1

I have a list of short hand text. All in English Language. Is there a Machine Learning algorithm that can be used to expand these abbreviations? For example, if the short hand is 'txt', it could suggest 'text', 'context', 'textual', etc with varying penalty values.

In addition, when I make a choice on the right word, I want it to learn this such that when next I input same shorthand, my choice get's high ratings.

Edit

Specifically, I have tried using this Language model described here but it only works for edits up to two levels. The 'edit' function is below:

def edits1(word): "All edits that are one edit away from `word`." letters = 'abcdefghijklmnopqrstuvwxyz' splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [L + R[1:] for L, R in splits if R] transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1] replaces = [L + c + R[1:] for L, R in splits if R for c in letters] inserts = [L + c + R for L, R in splits for c in letters] return set(deletes + transposes + replaces + inserts) 

It basically starts with one letter and then deletes, transposes, replaces and inserts letters (using letters of the alphabet).

How do I extend this to more than two edits?

4
  • 1
    The first part of your question isn't really machine learning, just mapping an extension to replacements. The second part is machine learning. Commented Mar 30, 2018 at 1:42
  • Welcome to Stack Overflow, please review: stackoverflow.com/help/how-to-ask Commented Mar 30, 2018 at 1:48
  • How many different short hand text input strings are in your possible entry data set? Commented Mar 30, 2018 at 2:01
  • The number of short hand text inputs isn't defined. It can be any length really Commented Apr 6, 2018 at 9:38

1 Answer 1

1

The first part has to do with producing words and the second has to do with ranking those words (and updating those rankings). I'll address the two parts in turn and try to point out any machine learning as that was part of the original question.

For the first part, I don't think you'll need machine learning and admittedly thinking about this a little, it seems artificial to use ML for this part. I think you could make good head-way with a dictionary of acronyms combined with use of synonyms.

  1. For example, start by looking up "txt" in a list such as this which lists "text" as an expansion for "txt".
  2. Take "text" and look up synonyms. You may want to restrict synonyms to those that look similar to the original acronym i.e. containing a substring with small edit-distance to "txt" or containing the acronym from the acronym dictionary ('text'). Take a look at this post for how to use NTLK for finding Synsets.

The important part here is to cover all the acronyms you'll encounter, so you may want to allow the user to enter in missing acronyms and expansions for those acronyms.

For the second part, you may want to combine two scoring algorithms to assign a score to each word and rank the words by their scores.

The first scoring algorithm should be something that works without any user data so that initially you have some semi-intelligent ordering of words. An example would be scoring a word based on how many edits that word is to the acronym. So "textual" would get a lower score than "text" for the acronym "txt" because it requires a few more letters to go from "txt" to "textual".

The second scoring algorithm would take over as you get more user data. An example of something you could use would be to keep track of the popularity of each word (i.e. what fraction of times it was chosen). See Online machine learning.

Combine the two scores into a final score via a learned linear function (See Linear Regression).

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.