1
$\begingroup$

I have a reasonably simple problem to solve. I need to extract reservations numbers from unstructured text. Based on my research, it seems to be an NER problem. Based on a visual analysis of the dataset, I could notice that the reservation number is frequently near specific keywords, such as 'confirmation', 'reservation', 'confirmation number', 'reservations number', etc.

First, I decided to try a Regex rule to extract the data, but some minimum variations might render this solution inefficient. The reservation number can have very different variations, such as:

ZXC51657856, EA5FFD4, 45615177413515, QT454545EF, 

At this moment, I don't have a dataset available to train a classifier to solve this issue.

I would like to receive some ideas from the community to guide me towards an elegant solution to this problem, as I'm pretty new to ML in general and time is limited.

$\endgroup$

1 Answer 1

1
$\begingroup$

From your question, I too feel it's a NER problem. And about the dataset, unless there is a data set which tags the reservation numbers and is similar to your application, you WILL have to create your own data set.

I worked on a similar problem before and my dataset looked something like this:

<TEAM>Northern</TEAM> NNP <TEAM>Ireland</TEAM> NNP man NN <PLAYER>James</PLAYER> NNP <PLAYER>McIlroy</PLAYER> NNP is VBZ confident JJ he PRP can MD win VB his PRP$ first JJ major JJ title NN at IN this DT weekend NN 's POS <COMPETITION>Spar</COMPETITION> JJ <COMPETITION>European</COMPETITION> JJ <COMPETITION>Indoor</COMPETITION> NNP <COMPETITION>Championships</COMPETITION> NNP in IN <LOCATION>Madrid</LOCATION> NNP 

You can see that that I have the entity tag and the part of speech tag in the word. When I parse this dataset for training, I also add the IOB tags (Inside, Outside, and Beginning)

[(('Claxton', 'NNP\n'), 'B-PLAYER'), (('hunting', 'VBG\n'), 'O'), (('first', 'RB\n'), 'O'), (('major', 'JJ\n'), 'O'), (('medal', 'NNS\n'), 'O'), (('.', '.\n'), 'O'), (('British', 'JJ\n'), 'O'), (('hurdler', 'NN\n'), 'O'), (('Sarah', 'NNP\n'), 'B-PLAYER'), (('Claxton', 'NNP\n'), 'I-PLAYER')......] 

Then I just used the ClassifierBasedTagger(There are other taggers too). I can't find the source but I used this code:

class NamedEntityChunker(ChunkParserI): def __init__(self, train_sents, **kwargs): assert isinstance(train_sents, Iterable), 'The training set should be an Iterable' self.feature_detector = features self.tagger = ClassifierBasedTagger( train = train_sents, feature_detector = features, **kwargs) def parse(self, tagged_sents): chunks = self.tagger.tag(tagged_sents) iob_triplets = [(w, t, c) for ((w, t), c) in chunks] return conlltags2tree(iob_triplets) 

Here features is a function which returns a dictionary of the features to be used such as the previous word, previous word's pos tag etc. Just features to train the model on.

{ 'word' : word, 'lemma' : stemmer.stem(word), 'pos' : pos, 'allascii' : allascii, 'next-word' : nextword, 'next-lemma' : stemmer.stem(nextword), 'next-pos' : nextpos, 'prev-word' : prevword, 'prev-lemma': stemmer.stem(prevword), 'prev-pos' : prevpos } 

You can find useful theory here

I hope this helps.

$\endgroup$
1
  • $\begingroup$ Small note: the initial dataset could be created with those reg-ex rules that has been already developed. There won't be POS tags but they might not be necessary in this (relatively) simpler problem. $\endgroup$ Commented Sep 7, 2018 at 18:23

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.