From your question, I too feel it's a NER problem. And about the dataset, unless there is a data set which tags the reservation numbers and is similar to your application, you WILL have to create your own data set.
I worked on a similar problem before and my dataset looked something like this:
<TEAM>Northern</TEAM> NNP <TEAM>Ireland</TEAM> NNP man NN <PLAYER>James</PLAYER> NNP <PLAYER>McIlroy</PLAYER> NNP is VBZ confident JJ he PRP can MD win VB his PRP$ first JJ major JJ title NN at IN this DT weekend NN 's POS <COMPETITION>Spar</COMPETITION> JJ <COMPETITION>European</COMPETITION> JJ <COMPETITION>Indoor</COMPETITION> NNP <COMPETITION>Championships</COMPETITION> NNP in IN <LOCATION>Madrid</LOCATION> NNP
You can see that that I have the entity tag and the part of speech tag in the word. When I parse this dataset for training, I also add the IOB tags (Inside, Outside, and Beginning)
[(('Claxton', 'NNP\n'), 'B-PLAYER'), (('hunting', 'VBG\n'), 'O'), (('first', 'RB\n'), 'O'), (('major', 'JJ\n'), 'O'), (('medal', 'NNS\n'), 'O'), (('.', '.\n'), 'O'), (('British', 'JJ\n'), 'O'), (('hurdler', 'NN\n'), 'O'), (('Sarah', 'NNP\n'), 'B-PLAYER'), (('Claxton', 'NNP\n'), 'I-PLAYER')......]
Then I just used the ClassifierBasedTagger(There are other taggers too). I can't find the source but I used this code:
class NamedEntityChunker(ChunkParserI): def __init__(self, train_sents, **kwargs): assert isinstance(train_sents, Iterable), 'The training set should be an Iterable' self.feature_detector = features self.tagger = ClassifierBasedTagger( train = train_sents, feature_detector = features, **kwargs) def parse(self, tagged_sents): chunks = self.tagger.tag(tagged_sents) iob_triplets = [(w, t, c) for ((w, t), c) in chunks] return conlltags2tree(iob_triplets)
Here features is a function which returns a dictionary of the features to be used such as the previous word, previous word's pos tag etc. Just features to train the model on.
{ 'word' : word, 'lemma' : stemmer.stem(word), 'pos' : pos, 'allascii' : allascii, 'next-word' : nextword, 'next-lemma' : stemmer.stem(nextword), 'next-pos' : nextpos, 'prev-word' : prevword, 'prev-lemma': stemmer.stem(prevword), 'prev-pos' : prevpos }
You can find useful theory here
I hope this helps.