Feasibility: train a model to learn how to extract data from documents

Question

I am searching for an approach for solving the following problem:

Given I have a large amount of printed and scanned documents. I am already able to detect text and the corresponding bounding boxes in that document. I have the texts and the coordinates of the bounding boxes. Now, if I want to extract specific data like "insurance contract number" or "recipient" and stuff like that, I would have to manually program that (to some degree).

I was wondering if there is an approach to show the computer what is the correct data extraction. Is it possible to train a model against a large amount of documents and the correctly extracted data as labels. So the features would be the extracted text plus the coordinates and the model should learn which data to extract depending on the training examples with correctly extracted text.

All algorithms I came across so far only "allow" one label. From my understanding this apporach would need like 20 labels for every training example. And also the prediction is also only "correct", if all labels are matching.

Anyone has an idea how to approach that?

Skiddles · Accepted Answer · 2018-11-29 13:52:23Z

Both of the answers from FelixGK and Wargream have merit. With the information you have posted though, I am not sure you are approaching the problem with the right objective, so the expected solution may seem further away than it probably is.

As FelixGK suggests, you should break the models down into a set of one-vs-the-rest binary classifiers, one for each label.

I am not sure what you mean when you state

From my understanding this apporach would need like 20 labels for every training example. And also the prediction is also only "correct", if all labels are matching.

Do you expect one model to collect multiple data elements within a document and assign the correct label to the element? This is probably difficult in a single model, but could be easier using some form of ensemble approach, like a set of binary classifiers. If you want to try it all in one model, I would look at neural networks and specifically, a network that implements a SoftMax activation function in the final layer. This will allow you to produce a multi-class output with probabilities assigned to each class. When using softmax, you generally want to accept the result with the highest probability.

FelixGK · Accepted Answer · 2018-11-27 14:34:05Z

One thing that comes to mind would be a Restricted Boltzmann Machine or alternatively, if you need a more powerfull model, a Deep Believe Net. Both learn to recognize familiar examples. You train them without labels, you just give them a set of training data. They will then adjust their weights, so that the global energy over the sample is minimal. If you have a trained RBM or DBN you can put an example in, and look at the energy. The lower the energy, the closer the example to the sample which was used to train your model.

You could train one model per target class, and then put your examples through all models, and return those labels for which the model-energy is below a threshold.

Generally, you could take any binary classifier and train one per class to classify membership.

atmarges · Accepted Answer · 2018-11-28 12:52:44Z

We can relate this problem to object detection (given an image, create a bounding box on the objects present and identify the detected objects). Take a look at this neural network based object detection algorithm.

Your training data would be the images with bounding boxes as labels and the classification of each boxes. Just manually label some and then switch the placement of the objects to create more training data.

Finally, use OCR to translate the detected objects into text data.

Stack Exchange Network

Feasibility: train a model to learn how to extract data from documents

3 Answers 3

Hot Network Questions

Feasibility: train a model to learn how to extract data from documents

3 Answers 3

Related

Hot Network Questions