I am searching for an approach for solving the following problem:
Given I have a large amount of printed and scanned documents. I am already able to detect text and the corresponding bounding boxes in that document. I have the texts and the coordinates of the bounding boxes. Now, if I want to extract specific data like "insurance contract number" or "recipient" and stuff like that, I would have to manually program that (to some degree).
I was wondering if there is an approach to show the computer what is the correct data extraction. Is it possible to train a model against a large amount of documents and the correctly extracted data as labels. So the features would be the extracted text plus the coordinates and the model should learn which data to extract depending on the training examples with correctly extracted text.
All algorithms I came across so far only "allow" one label. From my understanding this apporach would need like 20 labels for every training example. And also the prediction is also only "correct", if all labels are matching.
Anyone has an idea how to approach that?