Skip to main content
Notice removed Draw attention by sebastian87
Bounty Ended with Skiddles's answer chosen by sebastian87
Notice added Draw attention by sebastian87
Bounty Started worth 50 reputation by sebastian87
added 78 characters in body
Source Link

I am searching for an approach for solving the following problem:

Given I have a large amount of printed and scanned documents. I am already able to detect text and the corresponding bounding boxes in that document. I have the texts and the coordinates of the bounding boxes. Now, if I want to extract specific data like "insurance contract number" or "recipient" and stuff like that, I would have to manually program that (to some degree).

I was wondering if there is an approach to show the computer what is the correct data extraction. Is it possible to train a model against a large amount of documents and the correctly extracted data as labels. So the features would be the extracted text plus the coordinates and the model should learn which data to extract depending on the training examples with correctly extracted text.

All algorithms I came across so far only "allow" one label. From my understanding this apporach would need like 20 labels for every training example. And also the prediction is also only "correct", if all labels are matching.

Anyone has an idea how to approach that?

I am searching for an approach for solving the following problem:

Given I have a large amount of printed and scanned documents. I am already able to detect text and the corresponding bounding boxes in that document. I have the texts and the coordinates of the bounding boxes. Now, if I want to extract specific data like "insurance contract number" or "recipient" and stuff like that, I would have to manually program that (to some degree).

I was wondering if there is an approach to show the computer what is the correct data extraction. Is it possible to train a model against a large amount of documents and the correctly extracted data as labels. So the features would be the extracted text plus the coordinates and the model should learn which data to extract depending on the training examples with correctly extracted text.

All algorithms I came across so far only "allow" one label. From my understanding this apporach would need like 20 labels for every training example.

Anyone has an idea how to approach that?

I am searching for an approach for solving the following problem:

Given I have a large amount of printed and scanned documents. I am already able to detect text and the corresponding bounding boxes in that document. I have the texts and the coordinates of the bounding boxes. Now, if I want to extract specific data like "insurance contract number" or "recipient" and stuff like that, I would have to manually program that (to some degree).

I was wondering if there is an approach to show the computer what is the correct data extraction. Is it possible to train a model against a large amount of documents and the correctly extracted data as labels. So the features would be the extracted text plus the coordinates and the model should learn which data to extract depending on the training examples with correctly extracted text.

All algorithms I came across so far only "allow" one label. From my understanding this apporach would need like 20 labels for every training example. And also the prediction is also only "correct", if all labels are matching.

Anyone has an idea how to approach that?

Source Link

Feasibility: train a model to learn how to extract data from documents

I am searching for an approach for solving the following problem:

Given I have a large amount of printed and scanned documents. I am already able to detect text and the corresponding bounding boxes in that document. I have the texts and the coordinates of the bounding boxes. Now, if I want to extract specific data like "insurance contract number" or "recipient" and stuff like that, I would have to manually program that (to some degree).

I was wondering if there is an approach to show the computer what is the correct data extraction. Is it possible to train a model against a large amount of documents and the correctly extracted data as labels. So the features would be the extracted text plus the coordinates and the model should learn which data to extract depending on the training examples with correctly extracted text.

All algorithms I came across so far only "allow" one label. From my understanding this apporach would need like 20 labels for every training example.

Anyone has an idea how to approach that?