Notice removed Draw attention by sebastian87

occurred Nov 29, 2018 at 13:29

Bounty Ended with Skiddles's answer chosen by sebastian87

occurred Nov 29, 2018 at 13:29

Notice added Draw attention by sebastian87

occurred Nov 27, 2018 at 8:29

Bounty Started worth 50 reputation by sebastian87

occurred Nov 27, 2018 at 8:29

added 78 characters in body

Source Link

edited Nov 25, 2018 at 7:02

sebastian87

51
4

I am searching for an approach for solving the following problem:

Given I have a large amount of printed and scanned documents. I am already able to detect text and the corresponding bounding boxes in that document. I have the texts and the coordinates of the bounding boxes. Now, if I want to extract specific data like "insurance contract number" or "recipient" and stuff like that, I would have to manually program that (to some degree).

I was wondering if there is an approach to show the computer what is the correct data extraction. Is it possible to train a model against a large amount of documents and the correctly extracted data as labels. So the features would be the extracted text plus the coordinates and the model should learn which data to extract depending on the training examples with correctly extracted text.

All algorithms I came across so far only "allow" one label. From my understanding this apporach would need like 20 labels for every training example. And also the prediction is also only "correct", if all labels are matching.

Anyone has an idea how to approach that?

I am searching for an approach for solving the following problem:

Given I have a large amount of printed and scanned documents. I am already able to detect text and the corresponding bounding boxes in that document. I have the texts and the coordinates of the bounding boxes. Now, if I want to extract specific data like "insurance contract number" or "recipient" and stuff like that, I would have to manually program that (to some degree).

I was wondering if there is an approach to show the computer what is the correct data extraction. Is it possible to train a model against a large amount of documents and the correctly extracted data as labels. So the features would be the extracted text plus the coordinates and the model should learn which data to extract depending on the training examples with correctly extracted text.

All algorithms I came across so far only "allow" one label. From my understanding this apporach would need like 20 labels for every training example.

Anyone has an idea how to approach that?

I am searching for an approach for solving the following problem:

Given I have a large amount of printed and scanned documents. I am already able to detect text and the corresponding bounding boxes in that document. I have the texts and the coordinates of the bounding boxes. Now, if I want to extract specific data like "insurance contract number" or "recipient" and stuff like that, I would have to manually program that (to some degree).

I was wondering if there is an approach to show the computer what is the correct data extraction. Is it possible to train a model against a large amount of documents and the correctly extracted data as labels. So the features would be the extracted text plus the coordinates and the model should learn which data to extract depending on the training examples with correctly extracted text.

All algorithms I came across so far only "allow" one label. From my understanding this apporach would need like 20 labels for every training example. And also the prediction is also only "correct", if all labels are matching.

Anyone has an idea how to approach that?

Source Link

asked Nov 25, 2018 at 6:32

sebastian87

51
4

Feasibility: train a model to learn how to extract data from documents

I am searching for an approach for solving the following problem:

Given I have a large amount of printed and scanned documents. I am already able to detect text and the corresponding bounding boxes in that document. I have the texts and the coordinates of the bounding boxes. Now, if I want to extract specific data like "insurance contract number" or "recipient" and stuff like that, I would have to manually program that (to some degree).

I was wondering if there is an approach to show the computer what is the correct data extraction. Is it possible to train a model against a large amount of documents and the correctly extracted data as labels. So the features would be the extracted text plus the coordinates and the model should learn which data to extract depending on the training examples with correctly extracted text.

All algorithms I came across so far only "allow" one label. From my understanding this apporach would need like 20 labels for every training example.

Anyone has an idea how to approach that?

multilabel-classification

Stack Exchange Network

Return to Question

Feasibility: train a model to learn how to extract data from documents