Bumped by Community user

occurred Oct 13 at 22:05

Bumped by Community user

occurred Jun 13 at 1:04

Bumped by Community user

occurred Feb 5 at 20:03

Bumped by Community user

occurred Oct 8, 2024 at 1:00

Bumped by Community user

occurred Jun 6, 2024 at 14:00

Bumped by Community user

occurred Feb 5, 2024 at 9:06

Bumped by Community user

occurred Oct 8, 2023 at 7:02

Bumped by Community user

occurred Feb 13, 2022 at 18:04

deleted 14 characters in body

Source Link

edited Feb 22, 2021 at 18:38

Max

101
1

I want to use machine learning and NLP to convert semi-structured data in text files to structured data by predicting the patterns in the files and splitting the fields for example if I have a text file that looks like this :

Input :

2021565267MALL1ETAGE ZARA1st FLOOR 2345561 2022565267MALL2ETAGE ZARA1st FLOOR 2345561 2022565267ANFAPLACE2ETAGECOFEESHOP2345561 20225652634ANFAPLACE2ETAGE 2345561

Desired Output :

2021565267,MOROCCOMALL1ETAGEMALL1ETAGE ZARA1st FLOOR,2345561 2022565267,MOROCCOMALL2ETAGEMALL2ETAGE ZARA1st FLOOR,2345561 2022565267,ANFAPLACE2ETAGECOFEESHOP,2345561 20225652634,ANFAPLACE2ETAGE,2345561

The semi-structured files are not fixed-width so we can not just add col specification in pandas like this ( it can work for the first line for example ) :

col_specification =[(1, 10),.... ]

One of the approaches that I found online is to make a dictionary based on the occurrences of the words in the semi-structured file will that work in this situation if so how can I implement something like that?

I want to use machine learning and NLP to convert semi-structured data in text files to structured data by predicting the patterns in the files and splitting the fields for example if I have a text file that looks like this :

Input :

2021565267MALL1ETAGE ZARA1st FLOOR 2345561 2022565267MALL2ETAGE ZARA1st FLOOR 2345561 2022565267ANFAPLACE2ETAGECOFEESHOP2345561 20225652634ANFAPLACE2ETAGE 2345561

Desired Output :

2021565267,MOROCCOMALL1ETAGE ZARA1st FLOOR,2345561 2022565267,MOROCCOMALL2ETAGE ZARA1st FLOOR,2345561 2022565267,ANFAPLACE2ETAGECOFEESHOP,2345561 20225652634,ANFAPLACE2ETAGE,2345561

The semi-structured files are not fixed-width so we can not just add col specification in pandas like this ( it can work for the first line for example ) :

col_specification =[(1, 10),.... ]

One of the approaches that I found online is to make a dictionary based on the occurrences of the words in the semi-structured file will that work in this situation if so how can I implement something like that?

I want to use machine learning and NLP to convert semi-structured data in text files to structured data by predicting the patterns in the files and splitting the fields for example if I have a text file that looks like this :

Input :

2021565267MALL1ETAGE ZARA1st FLOOR 2345561 2022565267MALL2ETAGE ZARA1st FLOOR 2345561 2022565267ANFAPLACE2ETAGECOFEESHOP2345561 20225652634ANFAPLACE2ETAGE 2345561

Desired Output :

2021565267,MALL1ETAGE ZARA1st FLOOR,2345561 2022565267,MALL2ETAGE ZARA1st FLOOR,2345561 2022565267,ANFAPLACE2ETAGECOFEESHOP,2345561 20225652634,ANFAPLACE2ETAGE,2345561

The semi-structured files are not fixed-width so we can not just add col specification in pandas like this ( it can work for the first line for example ) :

col_specification =[(1, 10),.... ]

One of the approaches that I found online is to make a dictionary based on the occurrences of the words in the semi-structured file will that work in this situation if so how can I implement something like that?

Source Link

asked Feb 22, 2021 at 17:34

Max

101
1

Extracting structured data from semi structured data

I want to use machine learning and NLP to convert semi-structured data in text files to structured data by predicting the patterns in the files and splitting the fields for example if I have a text file that looks like this :

Input :

2021565267MALL1ETAGE ZARA1st FLOOR 2345561 2022565267MALL2ETAGE ZARA1st FLOOR 2345561 2022565267ANFAPLACE2ETAGECOFEESHOP2345561 20225652634ANFAPLACE2ETAGE 2345561

Desired Output :

2021565267,MOROCCOMALL1ETAGE ZARA1st FLOOR,2345561 2022565267,MOROCCOMALL2ETAGE ZARA1st FLOOR,2345561 2022565267,ANFAPLACE2ETAGECOFEESHOP,2345561 20225652634,ANFAPLACE2ETAGE,2345561

The semi-structured files are not fixed-width so we can not just add col specification in pandas like this ( it can work for the first line for example ) :

col_specification =[(1, 10),.... ]

One of the approaches that I found online is to make a dictionary based on the occurrences of the words in the semi-structured file will that work in this situation if so how can I implement something like that?

Stack Exchange Network

Return to Question

Extracting structured data from semi structured data