Skip to main content
Bumped by Community user
Bumped by Community user
Bumped by Community user
Bumped by Community user
Bumped by Community user
Bumped by Community user
Bumped by Community user
Bumped by Community user
deleted 14 characters in body
Source Link
Max
  • 101
  • 1

I want to use machine learning and NLP to convert semi-structured data in text files to structured data by predicting the patterns in the files and splitting the fields for example if I have a text file that looks like this :

Input :

2021565267MALL1ETAGE ZARA1st FLOOR 2345561 2022565267MALL2ETAGE ZARA1st FLOOR 2345561 2022565267ANFAPLACE2ETAGECOFEESHOP2345561 20225652634ANFAPLACE2ETAGE 2345561 

Desired Output :

2021565267,MOROCCOMALL1ETAGEMALL1ETAGE ZARA1st FLOOR,2345561 2022565267,MOROCCOMALL2ETAGEMALL2ETAGE ZARA1st FLOOR,2345561 2022565267,ANFAPLACE2ETAGECOFEESHOP,2345561 20225652634,ANFAPLACE2ETAGE,2345561 

The semi-structured files are not fixed-width so we can not just add col specification in pandas like this ( it can work for the first line for example ) :

col_specification =[(1, 10),.... ] 

One of the approaches that I found online is to make a dictionary based on the occurrences of the words in the semi-structured file will that work in this situation if so how can I implement something like that?

I want to use machine learning and NLP to convert semi-structured data in text files to structured data by predicting the patterns in the files and splitting the fields for example if I have a text file that looks like this :

Input :

2021565267MALL1ETAGE ZARA1st FLOOR 2345561 2022565267MALL2ETAGE ZARA1st FLOOR 2345561 2022565267ANFAPLACE2ETAGECOFEESHOP2345561 20225652634ANFAPLACE2ETAGE 2345561 

Desired Output :

2021565267,MOROCCOMALL1ETAGE ZARA1st FLOOR,2345561 2022565267,MOROCCOMALL2ETAGE ZARA1st FLOOR,2345561 2022565267,ANFAPLACE2ETAGECOFEESHOP,2345561 20225652634,ANFAPLACE2ETAGE,2345561 

The semi-structured files are not fixed-width so we can not just add col specification in pandas like this ( it can work for the first line for example ) :

col_specification =[(1, 10),.... ] 

One of the approaches that I found online is to make a dictionary based on the occurrences of the words in the semi-structured file will that work in this situation if so how can I implement something like that?

I want to use machine learning and NLP to convert semi-structured data in text files to structured data by predicting the patterns in the files and splitting the fields for example if I have a text file that looks like this :

Input :

2021565267MALL1ETAGE ZARA1st FLOOR 2345561 2022565267MALL2ETAGE ZARA1st FLOOR 2345561 2022565267ANFAPLACE2ETAGECOFEESHOP2345561 20225652634ANFAPLACE2ETAGE 2345561 

Desired Output :

2021565267,MALL1ETAGE ZARA1st FLOOR,2345561 2022565267,MALL2ETAGE ZARA1st FLOOR,2345561 2022565267,ANFAPLACE2ETAGECOFEESHOP,2345561 20225652634,ANFAPLACE2ETAGE,2345561 

The semi-structured files are not fixed-width so we can not just add col specification in pandas like this ( it can work for the first line for example ) :

col_specification =[(1, 10),.... ] 

One of the approaches that I found online is to make a dictionary based on the occurrences of the words in the semi-structured file will that work in this situation if so how can I implement something like that?

Source Link
Max
  • 101
  • 1

Extracting structured data from semi structured data

I want to use machine learning and NLP to convert semi-structured data in text files to structured data by predicting the patterns in the files and splitting the fields for example if I have a text file that looks like this :

Input :

2021565267MALL1ETAGE ZARA1st FLOOR 2345561 2022565267MALL2ETAGE ZARA1st FLOOR 2345561 2022565267ANFAPLACE2ETAGECOFEESHOP2345561 20225652634ANFAPLACE2ETAGE 2345561 

Desired Output :

2021565267,MOROCCOMALL1ETAGE ZARA1st FLOOR,2345561 2022565267,MOROCCOMALL2ETAGE ZARA1st FLOOR,2345561 2022565267,ANFAPLACE2ETAGECOFEESHOP,2345561 20225652634,ANFAPLACE2ETAGE,2345561 

The semi-structured files are not fixed-width so we can not just add col specification in pandas like this ( it can work for the first line for example ) :

col_specification =[(1, 10),.... ] 

One of the approaches that I found online is to make a dictionary based on the occurrences of the words in the semi-structured file will that work in this situation if so how can I implement something like that?