0

An extension to Problems reading CSV file with commas and characters in pandas

Solution provided in the above link works if and only if one column which contains commas as its characters and rest of the columns are well behaved.

What if there are more than one column with above issue?

Example CSV content with additional commas issue :

Name,Age,Address,Phone,Qualification Suresh,28,Texas,3334567892,B.Tech Ramesh,24,NewYork, NY,8978974040,9991111234,Ph.D Mukesh,26,Dallas,4547892345,Ph.D 

Required Output Pandas DataFrame:

Name Age Address Phone Qualification Suresh 28 Texas 3334567892 B.Tech Ramesh 24 NewYork, NY 8978974040,9991111234 Ph.D Mukesh 26 Dallas 4547892345 Ph.D 

Edited :

Input file with commas as characters in successive columns :

Name,Age,Address,Qualification,Grade Suresh,28,Texas,B.Tech,Ph.D,A Ramesh,24,NewYork, NY,B.Tech,A+ Mukesh,26,Dallas,B.Tech,Ph.D,A 

Required Output Pandas DataFrame:

Name Age Address Qualification Grade Suresh 28 Texas B.Tech,Ph.D A Ramesh 24 NewYork, NY B.Tech A+ Mukesh 26 Dallas B.Tech,Ph.D A 

Can I get any suggestions to solve this issue?

Thanks in Advance!!!

2 Answers 2

3

One way to do this would be to have " to clearly separate your data -

Name,Age,Address,Phone,Qualification Suresh,28,Texas,3334567892,B.Tech Ramesh,24,"NewYork, NY","8978974040,9991111234",Ph.D Mukesh,26,Dallas,4547892345,Ph.D 

If this isn't there, pandas will struggle to read it right.

Copy the above data, do a pd.read_clipboard(sep=',') and it will yield -

 Name Age Address Phone Qualification 0 Suresh 28 Texas 3334567892 B.Tech 1 Ramesh 24 NewYork, NY 8978974040,9991111234 Ph.D 2 Mukesh 26 Dallas 4547892345 Ph.D 

If modifying the source data as a whole is not within your means-

A practical approach would be to do a usual read_csv with error_bad_lines=False. Once done, look through the logs and make a note of the lines that pandas is struggling to read and modify only those lines accordingly.

Hope this helps.

Sign up to request clarification or add additional context in comments.

Comments

2

Your data appears fixed for the first two columns and also the last, so these can be removed and the remaining values could be processed using itertools.groupby() to group the remaining columns into numeric or non-numeric groups. The resulting data could then be loaded into pandas:

import pandas as pd from itertools import groupby import csv data = [] with open('input.csv', newline='') as f_input: csv_input = csv.reader(f_input) header = next(csv_input) for row in csv_input: addr_phone = [','.join(g) for k, g in groupby(row[2:-1], lambda x: x.isdigit())] data.append(row[:2] + addr_phone + [row[-1]]) df = pd.DataFrame(data, columns=header) print(df) 

Giving you:

 Name Age Address Phone Qualification 0 Suresh 28 Texas 3334567892 B.Tech 1 Ramesh 24 NewYork, NY 8978974040,9991111234 Ph.D 2 Mukesh 26 Dallas 4547892345 Ph.D 

To work with your second example, you would have to decide on a way to split the two columns. I would suggest you create a list of possible qualifications. When there is a match, you would be able to split at that point. For example:

import pandas as pd import csv def find_split(data): for index, v in enumerate(data): if v.lower() in ['b.tech', 'ph.d']: return [', '.join(data[:index]), ', '.join(data[index:])] return [', '.join(data), ''] data = [] with open('input.csv', newline='') as f_input: csv_input = csv.reader(f_input, skipinitialspace=True) header = next(csv_input) for row in csv_input: data.append(row[:2] + find_split(row[2:-1]) + [row[-1]]) df = pd.DataFrame(data, columns=header) print(df) 

Giving you:

 Name Age Address Qualification Grade 0 Suresh 28 Texas B.Tech, Ph.D A 1 Ramesh 24 NewYork, NY B.Tech A+ 2 Mukesh 26 Dallas B.Tech, Ph.D A 

You could create a list of qualifications by first creating a set() based on the contents of row[2] (lowercased). Print the contents of the set and then add that to the script and rerun it.

4 Comments

That works great!!! But what if I have a column(similar to address column with commas as it's characters in some lines) next to address column instead of Phone column??I mean if two non numeric columns with same issue are present in the data set next to each other, then It will be difficult to differenciate them using above solution right?
Please edit your question to include another example showing this. It is though not possible to guess which data belongs to which column in a completely generic way. The best solution would be to ensure the data is correctly escaped when it is created in the first place. Perhaps you could show how the data is being created?
The input data is provided by some third party source, so it's not possible to escape it while creation and this behavior is very rare, so there should be a way to handle it with some custom code snippet, since I cannot ignore these lines.
You are probably going to need a different approach for each type of file. I have updated the answer to solve your 2nd example.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.