python - comparing list to csv

Question

I have this .csv:

col1,col2,col3,col4,col5 247,19,1.0,2016-01-01 14:11:21,MP 247,3,1.0,2016-01-01 14:23:43,MP 247,12,1.0,2016-01-01 15:32:16,MP 402,3,1.0,2016-01-01 12:11:15,? 583,12,1.0,2016-01-01 02:33:57,? 769,16,1.0,2016-01-01 03:12:24,? 769,4,1.0,2016-01-01 03:22:29,? .....

I need to take col2 values for each col1 unique element and make a new .csv like this:

expected output: 19,3,12 3 12 16,4 ...

That is, I want to output numbers until a non-unique value is seen, at which point I will start a new line and continue to output numbers.

I read the .csv in that way and removed duplicate from the list:

import pandas as pd colnames = ['col1', 'col2', 'col3', 'col4', 'col5'] df = pd.read_csv('sorted.csv', names=colnames) list1 = df.col1.tolist() list2 = list(set(list1 ))

now things are getting hard for me, I'm newbie in python, my idea was to compare each element in list2 with each row in df writing col2 elements in a new .csv, could you help me please?

what is your expected output of the df you wan to write to csv? — gyx-hh
– gyx-hh, Commented Jun 26, 2018 at 12:42
You only need the first and second item from each line. Store them in a more useful data structure. Then iterate over that to generate the output. — Kenny Ostrom
– Kenny Ostrom, Commented Jun 26, 2018 at 12:44
If your intended output should be all column 2 values for a certain value of column 1, why are removing duplicates values of column 1? Wouldn't this result in only one value of column 2 corresponding to a value in column 1? Please clarify your intended result so we could address this. — pragmaticprog
– pragmaticprog, Commented Jun 26, 2018 at 12:54
Please provide an example output and I'll build a possible answer — Pitto
– Pitto, Commented Jun 26, 2018 at 12:54
the example output is in the first post: "I need to take col2 values for each col1 unique element and make a new .csv like this:", so I just need a .csv file with that sequences, each row should be the sequence for a single value in col1 — Mark
– Mark, Commented Jun 26, 2018 at 13:08

Mukul Anand · Accepted Answer · 2018-06-26 13:41:32Z

3

Example in python3

import pandas as pd import csv x = pd.read_csv('input.txt') y = x[['col1','col2']] with open("output.csv", "w") as f: writer = csv.writer(f) y.groupby(['col1']).agg(lambda x: writer.writerow(list(x.values)))

Maybe, you can try this. Don't store the whole output in a list or any Data Structure(memory issue). Write to file as you read and aggregate.(the reading should also be optimized to get an iterator if possible rather than loading the whole thing at once from input file.

answered Jun 26, 2018 at 13:41

Mukul Anand

6361 gold badge7 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Mark Over a year ago

this is almost perfect! is there any way to remove empty rows in the file? actually I have got an empty row between rows with elements

pragmaticprog Over a year ago

Is something like df.dropna (pandas.pydata.org/pandas-docs/stable/generated/…) what you're looking for?

Mark Over a year ago

@Nivii1406 seems that but it's not working, I get this error: "ParserError: Error tokenizing data. C error: Expected 3 fields in line 9, saw 4 in line 9" I have 4 numbers in the row

Mark Over a year ago

I've solved by adding ' , newline='' ' after "w" in with open, thanks for your help!

vielkind · Accepted Answer · 2018-06-26 13:18:14Z

1

You can do this by grouping your data then applying a set function as the aggregation.

df.groupby('col1')['col2'].apply(set).apply(list)

The apply(set) function creates a set of all distinct col2 elements for each col1 value then the apply(list) function converts the set into a list.

answered Jun 26, 2018 at 13:18

vielkind

2,9801 gold badge19 silver badges17 bronze badges

2 Comments

user_D_A__ Over a year ago

Hi ,

import pandas as pd colnames = ['col1', 'col2', 'col3', 'col4', 'col5'] df = pd.read_csv('sorted.csv', names=colnames) for value in df.groupby('col1')['col2'].apply(set).apply(list): print ",".join(map(str,value))

prints the following 19,3,12 3 12 4,16 col2 even the header line is included in the output

vielkind Over a year ago

@AbinayaDevarajan- are the headers included in your sorted.csv file? That seems to be the case. You are setting the names parameter in read_csv which will by default set header=None, which will treat the first row with the header information as a row in the data.

Mohammad Athar · Accepted Answer · 2018-06-26 13:21:25Z

you need to track your duplicates. The most simple (as in easy to understand, but sacrificing some efficiencies) way is as follows

import pandas as pd colnames = ['col1', 'col2', 'col3', 'col4', 'col5'] df = pd.read_csv('sorted.csv', names=colnames) list1 = df.col2.tolist() dup_tracker = [] for x in list1: if x in dup_tracker: file_out_helper('\n') file_out_helper(str(x) + ', ') dup_tracker.append(x) def file_out_helper(m_str): tgtfile = 'my_target_file.csv' with open(tgtfile,'a') as f: f.write(m_str)

Collectives™ on Stack Overflow

python - comparing list to csv

3 Answers 3

4 Comments

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

2 Comments

Comments

Related