0

I have this .csv:

col1,col2,col3,col4,col5 247,19,1.0,2016-01-01 14:11:21,MP 247,3,1.0,2016-01-01 14:23:43,MP 247,12,1.0,2016-01-01 15:32:16,MP 402,3,1.0,2016-01-01 12:11:15,? 583,12,1.0,2016-01-01 02:33:57,? 769,16,1.0,2016-01-01 03:12:24,? 769,4,1.0,2016-01-01 03:22:29,? ..... 

I need to take col2 values for each col1 unique element and make a new .csv like this:

expected output: 19,3,12 3 12 16,4 ... 

That is, I want to output numbers until a non-unique value is seen, at which point I will start a new line and continue to output numbers.

I read the .csv in that way and removed duplicate from the list:

import pandas as pd colnames = ['col1', 'col2', 'col3', 'col4', 'col5'] df = pd.read_csv('sorted.csv', names=colnames) list1 = df.col1.tolist() list2 = list(set(list1 )) 

now things are getting hard for me, I'm newbie in python, my idea was to compare each element in list2 with each row in df writing col2 elements in a new .csv, could you help me please?

7
  • what is your expected output of the df you wan to write to csv? Commented Jun 26, 2018 at 12:42
  • 1
    You only need the first and second item from each line. Store them in a more useful data structure. Then iterate over that to generate the output. Commented Jun 26, 2018 at 12:44
  • 1
    If your intended output should be all column 2 values for a certain value of column 1, why are removing duplicates values of column 1? Wouldn't this result in only one value of column 2 corresponding to a value in column 1? Please clarify your intended result so we could address this. Commented Jun 26, 2018 at 12:54
  • Please provide an example output and I'll build a possible answer Commented Jun 26, 2018 at 12:54
  • the example output is in the first post: "I need to take col2 values for each col1 unique element and make a new .csv like this:", so I just need a .csv file with that sequences, each row should be the sequence for a single value in col1 Commented Jun 26, 2018 at 13:08

3 Answers 3

3

Example in python3

import pandas as pd import csv x = pd.read_csv('input.txt') y = x[['col1','col2']] with open("output.csv", "w") as f: writer = csv.writer(f) y.groupby(['col1']).agg(lambda x: writer.writerow(list(x.values))) 

Maybe, you can try this. Don't store the whole output in a list or any Data Structure(memory issue). Write to file as you read and aggregate.(the reading should also be optimized to get an iterator if possible rather than loading the whole thing at once from input file.

Sign up to request clarification or add additional context in comments.

4 Comments

this is almost perfect! is there any way to remove empty rows in the file? actually I have got an empty row between rows with elements
Is something like df.dropna (pandas.pydata.org/pandas-docs/stable/generated/…) what you're looking for?
@Nivii1406 seems that but it's not working, I get this error: "ParserError: Error tokenizing data. C error: Expected 3 fields in line 9, saw 4 in line 9" I have 4 numbers in the row
I've solved by adding ' , newline='' ' after "w" in with open, thanks for your help!
1

You can do this by grouping your data then applying a set function as the aggregation.

df.groupby('col1')['col2'].apply(set).apply(list) 

The apply(set) function creates a set of all distinct col2 elements for each col1 value then the apply(list) function converts the set into a list.

2 Comments

Hi , import pandas as pd colnames = ['col1', 'col2', 'col3', 'col4', 'col5'] df = pd.read_csv('sorted.csv', names=colnames) for value in df.groupby('col1')['col2'].apply(set).apply(list): print ",".join(map(str,value)) prints the following 19,3,12 3 12 4,16 col2 even the header line is included in the output
@AbinayaDevarajan- are the headers included in your sorted.csv file? That seems to be the case. You are setting the names parameter in read_csv which will by default set header=None, which will treat the first row with the header information as a row in the data.
0

you need to track your duplicates. The most simple (as in easy to understand, but sacrificing some efficiencies) way is as follows

import pandas as pd colnames = ['col1', 'col2', 'col3', 'col4', 'col5'] df = pd.read_csv('sorted.csv', names=colnames) list1 = df.col2.tolist() dup_tracker = [] for x in list1: if x in dup_tracker: file_out_helper('\n') file_out_helper(str(x) + ', ') dup_tracker.append(x) def file_out_helper(m_str): tgtfile = 'my_target_file.csv' with open(tgtfile,'a') as f: f.write(m_str) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.