Creating a 2D array based upon a CSV

Question

I have a CSV file with millions of lines in the format of the below:

start, finish,count; 101,101,10; 101,103,2; 101,104,8; 102,103,5;

So we have a start location, an end location and a count of the number of people who make that journey.

What I'd like to do is put this into a 'table-style' matrix with all the start locations running along the top, all the end locations running down the side and in the body of the matrix have a sum of all the counts that sit within that intersect.

So far I have the CSV file cleaned and imported and have the start and end locations stored as vectors, however I'm unsure how to proceed when forming the body of the matrix, can anyone help?

Thank you.

EDIT: I would like it to look as follows:

 101,102; 101,10,0; 103,2,5; 104,8,0;

Do I understand correctly that the same (source, destination) pair might be repeated, or will there always be at most a single count for a given source and destination? — holdenweb
– holdenweb, Commented Aug 4, 2016 at 12:58
Hi @holdenweb , there will be multiple entries with the same source & destination, hence the need to sum all the counts within the matrix body. — Tom Pinder
– Tom Pinder, Commented Aug 4, 2016 at 13:00

piRSquared · Accepted Answer · 2016-08-04 13:26:27Z

2

use set_index and unstack

df.set_index(['start', 'finish'])['count'].unstack(0)

To save to csv

print df.set_index(['start', 'finish'])['count'].unstack(0).rename_axis(None) \ .to_csv('myfilename.csv') ,101,102 101,10.0, 103,2.0,5.0 104,8.0,

edited Aug 4, 2016 at 13:26

answered Aug 4, 2016 at 13:21

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Tom Pinder Over a year ago

That's fantastic @piRSqaured, exactly what I was looking for. Thanks!

piRSquared Over a year ago

If this answered your question, don't forget to check it.

Gaurav Dhama · Accepted Answer · 2016-08-04 13:50:48Z

You said you have millions of lines, so i don't know whether this will be effective or not, but if you don't run into memory issues a pandas dataframe is the way to go:

import pandas as pd df = pd.read_csv('inputfile.csv') df = df.groupby(['start','finish']).agg({'count':sum}).reset_index() # Create Pivot table df_out = df.pivot(index='finish',columns = 'start',values='count') # Write Output df_out.rename_axis(None).to_csv('output.csv')

Thanks for that @GauravDhama, the only problem is that I was hoping to have all the unique 'start' values running across the top of the table and all the unique 'end' values running down the side of the table. Do you know if this is possible?

jezrael · Accepted Answer · 2016-08-04 13:58:48Z

Another solution with pivot:

print (df.pivot(index='finish', columns='start', values='count')) start 101 102 finish 101 10.0 NaN 103 2.0 5.0 104 8.0 NaN

If need remove columns and index names use rename_axis (new in pandas 0.18.0):

print (df.pivot(index='finish', columns='start', values='count') .rename_axis(None) .rename_axis(None, axis=1)) 101 102 101 10.0 NaN 103 2.0 5.0 104 8.0 NaN

Collectives™ on Stack Overflow

Creating a 2D array based upon a CSV

3 Answers 3

2 Comments

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

Comments

Related