2

I have a CSV file with millions of lines in the format of the below:

start, finish,count; 101,101,10; 101,103,2; 101,104,8; 102,103,5; 

So we have a start location, an end location and a count of the number of people who make that journey.

What I'd like to do is put this into a 'table-style' matrix with all the start locations running along the top, all the end locations running down the side and in the body of the matrix have a sum of all the counts that sit within that intersect.

So far I have the CSV file cleaned and imported and have the start and end locations stored as vectors, however I'm unsure how to proceed when forming the body of the matrix, can anyone help?

Thank you.

EDIT: I would like it to look as follows:

 101,102; 101,10,0; 103,2,5; 104,8,0; 
4
  • 1
    Can you please provide some sample output? Commented Aug 4, 2016 at 12:54
  • Do I understand correctly that the same (source, destination) pair might be repeated, or will there always be at most a single count for a given source and destination? Commented Aug 4, 2016 at 12:58
  • Hi @GauravDhama I've added this into the question. Commented Aug 4, 2016 at 12:59
  • Hi @holdenweb , there will be multiple entries with the same source & destination, hence the need to sum all the counts within the matrix body. Commented Aug 4, 2016 at 13:00

3 Answers 3

2

use set_index and unstack

df.set_index(['start', 'finish'])['count'].unstack(0) 

enter image description here


To save to csv

print df.set_index(['start', 'finish'])['count'].unstack(0).rename_axis(None) \ .to_csv('myfilename.csv') ,101,102 101,10.0, 103,2.0,5.0 104,8.0, 
Sign up to request clarification or add additional context in comments.

2 Comments

That's fantastic @piRSqaured, exactly what I was looking for. Thanks!
If this answered your question, don't forget to check it.
0

You said you have millions of lines, so i don't know whether this will be effective or not, but if you don't run into memory issues a pandas dataframe is the way to go:

import pandas as pd df = pd.read_csv('inputfile.csv') df = df.groupby(['start','finish']).agg({'count':sum}).reset_index() # Create Pivot table df_out = df.pivot(index='finish',columns = 'start',values='count') # Write Output df_out.rename_axis(None).to_csv('output.csv') 

2 Comments

Thanks for that @GauravDhama, the only problem is that I was hoping to have all the unique 'start' values running across the top of the table and all the unique 'end' values running down the side of the table. Do you know if this is possible?
Have edited the program. Check if this works for you!!
0

Another solution with pivot:

print (df.pivot(index='finish', columns='start', values='count')) start 101 102 finish 101 10.0 NaN 103 2.0 5.0 104 8.0 NaN 

If need remove columns and index names use rename_axis (new in pandas 0.18.0):

print (df.pivot(index='finish', columns='start', values='count') .rename_axis(None) .rename_axis(None, axis=1)) 101 102 101 10.0 NaN 103 2.0 5.0 104 8.0 NaN 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.