4

I am trying to compile multiple text files into a single data frame. However, when I join the data frames with Pandas Concat function, the shape of the resulting data frame adds new columns. In the code sample below, data frame 3 has 12 columns rather than 8. Why?

**Input:** import pandas as pd df1 = pd.read_csv('2011-12-01-data.txt',sep = None, engine = 'python') df2 = pd.read_csv('2011-12-02-data.txt',sep = None, engine = 'python') df3= pd.concat([df1, df2]) print(df1.shape) print(df2.shape) print(df3.shape) **Output:** df1 shape = (26986, 8) df1 shape =(27266, 8) df3 shape =(54252, 12) 

The I am working with flight data available at http://lunadong.com/datasets/clean_flight.zip

0

2 Answers 2

2

I think you need header=None parameter for default columns names 0-7, because files has no headers. Also if there is separator tab, is possible specify it.

df1 = pd.read_csv('2011-12-01-data.txt',sep = '\t', engine = 'python', header=None) df2 = pd.read_csv('2011-12-02-data.txt',sep = '\t', engine = 'python', header=None) df3= pd.concat([df1, df2]) print(df1.shape) print(df2.shape) print(df3.shape) (26987, 8) (27267, 8) (54254, 8) print(df1.columns) Int64Index([0, 1, 2, 3, 4, 5, 6, 7], dtype='int64') print(df2.columns) Int64Index([0, 1, 2, 3, 4, 5, 6, 7], dtype='int64') print(df3.columns) Int64Index([0, 1, 2, 3, 4, 5, 6, 7], dtype='int64') 

Another solution is specify names parameter for new column names:

names= ['col1','col2','col3','col4','col5','col6','col7','col8'] df1 = pd.read_csv('2011-12-01-data.txt',sep = '\t', engine = 'python', names=names) df2 = pd.read_csv('2011-12-02-data.txt',sep = '\t', engine = 'python', names=names) df3= pd.concat([df1, df2]) print(df1.shape) print(df2.shape) print(df3.shape) (26987, 8) (27267, 8) (54254, 8) print(df1.columns) print(df2.columns) print(df3.columns) Index(['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8'], dtype='object') Index(['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8'], dtype='object') Index(['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8'], dtype='object') 

You get only 12 columns, because some values in first row of both dataframes are same, so from them was created columns names. After concat columns are aligned only for this columns. If values was different, there was no align and you get NaNs.

print(df1.columns) Index(['aa', 'AA-1007-TPA-MIA', '12/01/2011 01:55 PM', '12/01/2011 02:07 PM', 'F78', '12/01/2011 03:00 PM', '12/01/2011 02:57 PM', 'D5'], dtype='object') print(df2.columns) Index(['aa', 'AA-1007-TPA-MIA', '12/02/2011 01:55 PM', '12/02/2011 02:13 PM', 'F78', '12/02/2011 03:00 PM', '12/02/2011 03:05 PM', 'D5'], dtype='object') print(df3.columns) Index(['12/01/2011 01:55 PM', '12/01/2011 02:07 PM', '12/01/2011 02:57 PM', '12/01/2011 03:00 PM', '12/02/2011 01:55 PM', '12/02/2011 02:13 PM', '12/02/2011 03:00 PM', '12/02/2011 03:05 PM', 'AA-1007-TPA-MIA', 'D5', 'F78', 'aa'], dtype='object') 

print(df3.head()) 12/01/2011 01:55 PM 12/01/2011 02:07 PM 12/01/2011 02:57 PM \ 0 NaN 12/1/2011 2:07PM EST 12/1/2011 2:51PM EST 1 NaN 12/1/11 2:06 PM (-05:00) 12/1/11 2:51 PM (-05:00) 2 NaN 12/1/11 2:06 PM (-05:00) 12/1/11 2:51 PM (-05:00) 3 NaN 12/1/11 2:06 PM (-05:00) 12/1/11 2:51 PM (-05:00) 4 NaN 12/1/11 2:06 PM (-05:00) 12/1/11 2:51 PM (-05:00) 12/01/2011 03:00 PM 12/02/2011 01:55 PM 12/02/2011 02:13 PM \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN 12/02/2011 03:00 PM 12/02/2011 03:05 PM AA-1007-TPA-MIA D5 F78 \ 0 NaN NaN AA-1007-TPA-MIA NaN NaN 1 NaN NaN AA-1007-TPA-MIA NaN NaN 2 NaN NaN AA-1007-TPA-MIA NaN NaN 3 NaN NaN AA-1007-TPA-MIA NaN NaN 4 NaN NaN AA-1007-TPA-MIA NaN NaN aa 0 flightexplorer 1 airtravelcenter 2 myrateplan 3 helloflight 4 flytecomm 

print(df3.tail()) 12/01/2011 01:55 PM 12/01/2011 02:07 PM 12/01/2011 02:57 PM \ 27261 NaN NaN NaN 27262 NaN NaN NaN 27263 NaN NaN NaN 27264 NaN NaN NaN 27265 NaN NaN NaN 12/01/2011 03:00 PM 12/02/2011 01:55 PM 12/02/2011 02:13 PM \ 27261 NaN Dec 02 - 10:20pm Dec 02 - 10:23pm 27262 NaN 10:20pDec 2 10:23pDec 2 27263 NaN 2011-12-02 10:20 PM NaN 27264 NaN 2011-12-02 10:20 pm NaN 27265 NaN 2011-12-02 10:20PM CST 2011-12-02 10:31PM CST 12/02/2011 03:00 PM 12/02/2011 03:05 PM AA-1007-TPA-MIA D5 \ 27261 Dec 02 - 11:59pm Dec 02 - 11:51pm* AA-2059-DFW-SLC A3 27262 11:43pDec 2 NaN AA-2059-DFW-SLC A3 27263 2011-12-02 11:59 PM NaN AA-2059-DFW-SLC NaN 27264 NaN NaN AA-2059-DFW-SLC NaN 27265 2011-12-02 11:35PM MST 2011-12-02 11:43PM MST AA-2059-DFW-SLC A3 F78 aa 27261 C20/C travelocity 27262 C20 orbitz 27263 NaN weather 27264 C20 dfw 27265 C20 flightwise 
Sign up to request clarification or add additional context in comments.

Comments

1

User jezrael's answer solves the problem. But let me try to explain why pandas added new columns to your concatenated data frame and what went wrong.

pandas misread header

When you set header = None, pandas reads the first line of your file as a header and sets it by default to names of each column. Based on your code, these are the two sets of columns one would get for each of your dataframes if header = None.

df1: ['aa', 'AA-1007-TPA-MIA', '12/01/2011 01:55 PM', '12/01/2011 02:07 PM', 'F78', '12/01/2011 03:00 PM', '12/01/2011 02:57 PM', 'D5']

df2: ['aa', 'AA-1007-TPA-MIA', '12/02/2011 01:55 PM', '12/02/2011 02:13 PM', 'F78', '12/02/2011 03:00 PM', '12/02/2011 03:05 PM', 'D5']

Non-Unique columns appended

Finally, when you concatenated the two dataframes, all columns that were not common to df1 and df2 were appended as separate columns. 'aa','AA-1007-TPA-MIA', 'F78' and 'D5' were unique to df1 and df2 while everything else was appended to the list of columns.

This lead to 4(df1&df2) + 4(df1) + 4(df2) = 12 columns

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.