I think you need header=None parameter for default columns names 0-7, because files has no headers. Also if there is separator tab, is possible specify it.
df1 = pd.read_csv('2011-12-01-data.txt',sep = '\t', engine = 'python', header=None) df2 = pd.read_csv('2011-12-02-data.txt',sep = '\t', engine = 'python', header=None) df3= pd.concat([df1, df2]) print(df1.shape) print(df2.shape) print(df3.shape) (26987, 8) (27267, 8) (54254, 8) print(df1.columns) Int64Index([0, 1, 2, 3, 4, 5, 6, 7], dtype='int64') print(df2.columns) Int64Index([0, 1, 2, 3, 4, 5, 6, 7], dtype='int64') print(df3.columns) Int64Index([0, 1, 2, 3, 4, 5, 6, 7], dtype='int64')
Another solution is specify names parameter for new column names:
names= ['col1','col2','col3','col4','col5','col6','col7','col8'] df1 = pd.read_csv('2011-12-01-data.txt',sep = '\t', engine = 'python', names=names) df2 = pd.read_csv('2011-12-02-data.txt',sep = '\t', engine = 'python', names=names) df3= pd.concat([df1, df2]) print(df1.shape) print(df2.shape) print(df3.shape) (26987, 8) (27267, 8) (54254, 8) print(df1.columns) print(df2.columns) print(df3.columns) Index(['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8'], dtype='object') Index(['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8'], dtype='object') Index(['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8'], dtype='object')
You get only 12 columns, because some values in first row of both dataframes are same, so from them was created columns names. After concat columns are aligned only for this columns. If values was different, there was no align and you get NaNs.
print(df1.columns) Index(['aa', 'AA-1007-TPA-MIA', '12/01/2011 01:55 PM', '12/01/2011 02:07 PM', 'F78', '12/01/2011 03:00 PM', '12/01/2011 02:57 PM', 'D5'], dtype='object') print(df2.columns) Index(['aa', 'AA-1007-TPA-MIA', '12/02/2011 01:55 PM', '12/02/2011 02:13 PM', 'F78', '12/02/2011 03:00 PM', '12/02/2011 03:05 PM', 'D5'], dtype='object') print(df3.columns) Index(['12/01/2011 01:55 PM', '12/01/2011 02:07 PM', '12/01/2011 02:57 PM', '12/01/2011 03:00 PM', '12/02/2011 01:55 PM', '12/02/2011 02:13 PM', '12/02/2011 03:00 PM', '12/02/2011 03:05 PM', 'AA-1007-TPA-MIA', 'D5', 'F78', 'aa'], dtype='object')
print(df3.head()) 12/01/2011 01:55 PM 12/01/2011 02:07 PM 12/01/2011 02:57 PM \ 0 NaN 12/1/2011 2:07PM EST 12/1/2011 2:51PM EST 1 NaN 12/1/11 2:06 PM (-05:00) 12/1/11 2:51 PM (-05:00) 2 NaN 12/1/11 2:06 PM (-05:00) 12/1/11 2:51 PM (-05:00) 3 NaN 12/1/11 2:06 PM (-05:00) 12/1/11 2:51 PM (-05:00) 4 NaN 12/1/11 2:06 PM (-05:00) 12/1/11 2:51 PM (-05:00) 12/01/2011 03:00 PM 12/02/2011 01:55 PM 12/02/2011 02:13 PM \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN 12/02/2011 03:00 PM 12/02/2011 03:05 PM AA-1007-TPA-MIA D5 F78 \ 0 NaN NaN AA-1007-TPA-MIA NaN NaN 1 NaN NaN AA-1007-TPA-MIA NaN NaN 2 NaN NaN AA-1007-TPA-MIA NaN NaN 3 NaN NaN AA-1007-TPA-MIA NaN NaN 4 NaN NaN AA-1007-TPA-MIA NaN NaN aa 0 flightexplorer 1 airtravelcenter 2 myrateplan 3 helloflight 4 flytecomm
print(df3.tail()) 12/01/2011 01:55 PM 12/01/2011 02:07 PM 12/01/2011 02:57 PM \ 27261 NaN NaN NaN 27262 NaN NaN NaN 27263 NaN NaN NaN 27264 NaN NaN NaN 27265 NaN NaN NaN 12/01/2011 03:00 PM 12/02/2011 01:55 PM 12/02/2011 02:13 PM \ 27261 NaN Dec 02 - 10:20pm Dec 02 - 10:23pm 27262 NaN 10:20pDec 2 10:23pDec 2 27263 NaN 2011-12-02 10:20 PM NaN 27264 NaN 2011-12-02 10:20 pm NaN 27265 NaN 2011-12-02 10:20PM CST 2011-12-02 10:31PM CST 12/02/2011 03:00 PM 12/02/2011 03:05 PM AA-1007-TPA-MIA D5 \ 27261 Dec 02 - 11:59pm Dec 02 - 11:51pm* AA-2059-DFW-SLC A3 27262 11:43pDec 2 NaN AA-2059-DFW-SLC A3 27263 2011-12-02 11:59 PM NaN AA-2059-DFW-SLC NaN 27264 NaN NaN AA-2059-DFW-SLC NaN 27265 2011-12-02 11:35PM MST 2011-12-02 11:43PM MST AA-2059-DFW-SLC A3 F78 aa 27261 C20/C travelocity 27262 C20 orbitz 27263 NaN weather 27264 C20 dfw 27265 C20 flightwise