2

I have 15 csv files whose one of the column represents year. Problem is that the year column is named 'year' in some files and 'year_' in the other. So I have two columns that have the same information to me but since each file has only one of the column name(either year or year_), if row 1 has value in 'year', 'year_' has NaN. I want to combine those two columns so that I can get rid of NaN. What is the best way to do this?

Before

 year year_ 1 NaN 1999 2 2002 NaN 3 2000 NaN . . . N NaN 2004 

I want this to be

After

 year 1 1999 2 2002 3 2000 . . . N 2004 

3 Answers 3

4

You can use combine_first function.

df['YEAR'] = df['year'].combine_first(df['year_']) 

where df['year'] will be default and df['year2'] will be used to fill null values.

Sign up to request clarification or add additional context in comments.

2 Comments

Seems to be faster than the sum solution.
@HirotakaNakagame: Glad that we could help. Please also upvote helpful answers and accept the one that helped you most to show others that your issue is solved (you can do so by clicking on the small check next to the answer which then turns green).
2

Given that only one has a valid value, you can simply sum them on axis 1

year_cols = df.columns[df.columns.str.contains('year')] df['year'] = df[year_cols].sum(1) 

Comments

0

Same idea as @Vaishali: you can just sum the year columns; use filter to select the columns:

df.filter(like='year').sum(axis=1) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.