1

I am working in Python and have a large dataset in a pandas dataframe. I have taken a section of this data and put it into another dataframe, where I have created a new column and populated it. I now want to put this new column back into the original dataframe, overwriting one of the existing columns, but only for the section I have edited.

Please can you help advise how this is best done? The only unique identifier is the index that is automatically generated. The 2nd dataframe has kept the same index values as the larger one so it should be quite straight forward but I cannot work out how to a) reference the automatically created indexes b) use these indexes to overwrite the existing data in the column from another dataframe

So, it should be something like this (I realise this is a mashup of syntax but just trying to better explain what I am trying to do!):

where df1.ROW.INDEX == df2.ROW.INDEX insert into df1['col_name'].value from df2.['col_name'].value 

Any help would be greatly appreciated.

UPDATE: I now have this code which almost works:

index_values = edited_df.index.values for i in index_values: main_df.iloc[i]['pop'] = edited_df.iloc[i]['new_col'] 

I get a caveats error, and the main_df is not changed. It looks like it is making copies in each iteration rather than updating the main dataframe.

UPDATE: FIXED I finally managed to work out the kinks, solution below for anyone that has a similar problem.

index_values = edited_df.index.values for i in index_values: main_df.iloc[i, main_df.columns.get_loc('pop')] = edited_df.iloc[i]['new_col'] 

2 Answers 2

3

Consider using pandas.DataFrame.update for an inplace update from passed in dataframe. Be sure column names match both datasets.

main_df.update(edited_df, join='left', overwrite=True) 
Sign up to request clarification or add additional context in comments.

Comments

2

I appreciate that you've found a solution that works. However, you're using a for loop when you don't need to. I'll start by improving your loop. Then I'll back up @Partfait's update idea

You use loc to reference by index and column values. You're relying on the coincidence that your index values are sequenced integers.

index_values = edited_df.index.values for i in index_values: main_df.loc[i, 'pop'] = edited_df.loc[i, 'new_col'] 

However, loc can take array like indexers and you're only using scalar indexers. That means you're better off using at

index_values = edited_df.index.values for i in index_values: main_df.at[i, 'pop'] = edited_df.at[i, 'new_col'] 

Or you can go even faster with set_value

index_values = edited_df.index.values for i in index_values: main_df.set_value(i, 'pop', edited_df.get_value(i, 'new_col')) 

All that said, here is how you could use loc in one go

main_df.loc[:, 'pop'] = edited_df['new_col'] 

Or as @Partfait suggested

main_df.update(edited_df['new_col'].rename('pop')) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.