5

I have two dataframes of different length in python pandas like this:

df1: df2: Column1 Column2 Column3 ColumnA ColumnB 0 1 a r 0 1 a 1 2 b u 1 1 d 2 3 c k 2 1 e 3 4 d j 3 2 r 4 5 e f 4 2 w 5 3 y 6 3 h 

What I am trying to do now is comparing Column1 of df1 and ColumnA of df2. For each "hit", where a row in ColumnA in df2 has the same value as a row in Column1 in df1, I want to append a column to df1 with the vaule ColumnB of df2 has for the row where the "hit" was found, so that my result looks like this:

df1: Column1 Column2 Column3 Column4 Column5 Column6 0 1 a r a d e 1 2 b u r w 2 3 c k y h 3 4 d j 4 5 e f 

What I have tried so far was:

for row in df1, df2: if df1[Column1] == df2[ColumnA]: print 'yey!' 

which gave me an error saying I could not compare two dataframes of different length. So I tried:

for row in df1, df2: if def2[def2['ColumnA'].isin(def1['column1'])]: print 'lalala' else: print 'Nope' 

Which "works" in terms that I get an output, but I do not think it iterates over the rows and compares them, since it only prints 'lalala' two times. So I researched some more and found a way to iterate over each row of the dataframe, which is:

for index, row in df1.iterrows(): print row['Column1] 

But I do not know how to use this to compare the columns of the two dataframes and get the output I desire.

Any help on how to do this would be really appreciated.

1 Answer 1

5

I recommend you to use DataFrame API which allows to operate with DF in terms of join, merge, groupby, etc. You can find my solution below:

import pandas as pd df1 = pd.DataFrame({'Column1': [1,2,3,4,5], 'Column2': ['a','b','c','d','e'], 'Column3': ['r','u','k','j','f']}) df2 = pd.DataFrame({'Column1': [1,1,1,2,2,3,3], 'ColumnB': ['a','d','e','r','w','y','h']}) dfs = pd.DataFrame({}) for name, group in df2.groupby('Column1'): buffer_df = pd.DataFrame({'Column1': group['Column1'][:1]}) i = 0 for index, value in group['ColumnB'].iteritems(): i += 1 string = 'Column_' + str(i) buffer_df[string] = value dfs = dfs.append(buffer_df) result = pd.merge(df1, dfs, how='left', on='Column1') print(result) 

The result is:

 Column1 Column2 Column3 Column_0 Column_1 Column_2 0 1 a r a d e 1 2 b u r w NaN 2 3 c k y h NaN 3 4 d j NaN NaN NaN 4 5 e f NaN NaN NaN 

P.s. More details:

1) for df2 I produce groups by 'Column1'. The single group is a data frame. Example below:

 Column1 ColumnB 0 1 a 1 1 d 2 1 e 

2) for each group I produce data frame buffer_df:

 Column1 Column_0 Column_1 Column_2 0 1 a d e 

3) after that I create DF dfs:

 Column1 Column_0 Column_1 Column_2 0 1 a d e 3 2 r w NaN 5 3 y h NaN 

4) in the end I execute left join for df1 and dfs obtaining needed result.

2)* buffer_df is produced iteratively:

step0 (buffer_df = pd.DataFrame({'Column1': group['Column1'][:1]})): Column1 5 3 step1 (buffer_df['Column_0'] = group['ColumnB'][5]): Column1 Column_0 5 3 y step2 (buffer_df['Column_1'] = group['ColumnB'][5]): Column1 Column_0 Column_1 5 3 y h 
Sign up to request clarification or add additional context in comments.

7 Comments

Thank you, a very neat answer! However I notice that I do not quite understand what you're doing from buffer_df = .... until dfs = dfs.append(buffer_df). Could you explain what the code does? Thank you!
actually I think I get what the single lines of code do, but I do not get how they work together to create the output...
@sequence_hard check my answer again: new details are added. Have the process become clearer for you?
Yes, it is clear now, thank you for that very detailed answer. I was to braindead yesterday, that's the reason for my late answer. However when I try using the script for my actual data (which have a structure similar to the exemplary data, just with more columns in each df and mixed string/integer values), I get the following error: line 33, in <module> buffer_df[string] = group['Gene'][iter] KeyError: 83 Any idea what the cause for that could be?
Since this: File "index.pyx", line 97, in pandas.index.IndexEngine.get_value (pandas/index.c:2679) File "index.pyx", line 105, in pandas.index.IndexEngine.get_value (pandas/index.c:2494) File "index.pyx", line 149, in pandas.index.IndexEngine.get_loc (pandas/index.c:3233) File "hashtable.pyx", line 381, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7032) File "hashtable.pyx", line 387, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6973) KeyError: 83 is also part of the error message, i guess something might be wrong with the indexing of my files...
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.