1

I have two data sets as following

A B IDs IDs 1 1 2 2 3 5 4 7 

How in Pandas, Numpy we can apply a join which can give me all the data from B, which is not present in A Something like Following

B Ids 5 7 

I know it can be done with for loop, but that I don't want, since my real data is in millions, and I am really not sure how to use Panda Numpy here, something like following

pd.merge(A, B, on='ids', how='right') 

Thanks

2
  • try instead of right you can specify outer. Commented Jun 7, 2016 at 13:04
  • what is the expected output? The column names seem to be A and B and not IDs ... this is misleading. Commented Jun 7, 2016 at 13:08

4 Answers 4

3

You can use NumPy's setdiff1d, like so -

np.setdiff1d(B['IDs'],A['IDs']) 

Also, np.in1d could be used for the same effect, like so -

B[~np.in1d(B['IDs'],A['IDs'])] 

Please note that np.setdiff1d would give us a sorted NumPy array as output.

Sample run -

>>> A = pd.DataFrame([1,2,3,4],columns=['IDs']) >>> B = pd.DataFrame([1,7,5,2],columns=['IDs']) >>> np.setdiff1d(B['IDs'],A['IDs']) array([5, 7]) >>> B[~np.in1d(B['IDs'],A['IDs'])] IDs 1 7 2 5 
Sign up to request clarification or add additional context in comments.

3 Comments

Thank You so much! But despite of my several attempts: "I am receiving error, List indices must be integers not lists"
@manusharma So, do you have anything else apart from integers in that column of IDs, like strings maybe or integers as strings?
I have two large Lists/ Dataframe, some of them are long, Integers, I tried to use Map(int, dataset) to convert all in one, still the same error List Indices must be integers not lists
2

You can use merge with parameter indicator and then boolean indexing. Last you can drop column _merge:

A = pd.DataFrame({'IDs':[1,2,3,4], 'B':[4,5,6,7], 'C':[1,8,9,4]}) print (A) B C IDs 0 4 1 1 1 5 8 2 2 6 9 3 3 7 4 4 B = pd.DataFrame({'IDs':[1,2,5,7], 'A':[1,8,3,7], 'D':[1,8,9,4]}) print (B) A D IDs 0 1 1 1 1 8 8 2 2 3 9 5 3 7 4 7 df = (pd.merge(A, B, on='IDs', how='outer', indicator=True)) df = df[df._merge == 'right_only'] df = df.drop('_merge', axis=1) print (df) B C IDs A D 4 NaN NaN 5.0 3.0 9.0 5 NaN NaN 7.0 7.0 4.0 

Comments

1

You could convert the data series to sets and take the difference:

import pandas as pd df=pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]}) A=set(df['A']) B=set(df['B']) C=pd.DataFrame({'C' : list(B-A)}) # Take difference and convert back to DataFrame 

The variable "C" then yields

 C 0 5 1 7 

Comments

1

You can simply use pandas' .isin() method:

df = pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]}) df[~df['B'].isin(df['A'])] 

If these are separate DataFrames:

a = pd.DataFrame({'IDs' : [1,2,3,4]}) b = pd.DataFrame({'IDs' : [1,2,5,7]}) b[~b['IDs'].isin(a['IDs'])] 

Output:

 IDs 2 5 3 7 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.