I have two string columns in my Pandas dataset
name1 name2 John Doe John Doe AleX T Franz K and I need to check whether name1 equals name2. The naive way I use now is using a simple mask
mask=df.name1==df.name2
But the problem is that there may be mislabeled strings (in a way that is not predictable - the data is too big) that prevent an exact matching to occur.
For instance "John Doe" and "John Doe " would not match. Of course, I trimmed, lower-cased my strings but other possibilities remain.
One idea would be to look whether name1 is contained in name2. But it seems I cannot use str.contains with another variable as argument. Any other ideas?
Many thanks!
EDIT: using isin gives non-sensical results. Example
test = pd.DataFrame({'A': ["john doe", " john doe", 'John'], 'B': [' john doe', 'eddie murphy', 'batman']}) test Out[6]: A B 0 john doe john doe 1 john doe eddie murphy 2 John batman test['A'].isin(test['B']) Out[7]: 0 False 1 True 2 False Name: A, dtype: bool 
df['name1'].str.lower() == df['name2'].str.lower()?stripthe spaces andlowerthe case, see my answerinthen you'd have to do this usingapplylikedf.apply(lambda x: x['A'].strip().lower() in x['B'].strip().lower(), axis=1)