15

I have two string columns in my Pandas dataset

name1 name2 John Doe John Doe AleX T Franz K 

and I need to check whether name1 equals name2. The naive way I use now is using a simple mask

mask=df.name1==df.name2

But the problem is that there may be mislabeled strings (in a way that is not predictable - the data is too big) that prevent an exact matching to occur.

For instance "John Doe" and "John Doe " would not match. Of course, I trimmed, lower-cased my strings but other possibilities remain.

One idea would be to look whether name1 is contained in name2. But it seems I cannot use str.contains with another variable as argument. Any other ideas?

Many thanks!

EDIT: using isin gives non-sensical results. Example

test = pd.DataFrame({'A': ["john doe", " john doe", 'John'], 'B': [' john doe', 'eddie murphy', 'batman']}) test Out[6]: A B 0 john doe john doe 1 john doe eddie murphy 2 John batman test['A'].isin(test['B']) Out[7]: 0 False 1 True 2 False Name: A, dtype: bool 
6
  • Hi @edchum this is not a duplicate of the question you mention. Commented Mar 11, 2016 at 13:34
  • 1
    So are you after something like df['name1'].str.lower() == df['name2'].str.lower()? Commented Mar 11, 2016 at 13:35
  • not really, because I believe perfect equality generate too many false negatives in the data. checking whether name1 is contained somewhere in name2 seems more appropriate Commented Mar 11, 2016 at 13:38
  • 1
    I think in this case you should strip the spaces and lower the case, see my answer Commented Mar 11, 2016 at 13:38
  • 1
    If you proposed using something like in then you'd have to do this using apply like df.apply(lambda x: x['A'].strip().lower() in x['B'].strip().lower(), axis=1) Commented Mar 11, 2016 at 13:40

5 Answers 5

8

I think you can use str.lower and str.replace with arbitrary whitespace s/+:

test = pd.DataFrame({'A': ["john doe", " john doe", 'John'], 'B': [' john doe', 'eddie murphy', 'batman']}) print test['A'].str.lower().str.replace('s/+',"") == test['B'].str.strip().str.replace('s/+',"") 0 True 1 False 2 False dtype: bool 
Sign up to request clarification or add additional context in comments.

3 Comments

I dont understand why this gives me non-sensical results. please see my updated question
do you have any ideas whats going on with isin here?
Sorry, I was offline.
5

You can use difflib to compute distance

import difflib as dfl dfl.SequenceMatcher(None,'John Doe', 'John doe').ratio() 

edit : integration with Pandas :

import pandas as pd import difflib as dfl df = pd.DataFrame({'A': ["john doe", " john doe", 'John'], 'B': [' john doe', 'eddie murphy', 'batman']}) df['VAR1'] = df.apply(lambda x : dfl.SequenceMatcher(None, x['A'], x['B']).ratio(),axis=1) 

1 Comment

does that integrate with pandas dataframes?
4

strip the spaces and lower the case:

In [414]: test['A'].str.strip().str.lower() == test['B'].str.strip().str.lower() Out[414]: 0 True 1 False 2 False dtype: bool 

Comments

1

What you want is a string distance based on editing effort distance(s1, s2), which is what we call edit distance of strings. Once you define that function in your namespace you can do:

df['distance_s'] = df.apply(lambda r: distance(r['name1'], r['name2'])) filtered = df[df['distance_s'] < eps] # you define eps 

From a Google search, the following came up:

https://pypi.python.org/pypi/editdistance

It is a dynamic programming problem, so you can challenge yourself by writing your own too. It may not be as efficient though.

Comments

0

Having stumbled upon a similar problem, I discovered that difflib.Sequencematcher ratio is a good measure of similarity between two texts. However, my goal was to carefully inspect the difference between two texts, and there were no convenient solution for it. So I've created a library. Here's how you could use it:

from IPython import display import pandas as pd from pandas_text_comparer import TextComparer # A toy dataset. Replace with your data df = pd.read_csv("https://github.com/n-splv/pandas-text-comparer/raw/main/data/demo/review-responses.csv.gz") comparer = TextComparer(df, column_a="llm_response", column_b="human_response") comparer.run() html = comparer.get_html() display.HTML(html) 

comparison result

There's more to it, e.g. you can filter the rows, sort them by ratio, add other columns to the view etc. Feel free to read more in the repo. Hope it helps someone!

2 Comments

Please elaborate how this would specifically answer OP's question. Similarity screening is not the request. They want exact match more or less a number of minor -but admittedly ill-defined- differences such as trailing characters. Have you tried with their provided input instead of that repo's own demo dataset?
Hey! With the help of this library, OP will be able to quickly spot the ways his name1 and name2 may differ and based on that write a preprocessing function, which should then be applied to both texts.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.