how to compare two string variables in pandas?

Question

I have two string columns in my Pandas dataset

name1 name2 John Doe John Doe AleX T Franz K

and I need to check whether name1 equals name2. The naive way I use now is using a simple mask

mask=df.name1==df.name2

But the problem is that there may be mislabeled strings (in a way that is not predictable - the data is too big) that prevent an exact matching to occur.

For instance "John Doe" and "John Doe " would not match. Of course, I trimmed, lower-cased my strings but other possibilities remain.

One idea would be to look whether name1 is contained in name2. But it seems I cannot use str.contains with another variable as argument. Any other ideas?

Many thanks!

EDIT: using isin gives non-sensical results. Example

test = pd.DataFrame({'A': ["john doe", " john doe", 'John'], 'B': [' john doe', 'eddie murphy', 'batman']}) test Out[6]: A B 0 john doe john doe 1 john doe eddie murphy 2 John batman test['A'].isin(test['B']) Out[7]: 0 False 1 True 2 False Name: A, dtype: bool

Hi @edchum this is not a duplicate of the question you mention. — ℕʘʘḆḽḘ
– ℕʘʘḆḽḘ, Commented Mar 11, 2016 at 13:34
So are you after something like df['name1'].str.lower() == df['name2'].str.lower()? — EdChum
– EdChum, Commented Mar 11, 2016 at 13:35
not really, because I believe perfect equality generate too many false negatives in the data. checking whether name1 is contained somewhere in name2 seems more appropriate — ℕʘʘḆḽḘ
– ℕʘʘḆḽḘ, Commented Mar 11, 2016 at 13:38
I think in this case you should strip the spaces and lower the case, see my answer — EdChum
– EdChum, Commented Mar 11, 2016 at 13:38
If you proposed using something like in then you'd have to do this using apply like df.apply(lambda x: x['A'].strip().lower() in x['B'].strip().lower(), axis=1) — EdChum
– EdChum, Commented Mar 11, 2016 at 13:40

jezrael · Accepted Answer · 2016-03-11 13:47:28Z

8

I think you can use str.lower and str.replace with arbitrary whitespace s/+:

test = pd.DataFrame({'A': ["john doe", " john doe", 'John'], 'B': [' john doe', 'eddie murphy', 'batman']}) print test['A'].str.lower().str.replace('s/+',"") == test['B'].str.strip().str.replace('s/+',"") 0 True 1 False 2 False dtype: bool

edited Mar 11, 2016 at 13:47

answered Mar 11, 2016 at 13:02

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

ℕʘʘḆḽḘ Over a year ago

I dont understand why this gives me non-sensical results. please see my updated question

ℕʘʘḆḽḘ Over a year ago

do you have any ideas whats going on with isin here?

jezrael Over a year ago

Sorry, I was offline.

steboc · Accepted Answer · 2016-03-11 14:14:17Z

You can use difflib to compute distance

import difflib as dfl dfl.SequenceMatcher(None,'John Doe', 'John doe').ratio()

edit : integration with Pandas :

import pandas as pd import difflib as dfl df = pd.DataFrame({'A': ["john doe", " john doe", 'John'], 'B': [' john doe', 'eddie murphy', 'batman']}) df['VAR1'] = df.apply(lambda x : dfl.SequenceMatcher(None, x['A'], x['B']).ratio(),axis=1)

EdChum · Accepted Answer · 2016-03-11 13:37:17Z

strip the spaces and lower the case:

In [414]: test['A'].str.strip().str.lower() == test['B'].str.strip().str.lower() Out[414]: 0 True 1 False 2 False dtype: bool

Patrick the Cat · Accepted Answer · 2016-03-11 13:04:42Z

What you want is a string distance based on editing effort distance(s1, s2), which is what we call edit distance of strings. Once you define that function in your namespace you can do:

df['distance_s'] = df.apply(lambda r: distance(r['name1'], r['name2'])) filtered = df[df['distance_s'] < eps] # you define eps

From a Google search, the following came up:

https://pypi.python.org/pypi/editdistance

It is a dynamic programming problem, so you can challenge yourself by writing your own too. It may not be as efficient though.

Nick S · Accepted Answer · 2024-04-22 15:43:11Z

Having stumbled upon a similar problem, I discovered that difflib.Sequencematcher ratio is a good measure of similarity between two texts. However, my goal was to carefully inspect the difference between two texts, and there were no convenient solution for it. So I've created a library. Here's how you could use it:

from IPython import display import pandas as pd from pandas_text_comparer import TextComparer # A toy dataset. Replace with your data df = pd.read_csv("https://github.com/n-splv/pandas-text-comparer/raw/main/data/demo/review-responses.csv.gz") comparer = TextComparer(df, column_a="llm_response", column_b="human_response") comparer.run() html = comparer.get_html() display.HTML(html)

There's more to it, e.g. you can filter the rows, sort them by ratio, add other columns to the view etc. Feel free to read more in the repo. Hope it helps someone!

Please elaborate how this would specifically answer OP's question. Similarity screening is not the request. They want exact match more or less a number of minor -but admittedly ill-defined- differences such as trailing characters. Have you tried with their provided input instead of that repo's own demo dataset?
Hey! With the help of this library, OP will be able to quickly spot the ways his name1 and name2 may differ and based on that write a preprocessing function, which should then be applied to both texts.

Collectives™ on Stack Overflow

how to compare two string variables in pandas?

5 Answers 5

3 Comments

1 Comment

Comments

Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

1 Comment

Comments

Comments

2 Comments

Linked

Related