7

I have a pandas dataframe and want to select rows where values of a column starts with values of another column. I have tried the following:

import pandas as pd df = pd.DataFrame({'A': ['apple', 'xyz', 'aa'], 'B': ['app', 'b', 'aa']}) df_subset = df[df['A'].str.startswith(df['B'])] 

But it errors out and this solutions that I found also have not been helping.

KeyError: "None of [Float64Index([nan, nan, nan], dtype='float64')] are in the [columns]" 

np.where(df['A'].str.startswith(df['B']), True, False) from here also returns True for all.

3 Answers 3

8

For row wise comparison, we can use DataFrame.apply:

m = df.apply(lambda x: x['A'].startswith(x['B']), axis=1) df[m] A B 0 apple app 2 aa aa 

The reason your code is not working is because Series.str.startswith accepts a character sequence (a string scalar), and you are using a pandas Series. Quoting the docs:

pat : str
Character sequence. Regular expressions are not accepted.

Sign up to request clarification or add additional context in comments.

2 Comments

Brilliant! I did try apply with lambda too but failed to get it work; was missing the axis=1.
Yes that can be confusing at start, basically the idea is that you want to apply your function on each row (so over the column axis) and not per column (which is the index axis). In this axis='columns' would also suffice.
3

You may need to do with for loop , since the row check is not support with str.startswith

[x.startswith(y) for x , y in zip(df.A,df.B)] Out[380]: [True, False, True] df_sub=df[[x.startswith(y) for x , y in zip(df.A,df.B)]].copy() 

Comments

1

You can achieve this without using for loop:

import pandas as pd import numpy as np df = pd.DataFrame({'A': ['apple', 'xyz', 'aa'], 'B': ['app', 'b', 'aa']}) ufunc = np.frompyfunc(str.startswith, 2, 1) idx = ufunc(df['A'], df['B']) df[idx] Out[22]: A B 0 apple app 2 aa aa 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.