String concatenation of two pandas columns

Question

I have a following DataFrame:

from pandas import * df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3]})

It looks like this:

 bar foo 0 1 a 1 2 b 2 3 c

Now I want to have something like:

 bar 0 1 is a 1 2 is b 2 3 is c

How can I achieve this? I tried the following:

df['foo'] = '%s is %s' % (df['bar'], df['foo'])

but it gives me a wrong result:

>>>print df.ix[0] bar a foo 0 a 1 b 2 c Name: bar is 0 1 1 2 2 Name: 0

Sorry for a dumb question, but this one pandas: combine two columns in a DataFrame wasn't helpful for me.

Henry Ecker · Accepted Answer · 2022-02-15 04:13:52Z

178

df['bar'] = df.bar.map(str) + " is " + df.foo

edited Feb 15, 2022 at 4:13

Henry Ecker♦

35.9k19 gold badges48 silver badges67 bronze badges

answered Aug 8, 2012 at 6:03

BrenBarn

253k39 gold badges421 silver badges392 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

cs95 · Accepted Answer · 2019-06-13 20:00:24Z

This question has already been answered, but I believe it would be good to throw some useful methods not previously discussed into the mix, and compare all methods proposed thus far in terms of performance.

Here are some useful solutions to this problem, in increasing order of performance.

`DataFrame.agg`

This is a simple str.format-based approach.

df['baz'] = df.agg('{0[bar]} is {0[foo]}'.format, axis=1) df foo bar baz 0 a 1 1 is a 1 b 2 2 is b 2 c 3 3 is c

You can also use f-string formatting here:

df['baz'] = df.agg(lambda x: f"{x['bar']} is {x['foo']}", axis=1) df foo bar baz 0 a 1 1 is a 1 b 2 2 is b 2 c 3 3 is c

`char.array`-based Concatenation

Convert the columns to concatenate as chararrays, then add them together.

a = np.char.array(df['bar'].values) b = np.char.array(df['foo'].values) df['baz'] = (a + b' is ' + b).astype(str) df foo bar baz 0 a 1 1 is a 1 b 2 2 is b 2 c 3 3 is c

List Comprehension with `zip`

I cannot overstate how underrated list comprehensions are in pandas.

df['baz'] = [str(x) + ' is ' + y for x, y in zip(df['bar'], df['foo'])]

Alternatively, using str.join to concat (will also scale better):

df['baz'] = [ ' '.join([str(x), 'is', y]) for x, y in zip(df['bar'], df['foo'])]

df foo bar baz 0 a 1 1 is a 1 b 2 2 is b 2 c 3 3 is c

List comprehensions excel in string manipulation, because string operations are inherently hard to vectorize, and most pandas "vectorised" functions are basically wrappers around loops. I have written extensively about this topic in For loops with pandas - When should I care?. In general, if you don't have to worry about index alignment, use a list comprehension when dealing with string and regex operations.

The list comp above by default does not handle NaNs. However, you could always write a function wrapping a try-except if you needed to handle it.

def try_concat(x, y): try: return str(x) + ' is ' + y except (ValueError, TypeError): return np.nan df['baz'] = [try_concat(x, y) for x, y in zip(df['bar'], df['foo'])]

`perfplot` Performance Measurements

Graph generated using perfplot. Here's the complete code listing.

Functions

def brenbarn(df): return df.assign(baz=df.bar.map(str) + " is " + df.foo) def danielvelkov(df): return df.assign(baz=df.apply( lambda x:'%s is %s' % (x['bar'],x['foo']),axis=1)) def chrimuelle(df): return df.assign( baz=df['bar'].astype(str).str.cat(df['foo'].values, sep=' is ')) def vladimiryashin(df): return df.assign(baz=df.astype(str).apply(lambda x: ' is '.join(x), axis=1)) def erickfis(df): return df.assign( baz=df.apply(lambda x: f"{x['bar']} is {x['foo']}", axis=1)) def cs1_format(df): return df.assign(baz=df.agg('{0[bar]} is {0[foo]}'.format, axis=1)) def cs1_fstrings(df): return df.assign(baz=df.agg(lambda x: f"{x['bar']} is {x['foo']}", axis=1)) def cs2(df): a = np.char.array(df['bar'].values) b = np.char.array(df['foo'].values) return df.assign(baz=(a + b' is ' + b).astype(str)) def cs3(df): return df.assign( baz=[str(x) + ' is ' + y for x, y in zip(df['bar'], df['foo'])])

That's all I always wanted to know about string concatenation in pandas, but was too afraid too ask!
Can you please update the plot to next level 104 (or even higher), a quick visual answer with the current plot limited to 103 (1000 which is very small for today condition) is that cs3 is the best, eventually when you see brenbarn is looking less exponential than cs3, so most probably for large dataset brenbarn is the best (faster) answer.
@VelizarVESSELINOV Updated! What surprises me is that the numpy concatenation is slower than both the list comp and the pandas concatenation.
Have you considered using df['bar'].tolist() and df['foo'].tolist() in cs3()? My guess is that it would increase "base" time slightly but it would scale better.
Please include the fastest way: df.bar.astype(str) + ' is ' + df.foo

Daniel · Accepted Answer · 2014-05-16 22:58:47Z

The problem in your code is that you want to apply the operation on every row. The way you've written it though takes the whole 'bar' and 'foo' columns, converts them to strings and gives you back one big string. You can write it like:

df.apply(lambda x:'%s is %s' % (x['bar'],x['foo']),axis=1)

It's longer than the other answer but is more generic (can be used with values that are not strings).

chriad · Accepted Answer · 2014-03-28 17:56:36Z

16

You could also use

df['bar'] = df['bar'].str.cat(df['foo'].values.astype(str), sep=' is ')

answered Mar 28, 2014 at 17:56

chriad

1,41217 silver badges22 bronze badges

1 Comment

cbrnr Over a year ago

This doesn't work since df['bar'] is not a string column. The correct assignment is df['bar'] = df['bar'].astype(str).str.cat(df['foo'], sep=' is ').

vdi · Accepted Answer · 2017-04-29 10:56:05Z

13

df.astype(str).apply(lambda x: ' is '.join(x), axis=1) 0 1 is a 1 2 is b 2 3 is c dtype: object

answered Apr 29, 2017 at 10:56

vdi

79310 silver badges20 bronze badges

1 Comment

johnDanger Over a year ago

This answer also works with undetermined number of columns (> 1) & undetermined column names, making it more useful than the rest.

Maurício Collaça · Accepted Answer · 2022-10-22 03:55:43Z

It's been 10 years and no one proposed the most simple and intuitive way which is 50% faster than all examples proposed on these 10 years.

df.bar.astype(str) + ' is ' + df.foo

It's clearly now the right answer while using astype(str).

johnDanger · Accepted Answer · 2021-01-22 23:42:12Z

series.str.cat is the most flexible way to approach this problem:

For df = pd.DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3]})

df.foo.str.cat(df.bar.astype(str), sep=' is ') >>> 0 a is 1 1 b is 2 2 c is 3 Name: foo, dtype: object

OR

df.bar.astype(str).str.cat(df.foo, sep=' is ') >>> 0 1 is a 1 2 is b 2 3 is c Name: bar, dtype: object

Unlike .join() (which is for joining list contained in a single Series), this method is for joining 2 Series together. It also allows you to ignore or replace NaN values as desired.

Can you demonstrate ignoring/replacing NaN values with str.cat?

erickfis · Accepted Answer · 2019-06-21 20:28:45Z

@DanielVelkov answer is the proper one BUT using string literals is faster:

# Daniel's %timeit df.apply(lambda x:'%s is %s' % (x['bar'],x['foo']),axis=1) ## 963 µs ± 157 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) # String literals - python 3 %timeit df.apply(lambda x: f"{x['bar']} is {x['foo']}", axis=1) ## 849 µs ± 4.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

1'' · Accepted Answer · 2021-06-19 04:56:04Z

I think the most concise solution for arbitrary numbers of columns is a short-form version of this answer:

df.astype(str).apply(' is '.join, axis=1)

You can shave off two more characters with df.agg(), but it's slower:

df.astype(str).agg(' is '.join, axis=1)

robin girard · Accepted Answer · 2021-02-03 15:37:52Z

I have encountered a specific case from my side with 10^11 rows in my dataframe, and in this case none of the proposed solution is appropriate. I have used categories, and this should work fine in all cases when the number of unique string is not too large. This is easily done in the R software with XxY with factors but I could not find any other way to do it in python (I'm new to python). If anyone knows a place where this is implemented I'd be glad to know.

def Create_Interaction_var(df,Varnames): ''' :df data frame :list of 2 column names, say "X" and "Y". The two columns should be strings or categories convert strings columns to categories Add a column with the "interaction of X and Y" : X x Y, with name "Interaction-X_Y" ''' df.loc[:, Varnames[0]] = df.loc[:, Varnames[0]].astype("category") df.loc[:, Varnames[1]] = df.loc[:, Varnames[1]].astype("category") CatVar = "Interaction-" + "-".join(Varnames) Var0Levels = pd.DataFrame(enumerate(df.loc[:,Varnames[0]].cat.categories)).rename(columns={0 : "code0",1 : "name0"}) Var1Levels = pd.DataFrame(enumerate(df.loc[:,Varnames[1]].cat.categories)).rename(columns={0 : "code1",1 : "name1"}) NbLevels=len(Var0Levels) names = pd.DataFrame(list(itertools.product(dict(enumerate(df.loc[:,Varnames[0]].cat.categories)), dict(enumerate(df.loc[:,Varnames[1]].cat.categories)))), columns=['code0', 'code1']).merge(Var0Levels,on="code0").merge(Var1Levels,on="code1") names=names.assign(Interaction=[str(x) + '_' + y for x, y in zip(names["name0"], names["name1"])]) names["code01"]=names["code0"] + NbLevels*names["code1"] df.loc[:,CatVar]=df.loc[:,Varnames[0]].cat.codes+NbLevels*df.loc[:,Varnames[1]].cat.codes df.loc[:, CatVar]= df[[CatVar]].replace(names.set_index("code01")[["Interaction"]].to_dict()['Interaction'])[CatVar] df.loc[:, CatVar] = df.loc[:, CatVar].astype("category") return df

Shunya · Accepted Answer · 2022-06-15 08:56:52Z

0

from pandas import * x = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3]}) x x['bar'] = x.bar.astype("str") + " " + "is" + " " + x.foo x.drop(['foo'], axis=1)

edited Jun 15, 2022 at 8:56

Shunya

2,4894 gold badges20 silver badges36 bronze badges

answered Jun 14, 2022 at 20:51

hackerofjogos

92 bronze badges

1 Comment

Jeremy Caney Over a year ago

Remember that Stack Overflow isn't just intended to solve the immediate problem, but also to help future readers find solutions to similar problems, which requires understanding the underlying code. This is especially important for members of our community who are beginners, and not familiar with the syntax. Given that, can you edit your answer to include an explanation of what you're doing and why you believe it is the best approach?

Collectives™ on Stack Overflow

String concatenation of two pandas columns

11 Answers 11

Comments

`DataFrame.agg`

`char.array`-based Concatenation

List Comprehension with `zip`

`perfplot` Performance Measurements

6 Comments

Comments

1 Comment

1 Comment

1 Comment

1 Comment

Comments

Comments

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

11 Answers 11

Comments

char.array-based Concatenation

List Comprehension with zip

perfplot Performance Measurements

6 Comments

Comments

1 Comment

1 Comment

1 Comment

1 Comment

Comments

Comments

Comments

1 Comment

Linked

Related

`char.array`-based Concatenation

List Comprehension with `zip`

`perfplot` Performance Measurements