How do I create a new column where the values are selected based on an existing column?

Question

How do I add a color column to the following dataframe so that color='green' if Set == 'Z', and color='red' otherwise?

 Type Set 1 A Z 2 B Z 3 B X 4 C Y

Related: Create new column based on values from other columns / apply a function of multiple columns, row-wise in Pandas (same idea, but the selection criteria are based on multiple columns) — wjandrea
– wjandrea, Commented Sep 24, 2023 at 23:11
Since pandas 2.2.0, you can use case_when (a syntax useful in SQL in the context in the OP). See this answer for a demo. — cottontail
– cottontail, Commented Mar 10, 2024 at 12:11
@wjandrea if you insist that the distinction between doing this for a single column and for multiple columns is important, then this question needs to be clear that it's only about the single-column case - so I edited the title accordingly. — Karl Knechtel
– Karl Knechtel, Commented Jun 8, 2024 at 18:26
@KarlKnechtel Reading this back now, the more important thing is that there's only one criterion, which lets you use np.where instead of having to use np.select. I guess I misphrased my comment. — wjandrea
– wjandrea, Commented Jun 8, 2024 at 18:37

wjandrea · Accepted Answer · 2023-07-20 16:00:07Z

If you only have two choices to select from then use np.where:

df['color'] = np.where(df['Set']=='Z', 'green', 'red')

For example,

import pandas as pd import numpy as np df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')}) df['color'] = np.where(df['Set']=='Z', 'green', 'red') print(df)

yields

 Set Type color 0 Z A green 1 Z B green 2 X B red 3 Y C red

If you have more than two conditions then use np.select. For example, if you want color to be

yellow when (df['Set'] == 'Z') & (df['Type'] == 'A')
otherwise blue when (df['Set'] == 'Z') & (df['Type'] == 'B')
otherwise purple when (df['Type'] == 'B')
otherwise black,

then use

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')}) conditions = [ (df['Set'] == 'Z') & (df['Type'] == 'A'), (df['Set'] == 'Z') & (df['Type'] == 'B'), (df['Type'] == 'B')] choices = ['yellow', 'blue', 'purple'] df['color'] = np.select(conditions, choices, default='black') print(df)

which yields

 Set Type color 0 Z A yellow 1 Z B blue 2 X B purple 3 Y C black

stackoverflowuser2010 · Accepted Answer · 2017-08-16 16:49:28Z

191

List comprehension is another way to create another column conditionally. If you are working with object dtypes in columns, like in your example, list comprehensions typically outperform most other methods.

Example list comprehension:

df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]

%timeit tests:

import pandas as pd import numpy as np df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')}) %timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']] %timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red') %timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green') 1000 loops, best of 3: 239 µs per loop 1000 loops, best of 3: 523 µs per loop 1000 loops, best of 3: 263 µs per loop

edited Aug 16, 2017 at 16:49

stackoverflowuser2010

41.5k52 gold badges178 silver badges229 bronze badges

answered Jul 2, 2015 at 0:27

cheekybastard

5,7653 gold badges25 silver badges26 bronze badges

6 Comments

boot-scootin Over a year ago

Note that, with much larger dataframes (think pd.DataFrame({'Type':list('ABBC')*100000, 'Set':list('ZZXY')*100000})-size), numpy.where outpaces map, but the list comprehension is king (about 50% faster than numpy.where).

Matti Over a year ago

Can the list comprehension method be used if the condition needs information from multiple columns? I am looking for something like this (this does not work): df['color'] = ['red' if (x['Set'] == 'Z') & (x['Type'] == 'B') else 'green' for x in df]

cheekybastard Over a year ago

Add iterrows to the dataframe, then you can access multiple columns via row: ['red' if (row['Set'] == 'Z') & (row['Type'] == 'B') else 'green' for index, row in in df.iterrows()]

Paul Rougieux Over a year ago

Note this nice solution will not work if you need to take replacement values from another series in the data frame, such as df['color_type'] = np.where(df['Set']=='Z', 'green', df['Type'])

AMC Over a year ago

@cheekybastard Or don't, since .iterrows() is notoriously sluggish and the DataFrame shouldn't be modified while iterating.

|

acharuva · Accepted Answer · 2014-06-06 04:43:52Z

Another way in which this could be achieved is

df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

bli · Accepted Answer · 2019-06-21 15:23:50Z

The following is slower than the approaches timed here, but we can compute the extra column based on the contents of more than one column, and more than two values can be computed for the extra column.

Simple example using just the "Set" column:

def set_color(row): if row["Set"] == "Z": return "red" else: return "green" df = df.assign(color=df.apply(set_color, axis=1)) print(df)

 Set Type color 0 Z A red 1 Z B red 2 X B green 3 Y C green

Example with more colours and more columns taken into account:

def set_color(row): if row["Set"] == "Z": return "red" elif row["Type"] == "C": return "blue" else: return "green" df = df.assign(color=df.apply(set_color, axis=1)) print(df)

 Set Type color 0 Z A red 1 Z B red 2 X B green 3 Y C blue

Edit (21/06/2019): Using plydata

It is also possible to use plydata to do this kind of things (this seems even slower than using assign and apply, though).

from plydata import define, if_else

Simple if_else:

df = define(df, color=if_else('Set=="Z"', '"red"', '"green"')) print(df)

 Set Type color 0 Z A red 1 Z B red 2 X B green 3 Y C green

Nested if_else:

df = define(df, color=if_else( 'Set=="Z"', '"red"', if_else('Type=="C"', '"green"', '"blue"'))) print(df)

 Set Type color 0 Z A red 1 Z B red 2 X B blue 3 Y C green

Hossein Kalbasi · Accepted Answer · 2021-01-31 07:17:13Z

You can simply use the powerful .loc method and use one condition or several depending on your need (tested with pandas=1.0.5).

Code Summary:

df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split())) df['Color'] = "red" df.loc[(df['Set']=="Z"), 'Color'] = "green" #practice! df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"

Explanation:

df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split())) # df so far: Type Set 0 A Z 1 B Z 2 B X 3 C Y

add a 'color' column and set all values to "red"

df['Color'] = "red"

Apply your single condition:

df.loc[(df['Set']=="Z"), 'Color'] = "green" # df: Type Set Color 0 A Z green 1 B Z green 2 B X red 3 C Y red

or multiple conditions if you want:

df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"

You can read on Pandas logical operators and conditional selection here: Logical operators for boolean indexing in Pandas

boot-scootin · Accepted Answer · 2017-05-16 20:59:07Z

Here's yet another way to skin this cat, using a dictionary to map new values onto the keys in the list:

def map_values(row, values_dict): return values_dict[row] values_dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4} df = pd.DataFrame({'INDICATOR': ['A', 'B', 'C', 'D'], 'VALUE': [10, 9, 8, 7]}) df['NEW_VALUE'] = df['INDICATOR'].apply(map_values, args = (values_dict,))

What's it look like:

df Out[2]: INDICATOR VALUE NEW_VALUE 0 A 10 1 1 B 9 2 2 C 8 3 3 D 7 4

This approach can be very powerful when you have many ifelse-type statements to make (i.e. many unique values to replace).

And of course you could always do this:

df['NEW_VALUE'] = df['INDICATOR'].map(values_dict)

But that approach is more than three times as slow as the apply approach from above, on my machine.

And you could also do this, using dict.get:

df['NEW_VALUE'] = [values_dict.get(v, None) for v in df['INDICATOR']]

Update: On 100,000,000 rows, 52 string values, .apply() takes 47 seconds, versus only 5.91 seconds for .map().

Mykola Zotko · Accepted Answer · 2021-12-28 14:25:49Z

You can use pandas methods where and mask:

df['color'] = 'green' df['color'] = df['color'].where(df['Set']=='Z', other='red') # Replace values where the condition is False

or

df['color'] = 'red' df['color'] = df['color'].mask(df['Set']=='Z', other='green') # Replace values where the condition is True

Alternatively, you can use the method transform with a lambda function:

df['color'] = df['Set'].transform(lambda x: 'green' if x == 'Z' else 'red')

Output:

 Type Set color 1 A Z green 2 B Z green 3 B X red 4 C Y red

Performance comparison from @chai:

import pandas as pd import numpy as np df = pd.DataFrame({'Type':list('ABBC')*1000000, 'Set':list('ZZXY')*1000000}) %timeit df['color1'] = 'red'; df['color1'].where(df['Set']=='Z','green') %timeit df['color2'] = ['red' if x == 'Z' else 'green' for x in df['Set']] %timeit df['color3'] = np.where(df['Set']=='Z', 'red', 'green') %timeit df['color4'] = df.Set.map(lambda x: 'red' if x == 'Z' else 'green') 397 ms ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 976 ms ± 241 ms per loop 673 ms ± 139 ms per loop 796 ms ± 182 ms per loop

xiaotong xu · Accepted Answer · 2021-07-02 08:17:27Z

if you have only 2 choices, use np.where()

df = pd.DataFrame({'A':range(3)}) df['B'] = np.where(df.A>2, 'yes', 'no')

if you have over 2 choices, maybe apply() could work input

arr = pd.DataFrame({'A':list('abc'), 'B':range(3), 'C':range(3,6), 'D':range(6, 9)})

and arr is

 A B C D 0 a 0 3 6 1 b 1 4 7 2 c 2 5 8

if you want the column E tobe if arr.A =='a' then arr.B elif arr.A=='b' then arr.C elif arr.A == 'c' then arr.D else something_else

arr['E'] = arr.apply(lambda x: x['B'] if x['A']=='a' else(x['C'] if x['A']=='b' else(x['D'] if x['A']=='c' else 1234)), axis=1)

and finally the arr is

 A B C D E 0 a 0 3 6 0 1 b 1 4 7 4 2 c 2 5 8 8

Jaroslav Bezděk · Accepted Answer · 2019-10-10 14:30:03Z

One liner with .apply() method is following:

df['color'] = df['Set'].apply(lambda set_: 'green' if set_=='Z' else 'red')

After that, df data frame looks like this:

>>> print(df) Type Set color 0 A Z green 1 B Z green 2 B X red 3 C Y red

sammywemmy · Accepted Answer · 2021-10-18 09:11:47Z

The case_when function from pyjanitor is a wrapper around pd.Series.mask and offers a chainable/convenient form for multiple conditions:

For a single condition:

df.case_when( df.col1 == "Z", # condition "green", # value if True "red", # value if False column_name = "color" ) Type Set color 1 A Z green 2 B Z green 3 B X red 4 C Y red

For multiple conditions:

df.case_when( df.Set.eq('Z') & df.Type.eq('A'), 'yellow', # condition, result df.Set.eq('Z') & df.Type.eq('B'), 'blue', # condition, result df.Type.eq('B'), 'purple', # condition, result 'black', # default if none of the conditions evaluate to True column_name = 'color' ) Type Set color 1 A Z yellow 2 B Z blue 3 B X purple 4 C Y black

More examples can be found here

Gabe · Accepted Answer · 2024-01-26 18:30:39Z

This is can be done easily using case when if you have Pandas v2.2.0 (Jan 2024)

example from the release notes:

 import pandas as pd df = pd.DataFrame(dict(a=[1, 2, 3], b=[4, 5, 6])) default=pd.Series('default', index=df.index) default.case_when( caselist=[ (df.a == 1, 'first'), # condition, replacement (df.a.gt(1) & df.b.eq(5), 'second'), # condition, replacement ], ) Out[4]: 0 first 1 second 2 default dtype: object

drGabriel · Accepted Answer · 2023-01-25 19:46:47Z

Here is an easy one-liner you can use when you have one or several conditions:

df['color'] = np.select(condlist=[df['Set']=="Z", df['Set']=="Y"], choicelist=["green", "yellow"], default="red")

Easy and good to go!

See more here: https://numpy.org/doc/stable/reference/generated/numpy.select.html

Yaakov Bressler · Accepted Answer · 2019-12-08 18:42:01Z

If you're working with massive data, a memoized approach would be best:

# First create a dictionary of manually stored values color_dict = {'Z':'red'} # Second, build a dictionary of "other" values color_dict_other = {x:'green' for x in df['Set'].unique() if x not in color_dict.keys()} # Next, merge the two color_dict.update(color_dict_other) # Finally, map it to your column df['color'] = df['Set'].map(color_dict)

This approach will be fastest when you have many repeated values. My general rule of thumb is to memoize when: data_size > 10**4 & n_distinct < data_size/4

E.x. Memoize in a case 10,000 rows with 2,500 or fewer distinct values.

Alright, so with only 2 distinct values to map, 100,000,000 rows, it takes 6.67 seconds to run without "memoization", and 9.86 seconds with.
100,000,000 rows, 52 distinct values, where 1 of those maps to the first output value, and the other 51 all correspond to the other: 7.99 seconds without memoization, 11.1 seconds with.
Are your values in random order? Or are they back to back? High speed of pandas could be due to caching @AMC
Are your values in random order? Or are they back to back? Values are random, selected using random.choices().

rubengavidia0x · Accepted Answer · 2022-03-04 23:16:57Z

A Less verbose approach using np.select:

a = np.array([['A','Z'],['B','Z'],['B','X'],['C','Y']]) df = pd.DataFrame(a,columns=['Type','Set']) conditions = [ df['Set'] == 'Z' ] outputs = [ 'Green' ] # conditions Z is Green, Red Otherwise. res = np.select(conditions, outputs, 'Red') res array(['Green', 'Green', 'Red', 'Red'], dtype='<U5') df.insert(2, 'new_column',res) df Type Set new_column 0 A Z Green 1 B Z Green 2 B X Red 3 C Y Red df.to_numpy() array([['A', 'Z', 'Green'], ['B', 'Z', 'Green'], ['B', 'X', 'Red'], ['C', 'Y', 'Red']], dtype=object) %%timeit conditions = [df['Set'] == 'Z'] outputs = ['Green'] np.select(conditions, outputs, 'Red') 134 µs ± 9.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) df2 = pd.DataFrame({'Type':list('ABBC')*1000000, 'Set':list('ZZXY')*1000000}) %%timeit conditions = [df2['Set'] == 'Z'] outputs = ['Green'] np.select(conditions, outputs, 'Red') 188 ms ± 26.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Paras · Accepted Answer · 2024-10-24 05:33:22Z

Little modification to This Answer of acharuva

df['color'] = df.apply( lambda x: 'red' if x == 'Z' else 'green', axis=1)

Collectives™ on Stack Overflow

How do I create a new column where the values are selected based on an existing column?

15 Answers 15

Comments

6 Comments

Comments

Edit (21/06/2019): Using plydata

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

4 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

15 Answers 15

Comments

6 Comments

Comments

Edit (21/06/2019): Using plydata

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

4 Comments

Comments

Comments

Linked

Related