648

I want to apply my custom function (it uses an if-else ladder) to these six columns (ERI_Hispanic, ERI_AmerInd_AKNatv, ERI_Asian, ERI_Black_Afr.Amer, ERI_HI_PacIsl, ERI_White) in each row of my dataframe.

I've tried different methods from other questions but still can't seem to find the right answer for my problem. The critical piece of this is that if the person is counted as Hispanic they can't be counted as anything else. Even if they have a "1" in another ethnicity column they still are counted as Hispanic not two or more races. Similarly, if the sum of all the ERI columns is greater than 1 they are counted as two or more races and can't be counted as a unique ethnicity(except for Hispanic).

It's almost like doing a for loop through each row and if each record meets a criterion they are added to one list and eliminated from the original.

From the dataframe below I need to calculate a new column based on the following spec in SQL:

CRITERIA

IF [ERI_Hispanic] = 1 THEN RETURN “Hispanic” ELSE IF SUM([ERI_AmerInd_AKNatv] + [ERI_Asian] + [ERI_Black_Afr.Amer] + [ERI_HI_PacIsl] + [ERI_White]) > 1 THEN RETURN “Two or More” ELSE IF [ERI_AmerInd_AKNatv] = 1 THEN RETURN “A/I AK Native” ELSE IF [ERI_Asian] = 1 THEN RETURN “Asian” ELSE IF [ERI_Black_Afr.Amer] = 1 THEN RETURN “Black/AA” ELSE IF [ERI_HI_PacIsl] = 1 THEN RETURN “Haw/Pac Isl.” ELSE IF [ERI_White] = 1 THEN RETURN “White” 

Comment: If the ERI Flag for Hispanic is True (1), the employee is classified as “Hispanic”

Comment: If more than 1 non-Hispanic ERI Flag is true, return “Two or More”

DATAFRAME

 lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian eri_hispanic eri_nat_amer eri_white rno_defined 0 MOST JEFF E 0 0 0 0 0 1 White 1 CRUISE TOM E 0 0 0 1 0 0 White 2 DEPP JOHNNY 0 0 0 0 0 1 Unknown 3 DICAP LEO 0 0 0 0 0 1 Unknown 4 BRANDO MARLON E 0 0 0 0 0 0 White 5 HANKS TOM 0 0 0 0 0 1 Unknown 6 DENIRO ROBERT E 0 1 0 0 0 1 White 7 PACINO AL E 0 0 0 0 0 1 White 8 WILLIAMS ROBIN E 0 0 1 0 0 0 White 9 EASTWOOD CLINT E 0 0 0 0 0 1 White 
2
  • 3
    For this task, apply seems obvious but don't use it because it is just a loop over the rows. There are better ways to do it. Examples: here, here, here. Commented Nov 17, 2022 at 21:37
  • 2
    I'll second @cottontail BEWARE of this approach, it leads to very poor performance! Commented Jun 14, 2023 at 12:14

8 Answers 8

796

OK, two steps to this - first is to write a function that does the translation you want - I've put an example together based on your pseudo-code:

def label_race(row): if row['eri_hispanic'] == 1: return 'Hispanic' if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1: return 'Two Or More' if row['eri_nat_amer'] == 1: return 'A/I AK Native' if row['eri_asian'] == 1: return 'Asian' if row['eri_afr_amer'] == 1: return 'Black/AA' if row['eri_hawaiian'] == 1: return 'Haw/Pac Isl.' if row['eri_white'] == 1: return 'White' return 'Other' 

You may want to go over this, but it seems to do the trick - notice that the parameter going into the function is considered to be a Series object labelled "row".

Next, use the apply function in pandas to apply the function - e.g.

df.apply(label_race, axis=1) 

Note the axis=1 specifier, that means that the application is done at a row, rather than a column level. The results are here:

0 White 1 Hispanic 2 White 3 White 4 Other 5 White 6 Two Or More 7 White 8 Haw/Pac Isl. 9 White 

If you're happy with those results, then run it again, saving the results into a new column in your original dataframe.

df['race_label'] = df.apply(label_race, axis=1) 

The resultant dataframe looks like this (scroll to the right to see the new column):

 lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian eri_hispanic eri_nat_amer eri_white rno_defined race_label 0 MOST JEFF E 0 0 0 0 0 1 White White 1 CRUISE TOM E 0 0 0 1 0 0 White Hispanic 2 DEPP JOHNNY NaN 0 0 0 0 0 1 Unknown White 3 DICAP LEO NaN 0 0 0 0 0 1 Unknown White 4 BRANDO MARLON E 0 0 0 0 0 0 White Other 5 HANKS TOM NaN 0 0 0 0 0 1 Unknown White 6 DENIRO ROBERT E 0 1 0 0 0 1 White Two Or More 7 PACINO AL E 0 0 0 0 0 1 White White 8 WILLIAMS ROBIN E 0 0 1 0 0 0 White Haw/Pac Isl. 9 EASTWOOD CLINT E 0 0 0 0 0 1 White White 
Sign up to request clarification or add additional context in comments.

1 Comment

No need to wrap the function in a lambda, this should work as well: df['race_label'] = df.apply (label_race, axis=1)
454

Since this is the first Google result for 'pandas new column from others', here's a simple example:

import pandas as pd # make a simple dataframe df = pd.DataFrame({'a':[1,2], 'b':[3,4]}) df # a b # 0 1 3 # 1 2 4 # create an unattached column with an index df.apply(lambda row: row.a + row.b, axis=1) # 0 4 # 1 6 # do same but attach it to the dataframe df['c'] = df.apply(lambda row: row.a + row.b, axis=1) df # a b c # 0 1 3 4 # 1 2 4 6 

If you get the SettingWithCopyWarning you can do it this way also:

col = df.apply(lambda row: row.a + row.b, axis=1) df = df.assign(c=col.values) # assign values to column 'c' 

Source: https://stackoverflow.com/a/12555510/243392

And if your column name includes spaces you can use syntax like this:

df = df.assign(**{'some column name': col.values}) 

And here's the documentation for apply, and assign.

2 Comments

How does this answer the question? This function doesn't use any conditionals (if) at all plus you shouldn't even be using .apply here when you can just do df['a'] + df['b'].
The first part of the answer with df printed 2 times does not work for me with python3.10 and pandas 2.3.0 getting error: AttributeError: 'Series' object has no attribute 'a'. Did you mean: 'at'? EDIT: it turned out that axis=1 is a must for it to work
145

The answers above are perfectly valid, but a vectorized solution exists, in the form of numpy.select. This allows you to define conditions, then define outputs for those conditions, much more efficiently than using apply:


First, define conditions:

conditions = [ df['eri_hispanic'] == 1, df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1), df['eri_nat_amer'] == 1, df['eri_asian'] == 1, df['eri_afr_amer'] == 1, df['eri_hawaiian'] == 1, df['eri_white'] == 1, ] 

Now, define the corresponding outputs:

outputs = [ 'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White' ] 

Finally, using numpy.select:

res = np.select(conditions, outputs, 'Other') pd.Series(res) 

0 White 1 Hispanic 2 White 3 White 4 Other 5 White 6 Two Or More 7 White 8 Haw/Pac Isl. 9 White dtype: object 

Why should numpy.select be used over apply? Here are some performance checks:

df = pd.concat([df]*1000) In [42]: %timeit df.apply(lambda row: label_race(row), axis=1) 1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [44]: %%timeit ...: conditions = [ ...: df['eri_hispanic'] == 1, ...: df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1), ...: df['eri_nat_amer'] == 1, ...: df['eri_asian'] == 1, ...: df['eri_afr_amer'] == 1, ...: df['eri_hawaiian'] == 1, ...: df['eri_white'] == 1, ...: ] ...: ...: outputs = [ ...: 'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White' ...: ] ...: ...: np.select(conditions, outputs, 'Other') ...: ...: 3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 

Using numpy.select gives us vastly improved performance, and the discrepancy will only increase as the data grows.

Comments

43

.apply() takes in a function as the first parameter; pass in the label_race function as so:

df['race_label'] = df.apply(label_race, axis=1) 

You don't need to make a lambda function to pass in a function.

1 Comment

This seems to be a comment on the top answer and doesn't stand on its own since label_race is not defined. I've just removed the unnecessary lambda from the top answer so this is no longer needed in any case. In the future, please either edit existing answers to improve them or post a comment.
36

try this,

df.loc[df['eri_white']==1,'race_label'] = 'White' df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.' df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA' df.loc[df['eri_asian']==1,'race_label'] = 'Asian' df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native' df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More' df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic' df['race_label'].fillna('Other', inplace=True) 

O/P:

 lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian \ 0 MOST JEFF E 0 0 0 1 CRUISE TOM E 0 0 0 2 DEPP JOHNNY NaN 0 0 0 3 DICAP LEO NaN 0 0 0 4 BRANDO MARLON E 0 0 0 5 HANKS TOM NaN 0 0 0 6 DENIRO ROBERT E 0 1 0 7 PACINO AL E 0 0 0 8 WILLIAMS ROBIN E 0 0 1 9 EASTWOOD CLINT E 0 0 0 eri_hispanic eri_nat_amer eri_white rno_defined race_label 0 0 0 1 White White 1 1 0 0 White Hispanic 2 0 0 1 Unknown White 3 0 0 1 Unknown White 4 0 0 0 White Other 5 0 0 1 Unknown White 6 0 0 1 White Two Or More 7 0 0 1 White White 8 0 0 0 White Haw/Pac Isl. 9 0 0 1 White White 

use .loc instead of apply.

it improves vectorization.

.loc works in simple manner, mask rows based on the condition, apply values to the freeze rows.

for more details visit, .loc docs

Performance metrics:

Accepted Answer:

def label_race (row): if row['eri_hispanic'] == 1 : return 'Hispanic' if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 : return 'Two Or More' if row['eri_nat_amer'] == 1 : return 'A/I AK Native' if row['eri_asian'] == 1: return 'Asian' if row['eri_afr_amer'] == 1: return 'Black/AA' if row['eri_hawaiian'] == 1: return 'Haw/Pac Isl.' if row['eri_white'] == 1: return 'White' return 'Other' df=pd.read_csv('dataser.csv') df = pd.concat([df]*1000) %timeit df.apply(lambda row: label_race(row), axis=1) 

1.15 s ± 46.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

My Proposed Answer:

def label_race(df): df.loc[df['eri_white']==1,'race_label'] = 'White' df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.' df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA' df.loc[df['eri_asian']==1,'race_label'] = 'Asian' df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native' df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More' df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic' df['race_label'].fillna('Other', inplace=True) df=pd.read_csv('s22.csv') df = pd.concat([df]*1000) %timeit label_race(df) 

24.7 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Comments

12

Pandas>=2.2.0 solution: case_when

We can use case_when method to create a new column using a switch statement. First, assign a column with the default value ('Other' in the example in the OP), and then replace values in this new column using a list of (condition, replacement value) tuples.

df['race_label'] = 'Other' df['race_label'] = df['race_label'].case_when( [ (df['eri_hispanic']==1, 'Hispanic'), (df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white'] > 1, 'Two Or More'), (df['eri_nat_amer'] == 1, 'A/I AK Native'), (df['eri_asian'] == 1, 'Asian'), (df['eri_afr_amer'] == 1, 'Black/AA'), (df['eri_hawaiian'] == 1, 'Haw/Pac Isl.'), (df['eri_white'] == 1, 'White') ] ) 

This is somewhat similar to numpy.select but unlike that function, we see the condition next to the value, so it's more readable (imo). Also because it is vectorized, it is much faster than apply on large dataframes.


If we inspect its source code, apply() is a syntactic sugar for a Python for-loop (via the apply_series_generator() method of the FrameApply class). Because it has the pandas overhead, it's generally slower than a Python loop.

Use optimized (vectorized) methods wherever possible. If you have to use a loop, use @numba.jit decorator.

1. Don't use apply() for an if-else ladder

df.apply() is just about the slowest way to do this in pandas. As shown in the answers of user3483203 and Mohamed Thasin ah, depending on the dataframe size, np.select() and df.loc may be 50-300 times faster than df.apply() to produce the same output.

As it happens, a loop implementation (not unlike apply()) with the @jit decorator from numba module is (about 50-60%) faster than df.loc and np.select.1

Numba works on numpy arrays, so before using the jit decorator, you need to convert the dataframe into a numpy array. Then fill in values in a pre-initialized empty array by checking the conditions in a loop. Since numpy arrays don't have column names, you have to access the columns by their index in the loop. The most inconvenient part of the if-else ladder in the jitted function over the one in apply() is accessing the columns by their indices. Otherwise it's almost the same implementation.

import numpy as np import numba as nb @nb.jit(nopython=True) def conditional_assignment(arr, res): length = len(arr) for i in range(length): if arr[i][3] == 1: res[i] = 'Hispanic' elif arr[i][0] + arr[i][1] + arr[i][2] + arr[i][4] + arr[i][5] > 1: res[i] = 'Two Or More' elif arr[i][0] == 1: res[i] = 'Black/AA' elif arr[i][1] == 1: res[i] = 'Asian' elif arr[i][2] == 1: res[i] = 'Haw/Pac Isl.' elif arr[i][4] == 1: res[i] = 'A/I AK Native' elif arr[i][5] == 1: res[i] = 'White' else: res[i] = 'Other' return res # the columns with the boolean data cols = [c for c in df.columns if c.startswith('eri_')] # initialize an empty array to be filled in a loop # for string dtype arrays, we need to know the length of the longest string # and use it to set the dtype res = np.empty(len(df), dtype=f"<U{len('A/I AK Native')}") # pass the underlying numpy array of `df[cols]` into the jitted function df['rno_defined'] = conditional_assignment(df[cols].values, res) 

2. Don't use apply() for numeric operations

If you need to add a new row by adding two columns, your first instinct may be to write

df['c'] = df.apply(lambda row: row['a'] + row['b'], axis=1) 

But instead of this, row-wise add using sum(axis=1) method (or + operator if there are only a couple of columns):

df['c'] = df[['a','b']].sum(axis=1) # equivalently df['c'] = df['a'] + df['b'] 

Depending on the dataframe size, sum(1) may be 100s of times faster than apply().

In fact, you will almost never need apply() for numeric operations on a pandas dataframe because it has optimized methods for most operations: addition (sum(1)), subtraction (sub() or diff()), multiplication (prod(1)), division (div() or /), power (pow()), >, >=, ==, %, //, &, | etc. can all be performed on the entire dataframe without apply().

For example, let's say you want to create a new column using the following rule:

IF [colC] > 0 THEN RETURN [colA] * [colB] ELSE RETURN [colA] / [colB] 

Using the optimized pandas methods, this can be written as

df['new'] = df[['colA','colB']].prod(1).where(df['colC']>0, df['colA'] / df['colB']) 

the equivalent apply() solution is:

df['new'] = df.apply(lambda row: row.colA * row.colB if row.colC > 0 else row.colA / row.colB, axis=1) 

The approach using the optimized methods is 250 times faster than the equivalent apply() approach for dataframes with 20k rows. This gap only increases as the data size increases (for a dataframe with 1 mil rows, it's 365 times faster) and the time difference will become more and more noticeable.2



1: In the below result, I show the performance of the three approaches using a dataframe with 24 mil rows (this is the largest frame I can construct on my machine). For smaller frames, the numba-jitted function consistently runs at least 50% faster than the other two as well (you can check yourself).
def pd_loc(df): df['rno_defined'] = 'Other' df.loc[df['eri_nat_amer'] == 1, 'rno_defined'] = 'A/I AK Native' df.loc[df['eri_asian'] == 1, 'rno_defined'] = 'Asian' df.loc[df['eri_afr_amer'] == 1, 'rno_defined'] = 'Black/AA' df.loc[df['eri_hawaiian'] == 1, 'rno_defined'] = 'Haw/Pac Isl.' df.loc[df['eri_white'] == 1, 'rno_defined'] = 'White' df.loc[df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1) > 1, 'rno_defined'] = 'Two Or More' df.loc[df['eri_hispanic'] == 1, 'rno_defined'] = 'Hispanic' return df def np_select(df): conditions = [df['eri_hispanic'] == 1, df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1), df['eri_nat_amer'] == 1, df['eri_asian'] == 1, df['eri_afr_amer'] == 1, df['eri_hawaiian'] == 1, df['eri_white'] == 1] outputs = ['Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'] df['rno_defined'] = np.select(conditions, outputs, 'Other') return df @nb.jit(nopython=True) def conditional_assignment(arr, res): length = len(arr) for i in range(length): if arr[i][3] == 1 : res[i] = 'Hispanic' elif arr[i][0] + arr[i][1] + arr[i][2] + arr[i][4] + arr[i][5] > 1 : res[i] = 'Two Or More' elif arr[i][0] == 1: res[i] = 'Black/AA' elif arr[i][1] == 1: res[i] = 'Asian' elif arr[i][2] == 1: res[i] = 'Haw/Pac Isl.' elif arr[i][4] == 1 : res[i] = 'A/I AK Native' elif arr[i][5] == 1: res[i] = 'White' else: res[i] = 'Other' return res def nb_loop(df): cols = [c for c in df.columns if c.startswith('eri_')] res = np.empty(len(df), dtype=f"<U{len('A/I AK Native')}") df['rno_defined'] = conditional_assignment(df[cols].values, res) return df # df with 24mil rows n = 4_000_000 df = pd.DataFrame({ 'eri_afr_amer': [0, 0, 0, 0, 0, 0]*n, 'eri_asian': [1, 0, 0, 0, 0, 0]*n, 'eri_hawaiian': [0, 0, 0, 1, 0, 0]*n, 'eri_hispanic': [0, 1, 0, 0, 1, 0]*n, 'eri_nat_amer': [0, 0, 0, 0, 1, 0]*n, 'eri_white': [0, 0, 1, 1, 0, 0]*n }, dtype='int8') df.insert(0, 'name', ['MOST', 'CRUISE', 'DEPP', 'DICAP', 'BRANDO', 'HANKS']*n) %timeit nb_loop(df) # 5.23 s ± 45.2 ms per loop (mean ± std. dev. of 10 runs, 10 loops each) %timeit pd_loc(df) # 7.97 s ± 28.8 ms per loop (mean ± std. dev. of 10 runs, 10 loops each) %timeit np_select(df) # 8.5 s ± 39.6 ms per loop (mean ± std. dev. of 10 runs, 10 loops each) 

2: In the below result, I show the performance of the two approaches using a dataframe with 20k rows and again with 1 mil rows. For smaller frames, the gap is smaller because the optimized approach has an overhead while apply() is a loop. As the size of the frame increases, the vectorization overhead cost diminishes w.r.t. to the overall runtime of the code while apply() remains a loop over the frame.

n = 20_000 # 1_000_000 df = pd.DataFrame(np.random.rand(n,3)-0.5, columns=['colA','colB','colC']) %timeit df[['colA','colB']].prod(1).where(df['colC']>0, df['colA'] / df['colB']) # n = 20000: 2.69 ms ± 23.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # n = 1000000: 86.2 ms ± 441 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit df.apply(lambda row: row.colA * row.colB if row.colC > 0 else row.colA / row.colB, axis=1) # n = 20000: 679 ms ± 33.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # n = 1000000: 31.5 s ± 587 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 

3 Comments

Is df[['colA','colB']].prod(1) (I infer 1 is the axis argument) better than df['colA'] * df['colB']? Why?
@KarlKnechtel It's not better, just more concise and I guess less error-prone (no need to worry about brackets) and more readable than df['colA'].mul(df['colB']). In terms of performance, it's even worse than mul() or * because it has to do a lot more validation before reduction.
I'm not entirely convinced of those advantages, either.
6

Choosing a method according to the complexity of the criteria

For the examples below - in order to show multiple types of rules for the new column - we will assume a DataFrame with columns 'red', 'green' and 'blue', containing floating-point values ranging 0 to 1.

General case: .apply

As long as the necessary logic to compute the new value can be written as a function of other values in the same row, we can use the .apply method of the DataFrame to get the desired result. Write the function so that it accepts a single parameter, which is a single row of the input:

def as_hex(value): # clamp to avoid rounding errors etc. return min(max(0, int(value * 256)), 255) def hex_color(row): r, g, b = as_hex(row['red']), as_hex(row['green']), as_hex(row['blue']) return f'#{r:02x}{g:02x}{b:02x}' 

Pass the function itself (don't write parentheses after the name) to .apply, and specify axis=1 (meaning to supply rows to the categorizing function, so as to compute a column - rather than the other way around). Thus:

df['hex_color'] = df.apply(hex_color, axis=1) 

Note that wrapping in lambda is not necessary, since we are not binding any arguments or otherwise modifying the function.

The .apply step is necessary because the conversion function itself is not vectorized. Thus, a naive approach like df['color'] = hex_color(df) will not work (example question).

This tool is powerful, but inefficient. For best performance, please use a more specific approach where applicable.

Multiple choices with conditions: numpy.select or repeated assignment with df.loc or df.where

Suppose we were thresholding the color values, and computing rough color names like so:

def additive_color(row): # Insert here: logic that takes values from the `row` and computes # the desired cell value for the new column in that row. # The `row` is an ordinary `Series` object representing a row of the # original `DataFrame`; it can be indexed with column names, thus: if row['red'] > 0.5: if row['green'] > 0.5: return 'white' if row['blue'] > 0.5 else 'yellow' else: return 'magenta' if row['blue'] > 0.5 else 'red' elif row['green'] > 0.5: return 'cyan' if row['blue'] > 0.5 else 'green' else: return 'blue' if row['blue'] > 0.5 else 'black' 

In cases like this - where the categorizing function would be an if/else ladder, or match/case in 3.10 and up - we may get much faster performance using numpy.select.

This approach works very differently. First, compute masks on the data for where each condition applies:

black = (df['red'] <= 0.5) & (df['green'] <= 0.5) & (df['blue'] <= 0.5) white = (df['red'] > 0.5) & (df['green'] > 0.5) & (df['blue'] > 0.5) 

To call numpy.select, we need two parallel sequences - one of the conditions, and another of the corresponding values:

df['color'] = np.select( [white, black], ['white', 'black'], 'colorful' ) 

The optional third argument specifies a value to use when none of the conditions are met. (As an exercise: fill in the remaining conditions, and try it without a third argument.)

A similar approach is to make repeated assignments based on each condition. Assign the default value first, and then use df.loc to assign specific values for each condition:

df['color'] = 'colorful' df.loc[white, 'color'] = 'white' df.loc[black, 'color'] = 'black' 

Alternately, df.where can be used to do the assignments. However, df.where, used like this, assigns the specified value in places where the condition is not met, so the conditions must be inverted:

df['color'] = 'colorful' df['color'] = df['color'].where(~white, 'white').where(~black, 'black') 

Simple mathematical manipulations: built-in mathematical operators and broadcasting

For example, an apply-based approach like:

def brightness(row): return row['red'] * .299 + row['green'] * .587 + row['blue'] * .114 df['brightness'] = df.apply(brightness, axis=1) 

can instead be written by broadcasting the operators, for much better performance (and is also simpler):

df['brightness'] = df['red'] * .299 + df['green'] * .587 + df['blue'] * .114 

As an exercise, here's the first example redone that way:

def as_hex(column): scaled = (column * 256).astype(int) clamped = scaled.where(scaled >= 0, 0).where(scaled <= 255, 255) return clamped.apply(lambda i: f'{i:02x}') df['hex_color'] = '#' + as_hex(df['red']) + as_hex(df['green']) + as_hex(df['blue']) 

I was unable to find a vectorized equivalent to format the integer values as hex strings, so .apply is still used internally here - meaning that the full speed penalty still comes into play. Still, this demonstrates some general techniques.

For more details and examples, see cottontail's answer.

1 Comment

My goals in this answer are: 1) combine cottontail's advice for when not to use .apply, with simple how-tos for each approach (as described in the most popular answers); 2) talk through how each approach works and explain them clearly, with a consistent style; 3) use a different example for the input, to avoid any possible political controversy and to admit multiple plausible manipulations of the data.
0

Yet another (easily generalizable) approach, whose corner-stone is pandas.DataFrame.idxmax. First, the easily generalizable preamble.

# Indeed, all your conditions boils down to the following _gt_1_key = 'two_or_more' _lt_1_key = 'other' # The "dictionary-based" if-else statements labels = { _gt_1_key : 'Two Or More', 'eri_hispanic': 'Hispanic', 'eri_nat_amer': 'A/I AK Native', 'eri_asian' : 'Asian', 'eri_afr_amer': 'Black/AA', 'eri_hawaiian': 'Haw/Pac Isl.', 'eri_white' : 'White', _lt_1_key : 'Other', } # The output-driving 1-0 matrix mat = df.filter(regex='^eri_').copy() # `~.copy` to avoid `SettingWithCopyWarning` 

... and, finally, in a vectorized fashion:

mat[_gt_1_key] = gt1 = mat.sum(axis=1) mat[_lt_1_key] = gt1.eq(0).astype(int) race_label = mat.idxmax(axis=1).map(labels) 

where

>>> race_label 0 White 1 Hispanic 2 White 3 White 4 Other 5 White 6 Two Or More 7 White 8 Haw/Pac Isl. 9 White dtype: object 

that is a pandas.Series instance you can easily host within df, i.e. doing df['race_label'] = race_label.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.