Pandas dataframe column from data containing NaN values

Question

The following code transforms a given pandas column FEAT into a new, binary feature named STREAM. The program works as long as there are no NaN values in the original dataframe. If that is the case, the following exception occurs: ValueError: Length of values does not match length of index. I need to push the NaN values to the new column. Is it doable? Here is the code option that fails:

import pandas as pd import numpy as np data = { 'FEAT': [8, 15, 7, np.nan, 5, 2, 11, 15] } customer = pd.DataFrame(data) customer = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David', 'Bob', 'Sally', 'Mia', 'Luis']) #create binary variable STREAM 0:mainstream 1:avantgarde stream_0 = [1, 3, 5, 8, 10, 12, 14] stream_1 = [2, 4, 6, 7, 9, 11, 13, 15] # convert FEAT to list_0 list_0 = customer['FEAT'].values.tolist() # create a list of length = len(customer) whose elements are: # 0 if the value of 'FEAT' is in stream_0 # 1 if the value of 'FEAT' is in stream_1 L = [] for i in list_0: if i in stream_0: L.append(0) elif i in stream_1: L.append(1) # convert the list to a new column of customer df customer['STREAM'] = L print(customer)

what's value does it get in the else block, which is missing (i.e. not in stream_0 or stream_1) — ALollz
– ALollz, Commented May 13, 2020 at 15:41

ALollz · Accepted Answer · 2020-05-13 16:47:26Z

The issue is you are missing an else block so when a value (like NaN) is in neither stream_0 nor stream_1 you do nothing which then causes L to have fewer elements than the number of rows in customer.

Looping here is unnecessary, np.select can handle the column creation. The default argument will handle the else block.

customer['STREAM'] = np.select([customer.FEAT.isin(stream_0), customer.FEAT.isin(stream_1)], [0, 1], default=np.NaN) FEAT STREAM June 8.0 0.0 Robert 15.0 1.0 Lily 7.0 1.0 David NaN NaN Bob 5.0 0.0 Sally 2.0 1.0 Mia 11.0 1.0 Luis 15.0 1.0

You could also map the few values, everything not in either is NaN

d = {key: value for l,value in zip([stream_0, stream_1], [0,1]) for key in l} customer['STREAM'] = customer['FEAT'].map(d)

The dict uses a comprehension to create the key value pairs. For every key in stream_0 we assign it a value of 0, for every key in stream_1 we assign a value of 1. The comprehension is a bit complicated, so a more easy to understand method which accomplishes the same would be to create each dictionary separately, then combine them.

d_1 = {k: 0 for k in stream_0} d_2 = {k: 1 for k in stream_1} d = {**d_1, **d_2} # Combine #{1: 0, 2: 1, 3: 0, 4: 1, 5: 0, 6: 1, 7: 1, # 8: 0, 9: 1, 10: 0, 11: 1, 12: 0, 13: 1, 14: 0, 15: 1}

the second solution WORKS. However, it is very cryptic. Can you explain a bit more or provide an easier-to-understand code. Thanks
@josephpareti I added some explanation, and also a more straight forward way to create the dictionary. It's a bit more typing but I think clearer. map uses that dict to transform the values
thank-you, I understand more the option with the 2 dictionaries, but I still do not understand d = {**d_1, **d_2}
@josephpareti See stackoverflow.com/questions/38987/…. In python 3.5+ it's one very concise way to merge dictionaries otherwise you can use dict.update

Collectives™ on Stack Overflow

Pandas dataframe column from data containing NaN values

1 Answer 1

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Linked

Related