3

I have a dataframe that looks as follows:

 data = np.array([[5, 'red', 2,6, 8, 10], [11, 'red', 3,9,6,15], [8, 'blue', 0, 3, 5, 10], [2, 'blue', 1, 2, 3, 4]]) df = pd.DataFrame(data, columns = ['A','B','red_lower', 'red_upper', 'blue_lower', 'blue_upper']) 
 A B red_lower red_upper blue_lower blue_upper 0 5 red 2 6 8 10 1 11 red 3 9 6 15 2 8 blue 0 3 5 10 3 2 blue 1 2 3 4 

I'd like to create an additional column that tells me if the value in a column A is in the range of the color given in column B. For example, in row 0, since 5 has the designation red, I will check if 5 is between 2 and 6. It is, so I will have the new column have a 1.

Desired result:

 A B red_lower red_upper blue_lower blue_upper in_range 0 5 red 2 6 8 10 1 1 11 red 3 9 6 15 0 2 8 blue 0 3 5 10 1 3 2 blue 1 2 3 4 0 

I've tried to write a loop, but I'm getting many series errors. I really dont want to have to split up the dataframe (by color), but maybe that's the way to go? (in my actual dataframe, there are six different 'colors', not just two).

Thank you!

EDIT: bonus if we have the additional column tell me if the value is above or below the range! For example, in row 1, 11 is outside the range, so is too high. Table should look this way:

 A B red_lower red_upper blue_lower blue_upper in_range 0 5 red 2 6 8 10 inside 1 11 red 3 9 6 15 above 2 8 blue 0 3 5 10 inside 3 2 blue 1 2 3 4 below 
1
  • for loop or groupby should be the way Commented Jul 8, 2019 at 21:08

2 Answers 2

3

Here is using groupby split the df and most of step handled by the definition , which means you do not need input the different color each time

l=[] for name,x in df.groupby('B',sort=False): s1=(x.A >= x.filter(like=name).iloc[:, 0]) & (x.A <= x.filter(like=name).iloc[:, 1]) s2=x.A<x.filter(like=name).iloc[:, 0] l.extend(np.select([s1,s2],['inside','below'],default='above').tolist()) df['in_range']=l df Out[64]: A B red_lower red_upper blue_lower blue_upper in_range 0 5 red 2 6 8 10 inside 1 11 red 3 9 6 15 above 2 8 blue 0 3 5 10 inside 3 2 blue 1 2 3 4 below 
Sign up to request clarification or add additional context in comments.

4 Comments

I think the issue here is my 'color' strings might not show up exactly in the column name - there are extra dashes between words, and there are capitalization issues, so the filter isn't perfect. When running this, I get a 'single positional indexer is out-of-bounds' error. I'll try to play with it to edit my column names and B entries to make them match up nicer perhaps?
This needed lots of tweaking for the mess of data I'm working with, but this was really helpful, thank you!
I want to edit this because there is a problem in the code here. This base code DOES NOT WORK if the colors are mixed. If this table is changed to red/blue/red/blue, the designations are off. I'm going to uncheck this as a solution for now - any thoughts @WeNYoBen ?
@shakebeforeopening then you should sort your columns first , then after assign it back , you can always sort_index get back the original ordered ..My code work , just need to have more effort on it, Likedf=df.sort_values('B'), then using above code , then df=df.sort_index()
3

justify + broadcast + mask + logical_and

You can use some nifty broadcasting here, and the function justify from another answer. This assumes that each color has a single valid range. It also assumes that all of your numeric columns are in fact numeric.


values = df.A.values colors = df.B.values range_frame = df.iloc[:, 2:] ranges = range_frame.columns.str.split('_').str[0].values m = colors != ranges[:, None] masked = range_frame.mask(m) jf = justify(masked.values, invalid_val=np.nan)[:, :2] ir = np.logical_and(jf[:, 0] < values, values < jf[:, 1]).astype(int) c1 = values <= jf[:, 0] c2 = values >= jf[:, 1] irl = np.select([c1, c2], ['below', 'above'], 'inside') df.assign(in_range=ir, in_range_flag=irl) 

 A B red_lower red_upper blue_lower blue_upper in_range in_range_flag 0 5 red 2 6 8 10 1 inside 1 11 red 3 9 6 15 0 above 2 8 blue 0 3 5 10 1 inside 3 3 blue 1 2 3 4 0 below 

stack + reshape + logical_and

Again making the same assumptions as the first answer.


u = df.set_index(['A', 'B']).stack().rename_axis(['A', 'B', 'flag']).reset_index() frame = u[u.flag.str.split('_').str[0] == u.B] values = frame[::2].A.values ranges = frame[0].values.reshape(-1, 2) ir = np.logical_and(ranges[:, 0] < values, values < ranges[:, 1]) c1 = values <= ranges[:, 0] c2 = values >= ranges[:, 1] irl = np.select([c1, c2], ['below', 'above'], 'inside') df.assign(in_range=ir, in_range_flag=irl) 

Here is the definition for the justify function by @Divakar:

def justify(a, invalid_val=0, axis=1, side='left'): """ Justifies a 2D array Parameters ---------- A : ndarray Input array to be justified axis : int Axis along which justification is to be made side : str Direction of justification. It could be 'left', 'right', 'up', 'down' It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0. """ if invalid_val is np.nan: mask = ~np.isnan(a) else: mask = a!=invalid_val justified_mask = np.sort(mask,axis=axis) if (side=='up') | (side=='left'): justified_mask = np.flip(justified_mask,axis=axis) out = np.full(a.shape, invalid_val) if axis==1: out[justified_mask] = a[mask] else: out.T[justified_mask.T] = a.T[mask.T] return out 

1 Comment

I definitely should have mentioned that I have some empty strings (not NULLS) in certain ranges where they aren't provided (it's not the cleanest data in the world). Each color does match up to a 'range', but that range may be empty. I could amend this code with an if statement that checks if the range is empty first to put the string 'not provided' in the additional column? But I'm not sure how that would affect the justify function.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.