0

I would like to find a pandas solution for the following problem (the dataframe is very long in reality, therefore performance really is an important topic):

I have an input dataframe df and need to build a new dataframe dfNew, where I need to derive the output in column 'rs' from the values of the other columns.

And the needed logics is the following:

  • t is always increasing steadily from 0 to its maximum value. Afterwards its starts again with 0.
  • whenever we are in the range from t = 0 and the next upcoming pt = 'X' (including), the value of column td should be taken for the result column rs, else the value of column md should be taken for column rs.

How would a pandas based solution to derive rs from the other columns look like?

td = ['td0','td1','td2','td3','td4','td5','td6','td7','td8','td9','td10','td11','td12'] md = ['md0','md1','md2','md3','md4','md5','md6','md7','md8','md9','md10','md11','md12'] t = [ 0 , 1 , 2 , 3 , 0 , 1 , 2 , 3 , 4 , 5 , 0 , 1 , 2 ] pt = [ 'n', 'n', 'X', 'n', 'n', 'n', 'n', 'X', 'n', 'n', 'n', 'X', 'n'] df = pd.DataFrame({'td': td, 'md': md, 't': t, 'pt': pt}, columns=['td', 'md', 't', 'pt']) df td md t pt 0 td0 md0 0 n 1 td1 md1 1 n 2 td2 md2 2 X 3 td3 md3 3 n 4 td4 md4 0 n 5 td5 md5 1 n 6 td6 md6 2 n 7 td7 md7 3 X 8 td8 md8 4 n 9 td9 md9 5 n 10 td10 md10 0 n 11 td11 md11 1 X 12 td12 md12 2 n dfNew td md t pt rs 0 td0 md0 0 n td0 1 td1 md1 1 n td1 2 td2 md2 2 X td2 3 td3 md3 3 n md3 4 td4 md4 0 n td4 5 td5 md5 1 n td5 6 td6 md6 2 n td6 7 td7 md7 3 X td7 8 td8 md8 4 n md8 9 td9 md9 5 n md9 10 td10 md10 0 n td10 11 td11 md11 1 X td11 12 td12 md12 2 n md12 

2 Answers 2

1

Here's my take with groupby and cumsum

# df.t.eq(0).cumsum() marks the range of t # similarly x.shift().eq('X').cumsum() marks the X range pt_range = (df.groupby(df.t.eq(0).cumsum()) .pt.apply(lambda x: x.shift().eq('X').cumsum())) df['rs'] = np.where(pt_range, df.md, df.td) 

Output:

+-----+-------+-------+----+-----+------+ | | td | md | t | pt | rs | +-----+-------+-------+----+-----+------+ | 0 | td0 | md0 | 0 | n | td0 | | 1 | td1 | md1 | 1 | n | td1 | | 2 | td2 | md2 | 2 | X | td2 | | 3 | td3 | md3 | 3 | n | md3 | | 4 | td4 | md4 | 0 | n | td4 | | 5 | td5 | md5 | 1 | n | td5 | | 6 | td6 | md6 | 2 | n | td6 | | 7 | td7 | md7 | 3 | X | td7 | | 8 | td8 | md8 | 4 | n | md8 | | 9 | td9 | md9 | 5 | n | md9 | | 10 | td10 | md10 | 0 | n | td10 | | 11 | td11 | md11 | 1 | X | td11 | | 12 | td12 | md12 | 2 | n | md12 | +-----+-------+-------+----+-----+------+ 
Sign up to request clarification or add additional context in comments.

6 Comments

This looks like a genious work to me. Honestly :-) But unfortunately, I am not yet able to understand, how the groupby and apply(lambda... work together. Do you see any possibility to explain this a little bit?
groupby gathers all the record sharing the same property together. here df.groupby(some_series) will group the records with same values of the series. Then df.groupby(some_series).pt only looks at the pt column of each group, which is the series x in apply.
Thanks, that was already helpful, yes. I think my major difficulty is still understanding how df.groupby(some_series).pt combined with the apply leads to the proper pt_range as a result.
print df.t.eq(0).cumsum() out to see what it looks like.
First, make sure you understand what each of the above segment is (they are consecutive segments between zeros of t). Now, back to the apply(lambda x: function, it actually takes each abc[0][1] and pass it to the lambda.
|
1

I have build an algorithm to break the series after each X. But not sure how efficient it will be.

# store pt to list pt_list = df.pt.tolist() # iterate through the list to get the index of each n after each X md_map = {} for idx, item in enumerate(pt_list): if item == "X" and idx != df.index.max(): key = idx+1 value = "md" md_map[key] = value # map it with data frame df["td_md"] = df.index.map(md_map) # fill the na with td df["td_md"] = df.td_md.fillna("td") # create rs column from index and td_md df["rs"] = df.td_md + df.index.astype(str) 

I did not think abut each and every condition. But you have to build something like that.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.