Compare two columns based on last N rows in a pandas DataFrame

Question

I want to groupby "ts_code" and calculate percentage between one column max and min value from another column after max based on last N rows for each group. Specifically,

df

ts_code high low 0 A 20 10 1 A 30 5 2 A 40 20 3 A 50 10 4 A 20 30 5 B 20 10 6 B 30 5 7 B 40 20 8 B 50 10 9 B 20 30

Goal

Below is my expected result

 ts_code high low l3_high_low_pct_chg l4_high_low_pct_chg 0 A 20 10 NA NA 1 A 30 5 NA NA 2 A 40 20 0.5 NA 3 A 50 10 0.8 0.8 4 A 20 30 0.8 0.8 5 B 50 10 NA NA 6 B 30 5 NA NA 7 B 40 20 0.9 NA 8 B 10 10 0.75 0.9 9 B 20 30 0.75 0.75

ln_high_low_pct_chg(such as l3_high_low_pct_chg)= 1-(the min value of the low column after the peak)/(the max value of high column),on last N rows for each group and each row.

Try and problem

df['l3_highest']=df.groupby('ts_code')['high'].transform(lambda x: x.rolling(3).max()) df['l3_lowest']=df.groupby('ts_code')['low'].transform(lambda x: x.rolling(3).min()) df['l3_high_low_pct_chg']=1-df['l3_lowest']/df['l3_highest']

But it fails such that for second row, the l3_lowest would be 5 not 20. I don't know how to calculate percentage after peak.

For last 4 rows, at index=8, low=10,high=50,low=5, l4_high_low_pct_chg=0.9 , at index=9, high=40, low=10, l4_high_low_pct_chg=0.75

Another test data

If the rolling window is 52, for hy_code 880912 group and index 1252, l52_high_low_pct_chg would be 0.281131 and 880301 group and index 1251, l52_high_low_pct_chg would be 0.321471.

Can you explain how the 0.6 and 0.9 values are achieved? Which rows and values lead to those results? — aaossa
– aaossa, Commented Mar 7, 2022 at 15:36
For index 8, if we are looking at a rolling window of 4, then "the min value of the low column after the peak" would be 10, because the peak is 20 at index 7. The high between index 5 and 8 would be 50, so the formula would be 1-(10/50) = 0.80 Please explain what the formula is that gives you low=5. — Troy D
– Troy D, Commented Mar 8, 2022 at 14:37
@TroyD what I mean "peak" is that only based on high column. For index 8，the max is 50, then the index of min value from low column should be equal or greater than 5 for B group. And 5 is the lowest between 5 and 8 index. — Jack
– Jack, Commented Mar 9, 2022 at 0:19
I changed my answer to use the 'high' column for peak instead of the 'low' column. Still a disagreement on the expected result though. For index 3, you have 1-10/50, which is 0.8, and we agree on that. For index 4, you have 0.4, which I assume is 1-30/50=0.4. So for index 3, you have the high and the low on the same row, so the low is from the same row as the peak. For index 4, the low is 30, presumably because you're not allowing the low to be 10 because that's on the same row as the peak. So it seems like the logic for index 3 and index 4 are not the same. — Troy D
– Troy D, Commented Mar 9, 2022 at 3:00

Troy D · Accepted Answer · 2022-03-14 03:14:54Z

Grouping by 'ts_code' is just a trivial groupby() function. DataFrame.rolling() function is for single columns, so it's a tricky to apply it if you need data from multiple columns. You can use "from numpy_ext import rolling_apply as rolling_apply_ext" as in this example: Pandas rolling apply using multiple columns. However, I just created a function that manually groups the dataframe into n length sub-dataframes, then applies the function to calculate the value. idxmax() finds the index value of the peak of the low column, then we find the min() of the values that follow. The rest is pretty straightforward.

import numpy as np import pandas as pd df = pd.DataFrame([['A', 20, 10], ['A', 30, 5], ['A', 40, 20], ['A', 50, 10], ['A', 20, 30], ['B', 50, 10], ['B', 30, 5], ['B', 40, 20], ['B', 10, 10], ['B', 20, 30]], columns=['ts_code', 'high', 'low'] ) def custom_f(df, n): s = pd.Series(np.nan, index=df.index) def sub_f(df_): high_peak_idx = df_['high'].idxmax() min_low_after_peak = df_.loc[high_peak_idx:]['low'].min() max_high = df_['high'].max() return 1 - min_low_after_peak / max_high for i in range(df.shape[0] - n + 1): df_ = df.iloc[i:i + n] s.iloc[i + n - 1] = sub_f(df_) return s df['l3_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 3).values df['l4_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 4).values print(df)

If you prefer to use the rolling function, this method gives the same output:

def rolling_f(rolling_df): df_ = df.loc[rolling_df.index] high_peak_idx = df_['high'].idxmax() min_low_after_peak = df_.loc[high_peak_idx:]["low"].min() max_high = df_['high'].max() return 1 - min_low_after_peak / max_high df['l3_high_low_pct_chg'] = df.groupby("ts_code").rolling(3).apply(rolling_f).values[:, 0] df['l4_high_low_pct_chg'] = df.groupby("ts_code").rolling(4).apply(rolling_f).values[:, 0] print(df)

Finally, if you want to do a true rolling window calculation that avoids any index lookup, you can use the numpy_ext (https://pypi.org/project/numpy-ext/)

from numpy_ext import rolling_apply def np_ext_f(rolling_df, n): def rolling_apply_f(high, low): return 1 - low[np.argmax(high):].min() / high.max() try: return pd.Series(rolling_apply(rolling_apply_f, n, rolling_df['high'].values, rolling_df['low'].values), index=rolling_df.index) except ValueError: return pd.Series(np.nan, index=rolling_df.index) df['l3_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=3).sort_index(level=1).values df['l4_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=4).sort_index(level=1).values print(df)

output:

 ts_code high low l3_high_low_pct_chg l4_high_low_pct_chg 0 A 20 10 NaN NaN 1 A 30 5 NaN NaN 2 A 40 20 0.50 NaN 3 A 50 10 0.80 0.80 4 A 20 30 0.80 0.80 5 B 50 10 NaN NaN 6 B 30 5 NaN NaN 7 B 40 20 0.90 NaN 8 B 10 10 0.75 0.90 9 B 20 30 0.75 0.75

For large datasets, the speed of these operations becomes an issue. So, to compare the speed of these different methods, I created a timing function:

import time def timeit(f): def timed(*args, **kw): ts = time.time() result = f(*args, **kw) te = time.time() print ('func:%r took: %2.4f sec' % \ (f.__name__, te-ts)) return result return timed

Next, let's make a large DataFrame, just by copying the existing dataframe 500 times:

df = pd.concat([df for x in range(500)], axis=0) df = df.reset_index()

Finally, we run the three tests under a timing function:

@timeit def method_1(): df['l52_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 52).values method_1() @timeit def method_2(): df['l52_high_low_pct_chg'] = df.groupby("ts_code").rolling(52).apply(rolling_f).values[:, 0] method_2() @timeit def method_3(): df['l52_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=52).sort_index(level=1).values method_3()

Which gives us this output:

func:'method_1' took: 2.5650 sec func:'method_2' took: 15.1233 sec func:'method_3' took: 0.1084 sec

So, the fastest method is to use the numpy_ext, which makes sense because that's optimized for vectorized calculations. The second fastest method is the custom function I wrote, which is somewhat efficient because it does some vectorized calculations while also doing some Pandas lookups. The slowest method by far is using Pandas rolling function.

@Jack I added a rolling method that may look a little cleaner. Inside the function, it's either doing df_=df.loc[rolling_df.index] or df_=df.iloc[i:i+n], so either way it's iterating over an index range and looking up a subsection of the dataframe.
@Jack I'm afraid I don't understand what you mean. It will give you an error if the rolling window is greater than the size of the dataframe. I'm not sure if that's the problem you're referring to though. Can you be more specific about what you're expecting the answer to be?
Sorry, I could get right answer for latest data if dataframe is sorted by group column like hy_code and index should be re-sequence.

aaossa · Accepted Answer · 2022-03-09 14:06:25Z

For my solution, we'll use .groupby("ts_code") then .rolling to process groups of certain size and a custom_function. This custom function will take each group, and instead of applying a function directly on the received values, we'll use those values to query the original dataframe. Then, we can calculate the values as you expect by finding the row where the "high" peak is, then look the following rows to find the minimum "low" value and finally calculate the result using your formula:

def custom_function(group, df): # Query the original dataframe using the group values group = df.loc[group.values] # Calculate your formula high_peak_row = group["high"].idxmax() min_low_after_peak = group.loc[high_peak_row:, "low"].min() return 1 - min_low_after_peak / group.loc[high_peak_row, "high"] # Reset the index to roll over that column and be able query the original dataframe df["l3_high_low_pct_chg"] = df.reset_index().groupby("ts_code")["index"].rolling(3).apply(custom_function, args=(df,)).values df["l4_high_low_pct_chg"] = df.reset_index().groupby("ts_code")["index"].rolling(4).apply(custom_function, args=(df,)).values

Output:

 ts_code high low l3_high_low_pct_chg l4_high_low_pct_chg 0 A 20 10 NaN NaN 1 A 30 5 NaN NaN 2 A 40 20 0.50 NaN 3 A 50 10 0.80 0.80 4 A 20 30 0.80 0.80 5 B 50 10 NaN NaN 6 B 30 5 NaN NaN 7 B 40 20 0.90 NaN 8 B 10 10 0.75 0.90 9 B 20 30 0.75 0.75

We can take this idea further an only group once:

groups = df.reset_index().groupby("ts_code")["index"] for n in [3, 4]: df[f"l{n}_high_low_pct_chg"] = groups.rolling(n).apply(custom_function, args=(df,)).values

Are you using it on your real data or the sample data? Also, you said that Troy D second method does not provide the answer you want, but here you said that the output of my method is not the same as that one. So are we both wrong? Is his first method correct?
Sorry, I could get right answer for latest data if dataframe is sorted by group column like hy_code and index should be re-sequence.
But the data is large if my data from github is repeated 1000000, it will run slow(the window size is 52).

Collectives™ on Stack Overflow

Compare two columns based on last N rows in a pandas DataFrame

2 Answers 2

14 Comments

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

14 Comments

4 Comments

Linked

Related