Pandas new dataframe by rolling the rows

Question

I'm trying to create a new pandas dataframe by rolling the row values in a window. i.e

A R N D C Q -1 -2 -3 -3 -1 -2 -1 -2 -3 -3 -1 -2 -1 -2 -3 -3 -1 -2 -1 -2 -3 -3 -1 -2

to something like this:

A1 R1 N1 D1 C1 Q1 A2 R2 N2 D2 C2 Q2 … An Rn Nn Dn Cn Qn -1 -2 -3 -3 -1 a -1 -2 -3 -3 -1 b -1 -2 -3 -3 -1 b -1 -2 -3 -3 -1 c -1 -2 -3 -3 -1 c -1 -2 -3 -3 -1 d -1 -2 -3 -3 -1 d . . . . . .

it is similar to a rolling window in a string, i.e. EXAM with window 3 will yield EXA,XAM. The key difference here being that instead of letters, I'm trying to create windows by rows. This new dataframe will be used for training a svm. Although I can create another column with scaled value corresponding to other columns (a single column is easier to roll), I think I will loose some information, that's why I'm taking complete columns.

In essence, I'm trying to do something like this, but for n window size:

Mirko Salaris · Accepted Answer · 2020-08-21 13:07:23Z

You can use numpy indexing to accomplish this:

In [1]: import pandas as pd ...: import numpy as np ...: import string ...: In [2]: abc = list(string.ascii_letters.upper()) ...: df = pd.DataFrame(dict(a=abc, b=abc[::-1])) ...: df.head() ...: Out[2]: a b 0 A Z 1 B Y 2 C X 3 D W 4 E V In [3]: # construct a indexing array ...: n = 5 ...: vals = df.values ...: idx = np.tile(np.arange(n), (len(df) - n + 1, 1)) + np.arange(len(df) - n + 1).reshape(-1,1) ...: idx[:10] ...: Out[3]: array([[ 0, 1, 2, 3, 4], [ 1, 2, 3, 4, 5], [ 2, 3, 4, 5, 6], [ 3, 4, 5, 6, 7], [ 4, 5, 6, 7, 8], [ 5, 6, 7, 8, 9], [ 6, 7, 8, 9, 10], [ 7, 8, 9, 10, 11], [ 8, 9, 10, 11, 12], [ 9, 10, 11, 12, 13]]) In [4]: # construct columns and index using flattened index array ...: cols = [ "{}_{}".format(c,str(i)) for i in range(n) for c in df.columns] ...: df2 = pd.DataFrame(vals[idx.flatten()].reshape(len(df)-n+1,df.shape[1]*n), columns=cols) ...: df2.head() ...: Out[4]: a_0 b_0 a_1 b_1 a_2 b_2 a_3 b_3 a_4 b_4 0 A Z B Y C X D W E V 1 B Y C X D W E V F U 2 C X D W E V F U G T 3 D W E V F U G T H S 4 E V F U G T H S I R

Thanks! It works like charm. I will look into the code more, but I guess this does the job perfectly.
Hong, I encountered something strange while scaling the code. In some instances, it is skipping the last row in the dataframe.
idx = np.tile(np.arange(5), (len(df) - 5,1)) + np.arange(len(df) - 5).reshape(-1,1) I subtract here minus 5 to avaoid indexing errors. This could be the issue... you could also pad the end of you dataframe with NaN's.

Collectives™ on Stack Overflow

Pandas new dataframe by rolling the rows

1 Answer 1

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Related