1

I'm trying to create a new pandas dataframe by rolling the row values in a window. i.e

A R N D C Q -1 -2 -3 -3 -1 -2 -1 -2 -3 -3 -1 -2 -1 -2 -3 -3 -1 -2 -1 -2 -3 -3 -1 -2 

to something like this:

A1 R1 N1 D1 C1 Q1 A2 R2 N2 D2 C2 Q2 … An Rn Nn Dn Cn Qn -1 -2 -3 -3 -1 a -1 -2 -3 -3 -1 b -1 -2 -3 -3 -1 b -1 -2 -3 -3 -1 c -1 -2 -3 -3 -1 c -1 -2 -3 -3 -1 d -1 -2 -3 -3 -1 d . . . . . . 

it is similar to a rolling window in a string, i.e. EXAM with window 3 will yield EXA,XAM. The key difference here being that instead of letters, I'm trying to create windows by rows. This new dataframe will be used for training a svm. Although I can create another column with scaled value corresponding to other columns (a single column is easier to roll), I think I will loose some information, that's why I'm taking complete columns.

In essence, I'm trying to do something like this, but for n window size:

In essence, I'm trying to do something like this, but for n window size

1 Answer 1

1

You can use numpy indexing to accomplish this:

In [1]: import pandas as pd ...: import numpy as np ...: import string ...: In [2]: abc = list(string.ascii_letters.upper()) ...: df = pd.DataFrame(dict(a=abc, b=abc[::-1])) ...: df.head() ...: Out[2]: a b 0 A Z 1 B Y 2 C X 3 D W 4 E V In [3]: # construct a indexing array ...: n = 5 ...: vals = df.values ...: idx = np.tile(np.arange(n), (len(df) - n + 1, 1)) + np.arange(len(df) - n + 1).reshape(-1,1) ...: idx[:10] ...: Out[3]: array([[ 0, 1, 2, 3, 4], [ 1, 2, 3, 4, 5], [ 2, 3, 4, 5, 6], [ 3, 4, 5, 6, 7], [ 4, 5, 6, 7, 8], [ 5, 6, 7, 8, 9], [ 6, 7, 8, 9, 10], [ 7, 8, 9, 10, 11], [ 8, 9, 10, 11, 12], [ 9, 10, 11, 12, 13]]) In [4]: # construct columns and index using flattened index array ...: cols = [ "{}_{}".format(c,str(i)) for i in range(n) for c in df.columns] ...: df2 = pd.DataFrame(vals[idx.flatten()].reshape(len(df)-n+1,df.shape[1]*n), columns=cols) ...: df2.head() ...: Out[4]: a_0 b_0 a_1 b_1 a_2 b_2 a_3 b_3 a_4 b_4 0 A Z B Y C X D W E V 1 B Y C X D W E V F U 2 C X D W E V F U G T 3 D W E V F U G T H S 4 E V F U G T H S I R 
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks! It works like charm. I will look into the code more, but I guess this does the job perfectly.
Hong, I encountered something strange while scaling the code. In some instances, it is skipping the last row in the dataframe.
idx = np.tile(np.arange(5), (len(df) - 5,1)) + np.arange(len(df) - 5).reshape(-1,1) I subtract here minus 5 to avaoid indexing errors. This could be the issue... you could also pad the end of you dataframe with NaN's.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.