How to standardize my data (Univariate Time Series Forecasting using Keras LSTM)?

Question

Let be $X = (X_1,...., X_n)$ an univariate time serie. I would like to know how to standardize my data when I split it into train and test data. Let me explain you how I tranform $X$ so that I can fit an LSTM neural net. From $X$ I make a new input data and its corresponding output data. So, we have: $X = ( (X_1, ..., X_m), ... , (X_{n-m}, ..., X_{n-1}) )$
$Y = (X_{m+1}, ..., X_n)$
$\text{Card}X = \text{Card}Y$
Let's set $p$ the size of my test set. If I use Python's notation, we have:
$X_{train} = X[:-p]$
$X_{test} = X[-p:]$
Idem for $Y$ ... Now, I am wondering how to standardize my data. I think that standardizing $X$ before splitting the data into train and test sets could lead to over-fitting since we a transformation that involves all $X_i$. Basically, I am not sure that the sum (mean, standard deviation) will drown the information. In this case I think it could be better to just compute the mean and the standard deviation in the training set and use them to standardize both of the train and test sets. For me it makes no sense to standardize them separately since $\text{Card}X_{test} << \text{Card}X_{train}$. But may be I am wrong. I would also like to know whether I have to standardize $Y$ and $X$ or just $X$. When I am working with MLP neural net I used to just normalize the input data..

So, thank you first for reading and if you have any ideas or remarks, any questions to ask, please let me know. I can explain more it is up to you :)

P.S. I don't find a 'standardization' tag and I thus use the tag named normalization.

ignatius · Accepted Answer · 2019-07-08 10:44:41Z

2

Welcome kakarotto! The first thing, standarize with respect to the training set only, then use the statistics to standarize the other sets. As a rule of thumb, we should keep in mind that the test set does not exits while training... (although we have access to it) When training, if the network does not see the test set we (the humans) do either…

If the test set was well selected, and so the whole dataset, the training set and the test set should come from the same probability density function and, if enough data is provided, the statistics you get from the training set would be close to the statistics of the pdf (assuming unbiased estimators for these statistics, that should be the case for the mean and standard deviation)

Based on my experience, only standarize the input data. With the output data, I usually scale it (min-max scaler) if I know the boundaries (it they even exist…). Another usefull trick that usually worked for me its to not feed the net with the input data rather with its relative differences, for example:

$X = (0, X_1 - X_0, X_2 - X_0,\dots,X_m-X_0)$

answered Jul 8, 2019 at 10:44

ignatius

1,6969 silver badges22 bronze badges

$\begingroup$ Thanks :)That is what I think but I am not sure that my dataset verifies those hypothesis. I mean, let's imagine that I have 5-year sale records and I want to predict the sales monthly. Let's say 3 months in the future. So, in this case, I may have a training set that includes the first 33 months and test having the 3 last months. You can guess that the nature of my distribution is completely random. In this case, how should I standardize ? Is it a good way to solve a time serie forecasting problem ? I am not familiar with time series, so I may be totally wrong ... $\endgroup$

kakarotto
– kakarotto

2019-07-08 11:26:18 +00:00
Commented Jul 8, 2019 at 11:26
$\begingroup$ When working with time series you have to indentify if your data is stationary (i guess no) and handle the trend and the seasonality, if it is posible and they eve exist... When taking about prices is much far logical to work with relative prices and relative differences (for example the difference with respect to the beginning of the year/month…) since prices increase with time... I'll give you a good link machinelearningmastery.com/… $\endgroup$

ignatius
– ignatius

2019-07-08 11:32:14 +00:00
Commented Jul 8, 2019 at 11:32
$\begingroup$ adding to @ignatius answer, using those differences implies using the derivative of the timeseries instead of the original signal. Differentiation is a very dangerous operation in the context of timeseries, because it amplifies noise and can ruin your model. In some cases it might work, but it is safe to say that in general it is considered a black sheep, especially when measurements are sensor data. In fact, integration of timeseries might be better than differentiation, because it smoothens the signal like a low-pass filter. $\endgroup$

pcko1
– pcko1

2019-07-08 13:45:13 +00:00
Commented Jul 8, 2019 at 13:45
$\begingroup$ I know that, e.g., machinelearningmastery.com recommends differentiating timeseries to remove trends and make the signal stationary but take this with a pinch of salt. $\endgroup$

pcko1
– pcko1

2019-07-08 13:51:43 +00:00
Commented Jul 8, 2019 at 13:51
$\begingroup$ @pcko1 that's really interesting. If I have a correct understanding, integrating a my time serie means to apply such a transformation : F(t0) = sum{t <= t0) F(t). So, assuming an additive model, we have : F(t0) = sum{t<+t0} Trend(t) + ... If the trend increase over the time, the integration will lead to increase the trend. I see a problem, don't you ? Thanks. $\endgroup$

kakarotto
– kakarotto

2019-07-08 14:37:39 +00:00
Commented Jul 8, 2019 at 14:37

| Show 3 more comments

Stack Exchange Network

How to standardize my data (Univariate Time Series Forecasting using Keras LSTM)?

1 Answer 1

Hot Network Questions

How to standardize my data (Univariate Time Series Forecasting using Keras LSTM)?

1 Answer 1

Related

Hot Network Questions