Why linear regression doing well in time series data?

Question

I followed from this question.

I have the following task to do: I have time series data. Training by the consecutive 3 days to predict the each 4th day. Each day data represents one CSV file which has dimension 24x25. Every data points of each CSV file are pixels.

Now I need to do that, predict day4 (meaning the 4th day) by using training data day1, day2, day3 (meaning the three consecutive days prior), and after that calculate MSE between predicted day4 data and original day4 data. Let's call it mse1.

Similarly, I need to predict the day5 (meaning the 5th day) by using training data day2, day3, day4, and then calculate the mse2 (MSE between predicted day5 data and original day5 data).

I need to predict day6 (meaning the 6th day) by using training data day3, day4, day5, and then calculate mse3 (MSE between predicted day6 data and original day6).

..........

And finally I want to predict day93 by using training data day90, day91, day92, calculate mse90 (MSE between predicted day93 data and original day93).

I want to use in this case, Linear regression, and we have 90 MSE for this model.

import os import pandas as pd import numpy as np from sklearn.linear_model import LinearRegression from sklearn.preprocessing import MinMaxScaler import matplotlib.pyplot as plt # Paths data_folder = r'C:\Users\alokj\OneDrive\Desktop\jupyter_proj\All_data' output_folder = r'C:\Users\alokj\OneDrive\Desktop\jupyter_proj\90_days_merged' # Ensure the output folder exists os.makedirs(output_folder, exist_ok=True) # List all CSV files in the folder csv_files = [f for f in os.listdir(data_folder) if f.endswith('.csv')] # Sort the files based on the numeric part extracted from the filename csv_files = sorted(csv_files, key=lambda x: int(x.split('_Day')[1].split('_')[0])) # Prepare data data_list = [pd.read_csv(os.path.join(data_folder, file), header=None).values for file in csv_files] data_array = np.array(data_list) # Shape: (num_days, 24, 25) # Flatten the data for easier handling in regression models num_days, rows, cols = data_array.shape data_flattened = data_array.reshape(num_days, -1) # Shape: (num_days, 600) # Prepare features and target matrix for range (3, num_days) X = np.array([data_flattened[i-3:i].flatten() for i in range(3, num_days)]) # Shape: (num_days-3, 1800) y = data_flattened[3:num_days] # Target is the 4th day in each sequence # Train-Test Split and Validation (Separate fixed split) train_size = int(0.8 * len(X)) # 80% for training X_train = X[:train_size] y_train = y[:train_size] X_test = X[train_size:] y_test = y[train_size:] # Scaling the data scaler_X = MinMaxScaler() scaler_X.fit(X_train) # Fit on training set X_train_scaled = scaler_X.transform(X_train) X_test_scaled = scaler_X.transform(X_test) scaler_y = MinMaxScaler() scaler_y.fit(y_train) # Fit on training set y_train_scaled = scaler_y.transform(y_train) y_test_scaled = scaler_y.transform(y_test) ### Linear Regression lr_model = LinearRegression() lr_model.fit(X_train_scaled, y_train_scaled) y_pred_test_scaled_lr = lr_model.predict(X_test_scaled) y_pred_test_lr = scaler_y.inverse_transform(y_pred_test_scaled_lr) # Validation for Days 3 to 93 XX = np.array([data_flattened[i-3:i].flatten() for i in range(3, 96)]) # Shape: (90, 1800) yy = data_flattened[3:93] # Target for validation yy_pred_lr = lr_model.predict(scaler_X.transform(XX)) yy_pred_lr = scaler_y.inverse_transform(yy_pred_lr) # Calculate residuals for Linear Regression residuals_lr = [np.mean((yy[i] - yy_pred_lr[i])**2) for i in range(len(yy))] # Plot residuals for all models days = [f'Day {i+1}' for i in range(90)] # Start labels from Day 4 to Day 93 plt.figure(figsize=(12, 6)) plt.plot(days, residuals_lr, label='Linear Regression Residuals', marker='o') # Configure plot plt.xticks(ticks=range(0, len(days), 2), labels=[f'Day {i+1}' for i in range(0, len(days), 2)], rotation=45, ha='right') plt.xlabel('Days (Validation Set)') plt.ylabel('Residuals (MSE)') plt.title('Residuals for Models (Validation Set)') plt.legend() plt.grid(True) # Save and show plot plt.savefig(os.path.join(output_folder, 'residuals_plot_models_comparison_with_naive.png')) plt.show()

My result:

We know that linear regression models often do not do very well with time series data because the assumption of independent and identically distributed data is usually violated.

But in my case from the above plot, regression model is doing exceptionally very well (meaning mean squared error is very low, close to zero), would anybody check my regression model inside the code (if I made any mistakes or bugs that I might not be aware of)?

Actually I have implemented my code based on suggestions of @RobertLong's answer from my above linked question.

My all 93 days data folder link that I used for code.

Robert Long · Accepted Answer · 2024-11-29 18:15:25Z

Linear Regression for Time Series Prediction

Your sliding window approach for time series prediction seems promising, but the rather low MSE values suggest the need for further scrutiny. Following, I will try to provide a detailed code review, focusing on debugging, validation, and potential suggestions for improvement.

Mathematical Framework

Linear Regression

Linear regression models the relationship between predictors $X$ and a target variable $Y$ as:

$$ \hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \epsilon $$

where:

$\beta_0$ is the intercept,
$\beta_i$ are the coefficients of the predictors $X_i$,
$p$ is the number of predictors.
$\epsilon$ represents the error term (assumed to follow a normal distribution, $\epsilon \sim \mathcal{N}(0, \sigma^2)$).

In your problem, $X$ consists of the flattened matrices for three consecutive days ($3 \times 24 \times 25 = 1800$ features), and $Y$ represents the flattened matrix of the next day ($1 \times 24 \times 25 = 600$ features).

It is often more convenient to using matrix notation:

Matrix Form

$$ \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} $$

where:

$\mathbf{Y}$ is the $n \times 1$ vector of observed target values,
$\mathbf{X}$ is the $n \times (p + 1)$ matrix of predictors, including a column of ones for the intercept,
$\boldsymbol{\beta}$ is the $(p + 1) \times 1$ vector of coefficients (including the intercept),
$\boldsymbol{\epsilon}$ is the $n \times 1$ vector of error terms.

The goal is to find $\boldsymbol{\beta}$ such that the residual sum of squares (RSS) is minimised:

$$ \text{RSS} = (\mathbf{Y} - \mathbf{X} \boldsymbol{\beta})^\top (\mathbf{Y} - \mathbf{X} \boldsymbol{\beta}) $$

The solution is obtained via the normal equation:

$$ \boldsymbol{\beta} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y} $$

Mean Squared Error (MSE)

The model’s performance is evaluated using the MSE:

$$ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} \left( Y_i - \hat{Y}_i \right)^2 $$

where:

$n$ is the number of observations,
$Y_i$ and $\hat{Y}_i$ are the actual and predicted values, respectively.

Interpreting "Low MSE"

Before evaluating the results, it is important to define what constitutes "low MSE" in this context. MSE values are relative and depend on several factors:

Scale of the Target Variable:
- MSE depends on the magnitude of the target values. For example, an MSE of 0.001 is small for pixel values normalised between 0 and 1 but could be very large for targets in the range of 0 to 0.01.
- Normalising the target values or using metrics like $R^2$ can provide clearer insight: $$ R^2 = 1 - \frac{\text{MSE}}{\text{Variance of } Y} $$
Baseline Comparison:
- Compare the MSE to simple benchmarks, such as:
  - Persistence Model: Predicting the next day as identical to the previous day.
  - Moving Average: Averaging the past three days.
- If the regression model’s MSE is close to these baselines, its performance may not be exceptional.
Variability in the Data:
- Data with low variability or strong temporal patterns (eg., high autocorrelation) will naturally lead to lower MSE values, as the task is easier to predict.
- Conversely, noisy or irregular data would make low MSE surprising.
Model Assumptions:
- Linear regression assumes linear relationships and normally distributed, homoscedastic residuals. If these assumptions are violated but MSE remains low, issues like overfitting or data leakage should be considered.

Clarifying "Low MSE" in This Case

Contextualise the MSE values by:
1. Providing Scale Context: Assess the range of the target variable.
2. Baseline Comparison: Compare against naive models to gauge relative performance.
3. Evaluating Variance: Compare MSE to the variance of the target variable to determine its significance.
4. Benchmarking: Review similar datasets or tasks for comparison.

Defining "low MSE" relative to these factors ensures the results are interpreted meaningfully and any anomalies are identified.

Debugging Suggestions

1. Alignment Check

Misaligned predictors and targets can lead to incorrect training/testing splits. Verify that data indices are properly aligned:

for i in range(3, num_days): print(f"Train: Days {i-3}, {i-2}, {i-1}; Predict: Day {i}")

2. Temporal Dependency Analysis

ACF and PACF Plots: Use autocorrelation (ACF) and partial autocorrelation (PACF) to understand the temporal structure of your data:
ACF (Autocorrelation Function): Measures correlation between time series values at different lags.
PACF (Partial Autocorrelation Function): Isolates the effect of individual lags.
How to Generate:

import statsmodels.api as sm import matplotlib.pyplot as plt series = data_flattened[:, 0] # Use the first feature across all days fig, axes = plt.subplots(2, 1, figsize=(10, 8)) sm.graphics.tsa.plot_acf(series, lags=30, ax=axes[0]) sm.graphics.tsa.plot_pacf(series, lags=30, ax=axes[1]) plt.show()

Interpretation:
- Strong ACF correlations at short lags suggest that temporal dependencies are significant. Adding lagged variables explicitly as features may improve performance. See the following thread over at Cross Validated for further details on interpretation of ACF/PACF plots:
  
  How to interpret an ACF and PACF together?

Validation Strategy

1. Walk-Forward Validation

Train and evaluate the model iteratively using only past data to prevent information leakage:

for i in range(3, num_days): X_train = data_flattened[i-3:i].flatten() y_train = data_flattened[i] model.fit(X_train, y_train) y_pred = model.predict(X_test) # Predict day i+1

Why It Matters: This ensures that the model is validated in a way that mimics real-world forecasting, where future data is unavailable during training.

Recommendations for Improvement

Verify the problem

Ensure that you really are dealing with unexpectedly low MSE - check the Interpreting "Low MSE" section.

Validation

Use strict chronological splits for training and testing to prevent leakage.
Test the model on an out-of-sample dataset (eg., reserve the last few days for testing).

Benchmarking

Compare against baselines like naive persistence or simple moving averages.
Use MSE reductions relative to the baseline as a measure of model improvement.

Advanced Models

CNNs: Leverage spatial structure in your 24x25 matrices for better feature extraction, particularly when relationships between neighbouring pixels are relevant. (LeCun et al., 1998)
LSTMs: Long Short-Term Memory networks (LSTMs) are designed to capture long-term dependencies in sequential data, making them suitable for modelling temporal patterns in your dataset. (Hochreiter & Schmidhuber, 1997)
ARIMA/SARIMA: Handle temporal dependencies like trends and seasonality explicitly. These models are well-suited for univariate time series but can be extended to multivariate settings. (Box et al., 2015)
ARCH/GARCH Models: If your data exhibits time-varying volatility, Autoregressive Conditional Heteroskedasticity (ARCH) or Generalised ARCH (GARCH) models may be appropriate to model conditional variance over time. (Engle, 1982; Bollerslev, 1986)
RNNs: Recurrent Neural Networks (RNNs) are ideal for sequential data but may struggle with long-term dependencies compared to LSTMs. (Rumelhart et al., 1986)

Each of these models provides unique strengths, and choosing the right one depends on the specific temporal and spatial complexities of your dataset.

Summing Up

While your implementation is a good start, the (apparently) low MSE values suggest potential issues such as data leakage or overly simplistic temporal patterns. First I would verify that the MSE is indeed low as detailed in the Interpreting "Low MSE" section. By verifying data alignment, analysing temporal dependencies with ACF/PACF, and benchmarking against baselines, you can hopefully obtain "better" results (and a better understand your model's strengths and limitations). Exploring more advanced models may further enhance predictive performance and address the spatial and temporal complexities of your data.

References:

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324. https://doi.org/10.1109/5.726791
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
Box, G. E. P., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2015). Time Series Analysis: Forecasting and Control (5th ed.). Wiley.
Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica, 50(4), 987-1007. https://doi.org/10.2307/1912773
Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31(3), 307-327. https://doi.org/10.1016/0304-4076(86)90063-1 Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536. https://doi.org/10.1038/323533a0

Hi Robert, based on your suggestions, i have implemented the regression model, please have a look on this question. I am really appreciate your feedback. — S. M.
– S. M., Commented Mar 12 at 23:00

Stack Exchange Network

Why linear regression doing well in time series data?

1 Answer 1

Linear Regression for Time Series Prediction

Mathematical Framework

Linear Regression

Matrix Form

Mean Squared Error (MSE)

Interpreting "Low MSE"

Clarifying "Low MSE" in This Case

Debugging Suggestions

1. Alignment Check

2. Temporal Dependency Analysis

Validation Strategy

1. Walk-Forward Validation

Recommendations for Improvement

Verify the problem

Validation

Benchmarking

Advanced Models

Summing Up

Linked

Hot Network Questions

Why linear regression doing well in time series data?

1 Answer 1

Linear Regression for Time Series Prediction

Mathematical Framework

Linear Regression

Matrix Form

Mean Squared Error (MSE)

Interpreting "Low MSE"

Clarifying "Low MSE" in This Case

Debugging Suggestions

1. Alignment Check

2. Temporal Dependency Analysis

Validation Strategy

1. Walk-Forward Validation

Recommendations for Improvement

Verify the problem

Validation

Benchmarking

Advanced Models

Summing Up

Linked

Related

Hot Network Questions