VECM

Using a VECM to predict FANG stocks
See the VECM documentation

[1]:

import pandas as pd import numpy as np from pandas_datareader import data as pdr import yfinance as yf from scalecast.Forecaster import Forecaster from scalecast.MVForecaster import MVForecaster from scalecast.Pipeline import Transformer, Reverter, MVPipeline from scalecast.util import ( find_optimal_lag_order, find_optimal_coint_rank, Forecaster_with_missing_vals, ) from scalecast.auxmodels import vecm from scalecast.multiseries import export_model_summaries from scalecast import GridGenerator import matplotlib.pyplot as plt 

[2]:

yf.pdr_override() 

Download data using a public API

[3]:

FANG = [ 'META', 'AMZN', 'NFLX', 'GOOG', ] fs = [] for sym in FANG: df = pdr.get_data_yahoo(sym) # since the api doesn't send the data exactly in Business-day frequency # we can correct it using this function f = Forecaster_with_missing_vals( y=df['Close'], current_dates = df.index, future_dates = 65, end = '2022-09-30', desired_frequency = 'B', fill_strategy = 'linear_interp', add_noise = True, noise_lookback = 5, ) fs.append(f) mvf = MVForecaster(*fs,names=FANG,test_length=65) mvf.set_validation_metric('rmse') mvf.add_sklearn_estimator(vecm,'vecm') mvf 

 [*********************100%***********************] 1 of 1 completed [*********************100%***********************] 1 of 1 completed [*********************100%***********************] 1 of 1 completed [*********************100%***********************] 1 of 1 completed

[3]:

 MVForecaster( DateStartActuals=2012-05-18T00:00:00.000000000 DateEndActuals=2023-08-03T00:00:00.000000000 Freq=B N_actuals=2925 N_series=4 SeriesNames=['META', 'AMZN', 'NFLX', 'GOOG'] ForecastLength=65 Xvars=[] TestLength=65 ValidationLength=1 ValidationMetric=rmse ForecastsEvaluated=[] CILevel=None CurrentEstimator=mlr OptimizeOn=mean GridsFile=MVGrids )

[4]:

mvf.plot() plt.show() 

Augmented Dickey Fuller Tests to Confirm Unit-1 Roots

[5]:

for stock, f in zip(FANG,fs): adf_result = f.adf_test(full_res=True) print('the stock {} is {}stationary at level'.format( stock, 'not ' if adf_result[1] > 0.05 else '' ) ) 

 the stock META is not stationary at level the stock AMZN is not stationary at level the stock NFLX is not stationary at level the stock GOOG is not stationary at level

[6]:

for stock, f in zip(FANG,fs): adf_result = f.adf_test(diffy=True,full_res=True) print('the stock {} is {}stationary at its first difference'.format( stock, 'not ' if adf_result[1] > 0.05 else '' ) ) 

 the stock META is stationary at its first difference the stock AMZN is stationary at its first difference the stock NFLX is stationary at its first difference the stock GOOG is stationary at its first difference

Measure IC to Find Optimal Lag Order

this is used to run the cointegration test

[7]:

lag_test = find_optimal_lag_order(mvf,train_only=True) pd.DataFrame( { 'aic':lag_test.aic, 'bic':lag_test.bic, 'hqic':lag_test.hqic, 'fpe':lag_test.fpe, }, index = ['optimal lag order'], ).T 

[7]:

	optimal lag order
aic	27
bic	1
hqic	3
fpe	27

Johansen cointegration test

[8]:

coint_res = find_optimal_coint_rank( mvf, det_order=1, k_ar_diff=10, train_only=True, ) print(coint_res) coint_res.rank 

 Johansen cointegration test using trace test statistic with 5% significance level ===================================== r_0 r_1 test statistic critical value ------------------------------------- 0 4 56.60 55.25 1 4 33.50 35.01 -------------------------------------

[8]:

We found a cointegration rank of 1.

Run VECM

Now, we can specify a grid that will try more lags, deterministic terms, seasonal fluctuations, and cointegration ranks of 0 and 1

[9]:

vecm_grid = dict( lags = [0], # required to set this to 0 for the vecm model in scalecast freq = ['B'], # only necessary to suppress a warning k_ar_diff = range(1,66), # lag orders to try coint_rank = [0,1], deterministic = ["n","co","lo","li","cili","colo"], seasons = [0,5,30,65,260], ) mvf.set_estimator('vecm') mvf.ingest_grid(vecm_grid) mvf.limit_grid_size(100,random_seed=20) mvf.cross_validate(k=3,verbose=True) mvf.auto_forecast() results = mvf.export('model_summaries') results[[ 'ModelNickname', 'Series', 'TestSetRMSE', 'TestSetMAE', ]] 

 Num hyperparams to try for the vecm model: 100. Fold 0: Train size: 2145 (2012-05-18 00:00:00 - 2020-08-06 00:00:00). Test Size: 715 (2020-08-07 00:00:00 - 2023-05-04 00:00:00). Fold 1: Train size: 1430 (2012-05-18 00:00:00 - 2017-11-09 00:00:00). Test Size: 715 (2017-11-10 00:00:00 - 2020-08-06 00:00:00). Fold 2: Train size: 715 (2012-05-18 00:00:00 - 2015-02-12 00:00:00). Test Size: 715 (2015-02-13 00:00:00 - 2017-11-09 00:00:00). Chosen paramaters: {'lags': 0, 'freq': 'B', 'k_ar_diff': 28, 'coint_rank': 1, 'deterministic': 'li', 'seasons': 5}.

[9]:

	ModelNickname	Series	TestSetRMSE	TestSetMAE
0	vecm	META	50.913814	44.133838
1	vecm	AMZN	23.752750	22.348669
2	vecm	NFLX	113.705564	105.310047
3	vecm	GOOG	18.846431	18.041912

View VECM Results

[10]:

results['TestSetRMSE'].mean() 

[10]:

 51.80463965264752

[11]:

mvf.export_validation_grid('vecm').sample(15) 

[11]:

	freq	k_ar_diff	coint_rank	deterministic	seasons	Fold0Metric	Fold1Metric	Fold2Metric	AverageMetric	MetricEvaluated	Optimized On
71	B	58	0	colo	0	162.599874	44.268484	26.594518	77.820958	rmse	mean
40	B	7	0	li	260	NaN	NaN	NaN	NaN	rmse	mean
3	B	54	1	lo	30	110.553065	51.443275	19.154490	60.383610	rmse	mean
30	B	29	0	co	30	112.035140	44.427934	18.598329	58.353801	rmse	mean
76	B	1	1	cili	65	93.945660	51.090754	21.836356	55.624257	rmse	mean
87	B	50	0	lo	260	163.152289	46.616014	13.327631	74.365311	rmse	mean
10	B	44	0	cili	65	NaN	NaN	NaN	NaN	rmse	mean
9	B	47	0	n	0	86.212252	55.639330	36.182421	59.344668	rmse	mean
94	B	38	1	colo	5	139.580751	41.852200	24.055439	68.496130	rmse	mean
43	B	55	1	colo	260	112.237679	43.782141	63.595936	73.205252	rmse	mean
73	B	40	0	lo	0	159.422501	46.642590	18.090453	74.718515	rmse	mean
62	B	18	1	n	5	166.241648	54.618358	30.753820	83.871275	rmse	mean
63	B	12	0	co	260	108.180519	44.685347	23.286315	58.717394	rmse	mean
60	B	21	1	li	260	82.029398	50.186756	31.553748	54.589967	rmse	mean
64	B	43	1	co	0	100.291180	46.496187	16.839982	54.542450	rmse	mean

[12]:

mvf.plot_test_set( series='AMZN', models='vecm', include_train=130, figsize=(16,8) ) plt.show() 

Re-weight Evaluation Metrics and Rerun VECM

[13]:

weights = results['TestSetRMSE'] / results['TestSetRMSE'].sum() weights 

[13]:

 0 0.245701 1 0.114627 2 0.548723 3 0.090950 Name: TestSetRMSE, dtype: float64

[14]:

mvf.set_optimize_on( lambda x: ( x[0]*weights[0] + x[1]*weights[1] + x[2]*weights[2] + x[3]*weights[3] ) ) mvf.ingest_grid(vecm_grid) mvf.limit_grid_size(100,random_seed=20) mvf.cross_validate(k=3,verbose=True) mvf.auto_forecast(call_me='vecm_weighted') results = mvf.export('model_summaries') results[[ 'ModelNickname', 'Series', 'TestSetRMSE', 'TestSetMAE', ]] 

 Num hyperparams to try for the vecm model: 100. Fold 0: Train size: 2145 (2012-05-18 00:00:00 - 2020-08-06 00:00:00). Test Size: 715 (2020-08-07 00:00:00 - 2023-05-04 00:00:00). Fold 1: Train size: 1430 (2012-05-18 00:00:00 - 2017-11-09 00:00:00). Test Size: 715 (2017-11-10 00:00:00 - 2020-08-06 00:00:00). Fold 2: Train size: 715 (2012-05-18 00:00:00 - 2015-02-12 00:00:00). Test Size: 715 (2015-02-13 00:00:00 - 2017-11-09 00:00:00). Chosen paramaters: {'lags': 0, 'freq': 'B', 'k_ar_diff': 62, 'coint_rank': 1, 'deterministic': 'li', 'seasons': 0}.

[14]:

	ModelNickname	Series	TestSetRMSE	TestSetMAE
0	vecm	META	50.913814	44.133838
1	vecm_weighted	META	33.772797	28.520031
2	vecm	AMZN	23.752750	22.348669
3	vecm_weighted	AMZN	15.835426	14.617043
4	vecm	NFLX	113.705564	105.310047
5	vecm_weighted	NFLX	59.887563	53.210215
6	vecm	GOOG	18.846431	18.041912
7	vecm_weighted	GOOG	16.757831	16.121594

[15]:

results.loc[results['ModelNickname'] == 'vecm_weighted','TestSetRMSE'].mean() 

[15]:

 31.56340410699967

An improvement by weighting the optimizer!

[16]:

mvf.plot_test_set( series='META', models='all', include_train=130, figsize=(16,8) ) plt.show() 

[17]:

mvf.plot( series='all', models='all', figsize=(16,8) ) plt.show() 

Try Other MV Models

[31]:

GridGenerator.get_mv_grids() # open MVGrids.py and manually change all lags arguments to range(1,66) 

[32]:

transformers = [] reverters = [] for stock, f in zip(FANG,fs): transformer = Transformer( transformers = [('DiffTransform',)] ) reverter = Reverter( reverters = [('DiffRevert',)], base_transformer = transformer, ) transformers.append(transformer) reverters.append(reverter) 

[33]:

def Xvar_select(f): f.set_validation_length(65) f.auto_Xvar_select( estimator='gbt', max_depth=2, max_ar=0, # in mv modeling, lags are a hyperparameter, not a regressor in the MVForecaster object ) def mvforecaster(mvf): models = ( 'mlr', 'elasticnet', 'gbt', 'xgboost', 'lightgbm', 'knn', ) mvf.set_test_length(65) mvf.tune_test_forecast( models, limit_grid_size=10, cross_validate=True, k=3, ) 

[34]:

pipeline = MVPipeline( steps = [ ('Transform',transformers), ('Xvar Select',[Xvar_select]*4), ('Forecast',mvforecaster), ('Revert',reverters), ], names = FANG, ) 

[35]:

fs = pipeline.fit_predict(*fs) 

 Finished loading model, total used 150 iterations Finished loading model, total used 150 iterations Finished loading model, total used 150 iterations Finished loading model, total used 150 iterations Finished loading model, total used 150 iterations Finished loading model, total used 150 iterations Finished loading model, total used 150 iterations Finished loading model, total used 150 iterations Finished loading model, total used 150 iterations Finished loading model, total used 150 iterations Finished loading model, total used 150 iterations Finished loading model, total used 150 iterations Finished loading model, total used 150 iterations Finished loading model, total used 150 iterations Finished loading model, total used 150 iterations Finished loading model, total used 150 iterations Finished loading model, total used 150 iterations Finished loading model, total used 150 iterations Finished loading model, total used 150 iterations Finished loading model, total used 150 iterations

[36]:

results = export_model_summaries(dict(zip(FANG,fs))) 

View Results

[37]:

model_rmses = results.groupby('ModelNickname')['TestSetRMSE'].mean().sort_values().reset_index() model_rmses 

[37]:

	ModelNickname	TestSetRMSE
0	xgboost	27.918698
1	knn	41.832711
2	mlr	42.519262
3	gbt	43.895070
4	elasticnet	44.320649
5	lightgbm	44.554279

The above table is the mean mape performance from each model over all series.

[38]:

series_rmses = results.groupby('Series')['TestSetRMSE'].min().reset_index() series_rmses['Model'] = [ results.loc[ results['TestSetRMSE'] == i, 'ModelNickname' ].values[0] for i in series_rmses['TestSetRMSE'] ] series_rmses 

[38]:

	Series	TestSetRMSE	Model
0	AMZN	17.433315	xgboost
1	GOOG	12.892690	xgboost
2	META	10.509525	xgboost
3	NFLX	70.839262	xgboost

[39]:

series_rmses['TestSetRMSE'].mean() 

[39]:

 27.918697955026953

The above table shows the best model for each series and its derived RMSE. The average RMSE of all these models applied to the individual series is 27.9, but being so dependent on the test set to choose the model probably leads to overfitting.

[41]:

fs[1].plot_test_set( models='xgboost', include_train=130, ) plt.title('AMZN Best Model Test Predictions') plt.show() 

[42]:

fs[1].plot( models='xgboost', ) plt.title('AMZN Best Model Forecast') plt.show() 

[ ]: