Cross Validation

As time-series have the inherent structure we could run into problems with traditional shuffled Kfolds cross-validation. hcrystalball implements forward rolling cross-validation making training set consist only of observations that occurred prior to the observations that form the test set.

f4307130f36e4a949322929d3a147016

[1]:
from hcrystalball.model_selection import FinerTimeSplit
from sklearn.model_selection import cross_validate
from hcrystalball.wrappers import ExponentialSmoothingWrapper
[2]:
import pandas as pd
import matplotlib.pyplot as plt
# plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = [12, 6]
[3]:
from hcrystalball.utils import get_sales_data

df = get_sales_data(n_dates=100,
                    n_assortments=1,
                    n_states=1,
                    n_stores=1)
X, y = pd.DataFrame(index=df.index), df['Sales']

Native Cross Validation

[4]:
cross_validate(ExponentialSmoothingWrapper(),
               X,
               y,
               cv=FinerTimeSplit(horizon=5, n_splits=2),
               scoring='neg_mean_absolute_error')
[4]:
{'fit_time': array([0.00929976, 0.00830221]),
 'score_time': array([0.00860476, 0.00812888]),
 'test_score': array([-4822.51813324, -5489.07012375])}

Grid search and model selection

Model selection and parameter tuning is the area where hcrystalball really shines. There is ongoing and probably a never-ending discussion about superiority or inferiority of ML techniques over common statistical/econometrical ones. Why not try both? The problem of a simple comparison between the performance of different kind of algorithms such as SARIMAX, Prophet, regularized linear models, and XGBoost lead to hcrystalball. Let’s see how to do it!

[5]:
from hcrystalball.compose import TSColumnTransformer
from hcrystalball.feature_extraction import SeasonalityTransformer
from hcrystalball.wrappers import ProphetWrapper
from hcrystalball.wrappers import get_sklearn_wrapper
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

from hcrystalball.wrappers import SarimaxWrapper
from sklearn.model_selection import GridSearchCV

import numpy as np
import pandas as pd

Define our pipeline

[6]:
sklearn_model_pipeline = Pipeline([
    ('seasonality', SeasonalityTransformer(freq='D')),
    ('model', 'passthrough') # this will be overwritten by param grid
])

Define pipeline parameters including different models

[7]:
param_grid = [{'model': [sklearn_model_pipeline],
               'model__model':[get_sklearn_wrapper(RandomForestRegressor),
                               get_sklearn_wrapper(LinearRegression)]},
              {'model': [ProphetWrapper()],
               'model__seasonality_mode':['multiplicative', 'additive']},
              {'model': [SarimaxWrapper(order=(2,1,1), suppress_warnings=True)]}
             ]

Custom scorer

[11]:
from hcrystalball.metrics import make_ts_scorer
from sklearn.metrics import mean_absolute_error
[12]:
scoring = make_ts_scorer(mean_absolute_error,
                         greater_is_better=False)
[13]:
grid_search = GridSearchCV(estimator=sklearn_model_pipeline,
                           param_grid=param_grid,
                           scoring=scoring,
                           cv=FinerTimeSplit(horizon=5, n_splits=2),
                           refit=False,
                           error_score=np.nan)
results = grid_search.fit(X, y)
[14]:
results.scorer_.cv_data.loc[:,lambda x: x.columns != 'split'].plot();
../../../_images/examples_tutorial_wrappers_05_model_selection_22_0.png

hcrystalball internally tracks data based on unique model hashes since model string represantations (reprs) are very long for usable columns names in dataframe, but if you are curious i.e. what was the worse model not to use it for further experiment, you can do it with scorers estimator_ids attribute

[15]:
results.scorer_.cv_data.head()
[15]:
split y_true feef8e885b65983a6b2a2afa81790f63 f15ffb80abe21be5aa49ca2f4bef1428 443c753f6a53b3387028bafe49a286eb 6f632e060699dce1d328412e4b83c850 12436b23a2222e4b1df46a139cef649f
2015-07-22 0 18046.0 21091.84 20676.741135 24428.095232 24065.094696 20108.463667
2015-07-23 0 19532.0 21087.16 18015.494274 21852.187052 21566.265711 17871.040123
2015-07-24 0 17420.0 22742.39 18084.973453 20794.186777 20548.954160 19143.531055
2015-07-25 0 13558.0 13243.11 12739.066714 14475.237140 14713.923592 13445.097910
2015-07-26 0 0.0 0.00 -685.019580 -12.321919 1099.681532 -48.938135

We can get to the model definitions using hash in results.scorer_.estimator_ids dict