Inspecting ModelSelectorResult¶

When we go down from multiple time-series to single time-series, the best way how to get access to all relevant information to use/access ModelSelectorResult objects

[1]:

import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn')
plt.rcParams['figure.figsize'] = [12, 6]

[2]:

from hcrystalball.model_selection import ModelSelector
from hcrystalball.utils import get_sales_data
from hcrystalball.wrappers import get_sklearn_wrapper
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

[3]:

df = get_sales_data(n_dates=365*2,
                    n_assortments=1,
                    n_states=1,
                    n_stores=2)
df.head()

[3]:

	Store	Sales	Open	Promo	SchoolHoliday	StoreType	Assortment	Promo2	State	HolidayCode
Date
2013-08-01	817	25013	True	True	True	a	a	False	BE	DE-BE
2013-08-01	513	22514	True	True	True	a	a	False	BE	DE-BE
2013-08-02	513	19330	True	True	True	a	a	False	BE	DE-BE
2013-08-02	817	22870	True	True	True	a	a	False	BE	DE-BE
2013-08-03	513	16633	True	False	False	a	a	False	BE	DE-BE

[4]:

# let's start simple
df_minimal = df[['Sales']]

[5]:

ms_minimal = ModelSelector(frequency='D', horizon=10)

[6]:

ms_minimal.create_gridsearch(
    n_splits=2,
    between_split_lag=None,
    sklearn_models=False,
    sklearn_models_optimize_for_horizon=False,
    autosarimax_models=False,
    prophet_models=False,
    tbats_models=False,
    exp_smooth_models=False,
    average_ensembles=False,
    stacking_ensembles=False)

[7]:

ms_minimal.add_model_to_gridsearch(get_sklearn_wrapper(LinearRegression, hcb_verbose=False))
ms_minimal.add_model_to_gridsearch(get_sklearn_wrapper(RandomForestRegressor, random_state=42, hcb_verbose=False))

[8]:

ms_minimal.select_model(df=df_minimal, target_col_name='Sales')

Ways to access ModelSelectorResult¶

There are three ways how you can get to single time-series result level.

First is over .results[i], which is fast, but does not ensure, that results are loaded in the same order as when they were created (reason for that is hash used in the name of each result, that are later read in alphabetic order)
Second and third uses .get_result_for_partition() through dict based partition
Forth does that using partition_hash (also in results file name if persisted)

[9]:

result = ms_minimal.results[0]
result = ms_minimal.get_result_for_partition({'no_partition_label': ''})
result = ms_minimal.get_result_for_partition(ms_minimal.partitions[0])
result = ms_minimal.get_result_for_partition('fb452abd91f5c3bcb8afa4162c6452c2')

ModelSelectorResult is rich¶

As you can see below, we try to store all relevant information to enable easy access to data, that is otherwise very lenghty.

[10]:

result

[10]:

ModelSelectorResult
-------------------
  best_model_name: sklearn
  frequency: D
  horizon: 10

  country_code_column: None

  partition: {'no_partition_label': ''}
  partition_hash: fb452abd91f5c3bcb8afa4162c6452c2

  df_plot: DataFrame (730, 6) suited for plotting cv results with .plot()
  X_train: DataFrame (730, 0) with training feature values
  y_train: DataFrame (730,) with training target values
  cv_results: DataFrame (2, 11) with gridsearch cv info
  best_model_cv_results: Series with gridsearch cv info
  cv_data: DataFrame (20, 4) with models predictions, split and true target values
  best_model_cv_data: DataFrame (20, 3) with model predictions, split and true target values

  model_reprs: Dict of model_hash and model_reprs
  best_model_hash: 53ec261970542c03a3b0a54fc6af214d
  best_model: Pipeline(memory=None,
         steps=[('exog_passthrough', 'passthrough'), ('holiday', 'passthrough'),
                ('model',
                 SklearnWrapper(bootstrap=True, ccp_alpha=0.0,
                                clip_predictions_lower=None,
                                clip_predictions_upper=None, criterion='mse',
                                fit_params=None, hcb_verbose=False, lags=3,
                                max_depth=None, max_features='auto',
                                max_leaf_nodes=None, max_samples=None,
                                min_impurity_decrease=0.0,
                                min_impurity_split=None, min_samples_leaf=1,
                                min_samples_split=2,
                                min_weight_fraction_leaf=0.0, n_estimators=100,
                                n_jobs=None, name='sklearn', oob_score=False,
                                optimize_for_horizon=False, random_state=42,
                                verbose=0, warm_start=False))],
         verbose=False)
-------------------

Traning data¶

[11]:

result.X_train

[11]:


Date
2013-08-01
2013-08-02
2013-08-03
2013-08-04
2013-08-05
...
2015-07-27
2015-07-28
2015-07-29
2015-07-30
2015-07-31

730 rows × 0 columns

[12]:

result.y_train

[12]:

Date
2013-08-01    47527
2013-08-02    42200
2013-08-03    30370
2013-08-04        0
2013-08-05    42239
              ...
2015-07-27    43671
2015-07-28    41142
2015-07-29    39906
2015-07-30    39800
2015-07-31    43052
Freq: D, Name: Sales, Length: 730, dtype: int64

Data behind plots¶

Ready to be plotted or adjusted to your needs

[13]:

result.df_plot

[13]:

	actuals	cv_forecast(sklearn)	cv_split	error	cv_split_str	mae
2013-08-01	47527	NaN	NaN	NaN	cv_split=nan, mae=nan	NaN
2013-08-02	42200	NaN	NaN	NaN	cv_split=nan, mae=nan	NaN
2013-08-03	30370	NaN	NaN	NaN	cv_split=nan, mae=nan	NaN
2013-08-04	0	NaN	NaN	NaN	cv_split=nan, mae=nan	NaN
2013-08-05	42239	NaN	NaN	NaN	cv_split=nan, mae=nan	NaN
...	...	...	...	...	...	...
2015-07-27	43671	22441.52	1	21229.48	cv_split=1, mae=9370.22	9370.219
2015-07-28	41142	42666.69	1	1524.69	cv_split=1, mae=9370.22	9370.219
2015-07-29	39906	40116.97	1	210.97	cv_split=1, mae=9370.22	9370.219
2015-07-30	39800	27123.95	1	12676.05	cv_split=1, mae=9370.22	9370.219
2015-07-31	43052	28785.73	1	14266.27	cv_split=1, mae=9370.22	9370.219

730 rows × 6 columns

[14]:

result.df_plot.tail(50).plot();

[14]:

<AxesSubplot:>

../../../_images/examples_tutorial_model_selection_02_model_selector_result_basic_18_1.png

[15]:

result

[15]:

ModelSelectorResult
-------------------
  best_model_name: sklearn
  frequency: D
  horizon: 10

  country_code_column: None

  partition: {'no_partition_label': ''}
  partition_hash: fb452abd91f5c3bcb8afa4162c6452c2

  df_plot: DataFrame (730, 6) suited for plotting cv results with .plot()
  X_train: DataFrame (730, 0) with training feature values
  y_train: DataFrame (730,) with training target values
  cv_results: DataFrame (2, 11) with gridsearch cv info
  best_model_cv_results: Series with gridsearch cv info
  cv_data: DataFrame (20, 4) with models predictions, split and true target values
  best_model_cv_data: DataFrame (20, 3) with model predictions, split and true target values

  model_reprs: Dict of model_hash and model_reprs
  best_model_hash: 53ec261970542c03a3b0a54fc6af214d
  best_model: Pipeline(memory=None,
         steps=[('exog_passthrough', 'passthrough'), ('holiday', 'passthrough'),
                ('model',
                 SklearnWrapper(bootstrap=True, ccp_alpha=0.0,
                                clip_predictions_lower=None,
                                clip_predictions_upper=None, criterion='mse',
                                fit_params=None, hcb_verbose=False, lags=3,
                                max_depth=None, max_features='auto',
                                max_leaf_nodes=None, max_samples=None,
                                min_impurity_decrease=0.0,
                                min_impurity_split=None, min_samples_leaf=1,
                                min_samples_split=2,
                                min_weight_fraction_leaf=0.0, n_estimators=100,
                                n_jobs=None, name='sklearn', oob_score=False,
                                optimize_for_horizon=False, random_state=42,
                                verbose=0, warm_start=False))],
         verbose=False)
-------------------

Best Model Metadata¶

That can help to filter for example cv_data or to get a glimpse on which parameters the best model has

[16]:

result.best_model_hash

[16]:

'53ec261970542c03a3b0a54fc6af214d'

[17]:

result.best_model_name

[17]:

'sklearn'

[18]:

result.best_model_repr

[18]:

"Pipeline(memory=None,steps=[('exog_passthrough','passthrough'),('holiday','passthrough'),('model',SklearnWrapper(bootstrap=True,ccp_alpha=0.0,clip_predictions_lower=None,clip_predictions_upper=None,criterion='mse',fit_params=None,hcb_verbose=False,lags=3,max_depth=None,max_features='auto',max_leaf_nodes=None,max_samples=None,min_impurity_decrease=0.0,min_impurity_split=None,min_samples_leaf=1,min_samples_split=2,min_weight_fraction_leaf=0.0,n_estimators=100,n_jobs=None,name='sklearn',oob_score=False,optimize_for_horizon=False,random_state=42,verbose=0,warm_start=False))],verbose=False)"

CV Results¶

Get information about how our model behaved in cross validation

[19]:

result.best_model_cv_results['mean_fit_time']

[19]:

0.0009076595306396484

Or how all the models behaved

[20]:

result.cv_results.sort_values('rank_test_score').head()

[20]:

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_model	params	split0_test_score	split1_test_score	mean_test_score	std_test_score	rank_test_score
1	0.000908	0.000003	0.199280	0.001475	SklearnWrapper(bootstrap=True, ccp_alpha=0.0, ...	{'model': SklearnWrapper(bootstrap=True, ccp_a...	-10491.576000	-9370.219000	-9930.89750	560.678500	1
0	0.001297	0.000441	0.019724	0.001840	SklearnWrapper(clip_predictions_lower=None, cl...	{'model': SklearnWrapper(clip_predictions_lowe...	-12522.584395	-8074.404344	-10298.49437	2224.090025	2

CV Data¶

Access predictions made during cross validation with possible cv splits and true target values

[21]:

result.cv_data.head()

[21]:

	y_true	584d397ba93c7fbf49f5fd97f42d5d62	53ec261970542c03a3b0a54fc6af214d
2015-07-12	0.0	26171.429013	16311.14
2015-07-13	48687.0	27599.262145	30500.02
2015-07-14	45498.0	30372.219206	33853.97
2015-07-15	45209.0	36301.509820	45178.86
2015-07-16	43669.0	35328.104719	42460.11

[22]:

result.cv_data.drop(['split'], axis=1).plot();

[22]:

<AxesSubplot:>

../../../_images/examples_tutorial_model_selection_02_model_selector_result_basic_30_1.png

[23]:

result.best_model_cv_data.head()

[23]:

	y_true	best_model
2015-07-12	0.0	16311.14
2015-07-13	48687.0	30500.02
2015-07-14	45498.0	33853.97
2015-07-15	45209.0	45178.86
2015-07-16	43669.0	42460.11

[24]:

result.best_model_cv_data.plot();

[24]:

<AxesSubplot:>

../../../_images/examples_tutorial_model_selection_02_model_selector_result_basic_32_1.png

Plotting Functions¶

With **plot_params that you can pass depending on your plotting backend

[25]:

result.plot_result(plot_from='2015-06', title='Performance', color=['blue','green']);

[25]:

<AxesSubplot:title={'center':'Performance'}>

../../../_images/examples_tutorial_model_selection_02_model_selector_result_basic_34_1.png

[26]:

result.plot_error(title='Error');

[26]:

cv_split_str
cv_split=0, mae=10491.58    AxesSubplot(0.125,0.125;0.775x0.755)
cv_split=1, mae=9370.22     AxesSubplot(0.125,0.125;0.775x0.755)
Name: error, dtype: object

../../../_images/examples_tutorial_model_selection_02_model_selector_result_basic_35_1.png

Convenient Persist Method¶

[27]:

result.persist?