
class hcrystalball.model_selection.ModelSelector(horizon, frequency, country_code_column=None)[source]

Bases: object

Enable large scale cross validation easily accessible.

Through create_gridsearch and select_model methods run cross validation and present / persist all relevant information.

  • horizon (int) – How many steps ahead the predictions during model selection are to be made

  • frequency (str) – Temporal frequency of the data which is to be used in model selection Data with different frequency will be resampled to this frequency.

  • country_code_column (str, list) – Name of the column(s) with ISO code of country/region, which can be used for supplying holiday. If used, later provided data must have daily frequency. e.g. ‘State’ with values like ‘DE’, ‘CZ’ or ‘Region’ with values like ‘DE-NW’, ‘DE-HE’, etc.

Attributes Summary


List of partitions the model selection was ran on.


Results for each partition


Path where ModelSelector object was stored

Methods Summary


Extend self.grid_search parameter grid with provided model.

create_gridsearch([n_splits, …])

Create grid_search attribute (sklearn.model_selection.GridSearchCV) based on selection criteria


Provide overview of partitions for which results are available


Provide result for given partition

persist_results([folder_path, …])

Store expert files for each partition.


Plot number of selected wrapper classes that were picked as best models

plot_results([partitions, plot_from])

Plot training data and cv forecasts for each of the partition

select_model(df, target_col_name[, …])

Run cross validation on data and selects best model.

Attributes Documentation


List of partitions the model selection was ran on.

Created only after calling select_model.


List of dictionaries of partitions

Return type



ValueError – If select_model was not called before


Results for each partition


List of ModelSelectorResult objects

Return type



ValueError – If select_model was not called before


Path where ModelSelector object was stored

Created only after calling persist_results.


Pathlike string to the folder containing stored ModelSelector object

Return type



ValueError – If presist_results was not called before

Methods Documentation


Extend self.grid_search parameter grid with provided model.

Adds given model or list of models to the gridsearch under ‘model’ step


model (sklearn compatible model or list of sklearn compatible models) – model(s) to be added to provided grid search

create_gridsearch(n_splits=5, between_split_lag=None, scoring='neg_mean_absolute_error', country_code=None, holidays_days_before=0, holidays_days_after=0, holidays_bridge_days=False, sklearn_models=True, sklearn_models_optimize_for_horizon=False, autosarimax_models=False, autoarima_dict=None, prophet_models=False, tbats_models=False, exp_smooth_models=False, theta_models=False, average_ensembles=False, stacking_ensembles=False, stacking_ensembles_train_horizon=10, stacking_ensembles_train_n_splits=20, clip_predictions_lower=None, clip_predictions_upper=None, exog_cols=None, hcb_verbose=False)[source]

Create grid_search attribute (sklearn.model_selection.GridSearchCV) based on selection criteria

  • n_splits (int) – How many cross-validation folds should be used in model selection

  • between_split_lag (int) – How big lag of observations should cv_splits have If kept as None, horizon is used resulting in non-overlaping cv_splits

  • scoring (str, callable) – String of sklearn regression metric name, or hcrystalball compatible scorer. For creation of hcrystalball compatible scorer use make_ts_scorer function.

  • country_code (str) – Country code in str (e.g. ‘DE’). Used in holiday transformer. Only one of country_code_column or country_code can be set.

  • holidays_days_before (int) – Number of days before the holiday which will be taken into account (i.e. 2 means that new bool column will be created and will be True for 2 days before holidays, otherwise False)

  • holidays_days_after (int) – Number of days after the holiday which will be taken into account (i.e. 2 means that new bool column will be created and will be True for 2 days after holidays, otherwise False)

  • holidays_bridge_days (bool) – Overlaping holidays_days_before and holidays_days_after feature which serves for modeling between holidays working days

  • sklearn_models (bool) – Whether to consider sklearn models

  • sklearn_optimize_for_horizon (bool) – Whether to add to default sklearn behavior also models, that optimize predictions for each horizon

  • autosarimax_models (bool) – Whether to consider auto sarimax models

  • autoarima_dict (dict) – Specification of pmdautoarima search space

  • prophet_models (bool) – Whether to consider FB prophet models

  • exp_smooth_models (bool) – Whether to consider exponential smoothing models

  • average_ensembles (bool) – Whether to consider average ensemble models

  • stacking_ensembles (bool) – Whether to consider stacking ensemble models

  • stacking_ensembles_train_horizon (int) – Which horizon should be used in meta model in stacking ensebmles

  • stacking_ensembles_train_n_splits (int) – Number of splits used in meta model in stacking ensebmles

  • clip_predictions_lower (float, int) – Minimal number allowed in the predictions

  • clip_predictions_upper (float, int) – Maximal number allowed in the predictions

  • exog_cols (list) – List of columns to be used as exogenous variables

  • hcb_verbose (bool) – Whtether to keep (True) or suppress (False) messages to stdout and stderr from the wrapper and 3rd party libraries during fit and predict


Provide overview of partitions for which results are available


as_dataframe (bool) – Whether to return partitions as pandas.DataFrame -> returns list of dicts


partitions available in model selector results

Return type

pandas.DataFrame, list[dict, dict, …]


Provide result for given partition


partition (str, dict) – partition_hash or partition_dict of data to which result is tied to


result of model selection for given partition

Return type



ValueError – if partition is not present in the results

persist_results(folder_path='results', persist_cv_results=False, persist_cv_data=False, persist_model_reprs=False, persist_best_model=False, persist_partition=False, persist_model_selector_results=True)[source]

Store expert files for each partition.

The file names follow {partition_hash}.{expert_type} e.g. 795dab1813f05b1abe9ae6ded93e1ec4.cv_data

Stores value of folder_path argument to self.stored_path

  • folder_path (str) – Path to the directory, where expert files are stored, by default ‘’ resulting in current working directory

  • persist_cv_results (bool) – If True cv_results of sklearn.model_selection.GridSearchCV as pandas df will be saved as pickle for each partition

  • persist_cv_data (bool) – If True the pandas df detail cv data will be saved as pickle for each partition

  • persist_model_reprs (bool) – If True model reprs will be saved as json for each partition

  • persist_best_model (bool) – If True best model will be saved as pickle for each partition

  • persist_partition (bool) – If True dictionary of partition label will be saved as json for each partition

  • persist_model_selector_results (bool) – If True ModelSelectoResults with all important information will be saved as pickle for each partition

plot_best_wrapper_classes(title='Most often selected classes', **plot_params)[source]

Plot number of selected wrapper classes that were picked as best models


title (str) – Title of the plot


Plot of most selected wrapper classes

Return type


plot_results(partitions=None, plot_from=None, **plot_params)[source]

Plot training data and cv forecasts for each of the partition

  • partitions (list) – List of partitions to plot results for

  • plot_from (str) – Date from which to show the plot e.g. ‘2019-12-31’, ‘2019’, or ‘2019-12’


List of matplotlib.axes._subplots.AxesSubplot for each partition

Return type


select_model(df, target_col_name, partition_columns=None, parallel_over_columns=None, executor=None, include_rules=None, exclude_rules=None, output_path='', persist_cv_results=False, persist_cv_data=False, persist_model_reprs=False, persist_best_model=False, persist_partition=False, persist_model_selector_results=False)[source]

Run cross validation on data and selects best model.

Best models are selected for each timeseries, stored in attribute self.results and if wanted also persisted.

  • df (pandas.DataFrame) – Container holding historical data for training

  • target_col_name (str) – Name of target column

  • partition_columns (list, tuple) – Column names based on which the data should be split up / partitioned

  • parallel_over_columns (list, tuple) – Subset of partition_columns, that are used to parallel split.

  • executor (prefect.executors) – Provide prefect’s executor. Only valid when parallel_over_columns is set. For more information see

  • include_rules (dict) – Dictionary with keys being column names and values being list of values to include in the output.

  • exclude_rules (dict) – Dictionary with keys being column names and values being list of values to exclude from the output.

  • output_path (str) – Path to directory for storing the output, default behavior is current working directory

  • persist_cv_results (bool) – If True cv_results of sklearn.model_selection.GridSearchCV as pandas df will be saved as pickle for each partition

  • persist_cv_data (bool) – If True the pandas df detail cv data will be saved as pickle for each partition

  • persist_model_reprs (bool) – If True model reprs will be saved as json for each partition

  • persist_best_model (bool) – If True best model will be saved as pickle for each partition

  • persist_partition (bool) – If True dictionary of partition label will be saved as json for each partition

  • persist_model_selector_results (bool) – If True ModelSelectoResults with all important information will be saved as pickle for each partition