ModelSelector

class hcrystalball.model_selection.ModelSelector(horizon, frequency, country_code_column=None)[source]

Bases: object

Enable large scale cross validation easily accessible.

Through create_gridsearch and select_model methods run cross validation and present / persist all relevant information.

Parameters
  • horizon (int) – How many steps ahead the predictions during model selection are to be made

  • frequency (str) – Temporal frequency of the data which is to be used in model selection Data with different frequency will be resampled to this frequency.

  • country_code_column (str, list) – Name of the column(s) with ISO code of country/region, which can be used for supplying holiday. If used, later provided data must have daily frequency. e.g. ‘State’ with values like ‘DE’, ‘CZ’ or ‘Region’ with values like ‘DE-NW’, ‘DE-HE’, etc.

Attributes Summary

partitions

List of partitions the model selection was ran on.

results

Results for each partition

stored_path

Path where ModelSelector object was stored

Methods Summary

add_model_to_gridsearch(model)

Extend self.grid_search parameter grid with provided model.

create_gridsearch([n_splits, …])

Create grid_search attribute (sklearn.model_selection.GridSearchCV) based on selection criteria

get_partitions([as_dataframe])

Provide overview of partitions for which results are available

get_result_for_partition([partition])

Provide result for given partition

persist_results([folder_path, …])

Store expert files for each partition.

plot_best_wrapper_classes([title])

Plot number of selected wrapper classes that were picked as best models

plot_results([partitions, plot_from])

Plot training data and cv forecasts for each of the partition

select_model(df, target_col_name[, …])

Run cross validation on data and selects best model.

Attributes Documentation

partitions

List of partitions the model selection was ran on.

Created only after calling select_model.

Returns

List of dictionaries of partitions

Return type

list

Raises

ValueError – If select_model was not called before

results

Results for each partition

Returns

List of ModelSelectorResult objects

Return type

list

Raises

ValueError – If select_model was not called before

stored_path

Path where ModelSelector object was stored

Created only after calling persist_results.

Returns

Pathlike string to the folder containing stored ModelSelector object

Return type

str

Raises

ValueError – If presist_results was not called before

Methods Documentation

add_model_to_gridsearch(model)[source]

Extend self.grid_search parameter grid with provided model.

Adds given model or list of models to the gridsearch under ‘model’ step

Parameters

model (sklearn compatible model or list of sklearn compatible models) – model(s) to be added to provided grid search

create_gridsearch(n_splits=5, between_split_lag=None, scoring='neg_mean_absolute_error', country_code=None, holidays_days_before=0, holidays_days_after=0, holidays_bridge_days=False, sklearn_models=True, sklearn_models_optimize_for_horizon=False, autosarimax_models=False, autoarima_dict=None, prophet_models=False, tbats_models=False, exp_smooth_models=False, theta_models=False, average_ensembles=False, stacking_ensembles=False, stacking_ensembles_train_horizon=10, stacking_ensembles_train_n_splits=20, clip_predictions_lower=None, clip_predictions_upper=None, exog_cols=None, hcb_verbose=False)[source]

Create grid_search attribute (sklearn.model_selection.GridSearchCV) based on selection criteria

Parameters
  • n_splits (int) – How many cross-validation folds should be used in model selection

  • between_split_lag (int) – How big lag of observations should cv_splits have If kept as None, horizon is used resulting in non-overlaping cv_splits

  • scoring (str, callable) – String of sklearn regression metric name, or hcrystalball compatible scorer. For creation of hcrystalball compatible scorer use make_ts_scorer function.

  • country_code (str) – Country code in str (e.g. ‘DE’). Used in holiday transformer. Only one of country_code_column or country_code can be set.

  • holidays_days_before (int) – Number of days before the holiday which will be taken into account (i.e. 2 means that new bool column will be created and will be True for 2 days before holidays, otherwise False)

  • holidays_days_after (int) – Number of days after the holiday which will be taken into account (i.e. 2 means that new bool column will be created and will be True for 2 days after holidays, otherwise False)

  • holidays_bridge_days (bool) – Overlaping holidays_days_before and holidays_days_after feature which serves for modeling between holidays working days

  • sklearn_models (bool) – Whether to consider sklearn models

  • sklearn_optimize_for_horizon (bool) – Whether to add to default sklearn behavior also models, that optimize predictions for each horizon

  • autosarimax_models (bool) – Whether to consider auto sarimax models

  • autoarima_dict (dict) – Specification of pmdautoarima search space

  • prophet_models (bool) – Whether to consider FB prophet models

  • exp_smooth_models (bool) – Whether to consider exponential smoothing models

  • average_ensembles (bool) – Whether to consider average ensemble models

  • stacking_ensembles (bool) – Whether to consider stacking ensemble models

  • stacking_ensembles_train_horizon (int) – Which horizon should be used in meta model in stacking ensebmles

  • stacking_ensembles_train_n_splits (int) – Number of splits used in meta model in stacking ensebmles

  • clip_predictions_lower (float, int) – Minimal number allowed in the predictions

  • clip_predictions_upper (float, int) – Maximal number allowed in the predictions

  • exog_cols (list) – List of columns to be used as exogenous variables

  • hcb_verbose (bool) – Whtether to keep (True) or suppress (False) messages to stdout and stderr from the wrapper and 3rd party libraries during fit and predict

get_partitions(as_dataframe=False)[source]

Provide overview of partitions for which results are available

Parameters

as_dataframe (bool) – Whether to return partitions as pandas.DataFrame -> returns list of dicts

Returns

partitions available in model selector results

Return type

pandas.DataFrame, list[dict, dict, …]

get_result_for_partition(partition=None)[source]

Provide result for given partition

Parameters

partition (str, dict) – partition_hash or partition_dict of data to which result is tied to

Returns

result of model selection for given partition

Return type

ModelSelectorResult

Raises

ValueError – if partition is not present in the results

persist_results(folder_path='results', persist_cv_results=False, persist_cv_data=False, persist_model_reprs=False, persist_best_model=False, persist_partition=False, persist_model_selector_results=True)[source]

Store expert files for each partition.

The file names follow {partition_hash}.{expert_type} e.g. 795dab1813f05b1abe9ae6ded93e1ec4.cv_data

Stores value of folder_path argument to self.stored_path

Parameters
  • folder_path (str) – Path to the directory, where expert files are stored, by default ‘’ resulting in current working directory

  • persist_cv_results (bool) – If True cv_results of sklearn.model_selection.GridSearchCV as pandas df will be saved as pickle for each partition

  • persist_cv_data (bool) – If True the pandas df detail cv data will be saved as pickle for each partition

  • persist_model_reprs (bool) – If True model reprs will be saved as json for each partition

  • persist_best_model (bool) – If True best model will be saved as pickle for each partition

  • persist_partition (bool) – If True dictionary of partition label will be saved as json for each partition

  • persist_model_selector_results (bool) – If True ModelSelectoResults with all important information will be saved as pickle for each partition

plot_best_wrapper_classes(title='Most often selected classes', **plot_params)[source]

Plot number of selected wrapper classes that were picked as best models

Parameters

title (str) – Title of the plot

Returns

Plot of most selected wrapper classes

Return type

matplotlib.axes._subplots.AxesSubplot

plot_results(partitions=None, plot_from=None, **plot_params)[source]

Plot training data and cv forecasts for each of the partition

Parameters
  • partitions (list) – List of partitions to plot results for

  • plot_from (str) – Date from which to show the plot e.g. ‘2019-12-31’, ‘2019’, or ‘2019-12’

Returns

List of matplotlib.axes._subplots.AxesSubplot for each partition

Return type

list

select_model(df, target_col_name, partition_columns=None, parallel_over_columns=None, executor=None, include_rules=None, exclude_rules=None, output_path='', persist_cv_results=False, persist_cv_data=False, persist_model_reprs=False, persist_best_model=False, persist_partition=False, persist_model_selector_results=False)[source]

Run cross validation on data and selects best model.

Best models are selected for each timeseries, stored in attribute self.results and if wanted also persisted.

Parameters
  • df (pandas.DataFrame) – Container holding historical data for training

  • target_col_name (str) – Name of target column

  • partition_columns (list, tuple) – Column names based on which the data should be split up / partitioned

  • parallel_over_columns (list, tuple) – Subset of partition_columns, that are used to parallel split.

  • executor (prefect.executors) – Provide prefect’s executor. Only valid when parallel_over_columns is set. For more information see https://docs.prefect.io/api/latest/engine/executors.html

  • include_rules (dict) – Dictionary with keys being column names and values being list of values to include in the output.

  • exclude_rules (dict) – Dictionary with keys being column names and values being list of values to exclude from the output.

  • output_path (str) – Path to directory for storing the output, default behavior is current working directory

  • persist_cv_results (bool) – If True cv_results of sklearn.model_selection.GridSearchCV as pandas df will be saved as pickle for each partition

  • persist_cv_data (bool) – If True the pandas df detail cv data will be saved as pickle for each partition

  • persist_model_reprs (bool) – If True model reprs will be saved as json for each partition

  • persist_best_model (bool) – If True best model will be saved as pickle for each partition

  • persist_partition (bool) – If True dictionary of partition label will be saved as json for each partition

  • persist_model_selector_results (bool) – If True ModelSelectoResults with all important information will be saved as pickle for each partition