ModelSelector¶

class hcrystalball.model_selection.ModelSelector(horizon, frequency, country_code_column=None)[source]¶

Bases: object

Enable large scale cross validation easily accessible.

Through create_gridsearch and select_model methods run cross validation and present / persist all relevant information.

Parameters

horizon (int) – How many steps ahead the predictions during model selection are to be made
frequency (str) – Temporal frequency of the data which is to be used in model selection Data with different frequency will be resampled to this frequency.
country_code_column (str, list) – Name of the column(s) with ISO code of country/region, which can be used for supplying holiday. If used, later provided data must have daily frequency. e.g. ‘State’ with values like ‘DE’, ‘CZ’ or ‘Region’ with values like ‘DE-NW’, ‘DE-HE’, etc.

Attributes Summary

`partitions`	List of partitions the model selection was ran on.
`results`	Results for each partition
`stored_path`	Path where `ModelSelector` object was stored

Methods Summary

`add_model_to_gridsearch`(model)	Extend `self.grid_search` parameter grid with provided model.
`create_gridsearch`([n_splits, …])	Create grid_search attribute (`sklearn.model_selection.GridSearchCV`) based on selection criteria
`get_partitions`([as_dataframe])	Provide overview of partitions for which results are available
`get_result_for_partition`([partition])	Provide result for given partition
`persist_results`([folder_path, …])	Store expert files for each partition.
`plot_best_wrapper_classes`([title])	Plot number of selected wrapper classes that were picked as best models
`plot_results`([partitions, plot_from])	Plot training data and cv forecasts for each of the partition
`select_model`(df, target_col_name[, …])	Run cross validation on data and selects best model.

Attributes Documentation

partitions¶

List of partitions the model selection was ran on.

Created only after calling select_model.

Returns: List of dictionaries of partitions
Return type: list
Raises: ValueError – If select_model was not called before

results¶

Results for each partition

Returns: List of ModelSelectorResult objects
Return type: list
Raises: ValueError – If select_model was not called before

stored_path¶

Path where ModelSelector object was stored

Created only after calling persist_results.

Returns: Pathlike string to the folder containing stored ModelSelector object
Return type: str
Raises: ValueError – If presist_results was not called before

Methods Documentation

add_model_to_gridsearch(model)[source]¶

Extend self.grid_search parameter grid with provided model.

Adds given model or list of models to the gridsearch under ‘model’ step

Parameters: model (sklearn compatible model or list of sklearn compatible models) – model(s) to be added to provided grid search

create_gridsearch(n_splits=5, between_split_lag=None, scoring='neg_mean_absolute_error', country_code=None, holidays_days_before=0, holidays_days_after=0, holidays_bridge_days=False, sklearn_models=True, sklearn_models_optimize_for_horizon=False, autosarimax_models=False, autoarima_dict=None, prophet_models=False, tbats_models=False, exp_smooth_models=False, theta_models=False, average_ensembles=False, stacking_ensembles=False, stacking_ensembles_train_horizon=10, stacking_ensembles_train_n_splits=20, clip_predictions_lower=None, clip_predictions_upper=None, exog_cols=None, hcb_verbose=False)[source]¶

Create grid_search attribute (sklearn.model_selection.GridSearchCV) based on selection criteria

Parameters

n_splits (int) – How many cross-validation folds should be used in model selection
between_split_lag (int) – How big lag of observations should cv_splits have If kept as None, horizon is used resulting in non-overlaping cv_splits
scoring (str, callable) – String of sklearn regression metric name, or hcrystalball compatible scorer. For creation of hcrystalball compatible scorer use make_ts_scorer function.
country_code (str) – Country code in str (e.g. ‘DE’). Used in holiday transformer. Only one of country_code_column or country_code can be set.
holidays_days_before (int) – Number of days before the holiday which will be taken into account (i.e. 2 means that new bool column will be created and will be True for 2 days before holidays, otherwise False)
holidays_days_after (int) – Number of days after the holiday which will be taken into account (i.e. 2 means that new bool column will be created and will be True for 2 days after holidays, otherwise False)
holidays_bridge_days (bool) – Overlaping holidays_days_before and holidays_days_after feature which serves for modeling between holidays working days
sklearn_models (bool) – Whether to consider sklearn models
sklearn_optimize_for_horizon (bool) – Whether to add to default sklearn behavior also models, that optimize predictions for each horizon
autosarimax_models (bool) – Whether to consider auto sarimax models
autoarima_dict (dict) – Specification of pmdautoarima search space
prophet_models (bool) – Whether to consider FB prophet models
exp_smooth_models (bool) – Whether to consider exponential smoothing models
average_ensembles (bool) – Whether to consider average ensemble models
stacking_ensembles (bool) – Whether to consider stacking ensemble models
stacking_ensembles_train_horizon (int) – Which horizon should be used in meta model in stacking ensebmles
stacking_ensembles_train_n_splits (int) – Number of splits used in meta model in stacking ensebmles
clip_predictions_lower (float, int) – Minimal number allowed in the predictions
clip_predictions_upper (float, int) – Maximal number allowed in the predictions
exog_cols (list) – List of columns to be used as exogenous variables
hcb_verbose (bool) – Whtether to keep (True) or suppress (False) messages to stdout and stderr from the wrapper and 3rd party libraries during fit and predict

get_partitions(as_dataframe=False)[source]¶

Provide overview of partitions for which results are available

Parameters: as_dataframe (bool) – Whether to return partitions as pandas.DataFrame -> returns list of dicts
Returns: partitions available in model selector results
Return type: pandas.DataFrame, list[dict, dict, …]

get_result_for_partition(partition=None)[source]¶

Provide result for given partition

Parameters: partition (str, dict) – partition_hash or partition_dict of data to which result is tied to
Returns: result of model selection for given partition
Return type: ModelSelectorResult
Raises: ValueError – if partition is not present in the results

persist_results(folder_path='results', persist_cv_results=False, persist_cv_data=False, persist_model_reprs=False, persist_best_model=False, persist_partition=False, persist_model_selector_results=True)[source]¶

Store expert files for each partition.

The file names follow {partition_hash}.{expert_type} e.g. 795dab1813f05b1abe9ae6ded93e1ec4.cv_data

Stores value of folder_path argument to self.stored_path

Parameters

folder_path (str) – Path to the directory, where expert files are stored, by default ‘’ resulting in current working directory
persist_cv_results (bool) – If True cv_results of sklearn.model_selection.GridSearchCV as pandas df will be saved as pickle for each partition
persist_cv_data (bool) – If True the pandas df detail cv data will be saved as pickle for each partition
persist_model_reprs (bool) – If True model reprs will be saved as json for each partition
persist_best_model (bool) – If True best model will be saved as pickle for each partition
persist_partition (bool) – If True dictionary of partition label will be saved as json for each partition
persist_model_selector_results (bool) – If True ModelSelectoResults with all important information will be saved as pickle for each partition

plot_best_wrapper_classes(title='Most often selected classes', **plot_params)[source]¶

Plot number of selected wrapper classes that were picked as best models

Parameters: title (str) – Title of the plot
Returns: Plot of most selected wrapper classes
Return type: matplotlib.axes._subplots.AxesSubplot

plot_results(partitions=None, plot_from=None, **plot_params)[source]¶

Plot training data and cv forecasts for each of the partition

Parameters

partitions (list) – List of partitions to plot results for
plot_from (str) – Date from which to show the plot e.g. ‘2019-12-31’, ‘2019’, or ‘2019-12’

Returns

List of matplotlib.axes._subplots.AxesSubplot for each partition

Return type

list

select_model(df, target_col_name, partition_columns=None, parallel_over_columns=None, executor=None, include_rules=None, exclude_rules=None, output_path='', persist_cv_results=False, persist_cv_data=False, persist_model_reprs=False, persist_best_model=False, persist_partition=False, persist_model_selector_results=False)[source]¶

Run cross validation on data and selects best model.

Best models are selected for each timeseries, stored in attribute self.results and if wanted also persisted.

Parameters

df (pandas.DataFrame) – Container holding historical data for training
target_col_name (str) – Name of target column
partition_columns (list, tuple) – Column names based on which the data should be split up / partitioned
parallel_over_columns (list, tuple) – Subset of partition_columns, that are used to parallel split.
executor (prefect.executors) – Provide prefect’s executor. Only valid when parallel_over_columns is set. For more information see https://docs.prefect.io/api/latest/engine/executors.html
include_rules (dict) – Dictionary with keys being column names and values being list of values to include in the output.
exclude_rules (dict) – Dictionary with keys being column names and values being list of values to exclude from the output.
output_path (str) – Path to directory for storing the output, default behavior is current working directory
persist_cv_results (bool) – If True cv_results of sklearn.model_selection.GridSearchCV as pandas df will be saved as pickle for each partition
persist_cv_data (bool) – If True the pandas df detail cv data will be saved as pickle for each partition
persist_model_reprs (bool) – If True model reprs will be saved as json for each partition
persist_best_model (bool) – If True best model will be saved as pickle for each partition
persist_partition (bool) – If True dictionary of partition label will be saved as json for each partition
persist_model_selector_results (bool) – If True ModelSelectoResults with all important information will be saved as pickle for each partition