ModelSelector¶
-
class
hcrystalball.model_selection.
ModelSelector
(horizon, frequency, country_code_column=None)[source]¶ Bases:
object
Enable large scale cross validation easily accessible.
Through
create_gridsearch
andselect_model
methods run cross validation and present / persist all relevant information.- Parameters
horizon (int) – How many steps ahead the predictions during model selection are to be made
frequency (str) – Temporal frequency of the data which is to be used in model selection Data with different frequency will be resampled to this frequency.
country_code_column (str) – Name of the column with ISO code of country/region, which can be used for supplying holiday. If used, later provided data must have daily frequency. e.g. ‘State’ with values like ‘DE’, ‘CZ’ or ‘Region’ with values like ‘DE-NW’, ‘DE-HE’, etc.
Attributes Summary
List of partitions the model selection was ran on.
Results for each partition
Path where
ModelSelector
object was storedMethods Summary
add_model_to_gridsearch
(model)Extend
self.grid_search
parameter grid with provided model.create_gridsearch
([n_splits, …])Create grid_search attribute (
sklearn.model_selection.GridSearchCV
) based on selection criteriaget_partitions
([as_dataframe])Provide overview of partitions for which results are available
get_result_for_partition
([partition])Provide result for given partition
persist_results
([folder_path, …])Store expert files for each partition.
plot_best_wrapper_classes
([title])Plot number of selected wrapper classes that were picked as best models
plot_results
([partitions, plot_from])Plot training data and cv forecasts for each of the partition
select_model
(df, target_col_name[, …])Run cross validation on data and selects best model.
Attributes Documentation
-
partitions
¶ List of partitions the model selection was ran on.
Created only after calling
select_model
.- Returns
List of dictionaries of partitions
- Return type
- Raises
ValueError – If
select_model
was not called before
-
results
¶ Results for each partition
- Returns
List of
ModelSelectorResult
objects- Return type
- Raises
ValueError – If
select_model
was not called before
-
stored_path
¶ Path where
ModelSelector
object was storedCreated only after calling
persist_results
.- Returns
Pathlike string to the folder containing stored ModelSelector object
- Return type
- Raises
ValueError – If
presist_results
was not called before
Methods Documentation
-
add_model_to_gridsearch
(model)[source]¶ Extend
self.grid_search
parameter grid with provided model.Adds given model or list of models to the gridsearch under ‘model’ step
- Parameters
model (sklearn compatible model or list of sklearn compatible models) – model(s) to be added to provided grid search
-
create_gridsearch
(n_splits=5, between_split_lag=None, scoring='neg_mean_absolute_error', country_code_column=None, country_code=None, sklearn_models=False, sklearn_models_optimize_for_horizon=False, autosarimax_models=False, autoarima_dict=None, prophet_models=True, tbats_models=False, exp_smooth_models=False, average_ensembles=False, stacking_ensembles=False, stacking_ensembles_train_horizon=10, stacking_ensembles_train_n_splits=20, clip_predictions_lower=None, clip_predictions_upper=None, exog_cols=None)[source]¶ Create grid_search attribute (
sklearn.model_selection.GridSearchCV
) based on selection criteria- Parameters
n_splits (int) – How many cross-validation folds should be used in model selection
between_split_lag (int) – How big lag of observations should cv_splits have If kept as None, horizon is used resulting in non-overlaping cv_splits
scoring (str, callable) – String of sklearn regression metric name, or hcrystalball compatible scorer. For creation of hcrystalball compatible scorer use
make_ts_scorer
function.country_code_column (str) – Column in data, that contains country code in str (e.g. ‘DE’). Used in holiday transformer. Only one of
country_code_column
orcountry_code
can be set.country_code (str) – Country code in str (e.g. ‘DE’). Used in holiday transformer. Only one of
country_code_column
orcountry_code
can be set.sklearn_models (bool) – Whether to consider sklearn models
sklearn_optimize_for_horizon (bool) – Whether to add to default sklearn behavior also models, that optimize predictions for each horizon
autosarimax_models (bool) – Whether to consider auto sarimax models
autoarima_dict (dict) – Specification of pmdautoarima search space
prophet_models (bool) – Whether to consider FB prophet models
exp_smooth_models (bool) – Whether to consider exponential smoothing models
average_ensembles (bool) – Whether to consider average ensemble models
stacking_ensembles (bool) – Whether to consider stacking ensemble models
stacking_ensembles_train_horizon (int) – Which horizon should be used in meta model in stacking ensebmles
stacking_ensembles_train_n_splits (int) – Number of splits used in meta model in stacking ensebmles
clip_predictions_lower (float, int) – Minimal number allowed in the predictions
clip_predictions_upper (float, int) – Maximal number allowed in the predictions
exog_cols (list) – List of columns to be used as exogenous variables
-
get_partitions
(as_dataframe=False)[source]¶ Provide overview of partitions for which results are available
- Parameters
as_dataframe (bool) – Whether to return partitions as pandas.DataFrame -> returns list of dicts
- Returns
partitions available in model selector results
- Return type
pandas.DataFrame, list[dict, dict, ..]
-
get_result_for_partition
(partition=None)[source]¶ Provide result for given partition
- Parameters
partition (str, dict) – partition_hash or partition_dict of data to which result is tied to
- Returns
result of model selection for given partition
- Return type
- Raises
ValueError – if partition is not present in the results
-
persist_results
(folder_path='results', persist_cv_results=False, persist_cv_data=False, persist_model_reprs=False, persist_best_model=False, persist_partition=False, persist_model_selector_results=True)[source]¶ Store expert files for each partition.
The file names follow {partition_hash}.{expert_type} e.g. 795dab1813f05b1abe9ae6ded93e1ec4.cv_data
Stores value of folder_path argument to
self.stored_path
- Parameters
folder_path (str) – Path to the directory, where expert files are stored, by default ‘’ resulting in current working directory
persist_cv_results (bool) – If True
cv_results
of sklearn.model_selection.GridSearchCV as pandas df will be saved as pickle for each partitionpersist_cv_data (bool) – If True the pandas df detail cv data will be saved as pickle for each partition
persist_model_reprs (bool) – If True model reprs will be saved as json for each partition
persist_best_model (bool) – If True best model will be saved as pickle for each partition
persist_partition (bool) – If True dictionary of partition label will be saved as json for each partition
persist_model_selector_results (bool) – If True ModelSelectoResults with all important information will be saved as pickle for each partition
-
plot_best_wrapper_classes
(title='Most often selected classes', **plot_params)[source]¶ Plot number of selected wrapper classes that were picked as best models
- Parameters
title (str) – Title of the plot
-
plot_results
(partitions=None, plot_from=None, **plot_params)[source]¶ Plot training data and cv forecasts for each of the partition
-
select_model
(df, target_col_name, partition_columns=None, parallel_over_columns=None, executor=None, include_rules=None, exclude_rules=None, country_code_column=None, output_path='', persist_cv_results=False, persist_cv_data=False, persist_model_reprs=False, persist_best_model=False, persist_partition=False, persist_model_selector_results=False)[source]¶ Run cross validation on data and selects best model.
Best models are selected for each timeseries, stored in attribute
self.results
and if wanted also persisted.- Parameters
df (pandas.DataFrame) – Container holding historical data for training
target_col_name (str) – Name of target column
partition_columns (list, tuple) – Column names based on which the data should be split up / partitioned
parallel_over_columns (list, tuple) – Subset of partition_columns, that are used to parallel split.
executor (prefect.engine.executors) – Provide prefect’s executor. Only valid when
parallel_over_columns
is set. For more information see https://docs.prefect.io/api/latest/engine/executors.htmlinclude_rules (dict) – Dictionary with keys being column names and values being list of values to include in the output.
exclude_rules (dict) – Dictionary with keys being column names and values being list of values to exclude from the output.
country_code_column (str) – Name of the column with country code, which can be used for supplying holiday (i.e. having gridsearch with HolidayTransformer with argument
country_code_column
set to this one).output_path (str) – Path to directory for storing the output, default behavior is current working directory
persist_cv_results (bool) – If True cv_results of sklearn.model_selection.GridSearchCV as pandas df will be saved as pickle for each partition
persist_cv_data (bool) – If True the pandas df detail cv data will be saved as pickle for each partition
persist_model_reprs (bool) – If True model reprs will be saved as json for each partition
persist_best_model (bool) – If True best model will be saved as pickle for each partition
persist_partition (bool) – If True dictionary of partition label will be saved as json for each partition
persist_model_selector_results (bool) – If True ModelSelectoResults with all important information will be saved as pickle for each partition