Advanced Large Scale learning with ModelSelector

Very often we have many different products, regions, countries, shops…for which we need to delivery forecast. This can be easily done with ModelSelector

[1]:
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn')
plt.rcParams['figure.figsize'] = [12, 6]
[2]:
from hcrystalball.model_selection import ModelSelector
from hcrystalball.utils import get_sales_data
from hcrystalball.wrappers import get_sklearn_wrapper
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

Get Dummy Data

[3]:
df = get_sales_data(n_dates=365*2,
                    n_assortments=2,
                    n_states=2,
                    n_stores=2)#.assign(BelgiumHolidays="BE")
[4]:
df.head()
[4]:
Store Sales Open Promo SchoolHoliday StoreType Assortment Promo2 State HolidayCode
Date
2013-08-01 817 25013 True True True a a False BE DE-BE
2013-08-01 251 18633 True True True a c False NW DE-NW
2013-08-01 335 16324 True True True b a True NW DE-NW
2013-08-01 380 15092 True True True a a True NW DE-NW
2013-08-01 788 19788 True True True a c False BE DE-BE

Get predefined sklearn models, holidays and exogenous variables

Here for the sake of time, we will use the advantage of the create_gridsearch method for cv splits, default scorer etc. and just extend empty grid with two models

[5]:
ms = ModelSelector(
    frequency='D',
    horizon=10,
    country_code_column='HolidayCode',
#     country_code_column=['HolidayCode','BelgiumHolidays'])
)
[6]:
# see full default parameter grid in hands on exercise
ms.create_gridsearch(
    n_splits=2,
    between_split_lag=5, # create overlapping cv_splits
    sklearn_models=False,
    sklearn_models_optimize_for_horizon=False,
    autosarimax_models=False,
    prophet_models=False,
    tbats_models=False,
    exp_smooth_models=False,
    average_ensembles=False,
    stacking_ensembles=False,
    exog_cols=['Open','Promo','SchoolHoliday','Promo2'],
    holidays_days_before=2,
    holidays_days_after=1,
    holidays_bridge_days=True,
)
[7]:
ms.add_model_to_gridsearch(get_sklearn_wrapper(LinearRegression, hcb_verbose=False))
ms.add_model_to_gridsearch(get_sklearn_wrapper(RandomForestRegressor, random_state=42, hcb_verbose=False))

Run model selection with partitions

This can be done within classical for loop that enables you to see progress bar, or within parallelized prefect flow in case you would define parallel_over_columns, which must be subset of partition_columns and optionally add executor to point to your running dask cluster. Default uses LocalExecutor, you might also try LocalDaskExecutor, that prefect will spin up for you DaskExecutor if you have one already running and you want to connect to it.

[8]:
df
[8]:
Store Sales Open Promo SchoolHoliday StoreType Assortment Promo2 State HolidayCode
Date
2013-08-01 817 25013 True True True a a False BE DE-BE
2013-08-01 251 18633 True True True a c False NW DE-NW
2013-08-01 335 16324 True True True b a True NW DE-NW
2013-08-01 380 15092 True True True a a True NW DE-NW
2013-08-01 788 19788 True True True a c False BE DE-BE
... ... ... ... ... ... ... ... ... ... ...
2015-07-31 523 15349 True True True c c False BE DE-BE
2015-07-31 513 19959 True True True a a False BE DE-BE
2015-07-31 380 17133 True True True a a True NW DE-NW
2015-07-31 335 17867 True True True b a True NW DE-NW
2015-07-31 251 22205 True True True a c False NW DE-NW

5840 rows × 10 columns

[9]:
# from prefect.engine.executors import LocalDaskExecutor
ms.select_model(df=df,
                target_col_name='Sales',
                partition_columns=['Assortment', 'State', 'Store'],
#                 parallel_over_columns=['Assortment'],
#                 executor = LocalDaskExecutor(),
               )
[10]:
ms.get_partitions(as_dataframe=True)
[10]:
Assortment State Store
0 a BE 513
1 a BE 817
2 a NW 335
3 a NW 380
4 c BE 523
5 c BE 788
6 c NW 251
7 c NW 756
[11]:
ms.plot_results(partitions=ms.partitions[:2], plot_from='2015-06');
[11]:
[[<AxesSubplot:title={'center':'Assortment=a State=BE Store=513 | (cv_split=0, mae=1382.88)'}>,
  <AxesSubplot:title={'center':'Assortment=a State=BE Store=513 | (cv_split=1, mae=1745.12)'}>],
 [<AxesSubplot:title={'center':'Assortment=a State=BE Store=817 | (cv_split=0, mae=1417.65)'}>,
  <AxesSubplot:title={'center':'Assortment=a State=BE Store=817 | (cv_split=1, mae=1540.06)'}>]]
../../../_images/examples_tutorial_model_selection_03_model_selector_advanced_14_1.png
../../../_images/examples_tutorial_model_selection_03_model_selector_advanced_14_2.png
../../../_images/examples_tutorial_model_selection_03_model_selector_advanced_14_3.png
../../../_images/examples_tutorial_model_selection_03_model_selector_advanced_14_4.png