get_sales_data

hcrystalball.utils.get_sales_data(n_dates=100, n_assortments=2, n_states=3, n_stores=3)[source]

Load subset of Rossmann store sales dataset.

This function loads a subset of the Rossmann store sales dataset from https://www.kaggle.com/c/rossmann-store-sales with the 100 stores with the highest sales overall. The data is for stores in Germany, in the date range 2015-04-23 to 2015-07-31.

The data is returned as a pandas.DataFrame:

  • Date - DataFrame index, date of recorded sales numbers

  • Store - a unique Id for each store

  • Sales - the turnover for any given day (this is what you are predicting)

  • Open - an indicator for whether the store was open: 0 = closed, 1 = open

  • Promo - indicates whether a store is running a promo on that day

  • SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools

  • StoreType - differentiates between 4 different store models: a, b, c, d

  • Assortment - describes an assortment level: a = basic, b = extra, c = extended

  • Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating

  • State - String code for state in Germany that the store is in (see https://en.wikipedia.org/wiki/States_of_Germany)

  • HolidayCode - the State prefixed with DE-.

The Assortment, State and Store serve as data partitioning columns. HolidayCode will provide country specific holidays for the given Date. Open, Promo, Promo2 and SchoolHoliday serve as exogenous variables. Sales is the target column we will predict.

Parameters
  • n_dates (int) – Number of days to be included for each series

  • n_assortments (int) – Number of assortments to included

  • n_states (int) – Number of states to included

  • n_stores (int) – Number of stores to included

Example

>>> get_sales_data()
            Store  Sales  Open  Promo  SchoolHoliday StoreType Assortment  Promo2 State HolidayCode
Date
2015-04-23    906   8162  True  False          False         a          a   False    HE       DE-HE
2015-04-23    251  16573  True  False          False         a          c   False    NW       DE-NW
2015-04-23    320  13114  True  False          False         a          c   False    SH       DE-SH
2015-04-23    335  11189  True  False          False         b          a    True    NW       DE-NW
2015-04-23    336  10184  True  False          False         a          a   False    HE       DE-HE
...           ...    ...   ...    ...            ...       ...        ...     ...   ...         ...
2015-07-31    817  23093  True   True           True         a          a   False    BE       DE-BE
2015-07-31    831  15152  True   True           True         a          a   False    NW       DE-NW
2015-07-31    906  15131  True   True           True         a          a   False    HE       DE-HE
2015-07-31    586  17879  True   True           True         a          c   False    NW       DE-NW
2015-07-31    251  22205  True   True           True         a          c   False    NW       DE-NW
Returns

Rossmann store sales subset, see description above.

Return type

pandas.DataFrame

Raises

ValueError – Error is raised if the number of assortments is higher than what dataset holds, if there are less than requested number of states within any assortment, or if there are not enough valid combinations of number of assortments, states and stores.