forecast_functions module

Functions are used to forecaset House Price Index (HPI) using ARX model.

It includes two forecast performance evaluation test: - Testing the equality of prediction mean squared errors: David I. Harvey, Stephen J. Leybourne, Paul Newbold (1997) - Tests for Forecast Encompassing: David I. Harvey, Stephen J. Leybourne, Paul Newbold (1998)

GTBpy.forecast_functions.lags_list_function(lags, max_lag)[source]

Generate a list of lag combinations based on the specified lags parameter and maximum lag value.

Parameters

lags{‘Auto’, ‘glob’, list of int}

Determines the type of lag combinations to generate: - ‘Auto’: Automatically generates sequential lags up to max_lag. - ‘glob’: Generates all possible combinations of lags up to max_lag. - list of int: Uses the specified list of lags directly.

max_lagint

The maximum lag value to consider when generating lag combinations.

Returns

lags_listlist of list of int

A list where each element is a list of integers representing a combination of lag values. The specific combinations depend on the input lags parameter.

Examples

>>> lags_list_function('Auto', 3)
[[], [1], [1, 2], [1, 2, 3]]
>>> lags_list_function('glob', 2)
[[], [1], [2], [1, 2]]
>>> lags_list_function([1, 2], 3)
[[1, 2]]
GTBpy.forecast_functions.result_table(index, header, h_list, index_name, first_index='None')[source]

Create a multi-indexed DataFrame with specified index and column headers.

Parameters

indexlist of str

A list of labels for the DataFrame’s index.

headerstr

The header label for the DataFrame’s columns.

h_listlist of int

A list of integers that will be appended to the header label to create the column names.

index_namestr

The name to assign to the DataFrame’s index.

first_indexstr, optional

The label for the first index position. Defaults to ‘None’.

Returns

dfpandas.DataFrame

A DataFrame with a MultiIndex for the columns, where the first level is the header and the second level corresponds to ‘h=’ followed by each element in h_list. The DataFrame’s index is set to the provided index list, with an optional first_index prepended.

Examples

>>> result_table(['A', 'B', 'C'], 'Metric', [1, 2, 3], 'Category')
    Metric          
        h=1  h=2  h=3
None   NaN   NaN   NaN
A      NaN   NaN   NaN
B      NaN   NaN   NaN
C      NaN   NaN   NaN
>>> result_table(['X', 'Y'], 'Value', [1, 2], 'Type', 'Start')
    Value      
        h=1  h=2
Start   NaN   NaN
X       NaN   NaN
Y       NaN   NaN
GTBpy.forecast_functions.compute_ic_cv(y, X, metric, cv_criteria='MSFE', fit_intercept=True, cv=5, shuffle=False, n_iter=20, seed=None)[source]

Compute the information criterion (IC) or cross-validation (CV) score based on the specified metric and criteria.

Parameters

yarray-like or pandas.Series

The dependent variable vector (target).

Xarray-like or pandas.DataFrame

The independent variable matrix (features).

metric{‘CV’, ‘IC’}

The type of metric to compute: - ‘CV’: Cross-validation metric based on the cv_criteria. - ‘IC’: Information criterion (e.g., BIC).

cv_criteria{‘MSFE’, ‘MAFE’}, optional

The criterion used to compute the CV score when metric is ‘CV’: - ‘MSFE’: Mean Squared Forecast Error (negative mean squared error). - ‘MAFE’: Mean Absolute Forecast Error (negative mean absolute error). Default is ‘MSFE’.

fit_interceptbool, optional

Whether to calculate the intercept for the linear model. If set to False, no intercept will be used in calculations. Default is True.

cvint, optional

The number of folds in cross-validation. Default is 5.

shufflebool, optional

Whether to shuffle the data before splitting into batches in cross-validation. Default is False.

n_iterint, optional

The number of iterations for cross-validation when shuffle is True. Default is 20.

seedint, optional

The random seed for reproducibility when shuffling data in cross-validation. Default is None.

Returns

icfloat

The computed information criterion (IC) or cross-validation (CV) score.

Examples

>>> compute_ic_cv(y, X, metric='CV', cv_criteria='MSFE', cv=5)
0.045
>>> compute_ic_cv(y, X, metric='IC')
210.34
GTBpy.forecast_functions.lag_selector(df, lag_select, seed=None, cv_criteria='MSFE', cv=5, shuffle=False, n_iter=20, h=1, max_lag=13, exog=None, var_order='cross', y_lags='Auto', exog_lags='Auto', seasonal=False, verbose=0)[source]

Select optimal lags for the dependent and exogenous variables using information criteria (IC) or cross-validation (CV) metrics.

Parameters

dfpandas.DataFrame

The input DataFrame containing the time series data. The first column is assumed to be the dependent variable (‘y’).

lag_select{‘IC’, ‘CV’}

The metric used to select the optimal lags: - ‘IC’: Information criterion (e.g., BIC). - ‘CV’: Cross-validation score based on cv_criteria.

seedint, optional

Random seed for reproducibility in cross-validation. Default is None.

cv_criteria{‘MSFE’, ‘MAFE’}, optional

The criterion used for cross-validation when lag_select is ‘CV’: - ‘MSFE’: Mean Squared Forecast Error (default). - ‘MAFE’: Mean Absolute Forecast Error.

cvint, optional

Number of folds for cross-validation. Default is 5.

shufflebool, optional

Whether to shuffle the data before splitting into batches for cross-validation. Default is False.

n_iterint, optional

Number of iterations for cross-validation when shuffle is True. Default is 20.

hint, optional

Forecast horizon, indicating how many steps ahead the model is predicting. Default is 1.

max_lagint, optional

The maximum lag order to consider for selection. Default is 13.

exogstr, optional

Name of the exogenous variable in the DataFrame. Default is None.

var_order{‘cross’, ‘nested’}, optional

The order in which lag selection is performed: - ‘cross’: Cross all combinations of lags for ‘y’ and ‘exog’. - ‘nested’: Select the best lags for ‘y’ first, then choose the best lags for ‘exog’ based on the selected ‘y’ lags. Default is ‘cross’.

y_lags{‘Auto’, ‘glob’, list of int}, optional

The lags to consider for the dependent variable: - ‘Auto’: Automatically generate lag sequences up to max_lag. - ‘glob’: Generates all possible combinations of lags up to max_lag. - list of int: Use the specified lags directly. Default is ‘Auto’.

exog_lags{‘Auto’, ‘glob’, list of int}, optional

The lags to consider for the exogenous variable, if applicable: - ‘Auto’: Automatically generate lag sequences up to max_lag. - ‘glob’: Generates all possible combinations of lags up to max_lag. - list of int: Use the specified lags directly. Default is ‘Auto’.

seasonalbool, optional

Whether to include seasonal dummies (monthly) in the model. Default is False.

verboseint, optional

If greater than 1, prints detailed information about the lag selection process. Default is 0.

Returns

y_lagslist of int

The optimal lags for the dependent variable (‘y’) based on the specified metric.

exog_lagslist of int

The optimal lags for the exogenous variable based on the specified metric.

ICfloat

The minimum information criterion (IC) or cross-validation score obtained.

icsdict

A dictionary where keys are the IC/CV scores and values are the corresponding lags for ‘y’ and ‘exog’.

Examples

>>> lag_selector_IC_CV(df, lag_select='CV', cv_criteria='MSFE', h=1)
([1, 2, 3], [1], 0.034, {...})
>>> lag_selector_IC_CV(df, lag_select='IC', exog='ExogVar', h=2, max_lag=5)
([1, 3], [1], 210.45, {...})
GTBpy.forecast_functions.model(df, h=1, max_lag=3, exog=None, seasonal=False, lag_select='IC', seed=None, cv=5, shuffle=False, n_iter=20, y_lags='Auto', exog_lags='Auto', var_order='cross', train_cut=0.8, verbose=0, log=False, plot=False, original_hpi=None, original_scale=True)[source]

Fit an autoregressive exogenous (ARX) model with selected lags using information criteria (IC) or cross-validation (CV) for dependent and independent variables, and evaluate its performance.

Parameters

dfpandas.DataFrame

The input DataFrame containing the time series data. The first column is assumed to be the dependent variable (‘y’).

hint, optional

Forecast horizon, indicating how many steps ahead the model is predicting. Default is 1.

max_lagint, optional

The maximum lag order to consider for selection. Default is 3.

exogstr, optional

Name of the exogenous variable in the DataFrame. Default is None.

seasonalbool, optional

Whether to include seasonal dummies (monthly) in the model. Default is False.

lag_select{‘IC’, ‘CV’}, optional

The metric used to select the optimal lags: - ‘IC’: Information criterion (e.g., BIC). - ‘CV’: Cross-validation score based on cv_criteria. Default is ‘IC’.

seedint, optional

Random seed for reproducibility in cross-validation. Default is None.

cvint, optional

Number of folds for cross-validation. Default is 5.

shufflebool, optional

Whether to shuffle the data before splitting into batches for cross-validation. Default is False.

n_iterint, optional

Number of iterations for cross-validation when shuffle is True. Default is 20.

y_lags{‘Auto’, ‘glob’, list of int}, optional

The lags to consider for the dependent variable: - ‘Auto’: Automatically generate lag sequences up to max_lag. - ‘glob’: Generates all possible combinations of lags up to max_lag. - list of int: Use the specified lags directly. Default is ‘Auto’.

exog_lags{‘Auto’, ‘glob’, list of int}, optional

The lags to consider for the exogenous variable, if applicable: - ‘Auto’: Automatically generate lag sequences up to max_lag. - ‘glob’: Generates all possible combinations of lags up to max_lag. - list of int: Use the specified lags directly. Default is ‘Auto’.

var_order{‘cross’, ‘nested’}, optional

The order in which lag selection is performed: - ‘cross’: Cross all combinations of lags for ‘y’ and ‘exog’. - ‘nested’: Select the best lags for ‘y’ first, then choose the best lags for ‘exog’ based on the selected ‘y’ lags. Default is ‘cross’.

train_cutfloat or str or datetime-like, optional

The cutoff point for splitting the data into training and validation sets. Can be a float between 0 and 1 representing the proportion of the data to use for training, or a specific index value. Default is 0.8.

verboseint, optional

Level of verbosity for debugging or detailed output. Default is 0.

logbool, optional

If True, the data is assumed to be log-transformed and predictions will be back-transformed. Default is False.

plotbool, optional

If True, plots the actual vs. predicted values with a vertical line indicating the training/validation split. Default is False.

original_hpipandas.Series, optional

Original HPI (Housing Price Index) values for back-transforming predictions. Only used if original_scale is True. Default is None.

original_scalebool, optional

Whether to return the predictions on the original scale (before any transformations). Default is True.

Returns

MAFE_valfloat

Mean Absolute Forecast Error on the validation set.

MSFE_valfloat

Mean Squared Forecast Error on the validation set.

MAFE_trainfloat

Mean Absolute Forecast Error on the training set.

MSFE_trainfloat

Mean Squared Forecast Error on the training set.

lags_setdict

Dictionary containing the selected lags for ‘y’ and ‘exog’.

ICfloat

The minimum information criterion (IC) or cross-validation score obtained during lag selection.

res_fullpandas.Series

The full set of predictions for the dependent variable, including both training and validation periods.

Examples

>>> MAFE_val, MSFE_val, MAFE_train, MSFE_train, lags_set, IC, res_full = model_IC_CV(df, h=1, max_lag=3, exog='ExogVar')
>>> print(MAFE_val, MSFE_val, lags_set)
0.034 0.002 {'y lags': [1, 2], 'exog lags': [1]}
GTBpy.forecast_functions.compare_exog(df, train_cut, h_list=[1, 3, 6, 12], seasonal=False, max_lag=3, lag_select='IC', seed=None, lag_fit_intercept=True, lag_cv=5, lag_shuffle=False, lag_iter=20, hsi_CV_select=False, hsi_fit_intercept=True, hsi_cv=4, hsi_iter=40, y_lags='Auto', exog_lags='Auto', var_order='cross', verbose=False, log=False, original_hpi=None, original_scale=True, sort_df='criteria', sort_col=-1)[source]

Compare the impact of different exogenous variables on forecasting accuracy using various evaluation metrics.

Parameters

dfpandas.DataFrame

The input DataFrame containing the time series data. The first column is assumed to be the dependent variable (‘HPI’), and the subsequent columns are potential exogenous variables.

train_cutfloat or str or datetime-like

The cutoff point for splitting the data into training and validation sets. Can be a float between 0 and 1 representing the proportion of the data to use for training, or a specific index value.

h_listlist of int, optional

A list of forecast horizons (e.g., [1, 3, 6, 12]) to evaluate. Default is [1, 3, 6, 12].

seasonalbool, optional

Whether to include seasonal dummies (monthly) in the model. Default is False.

max_lagint, optional

The maximum lag order to consider for selection. Default is 3.

lag_select{‘IC’, ‘CV’}, optional

The metric used to select the optimal lags: - ‘IC’: Information criterion (e.g., BIC). - ‘CV’: Cross-validation score. Default is ‘IC’.

seedint, optional

Random seed for reproducibility in cross-validation. Default is None.

lag_fit_interceptbool, optional

Whether to include an intercept in the lag selection linear regression model. Default is True.

lag_cvint, optional

Number of folds for cross-validation during lag selection. Default is 5.

lag_shufflebool, optional

Whether to shuffle the data before splitting into batches for cross-validation during lag selection. Default is False.

lag_iterint, optional

Number of iterations for cross-validation during lag selection when lag_shuffle is True. Default is 20.

hsi_CV_selectbool, optional

Whether to perform cross-validation for the selection of the UHSI (Unified Housing Sentiment Index) based on the selected lags. Default is False.

hsi_fit_interceptbool, optional

Whether to include an intercept in the UHSI selection linear regression model. Default is True.

hsi_cvint, optional

Number of folds for cross-validation during UHSI selection. Default is 4.

hsi_iterint, optional

Number of iterations for cross-validation during UHSI selection when hsi_CV_select is True. Default is 40.

y_lags{‘Auto’, ‘glob’, list of int}, optional

The lags to consider for the dependent variable (‘HPI’): - ‘Auto’: Automatically generate lag sequences up to max_lag. - ‘glob’: Generates all possible combinations of lags up to max_lag. - list of int: Use the specified lags directly. Default is ‘Auto’.

exog_lags{‘Auto’, ‘glob’, list of int}, optional

The lags to consider for the exogenous variables, if applicable: - ‘Auto’: Automatically generate lag sequences up to max_lag. - ‘glob’: Generates all possible combinations of lags up to max_lag. - list of int: Use the specified lags directly. Default is ‘Auto’.

var_order{‘cross’, ‘nested’}, optional

The order in which lag selection is performed: - ‘cross’: Cross all combinations of lags for ‘y’ and ‘exog’. - ‘nested’: Select the best lags for ‘y’ first, then choose the best lags for ‘exog’ based on the selected ‘y’ lags. Default is ‘cross’.

verbosebool, optional

If True, print detailed output during the function execution. Default is False.

logbool, optional

If True, the data is assumed to be log-transformed and predictions will be back-transformed. Default is False.

original_hpipandas.Series, optional

Original HPI (Housing Price Index) values for back-transforming predictions. Only used if original_scale is True. Default is None.

original_scalebool, optional

Whether to return the predictions on the original scale (before any transformations). Default is True.

sort_df{‘criteria’, ‘MAFE’, ‘MSFE’}, optional

Criteria to sort the results by: - ‘criteria’: Sort by the lag selection criteria (e.g., IC or CV). - ‘MAFE’: Sort by Mean Absolute Forecast Error. - ‘MSFE’: Sort by Mean Squared Forecast Error. Default is ‘criteria’.

sort_colint, optional

The column index in sort_df to use for sorting. Default is -1 (last column).

Returns

MAFE_val_dfpandas.DataFrame

DataFrame containing the Mean Absolute Forecast Error (MAFE) on the validation set for each exogenous variable and forecast horizon.

MAFE_val_df_improvepandas.DataFrame

DataFrame containing the percentage improvement in MAFE on the validation set for each exogenous variable compared to the baseline (no exogenous variables).

MSFE_val_dfpandas.DataFrame

DataFrame containing the Mean Squared Forecast Error (MSFE) on the validation set for each exogenous variable and forecast horizon.

MSFE_val_df_improvepandas.DataFrame

DataFrame containing the percentage improvement in MSFE on the validation set for each exogenous variable compared to the baseline.

criteria_dfpandas.DataFrame

DataFrame containing the lag selection criteria (e.g., IC or CV) values for each exogenous variable and forecast horizon.

lags_dfpandas.DataFrame

DataFrame containing the selected lags for the dependent variable (‘HPI’) and each exogenous variable.

MAFE_train_dfpandas.DataFrame

DataFrame containing the Mean Absolute Forecast Error (MAFE) on the training set for each exogenous variable and forecast horizon.

MSFE_train_dfpandas.DataFrame

DataFrame containing the Mean Squared Forecast Error (MSFE) on the training set for each exogenous variable and forecast horizon.

forecast_dictdict

A dictionary where each key is a forecast horizon (from h_list) and the corresponding value is a DataFrame containing the actual vs. predicted values for each exogenous variable.

dfpandas.DataFrame

The modified input DataFrame with additional columns for principal components (UHSI) if hsi_CV_select is True.

Examples

>>> MAFE_val_df, MAFE_val_df_improve, MSFE_val_df, MSFE_val_df_improve, criteria_df, lags_df, MAFE_train_df, MSFE_train_df, forecast_dict, df = compare_exog(df, '2023-01-01', h_list=[1, 6, 12], lag_select='IC', seasonal=True)
>>> print(MAFE_val_df)
Horizon 1  Horizon 6  Horizon 12
exog_var1    0.03      0.04       0.05
exog_var2    0.02      0.03       0.04
GTBpy.forecast_functions.HLN_MDM(d, h)[source]

Compute the Harvey, Leybourne, and Newbold (1997) Modified Diebold-Mariano (MDM) statistic and its p-value.

Parameters

dnumpy.ndarray or pandas.Series

The array or series of forecast error differentials. This represents the difference between the forecast errors of two competing models.

hint

The forecast horizon (number of steps ahead).

Returns

MDMfloat

The Modified Diebold-Mariano statistic.

pvalfloat

The p-value associated with the MDM statistic, under the null hypothesis that the forecast accuracy of the two models is the same.

Notes

  • The Harvey, Leybourne, and Newbold (1997) test modifies the Diebold-Mariano test to account for small-sample bias, especially when the forecast horizon is greater than one.

  • This test is particularly useful in comparing predictive accuracy when forecasts are based on overlapping data.

References

Harvey, D. I., Leybourne, S. J., & Newbold, P. (1997). Testing the equality of prediction mean squared errors. International Journal of Forecasting, 13(2), 281-291.

Examples

>>> d = np.array([0.1, 0.2, -0.1, 0.05, 0.3])
>>> h = 1
>>> MDM, pval = HLN_MDM(d, h)
>>> print(f"MDM Statistic: {MDM}, p-value: {pval}")
MDM Statistic: 1.414213562373095, p-value: 0.1826592511440174
GTBpy.forecast_functions.forecast_table(self)[source]

Generate a table comparing forecast performance across different predictors and horizons.

This table includes the Mean Absolute Forecast Error (MAFE), Mean Squared Forecast Error (MSFE), and p-values for hypothesis tests related to equal predictive accuracy and forecast encompassing.

Hypotheses:

  • H_{0,1}: Equal MAFE between the predictor model and the base model.

  • H_{0,2}: Equal MSFE between the predictor model and the base model.

  • H_{0,3}: The predictor model forecast encompasses the base model.

Returns

resultpandas.DataFrame

A DataFrame with a MultiIndex of forecast horizons (h) and predictors, and columns including: - ‘HPI lags’: The lags of the dependent variable (HPI). - ‘Exog lags’: The lags of the exogenous variable. - ‘MAFE’: The Mean Absolute Forecast Error for the predictor. - ‘MSFE’: The Mean Squared Forecast Error for the predictor. - ‘MAFE improvement’: The percentage improvement in MAFE relative to the base model. - ‘MSFE improvement’: The percentage improvement in MSFE relative to the base model. - ‘H_{0,1}’: The p-value for the hypothesis test of equal MAFE (using the Modified Diebold-Mariano test). - ‘H_{0,2}’: The p-value for the hypothesis test of equal MSFE (using the Modified Diebold-Mariano test). - ‘H_{0,3}’: The p-value for the hypothesis test of forecast encompassing (using the Modified Diebold-Mariano test).

Notes

  • The base model is represented by the ‘None’ predictor.

  • Forecast encompassing tests whether the predictor model contains all the information in the base model.

  • The method uses the Harvey, Leybourne, and Newbold (1997) Modified Diebold-Mariano test to compute p-values.

Examples

>>> table = model.forecast_table()
>>> print(table)
                HPI lags Exog lags     MAFE     MSFE MAFE improvement MSFE improvement   H_{0,1}   H_{0,2}   H_{0,3}
h    Predictors                                                                                                      
1    None              ...       ...  0.0251  0.00123            0.000            0.000  0.0321    0.0287    0.1025
    UHSI_1            ...       ...  0.0223  0.00110            0.111            0.105  0.2103    0.1325    0.0923
    UHSI_3            ...       ...  0.0210  0.00105            0.162            0.147  0.1809    0.1014    0.0534
...
GTBpy.forecast_functions.results(self, results_table=False, MAFE_val_df=False, MAFE_val_df_improve=False, MSFE_val_df=False, MSFE_val_df_improve=False, forecast_criteria_df=False, lags_df=False, MAFE_train_df=False, MSFE_train_df=False, head=10)[source]

Display selected result tables generated during the forecasting process.

Parameters

results_tablebool, optional

If True, display the full results table from the forecast comparison (default is False).

MAFE_val_dfbool, optional

If True, display the Mean Absolute Forecast Error (MAFE) validation DataFrame (default is False).

MAFE_val_df_improvebool, optional

If True, display the percentage improvement in MAFE for each predictor (default is False).

MSFE_val_dfbool, optional

If True, display the Mean Squared Forecast Error (MSFE) validation DataFrame (default is False).

MSFE_val_df_improvebool, optional

If True, display the percentage improvement in MSFE for each predictor (default is False).

forecast_criteria_dfbool, optional

If True, display the forecast criteria DataFrame, showing the selected model criteria (default is False).

lags_dfbool, optional

If True, display the DataFrame containing the selected lags for each model (default is False).

MAFE_train_dfbool, optional

If True, display the Mean Absolute Forecast Error (MAFE) training DataFrame (default is False).

MSFE_train_dfbool, optional

If True, display the Mean Squared Forecast Error (MSFE) training DataFrame (default is False).

headint or bool, optional

Number of rows to display from each DataFrame (default is 10). If True, display all rows of each DataFrame.

Returns

None

Displays the selected DataFrames in the Jupyter Notebook environment.

Notes

  • The method allows selective display of any of the key result tables generated during the model comparison process.

  • It uses the display function from IPython to show the DataFrames.

Examples

>>> model.results(results_table=True, MAFE_val_df=True, head=5)
    Displays the results table and the MAFE validation DataFrame with the top 5 rows.