forecast_functions module

Functions are used to forecaset House Price Index (HPI) using ARX model.

It includes two forecast performance evaluation test: - Testing the equality of prediction mean squared errors: David I. Harvey, Stephen J. Leybourne, Paul Newbold (1997) - Tests for Forecast Encompassing: David I. Harvey, Stephen J. Leybourne, Paul Newbold (1998)

GTBpy.forecast_functions.lags_list_function(lags, max_lag)[source]

Generate a list of lag combinations based on the specified lags parameter and maximum lag value.

Parameters

lags{‘Auto’, ‘glob’, list of int}: Determines the type of lag combinations to generate: - ‘Auto’: Automatically generates sequential lags up to max_lag. - ‘glob’: Generates all possible combinations of lags up to max_lag. - list of int: Uses the specified list of lags directly.
max_lagint: The maximum lag value to consider when generating lag combinations.

Returns

lags_listlist of list of int: A list where each element is a list of integers representing a combination of lag values. The specific combinations depend on the input lags parameter.

Examples

>>> lags_list_function('Auto', 3)
[[], [1], [1, 2], [1, 2, 3]]

>>> lags_list_function('glob', 2)
[[], [1], [2], [1, 2]]

>>> lags_list_function([1, 2], 3)
[[1, 2]]

GTBpy.forecast_functions.result_table(index, header, h_list, index_name, first_index='None')[source]

Create a multi-indexed DataFrame with specified index and column headers.

Parameters

indexlist of str: A list of labels for the DataFrame’s index.
headerstr: The header label for the DataFrame’s columns.
h_listlist of int: A list of integers that will be appended to the header label to create the column names.
index_namestr: The name to assign to the DataFrame’s index.
first_indexstr, optional: The label for the first index position. Defaults to ‘None’.

Returns

dfpandas.DataFrame: A DataFrame with a MultiIndex for the columns, where the first level is the header and the second level corresponds to ‘h=’ followed by each element in h_list. The DataFrame’s index is set to the provided index list, with an optional first_index prepended.

Examples

>>> result_table(['A', 'B', 'C'], 'Metric', [1, 2, 3], 'Category')
    Metric          
        h=1  h=2  h=3
None   NaN   NaN   NaN
A      NaN   NaN   NaN
B      NaN   NaN   NaN
C      NaN   NaN   NaN

>>> result_table(['X', 'Y'], 'Value', [1, 2], 'Type', 'Start')
    Value      
        h=1  h=2
Start   NaN   NaN
X       NaN   NaN
Y       NaN   NaN

GTBpy.forecast_functions.compute_ic_cv(y, X, metric, cv_criteria='MSFE', fit_intercept=True, cv=5, shuffle=False, n_iter=20, seed=None)[source]

Compute the information criterion (IC) or cross-validation (CV) score based on the specified metric and criteria.

Parameters

yarray-like or pandas.Series: The dependent variable vector (target).
Xarray-like or pandas.DataFrame: The independent variable matrix (features).
metric{‘CV’, ‘IC’}: The type of metric to compute: - ‘CV’: Cross-validation metric based on the cv_criteria. - ‘IC’: Information criterion (e.g., BIC).
cv_criteria{‘MSFE’, ‘MAFE’}, optional: The criterion used to compute the CV score when metric is ‘CV’: - ‘MSFE’: Mean Squared Forecast Error (negative mean squared error). - ‘MAFE’: Mean Absolute Forecast Error (negative mean absolute error). Default is ‘MSFE’.
fit_interceptbool, optional: Whether to calculate the intercept for the linear model. If set to False, no intercept will be used in calculations. Default is True.
cvint, optional: The number of folds in cross-validation. Default is 5.
shufflebool, optional: Whether to shuffle the data before splitting into batches in cross-validation. Default is False.
n_iterint, optional: The number of iterations for cross-validation when shuffle is True. Default is 20.
seedint, optional: The random seed for reproducibility when shuffling data in cross-validation. Default is None.

Returns

icfloat: The computed information criterion (IC) or cross-validation (CV) score.

Examples

>>> compute_ic_cv(y, X, metric='CV', cv_criteria='MSFE', cv=5)
0.045

>>> compute_ic_cv(y, X, metric='IC')
210.34

GTBpy.forecast_functions.lag_selector(df, lag_select, seed=None, cv_criteria='MSFE', cv=5, shuffle=False, n_iter=20, h=1, max_lag=13, exog=None, var_order='cross', y_lags='Auto', exog_lags='Auto', seasonal=False, verbose=0)[source]

Select optimal lags for the dependent and exogenous variables using information criteria (IC) or cross-validation (CV) metrics.

Parameters

dfpandas.DataFrame: The input DataFrame containing the time series data. The first column is assumed to be the dependent variable (‘y’).
lag_select{‘IC’, ‘CV’}: The metric used to select the optimal lags: - ‘IC’: Information criterion (e.g., BIC). - ‘CV’: Cross-validation score based on cv_criteria.
seedint, optional: Random seed for reproducibility in cross-validation. Default is None.
cv_criteria{‘MSFE’, ‘MAFE’}, optional: The criterion used for cross-validation when lag_select is ‘CV’: - ‘MSFE’: Mean Squared Forecast Error (default). - ‘MAFE’: Mean Absolute Forecast Error.
cvint, optional: Number of folds for cross-validation. Default is 5.
shufflebool, optional: Whether to shuffle the data before splitting into batches for cross-validation. Default is False.
n_iterint, optional: Number of iterations for cross-validation when shuffle is True. Default is 20.
hint, optional: Forecast horizon, indicating how many steps ahead the model is predicting. Default is 1.
max_lagint, optional: The maximum lag order to consider for selection. Default is 13.
exogstr, optional: Name of the exogenous variable in the DataFrame. Default is None.
var_order{‘cross’, ‘nested’}, optional: The order in which lag selection is performed: - ‘cross’: Cross all combinations of lags for ‘y’ and ‘exog’. - ‘nested’: Select the best lags for ‘y’ first, then choose the best lags for ‘exog’ based on the selected ‘y’ lags. Default is ‘cross’.
y_lags{‘Auto’, ‘glob’, list of int}, optional: The lags to consider for the dependent variable: - ‘Auto’: Automatically generate lag sequences up to max_lag. - ‘glob’: Generates all possible combinations of lags up to max_lag. - list of int: Use the specified lags directly. Default is ‘Auto’.
exog_lags{‘Auto’, ‘glob’, list of int}, optional: The lags to consider for the exogenous variable, if applicable: - ‘Auto’: Automatically generate lag sequences up to max_lag. - ‘glob’: Generates all possible combinations of lags up to max_lag. - list of int: Use the specified lags directly. Default is ‘Auto’.
seasonalbool, optional: Whether to include seasonal dummies (monthly) in the model. Default is False.
verboseint, optional: If greater than 1, prints detailed information about the lag selection process. Default is 0.

Returns

y_lagslist of int: The optimal lags for the dependent variable (‘y’) based on the specified metric.
exog_lagslist of int: The optimal lags for the exogenous variable based on the specified metric.
ICfloat: The minimum information criterion (IC) or cross-validation score obtained.
icsdict: A dictionary where keys are the IC/CV scores and values are the corresponding lags for ‘y’ and ‘exog’.

Examples

>>> lag_selector_IC_CV(df, lag_select='CV', cv_criteria='MSFE', h=1)
([1, 2, 3], [1], 0.034, {...})

>>> lag_selector_IC_CV(df, lag_select='IC', exog='ExogVar', h=2, max_lag=5)
([1, 3], [1], 210.45, {...})

GTBpy.forecast_functions.model(df, h=1, max_lag=3, exog=None, seasonal=False, lag_select='IC', seed=None, cv=5, shuffle=False, n_iter=20, y_lags='Auto', exog_lags='Auto', var_order='cross', train_cut=0.8, verbose=0, log=False, plot=False, original_hpi=None, original_scale=True)[source]

Fit an autoregressive exogenous (ARX) model with selected lags using information criteria (IC) or cross-validation (CV) for dependent and independent variables, and evaluate its performance.

Parameters

dfpandas.DataFrame: The input DataFrame containing the time series data. The first column is assumed to be the dependent variable (‘y’).
hint, optional: Forecast horizon, indicating how many steps ahead the model is predicting. Default is 1.
max_lagint, optional: The maximum lag order to consider for selection. Default is 3.
exogstr, optional: Name of the exogenous variable in the DataFrame. Default is None.
seasonalbool, optional: Whether to include seasonal dummies (monthly) in the model. Default is False.
lag_select{‘IC’, ‘CV’}, optional: The metric used to select the optimal lags: - ‘IC’: Information criterion (e.g., BIC). - ‘CV’: Cross-validation score based on cv_criteria. Default is ‘IC’.
seedint, optional: Random seed for reproducibility in cross-validation. Default is None.
cvint, optional: Number of folds for cross-validation. Default is 5.
shufflebool, optional: Whether to shuffle the data before splitting into batches for cross-validation. Default is False.
n_iterint, optional: Number of iterations for cross-validation when shuffle is True. Default is 20.
y_lags{‘Auto’, ‘glob’, list of int}, optional: The lags to consider for the dependent variable: - ‘Auto’: Automatically generate lag sequences up to max_lag. - ‘glob’: Generates all possible combinations of lags up to max_lag. - list of int: Use the specified lags directly. Default is ‘Auto’.
exog_lags{‘Auto’, ‘glob’, list of int}, optional: The lags to consider for the exogenous variable, if applicable: - ‘Auto’: Automatically generate lag sequences up to max_lag. - ‘glob’: Generates all possible combinations of lags up to max_lag. - list of int: Use the specified lags directly. Default is ‘Auto’.
var_order{‘cross’, ‘nested’}, optional: The order in which lag selection is performed: - ‘cross’: Cross all combinations of lags for ‘y’ and ‘exog’. - ‘nested’: Select the best lags for ‘y’ first, then choose the best lags for ‘exog’ based on the selected ‘y’ lags. Default is ‘cross’.
train_cutfloat or str or datetime-like, optional: The cutoff point for splitting the data into training and validation sets. Can be a float between 0 and 1 representing the proportion of the data to use for training, or a specific index value. Default is 0.8.
verboseint, optional: Level of verbosity for debugging or detailed output. Default is 0.
logbool, optional: If True, the data is assumed to be log-transformed and predictions will be back-transformed. Default is False.
plotbool, optional: If True, plots the actual vs. predicted values with a vertical line indicating the training/validation split. Default is False.
original_hpipandas.Series, optional: Original HPI (Housing Price Index) values for back-transforming predictions. Only used if original_scale is True. Default is None.
original_scalebool, optional: Whether to return the predictions on the original scale (before any transformations). Default is True.

Returns

MAFE_valfloat: Mean Absolute Forecast Error on the validation set.
MSFE_valfloat: Mean Squared Forecast Error on the validation set.
MAFE_trainfloat: Mean Absolute Forecast Error on the training set.
MSFE_trainfloat: Mean Squared Forecast Error on the training set.
lags_setdict: Dictionary containing the selected lags for ‘y’ and ‘exog’.
ICfloat: The minimum information criterion (IC) or cross-validation score obtained during lag selection.
res_fullpandas.Series: The full set of predictions for the dependent variable, including both training and validation periods.

Examples

>>> MAFE_val, MSFE_val, MAFE_train, MSFE_train, lags_set, IC, res_full = model_IC_CV(df, h=1, max_lag=3, exog='ExogVar')
>>> print(MAFE_val, MSFE_val, lags_set)
0.034 0.002 {'y lags': [1, 2], 'exog lags': [1]}

GTBpy.forecast_functions.compare_exog(df, train_cut, h_list=[1, 3, 6, 12], seasonal=False, max_lag=3, lag_select='IC', seed=None, lag_fit_intercept=True, lag_cv=5, lag_shuffle=False, lag_iter=20, hsi_CV_select=False, hsi_fit_intercept=True, hsi_cv=4, hsi_iter=40, y_lags='Auto', exog_lags='Auto', var_order='cross', verbose=False, log=False, original_hpi=None, original_scale=True, sort_df='criteria', sort_col=-1)[source]

Compare the impact of different exogenous variables on forecasting accuracy using various evaluation metrics.

Parameters

dfpandas.DataFrame: The input DataFrame containing the time series data. The first column is assumed to be the dependent variable (‘HPI’), and the subsequent columns are potential exogenous variables.
train_cutfloat or str or datetime-like: The cutoff point for splitting the data into training and validation sets. Can be a float between 0 and 1 representing the proportion of the data to use for training, or a specific index value.
h_listlist of int, optional: A list of forecast horizons (e.g., [1, 3, 6, 12]) to evaluate. Default is [1, 3, 6, 12].
seasonalbool, optional: Whether to include seasonal dummies (monthly) in the model. Default is False.
max_lagint, optional: The maximum lag order to consider for selection. Default is 3.
lag_select{‘IC’, ‘CV’}, optional: The metric used to select the optimal lags: - ‘IC’: Information criterion (e.g., BIC). - ‘CV’: Cross-validation score. Default is ‘IC’.
seedint, optional: Random seed for reproducibility in cross-validation. Default is None.
lag_fit_interceptbool, optional: Whether to include an intercept in the lag selection linear regression model. Default is True.
lag_cvint, optional: Number of folds for cross-validation during lag selection. Default is 5.
lag_shufflebool, optional: Whether to shuffle the data before splitting into batches for cross-validation during lag selection. Default is False.
lag_iterint, optional: Number of iterations for cross-validation during lag selection when lag_shuffle is True. Default is 20.
hsi_CV_selectbool, optional: Whether to perform cross-validation for the selection of the UHSI (Unified Housing Sentiment Index) based on the selected lags. Default is False.
hsi_fit_interceptbool, optional: Whether to include an intercept in the UHSI selection linear regression model. Default is True.
hsi_cvint, optional: Number of folds for cross-validation during UHSI selection. Default is 4.
hsi_iterint, optional: Number of iterations for cross-validation during UHSI selection when hsi_CV_select is True. Default is 40.
y_lags{‘Auto’, ‘glob’, list of int}, optional: The lags to consider for the dependent variable (‘HPI’): - ‘Auto’: Automatically generate lag sequences up to max_lag. - ‘glob’: Generates all possible combinations of lags up to max_lag. - list of int: Use the specified lags directly. Default is ‘Auto’.
exog_lags{‘Auto’, ‘glob’, list of int}, optional: The lags to consider for the exogenous variables, if applicable: - ‘Auto’: Automatically generate lag sequences up to max_lag. - ‘glob’: Generates all possible combinations of lags up to max_lag. - list of int: Use the specified lags directly. Default is ‘Auto’.
var_order{‘cross’, ‘nested’}, optional: The order in which lag selection is performed: - ‘cross’: Cross all combinations of lags for ‘y’ and ‘exog’. - ‘nested’: Select the best lags for ‘y’ first, then choose the best lags for ‘exog’ based on the selected ‘y’ lags. Default is ‘cross’.
verbosebool, optional: If True, print detailed output during the function execution. Default is False.
logbool, optional: If True, the data is assumed to be log-transformed and predictions will be back-transformed. Default is False.
original_hpipandas.Series, optional: Original HPI (Housing Price Index) values for back-transforming predictions. Only used if original_scale is True. Default is None.
original_scalebool, optional: Whether to return the predictions on the original scale (before any transformations). Default is True.
sort_df{‘criteria’, ‘MAFE’, ‘MSFE’}, optional: Criteria to sort the results by: - ‘criteria’: Sort by the lag selection criteria (e.g., IC or CV). - ‘MAFE’: Sort by Mean Absolute Forecast Error. - ‘MSFE’: Sort by Mean Squared Forecast Error. Default is ‘criteria’.
sort_colint, optional: The column index in sort_df to use for sorting. Default is -1 (last column).

Returns

MAFE_val_dfpandas.DataFrame: DataFrame containing the Mean Absolute Forecast Error (MAFE) on the validation set for each exogenous variable and forecast horizon.
MAFE_val_df_improvepandas.DataFrame: DataFrame containing the percentage improvement in MAFE on the validation set for each exogenous variable compared to the baseline (no exogenous variables).
MSFE_val_dfpandas.DataFrame: DataFrame containing the Mean Squared Forecast Error (MSFE) on the validation set for each exogenous variable and forecast horizon.
MSFE_val_df_improvepandas.DataFrame: DataFrame containing the percentage improvement in MSFE on the validation set for each exogenous variable compared to the baseline.
criteria_dfpandas.DataFrame: DataFrame containing the lag selection criteria (e.g., IC or CV) values for each exogenous variable and forecast horizon.
lags_dfpandas.DataFrame: DataFrame containing the selected lags for the dependent variable (‘HPI’) and each exogenous variable.
MAFE_train_dfpandas.DataFrame: DataFrame containing the Mean Absolute Forecast Error (MAFE) on the training set for each exogenous variable and forecast horizon.
MSFE_train_dfpandas.DataFrame: DataFrame containing the Mean Squared Forecast Error (MSFE) on the training set for each exogenous variable and forecast horizon.
forecast_dictdict: A dictionary where each key is a forecast horizon (from h_list) and the corresponding value is a DataFrame containing the actual vs. predicted values for each exogenous variable.
dfpandas.DataFrame: The modified input DataFrame with additional columns for principal components (UHSI) if hsi_CV_select is True.

Examples

>>> MAFE_val_df, MAFE_val_df_improve, MSFE_val_df, MSFE_val_df_improve, criteria_df, lags_df, MAFE_train_df, MSFE_train_df, forecast_dict, df = compare_exog(df, '2023-01-01', h_list=[1, 6, 12], lag_select='IC', seasonal=True)
>>> print(MAFE_val_df)
Horizon 1  Horizon 6  Horizon 12
exog_var1    0.03      0.04       0.05
exog_var2    0.02      0.03       0.04

GTBpy.forecast_functions.HLN_MDM(d, h)[source]

Compute the Harvey, Leybourne, and Newbold (1997) Modified Diebold-Mariano (MDM) statistic and its p-value.

Parameters

dnumpy.ndarray or pandas.Series: The array or series of forecast error differentials. This represents the difference between the forecast errors of two competing models.
hint: The forecast horizon (number of steps ahead).

Returns

MDMfloat: The Modified Diebold-Mariano statistic.
pvalfloat: The p-value associated with the MDM statistic, under the null hypothesis that the forecast accuracy of the two models is the same.

Notes

The Harvey, Leybourne, and Newbold (1997) test modifies the Diebold-Mariano test to account for small-sample bias, especially when the forecast horizon is greater than one.
This test is particularly useful in comparing predictive accuracy when forecasts are based on overlapping data.

References

Harvey, D. I., Leybourne, S. J., & Newbold, P. (1997). Testing the equality of prediction mean squared errors. International Journal of Forecasting, 13(2), 281-291.

Examples

>>> d = np.array([0.1, 0.2, -0.1, 0.05, 0.3])
>>> h = 1
>>> MDM, pval = HLN_MDM(d, h)
>>> print(f"MDM Statistic: {MDM}, p-value: {pval}")
MDM Statistic: 1.414213562373095, p-value: 0.1826592511440174

GTBpy.forecast_functions.forecast_table(self)[source]

Generate a table comparing forecast performance across different predictors and horizons.

This table includes the Mean Absolute Forecast Error (MAFE), Mean Squared Forecast Error (MSFE), and p-values for hypothesis tests related to equal predictive accuracy and forecast encompassing.

Hypotheses:

H_{0,1}: Equal MAFE between the predictor model and the base model.
H_{0,2}: Equal MSFE between the predictor model and the base model.
H_{0,3}: The predictor model forecast encompasses the base model.

Returns

resultpandas.DataFrame: A DataFrame with a MultiIndex of forecast horizons (h) and predictors, and columns including: - ‘HPI lags’: The lags of the dependent variable (HPI). - ‘Exog lags’: The lags of the exogenous variable. - ‘MAFE’: The Mean Absolute Forecast Error for the predictor. - ‘MSFE’: The Mean Squared Forecast Error for the predictor. - ‘MAFE improvement’: The percentage improvement in MAFE relative to the base model. - ‘MSFE improvement’: The percentage improvement in MSFE relative to the base model. - ‘H_{0,1}’: The p-value for the hypothesis test of equal MAFE (using the Modified Diebold-Mariano test). - ‘H_{0,2}’: The p-value for the hypothesis test of equal MSFE (using the Modified Diebold-Mariano test). - ‘H_{0,3}’: The p-value for the hypothesis test of forecast encompassing (using the Modified Diebold-Mariano test).

Notes

The base model is represented by the ‘None’ predictor.
Forecast encompassing tests whether the predictor model contains all the information in the base model.
The method uses the Harvey, Leybourne, and Newbold (1997) Modified Diebold-Mariano test to compute p-values.

Examples

>>> table = model.forecast_table()
>>> print(table)
                HPI lags Exog lags     MAFE     MSFE MAFE improvement MSFE improvement   H_{0,1}   H_{0,2}   H_{0,3}
h    Predictors                                                                                                      
1    None              ...       ...  0.0251  0.00123            0.000            0.000  0.0321    0.0287    0.1025
    UHSI_1            ...       ...  0.0223  0.00110            0.111            0.105  0.2103    0.1325    0.0923
    UHSI_3            ...       ...  0.0210  0.00105            0.162            0.147  0.1809    0.1014    0.0534
...

GTBpy.forecast_functions.results(self, results_table=False, MAFE_val_df=False, MAFE_val_df_improve=False, MSFE_val_df=False, MSFE_val_df_improve=False, forecast_criteria_df=False, lags_df=False, MAFE_train_df=False, MSFE_train_df=False, head=10)[source]

Display selected result tables generated during the forecasting process.

Parameters

results_tablebool, optional: If True, display the full results table from the forecast comparison (default is False).
MAFE_val_dfbool, optional: If True, display the Mean Absolute Forecast Error (MAFE) validation DataFrame (default is False).
MAFE_val_df_improvebool, optional: If True, display the percentage improvement in MAFE for each predictor (default is False).
MSFE_val_dfbool, optional: If True, display the Mean Squared Forecast Error (MSFE) validation DataFrame (default is False).
MSFE_val_df_improvebool, optional: If True, display the percentage improvement in MSFE for each predictor (default is False).
forecast_criteria_dfbool, optional: If True, display the forecast criteria DataFrame, showing the selected model criteria (default is False).
lags_dfbool, optional: If True, display the DataFrame containing the selected lags for each model (default is False).
MAFE_train_dfbool, optional: If True, display the Mean Absolute Forecast Error (MAFE) training DataFrame (default is False).
MSFE_train_dfbool, optional: If True, display the Mean Squared Forecast Error (MSFE) training DataFrame (default is False).
headint or bool, optional: Number of rows to display from each DataFrame (default is 10). If True, display all rows of each DataFrame.

Returns

None: Displays the selected DataFrames in the Jupyter Notebook environment.

Notes

The method allows selective display of any of the key result tables generated during the model comparison process.
It uses the display function from IPython to show the DataFrames.

Examples

>>> model.results(results_table=True, MAFE_val_df=True, head=5)
    Displays the results table and the MAFE validation DataFrame with the top 5 rows.