forecast_functions module
Functions are used to forecaset House Price Index (HPI) using ARX model.
It includes two forecast performance evaluation test: - Testing the equality of prediction mean squared errors: David I. Harvey, Stephen J. Leybourne, Paul Newbold (1997) - Tests for Forecast Encompassing: David I. Harvey, Stephen J. Leybourne, Paul Newbold (1998)
- GTBpy.forecast_functions.lags_list_function(lags, max_lag)[source]
Generate a list of lag combinations based on the specified lags parameter and maximum lag value.
Parameters
- lags{‘Auto’, ‘glob’, list of int}
Determines the type of lag combinations to generate: - ‘Auto’: Automatically generates sequential lags up to max_lag. - ‘glob’: Generates all possible combinations of lags up to max_lag. - list of int: Uses the specified list of lags directly.
- max_lagint
The maximum lag value to consider when generating lag combinations.
Returns
- lags_listlist of list of int
A list where each element is a list of integers representing a combination of lag values. The specific combinations depend on the input lags parameter.
Examples
>>> lags_list_function('Auto', 3) [[], [1], [1, 2], [1, 2, 3]]
>>> lags_list_function('glob', 2) [[], [1], [2], [1, 2]]
>>> lags_list_function([1, 2], 3) [[1, 2]]
- GTBpy.forecast_functions.result_table(index, header, h_list, index_name, first_index='None')[source]
Create a multi-indexed DataFrame with specified index and column headers.
Parameters
- indexlist of str
A list of labels for the DataFrame’s index.
- headerstr
The header label for the DataFrame’s columns.
- h_listlist of int
A list of integers that will be appended to the header label to create the column names.
- index_namestr
The name to assign to the DataFrame’s index.
- first_indexstr, optional
The label for the first index position. Defaults to ‘None’.
Returns
- dfpandas.DataFrame
A DataFrame with a MultiIndex for the columns, where the first level is the header and the second level corresponds to ‘h=’ followed by each element in h_list. The DataFrame’s index is set to the provided index list, with an optional first_index prepended.
Examples
>>> result_table(['A', 'B', 'C'], 'Metric', [1, 2, 3], 'Category') Metric h=1 h=2 h=3 None NaN NaN NaN A NaN NaN NaN B NaN NaN NaN C NaN NaN NaN
>>> result_table(['X', 'Y'], 'Value', [1, 2], 'Type', 'Start') Value h=1 h=2 Start NaN NaN X NaN NaN Y NaN NaN
- GTBpy.forecast_functions.compute_ic_cv(y, X, metric, cv_criteria='MSFE', fit_intercept=True, cv=5, shuffle=False, n_iter=20, seed=None)[source]
Compute the information criterion (IC) or cross-validation (CV) score based on the specified metric and criteria.
Parameters
- yarray-like or pandas.Series
The dependent variable vector (target).
- Xarray-like or pandas.DataFrame
The independent variable matrix (features).
- metric{‘CV’, ‘IC’}
The type of metric to compute: - ‘CV’: Cross-validation metric based on the cv_criteria. - ‘IC’: Information criterion (e.g., BIC).
- cv_criteria{‘MSFE’, ‘MAFE’}, optional
The criterion used to compute the CV score when metric is ‘CV’: - ‘MSFE’: Mean Squared Forecast Error (negative mean squared error). - ‘MAFE’: Mean Absolute Forecast Error (negative mean absolute error). Default is ‘MSFE’.
- fit_interceptbool, optional
Whether to calculate the intercept for the linear model. If set to False, no intercept will be used in calculations. Default is True.
- cvint, optional
The number of folds in cross-validation. Default is 5.
- shufflebool, optional
Whether to shuffle the data before splitting into batches in cross-validation. Default is False.
- n_iterint, optional
The number of iterations for cross-validation when shuffle is True. Default is 20.
- seedint, optional
The random seed for reproducibility when shuffling data in cross-validation. Default is None.
Returns
- icfloat
The computed information criterion (IC) or cross-validation (CV) score.
Examples
>>> compute_ic_cv(y, X, metric='CV', cv_criteria='MSFE', cv=5) 0.045
>>> compute_ic_cv(y, X, metric='IC') 210.34
- GTBpy.forecast_functions.lag_selector(df, lag_select, seed=None, cv_criteria='MSFE', cv=5, shuffle=False, n_iter=20, h=1, max_lag=13, exog=None, var_order='cross', y_lags='Auto', exog_lags='Auto', seasonal=False, verbose=0)[source]
Select optimal lags for the dependent and exogenous variables using information criteria (IC) or cross-validation (CV) metrics.
Parameters
- dfpandas.DataFrame
The input DataFrame containing the time series data. The first column is assumed to be the dependent variable (‘y’).
- lag_select{‘IC’, ‘CV’}
The metric used to select the optimal lags: - ‘IC’: Information criterion (e.g., BIC). - ‘CV’: Cross-validation score based on cv_criteria.
- seedint, optional
Random seed for reproducibility in cross-validation. Default is None.
- cv_criteria{‘MSFE’, ‘MAFE’}, optional
The criterion used for cross-validation when lag_select is ‘CV’: - ‘MSFE’: Mean Squared Forecast Error (default). - ‘MAFE’: Mean Absolute Forecast Error.
- cvint, optional
Number of folds for cross-validation. Default is 5.
- shufflebool, optional
Whether to shuffle the data before splitting into batches for cross-validation. Default is False.
- n_iterint, optional
Number of iterations for cross-validation when shuffle is True. Default is 20.
- hint, optional
Forecast horizon, indicating how many steps ahead the model is predicting. Default is 1.
- max_lagint, optional
The maximum lag order to consider for selection. Default is 13.
- exogstr, optional
Name of the exogenous variable in the DataFrame. Default is None.
- var_order{‘cross’, ‘nested’}, optional
The order in which lag selection is performed: - ‘cross’: Cross all combinations of lags for ‘y’ and ‘exog’. - ‘nested’: Select the best lags for ‘y’ first, then choose the best lags for ‘exog’ based on the selected ‘y’ lags. Default is ‘cross’.
- y_lags{‘Auto’, ‘glob’, list of int}, optional
The lags to consider for the dependent variable: - ‘Auto’: Automatically generate lag sequences up to max_lag. - ‘glob’: Generates all possible combinations of lags up to max_lag. - list of int: Use the specified lags directly. Default is ‘Auto’.
- exog_lags{‘Auto’, ‘glob’, list of int}, optional
The lags to consider for the exogenous variable, if applicable: - ‘Auto’: Automatically generate lag sequences up to max_lag. - ‘glob’: Generates all possible combinations of lags up to max_lag. - list of int: Use the specified lags directly. Default is ‘Auto’.
- seasonalbool, optional
Whether to include seasonal dummies (monthly) in the model. Default is False.
- verboseint, optional
If greater than 1, prints detailed information about the lag selection process. Default is 0.
Returns
- y_lagslist of int
The optimal lags for the dependent variable (‘y’) based on the specified metric.
- exog_lagslist of int
The optimal lags for the exogenous variable based on the specified metric.
- ICfloat
The minimum information criterion (IC) or cross-validation score obtained.
- icsdict
A dictionary where keys are the IC/CV scores and values are the corresponding lags for ‘y’ and ‘exog’.
Examples
>>> lag_selector_IC_CV(df, lag_select='CV', cv_criteria='MSFE', h=1) ([1, 2, 3], [1], 0.034, {...})
>>> lag_selector_IC_CV(df, lag_select='IC', exog='ExogVar', h=2, max_lag=5) ([1, 3], [1], 210.45, {...})
- GTBpy.forecast_functions.model(df, h=1, max_lag=3, exog=None, seasonal=False, lag_select='IC', seed=None, cv=5, shuffle=False, n_iter=20, y_lags='Auto', exog_lags='Auto', var_order='cross', train_cut=0.8, verbose=0, log=False, plot=False, original_hpi=None, original_scale=True)[source]
Fit an autoregressive exogenous (ARX) model with selected lags using information criteria (IC) or cross-validation (CV) for dependent and independent variables, and evaluate its performance.
Parameters
- dfpandas.DataFrame
The input DataFrame containing the time series data. The first column is assumed to be the dependent variable (‘y’).
- hint, optional
Forecast horizon, indicating how many steps ahead the model is predicting. Default is 1.
- max_lagint, optional
The maximum lag order to consider for selection. Default is 3.
- exogstr, optional
Name of the exogenous variable in the DataFrame. Default is None.
- seasonalbool, optional
Whether to include seasonal dummies (monthly) in the model. Default is False.
- lag_select{‘IC’, ‘CV’}, optional
The metric used to select the optimal lags: - ‘IC’: Information criterion (e.g., BIC). - ‘CV’: Cross-validation score based on cv_criteria. Default is ‘IC’.
- seedint, optional
Random seed for reproducibility in cross-validation. Default is None.
- cvint, optional
Number of folds for cross-validation. Default is 5.
- shufflebool, optional
Whether to shuffle the data before splitting into batches for cross-validation. Default is False.
- n_iterint, optional
Number of iterations for cross-validation when shuffle is True. Default is 20.
- y_lags{‘Auto’, ‘glob’, list of int}, optional
The lags to consider for the dependent variable: - ‘Auto’: Automatically generate lag sequences up to max_lag. - ‘glob’: Generates all possible combinations of lags up to max_lag. - list of int: Use the specified lags directly. Default is ‘Auto’.
- exog_lags{‘Auto’, ‘glob’, list of int}, optional
The lags to consider for the exogenous variable, if applicable: - ‘Auto’: Automatically generate lag sequences up to max_lag. - ‘glob’: Generates all possible combinations of lags up to max_lag. - list of int: Use the specified lags directly. Default is ‘Auto’.
- var_order{‘cross’, ‘nested’}, optional
The order in which lag selection is performed: - ‘cross’: Cross all combinations of lags for ‘y’ and ‘exog’. - ‘nested’: Select the best lags for ‘y’ first, then choose the best lags for ‘exog’ based on the selected ‘y’ lags. Default is ‘cross’.
- train_cutfloat or str or datetime-like, optional
The cutoff point for splitting the data into training and validation sets. Can be a float between 0 and 1 representing the proportion of the data to use for training, or a specific index value. Default is 0.8.
- verboseint, optional
Level of verbosity for debugging or detailed output. Default is 0.
- logbool, optional
If True, the data is assumed to be log-transformed and predictions will be back-transformed. Default is False.
- plotbool, optional
If True, plots the actual vs. predicted values with a vertical line indicating the training/validation split. Default is False.
- original_hpipandas.Series, optional
Original HPI (Housing Price Index) values for back-transforming predictions. Only used if original_scale is True. Default is None.
- original_scalebool, optional
Whether to return the predictions on the original scale (before any transformations). Default is True.
Returns
- MAFE_valfloat
Mean Absolute Forecast Error on the validation set.
- MSFE_valfloat
Mean Squared Forecast Error on the validation set.
- MAFE_trainfloat
Mean Absolute Forecast Error on the training set.
- MSFE_trainfloat
Mean Squared Forecast Error on the training set.
- lags_setdict
Dictionary containing the selected lags for ‘y’ and ‘exog’.
- ICfloat
The minimum information criterion (IC) or cross-validation score obtained during lag selection.
- res_fullpandas.Series
The full set of predictions for the dependent variable, including both training and validation periods.
Examples
>>> MAFE_val, MSFE_val, MAFE_train, MSFE_train, lags_set, IC, res_full = model_IC_CV(df, h=1, max_lag=3, exog='ExogVar') >>> print(MAFE_val, MSFE_val, lags_set) 0.034 0.002 {'y lags': [1, 2], 'exog lags': [1]}
- GTBpy.forecast_functions.compare_exog(df, train_cut, h_list=[1, 3, 6, 12], seasonal=False, max_lag=3, lag_select='IC', seed=None, lag_fit_intercept=True, lag_cv=5, lag_shuffle=False, lag_iter=20, hsi_CV_select=False, hsi_fit_intercept=True, hsi_cv=4, hsi_iter=40, y_lags='Auto', exog_lags='Auto', var_order='cross', verbose=False, log=False, original_hpi=None, original_scale=True, sort_df='criteria', sort_col=-1)[source]
Compare the impact of different exogenous variables on forecasting accuracy using various evaluation metrics.
Parameters
- dfpandas.DataFrame
The input DataFrame containing the time series data. The first column is assumed to be the dependent variable (‘HPI’), and the subsequent columns are potential exogenous variables.
- train_cutfloat or str or datetime-like
The cutoff point for splitting the data into training and validation sets. Can be a float between 0 and 1 representing the proportion of the data to use for training, or a specific index value.
- h_listlist of int, optional
A list of forecast horizons (e.g., [1, 3, 6, 12]) to evaluate. Default is [1, 3, 6, 12].
- seasonalbool, optional
Whether to include seasonal dummies (monthly) in the model. Default is False.
- max_lagint, optional
The maximum lag order to consider for selection. Default is 3.
- lag_select{‘IC’, ‘CV’}, optional
The metric used to select the optimal lags: - ‘IC’: Information criterion (e.g., BIC). - ‘CV’: Cross-validation score. Default is ‘IC’.
- seedint, optional
Random seed for reproducibility in cross-validation. Default is None.
- lag_fit_interceptbool, optional
Whether to include an intercept in the lag selection linear regression model. Default is True.
- lag_cvint, optional
Number of folds for cross-validation during lag selection. Default is 5.
- lag_shufflebool, optional
Whether to shuffle the data before splitting into batches for cross-validation during lag selection. Default is False.
- lag_iterint, optional
Number of iterations for cross-validation during lag selection when lag_shuffle is True. Default is 20.
- hsi_CV_selectbool, optional
Whether to perform cross-validation for the selection of the UHSI (Unified Housing Sentiment Index) based on the selected lags. Default is False.
- hsi_fit_interceptbool, optional
Whether to include an intercept in the UHSI selection linear regression model. Default is True.
- hsi_cvint, optional
Number of folds for cross-validation during UHSI selection. Default is 4.
- hsi_iterint, optional
Number of iterations for cross-validation during UHSI selection when hsi_CV_select is True. Default is 40.
- y_lags{‘Auto’, ‘glob’, list of int}, optional
The lags to consider for the dependent variable (‘HPI’): - ‘Auto’: Automatically generate lag sequences up to max_lag. - ‘glob’: Generates all possible combinations of lags up to max_lag. - list of int: Use the specified lags directly. Default is ‘Auto’.
- exog_lags{‘Auto’, ‘glob’, list of int}, optional
The lags to consider for the exogenous variables, if applicable: - ‘Auto’: Automatically generate lag sequences up to max_lag. - ‘glob’: Generates all possible combinations of lags up to max_lag. - list of int: Use the specified lags directly. Default is ‘Auto’.
- var_order{‘cross’, ‘nested’}, optional
The order in which lag selection is performed: - ‘cross’: Cross all combinations of lags for ‘y’ and ‘exog’. - ‘nested’: Select the best lags for ‘y’ first, then choose the best lags for ‘exog’ based on the selected ‘y’ lags. Default is ‘cross’.
- verbosebool, optional
If True, print detailed output during the function execution. Default is False.
- logbool, optional
If True, the data is assumed to be log-transformed and predictions will be back-transformed. Default is False.
- original_hpipandas.Series, optional
Original HPI (Housing Price Index) values for back-transforming predictions. Only used if original_scale is True. Default is None.
- original_scalebool, optional
Whether to return the predictions on the original scale (before any transformations). Default is True.
- sort_df{‘criteria’, ‘MAFE’, ‘MSFE’}, optional
Criteria to sort the results by: - ‘criteria’: Sort by the lag selection criteria (e.g., IC or CV). - ‘MAFE’: Sort by Mean Absolute Forecast Error. - ‘MSFE’: Sort by Mean Squared Forecast Error. Default is ‘criteria’.
- sort_colint, optional
The column index in sort_df to use for sorting. Default is -1 (last column).
Returns
- MAFE_val_dfpandas.DataFrame
DataFrame containing the Mean Absolute Forecast Error (MAFE) on the validation set for each exogenous variable and forecast horizon.
- MAFE_val_df_improvepandas.DataFrame
DataFrame containing the percentage improvement in MAFE on the validation set for each exogenous variable compared to the baseline (no exogenous variables).
- MSFE_val_dfpandas.DataFrame
DataFrame containing the Mean Squared Forecast Error (MSFE) on the validation set for each exogenous variable and forecast horizon.
- MSFE_val_df_improvepandas.DataFrame
DataFrame containing the percentage improvement in MSFE on the validation set for each exogenous variable compared to the baseline.
- criteria_dfpandas.DataFrame
DataFrame containing the lag selection criteria (e.g., IC or CV) values for each exogenous variable and forecast horizon.
- lags_dfpandas.DataFrame
DataFrame containing the selected lags for the dependent variable (‘HPI’) and each exogenous variable.
- MAFE_train_dfpandas.DataFrame
DataFrame containing the Mean Absolute Forecast Error (MAFE) on the training set for each exogenous variable and forecast horizon.
- MSFE_train_dfpandas.DataFrame
DataFrame containing the Mean Squared Forecast Error (MSFE) on the training set for each exogenous variable and forecast horizon.
- forecast_dictdict
A dictionary where each key is a forecast horizon (from h_list) and the corresponding value is a DataFrame containing the actual vs. predicted values for each exogenous variable.
- dfpandas.DataFrame
The modified input DataFrame with additional columns for principal components (UHSI) if hsi_CV_select is True.
Examples
>>> MAFE_val_df, MAFE_val_df_improve, MSFE_val_df, MSFE_val_df_improve, criteria_df, lags_df, MAFE_train_df, MSFE_train_df, forecast_dict, df = compare_exog(df, '2023-01-01', h_list=[1, 6, 12], lag_select='IC', seasonal=True) >>> print(MAFE_val_df) Horizon 1 Horizon 6 Horizon 12 exog_var1 0.03 0.04 0.05 exog_var2 0.02 0.03 0.04
- GTBpy.forecast_functions.HLN_MDM(d, h)[source]
Compute the Harvey, Leybourne, and Newbold (1997) Modified Diebold-Mariano (MDM) statistic and its p-value.
Parameters
- dnumpy.ndarray or pandas.Series
The array or series of forecast error differentials. This represents the difference between the forecast errors of two competing models.
- hint
The forecast horizon (number of steps ahead).
Returns
- MDMfloat
The Modified Diebold-Mariano statistic.
- pvalfloat
The p-value associated with the MDM statistic, under the null hypothesis that the forecast accuracy of the two models is the same.
Notes
The Harvey, Leybourne, and Newbold (1997) test modifies the Diebold-Mariano test to account for small-sample bias, especially when the forecast horizon is greater than one.
This test is particularly useful in comparing predictive accuracy when forecasts are based on overlapping data.
References
Harvey, D. I., Leybourne, S. J., & Newbold, P. (1997). Testing the equality of prediction mean squared errors. International Journal of Forecasting, 13(2), 281-291.
Examples
>>> d = np.array([0.1, 0.2, -0.1, 0.05, 0.3]) >>> h = 1 >>> MDM, pval = HLN_MDM(d, h) >>> print(f"MDM Statistic: {MDM}, p-value: {pval}") MDM Statistic: 1.414213562373095, p-value: 0.1826592511440174
- GTBpy.forecast_functions.forecast_table(self)[source]
Generate a table comparing forecast performance across different predictors and horizons.
This table includes the Mean Absolute Forecast Error (MAFE), Mean Squared Forecast Error (MSFE), and p-values for hypothesis tests related to equal predictive accuracy and forecast encompassing.
Hypotheses:
H_{0,1}: Equal MAFE between the predictor model and the base model.
H_{0,2}: Equal MSFE between the predictor model and the base model.
H_{0,3}: The predictor model forecast encompasses the base model.
Returns
- resultpandas.DataFrame
A DataFrame with a MultiIndex of forecast horizons (h) and predictors, and columns including: - ‘HPI lags’: The lags of the dependent variable (HPI). - ‘Exog lags’: The lags of the exogenous variable. - ‘MAFE’: The Mean Absolute Forecast Error for the predictor. - ‘MSFE’: The Mean Squared Forecast Error for the predictor. - ‘MAFE improvement’: The percentage improvement in MAFE relative to the base model. - ‘MSFE improvement’: The percentage improvement in MSFE relative to the base model. - ‘H_{0,1}’: The p-value for the hypothesis test of equal MAFE (using the Modified Diebold-Mariano test). - ‘H_{0,2}’: The p-value for the hypothesis test of equal MSFE (using the Modified Diebold-Mariano test). - ‘H_{0,3}’: The p-value for the hypothesis test of forecast encompassing (using the Modified Diebold-Mariano test).
Notes
The base model is represented by the ‘None’ predictor.
Forecast encompassing tests whether the predictor model contains all the information in the base model.
The method uses the Harvey, Leybourne, and Newbold (1997) Modified Diebold-Mariano test to compute p-values.
Examples
>>> table = model.forecast_table() >>> print(table) HPI lags Exog lags MAFE MSFE MAFE improvement MSFE improvement H_{0,1} H_{0,2} H_{0,3} h Predictors 1 None ... ... 0.0251 0.00123 0.000 0.000 0.0321 0.0287 0.1025 UHSI_1 ... ... 0.0223 0.00110 0.111 0.105 0.2103 0.1325 0.0923 UHSI_3 ... ... 0.0210 0.00105 0.162 0.147 0.1809 0.1014 0.0534 ...
- GTBpy.forecast_functions.results(self, results_table=False, MAFE_val_df=False, MAFE_val_df_improve=False, MSFE_val_df=False, MSFE_val_df_improve=False, forecast_criteria_df=False, lags_df=False, MAFE_train_df=False, MSFE_train_df=False, head=10)[source]
Display selected result tables generated during the forecasting process.
Parameters
- results_tablebool, optional
If True, display the full results table from the forecast comparison (default is False).
- MAFE_val_dfbool, optional
If True, display the Mean Absolute Forecast Error (MAFE) validation DataFrame (default is False).
- MAFE_val_df_improvebool, optional
If True, display the percentage improvement in MAFE for each predictor (default is False).
- MSFE_val_dfbool, optional
If True, display the Mean Squared Forecast Error (MSFE) validation DataFrame (default is False).
- MSFE_val_df_improvebool, optional
If True, display the percentage improvement in MSFE for each predictor (default is False).
- forecast_criteria_dfbool, optional
If True, display the forecast criteria DataFrame, showing the selected model criteria (default is False).
- lags_dfbool, optional
If True, display the DataFrame containing the selected lags for each model (default is False).
- MAFE_train_dfbool, optional
If True, display the Mean Absolute Forecast Error (MAFE) training DataFrame (default is False).
- MSFE_train_dfbool, optional
If True, display the Mean Squared Forecast Error (MSFE) training DataFrame (default is False).
- headint or bool, optional
Number of rows to display from each DataFrame (default is 10). If True, display all rows of each DataFrame.
Returns
- None
Displays the selected DataFrames in the Jupyter Notebook environment.
Notes
The method allows selective display of any of the key result tables generated during the model comparison process.
It uses the display function from IPython to show the DataFrames.
Examples
>>> model.results(results_table=True, MAFE_val_df=True, head=5) Displays the results table and the MAFE validation DataFrame with the top 5 rows.