google_trends_functions module

Functions are used to get hsi from Google trends data

GTBpy.google_trends_functions.trend_loader(hpi, folder_path, RE_companies=None, drop_companies=True, title_offset=10)[source]

Loads google trends data in csv format in a folder and returns four dataframes

Parameters

hpiPandas dataframe of shape (n_samples, 1): HPI index data
folder_pathstr: a folder path in which google trends data are stored.
RE_companieslist of strs, default = None: Name of real estate companies selected to constitute aggregate search of companies index.

Returns

df : pandas dataframe including HPI and all search queries df_full : pandas dataframe including HPI and search queries that have real value since 2004-01-01 df_full_nonzero : pandas dataframe including HPI and search queries that have real value since 2004-01-01 and do not have values equal to zero df_part : pandas dataframe including HPI and search queries that are not in df_full

GTBpy.google_trends_functions.reg_trend_selector(series, train_cut, alpha, recursive)[source]

Determine the appropriate trend model for a time series based on the Augmented Dickey-Fuller (ADF) test.

Parameters

seriespd.Series: The time series data to be analyzed.
train_cutstr or pd.Timestamp: The date or index that defines the end of the training period.
alphafloat: The significance level used to assess the p-value from the ADF test.
recursivebool: If True, the function will recursively difference the series until stationarity is achieved, based on the ADF test.

Returns

seriespd.Series: The original or differenced time series, depending on the ADF test results.
trend_colslist of str: A list of columns corresponding to the appropriate trend model, which may include: - ‘const’ : Constant term - ‘t’ : Linear time trend - ‘t2’ : Quadratic time trend

Notes

The function performs the ADF test on the training period of the series with different trend models (‘c’, ‘ct’, and ‘ctt’).
If the null hypothesis of the ADF test (that the series has a unit root) is not rejected at the given alpha level, the function recursively differences the series if recursive is set to True.
The trend model is chosen based on the significance of the ADF test with different trend components.

Examples

>>> series = pd.Series(np.random.randn(100).cumsum(), index=pd.date_range('2000-01-01', periods=100))
>>> series, trend_cols = reg_trend_selector(series, train_cut='2000-12-31', alpha=0.05, recursive=True)
    The function returns the appropriately differenced series and the trend model to be used.

GTBpy.google_trends_functions.reg_detrend_deseasonal(df, train_cut, deseasonal, detrend, regression='ct', alpha=0.1, recursive=True, col=None)[source]

Detrend and/or deseasonalize a DataFrame using linear regression.

Parameters

dfpd.DataFrame: The time series data to be detrended and/or deseasonalized. Each column is treated as a separate time series.
train_cutstr, int, float, or pd.Timestamp: The date or index that defines the end of the training period. If a float between 0 and 1 is provided, it represents a fraction of the total dataset length.
deseasonalbool: If True, the function removes the seasonal component from the series using monthly dummies.
detrendbool: If True, the function removes the trend component from the series based on the specified regression model.
regression{‘c’, ‘ct’, ‘ctt’}, optional: The type of trend model to be used: - ‘c’ : Constant term only (default). - ‘ct’ : Constant and linear trend. - ‘ctt’ : Constant, linear, and quadratic trend. If None, the function automatically selects the trend model based on the Augmented Dickey-Fuller (ADF) test.
alphafloat, optional: The significance level used for the ADF test when automatically selecting the trend model. Default is 0.1.
recursivebool, optional: If True, the function recursively differences the series until stationarity is achieved, based on the ADF test.
collist of str, optional: A list of column names in df to be processed. If None, all columns are processed. Default is None.

Returns

pd.DataFrame: A DataFrame with the detrended and/or deseasonalized series. The trend and seasonal components are removed according to the specified parameters.

Notes

The function adds a constant (const), linear time trend (t), and quadratic time trend (t2) to the DataFrame as potential regressors.
Monthly dummy variables are generated and used to remove seasonality if deseasonal is True.
The function determines the appropriate trend model using the ADF test if regression is set to None.
The function drops the additional columns (const, t, t2, and seasonal dummies) before returning the final DataFrame.

Examples

>>> df = pd.DataFrame({'HPI': np.random.randn(100).cumsum()}, index=pd.date_range('2000-01-01', periods=100))
>>> df_detrended = reg_detrend_deseasonal(df, train_cut='2005-01-01', deseasonal=True, detrend=True, regression='ct')
    This example removes both trend and seasonality from the 'HPI' series in `df` using a linear trend model with a constant and linear trend.

GTBpy.google_trends_functions.MA_detrend_deseasonal(df, train_cut, deseasonal, detrend, col=None)[source]

Detrend and/or deseasonalize a DataFrame using moving average decomposition.

Parameters

dfpd.DataFrame: The time series data to be detrended and/or deseasonalized. Each column is treated as a separate time series.
train_cutstr, int, float, or pd.Timestamp: The date or index that defines the end of the training period. If a float between 0 and 1 is provided, it represents a fraction of the total dataset length.
deseasonalbool: If True, the function removes the seasonal component from the series using moving average decomposition.
detrendbool: If True, the function removes the trend component from the series using moving average decomposition.
collist of str, optional: A list of column names in df to be processed. If None, all columns are processed. Default is None.

Returns

pd.DataFrame: A DataFrame with the detrended and/or deseasonalized series. The trend and seasonal components are removed according to the specified parameters.

Notes

The function uses seasonal_decompose from statsmodels.tsa to perform the moving average decomposition.
The deseasonal parameter removes the seasonal component using the seasonal decomposition from the training set.
The trend component is removed by subtracting the moving average trend calculated from the entire series.
The function creates a temporary column for month (month) to align seasonal adjustments, which is dropped before returning the final DataFrame.

Examples

>>> df = pd.DataFrame({'HPI': np.random.randn(100).cumsum()}, index=pd.date_range('2000-01-01', periods=100))
>>> df_ma_detrended = MA_detrend_deseasonal(df, train_cut='2005-01-01', deseasonal=True, detrend=True)
    This example removes both trend and seasonality from the 'HPI' series in `df` using moving average decomposition.

GTBpy.google_trends_functions.prepare_gtrends(df, detrend=False, deseasonal=False, method='reg', regression='ct', alpha=0.15, log=False, smooth=12, winsorize_trends=False, train_cut=0.8)[source]

Prepares Google Trends data by applying optional transformations such as detrending, deseasonalizing, smoothing, and logging. There are two ways when deterend=False and deseasonal=True. The first way is to just directly calculate the seasonal effect. The other way is that first calculate the trend, subtract it from series and then calculate the seasonal factors. Then subtract the seasonal factor from the original series. In this way we include trend and residuals in the series that we return as a result, and exclude seasonal factors. In this function,’growth’ method uses the first way and ‘MA’ and ‘reg’ methods use the second.

Parameters

dfpd.DataFrame: A DataFrame where the first column is the dependent variable (e.g., HPI) and the remaining columns are Google Trends data.
detrendbool, optional: If True, the function removes the trend component from the Google Trends data.
deseasonalbool, optional: If True, the function removes the seasonal component from the Google Trends data.
methodstr, optional: The method used for detrending and/or deseasonalizing. Options are ‘reg’ for regression-based, ‘MA’ for moving average, or ‘growth’ for differencing. Default is ‘reg’.
regressionstr, optional: The type of regression to use for detrending. Options are ‘c’ (constant), ‘ct’ (constant + trend), or ‘ctt’ (constant + trend + quadratic trend). Used only if method=’reg’. Default is ‘ct’.
alphafloat, optional: The significance level for the augmented Dickey-Fuller test used in the regression-based method. Default is 0.15.
logbool, optional: If True, log-transform the Google Trends data by applying log(X + 1) to avoid taking the logarithm of zero. Default is False.
smoothint or bool, optional: The smoothing window size. If an integer is provided, it specifies the window size for the seasonal decomposition trend smoothing. If True, a default window size of 12 is used. Default is 12.
winsorize_trendsbool, optional: If True, winsorizes the Google Trends data by applying limits of 0.05 on both ends to reduce the impact of outliers. Default is False.
train_cutfloat or str or pd.Timestamp, optional: The point at which to split the data into training and testing. If a float between 0 and 1 is provided, it represents a fraction of the total dataset length. If a string or timestamp is provided, it is used directly as a cutoff date. Default is 0.8.

Returns

pd.DataFrame: A DataFrame with the dependent variable and the transformed Google Trends data. The exact transformations depend on the provided parameters.

Raises

Exception: If an invalid method is specified.

Notes

The method parameter determines how the detrending and/or deseasonalizing is performed. ‘reg’ uses regression-based methods, ‘MA’ uses moving averages, and ‘growth’ uses differencing.
The smooth parameter can be used to apply additional smoothing using seasonal decomposition.
The log transformation is applied after winsorizing if both are enabled.
The final DataFrame returned is merged on the index of the dependent variable and the transformed Google Trends data, ensuring alignment.

Examples

>>> df = pd.DataFrame({'HPI': np.random.randn(100).cumsum(), 'trend': np.random.randn(100)}, index=pd.date_range('2000-01-01', periods=100))
>>> prepared_df = prepare_gtrends(df, detrend=True, deseasonal=True, method='MA', smooth=6)
    This example removes trend and seasonality using the moving average method and applies a smoothing window of 6 periods to the 'trend' data.

GTBpy.google_trends_functions.prepare_hpi(df, detrend=False, deseasonal=False, method='reg', regression='ct', alpha=0.15, log=False, train_cut=0.8)[source]

Prepares the Housing Price Index (HPI) data by applying optional transformations such as detrending, deseasonalizing, and logging.

Parameters

dfpd.DataFrame: A DataFrame containing at least a column labeled ‘HPI’, representing the Housing Price Index.
detrendbool, optional: If True, removes the trend component from the HPI data. Default is False.
deseasonalbool, optional: If True, removes the seasonal component from the HPI data. Default is False.
methodstr, optional: The method used for detrending and/or deseasonalizing. Options are ‘reg’ for regression-based or ‘MA’ for moving average. Default is ‘reg’.
regressionstr, optional: The type of regression to use for detrending. Options are ‘c’ (constant), ‘ct’ (constant + trend), or ‘ctt’ (constant + trend + quadratic trend). Used only if method=’reg’. Default is ‘ct’.
alphafloat, optional: The significance level for the augmented Dickey-Fuller (ADF) test used in the regression-based method. Default is 0.15.
logbool, optional: If True, log-transforms the HPI data by applying log(HPI). Default is False.
train_cutfloat or str or pd.Timestamp, optional: The point at which to split the data into training and testing. If a float between 0 and 1 is provided, it represents a fraction of the total dataset length. If a string or timestamp is provided, it is used directly as a cutoff date. Default is 0.8.

Returns

pd.DataFrame: A DataFrame with the transformed HPI data, depending on the selected options for detrending, deseasonalizing, and logging.

Raises

Exception: If an attempt is made to detrend HPI using the Moving Average method.

Notes

The function includes a log transformation option to stabilize variance, which is applied before any other transformations.
Detrending using the Moving Average (MA) method is explicitly disallowed for the HPI, as indicated by an exception.
The ADF test’s significance level (alpha) and type of regression used can significantly impact the detrending process, particularly in borderline cases.

Examples

>>> df = pd.DataFrame({'HPI': np.random.randn(100).cumsum()}, index=pd.date_range('2000-01-01', periods=100))
>>> prepared_df = prepare_hpi(df, detrend=True, deseasonal=True, method='reg', regression='ct')
    This example removes the trend and seasonality from the HPI data using a regression-based method with a constant and trend.

GTBpy.google_trends_functions.compute_weights(criteria, var_select, selection_index=None, coef_select='abs', n=10, power=4, rank_based=False)[source]

Computes weights for a set of criteria, selecting top variables or indices based on the specified selection method.

Parameters

criteriapd.Series or pd.DataFrame: The data containing the criteria used to compute weights, such as coefficients or cross-validation errors.
var_selectstr: The selection method for the variables. Options include ‘EN’ (Elastic Net), ‘coef’ (coefficient), ‘tvalue’, ‘CV’ (cross-validation), or ‘IC’ (information criterion).
selection_indexpd.Index or list, optional: The indices to consider for selection. If None, the entire index of criteria is used. Default is None.
coef_selectstr, optional: Method for selecting coefficients. Options are ‘abs’ for absolute values or ‘neg’ for negative values. Only relevant if var_select is ‘EN’, ‘coef’, or ‘tvalue’. Default is ‘abs’.
nint, optional: The number of top variables or indices to select based on the computed weights. Default is 10.
powerint or float, optional: The exponent used to compute the weights. Higher powers give more weight to larger values (or smaller values if using var_select=’CV’ or var_select=’IC’). Default is 4.
rank_basedbool, optional: If True, weights are computed based on the rank of each criterion rather than its raw value. Default is False.

Returns

weightspd.Series: A series of computed weights corresponding to the criteria.
top_queriespd.Index or list: The top n indices or variables based on the computed weights.

Raises

ValueError: If an invalid value for var_select or coef_select is provided.

Notes

The function supports both positive and negative selection of coefficients, depending on the coef_select parameter.
When rank_based is True, the function ranks the criteria before computing the weights, which can be useful for stabilizing the selection process.
The function supports two main types of selection: those based on coefficient magnitudes (‘EN’, ‘coef’, ‘tvalue’) and those based on error metrics (‘CV’, ‘IC’).

Examples

>>> criteria = pd.Series([0.2, -0.5, 1.0, -0.3, 0.7], index=['A', 'B', 'C', 'D', 'E'])
>>> weights, top_queries = compute_weights(criteria, var_select='coef', coef_select='abs', n=3)
>>> print(weights)
A    0.061926
B    0.142322
C    0.388067
D    0.100637
E    0.307048
dtype: float64
>>> print(top_queries)
Index(['C', 'E', 'B'], dtype='object')

>>> criteria = pd.Series([0.8, 0.5, 1.2, 0.9, 0.6], index=['A', 'B', 'C', 'D', 'E'])
>>> weights, top_queries = compute_weights(criteria, var_select='CV', rank_based=True, n=2)
>>> print(weights)
A    0.214796
B    0.355939
C    0.131228
D    0.178110
E    0.119927
dtype: float64
>>> print(top_queries)
Index(['B', 'D'], dtype='object')

GTBpy.google_trends_functions.GT_plot(self, df, exog, lag, title=None, winsorize_trend=False, scaled=False)[source]

Generate plots to visualize the relationship between the Housing Price Index (HPI) and a selected exogenous variable.

Parameters

dfpandas.DataFrame: The DataFrame containing the HPI and exogenous variables.
exogstr: The name of the exogenous variable to be plotted and analyzed.
lagint: The lag to apply to the exogenous variable in the scatter plot and time series plot.
titlestr, optional: The title for the overall figure (default is None).
winsorize_trendbool, optional: If True, apply winsorization to the exogenous variable to limit the effect of outliers (default is False).
scaledbool, optional: If True, standardize the exogenous variable by removing the mean and scaling to unit variance (default is False).

Returns

None: Displays a 2x2 grid of plots: - Top-left: Time series plot of HPI. - Top-right: Scatter plot of lagged exogenous variable vs. HPI. - Bottom-left: Time series plot of the lagged exogenous variable. - Bottom-right: (Removed) was intended for an additional plot, now removed.

Notes

The function visualizes the relationship between HPI and a chosen exogenous variable by generating time series and scatter plots.
If winsorize_trend is True, the exogenous variable is adjusted to mitigate the impact of extreme values.
If scaled is True, the exogenous variable is standardized before being plotted.
The bottom-right subplot (ax4) is intentionally removed, leaving three plots in the 2x2 grid.

Examples

>>> model.GT_plot(df, exog='Google_Trends_Search', lag=2, title='Google Trends vs. HPI', winsorize_trend=True, scaled=True)
    Displays the plots for the relationship between HPI and the Google Trends search data with a lag of 2.

GTBpy.google_trends_functions.plot_gtrends(self, exog, h: int, winsorize_trend=False, scaled=False)[source]

Plot Google Trends data and its transformations over different stages of processing.

Parameters

exogstr or None: The exogenous variable (Google Trends search term) to be plotted. If None, the variable with the highest weight for the specified horizon h is selected.
hint: The forecast horizon to be used. If h is not in self.selection_h_list, the maximum value from self.selection_h_list is used.
winsorize_trendbool, optional: If True, the Google Trends data is winsorized (default is False).
scaledbool, optional: If True, the Google Trends data is scaled using standardization (default is False).

Returns

None: Displays a series of plots showing the original Google Trends data, winsorized data, prepared Google Trends data, and the corresponding Housing Price Index (HPI) data.

Notes

This function visualizes the Google Trends data (exog) across different stages: original, after winsorization, after preparation, and after scaling.
The h parameter specifies the forecast horizon, and the function adjusts the plots accordingly.
If exog is not provided, the function automatically selects the best-performing Google Trends term based on predefined criteria.
The vertical red dashed line in each plot indicates the cut-off point between training and validation data.

Examples

>>> plot_gtrends(exog='housing_market', h=6, winsorize_trend=True, scaled=True)
    Plots the Google Trends data for 'housing_market' with winsorization and scaling applied at horizon 6.

>>> plot_gtrends(exog=None, h=3)
    Automatically selects the best Google Trends term and plots it at horizon 3 without any additional processing.

GTBpy.google_trends_functions.plot_hsi(self, hsi=None)[source]

Plot the selected Housing Search Index (HSI) for different forecast horizons.

Parameters

hsistr, optional: The specific HSI to be plotted. If None, the HSI with the highest weight for each forecast horizon (h) is selected (default is None).

Returns

None: Displays the plots for the selected HSIs across the specified forecast horizons in self.selection_h_list.

Notes

This method generates a plot for each forecast horizon (h) in reverse order (starting from the longest horizon).
If no specific hsi is provided, the method will automatically select and plot the HSI with the maximum weight for each h from self.criteria_dfs.
The plots are created using the GT_plot function, which visualizes the relationship between the HPI and the selected HSI.

Examples

>>> model.plot_hsi(hsi='Sentiment_Index')
    Plots the specified 'Sentiment_Index' for all horizons in the selection list.

>>> model.plot_hsi()
    Automatically selects and plots the HSI with the maximum weight for each horizon in the selection list.

GTBpy.google_trends_functions.plot_improvement(obj_list, measure='MAFE', h_list=None, colors=None, labels=None, figsize=(15, 9))[source]

Plot the Kernel Density Estimate (KDE) of improvement measures (MAFE/MSFE) for different models across forecast horizons.

Parameters

obj_listlist: A list of objects containing the results of the models to be compared. Each object should have MAFE_val_df_improve and MSFE_val_df_improve attributes.
measurestr, optional: The performance measure to be plotted. Can be ‘MAFE’ (Mean Absolute Forecast Error) or ‘MSFE’ (Mean Squared Forecast Error). Default is ‘MAFE’.
h_listlist, optional: The list of forecast horizons to be considered. If None, the horizons from the first object in obj_list will be used.
colorslist, optional: List of colors to be used for plotting different models. If None, a default list of colors will be used.
labelslist, optional: List of labels for the models in obj_list. If None, default labels (‘G1’, ‘G2’, …) will be generated.
figsizetuple, optional: Size of the figure. Default is (15, 9).

Returns

None: Displays the KDE plots for the selected measure across the specified forecast horizons.

Notes

This function plots the density of the percentage improvement in MAFE/MSFE for different models over multiple forecast horizons (h).
A vertical line is added to each plot to mark the performance of the Universal Housing Sentiment Index (UHSI) for the corresponding forecast horizon.
If h_list is not provided, the forecast horizons from the first model in obj_list are used by default.

Examples

>>> plot_improvement(models_list, measure='MAFE', h_list=[1, 3, 6, 12], colors=['blue', 'orange'], labels=['Model 1', 'Model 2'])
    Plots the MAFE improvement KDE for the specified models and forecast horizons.

GTBpy.google_trends_functions.plot_forecast(self, exog)[source]

Plot the forecasted vs actual Housing Price Index (HPI) values over different forecast horizons.

Parameters

exogstr or list, optional: The exogenous variable(s) used in the forecasting models. If None, the best performing exogenous variable for each horizon is selected based on the criteria in self.forecast_criteria_df.

Returns

None: Displays the forecast plots for the actual and predicted HPI across different forecast horizons.

Notes

The function generates subplots for each forecast horizon h in self.h_list, displaying the actual and predicted HPI.
If exog is a list, the corresponding element is used as the exogenous variable for each horizon. If exog is a single string, it is used for all horizons.
The vertical red dashed line marks the cut-off point between the training and validation datasets.

Examples

>>> model.plot_forecast(exog=['UHSI_1', 'UHSI_3', 'UHSI_6'])
    Plots the forecasts using the specified exogenous variables for each horizon.

>>> model.plot_forecast(exog='UHSI_3')
    Plots the forecasts using 'UHSI_3' as the exogenous variable for all horizons.

class GTBpy.google_trends_functions.GoogleTrend(df, train_cut=0.8, verbose=False, seed=None, pca_lag=True)[source]

Bases: object

A class for processing and modeling Google Trends data to predict housing prices. This class provides tools for preparing, transforming, and modeling time series data using methods such as detrending, deseasonalizing, principal component analysis (PCA), and lagged variable selection.

Parameters

dfpd.DataFrame: A DataFrame containing the initial dataset, including the Housing Price Index (HPI) and Google Trends data.
train_cutfloat, int, or str, optional: Defines the end of the training period. If a float between 0 and 1 is provided, it represents the fraction of the dataset used for training. Default is 0.8.
verbosebool, optional: If True, enables detailed output for logging purposes. Default is False.
seedint or None, optional: Seed for random number generation, used to ensure reproducibility. Default is None.
pca_lagbool, optional: If True, PCA with lagged variables is used during the feature extraction. Default is True.

Attributes

dfpd.DataFrame: A copy of the initial dataset provided by the user.
train_cutpd.Timestamp: Timestamp or index defining the end of the training period.
verbosebool: Verbosity level for logging purposes.
seedint or None: Seed for random number generation.
pca_lagbool: Specifies whether PCA with lagged variables should be used.
gtrends_dfpd.DataFrame: Transformed Google Trends data.
hpi_dfpd.DataFrame: Transformed Housing Price Index (HPI) data.
hsi_dfpd.DataFrame: Final set of features generated after processing HPI and Google Trends data.
hsi_dfspd.DataFrame: Collection of transformed features including multiple PCA layers.
criteria_dfspd.DataFrame: DataFrame containing computed criteria values for feature selection.
criteria_dfs_expandedpd.DataFrame: DataFrame containing expanded versions of computed criteria for feature selection.
forecast_hsipd.DataFrame: The final set of features used for forecasting.
MAFE_val_df, MSFE_val_df, forecast_criteria_df, etc.: DataFrames and metrics representing model performance during the forecast.

Methods

prepare_gtrends(detrend=True, deseasonal=True, …): Prepares Google Trends data by applying optional transformations such as detrending and deseasonalizing.
prepare_hpi(detrend=False, deseasonal=True, …): Prepares the HPI data by applying optional transformations such as detrending and deseasonalizing.
compute_hsi(layer=1, …): Extracts key features from the dataset using PCA, lagged variables, and selection criteria.
lag_setting(lag_select=’IC’, …): Sets the lag selection criteria and related parameters for the model.
forecast(h_list=[1,3,6,12], …): Forecasts the HPI using prepared Google Trends data and selected lags.
results(results_table=False, …): Displays various results from the model, including performance metrics.
plot_gtrends(exog=None, …): Plots Google Trends data used in the model.
plot_hsi(hsi=None): Plots HSI data used in the model.
plot_improvement(measure=’MAFE’, …): Plots improvements in forecasting accuracy based on specified performance measures.
plot_forecast(exog=None): Plots the forecasted HPI based on the selected model features.

Examples

>>> GT_object = GoogleTrend(df, '2019-12-31', verbose=1, seed=1, pca_lag=True)
>>> GT_object.prepare_gtrends(detrend=True, deseasonal=True, method='reg', regression='ct', log=False, smooth=12)
>>> GT_object.prepare_hpi(detrend=True, deseasonal=True, method='reg', regression='ct', log=True)
>>> GT_object.compute_hsi(layer=1, input_pool=True, auto_layer=False, n_input=10, n_hsi=20, max_lag=3, var_select='CV', random='rnd', cv=5, shuffle=True, n_iter=20, selection_h_list=[0,1,3,6,12])
>>> GT_object.compute_hsi(layer=2, input_pool=True, auto_layer=True, n_input=10, n_hsi=20, max_lag=1, var_select='CV', random='rnd', cv=5, shuffle=True, n_iter=20, selection_h_list=[0,1,3,6,12])
>>> GT_object.lag_setting(y_lags='Auto', exog_lags='Auto', max_lag=3, lag_select='IC')
>>> GT_object.forecast(h_list=[1,3,6,12], seasonal=True, hsi_CV_select=True, fit_intercept=True, cv=5, n_iter=20, original_scale=False)
>>> GT_object.results(MAFE_val_df=True, MAFE_val_df_improve=True, forecast_criteria_df=True, lags_df=True, head=True)

prepare_gtrends(detrend=True, deseasonal=True, method='reg', regression='ct', alpha=0.15, log=False, smooth=False, winsorize=False)[source]

Prepares Google Trends data by applying optional transformations such as detrending, deseasonalizing, smoothing, and logging.

Parameters

detrendbool, optional: If True, removes the trend component from the Google Trends data. Default is True.
deseasonalbool, optional: If True, removes the seasonal component from the Google Trends data. Default is True.
methodstr, optional: The method used for detrending and/or deseasonalizing. Options are ‘reg’ for regression-based, ‘MA’ for moving average, or ‘growth’ for differencing. Default is ‘reg’.
regressionstr, optional: The type of regression to use for detrending. Options are ‘c’ (constant), ‘ct’ (constant + trend), or ‘ctt’ (constant + trend + quadratic trend). Used only if method=’reg’. Default is ‘ct’.
alphafloat, optional: The significance level for the augmented Dickey-Fuller (ADF) test used in the regression-based method. Default is 0.15.
logbool, optional: If True, log-transforms the Google Trends data by applying log(X + 1) to stabilize variance. Default is False.
smoothbool or int, optional: If True, applies smoothing using a default window of 12. If an integer is provided, it specifies the window size for smoothing. Default is False.
winsorizebool, optional: If True, applies winsorization to the Google Trends data to reduce the effect of outliers. Default is False.

Returns

None: Updates the gtrends_df attribute with the prepared Google Trends data after applying the specified transformations.

Notes

This method prepares the Google Trends data for further modeling by optionally removing trend and seasonal components, smoothing, and log-transforming the data.
The method parameter determines the approach for detrending and deseasonalizing, allowing flexibility in how these transformations are applied.
The transformations help stabilize the data, improve stationarity, and reduce the impact of outliers, which can enhance the performance of downstream models.

prepare_hpi(detrend=False, deseasonal=True, method='reg', regression='ct', alpha=0.15, log=False)[source]

Prepares the Housing Price Index (HPI) data by applying optional transformations such as detrending, deseasonalizing, and logging.

Parameters

detrendbool, optional: If True, removes the trend component from the HPI data. Default is False.
deseasonalbool, optional: If True, removes the seasonal component from the HPI data. Default is True.
methodstr, optional: The method used for detrending and/or deseasonalizing. Options are ‘reg’ for regression-based or ‘MA’ for moving average. Default is ‘reg’.
regressionstr, optional: The type of regression to use for detrending. Options are ‘c’ (constant), ‘ct’ (constant + trend), or ‘ctt’ (constant + trend + quadratic trend). Used only if method=’reg’. Default is ‘ct’.
alphafloat, optional: The significance level for the augmented Dickey-Fuller (ADF) test used in the regression-based method. Default is 0.15.
logbool, optional: If True, log-transforms the HPI data by applying log(HPI). Default is False.

Returns

None: Updates the hpi_df attribute with the prepared HPI data after applying the specified transformations.

Notes

This method prepares the Housing Price Index (HPI) data for further modeling by optionally removing trend and seasonal components and applying a log transformation.
The method parameter determines how the detrending and deseasonalizing are applied, offering flexibility in transformations.
The transformations help in stabilizing the variance, improving stationarity, and making the data suitable for time series modeling and forecasting.

compute_hsi(layer=1, input_pool=True, auto_layer=False, n_input=10, n_hsi=20, max_lag=3, var_select='CV', random=False, cv=3, shuffle=True, n_iter=20, alphas=None, l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1], selection_h_list=[0, 1, 3, 6, 12], coef_select='abs', criteria_dis=False)[source]

Extracts key features from the dataset using Principal Component Analysis (PCA), lagged variables, and feature selection criteria.

Parameters

layerint, optional: The current layer of PCA and feature extraction. Default is 1.
input_poolbool, optional: If True, the full input feature pool is used during the feature extraction. Default is True.
auto_layerbool, optional: If True, additional layers are added automatically until no improvement is observed. Default is False.
n_inputint, optional: Number of input features to select. Default is 10.
n_hsiint, optional: Number of HSI (Housing Sentiment Index) features to compute. Default is 20.
max_lagint, optional: Maximum lag to consider for lagged features. Default is 3.
var_selectstr, optional: The method used for variable selection. Options include ‘CV’, ‘EN’, ‘coef’, or ‘tvalue’. Default is ‘CV’.
randombool or str, optional: If True or set to ‘intc’ or ‘rnd’, a randomized selection process is used. Default is False.
cvint, optional: Number of cross-validation folds. Default is 3.
shufflebool, optional: If True, shuffle the data during cross-validation. Default is True.
n_iterint, optional: Number of iterations for randomized search or selection. Default is 20.
alphasarray-like, optional: Array of alpha values for ElasticNetCV. Default is np.logspace(-3, 0, 100).
l1_ratiolist, optional: List of L1 ratios for ElasticNetCV. Default is [.1, .5, .7, .9, .95, .99, 1].
selection_h_listlist of int, optional: List of lag periods to consider for selection. Default is [0, 1, 3, 6, 12].
coef_selectstr, optional: Method for selecting coefficients (‘abs’ for absolute values or ‘neg’ for negative values). Default is ‘abs’.
criteria_disbool, optional: If True, displays criteria DataFrames used in feature selection. Default is False.

Returns

None: Updates the attributes hsi_df, hsi_dfs, X_scaled_lagged_full, criteria_dfs, and forecast_hsi with the newly computed features and criteria.

Notes

This method uses PCA for dimensionality reduction and lagged variable selection to generate new features.
Feature selection is based on different criteria, including ElasticNet coefficients, cross-validation metrics, and other statistics.
The process can include both deterministic and randomized approaches to ensure robustness.

lag_setting(y_lags='Auto', exog_lags='Auto', max_lag=3, lag_select='IC', fit_intercept=False, cv=5, shuffle=True, n_iter=20, var_order='cross')[source]

Sets the lag selection criteria and related parameters for the model.

Parameters

lag_selectstr, optional: The method used for selecting the lag order. Options include ‘IC’ (information criterion), ‘CV’ (cross-validation), etc. Default is ‘IC’.
fit_interceptbool, optional: If True, fits an intercept in the lag model. Default is False.
cvint, optional: Number of cross-validation folds to use when selecting lags. Default is 5.
shufflebool, optional: If True, shuffle the data during cross-validation. Default is True.
n_iterint, optional: Number of iterations to use during randomized cross-validation or lag selection. Default is 20.
var_orderstr, optional: The order in which variables are considered for lag selection. Options include ‘cross’ (cross-sectional ordering) and others. Default is ‘cross’.
max_lagint, optional: The maximum lag to consider for the model. Default is 3.
y_lagsstr or int, optional: The number of lags for the target variable (y). If ‘Auto’, the lag length is automatically determined. Default is ‘Auto’.
exog_lagsstr or int, optional: The number of lags for the exogenous variables (exog). If ‘Auto’, the lag length is automatically determined. Default is ‘Auto’.

Returns

None: Updates the lag-related attributes of the instance, such as lag_select, lag_fit_intercept, lag_cv, and others.

Notes

This method sets the parameters for the lag selection process, which determines the temporal dependencies in the model.
The selection can be done using either information criteria or cross-validation approaches.
The var_order parameter controls the order in which variables are processed for lag determination, which can impact model performance.

forecast(h_list=[1, 3, 6, 12], seasonal=False, hsi_CV_select=False, fit_intercept=False, cv=4, n_iter=100, original_scale=True, sort_df='criteria', sort_col=-1)[source]

Forecasts the Housing Price Index (HPI) using prepared Google Trends data and selected lag features.

Parameters

h_listlist of int, optional: A list of forecast horizons for which the predictions will be made. Default is [1, 3, 6, 12].
seasonalbool, optional: If True, includes seasonal components in the model during forecasting. Default is False.
hsi_CV_selectbool, optional: If True, uses cross-validation for selecting the Housing Sentiment Index (HSI) features. Default is False.
fit_interceptbool, optional: If True, fits an intercept term in the forecasting model. Default is False.
cvint, optional: Number of cross-validation folds for model validation. Default is 4.
n_iterint, optional: Number of iterations to use during cross-validation or parameter tuning. Default is 100.
original_scalebool, optional: If True, the forecasted values are converted back to their original scale. Default is True.
sort_dfstr, optional: The method used to sort the features for the forecast model. Options include ‘criteria’. Default is ‘criteria’.
sort_colint, optional: The column index to use for sorting when selecting features. Default is -1 (the last column).

Returns

None: Updates the attributes with forecast results, including various performance metrics and forecast data.

Attributes Updated

MAFE_val_dfpd.DataFrame: Mean Absolute Forecasting Error (MAFE) values for the validation set.
MAFE_val_df_improvepd.DataFrame: Improvement in MAFE values compared to a benchmark.
MSFE_val_dfpd.DataFrame: Mean Squared Forecasting Error (MSFE) values for the validation set.
MSFE_val_df_improvepd.DataFrame: Improvement in MSFE values compared to a benchmark.
forecast_criteria_dfpd.DataFrame: DataFrame containing criteria values used for selecting features during forecasting.
lags_dfpd.DataFrame: DataFrame representing the lags used for each feature in the model.
MAFE_train_dfpd.DataFrame: MAFE values for the training set.
MSFE_train_dfpd.DataFrame: MSFE values for the training set.
forecast_dictdict: Dictionary containing forecasted values for different horizons.
hsipd.DataFrame: DataFrame containing the forecasted HPI values.
results_tablepd.DataFrame: Table summarizing the forecast results, including performance metrics.

Notes

This method forecasts the HPI using a combination of Google Trends data and selected lagged features.
Various forecast horizons can be specified using the h_list parameter to produce forecasts for different time periods.
The method can optionally fit an intercept term and use cross-validation to select the most important features.

results(results_table=False, MAFE_val_df=False, MAFE_val_df_improve=False, MSFE_val_df=False, MSFE_val_df_improve=False, forecast_criteria_df=False, lags_df=False, MAFE_train_df=False, head=10)[source]

Displays various results from the model, including performance metrics and forecast data.

Parameters

results_tablebool, optional: If True, displays the full results table summarizing the forecast performance. Default is False.
MAFE_val_dfbool, optional: If True, displays the Mean Absolute Forecasting Error (MAFE) values for the validation set. Default is False.
MAFE_val_df_improvebool, optional: If True, displays the improvement in MAFE values compared to a benchmark. Default is False.
MSFE_val_dfbool, optional: If True, displays the Mean Squared Forecasting Error (MSFE) values for the validation set. Default is False.
MSFE_val_df_improvebool, optional: If True, displays the improvement in MSFE values compared to a benchmark. Default is False.
forecast_criteria_dfbool, optional: If True, displays the DataFrame containing criteria values used for selecting features during forecasting. Default is False.
lags_dfbool, optional: If True, displays the DataFrame representing the lags used for each feature in the model. Default is False.
MAFE_train_dfbool, optional: If True, displays the MAFE values for the training set. Default is False.
headint, optional: Number of rows to display when displaying the results DataFrames. Default is 10.

Returns

None: Displays the specified results based on the provided parameters.

Notes

This method allows the user to access different performance metrics and forecast-related data.
By selecting the appropriate parameters, users can view specific tables and metrics that summarize the model’s forecasting performance.
The head parameter controls the number of rows to display for DataFrames to avoid overwhelming output.

plot_gtrends(exog=None, h=12, winsorize_trend=False, scaled=False)[source]

Parameters

exoglist of str or None, optional: List of exogenous variables (Google Trends features) to plot. If None, all available features are plotted. Default is None.
hint, optional: The forecast horizon for which the trends are plotted. Default is 12.
winsorize_trendbool, optional: If True, plots the Google Trends data after winsorization, which reduces the effect of outliers. Default is False.
scaledbool, optional: If True, plots the scaled version of Google Trends data, allowing for comparison across different features. Default is False.

Returns

None: Displays the plots for the specified Google Trends features.

Notes

This method provides a visual representation of Google Trends data used in the model, which helps in understanding the temporal patterns in the data.
Users can optionally plot specific features by providing a list in the exog parameter.
The winsorize_trend parameter allows for a clearer view by minimizing the impact of extreme values.
The scaled parameter enables visualization of standardized features, useful for comparing different variables on the same scale.

plot_hsi(hsi=None)[source]

Plots the Housing Sentiment Index (HSI) data used in the model, allowing for visualization of the extracted features.

Parameters

hsilist of str or None, optional: List of HSI features to plot. If None, all available HSI features are plotted. Default is None.

Returns

None: Displays the plots for the specified HSI features.

Notes

This method provides a visual representation of the HSI features that have been extracted using PCA and other transformations.
The hsi parameter allows users to specify particular HSI features to visualize, or plot all available features if set to None.
Useful for analyzing the derived sentiment features and understanding their temporal patterns.

plot_improvement(measure='MAFE', h_list=None, colors=None, labels=None, figsize=(15, 9))[source]

Plots the improvement in forecasting accuracy for different horizons, using a specified performance measure.

Parameters

measurestr, optional: The performance measure to visualize. Options include ‘MAFE’ (Mean Absolute Forecasting Error) and ‘MSFE’ (Mean Squared Forecasting Error). Default is ‘MAFE’.
h_listlist of int or None, optional: A list of forecast horizons to include in the plot. If None, all available horizons are included. Default is None.
colorslist of str or None, optional: List of colors to use for the plot lines. If None, default colors are used. Default is None.
labelslist of str or None, optional: List of labels for each forecast horizon in the plot. If None, default labels are used. Default is None.
figsizetuple of int, optional: Figure size for the plot. Default is (15, 9).

Returns

None: Displays a plot showing the improvement in forecasting accuracy for the specified measure and horizons.

Notes

This method helps visualize the improvement in forecasting performance for different horizons.
The measure parameter allows the user to specify whether to visualize Mean Absolute Forecasting Error (MAFE) or Mean Squared Forecasting Error (MSFE).
Users can customize the appearance of the plot using the colors, labels, and figsize parameters.
The h_list parameter allows for selecting specific forecast horizons to analyze, which can be useful for evaluating the model’s performance over varying time periods.

plot_forecast(exog=None)[source]

Plots the forecasted Housing Price Index (HPI) values along with the actual values for comparison.

Parameters

exoglist of str or None, optional: List of exogenous variables (features) to include in the forecast plot. If None, all available features are included. Default is None.

Returns

None: Displays the forecast plot showing the predicted HPI values and the actual values.

Notes

This method provides a visual comparison between the forecasted HPI values and the actual values, allowing users to assess the model’s predictive performance.
The exog parameter allows users to include specific exogenous features in the plot for further analysis of their impact on the forecast.
Useful for evaluating the quality of the model’s predictions over different time horizons.