google_trends_functions module
Functions are used to get hsi from Google trends data
- GTBpy.google_trends_functions.trend_loader(hpi, folder_path, RE_companies=None, drop_companies=True, title_offset=10)[source]
Loads google trends data in csv format in a folder and returns four dataframes
Parameters
- hpiPandas dataframe of shape (n_samples, 1)
HPI index data
- folder_pathstr
a folder path in which google trends data are stored.
- RE_companieslist of strs, default = None
Name of real estate companies selected to constitute aggregate search of companies index.
Returns
df : pandas dataframe including HPI and all search queries df_full : pandas dataframe including HPI and search queries that have real value since 2004-01-01 df_full_nonzero : pandas dataframe including HPI and search queries that have real value since 2004-01-01 and do not have values equal to zero df_part : pandas dataframe including HPI and search queries that are not in df_full
- GTBpy.google_trends_functions.reg_trend_selector(series, train_cut, alpha, recursive)[source]
Determine the appropriate trend model for a time series based on the Augmented Dickey-Fuller (ADF) test.
Parameters
- seriespd.Series
The time series data to be analyzed.
- train_cutstr or pd.Timestamp
The date or index that defines the end of the training period.
- alphafloat
The significance level used to assess the p-value from the ADF test.
- recursivebool
If True, the function will recursively difference the series until stationarity is achieved, based on the ADF test.
Returns
- seriespd.Series
The original or differenced time series, depending on the ADF test results.
- trend_colslist of str
A list of columns corresponding to the appropriate trend model, which may include: - ‘const’ : Constant term - ‘t’ : Linear time trend - ‘t2’ : Quadratic time trend
Notes
The function performs the ADF test on the training period of the series with different trend models (‘c’, ‘ct’, and ‘ctt’).
If the null hypothesis of the ADF test (that the series has a unit root) is not rejected at the given alpha level, the function recursively differences the series if recursive is set to True.
The trend model is chosen based on the significance of the ADF test with different trend components.
Examples
>>> series = pd.Series(np.random.randn(100).cumsum(), index=pd.date_range('2000-01-01', periods=100)) >>> series, trend_cols = reg_trend_selector(series, train_cut='2000-12-31', alpha=0.05, recursive=True) The function returns the appropriately differenced series and the trend model to be used.
- GTBpy.google_trends_functions.reg_detrend_deseasonal(df, train_cut, deseasonal, detrend, regression='ct', alpha=0.1, recursive=True, col=None)[source]
Detrend and/or deseasonalize a DataFrame using linear regression.
Parameters
- dfpd.DataFrame
The time series data to be detrended and/or deseasonalized. Each column is treated as a separate time series.
- train_cutstr, int, float, or pd.Timestamp
The date or index that defines the end of the training period. If a float between 0 and 1 is provided, it represents a fraction of the total dataset length.
- deseasonalbool
If True, the function removes the seasonal component from the series using monthly dummies.
- detrendbool
If True, the function removes the trend component from the series based on the specified regression model.
- regression{‘c’, ‘ct’, ‘ctt’}, optional
The type of trend model to be used: - ‘c’ : Constant term only (default). - ‘ct’ : Constant and linear trend. - ‘ctt’ : Constant, linear, and quadratic trend. If None, the function automatically selects the trend model based on the Augmented Dickey-Fuller (ADF) test.
- alphafloat, optional
The significance level used for the ADF test when automatically selecting the trend model. Default is 0.1.
- recursivebool, optional
If True, the function recursively differences the series until stationarity is achieved, based on the ADF test.
- collist of str, optional
A list of column names in df to be processed. If None, all columns are processed. Default is None.
Returns
- pd.DataFrame
A DataFrame with the detrended and/or deseasonalized series. The trend and seasonal components are removed according to the specified parameters.
Notes
The function adds a constant (const), linear time trend (t), and quadratic time trend (t2) to the DataFrame as potential regressors.
Monthly dummy variables are generated and used to remove seasonality if deseasonal is True.
The function determines the appropriate trend model using the ADF test if regression is set to None.
The function drops the additional columns (const, t, t2, and seasonal dummies) before returning the final DataFrame.
Examples
>>> df = pd.DataFrame({'HPI': np.random.randn(100).cumsum()}, index=pd.date_range('2000-01-01', periods=100)) >>> df_detrended = reg_detrend_deseasonal(df, train_cut='2005-01-01', deseasonal=True, detrend=True, regression='ct') This example removes both trend and seasonality from the 'HPI' series in `df` using a linear trend model with a constant and linear trend.
- GTBpy.google_trends_functions.MA_detrend_deseasonal(df, train_cut, deseasonal, detrend, col=None)[source]
Detrend and/or deseasonalize a DataFrame using moving average decomposition.
Parameters
- dfpd.DataFrame
The time series data to be detrended and/or deseasonalized. Each column is treated as a separate time series.
- train_cutstr, int, float, or pd.Timestamp
The date or index that defines the end of the training period. If a float between 0 and 1 is provided, it represents a fraction of the total dataset length.
- deseasonalbool
If True, the function removes the seasonal component from the series using moving average decomposition.
- detrendbool
If True, the function removes the trend component from the series using moving average decomposition.
- collist of str, optional
A list of column names in df to be processed. If None, all columns are processed. Default is None.
Returns
- pd.DataFrame
A DataFrame with the detrended and/or deseasonalized series. The trend and seasonal components are removed according to the specified parameters.
Notes
The function uses seasonal_decompose from statsmodels.tsa to perform the moving average decomposition.
The deseasonal parameter removes the seasonal component using the seasonal decomposition from the training set.
The trend component is removed by subtracting the moving average trend calculated from the entire series.
The function creates a temporary column for month (month) to align seasonal adjustments, which is dropped before returning the final DataFrame.
Examples
>>> df = pd.DataFrame({'HPI': np.random.randn(100).cumsum()}, index=pd.date_range('2000-01-01', periods=100)) >>> df_ma_detrended = MA_detrend_deseasonal(df, train_cut='2005-01-01', deseasonal=True, detrend=True) This example removes both trend and seasonality from the 'HPI' series in `df` using moving average decomposition.
- GTBpy.google_trends_functions.prepare_gtrends(df, detrend=False, deseasonal=False, method='reg', regression='ct', alpha=0.15, log=False, smooth=12, winsorize_trends=False, train_cut=0.8)[source]
Prepares Google Trends data by applying optional transformations such as detrending, deseasonalizing, smoothing, and logging. There are two ways when deterend=False and deseasonal=True. The first way is to just directly calculate the seasonal effect. The other way is that first calculate the trend, subtract it from series and then calculate the seasonal factors. Then subtract the seasonal factor from the original series. In this way we include trend and residuals in the series that we return as a result, and exclude seasonal factors. In this function,’growth’ method uses the first way and ‘MA’ and ‘reg’ methods use the second.
Parameters
- dfpd.DataFrame
A DataFrame where the first column is the dependent variable (e.g., HPI) and the remaining columns are Google Trends data.
- detrendbool, optional
If True, the function removes the trend component from the Google Trends data.
- deseasonalbool, optional
If True, the function removes the seasonal component from the Google Trends data.
- methodstr, optional
The method used for detrending and/or deseasonalizing. Options are ‘reg’ for regression-based, ‘MA’ for moving average, or ‘growth’ for differencing. Default is ‘reg’.
- regressionstr, optional
The type of regression to use for detrending. Options are ‘c’ (constant), ‘ct’ (constant + trend), or ‘ctt’ (constant + trend + quadratic trend). Used only if method=’reg’. Default is ‘ct’.
- alphafloat, optional
The significance level for the augmented Dickey-Fuller test used in the regression-based method. Default is 0.15.
- logbool, optional
If True, log-transform the Google Trends data by applying log(X + 1) to avoid taking the logarithm of zero. Default is False.
- smoothint or bool, optional
The smoothing window size. If an integer is provided, it specifies the window size for the seasonal decomposition trend smoothing. If True, a default window size of 12 is used. Default is 12.
- winsorize_trendsbool, optional
If True, winsorizes the Google Trends data by applying limits of 0.05 on both ends to reduce the impact of outliers. Default is False.
- train_cutfloat or str or pd.Timestamp, optional
The point at which to split the data into training and testing. If a float between 0 and 1 is provided, it represents a fraction of the total dataset length. If a string or timestamp is provided, it is used directly as a cutoff date. Default is 0.8.
Returns
- pd.DataFrame
A DataFrame with the dependent variable and the transformed Google Trends data. The exact transformations depend on the provided parameters.
Raises
- Exception
If an invalid method is specified.
Notes
The method parameter determines how the detrending and/or deseasonalizing is performed. ‘reg’ uses regression-based methods, ‘MA’ uses moving averages, and ‘growth’ uses differencing.
The smooth parameter can be used to apply additional smoothing using seasonal decomposition.
The log transformation is applied after winsorizing if both are enabled.
The final DataFrame returned is merged on the index of the dependent variable and the transformed Google Trends data, ensuring alignment.
Examples
>>> df = pd.DataFrame({'HPI': np.random.randn(100).cumsum(), 'trend': np.random.randn(100)}, index=pd.date_range('2000-01-01', periods=100)) >>> prepared_df = prepare_gtrends(df, detrend=True, deseasonal=True, method='MA', smooth=6) This example removes trend and seasonality using the moving average method and applies a smoothing window of 6 periods to the 'trend' data.
- GTBpy.google_trends_functions.prepare_hpi(df, detrend=False, deseasonal=False, method='reg', regression='ct', alpha=0.15, log=False, train_cut=0.8)[source]
Prepares the Housing Price Index (HPI) data by applying optional transformations such as detrending, deseasonalizing, and logging.
Parameters
- dfpd.DataFrame
A DataFrame containing at least a column labeled ‘HPI’, representing the Housing Price Index.
- detrendbool, optional
If True, removes the trend component from the HPI data. Default is False.
- deseasonalbool, optional
If True, removes the seasonal component from the HPI data. Default is False.
- methodstr, optional
The method used for detrending and/or deseasonalizing. Options are ‘reg’ for regression-based or ‘MA’ for moving average. Default is ‘reg’.
- regressionstr, optional
The type of regression to use for detrending. Options are ‘c’ (constant), ‘ct’ (constant + trend), or ‘ctt’ (constant + trend + quadratic trend). Used only if method=’reg’. Default is ‘ct’.
- alphafloat, optional
The significance level for the augmented Dickey-Fuller (ADF) test used in the regression-based method. Default is 0.15.
- logbool, optional
If True, log-transforms the HPI data by applying log(HPI). Default is False.
- train_cutfloat or str or pd.Timestamp, optional
The point at which to split the data into training and testing. If a float between 0 and 1 is provided, it represents a fraction of the total dataset length. If a string or timestamp is provided, it is used directly as a cutoff date. Default is 0.8.
Returns
- pd.DataFrame
A DataFrame with the transformed HPI data, depending on the selected options for detrending, deseasonalizing, and logging.
Raises
- Exception
If an attempt is made to detrend HPI using the Moving Average method.
Notes
The function includes a log transformation option to stabilize variance, which is applied before any other transformations.
Detrending using the Moving Average (MA) method is explicitly disallowed for the HPI, as indicated by an exception.
The ADF test’s significance level (alpha) and type of regression used can significantly impact the detrending process, particularly in borderline cases.
Examples
>>> df = pd.DataFrame({'HPI': np.random.randn(100).cumsum()}, index=pd.date_range('2000-01-01', periods=100)) >>> prepared_df = prepare_hpi(df, detrend=True, deseasonal=True, method='reg', regression='ct') This example removes the trend and seasonality from the HPI data using a regression-based method with a constant and trend.
- GTBpy.google_trends_functions.compute_weights(criteria, var_select, selection_index=None, coef_select='abs', n=10, power=4, rank_based=False)[source]
Computes weights for a set of criteria, selecting top variables or indices based on the specified selection method.
Parameters
- criteriapd.Series or pd.DataFrame
The data containing the criteria used to compute weights, such as coefficients or cross-validation errors.
- var_selectstr
The selection method for the variables. Options include ‘EN’ (Elastic Net), ‘coef’ (coefficient), ‘tvalue’, ‘CV’ (cross-validation), or ‘IC’ (information criterion).
- selection_indexpd.Index or list, optional
The indices to consider for selection. If None, the entire index of criteria is used. Default is None.
- coef_selectstr, optional
Method for selecting coefficients. Options are ‘abs’ for absolute values or ‘neg’ for negative values. Only relevant if var_select is ‘EN’, ‘coef’, or ‘tvalue’. Default is ‘abs’.
- nint, optional
The number of top variables or indices to select based on the computed weights. Default is 10.
- powerint or float, optional
The exponent used to compute the weights. Higher powers give more weight to larger values (or smaller values if using var_select=’CV’ or var_select=’IC’). Default is 4.
- rank_basedbool, optional
If True, weights are computed based on the rank of each criterion rather than its raw value. Default is False.
Returns
- weightspd.Series
A series of computed weights corresponding to the criteria.
- top_queriespd.Index or list
The top n indices or variables based on the computed weights.
Raises
- ValueError
If an invalid value for var_select or coef_select is provided.
Notes
The function supports both positive and negative selection of coefficients, depending on the coef_select parameter.
When rank_based is True, the function ranks the criteria before computing the weights, which can be useful for stabilizing the selection process.
The function supports two main types of selection: those based on coefficient magnitudes (‘EN’, ‘coef’, ‘tvalue’) and those based on error metrics (‘CV’, ‘IC’).
Examples
>>> criteria = pd.Series([0.2, -0.5, 1.0, -0.3, 0.7], index=['A', 'B', 'C', 'D', 'E']) >>> weights, top_queries = compute_weights(criteria, var_select='coef', coef_select='abs', n=3) >>> print(weights) A 0.061926 B 0.142322 C 0.388067 D 0.100637 E 0.307048 dtype: float64 >>> print(top_queries) Index(['C', 'E', 'B'], dtype='object')
>>> criteria = pd.Series([0.8, 0.5, 1.2, 0.9, 0.6], index=['A', 'B', 'C', 'D', 'E']) >>> weights, top_queries = compute_weights(criteria, var_select='CV', rank_based=True, n=2) >>> print(weights) A 0.214796 B 0.355939 C 0.131228 D 0.178110 E 0.119927 dtype: float64 >>> print(top_queries) Index(['B', 'D'], dtype='object')
- GTBpy.google_trends_functions.GT_plot(self, df, exog, lag, title=None, winsorize_trend=False, scaled=False)[source]
Generate plots to visualize the relationship between the Housing Price Index (HPI) and a selected exogenous variable.
Parameters
- dfpandas.DataFrame
The DataFrame containing the HPI and exogenous variables.
- exogstr
The name of the exogenous variable to be plotted and analyzed.
- lagint
The lag to apply to the exogenous variable in the scatter plot and time series plot.
- titlestr, optional
The title for the overall figure (default is None).
- winsorize_trendbool, optional
If True, apply winsorization to the exogenous variable to limit the effect of outliers (default is False).
- scaledbool, optional
If True, standardize the exogenous variable by removing the mean and scaling to unit variance (default is False).
Returns
- None
Displays a 2x2 grid of plots: - Top-left: Time series plot of HPI. - Top-right: Scatter plot of lagged exogenous variable vs. HPI. - Bottom-left: Time series plot of the lagged exogenous variable. - Bottom-right: (Removed) was intended for an additional plot, now removed.
Notes
The function visualizes the relationship between HPI and a chosen exogenous variable by generating time series and scatter plots.
If winsorize_trend is True, the exogenous variable is adjusted to mitigate the impact of extreme values.
If scaled is True, the exogenous variable is standardized before being plotted.
The bottom-right subplot (ax4) is intentionally removed, leaving three plots in the 2x2 grid.
Examples
>>> model.GT_plot(df, exog='Google_Trends_Search', lag=2, title='Google Trends vs. HPI', winsorize_trend=True, scaled=True) Displays the plots for the relationship between HPI and the Google Trends search data with a lag of 2.
- GTBpy.google_trends_functions.plot_gtrends(self, exog, h: int, winsorize_trend=False, scaled=False)[source]
Plot Google Trends data and its transformations over different stages of processing.
Parameters
- exogstr or None
The exogenous variable (Google Trends search term) to be plotted. If None, the variable with the highest weight for the specified horizon h is selected.
- hint
The forecast horizon to be used. If h is not in self.selection_h_list, the maximum value from self.selection_h_list is used.
- winsorize_trendbool, optional
If True, the Google Trends data is winsorized (default is False).
- scaledbool, optional
If True, the Google Trends data is scaled using standardization (default is False).
Returns
- None
Displays a series of plots showing the original Google Trends data, winsorized data, prepared Google Trends data, and the corresponding Housing Price Index (HPI) data.
Notes
This function visualizes the Google Trends data (exog) across different stages: original, after winsorization, after preparation, and after scaling.
The h parameter specifies the forecast horizon, and the function adjusts the plots accordingly.
If exog is not provided, the function automatically selects the best-performing Google Trends term based on predefined criteria.
The vertical red dashed line in each plot indicates the cut-off point between training and validation data.
Examples
>>> plot_gtrends(exog='housing_market', h=6, winsorize_trend=True, scaled=True) Plots the Google Trends data for 'housing_market' with winsorization and scaling applied at horizon 6.
>>> plot_gtrends(exog=None, h=3) Automatically selects the best Google Trends term and plots it at horizon 3 without any additional processing.
- GTBpy.google_trends_functions.plot_hsi(self, hsi=None)[source]
Plot the selected Housing Search Index (HSI) for different forecast horizons.
Parameters
- hsistr, optional
The specific HSI to be plotted. If None, the HSI with the highest weight for each forecast horizon (h) is selected (default is None).
Returns
- None
Displays the plots for the selected HSIs across the specified forecast horizons in self.selection_h_list.
Notes
This method generates a plot for each forecast horizon (h) in reverse order (starting from the longest horizon).
If no specific hsi is provided, the method will automatically select and plot the HSI with the maximum weight for each h from self.criteria_dfs.
The plots are created using the GT_plot function, which visualizes the relationship between the HPI and the selected HSI.
Examples
>>> model.plot_hsi(hsi='Sentiment_Index') Plots the specified 'Sentiment_Index' for all horizons in the selection list.
>>> model.plot_hsi() Automatically selects and plots the HSI with the maximum weight for each horizon in the selection list.
- GTBpy.google_trends_functions.plot_improvement(obj_list, measure='MAFE', h_list=None, colors=None, labels=None, figsize=(15, 9))[source]
Plot the Kernel Density Estimate (KDE) of improvement measures (MAFE/MSFE) for different models across forecast horizons.
Parameters
- obj_listlist
A list of objects containing the results of the models to be compared. Each object should have MAFE_val_df_improve and MSFE_val_df_improve attributes.
- measurestr, optional
The performance measure to be plotted. Can be ‘MAFE’ (Mean Absolute Forecast Error) or ‘MSFE’ (Mean Squared Forecast Error). Default is ‘MAFE’.
- h_listlist, optional
The list of forecast horizons to be considered. If None, the horizons from the first object in obj_list will be used.
- colorslist, optional
List of colors to be used for plotting different models. If None, a default list of colors will be used.
- labelslist, optional
List of labels for the models in obj_list. If None, default labels (‘G1’, ‘G2’, …) will be generated.
- figsizetuple, optional
Size of the figure. Default is (15, 9).
Returns
- None
Displays the KDE plots for the selected measure across the specified forecast horizons.
Notes
This function plots the density of the percentage improvement in MAFE/MSFE for different models over multiple forecast horizons (h).
A vertical line is added to each plot to mark the performance of the Universal Housing Sentiment Index (UHSI) for the corresponding forecast horizon.
If h_list is not provided, the forecast horizons from the first model in obj_list are used by default.
Examples
>>> plot_improvement(models_list, measure='MAFE', h_list=[1, 3, 6, 12], colors=['blue', 'orange'], labels=['Model 1', 'Model 2']) Plots the MAFE improvement KDE for the specified models and forecast horizons.
- GTBpy.google_trends_functions.plot_forecast(self, exog)[source]
Plot the forecasted vs actual Housing Price Index (HPI) values over different forecast horizons.
Parameters
- exogstr or list, optional
The exogenous variable(s) used in the forecasting models. If None, the best performing exogenous variable for each horizon is selected based on the criteria in self.forecast_criteria_df.
Returns
- None
Displays the forecast plots for the actual and predicted HPI across different forecast horizons.
Notes
The function generates subplots for each forecast horizon h in self.h_list, displaying the actual and predicted HPI.
If exog is a list, the corresponding element is used as the exogenous variable for each horizon. If exog is a single string, it is used for all horizons.
The vertical red dashed line marks the cut-off point between the training and validation datasets.
Examples
>>> model.plot_forecast(exog=['UHSI_1', 'UHSI_3', 'UHSI_6']) Plots the forecasts using the specified exogenous variables for each horizon.
>>> model.plot_forecast(exog='UHSI_3') Plots the forecasts using 'UHSI_3' as the exogenous variable for all horizons.
- class GTBpy.google_trends_functions.GoogleTrend(df, train_cut=0.8, verbose=False, seed=None, pca_lag=True)[source]
Bases:
objectA class for processing and modeling Google Trends data to predict housing prices. This class provides tools for preparing, transforming, and modeling time series data using methods such as detrending, deseasonalizing, principal component analysis (PCA), and lagged variable selection.
Parameters
- dfpd.DataFrame
A DataFrame containing the initial dataset, including the Housing Price Index (HPI) and Google Trends data.
- train_cutfloat, int, or str, optional
Defines the end of the training period. If a float between 0 and 1 is provided, it represents the fraction of the dataset used for training. Default is 0.8.
- verbosebool, optional
If True, enables detailed output for logging purposes. Default is False.
- seedint or None, optional
Seed for random number generation, used to ensure reproducibility. Default is None.
- pca_lagbool, optional
If True, PCA with lagged variables is used during the feature extraction. Default is True.
Attributes
- dfpd.DataFrame
A copy of the initial dataset provided by the user.
- train_cutpd.Timestamp
Timestamp or index defining the end of the training period.
- verbosebool
Verbosity level for logging purposes.
- seedint or None
Seed for random number generation.
- pca_lagbool
Specifies whether PCA with lagged variables should be used.
- gtrends_dfpd.DataFrame
Transformed Google Trends data.
- hpi_dfpd.DataFrame
Transformed Housing Price Index (HPI) data.
- hsi_dfpd.DataFrame
Final set of features generated after processing HPI and Google Trends data.
- hsi_dfspd.DataFrame
Collection of transformed features including multiple PCA layers.
- criteria_dfspd.DataFrame
DataFrame containing computed criteria values for feature selection.
- criteria_dfs_expandedpd.DataFrame
DataFrame containing expanded versions of computed criteria for feature selection.
- forecast_hsipd.DataFrame
The final set of features used for forecasting.
- MAFE_val_df, MSFE_val_df, forecast_criteria_df, etc.
DataFrames and metrics representing model performance during the forecast.
Methods
- prepare_gtrends(detrend=True, deseasonal=True, …)
Prepares Google Trends data by applying optional transformations such as detrending and deseasonalizing.
- prepare_hpi(detrend=False, deseasonal=True, …)
Prepares the HPI data by applying optional transformations such as detrending and deseasonalizing.
- compute_hsi(layer=1, …)
Extracts key features from the dataset using PCA, lagged variables, and selection criteria.
- lag_setting(lag_select=’IC’, …)
Sets the lag selection criteria and related parameters for the model.
- forecast(h_list=[1,3,6,12], …)
Forecasts the HPI using prepared Google Trends data and selected lags.
- results(results_table=False, …)
Displays various results from the model, including performance metrics.
- plot_gtrends(exog=None, …)
Plots Google Trends data used in the model.
- plot_hsi(hsi=None)
Plots HSI data used in the model.
- plot_improvement(measure=’MAFE’, …)
Plots improvements in forecasting accuracy based on specified performance measures.
- plot_forecast(exog=None)
Plots the forecasted HPI based on the selected model features.
Examples
>>> GT_object = GoogleTrend(df, '2019-12-31', verbose=1, seed=1, pca_lag=True) >>> GT_object.prepare_gtrends(detrend=True, deseasonal=True, method='reg', regression='ct', log=False, smooth=12) >>> GT_object.prepare_hpi(detrend=True, deseasonal=True, method='reg', regression='ct', log=True) >>> GT_object.compute_hsi(layer=1, input_pool=True, auto_layer=False, n_input=10, n_hsi=20, max_lag=3, var_select='CV', random='rnd', cv=5, shuffle=True, n_iter=20, selection_h_list=[0,1,3,6,12]) >>> GT_object.compute_hsi(layer=2, input_pool=True, auto_layer=True, n_input=10, n_hsi=20, max_lag=1, var_select='CV', random='rnd', cv=5, shuffle=True, n_iter=20, selection_h_list=[0,1,3,6,12]) >>> GT_object.lag_setting(y_lags='Auto', exog_lags='Auto', max_lag=3, lag_select='IC') >>> GT_object.forecast(h_list=[1,3,6,12], seasonal=True, hsi_CV_select=True, fit_intercept=True, cv=5, n_iter=20, original_scale=False) >>> GT_object.results(MAFE_val_df=True, MAFE_val_df_improve=True, forecast_criteria_df=True, lags_df=True, head=True)
- prepare_gtrends(detrend=True, deseasonal=True, method='reg', regression='ct', alpha=0.15, log=False, smooth=False, winsorize=False)[source]
Prepares Google Trends data by applying optional transformations such as detrending, deseasonalizing, smoothing, and logging.
Parameters
- detrendbool, optional
If True, removes the trend component from the Google Trends data. Default is True.
- deseasonalbool, optional
If True, removes the seasonal component from the Google Trends data. Default is True.
- methodstr, optional
The method used for detrending and/or deseasonalizing. Options are ‘reg’ for regression-based, ‘MA’ for moving average, or ‘growth’ for differencing. Default is ‘reg’.
- regressionstr, optional
The type of regression to use for detrending. Options are ‘c’ (constant), ‘ct’ (constant + trend), or ‘ctt’ (constant + trend + quadratic trend). Used only if method=’reg’. Default is ‘ct’.
- alphafloat, optional
The significance level for the augmented Dickey-Fuller (ADF) test used in the regression-based method. Default is 0.15.
- logbool, optional
If True, log-transforms the Google Trends data by applying log(X + 1) to stabilize variance. Default is False.
- smoothbool or int, optional
If True, applies smoothing using a default window of 12. If an integer is provided, it specifies the window size for smoothing. Default is False.
- winsorizebool, optional
If True, applies winsorization to the Google Trends data to reduce the effect of outliers. Default is False.
Returns
- None
Updates the gtrends_df attribute with the prepared Google Trends data after applying the specified transformations.
Notes
This method prepares the Google Trends data for further modeling by optionally removing trend and seasonal components, smoothing, and log-transforming the data.
The method parameter determines the approach for detrending and deseasonalizing, allowing flexibility in how these transformations are applied.
The transformations help stabilize the data, improve stationarity, and reduce the impact of outliers, which can enhance the performance of downstream models.
- prepare_hpi(detrend=False, deseasonal=True, method='reg', regression='ct', alpha=0.15, log=False)[source]
Prepares the Housing Price Index (HPI) data by applying optional transformations such as detrending, deseasonalizing, and logging.
Parameters
- detrendbool, optional
If True, removes the trend component from the HPI data. Default is False.
- deseasonalbool, optional
If True, removes the seasonal component from the HPI data. Default is True.
- methodstr, optional
The method used for detrending and/or deseasonalizing. Options are ‘reg’ for regression-based or ‘MA’ for moving average. Default is ‘reg’.
- regressionstr, optional
The type of regression to use for detrending. Options are ‘c’ (constant), ‘ct’ (constant + trend), or ‘ctt’ (constant + trend + quadratic trend). Used only if method=’reg’. Default is ‘ct’.
- alphafloat, optional
The significance level for the augmented Dickey-Fuller (ADF) test used in the regression-based method. Default is 0.15.
- logbool, optional
If True, log-transforms the HPI data by applying log(HPI). Default is False.
Returns
- None
Updates the hpi_df attribute with the prepared HPI data after applying the specified transformations.
Notes
This method prepares the Housing Price Index (HPI) data for further modeling by optionally removing trend and seasonal components and applying a log transformation.
The method parameter determines how the detrending and deseasonalizing are applied, offering flexibility in transformations.
The transformations help in stabilizing the variance, improving stationarity, and making the data suitable for time series modeling and forecasting.
- compute_hsi(layer=1, input_pool=True, auto_layer=False, n_input=10, n_hsi=20, max_lag=3, var_select='CV', random=False, cv=3, shuffle=True, n_iter=20, alphas=None, l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1], selection_h_list=[0, 1, 3, 6, 12], coef_select='abs', criteria_dis=False)[source]
Extracts key features from the dataset using Principal Component Analysis (PCA), lagged variables, and feature selection criteria.
Parameters
- layerint, optional
The current layer of PCA and feature extraction. Default is 1.
- input_poolbool, optional
If True, the full input feature pool is used during the feature extraction. Default is True.
- auto_layerbool, optional
If True, additional layers are added automatically until no improvement is observed. Default is False.
- n_inputint, optional
Number of input features to select. Default is 10.
- n_hsiint, optional
Number of HSI (Housing Sentiment Index) features to compute. Default is 20.
- max_lagint, optional
Maximum lag to consider for lagged features. Default is 3.
- var_selectstr, optional
The method used for variable selection. Options include ‘CV’, ‘EN’, ‘coef’, or ‘tvalue’. Default is ‘CV’.
- randombool or str, optional
If True or set to ‘intc’ or ‘rnd’, a randomized selection process is used. Default is False.
- cvint, optional
Number of cross-validation folds. Default is 3.
- shufflebool, optional
If True, shuffle the data during cross-validation. Default is True.
- n_iterint, optional
Number of iterations for randomized search or selection. Default is 20.
- alphasarray-like, optional
Array of alpha values for ElasticNetCV. Default is np.logspace(-3, 0, 100).
- l1_ratiolist, optional
List of L1 ratios for ElasticNetCV. Default is [.1, .5, .7, .9, .95, .99, 1].
- selection_h_listlist of int, optional
List of lag periods to consider for selection. Default is [0, 1, 3, 6, 12].
- coef_selectstr, optional
Method for selecting coefficients (‘abs’ for absolute values or ‘neg’ for negative values). Default is ‘abs’.
- criteria_disbool, optional
If True, displays criteria DataFrames used in feature selection. Default is False.
Returns
- None
Updates the attributes hsi_df, hsi_dfs, X_scaled_lagged_full, criteria_dfs, and forecast_hsi with the newly computed features and criteria.
Notes
This method uses PCA for dimensionality reduction and lagged variable selection to generate new features.
Feature selection is based on different criteria, including ElasticNet coefficients, cross-validation metrics, and other statistics.
The process can include both deterministic and randomized approaches to ensure robustness.
- lag_setting(y_lags='Auto', exog_lags='Auto', max_lag=3, lag_select='IC', fit_intercept=False, cv=5, shuffle=True, n_iter=20, var_order='cross')[source]
Sets the lag selection criteria and related parameters for the model.
Parameters
- lag_selectstr, optional
The method used for selecting the lag order. Options include ‘IC’ (information criterion), ‘CV’ (cross-validation), etc. Default is ‘IC’.
- fit_interceptbool, optional
If True, fits an intercept in the lag model. Default is False.
- cvint, optional
Number of cross-validation folds to use when selecting lags. Default is 5.
- shufflebool, optional
If True, shuffle the data during cross-validation. Default is True.
- n_iterint, optional
Number of iterations to use during randomized cross-validation or lag selection. Default is 20.
- var_orderstr, optional
The order in which variables are considered for lag selection. Options include ‘cross’ (cross-sectional ordering) and others. Default is ‘cross’.
- max_lagint, optional
The maximum lag to consider for the model. Default is 3.
- y_lagsstr or int, optional
The number of lags for the target variable (y). If ‘Auto’, the lag length is automatically determined. Default is ‘Auto’.
- exog_lagsstr or int, optional
The number of lags for the exogenous variables (exog). If ‘Auto’, the lag length is automatically determined. Default is ‘Auto’.
Returns
- None
Updates the lag-related attributes of the instance, such as lag_select, lag_fit_intercept, lag_cv, and others.
Notes
This method sets the parameters for the lag selection process, which determines the temporal dependencies in the model.
The selection can be done using either information criteria or cross-validation approaches.
The var_order parameter controls the order in which variables are processed for lag determination, which can impact model performance.
- forecast(h_list=[1, 3, 6, 12], seasonal=False, hsi_CV_select=False, fit_intercept=False, cv=4, n_iter=100, original_scale=True, sort_df='criteria', sort_col=-1)[source]
Forecasts the Housing Price Index (HPI) using prepared Google Trends data and selected lag features.
Parameters
- h_listlist of int, optional
A list of forecast horizons for which the predictions will be made. Default is [1, 3, 6, 12].
- seasonalbool, optional
If True, includes seasonal components in the model during forecasting. Default is False.
- hsi_CV_selectbool, optional
If True, uses cross-validation for selecting the Housing Sentiment Index (HSI) features. Default is False.
- fit_interceptbool, optional
If True, fits an intercept term in the forecasting model. Default is False.
- cvint, optional
Number of cross-validation folds for model validation. Default is 4.
- n_iterint, optional
Number of iterations to use during cross-validation or parameter tuning. Default is 100.
- original_scalebool, optional
If True, the forecasted values are converted back to their original scale. Default is True.
- sort_dfstr, optional
The method used to sort the features for the forecast model. Options include ‘criteria’. Default is ‘criteria’.
- sort_colint, optional
The column index to use for sorting when selecting features. Default is -1 (the last column).
Returns
- None
Updates the attributes with forecast results, including various performance metrics and forecast data.
Attributes Updated
- MAFE_val_dfpd.DataFrame
Mean Absolute Forecasting Error (MAFE) values for the validation set.
- MAFE_val_df_improvepd.DataFrame
Improvement in MAFE values compared to a benchmark.
- MSFE_val_dfpd.DataFrame
Mean Squared Forecasting Error (MSFE) values for the validation set.
- MSFE_val_df_improvepd.DataFrame
Improvement in MSFE values compared to a benchmark.
- forecast_criteria_dfpd.DataFrame
DataFrame containing criteria values used for selecting features during forecasting.
- lags_dfpd.DataFrame
DataFrame representing the lags used for each feature in the model.
- MAFE_train_dfpd.DataFrame
MAFE values for the training set.
- MSFE_train_dfpd.DataFrame
MSFE values for the training set.
- forecast_dictdict
Dictionary containing forecasted values for different horizons.
- hsipd.DataFrame
DataFrame containing the forecasted HPI values.
- results_tablepd.DataFrame
Table summarizing the forecast results, including performance metrics.
Notes
This method forecasts the HPI using a combination of Google Trends data and selected lagged features.
Various forecast horizons can be specified using the h_list parameter to produce forecasts for different time periods.
The method can optionally fit an intercept term and use cross-validation to select the most important features.
- results(results_table=False, MAFE_val_df=False, MAFE_val_df_improve=False, MSFE_val_df=False, MSFE_val_df_improve=False, forecast_criteria_df=False, lags_df=False, MAFE_train_df=False, head=10)[source]
Displays various results from the model, including performance metrics and forecast data.
Parameters
- results_tablebool, optional
If True, displays the full results table summarizing the forecast performance. Default is False.
- MAFE_val_dfbool, optional
If True, displays the Mean Absolute Forecasting Error (MAFE) values for the validation set. Default is False.
- MAFE_val_df_improvebool, optional
If True, displays the improvement in MAFE values compared to a benchmark. Default is False.
- MSFE_val_dfbool, optional
If True, displays the Mean Squared Forecasting Error (MSFE) values for the validation set. Default is False.
- MSFE_val_df_improvebool, optional
If True, displays the improvement in MSFE values compared to a benchmark. Default is False.
- forecast_criteria_dfbool, optional
If True, displays the DataFrame containing criteria values used for selecting features during forecasting. Default is False.
- lags_dfbool, optional
If True, displays the DataFrame representing the lags used for each feature in the model. Default is False.
- MAFE_train_dfbool, optional
If True, displays the MAFE values for the training set. Default is False.
- headint, optional
Number of rows to display when displaying the results DataFrames. Default is 10.
Returns
- None
Displays the specified results based on the provided parameters.
Notes
This method allows the user to access different performance metrics and forecast-related data.
By selecting the appropriate parameters, users can view specific tables and metrics that summarize the model’s forecasting performance.
The head parameter controls the number of rows to display for DataFrames to avoid overwhelming output.
- plot_gtrends(exog=None, h=12, winsorize_trend=False, scaled=False)[source]
Parameters
- exoglist of str or None, optional
List of exogenous variables (Google Trends features) to plot. If None, all available features are plotted. Default is None.
- hint, optional
The forecast horizon for which the trends are plotted. Default is 12.
- winsorize_trendbool, optional
If True, plots the Google Trends data after winsorization, which reduces the effect of outliers. Default is False.
- scaledbool, optional
If True, plots the scaled version of Google Trends data, allowing for comparison across different features. Default is False.
Returns
- None
Displays the plots for the specified Google Trends features.
Notes
This method provides a visual representation of Google Trends data used in the model, which helps in understanding the temporal patterns in the data.
Users can optionally plot specific features by providing a list in the exog parameter.
The winsorize_trend parameter allows for a clearer view by minimizing the impact of extreme values.
The scaled parameter enables visualization of standardized features, useful for comparing different variables on the same scale.
- plot_hsi(hsi=None)[source]
Plots the Housing Sentiment Index (HSI) data used in the model, allowing for visualization of the extracted features.
Parameters
- hsilist of str or None, optional
List of HSI features to plot. If None, all available HSI features are plotted. Default is None.
Returns
- None
Displays the plots for the specified HSI features.
Notes
This method provides a visual representation of the HSI features that have been extracted using PCA and other transformations.
The hsi parameter allows users to specify particular HSI features to visualize, or plot all available features if set to None.
Useful for analyzing the derived sentiment features and understanding their temporal patterns.
- plot_improvement(measure='MAFE', h_list=None, colors=None, labels=None, figsize=(15, 9))[source]
Plots the improvement in forecasting accuracy for different horizons, using a specified performance measure.
Parameters
- measurestr, optional
The performance measure to visualize. Options include ‘MAFE’ (Mean Absolute Forecasting Error) and ‘MSFE’ (Mean Squared Forecasting Error). Default is ‘MAFE’.
- h_listlist of int or None, optional
A list of forecast horizons to include in the plot. If None, all available horizons are included. Default is None.
- colorslist of str or None, optional
List of colors to use for the plot lines. If None, default colors are used. Default is None.
- labelslist of str or None, optional
List of labels for each forecast horizon in the plot. If None, default labels are used. Default is None.
- figsizetuple of int, optional
Figure size for the plot. Default is (15, 9).
Returns
- None
Displays a plot showing the improvement in forecasting accuracy for the specified measure and horizons.
Notes
This method helps visualize the improvement in forecasting performance for different horizons.
The measure parameter allows the user to specify whether to visualize Mean Absolute Forecasting Error (MAFE) or Mean Squared Forecasting Error (MSFE).
Users can customize the appearance of the plot using the colors, labels, and figsize parameters.
The h_list parameter allows for selecting specific forecast horizons to analyze, which can be useful for evaluating the model’s performance over varying time periods.
- plot_forecast(exog=None)[source]
Plots the forecasted Housing Price Index (HPI) values along with the actual values for comparison.
Parameters
- exoglist of str or None, optional
List of exogenous variables (features) to include in the forecast plot. If None, all available features are included. Default is None.
Returns
- None
Displays the forecast plot showing the predicted HPI values and the actual values.
Notes
This method provides a visual comparison between the forecasted HPI values and the actual values, allowing users to assess the model’s predictive performance.
The exog parameter allows users to include specific exogenous features in the plot for further analysis of their impact on the forecast.
Useful for evaluating the quality of the model’s predictions over different time horizons.