Here we explore the theoretical coefficient distributions from a linear regression model. When fitting a regression model we can get estimates for the standard deviation of the coefficients. We use bootstrapping to get an empiracle distribution of the regression coefficients to compare against those distributions.
The outline is as follows:
- Collect some data
- Fit a regression model
- Fit many models on bootstrapped samples of the training data
- Compare the distribution of bootstrapped coefficients to the theoretical distribution of coefficients
The data used is weather data collect from kaggle: https://www.kaggle.com/budincsevity/szeged-weather
import os
import pandas as pd
import numpy as np
dir_path = os.getcwd()
data = pd.read_csv(os.path.join(dir_path, "data", "weatherHistory.csv"))
data.head()
Formatted Date | Summary | Precip Type | Temperature (C) | Apparent Temperature (C) | Humidity | Wind Speed (km/h) | Wind Bearing (degrees) | Visibility (km) | Loud Cover | Pressure (millibars) | Daily Summary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2006-04-01 00:00:00.000 +0200 | Partly Cloudy | rain | 9.472222 | 7.388889 | 0.89 | 14.1197 | 251.0 | 15.8263 | 0.0 | 1015.13 | Partly cloudy throughout the day. |
1 | 2006-04-01 01:00:00.000 +0200 | Partly Cloudy | rain | 9.355556 | 7.227778 | 0.86 | 14.2646 | 259.0 | 15.8263 | 0.0 | 1015.63 | Partly cloudy throughout the day. |
2 | 2006-04-01 02:00:00.000 +0200 | Mostly Cloudy | rain | 9.377778 | 9.377778 | 0.89 | 3.9284 | 204.0 | 14.9569 | 0.0 | 1015.94 | Partly cloudy throughout the day. |
3 | 2006-04-01 03:00:00.000 +0200 | Partly Cloudy | rain | 8.288889 | 5.944444 | 0.83 | 14.1036 | 269.0 | 15.8263 | 0.0 | 1016.41 | Partly cloudy throughout the day. |
4 | 2006-04-01 04:00:00.000 +0200 | Mostly Cloudy | rain | 8.755556 | 6.977778 | 0.83 | 11.0446 | 259.0 | 15.8263 | 0.0 | 1016.51 | Partly cloudy throughout the day. |
Next we do some data preparation. We create a model to predict apparent temperature using most of the remaining columns:
# Calculate apprent temperature
X = data.loc[:, ['Temperature (C)',
'Humidity', 'Wind Speed (km/h)',
'Wind Bearing (degrees)', 'Visibility (km)', 'Loud Cover',
'Pressure (millibars)']]
X = X.join(pd.get_dummies(data['Precip Type']))
# add constant column
X['Constant'] = 1
y = data.loc[:, 'Apparent Temperature (C)']
X.head()
Temperature (C) | Humidity | Wind Speed (km/h) | Wind Bearing (degrees) | Visibility (km) | Loud Cover | Pressure (millibars) | rain | snow | Constant | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 9.472222 | 0.89 | 14.1197 | 251.0 | 15.8263 | 0.0 | 1015.13 | 1 | 0 | 1 |
1 | 9.355556 | 0.86 | 14.2646 | 259.0 | 15.8263 | 0.0 | 1015.63 | 1 | 0 | 1 |
2 | 9.377778 | 0.89 | 3.9284 | 204.0 | 14.9569 | 0.0 | 1015.94 | 1 | 0 | 1 |
3 | 8.288889 | 0.83 | 14.1036 | 269.0 | 15.8263 | 0.0 | 1016.41 | 1 | 0 | 1 |
4 | 8.755556 | 0.83 | 11.0446 | 259.0 | 15.8263 | 0.0 | 1016.51 | 1 | 0 | 1 |
We reduce the training data size just to enlarge the coefficient distributino for the purpose of this article.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.9)
X_train.reset_index(inplace=True, drop=True)
y_train.reset_index(inplace=True, drop=True)
X_train.shape
(9645, 10)
Statsmodels approach
With statsmodels
we can apply the ordinary least squares solution to the above data to recover estimates of the model coefficients. When we fit a linear regression model the Hessian (2nd order derivatives) determines how sensitive the coefficients are to changes in the data. This in turn is used to calculate the standard deviation of the coefficients.
import statsmodels.api as sm
results = sm.OLS(y_train, X_train).fit()
print(results.summary())
# extract coefficient distributions
w_sm_mu = results.params
w_sm_std = np.sqrt(np.diag(results.normalized_cov_params))
OLS Regression Results
====================================================================================
Dep. Variable: Apparent Temperature (C) R-squared: 0.990
Model: OLS Adj. R-squared: 0.990
Method: Least Squares F-statistic: 1.175e+05
Date: Thu, 11 Jun 2020 Prob (F-statistic): 0.00
Time: 00:00:29 Log-Likelihood: -14401.
No. Observations: 9645 AIC: 2.882e+04
Df Residuals: 9636 BIC: 2.888e+04
Df Model: 8
Covariance Type: nonrobust
==========================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------
Temperature (C) 1.1138 0.002 607.061 0.000 1.110 1.117
Humidity 0.9269 0.078 11.828 0.000 0.773 1.081
Wind Speed (km/h) -0.0919 0.002 -54.661 0.000 -0.095 -0.089
Wind Bearing (degrees) 0.0006 0.000 5.584 0.000 0.000 0.001
Visibility (km) -0.0082 0.003 -2.805 0.005 -0.014 -0.002
Loud Cover 2.722e-15 1.54e-15 1.764 0.078 -3.03e-16 5.75e-15
Pressure (millibars) 0.0002 9.24e-05 2.550 0.011 5.45e-05 0.000
rain -0.3603 0.186 -1.942 0.052 -0.724 0.003
snow -0.8371 0.189 -4.432 0.000 -1.207 -0.467
Constant -1.9649 0.224 -8.783 0.000 -2.403 -1.526
==============================================================================
Omnibus: 240.963 Durbin-Watson: 1.987
Prob(Omnibus): 0.000 Jarque-Bera (JB): 262.505
Skew: 0.378 Prob(JB): 9.95e-58
Kurtosis: 3.287 Cond. No. 3.57e+20
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 7.96e-32. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
The above model coefficients suggest that the apparent temperature is mostly made up of the actual temperature. Humidity has a positive relationship and wind a negative relationship.
Bootstrapping approach
To get an empirical idea of the distribution of the regression coefficients we can refit the model on many bootstrapped samples. A bootstrap sample is when we take random samples from our training set, with replacement. If we take 1000 bootstrap samples and fit a model on each we gain 1000 estimates of the regression coefficients.
We do this as follows:
w_bs = []
n = X_train.shape[0]
for i in range(1000):
samp = np.random.randint(n, size=n)
results_bs = sm.OLS(y_train.loc[samp], X_train.loc[samp,:]).fit()
w_bs.append(results_bs.params)
w_bs = np.array(w_bs)
# summarise coefficient distributions
w_bs_mu = np.mean(w_bs, axis=0)
w_bs_std = np.std(w_bs, axis=0)
Results comparison
We can now compare the coefficient distributions to the theoretical results from statsmodels.
Below we plot the coefficient means and standard deviations from both statsmodels and the bootstrapped models.
import matplotlib.pyplot as plt
plt.style.use("seaborn-whitegrid")
fig, ax = plt.subplots(ncols=2, figsize=(10,6))
ax[0].plot(range(X.shape[1]), w_sm_mu, label='statsmodels')
ax[0].plot(range(X.shape[1]), w_bs_mu, 'x', label='boostrapped')
ax[0].set_ylabel('Mean')
ax[1].plot(range(X.shape[1]), w_sm_std, label='statsmodels')
ax[1].plot(range(X.shape[1]), w_bs_std, 'x', label='boostrapped')
ax[1].set_ylabel('Standard deviation')
plt.legend()
plt.show()
The above figures show that both the means and standard deviations are close. This suggests the coefficient distribution from the theory are likely correct. The numbers should converge with more bootstrap samples.
I would expect the bootstrapped coefficients to maybe over estimate the coefficient standard deviations - each bootstrap sample will have fewer unique data entries to fit the model, as such the coefficients may not be as tightly fit. It doesn’t seem like we have enough evidence to clarify this.
The raw result numbers are shown below:
coefficients = pd.concat([w_sm_mu,
pd.DataFrame(data=w_bs_mu, index=X_train.columns),
pd.DataFrame(data=w_sm_std, index=X_train.columns),
pd.DataFrame(data=w_bs_std, index=X_train.columns)], axis=1)
coefficients.columns = ['statsmodels_mu', 'bootstrapped_mu', 'statsmodels_std', 'bootstrapped_std']
coefficients
statsmodels_mu | bootstrapped_mu | statsmodels_std | bootstrapped_std | |
---|---|---|---|---|
Temperature (C) | 1.113766e+00 | 1.113721e+00 | 1.702782e-03 | 1.919617e-03 |
Humidity | 9.269036e-01 | 9.255212e-01 | 7.272990e-02 | 8.587625e-02 |
Wind Speed (km/h) | -9.193476e-02 | -9.207228e-02 | 1.560984e-03 | 2.599598e-03 |
Wind Bearing (degrees) | 5.787597e-04 | 5.797346e-04 | 9.620203e-05 | 1.074969e-04 |
Visibility (km) | -8.162228e-03 | -8.096304e-03 | 2.701086e-03 | 2.871113e-03 |
Loud Cover | 2.722203e-15 | 2.385971e-17 | 1.432384e-15 | 2.841603e-15 |
Pressure (millibars) | 2.355312e-04 | 2.313375e-04 | 8.573335e-05 | 9.545615e-05 |
rain | -3.603448e-01 | -3.645398e-01 | 1.721736e-01 | 1.573641e-01 |
snow | -8.370791e-01 | -8.412777e-01 | 1.753044e-01 | 1.671947e-01 |
Constant | -1.964852e+00 | -1.954745e+00 | 2.076308e-01 | 2.028931e-01 |