Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16,8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.982
Model:                            OLS   Adj. R-squared:                  0.980
Method:                 Least Squares   F-statistic:                     816.1
Date:                Sat, 06 Feb 2021   Prob (F-statistic):           7.07e-40
Time:                        16:48:16   Log-Likelihood:                -2.3163
No. Observations:                  50   AIC:                             12.63
Df Residuals:                      46   BIC:                             20.28
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9649      0.090     55.128      0.000       4.784       5.146
x1             0.5057      0.014     36.406      0.000       0.478       0.534
x2             0.5563      0.055     10.189      0.000       0.446       0.666
x3            -0.0206      0.001    -16.886      0.000      -0.023      -0.018
==============================================================================
Omnibus:                        3.432   Durbin-Watson:                   1.756
Prob(Omnibus):                  0.180   Jarque-Bera (JB):                2.040
Skew:                          -0.254   Prob(JB):                        0.361
Kurtosis:                       2.151   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.45007842  4.95791391  5.42260851  5.81384337  6.11224154  6.31255157
  6.42451031  6.47124316  6.48546465  6.50410351  6.5622354   6.68732009
  6.89468987  7.18503052  7.54426863  7.94588383  8.35526598  8.73540157
  9.05295488  9.283745    9.41671864  9.45576617  9.41908245  9.33617745
  9.24302787  9.17616624  9.16667877  9.23509791  9.38802899  9.61706496
  9.90016762 10.20528848 10.4956342  10.73571164 10.8971593  10.963407
 10.93239835 10.81692859 10.64254237 10.44333683 10.25635956 10.11552176
 10.04602673 10.06022905 10.15560477 10.31516603 10.5102522  10.70523866
 10.8633899  10.95289499]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.93835847 10.77775122 10.49288993 10.1334921   9.76500348  9.45257462
  9.24510976  9.16329341  9.19452599  9.2960087 ]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");
../../../_images/examples_notebooks_generated_predict_12_0.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           4.964899
x1                  0.505674
np.sin(x1)          0.556318
I((x1 - 5) ** 2)   -0.020593
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    10.938358
1    10.777751
2    10.492890
3    10.133492
4     9.765003
5     9.452575
6     9.245110
7     9.163293
8     9.194526
9     9.296009
dtype: float64