In this notebook, we present some basic ideas of numpy, matplotlib, and statsmodels. First, import pylab.
%pylab inline
The pylab command above inputs lots of numpy stuff. We don't have to "from numpy import *". It also means plots appear in place (not in a separate pop-up window.)
#some junk just to show numpy and pylab are loaded
a = arange(10)
plot(a,a**1.5)
#NYC temperatures are from weather.com. Do Google search 'average temperature in new york city'
nyc = array([30, 32, 40, 50, 61, 70, 76, 74, 66, 55, 45, 35])
plot(nyc)
plot(nyc,'*')
y = nyc
months = arange(12)+0.5 #offset to middle of each month
X = array([ones(12), cos(2*pi*months/12), sin(2*pi*months/12)]).transpose()
X
We present four solutions to the linear regression problem. The first two solutions are straightforward, but the latter two are better behaved numerically and are generally preferred.
The straightforward solution \({\hat \beta} = (X^T X)^{-1}X^Ty\) can be calculated fairly easily. For larger, or tricky, problems QR or SVD methods should be used instead. The main problem with the straightforward solution is that computing \(X^T X\) results in a loss of numerical accuracy (especially for large \(X\)). Better is to use the SVD algorithm (numpy.linalg.lstsq) or the QR algorithm (statsmodels.OLS).
Recall, to do linear algebra multiplication in numpy we use the dot function. \(X.T\) is the transpose of \(X\) and inv calculates the inverse of a square matrix.
betahat = dot(inv(dot(X.T, X)),dot(X.T,y))
betahat
Slightly better is to use the numpy.linalg.solve function to solve the normal equations. As a general rule, it is better to "solve" equations than to compute the inverse and multiply by the inverse (as we did above). I.e., solve the normal equations \((X^T X) {\hat\beta} = X^Ty\) for \({\hat\beta}\).
import numpy.linalg as la
betahat = la.solve(dot(X.T,X), dot(X.T,y))
betahat
The standard python method to do least squares linear regression is numpy.linalg.lstsq. It comes with numpy and doesn't need anything else to be installed. It uses the SVD method.
betahat, resid, rnk, singvals = la.lstsq(X, y)
print betahat, resid, rnk, singvals
We'll get the same numbers for betahat and resid using statsmodels (below). Rnk=3=m means X is full rank. Singular values >> 0 means well behaved numerically. Be careful when singular values are close to 0. (We used "rank" originally, but it's a numpy function, so we changed to "rnk". Next time we'll change the %pylab statement at the top to avoid this problem.)
There is a newer and bigger linear regression module called "statsmodels". statsmodels contains lots of linear regression code (and lots of other statistical models as well). But it may not be installed by default. It depends on your distribution.
model.fit(method='pinv') is the default (pinv=standard definition \((X^TX)^{-1}X^Ty\)). Can also use method='qr'. There's no great reason to use qr in this simple example (model is small and well behaved numerically), but we'll do it anyway. All this info can be found in help(sm.OLS).
import statsmodels.api as sm
model = sm.OLS(y,X)
results = model.fit(method='qr')
print results.params
print results.rsquared
Plot the results. Pretty up the picture a bit.
a, b, c = results.params
dates = linspace(0,12,101) #to produce smooth curve
fitted = a + b*cos(2*pi*dates/12) + c*sin(2*pi*dates/12)
plot(months,y,'*', dates, fitted)
axhline(y=a)
xlabel('Months')
ylabel('Average Temperature in NYC')
savefig('nyc.pdf') #comment out if you don't want to save figure
results.summary()
residuals = y - dot(X,results.params)
print residuals
print sum(residuals**2)
4.02 is the same number la.lstsq returned.
Let's see how the standard error for 'const' is determined. Note, X'X is diagonal, so inverse is also diagonal. The formula is \({\widehat{\sigma^2}}= ||y-{\hat y}||^2 (X^T X)^{-1}/(n-m)\).
d = diag(dot(X.T, X)) #get main diagonal
squarederror = sum(residuals**2)
stderr = sqrt(squarederror/(d[0]*(12-3)))
print stderr
For the vast majority of problems, my suggestion is to use Statsmodels with the QR method. The QR method is fast and accurate (possibly slightly slower than la.solve, but only slightly). Statsmodels gives lots of additional information as well.
The SVD is generally the most accurate method, but it is also the most time consuming. For almost all problems, the QR method offers the best combination of speed and accuracy.
Charles Boncelet, 29 April 2014