This very simple case-study is designed to get you up-and-running quickly with `statsmodels`

. Starting from raw data, we will show the steps needed to estimate a statistical model and to draw a diagnostic plot. We will only use functions provided by `statsmodels`

or its `pandas`

and `patsy`

dependencies.

After installing statsmodels and its dependencies, we load a few modules and functions:

In [1]: from __future__ import print_function In [2]: import statsmodels.api as sm In [3]: import pandas In [4]: from patsy import dmatrices

pandas builds on `numpy`

arrays to provide rich data structures and data analysis tools. The `pandas.DataFrame`

function provides labelled arrays of (potentially heterogenous) data, similar to the `R`

“data.frame”. The `pandas.read_csv`

function can be used to convert a comma-separated values file to a `DataFrame`

object.

patsy is a Python library for describing statistical models and building Design Matrices using `R`

-like formulas.

We download the Guerry dataset, a collection of historical data used in support of Andre-Michel Guerry’s 1833 *Essay on the Moral Statistics of France*. The data set is hosted online in comma-separated values format (CSV) by the Rdatasets repository. We could download the file locally and then load it using `read_csv`

, but `pandas`

takes care of all of this automatically for us:

In [5]: df = sm.datasets.get_rdataset("Guerry", "HistData").data

The Input/Output doc page shows how to import from various other formats.

We select the variables of interest and look at the bottom 5 rows:

In [6]: vars = ['Department', 'Lottery', 'Literacy', 'Wealth', 'Region'] In [7]: df = df[vars] In [8]: df[-5:] Out[8]: Department Lottery Literacy Wealth Region 81 Vienne 40 25 68 W 82 Haute-Vienne 55 13 67 C 83 Vosges 14 62 82 E 84 Yonne 51 47 30 C 85 Corse 83 49 37 NaN

Notice that there is one missing observation in the *Region* column. We eliminate it using a `DataFrame`

method provided by `pandas`

:

In [9]: df = df.dropna() In [10]: df[-5:] Out[10]: Department Lottery Literacy Wealth Region 80 Vendee 68 28 56 W 81 Vienne 40 25 68 W 82 Haute-Vienne 55 13 67 C 83 Vosges 14 62 82 E 84 Yonne 51 47 30 C

We want to know whether literacy rates in the 86 French departments are associated with per capita wagers on the Royal Lottery in the 1820s. We need to control for the level of wealth in each department, and we also want to include a series of dummy variables on the right-hand side of our regression equation to control for unobserved heterogeneity due to regional effects. The model is estimated using ordinary least squares regression (OLS).

To fit most of the models covered by `statsmodels`

, you will need to create two design matrices. The first is a matrix of endogenous variable(s) (i.e. dependent, response, regressand, etc.). The second is a matrix of exogenous variable(s) (i.e. independent, predictor, regressor, etc.). The OLS coefficient estimates are calculated as usual:

\[\hat{\beta} = (X'X)^{-1} X'y\]

where \(y\) is an \(N \times 1\) column of data on lottery wagers per capita (*Lottery*). \(X\) is \(N \times 7\) with an intercept, the *Literacy* and *Wealth* variables, and 4 region binary variables.

The `patsy`

module provides a convenient function to prepare design matrices using `R`

-like formulas. You can find more information here.

We use `patsy`

’s `dmatrices`

function to create design matrices:

In [11]: y, X = dmatrices('Lottery ~ Literacy + Wealth + Region', data=df, return_type='dataframe')

The resulting matrices/data frames look like this:

In [12]: y[:3] Out[12]: Lottery 0 41.0 1 38.0 2 66.0 In [13]: X[:3]

© 2009–2012 Statsmodels Developers

© 2006–2008 Scipy Developers

© 2006 Jonathan E. Taylor

Licensed under the 3-clause BSD License.

http://www.statsmodels.org/stable/gettingstarted.html