/Statsmodels

# Fitting models using R-style formulas

Since version 0.5.0, `statsmodels` allows users to fit statistical models using R-style formulas. Internally, `statsmodels` uses the patsy package to convert formulas and data to the matrices that are used in model fitting. The formula framework is quite powerful; this tutorial only scratches the surface. A full description of the formula language can be found in the `patsy` docs:

```In [1]: import statsmodels.formula.api as smf

In [2]: import numpy as np

In [3]: import pandas
```

Notice that we called `statsmodels.formula.api` instead of the usual `statsmodels.api`. The `formula.api` hosts many of the same functions found in `api` (e.g. OLS, GLM), but it also holds lower case counterparts for most of these models. In general, lower case models accept `formula` and `df` arguments, whereas upper case ones take `endog` and `exog` design matrices. `formula` accepts a string which describes the model in terms of a `patsy` formula. `df` takes a pandas data frame.

`dir(smf)` will print a list of available models.

Formula-compatible models have the following generic call signature: `(formula, data, subset=None, *args, **kwargs)`

## OLS regression using formulas

To begin, we fit the linear model described on the Getting Started page. Download the data, subset columns, and list-wise delete to remove missing observations:

```In [4]: df = sm.datasets.get_rdataset("Guerry", "HistData").data

In [5]: df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()

Out[6]:
Lottery  Literacy  Wealth Region
0       41        37      73      E
1       38        51      22      N
2       66        13      61      C
3       80        46      76      E
4       79        69      83      E
```

Fit the model:

```In [7]: mod = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)

In [8]: res = mod.fit()

In [9]: print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable:                Lottery   R-squared:                       0.338
Method:                 Least Squares   F-statistic:                     6.636
Date:                Tue, 28 Feb 2017   Prob (F-statistic):           1.07e-05
Time:                        21:36:08   Log-Likelihood:                -375.30
No. Observations:                  85   AIC:                             764.6
Df Residuals:                      78   BIC:                             781.7
Df Model:                           6
Covariance Type:            nonrobust
===============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      38.6517      9.456      4.087      0.000      19.826      57.478
Region[T.E]   -15.4278      9.727     -1.586      0.117     -34.793       3.938
Region[T.N]   -10.0170      9.260     -1.082      0.283     -28.453       8.419
Region[T.S]    -4.5483      7.279     -0.625      0.534     -19.039       9.943
Region[T.W]   -10.0913      7.196     -1.402      0.165     -24.418       4.235
Literacy       -0.1858      0.210     -0.886      0.378      -0.603       0.232
Wealth          0.4515      0.103      4.390      0.000       0.247       0.656
==============================================================================
Omnibus:                        3.049   Durbin-Watson:                   1.785
Prob(Omnibus):                  0.218   Jarque-Bera (JB):                2.694
Skew:                          -0.340   Prob(JB):                        0.260
Kurtosis:                       2.454   Cond. No.                         371.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
```

## Categorical variables

Looking at the summary printed above, notice that `patsy` determined that elements of Region were text strings, so it treated Region as a categorical variable. `patsy`‘s default is also to include an intercept, so we automatically dropped one of the Region categories.

If Region had been an integer variable that we wanted to treat explicitly as categorical, we could have done so by using the `C()` operator:

```In [10]: res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()

In [11]: print(res.params)
Intercept         38.651655
C(Region)[T.E]   -15.427785
C(Region)[T.N]   -10.016961
C(Region)[T.S]    -4.548257
C(Region)[T.W]   -10.091276
Literacy          -0.185819
Wealth             0.451475
dtype: float64
```

Examples more advanced features `patsy`‘s categorical variables function can be found here: Patsy: Contrast Coding Systems for categorical variables

## Operators

We have already seen that “~” separates the left-hand side of the model from the right-hand side, and that “+” adds new columns to the design matrix.

### Removing variables

The “-” sign can be used to remove columns/variables. For instance, we can remove the intercept from a model by:

```In [12]: res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit()

In [13]: print(res.params)
C(Region)[C]    38.651655
C(Region)[E]    23.223870
C(Region)[N]    28.634694
C(Region)[S]    34.103399
C(Region)[W]    28.560379
Literacy        -0.185819
Wealth           0.451475
dtype: float64
```

### Multiplicative interactions

”:” adds a new column to the design matrix with the product of the other two columns. “*” will also include the individual columns that were multiplied together:

```In [14]: res1 = smf.ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()

In [15]: res2 = smf.ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()

In [16]: print(res1.params)
Literacy:Wealth    0.018176
dtype: float64

In [17]: print(res2.params)
```