Statsmodels supports a variety of approaches for analyzing contingency tables, including methods for assessing independence, symmetry, homogeneity, and methods for working with collections of tables from a stratified population.

The methods described here are mainly for two-way tables. Multi-way tables can be analyzed using log-linear models. Statsmodels does not currently have a dedicated API for loglinear modeling, but Poisson regression in `statsmodels.genmod.GLM`

can be used for this purpose.

A contingency table is a multi-way table that describes a data set in which each observation belongs to one category for each of several variables. For example, if there are two variables, one with \(r\) levels and one with \(c\) levels, then we have a \(r \times c\) contingency table. The table can be described in terms of the number of observations that fall into a given cell of the table, e.g. \(T_{ij}\) is the number of observations that have level \(i\) for the first variable and level \(j\) for the second variable. Note that each variable must have a finite number of levels (or categories), which can be either ordered or unordered. In different contexts, the variables defining the axes of a contingency table may be called **categorical variables** or **factor variables**. They may be either **nominal** (if their levels are unordered) or **ordinal** (if their levels are ordered).

The underlying population for a contingency table is described by a **distribution table** \(P_{i, j}\). The elements of \(P\) are probabilities, and the sum of all elements in \(P\) is 1. Methods for analyzing contingency tables use the data in \(T\) to learn about properties of \(P\).

The `statsmodels.stats.Table`

is the most basic class for working with contingency tables. We can create a `Table`

object directly from any rectangular array-like object containing the contingency table cell counts:

In [1]: import numpy as np In [2]: import pandas as pd In [3]: import statsmodels.api as sm In [4]: df = sm.datasets.get_rdataset("Arthritis", "vcd").data In [5]: tab = pd.crosstab(df['Treatment'], df['Improved']) In [6]: tab = tab.loc[:, ["None", "Some", "Marked"]] In [7]: table = sm.stats.Table(tab)

Alternatively, we can pass the raw data and let the Table class construct the array of cell counts for us:

In [8]: table = sm.stats.Table.from_data(df[["Treatment", "Improved"]])

**Independence** is the property that the row and column factors occur independently. **Association** is the lack of independence. If the joint distribution is independent, it can be written as the outer product of the row and column marginal distributions:

\[\]

P_{ij} = sum_k P_{ij} cdot sum_k P_{kj} forall i, j

We can obtain the best-fitting independent distribution for our observed data, and then view residuals which identify particular cells that most strongly violate independence:

In [9]: print(table.table_orig) Improved Marked None Some Treatment Placebo 7 29 7 Treated 21 13 7 In [10]: print(table.fittedvalues)

© 2009–2012 Statsmodels Developers

© 2006–2008 Scipy Developers

© 2006 Jonathan E. Taylor

Licensed under the 3-clause BSD License.

http://www.statsmodels.org/stable/contingency_tables.html