New in version 0.15.

Note

While there was `pandas.Categorical`

in earlier versions, the ability to use categorical data in `Series`

and `DataFrame`

is new.

This is an introduction to pandas categorical data type, including a short comparison with R’s `factor`

.

`Categoricals`

are a pandas data type, which correspond to categorical variables in statistics: a variable, which can take on only a limited, and usually fixed, number of possible values (`categories`

; `levels`

in R). Examples are gender, social class, blood types, country affiliations, observation time or ratings via Likert scales.

In contrast to statistical categorical variables, categorical data might have an order (e.g. ‘strongly agree’ vs ‘agree’ or ‘first observation’ vs. ‘second observation’), but numerical operations (additions, divisions, …) are not possible.

All values of categorical data are either in `categories`

or `np.nan`

. Order is defined by the order of `categories`

, not lexical order of the values. Internally, the data structure consists of a `categories`

array and an integer array of `codes`

which point to the real value in the `categories`

array.

The categorical data type is useful in the following cases:

- A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
- The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
- As a signal to other python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

See also the API docs on categoricals.

Categorical `Series`

or columns in a `DataFrame`

can be created in several ways:

By specifying `dtype="category"`

when constructing a `Series`

:

In [1]: s = pd.Series(["a","b","c","a"], dtype="category") In [2]: s Out[2]: 0 a 1 b 2 c 3 a dtype: category Categories (3, object): [a, b, c]

By converting an existing `Series`

or column to a `category`

dtype:

In [3]: df = pd.DataFrame({"A":["a","b","c","a"]}) In [4]: df["B"] = df["A"].astype('category') In [5]: df Out[5]: A B 0 a a 1 b b 2 c c 3 a a

By using some special functions:

In [6]: df = pd.DataFrame({'value': np.random.randint(0, 100, 20)}) In [7]: labels = [ "{0} - {1}".format(i, i + 9) for i in range(0, 100, 10) ] In [8]: df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels) In [9]: df.head(10) Out[9]: value group 0 65 60 - 69 1 49 40 - 49 2 56 50 - 59 3 43 40 - 49 4 43 40 - 49 5 91 90 - 99 6 32 30 - 39 7 87 80 - 89 8 36 30 - 39 9 8 0 - 9

See documentation for `cut()`

.

By passing a `pandas.Categorical`

object to a `Series`

or assigning it to a `DataFrame`

.

In [10]: raw_cat = pd.Categorical(["a","b","c","a"], categories=["b","c","d"], ....: ordered=False) ....: In [11]: s = pd.Series(raw_cat) In [12]: s Out[12]: 0 NaN 1 b 2 c 3 NaN dtype: category Categories (3, object): [b, c, d] In [13]: df = pd.DataFrame({"A":["a","b","c","a"]}) In [14]: df["B"] = raw_cat In [15]: df Out[15]: A B 0 a NaN 1 b b 2 c c 3 a NaN

You can also specify differently ordered categories or make the resulting data ordered, by passing these arguments to `astype()`

:

In [16]: s = pd.Series(["a","b","c","a"]) In [17]: s_cat = s.astype("category", categories=["b","c","d"], ordered=False) In [18]: s_cat Out[18]: 0 NaN 1 b 2 c 3 NaN dtype: category Categories (3, object): [b, c, d]

Categorical data has a specific `category`

dtype:

In [19]: df.dtypes Out[19]: A object B category dtype: object

Note

In contrast to R’s `factor`

function, categorical data is not converting input values to strings and categories will end up the same data type as the original values.

Note

In contrast to R’s `factor`

function, there is currently no way to assign/change labels at creation time. Use `categories`

to change the categories after creation time.

To get back to the original Series or `numpy`

array, use `Series.astype(original_dtype)`

or `np.asarray(categorical)`

:

In [20]: s = pd.Series(["a","b","c","a"]) In [21]: s Out[21]: 0 a 1 b 2 c 3 a dtype: object In [22]: s2 = s.astype('category') In [23]: s2 Out[23]: 0 a 1 b 2 c 3 a dtype: category Categories (3, object): [a, b, c] In [24]: s2.astype(str)

© 2008–2012, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team

Licensed under the 3-clause BSD License.

http://pandas.pydata.org/pandas-docs/version/0.20.2/categorical.html