W3cubDocs

Sparse data structures

Note

SparseSeries and SparseDataFrame have been deprecated. Their purpose is served equally well by a Series or DataFrame with sparse values. See Migrating for tips on migrating.

Pandas provides data structures for efficiently storing sparse data. These are not necessarily sparse in the typical “mostly 0”. Rather, you can view these objects as being “compressed” where any data matching a specific value (NaN / missing value, though any value can be chosen, including 0) is omitted. The compressed values are not actually stored in the array.

In [1]: arr = np.random.randn(10)

In [2]: arr[2:-2] = np.nan

In [3]: ts = pd.Series(pd.SparseArray(arr))

In [4]: ts
Out[4]: 
0    0.469112
1   -0.282863
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8   -0.861849
9   -2.104569
dtype: Sparse[float64, nan]

Notice the dtype, Sparse[float64, nan]. The nan means that elements in the array that are nan aren’t actually stored, only the non-nan elements are. Those non-nan elements have a float64 dtype.

The sparse objects exist for memory efficiency reasons. Suppose you had a large, mostly NA DataFrame:

In [5]: df = pd.DataFrame(np.random.randn(10000, 4))

In [6]: df.iloc[:9998] = np.nan

In [7]: sdf = df.astype(pd.SparseDtype("float", np.nan))

In [8]: sdf.head()
Out[8]: 
    0   1   2   3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN

In [9]: sdf.dtypes

© 2008–2012, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
Licensed under the 3-clause BSD License.
https://pandas.pydata.org/pandas-docs/version/0.25.0/user_guide/sparse.html