Note
SparseSeries
and SparseDataFrame
have been deprecated. Their purpose is served equally well by a Series
or DataFrame
with sparse values. See Migrating for tips on migrating.
Pandas provides data structures for efficiently storing sparse data. These are not necessarily sparse in the typical “mostly 0”. Rather, you can view these objects as being “compressed” where any data matching a specific value (NaN
/ missing value, though any value can be chosen, including 0) is omitted. The compressed values are not actually stored in the array.
In [1]: arr = np.random.randn(10) In [2]: arr[2:-2] = np.nan In [3]: ts = pd.Series(pd.SparseArray(arr)) In [4]: ts Out[4]: 0 0.469112 1 -0.282863 2 NaN 3 NaN 4 NaN 5 NaN 6 NaN 7 NaN 8 -0.861849 9 -2.104569 dtype: Sparse[float64, nan]
Notice the dtype, Sparse[float64, nan]
. The nan
means that elements in the array that are nan
aren’t actually stored, only the non-nan
elements are. Those non-nan
elements have a float64
dtype.
The sparse objects exist for memory efficiency reasons. Suppose you had a large, mostly NA DataFrame
:
In [5]: df = pd.DataFrame(np.random.randn(10000, 4)) In [6]: df.iloc[:9998] = np.nan In [7]: sdf = df.astype(pd.SparseDtype("float", np.nan)) In [8]: sdf.head() Out[8]: 0 1 2 3 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN In [9]: sdf.dtypes
© 2008–2012, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
Licensed under the 3-clause BSD License.
https://pandas.pydata.org/pandas-docs/version/0.25.0/user_guide/sparse.html