statsmodels.stats.correlation_tools.corr_thresholded(data, minabs=None, max_elt=10000000.0)
[source]
Construct a sparse matrix containing the thresholded row-wise correlation matrix from a data array.
Parameters: |
|
---|---|
Returns: |
cormat – The thresholded correlation matrix, in COO format. |
Return type: |
sparse.coo_matrix |
This is an alternative to C = np.corrcoef(data); C *= (np.abs(C) >= absmin), suitable for very tall data matrices.
If the data are jointly Gaussian, the marginal sampling distributions of the elements of the sample correlation matrix are approximately Gaussian with standard deviation 1 / sqrt(n). The default value of minabs
is thus equal to 1 standard error, which will set to zero approximately 68% of the estimated correlation coefficients for which the population value is zero.
No intermediate matrix with more than max_elt
values will be constructed. However memory use could still be high if a large number of correlation values exceed minabs
in magnitude.
The thresholded matrix is returned in COO format, which can easily be converted to other sparse formats.
Here X is a tall data matrix (e.g. with 100,000 rows and 50 columns). The row-wise correlation matrix of X is calculated and stored in sparse form, with all entries smaller than 0.3 treated as 0.
>>> import numpy as np >>> np.random.seed(1234) >>> b = 1.5 - np.random.rand(10, 1) >>> x = np.random.randn(100,1).dot(b.T) + np.random.randn(100,10) >>> cmat = corr_thresholded(x, 0.3)
© 2009–2012 Statsmodels Developers
© 2006–2008 Scipy Developers
© 2006 Jonathan E. Taylor
Licensed under the 3-clause BSD License.
http://www.statsmodels.org/stable/generated/statsmodels.stats.correlation_tools.corr_thresholded.html