W3cubDocs

fetch_openml

Fetch dataset from openml by name or dataset id.

Datasets are uniquely identified by either an integer ID or by a combination of name and version (i.e. there might be multiple versions of the ‘iris’ dataset). Please give either name or data_id (not both). In case a name is given, a version can also be provided.

Notes

The "pandas" and "liac-arff" parsers can lead to different data types in the output. The notable differences are the following:

The "liac-arff" parser always encodes categorical features as str objects. To the contrary, the "pandas" parser instead infers the type while reading and numerical categories will be casted into integers whenever possible.
The "liac-arff" parser uses float64 to encode numerical features tagged as ‘REAL’ and ‘NUMERICAL’ in the metadata. The "pandas" parser instead infers if these numerical features corresponds to integers and uses panda’s Integer extension dtype.
In particular, classification datasets with integer categories are typically loaded as such (0, 1, ...) with the "pandas" parser while "liac-arff" will force the use of string encoded class labels such as "0", "1" and so on.
The "pandas" parser will not strip single quotes - i.e. ' - from string columns. For instance, a string 'my string' will be kept as is while the "liac-arff" parser will strip the single quotes. For categorical columns, the single quotes are stripped from the values.

In addition, when as_frame=False is used, the "liac-arff" parser returns ordinally encoded data where the categories are provided in the attribute categories of the Bunch instance. Instead, "pandas" returns a NumPy array were the categories are not encoded.

Examples

>>> from sklearn.datasets import fetch_openml
>>> adult = fetch_openml("adult", version=2)  
>>> adult.frame.info()  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   age             48842 non-null  int64
 1   workclass       46043 non-null  category
 2   fnlwgt          48842 non-null  int64
 3   education       48842 non-null  category
 4   education-num   48842 non-null  int64
 5   marital-status  48842 non-null  category
 6   occupation      46033 non-null  category
 7   relationship    48842 non-null  category
 8   race            48842 non-null  category
 9   sex             48842 non-null  category
 10  capital-gain    48842 non-null  int64
 11  capital-loss    48842 non-null  int64
 12  hours-per-week  48842 non-null  int64
 13  native-country  47985 non-null  category
 14  class           48842 non-null  category
dtypes: category(9), int64(6)
memory usage: 2.7 MB

Gallery examples

Release Highlights for scikit-learn 1.4

Release Highlights for scikit-learn 1.2

Release Highlights for scikit-learn 1.1

Release Highlights for scikit-learn 0.22

Categorical Feature Support in Gradient Boosting

Combine predictors using stacking

Features in Histogram Gradient Boosting Trees

Image denoising using kernel PCA

Time-related feature engineering

Forecasting of CO2 level on Mona Loa dataset using Gaussian process regression (GPR)

Early stopping of Stochastic Gradient Descent

MNIST classification using multinomial logistic + L1

Poisson regression and non-normal loss

Tweedie regression on insurance claims

Common pitfalls in the interpretation of coefficients of linear models

Partial Dependence and Individual Conditional Expectation Plots

Permutation Importance vs Random Forest Feature Importance (MDI)

Evaluation of outlier detection estimators

Introducing the set_output API

Visualizations with Display Objects

Post-hoc tuning the cut-off point of decision function

Post-tuning the decision threshold for cost-sensitive learning

Overview of multiclass training meta-estimators

Multilabel classification using a classifier chain

Approximate nearest neighbors in TSNE

Visualization of MLP weights on MNIST

Column Transformer with Mixed Types

Effect of transforming the targets in regression model

Comparing Target Encoder with Other Encoders

© 2007–2025 The scikit-learn developers
Licensed under the 3-clause BSD License.
https://scikit-learn.org/1.6/modules/generated/sklearn.datasets.fetch_openml.html