Sklearn dataset loading utilities

Sklearn dataset loading utilities

Image from Unsplash

Image from Sklearn website

scikit-learn Python library comes with a few small standard datasets that do not require downloading any file from some external website, as they are available in sci-kit learn installation by executing sklearn.datasets package.

General dataset API

There are three main kinds of dataset interfaces that can be used to get datasets depending on the desired type of dataset.

Image by Author

  • The dataset loaders

Can be used to load small standard datasets, they are also called the Toy datasets, and I will talk about more in the next sections of this document.

  • The dataset fetchers

They can be used to download and load larger datasets. For further information refer to the Real world datasets section on sklearn website.

  • The dataset generation functions

They can be used to generate controlled synthetic datasets. Find more information in the Generated datasets section on scikit learn site.

The datasets also contains a full description in their DESCR attribute and some contain feature_names and target_names. See the dataset descriptions below for details.

Toy datasets

scikit-learn comes with a few small standard datasets that do not require downloading any file from some external website.

They can be loaded using the following functions:

Image by Author

Have in mind that these datasets are useful to quickly illustrate the behavior of the various algorithms implemented in scikit-learn, eventhough, they are considered too small to be representative of real world machine learning tasks.

Image by Author

Now that we have loaded a toy dataset from sklearn API by applying the function load_wine(), we store it inside the variable wine

Image by Author

Next, let’s make use of shape in order to inspect how many columns and rows it has.

Image by Author

Looking at the dataset’s data type, we notice that sklearn.utils.Bunch is returned, For more information about this Bunch object go to this link

Image by Author

So far we know that our wine toy dataset is comprised of 178 rows and 13 columns, but we haven’t still made the first sight to it, so for that let’s use the DESCR attribute as talked about before, this works in a similar way as pandas describe() function, but it provides more detailed information regarding the dataset.

Image by Author

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

:Number of Instances: 178 (50 in each of three classes)
:Number of Attributes: 13 numeric, predictive attributes and the class
:Attribute Information:

  • Alcohol
  • Malic acid
  • Ash
    • Alcalinity of ash
  • Magnesium
    • Total phenols
  • Flavanoids
  • Nonflavanoid phenols
  • Proanthocyanins
    • Color intensity
  • Hue
  • OD280/OD315 of diluted wines
  • Proline

    • class:
      • class_0
      • class_1
      • class_2

    :Summary Statistics:

    ============================= ==== ===== ======= =====
    Min Max Mean SD
    ============================= ==== ===== ======= =====
    Alcohol: 11.0 14.8 13.0 0.8
    Malic Acid: 0.74 5.80 2.34 1.12
    Ash: 1.36 3.23 2.36 0.27
    Alcalinity of Ash: 10.6 30.0 19.5 3.3
    Magnesium: 70.0 162.0 99.7 14.3
    Total Phenols: 0.98 3.88 2.29 0.63
    Flavanoids: 0.34 5.08 2.03 1.00
    Nonflavanoid Phenols: 0.13 0.66 0.36 0.12
    Proanthocyanins: 0.41 3.58 1.59 0.57
    Colour Intensity: 1.3 13.0 5.1 2.3
    Hue: 0.48 1.71 0.96 0.23
    OD280/OD315 of diluted wines: 1.27 4.00 2.61 0.71
    Proline: 278 1680 746 315
    ============================= ==== ===== ======= =====

    :Missing Attribute Values: None
    :Class Distribution: class_0 (59), class_1 (71), class_2 (48)
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%)
    :Date: July, 1988

This is a copy of UCI ML Wine recognition datasets.
https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

The data is the results of a chemical analysis of wines grown in the same
region in Italy by three different cultivators. There are thirteen different
measurements taken for different constituents found in the three types of
wine.

Original Owners:

Forina, M. et al, PARVUS -
An Extendible Package for Data Exploration, Classification and Correlation.
Institute of Pharmaceutical and Food Analysis and Technologies,
Via Brigata Salerno, 16147 Genoa, Italy.

Citation:

Lichman, M. (2013). UCI Machine Learning Repository
[https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
School of Information and Computer Science.

.. topic:: References

(1) S. Aeberhard, D. Coomans and O. de Vel,
Comparison of Classifiers in High Dimensional Settings,
Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of
Mathematics and Statistics, James Cook University of North Queensland.
(Also submitted to Technometrics).

The data was used with many others for comparing various
classifiers. The classes are separable, though only RDA
has achieved 100% correct classification.
(RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data))
(All results using the leave-one-out technique)

(2) S. Aeberhard, D. Coomans and O. de Vel,
"THE CLASSIFICATION PERFORMANCE OF RDA"
Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of
Mathematics and Statistics, James Cook University of North Queensland.
(Also submitted to Journal of Chemometrics).

By applying the key() function, we’ll have access to a list of attributes available to apply on this dataset.

Image by Author

The first one, which returns an array of the data itself contained in the dataset.

Image by Author

The fourth one gives us the name of the classes of wine, means the target features.

Image by Author

And the last one the name of all features in the dataset.

Image by Author

Converting Sklearn dataset in pandas dataframe

Though the loaded dataset from sklearn API is ready to be used for machine learning algorithms, it is also useful to convert it into pasdas dataframe for other purposes.

Image by Author

One way or another will be the starting point for our EDA or data preprocessing for Machine Learning. In this tutorial, we have learned how to use pre-loaded datasets available in sklearn API, if you are interested on cloning this notebook find it in my GitHub repo or if you need more documentation find it on sklearn site.

To be continued…