Image from Unsplash
Image from Sklearn website
scikit-learn Python library comes with a few small standard datasets that do not require downloading any file from some external website, as they are available in sci-kit learn installation by executing sklearn.datasets
package.
General dataset API
There are three main kinds of dataset interfaces that can be used to get datasets depending on the desired type of dataset.
Image by Author
- The dataset loaders
Can be used to load small standard datasets, they are also called the Toy datasets, and I will talk about more in the next sections of this document.
- The dataset fetchers
They can be used to download and load larger datasets. For further information refer to the Real world datasets section on sklearn website.
- The dataset generation functions
They can be used to generate controlled synthetic datasets. Find more information in the Generated datasets section on scikit learn site.
The datasets also contains a full description in their DESCR
attribute and some contain feature_names
and target_names
. See the dataset descriptions below for details.
Toy datasets
scikit-learn comes with a few small standard datasets that do not require downloading any file from some external website.
They can be loaded using the following functions:
Image by Author
Have in mind that these datasets are useful to quickly illustrate the behavior of the various algorithms implemented in scikit-learn, eventhough, they are considered too small to be representative of real world machine learning tasks.
Image by Author
Now that we have loaded a toy dataset from sklearn API by applying the function load_wine()
, we store it inside the variable wine
Image by Author
Next, let’s make use of shape in order to inspect how many columns and rows it has.
Image by Author
Looking at the dataset’s data type, we notice that sklearn.utils.Bunch is returned, For more information about this Bunch object go to this link
Image by Author
So far we know that our wine toy dataset is comprised of 178 rows and 13 columns, but we haven’t still made the first sight to it, so for that let’s use the DESCR attribute as talked about before, this works in a similar way as pandas describe() function, but it provides more detailed information regarding the dataset.
Image by Author
.. _wine_dataset:
Wine recognition dataset
------------------------
**Data Set Characteristics:**
:Number of Instances: 178 (50 in each of three classes)
:Number of Attributes: 13 numeric, predictive attributes and the class
:Attribute Information:
- Alcohol
- Malic acid
- Ash
- Alcalinity of ash
- Magnesium
- Total phenols
- Flavanoids
- Nonflavanoid phenols
- Proanthocyanins
- Color intensity
- Hue
- OD280/OD315 of diluted wines
Proline
- class:
- class_0
- class_1
- class_2
:Summary Statistics:
============================= ==== ===== ======= =====
Min Max Mean SD
============================= ==== ===== ======= =====
Alcohol: 11.0 14.8 13.0 0.8
Malic Acid: 0.74 5.80 2.34 1.12
Ash: 1.36 3.23 2.36 0.27
Alcalinity of Ash: 10.6 30.0 19.5 3.3
Magnesium: 70.0 162.0 99.7 14.3
Total Phenols: 0.98 3.88 2.29 0.63
Flavanoids: 0.34 5.08 2.03 1.00
Nonflavanoid Phenols: 0.13 0.66 0.36 0.12
Proanthocyanins: 0.41 3.58 1.59 0.57
Colour Intensity: 1.3 13.0 5.1 2.3
Hue: 0.48 1.71 0.96 0.23
OD280/OD315 of diluted wines: 1.27 4.00 2.61 0.71
Proline: 278 1680 746 315
============================= ==== ===== ======= =====:Missing Attribute Values: None
:Class Distribution: class_0 (59), class_1 (71), class_2 (48)
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988- class:
This is a copy of UCI ML Wine recognition datasets.
https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data
The data is the results of a chemical analysis of wines grown in the same
region in Italy by three different cultivators. There are thirteen different
measurements taken for different constituents found in the three types of
wine.
Original Owners:
Forina, M. et al, PARVUS -
An Extendible Package for Data Exploration, Classification and Correlation.
Institute of Pharmaceutical and Food Analysis and Technologies,
Via Brigata Salerno, 16147 Genoa, Italy.
Citation:
Lichman, M. (2013). UCI Machine Learning Repository
[https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
School of Information and Computer Science.
.. topic:: References
(1) S. Aeberhard, D. Coomans and O. de Vel,
Comparison of Classifiers in High Dimensional Settings,
Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of
Mathematics and Statistics, James Cook University of North Queensland.
(Also submitted to Technometrics).
The data was used with many others for comparing various
classifiers. The classes are separable, though only RDA
has achieved 100% correct classification.
(RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data))
(All results using the leave-one-out technique)
(2) S. Aeberhard, D. Coomans and O. de Vel,
"THE CLASSIFICATION PERFORMANCE OF RDA"
Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of
Mathematics and Statistics, James Cook University of North Queensland.
(Also submitted to Journal of Chemometrics).
By applying the key() function, we’ll have access to a list of attributes available to apply on this dataset.
Image by Author
The first one, which returns an array of the data itself contained in the dataset.
Image by Author
The fourth one gives us the name of the classes of wine, means the target features.
Image by Author
And the last one the name of all features in the dataset.
Image by Author
Converting Sklearn dataset in pandas dataframe
Though the loaded dataset from sklearn API is ready to be used for machine learning algorithms, it is also useful to convert it into pasdas dataframe for other purposes.
Image by Author
One way or another will be the starting point for our EDA or data preprocessing for Machine Learning. In this tutorial, we have learned how to use pre-loaded datasets available in sklearn API, if you are interested on cloning this notebook find it in my GitHub repo or if you need more documentation find it on sklearn site.