UCL Winners Exploratory Data Analysis

UCL Winners Exploratory Data Analysis

https://es.uefa.com/uefachampionsleague/news/0258-0e9533031260-37a6195f5f8a-1000--donde-ver-la-final-de-la-champions/

Since we are close to knowing who will be the winner this year of the most prestigious European tournament at club’s level aka UEFA Champions League, let’s perform a recap on who where the teams who won most editions of this competition throughout history.

We will use for this analysis data obtained from Wikipedia which contains data for the finals of the European club championship since its inception in 1955. I have already parsed and work on this data, as well updated with current records, and you can find the data available in my GitHub repo, as well as the notebook of this analysis.

Data Acquisition

First we will create a new folder to store the dataset in our root setting a condition if the path does not exist, we’ll create it, by using the os package. Then we will use urllib package to download the csv file from base_url. Then we’ll store it in the folder we previously created, before that we’ll make sure that the file does not exist.

Once get get our dataset, we'll convert it into a DataFrame by using read_csv command

Image from Author

In the output of the previous code we get a DataFrame with 64 rows that shows the data for all Champions League finals from the beginning of the competition won by Real Madrid until the last edition won by Bayern Munich.

Image from Author

Grouping data with Pandas

The group by clause is an operation on DataFrames. A Series is a 1D object, so performing a group by operation on it is not very useful. However, it can be used to obtain distinct rows of the Series. The result of a group by operation is not a DataFrame but dict of DataFrame objects.

Thus, the output saw above shows the season, the nations to which the winning and runner-up clubs belong, the score, the venue, and the attendance figures, suppose we wanted to rank the nations by the number of European club championships they had won. We can do this by using group by. First, we apply group by function to the DataFrame and see what is the type of the result:

Image from Author

We notice that nationsGrp is an object type called pandas.core.groupby.DataFrameGroupBy. The column on which we use groupby is referred to as the key, and the rest of the values corresponding to those keys, are the ones inside each of them, which is merely an object called dictionary. We can see what the groups look like by using the groups attribute on the resulting DataFrameGroupBy object:

As we told before, this is basically a dictionary that shows the unique groups and the axis labels corresponding to each group — in this case the row number. We can get for example whole information of the 62th index of the DataFrame, that corresponds to 2017–18 season final, which represented the 13th “Orejona” for Real Madrid disputed against Liverpool.

The number of groups is obtained by using the len() function in the cell below:

Image from Author

Here the data we grouped previously determined by DataFrameGroupBy object, identified with the variable name nationsGrp, we'll use it to display some tables, but first we need to convert it to DataFrame, so we can create a new mesure and sort it ascending.

In the table we note that the Nation with more wins in Champions is Spain, mostly due to the 13 a 5 Trophys from Real Madrid and Barcelona.

Image from Author

The size() function returns a Series with the group names as the index and the size of each group. The size() function is also an aggregation function.

To do a further breakup of wins by country and club, we apply a multicolumn groupby function and then size() and sort():

Image from Author

A multicolumn groupby specifies more than one column to be used as the key by specifying the key columns as a list. Thus, we can see that the most successful club in this competition has been Real Madrid of Spain.

We have quikly explored through this tutorial some ways we can used data analisys techniques, especifically one well known between data analysts which is groupby. You can also find much more other Pandas techniques to analyze in the official documentation.