Sending Data to Google Sheets Using Python: Data Manipulation and Querying

Sending Data to Google Sheets Using Python: Data Manipulation and Querying

Extract your data with Python and load it into Google Sheets for manipulation and visualization

Image from fullstackfeed

Introduction and motivation

The motivation for writing this tutorial is to provide readers with the tools to extract, load, manipulate, and organize data into a familiar format: Google Spreadsheets. We will learn how to dump thousands of rows of data in an automatized way from multiple formats of plain text (.csv, .tsv, .txt, json files) stored locally, in the cloud, or even coming from APIs. Once the data is in Google Spreadsheets, we will be able to manipulate, query, and visualize it, in some of the same ways we might do so in a Jupyter notebook.

Image by author

You can find more information on setting up your service account in Google Cloud Console in this article. Once you have configured your Google Service Account, I suggest you create the following folder structure in your Google Drive, where the data folder will contain the dataset you retrieve in the accompanying Jupyter Notebook, as well as the API key file you’ll need to connect your Python code with the Google Spreadsheet. The config folder will contain the config.ini file to store “secrets” like passwords and credentials.

Image by author

Google Colab

Google Colab provides a development platform that can be used to easily share and replicate work. There’s very few requirements to get started, and with access to the underlying container, there are countless applications. For all of these reasons and more, Colab is a great tool for teaching and learning.

If you’re not already familiar with Colab, I strongly recommend creating an account and working through some of the examples in the official docs.

Setting up the notebook

Google Colab has the standard Python packages already installed, but we will need to import a few others. If you already have these packages installed, you can simply import them directly into your notebook. If you do not have them installed, you can do so using either pip or conda installs in your terminal. To install with pip directly in the notebook (without navigating to your command line), you can also use !pip install gspread==3.6.0. Next, we will need to mount our Google Drive storage in the notebook, as shown below:

Image by Author

Downloading the dataset

In this tutorial we will be using the “Adult” dataset from the UCI Machine Learning Repository (you can find a link to it on the homepage, under the “Most Popular Datasets” category). This data extraction was originally done by Barry Becker utilizing the 1994 Census database. Inside the data folder, you will find four files. Here, we will be primarily focusing on the adult.data and adult.names folders.

I downloaded the dataset from the UCI repo using the wget UNIX command, which we can also use in directly in Jupyter notebooks by preceding it with a !. We’ll also be using cat, ls and head for data and file manipulation, as shown below:

Image by author

By running these commands, the files will be downloaded to our Google Drive root folder, content.

Image by author

Exploring the dataset

When exploring these files in the notebook, you may notice that adult.names contains rather comprehensive description of dataset:

Image by author

On the information page of the dataset, as well as in the final part of the adult.names file, we can locate our feature names, along with their descriptions.

UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Adult)

Below I create a list of column names and assign them as headers of the DataFrame using the names parameter of [pd.read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html):

Image by author

We can also explore some basic information about the dataset using the .info and .describe attributes, which return the data type and number of observations per feature, as well as some descriptive statistics, like mean, median and IQRs:

Image by author (df.info())

Image by author (df.describe())

Sending data to Google Spreadsheets

To send our data to Google Spreadsheets, I created an empty spreadsheet in the data folder, and then use the id of the empty file as a parameter to connect to gspread.service_account in the notebook. Note the file id underlined in red below:

Image by author

Image by author

Then we will need to provide access to the Service Account by clicking the “Share” button located in the upper right corner of the sheet:

Image by author

After I instantiate the sh = gc.open_by_key variable to send my data from the notebook to my sheet, it’s time to send it!

Image by author

In the example above, sh.get_worksheet() takes an index as a parameter, which corresponds to the sheet number we intend to send the data to. In this case, as we are using the first sheet of the workbook, the index is 0.

The set_with_dataframe function takes as arguments ws, our DataFrame, row, col, and whether or not to include column header.

Once that cell is executed, the data is sent to the first sheet of the workbook we previously set:

Image by author

Once we’ve loaded the data into Google Sheets, we can generate some attractive and relatively comprehensive visualizations, similar to those produced by other BI tools like Tableau or PowerBI:

Image by author

These charts are generated by queries executed in the very same Google sheet. For more information about the Google Sheets QUERY function, see this tutorial.

Existing ML applications may surprise you — watch our interview with GE Healthcare’s Vignesh Shetty to learn how his team is using ML in the healthcare setting.

We can start by answering the following questions by querying the main dataset:

How is the capital gain distributed per age range?
=*QUERY(dataframe![dataframe_range],”select B,sum(L) group by B order by sum(L) desc”,1)*

How many observations are there per workclass?
=*QUERY(dataframe![dataframe_range],”select C,count(A) group by C order by count(A) desc”,1)*

Which are the top occupations by workclass?
*=QUERY(dataframe![dataframe_range],”select H,C,count(A) where C=’Private’ group by H,C order by count(A) desc”,1)*

Which sex has the most capital gain?
*=QUERY(dataframe![dataframe_range],”select K,sum(L) group by K order by sum(L) desc”,1)*

Which is the most profitable education status?
*=QUERY(dataframe![dataframe_range],”select E,sum(L) group by E order by sum(L) desc”,1)*

The first thing we need to know to start performing queries in Google Sheets, is where we can locate our data. In our case, our data is in the sheet called dataframe!, and the [dataframe_range] is nothing more than the range of the entire dataset A1:P32562. After that, we include the very same combination of statements that we would perform in SQL (within double quotes), with the difference that here we will refer to the columns as they are written in the spreadsheet. The final number at the end of the query indicates whether or not we need the head in the resulting dataset; 1 indicates we do, while 0 indicates we don’t.

Notice that for the charts I have added in column B a calculated dimension which groups age ranges into bins:

\=IFS(AND(A2>10,A2<=20),”under 20 years”,AND(A2>20,A2<=40),”between 20–40 years”,AND(A2>40,A2<=60),”between 40–60 years”,AND(A2>60,A2<=80),”between 60–80 years”,AND(A2>80,A2<=90),”more than 90 years”)

Image by author

We can even use the statistics functions in order to generate a table that describes our dataset in a way similar to the DataFrame.describe() pandas function:

Image by author

We can even continue to answer questions about our data without ever leaving our spreadsheet, by making a dashboard with our charts. Of course, a Google Spreadsheet dashboard won’t be nearly as sophisticated as some other BI-specific tools like Tableau, QlikView or PowerBI, but it is certainly a great starting point.

In this tutorial, we have learned how to do some basic data extraction using Python in a Jupyter notebook and then automatically dump that data into Google Spreadsheets. Once in Google Spreadsheets, we performed additional manipulation of the dataset, ran queries on the data, and talked about creating a dashboard.

I hope this tutorial has been useful for you! Please feel free to clone the code I used from this GitHub repo.

Thanks for reading!

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.