When scheduling our scripts to run automatically, we came up with the issue that the machine where in order to run it, needs to be always turned on unless we have a server or a VM in the computing cloud (AWS, GCP, etc).
The steps of configuring virtual environments can be avoided if we just need to schedule a couple of scripts and dump just less than a hundred CSV files of data for further analysis.
Deepnote has become one option to work, analyze data in a cloud environment, and now schedule online jobs to run as a cron-like option completely free.
Deepnote is a web application compatible with Jupyter Notebooks that provides an interactive work environment for data science at the project level. In it you will be able to write and execute code in a simple and collaborative way in real-time without any extra configuration. So, it is optimal to start studying or developing professional projects in data science.
First of all, I will clarify that I have no relation with the Deepnote project or any kind of sponsorship from that company, just trying to share knowledge regarding the tools that I find useful in my daily work, and are in the field of data science, data analysis and data engineering, ready to be used.
Accesing to Deepnote Dashboard
Accessing to Deepnote is free, you just need to SignUp using your Github or Google account
Once you have Signed up you can log in following the options provided in the prompt
Then you will be prompted to the Dashboard page where you can start to create your online notebooks. In this link, you can find a complete article from one of the Medium writers that explain the whole features and how to start working in Deepnote, which I will not be covered in this tutorial.
How the script works
For my scheduled notebook I will be extracting the Covid data from the Github repo of Our World in Data which is updated daily.
You can download this script from my Github, and play with this data. The first thing that the script does is import the required Python packages:
Then read the data from GitHub remote repo and stores it in a pandas dataframe, after this counts the number of records in the Covid dataset
Then it sets datetime variables that will be used to perform filters in the dataset, and name the output set
We can inspect the columns that the dataset has, and we realize that on those 67 columns, there are some interesting ones related to population on the different countries
We can also be able to inspect the different countries in the dataset (locations column)
But let´s leave the EDA for another time and continue configuring the subset of data that we will schedule in our notebook.
Then I will select the data for the location I am interested in, which is Uruguay and will print in the cell below the number of records that the resulting subset of data has.
After that, I will select only the rows corresponding to the datetime variables were set above, which are 7 days back to the current day(startDay) and today (endDay).
As I will run this on February 28th I will have the data of Covid cases for 1 week corresponding to my country, and the idea is that I schedule this to run this job in one of the options that the scheduler offers (Hourly, Daily and Weekly), for this time will be weekly.
The data that will be downloaded will be stored in My Drive storage, this guide will show you how to integrate your google drive to your Deepnote notebook.
Scheduling the notebook
So next let´s schedule our notebook (script) to run in a daily basis on Deepnote. in the right corner of your notebook click on the down arrow on Run notebook button, and there you will be prompted with the options to schedule this job in one of the options that the scheduler offers (Hourly, Daily and Weekly), it has other advanced options to choose.
Voilá!! the script will run on a weekly basis downloading to Google Drive storage data ready for further analysis.