As regular readers know, I’ve been pursuing a Masters in data analysis from SMU since last August. I’m still quite new to programming in Python and the data analysis process. As such, from time to time, I practice on various public datasets available for use.
COVID-19 has certainly been top of most people’s minds right now with it having a significant impact on our lives. Between social distancing and working from home, this unprecedented health crisis may change the way we behave and do things forever. It has certainly affected our portfolios with rapid write offs in market value during March Madness.
One key question that investors have is whether the COVID-19 outbreak is under control and consequently, whether the recent market rally is sustainable. As such, I thought I’ll share something different with you today. I’ll be doing a step by step guide to how to perform an exploratory analysis on the Global COVID-19 dataset compiled by John Hopkins University (JHU) with the aid of Python.
Hopefully, we will gain some insights into the spread of the virus in various countries we invest in and whether it is under control.
Before we start, here’s a quick introduction to Python and its role in data analysis.
Python is actually a general purpose programming language which you can pick up to do anything. Designing your own games, automating certain repetitive menial tasks, all this is possible with Python. As for why use Python specifically for data analysis, there are 2 reasons in my mind.
- Python syntax (ie the way you write Python statements) is very natural and descriptive, much more so that other programming languages. As such, anybody can pick up the code and roughly understand what the code is trying to do. This helps with programming noobs like me.
- Extensive work has been done to augment Python with algorithms so that users can quickly perform data analysis without using complex code.
If you would like to follow along, you need to install Python on your computer. There are a couple of ways to do this. Personally, my favourite way is to use the Anaconda distribution. This is because it comes with a lot of the essential tools and packages required to perform data analysis on your own computer in a segregated manner.
Once you install and run Anaconda, you will get the above screen. To start coding, you should launch either Jupyter Notebook or Spyder. For beginners like me, I recommend using Jupyter Notebook as its web browser interface is intuitive for most people.
Once you launch Jupyter Notebook and created a new Python 3 file, you should see the above in your browser.
Now you’re ready to start coding! Or in this case, copying my code 😛
Alternatively, you can simply download my IPython notebook here to run for yourself.
About the Dataset
The dataset that I’m performing my analysis on is compiled by John Hopkins University. GitHub reference can be found here. This data is sourced by their team from global official sources and this data powers the university’s COVID 19 Dashboard.
There are 5 files in the dataset that are updated daily – Global and US cumulative cases, Global and US cumulative deaths, and Global cumulative recoveries.
For the purposes of this analysis, I will focus on the Global cumulative confirmed cases dataset pictured above. Here are the fields in the dataset:
- Province/State – Certain countries provide state level case data
- Country/Region – Self-explanatory
- Lat and Long – Latitude and Longitude of the Country/State. With this info, one technically can plot the data on a world map. This is more advanced and I’ll skip for this exercise.
- Dates – No. of cumulative cases at a given date.
As you can see, the dataset is very much like an Excel sheet. You can technically download this CSV file and open it on your computer using Excel, but the downside of doing that is that you have to download the file every time its updated.
With all that out of the way, we can finally start coding. We start by importing libraries into Python.
import pandas as pd import datetime import seaborn as sns import matplotlib.pyplot as plt
Libraries are collections of prepackaged functions that perform a certain function without a need for the user to manually code it. Think of it as functions in Excel like mean, vlookup, etc.
The libraries I’ve used for this analysis include:
- Pandas – The single most important Python library for data analysis. Helps a lot with transforming and extracting data from datasets.
- Numpy – A package that allows you to perform common mathematical calculations
- Datetime – As this is a time series, there is a need to process some date data.
- Matplotlib – Essential visualisation library for plotting graphs
- Seaborn – A prettier visualisation library that extends Matplotlib
Not all datasets available for you to use are clean or in the format you need for your analysis. Although this dataset is already quite clean, there’s still a need to process it a little bit.
1) Reading in the data
We start by reading in the dataset from the online link and temporarily storing it in a Dataframe.
#Read in dataset df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv') df.head()
df.head() shows us the first few rows of the dataset, with the output shown above. I do this usually to check that the dataset is read in properly.
2) Drop irrelevant columns
Next, we remove any unwanted columns from the dataset. As earlier mentioned, we do not need the Lat and Long columns as I don’t plan to visualise the data on a World Map.
#Drop Latitude and Longitude df = df.drop(['Lat', 'Long'],axis =1)
3) Process certain country data
As I looked through the data, I noticed that certain countries has state level data with no national level data. Here’s Australia as an example:
As I only wanted national level data, I had to sum up these rows into 1 national level. There are 3 countries that need this – Australia, China and Canada.
There are 2 steps to perform for this. We start by consolidating the rows for Australia, China and Canada with the following code
#Consolidate Province / State data for certain countries into National data tbp_countries = ['China', 'Australia', 'Canada'] df_processed = df[df['Country/Region'].isin(tbp_countries)] df_processed = df_processed.groupby('Country/Region', as_index=False).sum() df_processed
This creates a DataFrame with the processed data as above.
Next, drop the rows with state level data and append our consolidated country data. I also did some minor data transformations in the process. See comments for additional information.
df = df[df['Province/State'].isnull()] df = df.drop('Province/State', axis =1) df = df[~df['Country/Region'].isin(tbp_countries)] df = df.append(df_processed, ignore_index=True) df = df.set_index('Country/Region') #set Country/Region as Index df.columns = pd.to_datetime(df.columns) #set Column names as Dates df = df.T #Transpose the Dataframe df
We’ve successfully processed the data for cumulative cases!
4) Calculate daily new cases and daily new case growth
Of course, we would probably like to analyse the daily new cases by country. I’ll also like to run analysis on growth in daily new cases to understand the pace of virus spread in each country.
To obtain the daily new cases, we need to find the difference between the cumulative cases on a particular day versus the previous day.
df_newcases = df.diff(periods=1) df_newcases = df_newcases.fillna(0)
As for the growth in daily new cases, we simply run the same code a second time.
df_newcasesgrowth = df_newcases.diff(periods=1) df_newcasesgrowth = df_newcasesgrowth.fillna(0)
And we’re done processing! At the end of these programming steps, we end up with 3 data tables (DataFrames) as follows:
- df – contains daily cumulative confirmed case data by country
- df_newcases – contains daily confirmed case data by country
- df_newcasesgrowth – contains daily growth / reduction in confirmed case data by country
Using these 3 DataFrames, we are able to visualise the data / plot graphs to explore the data.
There are many ways to visualise data to give the numbers meaning. You can find out various things like…
Top 20 Countries by Cumulative Cases for a Given Date
This set of code plots a bar chart to show the top 20 countries in terms of cumulative cases for a given date. Change the date variable below if you wish to look back in time.
Date = datetime.datetime(2020,4,9) #Set Date df_Date = df.loc[Date] df_Top20 = df_Date.nlargest(20,keep='all') plt.figure(figsize=(20,10)) sns.barplot(df_Top20.values[:], df_Top20.index[:]) plt.title('Top 20 Countries by Cumulative Cases', fontsize = 25) plt.ylabel('CountryNo. of Cumulative Cases',fontsize =15) plt.xlabel('No. of Cumulative Cases',fontsize =15) plt.show()
No surprises here, US is way in front in terms of infections, with European countries filling the bulk of the top 10.
Country Specific Analysis
This set of code plots the cumulative cases, daily new cases and case growth for a given country all on the same chart. Simply change the country variable for different country results. The code below is for China.
Country = 'China' #Set Country x1 = df[Country].index[:] y1 = df[Country].values[:] y2 = df_newcases[Country].values[:] y3 = df_newcasesgrowth[Country].values[:] plt.figure(figsize=(20,10)) plt.plot(x1, y1) plt.plot(x1, y2) plt.plot(x1, y3) plt.title('COVID19 in ' + Country) plt.ylabel('Cases') plt.xlabel('Date') plt.axhline() plt.legend(['Cumulative Cases', 'New Cases', 'Case Growth']) plt.show()
As you can see, China has managed to significantly flatten the curve since the spike in cases in February. This is what most countries are aiming for, and what investors are looking for in the markets.
Cumulative Case Curves in Selected Countries
The following set of code plots the country case curves for selected countries. This is to have some comparability between countries for which stage of the coronavirus outbreak they are in at the moment.
This set of code is slightly more advanced as it involves using For loops to plot multiple lines and taking logarithm of cumulative cases.
#Select Countries Countries = ['US', 'Spain', 'Italy', 'Germany', 'France', 'China', 'Japan', 'India', 'Singapore'] x1 = df.index[:] plt.figure(figsize = (20,10)) for y in Countries: plt.plot(x1,np.log(df[y].values[:])) plt.title('COVID19 Cases in Selected Countries') plt.ylabel('Log of Cases') plt.xlabel('Date') plt.legend(Countries) plt.show()
As you can see from this graph, the US and Europe is in the process of flattening their respective curves. India’s curve is quite worrying actually. Singapore had initially managed to flatten the curve before being hit by the second wave of infections.
And many more possibilities
I’m showing just the 3 most pertinent charts arising from this dataset. You can analyse state level data if you wish. You can plot death and recovery rates based on other datasets within the JHU Covid19 repository. Or combine other datasets to perform repeatable exploratory analysis. The possibilities are endless.
What I see in the charts is heartening as there may be light at the end of the tunnel in the next month. I feel the market rally is taking into account the gradual flattening of the curves in US and Europe.
While there is hope for an end to this crisis, there are several unknowns still to look out for:
- The economic fallout of global economies grinding to a halt and unprecedented levels of liquidity provided by Central Banks.
- A potential second wave of infections once governments start to reopen their borders again.
That said, I’m cautiously optimistic of the markets in months ahead.
Hope you find this article useful and insightful. Do feel free to comment on my code and suggest other types of analysis 🙂