• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Home
  • About KK
  • Resources
    • Income Tax Calculator
  • Freebies
  • Blog
  • Ask KK

Risk N Returns

Personal Finance and Investment Ideas

  • Email
  • Facebook
  • RSS
  • Twitter
  • Start Here
  • Current Portfolio
  • Personal Finance
  • Invest

Exploratory analysis on COVID-19 Dataset with Python

April 11, 2020 By KK

As regular readers know, I’ve been pursuing a Masters in data analysis from SMU since last August. I’m still quite new to programming in Python and the data analysis process. As such, from time to time, I practice on various public datasets available for use.

COVID-19 has certainly been top of most people’s minds right now with it having a significant impact on our lives. Between social distancing and working from home, this unprecedented health crisis may change the way we behave and do things forever. It has certainly affected our portfolios with rapid write offs in market value during March Madness.

One key question that investors have is whether the COVID-19 outbreak is under control and consequently, whether the recent market rally is sustainable. As such, I thought I’ll share something different with you today. I’ll be doing a step by step guide to how to perform an exploratory analysis on the Global COVID-19 dataset compiled by John Hopkins University (JHU) with the aid of Python.

Hopefully, we will gain some insights into the spread of the virus in various countries we invest in and whether it is under control.


Contents

  • Introduction
  • Prerequisites
  • About the Dataset
  • Import Libraries
  • Data processing
      • 1) Reading in the data
      • 2) Drop irrelevant columns
      • 3) Process certain country data
      • 4) Calculate daily new cases and daily new case growth
  • Visualising data
      • Top 20 Countries by Cumulative Cases for a Given Date
      • Country Specific Analysis
      • Cumulative Case Curves in Selected Countries
      • And many more possibilities
  • Final Insights

Introduction

Before we start, here’s a quick introduction to Python and its role in data analysis.

Python is actually a general purpose programming language which you can pick up to do anything. Designing your own games, automating certain repetitive menial tasks, all this is possible with Python. As for why use Python specifically for data analysis, there are 2 reasons in my mind.

  • Python syntax (ie the way you write Python statements) is very natural and descriptive, much more so that other programming languages. As such, anybody can pick up the code and roughly understand what the code is trying to do. This helps with programming noobs like me.
  • Extensive work has been done to augment Python with algorithms so that users can quickly perform data analysis without using complex code.

Prerequisites

If you would like to follow along, you need to install Python on your computer. There are a couple of ways to do this. Personally, my favourite way is to use the Anaconda distribution. This is because it comes with a lot of the essential tools and packages required to perform data analysis on your own computer in a segregated manner.

Once you install and run Anaconda, you will get the above screen. To start coding, you should launch either Jupyter Notebook or Spyder. For beginners like me, I recommend using Jupyter Notebook as its web browser interface is intuitive for most people.

Once you launch Jupyter Notebook and created a new Python 3 file, you should see the above in your browser.

Now you’re ready to start coding! Or in this case, copying my code ๐Ÿ˜›

Alternatively, you can simply download my IPython notebook here to run for yourself.


About the Dataset

The dataset that I’m performing my analysis on is compiled by John Hopkins University. GitHub reference can be found here. This data is sourced by their team from global official sources and this data powers the university’s COVID 19 Dashboard.

There are 5 files in the dataset that are updated daily – Global and US cumulative cases, Global and US cumulative deaths, and Global cumulative recoveries.

For the purposes of this analysis, I will focus on the Global cumulative confirmed cases dataset pictured above. Here are the fields in the dataset:

  • Province/State – Certain countries provide state level case data
  • Country/Region – Self-explanatory
  • Lat and Long – Latitude and Longitude of the Country/State. With this info, one technically can plot the data on a world map. This is more advanced and I’ll skip for this exercise.
  • Dates – No. of cumulative cases at a given date.

As you can see, the dataset is very much like an Excel sheet. You can technically download this CSV file and open it on your computer using Excel, but the downside of doing that is that you have to download the file every time its updated.

Import Libraries

With all that out of the way, we can finally start coding. We start by importing libraries into Python.

import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt

Libraries are collections of prepackaged functions that perform a certain function without a need for the user to manually code it. Think of it as functions in Excel like mean, vlookup, etc.

The libraries I’ve used for this analysis include:

  • Pandas – The single most important Python library for data analysis. Helps a lot with transforming and extracting data from datasets.
  • Numpy – A package that allows you to perform common mathematical calculations
  • Datetime – As this is a time series, there is a need to process some date data.
  • Matplotlib – Essential visualisation library for plotting graphs
  • Seaborn – A prettier visualisation library that extends Matplotlib

Data processing

Not all datasets available for you to use are clean or in the format you need for your analysis. Although this dataset is already quite clean, there’s still a need to process it a little bit.

1) Reading in the data

We start by reading in the dataset from the online link and temporarily storing it in a Dataframe.

#Read in dataset
df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
df.head()

df.head() shows us the first few rows of the dataset, with the output shown above. I do this usually to check that the dataset is read in properly.

2) Drop irrelevant columns

Next, we remove any unwanted columns from the dataset. As earlier mentioned, we do not need the Lat and Long columns as I don’t plan to visualise the data on a World Map.

#Drop Latitude and Longitude
df = df.drop(['Lat', 'Long'],axis =1)

3) Process certain country data

As I looked through the data, I noticed that certain countries has state level data with no national level data. Here’s Australia as an example:

As I only wanted national level data, I had to sum up these rows into 1 national level. There are 3 countries that need this – Australia, China and Canada.

There are 2 steps to perform for this. We start by consolidating the rows for Australia, China and Canada with the following code

#Consolidate Province / State data for certain countries into National data
tbp_countries = ['China', 'Australia', 'Canada']
df_processed = df[df['Country/Region'].isin(tbp_countries)]
df_processed = df_processed.groupby('Country/Region', as_index=False).sum()
df_processed

This creates a DataFrame with the processed data as above.

Next, drop the rows with state level data and append our consolidated country data. I also did some minor data transformations in the process. See comments for additional information.

df = df[df['Province/State'].isnull()]
df = df.drop('Province/State', axis =1)
df = df[~df['Country/Region'].isin(tbp_countries)]
df = df.append(df_processed, ignore_index=True)
df = df.set_index('Country/Region') #set Country/Region as Index
df.columns = pd.to_datetime(df.columns) #set Column names as Dates
df = df.T #Transpose the Dataframe
df

We’ve successfully processed the data for cumulative cases!


4) Calculate daily new cases and daily new case growth

Of course, we would probably like to analyse the daily new cases by country. I’ll also like to run analysis on growth in daily new cases to understand the pace of virus spread in each country.

To obtain the daily new cases, we need to find the difference between the cumulative cases on a particular day versus the previous day.

df_newcases = df.diff(periods=1)
df_newcases = df_newcases.fillna(0)

As for the growth in daily new cases, we simply run the same code a second time.

df_newcasesgrowth = df_newcases.diff(periods=1)
df_newcasesgrowth = df_newcasesgrowth.fillna(0)

And we’re done processing! At the end of these programming steps, we end up with 3 data tables (DataFrames) as follows:

  • df – contains daily cumulative confirmed case data by country
  • df_newcases – contains daily confirmed case data by country
  • df_newcasesgrowth – contains daily growth / reduction in confirmed case data by country

Using these 3 DataFrames, we are able to visualise the data / plot graphs to explore the data.

Visualising data

There are many ways to visualise data to give the numbers meaning. You can find out various things like…

Top 20 Countries by Cumulative Cases for a Given Date

This set of code plots a bar chart to show the top 20 countries in terms of cumulative cases for a given date. Change the date variable below if you wish to look back in time.

Date = datetime.datetime(2020,4,9) #Set Date
df_Date = df.loc[Date]
df_Top20 = df_Date.nlargest(20,keep='all')
plt.figure(figsize=(20,10))
sns.barplot(df_Top20.values[:], df_Top20.index[:])
plt.title('Top 20 Countries by Cumulative Cases', fontsize = 25)
plt.ylabel('CountryNo. of Cumulative Cases',fontsize =15)
plt.xlabel('No. of Cumulative Cases',fontsize =15)
plt.show()

No surprises here, US is way in front in terms of infections, with European countries filling the bulk of the top 10.


Country Specific Analysis

This set of code plots the cumulative cases, daily new cases and case growth for a given country all on the same chart. Simply change the country variable for different country results. The code below is for China.

Country = 'China' #Set Country
x1 = df[Country].index[:]
y1 = df[Country].values[:]
y2 = df_newcases[Country].values[:]
y3 = df_newcasesgrowth[Country].values[:]
plt.figure(figsize=(20,10))
plt.plot(x1, y1)
plt.plot(x1, y2)
plt.plot(x1, y3)
plt.title('COVID19 in ' + Country)
plt.ylabel('Cases')
plt.xlabel('Date')
plt.axhline()
plt.legend(['Cumulative Cases', 'New Cases', 'Case Growth'])
plt.show()

As you can see, China has managed to significantly flatten the curve since the spike in cases in February. This is what most countries are aiming for, and what investors are looking for in the markets.

Cumulative Case Curves in Selected Countries

The following set of code plots the country case curves for selected countries. This is to have some comparability between countries for which stage of the coronavirus outbreak they are in at the moment.

This set of code is slightly more advanced as it involves using For loops to plot multiple lines and taking logarithm of cumulative cases.

#Select Countries
Countries = ['US', 'Spain', 'Italy', 'Germany', 'France', 'China', 'Japan', 'India', 'Singapore']
x1 = df.index[:]
plt.figure(figsize = (20,10))
for y in Countries:
    plt.plot(x1,np.log(df[y].values[:]))
plt.title('COVID19 Cases in Selected Countries')
plt.ylabel('Log of Cases')
plt.xlabel('Date')
plt.legend(Countries)
plt.show()

As you can see from this graph, the US and Europe is in the process of flattening their respective curves. India’s curve is quite worrying actually. Singapore had initially managed to flatten the curve before being hit by the second wave of infections.

And many more possibilities

I’m showing just the 3 most pertinent charts arising from this dataset. You can analyse state level data if you wish. You can plot death and recovery rates based on other datasets within the JHU Covid19 repository. Or combine other datasets to perform repeatable exploratory analysis. The possibilities are endless.


Final Insights

What I see in the charts is heartening as there may be light at the end of the tunnel in the next month. I feel the market rally is taking into account the gradual flattening of the curves in US and Europe.

While there is hope for an end to this crisis, there are several unknowns still to look out for:

  • The economic fallout of global economies grinding to a halt and unprecedented levels of liquidity provided by Central Banks.
  • A potential second wave of infections once governments start to reopen their borders again.

That said, I’m cautiously optimistic of the markets in months ahead.

Hope you find this article useful and insightful. Do feel free to comment on my code and suggest other types of analysis ๐Ÿ™‚

Happy Hunting,
KK

If you love the articles I write, follow me on Facebook, Twitter, InvestingNote, StocksCafe or subscribe to my blog and never miss another article!


Related

Filed Under: Python Tagged With: Python for Finance


Reader Interactions

Comments

  1. Jason says

    April 11, 2020 at 5:32 pm

    Hi, how have you found the part time masters so far? I am thinking of doing data analytics but still weighing pros and cons like you. Thanks!

    • KK says

      April 11, 2020 at 6:13 pm

      Hi Jason,

      Part time masters has been fun so far. Dont know about you but I like being a student lol. The main issue is the need to balance between work and studies. I dont have much commitments, so its not as hard for me.

      Hope it helps!

      Regards,
      KK

  2. mc says

    April 11, 2020 at 5:50 pm

    someone already did the log plots, although they have done it via new cases vs total cases, instead of by time on the x-axis; have a look, it is really interesting.

    https://aatishb.com/covidtrends/

    • KK says

      April 11, 2020 at 6:19 pm

      Hi mc,

      Thanks for sharing, interesting way to look at it. I think its just different ways to view the same thing.

      Regards,
      KK

Primary Sidebar

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 338 other subscribers

Search my site

Top Posts & Pages

  • Exploratory analysis on COVID-19 Dataset with Python
    Exploratory analysis on COVID-19 Dataset with Python
  • Are hotel management companies potential multi-baggers?
    Are hotel management companies potential multi-baggers?
  • My Guide to the CPF Investment Scheme (CPFIS)
    My Guide to the CPF Investment Scheme (CPFIS)

Archives

  • January 2022 (1)
  • November 2021 (1)
  • October 2021 (1)
  • September 2021 (1)
  • August 2021 (1)
  • July 2021 (1)
  • June 2021 (1)
  • May 2021 (1)
  • March 2021 (1)
  • February 2021 (2)
  • January 2021 (2)
  • December 2020 (2)
  • November 2020 (1)
  • October 2020 (1)
  • September 2020 (1)
  • August 2020 (3)
  • July 2020 (3)
  • June 2020 (1)
  • May 2020 (2)
  • April 2020 (2)
  • March 2020 (1)
  • February 2020 (7)
  • January 2020 (4)
  • December 2019 (4)
  • November 2019 (4)
  • October 2019 (2)
  • September 2019 (3)
  • August 2019 (7)
  • July 2019 (7)
  • June 2019 (8)
  • May 2019 (7)
  • April 2019 (8)
  • March 2019 (12)
  • February 2019 (8)
  • January 2019 (4)
  • December 2018 (4)
  • October 2018 (4)
  • September 2018 (6)
  • August 2018 (6)
  • July 2018 (3)
  • June 2018 (6)
  • May 2018 (4)
  • April 2018 (6)
  • March 2018 (9)
  • February 2018 (6)
  • January 2018 (10)
  • December 2017 (7)
  • November 2017 (2)
  • October 2017 (1)
  • September 2017 (3)




Copyright © 2023 | Risk N Returns