Chapter 3 Data transformation

The data comes for the most part in a format that is ready to make visualizations and perform exploratory data analysis. We will transform some variables into factor variables and create new grouping variables based on the continuous variables. For example, we will segment states based on positive tests and deaths into “high”, “med”, “low”. We will then create factors out of these variables. Most of the variables are already in a usable format and require little transformation. We will also create variables for day and month by feature engineering from the date variable, which will allow us to visualize daily and monthly trends.

When plotting the COVID-19 patterns on the state-level, the data sometimes has a weekly pattern (cases and deaths are lower during weekends compared to weekdays) for certain states. In order to control for the weekly pattern, we will use the rollmean function from the zoo library for the 7-Day average. This is a standard practice used in other COVID-19 databases. For comparing the spread of COVID-19 across different states, we will also present graphs on per capita basis (e.g. there are fewer cases in Illinois than California, but there are much fewer people in Illinois, so it is suffering more from the virus on average). When there are visibly under-reported numbers, we will remove those data points to not output misleading results. For example, the COVID-19 test numbers were below 1,000 in New Jersey until March 23, 2020, and suddenly increased to 9,000 on March 24. In this case, it was not appropriate to include test data before March 24 for New Jersey.