Health policy in the Staes is complicated, and several forms of healthcare coverage existed in the United States of America, including both federal goverment-led healthcare policy, and private insurance company. Before making any inference about the relationship between health condition and health policy, it is important for us to have a general idea about healthcare economics in the States. Thus, We are interested in getting sense of the health expenditure, including healthcare coverage and healthcare spending, across States. More specifically, the quesiton are
In this case study, we’ll walk you through collecting data, importing data, cleaning data, wrangling data, and visulizing the data, using well-established and commonly used packages, including library(datasets)
, library(tidyr)
, library(dplyr)
, ggplot2
, and ggrepel
.
Image source from US Department of Health and Human Services
We will be using the data from the Henry J Kaiser Family Foundation (KFF).
We have downloaded, re-named and saved these files in the GitHub repository under the data/KFF/
directory.
Now, before we dig into the data analysis, we need to introduce a set of R packages that we will use to analyze the data.
The tidyverse is “an opinionated collection of R packages designed for data science. All packages share an underlying philosophy and common APIs.”
Another way of putting it is that it’s a set of packages that are useful specifically for data manipulation, exploration and visualization with a common philosphy.
The common philosphy is called “tidy” data. It is a standard way of mapping the meaning of a dataset to its structure.
In tidy data:
Below, we are interested in transformating the table on the right to the the table on the left, which is considered “tidy”.
Working with tidy data is useful because it creates a structured way of organizing data values within a data set. This makes the data analysis process more efficient and simplifies the development of data analysis tools that work together. In this way, you can focus on the problem you are investigating, rather than the uninteresting logistics of data.
tidyverse
?We can install and load the set of R packages using install.packages("tidyverse")
function.
When we load the tidyverse package using library(tidyverse)
, there are six core R packages that load:
Here, we load in the tidyverse.
library(tidyverse)
These packages are highlighted in bold here:
Because these packages all share the “tidy” philosphy, the data analysis workflow is easier as you move from package to package.
Here, we will focus on the readr
, tidyr
and dplyr
R packages to import data, to transform data to the “tidy” format, and to wrangle data.
Next, we will give a brief description of the features in each of these packages.
There are several base R functions that allow you read in data into R, which you may be familiar with such as read.table()
, read.csv()
, and read.delim()
. Instead of using these, we will use the functions in the readr R package. The main reasons for this are
readr
are around 10x faser.readr
R packagelibrary(readr)
The main functions in readr
are:
readr functions |
Description |
---|---|
read_delim() |
reads in a flat file data with a given character to separate fields |
read_csv() |
reads in a CSV file |
read_tsv() |
reads in a file with values separated by tabs |
read_lines() |
reads only a certain number of lines from the file |
read_file() |
reads a complete file into a string |
write_csv() |
writes data frame to CSV |
A useful cheatsheet for the functions in the readr
package can be found on RStudio’s website:
Let’s try reading in some data. We will begin by reading in the healthcare-coverage.csv
data.
If we want to see what the header of the file looks like, we can use the read_lines()
function to peak at the first few lines.
read_lines(file = "./data/KFF/healthcare-coverage.csv", n_max = 10)
[1] "\"Title: Health Insurance Coverage of the Total Population | The Henry J. Kaiser Family Foundation\""
[2] "\"Timeframe: 2013 - 2016\""
[3] "\"Location\",\"2013__Employer\",\"2013__Non-Group\",\"2013__Medicaid\",\"2013__Medicare\",\"2013__Other Public\",\"2013__Uninsured\",\"2013__Total\",\"2014__Employer\",\"2014__Non-Group\",\"2014__Medicaid\",\"2014__Medicare\",\"2014__Other Public\",\"2014__Uninsured\",\"2014__Total\",\"2015__Employer\",\"2015__Non-Group\",\"2015__Medicaid\",\"2015__Medicare\",\"2015__Other Public\",\"2015__Uninsured\",\"2015__Total\",\"2016__Employer\",\"2016__Non-Group\",\"2016__Medicaid\",\"2016__Medicare\",\"2016__Other Public\",\"2016__Uninsured\",\"2016__Total\""
[4] "\"United States\",\"155696900\",\"13816000\",\"54919100\",\"40876300\",\"6295400\",\"41795100\",\"313401200\",\"154347500\",\"19313000\",\"61650400\",\"41896500\",\"5985000\",\"32967500\",\"316159900\",\"155965800\",\"21816500\",\"62384500\",\"43308400\",\"6422300\",\"28965900\",\"318868500\",\"157381500\",\"21884400\",\"62303400\",\"44550200\",\"6192200\",\"28051900\",\"320372000\""
[5] "\"Alabama\",\"2126500\",\"174200\",\"869700\",\"783000\",\"85600\",\"724800\",\"4763900\",\"2202800\",\"288900\",\"891900\",\"718400\",\"143900\",\"522200\",\"4768000\",\"2218000\",\"291500\",\"911400\",\"719100\",\"174600\",\"519400\",\"4833900\",\"2263800\",\"262400\",\"997000\",\"761200\",\"128800\",\"420800\",\"4834100\""
[6] "\"Alaska\",\"364900\",\"24000\",\"95000\",\"55200\",\"60600\",\"102200\",\"702000\",\"345300\",\"26800\",\"130100\",\"55300\",\"37300\",\"100800\",\"695700\",\"355700\",\"22300\",\"128100\",\"60900\",\"47700\",\"90500\",\"705300\",\"324400\",\"20300\",\"145400\",\"68200\",\"55600\",\"96900\",\"710800\""
[7] "\"Arizona\",\"2883800\",\"170800\",\"1346100\",\"842000\",\"N/A\",\"1223000\",\"6603100\",\"2835200\",\"333500\",\"1639400\",\"911100\",\"N/A\",\"827100\",\"6657200\",\"2766500\",\"278400\",\"1711500\",\"949000\",\"189300\",\"844800\",\"6739500\",\"3010700\",\"377000\",\"1468400\",\"1028000\",\"172500\",\"833700\",\"6890200\""
[8] "\"Arkansas\",\"1128800\",\"155600\",\"600800\",\"515200\",\"67600\",\"436800\",\"2904800\",\"1176500\",\"231700\",\"639200\",\"479400\",\"82000\",\"287200\",\"2896000\",\"1293700\",\"200200\",\"641400\",\"484500\",\"63700\",\"268400\",\"2953000\",\"1290900\",\"252900\",\"618600\",\"490000\",\"67500\",\"225500\",\"2945300\""
[9] "\"California\",\"17747300\",\"1986400\",\"8344800\",\"3828500\",\"675400\",\"5594100\",\"38176400\",\"17703700\",\"2778800\",\"9618800\",\"4049000\",\"634400\",\"3916700\",\"38701300\",\"17718300\",\"3444200\",\"10138100\",\"4080100\",\"752700\",\"2980600\",\"39113900\",\"18116200\",\"3195400\",\"9853800\",\"4436000\",\"556100\",\"3030800\",\"39188300\""
[10] "\"Colorado\",\"2852500\",\"426300\",\"697300\",\"549700\",\"118100\",\"654000\",\"5297800\",\"2489400\",\"397900\",\"1053700\",\"619500\",\"214000\",\"602900\",\"5377400\",\"2706000\",\"346900\",\"1036600\",\"708000\",\"148000\",\"475700\",\"5421300\",\"2872600\",\"370000\",\"855800\",\"692400\",\"190100\",\"528400\",\"5509200\""
It looks like the first two lines are descriptive and are not useful. We will tell R to skip reading these in using the skip
argument in read_csv()
. The third line looks like it contains the column names and starting on the fourth line is where the data starts.
coverage <- read_csv("./data/KFF/healthcare-coverage.csv",
skip = 2, col_names = TRUE)
head(coverage)
# A tibble: 6 x 29
Location `2013__Employer` `2013__Non-Grou~ `2013__Medicaid`
<chr> <int> <int> <int>
1 United ~ 155696900 13816000 54919100
2 Alabama 2126500 174200 869700
3 Alaska 364900 24000 95000
4 Arizona 2883800 170800 1346100
5 Arkansas 1128800 155600 600800
6 Califor~ 17747300 1986400 8344800
# ... with 25 more variables: `2013__Medicare` <int>, `2013__Other
# Public` <chr>, `2013__Uninsured` <int>, `2013__Total` <int>,
# `2014__Employer` <int>, `2014__Non-Group` <int>,
# `2014__Medicaid` <int>, `2014__Medicare` <int>, `2014__Other
# Public` <chr>, `2014__Uninsured` <int>, `2014__Total` <int>,
# `2015__Employer` <int>, `2015__Non-Group` <int>,
# `2015__Medicaid` <int>, `2015__Medicare` <int>, `2015__Other
# Public` <chr>, `2015__Uninsured` <int>, `2015__Total` <int>,
# `2016__Employer` <int>, `2016__Non-Group` <dbl>,
# `2016__Medicaid` <int>, `2016__Medicare` <int>, `2016__Other
# Public` <chr>, `2016__Uninsured` <int>, `2016__Total` <int>
tail(coverage)
# A tibble: 6 x 29
Location `2013__Employer` `2013__Non-Grou~ `2013__Medicaid`
<chr> <int> <int> <int>
1 <NA> NA NA NA
2 *Uninsu~ NA NA NA
3 <NA> NA NA NA
4 For exa~ NA NA NA
5 <NA> NA NA NA
6 *N/A*: ~ NA NA NA
# ... with 25 more variables: `2013__Medicare` <int>, `2013__Other
# Public` <chr>, `2013__Uninsured` <int>, `2013__Total` <int>,
# `2014__Employer` <int>, `2014__Non-Group` <int>,
# `2014__Medicaid` <int>, `2014__Medicare` <int>, `2014__Other
# Public` <chr>, `2014__Uninsured` <int>, `2014__Total` <int>,
# `2015__Employer` <int>, `2015__Non-Group` <int>,
# `2015__Medicaid` <int>, `2015__Medicare` <int>, `2015__Other
# Public` <chr>, `2015__Uninsured` <int>, `2015__Total` <int>,
# `2016__Employer` <int>, `2016__Non-Group` <dbl>,
# `2016__Medicaid` <int>, `2016__Medicare` <int>, `2016__Other
# Public` <chr>, `2016__Uninsured` <int>, `2016__Total` <int>
It looks like we now have the right header, but there are a bunch of NAs in the end of the data frame because most of it isn’t useful data.
Let’s take a closer look at the last 30 lines
tail(coverage, n=30)
# A tibble: 30 x 29
Location `2013__Employer` `2013__Non-Grou~ `2013__Medicaid`
<chr> <int> <int> <int>
1 Washing~ 3541600 309000 1026800
2 West Vi~ 841300 42600 382500
3 Wiscons~ 3154500 225300 907600
4 Wyoming 305900 19500 74200
5 Notes NA NA NA
6 The maj~ NA NA NA
7 <NA> NA NA NA
8 "In thi~ NA NA NA
9 <NA> NA NA NA
10 Data ex~ NA NA NA
# ... with 20 more rows, and 25 more variables: `2013__Medicare` <int>,
# `2013__Other Public` <chr>, `2013__Uninsured` <int>,
# `2013__Total` <int>, `2014__Employer` <int>, `2014__Non-Group` <int>,
# `2014__Medicaid` <int>, `2014__Medicare` <int>, `2014__Other
# Public` <chr>, `2014__Uninsured` <int>, `2014__Total` <int>,
# `2015__Employer` <int>, `2015__Non-Group` <int>,
# `2015__Medicaid` <int>, `2015__Medicare` <int>, `2015__Other
# Public` <chr>, `2015__Uninsured` <int>, `2015__Total` <int>,
# `2016__Employer` <int>, `2016__Non-Group` <dbl>,
# `2016__Medicaid` <int>, `2016__Medicare` <int>, `2016__Other
# Public` <chr>, `2016__Uninsured` <int>, `2016__Total` <int>
It looks like there is a line with a string Notes
in it and everything below that line should not be read in. We can use the n_max
argument here.
coverage <- read_csv("./data/KFF/healthcare-coverage.csv",
skip = 2, col_names = TRUE)
coverage <- read_csv("./data/KFF/healthcare-coverage.csv",
skip = 2, col_names = TRUE,
n_max = which(coverage$Location == "Notes")-1)
tail(coverage)
# A tibble: 6 x 29
Location `2013__Employer` `2013__Non-Grou~ `2013__Medicaid`
<chr> <int> <int> <int>
1 Vermont 317700 26200 123400
2 Virginia 4661600 364800 773200
3 Washing~ 3541600 309000 1026800
4 West Vi~ 841300 42600 382500
5 Wiscons~ 3154500 225300 907600
6 Wyoming 305900 19500 74200
# ... with 25 more variables: `2013__Medicare` <int>, `2013__Other
# Public` <chr>, `2013__Uninsured` <int>, `2013__Total` <int>,
# `2014__Employer` <int>, `2014__Non-Group` <int>,
# `2014__Medicaid` <int>, `2014__Medicare` <int>, `2014__Other
# Public` <chr>, `2014__Uninsured` <int>, `2014__Total` <int>,
# `2015__Employer` <int>, `2015__Non-Group` <int>,
# `2015__Medicaid` <int>, `2015__Medicare` <int>, `2015__Other
# Public` <chr>, `2015__Uninsured` <int>, `2015__Total` <int>,
# `2016__Employer` <int>, `2016__Non-Group` <dbl>,
# `2016__Medicaid` <int>, `2016__Medicare` <int>, `2016__Other
# Public` <chr>, `2016__Uninsured` <int>, `2016__Total` <int>
That’s better!
Now because we are also going to want to use in healthcare-spending.csv
, let’s read it in now.
spending <- read_csv("./data/KFF/healthcare-spending.csv",
skip = 2, col_names = TRUE)
spending <- read_csv("./data/KFF/healthcare-spending.csv",
skip = 2, col_names = TRUE,
n_max = which(spending$Location == "Notes")-1)
tail(spending)
# A tibble: 6 x 25
Location `1991__Total He~ `1992__Total He~ `1993__Total He~
<chr> <int> <int> <int>
1 Vermont 1330 1421 1522
2 Virginia 14829 15599 16634
3 Washing~ 12674 13859 14523
4 West Vi~ 4672 5159 5550
5 Wiscons~ 12694 13669 14636
6 Wyoming 1023 1067 1171
# ... with 21 more variables: `1994__Total Health Spending` <int>,
# `1995__Total Health Spending` <int>, `1996__Total Health
# Spending` <int>, `1997__Total Health Spending` <int>, `1998__Total
# Health Spending` <int>, `1999__Total Health Spending` <int>,
# `2000__Total Health Spending` <int>, `2001__Total Health
# Spending` <int>, `2002__Total Health Spending` <int>, `2003__Total
# Health Spending` <int>, `2004__Total Health Spending` <int>,
# `2005__Total Health Spending` <int>, `2006__Total Health
# Spending` <int>, `2007__Total Health Spending` <int>, `2008__Total
# Health Spending` <int>, `2009__Total Health Spending` <int>,
# `2010__Total Health Spending` <int>, `2011__Total Health
# Spending` <int>, `2012__Total Health Spending` <int>, `2013__Total
# Health Spending` <int>, `2014__Total Health Spending` <int>
glimpse()
at your dataOne last thing in this section. One way to look at our data would be to use head()
or tail()
, as we just saw. Another one you might have heard of is the str()
function. One you might not have heard of is the glimpse()
function. It’s used for a special type of object in R called a tibble
. Let’s read the help file to learn more.
?tibble::tibble
It’s kind of like print()
where it shows you columns running down the page. Let’s try it out. If we look at our data, say the coverage
data frame, we see that it is not “tidy”:
glimpse(coverage)
Observations: 52
Variables: 29
$ Location <chr> "United States", "Alabama", "Alaska", "Ar...
$ `2013__Employer` <int> 155696900, 2126500, 364900, 2883800, 1128...
$ `2013__Non-Group` <int> 13816000, 174200, 24000, 170800, 155600, ...
$ `2013__Medicaid` <int> 54919100, 869700, 95000, 1346100, 600800,...
$ `2013__Medicare` <int> 40876300, 783000, 55200, 842000, 515200, ...
$ `2013__Other Public` <chr> "6295400", "85600", "60600", "N/A", "6760...
$ `2013__Uninsured` <int> 41795100, 724800, 102200, 1223000, 436800...
$ `2013__Total` <int> 313401200, 4763900, 702000, 6603100, 2904...
$ `2014__Employer` <int> 154347500, 2202800, 345300, 2835200, 1176...
$ `2014__Non-Group` <int> 19313000, 288900, 26800, 333500, 231700, ...
$ `2014__Medicaid` <int> 61650400, 891900, 130100, 1639400, 639200...
$ `2014__Medicare` <int> 41896500, 718400, 55300, 911100, 479400, ...
$ `2014__Other Public` <chr> "5985000", "143900", "37300", "N/A", "820...
$ `2014__Uninsured` <int> 32967500, 522200, 100800, 827100, 287200,...
$ `2014__Total` <int> 316159900, 4768000, 695700, 6657200, 2896...
$ `2015__Employer` <int> 155965800, 2218000, 355700, 2766500, 1293...
$ `2015__Non-Group` <int> 21816500, 291500, 22300, 278400, 200200, ...
$ `2015__Medicaid` <int> 62384500, 911400, 128100, 1711500, 641400...
$ `2015__Medicare` <int> 43308400, 719100, 60900, 949000, 484500, ...
$ `2015__Other Public` <chr> "6422300", "174600", "47700", "189300", "...
$ `2015__Uninsured` <int> 28965900, 519400, 90500, 844800, 268400, ...
$ `2015__Total` <int> 318868500, 4833900, 705300, 6739500, 2953...
$ `2016__Employer` <int> 157381500, 2263800, 324400, 3010700, 1290...
$ `2016__Non-Group` <dbl> 21884400, 262400, 20300, 377000, 252900, ...
$ `2016__Medicaid` <int> 62303400, 997000, 145400, 1468400, 618600...
$ `2016__Medicare` <int> 44550200, 761200, 68200, 1028000, 490000,...
$ `2016__Other Public` <chr> "6192200", "128800", "55600", "172500", "...
$ `2016__Uninsured` <int> 28051900, 420800, 96900, 833700, 225500, ...
$ `2016__Total` <int> 320372000, 4834100, 710800, 6890200, 2945...
datasets
R packageSince our goal is to get sense of the health expenditure, including healthcare coverage and healthcare spending, across States, it would be nice add some information about each state. Namely, the state abbreviation and state region (i.e. north, south, etc).
For this we use the state dataset in the datasets
R package.
Before we begin, let’s look at what states are there:
unique(coverage$Location)
[1] "United States" "Alabama" "Alaska"
[4] "Arizona" "Arkansas" "California"
[7] "Colorado" "Connecticut" "Delaware"
[10] "District of Columbia" "Florida" "Georgia"
[13] "Hawaii" "Idaho" "Illinois"
[16] "Indiana" "Iowa" "Kansas"
[19] "Kentucky" "Louisiana" "Maine"
[22] "Maryland" "Massachusetts" "Michigan"
[25] "Minnesota" "Mississippi" "Missouri"
[28] "Montana" "Nebraska" "Nevada"
[31] "New Hampshire" "New Jersey" "New Mexico"
[34] "New York" "North Carolina" "North Dakota"
[37] "Ohio" "Oklahoma" "Oregon"
[40] "Pennsylvania" "Rhode Island" "South Carolina"
[43] "South Dakota" "Tennessee" "Texas"
[46] "Utah" "Vermont" "Virginia"
[49] "Washington" "West Virginia" "Wisconsin"
[52] "Wyoming"
We see there are more than 50 states because “United States” and “District of Columbia” are both included.
Let’s look what states are inside the state
dataset.
library(datasets)
data(state)
unique(state.name)
[1] "Alabama" "Alaska" "Arizona" "Arkansas"
[5] "California" "Colorado" "Connecticut" "Delaware"
[9] "Florida" "Georgia" "Hawaii" "Idaho"
[13] "Illinois" "Indiana" "Iowa" "Kansas"
[17] "Kentucky" "Louisiana" "Maine" "Maryland"
[21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
[25] "Missouri" "Montana" "Nebraska" "Nevada"
[29] "New Hampshire" "New Jersey" "New Mexico" "New York"
[33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
[37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
[41] "South Dakota" "Tennessee" "Texas" "Utah"
[45] "Vermont" "Virginia" "Washington" "West Virginia"
[49] "Wisconsin" "Wyoming"
Ah, ok. So let’s start by dealing with DC as a special case.
state.abb <- c(state.abb, "DC")
state.region <- as.factor(c(as.character(state.region), "South"))
state.name <- c(state.name, "District of Columbia")
We will deal with the “United States” in the next section.
A subset of the data analysis process can be thought about in the following way:
where each of these steps need their own tools and software to complete.
After we import the data into R, if we are going to take advantage of the “tidyverse”, this means we need to transform the data into a form that is “tidy”. If you recall, in tidy data:
For example, consider the following dataset:
Here:
Alternatively, a data set can be structured in the following way:
In both cases, the data is the same, but the structure is different. This can be frustrating to deal with as an analyst because the meaning of the values (rows and columns) in the two data sets are different. Providing a standardized way of organizing values within a data set would alleviate a major portion of this frustration.
For motivation, a tidy version of the stock data we looked at above looks like this: (we’ll learn how the functions work in just a moment)
In this “tidy” data set, we have three columns representing three variables (time, company name and stock price). Every row represents contains one stock price from a particular time and for a specific company.
If we consider our coverage
dataframe, we see it is also not in a tidy format. Each row contains information about the coverage level by Location
across years and types of coverage.
coverage[1:5, 1:5]
# A tibble: 5 x 5
Location `2013__Employer` `2013__Non-Grou~ `2013__Medicaid`
<chr> <int> <int> <int>
1 United ~ 155696900 13816000 54919100
2 Alabama 2126500 174200 869700
3 Alaska 364900 24000 95000
4 Arizona 2883800 170800 1346100
5 Arkansas 1128800 155600 600800
# ... with 1 more variable: `2013__Medicare` <int>
Now, let’s use the tidyr
R package to transform our data into a tidy format.
tidyr
R packagetidyr
R package ?tidyr
is an R package that transforms data sets to a tidy format.
This package is installed and loaded when you load the tidyverse
using library(tidyverse)
. However, you can also just load the library by itself.
library(tidyr)
The main functions in tidyr
are:
tidyr functions |
Description |
---|---|
gather() |
takes multiple columns, and gathers them into key-value pairs, making “wide” data longer |
separate() |
turns a single character column into multiple columns, making “long” data wider |
spread() |
spread rows into multiple columns, transforming “long” data into “wide” format |
We’ll explore what it means to go between a “wide” and “long” data format using gather()
, separate()
, and spread()
.
A tidyr
cheatsheet for the functions in the tidyr
package can be found on RStudio’s website:
gather()
Let’s start by looking at the gather()
help file
?gather
This function gathers multiple columns and collapses them into new key-value pairs. This transform data from wide format into a long format.
key
is the name of the new column that you are creating which contains the values of the column headings that you are gatheringvalue
is the name of the new column that will contain the values themselvesFor example, here we create a column titled year_type
and coverage
. We also want to keep the Location
column as it is because it also contains observational level data.
coverage <- gather(coverage, "year_type", "tot_coverage", -Location)
coverage
# A tibble: 1,456 x 3
Location year_type tot_coverage
<chr> <chr> <chr>
1 United States 2013__Employer 155696900
2 Alabama 2013__Employer 2126500
3 Alaska 2013__Employer 364900
4 Arizona 2013__Employer 2883800
5 Arkansas 2013__Employer 1128800
6 California 2013__Employer 17747300
7 Colorado 2013__Employer 2852500
8 Connecticut 2013__Employer 2030500
9 Delaware 2013__Employer 473700
10 District of Columbia 2013__Employer 324300
# ... with 1,446 more rows
Now we see each row contains one observation. Namely, a Location
, a year_type
and coverage
. It would be nice to separate out the information in the year_type
column into two columns. We can implement same techniques to the healthcare spending dataset.
Let’s do the same for the spending
data. In this case I will use year
and spending
for the key
and value
. We also want to keep Location
like before.
spending <- gather(spending, "year", "tot_spending", -Location)
spending
# A tibble: 1,248 x 3
Location year tot_spending
<chr> <chr> <int>
1 United States 1991__Total Health Spending 675896
2 Alabama 1991__Total Health Spending 10393
3 Alaska 1991__Total Health Spending 1458
4 Arizona 1991__Total Health Spending 9269
5 Arkansas 1991__Total Health Spending 5632
6 California 1991__Total Health Spending 81438
7 Colorado 1991__Total Health Spending 8460
8 Connecticut 1991__Total Health Spending 10950
9 Delaware 1991__Total Health Spending 1938
10 District of Columbia 1991__Total Health Spending 2800
# ... with 1,238 more rows
We will explore how to do that in the Data Wrangling section below. For now let’s learn more about the tidyr
package.
spread()
In contrast to gathering multiple columns into key-value pairs, we can spread a key-value pair across multiple columns.
The function spread()
does just that. It transforms data from a long format into a wide format.
key
is the name of the column in your data set that contains the values of the column headings that you are spreading across multiple columnsvalue
is the name of the column that contains the values for the multiple columnsspread(coverage, year_type, tot_coverage)
# A tibble: 52 x 29
Location `2013__Employer` `2013__Medicaid` `2013__Medicare`
<chr> <chr> <chr> <chr>
1 Alabama 2126500 869700 783000
2 Alaska 364900 95000 55200
3 Arizona 2883800 1346100 842000
4 Arkansas 1128800 600800 515200
5 Califor~ 17747300 8344800 3828500
6 Colorado 2852500 697300 549700
7 Connect~ 2030500 532000 475300
8 Delaware 473700 192700 141300
9 Distric~ 324300 174900 59900
10 Florida 8023400 3190900 3108800
# ... with 42 more rows, and 25 more variables: `2013__Non-Group` <chr>,
# `2013__Other Public` <chr>, `2013__Total` <chr>,
# `2013__Uninsured` <chr>, `2014__Employer` <chr>,
# `2014__Medicaid` <chr>, `2014__Medicare` <chr>,
# `2014__Non-Group` <chr>, `2014__Other Public` <chr>,
# `2014__Total` <chr>, `2014__Uninsured` <chr>, `2015__Employer` <chr>,
# `2015__Medicaid` <chr>, `2015__Medicare` <chr>,
# `2015__Non-Group` <chr>, `2015__Other Public` <chr>,
# `2015__Total` <chr>, `2015__Uninsured` <chr>, `2016__Employer` <chr>,
# `2016__Medicaid` <chr>, `2016__Medicare` <chr>,
# `2016__Non-Group` <chr>, `2016__Other Public` <chr>,
# `2016__Total` <chr>, `2016__Uninsured` <chr>
In the real world, analyzing data rarely involves data that can be easily imported and ready for analysis. According to Wikipedia:
Data munging or data wrangling is loosely the process of manually converting or mapping data from one “raw” form into another format that allows for more convenient consumption of the data with the help of semi-automated tools.
As you will see in class, one of the most time-consuming aspects of the data analysis process is “data wrangling”. This is also is a trendy term for cleaning up a messy data set.
R provides incredibly powerful and flexible language for data wrangling. However, the syntax is somewhat hard to get used to. We will therefore introducing a package that makes the syntax much more like the English language. This package is dplyr
.
dplyr
R packagedplyr
R package ?dplyr
is a powerful R-package to transform and summarize tabular data with rows and columns.
The package contains a set of functions (or “verbs”) to perform common data manipulation operations such as filtering for rows, selecting specific columns, re-ordering rows, adding new columns and summarizing data.
In addition, dplyr
contains a useful function to perform another common task which is the is the “split-apply-combine” concept. We will discuss that in a little bit.
dplyr
R package compare with base functions RIf you are familiar with R, you are probably familiar with base R functions such as split()
, subset()
, apply()
, sapply()
, lapply()
, tapply()
and aggregate()
. Compared to base functions in R, the functions in dplyr
are easier to work with, are more consistent in the syntax and are targeted for data analysis around data frames instead of just vectors.
The important dplyr
verbs to remember are:
dplyr verbs |
Description |
---|---|
select() |
select columns |
filter() |
filter rows |
arrange() |
re-order or arrange rows |
mutate() |
create new columns |
summarize() |
summarize values |
group_by() |
allows for group operations in the “split-apply-combine” concept |
Before we go any futher, let’s introduce the pipe operator: %>%
. In our stocks
example, we briefly saw this symbol. It is called the pipe operator. dplyr
imports this operator from another package (magrittr
) see help file here. This operator allows you to pipe the output from one function to the input of another function. Instead of nesting functions (reading from the inside to the outside), the idea of of piping is to read the functions from left to right.
Now in stocks
example, we pipe the stocks
data frame to the function that will gather multiple columns into key-value pairs.
dplyr
verbs in action: separate()
, unite()
, …First, let’s separate the year_type
column in the coverage
dataset to two columns: year
and health coverage type
.
To do this, we will use the separate()
function in the tidyr
package.
Note:
separate()
= separate one column into multiple columnsunite()
= unite multiple columns into oneseparate()
and unite()
in the spending
datasetcoverage %>%
separate(year_type, sep="__",
into=c("year", "type"))
# A tibble: 1,456 x 4
Location year type tot_coverage
<chr> <chr> <chr> <chr>
1 United States 2013 Employer 155696900
2 Alabama 2013 Employer 2126500
3 Alaska 2013 Employer 364900
4 Arizona 2013 Employer 2883800
5 Arkansas 2013 Employer 1128800
6 California 2013 Employer 17747300
7 Colorado 2013 Employer 2852500
8 Connecticut 2013 Employer 2030500
9 Delaware 2013 Employer 473700
10 District of Columbia 2013 Employer 324300
# ... with 1,446 more rows
We see that we now have two columns, except the year
column was converted to a character. If we look at the help file ?separate
, we see we can use the convert=TRUE
argument to convert the character to an integer.
coverage <-
coverage %>%
separate(year_type, sep="__",
into=c("year", "type"),
convert = TRUE)
coverage
# A tibble: 1,456 x 4
Location year type tot_coverage
<chr> <int> <chr> <chr>
1 United States 2013 Employer 155696900
2 Alabama 2013 Employer 2126500
3 Alaska 2013 Employer 364900
4 Arizona 2013 Employer 2883800
5 Arkansas 2013 Employer 1128800
6 California 2013 Employer 17747300
7 Colorado 2013 Employer 2852500
8 Connecticut 2013 Employer 2030500
9 Delaware 2013 Employer 473700
10 District of Columbia 2013 Employer 324300
# ... with 1,446 more rows
Next, we see that the tot_coverage
column is also a character. Gah!
Let’s fix that. We can use the mutate_at()
function to do this. We are asking R to take tot_coverage
column and convert it to an integer and then replace the old column with the new converted column
coverage <-
coverage %>%
mutate_at("tot_coverage", as.integer)
# Add the abbreviation of States
coverage$abb <- state.abb[match(coverage$Location, state.name)]
coverage$region <- state.region[match(coverage$Location, state.name)]
coverage
# A tibble: 1,456 x 6
Location year type tot_coverage abb region
<chr> <int> <chr> <int> <chr> <fct>
1 United States 2013 Employer 155696900 <NA> <NA>
2 Alabama 2013 Employer 2126500 AL South
3 Alaska 2013 Employer 364900 AK West
4 Arizona 2013 Employer 2883800 AZ West
5 Arkansas 2013 Employer 1128800 AR South
6 California 2013 Employer 17747300 CA West
7 Colorado 2013 Employer 2852500 CO West
8 Connecticut 2013 Employer 2030500 CT Northeast
9 Delaware 2013 Employer 473700 DE South
10 District of Columbia 2013 Employer 324300 DC South
# ... with 1,446 more rows
The coverage
data looks good now. We see that there are different year
s and different types
of healthcare coverage.
Also, you may want to link the coverage data with our location information.
# Add the abbreviation of States
coverage$abb <- state.abb[match(coverage$Location, state.name)]
coverage$region <- state.region[match(coverage$Location, state.name)]
coverage
# A tibble: 1,456 x 6
Location year type tot_coverage abb region
<chr> <int> <chr> <int> <chr> <fct>
1 United States 2013 Employer 155696900 <NA> <NA>
2 Alabama 2013 Employer 2126500 AL South
3 Alaska 2013 Employer 364900 AK West
4 Arizona 2013 Employer 2883800 AZ West
5 Arkansas 2013 Employer 1128800 AR South
6 California 2013 Employer 17747300 CA West
7 Colorado 2013 Employer 2852500 CO West
8 Connecticut 2013 Employer 2030500 CT Northeast
9 Delaware 2013 Employer 473700 DE South
10 District of Columbia 2013 Employer 324300 DC South
# ... with 1,446 more rows
coverage
dataset?table(coverage$type, coverage$year)
2013 2014 2015 2016
Employer 52 52 52 52
Medicaid 52 52 52 52
Medicare 52 52 52 52
Non-Group 52 52 52 52
Other Public 52 52 52 52
Total 52 52 52 52
Uninsured 52 52 52 52
separate()
and unite()
in the spending
datasetNext, we will look at the spending
data. We see the year
column has information that we do not want. We only care about the year.
spending
# A tibble: 1,248 x 3
Location year tot_spending
<chr> <chr> <int>
1 United States 1991__Total Health Spending 675896
2 Alabama 1991__Total Health Spending 10393
3 Alaska 1991__Total Health Spending 1458
4 Arizona 1991__Total Health Spending 9269
5 Arkansas 1991__Total Health Spending 5632
6 California 1991__Total Health Spending 81438
7 Colorado 1991__Total Health Spending 8460
8 Connecticut 1991__Total Health Spending 10950
9 Delaware 1991__Total Health Spending 1938
10 District of Columbia 1991__Total Health Spending 2800
# ... with 1,238 more rows
Let’s use the separate()
function with convert=TRUE
to separate the year
column into columns. Then, we introduce another dplyr
action verb: select()
.
The two most basic functions are select()
and filter()
which selects columns and filters rows, respectively.
select()
In the separate()
function, we create two new columns called year
and name
. Then, we ask to return all the columns, except name
. To select all the columns except a specific column, use the “-” (subtraction) operator (also known as negative indexing).
spending <-
spending %>%
separate(year, sep="__", into=c("year", "name"), convert = TRUE) %>%
select(-name)
spending
# A tibble: 1,248 x 3
Location year tot_spending
<chr> <int> <int>
1 United States 1991 675896
2 Alabama 1991 10393
3 Alaska 1991 1458
4 Arizona 1991 9269
5 Arkansas 1991 5632
6 California 1991 81438
7 Colorado 1991 8460
8 Connecticut 1991 10950
9 Delaware 1991 1938
10 District of Columbia 1991 2800
# ... with 1,238 more rows
The function select()
is much more powerful though. To select a range of columns by name, use the “:” (colon) operator
coverage %>%
select(year:type)
# A tibble: 1,456 x 2
year type
<int> <chr>
1 2013 Employer
2 2013 Employer
3 2013 Employer
4 2013 Employer
5 2013 Employer
6 2013 Employer
7 2013 Employer
8 2013 Employer
9 2013 Employer
10 2013 Employer
# ... with 1,446 more rows
To select all columns that start with the character string “t”, use the function starts_with()
coverage %>%
select(starts_with("t"))
# A tibble: 1,456 x 2
type tot_coverage
<chr> <int>
1 Employer 155696900
2 Employer 2126500
3 Employer 364900
4 Employer 2883800
5 Employer 1128800
6 Employer 17747300
7 Employer 2852500
8 Employer 2030500
9 Employer 473700
10 Employer 324300
# ... with 1,446 more rows
Some additional options to select columns based on a specific criteria include
ends_with()
= Select columns that end with a character stringcontains()
= Select columns that contain a character stringmatches()
= Select columns that match a regular expressionone_of()
= Select columns names that are from a group of namesfilter()
Let’s say we want to know how many peopled had health insurance coverage in Maryland?
First, we can filter the rows for years in 2007.
coverage %>%
filter(Location == "Maryland")
# A tibble: 28 x 6
Location year type tot_coverage abb region
<chr> <int> <chr> <int> <chr> <fct>
1 Maryland 2013 Employer 3172400 MD South
2 Maryland 2013 Non-Group 320800 MD South
3 Maryland 2013 Medicaid 889800 MD South
4 Maryland 2013 Medicare 751500 MD South
5 Maryland 2013 Other Public 124400 MD South
6 Maryland 2013 Uninsured 682000 MD South
7 Maryland 2013 Total 5940900 MD South
8 Maryland 2014 Employer 3558800 MD South
9 Maryland 2014 Non-Group 361700 MD South
10 Maryland 2014 Medicaid 807900 MD South
# ... with 18 more rows
Note: you can use the boolean operators (e.g. >
, <
, >=
, <=
, !=
, %in%
) to create logical tests.
For example, if we wanted only years after 2014, we can add a second criteria:
coverage %>%
filter(Location == "Maryland",
year > 2014)
# A tibble: 14 x 6
Location year type tot_coverage abb region
<chr> <int> <chr> <int> <chr> <fct>
1 Maryland 2015 Employer 3431400 MD South
2 Maryland 2015 Non-Group 371400 MD South
3 Maryland 2015 Medicaid 856800 MD South
4 Maryland 2015 Medicare 705500 MD South
5 Maryland 2015 Other Public 141200 MD South
6 Maryland 2015 Uninsured 394300 MD South
7 Maryland 2015 Total 5900500 MD South
8 Maryland 2016 Employer 3210600 MD South
9 Maryland 2016 Non-Group 443000 MD South
10 Maryland 2016 Medicaid 926300 MD South
11 Maryland 2016 Medicare 827000 MD South
12 Maryland 2016 Other Public 153800 MD South
13 Maryland 2016 Uninsured 372100 MD South
14 Maryland 2016 Total 5932800 MD South
coverage %>%
filter(Location == "Maryland",
type == "Uninsured")
# A tibble: 4 x 6
Location year type tot_coverage abb region
<chr> <int> <chr> <int> <chr> <fct>
1 Maryland 2013 Uninsured 682000 MD South
2 Maryland 2014 Uninsured 343000 MD South
3 Maryland 2015 Uninsured 394300 MD South
4 Maryland 2016 Uninsured 372100 MD South
What happened between 2013 and 2014?
arrange()
Now, let’s say we want to see which states has the To arrange (or re-order) rows by a particular column such as the population, list the name of the column you want to arrange the rows by
coverage %>%
arrange(tot_coverage)
# A tibble: 1,456 x 6
Location year type tot_coverage abb region
<chr> <int> <chr> <int> <chr> <fct>
1 Vermont 2013 Other Public 9900 VT Northeast
2 Vermont 2014 Other Public 9900 VT Northeast
3 Rhode Island 2013 Other Public 12100 RI Northeast
4 Wyoming 2014 Other Public 13600 WY West
5 Delaware 2013 Other Public 13800 DE South
6 Vermont 2016 Other Public 14600 VT Northeast
7 New Hampshire 2013 Other Public 15100 NH Northeast
8 Wyoming 2016 Other Public 16400 WY West
9 Vermont 2015 Other Public 16500 VT Northeast
10 North Dakota 2014 Other Public 17300 ND North Central
# ... with 1,446 more rows
Employer
type of healthcare coverage?Hint: use the desc()
function inside of arrange()
to order rows in a descending order.
coverage %>%
filter(Location != "United States", year == 2016, type == "Employer") %>%
arrange(desc(tot_coverage)) %>%
head(n=3)
# A tibble: 3 x 6
Location year type tot_coverage abb region
<chr> <int> <chr> <int> <chr> <fct>
1 California 2016 Employer 18116200 CA West
2 Texas 2016 Employer 13607200 TX South
3 New York 2016 Employer 9767500 NY Northeast
join()
Here, we’re gioing to demonstrate how to join two datasets using series of join()
function, including left_join()
, right_join()
, inner_join()
, …
Up until now, we have been working with three datasets coverage
and spending
separtely. Next, we will combine these together.
If we want to combine, say, coverage
and spending
together, we have to decide a few things. Both share a Location
column and a year
column. However, the range of years
is different between datasets.
table(coverage$year)
2013 2014 2015 2016
364 364 364 364
table(spending$year)
1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
52 52 52 52 52 52 52 52 52 52 52 52 52 52 52
2006 2007 2008 2009 2010 2011 2012 2013 2014
52 52 52 52 52 52 52 52 52
Do we want a dataset with all the years available or only a portion of it? Because there is spending information from 1991-2014, and coverage information from 2013-2016.
dplyr
has a list of join
functions that are useful to combine datasets. To read more about them, Jenny Bryan has created a nice cheatsheet.
If we look at the help file
?dplyr::join
We see there are several options for us to pick from. Let’s try one out. We’ll start with left_join()
and see what that does.
hc <- left_join(coverage, spending, by = c("Location", "year"))
head(hc)
# A tibble: 6 x 7
Location year type tot_coverage abb region tot_spending
<chr> <int> <chr> <int> <chr> <fct> <int>
1 United States 2013 Employer 155696900 <NA> <NA> 2435624
2 Alabama 2013 Employer 2126500 AL South 33788
3 Alaska 2013 Employer 364900 AK West 7684
4 Arizona 2013 Employer 2883800 AZ West 41481
5 Arkansas 2013 Employer 1128800 AR South 20500
6 California 2013 Employer 17747300 CA West 278168
tail(hc)
# A tibble: 6 x 7
Location year type tot_coverage abb region tot_spending
<chr> <int> <chr> <int> <chr> <fct> <int>
1 Vermont 2016 Total 622500 VT Northeast NA
2 Virginia 2016 Total 8175000 VA South NA
3 Washington 2016 Total 7297300 WA West NA
4 West Virginia 2016 Total 1814100 WV South NA
5 Wisconsin 2016 Total 5766100 WI North Central NA
6 Wyoming 2016 Total 571700 WY West NA
What did it do? We see that the new hc
dataset includes all the years from 2013-2016 (as that is the range of years in coverage
), but because the spending
dataset only goes to 2014, the tot_spending
is reported as NA for years 2015 and 2016.
right_join()
?hc <- right_join(coverage, spending, by = c("Location", "year"))
head(hc)
# A tibble: 6 x 7
Location year type tot_coverage abb region tot_spending
<chr> <int> <chr> <int> <chr> <fct> <int>
1 United States 1991 <NA> NA <NA> <NA> 675896
2 Alabama 1991 <NA> NA <NA> <NA> 10393
3 Alaska 1991 <NA> NA <NA> <NA> 1458
4 Arizona 1991 <NA> NA <NA> <NA> 9269
5 Arkansas 1991 <NA> NA <NA> <NA> 5632
6 California 1991 <NA> NA <NA> <NA> 81438
tail(hc)
# A tibble: 6 x 7
Location year type tot_coverage abb region tot_spending
<chr> <int> <chr> <int> <chr> <fct> <int>
1 Wyoming 2014 Non-Group 31600 WY West 4856
2 Wyoming 2014 Medicaid 54900 WY West 4856
3 Wyoming 2014 Medicare 65600 WY West 4856
4 Wyoming 2014 Other Public 13600 WY West 4856
5 Wyoming 2014 Uninsured 58100 WY West 4856
6 Wyoming 2014 Total 572000 WY West 4856
Here, we see every row in the spending dataset is there, but with NAs for the years that there was no coverage data.
There is also a full_join()
and inner_join()
. If we want the intersection of years
from coverage
and spending
(meaning only 2013 and 2014), we should use inner_join()
.
hc <- inner_join(coverage, spending, by = c("Location", "year"))
head(hc)
# A tibble: 6 x 7
Location year type tot_coverage abb region tot_spending
<chr> <int> <chr> <int> <chr> <fct> <int>
1 United States 2013 Employer 155696900 <NA> <NA> 2435624
2 Alabama 2013 Employer 2126500 AL South 33788
3 Alaska 2013 Employer 364900 AK West 7684
4 Arizona 2013 Employer 2883800 AZ West 41481
5 Arkansas 2013 Employer 1128800 AR South 20500
6 California 2013 Employer 17747300 CA West 278168
tail(hc)
# A tibble: 6 x 7
Location year type tot_coverage abb region tot_spending
<chr> <int> <chr> <int> <chr> <fct> <int>
1 Vermont 2014 Total 617000 VT Northeast 6389
2 Virginia 2014 Total 8258800 VA South 62847
3 Washington 2014 Total 7085000 WA West 55819
4 West Virginia 2014 Total 1825500 WV South 17491
5 Wisconsin 2014 Total 5747200 WI North Central 50109
6 Wyoming 2014 Total 572000 WY West 4856
Yes, that’s what we want!
Next, if we are only interested in looking at US states, we can remove the rows corresponding to the Location == "United States"
hc <- hc %>%
filter(Location != "United States")
Another problem is that inside our hc
dataset, we have seen there are multiple types
of healthcare coverage
table(hc$type)
Employer Medicaid Medicare Non-Group Other Public
102 102 102 102 102
Total Uninsured
102 102
The total
type is not really a formal type of healthcare coverage. It really represents just the total number of people in the state. This is useful information and we can include it as a column called tot_pop
. How can we do this?
Well, one way would be to use the join
functions again in dplyr
.
pop <- hc %>%
filter(type == "Total") %>%
select(Location, year, tot_coverage)
pop
# A tibble: 102 x 3
Location year tot_coverage
<chr> <int> <int>
1 Alabama 2013 4763900
2 Alaska 2013 702000
3 Arizona 2013 6603100
4 Arkansas 2013 2904800
5 California 2013 38176400
6 Colorado 2013 5297800
7 Connecticut 2013 3578900
8 Delaware 2013 909300
9 District of Columbia 2013 652100
10 Florida 2013 19429000
# ... with 92 more rows
hc <- hc %>%
filter(type != "Total") %>%
left_join(pop, by = c("Location", "year")) %>%
rename(tot_coverage = tot_coverage.x, tot_pop = tot_coverage.y)
hc
# A tibble: 612 x 8
Location year type tot_coverage abb region tot_spending tot_pop
<chr> <int> <chr> <int> <chr> <fct> <int> <int>
1 Alabama 2013 Emplo~ 2126500 AL South 33788 4.76e6
2 Alaska 2013 Emplo~ 364900 AK West 7684 7.02e5
3 Arizona 2013 Emplo~ 2883800 AZ West 41481 6.60e6
4 Arkansas 2013 Emplo~ 1128800 AR South 20500 2.90e6
5 California 2013 Emplo~ 17747300 CA West 278168 3.82e7
6 Colorado 2013 Emplo~ 2852500 CO West 34090 5.30e6
7 Connecticut 2013 Emplo~ 2030500 CT North~ 34223 3.58e6
8 Delaware 2013 Emplo~ 473700 DE South 9038 9.09e5
9 District o~ 2013 Emplo~ 324300 DC South 7443 6.52e5
10 Florida 2013 Emplo~ 8023400 FL South 150547 1.94e7
# ... with 602 more rows
We can check to make sure that the total
is no longer listed as a type
of healthcare coverage.
table(hc$type)
Employer Medicaid Medicare Non-Group Other Public
102 102 102 102 102
Uninsured
102
We are now ready to try answering our first question that we asked:
- Is there a relationship between healthcare coverage and healthcare spending in the United States?
Let’s pick out the type==Employer
and year==2013
.
hc.employer.2013 <- hc %>%
filter(type == "Employer", year == "2013")
plot(hc.employer.2013$tot_spending,
hc.employer.2013$tot_coverage, log = "xy",
xlab = "spending", ylab = "coverage")
We see there is a strong relationship. However, we also see that healthcare coverage and spending is also strongly related to population size
par(mfrow=c(1,2))
plot(hc.employer.2013$tot_pop,
hc.employer.2013$tot_coverage, log = "xy",
xlab = "population size", ylab = "coverage")
plot(hc.employer.2013$tot_pop,
hc.employer.2013$tot_spending, log = "xy",
xlab = "population size", ylab = "spending")
This means we need to take into account the population size of each state when we are comparing the heathcare coverage and spending.
mutate()
Instead of the absolute number of people who are covered (tot_coverage
), we will calculate the proportion of people who are coverage in each state, year and type.
For this, we will use the mutate()
function in dplyr
.
hc <- hc %>%
mutate(prop_coverage = tot_coverage/tot_pop)
hc
# A tibble: 612 x 9
Location year type tot_coverage abb region tot_spending tot_pop
<chr> <int> <chr> <int> <chr> <fct> <int> <int>
1 Alabama 2013 Empl~ 2126500 AL South 33788 4.76e6
2 Alaska 2013 Empl~ 364900 AK West 7684 7.02e5
3 Arizona 2013 Empl~ 2883800 AZ West 41481 6.60e6
4 Arkansas 2013 Empl~ 1128800 AR South 20500 2.90e6
5 Califor~ 2013 Empl~ 17747300 CA West 278168 3.82e7
6 Colorado 2013 Empl~ 2852500 CO West 34090 5.30e6
7 Connect~ 2013 Empl~ 2030500 CT North~ 34223 3.58e6
8 Delaware 2013 Empl~ 473700 DE South 9038 9.09e5
9 Distric~ 2013 Empl~ 324300 DC South 7443 6.52e5
10 Florida 2013 Empl~ 8023400 FL South 150547 1.94e7
# ... with 602 more rows, and 1 more variable: prop_coverage <dbl>
We need to add another column to our dataset. We will add the spending per capita (or spending per person) in dollars and name this column spending_capita
.
The tot_spending
column is reported in millions (1e6). Therefore, to calculate spending_capita
we will need to adjust for this scaling factor to report it on the original scale (just dollars) and then divide by tot_pop
.
hc <- hc %>%
mutate(spending_capita = (tot_spending*1e6) / tot_pop)
hc %>% select(prop_coverage, spending_capita)
# A tibble: 612 x 2
prop_coverage spending_capita
<dbl> <dbl>
1 0.446 7093.
2 0.520 10946.
3 0.437 6282.
4 0.389 7057.
5 0.465 7286.
6 0.538 6435.
7 0.567 9562.
8 0.521 9940.
9 0.497 11414.
10 0.413 7749.
# ... with 602 more rows
Now we are ready to go back to our first question
- Is there a relationship between healthcare coverage and healthcare spending in the United States?
hc.employer.2013 <- hc %>%
filter(type == "Employer", year == "2013")
plot(hc.employer.2013$spending_capita,
hc.employer.2013$prop_coverage, log = "xy",
xlab = "spending per capita",
ylab = "proportion of Employer coverage")
Yes, it looks like there is a relationship for Employer
healthcare coverage in 2013.
We will continue to explore the other types of coverages later on. For now, we get back to to learning more action verbs in dplyr
.
Our second question that we were interested in was:
- Which US states spend the most and which spend the least on healthcare? How does the spending distribution change across geographic regions in the United States?
To answer these questions, we need to learn how to calculate summary statistics in our data.
summarise()
The summarise()
function in dplyr
will create summary statistics for a given column in the data frame such as finding the max, min, average. For example, to compute the average spending per capita, we can apply the mean()
function to the column spending_captia
and call the summary value avg_spending_capita
.
hc %>%
summarise(avg_spending_capita = mean(spending_capita))
# A tibble: 1 x 1
avg_spending_capita
<dbl>
1 8246.
There are many other summary statistics you could consider such sd()
, min()
, median()
, mean()
, sum()
, n()
(returns the length of vector), first()
(returns first value in vector), last()
(returns last value in vector) and n_distinct()
(number of distinct values in vector).
Also note, this is the average across all states, and all years. This is not very informative.
If you recall, our question asked about which states spent the most, so we want an average spending per capita for each state.
For this, we need to introduc another function in dplyr
called group_by()
.
group_by()
The group_by()
verb is and incredibly powerful function in dplyr
. As we mentioned before it’s related to concept of “split-apply-combine”.
In our example above, we want to split the data frame by some variable (e.g. Location
), apply a function to the individual data frames (mean
) and then combine the output back into a summary data frame.
Let’s see how that would look
hc %>%
group_by(Location) %>%
summarise(avg_spending_capita = mean(spending_capita))
# A tibble: 51 x 2
Location avg_spending_capita
<chr> <dbl>
1 Alabama 7244.
2 Alaska 11331.
3 Arizona 6397.
4 Arkansas 7324.
5 California 7416.
6 Colorado 6602.
7 Connecticut 9730.
8 Delaware 10127.
9 District of Columbia 11698.
10 Florida 7945.
# ... with 41 more rows
That’s better. Here we are averaging across the years 2013 and 2014.
# smallest
hc %>%
group_by(Location) %>%
summarise(avg_spending_capita = mean(spending_capita)) %>%
arrange(avg_spending_capita) %>%
head(n=3)
# A tibble: 3 x 2
Location avg_spending_capita
<chr> <dbl>
1 Utah 5842.
2 Arizona 6397.
3 Georgia 6513.
# largest
hc %>%
group_by(Location) %>%
summarise(avg_spending_capita = mean(spending_capita)) %>%
arrange(desc(avg_spending_capita)) %>%
head(n=3)
# A tibble: 3 x 2
Location avg_spending_capita
<chr> <dbl>
1 District of Columbia 11698.
2 Alaska 11331.
3 Massachusetts 10535.
Hint: Calculate the mean and standard deviation of spending per capita for each geographic region in the US.
hc %>%
group_by(region) %>%
summarise(avg_spending_capita = mean(spending_capita),
sd_spending_capita = sd(spending_capita))
# A tibble: 4 x 3
region avg_spending_capita sd_spending_capita
<fct> <dbl> <dbl>
1 North Central 8404. 541.
2 Northeast 9592. 546.
3 South 7994. 1273.
4 West 7498. 1329.
Another way to visualize distributions is to use boxplots.
Create four boxplots represening the spending per capita distribution for each of the four regions using the boxplot()
function in R.
boxplot(hc$spending_capita ~ hc$region)
Now that we have our data in a tidy
format, next, we will learn about how to do this using the ggplot2
R package in the tidyverse
.
As you have already seen, there are many functions available in base R that can create plots (e.g. plot()
, boxplot()
). Others include: hist()
, qqplot()
, etc. These functions are great because they come with a basic installation of R and can be quite powerful when you need a quick visualization of something when you are exploring data.
We are choosing to introduce ggplot2
because, in our opinion, it’s one of the simplest ways for beginners to create relatively complicated plots that are intuitive and aesthically pleasing.
ggplot2
R packageThe reasons ggplot2
is generally intuitive for beginners is the use of grammar of graphics or the gg
in ggplot2
. The idea is that you can construct many sentences by learning just a few nouns, adjectives, and verbs. There are specific “words” that we will need to learn and once we do, you will be able to create (or “write”) hundreds of different plots.
The critical part to making graphics using ggplot2
is the data needs to be in a tidy format. Given that we have just spend the last two lectures learning about how to work with tidy data, we are primed to take advantage of all that ggplot2
has to offer!
We will show how it’s easy to pipe tidy data (output) as input to other functions that creates plots. This all works because we are working within the tidyverse.
ggplot2
cheatsheetThe cheatsheet looks like the following:
ggplot()
function?As explained by Hadley Wickham:
the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinates system.
ggplot2
Terminologyx
and y
variable names)geom_point()
, geom_bar()
, geom_line()
, geom_histogram()
There are three ways to initialize a ggplot()
object.
An empty ggplot object
library(ggplot2)
p <- ggplot()
A ggplot object associated with a dataset
p <- hc %>%
filter(year==2014) %>%
ggplot()
or a ggplot object with a dataset and x
and y
defined
p <- hc %>%
filter(year==2014) %>%
ggplot(aes(x = spending_capita, y = prop_coverage))
p
geom_point()
The function aes()
is an aesthetic mapping function inside the ggplot()
object. We use this function to specify plot attributes (e.g. x
and y
variable names) that will not change as we add more layers.
Anything that goes in the ggplot()
object becomes a global setting. From there, we use the geom
objects to add more layers to the base ggplot()
object. These will define what we are interested in illustrating using the data.
If you recall, our first question that we were interested in was
- Is there a relationship between healthcare coverage and healthcare spending in the United States?
Before, we were using base R to create something like this:
hc.employer.2013 <- hc %>%
filter(type == "Employer", year == "2013")
plot(hc.employer.2013$spending_capita,
hc.employer.2013$prop_coverage,
xlab = "spending per capita",
ylab = "coverage proportion")
Let’s re-create this plot with ggplot2
using the geom_point()
geometry.
p <- hc %>%
filter(type == "Employer", year == "2013") %>%
ggplot(aes(x = spending_capita, y = prop_coverage))
p + geom_point() +
xlab("spending per capita") +
ylab("coverage proportion")
We used the xlab()
and ylab()
functions in ggplot2
to specify the x-axis and y-axis labels.
Note, we do not have to assign (<-
) the plot to anything:
hc %>%
filter(type == "Employer", year == "2013") %>%
ggplot(aes(x = spending_capita, y = prop_coverage)) +
geom_point() +
xlab("spending per capita") +
ylab("coverage proportion")
It’s also simple to fit a linear regression model and plot it on top of scatter plot using the geom_smooth()
(or stat_smooth()
) functions.
hc %>%
filter(type == "Employer", year == "2013") %>%
ggplot(aes(x = spending_capita, y = prop_coverage)) +
geom_point() +
xlab("spending per capita") +
ylab("coverage proportion") +
geom_smooth(method = "lm", col = "red")
The standard error bounds are computed and included in the plot.
It would be nice to know which state is represented by which state. For this, we will introduce another geom called geom_text()
.
geom_text()
In our dataset, we have information about the abbreviation for each state. We could add the abbreviations for each state next to the point on the plot to assess which states have a higher or lower coverage for a given amount of money they spend per capita.
hc %>%
filter(type == "Employer", year == "2013") %>%
ggplot(aes(x = spending_capita, y = prop_coverage)) +
geom_point() +
xlab("spending per capita") +
ylab("coverage proportion") +
geom_smooth(method = "lm", col = "red") +
geom_text(aes(label=abb))
That is cool, but it would be even better if we could nudge the text over a bit. Let’s look at the help file for geom_text()
:
?ggplot2::geom_text
We see there is an argument called nudge_x
and nudge_y
. We can use these to nudge the text over a bit so the text is not directly on top of the points.
hc %>%
filter(type == "Employer", year == "2013") %>%
ggplot(aes(x = spending_capita, y = prop_coverage)) +
geom_point() +
xlab("spending per capita") +
ylab("coverage proportion") +
geom_smooth(method = "lm", col = "red") +
geom_text(aes(label=abb), nudge_x = 150)
## add your code here
hc %>%
filter(type == "Employer", year == "2013") %>%
ggplot(aes(x = spending_capita, y = prop_coverage,
color = region)) +
geom_point() +
xlab("spending per capita") +
ylab("coverage proportion") +
geom_smooth(method = "lm", col = "red") +
geom_text(aes(label=abb), nudge_x = 150)
ggrepel
, and check if you can improve the quality of visualization using the geom_text_repel* from ggrepel
instead of x_nudge from ggplot2
.## add your code here
library(ggrepel)
hc %>%
filter(type == "Employer", year == "2013") %>%
ggplot(aes(x = spending_capita, y = prop_coverage,
color = region)) +
geom_point() +
xlab("spending per capita") +
ylab("coverage proportion") +
geom_smooth(method = "lm", col = "red") +
geom_text_repel(aes(label=abb))
facet_wrap
Ok, getting back to our original question:
- Is there a relationship between healthcare coverage and healthcare spending in the United States?
We saw there was a positive relationship, but this was only for one type of healthcare coverage (Employer
) and one year. What about the other types?
For this, we will introduce facets
. The idea of faceting is to stratify the data by some variable and make the same plot for each strata.
For example, if we wanted to facet by the type
variable, we will add a layer to our ggplot()
object using the facet_grid()
or facet_wrap()
functions. The function expects the row and column variables to be separated by a ~
.
hc %>%
filter(year == "2013") %>%
ggplot(aes(x = spending_capita, y = prop_coverage,
color = region)) +
geom_point() +
xlab("spending per capita") +
ylab("coverage proportion") +
geom_smooth(method = "lm", col = "red") +
geom_text_repel(aes(label=abb)) +
facet_wrap(~type)
We see that the proportion of people covered have different scales in the y-axis. Let’s read the help file to see if there is some way to not restrict the y-axis to be the same.
?ggplot2::facet_grid
Yes, we see there is an argument called scales
that can be free_y
, (free columns), free_x
(free rows), and free
(both). Let’s try free_y
and look at a different year (year=="2014"
):
hc %>%
filter(year == "2014") %>%
ggplot(aes(x = spending_capita, y = prop_coverage,
color = region)) +
geom_point() +
xlab("spending per capita") +
ylab("coverage proportion") +
geom_smooth(method = "lm", col = "red") +
geom_text_repel(aes(label=abb)) +
facet_wrap(~type, scales="free_y")
Given we know Other Public
refers to the military or Veterans Adminstration, we can see states like HI, VA, NV have a larger proportion of military or VA Other Public
type coverage. While a state like AK has a similar proportion of Other Public
coverage, it has a much larger spending per capita.
We also see a negative relationship with the Uninsured
type. The more states spend, the less uninsured people in the state.
geom_boxplot()
Next, let’s revisit the second question.
- Which US states spend the most and which spend the least on healthcare? How does the spending distribution change across geographic regions in the United States?
Let’s try making a boxplot with ggplot2
. If you recall, the way to do this in base R was:
boxplot(hc$spending_capita ~ hc$region)
Now, we introduce the geom_boxplot()
function. Note, we needed to tell ggplot2
what needs to be along the x and y axis in aes()
.
hc %>%
ggplot(aes(x = region, y = spending_capita)) +
geom_boxplot()
facet_grid
- Does the relationship between healthcare coverage and healthcare spending in the United States change from 2013 to 2014?
Let’s try faceting by both year
and type
. Note that we can facet by rows putting a column name before the ~
and facet by columns putting a column name after the ~
. We are also using facet_grid()
instead of facet_wrap()
.
p <- hc %>%
ggplot(aes(x = spending_capita, y = prop_coverage,
color = region)) +
geom_point() +
xlab("spending per capita") +
ylab("coverage proportion") +
geom_smooth(method = "lm", col = "red") +
geom_text_repel(aes(label=abb))
p + facet_grid(year~type, scales="free")
p + facet_grid(year~type)
The total healthcare expenditure is associated with the population. To make a fair comparison, we create “healthcare expenditure per capita.” Further, the exploratory analysis via data visualization showed higher spending in healthcare per capita is positively associated with higher employer coverage proportion and is negatively associated with the porportion of uninsured population across the States.