Disclaimer: The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts.

This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.

To cite this case study please use:

Wright, Carrie and Meng, Qier and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com/opencasestudies/ocs-bp-diet. Exploring global patterns of dietary behaviors associated with health risk (Version v1.0.0).

To access the GitHub Repository with the data for this case study see here: https://github.com/opencasestudies/ocs-bp-diet.

You may also access and download the data using our OCSdata package. To learn more about this package including examples, see this link. Here is how you would install this package:

install.packages("OCSdata")

This case study is part of a series of public health case studies for the Bloomberg American Health Initiative.


The total reading time for this case study is calculated via koRpus and shown below:

Reading Time Method
99 minutes koRpus

Readability Score:

A readability index estimates the reading difficulty level of a particular text. Flesch-Kincaid, FORCAST, and SMOG are three common readability indices that were calculated for this case study via koRpus. These indices provide an estimation of the minimum reading level required to comprehend this case study by grade and age.

Text language: en 
index grade age
Flesch-Kincaid 10 15
FORCAST 10 15
SMOG 12 17

Please help us by filling out our survey.

Motivation


An article recently published in The Lancet evaluated global dietary trends and the relationship of dietary factors with mortality and fertility.

GBD 2017 Diet Collaborators. Health effects of dietary risks in 195 countries, 1990–2017: a systematic analysis for the Global Burden of Disease Study 2017. The Lancet 393, 1958–1972 (2019).

This article evaluated food consumption patterns in 195 countries for 15 different dietary risk factors that have probable associations with non-communicable disease (NCD). For example, over-consumption of sodium is associated with high blood pressure. These consumption levels were then used to estimate levels of mortality and morbidity due to NCD, as well as disability-adjusted life-years (DALYs) attributable to sub-optimal consumption of foods related to these dietary risk factors. The authors found that:

“High intake of sodium …, low intake of whole grains …, and low intake of fruits … were the leading dietary risk factors for deaths and DALYs globally and in many countries.”

This figure from the paper’s supplementary materials shows the ranking of the 15 dietary risk factors based on the estimated number of attributable deaths. Here, the numbers and colors of the little squares imply rankings of the risk factors (rows) by regions (columns). The color red indicates risk factors that are associated with larger number of attributable deaths. The column on the right is the overall global data. As you can see here, the top 3 risk factors are often issues for many different regions of the world.

This case study will evaluate the data reported in this article to explore regional, age, and gender specific differences in dietary consumption patterns around the world in 2017.

Main Questions


Our main questions are:

  1. What are the global trends for potentially harmful diets?
  2. How do males and females compare?
  3. How do different age groups compare for these dietary factors?
  4. How do different countries compare? In particular, how does the US compare to other countries in terms of diet trends?

Learning Objectives


In this case study, we will walk you through importing data from PDF files and CSV files, cleaning data, wrangling data, comparing data, joining data, visualizing data, and comparing two or more groups using well-established and commonly used packages, including stringr, tidyr, dplyr, purrr, and ggplot2. We will especially focus on using packages and functions from the Tidyverse. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially legible and intuitive.

The skills, methods, and concepts that students will be familiar with by the end of this case study are:

Data Science Learning Objectives:

  1. Importing/extracting data from PDF (dplyr, stringr)
  2. How to reshape data by pivoting between “long” and “wide” formats (tidyr)
  3. Perform functions on all columns of a tibble (purrr)
  4. Data cleaning with regular expressions (stringr)
  5. Specific data value reassignment
  6. Separate data within a column into multiple columns (tidyr)
  7. Methods to Compare data (dplyr)
  8. Combining data from two sources (dplyr)
  9. Make interactive plots (ggiraph)
  10. Make a zoom facet for plot (ggforce)
  11. Combine plots together (cowplot)

Statistical Learning Objectives:

  1. Understanding of how the t-test and the ANOVA are specialized regressions
  2. Basic understanding of the utility of a regression analysis
  3. How to implement a linear regression analysis in R
  4. How to interpret regression coefficients
  5. Awareness of t-test assumptions
  6. Awareness of linear regression assumptions
  7. How to use Q-Q plots to check for normality
  8. Difference between fixed effects and random effects
  9. How to perform paired t-test
  10. How to perform a linear mixed effects regression


We will begin by loading the packages that we will need:

library(here)
library(readr)
library(dplyr)
library(skimr)
library(pdftools)
library(stringr)
library(magrittr)
library(purrr)
library(tibble)
library(tidyr)
library(ggplot2)
library(ggpubr)
library(forcats)
library(lme4)
library(lmerTest)
library(car)
library(ggiraph)
library(ggforce)
library(viridis)
library(cowplot)
library(OCSdata)
Package Use in this case study
here to easily load and save data
readr to import the CSV file data
dplyr to arrange/filter/select/compare specific subsets of the data
skimr to get an overview of data
pdftools to read a PDF into R
stringr to manipulate the text within the PDF of the data
magrittr to use the %<>% piping operator
purrr to perform functions on all columns of a tibble
tibble to create data objects that we can manipulate with dplyr/stringr/tidyr/purrr
tidyr to separate data within a column into multiple columns
ggplot2 to make visualizations with multiple layers
ggpubr to easily add regression line equations to plots
forcats to change details about factors (categorical variables)
lme4 to fit a linear mixed effects model
lmerTest to perform linear mixed model testing
car to perform Levene’s Test of Homogeneity of Variances
ggiraph to make plots interactive
ggforce to modify facets in plots
viridis to plot in a color palette that is easily interpreted by colorblind individuals
cowplot to allow plots to be combined
OCSdata to access and download OCS data files

The first time we use a function, we will use the :: to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.

Context


Here is an excerpt from the article itself about the context of the work:

Many dietary factors have well-established associations with health risk. The authors that generated this data set identified 15 dietary factors that have probable health risk based on literature search.

Here you can see a table of the sources for the health risks associated with the dietary factors. The first column shows the risk factors and the second column shows the health outcomes. This table is part of “Supplemental Table 1. Epidemiological evidence supporting causality between dietary risk factors and disease endpoints” from the paper’s supplementary materials.

In the article the authors found that most of the mortality associated with each factor is related to cardiovascular disease.

Limitations


There are some important limitations regarding the data from this article to keep in mind. The definition of certain dietary factors varied across some of the collection sources. Intakes of certain healthy foods like vegetables and fruits are likely positively correlated with each other and likely negatively correlated with intakes of unhealthy foods. Much of the data was collected with 24 hour recall surveys which are prone to issues due to inaccuracy of memory recall or other biases such as a tendency for some people to report healthier behaviors. The guidelines in the PDF are not specified by gender even though it is known that there are different dietary requirements for optimal health for certain nutrients by gender. The article discusses some limitations about accounting for overall food consumption when calculating consumption of particular foods:

"To remove the effect of energy intake as a potential confounder and address measurement error in dietary assessment tools, most cohorts have adjusted for total energy intake in their statistical models. This energy adjustment means that diet components are defined as risks in terms of the share of diet and not as absolute levels of exposure. In other words, an increase in intake of foods and macronutrients should be compensated by a decrease in intake of other dietary factors to hold total energy intake constant. Thus, the relative risk of change in each component of diet depends on the other components for which it is substituted. However, the relative risks estimated from meta-analyses of cohort studies do not generally specify the type of substitution.

There are also important nuances to keep in mind regarding some of the dietary factors. For example calcium consumption was calculated based on consumption of dairy products, while calcium can be acquired from other sources including plant-based sources. However in these data, the influence of plant-based consumption of calcium was not accounted for, nor was supplementation through vitamin sources.

Also, while gender and sex are not actually binary, the data used in this analysis only contains information for groups of individuals described as male or female.

What are the data?


We will be using data that we requested from the Global Burden of Disease (GBD) of the Institute for Health Metrics and Evaluation (IHME) about dietary intake, as well as the guideline data about optimal consumption amounts for different foods contained within the PDF of the article. We have two CSV files, dietary_risk_exposure_all_ages_2017.csv and dietary_risk_exposure_sep_ages_2017.csv. The first one includes consumption levels at the global level and for different countries for all ages combined.

Looking at the CSV file in excel:

Here you can see that the data contains mean consumption values for both men and women in various countries at the national level in 2017 for various foods that may be problematic for health. The units for the food varies. So for example, the mean column in row that says “Diet low in fiber” indicates the average consumption level per person in that region and of that gender of fiber in grams per day.

The second CSV file has similar data, but consumption levels for different age groups are separated.

The authors of this article obtained the data from a variety of sources including household budget surveys and nutritional surveys regarding 24 hour recall of food consumption and 24 hour urinary sodium analysis. The data was derived from sales data from Euromonitor, estimates about national availability of specific nutrients from the United Nations Food and Agriculture Organization (FAO) and the United States Department of Agriculture’s National Nutrition Database.

Data Import


If you have trouble accessing the GitHub Repository, the data can be downloaded from here and here.

Let’s import our data into R now so that we can explore the data further.

In our case, we downloaded this data and put it within a “data” directory within a subdirectory called “raw” for our project. If you use an RStudio project, then you can use the here() function of the here package to make the path for importing this data simpler. The here package automatically starts looking for files based on where you have a .Rproj file which is created when you start a new RStudio project. We can specify that we want to look for the files within the “docs” directory within a directory where our .Rproj file is located by separating the name of the “data” directory, the “raw” subdirectory, and the file name using commas.


Click here to see more about creating new projects in RStudio.

You can create a project by going to the File menu of RStudio like so:

You can also do so by clicking the project button:

See here to learn more about using RStudio projects.


diet_data <- readr::read_csv(here("data", "raw", 
                       "dietary_risk_exposure_all_ages_2017.csv"))
sep_age_diet_data <- read_csv(here("data", "raw", 
                       "dietary_risk_exposure_sep_ages_2017.csv"))

You may also use the OCSdata package to download the raw data:

# install.packages("OCSdata")
library(OCSdata)
raw_data("ocs-bp-diet", outpath = getwd())
# This will save the raw data files in a "OCSdata/data/raw/" sub-folder 
# in your current working directory

If you used the OCSdata package to download the raw data, you can import the data into R like so:

diet_data <- readr::read_csv(here("OCSdata", "data", "raw", 
                       "dietary_risk_exposure_all_ages_2017.csv"))
sep_age_diet_data <- read_csv(here("OCSdata", "data", "raw", 
                       "dietary_risk_exposure_sep_ages_2017.csv"))

First let’s just get a general sense of our data. We can do that using the glimpse() function of the dplyr package (it is also in the tibble package).

dplyr::glimpse(diet_data)
Rows: 5,880
Columns: 11
$ year_id        <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2~
$ location_name  <chr> "Global", "Global", "China", "China", "North Korea", "N~
$ rei_id         <dbl> 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, ~
$ rei_name       <chr> "Diet high in processed meat", "Diet high in processed ~
$ age_group_name <chr> "All available ages", "All available ages", "All availa~
$ sex            <chr> "Male", "Female", "Male", "Female", "Male", "Female", "~
$ parameter      <chr> "continuous", "continuous", "continuous", "continuous",~
$ mean           <dbl> 4.2865629, 3.2640990, 2.3176975, 1.7512870, 0.5665229, ~
$ upper          <dbl> 4.4633117, 3.3765360, 2.6944978, 2.0454134, 0.6596296, ~
$ lower          <dbl> 4.1309531, 3.1547299, 1.9933744, 1.5161724, 0.4818201, ~
$ unit           <chr> "g/day", "g/day", "g/day", "g/day", "g/day", "g/day", "~
glimpse(sep_age_diet_data)
Rows: 88,200
Columns: 11
$ year_id        <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2~
$ location_name  <chr> "Global", "Global", "Global", "Global", "Global", "Glob~
$ rei_name       <chr> "Diet low in calcium", "Diet low in calcium", "Diet low~
$ age_group_id   <dbl> 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 30, 31, 32,~
$ age_group_name <chr> "25 to 29", "30 to 34", "35 to 39", "40 to 44", "45 to ~
$ sex            <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male",~
$ parameter      <chr> "continuous", "continuous", "continuous", "continuous",~
$ mean           <dbl> 0.4166193, 0.4253721, 0.4352053, 0.4412156, 0.4565795, ~
$ upper          <dbl> 0.4310974, 0.4403903, 0.4481282, 0.4579218, 0.4763366, ~
$ lower          <dbl> 0.4030347, 0.4110026, 0.4222248, 0.4264522, 0.4381799, ~
$ unit           <chr> "g/day", "g/day", "g/day", "g/day", "g/day", "g/day", "~

Here we can tell that the sep_age_diet_data is much larger than the diet_data. The diet_data has only 5,880 rows while the sep_age_diet_data has 88,200 rows!

However, both files appear to have the same column structure with 11 variables each.

The skim() function of the skimr package is also really helpful for getting a general sense of your data.

skim(diet_data)
Data summary
Name diet_data
Number of rows 5880
Number of columns 11
_______________________
Column type frequency:
character 6
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
location_name 0 1 4 32 0 196 0
rei_name 0 1 16 39 0 15 0
age_group_name 0 1 18 18 0 1 0
sex 0 1 4 6 0 2 0
parameter 0 1 10 10 0 1 0
unit 0 1 5 12 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year_id 0 1 2017.00 0.00 2017 2017.00 2017.00 2017.00 2017.00 ▁▁▇▁▁
rei_id 0 1 133.67 53.93 111 114.00 118.00 123.00 333.00 ▇▁▁▁▁
mean 0 1 38.27 64.29 0 0.45 8.38 49.12 566.69 ▇▁▁▁▁
upper 0 1 41.81 70.06 0 0.49 9.22 53.18 624.23 ▇▁▁▁▁
lower 0 1 35.02 59.06 0 0.42 7.52 45.18 513.22 ▇▁▁▁▁

Notice how there is a column providing the number of missing observations for each variable. It looks like our data is very complete and we do not have any missing data. We also get a sense about the size of our data.

The n_unqiue column shows us the number of unique values for each of our columns.

Let’s take a look at sep_age_diet_data.

skim(sep_age_diet_data)
Data summary
Name sep_age_diet_data
Number of rows 88200
Number of columns 11
_______________________
Column type frequency:
character 6
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
location_name 0 1 4 32 0 196 0
rei_name 0 1 16 39 0 15 0
age_group_name 0 1 7 8 0 15 0
sex 0 1 4 6 0 2 0
parameter 0 1 10 10 0 1 0
unit 0 1 5 12 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year_id 0 1 2017.00 0.00 2017 2017.00 2017.00 2017.00 2017.00 ▁▁▇▁▁
age_group_id 0 1 32.87 54.46 10 13.00 17.00 30.00 235.00 ▇▁▁▁▁
mean 0 1 36.06 61.94 0 0.42 7.90 43.54 604.01 ▇▁▁▁▁
upper 0 1 45.86 78.69 0 0.53 10.43 55.10 806.98 ▇▁▁▁▁
lower 0 1 28.01 48.47 0 0.32 5.72 34.17 494.66 ▇▁▁▁▁

We can see that there are many more rows in this data set.

Let’s change the variable name rei_name to dietary_risk so that it makes more sense. We can use the rename() function from the dplyr package.

diet_data <- dplyr::rename(diet_data, dietary_risk = rei_name)
sep_age_diet_data <- dplyr::rename(sep_age_diet_data, dietary_risk = rei_name)

glimpse(diet_data)
Rows: 5,880
Columns: 11
$ year_id        <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2~
$ location_name  <chr> "Global", "Global", "China", "China", "North Korea", "N~
$ rei_id         <dbl> 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, ~
$ dietary_risk   <chr> "Diet high in processed meat", "Diet high in processed ~
$ age_group_name <chr> "All available ages", "All available ages", "All availa~
$ sex            <chr> "Male", "Female", "Male", "Female", "Male", "Female", "~
$ parameter      <chr> "continuous", "continuous", "continuous", "continuous",~
$ mean           <dbl> 4.2865629, 3.2640990, 2.3176975, 1.7512870, 0.5665229, ~
$ upper          <dbl> 4.4633117, 3.3765360, 2.6944978, 2.0454134, 0.6596296, ~
$ lower          <dbl> 4.1309531, 3.1547299, 1.9933744, 1.5161724, 0.4818201, ~
$ unit           <chr> "g/day", "g/day", "g/day", "g/day", "g/day", "g/day", "~
glimpse(sep_age_diet_data)
Rows: 88,200
Columns: 11
$ year_id        <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2~
$ location_name  <chr> "Global", "Global", "Global", "Global", "Global", "Glob~
$ dietary_risk   <chr> "Diet low in calcium", "Diet low in calcium", "Diet low~
$ age_group_id   <dbl> 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 30, 31, 32,~
$ age_group_name <chr> "25 to 29", "30 to 34", "35 to 39", "40 to 44", "45 to ~
$ sex            <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male",~
$ parameter      <chr> "continuous", "continuous", "continuous", "continuous",~
$ mean           <dbl> 0.4166193, 0.4253721, 0.4352053, 0.4412156, 0.4565795, ~
$ upper          <dbl> 0.4310974, 0.4403903, 0.4481282, 0.4579218, 0.4763366, ~
$ lower          <dbl> 0.4030347, 0.4110026, 0.4222248, 0.4264522, 0.4381799, ~
$ unit           <chr> "g/day", "g/day", "g/day", "g/day", "g/day", "g/day", "~

Looks good!

We will then take a look at the different dietary risk factors considered. To do this we will use the distinct() function of the dplyr package.

This function grabs only the distinct or unique rows from a given variable (dietary_risk, in our case) of a given data frame (diet_data, in our case).

dplyr::distinct(diet_data, dietary_risk)
# A tibble: 15 x 1
   dietary_risk                           
   <chr>                                  
 1 Diet high in processed meat            
 2 Diet high in red meat                  
 3 Diet high in sodium                    
 4 Diet high in sugar-sweetened beverages 
 5 Diet high in trans fatty acids         
 6 Diet low in calcium                    
 7 Diet low in fiber                      
 8 Diet low in fruits                     
 9 Diet low in legumes                    
10 Diet low in milk                       
11 Diet low in nuts and seeds             
12 Diet low in polyunsaturated fatty acids
13 Diet low in seafood omega-3 fatty acids
14 Diet low in vegetables                 
15 Diet low in whole grains               

Both over and under consumption could be a health problem!

We will be using the %>% pipe for sequential steps in our code later on. This will make more sense when we have multiple sequential steps using the same data object.

We could do the same code as above using this notation. For example we first grab the diet_data, then we select the distinct values of the dietary_risk variable.

diet_data %>%
  distinct(dietary_risk)
# A tibble: 15 x 1
   dietary_risk                           
   <chr>                                  
 1 Diet high in processed meat            
 2 Diet high in red meat                  
 3 Diet high in sodium                    
 4 Diet high in sugar-sweetened beverages 
 5 Diet high in trans fatty acids         
 6 Diet low in calcium                    
 7 Diet low in fiber                      
 8 Diet low in fruits                     
 9 Diet low in legumes                    
10 Diet low in milk                       
11 Diet low in nuts and seeds             
12 Diet low in polyunsaturated fatty acids
13 Diet low in seafood omega-3 fatty acids
14 Diet low in vegetables                 
15 Diet low in whole grains               

OK, so that gives us an idea of what dietary factors we can explore, and we can see that there are 15 of them.

Let’s see if the location_name values are the same between both CSV files. To do this we will use the setequal() function of dplyr.

dplyr::setequal(
  distinct(diet_data, location_name),
  distinct(sep_age_diet_data, location_name)
)
[1] TRUE

OK, we got the value of TRUE, so it looks like the same locations are in both files.

Note: In this case were comparing two different objects so using the pipe is not as useful.

Let’s take a look at the locations included in the data.

# scroll through the output!
sep_age_diet_data %>%
  distinct(location_name) %>%
  pull()
  [1] "Global"                           "China"                           
  [3] "North Korea"                      "Taiwan (Province of China)"      
  [5] "Cambodia"                         "Indonesia"                       
  [7] "Laos"                             "Malaysia"                        
  [9] "Maldives"                         "Myanmar"                         
 [11] "Philippines"                      "Sri Lanka"                       
 [13] "Thailand"                         "Timor-Leste"                     
 [15] "Vietnam"                          "Fiji"                            
 [17] "Kiribati"                         "Marshall Islands"                
 [19] "Federated States of Micronesia"   "Papua New Guinea"                
 [21] "Samoa"                            "Solomon Islands"                 
 [23] "Tonga"                            "Vanuatu"                         
 [25] "Armenia"                          "Azerbaijan"                      
 [27] "Georgia"                          "Kazakhstan"                      
 [29] "Kyrgyzstan"                       "Mongolia"                        
 [31] "Tajikistan"                       "Turkmenistan"                    
 [33] "Uzbekistan"                       "Albania"                         
 [35] "Bosnia and Herzegovina"           "Bulgaria"                        
 [37] "Croatia"                          "Czech Republic"                  
 [39] "Hungary"                          "Macedonia"                       
 [41] "Montenegro"                       "Poland"                          
 [43] "Romania"                          "Serbia"                          
 [45] "Slovakia"                         "Slovenia"                        
 [47] "Belarus"                          "Estonia"                         
 [49] "Latvia"                           "Lithuania"                       
 [51] "Moldova"                          "Russian Federation"              
 [53] "Ukraine"                          "Brunei"                          
 [55] "Japan"                            "South Korea"                     
 [57] "Singapore"                        "Australia"                       
 [59] "New Zealand"                      "Andorra"                         
 [61] "Austria"                          "Belgium"                         
 [63] "Cyprus"                           "Denmark"                         
 [65] "Finland"                          "France"                          
 [67] "Germany"                          "Greece"                          
 [69] "Iceland"                          "Ireland"                         
 [71] "Israel"                           "Italy"                           
 [73] "Luxembourg"                       "Malta"                           
 [75] "Netherlands"                      "Norway"                          
 [77] "Portugal"                         "Spain"                           
 [79] "Sweden"                           "Switzerland"                     
 [81] "United Kingdom"                   "Argentina"                       
 [83] "Chile"                            "Uruguay"                         
 [85] "Canada"                           "United States"                   
 [87] "Antigua and Barbuda"              "The Bahamas"                     
 [89] "Barbados"                         "Belize"                          
 [91] "Cuba"                             "Dominica"                        
 [93] "Dominican Republic"               "Grenada"                         
 [95] "Guyana"                           "Haiti"                           
 [97] "Jamaica"                          "Saint Lucia"                     
 [99] "Saint Vincent and the Grenadines" "Suriname"                        
[101] "Trinidad and Tobago"              "Bolivia"                         
[103] "Ecuador"                          "Peru"                            
[105] "Colombia"                         "Costa Rica"                      
[107] "El Salvador"                      "Guatemala"                       
[109] "Honduras"                         "Mexico"                          
[111] "Nicaragua"                        "Panama"                          
[113] "Venezuela"                        "Brazil"                          
[115] "Paraguay"                         "Algeria"                         
[117] "Bahrain"                          "Egypt"                           
[119] "Iran"                             "Iraq"                            
[121] "Jordan"                           "Kuwait"                          
[123] "Lebanon"                          "Libya"                           
[125] "Morocco"                          "Palestine"                       
[127] "Oman"                             "Qatar"                           
[129] "Saudi Arabia"                     "Syria"                           
[131] "Tunisia"                          "Turkey"                          
[133] "United Arab Emirates"             "Yemen"                           
[135] "Afghanistan"                      "Bangladesh"                      
[137] "Bhutan"                           "India"                           
[139] "Nepal"                            "Pakistan"                        
[141] "Angola"                           "Central African Republic"        
[143] "Congo"                            "Democratic Republic of the Congo"
[145] "Equatorial Guinea"                "Gabon"                           
[147] "Burundi"                          "Comoros"                         
[149] "Djibouti"                         "Eritrea"                         
[151] "Ethiopia"                         "Kenya"                           
[153] "Madagascar"                       "Malawi"                          
[155] "Mauritius"                        "Mozambique"                      
[157] "Rwanda"                           "Seychelles"                      
[159] "Somalia"                          "Tanzania"                        
[161] "Uganda"                           "Zambia"                          
[163] "Botswana"                         "Lesotho"                         
[165] "Namibia"                          "South Africa"                    
[167] "Swaziland"                        "Zimbabwe"                        
[169] "Benin"                            "Burkina Faso"                    
[171] "Cameroon"                         "Cape Verde"                      
[173] "Chad"                             "Cote d'Ivoire"                   
[175] "The Gambia"                       "Ghana"                           
[177] "Guinea"                           "Guinea-Bissau"                   
[179] "Liberia"                          "Mali"                            
[181] "Mauritania"                       "Niger"                           
[183] "Nigeria"                          "Sao Tome and Principe"           
[185] "Senegal"                          "Sierra Leone"                    
[187] "Togo"                             "American Samoa"                  
[189] "Bermuda"                          "Greenland"                       
[191] "Guam"                             "Northern Mariana Islands"        
[193] "Puerto Rico"                      "Virgin Islands, U.S."            
[195] "South Sudan"                      "Sudan"                           

OK, so there are global values, as well as values for 195 countries.

Let’s take a look at the data when we order it by the mean consumption rate column. We can do so using the arrange() function of the dplyr package.

diet_data %>%
  dplyr::arrange(mean) %>%
  glimpse()
Rows: 5,880
Columns: 11
$ year_id        <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2~
$ location_name  <chr> "Lebanon", "Lebanon", "Italy", "Turkey", "Kazakhstan", ~
$ rei_id         <dbl> 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, ~
$ dietary_risk   <chr> "Diet high in trans fatty acids", "Diet high in trans f~
$ age_group_name <chr> "All available ages", "All available ages", "All availa~
$ sex            <chr> "Male", "Female", "Male", "Male", "Male", "Male", "Male~
$ parameter      <chr> "continuous", "continuous", "continuous", "continuous",~
$ mean           <dbl> 0.0007277926, 0.0008198374, 0.0013574727, 0.0014110786,~
$ upper          <dbl> 0.0008927912, 0.0010079978, 0.0014915092, 0.0017274811,~
$ lower          <dbl> 0.0005853401, 0.0006704380, 0.0012335834, 0.0011349718,~
$ unit           <chr> "% energy/day", "% energy/day", "% energy/day", "% ener~

OK, so it looks like people in Lebanon don’t eat very many trans fatty acids.

Let’s also figure out how many values there are in each age group of the data that is separated by age. We will use the count() function of the dplyr package to do this.

sep_age_diet_data %>%
  dplyr::count(age_group_name)
# A tibble: 15 x 2
   age_group_name     n
   <chr>          <int>
 1 25 to 29        5880
 2 30 to 34        5880
 3 35 to 39        5880
 4 40 to 44        5880
 5 45 to 49        5880
 6 50 to 54        5880
 7 55 to 59        5880
 8 60 to 64        5880
 9 65 to 69        5880
10 70 to 74        5880
11 75 to 79        5880
12 80 to 84        5880
13 85 to 89        5880
14 90 to 94        5880
15 95 plus         5880

That’s a lot of values!

Let’s look a bit deeper to try to understand why. We can use the count() function again but get the number of values for each category within sex, age_group_name and location_name of the data.

sep_age_diet_data %>%
  count(sex, age_group_name, location_name)
# A tibble: 5,880 x 4
   sex    age_group_name location_name           n
   <chr>  <chr>          <chr>               <int>
 1 Female 25 to 29       Afghanistan            15
 2 Female 25 to 29       Albania                15
 3 Female 25 to 29       Algeria                15
 4 Female 25 to 29       American Samoa         15
 5 Female 25 to 29       Andorra                15
 6 Female 25 to 29       Angola                 15
 7 Female 25 to 29       Antigua and Barbuda    15
 8 Female 25 to 29       Argentina              15
 9 Female 25 to 29       Armenia                15
10 Female 25 to 29       Australia              15
# ... with 5,870 more rows

OK, so it looks like these are probably the consumption values for each of the different dietary factors (since there were 15 different factors) for each age group and gender combination within each country.

We can confirm this by filtering the data to one of the age groups, for a single gender, and for a single location. To do this we can use the filter() function of the dplyr package. Notice that we need to use two equal signs == to specify what values we would like for each variable.

sep_age_diet_data %>%
  dplyr::filter(
    sex == "Female",
    age_group_name == "25 to 29",
    location_name == "Afghanistan"
  )
# A tibble: 15 x 11
   year_id location_name dietary_risk          age_group_id age_group_name sex  
     <dbl> <chr>         <chr>                        <dbl> <chr>          <chr>
 1    2017 Afghanistan   Diet low in calcium             10 25 to 29       Fema~
 2    2017 Afghanistan   Diet low in fiber               10 25 to 29       Fema~
 3    2017 Afghanistan   Diet low in seafood ~           10 25 to 29       Fema~
 4    2017 Afghanistan   Diet low in fruits              10 25 to 29       Fema~
 5    2017 Afghanistan   Diet low in whole gr~           10 25 to 29       Fema~
 6    2017 Afghanistan   Diet low in legumes             10 25 to 29       Fema~
 7    2017 Afghanistan   Diet low in milk                10 25 to 29       Fema~
 8    2017 Afghanistan   Diet low in nuts and~           10 25 to 29       Fema~
 9    2017 Afghanistan   Diet high in process~           10 25 to 29       Fema~
10    2017 Afghanistan   Diet low in polyunsa~           10 25 to 29       Fema~
11    2017 Afghanistan   Diet high in red meat           10 25 to 29       Fema~
12    2017 Afghanistan   Diet high in sodium             10 25 to 29       Fema~
13    2017 Afghanistan   Diet high in sugar-s~           10 25 to 29       Fema~
14    2017 Afghanistan   Diet high in trans f~           10 25 to 29       Fema~
15    2017 Afghanistan   Diet low in vegetabl~           10 25 to 29       Fema~
# ... with 5 more variables: parameter <chr>, mean <dbl>, upper <dbl>,
#   lower <dbl>, unit <chr>

This confirms that for each of the 15 dietary factors, our unit of observation is a combination of gender, age and country.

However, before we proceed with our analysis, we will want to perform some additional data wrangling. To do this, we will introduce the pdftools package, which will allow us to pull additional data from the manuscript itself.

While all of the mean consumption values are reported in grams, each dietary factor has a different amount that is considered optimal for consuming. To make the consumption values more comparable across factors, let’s also get some data from the PDF of the paper so that we can calculate consumption of these dietary factors as percentages of the daily optimum.

We are interested in this table on page 3:

First let’s import the PDF using the pfd_text() function of the pdftools package.

You can find this file here.

paper <- pdftools::pdf_text(here("data", "raw",
"Afshin_et_al_2019.pdf"))

We can save our imported data as an rda file (stands for R data file) using the save() function.

save(diet_data, sep_age_diet_data, paper, file = here::here("data", "imported", "imported_data.rda"))

Data Wrangling


If you have been following along but stopped, we could load our imported data like so:

load(here::here("data", "imported", "imported_data.rda"))

If you skipped the data import section click here.

First you need to install and load the OCSdata package:

install.packages("OCSdata")
library(OCSdata)

Then, you may load the imported data using the following code:

imported_data("ocs-bp-diet", outpath = getwd())
load(here::here("OCSdata", "data", "imported", "imported_data.rda"))

If the package does not work for you, alternatively, an RDA file (stands for R data) of the data can be found in our GitHub repository or slightly more directly here. Download this file and then place it in your current working directory within a subdirectory called “imported” within a subdirectory called “data” to copy and paste our code. We used an RStudio project and the here package to navigate to the file more easily.

load(here::here("data", "imported", "imported_data.rda"))

Click here to see more about creating new projects in RStudio.

You can create a project by going to the File menu of RStudio like so:

You can also do so by clicking the project button:

See here to learn more about using RStudio projects and here to learn more about the here package.



Let’s take a look at our manuscript data.

We can use the base summary() function to get a sense of what the data looks like. By base we mean that these functions are part of the base package and are loaded automatically on startup of R. Thus, library(base) is not required.

summary(paper)
   Length     Class      Mode 
       15 character character 

We can see that we have 15 different character strings. Each one contains the text on each of the 15 different pages of the PDF.

Again, the table we are interested in is on the third page, so let’s grab just that portion of the PDF. The top of this page looks like:

# Here we will select the 3rd value in the paper object
pdf_table <- paper[3]

summary(pdf_table)
   Length     Class      Mode 
        1 character character 
# specifying nchar.max truncates the output
glimpse(pdf_table, nchar.max = 800)
 chr "                                                                                                                                                                                                  Articles\n\n\n\n\nin systolic blood pressure, and then estimated the                                          Disease-specific deaths and disability-adjusted\nrelationship between change in systolic blood pressure                                      life-years\nand disease outcomes.14                                                                     Data on disease-specific deaths and disability-adjusted\n                                                                                            life-years (DALYs) by age, sex, country, and year were\nOptimal level of intake      "| __truncated__

Here we can see that the pdf_table object now contains the text from the 3rd page as a single large character string. However the text is difficult to read because of the column structure in the PDF. Now let’s try to grab just the text in the table.

One way to approach this is to split the string by some pattern that we notice in the table.

All the rows of interest of the table appear to start with the word "Diet". Moreover, only the capitalized form of the word "Diet" appears to be within the table, and it is not present in the preceding text (although "diet" is).

Let’s use the str_split() function of the stringr package to split the data within the object called pdf_table by the word "Diet". Only lines from page 3 that contain the word "Diet" will be selected (and not "diet" as this function is case-sensitive). Each section of the text that contains "Diet" will be split into individual pieces every time the word "Diet" occurs and the word itself will be removed.

In this case we are also using the magrittr assignment pipe or double pipe that looks like this %<>% of the magrittr package. This allows us use the pdf_table data as input to the later steps but also reassign the output to the same data object name.

pdf_table %<>%
  stringr::str_split(pattern = "Diet")

Using the base::summary() and dplyr::glimpse() function we can see that we created a list of the rows in the table that contained the word "Diet". We can see that we start with the row that contains "low in fruits".

pdf_table %>%
  summary()
     Length Class  Mode     
[1,] 17     -none- character
pdf_table %>%
  glimpse()
List of 1
 $ : chr [1:17] "                                                                                                               "| __truncated__ " low in fruits                Mean daily consumption of fruits (fresh, frozen, cooked, canned, or dried fruits,"| __truncated__ " low in vegetables            Mean daily consumption of vegetables (fresh, frozen, cooked, canned, or dried veg"| __truncated__ " low in legumes               Mean daily consumption of legumes (fresh, frozen, cooked, canned, or dried legume"| __truncated__ ...

In order to extract the values that we want from these character strings, we will use some additional functions from the stringr package. RStudio creates really helpful cheat sheets like this one which shows you all the major functions in the stringr package. You can download others here.

You can see that we could have also used the str_split_fixed() function which would also separate the substrings into different columns of a matrix, however we would need to know the number of substrings or pieces that we would like returned.

For more information about str_split() see here.

Let’s separate the values within the list using the base unlist function, this will allow us to easily select the different substrings within the object called pdf_table.

pdf_table %<>%
  unlist()

It’s important to realize that the first split will split the text before the first occurrence of "Diet" as the first value in the output. (This is why there are 17 elements in pdf_table rather than 15, the number of rows in the table.) We could use the first() function of the dplyr package to look at this value. However, we will suppress the output as this is quite large.

dplyr::first(pdf_table)

Instead we can take a look at the second element of the list. using the nth() function of dplyr.

nth(pdf_table, 2)
[1] " low in fruits                Mean daily consumption of fruits (fresh, frozen, cooked, canned, or dried fruits, excluding                250 g (200–300) per day                           94·9\n                                    fruit juices and salted or pickled fruits)\n  "

Indeed this looks like the first row of interest in our table:

Using the last() and the nth() functions of the dplyr package we can take a look at the last values of the list.

# to see the second to last value we can use nth()
# the -2 specifies that we want the second-to-last value
# -3 would be third-to-last and -1 would be the last value
dplyr::nth(pdf_table, -2)
[1] " high in sodium               24 h urinary sodium measured in g per day                                                                     3 g (1–5) per day*                             26·2\n\n *To reflect the uncertainty in existing evidence on optimal level of intake for sodium, 1–5 g per day was considered as the uncertainty range for the optimal level of sodium where less than 2·3 g per day is the\n intake level of sodium associated with the lowest level of blood pressure in randomised controlled trials and 4–5 g per day is the level of sodium intake associated with the lowest risk of cardiovascular disease in\n observational studies.\n\n Table: "
# to see the very last value we can use last()
dplyr::last(pdf_table)
[1] "ary risk factor exposure definitions, optimal level, and data representativeness index, 1990–2017\n\n\n\nwww.thelancet.com Published online April 3, 2019 http://dx.doi.org/10.1016/S0140-6736(19)30041-8                                                                                                                          3\n"

We don’t need this part of the table or the text before the table if we just want the consumption recommendations.

So we will select the second through the second-to-last of the substrings. Since we have seventeen substrings, we will select the second through the sixteenth. However a better way to do this rather than selecting by index, would be to select phrases that are unique to the text within the table that we want. We will use the str_subset() function of stringr package to select the table rows with consumption guidelines. Most of the rows have the phrase “Mean daily consumption”, however, there are other phrases for some of the rows, including “Mean daily intake” and “24 h sodium”. So we will subset for each of these phrases.

# one could subset the pdf_table like this:
# pdf_table <- pdf_table[2:16]

pdf_table %<>%
  str_subset(pattern = "Mean daily consumption|Mean daily intake|24 h")

Notice that we separate the different patterns to look for using vertical bar character "|" and that all of the patterns are within quotation marks together.

Question opportunity:

  1. What other string patterns could you use to subset the rows of the table that we want?

  2. Why might it be better to subset based on the text rather than the index?

Now the first row is what we want:

first(pdf_table)
[1] " low in fruits                Mean daily consumption of fruits (fresh, frozen, cooked, canned, or dried fruits, excluding                250 g (200–300) per day                           94·9\n                                    fruit juices and salted or pickled fruits)\n  "

And the last row is what we want:

last(pdf_table)
[1] " high in sodium               24 h urinary sodium measured in g per day                                                                     3 g (1–5) per day*                             26·2\n\n *To reflect the uncertainty in existing evidence on optimal level of intake for sodium, 1–5 g per day was considered as the uncertainty range for the optimal level of sodium where less than 2·3 g per day is the\n intake level of sodium associated with the lowest level of blood pressure in randomised controlled trials and 4–5 g per day is the level of sodium intake associated with the lowest risk of cardiovascular disease in\n observational studies.\n\n Table: "

At this point, we have a better look at the current representation of the table data in R, and we might notice something that will need to be fixed. In the string above, the decimal points from the PDF are being recognized as something called an interpunct instead of a period or decimal. An interpunct is a centered dot, as opposed to a period or decimal that is aligned to the bottom of the line.

The interpunct was previously used to separate words in certain languages, like ancient Latin.

[source]

You can produce an interpunct on a Mac like this:

[source]

It is important to replace these for later when we want these values to be converted from character strings to numeric. We will again use the stringr package. This time we will use the str_replace_all() function which replaces all instances of a pattern in an individual string. In this case we want to replace all instances of the interpunct with a decimal point.

pdf_table %<>%
  stringr::str_replace_all(
    pattern = "·",
    replacement = "."
  )
last(pdf_table)
[1] " high in sodium               24 h urinary sodium measured in g per day                                                                     3 g (1–5) per day*                             26.2\n\n *To reflect the uncertainty in existing evidence on optimal level of intake for sodium, 1–5 g per day was considered as the uncertainty range for the optimal level of sodium where less than 2.3 g per day is the\n intake level of sodium associated with the lowest level of blood pressure in randomised controlled trials and 4–5 g per day is the level of sodium intake associated with the lowest risk of cardiovascular disease in\n observational studies.\n\n Table: "

Looks good!

Now we will try to split the strings for each row based on the presence of two spaces to create the columns of the table, as there appears to be more than one space between the columns. The resulting substrings will be separated by quotes.

For additional details, the second page of the stringr cheat sheet has more information about using “Special Characters” in stringr. For example \\s is interpreted as a space as the \\ indicates that the s should be interpreted as a special character and not simply the letter s. The {2,} indicates two or more spaces, while {2} would indicate exactly two spaces.

table_split <- str_split(
  string = pdf_table,
  pattern = "\\s{2,}"
)
glimpse(table_split) # scroll the output!
List of 15
 $ : chr [1:6] " low in fruits" "Mean daily consumption of fruits (fresh, frozen, cooked, canned, or dried fruits, excluding" "250 g (200–300) per day" "94.9" ...
 $ : chr [1:7] " low in vegetables" "Mean daily consumption of vegetables (fresh, frozen, cooked, canned, or dried vegetables," "360 g (290–430) per day" "94.9" ...
 $ : chr [1:5] " low in legumes" "Mean daily consumption of legumes (fresh, frozen, cooked, canned, or dried legumes)" "60 g (50–70) per day" "94.9" ...
 $ : chr [1:7] " low in whole grains" "Mean daily consumption of whole grains (bran, germ, and endosperm in their natural" "125 g (100–150) per day" "94.9" ...
 $ : chr [1:5] " low in nuts and seeds" "Mean daily consumption of nut and seed foods" "21 g (16–25) per day" "94.9" ...
 $ : chr [1:6] " low in milk" "Mean daily consumption of milk including non-fat, low-fat, and full-fat milk, excluding soy" "435 g (350–520) per day" "94.9" ...
 $ : chr [1:6] " high in red meat" "Mean daily consumption of red meat (beef, pork, lamb, and goat, but excluding poultry, fish," "23 g (18–27) per day" "94.9" ...
 $ : chr [1:6] " high in processed meat" "Mean daily consumption of meat preserved by smoking, curing, salting, or addition of" "2 g (0–4) per day" "36.9" ...
 $ : chr [1:6] " high in sugar-sweetened Mean daily consumption of beverages with =50 kcal per 226.8 serving, including carbonated" "3 g (0–5) per day" "36.9" "beverages" ...
 $ : chr [1:6] " low in fibre" "Mean daily intake of fibre from all sources including fruits, vegetables, grains, legumes, and" "24 g (19–28) per day" "94.9" ...
 $ : chr [1:5] " low in calcium" "Mean daily intake of calcium from all sources, including milk, yogurt, and cheese" "1.25 g (1.00–1.50) per day" "94.9" ...
 $ : chr [1:5] " low in seafood omega-3 Mean daily intake of eicosapentaenoic acid and docosahexaenoic acid" "250 mg (200–300) per day" "94.9" "fatty acids" ...
 $ : chr [1:7] " low in polyunsaturated" "Mean daily intake of omega-6 fatty acids from all sources, mainly liquid vegetable oils," "11% (9–13) of total daily energy" "94.9" ...
 $ : chr [1:6] " high in trans fatty acids" "Mean daily intake of trans fat from all sources, mainly partially hydrogenated vegetable oils" "0.5% (0.0–1.0) of total daily energy" "36.9" ...
 $ : chr [1:8] " high in sodium" "24 h urinary sodium measured in g per day" "3 g (1–5) per day*" "26.2" ...

Now we can see that each of our 15 strings has been split into pieces, but unfortunately, it was not completely consistent across dietary factors. Why did this happen? If we look closely, we can see that the sugar-sweetened beverage and the seafood category had only one space between the first and second columns. These are the columns about the dietary category and the one that describes in more detail what the consumption suggestion is about.

The values for these two columns appear to be together still in the same substring for these two categories. We can see this because there are no quotation marks adjacent to the word "Mean".

Here you can see how the next substring should have started with the word "Mean" by the new inclusion of a quotation mark ". The red rectangles indicate the problematic substrings, while the green rectangles show examples where the split worked correctly.

We can add an extra space in front of the word "Mean" for these particular categories and then try splitting again.

Since we originally split based on two or more spaces, we can just add a space in front of the word “Mean” for all the pdf_table strings and then try subsetting again. We can use the str_which() function of the stringr package to find the index of these particular cases.

pdf_table %>%
  str_which(pattern = "seafood|sugar")
[1]  9 12

Here we can use the str_subset() function of the stringr package to see just the strings that match these patterns within pdf_table:

pdf_table %>%
  str_subset(pattern = "seafood|sugar")
[1] " high in sugar-sweetened Mean daily consumption of beverages with =50 kcal per 226.8 serving, including carbonated                          3 g (0–5) per day                              36.9\n  beverages                    beverages, sodas, energy drinks, fruit drinks, but excluding 100% fruit and vegetable juices\n  "
[2] " low in seafood omega-3 Mean daily intake of eicosapentaenoic acid and docosahexaenoic acid                                              250 mg (200–300) per day                          94.9\n  fatty acids\n  "                                                                                                              

This is equivalent to using the str_which() function with []:

pdf_table[str_which(pdf_table, pattern = "seafood|sugar")]

Now we can replace these values within the pdf_table object after adding a space in front of “Mean”:

pdf_table[str_which(pdf_table,
  pattern =
    "seafood|sugar"
)] <- str_replace(
  string = pdf_table[str_which(pdf_table,
    pattern =
      "seafood|sugar"
  )],
  pattern = "Mean",
  replacement = " Mean"
)

And now we can try splitting again by two or more spaces:

table_split <- str_split(pdf_table, pattern = "\\s{2,}")

We could also just add a space in front of all the values of “Mean” in pdf_table since the split was performed based on two or more spaces. Thus the other elements in pdf_table would also be split just as before despite the additional space.

pdf_table <- pdf_table %>%
  stringr::str_replace(
    pattern = "Mean",
    replacement = " Mean"
  )
table_split <- str_split(pdf_table, pattern = "\\s{2,}")

# scroll the output!
glimpse(table_split)
List of 15
 $ : chr [1:6] " low in fruits" "Mean daily consumption of fruits (fresh, frozen, cooked, canned, or dried fruits, excluding" "250 g (200–300) per day" "94.9" ...
 $ : chr [1:7] " low in vegetables" "Mean daily consumption of vegetables (fresh, frozen, cooked, canned, or dried vegetables," "360 g (290–430) per day" "94.9" ...
 $ : chr [1:5] " low in legumes" "Mean daily consumption of legumes (fresh, frozen, cooked, canned, or dried legumes)" "60 g (50–70) per day" "94.9" ...
 $ : chr [1:7] " low in whole grains" "Mean daily consumption of whole grains (bran, germ, and endosperm in their natural" "125 g (100–150) per day" "94.9" ...
 $ : chr [1:5] " low in nuts and seeds" "Mean daily consumption of nut and seed foods" "21 g (16–25) per day" "94.9" ...
 $ : chr [1:6] " low in milk" "Mean daily consumption of milk including non-fat, low-fat, and full-fat milk, excluding soy" "435 g (350–520) per day" "94.9" ...
 $ : chr [1:6] " high in red meat" "Mean daily consumption of red meat (beef, pork, lamb, and goat, but excluding poultry, fish," "23 g (18–27) per day" "94.9" ...
 $ : chr [1:6] " high in processed meat" "Mean daily consumption of meat preserved by smoking, curing, salting, or addition of" "2 g (0–4) per day" "36.9" ...
 $ : chr [1:7] " high in sugar-sweetened" "Mean daily consumption of beverages with =50 kcal per 226.8 serving, including carbonated" "3 g (0–5) per day" "36.9" ...
 $ : chr [1:6] " low in fibre" "Mean daily intake of fibre from all sources including fruits, vegetables, grains, legumes, and" "24 g (19–28) per day" "94.9" ...
 $ : chr [1:5] " low in calcium" "Mean daily intake of calcium from all sources, including milk, yogurt, and cheese" "1.25 g (1.00–1.50) per day" "94.9" ...
 $ : chr [1:6] " low in seafood omega-3" "Mean daily intake of eicosapentaenoic acid and docosahexaenoic acid" "250 mg (200–300) per day" "94.9" ...
 $ : chr [1:7] " low in polyunsaturated" "Mean daily intake of omega-6 fatty acids from all sources, mainly liquid vegetable oils," "11% (9–13) of total daily energy" "94.9" ...
 $ : chr [1:6] " high in trans fatty acids" "Mean daily intake of trans fat from all sources, mainly partially hydrogenated vegetable oils" "0.5% (0.0–1.0) of total daily energy" "36.9" ...
 $ : chr [1:8] " high in sodium" "24 h urinary sodium measured in g per day" "3 g (1–5) per day*" "26.2" ...

Looks better!

We want just the first (the food category) and third column (the optimal consumption amount suggested) for each row in the table. However, the table is currently stored as a list of character vectors, so it is not quite so simple to extract these values.

We can use the map function of the purrr package to accomplish this.

The map function allows us to perform the same action multiple times across each element within an object, in this case, a list.

The following will allow us to select the first or third substring from each element of the pdf_table object.

category <- map(table_split, 1)
amount <- map(table_split, 3)
head(category)
[[1]]
[1] " low in fruits"

[[2]]
[1] " low in vegetables"

[[3]]
[1] " low in legumes"

[[4]]
[1] " low in whole grains"

[[5]]
[1] " low in nuts and seeds"

[[6]]
[1] " low in milk"
head(amount)
[[1]]
[1] "250 g (200–300) per day"

[[2]]
[1] "360 g (290–430) per day"

[[3]]
[1] "60 g (50–70) per day"

[[4]]
[1] "125 g (100–150) per day"

[[5]]
[1] "21 g (16–25) per day"

[[6]]
[1] "435 g (350–520) per day"

Now we will create a tibble using this data. However, currently both category and amount are of class list. To create a tibble we need to unlist the data to create vectors.

class(category)
[1] "list"
category %<>% unlist()
amount %<>% unlist()
class(category)
[1] "character"

category
 [1] " low in fruits"             " low in vegetables"        
 [3] " low in legumes"            " low in whole grains"      
 [5] " low in nuts and seeds"     " low in milk"              
 [7] " high in red meat"          " high in processed meat"   
 [9] " high in sugar-sweetened"   " low in fibre"             
[11] " low in calcium"            " low in seafood omega-3"   
[13] " low in polyunsaturated"    " high in trans fatty acids"
[15] " high in sodium"           
amount
 [1] "250 g (200–300) per day"             
 [2] "360 g (290–430) per day"             
 [3] "60 g (50–70) per day"                
 [4] "125 g (100–150) per day"             
 [5] "21 g (16–25) per day"                
 [6] "435 g (350–520) per day"             
 [7] "23 g (18–27) per day"                
 [8] "2 g (0–4) per day"                   
 [9] "3 g (0–5) per day"                   
[10] "24 g (19–28) per day"                
[11] "1.25 g (1.00–1.50) per day"          
[12] "250 mg (200–300) per day"            
[13] "11% (9–13) of total daily energy"    
[14] "0.5% (0.0–1.0) of total daily energy"
[15] "3 g (1–5) per day*"                  

We could have done all of this at once in one command like this:

category <- unlist(map(table_split, 1))
amount <- unlist(map(table_split, 3))

Now we will create a tibble, which is an important data frame structure in the tidyverse which allows us to use other packages in the tidyverse with our data.

We will name our tibble columns now as we create our tibble using the tibble() function of both the tidyr and the tibble packages, as names are required in tibbles.

guidelines <- tibble::tibble(
  category = category,
  amount = amount
)
guidelines
# A tibble: 15 x 2
   category                     amount                              
   <chr>                        <chr>                               
 1 " low in fruits"             250 g (200–300) per day             
 2 " low in vegetables"         360 g (290–430) per day             
 3 " low in legumes"            60 g (50–70) per day                
 4 " low in whole grains"       125 g (100–150) per day             
 5 " low in nuts and seeds"     21 g (16–25) per day                
 6 " low in milk"               435 g (350–520) per day             
 7 " high in red meat"          23 g (18–27) per day                
 8 " high in processed meat"    2 g (0–4) per day                   
 9 " high in sugar-sweetened"   3 g (0–5) per day                   
10 " low in fibre"              24 g (19–28) per day                
11 " low in calcium"            1.25 g (1.00–1.50) per day          
12 " low in seafood omega-3"    250 mg (200–300) per day            
13 " low in polyunsaturated"    11% (9–13) of total daily energy    
14 " high in trans fatty acids" 0.5% (0.0–1.0) of total daily energy
15 " high in sodium"            3 g (1–5) per day*                  

Looking pretty good!

Separating values within a variable


Recall that the main goal of this data wrangling is to extract the optimal intake level for each dietary factor. So while we have managed to pull and organize the data from the pdf table, we need to further process the results to isolate this numeric value.

Do to this, we want to separate the different numbers within the amount column, to isolate the optimal amount, and the optimal range, and eventually convert them to numeric values.

Recall what the original table looked like:

We can use the tidyr::separate() function to separate the data within the amount column into three new columns based on the optimal level and the optimal range. We can separate the values based on the open parentheses "(" and the long dash "–" characters. Again we will use the bar "|" to indicate that we want to separate by either character.

# The first column will be called optimal
# It will contain the 1st part of the amount column data before the "("
# The 2nd column will be called lower
# It will contain the data after the "("
# The 3rd column will be called upper
# It will contain the 2nd part of the data based on the "–"
# The "\\" are necessary - we will explain very soon

guidelines %<>%
  tidyr::separate(amount,
    c("optimal", "lower", "upper"),
    sep = "\\(|–"
  )

guidelines
# A tibble: 15 x 4
   category                     optimal   lower upper                     
   <chr>                        <chr>     <chr> <chr>                     
 1 " low in fruits"             "250 g "  200   300) per day              
 2 " low in vegetables"         "360 g "  290   430) per day              
 3 " low in legumes"            "60 g "   50    70) per day               
 4 " low in whole grains"       "125 g "  100   150) per day              
 5 " low in nuts and seeds"     "21 g "   16    25) per day               
 6 " low in milk"               "435 g "  350   520) per day              
 7 " high in red meat"          "23 g "   18    27) per day               
 8 " high in processed meat"    "2 g "    0     4) per day                
 9 " high in sugar-sweetened"   "3 g "    0     5) per day                
10 " low in fibre"              "24 g "   19    28) per day               
11 " low in calcium"            "1.25 g " 1.00  1.50) per day             
12 " low in seafood omega-3"    "250 mg " 200   300) per day              
13 " low in polyunsaturated"    "11% "    9     13) of total daily energy 
14 " high in trans fatty acids" "0.5% "   0.0   1.0) of total daily energy
15 " high in sodium"            "3 g "    1     5) per day*               

Let’s also create a new variable/column in our tibble that indicates the direction of over- or under-consumption that can be harmful for each dietary factor.

guidelines %<>%
  separate(category, c("direction", "food"), sep = " in ")
guidelines
# A tibble: 15 x 5
   direction food              optimal   lower upper                     
   <chr>     <chr>             <chr>     <chr> <chr>                     
 1 " low"    fruits            "250 g "  200   300) per day              
 2 " low"    vegetables        "360 g "  290   430) per day              
 3 " low"    legumes           "60 g "   50    70) per day               
 4 " low"    whole grains      "125 g "  100   150) per day              
 5 " low"    nuts and seeds    "21 g "   16    25) per day               
 6 " low"    milk              "435 g "  350   520) per day              
 7 " high"   red meat          "23 g "   18    27) per day               
 8 " high"   processed meat    "2 g "    0     4) per day                
 9 " high"   sugar-sweetened   "3 g "    0     5) per day                
10 " low"    fibre             "24 g "   19    28) per day               
11 " low"    calcium           "1.25 g " 1.00  1.50) per day             
12 " low"    seafood omega-3   "250 mg " 200   300) per day              
13 " low"    polyunsaturated   "11% "    9     13) of total daily energy 
14 " high"   trans fatty acids "0.5% "   0.0   1.0) of total daily energy
15 " high"   sodium            "3 g "    1     5) per day*               

If we wanted to remove the direction variable we could use the modify_at() function of the purrr package:

guidelines %>% purrr::modify_at("direction", ~NULL)

Data cleaning with regular expressions


OK, looking better, but we still need a bit of cleaning to remove symbols and extra words from the columns. Some of the extra symbols include: "%", ")" and the "*".

The "*" and the ")" are what we call metacharacters or regular expressions. These are characters that have special meanings.

Now we need the "\\" to indicate that we want these characters to be matched exactly and not interpreted as the meaning of the symbol. Recall that we used "\\(|–" earlier.

See here for more info about regular expressions in R.


Click here for a simple example of regular expressions using the str_count() function of the stringr package

The str_count() function counts the number of instances of a character string. In this case we will look for individual characters but you could also search for words or phrases.

regextest <- readr::read_file(here("docs", "regEx.txt"))
regextest
[1] "Testing for ts or\ttabs can be tricky.(yes, it really can!*)\r\n"

Count the letter t:

str_count(regextest, "t") # notice this doesn't include the t in the tab
[1] 5

Count tabs:

str_count(regextest, "\\t") # search for tab
[1] 1
# this would not work:
str_count(regextest, "[t]") # searches for the letter t
[1] 5

Count parentheses:

# this would not work because R thinks this is part of the code itself
# str_count(regextest, ")")
# this would not work because R thinks this is part of the code itself
# str_count(regextest, "\)")
str_count(regextest, "\\)") # this works!
[1] 1
# this works! because it is a punctuation character
str_count(regextest, "[)]")
[1] 1

Count the occurrence of the asterisk:

# this also does not work
# str_count(regextest, "*")
# nor does this
# str_count(regextest, "\*")
str_count(regextest, "\\*") # this works!
[1] 1
# this works! because it is a punctuation character
str_count(regextest, "[*]") # this works!
[1] 1

We also want to make a unit variable so that we can make sure that our units are consistent later.

guidelines %>%
  pull(optimal)
 [1] "250 g "  "360 g "  "60 g "   "125 g "  "21 g "   "435 g "  "23 g "  
 [8] "2 g "    "3 g "    "24 g "   "1.25 g " "250 mg " "11% "    "0.5% "  
[15] "3 g "   

Notice that the values that are percentages don’t have spaces between the number and the unit. We can separate the "optimal" values by a space or a percent symbol "%" using "|" to indicate that we want to separate by either. In this case we will lose the “%” and will need to add it back to those values.

guidelines %<>%
  separate(optimal,
    into = c("optimal", "unit"),
    sep = " |%",
    remove = FALSE
  )
guidelines
# A tibble: 15 x 6
   direction food              lower optimal unit  upper                     
   <chr>     <chr>             <chr> <chr>   <chr> <chr>                     
 1 " low"    fruits            200   250     "g"   300) per day              
 2 " low"    vegetables        290   360     "g"   430) per day              
 3 " low"    legumes           50    60      "g"   70) per day               
 4 " low"    whole grains      100   125     "g"   150) per day              
 5 " low"    nuts and seeds    16    21      "g"   25) per day               
 6 " low"    milk              350   435     "g"   520) per day              
 7 " high"   red meat          18    23      "g"   27) per day               
 8 " high"   processed meat    0     2       "g"   4) per day                
 9 " high"   sugar-sweetened   0     3       "g"   5) per day                
10 " low"    fibre             19    24      "g"   28) per day               
11 " low"    calcium           1.00  1.25    "g"   1.50) per day             
12 " low"    seafood omega-3   200   250     "mg"  300) per day              
13 " low"    polyunsaturated   9     11      ""    13) of total daily energy 
14 " high"   trans fatty acids 0.0   0.5     ""    1.0) of total daily energy
15 " high"   sodium            1     3       "g"   5) per day*               

Great, so to now we will add “%” to the unit variable for the "low in polyunsaturated" and "high in trans fatty acids" rows.

First we need to replace the empty values with NA using the na_if() function of the dplyr package.

guidelines %<>%
  na_if("")
guidelines
# A tibble: 15 x 6
   direction food              lower optimal unit  upper                     
   <chr>     <chr>             <chr> <chr>   <chr> <chr>                     
 1 " low"    fruits            200   250     g     300) per day              
 2 " low"    vegetables        290   360     g     430) per day              
 3 " low"    legumes           50    60      g     70) per day               
 4 " low"    whole grains      100   125     g     150) per day              
 5 " low"    nuts and seeds    16    21      g     25) per day               
 6 " low"    milk              350   435     g     520) per day              
 7 " high"   red meat          18    23      g     27) per day               
 8 " high"   processed meat    0     2       g     4) per day                
 9 " high"   sugar-sweetened   0     3       g     5) per day                
10 " low"    fibre             19    24      g     28) per day               
11 " low"    calcium           1.00  1.25    g     1.50) per day             
12 " low"    seafood omega-3   200   250     mg    300) per day              
13 " low"    polyunsaturated   9     11      <NA>  13) of total daily energy 
14 " high"   trans fatty acids 0.0   0.5     <NA>  1.0) of total daily energy
15 " high"   sodium            1     3       g     5) per day*               

Then to replace the NA values, we can use the replace_na() function in the tidyr package and the mutate() function of dplyr to specify which values to replace, in this case the NA values within the variable unit. Essentially this variable gets reassigned with the new values, as we mostly think of the mutate() function as creating new variables.

guidelines %<>%
  dplyr::mutate(unit = replace_na(unit, "%"))

# now just to show these rows
guidelines %>%
  filter(unit == "%")
# A tibble: 2 x 6
  direction food              lower optimal unit  upper                     
  <chr>     <chr>             <chr> <chr>   <chr> <chr>                     
1 " low"    polyunsaturated   9     11      %     13) of total daily energy 
2 " high"   trans fatty acids 0.0   0.5     %     1.0) of total daily energy

Let’s also move unit to be the last column. We can use the relocate() function of the dplyr package to do this. For more information about the relocate() function see here.

guidelines %<>%
  relocate(unit, .after = last_col())

To remove all of the remaining extra characters and words we will again use the stringr package. This time we will use the str_remove_all() function to remove all instances of these characters.

guidelines <- as_tibble(
  map(guidelines, str_remove_all,
    pattern = "\\) per day|\\) of total daily energy|\\*"
  )
)
guidelines
# A tibble: 15 x 6
   direction food              lower optimal upper unit 
   <chr>     <chr>             <chr> <chr>   <chr> <chr>
 1 " low"    fruits            200   250     300   g    
 2 " low"    vegetables        290   360     430   g    
 3 " low"    legumes           50    60      70    g    
 4 " low"    whole grains      100   125     150   g    
 5 " low"    nuts and seeds    16    21      25    g    
 6 " low"    milk              350   435     520   g    
 7 " high"   red meat          18    23      27    g    
 8 " high"   processed meat    0     2       4     g    
 9 " high"   sugar-sweetened   0     3       5     g    
10 " low"    fibre             19    24      28    g    
11 " low"    calcium           1.00  1.25    1.50  g    
12 " low"    seafood omega-3   200   250     300   mg   
13 " low"    polyunsaturated   9     11      13    %    
14 " high"   trans fatty acids 0.0   0.5     1.0   %    
15 " high"   sodium            1     3       5     g    

Nice! That’s pretty clean but we can do a bit more.

Data type conversion


One of the next things to notice about our data is all of our variables are of class character, which is not how we want them to be.

For example, the optimal amounts of consumption are currently of class character, which is indicated by the <chr> just below the column names/variable names of the guidelines tibble:

guidelines
# A tibble: 15 x 6
   direction food              lower optimal upper unit 
   <chr>     <chr>             <chr> <chr>   <chr> <chr>
 1 " low"    fruits            200   250     300   g    
 2 " low"    vegetables        290   360     430   g    
 3 " low"    legumes           50    60      70    g    
 4 " low"    whole grains      100   125     150   g    
 5 " low"    nuts and seeds    16    21      25    g    
 6 " low"    milk              350   435     520   g    
 7 " high"   red meat          18    23      27    g    
 8 " high"   processed meat    0     2       4     g    
 9 " high"   sugar-sweetened   0     3       5     g    
10 " low"    fibre             19    24      28    g    
11 " low"    calcium           1.00  1.25    1.50  g    
12 " low"    seafood omega-3   200   250     300   mg   
13 " low"    polyunsaturated   9     11      13    %    
14 " high"   trans fatty acids 0.0   0.5     1.0   %    
15 " high"   sodium            1     3       5     g    

To convert these values to numeric we use the mutate() and across() functions of the dplyr package.

The across() function has two main arguments: (i) the columns you want to operate on and (ii) the function or list of functions to apply to each column. In this case if we look at the beginning of the guidelines tibble, we can see that optimal, lower and upper should be converted. As these three columns are sequential, we can simply put a : between optimal and upper to indicate that we want all the variables in between these columns to be converted.

guidelines %<>%
  mutate(across(lower:upper, as.numeric))
guidelines
# A tibble: 15 x 6
   direction food              lower optimal upper unit 
   <chr>     <chr>             <dbl>   <dbl> <dbl> <chr>
 1 " low"    fruits              200  250    300   g    
 2 " low"    vegetables          290  360    430   g    
 3 " low"    legumes              50   60     70   g    
 4 " low"    whole grains        100  125    150   g    
 5 " low"    nuts and seeds       16   21     25   g    
 6 " low"    milk                350  435    520   g    
 7 " high"   red meat             18   23     27   g    
 8 " high"   processed meat        0    2      4   g    
 9 " high"   sugar-sweetened       0    3      5   g    
10 " low"    fibre                19   24     28   g    
11 " low"    calcium               1    1.25   1.5 g    
12 " low"    seafood omega-3     200  250    300   mg   
13 " low"    polyunsaturated       9   11     13   %    
14 " high"   trans fatty acids     0    0.5    1   %    
15 " high"   sodium                1    3      5   g    

Great! Now these variables are of class <dbl> (stands for double) which indicates that they are numeric. Here is a link for more information on numeric classes in R.

If we had not replaced the "·" interpunct values to a period, conversion from character to numeric would be problematic and would result in NA values.

Data value reassignments


We seem to have lost the word "beverages" from the "sugar-sweetened beverages" category, as well as "fatty acids" from the "seafood omega 3 fatty acids", and the "polyunsaturated fatty acids" categories as the full category name was listed on two lines within the table. We would like to replace these values with the full name.

To select the food variable we will show you several options. Only a couple will work well with reassigning the data in that particular variable within guidelines without assigning an intermediate data object. We will look using mutate_at(), pull(), select(), and two styles of brackets ["variable name"] and [["variablename"]].

The bracket ["variable name"] option and the select() option will grab a tibble (data frame) version of the food column out of guidelines. However we can’t start commands with select for assignments.

guidelines["food"] # same output as select
# A tibble: 15 x 1
   food             
   <chr>            
 1 fruits           
 2 vegetables       
 3 legumes          
 4 whole grains     
 5 nuts and seeds   
 6 milk             
 7 red meat         
 8 processed meat   
 9 sugar-sweetened  
10 fibre            
11 calcium          
12 seafood omega-3  
13 polyunsaturated  
14 trans fatty acids
15 sodium           
select(guidelines, "food") # same output as brackets
# A tibble: 15 x 1
   food             
   <chr>            
 1 fruits           
 2 vegetables       
 3 legumes          
 4 whole grains     
 5 nuts and seeds   
 6 milk             
 7 red meat         
 8 processed meat   
 9 sugar-sweetened  
10 fibre            
11 calcium          
12 seafood omega-3  
13 polyunsaturated  
14 trans fatty acids
15 sodium           

pull() and the bracket [["variable name"]] option in contrast, will grab the vector version of the food data:

pull(guidelines, "food") # get character vector not a tibble
 [1] "fruits"            "vegetables"        "legumes"          
 [4] "whole grains"      "nuts and seeds"    "milk"             
 [7] "red meat"          "processed meat"    "sugar-sweetened"  
[10] "fibre"             "calcium"           "seafood omega-3"  
[13] "polyunsaturated"   "trans fatty acids" "sodium"           
# bracket option:
guidelines[["food"]] # get character vector not a tibble
 [1] "fruits"            "vegetables"        "legumes"          
 [4] "whole grains"      "nuts and seeds"    "milk"             
 [7] "red meat"          "processed meat"    "sugar-sweetened"  
[10] "fibre"             "calcium"           "seafood omega-3"  
[13] "polyunsaturated"   "trans fatty acids" "sodium"           

The pull() function can be very useful when combined with other functions (for example you typically want to use a vector with the str_replace() function), but just like select, we can’t start assignments with pull().

This is not possible and will result in an error:

select(guidelines, food) <-
  str_replace(
    pull(guidelines, "food"),
    pattern = "sugar-sweetened",
    replacement = "sugar-sweetened beverages"
  )

guidelines %>% select(food) <-
  str_replace(
    pull(guidelines, "food"),
    pattern = "sugar-sweetened",
    replacement = "sugar-sweetened beverages"
  )

This will only print the result, but not reassign the food variable values:

guidelines %>%
  pull(food) %>%
  str_replace(
    pattern = "sugar-sweetened",
    replacement = "sugar-sweetened beverages"
  )
 [1] "fruits"                    "vegetables"               
 [3] "legumes"                   "whole grains"             
 [5] "nuts and seeds"            "milk"                     
 [7] "red meat"                  "processed meat"           
 [9] "sugar-sweetened beverages" "fibre"                    
[11] "calcium"                   "seafood omega-3"          
[13] "polyunsaturated"           "trans fatty acids"        
[15] "sodium"                   

Using select() would work as well to print the result (although the result structure is different):

guidelines %>%
  select(food) %>%
  str_replace(
    pattern = "sugar-sweetened",
    replacement = "sugar-sweetened beverages"
  )
[1] "c(\"fruits\", \"vegetables\", \"legumes\", \"whole grains\", \"nuts and seeds\", \"milk\", \"red meat\", \"processed meat\", \"sugar-sweetened beverages\", \"fibre\", \"calcium\", \"seafood omega-3\", \"polyunsaturated\", \"trans fatty acids\", \"sodium\")"

Question opportunity:

Why do these commands not reassign the food variable values?

The bracket option is great alternative and allows us to reassign the values within guidelines easily. Either of the two styles of brackets: ["variable name"] and [["variablename"]] will work.

# 1st method: `["variable name"]`
# Replacing "sugar-sweetened" with "sugar-sweetened beverages"
guidelines["food"] <-
  str_replace(
    pull(guidelines, "food"),
    pattern = "sugar-sweetened",
    replacement = "sugar-sweetened beverages"
  )

# 2nd method: `[["variablename"]]`
# Replacing "seafood omega-3" with"seafood omega-3 fatty acids"
guidelines[["food"]] <-
  str_replace(
    pull(guidelines, "food"),
    pattern = "seafood omega-3",
    replacement = "seafood omega-3 fatty acids"
  )

guidelines
# A tibble: 15 x 6
   direction food                        lower optimal upper unit 
   <chr>     <chr>                       <dbl>   <dbl> <dbl> <chr>
 1 " low"    fruits                        200  250    300   g    
 2 " low"    vegetables                    290  360    430   g    
 3 " low"    legumes                        50   60     70   g    
 4 " low"    whole grains                  100  125    150   g    
 5 " low"    nuts and seeds                 16   21     25   g    
 6 " low"    milk                          350  435    520   g    
 7 " high"   red meat                       18   23     27   g    
 8 " high"   processed meat                  0    2      4   g    
 9 " high"   sugar-sweetened beverages       0    3      5   g    
10 " low"    fibre                          19   24     28   g    
11 " low"    calcium                         1    1.25   1.5 g    
12 " low"    seafood omega-3 fatty acids   200  250    300   mg   
13 " low"    polyunsaturated                 9   11     13   %    
14 " high"   trans fatty acids               0    0.5    1   %    
15 " high"   sodium                          1    3      5   g    

Finally, the best option is probably the mutate_at() function from dplyr. In this case we need to include ~ in front of the function that we would like to use on the values in our food variables. We also include . as a replacement to reference the data that we want to use within str_replace() (which in this case is the food variable values of guidelines).

Notice we didn’t need this when we previously use mutate_at() with the as.numeric() function. This is because the str_replace() function requires us to specify what data we are using as one of the arguments, while as.numeric() does not.

# Replacing "polyunsaturated" with"polyunsaturated fatty acids"
guidelines %<>%
  mutate_at(
    vars(food),
    ~ str_replace(
      string = .,
      pattern = "polyunsaturated",
      replacement = "polyunsaturated fatty acids"
    )
  )

guidelines
# A tibble: 15 x 6
   direction food                        lower optimal upper unit 
   <chr>     <chr>                       <dbl>   <dbl> <dbl> <chr>
 1 " low"    fruits                        200  250    300   g    
 2 " low"    vegetables                    290  360    430   g    
 3 " low"    legumes                        50   60     70   g    
 4 " low"    whole grains                  100  125    150   g    
 5 " low"    nuts and seeds                 16   21     25   g    
 6 " low"    milk                          350  435    520   g    
 7 " high"   red meat                       18   23     27   g    
 8 " high"   processed meat                  0    2      4   g    
 9 " high"   sugar-sweetened beverages       0    3      5   g    
10 " low"    fibre                          19   24     28   g    
11 " low"    calcium                         1    1.25   1.5 g    
12 " low"    seafood omega-3 fatty acids   200  250    300   mg   
13 " low"    polyunsaturated fatty acids     9   11     13   %    
14 " high"   trans fatty acids               0    0.5    1   %    
15 " high"   sodium                          1    3      5   g    

This might be considered a better option because it is more readable as to where the food data came from that we are replacing values within.

There is one last minor detail… the direction variable has leading spaces still. We can use str_trim() to fix that!

guidelines %<>%
  mutate_at(vars(direction), str_trim)

guidelines
# A tibble: 15 x 6
   direction food                        lower optimal upper unit 
   <chr>     <chr>                       <dbl>   <dbl> <dbl> <chr>
 1 low       fruits                        200  250    300   g    
 2 low       vegetables                    290  360    430   g    
 3 low       legumes                        50   60     70   g    
 4 low       whole grains                  100  125    150   g    
 5 low       nuts and seeds                 16   21     25   g    
 6 low       milk                          350  435    520   g    
 7 high      red meat                       18   23     27   g    
 8 high      processed meat                  0    2      4   g    
 9 high      sugar-sweetened beverages       0    3      5   g    
10 low       fibre                          19   24     28   g    
11 low       calcium                         1    1.25   1.5 g    
12 low       seafood omega-3 fatty acids   200  250    300   mg   
13 low       polyunsaturated fatty acids     9   11     13   %    
14 high      trans fatty acids               0    0.5    1   %    
15 high      sodium                          1    3      5   g    

OK! Now we know how much of each dietary factor we generally need for optimal health according to the guidelines used in this article.

Comparing data


Recall that the main goal of pulling the guideline amounts from the pdf was that we would like to see how the mean consumption rates for the different groups of people compared to the optimal intake guidelines.

One way we could do this is to calculate a consumption percentage of the optimal value.

To calculate this it would be helpful to put the guideline amounts with the average consumption rates into the same tibble, especially because the observed consumption data (diet_data and sep_age_diet_data) are very different dimensions from the guidelines data.

In order to create a tibble with our observed consumption rates with the suggested consumption rates, we will join our data using dplyr. In order to do so it is important that our different data sets have at least one column with the same values that we can use to join them together. So let’s first assess if that is the case.

distinct(diet_data, dietary_risk)
# A tibble: 15 x 1
   dietary_risk                           
   <chr>                                  
 1 Diet high in processed meat            
 2 Diet high in red meat                  
 3 Diet high in sodium                    
 4 Diet high in sugar-sweetened beverages 
 5 Diet high in trans fatty acids         
 6 Diet low in calcium                    
 7 Diet low in fiber                      
 8 Diet low in fruits                     
 9 Diet low in legumes                    
10 Diet low in milk                       
11 Diet low in nuts and seeds             
12 Diet low in polyunsaturated fatty acids
13 Diet low in seafood omega-3 fatty acids
14 Diet low in vegetables                 
15 Diet low in whole grains               
select(guidelines, food)
# A tibble: 15 x 1
   food                       
   <chr>                      
 1 fruits                     
 2 vegetables                 
 3 legumes                    
 4 whole grains               
 5 nuts and seeds             
 6 milk                       
 7 red meat                   
 8 processed meat             
 9 sugar-sweetened beverages  
10 fibre                      
11 calcium                    
12 seafood omega-3 fatty acids
13 polyunsaturated fatty acids
14 trans fatty acids          
15 sodium                     

We are actually pretty close: there are 15 dietary factors in each data set, and the names are nearly the same. To make them match completely, we can see that we need to remove the "Diet low in" and "Diet high in" phrases from the observed consumption data.

diet_data %<>%
  mutate_at(
    vars(dietary_risk),
    ~ str_remove(
      string = .,
      pattern = "Diet low in |Diet high in "
    )
  )

sep_age_diet_data %<>%
  mutate_at(
    vars(dietary_risk),
    ~ str_remove(
      string = .,
      pattern = "Diet low in |Diet high in "
    )
  )

Also let’s double check that the two observed files have the same exact values for dietary factor names.

We can use the setequal() function from dplyr to check that the unique values for dietary_risk are the same for both diet_data and sep_age_diet_data.

setequal(
  distinct(diet_data, dietary_risk),
  distinct(sep_age_diet_data, dietary_risk)
)
[1] TRUE

Great!

Note that the default of the set_equal function ignores the order of values in rows. So we still don’t know if the order is the same.

We can check using the all_equal function of dplyr which reports back clues about what might be different if anything. Importantly we are including ignore_row_order = FALSE as the default is TRUE.

all_equal(distinct(diet_data, dietary_risk),
  distinct(sep_age_diet_data, dietary_risk),
  ignore_row_order = FALSE
)
[1] "Same row values, but different order"

Looks like they are not in the same order.

Note that if any of the values are different, all_equal() will first report this and will not report that the rows are in a different order.


Click here to see a toy example about how the three comparison functions (setequal(), all_equal() (also all.equal() for tbl_df), and setdiff()) work in dplyr.

It’s important to realize that row order is ignored by bothsetequal() and setdiff().

Now let’s compare two tibbles that have different row orders and different values.

Here are our tibbles to compare:

X <- tibble(test = c("A", "B", "AC", "D"))
Y <- tibble(test = c("A", "D", "A", "B"))
X
# A tibble: 4 x 1
  test 
  <chr>
1 A    
2 B    
3 AC   
4 D    
Y
# A tibble: 4 x 1
  test 
  <chr>
1 A    
2 D    
3 A    
4 B    
class(Y)
[1] "tbl_df"     "tbl"        "data.frame"

Since we are using tibbles, which are of class tbl_df we can use either all_equal or all.equal(). Notice that it doesn’t report rows being a different order because it first tells what rows have unique values or rows with a value that has a different number of frequency.

all_equal(X, Y, ignore_row_order = TRUE)
[1] "- Rows in x but not in y: 3\n"
all_equal(X, Y, ignore_row_order = FALSE)
[1] "- Rows in x but not in y: 3\n"
# Doesn't report rows being different order
all.equal(X, Y, ignore_row_order = TRUE)
[1] "Component \"test\": 3 string mismatches"
all.equal(X, Y, ignore_row_order = FALSE)
[1] "Component \"test\": 3 string mismatches"
# Doesn't report rows being different order

setequal() does not provide clues about what is different but TRUE (no differences) or FALSE (indicating at least one difference).

# Reports false indicating at least one difference
setequal(X, Y)
[1] FALSE

setdiff() tells us what is different and is dependent on the order of the objects compared, but prioritizes the values that are unique to each.

# This reports what is unique to X
setdiff(X, Y)
# A tibble: 1 x 1
  test 
  <chr>
1 AC   
# This reports what is unique to Y - nothing in this case
setdiff(Y, X)
# A tibble: 0 x 1
# ... with 1 variable: test <chr>

Now let’s make it so that only the order is different:

Y <- tibble(test = c("A", "D", "AC", "B"))
X
# A tibble: 4 x 1
  test 
  <chr>
1 A    
2 B    
3 AC   
4 D    
Y
# A tibble: 4 x 1
  test 
  <chr>
1 A    
2 D    
3 AC   
4 B    

Now that there are no values that are unique to either X or Y, all_equal() reports that there is a different order.

all_equal(X, Y, ignore_row_order = TRUE)
[1] TRUE
all_equal(X, Y, ignore_row_order = FALSE) # reports diff order
[1] "Same row values, but different order"

Remember setequal() ignores order and gives a value of TRUE for no differences.

# It reports no difference!
setequal(X, Y)
[1] TRUE

setdiff() also ignores order and shows no differences.

setdiff(X, Y)
# A tibble: 0 x 1
# ... with 1 variable: test <chr>

If we have different column/variable names this makes comparisons more challenging. Columns will be identified for having different names.

X <- tibble(colname1 = c("A", "B", "AC", "D"))
Y <- tibble(colname2 = c("A", "D", "AG", "B"))

all_equal() will simply report that col names are different

all_equal(X, Y, ignore_row_order = TRUE)
[1] "not compatible: \n- Cols in y but not x: `colname2`.\n- Cols in x but not y: `colname1`.\n"
all_equal(X, Y, ignore_row_order = FALSE)
[1] "not compatible: \n- Cols in y but not x: `colname2`.\n- Cols in x but not y: `colname1`.\n"

seteqaul() will report TRUE or FALSE to indicate either a difference in columns or rows

setequal(X, Y)
[1] FALSE

setdiff() requires that column names be the same so this will cause an error:

setdiff(X, Y) # This will not work

OK, let’s keep going with our data.

How similar are the guidelines tibble and the observed consumption tibbles?

setequal(
  distinct(diet_data, dietary_risk),
  select(guidelines, food)
)
[1] FALSE

OK, looks like we have some different values.

Let’s use the setdiff function to get more information about what is different between the values.

setdiff(
  distinct(diet_data, dietary_risk),
  select(guidelines, food)
)

:( That wont work. This is because setdiff() requires that the column names are the same in the objects that we are comparing.

We can use the rename() function from dplyr to do this. We list the value that we want to change to first. We find “food” more intuitive now so we are going to change “dietary_risk” to “food” for the diet_data and the sep_age_diet_data:

diet_data %<>%
  dplyr::rename(food = dietary_risk)
sep_age_diet_data %<>%
  dplyr::rename(food = dietary_risk)
setdiff(
  distinct(diet_data, food),
  select(guidelines, food)
)
# A tibble: 1 x 1
  food 
  <chr>
1 fiber

Great, now we know that the fiber value appears to be different between the two.

Checking our original files we can see that the British spelling “fibre” is used in the table from the article (that we used to create guidelines), in contrast to the American spelling “fiber” used in the CSV files.

Let’s stick with the American spelling, so we will replace "fibre" in the guideline tibble.

guidelines %<>%
  mutate_at(
    vars(food),
    ~ str_replace(
      string = .,
      pattern = "fibre",
      replacement = "fiber"
    )
  )

guidelines %>%
  filter(food == "fiber")
# A tibble: 1 x 6
  direction food  lower optimal upper unit 
  <chr>     <chr> <dbl>   <dbl> <dbl> <chr>
1 low       fiber    19      24    28 g    

Now let’s check again to see that our food values match between the guidelines and the observed consumption data tibbles.

setdiff(
  select(guidelines, food),
  distinct(diet_data, food)
)
# A tibble: 0 x 1
# ... with 1 variable: food <chr>
setdiff(
  select(guidelines, food),
  distinct(sep_age_diet_data, food)
)
# A tibble: 0 x 1
# ... with 1 variable: food <chr>

Great! There are no differences :)

Joining data


Now we can put our guideline data together with the diet_data and the sep_age_diet_data.

Remember that the food data in our guidelines tibble is not necessarily in the same order as that of the consumption data tibbles. Thus this could be a problem if we decided to expand the guidelines rows (to repeat for the number of fruit observations etc.) and add them to our observed consumption tibbles by binding them together by column.

[source]

In that case we could use the arrange() function of dplyr to sort the data alphabetically.

However, we will instead use a joining function of dplyr. These functions combine the data together based on common values and don’t require the rows to be in the same order. There are a variety of options.

[source]

In our case we would like to retain all of the values of diet_data and sep_age_diet_data. We would like to add new columns of values to these tibbles that correspond to the guideline information about amounts of consumption for each food type in the guidelines tibble. We shouldn’t have any values of food in guidelines that don’t match, so we will not get any NA values. Therefore, in our case any of the mutating join functions should result in the same output.

It’s important to check if we have any overlapping variable names before we join the data. Otherwise, these columns will either be used to identify which rows to join, or new copies of the columns, with a default name to distinguish the columns of one data set from those of the other, will be created. We can use the base R function names() and the intersect() function of the dplyr package to identify which column names are common to our two data sets.

dplyr::intersect(
  names(diet_data),
  names(guidelines)
)
[1] "food"  "upper" "lower" "unit" 

So it looks like the "upper" , "lower" and "unit" variable names are overlapping. Therefore, to distinguish the names later we will rename the guideline "upper" , "lower" and "unit" variables.

We will again use the rename function from the dplyr package. We can list multiple variables to rename and separate each with a comma. We need to list the new names first.

guidelines %<>%
  rename(
    upper_optimal = upper,
    lower_optimal = lower,
    unit_optimal = unit
  )

guidelines
# A tibble: 15 x 6
   direction food               lower_optimal optimal upper_optimal unit_optimal
   <chr>     <chr>                      <dbl>   <dbl>         <dbl> <chr>       
 1 low       fruits                       200  250            300   g           
 2 low       vegetables                   290  360            430   g           
 3 low       legumes                       50   60             70   g           
 4 low       whole grains                 100  125            150   g           
 5 low       nuts and seeds                16   21             25   g           
 6 low       milk                         350  435            520   g           
 7 high      red meat                      18   23             27   g           
 8 high      processed meat                 0    2              4   g           
 9 high      sugar-sweetened b~             0    3              5   g           
10 low       fiber                         19   24             28   g           
11 low       calcium                        1    1.25           1.5 g           
12 low       seafood omega-3 f~           200  250            300   mg          
13 low       polyunsaturated f~             9   11             13   %           
14 high      trans fatty acids              0    0.5            1   %           
15 high      sodium                         1    3              5   g           

It is also a good idea to check our units to make sure they are the same for both guidelines and the observed consumption tibbles(diet_and_guidelines and all_age_diet_and_guidelines).

Let’s take a look with the count() function of the dplyr package. We will also use the bind_cols() function of dplyr to put the data together so that we can see it easily.

dplyr::bind_cols(
  count(diet_data, unit, food),
  count(sep_age_diet_data, unit, food),
  count(guidelines, unit_optimal, food)
)
# A tibble: 15 x 9
   unit...1   food...2 n...3 unit...4 food...5 n...6 unit_optimal food...8 n...9
   <chr>      <chr>    <int> <chr>    <chr>    <int> <chr>        <chr>    <int>
 1 % energy/~ polyuns~   392 % energ~ polyuns~  5880 %            polyuns~     1
 2 % energy/~ trans f~   392 % energ~ trans f~  5880 %            trans f~     1
 3 g/day      calcium    392 g/day    calcium   5880 g            calcium      1
 4 g/day      fiber      392 g/day    fiber     5880 g            fiber        1
 5 g/day      fruits     392 g/day    fruits    5880 g            fruits       1
 6 g/day      legumes    392 g/day    legumes   5880 g            legumes      1
 7 g/day      milk       392 g/day    milk      5880 g            milk         1
 8 g/day      nuts an~   392 g/day    nuts an~  5880 g            nuts an~     1
 9 g/day      process~   392 g/day    process~  5880 g            process~     1
10 g/day      red meat   392 g/day    red meat  5880 g            red meat     1
11 g/day      seafood~   392 g/day    seafood~  5880 g            sodium       1
12 g/day      sodium     392 g/day    sodium    5880 g            sugar-s~     1
13 g/day      sugar-s~   392 g/day    sugar-s~  5880 g            vegetab~     1
14 g/day      vegetab~   392 g/day    vegetab~  5880 g            whole g~     1
15 g/day      whole g~   392 g/day    whole g~  5880 mg           seafood~     1

We can see that the only potential issue is the seafood omega-3 fatty acids data which is in g/day for the observed data(diet_data and all_age_diet_and_guidelines), but in mg/day in the guidelines data.

We can account for this by dividing the guidelines seafood omega-3 fatty acids data by 1000 to convert it to grams from milligrams.

To do this we will use the if_else() function in the dplyr package. This allows us to specify a condition (in this case if the unit is "mg"), as well as values if this condition is met (true), or if the condition is not met (false).

In the following we mutate the values in each of the guideline numeric columns (lower, optimal and upper) one at a time. When we refer to lower for example we refer to the values in the column/variable. So if the condition is not met, then the original value is retained. We will also replace "mg" with "g" after everything is converted to grams.

# "lower_optimal" variable
guidelines %<>%
  mutate(lower_optimal = dplyr::if_else(
    condition = unit_optimal == "mg",
    true = lower_optimal / 1000,
    false = lower_optimal
  ))
# Explanation for the use of "if_else()" here
# If the "unit_optimal" variable is in "mg", we convert the corresponding "lower_optimal" (currently in mg) variable to grams (g) by dividing by 1,000.
# If not, the corresponding "lower_optimal" (already in g) is not changed

# "optimal" variable
guidelines %<>%
  mutate(optimal = if_else(condition = unit_optimal == "mg",
    true = optimal / 1000,
    false = optimal
  ))

# "upper_optimal" variable
guidelines %<>%
  mutate(upper_optimal = if_else(condition = unit_optimal == "mg",
    true = upper_optimal / 1000,
    false = upper_optimal
  ))

# replace "mg" with "g" in the "unit_optimal" variable
guidelines %<>%
  mutate(unit_optimal = if_else(condition = unit_optimal == "mg",
    true = "g",
    false = unit_optimal
  ))

guidelines
# A tibble: 15 x 6
   direction food               lower_optimal optimal upper_optimal unit_optimal
   <chr>     <chr>                      <dbl>   <dbl>         <dbl> <chr>       
 1 low       fruits                     200    250            300   g           
 2 low       vegetables                 290    360            430   g           
 3 low       legumes                     50     60             70   g           
 4 low       whole grains               100    125            150   g           
 5 low       nuts and seeds              16     21             25   g           
 6 low       milk                       350    435            520   g           
 7 high      red meat                    18     23             27   g           
 8 high      processed meat               0      2              4   g           
 9 high      sugar-sweetened b~           0      3              5   g           
10 low       fiber                       19     24             28   g           
11 low       calcium                      1      1.25           1.5 g           
12 low       seafood omega-3 f~           0.2    0.25           0.3 g           
13 low       polyunsaturated f~           9     11             13   %           
14 high      trans fatty acids            0      0.5            1   %           
15 high      sodium                       1      3              5   g           

Click here to see a couple of other ways to do this:
# Another possible way with dplyr::case_when():
guidelines %<>%
  mutate(lower_optimal = case_when(
    unit_optimal == "mg" ~ lower_optimal / 1000,
    unit_optimal != "mg" ~ lower_optimal
  ))

# Or could use this:
guidelines %<>%
  mutate_at(
    vars(unit_optimal),
    ~ str_replace(
      string = .,
      pattern = "mg",
      replacement = "g"
    )
  )

In contrast we could have changed or mutated the values for lower_optimal, optimal, upper_optimal all at once like this using the funs() argument in mutate_at() of dplyr.

guidelines[str_which(
  string = guidelines[["food"]],
  pattern = "seafood omega-3 fatty acids"
), ] <- guidelines %>%
  filter(food == "seafood omega-3 fatty acids") %>%
  mutate_at(vars(lower_optimal:upper_optimal), funs(. / 1000))

Now we are ready to join the data!

Again, we would like to add new columns of values to diet_data and all_age_diet_and_guidelines that correspond to the guideline information about amounts of consumption for each food type in the guidelines tibble. So we will join the data based on the food variable values. We will use the full_join() function of the dplyr package.

diet_and_guidelines <- diet_data %>%
  dplyr::full_join(guidelines, by = "food")

all_age_diet_and_guidelines <- sep_age_diet_data %>%
  full_join(guidelines, by = "food")

glimpse(diet_and_guidelines)
Rows: 5,880
Columns: 16
$ year_id        <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2~
$ location_name  <chr> "Global", "Global", "China", "China", "North Korea", "N~
$ rei_id         <dbl> 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, ~
$ food           <chr> "processed meat", "processed meat", "processed meat", "~
$ age_group_name <chr> "All available ages", "All available ages", "All availa~
$ sex            <chr> "Male", "Female", "Male", "Female", "Male", "Female", "~
$ parameter      <chr> "continuous", "continuous", "continuous", "continuous",~
$ mean           <dbl> 4.2865629, 3.2640990, 2.3176975, 1.7512870, 0.5665229, ~
$ upper          <dbl> 4.4633117, 3.3765360, 2.6944978, 2.0454134, 0.6596296, ~
$ lower          <dbl> 4.1309531, 3.1547299, 1.9933744, 1.5161724, 0.4818201, ~
$ unit           <chr> "g/day", "g/day", "g/day", "g/day", "g/day", "g/day", "~
$ direction      <chr> "high", "high", "high", "high", "high", "high", "high",~
$ lower_optimal  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ optimal        <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~
$ upper_optimal  <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4~
$ unit_optimal   <chr> "g", "g", "g", "g", "g", "g", "g", "g", "g", "g", "g", ~
glimpse(all_age_diet_and_guidelines)
Rows: 88,200
Columns: 16
$ year_id        <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2~
$ location_name  <chr> "Global", "Global", "Global", "Global", "Global", "Glob~
$ food           <chr> "calcium", "calcium", "calcium", "calcium", "calcium", ~
$ age_group_id   <dbl> 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 30, 31, 32,~
$ age_group_name <chr> "25 to 29", "30 to 34", "35 to 39", "40 to 44", "45 to ~
$ sex            <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male",~
$ parameter      <chr> "continuous", "continuous", "continuous", "continuous",~
$ mean           <dbl> 0.4166193, 0.4253721, 0.4352053, 0.4412156, 0.4565795, ~
$ upper          <dbl> 0.4310974, 0.4403903, 0.4481282, 0.4579218, 0.4763366, ~
$ lower          <dbl> 0.4030347, 0.4110026, 0.4222248, 0.4264522, 0.4381799, ~
$ unit           <chr> "g/day", "g/day", "g/day", "g/day", "g/day", "g/day", "~
$ direction      <chr> "low", "low", "low", "low", "low", "low", "low", "low",~
$ lower_optimal  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ optimal        <dbl> 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1~
$ upper_optimal  <dbl> 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, ~
$ unit_optimal   <chr> "g", "g", "g", "g", "g", "g", "g", "g", "g", "g", "g", ~

It’s always a good idea to check that the values are what you expect after merging.

diet_and_guidelines %>%
  count(food, optimal)
# A tibble: 15 x 3
   food                        optimal     n
   <chr>                         <dbl> <int>
 1 calcium                        1.25   392
 2 fiber                         24      392
 3 fruits                       250      392
 4 legumes                       60      392
 5 milk                         435      392
 6 nuts and seeds                21      392
 7 polyunsaturated fatty acids   11      392
 8 processed meat                 2      392
 9 red meat                      23      392
10 seafood omega-3 fatty acids    0.25   392
11 sodium                         3      392
12 sugar-sweetened beverages      3      392
13 trans fatty acids              0.5    392
14 vegetables                   360      392
15 whole grains                 125      392
all_age_diet_and_guidelines %>%
  count(food, optimal)
# A tibble: 15 x 3
   food                        optimal     n
   <chr>                         <dbl> <int>
 1 calcium                        1.25  5880
 2 fiber                         24     5880
 3 fruits                       250     5880
 4 legumes                       60     5880
 5 milk                         435     5880
 6 nuts and seeds                21     5880
 7 polyunsaturated fatty acids   11     5880
 8 processed meat                 2     5880
 9 red meat                      23     5880
10 seafood omega-3 fatty acids    0.25  5880
11 sodium                         3     5880
12 sugar-sweetened beverages      3     5880
13 trans fatty acids              0.5   5880
14 vegetables                   360     5880
15 whole grains                 125     5880
# For easy comparison we will arrange by food alphabetically
arrange(guidelines, food)
# A tibble: 15 x 6
   direction food               lower_optimal optimal upper_optimal unit_optimal
   <chr>     <chr>                      <dbl>   <dbl>         <dbl> <chr>       
 1 low       calcium                      1      1.25           1.5 g           
 2 low       fiber                       19     24             28   g           
 3 low       fruits                     200    250            300   g           
 4 low       legumes                     50     60             70   g           
 5 low       milk                       350    435            520   g           
 6 low       nuts and seeds              16     21             25   g           
 7 low       polyunsaturated f~           9     11             13   %           
 8 high      processed meat               0      2              4   g           
 9 high      red meat                    18     23             27   g           
10 low       seafood omega-3 f~           0.2    0.25           0.3 g           
11 high      sodium                       1      3              5   g           
12 high      sugar-sweetened b~           0      3              5   g           
13 high      trans fatty acids            0      0.5            1   %           
14 low       vegetables                 290    360            430   g           
15 low       whole grains               100    125            150   g           

Looks good!

Calculating relative consumption


Recall that our aim is to compare the consumption rates of these dietary factors by different groups of people, and ideally, to facilitate cross-factor comparisons, we want to consider consumption rates relative to the optimal guidelines.

To do this, let’s calculate values of consumption that are relative to the suggested guidelines.

There are a few approaches we could take. One is to calculate a "percentage of optimal consumption" using the mean value for each observed factor relative to its optimal value. To do this we will use the mutate() function of the dplyrpackage. This will create a new variable called Relative_Percent that will be equal to the ratio of the mean value and the optimal value, multiplied by 100, to create a percentage relative to the optimal amount suggested.

diet_and_guidelines %<>%
  mutate(Relative_Percent = (mean / optimal) * 100)

all_age_diet_and_guidelines %<>%
  mutate(Relative_Percent = (mean / optimal) * 100)

Another option is to incorporate the range of optimal intakes and the direction that is associated with health risk. If the direction of risk is high and the consumption was greater than the optimal mean value, than the percentage is calculated based on the upper_optimal value, while if the direction of risk is low and the consumption is less than the optimal mean value, then the percentage is calculated based on the lower_optimal value. We will use the case_when() function of the dplyr package to do this. This allows us to specify values (indicated on the right side of the ~symbol) based on specific conditions (indicated on the left side of the ~ symbol). We can specify multiple conditions using the & symbol.

diet_and_guidelines %<>%
  mutate(range_percent = case_when(
    direction == "high" ~ (mean / upper_optimal) * 100,
    direction == "low" ~ (mean / lower_optimal) * 100
  ))

all_age_diet_and_guidelines %<>%
  mutate(range_percent = case_when(
    direction == "high" ~ (mean / upper_optimal) * 100,
    direction == "low" ~ (mean / lower_optimal) * 100
  ))


diet_and_guidelines %<>%
  mutate(percent_over_under = case_when(
    direction == "high" & mean > upper_optimal ~
    ((mean - upper_optimal) / upper_optimal) * 100,
    direction == "high" & mean <= upper_optimal ~ 0,
    direction == "low" & mean >= lower_optimal ~ 0,
    direction == "low" & mean < lower_optimal ~
    ((lower_optimal - mean) / lower_optimal) * -100
  ))


all_age_diet_and_guidelines %<>%
  mutate(percent_over_under = case_when(
    direction == "high" & mean > upper_optimal ~
    ((mean - upper_optimal) / upper_optimal) * 100,
    direction == "high" & mean <= upper_optimal ~ 0,
    direction == "low" & mean >= lower_optimal ~ 0,
    direction == "low" & mean < lower_optimal ~
    ((lower_optimal - mean) / lower_optimal) * -100
  ))

Another option is to create a binary outcome indicating whether optimal consumption was achieved or not.

diet_and_guidelines %<>%
  mutate(opt_achieved = if_else(
    condition = direction == "low" & mean > lower_optimal |
      direction == "high" & mean < upper_optimal,
    true = "Yes",
    false = "No"
  ))

all_age_diet_and_guidelines %<>%
  mutate(opt_achieved = if_else(
    condition = direction == "low" & mean > lower_optimal |
      direction == "high" & mean < upper_optimal,
    true = "Yes",
    false = "No"
  ))

glimpse(diet_and_guidelines)
Rows: 5,880
Columns: 20
$ year_id            <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 201~
$ location_name      <chr> "Global", "Global", "China", "China", "North Korea"~
$ rei_id             <dbl> 117, 117, 117, 117, 117, 117, 117, 117, 117, 117, 1~
$ food               <chr> "processed meat", "processed meat", "processed meat~
$ age_group_name     <chr> "All available ages", "All available ages", "All av~
$ sex                <chr> "Male", "Female", "Male", "Female", "Male", "Female~
$ parameter          <chr> "continuous", "continuous", "continuous", "continuo~
$ mean               <dbl> 4.2865629, 3.2640990, 2.3176975, 1.7512870, 0.56652~
$ upper              <dbl> 4.4633117, 3.3765360, 2.6944978, 2.0454134, 0.65962~
$ lower              <dbl> 4.1309531, 3.1547299, 1.9933744, 1.5161724, 0.48182~
$ unit               <chr> "g/day", "g/day", "g/day", "g/day", "g/day", "g/day~
$ direction          <chr> "high", "high", "high", "high", "high", "high", "hi~
$ lower_optimal      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
$ optimal            <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
$ upper_optimal      <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ~
$ unit_optimal       <chr> "g", "g", "g", "g", "g", "g", "g", "g", "g", "g", "~
$ Relative_Percent   <dbl> 214.32814, 163.20495, 115.88487, 87.56435, 28.32615~
$ range_percent      <dbl> 107.164072, 81.602474, 57.942437, 43.782175, 14.163~
$ percent_over_under <dbl> 7.164072, 0.000000, 0.000000, 0.000000, 0.000000, 0~
$ opt_achieved       <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye~
glimpse(all_age_diet_and_guidelines)
Rows: 88,200
Columns: 20
$ year_id            <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 201~
$ location_name      <chr> "Global", "Global", "Global", "Global", "Global", "~
$ food               <chr> "calcium", "calcium", "calcium", "calcium", "calciu~
$ age_group_id       <dbl> 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 30, 31,~
$ age_group_name     <chr> "25 to 29", "30 to 34", "35 to 39", "40 to 44", "45~
$ sex                <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Ma~
$ parameter          <chr> "continuous", "continuous", "continuous", "continuo~
$ mean               <dbl> 0.4166193, 0.4253721, 0.4352053, 0.4412156, 0.45657~
$ upper              <dbl> 0.4310974, 0.4403903, 0.4481282, 0.4579218, 0.47633~
$ lower              <dbl> 0.4030347, 0.4110026, 0.4222248, 0.4264522, 0.43817~
$ unit               <chr> "g/day", "g/day", "g/day", "g/day", "g/day", "g/day~
$ direction          <chr> "low", "low", "low", "low", "low", "low", "low", "l~
$ lower_optimal      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
$ optimal            <dbl> 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.2~
$ upper_optimal      <dbl> 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1~
$ unit_optimal       <chr> "g", "g", "g", "g", "g", "g", "g", "g", "g", "g", "~
$ Relative_Percent   <dbl> 33.32954, 34.02977, 34.81642, 35.29725, 36.52636, 3~
$ range_percent      <dbl> 41.66193, 42.53721, 43.52053, 44.12156, 45.65795, 4~
$ percent_over_under <dbl> -58.33807, -57.46279, -56.47947, -55.87844, -54.342~
$ opt_achieved       <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No~

One last thing that can be useful with data wrangling is to reshape the data into what is called the long format. This is very useful for creating visualizations with a powerful and flexible package called ggplot2.

To coerce an object into long format, we create more rows and fewer columns. For more information about this, please see the Data Visualization section of this case study.

We would like to put together the different types of percentages of the optimal intake that we just calculated.

To get our data in long format we can use the pivot_longer() function of the tidyr package. We will list the columns that we want to come together into the longer format using the cols argument. The names_to argument indicates the name of the variable that will include the character information about the values that we are consolidating, i.e., this variable contains the names of the columns that we are bringing together. The values_to is the name of the column that will contain the values of the columns we are consolidating. We can use contains() of the tidyr package to look at the variables with names that contain "percent" .

diet_and_guidelines_long <- diet_and_guidelines %>%
  pivot_longer(
    cols = contains("percent"),
    names_to = "percent_type",
    values_to = "percent"
  )

Click here to see how this would be done with the older version of this function, called gather():

Recall that for pivot_longer(), the cols argument is used. For gather() we would simply list the variables that we wish to consolidate. The names_to and values_to arguments of pivot_longer() are equivalent to the key and value arguments in gather() respectively.

We would get an identical output from the two methods. We can check that with setequal().

diet_and_guidelines_long2 <- diet_and_guidelines %>%
  gather(contains("percent"),
    key = percent_type,
    value = percent
  )

setequal(diet_and_guidelines_long, diet_and_guidelines_long2)
[1] TRUE

Let’s do the same for the age separated data.

all_age_diet_and_guidelines_long <- all_age_diet_and_guidelines %>%
  pivot_longer(
    cols = contains("percent"),
    names_to = "percent_type",
    values_to = "percent"
  )

We now have the main variables and data formats that we need to proceed with the next steps of our analysis, including data exploration and eventually, modeling.

Now we will save our wrangled data. We will save it as an rda file for ourselves and as csv files, as this is often a good option for collaborators. We need a separate csv file for each tibble. We will save these files in a directory called “wrangled” within our “data” directory of our project.

save(all_age_diet_and_guidelines, all_age_diet_and_guidelines_long, diet_and_guidelines, sep_age_diet_data, file = here::here("data", "wrangled", "wrangled_data.rda"))

write_csv(all_age_diet_and_guidelines, file = here::here("data", "wrangled", "all_age_diet_and_guidelines.csv"))
write_csv(all_age_diet_and_guidelines_long, file = here::here("data", "wrangled", "all_age_diet_and_guidelines_long.csv"))
write_csv(diet_and_guidelines, file = here::here("data", "wrangled", "diet_and_guidelines.csv"))
write_csv(sep_age_diet_data, file = here::here("data", "wrangled", "sep_age_diet_data.csv"))

Data Exploration


If you have been following along but stopped you could load the wrangled data like so:

load(here::here("data", "wrangled", "wrangled_data.rda"))

If you skipped the data wrangling section click here.

First you need to install and load the OCSdata package:

install.packages("OCSdata")
library(OCSdata)

Then, you may load the wrangled data using the following code:

wrangled_rda("ocs-bp-diet", outpath = getwd())
load(here::here("OCSdata", "data", "wrangled", "wrangled_data.rda"))

If the package does not work for you, alternatively, an RDA file (stands for R data) of the data can be found here or slightly more directly here. Download this file and then place it in your current working directory within a subdirectory called “wrangled” within a subdirectory called “data” to copy and paste our code. We used an RStudio project and the here package to navigate to the file more easily.

load(here::here("data", "wrangled", "wrangled_data.rda"))

Click here to see more about creating new projects in RStudio.

You can create a project by going to the File menu of RStudio like so:

You can also do so by clicking the project button:

See here to learn more about using RStudio projects and here to learn more about the here package.



Exploring age collapsed data


Let’s start by taking a look at the percent of consumption, across all dietary factors. Again we will use the base R summary() function:

diet_and_guidelines %>%
  select(Relative_Percent) %>%
  summary()
 Relative_Percent  
 Min.   :    0.02  
 1st Qu.:   11.20  
 Median :   34.71  
 Mean   :  303.03  
 3rd Qu.:   73.12  
 Max.   :13724.61  

Wow! Some of the values are nearly zero, suggesting that some people are consuming basically zero percent of what is suggested for optimal health. On the other hand, for some dietary factors people are consuming over 13,000 percent what is suggested!

This is why it is important to look at the direction of consumption that could be harmful. For example if there is a population that consumes large amounts of vegetables this could be a good thing, but if there is a population consuming large amounts of sodium this would be a bad thing.

Let’s take a look to see what dietary factors are at the extremes by arranging the data using the arrange() function of the dplyr package. We can arrange by smallest to largest using the default and we can arrange largest to smallest using the minus sign -.

diet_and_guidelines %>%
  arrange(-Relative_Percent) %>%
  glimpse()
Rows: 5,880
Columns: 20
$ year_id            <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 201~
$ location_name      <chr> "Guatemala", "Honduras", "Nicaragua", "Costa Rica",~
$ rei_id             <dbl> 118, 118, 118, 118, 118, 118, 118, 118, 118, 118, 1~
$ food               <chr> "sugar-sweetened beverages", "sugar-sweetened bever~
$ age_group_name     <chr> "All available ages", "All available ages", "All av~
$ sex                <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Ma~
$ parameter          <chr> "continuous", "continuous", "continuous", "continuo~
$ mean               <dbl> 411.7382, 390.2302, 381.4949, 378.7929, 374.6964, 3~
$ upper              <dbl> 444.4002, 423.6780, 411.2766, 407.2788, 402.3668, 3~
$ lower              <dbl> 381.8858, 364.0112, 353.7633, 353.6314, 348.9515, 3~
$ unit               <chr> "g/day", "g/day", "g/day", "g/day", "g/day", "g/day~
$ direction          <chr> "high", "high", "high", "high", "high", "high", "hi~
$ lower_optimal      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
$ optimal            <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
$ upper_optimal      <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ~
$ unit_optimal       <chr> "g", "g", "g", "g", "g", "g", "g", "g", "g", "g", "~
$ Relative_Percent   <dbl> 13724.608, 13007.674, 12716.496, 12626.430, 12489.8~
$ range_percent      <dbl> 8234.765, 7804.605, 7629.897, 7575.858, 7493.927, 7~
$ percent_over_under <dbl> 8134.765, 7704.605, 7529.897, 7475.858, 7393.927, 7~
$ opt_achieved       <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No~

OK, so it looks like sugar-sweetened beverages are really over-consumed in some parts of the world!

Recall from the supplementary table from the article that over-consumption of sugar-sweetened beverages is associated with both Diabetes mellitus type 2 and Ischemic heart disease. This article discusses some of the controversy over the potential health risks associated with high consumption of sugar.

It still looks quite bad if we look at the other calculated percentage values.

diet_and_guidelines %>%
  select(contains("percent")) %>%
  summary()
 Relative_Percent   range_percent      percent_over_under
 Min.   :    0.02   Min.   :   0.025   Min.   : -99.97   
 1st Qu.:   11.20   1st Qu.:  13.263   1st Qu.: -77.60   
 Median :   34.71   Median :  38.941   Median : -42.20   
 Mean   :  303.03   Mean   : 196.373   Mean   : 108.59   
 3rd Qu.:   73.12   3rd Qu.:  72.414   3rd Qu.:   0.00   
 Max.   :13724.61   Max.   :8234.765   Max.   :8134.77   

So some places are still consuming 8,000 percent more than the upper range of the suggested optimal intake.

Let’s take a look at global levels:

diet_and_guidelines %>%
  filter(food == "sugar-sweetened beverages" &
    location_name == "Global")
# A tibble: 2 x 20
  year_id location_name rei_id food   age_group_name sex   parameter  mean upper
    <dbl> <chr>          <dbl> <chr>  <chr>          <chr> <chr>     <dbl> <dbl>
1    2017 Global           118 sugar~ All available~ Male  continuo~ 114.  123. 
2    2017 Global           118 sugar~ All available~ Fema~ continuo~  80.2  86.3
# ... with 11 more variables: lower <dbl>, unit <chr>, direction <chr>,
#   lower_optimal <dbl>, optimal <dbl>, upper_optimal <dbl>,
#   unit_optimal <chr>, Relative_Percent <dbl>, range_percent <dbl>,
#   percent_over_under <dbl>, opt_achieved <chr>

For those who are less familiar with the metric system where grams are equivalent to milliliters, it may be useful to realize how many fluid ounces the max amount of consumption per day (~444g for the upper value for Guatemala) actually is.

There are 0.35247 ounces in one gram.

# top amount in ounces
0.35247 * 444.4002
[1] 156.6377

OK, so the top consumers are drinking about 87 fluid ounces per day. Since there are 12 ounces in a single can of soda, this is about 7.25 sodas per day. Globally on average, males are drinking around 1.924 sodas worth of sweetened beverages, while females are drinking about 1.401.

Let’s take a look at what is under-consumed:

diet_and_guidelines %>%
  arrange(Relative_Percent) %>%
  glimpse()
Rows: 5,880
Columns: 20
$ year_id            <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 201~
$ location_name      <chr> "Chad", "Chad", "Mali", "Mali", "Burkina Faso", "Bu~
$ rei_id             <dbl> 122, 122, 122, 122, 122, 122, 122, 122, 122, 122, 1~
$ food               <chr> "polyunsaturated fatty acids", "polyunsaturated fat~
$ age_group_name     <chr> "All available ages", "All available ages", "All av~
$ sex                <chr> "Female", "Male", "Female", "Male", "Female", "Male~
$ parameter          <chr> "continuous", "continuous", "continuous", "continuo~
$ mean               <dbl> 0.002227074, 0.002295916, 0.002301266, 0.002373331,~
$ upper              <dbl> 0.002383410, 0.002439931, 0.002461209, 0.002520039,~
$ lower              <dbl> 0.002082132, 0.002161922, 0.002152212, 0.002229945,~
$ unit               <chr> "% energy/day", "% energy/day", "% energy/day", "% ~
$ direction          <chr> "low", "low", "low", "low", "low", "low", "low", "l~
$ lower_optimal      <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, ~
$ optimal            <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,~
$ upper_optimal      <dbl> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,~
$ unit_optimal       <chr> "%", "%", "%", "%", "%", "%", "%", "%", "%", "%", "~
$ Relative_Percent   <dbl> 0.02024612, 0.02087197, 0.02092060, 0.02157574, 0.0~
$ range_percent      <dbl> 0.02474526, 0.02551018, 0.02556963, 0.02637035, 0.0~
$ percent_over_under <dbl> -99.97525, -99.97449, -99.97443, -99.97363, -99.960~
$ opt_achieved       <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No~

On the other hand, it looks like some places are consuming almost no polyunsaturated fatty acids. These are fats that found in plant-based sources like seeds and nuts. According to an article about polyunsaturated fatty acids and its influence on health:

Coronary heart disease (CHD) is the leading cause of death worldwide … The types of dietary fats consumed play an important role in CHD risk, representing key modifiable risk factors…In particular, higher intakes of trans fat (TFA) and of saturated fat (SFA) replacing ω‐6 (n‐6) polyunsaturated fat (PUFA) are associated with increased CHD… whereas higher intake of PUFA replacing either SFA or carbohydrate is associated with lower risk.

Let’s get an idea about how countries compare in terms of how many of the dietary factors are consumed at the optimal level (the opt_achieved variable).

diet_and_guidelines %>%
  count(opt_achieved)
# A tibble: 2 x 2
  opt_achieved     n
  <chr>        <int>
1 No            4360
2 Yes           1520

Looks like overall, only 34.86% of dietary factors for all tested populations were at optimal levels.

Let’s get an idea about how countries compare on this metric.

diet_and_guidelines %>%
  count(opt_achieved, location_name) %>%
  filter(opt_achieved == "Yes") %>%
  arrange(-n) %>%
  # this allows us to show the full output
  print(n = 1e3)
# A tibble: 196 x 3
    opt_achieved location_name                        n
    <chr>        <chr>                            <int>
  1 Yes          Qatar                               13
  2 Yes          Rwanda                              13
  3 Yes          Turkey                              13
  4 Yes          Dominican Republic                  12
  5 Yes          Egypt                               12
  6 Yes          Malawi                              12
  7 Yes          Benin                               11
  8 Yes          Burkina Faso                        11
  9 Yes          Cameroon                            11
 10 Yes          Cuba                                11
 11 Yes          Ethiopia                            11
 12 Yes          Ghana                               11
 13 Yes          Lesotho                             11
 14 Yes          Mali                                11
 15 Yes          Nigeria                             11
 16 Yes          Senegal                             11
 17 Yes          Togo                                11
 18 Yes          Tunisia                             11
 19 Yes          Angola                              10
 20 Yes          Azerbaijan                          10
 21 Yes          Bahrain                             10
 22 Yes          Barbados                            10
 23 Yes          Bermuda                             10
 24 Yes          Chad                                10
 25 Yes          Comoros                             10
 26 Yes          Cote d'Ivoire                       10
 27 Yes          El Salvador                         10
 28 Yes          Guatemala                           10
 29 Yes          Haiti                               10
 30 Yes          India                               10
 31 Yes          Iran                                10
 32 Yes          Kenya                               10
 33 Yes          Lebanon                             10
 34 Yes          Mauritania                          10
 35 Yes          Mauritius                           10
 36 Yes          Morocco                             10
 37 Yes          Mozambique                          10
 38 Yes          Myanmar                             10
 39 Yes          Nepal                               10
 40 Yes          Nicaragua                           10
 41 Yes          Niger                               10
 42 Yes          North Korea                         10
 43 Yes          Sao Tome and Principe               10
 44 Yes          Sierra Leone                        10
 45 Yes          South Sudan                         10
 46 Yes          Tanzania                            10
 47 Yes          Timor-Leste                         10
 48 Yes          Uganda                              10
 49 Yes          United Arab Emirates                10
 50 Yes          Zambia                              10
 51 Yes          Armenia                              9
 52 Yes          Bosnia and Herzegovina               9
 53 Yes          Burundi                              9
 54 Yes          Costa Rica                           9
 55 Yes          Equatorial Guinea                    9
 56 Yes          Eritrea                              9
 57 Yes          Fiji                                 9
 58 Yes          Honduras                             9
 59 Yes          Iraq                                 9
 60 Yes          Kuwait                               9
 61 Yes          Oman                                 9
 62 Yes          Papua New Guinea                     9
 63 Yes          Solomon Islands                      9
 64 Yes          Tajikistan                           9
 65 Yes          Trinidad and Tobago                  9
 66 Yes          Uzbekistan                           9
 67 Yes          Virgin Islands, U.S.                 9
 68 Yes          Afghanistan                          8
 69 Yes          Albania                              8
 70 Yes          Algeria                              8
 71 Yes          American Samoa                       8
 72 Yes          Antigua and Barbuda                  8
 73 Yes          Bangladesh                           8
 74 Yes          Belize                               8
 75 Yes          Bhutan                               8
 76 Yes          Botswana                             8
 77 Yes          Brazil                               8
 78 Yes          Cambodia                             8
 79 Yes          Cape Verde                           8
 80 Yes          Congo                                8
 81 Yes          Democratic Republic of the Congo     8
 82 Yes          Djibouti                             8
 83 Yes          Dominica                             8
 84 Yes          Federated States of Micronesia       8
 85 Yes          Gabon                                8
 86 Yes          Georgia                              8
 87 Yes          Grenada                              8
 88 Yes          Guam                                 8
 89 Yes          Guinea                               8
 90 Yes          Guinea-Bissau                        8
 91 Yes          Guyana                               8
 92 Yes          Indonesia                            8
 93 Yes          Israel                               8
 94 Yes          Jamaica                              8
 95 Yes          Jordan                               8
 96 Yes          Kazakhstan                           8
 97 Yes          Kiribati                             8
 98 Yes          Kyrgyzstan                           8
 99 Yes          Laos                                 8
100 Yes          Liberia                              8
101 Yes          Libya                                8
102 Yes          Macedonia                            8
103 Yes          Madagascar                           8
104 Yes          Malaysia                             8
105 Yes          Maldives                             8
106 Yes          Marshall Islands                     8
107 Yes          Mexico                               8
108 Yes          Moldova                              8
109 Yes          Namibia                              8
110 Yes          Northern Mariana Islands             8
111 Yes          Pakistan                             8
112 Yes          Palestine                            8
113 Yes          Peru                                 8
114 Yes          Philippines                          8
115 Yes          Saint Lucia                          8
116 Yes          Saint Vincent and the Grenadines     8
117 Yes          Samoa                                8
118 Yes          Saudi Arabia                         8
119 Yes          Seychelles                           8
120 Yes          Somalia                              8
121 Yes          Sri Lanka                            8
122 Yes          Sudan                                8
123 Yes          Suriname                             8
124 Yes          Swaziland                            8
125 Yes          Syria                                8
126 Yes          Taiwan (Province of China)           8
127 Yes          Thailand                             8
128 Yes          The Bahamas                          8
129 Yes          The Gambia                           8
130 Yes          Tonga                                8
131 Yes          Turkmenistan                         8
132 Yes          Vanuatu                              8
133 Yes          Yemen                                8
134 Yes          Zimbabwe                             8
135 Yes          Bolivia                              7
136 Yes          Central African Republic             7
137 Yes          Luxembourg                           7
138 Yes          Montenegro                           7
139 Yes          Panama                               7
140 Yes          Paraguay                             7
141 Yes          Puerto Rico                          7
142 Yes          South Africa                         7
143 Yes          Ukraine                              7
144 Yes          Venezuela                            7
145 Yes          Vietnam                              7
146 Yes          Canada                               6
147 Yes          Colombia                             6
148 Yes          Greece                               6
149 Yes          Japan                                6
150 Yes          Malta                                6
151 Yes          Mongolia                             6
152 Yes          Portugal                             6
153 Yes          Romania                              6
154 Yes          Serbia                               6
155 Yes          Argentina                            5
156 Yes          Belarus                              5
157 Yes          Bulgaria                             5
158 Yes          Croatia                              5
159 Yes          Ecuador                              5
160 Yes          Finland                              5
161 Yes          Global                               5
162 Yes          Iceland                              5
163 Yes          Netherlands                          5
164 Yes          Russian Federation                   5
165 Yes          Sweden                               5
166 Yes          Andorra                              4
167 Yes          Australia                            4
168 Yes          Austria                              4
169 Yes          Belgium                              4
170 Yes          Brunei                               4
171 Yes          Chile                                4
172 Yes          China                                4
173 Yes          Cyprus                               4
174 Yes          Denmark                              4
175 Yes          Estonia                              4
176 Yes          France                               4
177 Yes          Germany                              4
178 Yes          Ireland                              4
179 Yes          Italy                                4
180 Yes          Latvia                               4
181 Yes          Lithuania                            4
182 Yes          New Zealand                          4
183 Yes          Norway                               4
184 Yes          Poland                               4
185 Yes          Singapore                            4
186 Yes          South Korea                          4
187 Yes          Spain                                4
188 Yes          Switzerland                          4
189 Yes          United Kingdom                       4
190 Yes          Uruguay                              4
191 Yes          Czech Republic                       3
192 Yes          Greenland                            3
193 Yes          Hungary                              3
194 Yes          Slovakia                             3
195 Yes          Slovenia                             3
196 Yes          United States                        3

It looks as though on average the populations (both male and female separately) in Qatar, Rwanda, and Turkey consumed the optimal level of intake for the largest number of dietary factors (13 out of 30 (for the 15 dietary factors for males and females)).

In contrast, the Czech Republic, Greenland, Hungary, Slovakia, Slovenia, and the United States had the poorest consumption rates (27 out of 30 were not at optimal levels).

diet_and_guidelines %>%
  count(opt_achieved, location_name) %>%
  filter(opt_achieved == "No") %>%
  arrange(-n) %>%
  # to show full output
  print(n = 1e3)
# A tibble: 196 x 3
    opt_achieved location_name                        n
    <chr>        <chr>                            <int>
  1 No           Czech Republic                      27
  2 No           Greenland                           27
  3 No           Hungary                             27
  4 No           Slovakia                            27
  5 No           Slovenia                            27
  6 No           United States                       27
  7 No           Andorra                             26
  8 No           Australia                           26
  9 No           Austria                             26
 10 No           Belgium                             26
 11 No           Brunei                              26
 12 No           Chile                               26
 13 No           China                               26
 14 No           Cyprus                              26
 15 No           Denmark                             26
 16 No           Estonia                             26
 17 No           France                              26
 18 No           Germany                             26
 19 No           Ireland                             26
 20 No           Italy                               26
 21 No           Latvia                              26
 22 No           Lithuania                           26
 23 No           New Zealand                         26
 24 No           Norway                              26
 25 No           Poland                              26
 26 No           Singapore                           26
 27 No           South Korea                         26
 28 No           Spain                               26
 29 No           Switzerland                         26
 30 No           United Kingdom                      26
 31 No           Uruguay                             26
 32 No           Argentina                           25
 33 No           Belarus                             25
 34 No           Bulgaria                            25
 35 No           Croatia                             25
 36 No           Ecuador                             25
 37 No           Finland                             25
 38 No           Global                              25
 39 No           Iceland                             25
 40 No           Netherlands                         25
 41 No           Russian Federation                  25
 42 No           Sweden                              25
 43 No           Canada                              24
 44 No           Colombia                            24
 45 No           Greece                              24
 46 No           Japan                               24
 47 No           Malta                               24
 48 No           Mongolia                            24
 49 No           Portugal                            24
 50 No           Romania                             24
 51 No           Serbia                              24
 52 No           Bolivia                             23
 53 No           Central African Republic            23
 54 No           Luxembourg                          23
 55 No           Montenegro                          23
 56 No           Panama                              23
 57 No           Paraguay                            23
 58 No           Puerto Rico                         23
 59 No           South Africa                        23
 60 No           Ukraine                             23
 61 No           Venezuela                           23
 62 No           Vietnam                             23
 63 No           Afghanistan                         22
 64 No           Albania                             22
 65 No           Algeria                             22
 66 No           American Samoa                      22
 67 No           Antigua and Barbuda                 22
 68 No           Bangladesh                          22
 69 No           Belize                              22
 70 No           Bhutan                              22
 71 No           Botswana                            22
 72 No           Brazil                              22
 73 No           Cambodia                            22
 74 No           Cape Verde                          22
 75 No           Congo                               22
 76 No           Democratic Republic of the Congo    22
 77 No           Djibouti                            22
 78 No           Dominica                            22
 79 No           Federated States of Micronesia      22
 80 No           Gabon                               22
 81 No           Georgia                             22
 82 No           Grenada                             22
 83 No           Guam                                22
 84 No           Guinea                              22
 85 No           Guinea-Bissau                       22
 86 No           Guyana                              22
 87 No           Indonesia                           22
 88 No           Israel                              22
 89 No           Jamaica                             22
 90 No           Jordan                              22
 91 No           Kazakhstan                          22
 92 No           Kiribati                            22
 93 No           Kyrgyzstan                          22
 94 No           Laos                                22
 95 No           Liberia                             22
 96 No           Libya                               22
 97 No           Macedonia                           22
 98 No           Madagascar                          22
 99 No           Malaysia                            22
100 No           Maldives                            22
101 No           Marshall Islands                    22
102 No           Mexico                              22
103 No           Moldova                             22
104 No           Namibia                             22
105 No           Northern Mariana Islands            22
106 No           Pakistan                            22
107 No           Palestine                           22
108 No           Peru                                22
109 No           Philippines                         22
110 No           Saint Lucia                         22
111 No           Saint Vincent and the Grenadines    22
112 No           Samoa                               22
113 No           Saudi Arabia                        22
114 No           Seychelles                          22
115 No           Somalia                             22
116 No           Sri Lanka                           22
117 No           Sudan                               22
118 No           Suriname                            22
119 No           Swaziland                           22
120 No           Syria                               22
121 No           Taiwan (Province of China)          22
122 No           Thailand                            22
123 No           The Bahamas                         22
124 No           The Gambia                          22
125 No           Tonga                               22
126 No           Turkmenistan                        22
127 No           Vanuatu                             22
128 No           Yemen                               22
129 No           Zimbabwe                            22
130 No           Armenia                             21
131 No           Bosnia and Herzegovina              21
132 No           Burundi                             21
133 No           Costa Rica                          21
134 No           Equatorial Guinea                   21
135 No           Eritrea                             21
136 No           Fiji                                21
137 No           Honduras                            21
138 No           Iraq                                21
139 No           Kuwait                              21
140 No           Oman                                21
141 No           Papua New Guinea                    21
142 No           Solomon Islands                     21
143 No           Tajikistan                          21
144 No           Trinidad and Tobago                 21
145 No           Uzbekistan                          21
146 No           Virgin Islands, U.S.                21
147 No           Angola                              20
148 No           Azerbaijan                          20
149 No           Bahrain                             20
150 No           Barbados                            20
151 No           Bermuda                             20
152 No           Chad                                20
153 No           Comoros                             20
154 No           Cote d'Ivoire                       20
155 No           El Salvador                         20
156 No           Guatemala                           20
157 No           Haiti                               20
158 No           India                               20
159 No           Iran                                20
160 No           Kenya                               20
161 No           Lebanon                             20
162 No           Mauritania                          20
163 No           Mauritius                           20
164 No           Morocco                             20
165 No           Mozambique                          20
166 No           Myanmar                             20
167 No           Nepal                               20
168 No           Nicaragua                           20
169 No           Niger                               20
170 No           North Korea                         20
171 No           Sao Tome and Principe               20
172 No           Sierra Leone                        20
173 No           South Sudan                         20
174 No           Tanzania                            20
175 No           Timor-Leste                         20
176 No           Uganda                              20
177 No           United Arab Emirates                20
178 No           Zambia                              20
179 No           Benin                               19
180 No           Burkina Faso                        19
181 No           Cameroon                            19
182 No           Cuba                                19
183 No           Ethiopia                            19
184 No           Ghana                               19
185 No           Lesotho                             19
186 No           Mali                                19
187 No           Nigeria                             19
188 No           Senegal                             19
189 No           Togo                                19
190 No           Tunisia                             19
191 No           Dominican Republic                  18
192 No           Egypt                               18
193 No           Malawi                              18
194 No           Qatar                               17
195 No           Rwanda                              17
196 No           Turkey                              17

Let’s look at the raw US data:

diet_and_guidelines %>%
  filter(location_name == "United States") %>%
  glimpse()
Rows: 30
Columns: 20
$ year_id            <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 201~
$ location_name      <chr> "United States", "United States", "United States", ~
$ rei_id             <dbl> 117, 117, 116, 116, 124, 124, 118, 118, 123, 123, 1~
$ food               <chr> "processed meat", "processed meat", "red meat", "re~
$ age_group_name     <chr> "All available ages", "All available ages", "All av~
$ sex                <chr> "Male", "Female", "Male", "Female", "Male", "Female~
$ parameter          <chr> "continuous", "continuous", "continuous", "continuo~
$ mean               <dbl> 25.31934316, 16.90044942, 62.30037501, 40.30456938,~
$ upper              <dbl> 27.99854665, 18.57300631, 66.61324291, 43.41968400,~
$ lower              <dbl> 22.85404285, 15.32263709, 58.39239430, 37.25710437,~
$ unit               <chr> "g/day", "g/day", "g/day", "g/day", "g/day", "g/day~
$ direction          <chr> "high", "high", "high", "high", "high", "high", "hi~
$ lower_optimal      <dbl> 0.0, 0.0, 18.0, 18.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0,~
$ optimal            <dbl> 2.00, 2.00, 23.00, 23.00, 3.00, 3.00, 3.00, 3.00, 0~
$ upper_optimal      <dbl> 4.0, 4.0, 27.0, 27.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0,~
$ unit_optimal       <chr> "g", "g", "g", "g", "g", "g", "g", "g", "%", "%", "~
$ Relative_Percent   <dbl> 1265.967158, 845.022471, 270.871196, 175.237258, 17~
$ range_percent      <dbl> 632.983579, 422.511236, 230.742130, 149.276183, 104~
$ percent_over_under <dbl> 532.983579, 322.511236, 130.742130, 49.276183, 4.04~
$ opt_achieved       <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Y~

Let’s see how males and females compare for achieving the optimal intake, across all countries:

count(diet_and_guidelines, sex, opt_achieved)
# A tibble: 4 x 3
  sex    opt_achieved     n
  <chr>  <chr>        <int>
1 Female No            2171
2 Female Yes            769
3 Male   No            2189
4 Male   Yes            751

Looks pretty similar, but it may be a bit better for females. We will evaluate this further below.

In order to assess what we have observed so far in a graphical way, we will make some data visualizations. One way we can do this is with the ggplot2 package. The ggplot2 package creates plots by building the plot components piece by piece, using "layers".

With ggplot2 we select what data we would like to plot using the first function (ggplot()) and then we add on additional layers of complexity (these layers can even involve different data). The aes() argument specifies what aspects of the data will be plotted where. The geom_* function specifies what type of plot to create (e.g. geom_histogram() creates a histogram). Notice in the following code how there is a plus sign between the ggplot() function and the geom_bar() function; this is how we combine different plot layers.

We will see later how we can add many layers to plots with ggplot2. For additional information on using ggplot2, see this case study.

diet_and_guidelines %>%
  ggplot(aes(opt_achieved, fill = sex)) +
  geom_bar(position = "dodge")

Continuing with ggplot2 we will now create a different type of plot. This time we will create a series of box plots. We will use the facet_wrap() function of ggplot2 to allow us to create many different plots simultaneously. In this case we can look at box plots for the different dietary factors colored by sex. The scales argument when set to "free" means that each of the sequential plot created by the facet can have a different scale for the y axis, otherwise, by default they are constrained to the same scale. Since our dietary factors are measured on very different scales, we do not want this constraint here.

# we will create a new variable with food names with new lines
# str_replace() is used here because we are only replacing the first instance of space
# otherwise str_replace_all() should be used
diet_and_guidelines %<>%
  mutate(
    food_to_plot =
      str_replace(
        string = pull(diet_and_guidelines, food),
        pattern = " ",
        replacement = "\n"
      )
  )

diet_and_guidelines %>%
  ggplot(aes(
    y = Relative_Percent,
    x = sex,
    color = sex
  )) +
  geom_boxplot() +
  facet_wrap(~food_to_plot,
    scales = "free",
    # specifies the number of rows of subplots
    nrow = 3,
    # moves the food label to the right
    strip.position = "right"
  ) +
  # this changes the size of the font for the labels
  theme(
    strip.text.y = element_text(size = 8),
    axis.text.x = element_text(
      angle = 70,
      hjust = 1
    )
  )

If we just look at differences by sex for the specific dietary factors, males appear to potentially consume more of many of the factors, including possibly more sodium, fiber, calcium, red meat, and sugar-sweetened beverages than females. Females may consume more fruit.

Exploring the data separated by age


Now we will take a look at the data that is separated by age groups.

First, recall that we have 15 different age groups starting from age 25 to 95 plus.

all_age_diet_and_guidelines %>%
  count(age_group_name)
# A tibble: 15 x 2
   age_group_name     n
   <chr>          <int>
 1 25 to 29        5880
 2 30 to 34        5880
 3 35 to 39        5880
 4 40 to 44        5880
 5 45 to 49        5880
 6 50 to 54        5880
 7 55 to 59        5880
 8 60 to 64        5880
 9 65 to 69        5880
10 70 to 74        5880
11 75 to 79        5880
12 80 to 84        5880
13 85 to 89        5880
14 90 to 94        5880
15 95 plus         5880
sep_age_diet_data %>%
  ggplot(aes(y = mean, x = age_group_name, col = sex)) +
  geom_boxplot() +
  facet_wrap(~food, scales = "free", nrow = 6) +
  theme(
    axis.text.x = element_text(angle = 70, hjust = 1),
    strip.text.x = element_text(size = 8)
  )

We can see from these plots that there appear to be age differences and gender differences for some of the different dietary factors. We will work to create clearer figures later on. However these initial figures have given us a better sense of the data that we are working with.

Data Analysis


If you have been following along but stopped you could load the wrangled data like so:

load(here::here("data", "wrangled", "wrangled_data.rda"))

If you skipped the data wrangling section click here.

First you need to install and load the OCSdata package:

install.packages("OCSdata")
library(OCSdata)

Then, you may load the wrangled data using the following code:

wrangled_rda("ocs-bp-diet", outpath = getwd())
load(here::here("OCSdata", "data", "wrangled", "wrangled_data.rda"))

If the package does not work for you, alternatively, an RDA file (stands for R data) of the data can be found here or slightly more directly here. Download this file and then place it in your current working directory within a subdirectory called “wrangled” within a subdirectory called “data” to copy and paste our code. We used an RStudio project and the here package to navigate to the file more easily.

load(here::here("data", "wrangled", "wrangled_data.rda"))

Click here to see more about creating new projects in RStudio.

You can create a project by going to the File menu of RStudio like so:

You can also do so by clicking the project button:

See here to learn more about using RStudio projects and here to learn more about the here package.



Recall what our main questions were:

Our main questions are:

  1. What are the global trends for potentially harmful diets?
  2. How do males and females compare?
  3. How do different age groups compare for these dietary factors?
  4. How do different countries compare? In particular, how does the US compare to other countries in terms of diet trends?

We have some general sense about global trends for the risk-associated dietary factors, however we want to know more.

We are interested in how much the genders differ, how much the 15 different age groups differ, and how the 195 countries compare.

In order to make inferences about these comparisons, it is helpful to perform statistical tests. These tests can help us to determine the strength of the association between the consumption of the dietary factors (our outcome variable) and sex, age group, and country identity (our predictor variables). One way to look at the strength of association between variables is to use a statistical method called regression.

If we measure consumption using either raw consumption or the percent of optimal consumption, then our outcome variable is what we call continuous, because our values can take on any numeric value within the range of possible values. To look at the strength of association with a continuous outcome, we can use linear regression.

If, instead, we measure consumption by whether or not the optimal level of consumption was achieved (“yes” or “no”), then our outcome would be considered binary, meaning it can take only two possible values. To look at the strength of association with a binary outcome, we can use logistic regression. There are other regression method for different types of outcomes as well; see here for a guide on different types of regression methods.

In this case study, we will focus on the outcome of the percent of optimal consumption (Relative_Percent), so we will focus our analysis on linear regression.

You may have already learned that one can compare a continuous outcome between two groups using a \(t\)-test. For more information on the \(t\)-test see this case study. And perhaps you have heard about ANOVA (ANalysis Of VAriance) for comparing a continuous outcome across more than two groups. It turns out that both the \(t\)-test and ANOVA are specialized types of linear regression. We will use each of these tests to investigate patterns of consumption for dietary factors that contribute to health risk and we will look at how we can obtain equivalent results with regression.

Linear Regression


So what is linear regression? How can we use regression to compare our groups of interest and look at the relationship between group identity and consumption of dietary factors associated with health risk?

The statistical version of the term regression was coined in 1877 in this article about the relationship between hereditary traits and population averages. The author particularly focused on height and kinship or relatedness. The word itself means "to go back to a simpler state". It was noticed that individuals with parents who had an extreme trait, such as exceptional height, tended to have a height more similar to the average of the population than the extreme height of their parents. For example if parents were very tall, their children were likely to be a bit shorter than their parents and therefore closer to the population average. Thus the children regressed towards the mean or in the author’s words the offspring showed:

“a regression towards mediocrity”

See here for more information about this history.

When we think about this from a statistical standpoint, regression allows us to estimate or regress relationships between variables with a “simple” model. We do this by estimating the mean of an outcome, given a value of an input or predictor variable. This can be useful for predicting future values of the outcome based on the approximation of the real relationship between the variables within the model, or just for understanding how different variables are related to one another.

We will start by considering simple linear regression, where we have one continuous predictor variable and one continuous outcome variable, as shown below:

We want to identify a “best fit” line that summarizes the relationship between these two variables. We can so this using the ordinary least squares method, which chooses the line that best fits the data by minimizing the sum of the squared vertical distances between each point and the line. In the above example, this line turns out to be:

Fitting a line to the data like this allows us to create a formula for the line using an intercept and a slope, so that we can then estimate mean values of \(Y\) (dependent/outcome variable) given known values of \(X\) (independent/predictor/covariate/explanatory variable(s)). People will also say that we are “regressing \(Y\) on \(X\)”.

You may have seen the formula for a line written like this:

\[Y = mX + b\]

or

\[Y = aX + b\]

In this case \(m\) or \(a\) is the slope of the line and \(b\) is a constant and represents the y-intercept or the point where the y axis is crossed by the line, when \(x = 0\).

In regression, we usually write this model like this:

\[Y = \beta_{1}X +\beta_{0}\]

Now \(\beta_{1}\), called “beta one”, is our slope and \(\beta_{0}\), called “beta zero” (or “beta naught”), is our intercept. In our example above, the slope of the regression line is \(\beta_{1} = 2.3\) and the intercept is \(\beta_{0} = -6.6\).

Importantly the slope (\(\beta_{1}\)) gives us a quantitative measure of the relationship between the independent variable (\(X\)) on the dependent variable (\(Y\)). In particular, \(\beta_{1}\) tells how the expected difference in the \(Y\) value for a difference of 1 unit in the \(X\) value.

It’s possible that the regression line will perfectly fit the data, and all points will lie on the line with no distance to the line:

In this case, the slope or \(\beta_{1}\) is 1 and the intercept \(\beta_{0}\) is 10 and every observed data point lies exactly on the line, e.g., we can see that when \(X\) is 50, \(Y\) is exactly 60. This is very unusual in statistical analysis however, as often the relationship between variables is more complicated and there is more noise in our data. In these other cases there will be greater distances between the line and the points.

Like this regression:

In this case, because there is some vertical distance between the line and the data points, there is a bit of what is called “error” in the model. The formula for the relationship between \(X\) and \(Y\) does not perfectly describe the data. The vertical distance between the line and each data point is what we call a residual. Our least squares method finds the line with the minimized value of the sum of the squared residual values.

Check out this interactive explanation of how the ordinary least squares method works.

Here is an image of what we are saying about the ordinary least squares regression to fit a line to data:
[source]

This basic concept of simple linear regression an be extended to allow for more than one covariate (the independent variables, or x’s); this is called multivariable linear regression. With more than one independent variable, we can’t visualize these relationships easily with a line on a two-dimensional page, but the mathematical concept remains in some sense the same.

R has it’s own way of representing the regression equation in code. For a guide on how to perform regressions in R see here.

In R we indicate a linear model like this:

y ~ x

Here our response/outcome variable is on the left of the ~ while our covariates/explanatory variables are on the right of the ~.

Before we get started, let’s remove the global values from our data and set them aside, as this is really a composite of all the country values.

global <- diet_and_guidelines %>%
  filter(location_name == "Global")
diet_and_guidelines %<>%
  filter(location_name != "Global")
all_age_diet_and_guidelines %<>%
  filter(location_name != "Global")

\(t\)-test and linear regression


Since we will be covering a lot of different statistical concepts here, we will want to focus are analysis on a single dietary factor. Let’s choose one of the dietary factors that appeared to potentially have a difference between genders based on our figure in our exploratory analysis.

“If we just look at differences by sex for the specific dietary factors, males appear to potentially consume more of many of the factors, including possibly more sodium, fiber, calcium, red meat, and sugar-sweetened beverages than females. Females may consume more fruit.”

Let’s take a look at red meat.

We can compare the relative percent of red meat consumption of males and females around the world using the well known \(t\)-test using the t.test() function and a linear regression model using the lm() function (both are included in stats package that is installed with R by default) and we will get the same results. See here for additional explanation about why that is the case. Here and here are also great sources about how many commonly known statistical tests are specialized forms of regression.

Before we get started, let’s think about the assumptions of both an independent samples \(t\)-test and linear regression.

Independent samples \(t\)-test assumptions:

  1. Normality of the outcome in each group (this is not as much of an issue if the number of observations is relatively large, i.e., total n > 30 - which is indeed the case for us!)
  2. Equal variance between the two groups
  3. Independent observations

Linear regression assumptions:

  1. L (linear) - There is a linear relationship between the outcome variable and each covariate.

  2. I (independent) - The outcome for individual observations are independent from one another, given the covariates in the model.

  3. N (normal) - The residuals (errors) are normally distributed. Note that the variables themselves do not need to be normally distributed.

  4. E (equal variances) - The variance of the residuals is constant across covariate groups. This is called homoscedasticity. In other words the residuals are of similar size along the regression line.

It’s also important that if there are multiple predictor variables, that these are not too highly correlated.

See here for additional information about the assumptions of linear regressions.

Notice that many of the assumptions between \(t\)-tests and linear regression are similar – each has an assumption of normality, equal variance, and independence!

Assessing normality

First we will explore the shape of the distribution of these relative percent of red meat consumption. We can do this by looking at a frequency distribution of the Relative_Percent variable for red meat consumption. We will use the geom_histogram() of the ggplot2package to create a histogram to evaluate the frequency distributions of our data. The facet_wrap() function of the ggplot2 package allows us to look at different parts of our data in separate plots. Here we can compare the distribution for males and females.

diet_and_guidelines %>%
  filter(food == "red meat") %>%
  ggplot(aes(x = Relative_Percent)) +
  geom_histogram() +
  facet_wrap(~sex)

This Relative_Percent variable appears to have a right skew for both male and female individuals. We can also see this by looking at normal Quantile-Quantile (Q-Q) plots of this variable. Remember that in a Q-Q plot, points away from the line indicate one of the distributions is more skewed than the other. In this case, we see that the values in are sample are skewed relative to the theoretical normal distribution. Here is a great reference for interpreting Q-Q plots.

diet_and_guidelines %>%
  filter(food == "red meat") %>%
  ggplot(aes(sample = Relative_Percent)) +
  facet_wrap(~sex) +
  geom_qq() +
  geom_qq_line()

We can consider transforming our data to make it more normally distributed. When data is highly right skewed, a log transformation is often helpful.

Let’s take a look a the log (with base 10) of our Relative_Percent variable:

diet_and_guidelines %>%
  filter(food == "red meat") %>%
  ggplot(aes(x = log10(Relative_Percent))) +
  geom_histogram() +
  facet_wrap(~sex)

diet_and_guidelines %>%
  filter(food == "red meat") %>%
  ggplot(aes(sample = log10(Relative_Percent))) +
  facet_wrap(~sex) +
  geom_qq() +
  geom_qq_line()

OK, so now our histograms look fairly normal. It isn’t perfect, but we have a large number of samples, so this is good for our \(t\)-test assumptions.

Assessing equal variances

The next thing we need to check is if the variance in red meat consumption is similar between the two gender groups. We can use the var.test() of the stats package using the log-normalized data, as this data is fairly normally distributed.

Because we are piping in our data to this test function, we need to indicate that this is the data we intend to use by using . for the data argument. This is a handy tip when piping into a function outside of the tidyverse where the first argument isn’t a data set.

diet_and_guidelines %>%
  filter(food == "red meat") %>%
  var.test(log10(Relative_Percent) ~ sex, data = .)

    F test to compare two variances

data:  log10(Relative_Percent) by sex
F = 0.96369, num df = 194, denom df = 194, p-value = 0.797
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.7266925 1.2779719
sample estimates:
ratio of variances 
          0.963687 

The p value > .05 for this test, thus we can conclude that there is not enough evidence to reject the null hypothesis that there is no difference in the variance of the distributions, so we conclude that variance is roughly equal.

Comparing a \(t\)-test to linear regression

Now let’s compare the consumption of red meat across genders using both a \(t\)-test and a linear regression. First our independent samples \(t\)-test:

diet_and_guidelines %>%
  filter(food == "red meat") %>%
  t.test(log10(Relative_Percent) ~ sex, data = ., var.equal = TRUE)

    Two Sample t-test

data:  log10(Relative_Percent) by sex
t = -5.3187, df = 388, p-value = 1.77e-07
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -0.2525474 -0.1162266
sample estimates:
mean in group Female   mean in group Male 
            1.798872             1.983259 

Notice here that sample means for the two groups are 1.80 and 1.98 for males and females, respectively. So that means the difference in sample means is 1.80 - 1.98 = -0.18. We also see a test statistic of \(t\) = -5.32 and a very small \(p\)-value.

Let’s examine the same relationship using linear regression:

diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ sex, data = .) %>%
  summary()

Call:
lm(formula = log10(Relative_Percent) ~ sex, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0238 -0.2484  0.0052  0.3127  0.6170 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.79887    0.02451  73.382  < 2e-16 ***
sexMale      0.18439    0.03467   5.319 1.77e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3423 on 388 degrees of freedom
Multiple R-squared:  0.06795,   Adjusted R-squared:  0.06555 
F-statistic: 28.29 on 1 and 388 DF,  p-value: 1.77e-07

Look at the results for the slope of the regression line, indicated by the sexMale row in the output above. Notice how the \(t\)-value and the \(p\)-value match our \(t\)-test! (Well, the signs are switched in each case – the \(t\) value is negative in the t.test() output because the male group is being used as reference group, while the female group is being used as the reference group in lm()). We can fix this using the fct_inorder() function of the forcats package which is all about factors. This function allows us to order the factor by what appears first. In this case “male” appears first, so now our output will match that of the lm() function.

diet_and_guidelines %<>%
  mutate_at(vars(sex), factor)

diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ sex, data = .) %>%
  summary()

Call:
lm(formula = log10(Relative_Percent) ~ sex, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0238 -0.2484  0.0052  0.3127  0.6170 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.79887    0.02451  73.382  < 2e-16 ***
sexMale      0.18439    0.03467   5.319 1.77e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3423 on 388 degrees of freedom
Multiple R-squared:  0.06795,   Adjusted R-squared:  0.06555 
F-statistic: 28.29 on 1 and 388 DF,  p-value: 1.77e-07
diet_and_guidelines %>%
  mutate_at(vars(sex), forcats::fct_inorder) %>%
  filter(food == "red meat") %>%
  t.test(log10(Relative_Percent) ~ sex,
    data = .,
    var.equal = TRUE
  )

    Two Sample t-test

data:  log10(Relative_Percent) by sex
t = 5.3187, df = 388, p-value = 1.77e-07
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
 0.1162266 0.2525474
sample estimates:
  mean in group Male mean in group Female 
            1.983259             1.798872 

Now they match. Notice that the degrees of freedom also match, both results show 388 degrees of freedom. We are estimating two parameters for the linear model the two \(\beta\) coefficients, (the slope and intercept), and for the \(t\)-test we are estimating the means of two groups (males and females). Overall we have two samples (male and female) for each of the 195 countries.

Thus, the overall sample number is: \(n = 195*2 = 390\)

\[df = n - # parameters estimating\] Thus the degrees of freedom can be calculated as: \(df = 390 -2 = 388\)

Let’s look more closely at the linear regression output from lm(). Our estimated intercept (\(\beta_{0}\)) is 1.80, which can be interpreted as the mean value when sex is not male (so in this case when sex is female). This matches the sample mean of the female group in the t.test() output.

Our estimated slope (\(\beta_{1}\)) is 0.18, which can be interpreted as the slope of the regression line or the mean change in \(Y\) associated with one-unit increase in \(X\). Since our \(X\) variable is sex, a one-unit change means moving from one group to another. So we can think of the slope as the difference between the means of the two groups, male (\(X\)=1) minus female (\(X\) = 0). If we calculate this difference in means as calculated in the t.test() output, we get the value of \(\beta_{1}\) (the slope or the sexMale estimate) of the lm() output!

Mean of males - Mean of females \(1.983259 - 1.798872 =0.184387\)

Cool! For more information about the output of lm() see here.

After fitting our linear regression model, we can use the base plot() function to get information about our model residuals to help us assess whether any of the assumptions of linear regression are violated. Here we choose to view the first three of these plots with which = 1:3.

diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ sex, data = .) %>%
  plot(which = 1:3)

The second plot shows us that our residuals are slightly negatively (or left) skewed. We can see also see the spread of the residuals is similar between males and females, as the first and third plot show similar spreads of values in the two lines. This suggests that the assumption of homoscedasticity is met. Here is what these plots would look like if the variance were not the same between the groups:

diet_and_guidelines %>%
  filter(food == "red meat") %>%
  mutate(example_perc = case_when(
    # this will artifically make our female data have different
    # variance from the male data
    sex == "Female" ~ log(Relative_Percent),
    sex == "Male" ~ Relative_Percent
  )) %>%
  lm(log10(example_perc) ~ sex, data = .) %>%
  plot(which = 1:3)

In this case the spread of the points is clearly less for one group compared to the other. If we saw plots like these, we would be concerned the assumption of homoscedasticity was violated.

Assessing independence

We never considered the assumption of independent required by both a \(t\)-test and linear regression. Do we truly have independent samples in this case? No! Since we have female and male values from the same countries, our data is really what we would call “paired”. The male and female diet values from the same country are most likely related to each another because of cultural effects on diet. This means the assumption of independence for the independent samples \(t\)-test is violated, as is the independence assumption for linear regression.

We can address this by doing a paired \(t\)-test instead of an independent \(t\)-test and by accounting for country in our linear model by adding it to our model as what we call a fixed effect.

Paired \(t\)-test and linear model with fixed effects


Now we will perform the paired versions of our analysis. This is very easy to do with the t.test() function, by simply using the paired argument and setting it equal to TRUE.

However, our data needs to be in a slightly different form to do the paired test, since we have to tell R which values need to be paired together. Instead of one long dataset with different rows for males and females, we will need separate columns for the male and female values. So we need to make our dataset wider. We can do that using the pivot_wider() function of the tidyr package. To use this function we specify the values that we want to separate into more variables using the values_from argument and we use the names_from argument to specify how we want to separate these other variables. In this case we will make a male and female version of all the other variables specified.

wide_diet <- diet_and_guidelines %>%
  pivot_wider(
    values_from = c(
      contains("percent"),
      mean,
      upper,
      lower,
      opt_achieved
    ),
    names_from = sex
  )

glimpse(wide_diet)
Rows: 2,925
Columns: 26
$ year_id                   <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 20~
$ location_name             <chr> "China", "North Korea", "Taiwan (Province of~
$ rei_id                    <dbl> 117, 117, 117, 117, 117, 117, 117, 117, 117,~
$ food                      <chr> "processed meat", "processed meat", "process~
$ age_group_name            <chr> "All available ages", "All available ages", ~
$ parameter                 <chr> "continuous", "continuous", "continuous", "c~
$ unit                      <chr> "g/day", "g/day", "g/day", "g/day", "g/day",~
$ direction                 <chr> "high", "high", "high", "high", "high", "hig~
$ lower_optimal             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ optimal                   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,~
$ upper_optimal             <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,~
$ unit_optimal              <chr> "g", "g", "g", "g", "g", "g", "g", "g", "g",~
$ Relative_Percent_Male     <dbl> 115.88487, 28.32615, 119.29284, 23.54152, 20~
$ Relative_Percent_Female   <dbl> 87.56435, 20.67010, 90.93536, 17.67477, 15.9~
$ range_percent_Male        <dbl> 57.942437, 14.163074, 59.646422, 11.770760, ~
$ range_percent_Female      <dbl> 43.782175, 10.335051, 45.467681, 8.837384, 7~
$ percent_over_under_Male   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ percent_over_under_Female <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ mean_Male                 <dbl> 2.3176975, 0.5665229, 2.3858569, 0.4708304, ~
$ mean_Female               <dbl> 1.7512870, 0.4134020, 1.8187072, 0.3534954, ~
$ upper_Male                <dbl> 2.6944978, 0.6596296, 2.7830329, 0.5541629, ~
$ upper_Female              <dbl> 2.0454134, 0.4730273, 2.1087411, 0.4132150, ~
$ lower_Male                <dbl> 1.9933744, 0.4818201, 2.0572123, 0.3964639, ~
$ lower_Female              <dbl> 1.5161724, 0.3573467, 1.5705803, 0.3017578, ~
$ opt_achieved_Male         <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Y~
$ opt_achieved_Female       <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Y~

You can see we now have a Relative_Percent_Male variable and a Relative_Percent_Female variable. We can use these two variables in our paired \(t\)-test. Since the paired version of the \(t\)-test doesn’t take a data= argument, we will pull the appropriate variables from our data a little bit differently, using the pull() function.

t.test(log10(pull(
  filter(wide_diet, food == "red meat"),
  Relative_Percent_Male
)),
log10(pull(
  filter(wide_diet, food == "red meat"),
  Relative_Percent_Female
)),
var.equal = TRUE, paired = TRUE
)

    Paired t-test

data:  log10(pull(filter(wide_diet, food == "red meat"), Relative_Percent_Male)) and log10(pull(filter(wide_diet, food == "red meat"), Relative_Percent_Female))
t = 188.16, df = 194, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.1824542 0.1863197
sample estimates:
mean of the differences 
               0.184387 

Here an estimated mean difference (Males - Females) of 0.18, and that this is considered significantly different than 0 due to a very small \(p\)-value. You can also see that now our degrees of freedom are 194, which makes sense because with paired samples we are only estimating one parameter (the mean difference) based on data on 195 differences for each country. So \(df = n - \# \ parameters = 195 -1 = 194\).

The paired version of the linear model is a bit more complex. In this case we will add another term in our model to evaluate the influence of sex on Relative_Percent consumption while keeping the country identity fixed or constant, or in other words controlling/adjusting for country. We can use the + to add this additional term. Now that we have multiple covariate/explanatory variable terms, we would call this a multivariable linear regression.

So now our model in words will be:

Mean relative consumption of red meat is dependent on sex and country. Or in other words, sex and location influence the consumption of red meat around the world.

Then the coefficient for sex will be different from what we had in our previous lm() model, as it will be calculated while keeping location_name or the country where the consumption value was obtained fixed, or in other words “controlling for location_name.” This will also result in output for each of the countries. The coefficients here represent the average difference in consumption value for each country compared to the reference country of Afghanistan, while accounting for sex.

This now should meet the assumption of independence for a linear regression model, since observations will be independent conditional an the covariates of sex and country.

Let’s fit this model and look at the results.

diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ sex + location_name, data = .) %>%
  summary()

Call:
lm(formula = log10(Relative_Percent) ~ sex + location_name, data = .)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.031341 -0.003897  0.000000  0.003897  0.031341 

Coefficients:
                                               Estimate Std. Error t value
(Intercept)                                    1.537017   0.006860 224.066
sexMale                                        0.184387   0.000980 188.159
location_nameAlbania                           0.438021   0.009676  45.268
location_nameAlgeria                           0.045488   0.009676   4.701
location_nameAmerican Samoa                    0.467181   0.009676  48.281
location_nameAndorra                           0.748631   0.009676  77.368
location_nameAngola                            0.250455   0.009676  25.884
location_nameAntigua and Barbuda               0.299310   0.009676  30.932
location_nameArgentina                         0.824071   0.009676  85.164
location_nameArmenia                           0.394447   0.009676  40.764
location_nameAustralia                         0.814642   0.009676  84.190
location_nameAustria                           0.870921   0.009676  90.006
location_nameAzerbaijan                        0.293968   0.009676  30.380
location_nameBahrain                           0.289155   0.009676  29.883
location_nameBangladesh                       -0.611390   0.009676 -63.185
location_nameBarbados                          0.310247   0.009676  32.063
location_nameBelarus                           0.687041   0.009676  71.003
location_nameBelgium                           0.676956   0.009676  69.961
location_nameBelize                            0.176578   0.009676  18.249
location_nameBenin                            -0.326705   0.009676 -33.764
location_nameBermuda                           0.424542   0.009676  43.875
location_nameBhutan                           -0.120195   0.009676 -12.422
location_nameBolivia                           0.493009   0.009676  50.950
location_nameBosnia and Herzegovina            0.170649   0.009676  17.636
location_nameBotswana                          0.133756   0.009676  13.823
location_nameBrazil                            0.740470   0.009676  76.525
location_nameBrunei                            0.221789   0.009676  22.921
location_nameBulgaria                          0.680417   0.009676  70.318
location_nameBurkina Faso                      0.092972   0.009676   9.608
location_nameBurundi                          -0.196961   0.009676 -20.355
location_nameCambodia                          0.110614   0.009676  11.432
location_nameCameroon                         -0.055387   0.009676  -5.724
location_nameCanada                            0.667331   0.009676  68.966
location_nameCape Verde                        0.394975   0.009676  40.819
location_nameCentral African Republic          0.369231   0.009676  38.159
location_nameChad                              0.043477   0.009676   4.493
location_nameChile                             0.670126   0.009676  69.255
location_nameChina                             0.634103   0.009676  65.532
location_nameColombia                          0.360716   0.009676  37.279
location_nameComoros                          -0.115892   0.009676 -11.977
location_nameCongo                            -0.257950   0.009676 -26.658
location_nameCosta Rica                        0.379751   0.009676  39.246
location_nameCote d'Ivoire                    -0.387071   0.009676 -40.002
location_nameCroatia                           0.581435   0.009676  60.089
location_nameCuba                              0.347647   0.009676  35.928
location_nameCyprus                            0.600760   0.009676  62.086
location_nameCzech Republic                    0.647519   0.009676  66.918
location_nameDemocratic Republic of the Congo -0.140305   0.009676 -14.500
location_nameDenmark                           0.662793   0.009676  68.497
location_nameDjibouti                          0.204672   0.009676  21.152
location_nameDominica                          0.196880   0.009676  20.347
location_nameDominican Republic                0.279081   0.009676  28.842
location_nameEcuador                           0.516439   0.009676  53.372
location_nameEgypt                             0.138031   0.009676  14.265
location_nameEl Salvador                       0.043217   0.009676   4.466
location_nameEquatorial Guinea                 0.415090   0.009676  42.898
location_nameEritrea                          -0.141927   0.009676 -14.668
location_nameEstonia                           0.541948   0.009676  56.008
location_nameEthiopia                         -0.173354   0.009676 -17.915
location_nameFederated States of Micronesia    0.209379   0.009676  21.638
location_nameFiji                              0.297535   0.009676  30.749
location_nameFinland                           0.625494   0.009676  64.642
location_nameFrance                            0.700810   0.009676  72.426
location_nameGabon                             0.187008   0.009676  19.326
location_nameGeorgia                           0.228131   0.009676  23.576
location_nameGermany                           0.703028   0.009676  72.655
location_nameGhana                            -0.346773   0.009676 -35.838
location_nameGreece                            0.667048   0.009676  68.937
location_nameGreenland                         0.768537   0.009676  79.425
location_nameGrenada                           0.213445   0.009676  22.059
location_nameGuam                              0.577515   0.009676  59.684
location_nameGuatemala                         0.009005   0.009676   0.931
location_nameGuinea                           -0.194870   0.009676 -20.139
location_nameGuinea-Bissau                     0.133387   0.009676  13.785
location_nameGuyana                           -0.216393   0.009676 -22.363
location_nameHaiti                            -0.051033   0.009676  -5.274
location_nameHonduras                          0.117781   0.009676  12.172
location_nameHungary                           0.610661   0.009676  63.109
location_nameIceland                           0.659366   0.009676  68.143
location_nameIndia                            -0.650568   0.009676 -67.234
location_nameIndonesia                        -0.219020   0.009676 -22.635
location_nameIran                             -0.045613   0.009676  -4.714
location_nameIraq                             -0.318520   0.009676 -32.918
location_nameIreland                           0.800964   0.009676  82.776
location_nameIsrael                            0.483992   0.009676  50.019
location_nameItaly                             0.663507   0.009676  68.571
location_nameJamaica                          -0.033539   0.009676  -3.466
location_nameJapan                             0.374311   0.009676  38.683
location_nameJordan                            0.109734   0.009676  11.341
location_nameKazakhstan                        0.657682   0.009676  67.969
location_nameKenya                             0.104881   0.009676  10.839
location_nameKiribati                          0.130670   0.009676  13.504
location_nameKuwait                            0.365667   0.009676  37.790
location_nameKyrgyzstan                        0.424414   0.009676  43.861
location_nameLaos                              0.224672   0.009676  23.219
location_nameLatvia                            0.573087   0.009676  59.226
location_nameLebanon                           0.330736   0.009676  34.180
location_nameLesotho                           0.078852   0.009676   8.149
location_nameLiberia                          -0.380804   0.009676 -39.355
location_nameLibya                             0.101999   0.009676  10.541
location_nameLithuania                         0.653012   0.009676  67.486
location_nameLuxembourg                        0.801064   0.009676  82.787
location_nameMacedonia                         0.315642   0.009676  32.620
location_nameMadagascar                        0.010171   0.009676   1.051
location_nameMalawi                           -0.240867   0.009676 -24.893
location_nameMalaysia                          0.126939   0.009676  13.119
location_nameMaldives                         -0.131439   0.009676 -13.584
location_nameMali                              0.252462   0.009676  26.091
location_nameMalta                             0.654370   0.009676  67.626
location_nameMarshall Islands                  0.242041   0.009676  25.014
location_nameMauritania                        0.253405   0.009676  26.188
location_nameMauritius                         0.084586   0.009676   8.742
location_nameMexico                            0.493796   0.009676  51.032
location_nameMoldova                           0.248523   0.009676  25.684
location_nameMongolia                          0.802313   0.009676  82.916
location_nameMontenegro                        0.695318   0.009676  71.858
location_nameMorocco                           0.046481   0.009676   4.804
location_nameMozambique                       -0.146113   0.009676 -15.100
location_nameMyanmar                           0.161358   0.009676  16.676
location_nameNamibia                           0.271516   0.009676  28.060
location_nameNepal                            -0.003694   0.009676  -0.382
location_nameNetherlands                       0.689104   0.009676  71.216
location_nameNew Zealand                       0.792965   0.009676  81.950
location_nameNicaragua                        -0.108988   0.009676 -11.263
location_nameNiger                             0.191332   0.009676  19.773
location_nameNigeria                          -0.169992   0.009676 -17.568
location_nameNorth Korea                      -0.246185   0.009676 -25.442
location_nameNorthern Mariana Islands          0.435022   0.009676  44.958
location_nameNorway                            0.663406   0.009676  68.560
location_nameOman                              0.377107   0.009676  38.972
location_namePakistan                          0.014624   0.009676   1.511
location_namePalestine                        -0.002820   0.009676  -0.291
location_namePanama                            0.477302   0.009676  49.327
location_namePapua New Guinea                  0.177936   0.009676  18.389
location_nameParaguay                          0.601524   0.009676  62.165
location_namePeru                             -0.037011   0.009676  -3.825
location_namePhilippines                       0.344310   0.009676  35.583
location_namePoland                            0.687177   0.009676  71.017
location_namePortugal                          0.733722   0.009676  75.827
location_namePuerto Rico                       0.412028   0.009676  42.581
location_nameQatar                             0.429955   0.009676  44.434
location_nameRomania                           0.652514   0.009676  67.435
location_nameRussian Federation                0.535982   0.009676  55.392
location_nameRwanda                           -0.277812   0.009676 -28.711
location_nameSaint Lucia                       0.212165   0.009676  21.926
location_nameSaint Vincent and the Grenadines  0.196704   0.009676  20.329
location_nameSamoa                             0.267956   0.009676  27.692
location_nameSao Tome and Principe            -0.047627   0.009676  -4.922
location_nameSaudi Arabia                      0.057527   0.009676   5.945
location_nameSenegal                          -0.072896   0.009676  -7.534
location_nameSerbia                            0.508400   0.009676  52.541
location_nameSeychelles                        0.301701   0.009676  31.180
location_nameSierra Leone                     -0.522458   0.009676 -53.994
location_nameSingapore                         0.582594   0.009676  60.209
location_nameSlovakia                          0.627148   0.009676  64.813
location_nameSlovenia                          0.679024   0.009676  70.174
location_nameSolomon Islands                  -0.010315   0.009676  -1.066
location_nameSomalia                          -0.256388   0.009676 -26.497
location_nameSouth Africa                      0.435449   0.009676  45.002
location_nameSouth Korea                       0.592893   0.009676  61.273
location_nameSouth Sudan                       0.015340   0.009676   1.585
location_nameSpain                             0.678441   0.009676  70.114
location_nameSri Lanka                        -0.749966   0.009676 -77.506
location_nameSudan                            -0.050768   0.009676  -5.247
location_nameSuriname                          0.176747   0.009676  18.266
location_nameSwaziland                         0.301692   0.009676  31.179
location_nameSweden                            0.692026   0.009676  71.518
location_nameSwitzerland                       0.701932   0.009676  72.542
location_nameSyria                            -0.011904   0.009676  -1.230
location_nameTaiwan (Province of China)        0.677543   0.009676  70.021
location_nameTajikistan                        0.015453   0.009676   1.597
location_nameTanzania                         -0.129507   0.009676 -13.384
location_nameThailand                          0.164882   0.009676  17.040
location_nameThe Bahamas                       0.577812   0.009676  59.715
location_nameThe Gambia                       -0.389047   0.009676 -40.206
location_nameTimor-Leste                       0.060600   0.009676   6.263
location_nameTogo                             -0.308629   0.009676 -31.896
location_nameTonga                             0.289031   0.009676  29.870
location_nameTrinidad and Tobago               0.160447   0.009676  16.582
location_nameTunisia                          -0.005090   0.009676  -0.526
location_nameTurkey                            0.050119   0.009676   5.180
location_nameTurkmenistan                      0.722890   0.009676  74.708
location_nameUganda                            0.015166   0.009676   1.567
location_nameUkraine                           0.372418   0.009676  38.488
location_nameUnited Arab Emirates              0.283428   0.009676  29.291
location_nameUnited Kingdom                    0.556010   0.009676  57.461
location_nameUnited States                     0.708984   0.009676  73.271
location_nameUruguay                           0.632295   0.009676  65.345
location_nameUzbekistan                        0.458726   0.009676  47.407
location_nameVanuatu                           0.191061   0.009676  19.745
location_nameVenezuela                         0.439508   0.009676  45.421
location_nameVietnam                           0.494979   0.009676  51.154
location_nameVirgin Islands, U.S.              0.438211   0.009676  45.287
location_nameYemen                            -0.126839   0.009676 -13.108
location_nameZambia                           -0.104306   0.009676 -10.780
location_nameZimbabwe                          0.046567   0.009676   4.812
                                              Pr(>|t|)    
(Intercept)                                    < 2e-16 ***
sexMale                                        < 2e-16 ***
location_nameAlbania                           < 2e-16 ***
location_nameAlgeria                          4.90e-06 ***
location_nameAmerican Samoa                    < 2e-16 ***
location_nameAndorra                           < 2e-16 ***
location_nameAngola                            < 2e-16 ***
location_nameAntigua and Barbuda               < 2e-16 ***
location_nameArgentina                         < 2e-16 ***
location_nameArmenia                           < 2e-16 ***
location_nameAustralia                         < 2e-16 ***
location_nameAustria                           < 2e-16 ***
location_nameAzerbaijan                        < 2e-16 ***
location_nameBahrain                           < 2e-16 ***
location_nameBangladesh                        < 2e-16 ***
location_nameBarbados                          < 2e-16 ***
location_nameBelarus                           < 2e-16 ***
location_nameBelgium                           < 2e-16 ***
location_nameBelize                            < 2e-16 ***
location_nameBenin                             < 2e-16 ***
location_nameBermuda                           < 2e-16 ***
location_nameBhutan                            < 2e-16 ***
location_nameBolivia                           < 2e-16 ***
location_nameBosnia and Herzegovina            < 2e-16 ***
location_nameBotswana                          < 2e-16 ***
location_nameBrazil                            < 2e-16 ***
location_nameBrunei                            < 2e-16 ***
location_nameBulgaria                          < 2e-16 ***
location_nameBurkina Faso                      < 2e-16 ***
location_nameBurundi                           < 2e-16 ***
location_nameCambodia                          < 2e-16 ***
location_nameCameroon                         3.90e-08 ***
location_nameCanada                            < 2e-16 ***
location_nameCape Verde                        < 2e-16 ***
location_nameCentral African Republic          < 2e-16 ***
location_nameChad                             1.20e-05 ***
location_nameChile                             < 2e-16 ***
location_nameChina                             < 2e-16 ***
location_nameColombia                          < 2e-16 ***
location_nameComoros                           < 2e-16 ***
location_nameCongo                             < 2e-16 ***
location_nameCosta Rica                        < 2e-16 ***
location_nameCote d'Ivoire                     < 2e-16 ***
location_nameCroatia                           < 2e-16 ***
location_nameCuba                              < 2e-16 ***
location_nameCyprus                            < 2e-16 ***
location_nameCzech Republic                    < 2e-16 ***
location_nameDemocratic Republic of the Congo  < 2e-16 ***
location_nameDenmark                           < 2e-16 ***
location_nameDjibouti                          < 2e-16 ***
location_nameDominica                          < 2e-16 ***
location_nameDominican Republic                < 2e-16 ***
location_nameEcuador                           < 2e-16 ***
location_nameEgypt                             < 2e-16 ***
location_nameEl Salvador                      1.35e-05 ***
location_nameEquatorial Guinea                 < 2e-16 ***
location_nameEritrea                           < 2e-16 ***
location_nameEstonia                           < 2e-16 ***
location_nameEthiopia                          < 2e-16 ***
location_nameFederated States of Micronesia    < 2e-16 ***
location_nameFiji                              < 2e-16 ***
location_nameFinland                           < 2e-16 ***
location_nameFrance                            < 2e-16 ***
location_nameGabon                             < 2e-16 ***
location_nameGeorgia                           < 2e-16 ***
location_nameGermany                           < 2e-16 ***
location_nameGhana                             < 2e-16 ***
location_nameGreece                            < 2e-16 ***
location_nameGreenland                         < 2e-16 ***
location_nameGrenada                           < 2e-16 ***
location_nameGuam                              < 2e-16 ***
location_nameGuatemala                        0.353203    
location_nameGuinea                            < 2e-16 ***
location_nameGuinea-Bissau                     < 2e-16 ***
location_nameGuyana                            < 2e-16 ***
location_nameHaiti                            3.54e-07 ***
location_nameHonduras                          < 2e-16 ***
location_nameHungary                           < 2e-16 ***
location_nameIceland                           < 2e-16 ***
location_nameIndia                             < 2e-16 ***
location_nameIndonesia                         < 2e-16 ***
location_nameIran                             4.63e-06 ***
location_nameIraq                              < 2e-16 ***
location_nameIreland                           < 2e-16 ***
location_nameIsrael                            < 2e-16 ***
location_nameItaly                             < 2e-16 ***
location_nameJamaica                          0.000650 ***
location_nameJapan                             < 2e-16 ***
location_nameJordan                            < 2e-16 ***
location_nameKazakhstan                        < 2e-16 ***
location_nameKenya                             < 2e-16 ***
location_nameKiribati                          < 2e-16 ***
location_nameKuwait                            < 2e-16 ***
location_nameKyrgyzstan                        < 2e-16 ***
location_nameLaos                              < 2e-16 ***
location_nameLatvia                            < 2e-16 ***
location_nameLebanon                           < 2e-16 ***
location_nameLesotho                          4.43e-14 ***
location_nameLiberia                           < 2e-16 ***
location_nameLibya                             < 2e-16 ***
location_nameLithuania                         < 2e-16 ***
location_nameLuxembourg                        < 2e-16 ***
location_nameMacedonia                         < 2e-16 ***
location_nameMadagascar                       0.294507    
location_nameMalawi                            < 2e-16 ***
location_nameMalaysia                          < 2e-16 ***
location_nameMaldives                          < 2e-16 ***
location_nameMali                              < 2e-16 ***
location_nameMalta                             < 2e-16 ***
location_nameMarshall Islands                  < 2e-16 ***
location_nameMauritania                        < 2e-16 ***
location_nameMauritius                        1.09e-15 ***
location_nameMexico                            < 2e-16 ***
location_nameMoldova                           < 2e-16 ***
location_nameMongolia                          < 2e-16 ***
location_nameMontenegro                        < 2e-16 ***
location_nameMorocco                          3.11e-06 ***
location_nameMozambique                        < 2e-16 ***
location_nameMyanmar                           < 2e-16 ***
location_nameNamibia                           < 2e-16 ***
location_nameNepal                            0.703082    
location_nameNetherlands                       < 2e-16 ***
location_nameNew Zealand                       < 2e-16 ***
location_nameNicaragua                         < 2e-16 ***
location_nameNiger                             < 2e-16 ***
location_nameNigeria                           < 2e-16 ***
location_nameNorth Korea                       < 2e-16 ***
location_nameNorthern Mariana Islands          < 2e-16 ***
location_nameNorway                            < 2e-16 ***
location_nameOman                              < 2e-16 ***
location_namePakistan                         0.132342    
location_namePalestine                        0.771045    
location_namePanama                            < 2e-16 ***
location_namePapua New Guinea                  < 2e-16 ***
location_nameParaguay                          < 2e-16 ***
location_namePeru                             0.000176 ***
location_namePhilippines                       < 2e-16 ***
location_namePoland                            < 2e-16 ***
location_namePortugal                          < 2e-16 ***
location_namePuerto Rico                       < 2e-16 ***
location_nameQatar                             < 2e-16 ***
location_nameRomania                           < 2e-16 ***
location_nameRussian Federation                < 2e-16 ***
location_nameRwanda                            < 2e-16 ***
location_nameSaint Lucia                       < 2e-16 ***
location_nameSaint Vincent and the Grenadines  < 2e-16 ***
location_nameSamoa                             < 2e-16 ***
location_nameSao Tome and Principe            1.83e-06 ***
location_nameSaudi Arabia                     1.26e-08 ***
location_nameSenegal                          1.83e-12 ***
location_nameSerbia                            < 2e-16 ***
location_nameSeychelles                        < 2e-16 ***
location_nameSierra Leone                      < 2e-16 ***
location_nameSingapore                         < 2e-16 ***
location_nameSlovakia                          < 2e-16 ***
location_nameSlovenia                          < 2e-16 ***
location_nameSolomon Islands                  0.287747    
location_nameSomalia                           < 2e-16 ***
location_nameSouth Africa                      < 2e-16 ***
location_nameSouth Korea                       < 2e-16 ***
location_nameSouth Sudan                      0.114522    
location_nameSpain                             < 2e-16 ***
location_nameSri Lanka                         < 2e-16 ***
location_nameSudan                            4.04e-07 ***
location_nameSuriname                          < 2e-16 ***
location_nameSwaziland                         < 2e-16 ***
location_nameSweden                            < 2e-16 ***
location_nameSwitzerland                       < 2e-16 ***
location_nameSyria                            0.220097    
location_nameTaiwan (Province of China)        < 2e-16 ***
location_nameTajikistan                       0.111900    
location_nameTanzania                          < 2e-16 ***
location_nameThailand                          < 2e-16 ***
location_nameThe Bahamas                       < 2e-16 ***
location_nameThe Gambia                        < 2e-16 ***
location_nameTimor-Leste                      2.38e-09 ***
location_nameTogo                              < 2e-16 ***
location_nameTonga                             < 2e-16 ***
location_nameTrinidad and Tobago               < 2e-16 ***
location_nameTunisia                          0.599486    
location_nameTurkey                           5.55e-07 ***
location_nameTurkmenistan                      < 2e-16 ***
location_nameUganda                           0.118652    
location_nameUkraine                           < 2e-16 ***
location_nameUnited Arab Emirates              < 2e-16 ***
location_nameUnited Kingdom                    < 2e-16 ***
location_nameUnited States                     < 2e-16 ***
location_nameUruguay                           < 2e-16 ***
location_nameUzbekistan                        < 2e-16 ***
location_nameVanuatu                           < 2e-16 ***
location_nameVenezuela                         < 2e-16 ***
location_nameVietnam                           < 2e-16 ***
location_nameVirgin Islands, U.S.              < 2e-16 ***
location_nameYemen                             < 2e-16 ***
location_nameZambia                            < 2e-16 ***
location_nameZimbabwe                         2.99e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.009676 on 194 degrees of freedom
Multiple R-squared:  0.9996,    Adjusted R-squared:  0.9993 
F-statistic:  2671 on 195 and 194 DF,  p-value: < 2.2e-16

First let’s look at the estimated coefficient for the sexMale variable, which is 0.18. This can be interpreted as the difference in mean log relative percent consumption between males and females, holding country constant. So comparing males to females within the same country. Notice this is the same estimated difference we found from our paired \(t\)-test! The \(p\)-value for this coefficient also matches the \(p\)-value from the paired \(t\)-test.

You can also see from this output that we have a coefficient for every country except Afghanistan, which is our reference country. These coefficients compare the country to that reference. So the estimated coefficient for location_nameAlbania, 0.44, can be interpreted as the difference in mean log relative percent consumption between Albania and Afghanistan, holding sex constant. So comparing Albania to Afghanistan within males or comparing Albania to Afghanistan within females.

Finally, you might notice that the number of residual degrees of freedom for this regression is 194, just as in the paired \(t\)-test. This makes sense since we have to estimate a coefficient for 194 countries (all except Afghanistan) as well as a coefficient for sex and an intercept. So we have:

\[df = n - # parameters estimating = 390 - 194 - 2 = 194\]

We should also check the residual plots for this fixed effects regression model.

diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ sex + location_name, data = .) %>%
  plot(which = 1:3)

These residual plots look much better than our previous plots. This guide provides more information on how to interpret these residual diagnostic plots.

Based on our Q-Q plot, we appear to have some outliers perhaps at the extreme ends of our tails but overall the residuals look fairly normal. The residual vs fitted plot shows us if the relationship between our outcome variable and our predictors looks linear, if there is unequal error variance between groups, and if there are possible outliers. Ideally this should look like a band of points equally centered around zero. Here are examples of these plots that might show issues of concern.

Overall our plot looks fairly good. The shape of our data looks fairly linear (the residuals don’t appear to have a shape other than a band or line), there does not appear to be any extreme outliers (no data points are especially far away) and the points have the same general range around the line for the various fitted values. There are a few points with wider residuals at the higher fitted values, but overall this looks quite reasonable.

Our scale-location plot also shows us that our variance looks fairly equal across groups as our values show a relatively even spread. A larger bend in the line would indicate more variation in the variance across our independent variable groups also known as heteroscedasticity. There is only a slight bend in the line for our data suggestive of homoscedasticity. So our assumptions look pretty good:

  1. Linear - the relationship appears to be fairly linear
  2. Independence - now that we have taken care of the location structure in our data, our samples are independent
  3. Normality - the residuals appear to be fairly normally distributed and we have a large number of samples to help account for minor violations
  4. Equal variance - the variance in the residuals appear to be fairly equal across the groups of the independent/predictor variables

Paired \(t\)-test and linear model with mixed effects


To “pair” our data using fixed effects cost us an additional 194 variables in our regression model, one for each country except Afghanistan. Alternatively, we can perform a slightly different type of regression that still accounts for the paired structure in the data.

In this case we will use the lmer() function of the lme4 package. This function allows us to fit what is called a linear mixed effects regression model. We will also use the lmerTest package, since this adds test statistics and \(p\)-values to the linear mixed effects model output.

This type of regression is called mixed because it contains both fixed and random effects. There are many different definitions for fixed and random effects and the difference is conceptually complex and context specific.

However in simplistic terms, fixed effects are generally speaking the variables of interest that we have reason to believe explain or predict the outcome or response variable, while random effects are those that may introduce additional variance in the influence of those predictor variables on the outcome variable. For example, they may provide information about group or batch structures within the data.

In our case, we are interested in the influence of sex on the consumption of red meat, however the identity of the country where the male and female consumption values were obtained may influence this relationship and we would like to control for that. We don’t want to model for location_name itself, but just model it’s influence on the relationship of sex on consumption of red meat. In other words, we are interested in getting a sense of how sex influences consumption rates in general and we want to account for the paired structure within our data, the fact that we have corresponding consumption values for the two sexes from different countries. The notation for including a random effect like this is 1 | variable_name. The one indicates a varying-intercept group effect, in other words we expect that the intercept may vary for each value of the variable indicated to the right of the |. So in our case, the intercept (log relative percent consumption when sex is assigned to the zero value - female) may be different for each country.

Let’s fit a mixed effects model that includes a fixed effect for sex and a random intercept for country:

diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lmer(log10(Relative_Percent) ~ sex + (1 | location_name), data = .) %>%
  summary()
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: log10(Relative_Percent) ~ sex + (1 | location_name)
   Data: .

REML criterion at convergence: -969.5

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.2253 -0.3914 -0.0038  0.3993  3.2527 

Random effects:
 Groups        Name        Variance  Std.Dev.
 location_name (Intercept) 1.171e-01 0.342181
 Residual                  9.363e-05 0.009676
Number of obs: 390, groups:  location_name, 195

Fixed effects:
             Estimate Std. Error        df t value Pr(>|t|)    
(Intercept)   1.79887    0.02451 194.15516   73.38   <2e-16 ***
sexMale       0.18439    0.00098 194.00000  188.16   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
        (Intr)
sexMale -0.020

How would we interpret the results of this model? Again, let’s look at the estimated coefficient for the sexMale variable, which is 0.18. This can be interpreted the same way as in the simple linear regression, as the difference in mean log relative percent consumption between males and females. However, here we haven’t violated the independence assumption because we are accounting for the paired nature of the data through the random effect for country. The \(t\)-statistic and \(p\)-value for this coefficient also match those from the paired \(t\)-test we did before:

t.test(log10(pull(
  filter(wide_diet, food == "red meat"),
  Relative_Percent_Male
)),
log10(pull(
  filter(wide_diet, food == "red meat"),
  Relative_Percent_Female
)),
var.equal = TRUE, paired = TRUE
)

    Paired t-test

data:  log10(pull(filter(wide_diet, food == "red meat"), Relative_Percent_Male)) and log10(pull(filter(wide_diet, food == "red meat"), Relative_Percent_Female))
t = 188.16, df = 194, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.1824542 0.1863197
sample estimates:
mean of the differences 
               0.184387 

Notice that in the output for the mixed effects model, there are not coefficients for each country, like there were in the fixed effects model. This is because we are not explicitly estimating individual country effects in this model. Instead, the country effect is captured through the intercept in this model. Our estimated intercept is 1.80 and the standard deviation of this intercept is 0.34 (shown in the Random effects table in the output.) We can interpret this as saying that each country has an intercept that comes from a normal distribution with mean of 1.80 and a standard deviation of 0.34. Since the intercept in this model represents the log relative percent consumption for females, this give us an idea of how female consumption varies across countries – average log consumption across countries is 1.80, but there is variability from one country to another. And then the male log consumption is, on average, 0.18 higher than for females.

It is more complicated to calculate the degrees of freedom in the mixed effect model and beyond this case study, but it is based on the Satterthwaite formula and results in the same degrees of freedom.

Finally, lets see what our residual plots look like for this mixed effects model. We can’t use the plot() function with a lmer() model to get all of the plots at once, but we can construct a residual vs. fitted value plot and a Q-Q plot ourselves:

diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lmer(log10(Relative_Percent) ~ sex + (1 | location_name), data = .) %>%
  plot()

diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lmer(log10(Relative_Percent) ~ sex + (1 | location_name), data = .) %>%
  resid() %>%
  qqnorm()

diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lmer(log10(Relative_Percent) ~ sex + (1 | location_name), data = .) %>%
  resid() %>%
  qqline()

Notice that the plots look very similar to what we saw with the fixed effects model.

We see that the paired \(t\)-test, the fixed effects model adjusting for country, and the mixed effects model with a random intercept for country all give the same results in this case. So which test should we use? The decision of which test to perform depends on your question of interest. In this case we were particularly interested in the influence of sex, so setting location_name to a random effect provides the same level of detail about sex without as much information about location_name, so that might be ideal. As we can see, the results, in this case, are the same. The benefit of using regression over a simple paired \(t\)-test would be the ability to add other covariates to our model if we wanted to adjust for other country characteristics.

Overall, though, we can conclude from these tests that we have enough evidence to reject the null hypothesis that there is no difference between the mean consumption of males and females ( or that sex has no association or influence on red meat consumption.) Therefore, it appears that males consume significantly more red meat than females globally.

ANalysis Of VAriance (ANOVA) test


We are also interested in the influence of age group on dietary consumption, but because there are 15 age groups we can’t assess the influence of age group on consumption using the paired \(t\)-test, as this test can only compare 2 groups.

If we wanted to test the hypothesis that there are any age group differences, that at least one of the groups is different from the others; we could use an ANOVA test. This test allows us to compare means of 3 or more groups by evaluating the variance of the data within the groups and among the groups.

Our null hypothesis is that all age groups have equal means: \[ H_0: \mu_{1} = \mu_{2} =\mu_{3}=\mu_{4} = ... \mu_{15} \]

The alternative hypothesis is that at least one age group mean is not equal to the others.

Importantly, if we reject the null, we do not know which group means are different from one another. Subsequent testing is required if we want to know this information. In this case we call this type of non-specific hypothesis an “omnibus” hypothesis.

You could actually perform an ANOVA to compare two means, but in this case you would get an \(F\)-statistic instead of a \(t\)-statistic which would be equivalent to \(t^2\). However it is not conventional to use ANOVA for only 2 means. The \(F\)-statistic is derived form the \(F\)-test is used for a few different type of tests. In the ANOVA the F-test is calculated as:

\[F = \frac{ variability\ between\ the \ groups}{ variablity\ within\ the \ groups}\]

The larger the ratio, the larger the variability between the groups, thus the more likely that the data for each group comes from a different distribution with different means, suggesting that the groups are different.

It turns out that the ANOVA test is also equivalent to linear regression. We will demonstrate this by evaluating how the consumption of red meat varies by age group using an ANOVA and a linear regression.

Thinking about how we want to know if red meat consumption differs between age groups from the linear regression perspective, we could also describe our null hypothesis as:

There is no influence of age group identity on consumption or there is no relationship between age group identity and consumption.

And we could describe our alternative hypothesis as:

Age group identity does influence consumption or explain some of the variation in consumption.

ANOVA assumptions

The ANOVA assumptions are quite similar to the \(t\)-test assumptions:

  1. Normality of the data for all tested groups (less of an issue if the number of observations is relatively large total n > 30)
  2. Equal variance between the groups - aka Homogeneity of Variances assumption (make sure you do the correct test if the data is not normal)
  3. Independent observations

let’s evaluate our assumptions for the groups we are comparing, starting with normality using Q-Q plots. First let’s make age_group_name a factor:

all_age_diet_and_guidelines %<>%
  mutate_at(vars(age_group_name), factor)

Now let’s look at Q-Q plots of both relative percent consumption and the log-transformed version of this variable:

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  ggplot(aes(sample = Relative_Percent)) +
  facet_wrap(~age_group_name) +
  geom_qq() +
  geom_qq_line()

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  ggplot(aes(sample = log10(Relative_Percent))) +
  facet_wrap(~age_group_name) +
  geom_qq() +
  geom_qq_line()

After transformation, these Q-Q plots look pretty good.

Now let’s look at the assumption of constant variance. There are different ways to assess this assumption across more than two groups. Bartlett’s test works well if the data appears to be quite normally distributed, while the Fligner-Killeen test is nonparametric and does not assume normality of the data.

We will use another popular test, Levene’s test, which is more robust to violations of normality than the Bartlett’s test, but not as robust as the Fligner-Killeen test. The null hypothesis of this test, as for the other two tests, is that the variances are equal across all of the groups. The alternative hypothesis is that at least one pair of groups has different variances. In symbols we would write this as

\[ H_0: \sigma_1^2 = \sigma_2^2 = \sigma_3^2 ... = \sigma_n^2 \]

and

\[H_a:\sigma_i^2 \neq \sigma_j^2 \] for at least one pair (\(i\),\(j\)).

We will use the leveneTest() function of the car package to performs Levene’s test.

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  car::leveneTest(log10(Relative_Percent) ~ age_group_name, data = .)
Levene's Test for Homogeneity of Variance (center = median)
        Df F value Pr(>F)
group   14  0.0063      1
      5835               

Our data does not appear to violate the homogeneity of variances assumption as our \(p\)-value was greater than 0.05 and so we would fail to reject the null hypothesis of equal variances.

We already know that our independence assumption is not met, since the data for the different age groups comes from the same countries. We will account for this in later models, but first let’s compare the results between ANOVA and linear regression assuming the independence assumption is met.

ANOVA and linear regression

We can use the aov() function of the stats package to perform an ANOVA test. We will be performing what is called a one-way ANOVA because we only have one independent variable (age group). We will also perform a linear regression for comparison.

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  aov(log10(Relative_Percent) ~ age_group_name, data = .) %>%
  summary()
                 Df Sum Sq Mean Sq F value Pr(>F)    
age_group_name   14   42.5   3.038   23.37 <2e-16 ***
Residuals      5835  758.7   0.130                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ age_group_name, data = .) %>%
  summary()

Call:
lm(formula = log10(Relative_Percent) ~ age_group_name, data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.09054 -0.25152  0.00684  0.30072  0.72695 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)             1.921191   0.018259 105.219  < 2e-16 ***
age_group_name30 to 34  0.001325   0.025822   0.051 0.959073    
age_group_name35 to 39  0.001745   0.025822   0.068 0.946133    
age_group_name40 to 44 -0.004447   0.025822  -0.172 0.863283    
age_group_name45 to 49 -0.019088   0.025822  -0.739 0.459801    
age_group_name50 to 54 -0.034767   0.025822  -1.346 0.178223    
age_group_name55 to 59 -0.051470   0.025822  -1.993 0.046279 *  
age_group_name60 to 64 -0.071080   0.025822  -2.753 0.005929 ** 
age_group_name65 to 69 -0.094345   0.025822  -3.654 0.000261 ***
age_group_name70 to 74 -0.119423   0.025822  -4.625 3.83e-06 ***
age_group_name75 to 79 -0.147352   0.025822  -5.706 1.21e-08 ***
age_group_name80 to 84 -0.210007   0.025822  -8.133 5.08e-16 ***
age_group_name85 to 89 -0.215369   0.025822  -8.341  < 2e-16 ***
age_group_name90 to 94 -0.218172   0.025822  -8.449  < 2e-16 ***
age_group_name95 plus  -0.218096   0.025822  -8.446  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3606 on 5835 degrees of freedom
Multiple R-squared:  0.05309,   Adjusted R-squared:  0.05081 
F-statistic: 23.37 on 14 and 5835 DF,  p-value: < 2.2e-16

We can see that the \(F\)-statistic (\(F\) value in the aov() output and at the bottom of the lm() output) is the same for both models and the \(p\)-value for the \(F\)-statistic is the same!

We also see that the degrees of freedom for the \(F\)-statistic is 14. This makes sense because we have 15 different age groups and degrees of freedom for the \(F\)-statistic are calculated as \(df = n - 1\). So in our case: \(df = 15 -1\).

The difference here is that with the lm() model we also get information about how the individual age groups are associated with the log relative percent consumption of red meat. Notice that if we look at all the age groups in the data

all_age_diet_and_guidelines %>%
  distinct(age_group_name)
# A tibble: 15 x 1
   age_group_name
   <fct>         
 1 25 to 29      
 2 30 to 34      
 3 35 to 39      
 4 40 to 44      
 5 45 to 49      
 6 50 to 54      
 7 55 to 59      
 8 60 to 64      
 9 65 to 69      
10 70 to 74      
11 75 to 79      
12 80 to 84      
13 85 to 89      
14 90 to 94      
15 95 plus       

we see that our lm() results are missing one of the age groups, the 25 to 29 age group. That is because this is the reference group and the coefficients indicate the slope or difference in log relative percent consumption rates for each listed age group compared to this reference group.

ANOVA and linear regression with fixed effects

Now let’s account for the paired location_name structure within our data, since the above models violate the independence assumptions for ANOVA and linear regression. We can do this by adding another fixed effect to both the ANOVA model and the linear regression model. For ANOVA, this means we are now doing a two-way ANOVA, since we have two independent variables (age group and country). For linear regression, we are now adding a fixed effect for country to our model.

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  aov(log10(Relative_Percent) ~ age_group_name + location_name, data = .) %>%
  summary()
                 Df Sum Sq Mean Sq F value Pr(>F)    
age_group_name   14   42.5   3.038   356.2 <2e-16 ***
location_name   194  710.6   3.663   429.5 <2e-16 ***
Residuals      5641   48.1   0.009                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ age_group_name + location_name, data = .) %>%
  summary()

Call:
lm(formula = log10(Relative_Percent) ~ age_group_name + location_name, 
    data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.17321 -0.08990  0.04057  0.08969  0.19936 

Coefficients:
                                                Estimate Std. Error t value
(Intercept)                                    1.6473808  0.0174553  94.377
age_group_name30 to 34                         0.0013252  0.0066133   0.200
age_group_name35 to 39                         0.0017447  0.0066133   0.264
age_group_name40 to 44                        -0.0044467  0.0066133  -0.672
age_group_name45 to 49                        -0.0190883  0.0066133  -2.886
age_group_name50 to 54                        -0.0347670  0.0066133  -5.257
age_group_name55 to 59                        -0.0514702  0.0066133  -7.783
age_group_name60 to 64                        -0.0710801  0.0066133 -10.748
age_group_name65 to 69                        -0.0943452  0.0066133 -14.266
age_group_name70 to 74                        -0.1194232  0.0066133 -18.058
age_group_name75 to 79                        -0.1473520  0.0066133 -22.281
age_group_name80 to 84                        -0.2100070  0.0066133 -31.755
age_group_name85 to 89                        -0.2153687  0.0066133 -32.566
age_group_name90 to 94                        -0.2181720  0.0066133 -32.990
age_group_name95 plus                         -0.2180964  0.0066133 -32.979
location_nameAlbania                           0.4696634  0.0238444  19.697
location_nameAlgeria                           0.0524978  0.0238444   2.202
location_nameAmerican Samoa                    0.4781072  0.0238444  20.051
location_nameAndorra                           0.7709817  0.0238444  32.334
location_nameAngola                            0.2502250  0.0238444  10.494
location_nameAntigua and Barbuda               0.3148782  0.0238444  13.206
location_nameArgentina                         0.8391115  0.0238444  35.191
location_nameArmenia                           0.4143613  0.0238444  17.378
location_nameAustralia                         0.8382715  0.0238444  35.156
location_nameAustria                           0.9006351  0.0238444  37.771
location_nameAzerbaijan                        0.3013522  0.0238444  12.638
location_nameBahrain                           0.2892541  0.0238444  12.131
location_nameBangladesh                       -0.6036909  0.0238444 -25.318
location_nameBarbados                          0.3355809  0.0238444  14.074
location_nameBelarus                           0.7101162  0.0238444  29.781
location_nameBelgium                           0.7081735  0.0238444  29.700
location_nameBelize                            0.1840253  0.0238444   7.718
location_nameBenin                            -0.3251738  0.0238444 -13.637
location_nameBermuda                           0.4539095  0.0238444  19.036
location_nameBhutan                           -0.1193613  0.0238444  -5.006
location_nameBolivia                           0.5011897  0.0238444  21.019
location_nameBosnia and Herzegovina            0.1966458  0.0238444   8.247
location_nameBotswana                          0.1360033  0.0238444   5.704
location_nameBrazil                            0.7420931  0.0238444  31.122
location_nameBrunei                            0.2248900  0.0238444   9.432
location_nameBulgaria                          0.7016126  0.0238444  29.425
location_nameBurkina Faso                      0.0952272  0.0238444   3.994
location_nameBurundi                          -0.1975274  0.0238444  -8.284
location_nameCambodia                          0.1162693  0.0238444   4.876
location_nameCameroon                         -0.0547177  0.0238444  -2.295
location_nameCanada                            0.6965652  0.0238444  29.213
location_nameCape Verde                        0.4033481  0.0238444  16.916
location_nameCentral African Republic          0.3695721  0.0238444  15.499
location_nameChad                              0.0445016  0.0238444   1.866
location_nameChile                             0.6899420  0.0238444  28.935
location_nameChina                             0.6542148  0.0238444  27.437
location_nameColombia                          0.3677083  0.0238444  15.421
location_nameComoros                          -0.1092061  0.0238444  -4.580
location_nameCongo                            -0.2562887  0.0238444 -10.748
location_nameCosta Rica                        0.3922246  0.0238444  16.449
location_nameCote d'Ivoire                    -0.3878599  0.0238444 -16.266
location_nameCroatia                           0.6168738  0.0238444  25.871
location_nameCuba                              0.3715729  0.0238444  15.583
location_nameCyprus                            0.6210060  0.0238444  26.044
location_nameCzech Republic                    0.6754221  0.0238444  28.326
location_nameDemocratic Republic of the Congo -0.1381953  0.0238444  -5.796
location_nameDenmark                           0.6940034  0.0238444  29.105
location_nameDjibouti                          0.2040664  0.0238444   8.558
location_nameDominica                          0.2201500  0.0238444   9.233
location_nameDominican Republic                0.2904822  0.0238444  12.182
location_nameEcuador                           0.5001219  0.0238444  20.974
location_nameEgypt                             0.1407597  0.0238444   5.903
location_nameEl Salvador                       0.0564578  0.0238444   2.368
location_nameEquatorial Guinea                 0.4145078  0.0238444  17.384
location_nameEritrea                          -0.1427694  0.0238444  -5.988
location_nameEstonia                           0.5724487  0.0238444  24.008
location_nameEthiopia                         -0.1749948  0.0238444  -7.339
location_nameFederated States of Micronesia    0.2177319  0.0238444   9.131
location_nameFiji                              0.3063835  0.0238444  12.849
location_nameFinland                           0.6616460  0.0238444  27.748
location_nameFrance                            0.7352843  0.0238444  30.837
location_nameGabon                             0.1926532  0.0238444   8.080
location_nameGeorgia                           0.2530224  0.0238444  10.611
location_nameGermany                           0.7386742  0.0238444  30.979
location_nameGhana                            -0.3464979  0.0238444 -14.532
location_nameGreece                            0.7022224  0.0238444  29.450
location_nameGreenland                         0.7880282  0.0238444  33.049
location_nameGrenada                           0.2394215  0.0238444  10.041
location_nameGuam                              0.5937555  0.0238444  24.901
location_nameGuatemala                         0.0139062  0.0238444   0.583
location_nameGuinea                           -0.1905498  0.0238444  -7.991
location_nameGuinea-Bissau                     0.1322312  0.0238444   5.546
location_nameGuyana                           -0.2063983  0.0238444  -8.656
location_nameHaiti                            -0.0491435  0.0238444  -2.061
location_nameHonduras                          0.1244476  0.0238444   5.219
location_nameHungary                           0.6410019  0.0238444  26.883
location_nameIceland                           0.6839722  0.0238444  28.685
location_nameIndia                            -0.6388459  0.0238444 -26.792
location_nameIndonesia                        -0.2110810  0.0238444  -8.852
location_nameIran                             -0.0399115  0.0238444  -1.674
location_nameIraq                             -0.3147627  0.0238444 -13.201
location_nameIreland                           0.8205275  0.0238444  34.412
location_nameIsrael                            0.5031271  0.0238444  21.100
location_nameItaly                             0.7012419  0.0238444  29.409
location_nameJamaica                          -0.0196064  0.0238444  -0.822
location_nameJapan                             0.4163050  0.0238444  17.459
location_nameJordan                            0.1122555  0.0238444   4.708
location_nameKazakhstan                        0.6667740  0.0238444  27.964
location_nameKenya                             0.1062621  0.0238444   4.456
location_nameKiribati                          0.1360152  0.0238444   5.704
location_nameKuwait                            0.3628791  0.0238444  15.219
location_nameKyrgyzstan                        0.4299491  0.0238444  18.031
location_nameLaos                              0.2281184  0.0238444   9.567
location_nameLatvia                            0.6051096  0.0238444  25.377
location_nameLebanon                           0.3358240  0.0238444  14.084
location_nameLesotho                           0.0827870  0.0238444   3.472
location_nameLiberia                          -0.3810751  0.0238444 -15.982
location_nameLibya                             0.1058210  0.0238444   4.438
location_nameLithuania                         0.6853303  0.0238444  28.742
location_nameLuxembourg                        0.8252128  0.0238444  34.608
location_nameMacedonia                         0.3375009  0.0238444  14.154
location_nameMadagascar                        0.0086607  0.0238444   0.363
location_nameMalawi                           -0.2390062  0.0238444 -10.024
location_nameMalaysia                          0.1348008  0.0238444   5.653
location_nameMaldives                         -0.1304736  0.0238444  -5.472
location_nameMali                              0.2545834  0.0238444  10.677
location_nameMalta                             0.6861462  0.0238444  28.776
location_nameMarshall Islands                  0.2453513  0.0238444  10.290
location_nameMauritania                        0.2586036  0.0238444  10.845
location_nameMauritius                         0.1018209  0.0238444   4.270
location_nameMexico                            0.5019364  0.0238444  21.050
location_nameMoldova                           0.2694136  0.0238444  11.299
location_nameMongolia                          0.8050147  0.0238444  33.761
location_nameMontenegro                        0.7197936  0.0238444  30.187
location_nameMorocco                           0.0536665  0.0238444   2.251
location_nameMozambique                       -0.1454548  0.0238444  -6.100
location_nameMyanmar                           0.1701756  0.0238444   7.137
location_nameNamibia                           0.2766865  0.0238444  11.604
location_nameNepal                             0.0040043  0.0238444   0.168
location_nameNetherlands                       0.7182009  0.0238444  30.120
location_nameNew Zealand                       0.8215769  0.0238444  34.456
location_nameNicaragua                        -0.1022029  0.0238444  -4.286
location_nameNiger                             0.1929072  0.0238444   8.090
location_nameNigeria                          -0.1687456  0.0238444  -7.077
location_nameNorth Korea                      -0.2308157  0.0238444  -9.680
location_nameNorthern Mariana Islands          0.4519371  0.0238444  18.954
location_nameNorway                            0.6802116  0.0238444  28.527
location_nameOman                              0.3715916  0.0238444  15.584
location_namePakistan                          0.0163832  0.0238444   0.687
location_namePalestine                        -0.0001115  0.0238444  -0.005
location_namePanama                            0.4910430  0.0238444  20.594
location_namePapua New Guinea                  0.1780454  0.0238444   7.467
location_nameParaguay                          0.6105228  0.0238444  25.604
location_namePeru                             -0.0270767  0.0238444  -1.136
location_namePhilippines                       0.3489508  0.0238444  14.634
location_namePoland                            0.7119822  0.0238444  29.859
location_namePortugal                          0.7651172  0.0238444  32.088
location_namePuerto Rico                       0.4442532  0.0238444  18.631
location_nameQatar                             0.4226739  0.0238444  17.726
location_nameRomania                           0.6732206  0.0238444  28.234
location_nameRussian Federation                0.5601442  0.0238444  23.492
location_nameRwanda                           -0.2771514  0.0238444 -11.623
location_nameSaint Lucia                       0.2295368  0.0238444   9.626
location_nameSaint Vincent and the Grenadines  0.2179043  0.0238444   9.139
location_nameSamoa                             0.2797823  0.0238444  11.734
location_nameSao Tome and Principe            -0.0434018  0.0238444  -1.820
location_nameSaudi Arabia                      0.0524593  0.0238444   2.200
location_nameSenegal                          -0.0681700  0.0238444  -2.859
location_nameSerbia                            0.5359842  0.0238444  22.478
location_nameSeychelles                        0.3146719  0.0238444  13.197
location_nameSierra Leone                     -0.5214405  0.0238444 -21.868
location_nameSingapore                         0.5946425  0.0238444  24.938
location_nameSlovakia                          0.6438632  0.0238444  27.003
location_nameSlovenia                          0.7116904  0.0238444  29.847
location_nameSolomon Islands                  -0.0090348  0.0238444  -0.379
location_nameSomalia                          -0.2564213  0.0238444 -10.754
location_nameSouth Africa                      0.4267033  0.0238444  17.895
location_nameSouth Korea                       0.5949378  0.0238444  24.951
location_nameSouth Sudan                       0.0164738  0.0238444   0.691
location_nameSpain                             0.7138244  0.0238444  29.937
location_nameSri Lanka                        -0.7330116  0.0238444 -30.741
location_nameSudan                            -0.0465153  0.0238444  -1.951
location_nameSuriname                          0.1906880  0.0238444   7.997
location_nameSwaziland                         0.3040096  0.0238444  12.750
location_nameSweden                            0.7247779  0.0238444  30.396
location_nameSwitzerland                       0.7281804  0.0238444  30.539
location_nameSyria                            -0.0024640  0.0238444  -0.103
location_nameTaiwan (Province of China)        0.6900827  0.0238444  28.941
location_nameTajikistan                        0.0176046  0.0238444   0.738
location_nameTanzania                         -0.1278459  0.0238444  -5.362
location_nameThailand                          0.1802762  0.0238444   7.561
location_nameThe Bahamas                       0.5890517  0.0238444  24.704
location_nameThe Gambia                       -0.3863879  0.0238444 -16.205
location_nameTimor-Leste                       0.0705710  0.0238444   2.960
location_nameTogo                             -0.3083612  0.0238444 -12.932
location_nameTonga                             0.3002675  0.0238444  12.593
location_nameTrinidad and Tobago               0.1770619  0.0238444   7.426
location_nameTunisia                           0.0071742  0.0238444   0.301
location_nameTurkey                            0.0631309  0.0238444   2.648
location_nameTurkmenistan                      0.7290156  0.0238444  30.574
location_nameUganda                            0.0146972  0.0238444   0.616
location_nameUkraine                           0.3932282  0.0238444  16.491
location_nameUnited Arab Emirates              0.2716685  0.0238444  11.393
location_nameUnited Kingdom                    0.5932920  0.0238444  24.882
location_nameUnited States                     0.7390080  0.0238444  30.993
location_nameUruguay                           0.6597541  0.0238444  27.669
location_nameUzbekistan                        0.4620568  0.0238444  19.378
location_nameVanuatu                           0.1977037  0.0238444   8.291
location_nameVenezuela                         0.4446078  0.0238444  18.646
location_nameVietnam                           0.5087180  0.0238444  21.335
location_nameVirgin Islands, U.S.              0.4675163  0.0238444  19.607
location_nameYemen                            -0.1277139  0.0238444  -5.356
location_nameZambia                           -0.1062726  0.0238444  -4.457
location_nameZimbabwe                          0.0462498  0.0238444   1.940
                                              Pr(>|t|)    
(Intercept)                                    < 2e-16 ***
age_group_name30 to 34                        0.841191    
age_group_name35 to 39                        0.791927    
age_group_name40 to 44                        0.501365    
age_group_name45 to 49                        0.003912 ** 
age_group_name50 to 54                        1.52e-07 ***
age_group_name55 to 59                        8.37e-15 ***
age_group_name60 to 64                         < 2e-16 ***
age_group_name65 to 69                         < 2e-16 ***
age_group_name70 to 74                         < 2e-16 ***
age_group_name75 to 79                         < 2e-16 ***
age_group_name80 to 84                         < 2e-16 ***
age_group_name85 to 89                         < 2e-16 ***
age_group_name90 to 94                         < 2e-16 ***
age_group_name95 plus                          < 2e-16 ***
location_nameAlbania                           < 2e-16 ***
location_nameAlgeria                          0.027728 *  
location_nameAmerican Samoa                    < 2e-16 ***
location_nameAndorra                           < 2e-16 ***
location_nameAngola                            < 2e-16 ***
location_nameAntigua and Barbuda               < 2e-16 ***
location_nameArgentina                         < 2e-16 ***
location_nameArmenia                           < 2e-16 ***
location_nameAustralia                         < 2e-16 ***
location_nameAustria                           < 2e-16 ***
location_nameAzerbaijan                        < 2e-16 ***
location_nameBahrain                           < 2e-16 ***
location_nameBangladesh                        < 2e-16 ***
location_nameBarbados                          < 2e-16 ***
location_nameBelarus                           < 2e-16 ***
location_nameBelgium                           < 2e-16 ***
location_nameBelize                           1.39e-14 ***
location_nameBenin                             < 2e-16 ***
location_nameBermuda                           < 2e-16 ***
location_nameBhutan                           5.73e-07 ***
location_nameBolivia                           < 2e-16 ***
location_nameBosnia and Herzegovina            < 2e-16 ***
location_nameBotswana                         1.23e-08 ***
location_nameBrazil                            < 2e-16 ***
location_nameBrunei                            < 2e-16 ***
location_nameBulgaria                          < 2e-16 ***
location_nameBurkina Faso                     6.59e-05 ***
location_nameBurundi                           < 2e-16 ***
location_nameCambodia                         1.11e-06 ***
location_nameCameroon                         0.021782 *  
location_nameCanada                            < 2e-16 ***
location_nameCape Verde                        < 2e-16 ***
location_nameCentral African Republic          < 2e-16 ***
location_nameChad                             0.062047 .  
location_nameChile                             < 2e-16 ***
location_nameChina                             < 2e-16 ***
location_nameColombia                          < 2e-16 ***
location_nameComoros                          4.75e-06 ***
location_nameCongo                             < 2e-16 ***
location_nameCosta Rica                        < 2e-16 ***
location_nameCote d'Ivoire                     < 2e-16 ***
location_nameCroatia                           < 2e-16 ***
location_nameCuba                              < 2e-16 ***
location_nameCyprus                            < 2e-16 ***
location_nameCzech Republic                    < 2e-16 ***
location_nameDemocratic Republic of the Congo 7.17e-09 ***
location_nameDenmark                           < 2e-16 ***
location_nameDjibouti                          < 2e-16 ***
location_nameDominica                          < 2e-16 ***
location_nameDominican Republic                < 2e-16 ***
location_nameEcuador                           < 2e-16 ***
location_nameEgypt                            3.77e-09 ***
location_nameEl Salvador                      0.017930 *  
location_nameEquatorial Guinea                 < 2e-16 ***
location_nameEritrea                          2.26e-09 ***
location_nameEstonia                           < 2e-16 ***
location_nameEthiopia                         2.46e-13 ***
location_nameFederated States of Micronesia    < 2e-16 ***
location_nameFiji                              < 2e-16 ***
location_nameFinland                           < 2e-16 ***
location_nameFrance                            < 2e-16 ***
location_nameGabon                            7.88e-16 ***
location_nameGeorgia                           < 2e-16 ***
location_nameGermany                           < 2e-16 ***
location_nameGhana                             < 2e-16 ***
location_nameGreece                            < 2e-16 ***
location_nameGreenland                         < 2e-16 ***
location_nameGrenada                           < 2e-16 ***
location_nameGuam                              < 2e-16 ***
location_nameGuatemala                        0.559777    
location_nameGuinea                           1.61e-15 ***
location_nameGuinea-Bissau                    3.06e-08 ***
location_nameGuyana                            < 2e-16 ***
location_nameHaiti                            0.039348 *  
location_nameHonduras                         1.86e-07 ***
location_nameHungary                           < 2e-16 ***
location_nameIceland                           < 2e-16 ***
location_nameIndia                             < 2e-16 ***
location_nameIndonesia                         < 2e-16 ***
location_nameIran                             0.094219 .  
location_nameIraq                              < 2e-16 ***
location_nameIreland                           < 2e-16 ***
location_nameIsrael                            < 2e-16 ***
location_nameItaly                             < 2e-16 ***
location_nameJamaica                          0.410962    
location_nameJapan                             < 2e-16 ***
location_nameJordan                           2.56e-06 ***
location_nameKazakhstan                        < 2e-16 ***
location_nameKenya                            8.49e-06 ***
location_nameKiribati                         1.23e-08 ***
location_nameKuwait                            < 2e-16 ***
location_nameKyrgyzstan                        < 2e-16 ***
location_nameLaos                              < 2e-16 ***
location_nameLatvia                            < 2e-16 ***
location_nameLebanon                           < 2e-16 ***
location_nameLesotho                          0.000521 ***
location_nameLiberia                           < 2e-16 ***
location_nameLibya                            9.25e-06 ***
location_nameLithuania                         < 2e-16 ***
location_nameLuxembourg                        < 2e-16 ***
location_nameMacedonia                         < 2e-16 ***
location_nameMadagascar                       0.716457    
location_nameMalawi                            < 2e-16 ***
location_nameMalaysia                         1.65e-08 ***
location_nameMaldives                         4.64e-08 ***
location_nameMali                              < 2e-16 ***
location_nameMalta                             < 2e-16 ***
location_nameMarshall Islands                  < 2e-16 ***
location_nameMauritania                        < 2e-16 ***
location_nameMauritius                        1.98e-05 ***
location_nameMexico                            < 2e-16 ***
location_nameMoldova                           < 2e-16 ***
location_nameMongolia                          < 2e-16 ***
location_nameMontenegro                        < 2e-16 ***
location_nameMorocco                          0.024443 *  
location_nameMozambique                       1.13e-09 ***
location_nameMyanmar                          1.07e-12 ***
location_nameNamibia                           < 2e-16 ***
location_nameNepal                            0.866641    
location_nameNetherlands                       < 2e-16 ***
location_nameNew Zealand                       < 2e-16 ***
location_nameNicaragua                        1.85e-05 ***
location_nameNiger                            7.23e-16 ***
location_nameNigeria                          1.65e-12 ***
location_nameNorth Korea                       < 2e-16 ***
location_nameNorthern Mariana Islands          < 2e-16 ***
location_nameNorway                            < 2e-16 ***
location_nameOman                              < 2e-16 ***
location_namePakistan                         0.492057    
location_namePalestine                        0.996268    
location_namePanama                            < 2e-16 ***
location_namePapua New Guinea                 9.46e-14 ***
location_nameParaguay                          < 2e-16 ***
location_namePeru                             0.256190    
location_namePhilippines                       < 2e-16 ***
location_namePoland                            < 2e-16 ***
location_namePortugal                          < 2e-16 ***
location_namePuerto Rico                       < 2e-16 ***
location_nameQatar                             < 2e-16 ***
location_nameRomania                           < 2e-16 ***
location_nameRussian Federation                < 2e-16 ***
location_nameRwanda                            < 2e-16 ***
location_nameSaint Lucia                       < 2e-16 ***
location_nameSaint Vincent and the Grenadines  < 2e-16 ***
location_nameSamoa                             < 2e-16 ***
location_nameSao Tome and Principe            0.068780 .  
location_nameSaudi Arabia                     0.027843 *  
location_nameSenegal                          0.004266 ** 
location_nameSerbia                            < 2e-16 ***
location_nameSeychelles                        < 2e-16 ***
location_nameSierra Leone                      < 2e-16 ***
location_nameSingapore                         < 2e-16 ***
location_nameSlovakia                          < 2e-16 ***
location_nameSlovenia                          < 2e-16 ***
location_nameSolomon Islands                  0.704770    
location_nameSomalia                           < 2e-16 ***
location_nameSouth Africa                      < 2e-16 ***
location_nameSouth Korea                       < 2e-16 ***
location_nameSouth Sudan                      0.489664    
location_nameSpain                             < 2e-16 ***
location_nameSri Lanka                         < 2e-16 ***
location_nameSudan                            0.051132 .  
location_nameSuriname                         1.53e-15 ***
location_nameSwaziland                         < 2e-16 ***
location_nameSweden                            < 2e-16 ***
location_nameSwitzerland                       < 2e-16 ***
location_nameSyria                            0.917699    
location_nameTaiwan (Province of China)        < 2e-16 ***
location_nameTajikistan                       0.460357    
location_nameTanzania                         8.57e-08 ***
location_nameThailand                         4.66e-14 ***
location_nameThe Bahamas                       < 2e-16 ***
location_nameThe Gambia                        < 2e-16 ***
location_nameTimor-Leste                      0.003093 ** 
location_nameTogo                              < 2e-16 ***
location_nameTonga                             < 2e-16 ***
location_nameTrinidad and Tobago              1.29e-13 ***
location_nameTunisia                          0.763519    
location_nameTurkey                           0.008129 ** 
location_nameTurkmenistan                      < 2e-16 ***
location_nameUganda                           0.537669    
location_nameUkraine                           < 2e-16 ***
location_nameUnited Arab Emirates              < 2e-16 ***
location_nameUnited Kingdom                    < 2e-16 ***
location_nameUnited States                     < 2e-16 ***
location_nameUruguay                           < 2e-16 ***
location_nameUzbekistan                        < 2e-16 ***
location_nameVanuatu                           < 2e-16 ***
location_nameVenezuela                         < 2e-16 ***
location_nameVietnam                           < 2e-16 ***
location_nameVirgin Islands, U.S.              < 2e-16 ***
location_nameYemen                            8.84e-08 ***
location_nameZambia                           8.48e-06 ***
location_nameZimbabwe                         0.052472 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09235 on 5641 degrees of freedom
Multiple R-squared:   0.94, Adjusted R-squared:  0.9377 
F-statistic: 424.5 on 208 and 5641 DF,  p-value: < 2.2e-16

It’s hard to see that these results match, since the linear regression output doesn’t print the \(F\)-statistic for the age groups together or the countries together; it only gives results for individual \(t\)-tests of each regression coefficient. We can get these grouped \(F\)-statistics using the anova() function of the stats package. This function does not actually directly perform ANOVA like the aov() function, but instead prints a variance table using a lm() object.

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ age_group_name + location_name, data = .) %>%
  anova()
Analysis of Variance Table

Response: log10(Relative_Percent)
                 Df Sum Sq Mean Sq F value    Pr(>F)    
age_group_name   14  42.53  3.0380  356.23 < 2.2e-16 ***
location_name   194 710.57  3.6627  429.48 < 2.2e-16 ***
Residuals      5641  48.11  0.0085                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that indeed the \(F\)-values and \(p\)-values from linear regression match those from ANOVA. In this case, this analysis suggests that there is a significant relationship between age group and consumption, even when controlling for country. It also suggests that there is a significant relationship between country and consumption, even when controlling for age group. However, only the first relationship is our relationship of interest; the second is only included in the model to account for the dependent nature of the data.

Remember, the ANOVA results indicate that the means are different across these groups, but it does not inform us about which groups are different. However, the original lm() output using the `summary()~ command gives more information about specific group differences. Remember, though, that these are relative to the reference level for the age group and location and that these values are calculated for the effect on consumption while controlling for the other variable in the model.

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ age_group_name + location_name, data = .) %>%
  summary()

Call:
lm(formula = log10(Relative_Percent) ~ age_group_name + location_name, 
    data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.17321 -0.08990  0.04057  0.08969  0.19936 

Coefficients:
                                                Estimate Std. Error t value
(Intercept)                                    1.6473808  0.0174553  94.377
age_group_name30 to 34                         0.0013252  0.0066133   0.200
age_group_name35 to 39                         0.0017447  0.0066133   0.264
age_group_name40 to 44                        -0.0044467  0.0066133  -0.672
age_group_name45 to 49                        -0.0190883  0.0066133  -2.886
age_group_name50 to 54                        -0.0347670  0.0066133  -5.257
age_group_name55 to 59                        -0.0514702  0.0066133  -7.783
age_group_name60 to 64                        -0.0710801  0.0066133 -10.748
age_group_name65 to 69                        -0.0943452  0.0066133 -14.266
age_group_name70 to 74                        -0.1194232  0.0066133 -18.058
age_group_name75 to 79                        -0.1473520  0.0066133 -22.281
age_group_name80 to 84                        -0.2100070  0.0066133 -31.755
age_group_name85 to 89                        -0.2153687  0.0066133 -32.566
age_group_name90 to 94                        -0.2181720  0.0066133 -32.990
age_group_name95 plus                         -0.2180964  0.0066133 -32.979
location_nameAlbania                           0.4696634  0.0238444  19.697
location_nameAlgeria                           0.0524978  0.0238444   2.202
location_nameAmerican Samoa                    0.4781072  0.0238444  20.051
location_nameAndorra                           0.7709817  0.0238444  32.334
location_nameAngola                            0.2502250  0.0238444  10.494
location_nameAntigua and Barbuda               0.3148782  0.0238444  13.206
location_nameArgentina                         0.8391115  0.0238444  35.191
location_nameArmenia                           0.4143613  0.0238444  17.378
location_nameAustralia                         0.8382715  0.0238444  35.156
location_nameAustria                           0.9006351  0.0238444  37.771
location_nameAzerbaijan                        0.3013522  0.0238444  12.638
location_nameBahrain                           0.2892541  0.0238444  12.131
location_nameBangladesh                       -0.6036909  0.0238444 -25.318
location_nameBarbados                          0.3355809  0.0238444  14.074
location_nameBelarus                           0.7101162  0.0238444  29.781
location_nameBelgium                           0.7081735  0.0238444  29.700
location_nameBelize                            0.1840253  0.0238444   7.718
location_nameBenin                            -0.3251738  0.0238444 -13.637
location_nameBermuda                           0.4539095  0.0238444  19.036
location_nameBhutan                           -0.1193613  0.0238444  -5.006
location_nameBolivia                           0.5011897  0.0238444  21.019
location_nameBosnia and Herzegovina            0.1966458  0.0238444   8.247
location_nameBotswana                          0.1360033  0.0238444   5.704
location_nameBrazil                            0.7420931  0.0238444  31.122
location_nameBrunei                            0.2248900  0.0238444   9.432
location_nameBulgaria                          0.7016126  0.0238444  29.425
location_nameBurkina Faso                      0.0952272  0.0238444   3.994
location_nameBurundi                          -0.1975274  0.0238444  -8.284
location_nameCambodia                          0.1162693  0.0238444   4.876
location_nameCameroon                         -0.0547177  0.0238444  -2.295
location_nameCanada                            0.6965652  0.0238444  29.213
location_nameCape Verde                        0.4033481  0.0238444  16.916
location_nameCentral African Republic          0.3695721  0.0238444  15.499
location_nameChad                              0.0445016  0.0238444   1.866
location_nameChile                             0.6899420  0.0238444  28.935
location_nameChina                             0.6542148  0.0238444  27.437
location_nameColombia                          0.3677083  0.0238444  15.421
location_nameComoros                          -0.1092061  0.0238444  -4.580
location_nameCongo                            -0.2562887  0.0238444 -10.748
location_nameCosta Rica                        0.3922246  0.0238444  16.449
location_nameCote d'Ivoire                    -0.3878599  0.0238444 -16.266
location_nameCroatia                           0.6168738  0.0238444  25.871
location_nameCuba                              0.3715729  0.0238444  15.583
location_nameCyprus                            0.6210060  0.0238444  26.044
location_nameCzech Republic                    0.6754221  0.0238444  28.326
location_nameDemocratic Republic of the Congo -0.1381953  0.0238444  -5.796
location_nameDenmark                           0.6940034  0.0238444  29.105
location_nameDjibouti                          0.2040664  0.0238444   8.558
location_nameDominica                          0.2201500  0.0238444   9.233
location_nameDominican Republic                0.2904822  0.0238444  12.182
location_nameEcuador                           0.5001219  0.0238444  20.974
location_nameEgypt                             0.1407597  0.0238444   5.903
location_nameEl Salvador                       0.0564578  0.0238444   2.368
location_nameEquatorial Guinea                 0.4145078  0.0238444  17.384
location_nameEritrea                          -0.1427694  0.0238444  -5.988
location_nameEstonia                           0.5724487  0.0238444  24.008
location_nameEthiopia                         -0.1749948  0.0238444  -7.339
location_nameFederated States of Micronesia    0.2177319  0.0238444   9.131
location_nameFiji                              0.3063835  0.0238444  12.849
location_nameFinland                           0.6616460  0.0238444  27.748
location_nameFrance                            0.7352843  0.0238444  30.837
location_nameGabon                             0.1926532  0.0238444   8.080
location_nameGeorgia                           0.2530224  0.0238444  10.611
location_nameGermany                           0.7386742  0.0238444  30.979
location_nameGhana                            -0.3464979  0.0238444 -14.532
location_nameGreece                            0.7022224  0.0238444  29.450
location_nameGreenland                         0.7880282  0.0238444  33.049
location_nameGrenada                           0.2394215  0.0238444  10.041
location_nameGuam                              0.5937555  0.0238444  24.901
location_nameGuatemala                         0.0139062  0.0238444   0.583
location_nameGuinea                           -0.1905498  0.0238444  -7.991
location_nameGuinea-Bissau                     0.1322312  0.0238444   5.546
location_nameGuyana                           -0.2063983  0.0238444  -8.656
location_nameHaiti                            -0.0491435  0.0238444  -2.061
location_nameHonduras                          0.1244476  0.0238444   5.219
location_nameHungary                           0.6410019  0.0238444  26.883
location_nameIceland                           0.6839722  0.0238444  28.685
location_nameIndia                            -0.6388459  0.0238444 -26.792
location_nameIndonesia                        -0.2110810  0.0238444  -8.852
location_nameIran                             -0.0399115  0.0238444  -1.674
location_nameIraq                             -0.3147627  0.0238444 -13.201
location_nameIreland                           0.8205275  0.0238444  34.412
location_nameIsrael                            0.5031271  0.0238444  21.100
location_nameItaly                             0.7012419  0.0238444  29.409
location_nameJamaica                          -0.0196064  0.0238444  -0.822
location_nameJapan                             0.4163050  0.0238444  17.459
location_nameJordan                            0.1122555  0.0238444   4.708
location_nameKazakhstan                        0.6667740  0.0238444  27.964
location_nameKenya                             0.1062621  0.0238444   4.456
location_nameKiribati                          0.1360152  0.0238444   5.704
location_nameKuwait                            0.3628791  0.0238444  15.219
location_nameKyrgyzstan                        0.4299491  0.0238444  18.031
location_nameLaos                              0.2281184  0.0238444   9.567
location_nameLatvia                            0.6051096  0.0238444  25.377
location_nameLebanon                           0.3358240  0.0238444  14.084
location_nameLesotho                           0.0827870  0.0238444   3.472
location_nameLiberia                          -0.3810751  0.0238444 -15.982
location_nameLibya                             0.1058210  0.0238444   4.438
location_nameLithuania                         0.6853303  0.0238444  28.742
location_nameLuxembourg                        0.8252128  0.0238444  34.608
location_nameMacedonia                         0.3375009  0.0238444  14.154
location_nameMadagascar                        0.0086607  0.0238444   0.363
location_nameMalawi                           -0.2390062  0.0238444 -10.024
location_nameMalaysia                          0.1348008  0.0238444   5.653
location_nameMaldives                         -0.1304736  0.0238444  -5.472
location_nameMali                              0.2545834  0.0238444  10.677
location_nameMalta                             0.6861462  0.0238444  28.776
location_nameMarshall Islands                  0.2453513  0.0238444  10.290
location_nameMauritania                        0.2586036  0.0238444  10.845
location_nameMauritius                         0.1018209  0.0238444   4.270
location_nameMexico                            0.5019364  0.0238444  21.050
location_nameMoldova                           0.2694136  0.0238444  11.299
location_nameMongolia                          0.8050147  0.0238444  33.761
location_nameMontenegro                        0.7197936  0.0238444  30.187
location_nameMorocco                           0.0536665  0.0238444   2.251
location_nameMozambique                       -0.1454548  0.0238444  -6.100
location_nameMyanmar                           0.1701756  0.0238444   7.137
location_nameNamibia                           0.2766865  0.0238444  11.604
location_nameNepal                             0.0040043  0.0238444   0.168
location_nameNetherlands                       0.7182009  0.0238444  30.120
location_nameNew Zealand                       0.8215769  0.0238444  34.456
location_nameNicaragua                        -0.1022029  0.0238444  -4.286
location_nameNiger                             0.1929072  0.0238444   8.090
location_nameNigeria                          -0.1687456  0.0238444  -7.077
location_nameNorth Korea                      -0.2308157  0.0238444  -9.680
location_nameNorthern Mariana Islands          0.4519371  0.0238444  18.954
location_nameNorway                            0.6802116  0.0238444  28.527
location_nameOman                              0.3715916  0.0238444  15.584
location_namePakistan                          0.0163832  0.0238444   0.687
location_namePalestine                        -0.0001115  0.0238444  -0.005
location_namePanama                            0.4910430  0.0238444  20.594
location_namePapua New Guinea                  0.1780454  0.0238444   7.467
location_nameParaguay                          0.6105228  0.0238444  25.604
location_namePeru                             -0.0270767  0.0238444  -1.136
location_namePhilippines                       0.3489508  0.0238444  14.634
location_namePoland                            0.7119822  0.0238444  29.859
location_namePortugal                          0.7651172  0.0238444  32.088
location_namePuerto Rico                       0.4442532  0.0238444  18.631
location_nameQatar                             0.4226739  0.0238444  17.726
location_nameRomania                           0.6732206  0.0238444  28.234
location_nameRussian Federation                0.5601442  0.0238444  23.492
location_nameRwanda                           -0.2771514  0.0238444 -11.623
location_nameSaint Lucia                       0.2295368  0.0238444   9.626
location_nameSaint Vincent and the Grenadines  0.2179043  0.0238444   9.139
location_nameSamoa                             0.2797823  0.0238444  11.734
location_nameSao Tome and Principe            -0.0434018  0.0238444  -1.820
location_nameSaudi Arabia                      0.0524593  0.0238444   2.200
location_nameSenegal                          -0.0681700  0.0238444  -2.859
location_nameSerbia                            0.5359842  0.0238444  22.478
location_nameSeychelles                        0.3146719  0.0238444  13.197
location_nameSierra Leone                     -0.5214405  0.0238444 -21.868
location_nameSingapore                         0.5946425  0.0238444  24.938
location_nameSlovakia                          0.6438632  0.0238444  27.003
location_nameSlovenia                          0.7116904  0.0238444  29.847
location_nameSolomon Islands                  -0.0090348  0.0238444  -0.379
location_nameSomalia                          -0.2564213  0.0238444 -10.754
location_nameSouth Africa                      0.4267033  0.0238444  17.895
location_nameSouth Korea                       0.5949378  0.0238444  24.951
location_nameSouth Sudan                       0.0164738  0.0238444   0.691
location_nameSpain                             0.7138244  0.0238444  29.937
location_nameSri Lanka                        -0.7330116  0.0238444 -30.741
location_nameSudan                            -0.0465153  0.0238444  -1.951
location_nameSuriname                          0.1906880  0.0238444   7.997
location_nameSwaziland                         0.3040096  0.0238444  12.750
location_nameSweden                            0.7247779  0.0238444  30.396
location_nameSwitzerland                       0.7281804  0.0238444  30.539
location_nameSyria                            -0.0024640  0.0238444  -0.103
location_nameTaiwan (Province of China)        0.6900827  0.0238444  28.941
location_nameTajikistan                        0.0176046  0.0238444   0.738
location_nameTanzania                         -0.1278459  0.0238444  -5.362
location_nameThailand                          0.1802762  0.0238444   7.561
location_nameThe Bahamas                       0.5890517  0.0238444  24.704
location_nameThe Gambia                       -0.3863879  0.0238444 -16.205
location_nameTimor-Leste                       0.0705710  0.0238444   2.960
location_nameTogo                             -0.3083612  0.0238444 -12.932
location_nameTonga                             0.3002675  0.0238444  12.593
location_nameTrinidad and Tobago               0.1770619  0.0238444   7.426
location_nameTunisia                           0.0071742  0.0238444   0.301
location_nameTurkey                            0.0631309  0.0238444   2.648
location_nameTurkmenistan                      0.7290156  0.0238444  30.574
location_nameUganda                            0.0146972  0.0238444   0.616
location_nameUkraine                           0.3932282  0.0238444  16.491
location_nameUnited Arab Emirates              0.2716685  0.0238444  11.393
location_nameUnited Kingdom                    0.5932920  0.0238444  24.882
location_nameUnited States                     0.7390080  0.0238444  30.993
location_nameUruguay                           0.6597541  0.0238444  27.669
location_nameUzbekistan                        0.4620568  0.0238444  19.378
location_nameVanuatu                           0.1977037  0.0238444   8.291
location_nameVenezuela                         0.4446078  0.0238444  18.646
location_nameVietnam                           0.5087180  0.0238444  21.335
location_nameVirgin Islands, U.S.              0.4675163  0.0238444  19.607
location_nameYemen                            -0.1277139  0.0238444  -5.356
location_nameZambia                           -0.1062726  0.0238444  -4.457
location_nameZimbabwe                          0.0462498  0.0238444   1.940
                                              Pr(>|t|)    
(Intercept)                                    < 2e-16 ***
age_group_name30 to 34                        0.841191    
age_group_name35 to 39                        0.791927    
age_group_name40 to 44                        0.501365    
age_group_name45 to 49                        0.003912 ** 
age_group_name50 to 54                        1.52e-07 ***
age_group_name55 to 59                        8.37e-15 ***
age_group_name60 to 64                         < 2e-16 ***
age_group_name65 to 69                         < 2e-16 ***
age_group_name70 to 74                         < 2e-16 ***
age_group_name75 to 79                         < 2e-16 ***
age_group_name80 to 84                         < 2e-16 ***
age_group_name85 to 89                         < 2e-16 ***
age_group_name90 to 94                         < 2e-16 ***
age_group_name95 plus                          < 2e-16 ***
location_nameAlbania                           < 2e-16 ***
location_nameAlgeria                          0.027728 *  
location_nameAmerican Samoa                    < 2e-16 ***
location_nameAndorra                           < 2e-16 ***
location_nameAngola                            < 2e-16 ***
location_nameAntigua and Barbuda               < 2e-16 ***
location_nameArgentina                         < 2e-16 ***
location_nameArmenia                           < 2e-16 ***
location_nameAustralia                         < 2e-16 ***
location_nameAustria                           < 2e-16 ***
location_nameAzerbaijan                        < 2e-16 ***
location_nameBahrain                           < 2e-16 ***
location_nameBangladesh                        < 2e-16 ***
location_nameBarbados                          < 2e-16 ***
location_nameBelarus                           < 2e-16 ***
location_nameBelgium                           < 2e-16 ***
location_nameBelize                           1.39e-14 ***
location_nameBenin                             < 2e-16 ***
location_nameBermuda                           < 2e-16 ***
location_nameBhutan                           5.73e-07 ***
location_nameBolivia                           < 2e-16 ***
location_nameBosnia and Herzegovina            < 2e-16 ***
location_nameBotswana                         1.23e-08 ***
location_nameBrazil                            < 2e-16 ***
location_nameBrunei                            < 2e-16 ***
location_nameBulgaria                          < 2e-16 ***
location_nameBurkina Faso                     6.59e-05 ***
location_nameBurundi                           < 2e-16 ***
location_nameCambodia                         1.11e-06 ***
location_nameCameroon                         0.021782 *  
location_nameCanada                            < 2e-16 ***
location_nameCape Verde                        < 2e-16 ***
location_nameCentral African Republic          < 2e-16 ***
location_nameChad                             0.062047 .  
location_nameChile                             < 2e-16 ***
location_nameChina                             < 2e-16 ***
location_nameColombia                          < 2e-16 ***
location_nameComoros                          4.75e-06 ***
location_nameCongo                             < 2e-16 ***
location_nameCosta Rica                        < 2e-16 ***
location_nameCote d'Ivoire                     < 2e-16 ***
location_nameCroatia                           < 2e-16 ***
location_nameCuba                              < 2e-16 ***
location_nameCyprus                            < 2e-16 ***
location_nameCzech Republic                    < 2e-16 ***
location_nameDemocratic Republic of the Congo 7.17e-09 ***
location_nameDenmark                           < 2e-16 ***
location_nameDjibouti                          < 2e-16 ***
location_nameDominica                          < 2e-16 ***
location_nameDominican Republic                < 2e-16 ***
location_nameEcuador                           < 2e-16 ***
location_nameEgypt                            3.77e-09 ***
location_nameEl Salvador                      0.017930 *  
location_nameEquatorial Guinea                 < 2e-16 ***
location_nameEritrea                          2.26e-09 ***
location_nameEstonia                           < 2e-16 ***
location_nameEthiopia                         2.46e-13 ***
location_nameFederated States of Micronesia    < 2e-16 ***
location_nameFiji                              < 2e-16 ***
location_nameFinland                           < 2e-16 ***
location_nameFrance                            < 2e-16 ***
location_nameGabon                            7.88e-16 ***
location_nameGeorgia                           < 2e-16 ***
location_nameGermany                           < 2e-16 ***
location_nameGhana                             < 2e-16 ***
location_nameGreece                            < 2e-16 ***
location_nameGreenland                         < 2e-16 ***
location_nameGrenada                           < 2e-16 ***
location_nameGuam                              < 2e-16 ***
location_nameGuatemala                        0.559777    
location_nameGuinea                           1.61e-15 ***
location_nameGuinea-Bissau                    3.06e-08 ***
location_nameGuyana                            < 2e-16 ***
location_nameHaiti                            0.039348 *  
location_nameHonduras                         1.86e-07 ***
location_nameHungary                           < 2e-16 ***
location_nameIceland                           < 2e-16 ***
location_nameIndia                             < 2e-16 ***
location_nameIndonesia                         < 2e-16 ***
location_nameIran                             0.094219 .  
location_nameIraq                              < 2e-16 ***
location_nameIreland                           < 2e-16 ***
location_nameIsrael                            < 2e-16 ***
location_nameItaly                             < 2e-16 ***
location_nameJamaica                          0.410962    
location_nameJapan                             < 2e-16 ***
location_nameJordan                           2.56e-06 ***
location_nameKazakhstan                        < 2e-16 ***
location_nameKenya                            8.49e-06 ***
location_nameKiribati                         1.23e-08 ***
location_nameKuwait                            < 2e-16 ***
location_nameKyrgyzstan                        < 2e-16 ***
location_nameLaos                              < 2e-16 ***
location_nameLatvia                            < 2e-16 ***
location_nameLebanon                           < 2e-16 ***
location_nameLesotho                          0.000521 ***
location_nameLiberia                           < 2e-16 ***
location_nameLibya                            9.25e-06 ***
location_nameLithuania                         < 2e-16 ***
location_nameLuxembourg                        < 2e-16 ***
location_nameMacedonia                         < 2e-16 ***
location_nameMadagascar                       0.716457    
location_nameMalawi                            < 2e-16 ***
location_nameMalaysia                         1.65e-08 ***
location_nameMaldives                         4.64e-08 ***
location_nameMali                              < 2e-16 ***
location_nameMalta                             < 2e-16 ***
location_nameMarshall Islands                  < 2e-16 ***
location_nameMauritania                        < 2e-16 ***
location_nameMauritius                        1.98e-05 ***
location_nameMexico                            < 2e-16 ***
location_nameMoldova                           < 2e-16 ***
location_nameMongolia                          < 2e-16 ***
location_nameMontenegro                        < 2e-16 ***
location_nameMorocco                          0.024443 *  
location_nameMozambique                       1.13e-09 ***
location_nameMyanmar                          1.07e-12 ***
location_nameNamibia                           < 2e-16 ***
location_nameNepal                            0.866641    
location_nameNetherlands                       < 2e-16 ***
location_nameNew Zealand                       < 2e-16 ***
location_nameNicaragua                        1.85e-05 ***
location_nameNiger                            7.23e-16 ***
location_nameNigeria                          1.65e-12 ***
location_nameNorth Korea                       < 2e-16 ***
location_nameNorthern Mariana Islands          < 2e-16 ***
location_nameNorway                            < 2e-16 ***
location_nameOman                              < 2e-16 ***
location_namePakistan                         0.492057    
location_namePalestine                        0.996268    
location_namePanama                            < 2e-16 ***
location_namePapua New Guinea                 9.46e-14 ***
location_nameParaguay                          < 2e-16 ***
location_namePeru                             0.256190    
location_namePhilippines                       < 2e-16 ***
location_namePoland                            < 2e-16 ***
location_namePortugal                          < 2e-16 ***
location_namePuerto Rico                       < 2e-16 ***
location_nameQatar                             < 2e-16 ***
location_nameRomania                           < 2e-16 ***
location_nameRussian Federation                < 2e-16 ***
location_nameRwanda                            < 2e-16 ***
location_nameSaint Lucia                       < 2e-16 ***
location_nameSaint Vincent and the Grenadines  < 2e-16 ***
location_nameSamoa                             < 2e-16 ***
location_nameSao Tome and Principe            0.068780 .  
location_nameSaudi Arabia                     0.027843 *  
location_nameSenegal                          0.004266 ** 
location_nameSerbia                            < 2e-16 ***
location_nameSeychelles                        < 2e-16 ***
location_nameSierra Leone                      < 2e-16 ***
location_nameSingapore                         < 2e-16 ***
location_nameSlovakia                          < 2e-16 ***
location_nameSlovenia                          < 2e-16 ***
location_nameSolomon Islands                  0.704770    
location_nameSomalia                           < 2e-16 ***
location_nameSouth Africa                      < 2e-16 ***
location_nameSouth Korea                       < 2e-16 ***
location_nameSouth Sudan                      0.489664    
location_nameSpain                             < 2e-16 ***
location_nameSri Lanka                         < 2e-16 ***
location_nameSudan                            0.051132 .  
location_nameSuriname                         1.53e-15 ***
location_nameSwaziland                         < 2e-16 ***
location_nameSweden                            < 2e-16 ***
location_nameSwitzerland                       < 2e-16 ***
location_nameSyria                            0.917699    
location_nameTaiwan (Province of China)        < 2e-16 ***
location_nameTajikistan                       0.460357    
location_nameTanzania                         8.57e-08 ***
location_nameThailand                         4.66e-14 ***
location_nameThe Bahamas                       < 2e-16 ***
location_nameThe Gambia                        < 2e-16 ***
location_nameTimor-Leste                      0.003093 ** 
location_nameTogo                              < 2e-16 ***
location_nameTonga                             < 2e-16 ***
location_nameTrinidad and Tobago              1.29e-13 ***
location_nameTunisia                          0.763519    
location_nameTurkey                           0.008129 ** 
location_nameTurkmenistan                      < 2e-16 ***
location_nameUganda                           0.537669    
location_nameUkraine                           < 2e-16 ***
location_nameUnited Arab Emirates              < 2e-16 ***
location_nameUnited Kingdom                    < 2e-16 ***
location_nameUnited States                     < 2e-16 ***
location_nameUruguay                           < 2e-16 ***
location_nameUzbekistan                        < 2e-16 ***
location_nameVanuatu                           < 2e-16 ***
location_nameVenezuela                         < 2e-16 ***
location_nameVietnam                           < 2e-16 ***
location_nameVirgin Islands, U.S.              < 2e-16 ***
location_nameYemen                            8.84e-08 ***
location_nameZambia                           8.48e-06 ***
location_nameZimbabwe                         0.052472 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09235 on 5641 degrees of freedom
Multiple R-squared:   0.94, Adjusted R-squared:  0.9377 
F-statistic: 424.5 on 208 and 5641 DF,  p-value: < 2.2e-16

ANOVA and linear regression with mixed effects

We could instead perform a similar analysis as we did for the two group analysis where we controlled for the paired data structure using a random effect based on country In particular, we could include a random intercept for country. We could do this within the aov() function using Error() and within the lmer() function with 1 | variable_name.

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  aov(
    log10(Relative_Percent) ~ age_group_name + Error(location_name),
    data = .
  ) %>%
  summary()

Error: location_name
           Df Sum Sq Mean Sq F value Pr(>F)
Residuals 194  710.6   3.663               

Error: Within
                 Df Sum Sq Mean Sq F value Pr(>F)    
age_group_name   14  42.53  3.0380   356.2 <2e-16 ***
Residuals      5641  48.11  0.0085                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lmer(
    log10(Relative_Percent) ~ age_group_name + (1 | location_name),
    data = .
  ) %>%
  summary()
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: log10(Relative_Percent) ~ age_group_name + (1 | location_name)
   Data: .

REML criterion at convergence: -9975.4

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.8675 -0.9750  0.4451  0.9700  2.1669 

Random effects:
 Groups        Name        Variance Std.Dev.
 location_name (Intercept) 0.121806 0.34901 
 Residual                  0.008528 0.09235 
Number of obs: 5850, groups:  location_name, 195

Fixed effects:
                         Estimate Std. Error         df t value Pr(>|t|)    
(Intercept)             1.921e+00  2.543e-02  2.068e+02  75.558  < 2e-16 ***
age_group_name30 to 34  1.325e-03  6.613e-03  5.641e+03   0.200  0.84119    
age_group_name35 to 39  1.745e-03  6.613e-03  5.641e+03   0.264  0.79193    
age_group_name40 to 44 -4.447e-03  6.613e-03  5.641e+03  -0.672  0.50136    
age_group_name45 to 49 -1.909e-02  6.613e-03  5.641e+03  -2.886  0.00391 ** 
age_group_name50 to 54 -3.477e-02  6.613e-03  5.641e+03  -5.257 1.52e-07 ***
age_group_name55 to 59 -5.147e-02  6.613e-03  5.641e+03  -7.783 8.37e-15 ***
age_group_name60 to 64 -7.108e-02  6.613e-03  5.641e+03 -10.748  < 2e-16 ***
age_group_name65 to 69 -9.434e-02  6.613e-03  5.641e+03 -14.266  < 2e-16 ***
age_group_name70 to 74 -1.194e-01  6.613e-03  5.641e+03 -18.058  < 2e-16 ***
age_group_name75 to 79 -1.474e-01  6.613e-03  5.641e+03 -22.281  < 2e-16 ***
age_group_name80 to 84 -2.100e-01  6.613e-03  5.641e+03 -31.755  < 2e-16 ***
age_group_name85 to 89 -2.154e-01  6.613e-03  5.641e+03 -32.566  < 2e-16 ***
age_group_name90 to 94 -2.182e-01  6.613e-03  5.641e+03 -32.990  < 2e-16 ***
age_group_name95 plus  -2.181e-01  6.613e-03  5.641e+03 -32.979  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice now the results only show for the age group variable, since this is the only fixed effect in the model. However, dependence in the data due to country is still accounted for through the random effect.

If we use anova() instead of summary() for our lmer() model, we can see they give the same results.

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  aov(
    log10(Relative_Percent) ~ age_group_name + Error(location_name),
    data = .
  ) %>%
  summary()

Error: location_name
           Df Sum Sq Mean Sq F value Pr(>F)
Residuals 194  710.6   3.663               

Error: Within
                 Df Sum Sq Mean Sq F value Pr(>F)    
age_group_name   14  42.53  3.0380   356.2 <2e-16 ***
Residuals      5641  48.11  0.0085                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lmer(
    log10(Relative_Percent) ~ age_group_name + (1 | location_name),
    data = .
  ) %>%
  anova()
Type III Analysis of Variance Table with Satterthwaite's method
               Sum Sq Mean Sq NumDF DenDF F value    Pr(>F)    
age_group_name 42.532   3.038    14  5641  356.23 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Modeling all groups of interest


Now we can extend out model to include include sex, age_group_name and location_name all in the same linear model and get information about how each of these factors influences dietary consumption, while accounting for the other factors. Since we are primarily interested in the effects of sex and age, but want to account for the dependence in the data due to repeated measurements by country, we will include sex and age_group_name as fixed effects and incorporate a random intercept for location_name.

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lmer(
    log10(Relative_Percent) ~ sex + age_group_name + (1 | location_name),
    data = .
  ) %>%
  anova()
Type III Analysis of Variance Table with Satterthwaite's method
               Sum Sq Mean Sq NumDF DenDF F value    Pr(>F)    
sex            47.603  47.603     1  5640  530948 < 2.2e-16 ***
age_group_name 42.532   3.038    14  5640   33885 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lmer(
    log10(Relative_Percent) ~ sex + age_group_name + (1 | location_name),
    data = .
  ) %>%
  summary()
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: log10(Relative_Percent) ~ sex + age_group_name + (1 | location_name)
   Data: .

REML criterion at convergence: -35657.3

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-8.7650 -0.4242 -0.0168  0.4071 17.2844 

Random effects:
 Groups        Name        Variance  Std.Dev.
 location_name (Intercept) 1.221e-01 0.349411
 Residual                  8.966e-05 0.009469
Number of obs: 5850, groups:  location_name, 195

Fixed effects:
                         Estimate Std. Error         df  t value Pr(>|t|)    
(Intercept)             1.831e+00  2.503e-02  1.941e+02   73.161  < 2e-16 ***
sexMale                 1.804e-01  2.476e-04  5.640e+03  728.662  < 2e-16 ***
age_group_name30 to 34  1.325e-03  6.781e-04  5.640e+03    1.954   0.0507 .  
age_group_name35 to 39  1.745e-03  6.781e-04  5.640e+03    2.573   0.0101 *  
age_group_name40 to 44 -4.447e-03  6.781e-04  5.640e+03   -6.558 5.94e-11 ***
age_group_name45 to 49 -1.909e-02  6.781e-04  5.640e+03  -28.151  < 2e-16 ***
age_group_name50 to 54 -3.477e-02  6.781e-04  5.640e+03  -51.274  < 2e-16 ***
age_group_name55 to 59 -5.147e-02  6.781e-04  5.640e+03  -75.907  < 2e-16 ***
age_group_name60 to 64 -7.108e-02  6.781e-04  5.640e+03 -104.828  < 2e-16 ***
age_group_name65 to 69 -9.435e-02  6.781e-04  5.640e+03 -139.139  < 2e-16 ***
age_group_name70 to 74 -1.194e-01  6.781e-04  5.640e+03 -176.123  < 2e-16 ***
age_group_name75 to 79 -1.474e-01  6.781e-04  5.640e+03 -217.312  < 2e-16 ***
age_group_name80 to 84 -2.100e-01  6.781e-04  5.640e+03 -309.714  < 2e-16 ***
age_group_name85 to 89 -2.154e-01  6.781e-04  5.640e+03 -317.622  < 2e-16 ***
age_group_name90 to 94 -2.182e-01  6.781e-04  5.640e+03 -321.756  < 2e-16 ***
age_group_name95 plus  -2.181e-01  6.781e-04  5.640e+03 -321.644  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at the anova() output, we can see that sex and age group both have significant associations with the consumption of red meat, when controlling for the other variable. Additionally, by looking at the individual coefficient estimates in the summary() output, we see that males tend to have higher red meat consumption compared to females (positive coefficient for sexMale) and that consumption seems to decrease with increasing age (negative coefficients for all the age group categories that appear to become larger in magnitude with increasing age).

Data Visualization


If you have been following along but stopped you could load the wrangled data like so:

load(here::here("data", "wrangled", "wrangled_data.rda"))

If you are starting the case study at this section click here.

First you need to install and load the OCSdata package:

install.packages("OCSdata")
library(OCSdata)

Then, you may load the wrangled data using the following code:

wrangled_rda("ocs-bp-diet", outpath = getwd())
load(here::here("OCSdata", "data", "wrangled", "wrangled_data.rda"))

If the package does not work for you, alternatively, an RDA file (stands for R data) of the data can be found here or slightly more directly here. Download this file and then place it in your current working directory within a subdirectory called “wrangled” within a subdirectory called “data” to copy and paste our code. We used an RStudio project and the here package to navigate to the file more easily.

load(here::here("data", "wrangled", "wrangled_data.rda"))

Click here to see more about creating new projects in RStudio.

You can create a project by going to the File menu of RStudio like so:

You can also do so by clicking the project button:

See here to learn more about using RStudio projects and here to learn more about the here package.



Now that we have statistically analyzed the consumption of red meat based on the location, sex, and age group of different populations around the world. Let’s make some visualizations to help with our interpretations.

Red Meat


Let’s try to make a plot that shows the relationship of age group, sex, and location on consumption of red meat.

First we will filter our data for only the data associated with red meat, and then we will create a box plot graph with age group as the x axis, but include box plots for each sex for each age group. We can include an additional subplot to just look at the relationship of sex and consumption. Recall that the ggplot2 package is very useful for making figures and uses a layering structure to make plots using the + between layers.

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  ggplot(aes(
    x = age_group_name,
    y = Relative_Percent,
    col = sex
  )) +
  geom_boxplot() +
  # this adds the individual points for the sex comparison
  geom_jitter(aes(
    x = sex,
    y = Relative_Percent
  ),
  # width specifies how wide the points will be plotted
  width = .2,
  size = 2,
  shape = 21
  ) +
  # this angles the x axis text and removes the legend
  theme(
    axis.text.x = element_text(
      angle = 70,
      hjust = 1
    ),
    legend.position = "none"
  )

OK, this is pretty good, but we can do better.

Let’s try specifically looking at the countries that over-consumed red meat. We can look at these countries by filtering our data where Relative_Percent was greater than 100%. Now we will overlap the jitter points and the box plot using the position_jitterdoge() as the position in geom_pont(). In order to not obscure our box plots, we can use the argument alpha to make our jitter points more transparent.

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  filter(Relative_Percent > 100) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = age_group_name,
    fill = sex
  )) +
  # this position option will separate the points by sex
  # this is determined by the fill argument in the ggplot() function
  # could also use col argument but it would change the style a bit
  geom_point(
    position = position_jitterdodge(),
    aes(col = sex),
    alpha = 3 / 10
  ) +
  geom_boxplot(outlier.shape = NA) +
  theme(axis.text.x = element_text(
    angle = 70,
    hjust = 1
  ))

What are the countries that have such high consumption rates?

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  filter(Relative_Percent > 1000)
# A tibble: 0 x 20
# ... with 20 variables: year_id <dbl>, location_name <chr>, food <chr>,
#   age_group_id <dbl>, age_group_name <chr>, sex <chr>, parameter <chr>,
#   mean <dbl>, upper <dbl>, lower <dbl>, unit <chr>, direction <chr>,
#   lower_optimal <dbl>, optimal <dbl>, upper_optimal <dbl>,
#   unit_optimal <chr>, Relative_Percent <dbl>, range_percent <dbl>,
#   percent_over_under <dbl>, opt_achieved <chr>

Looks like the males in Laos and Timor_Leste have the highest consumption.

Now let’s plot just the populations that eat less than the optimal amount by filtering for Relative_Percent < 100%.

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  filter(Relative_Percent < 100) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = age_group_name,
    fill = sex
  )) +
  geom_point(
    position = position_jitterdodge(),
    aes(col = sex),
    alpha = 3 / 10
  ) +
  geom_boxplot(outlier.shape = NA) +
  theme(axis.text.x = element_text(
    angle = 70,
    hjust = 1
  ))

Nice! It would be nice to be able to know what countries each data point corresponds to. One way to do this is using a package called ggiraph. This package is really helpful for creating interactive graphs.

We will use the geom_point_interactive() function to allow us to hover over points to display the country name. We indicate what label we want with the tooltip argument. This function is similar to the normal geom_point() function. Thus, we will include the same arguments as before. However, we will also split the male and female data using facet_wrap() to make things a bit less overwhelming.

Notice that we are creating a plot object before we use the geom_point_interactive().

We are also rendering the plot with the girafe() function of the ggiraph package.

g <- all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  filter(Relative_Percent < 100) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = age_group_name,
    fill = sex
  )) +
  geom_boxplot(outlier.shape = NA) +
  facet_wrap(~sex) +
  theme(axis.text.x = element_text(
    angle = 70,
    hjust = 1
  ))

g <- g + geom_point_interactive(aes(
  color = sex,
  tooltip = location_name
),
size = 2,
position = position_jitterdodge(),
alpha = 3 / 10
)

girafe(code = print(g))

Cool!

From this plot we can see the countries with populations that do well by not over-consuming red meat, (as over-consumption is associated with health risk). We see that different countries greatly vary, we can see that overall younger populations appear to consume more red meat, and men appear to consume red meat.

Let’s do the same thing for the over-consuming countries. We can also take this one step further to show all the points for the same country when we hover over one data point by using the data_id argument of the geom_point_interactive() function.

We can also add links to Wikipedia pages for these countries using the onclick argument. See this link for more information on using the ggirpah package. We will use the base sprintf() function to format our urls for the Wikipedia links into C style to open a new tab for the link when a user clicks on the figure.

all_age_diet_and_guidelines %<>%
  mutate(link = sprintf(
    "window.open(\"%s%s\")",
    "http://en.wikipedia.org/wiki/",
    as.character(pull(
      all_age_diet_and_guidelines,
      location_name
    ))
  ))

g <- all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  filter(Relative_Percent > 100) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = age_group_name,
    fill = sex
  )) +
  geom_boxplot(outlier.shape = NA) +
  facet_wrap(~sex) +
  theme(
    legend.position = "none",
    axis.text.x = element_text(
      angle = 70,
      hjust = 1
    )
  ) +
  expand_limits(y = 99)


g <- g + geom_point_interactive(aes(
  color = sex,
  tooltip = location_name,
  data_id = location_name,
  onclick = link
),
size = 2,
position = position_jitterdodge(),
alpha = 3 / 10
)

g <- g + geom_point_interactive(
  data =
    all_age_diet_and_guidelines %>%
      filter(food == "red meat") %>%
      filter(Relative_Percent > 100) %>%
      filter(location_name == "United States"),
  aes(
    fill = location_name,
    tooltip = location_name,
    data_id = location_name,
    onclick = link
  ),
  size = 4,
  position = position_jitterdodge(),
  alpha = 5 / 10,
  color = "black"
)

girafe(code = print(g))

United Sates Data


Now let’s take a look at the US data specifically.

diet_and_guidelines %>%
  filter(location_name == "United States") %>%
  count(opt_achieved)
# A tibble: 2 x 2
  opt_achieved     n
  <chr>        <int>
1 No              27
2 Yes              3

OK, it looks like optimal consumption levels were achieved for only 10% of the dietary factors.

Let’s look at males and females separately:

Question Opportunity

Can you come up with the code for how you would do this?


Click here to reveal the code.
diet_and_guidelines %>%
  filter(
    sex == "Male",
    location_name == "United States"
  ) %>%
  count(opt_achieved, food) %>%
  arrange(food)

diet_and_guidelines %>%
  filter(
    sex == "Female",
    location_name == "United States"
  ) %>%
  count(opt_achieved, food) %>%
  arrange(food)

For males:

# A tibble: 15 x 3
   opt_achieved food                            n
   <chr>        <chr>                       <int>
 1 No           calcium                         1
 2 No           fiber                           1
 3 No           fruits                          1
 4 No           legumes                         1
 5 No           milk                            1
 6 No           nuts and seeds                  1
 7 No           polyunsaturated fatty acids     1
 8 No           processed meat                  1
 9 No           red meat                        1
10 No           seafood omega-3 fatty acids     1
11 No           sodium                          1
12 No           sugar-sweetened beverages       1
13 Yes          trans fatty acids               1
14 No           vegetables                      1
15 No           whole grains                    1

For females:

diet_and_guidelines %>%
  filter(
    sex == "Female",
    location_name == "United States"
  ) %>%
  count(opt_achieved, food) %>%
  arrange(food)
# A tibble: 15 x 3
   opt_achieved food                            n
   <chr>        <chr>                       <int>
 1 No           calcium                         1
 2 No           fiber                           1
 3 No           fruits                          1
 4 No           legumes                         1
 5 No           milk                            1
 6 No           nuts and seeds                  1
 7 No           polyunsaturated fatty acids     1
 8 No           processed meat                  1
 9 No           red meat                        1
10 No           seafood omega-3 fatty acids     1
11 Yes          sodium                          1
12 No           sugar-sweetened beverages       1
13 Yes          trans fatty acids               1
14 No           vegetables                      1
15 No           whole grains                    1

So females are a bit better about not over-consuming sodium in the United States relative to males. Both groups are doing well with avoiding trans fatty acids. Let’s look more closely at which dietary components have high and low consumption in the United States:

all_age_diet_and_guidelines %>%
  filter(location_name == "United States") %>%
  ggplot(aes(
    y = Relative_Percent,
    x = food,
    fill = sex
  )) +
  theme(axis.text.x = element_text(
    angle = 70,
    hjust = 1
  )) +
  facet_wrap(~direction, scales = "free") +
  geom_boxplot() +
  geom_point(
    position = position_jitterdodge(),
    alpha = 3 / 10
  )

OK, so we can indeed see that overall consumption of sodium and trans fatty acids is pretty close to optimal. So that’s great. However, Both males and females are over-consuming processed meat, red meat, and sugar-sweetened beverages. On the other hand both genders are not getting adequate intake of all the other dietary factors for optimal health. The population in the United states has especially poor intake of polyunsaturated fats. it also looks like in most cases females are getting less of the dietary factors that pose health risks when under-consumed, with the exception of fruits.

How about if we look at age groups. First let’s look at the dietary components with that were over-consumed in the United States.

Question Opportunity

Can you come up with the code for this on your own?


Click here to reveal the code.

We will also move our legend to the bottom of the plot using the theme() function of the ggplot2 package, like so:

plot_age_groups <- all_age_diet_and_guidelines %>%
  filter(
    location_name == "United States",
    direction == "high"
  ) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = food,
    fill = age_group_name
  )) +
  facet_wrap(~food, scales = "free") +
  geom_boxplot() +
  theme(legend.position = "bottom")

plot_age_groups

OK! It looks like age really influences the consumption of these dietary factors. With the exception of trans fatty acids, the consumption of all of these dietary factors seems to decrease with age. Let’s also use thescale_fill_viridis() function of the viridis package to change the colors of our plot. This package uses palettes of colors that are discernible for individuals who are colorblind.

all_age_diet_and_guidelines %>%
  filter(
    location_name == "United States",
    direction == "high"
  ) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = food,
    fill = age_group_name
  )) +
  facet_wrap(~food, scales = "free") +
  geom_boxplot() +
  # change the colors from rainbow to purple/green/yellow
  scale_fill_viridis(discrete = TRUE) +
  theme_linedraw() +
  theme(
    strip.text = element_text(size = 8, face = "bold"),
    axis.text.x = element_blank(),
    axis.title.x = element_blank(),
    legend.position = "bottom"
  )

Nice!

Now let’s look at the dietary factors that when consumed at low levels increase health risk:

Question Opportunity

Again, see if you come up with the code for this on your own?


Click here to reveal the code.
low_foods_plot <- all_age_diet_and_guidelines %>%
  filter(
    location_name == "United States",
    direction == "low"
  ) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = food,
    fill = age_group_name
  )) +
  facet_wrap(~food, scales = "free") +
  geom_boxplot() +
  # change the colors from rainbow to purple/green/yellow
  scale_fill_viridis(discrete = TRUE) +
  theme_linedraw() +
  theme(
    strip.text = element_text(size = 7, face = "bold"),
    axis.text.x = element_blank(),
    axis.title.x = element_blank(),
    legend.position = "bottom"
  )

low_foods_plot

Interesting, we see that for the foods that are over consumed (processed meat, red meat, sodium, and sugar-sweetened beverages), consumption appears to decrease with age. For the foods that are under consumed, many appear to rise and fall with age.

Summary


Synopsis


We have evaluated average consumption estimates of 15 dietary factors with probably non-communicable disease (NCD) risk from 195 different countries around the world. To do so we imported data from a PDF using the pdftools package, as well as data from two CSV files using readr. We used tidyverse packages such as dplyr, stringr, and tidy to clean and join the data from the PDF with the CSV files.

We learned that regression is a powerful and flexible statistical tool that simplifies or estimates the relationships between variables using a mathematical model. We learned about the utility of regression techniques to compare groups, look for associations between variables, and predict outcomes based on multiple predictor or explanatory variables. We then compared this to other popular tests like the \(t\)-test and the ANOVA. We learned that these tests are actually equivalent to specialized types of regressions.

Our statistical analysis focused on evaluating differences in the consumption of red meat around the world between females and males and across different age groups. First we looked at the assumptions of \(t\)-tests and regressions, and determined that the rate of red meat consumption relative to the optimal guideline-suggested amount was right skewed. We learned that we could transform the data by taking the log of these values to achieve more normally distributed data. To compare males and females we used a \(t\)-test and learned that a \(t\)-test is a specialized form of a linear regression. To compare the 15 different age groups we used an ANOVA and learned that ANOVA is also a specialized form of linear regression. We examined how we obtained the same results using either statistical test. This was also the case if we looked at the effect of gender and controlled for the paired country structure in the data by either including location_name in the model as another term or by using a mixed effects model to control for this structure as a random effect but not specifically test for the influence of location_name on red meat consumption estimates. We learned that fixed effects are those that we wish to evaluate, while random effects are those that may influence the relationships of our variables of interest but that we do not wish to actively evaluate. Using these tests and models, we determined that males consume more red meat than females on average around the world.

Our ANOVA analysis of age determined that indeed there is at least one age group that consumed a significantly different amount of red meat compared to the other age groups, and this was still the case when we controlled for location_name. However, we learned that the ANOVA does not provide information about which age groups are different. We learned how the regression could provide some quantification of the effect of specific age groups relative to the reference age group. Furthermore, our data visualizations allowed us to determine that in general red meat consumption appears to be higher in the younger age groups relative to the older age groups.

Finally, we also looked at differences in red meat consumption between the different countries and saw in our ANOVA analysis and our regression analysis that there were significant differences. We were able to use a regression that included sex, age_group_name, and location_name to evaluate the influence of each of the three demographic factors on consumption while controlling or accounting for the other two. Our results demonstrated that all three influenced or were associated with red meat consumption.

In preforming our statistical analyses we learned about the assumptions of the \(t\)-test, regression, and the ANOVA. We also learned about important methods to tests these assumptions.

Using the ggplot2 package we were able to visualize trends in the data and to compare consumption of these dietary factors in the US with that of the other countries.

We see that the populations in many countries are over-consuming foods that are associated with health risk when over-consumed. In particular processed meat and sugar-sweetened beverages appear to be the most over consumed. Importantly both of these appear to be consumed at higher quantities by males and younger adults. People in the US appear to consume fewer sugar-sweetened beverages than other countries, however, people are still over-consuming. Processed meat however appears to be especially bad in the US. In terms of food that need to be consumed in adequate amounts to overcome health risk, nearly all countries for all factors are not reaching guideline levels. However, there are some countries consuming more than adequate amounts of legumes, vegetables, fruits and fiber. People in the US appear to eat more milk products and consume more omega-3 fatty acids and calcium rich foods than other countries. All countries including the US consume very low levels of polyunsaturated fatty acids. These polyunsaturated fatty acids are abundant in seeds, nuts and avocados, as well as fish. Likely the low level of consumption of nuts and seeds contributes to these low polyunsaturated fatty acid estimates. The supplementary table included in the article suggests that poor consumption of polyunsaturated fatty acids is associated with ischemic heart disease. The article takes this data further to evaluate the association of consumption levels of these foods with mortality.

Analyses like the one in our case study are important for defining which groups could benefit the most from interventions, education, and policy changes when attempting to mitigate public health challenges. You can see in the article however that many additional considerations would be involved to perform a more thorough analysis to adequately understand the data enough to recommend policy changes.

Suggested Homework


Students can evaluate consumption estimates of another dietary factor besides red meat.

Additional Information


Session Info


sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] OCSdata_1.0.2             cowplot_1.1.1            
 [3] viridis_0.6.2             viridisLite_0.4.0        
 [5] ggforce_0.3.3             ggiraph_0.8.2            
 [7] car_3.0-12                carData_3.0-5            
 [9] lmerTest_3.1-3            lme4_1.1-27.1            
[11] Matrix_1.3-4              forcats_0.5.1            
[13] ggpubr_0.4.0              ggplot2_3.3.5            
[15] tidyr_1.1.4               tibble_3.1.6             
[17] purrr_0.3.4               stringr_1.4.0            
[19] pdftools_3.1.1            skimr_2.1.3              
[21] dplyr_1.0.7               readr_2.1.1              
[23] koRpus.lang.en_0.1-4      koRpus_0.13-8            
[25] sylly_0.1-6               read.so_0.1.1            
[27] wordcountaddin_0.3.0.9000 magrittr_2.0.2           
[29] knitr_1.37                here_1.0.1               

loaded via a namespace (and not attached):
 [1] nlme_3.1-153        fs_1.5.2            usethis_2.1.5      
 [4] bit64_4.0.5         httr_1.4.2          rprojroot_2.0.2    
 [7] repr_1.1.4          numDeriv_2016.8-1.1 tools_4.1.2        
[10] backports_1.4.1     bslib_0.3.1         utf8_1.2.2         
[13] R6_2.5.1            mgcv_1.8-38         DBI_1.1.2          
[16] colorspace_2.0-2    withr_2.5.0         gridExtra_2.3      
[19] tidyselect_1.1.2    bit_4.0.4           curl_4.3.2         
[22] compiler_4.1.2      cli_3.2.0           labeling_0.4.2     
[25] sass_0.4.0          scales_1.1.1        askpass_1.1        
[28] systemfonts_1.0.4   digest_0.6.29       minqa_1.2.4        
[31] rmarkdown_2.11      base64enc_0.1-3     pkgconfig_2.0.3    
[34] htmltools_0.5.2     fastmap_1.1.0       highr_0.9          
[37] htmlwidgets_1.5.4   rlang_1.0.1         rstudioapi_0.13    
[40] farver_2.1.0        jquerylib_0.1.4     generics_0.1.1     
[43] jsonlite_1.8.0      vroom_1.5.7         polynom_1.4-0      
[46] Rcpp_1.0.8          munsell_0.5.0       fansi_1.0.2        
[49] abind_1.4-5         lifecycle_1.0.1     stringi_1.7.6      
[52] yaml_2.3.5          MASS_7.3-54         grid_4.1.2         
[55] parallel_4.1.2      crayon_1.5.0        lattice_0.20-45    
[58] splines_4.1.2       hms_1.1.1           pillar_1.7.0       
[61] uuid_1.0-3          boot_1.3-28         ggsignif_0.6.3     
[64] glue_1.6.1          evaluate_0.15       qpdf_1.1           
[67] data.table_1.14.2   remotes_2.4.2       vctrs_0.3.8        
[70] nloptr_1.2.2.3      tzdb_0.2.0          tweenr_1.0.2       
[73] gtable_0.3.0        polyclip_1.10-0     assertthat_0.2.1   
[76] xfun_0.29           sylly.en_0.1-3      broom_0.7.11       
[79] rstatix_0.7.0       ellipsis_0.3.2     

Estimate of RMarkdown Compilation Time:

About 79 - 89 seconds

This compilation time was measured on a PC machine operating on Windows 10. This range should only be used as an estimate as compilation time will vary with different machines and operating systems.

Acknowledgments


We would like to acknowledge Jessica Fanzo for assisting in framing the major direction of the case study, as well as Ashkan Afshin and Erin Mullany for giving us access to the data.

We would like to acknowledge Michael Breshock for his contributions to this case study and developing the OCSdata package.

We would also like to acknowledge the Bloomberg American Health Initiative for funding this work.

---
title: "Open Case Studies: Exploring global patterns of dietary behaviors associated with health risk "
css: style.css
output:
  html_document:
    includes:
       in_header: GA_Script.Rhtml
    self_contained: yes
    code_download: yes
    highlight: tango
    number_sections: no
    theme: cosmo
    toc: yes
    toc_float: yes
  pdf_document:
    toc: yes
  word_document:
    toc: yes

---
<style>
#TOC {
  background: url("https://opencasestudies.github.io/img/icon-bahi.png");
  background-size: contain;
  padding-top: 240px !important;
  background-repeat: no-repeat;
}
</style>

<!-- Open all links in new tab-->  
<base target="_blank"/>  
<div id="google_translate_element"></div>

<script type="text/javascript" src='//translate.google.com/translate_a/element.js?cb=googleTranslateElementInit'></script>

<script type="text/javascript">
function googleTranslateElementInit() {
  new google.translate.TranslateElement({pageLanguage: 'en'}, 'google_translate_element');
}
</script>


```{r setup, include=FALSE}
knitr::opts_chunk$set(
  include = TRUE, comment = NA, echo = TRUE,
  message = FALSE, warning = FALSE, cache = FALSE,
  fig.align = "center", out.width = "90%"
)
library(here)
library(knitr)
library(magrittr)
remotes::install_github("benmarwick/wordcountaddin", type = "source", dependencies = TRUE)
remotes::install_github("alistaire47/read.so")
library(wordcountaddin)
library(read.so)

rmarkdown:::perf_timer_reset_all()
rmarkdown:::perf_timer_start("render")
```



#### {.outline }
```{r, echo = FALSE, out.width="800 ptx"}
knitr::include_graphics(here::here("img", "mainplot.png"))
```


#### {.disclaimer_block}

**Disclaimer**: The purpose of the [Open Case Studies](https://www.opencasestudies.org){target="_blank"} project is **to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data**. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts. 

####

#### {.license_block}

This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 [(CC BY-NC 3.0)](https://creativecommons.org/licenses/by-nc/3.0/us/){target="_blank"}  United States License.

####

#### {.reference_block}

To cite this case study please use:

Wright, Carrie and Meng, Qier and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). [https://github.com/opencasestudies/ocs-bp-diet](https://github.com/opencasestudies/ocs-bp-diet). Exploring global patterns of dietary behaviors associated with health risk (Version v1.0.0).

####

To access the GitHub Repository with the data for this case study see here: https://github.com/opencasestudies/ocs-bp-diet.

You may also access and download the data using our `OCSdata` package. To learn more about this package including examples, see this [link](https://github.com/opencasestudies/OCSdata). Here is how you would install this package:

```{r, eval=FALSE}
install.packages("OCSdata")
```

This case study is part of a series of public health case studies for the [Bloomberg American Health Initiative](https://americanhealth.jhu.edu/open-case-studies).

***

The total reading time for this case study is calculated via [koRpus](https://github.com/unDocUMeantIt/koRpus) and shown below: 

```{r, echo=FALSE}
readtable = text_stats("index.Rmd") # producing reading time markdown table
readtime = read.so::read.md(readtable) %>% dplyr::select(Method, koRpus) %>% # reading table into dataframe, selecting relevant factors
  dplyr::filter(Method == "Reading time") %>% # dropping unnecessary rows
  dplyr::mutate(koRpus = paste(round(as.numeric(stringr::str_split(koRpus, " ")[[1]][1])), "minutes")) %>% # rounding reading time estimate
  dplyr::mutate(Method = "koRpus") %>% dplyr::relocate(koRpus, .before = Method) %>% dplyr::rename(`Reading Time` = koRpus) # reorganizing table
knitr::kable(readtime, format="markdown")
```

***

**Readability Score: **

A readability index estimates the reading difficulty level of a particular text. Flesch-Kincaid, FORCAST, and SMOG are three common readability indices that were calculated for this case study via [koRpus](https://github.com/unDocUMeantIt/koRpus). These indices provide an estimation of the minimum reading level required to comprehend this case study by grade and age. 

```{r, echo=FALSE}
rt = wordcountaddin::readability("index.Rmd", quiet=TRUE) # producing readability markdown table
df = read.so::read.md(rt) %>% dplyr::select(index, grade, age) %>%  # reading table into dataframe, selecting relevant factors
  tidyr::drop_na() %>% dplyr::mutate(grade = round(as.numeric(grade)), # dropping rows with missing values, rounding age and grade columns
                                     age = round(as.numeric(age))
                                     )
knitr::kable(df, format="markdown")
```

***

Please help us by filling out our survey.


<div style="display: flex; justify-content: center;"><iframe src="https://docs.google.com/forms/d/e/1FAIpQLSfpN4FN3KELqBNEgf2Atpi7Wy7Nqy2beSkFQINL7Y5sAMV5_w/viewform?embedded=true" width="1200" height="700" frameborder="0" marginheight="0" marginwidth="0">Loading…</iframe></div>


## **Motivation**
***

An [article](https://www.thelancet.com/action/showPdf?pii=S0140-6736%2819%2930041-8){target="_blank"} recently published in The 
Lancet evaluated global dietary trends and the relationship of dietary factors with mortality and fertility.

```{r, echo = FALSE}
knitr::include_graphics(here::here("img", "thepaper.png"))
```

#### {.reference_block}
GBD 2017 Diet Collaborators. Health effects of dietary risks in 195 countries, 1990–2017: a systematic analysis for the Global Burden of Disease Study 2017. *The Lancet* 393, 1958–1972 (2019).

####

This article evaluated food consumption patterns in 195 countries for 15 different dietary risk factors that have probable associations with non-communicable disease (NCD). For example, over-consumption of sodium is associated with high blood pressure. These consumption levels were then used to estimate levels of mortality and morbidity due to NCD, as well as disability-adjusted life-years (DALYs) attributable to sub-optimal consumption of foods related to these dietary risk factors. The authors found that: 

> "High intake of sodium ..., low intake of whole grains ..., and low intake of fruits ... were the leading dietary risk factors for deaths and DALYs globally and in many countries." 

This figure from the paper's [supplementary materials](https://www.thelancet.com/cms/10.1016/S0140-6736(19)30041-8/attachment/3d4c0258-c2ea-405f-8d11-e9ae65e6f996/mmc1.pdf){target="_blank"} shows the ranking of the 15 dietary risk factors based on the estimated number of attributable deaths. Here, the numbers and colors of the little squares imply rankings of the risk factors (rows) by regions (columns). The color red indicates risk factors that are associated with larger number of attributable deaths. The column on the right is the overall global data. As you can see here, the top 3 risk factors are often issues for many different regions of the world.

```{r, echo = FALSE, out.width= "700 px"}
knitr::include_graphics(here("img", "deaths.png"))
```

This case study will evaluate the data reported in this article to explore regional, age, and gender specific differences in dietary consumption patterns around the world in 2017. 

## **Main Questions**
***

#### {.main_question_block}
<b><u> Our main questions are: </u></b>

1) What are the global trends for potentially harmful diets?
2) How do males and females compare?
3) How do different age groups compare for these dietary factors?
4) How do different countries compare? In particular, how does the US compare to other countries in terms of diet trends?

####

## **Learning Objectives**
***

In this case study, we will walk you through importing data from PDF files and CSV files, cleaning data, wrangling data, comparing data, joining data, visualizing data, and <b> comparing two or more groups </b> using well-established and commonly used packages, including `stringr`, `tidyr`, `dplyr`, `purrr`, and `ggplot2`. We will especially focus on using packages and functions from the [Tidyverse](https://www.tidyverse.org/){target="_blank"}. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially legible and intuitive.

The skills, methods, and concepts that students will be familiar with by the end of this case study are:

<u>**Data Science Learning Objectives:**</u>

1. Importing/extracting data from PDF (`dplyr`, `stringr`)  
2. How to reshape data by pivoting between "long" and "wide" formats (`tidyr`)    
3. Perform functions on all columns of a tibble (`purrr`)  
4. Data cleaning with regular expressions (`stringr`)  
5. Specific data value reassignment  
6. Separate data within a column into multiple columns (`tidyr`) 
7. Methods to Compare data (`dplyr`)  
8. Combining data from two sources (`dplyr`)  
9. Make interactive plots (`ggiraph`)  
10. Make a zoom facet for plot (`ggforce`) 
11. Combine plots together (`cowplot`)

<u>**Statistical Learning Objectives:**</u> 

1.  Understanding of how the *t*-test and the ANOVA are specialized
    regressions
2.  Basic understanding of the utility of a regression analysis
3.  How to implement a linear regression analysis in R
4.  How to interpret regression coefficients
5.  Awareness of *t*-test assumptions
6.  Awareness of linear regression assumptions
7.  How to use Q-Q plots to check for normality
8.  Difference between fixed effects and random effects
9.  How to perform paired *t*-test
10. How to perform a linear mixed effects regression


```{r, out.width = "20%", echo = FALSE, fig.align = "center"}
include_graphics("https://tidyverse.tidyverse.org/logo.png")
```

***

We will begin by loading the packages that we will need:

```{r}
library(here)
library(readr)
library(dplyr)
library(skimr)
library(pdftools)
library(stringr)
library(magrittr)
library(purrr)
library(tibble)
library(tidyr)
library(ggplot2)
library(ggpubr)
library(forcats)
library(lme4)
library(lmerTest)
library(car)
library(ggiraph)
library(ggforce)
library(viridis)
library(cowplot)
library(OCSdata)
```


 Package   | Use in this case study                                                                        
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"}       | to easily load and save data
[readr](https://readr.tidyverse.org/){target="_blank"}      | to import the CSV file data
[dplyr](https://dplyr.tidyverse.org/){target="_blank"}      | to arrange/filter/select/compare specific subsets of the data 
[skimr](https://cran.r-project.org/web/packages/skimr/index.html){target="_blank"}      | to get an overview of data
[pdftools](https://cran.r-project.org/web/packages/pdftools/pdftools.pdf){target="_blank"}   | to read a PDF into R 
[stringr](https://stringr.tidyverse.org/articles/stringr.html){target="_blank"}    | to manipulate the text within the PDF of the data
[magrittr](https://magrittr.tidyverse.org/articles/magrittr.html){target="_blank"}   | to use the `%<>%` piping operator
[purrr](https://purrr.tidyverse.org/){target="_blank"}      | to perform functions on all columns of a tibble
[tibble](https://tibble.tidyverse.org/){target="_blank"}     | to create data objects that we can manipulate with dplyr/stringr/tidyr/purrr
[tidyr](https://tidyr.tidyverse.org/){target="_blank"}      | to separate data within a column into multiple columns
[ggplot2](https://ggplot2.tidyverse.org/){target="_blank"}    | to make visualizations with multiple layers
[ggpubr](https://cran.r-project.org/web/packages/ggpubr/index.html){target="_blank"}    | to easily add regression line equations to plots
[forcats](https://forcats.tidyverse.org/){target="_blank"}    | to change details about factors (categorical variables)
[lme4](https://cran.r-project.org/web/packages/lme4/lme4.pdf)| to fit a linear mixed effects model
[lmerTest](https://cran.r-project.org/web/packages/lmerTest/lmerTest.pdf)| to perform linear mixed model testing
[car](https://cran.r-project.org/web/packages/car/car.pdf)| to perform Levene's Test of Homogeneity of Variances
[ggiraph](https://cran.r-project.org/web/packages/ggiraph/index.html)| to make plots interactive
[ggforce](https://cran.r-project.org/web/packages/ggforce/ggforce.pdf)| to modify facets in plots
[viridis](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html)| to plot in a color palette that is easily interpreted by colorblind individuals
[cowplot](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html){target="_blank"} | to allow plots to be combined
[OCSdata](https://github.com/opencasestudies/OCSdata){target="_blank"} | to access and download OCS data files
___



The first time we use a function, we will use the `::` to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.


## **Context**
***

Here is an excerpt from the article itself about the context of the work:
```{r, echo = FALSE}
knitr::include_graphics(here("img", "context.png"))
```

Many dietary factors have well-established associations with health risk. The authors that generated this data set identified 15 dietary factors that have probable health risk based on literature search.

Here you can see a table of the sources for the health risks associated with the dietary factors. The first column shows the risk factors and the second column shows the health outcomes. This table is part of "Supplemental Table 1. Epidemiological evidence supporting causality between dietary risk factors and disease endpoints" from the paper’s [supplementary materials](https://www.thelancet.com/cms/10.1016/S0140-6736(19)30041-8/attachment/3d4c0258-c2ea-405f-8d11-e9ae65e6f996/mmc1.pdf){target="_blank"}.


```{r, echo = FALSE, out.width= "500 px"}
knitr::include_graphics(here("img", "dietaryrisk.png"))
```


In the article the authors found that most of the mortality associated with each factor is related to cardiovascular disease.

```{r, echo = FALSE, out.width= "500 px"}
knitr::include_graphics(here("img", "cardiorisk.png"))
```

## **Limitations**
***

There are some important limitations regarding the data from this article to keep in mind. The definition of certain dietary factors varied across some of the collection sources. Intakes of certain healthy foods like vegetables and fruits are likely positively correlated with each other and likely negatively correlated with intakes of unhealthy foods. Much of the data was collected with 24 hour recall surveys which are prone to issues due to inaccuracy of memory recall or other biases such as a tendency for some people to report healthier behaviors. The guidelines in the PDF are not specified by gender even though it is known that there are different dietary requirements for optimal health for certain nutrients by gender. The article discusses some limitations about accounting for overall food consumption when calculating consumption of particular foods:

> "To remove the effect of energy intake as a potential confounder and address measurement error in dietary assessment tools, most cohorts have adjusted for total energy intake in their statistical models. This energy adjustment means that diet components are defined as risks in terms of the share of diet and not as absolute levels of exposure. In other words, an increase in intake of foods and macronutrients should be compensated by a decrease in intake of other dietary factors to hold total energy intake constant. Thus, the relative risk of change in each component of diet depends on the other components for which it is substituted. However, the relative risks estimated from meta-analyses of cohort studies do not generally specify the type of substitution.

There are also important nuances to keep in mind regarding some of the dietary factors. For example calcium consumption was calculated based on consumption of dairy products, while calcium can be acquired from other sources including plant-based sources. However in these data, the influence of plant-based consumption of calcium was not accounted for, nor was supplementation through vitamin sources. 

Also, while [gender](https://www.genderspectrum.org/quick-links/understanding-gender/){target="_blank"} and [sex](https://www.who.int/genomics/gender/en/index1.html){target="_blank"} are not actually binary, the data used in this analysis only contains information for groups of individuals described as male or female. 

## **What are the data?**
***

We will be using data that we requested from the [Global Burden of Disease (GBD)](http://www.healthdata.org/gbd){target="_blank"} of the [Institute for Health Metrics and Evaluation (IHME)](http://www.healthdata.org/about) about dietary intake, as well as the guideline data about optimal consumption amounts for different foods contained within the PDF of the article. We have two CSV files, dietary_risk_exposure_all_ages_2017.csv and dietary_risk_exposure_sep_ages_2017.csv. The first one includes consumption levels at the global level and for different countries for all ages combined.

Looking at the CSV file in excel:

```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics(here::here("img", "csv.png"))
```

Here you can see that the data contains mean consumption values for both men and women in various countries at the national level in 2017 for various foods that may be problematic for health. The units for the food varies. So for example, the mean column in row that says "Diet low in fiber" indicates the average consumption level per person in that region and of that gender of fiber in grams per day.

The second CSV file has similar data, but consumption levels for different age groups are separated.

```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics(here::here("img", "age_sep3.png"))
```

The authors of this article obtained the data from a variety of sources including household budget surveys and nutritional surveys regarding 24 hour recall of food consumption and 24 hour urinary sodium analysis. The data was derived from sales data from [Euromonitor](https://www.euromonitor.com/){target="_blank"}, estimates about national availability of specific nutrients from the [United Nations Food and Agriculture Organization (FAO)](http://www.fao.org/home/en/){target="_blank"} and the [United States Department of Agriculture](https://www.usda.gov/){target="_blank"}'s [National Nutrition Database](https://data.nal.usda.gov/dataset/usda-national-nutrient-database-standard-reference-legacy-release){target="_blank"}.

## **Data Import**
***

If you have trouble accessing the [GitHub Repository](https://github.com/opencasestudies/ocs-bp-diet), the data can be downloaded from [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-diet/master/data/raw/dietary_risk_exposure_all_ages_2017.csv) and [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-diet/master/data/raw/dietary_risk_exposure_sep_ages_2017.csv).

Let's import our data into R now so that we can explore the data further.

In our case, we downloaded this data and put it within a "data" directory within a subdirectory called "raw" for our project. If you use an RStudio project, then you can use the `here()` function of the `here` package to make the path for importing this data simpler. The `here` package automatically starts looking for files based on where you have a `.Rproj` file which is created when you start a new RStudio project. We can specify that we want to look for the files within the "docs" directory within a directory where our `.Rproj` file is located by separating the name of the "data" directory, the "raw" subdirectory, and the file name using commas.

***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>

You can create a project by going to the File menu of RStudio like so:


```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "New_project.png"))
```

You can also do so by clicking the project button:

```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "project_button.png"))
```

See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects. 

</details>

***

```{r}
diet_data <- readr::read_csv(here("data", "raw", 
                       "dietary_risk_exposure_all_ages_2017.csv"))
sep_age_diet_data <- read_csv(here("data", "raw", 
                       "dietary_risk_exposure_sep_ages_2017.csv"))
```

You may also use the `OCSdata` package to download the raw data: 

```{r, eval=FALSE}
# install.packages("OCSdata")
library(OCSdata)
raw_data("ocs-bp-diet", outpath = getwd())
# This will save the raw data files in a "OCSdata/data/raw/" sub-folder 
# in your current working directory
```

If you used the `OCSdata` package to download the raw data, you can import the data into R like so:

```{r, eval=FALSE}
diet_data <- readr::read_csv(here("OCSdata", "data", "raw", 
                       "dietary_risk_exposure_all_ages_2017.csv"))
sep_age_diet_data <- read_csv(here("OCSdata", "data", "raw", 
                       "dietary_risk_exposure_sep_ages_2017.csv"))
```

First let's just get a general sense of our data. We can do that using the `glimpse()` function of the `dplyr` package (it is also in the `tibble` package).

```{r}
dplyr::glimpse(diet_data)
```

```{r}
glimpse(sep_age_diet_data)
```

Here we can tell that the `sep_age_diet_data` is much larger than the `diet_data`. The `diet_data` has only 5,880 rows while the `sep_age_diet_data` has 88,200 rows!

However, both files appear to have the same column structure with 11 variables each.


The `skim()` function of the `skimr` package is also really helpful for getting a general sense of your data.

```{r}
skim(diet_data)
```

         
Notice how there is a column providing the number of missing observations for each variable. It looks like our data is very complete and we do not have any missing data.
We also get a sense about the size of our data.

The `n_unqiue` column shows us the number of unique values for each of our columns.


Let's take a look at `sep_age_diet_data`.

```{r}
skim(sep_age_diet_data)
```

We can see that there are many more rows in this data set.

Let's change the variable name `rei_name` to `dietary_risk` so that it makes more sense. We can use the `rename()` function from the `dplyr` package.

```{r}
diet_data <- dplyr::rename(diet_data, dietary_risk = rei_name)
sep_age_diet_data <- dplyr::rename(sep_age_diet_data, dietary_risk = rei_name)

glimpse(diet_data)
glimpse(sep_age_diet_data)
```

Looks good!

We will then take a look at the different dietary risk factors considered.
To do this we will use the `distinct()` function of the `dplyr` package.

This function grabs only the distinct or unique rows from a given variable (`dietary_risk`, in our case) of a given data frame (`diet_data`, in our case).

```{r}
dplyr::distinct(diet_data, dietary_risk)
```

Both over and under consumption could be a health problem!

We will be using the `%>%` pipe for sequential steps in our code later on.
This will make more sense when we have multiple sequential steps using the same data object.


We could do the same code as above using this notation. For example we first grab the `diet_data`, then we select the distinct values of the `dietary_risk` variable.

```{r}
diet_data %>%
  distinct(dietary_risk)
```

OK, so that gives us an idea of what dietary factors we can explore, and we can see that there are 15 of them. 

Let's see if the `location_name` values are the same between both CSV files. To do this we will use the `setequal()` function of `dplyr`.
```{r}
dplyr::setequal(
  distinct(diet_data, location_name),
  distinct(sep_age_diet_data, location_name)
)
```

OK, we got the value of TRUE, so it looks like the same locations are in both files.

Note: In this case were comparing two different objects so using the pipe is not as useful.

Let's take a look at the locations included in the data.

#### {.scrollable }
```{r}
# scroll through the output!
sep_age_diet_data %>%
  distinct(location_name) %>%
  pull()
```
####


OK, so there are global values, as well as values for 195 countries.


Let's take a look at the data when we order it by the mean consumption rate column. We can do so using the `arrange()` function of the `dplyr` package.

```{r}
diet_data %>%
  dplyr::arrange(mean) %>%
  glimpse()
```

OK, so it looks like people in Lebanon don't eat very many trans fatty acids.

Let's also figure out how many values there are in each age group of the data that is separated by age. We will use the `count()` function of the `dplyr` package to do this.

```{r}
sep_age_diet_data %>%
  dplyr::count(age_group_name)
```
That's a lot of values!

Let's look a bit deeper to try to understand why.
We can use the `count()` function again but get the number of values for each category within `sex`, `age_group_name` and `location_name` of the data.

```{r}
sep_age_diet_data %>%
  count(sex, age_group_name, location_name)
```

OK, so it looks like these are probably the consumption values for each of the different dietary factors (since there were 15 different factors) for each age group and gender combination within each country.

We can confirm this by filtering the data to one of the age groups, for a single gender, and for a single location. To do this we can use the `filter()` function of the `dplyr` package. Notice that we need to use two equal signs `==` to specify what values we would like for each variable.

```{r}
sep_age_diet_data %>%
  dplyr::filter(
    sex == "Female",
    age_group_name == "25 to 29",
    location_name == "Afghanistan"
  )
```

This confirms that for each of the 15 dietary factors, our unit of observation is a combination of gender, age and country. 

However, before we proceed with our analysis, we will want to perform some additional data wrangling. To do this, we will introduce the `pdftools` package, which will allow us to pull additional data from the manuscript itself.


While all of the mean consumption values are reported in grams, each dietary factor has a different amount that is considered optimal for consuming. To make the consumption values more comparable across factors, let's also get some data from the PDF of the paper so that we can calculate consumption of these dietary factors as percentages of the daily optimum.

We are interested in this table on page 3:

```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics(here::here("img", "Table.png"))
```

First let's import the PDF using the `pfd_text()` function of the `pdftools` package.

You can find this file [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-diet/master/data/raw/Afshin_et_al_2019.pdf).

```{r}
paper <- pdftools::pdf_text(here("data", "raw",
"Afshin_et_al_2019.pdf"))
```

We can save our imported data as an rda file (stands for R data file) using the `save()` function. 

```{r}
save(diet_data, sep_age_diet_data, paper, file = here::here("data", "imported", "imported_data.rda"))
```


## **Data Wrangling**
***

If you have been following along but stopped, we could load our imported data like so:

```{r}
load(here::here("data", "imported", "imported_data.rda"))
```

***
<details> <summary> If you skipped the data import section click here. </summary>

First you need to install and load the `OCSdata` package:

```{r, eval=FALSE}
install.packages("OCSdata")
library(OCSdata)
```

Then, you may load the imported data using the following code:

```{r, eval=FALSE}
imported_data("ocs-bp-diet", outpath = getwd())
load(here::here("OCSdata", "data", "imported", "imported_data.rda"))
```

If the package does not work for you, alternatively, an RDA file (stands for R data) of the data can be found in our [GitHub repository](https://github.com//opencasestudies/ocs-bp-diet/tree/master/data/imported) or slightly more directly [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-diet/master/data/imported/imported_data.rda). Download this file and then place it in your current working directory within a subdirectory called "imported" within a subdirectory called "data" to copy and paste our code. We used an RStudio project and the [`here` package](https://github.com/jennybc/here_here) to navigate to the file more easily. 

```{r}
load(here::here("data", "imported", "imported_data.rda"))
```

***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>

You can create a project by going to the File menu of RStudio like so:


```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "New_project.png"))
```

You can also do so by clicking the project button:

```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "project_button.png"))
```

See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects and [here](https://github.com/jennybc/here_here) to learn more about the `here` package.

</details>
***
</details>

***

Let's take a look at our manuscript data.

We can use the `base` `summary()` function to get a sense of what the data looks like. By `base` we mean that these functions are part of the `base` package and are loaded automatically on startup of R. Thus, `library(base)` is not required.

```{r}
summary(paper)
```

We can see that we have 15 different character strings. Each one contains the text on each of the 15 different pages of the PDF.

Again, the table we are interested in is on the third page, so let's grab just that portion of the PDF. The top of this page looks like:

```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics(here::here("img", "page3.png"))
```

```{r}
# Here we will select the 3rd value in the paper object
pdf_table <- paper[3]

summary(pdf_table)

# specifying nchar.max truncates the output
glimpse(pdf_table, nchar.max = 800)
```

Here we can see that the `pdf_table` object now contains the text from the 3rd page as a **single large character string**. However the text is difficult to read because of the column structure in the PDF. Now let's try to grab just the text in the table.

One way to approach this is to split the string by some pattern that we notice in the table.

```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics(here::here("img", "Table.png"))
```

All the rows of interest of the table appear to start with the word `"Diet"`. Moreover, only the capitalized form of the word `"Diet"` appears to be within the table, and it is not present in the preceding text (although `"diet"` is). 

```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics(here::here("img", "Diet_on_page3.png"))
```


Let's use the `str_split()` function of the `stringr` package to split the data within the object called `pdf_table` by the word `"Diet"`.  Only lines from page 3 that contain the word `"Diet"` will be selected (and not `"diet"` as this function is case-sensitive). Each section of the text that contains `"Diet"` will be split into individual pieces every time the word `"Diet"` occurs and the word itself will be removed.

In this case we are also using the magrittr assignment pipe or double pipe that looks like this `%<>%` of the `magrittr` package. This allows us use the `pdf_table` data as input to the later steps but also reassign the output to the same data object name.

```{r}
pdf_table %<>%
  stringr::str_split(pattern = "Diet")
```

Using the `base::summary()` and `dplyr::glimpse()` function we can see that we created a list of the rows in the table that contained the word `"Diet"`. We can see that we start with the row that contains `"low in fruits"`. 

```{r}
pdf_table %>%
  summary()
```

```{r}
pdf_table %>%
  glimpse()
```

In order to extract the values that we want from these character strings, we will use some additional functions from the `stringr` package. RStudio creates really helpful cheat sheets like this one which shows you all the major functions in the `stringr` package. You can download others [here](https://rstudio.com/resources/cheatsheets/){target="_blank"}.

```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics(here::here("img", "strings-1_str_split.png"))
```

You can see that we could have also used the `str_split_fixed()` function which would also separate the substrings into different columns of a matrix, however we would need to know the number of substrings or pieces that we would like returned.

For more information about `str_split()` see [here](http://rfunction.com/archives/1499){target="_blank"}.

Let's separate the values within the list using the base `unlist` function, this will allow us to easily select the different substrings within the object called `pdf_table`.

```{r}
pdf_table %<>%
  unlist()
```

It's important to realize that the first split will split the text before the first occurrence of `"Diet"` as the first value in the output. (This is why there are 17 elements in `pdf_table` rather than 15, the number of rows in the table.) We could use the `first()` function of the `dplyr` package to look at this value. However, we will suppress the output as this is quite large.

```{r, eval = FALSE}
dplyr::first(pdf_table)
```

Instead we can take a look at the second element of the list. using the `nth()` function of `dplyr`.

```{r}
nth(pdf_table, 2)
```

Indeed this looks like the first row of interest in our table:

```{r,echo = FALSE,out.width= "800px"}
knitr::include_graphics(here("img", "firstrow.png"))
```


Using the `last()` and the `nth()` functions of the `dplyr` package we can take a look at the last values of the list.
```{r}
# to see the second to last value we can use nth()
# the -2 specifies that we want the second-to-last value
# -3 would be third-to-last and -1 would be the last value
dplyr::nth(pdf_table, -2)

# to see the very last value we can use last()
dplyr::last(pdf_table)
```

```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics(here::here("img", "end_of_table.png"))
```


We don't need this part of the table or the text before the table if we just want the consumption recommendations. 

So we will select the second through the second-to-last of the substrings. Since we have seventeen substrings, we will select the second through the sixteenth. However a better way to do this rather than selecting by index, would be to select phrases that are unique to the text within the table that we want. We will use the `str_subset()` function of `stringr` package to select the table rows with consumption guidelines.  Most of the rows have the phrase "Mean daily consumption", however, there are other phrases for some of the rows, including "Mean daily intake" and "24 h sodium". So we will subset for each of these phrases.

```{r}
# one could subset the pdf_table like this:
# pdf_table <- pdf_table[2:16]

pdf_table %<>%
  str_subset(pattern = "Mean daily consumption|Mean daily intake|24 h")
```

Notice that we separate the different patterns to look for using vertical bar character `"|"` and that all of the patterns are within quotation marks together.

#### {.think_question_block}
<u>Question opportunity:</u> 

1) What other string patterns could you use to subset the rows of the table that we want?

2) Why might it be better to subset based on the text rather than the index?

####


Now the first row is what we want:
```{r}
first(pdf_table)
```

And the last row is what we want:
```{r}
last(pdf_table)
```

At this point, we have a better look at the current representation of the table data in R, and we might notice something that will need to be fixed. In the string above, the decimal points from the PDF are being recognized as something called an interpunct instead of a period or decimal. An interpunct is a centered dot, as opposed to a period or decimal that is aligned to the bottom of the line.

The interpunct was previously used to separate words in certain languages, like ancient Latin.


<p align="center">
  <img width="400" src="https://www.yourdictionary.com/image/articles/3417.Latin.jpg">
</p>

###### [[source](https://www.yourdictionary.com/image/articles/3417.Latin.jpg)]

You can produce an interpunct on a Mac like this:


<p align="center">
  <img width="400" src="https://www.shorttutorials.com/mac-os-special-characters-shortcuts/images/middle-dot.png">
</p>

###### [[source](https://www.shorttutorials.com/mac-os-special-characters-shortcuts/middle-dot.html)]


It is important to replace these for later when we want these values to be converted from character strings to numeric. We will again use the `stringr` package. This time we will use the `str_replace_all()` function which replaces all instances of a pattern in an individual string. In this case we want to replace all instances of the interpunct with a decimal point.


```{r,}
pdf_table %<>%
  stringr::str_replace_all(
    pattern = "·",
    replacement = "."
  )
last(pdf_table)
```

Looks good!

Now we will try to split the strings for each row based on the presence of two spaces to create the columns of the table, as there appears to be more than one space between the columns. The resulting substrings will be separated by quotes.

For additional details, the second page of the `stringr` cheat sheet has more information about using "Special Characters" in `stringr`. For example `\\s` is interpreted as a space as the `\\` indicates that the `s` should be interpreted as a special character and not simply the letter s.  The `{2,}` indicates two or more spaces, while `{2}` would indicate exactly two spaces.

```{r, echo = FALSE,out.width = "800px"}
knitr::include_graphics(here("img", "strings-2_highlight.png"))
```

#### {.scrollable }
```{r}
table_split <- str_split(
  string = pdf_table,
  pattern = "\\s{2,}"
)
glimpse(table_split) # scroll the output!
```
####

Now we can see that each of our 15 strings has been split into pieces, but unfortunately, it was not completely consistent across dietary factors. Why did this happen? If we look closely, we can see that the sugar-sweetened beverage and the seafood category had only one space between the first and second columns. These are the columns about the dietary category and the one that describes in more detail what the consumption suggestion is about.

The values for these two columns appear to be together still in the same substring for these two categories. We can see this because there are no quotation marks adjacent to the word `"Mean"`.

Here you can see how the next substring should have started with the word `"Mean"` by the new inclusion of a quotation mark `"`. The red rectangles indicate the problematic substrings, while the green rectangles show examples where the split worked correctly.

```{r, echo = FALSE, out.width = "700px"}
knitr::include_graphics(here("img", "substring_sep.png"))
```


We can add an extra space in front of the word `"Mean"` for these particular categories and then try splitting again.

Since we originally split based on two or more spaces, we can just add a space in front of the word "Mean" for all the `pdf_table` strings and then try subsetting again. We can use the `str_which()` function of the `stringr` package to find the index of these particular cases.

```{r}
pdf_table %>%
  str_which(pattern = "seafood|sugar")
```

Here we can use the `str_subset()` function of the `stringr` package to see just the strings that match these patterns within `pdf_table`:
```{r}
pdf_table %>%
  str_subset(pattern = "seafood|sugar")
```

This is equivalent to using the `str_which()` function with `[]`:
```{r, eval = FALSE}
pdf_table[str_which(pdf_table, pattern = "seafood|sugar")]
```

Now we can replace these values within the pdf_table object after adding a space in front of "Mean":

```{r}
pdf_table[str_which(pdf_table,
  pattern =
    "seafood|sugar"
)] <- str_replace(
  string = pdf_table[str_which(pdf_table,
    pattern =
      "seafood|sugar"
  )],
  pattern = "Mean",
  replacement = " Mean"
)
```

And now we can try splitting again by two or more spaces:
```{r}
table_split <- str_split(pdf_table, pattern = "\\s{2,}")
```

We could also just add a space in front of all the values of "Mean" in `pdf_table` since the split was performed based on two or more spaces. Thus the other elements in `pdf_table` would also be split just as before despite the additional space.

```{r, eval = FALSE}
pdf_table <- pdf_table %>%
  stringr::str_replace(
    pattern = "Mean",
    replacement = " Mean"
  )
table_split <- str_split(pdf_table, pattern = "\\s{2,}")
```

#### {.scrollable }
```{r}
# scroll the output!
glimpse(table_split)
```
####

Looks better!

We want just the first (the food **category**) and third column (the optimal consumption **amount** suggested) for each row in the table. However, the table is currently stored as a list of character vectors, so it is not quite so simple to extract these values.

We can use the `map` function of the `purrr` package to accomplish this.

The `map` function allows us to perform the same action multiple times across each element within an object, in this case, a list.

The following will allow us to select the first or third substring from each element of the `pdf_table` object.

```{r}
category <- map(table_split, 1)
amount <- map(table_split, 3)
head(category)
head(amount)
```

Now we will create a `tibble` using this data. However, currently both `category` and `amount` are of class `list`. To create a `tibble` we need to unlist the data to create vectors.

```{r}
class(category)
category %<>% unlist()
amount %<>% unlist()
class(category)
```

#### {.scrollable }
```{r}
category
amount
```
####

We could have done all of this at once in one command like this:

```{r, eval = FALSE}
category <- unlist(map(table_split, 1))
amount <- unlist(map(table_split, 3))
```

Now we will create a `tibble`, which is an important data frame structure in the tidyverse which allows us to use other packages in the tidyverse with our data.

We will name our `tibble` columns now as we create our `tibble` using the `tibble()` function of both the `tidyr` and the `tibble` packages, as names are required in tibbles.

```{r}
guidelines <- tibble::tibble(
  category = category,
  amount = amount
)
guidelines
```

Looking pretty good!

### **Separating values within a variable**
***

Recall that the main goal of this data wrangling is to extract the optimal intake level for each dietary factor. So while we have managed to pull and organize the data from the pdf table, we need to further process the results to isolate this numeric value.

Do to this, we want to separate the different numbers within the `amount` column, to isolate the optimal amount, and the optimal range, and eventually convert them to numeric values.

Recall what the original table looked like:
```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics(here("img", "firstrow.png"))
```

We can use the `tidyr::separate()` function to separate the data within the amount column into three new columns based on the optimal level and the optimal range. We can separate the values based on the open parentheses `"("` and the long dash `"–"` characters. Again we will use the bar `"|"` to indicate that we want to separate by either character.

```{r}
# The first column will be called optimal
# It will contain the 1st part of the amount column data before the "("
# The 2nd column will be called lower
# It will contain the data after the "("
# The 3rd column will be called upper
# It will contain the 2nd part of the data based on the "–"
# The "\\" are necessary - we will explain very soon

guidelines %<>%
  tidyr::separate(amount,
    c("optimal", "lower", "upper"),
    sep = "\\(|–"
  )

guidelines
```


Let's also create a new variable/column in our tibble that indicates the direction of over- or under-consumption that can be harmful for each dietary factor.

```{r}
guidelines %<>%
  separate(category, c("direction", "food"), sep = " in ")
guidelines
```

If we wanted to remove the direction variable we could use the `modify_at()` function of the `purrr` package:

```{r,eval = FALSE}
guidelines %>% purrr::modify_at("direction", ~NULL)
```


### **Data cleaning with regular expressions**
***

OK, looking better, but we still need a bit of cleaning to remove symbols and extra words from the columns. Some of the extra symbols include: `"%"`, `")"` and the `"*"`.

The `"*"` and the `")"` are what we call metacharacters or [regular expressions](https://www.r-bloggers.com/regular-expressions-every-r-programmer-should-know/){target="_blank"}. These are characters that have special meanings.

```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics(here("img", "RegExCheatsheet.png"))
```

Now we need the `"\\"` to indicate that we want these characters to be matched exactly and not interpreted as the meaning of the symbol. Recall that we used `"\\(|–"` earlier.

See [here](https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html){target="_blank"} for more info about regular expressions in R. 

***

<details> <summary> Click here for a simple example of regular expressions using the `str_count()` function of the `stringr` package </summary>

The `str_count()` function counts the number of instances of a character string. In this case we will look for individual characters but you could also search for words or phrases.

```{r}
regextest <- readr::read_file(here("docs", "regEx.txt"))
regextest
```

Count the letter t:
```{r}
str_count(regextest, "t") # notice this doesn't include the t in the tab
```

Count tabs:
```{r}
str_count(regextest, "\\t") # search for tab
# this would not work:
str_count(regextest, "[t]") # searches for the letter t
```

Count parentheses:
```{r}
# this would not work because R thinks this is part of the code itself
# str_count(regextest, ")")
# this would not work because R thinks this is part of the code itself
# str_count(regextest, "\)")
str_count(regextest, "\\)") # this works!
# this works! because it is a punctuation character
str_count(regextest, "[)]")
```

Count the occurrence of the asterisk:
```{r}
# this also does not work
# str_count(regextest, "*")
# nor does this
# str_count(regextest, "\*")
str_count(regextest, "\\*") # this works!
# this works! because it is a punctuation character
str_count(regextest, "[*]") # this works!
```

</details>

***

We also want to make a unit variable so that we can make sure that our units are consistent later. 

```{r}
guidelines %>%
  pull(optimal)
```

Notice that the values that are percentages don't have spaces between the number and the unit.
We can separate the `"optimal"` values by a space or a percent symbol `"%"` using `"|"` to indicate that we want to separate by either. In this case we will lose the "%" and will need to add it back to those values.

```{r}
guidelines %<>%
  separate(optimal,
    into = c("optimal", "unit"),
    sep = " |%",
    remove = FALSE
  )
guidelines
```

Great, so to now we will add "`%`" to the `unit` variable for  the `"low in polyunsaturated"` and `"high in trans fatty acids"` rows.

First we need to replace the empty values with `NA` using the `na_if()` function of the `dplyr` package.

```{r}
guidelines %<>%
  na_if("")
guidelines
```


Then to replace the `NA` values, we can use the `replace_na()` function in the `tidyr` package and the `mutate()` function of `dplyr` to specify which values to replace, in this case the `NA` values within the variable `unit`. Essentially this variable gets reassigned with the new values, as we mostly think of the `mutate()` function as creating new variables.

```{r}
guidelines %<>%
  dplyr::mutate(unit = replace_na(unit, "%"))

# now just to show these rows
guidelines %>%
  filter(unit == "%")
```

Let's also move `unit` to be the last column. We can use the `relocate()` function of the `dplyr` package to do this. For more information about the `relocate()` function see [here](https://dplyr.tidyverse.org/reference/relocate.html){target="_blank"}.

```{r}
guidelines %<>%
  relocate(unit, .after = last_col())
```

To remove all of the remaining extra characters and words we will again use the `stringr` package. This time we will use the `str_remove_all()` function to remove all instances of these characters.

```{r}
guidelines <- as_tibble(
  map(guidelines, str_remove_all,
    pattern = "\\) per day|\\) of total daily energy|\\*"
  )
)
guidelines
```

Nice! That's pretty clean but we can do a bit more.

### **Data type conversion**
***

One of the next things to notice about our data is all of our variables are of class character, which is not how we want them to be.

For example, the optimal amounts of consumption are currently of class character, which is indicated by the `<chr>` just below the column names/variable names of the `guidelines` tibble:

```{r}
guidelines
```


To convert these values to numeric we use the `mutate()` and `across()` functions of the `dplyr` package.

The `across()` function has two main arguments: (i) the columns you want to operate on and (ii) the function or list of functions to apply to each column. In this case if we look at the beginning of the `guidelines` tibble, we can see that `optimal`, `lower` and `upper` should be converted. As these three columns are sequential, we can simply put a `:` between `optimal` and `upper` to indicate that we want all the variables in between these columns to be converted. 

```{r}
guidelines %<>%
  mutate(across(lower:upper, as.numeric))
guidelines
```

Great! Now these variables are of class `<dbl>` (stands for double) which indicates that they are numeric. Here is a [link](http://uc-r.github.io/integer_double/){target="_blank"} for more information on numeric classes in R.

If we had not replaced the `"·"` interpunct values to a period, conversion from character to numeric would be problematic and would result in NA values.

### **Data value reassignments**
***

We seem to have lost the word `"beverages"` from the `"sugar-sweetened beverages"` category,  as well as `"fatty acids"` from the `"seafood omega 3 fatty acids"`, and the `"polyunsaturated fatty acids"` categories as the full category name was listed on two lines within the table. We would like to replace these values with the full name. 

To select the `food` variable we will show you several options. Only a couple will work well with reassigning the data in that particular variable within `guidelines` without assigning an intermediate data object. We will look using `mutate_at()`, `pull()`, `select()`, and two styles of brackets `["variable name"]` and `[["variablename"]]`.

The bracket `["variable name"]` option and the `select()` option will grab a tibble (data frame) version of the food column out of guidelines. However we can't start commands with select for assignments.

```{r}
guidelines["food"] # same output as select
select(guidelines, "food") # same output as brackets
```


`pull()` and the bracket `[["variable name"]]` option in contrast, will grab the vector version of the food data:

```{r}
pull(guidelines, "food") # get character vector not a tibble
# bracket option:
guidelines[["food"]] # get character vector not a tibble
```

The `pull()` function can be very useful when combined with other functions (for example you typically want to use a vector with the `str_replace()` function), but just like select, we can't start assignments with `pull()`.


This is not possible and will result in an error:
```{r, eval = FALSE}
select(guidelines, food) <-
  str_replace(
    pull(guidelines, "food"),
    pattern = "sugar-sweetened",
    replacement = "sugar-sweetened beverages"
  )

guidelines %>% select(food) <-
  str_replace(
    pull(guidelines, "food"),
    pattern = "sugar-sweetened",
    replacement = "sugar-sweetened beverages"
  )
```

This will only print the result, but not reassign the food variable values:

```{r}
guidelines %>%
  pull(food) %>%
  str_replace(
    pattern = "sugar-sweetened",
    replacement = "sugar-sweetened beverages"
  )
```   

Using `select()` would work as well to print the result (although the result structure is different):

```{r}
guidelines %>%
  select(food) %>%
  str_replace(
    pattern = "sugar-sweetened",
    replacement = "sugar-sweetened beverages"
  )
```

#### {.think_question_block}

<u>Question opportunity:</u> 

Why do these commands not reassign the food variable values?

####

The bracket option is great alternative and allows us to reassign the values within guidelines easily. Either of the two styles of brackets: `["variable name"]` and `[["variablename"]]` will work.

```{r}
# 1st method: `["variable name"]`
# Replacing "sugar-sweetened" with "sugar-sweetened beverages"
guidelines["food"] <-
  str_replace(
    pull(guidelines, "food"),
    pattern = "sugar-sweetened",
    replacement = "sugar-sweetened beverages"
  )

# 2nd method: `[["variablename"]]`
# Replacing "seafood omega-3" with"seafood omega-3 fatty acids"
guidelines[["food"]] <-
  str_replace(
    pull(guidelines, "food"),
    pattern = "seafood omega-3",
    replacement = "seafood omega-3 fatty acids"
  )

guidelines
```


Finally, the best option is probably the `mutate_at()` function from `dplyr`. In this case we need to include `~` in front of the function that we would like to use on the values in our `food` variables. We also include `.` as a replacement to reference the data that we want to use within `str_replace()` (which in this case is the `food` variable values of `guidelines`).

Notice we didn't need this when we previously use `mutate_at()` with the `as.numeric()` function. This is because the `str_replace()` function requires us to specify what data we are using as one of the arguments, while `as.numeric()` does not.

```{r}
# Replacing "polyunsaturated" with"polyunsaturated fatty acids"
guidelines %<>%
  mutate_at(
    vars(food),
    ~ str_replace(
      string = .,
      pattern = "polyunsaturated",
      replacement = "polyunsaturated fatty acids"
    )
  )

guidelines
```

This might be considered a better option because it is more readable as to where the `food` data came from that we are replacing values within.

There is one last minor detail... the `direction` variable has leading spaces still. We can use `str_trim()` to fix that!

```{r}
guidelines %<>%
  mutate_at(vars(direction), str_trim)

guidelines
```

OK! Now we know how much of each dietary factor we generally need for optimal health according to the guidelines used in this article.



### **Comparing data**
***

Recall that the main goal of pulling the guideline amounts from the pdf was that we would like to see how the mean consumption rates for the different groups of people compared to the optimal intake guidelines.

One way we could do this is to calculate a consumption percentage of the optimal value.

To calculate this it would be helpful to put the guideline amounts with the average consumption rates into the same tibble, especially because the observed consumption data (`diet_data` and `sep_age_diet_data`) are very different dimensions from the `guidelines` data. 

In order to create a tibble with our observed consumption rates with the suggested consumption rates, we will join our data using `dplyr`. In order to do so it is important that our different data sets have at least one column with the same values that we can use to join them together. So let's first assess if that is the case.


```{r}
distinct(diet_data, dietary_risk)
select(guidelines, food)
```

We are actually pretty close: there are 15 dietary factors in each data set, and the names are nearly the same. To make them match completely, we can see that we need to remove the `"Diet low in"` and `"Diet high in"` phrases from the observed consumption data.

```{r}
diet_data %<>%
  mutate_at(
    vars(dietary_risk),
    ~ str_remove(
      string = .,
      pattern = "Diet low in |Diet high in "
    )
  )

sep_age_diet_data %<>%
  mutate_at(
    vars(dietary_risk),
    ~ str_remove(
      string = .,
      pattern = "Diet low in |Diet high in "
    )
  )
```

Also let's double check that the two observed files have the same exact values for dietary factor names. 

We can use the `setequal()` function from `dplyr` to check that the unique values for `dietary_risk` are the same for both `diet_data` and `sep_age_diet_data`.


```{r}
setequal(
  distinct(diet_data, dietary_risk),
  distinct(sep_age_diet_data, dietary_risk)
)
```
Great!

Note that the default of the set_equal function ignores the order of values in rows. So we still don't know if the order is the same.

We can check using the `all_equal` function of `dplyr` which reports back clues about what might be different if anything. Importantly we are including `ignore_row_order = FALSE` as the default is `TRUE`.

```{r}
all_equal(distinct(diet_data, dietary_risk),
  distinct(sep_age_diet_data, dietary_risk),
  ignore_row_order = FALSE
)
```

Looks like they are not in the same order. 

Note that if any of the values are different, `all_equal()` will first report this and will not report that the rows are in a different order.

***

<details> <summary> Click here to see a toy example about how the three comparison functions (`setequal()`, `all_equal()` (also `all.equal()` for `tbl_df`), and `setdiff()`) work in `dplyr`. </summary> 

It's important to realize that row order is ignored by both`setequal()` and `setdiff()`. 

Now let's compare two tibbles that have different row orders and different values. 

Here are our tibbles to compare:
```{r}
X <- tibble(test = c("A", "B", "AC", "D"))
Y <- tibble(test = c("A", "D", "A", "B"))
X
Y
class(Y)
```

Since we are using tibbles, which are of class `tbl_df` we can use either `all_equal` or `all.equal()`.
Notice that it doesn't report rows being a different order because it first tells what rows have unique values or rows with a value that has a different number of frequency.

```{r}
all_equal(X, Y, ignore_row_order = TRUE)
all_equal(X, Y, ignore_row_order = FALSE)
# Doesn't report rows being different order
all.equal(X, Y, ignore_row_order = TRUE)
all.equal(X, Y, ignore_row_order = FALSE)
# Doesn't report rows being different order
```

`setequal()` does not provide clues about what is different but TRUE (no differences) or FALSE (indicating at least one difference).

```{r}
# Reports false indicating at least one difference
setequal(X, Y)
```

`setdiff()` tells us what is different and is dependent on the order of the objects compared, but prioritizes the values that are unique to each.

```{r}
# This reports what is unique to X
setdiff(X, Y)
# This reports what is unique to Y - nothing in this case
setdiff(Y, X)
```

Now let's make it so that only the order is different:
```{r}
Y <- tibble(test = c("A", "D", "AC", "B"))
X
Y
```

Now that there are no values that are unique to either X or Y, `all_equal()` reports that there is a different order. 
```{r}
all_equal(X, Y, ignore_row_order = TRUE)
all_equal(X, Y, ignore_row_order = FALSE) # reports diff order
```


Remember `setequal()` ignores order and gives a value of TRUE for no differences.
```{r}
# It reports no difference!
setequal(X, Y)
```

`setdiff()` also ignores order and shows no differences.
```{r}
setdiff(X, Y)
```

If we have different column/variable names this makes comparisons more challenging. Columns will be identified for having different names.
```{r}
X <- tibble(colname1 = c("A", "B", "AC", "D"))
Y <- tibble(colname2 = c("A", "D", "AG", "B"))
```

`all_equal()` will simply report that col names are different

```{r}
all_equal(X, Y, ignore_row_order = TRUE)
all_equal(X, Y, ignore_row_order = FALSE)
```

`seteqaul()` will report `TRUE` or `FALSE` to indicate either a difference in columns or rows

```{r}
setequal(X, Y)
```

`setdiff()` requires that column names be the same so this will cause an error:

```{r, eval = FALSE}
setdiff(X, Y) # This will not work
```

</details> 

***

OK, let's keep going with our data.

How similar are the guidelines tibble and the observed consumption tibbles?

```{r}
setequal(
  distinct(diet_data, dietary_risk),
  select(guidelines, food)
)
```

OK, looks like we have some different values.

Let's use the `setdiff` function to get more information about what is different between the values.

```{r, eval = FALSE}
setdiff(
  distinct(diet_data, dietary_risk),
  select(guidelines, food)
)
```

:( That wont work. This is because `setdiff()` requires that the column names are the same in the objects that we are comparing.


We can use the `rename()` function from `dplyr` to do this. We list the value that we want to change to first. We find "food" more intuitive now so we are going to change "dietary_risk" to "food" for the `diet_data` and the `sep_age_diet_data`:

```{r}
diet_data %<>%
  dplyr::rename(food = dietary_risk)
sep_age_diet_data %<>%
  dplyr::rename(food = dietary_risk)
```


```{r}
setdiff(
  distinct(diet_data, food),
  select(guidelines, food)
)
```

Great, now we know that the `fiber` value appears to be different between the two.


Checking our original files we can see that the British spelling "fibre" is used in the table from the article (that we used to create `guidelines`), in contrast to the American spelling "fiber" used in the CSV files.

Let's stick with the American spelling, so we will replace `"fibre"` in the guideline tibble.

```{r}
guidelines %<>%
  mutate_at(
    vars(food),
    ~ str_replace(
      string = .,
      pattern = "fibre",
      replacement = "fiber"
    )
  )

guidelines %>%
  filter(food == "fiber")
```

Now let's check again to see that our food values match between the guidelines and the observed consumption data tibbles.

```{r}
setdiff(
  select(guidelines, food),
  distinct(diet_data, food)
)

setdiff(
  select(guidelines, food),
  distinct(sep_age_diet_data, food)
)
```

Great!  There are no differences :)

### **Joining data**
***

Now we can put our guideline data together with the `diet_data` and the `sep_age_diet_data`.

Remember that the `food` data in our `guidelines` tibble is not necessarily in the same order as that of the consumption data tibbles. Thus this could be a problem if we decided to expand the `guidelines` rows (to repeat for the number of fruit observations etc.) and add them to our observed consumption tibbles by binding them together by column. 

```{r, echo = FALSE, outwidth = "50%", fig.align= "center"}
knitr::include_graphics(here("img", "bind.png"))
```

#### [[source]](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)

In that case we could use the `arrange()` function of `dplyr` to sort the data alphabetically.

However, we will instead use a joining function of `dplyr`. These functions combine the data together based on **common values** and don't require the rows to be in the same order. There are a variety of options.

```{r, echo = FALSE, outwidth = "50%", fig.align= "center"}
knitr::include_graphics(here("img", "join.png"))
```

#### [[source]](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)


In our case we would like to retain all of the values of `diet_data` and `sep_age_diet_data`. We would like to add new columns of values to these tibbles that correspond to the guideline information about amounts of consumption for each food type in the `guidelines` tibble. We shouldn't have any values of `food` in `guidelines` that don't match, so we will not get any `NA` values. Therefore, in our case any of the mutating join functions should result in the same output.

It's important to check if we have any overlapping variable names before we join the data. Otherwise, these columns will either be used to identify which rows to join, or new copies of the columns, with a default name to distinguish the columns of one data set from those of the other, will be created. We can use the base R function `names()`  and the `intersect()` function of the `dplyr` package to identify which column names are common to our two data sets.

```{r}
dplyr::intersect(
  names(diet_data),
  names(guidelines)
)
```

So it looks like the `"upper"` , `"lower"` and `"unit"` variable names are overlapping. Therefore, to distinguish the names later we will rename the guideline `"upper"` , `"lower"` and `"unit"` variables.

We will again use the `rename` function from the `dplyr` package. We can list multiple variables to rename and separate each with a comma. We need to list the new names first.

```{r}
guidelines %<>%
  rename(
    upper_optimal = upper,
    lower_optimal = lower,
    unit_optimal = unit
  )

guidelines
```

It is also a good idea to check our units to make sure they are the same for both `guidelines` and the observed consumption tibbles(`diet_and_guidelines` and `all_age_diet_and_guidelines`).

Let's take a look with the `count()` function of the `dplyr` package. We will also use the `bind_cols()` function of `dplyr` to put the data together so that we can see it easily.

```{r}
dplyr::bind_cols(
  count(diet_data, unit, food),
  count(sep_age_diet_data, unit, food),
  count(guidelines, unit_optimal, food)
)
```

We can see that the only potential issue is the `seafood omega-3 fatty acids` data which is in g/day for the observed data(`diet_data` and `all_age_diet_and_guidelines`), but in mg/day in the `guidelines` data.

We can account for this by dividing the `guidelines` `seafood omega-3 fatty acids data` by 1000 to convert it to grams from milligrams.

To do this we will use the `if_else()` function in the `dplyr` package. This allows us to specify a condition (in this case if the unit is `"mg"`), as well as values if this condition is met (true), or if the condition is not met (false). 

In the following we mutate the values in each of the guideline numeric columns (`lower`, `optimal` and `upper`) one at a time. When we refer to `lower` for example we refer to the values in the column/variable. So if the condition is not met, then the original value is retained. We will also replace `"mg"` with `"g"` after everything is converted to grams.



```{r}
# "lower_optimal" variable
guidelines %<>%
  mutate(lower_optimal = dplyr::if_else(
    condition = unit_optimal == "mg",
    true = lower_optimal / 1000,
    false = lower_optimal
  ))
# Explanation for the use of "if_else()" here
# If the "unit_optimal" variable is in "mg", we convert the corresponding "lower_optimal" (currently in mg) variable to grams (g) by dividing by 1,000.
# If not, the corresponding "lower_optimal" (already in g) is not changed

# "optimal" variable
guidelines %<>%
  mutate(optimal = if_else(condition = unit_optimal == "mg",
    true = optimal / 1000,
    false = optimal
  ))

# "upper_optimal" variable
guidelines %<>%
  mutate(upper_optimal = if_else(condition = unit_optimal == "mg",
    true = upper_optimal / 1000,
    false = upper_optimal
  ))

# replace "mg" with "g" in the "unit_optimal" variable
guidelines %<>%
  mutate(unit_optimal = if_else(condition = unit_optimal == "mg",
    true = "g",
    false = unit_optimal
  ))

guidelines
```


***
<details> <summary> Click here to see a couple of other ways to do this: </summary>

```{r, eval = FALSE}
# Another possible way with dplyr::case_when():
guidelines %<>%
  mutate(lower_optimal = case_when(
    unit_optimal == "mg" ~ lower_optimal / 1000,
    unit_optimal != "mg" ~ lower_optimal
  ))

# Or could use this:
guidelines %<>%
  mutate_at(
    vars(unit_optimal),
    ~ str_replace(
      string = .,
      pattern = "mg",
      replacement = "g"
    )
  )
```

</details>

***



In contrast we could have changed or mutated the values for `lower_optimal`, `optimal`, `upper_optimal` all at once like this using the `funs()` argument in `mutate_at()` of `dplyr`.

```{r, eval = FALSE}
guidelines[str_which(
  string = guidelines[["food"]],
  pattern = "seafood omega-3 fatty acids"
), ] <- guidelines %>%
  filter(food == "seafood omega-3 fatty acids") %>%
  mutate_at(vars(lower_optimal:upper_optimal), funs(. / 1000))
```


Now we are ready to join the data!

Again, we would like to add new columns of values to `diet_data` and `all_age_diet_and_guidelines` that correspond to the guideline information about amounts of consumption for each food type in the `guidelines` tibble. So we will join the data based on the `food` variable values. We will use the `full_join()` function of the `dplyr` package.

```{r}
diet_and_guidelines <- diet_data %>%
  dplyr::full_join(guidelines, by = "food")

all_age_diet_and_guidelines <- sep_age_diet_data %>%
  full_join(guidelines, by = "food")

glimpse(diet_and_guidelines)
glimpse(all_age_diet_and_guidelines)
```

It's always a good idea to check that the values are what you expect after merging. 

```{r}
diet_and_guidelines %>%
  count(food, optimal)

all_age_diet_and_guidelines %>%
  count(food, optimal)

# For easy comparison we will arrange by food alphabetically
arrange(guidelines, food)
```

Looks good!
 
 
### **Calculating relative consumption**
***

Recall that our aim is to compare the consumption rates of these dietary factors by different groups of people, and ideally, to facilitate cross-factor comparisons, we want to consider consumption rates relative to the optimal guidelines.

To do this, let's calculate values of consumption that are relative to the suggested guidelines.

There are a few approaches we could take. One is to calculate a `"percentage of optimal consumption"` using the mean value for each observed factor relative to its optimal value. To do this we will use the `mutate()` function of the `dplyr`package. This will create a new variable called `Relative_Percent` that will be equal to the ratio of the `mean` value and the `optimal` value, multiplied by 100, to create a percentage relative to the optimal amount suggested.

```{r}
diet_and_guidelines %<>%
  mutate(Relative_Percent = (mean / optimal) * 100)

all_age_diet_and_guidelines %<>%
  mutate(Relative_Percent = (mean / optimal) * 100)
```

Another option is to incorporate the range of optimal intakes and the direction that is associated with health risk. If the direction of risk is `high` and the consumption was greater than the `optimal` mean value, than the percentage is calculated based on the `upper_optimal` value, while if the direction of risk is `low` and the consumption is less than the `optimal` mean value, then the percentage is calculated based on the `lower_optimal` value. We will use the `case_when()` function of the `dplyr` package to do this. This allows us to specify values (indicated on the right side of the `~`symbol) based on specific conditions (indicated on the left side of the `~` symbol). We can specify multiple conditions using the `&` symbol.

```{r}
diet_and_guidelines %<>%
  mutate(range_percent = case_when(
    direction == "high" ~ (mean / upper_optimal) * 100,
    direction == "low" ~ (mean / lower_optimal) * 100
  ))

all_age_diet_and_guidelines %<>%
  mutate(range_percent = case_when(
    direction == "high" ~ (mean / upper_optimal) * 100,
    direction == "low" ~ (mean / lower_optimal) * 100
  ))


diet_and_guidelines %<>%
  mutate(percent_over_under = case_when(
    direction == "high" & mean > upper_optimal ~
    ((mean - upper_optimal) / upper_optimal) * 100,
    direction == "high" & mean <= upper_optimal ~ 0,
    direction == "low" & mean >= lower_optimal ~ 0,
    direction == "low" & mean < lower_optimal ~
    ((lower_optimal - mean) / lower_optimal) * -100
  ))


all_age_diet_and_guidelines %<>%
  mutate(percent_over_under = case_when(
    direction == "high" & mean > upper_optimal ~
    ((mean - upper_optimal) / upper_optimal) * 100,
    direction == "high" & mean <= upper_optimal ~ 0,
    direction == "low" & mean >= lower_optimal ~ 0,
    direction == "low" & mean < lower_optimal ~
    ((lower_optimal - mean) / lower_optimal) * -100
  ))
```

Another option is to create a binary outcome indicating whether optimal consumption was achieved or not.

```{r}

diet_and_guidelines %<>%
  mutate(opt_achieved = if_else(
    condition = direction == "low" & mean > lower_optimal |
      direction == "high" & mean < upper_optimal,
    true = "Yes",
    false = "No"
  ))

all_age_diet_and_guidelines %<>%
  mutate(opt_achieved = if_else(
    condition = direction == "low" & mean > lower_optimal |
      direction == "high" & mean < upper_optimal,
    true = "Yes",
    false = "No"
  ))

glimpse(diet_and_guidelines)
glimpse(all_age_diet_and_guidelines)
```

One last thing that can be useful with data wrangling is to **reshape** the data into what is called the **long** format. This is very useful for creating visualizations with a powerful and flexible package called `ggplot2`.

To coerce an object into long format, we create **more rows and fewer columns**. For more information about this, please see the Data Visualization section of this [case study](https://www.opencasestudies.org/ocs-bp-opioid-rural-urban/#Rural_and_Urban_areas/#:~:text=%20the%20data%20was%20presented%20in%20a%20format%20that%20is%20called%20long%20format.){target="_blank"}.

We would like to put together the different types of percentages of the optimal intake that we just calculated.

To get our data in long format we can use the `pivot_longer()` function of the `tidyr` package. We will list the columns that we want to come together into the longer format using the `cols` argument. The `names_to` argument indicates the name of the variable that will include the character information about the values that we are consolidating, i.e., this variable contains the names of the columns that we are bringing together. The `values_to` is the name of the column that will contain the values of the columns we are consolidating. We can use `contains()` of the `tidyr` package to look at the variables with names that contain `"percent"` .

```{r}
diet_and_guidelines_long <- diet_and_guidelines %>%
  pivot_longer(
    cols = contains("percent"),
    names_to = "percent_type",
    values_to = "percent"
  )
```

***
<details> <summary> Click here to see how this would be done with the older version of this function, called `gather()`: </summary>

Recall that for `pivot_longer()`, the `cols` argument is used. For `gather()` we would simply list the variables that we wish to consolidate. The `names_to` and `values_to` arguments of `pivot_longer()` are equivalent to the `key` and `value` arguments in `gather()` respectively.

We would get an identical output from the two methods. We can check that with `setequal()`.
```{r}
diet_and_guidelines_long2 <- diet_and_guidelines %>%
  gather(contains("percent"),
    key = percent_type,
    value = percent
  )

setequal(diet_and_guidelines_long, diet_and_guidelines_long2)
```

</details>

***

Let's do the same for the age separated data.

```{r}
all_age_diet_and_guidelines_long <- all_age_diet_and_guidelines %>%
  pivot_longer(
    cols = contains("percent"),
    names_to = "percent_type",
    values_to = "percent"
  )
```

We now have the main variables and data formats that we need to proceed with the next steps of our analysis, including data exploration and eventually, modeling.

Now we will save our wrangled data. We will save it as an rda file for ourselves and as csv files, as this is often a good option for collaborators. We need a separate csv file for each tibble. We will save these files in a directory called "wrangled" within our "data" directory of our project.

```{r}
save(all_age_diet_and_guidelines, all_age_diet_and_guidelines_long, diet_and_guidelines, sep_age_diet_data, file = here::here("data", "wrangled", "wrangled_data.rda"))

write_csv(all_age_diet_and_guidelines, file = here::here("data", "wrangled", "all_age_diet_and_guidelines.csv"))
write_csv(all_age_diet_and_guidelines_long, file = here::here("data", "wrangled", "all_age_diet_and_guidelines_long.csv"))
write_csv(diet_and_guidelines, file = here::here("data", "wrangled", "diet_and_guidelines.csv"))
write_csv(sep_age_diet_data, file = here::here("data", "wrangled", "sep_age_diet_data.csv"))
```

## **Data Exploration**
***

If you have been following along but stopped you could load the wrangled data like so:

```{r}
load(here::here("data", "wrangled", "wrangled_data.rda"))
```

***
<details> <summary> If you skipped the data wrangling section click here. </summary>

First you need to install and load the `OCSdata` package:

```{r, eval=FALSE}
install.packages("OCSdata")
library(OCSdata)
```

Then, you may load the wrangled data using the following code:

```{r, eval=FALSE}
wrangled_rda("ocs-bp-diet", outpath = getwd())
load(here::here("OCSdata", "data", "wrangled", "wrangled_data.rda"))
```

If the package does not work for you, alternatively, an RDA file (stands for R data) of the data can be found [here](https://github.com//opencasestudies/ocs-bp-diet/tree/master/data/wrangled) or slightly more directly [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-diet/master/data/wrangled/wrangled_data.rda). Download this file and then place it in your current working directory within a subdirectory called "wrangled" within a subdirectory called "data" to copy and paste our code. We used an RStudio project and the [`here` package](https://github.com/jennybc/here_here) to navigate to the file more easily. 

```{r}
load(here::here("data", "wrangled", "wrangled_data.rda"))
```


***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>

You can create a project by going to the File menu of RStudio like so:


```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "New_project.png"))
```

You can also do so by clicking the project button:

```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "project_button.png"))
```

See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects and [here](https://github.com/jennybc/here_here) to learn more about the `here` package.

</details>
***
</details>
***
 
### **Exploring age collapsed data**
***

Let's start by taking a look at the  percent of consumption, across all dietary factors. Again we will use the base R `summary()` function:

```{r}
diet_and_guidelines %>%
  select(Relative_Percent) %>%
  summary()
```

Wow! Some of the values are nearly zero, suggesting that some people are consuming basically zero percent of what is suggested for optimal health. On the other hand, for some dietary factors people are consuming over 13,000 percent what is suggested! 

This is why it is important to look at the direction of consumption that could be harmful. For example if there is a population that consumes large amounts of vegetables this could be a good thing, but if there is a population consuming large amounts of sodium this would be a bad thing. 

Let's take a look to see what dietary factors are at the extremes by arranging the data using the `arrange()` function of the `dplyr` package. We can arrange by smallest to largest using the default and we can arrange largest to smallest using the minus sign `-`.

```{r}
diet_and_guidelines %>%
  arrange(-Relative_Percent) %>%
  glimpse()
```

OK, so it looks like sugar-sweetened beverages are really over-consumed in some parts of the world!

Recall from the supplementary table from the article that over-consumption of sugar-sweetened beverages is associated with both Diabetes mellitus type 2 and Ischemic heart disease. This [article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5133084/){target="_blank"} discusses some of the controversy over the potential health risks associated with high consumption of sugar.

It still looks quite bad if we look at the other calculated percentage values. 
```{r}
diet_and_guidelines %>%
  select(contains("percent")) %>%
  summary()
```
So some places are still consuming 8,000 percent more than the upper range of the suggested optimal intake.

Let's take a look at global levels:
```{r}
diet_and_guidelines %>%
  filter(food == "sugar-sweetened beverages" &
    location_name == "Global")
```

For those who are less familiar with the metric system where grams are equivalent to milliliters, it may be useful to realize how many fluid ounces the max amount of consumption per day (~444g for the `upper` value for Guatemala) actually is. 

There are 0.35247 ounces in one gram.

```{r}
# top amount in ounces
0.35247 * 444.4002
```

OK, so the top consumers are drinking about 87 fluid ounces per day. Since there are 12 ounces in a single can of soda, this is about `r 87/12` sodas per day. Globally on average, males are drinking around `r round((65.5*0.35247)/12, 3)` sodas worth of sweetened beverages, while females are drinking about `r round((47.7*0.35247)/12, 3)`.


Let's take a look at what is under-consumed:

```{r}
diet_and_guidelines %>%
  arrange(Relative_Percent) %>%
  glimpse()
```

On the other hand, it looks like some places are consuming almost no polyunsaturated fatty acids. These are fats that found in plant-based sources like seeds and nuts. According to an [article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4859401/){target="_blank"} about polyunsaturated fatty acids and its influence on health:

> Coronary heart disease (CHD) is the leading cause of death worldwide ... The types of dietary fats consumed play an important role in CHD risk, representing key modifiable risk factors...In particular, higher intakes of trans fat (TFA) and of saturated fat (SFA) replacing ω‐6 (n‐6) polyunsaturated fat (PUFA) are associated with increased CHD... whereas higher intake of PUFA replacing either SFA or carbohydrate is associated with lower risk.


Let's get an idea about how countries compare in terms of how many of the dietary factors are consumed at the optimal level (the `opt_achieved` variable).

```{r}
diet_and_guidelines %>%
  count(opt_achieved)
```

Looks like overall, only `r round(1520/4360*100, 2)`%  of dietary factors for all tested populations were at optimal levels.

Let's get an idea about how countries compare on this metric.

#### {.scrollable }
```{r}
diet_and_guidelines %>%
  count(opt_achieved, location_name) %>%
  filter(opt_achieved == "Yes") %>%
  arrange(-n) %>%
  # this allows us to show the full output
  print(n = 1e3)
```
####

It looks as though on average the populations (both male and female separately) in Qatar, Rwanda, and Turkey consumed the optimal level of intake for the largest number of dietary factors (13 out of 30 (for the 15 dietary factors for males and females)).

In contrast, the Czech Republic, Greenland, Hungary, Slovakia, Slovenia, and the United States had the poorest consumption rates (27 out of 30 were not at optimal levels).

#### {.scrollable }
```{r}
diet_and_guidelines %>%
  count(opt_achieved, location_name) %>%
  filter(opt_achieved == "No") %>%
  arrange(-n) %>%
  # to show full output
  print(n = 1e3)
```
####

Let's look at the raw US data:
```{r}
diet_and_guidelines %>%
  filter(location_name == "United States") %>%
  glimpse()
```

Let's see how males and females compare for achieving the optimal intake, across all countries:

```{r}
count(diet_and_guidelines, sex, opt_achieved)
```
Looks pretty similar, but it may be a bit better for females. We will evaluate this further below.

In order to assess what we have observed so far in a graphical way, we will make some data visualizations. One way we can do this is with the `ggplot2` package.
The [ggplot2](https://ggplot2.tidyverse.org/){target="_blank"} package creates plots by building the plot components piece by piece, using `"layers"`.

With `ggplot2` we select what data we would like to plot using the first function (`ggplot()`) and then we add on additional layers of complexity (these layers can even involve different data). The `aes()` argument specifies what aspects of the data will be plotted where. The `geom_*` function specifies what type of plot to create (e.g. `geom_histogram()` creates a histogram). Notice in the following code how there is a plus sign between the `ggplot()` function and the `geom_bar()` function; this is how we combine different plot layers. 

We will see later how we can add many layers to plots with `ggplot2`. For additional information on using `ggplot2`, see this [case study](https://opencasestudies.github.io/ocs-bp-co2-emissions/){target="_blank"}.

```{r}
diet_and_guidelines %>%
  ggplot(aes(opt_achieved, fill = sex)) +
  geom_bar(position = "dodge")
```

Continuing with `ggplot2` we will now create a different type of plot. This time we will create a series of box plots. We will use the `facet_wrap()` function of ggplot2 to allow us to create many different plots simultaneously. In this case we can look at box plots for the different dietary factors colored by sex. The `scales` argument when set to `"free"` means that each of the sequential plot created by the facet can have a different scale for the y axis, otherwise, by default they are constrained to the same scale. Since our dietary factors are measured on very different scales, we do not want this constraint here.


```{r}
# we will create a new variable with food names with new lines
# str_replace() is used here because we are only replacing the first instance of space
# otherwise str_replace_all() should be used
diet_and_guidelines %<>%
  mutate(
    food_to_plot =
      str_replace(
        string = pull(diet_and_guidelines, food),
        pattern = " ",
        replacement = "\n"
      )
  )

diet_and_guidelines %>%
  ggplot(aes(
    y = Relative_Percent,
    x = sex,
    color = sex
  )) +
  geom_boxplot() +
  facet_wrap(~food_to_plot,
    scales = "free",
    # specifies the number of rows of subplots
    nrow = 3,
    # moves the food label to the right
    strip.position = "right"
  ) +
  # this changes the size of the font for the labels
  theme(
    strip.text.y = element_text(size = 8),
    axis.text.x = element_text(
      angle = 70,
      hjust = 1
    )
  )
```


If we just look at differences by sex for the specific dietary factors,  males appear to potentially consume more of many of the factors, including possibly more sodium, fiber, calcium, red meat, and sugar-sweetened beverages than females. Females may consume more fruit.

### **Exploring the data separated by age**
***

Now we will take a look at the data that is separated by age groups.

First, recall that we have 15 different age groups starting from age 25 to 95 plus.
```{r}
all_age_diet_and_guidelines %>%
  count(age_group_name)
```



```{r, fig.height=15}
sep_age_diet_data %>%
  ggplot(aes(y = mean, x = age_group_name, col = sex)) +
  geom_boxplot() +
  facet_wrap(~food, scales = "free", nrow = 6) +
  theme(
    axis.text.x = element_text(angle = 70, hjust = 1),
    strip.text.x = element_text(size = 8)
  )
```

We can see from these plots that there appear to be age differences and gender differences for some of the different dietary factors. We will work to create clearer figures later on. However these initial figures have given us a better sense of the data that we are working with.


## **Data Analysis**
***

If you have been following along but stopped you could load the wrangled data like so:

```{r}
load(here::here("data", "wrangled", "wrangled_data.rda"))
```

***
<details> <summary> If you skipped the data wrangling section click here. </summary>

First you need to install and load the `OCSdata` package:

```{r, eval=FALSE}
install.packages("OCSdata")
library(OCSdata)
```

Then, you may load the wrangled data using the following code:

```{r, eval=FALSE}
wrangled_rda("ocs-bp-diet", outpath = getwd())
load(here::here("OCSdata", "data", "wrangled", "wrangled_data.rda"))
```

If the package does not work for you, alternatively, an RDA file (stands for R data) of the data can be found [here](https://github.com//opencasestudies/ocs-bp-diet/tree/master/data/wrangled) or slightly more directly [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-diet/master/data/wangled/wrangled_data.rda). Download this file and then place it in your current working directory within a subdirectory called "wrangled" within a subdirectory called "data" to copy and paste our code. We used an RStudio project and the [`here` package](https://github.com/jennybc/here_here) to navigate to the file more easily. 

```{r}
load(here::here("data", "wrangled", "wrangled_data.rda"))
```


***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>

You can create a project by going to the File menu of RStudio like so:


```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "New_project.png"))
```

You can also do so by clicking the project button:

```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "project_button.png"))
```

See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects and [here](https://github.com/jennybc/here_here) to learn more about the `here` package.

</details>
***
</details>
***

Recall what our main questions were:

#### {.main_question_block}
<b><u> Our main questions are: </u></b>

1) What are the global trends for potentially harmful diets?
2) How do males and females compare?
3) How do different age groups compare for these dietary factors?
4) How do different countries compare? In particular, how does the US compare to other countries in terms of diet trends?

####

We have some general sense about global trends for the risk-associated dietary factors, however we want to know more.

We are interested in how much the genders differ, how much the 15 different age groups differ, and how the 195 countries compare. 

In order to make [inferences](https://www.britannica.com/science/inference-statistics) about these comparisons, it is helpful to perform statistical tests. These tests can help us to determine the strength of the association between the consumption of the dietary factors (our outcome variable) and sex, age group, and country identity (our predictor variables). One way to look at the strength of association between variables is to use a statistical method called **regression**.

If we measure consumption using either raw consumption or the percent of optimal consumption, then our outcome variable is what we call **continuous**, because our values can take on any numeric value within the range of possible values.  To look at the strength of association with a continuous outcome, we can use **linear regression**.

If, instead, we measure consumption by whether or not the optimal level of consumption was achieved ("yes" or "no"), then our outcome would be considered **binary**, meaning it can take only two possible values.  To look at the strength of association with a binary outcome, we can use **logistic regression.**  There are other regression method for different types of outcomes as well; see [here](https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/){target="_blank"} for a guide on different types of regression methods.

In this case study, we will focus on the outcome of the percent of optimal consumption (`Relative_Percent`), so we will focus our analysis on linear regression.  

You may have already learned that one can compare a continuous outcome between two groups using a $t$-test. For more information on the $t$-test see this [case study](https://opencasestudies.github.io/ocs-bp-rural-and-urban-obesity/){target="_blank"}.  And perhaps you have heard about ANOVA (ANalysis Of VAriance) for comparing a continuous outcome across more than two groups.  It turns out that both the $t$-test and ANOVA are specialized types of [linear regression](https://lindeloev.github.io/tests-as-linear/){target="_blank"}. We will use each of these tests to investigate patterns of consumption for dietary factors that contribute to health risk and we will look at how we can obtain equivalent results with regression.

### **Linear Regression**
***

So what is linear regression? How can we use regression to compare our groups of interest and look at the relationship between group identity and consumption of dietary factors associated with health risk?

The statistical version of the term regression was coined in 1877 in this [article](http://galton.org/essays/1870-1879/galton-1877-typical-laws-heredity.pdf ){target="_blank"} about the relationship between hereditary traits and population averages. The author particularly focused on [height](https://zenodo.org/record/1449548#.Xlf_9hNKihc){target="_blank"} and kinship or relatedness. The word itself means `"to go back to a simpler state"`. It was noticed that individuals with parents who had an extreme trait, such as exceptional height, tended to have a height more similar to the average of the population than the extreme height of their parents. For example if parents were very tall, their children were likely to be a bit shorter than their parents and therefore closer to the population average. Thus the children regressed towards the mean or in the author's words the offspring showed:

> "a *regression* towards mediocrity"

See [here](https://en.wikipedia.org/wiki/Regression_toward_the_mean){target="_blank"} for more information about this history.

When we think about this from a statistical standpoint, regression allows us to estimate or **regress** relationships between variables with a "simple" model. We do this by **estimating the mean** of an outcome, given a value of an input or predictor variable. This can be useful for **predicting future values** of the outcome based on the approximation of the real relationship between the variables within the model, or just for understanding how different variables are related to one another.

We will start by considering **simple linear regression**, where we have one continuous predictor variable and one continuous outcome variable, as shown below:
```{r, echo = FALSE, out.width = "400ptx"}
set.seed(15)
data_x <- sample(1:10, 10, replace = TRUE)
data_y <- data_x + rnorm(10, 0, 10)
thedata <- bind_cols(x = data_x, y = data_y)

ggplot(data = thedata, aes(x = x, y = y)) +
  geom_point() +
  theme(
    axis.text.y = element_text(size = 15),
    axis.text.x = element_text(size = 15)
  )
```

We want to identify a "best fit" line that summarizes the relationship between these two variables.  We can so this using the ordinary least squares method, which chooses the line that best fits the data by minimizing the sum of the squared vertical distances between each point and the line. In the above example, this line turns out to be:
```{r, echo = FALSE, out.width = "400ptx"}
set.seed(15)
data_x <- sample(1:10, 10, replace = TRUE)
data_y <- data_x + rnorm(10, 0, 10)
thedata <- bind_cols(x = data_x, y = data_y)

ggplot(data = thedata, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(
    method = "lm",
    se = FALSE,
    color = "black",
    formula = y ~ x
  ) +
  stat_regline_equation(size = 6) +
  theme(
    axis.text.y = element_text(size = 15),
    axis.text.x = element_text(size = 15)
  )
```

Fitting a line to the data like this allows us to create a formula for the line using an **intercept** and a **slope**, so that we can then estimate **mean** values of $Y$ (dependent/outcome variable) given known values of $X$ (independent/predictor/covariate/explanatory variable(s)). People will also say that we are "regressing $Y$ on $X$".
 
You may have seen the formula for a line written like this:

$$Y = mX + b$$ 

<center> or </center>
$$Y = aX + b$$

In this case $m$ or $a$ is the slope of the line and $b$ is a constant and represents the y-intercept or the point where the y axis is crossed by the line, when $x = 0$.

In regression, we usually write this model like this:

$$Y = \beta_{1}X +\beta_{0}$$

Now $\beta_{1}$, called "beta one", is our slope and $\beta_{0}$, called "beta zero" (or "beta naught"), is our intercept.  In our example above, the slope of the regression line is $\beta_{1} = 2.3$ and the intercept is $\beta_{0} = -6.6$.

Importantly the slope ($\beta_{1}$) gives us a quantitative measure of the relationship between the independent variable ($X$) on the dependent variable ($Y$).  In particular, $\beta_{1}$ tells how the expected difference in the $Y$ value for a difference of 1 unit in the $X$ value.

It's possible that the regression line will perfectly fit the data, and all points will lie on the line with no distance to the line:

```{r, echo = FALSE, out.width = "400ptx"}

data_x <- sample(1:100, 20, replace = TRUE)
data_y <- data_x + 10
thedata <- bind_cols(x = data_x, y = data_y)

ggplot(data = thedata, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(
    method = "lm",
    se = FALSE,
    color = "black",
    formula = y ~ x
  ) +
  stat_regline_equation(size = 6) +
  theme(
    axis.text.y = element_text(size = 15),
    axis.text.x = element_text(size = 15)
  )
```

In this case, the slope or $\beta_{1}$ is 1 and the intercept $\beta_{0}$ is 10 and every observed data point lies exactly on the line, e.g., we can see that when $X$ is 50, $Y$ is exactly 60. This is very unusual in statistical analysis however, as often the relationship between variables is more complicated and there is more noise in our data. In these other cases there will be greater distances between the line and the points. 

Like this regression:
```{r, echo = FALSE, out.width="400ptx"}
set.seed(13)
thedata %<>% mutate(y2 = rnorm(20, sd = 40))

ggplot(data = thedata, aes(x = x, y = y2)) +
  geom_smooth(
    method = "lm",
    se = FALSE,
    color = "black",
    formula = y ~ x
  ) +
  geom_point() +
  stat_regline_equation(size = 6) +
  theme(
    axis.text.y = element_text(size = 15),
    axis.text.x = element_text(size = 15)
  )
```

In this case, because there is some vertical distance between the line and the data points, there is a bit of what is called "error" in the model. The formula for the relationship between $X$ and $Y$ does not perfectly describe the data. The vertical distance between the line and each data point is what we call a [residual](https://www.statisticshowto.datasciencecentral.com/residual/){target="_blank"}. Our least squares method finds the line with the minimized value of the sum of the squared residual values.

Check out this [interactive explanation](http://setosa.io/ev/ordinary-least-squares-regression/){target="_blank"} of how the ordinary least squares method works.

Here is an image of what we are saying about the ordinary least squares regression to fit a line to data:
<center>![](https://qph.fs.quoracdn.net/main-qimg-3b0d7655ac76edf1241f97015ee755b4)</center>

###### [[source](https://qph.fs.quoracdn.net/main-qimg-3b0d7655ac76edf1241f97015ee755b4)]

This basic concept of simple linear regression an be extended to allow for more than one covariate (the independent variables, or x's); this is called **multivariable** linear regression.   With more than one independent variable, we can't visualize these relationships easily with a line on a two-dimensional page, but the mathematical concept remains in some sense the same.

R has it's own way of representing the regression equation in code. For a guide on how to perform regressions in R see [here](http://www.montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20092010/Class8/Using%20R%20for%20linear%20regression.pdf){target="_blank"}.

In R we indicate a linear model like this:
```{r, eval = FALSE}
y ~ x
```
Here our response/outcome variable is on the left of the `~` while our covariates/explanatory variables are on the right of the `~`.


Before we get started, let's remove the global values from our data and set them aside, as this is really a composite of all the country values.

```{r}
global <- diet_and_guidelines %>%
  filter(location_name == "Global")
diet_and_guidelines %<>%
  filter(location_name != "Global")
all_age_diet_and_guidelines %<>%
  filter(location_name != "Global")
```

### **$t$-test and linear regression**
***

Since we will be covering a lot of different statistical concepts here, we will want to focus are analysis on a single dietary factor. Let's choose one of the dietary factors that appeared to potentially have a difference between genders based on our figure in our exploratory analysis.

> "If we just look at differences by sex for the specific dietary factors,  males appear to potentially consume more of many of the factors, including possibly more sodium, fiber, calcium, red meat, and sugar-sweetened beverages than females. Females may consume more fruit."

Let's take a look at red meat.

We can compare the relative percent of red meat consumption of males and females around the world using the well known $t$-test using the `t.test()` function and a linear regression model using the `lm()` function (both are included in `stats` package that is installed with R by default) and we will get the **same results**. See [here](https://scientificallysound.org/2017/06/08/$t$-test-as-linear-models-r/){target="_blank"} for additional explanation about why that is the case. [Here](https://towardsdatascience.com/everything-is-just-a-regression-5a3bf22c459c){target="_blank"} and [here](https://lindeloev.github.io/tests-as-linear/){target="_blank"} are also great sources about how many commonly known statistical tests are specialized forms of regression.

Before we get started, let's think about the assumptions of both an independent samples $t$-test and linear regression.


#### Independent samples $t$-test assumptions:

1) Normality of the outcome in each group (this is not as much of an issue if the number of observations is relatively large, i.e., total n > 30 - which is indeed the case for us!)
2) Equal variance between the two groups
3) Independent observations

#### Linear regression assumptions:

1) **L** (linear) - There is a linear relationship between the outcome variable and each covariate.

2) **I** (independent) - The outcome for individual observations are independent from one another, given the covariates in the model.

3) **N** (normal) - The residuals (errors) are normally distributed. Note that the variables themselves do not need to be normally distributed.

4) **E** (equal variances) - The variance of the residuals is constant across covariate groups.  This is called [homoscedasticity](https://www.statisticssolutions.com/homoscedasticity/){target="_blank"}. In other words the residuals are of similar size along the regression line.

It's also important that if there are multiple predictor variables, that these are not too highly correlated.

See [here](https://www.jmp.com/en_us/statistics-knowledge-portal/what-is-regression/simple-linear-regression-assumptions.html){target="_blank"} for additional information about the assumptions of linear regressions.

Notice that many of the assumptions between $t$-tests and linear regression are similar -- each has an assumption of normality, equal variance, and independence!

#### Assessing normality

First we will explore the shape of the distribution of these relative percent of red meat consumption.  We can do this by looking at a frequency distribution of the `Relative_Percent` variable for red meat consumption.  We will use the `geom_histogram()` of the `ggplot2`package to create a histogram to evaluate the frequency distributions of our data. The `facet_wrap()` function of the `ggplot2` package allows us to look at different parts of our data in separate plots.  Here we can compare the distribution for males and females.

```{r}
diet_and_guidelines %>%
  filter(food == "red meat") %>%
  ggplot(aes(x = Relative_Percent)) +
  geom_histogram() +
  facet_wrap(~sex)
```

This `Relative_Percent` variable appears to have a right skew for both male and female individuals.  We can also see this by looking at normal Quantile-Quantile (Q-Q) plots of this variable.  Remember that in a Q-Q plot, points away from the line indicate one of the distributions is more skewed than the other.  In this case, we see that the values in are sample are skewed relative to the theoretical normal distribution. [Here](http://www.ucd.ie/ecomodel/Resources/QQplots_WebVersion.html){target="_blank"} is a great reference for interpreting Q-Q plots.

```{r}
diet_and_guidelines %>%
  filter(food == "red meat") %>%
  ggplot(aes(sample = Relative_Percent)) +
  facet_wrap(~sex) +
  geom_qq() +
  geom_qq_line()
```

We can consider transforming our data to make it more normally distributed. When data is highly right skewed, a log transformation is often helpful.

Let's take a look a the log (with base 10) of our `Relative_Percent` variable:

```{r}
diet_and_guidelines %>%
  filter(food == "red meat") %>%
  ggplot(aes(x = log10(Relative_Percent))) +
  geom_histogram() +
  facet_wrap(~sex)

diet_and_guidelines %>%
  filter(food == "red meat") %>%
  ggplot(aes(sample = log10(Relative_Percent))) +
  facet_wrap(~sex) +
  geom_qq() +
  geom_qq_line()
```

OK, so now our histograms look fairly normal. It isn't perfect, but we have a large number of samples, so this is good for our $t$-test assumptions. 

#### Assessing equal variances

The next thing we need to check is if the variance in red meat consumption is similar between the two gender groups. We can use the `var.test()`  of the `stats` package using the log-normalized data, as this data is fairly normally distributed.

Because we are piping in our data to this test function, we need to indicate that this is the data we intend to use by using `.` for the `data` argument.  This is a handy tip when piping into a function outside of the `tidyverse` where the first argument isn't a data set.

```{r}
diet_and_guidelines %>%
  filter(food == "red meat") %>%
  var.test(log10(Relative_Percent) ~ sex, data = .)
```

The p value > .05 for this test, thus we can conclude that there is not enough evidence to reject the null hypothesis that there is no difference in the variance of the distributions, so we conclude that variance is roughly equal.

#### Comparing a $t$-test to linear regression

Now let's compare the consumption of red meat across genders using both a $t$-test and a linear regression. First our independent samples $t$-test:
```{r}
diet_and_guidelines %>%
  filter(food == "red meat") %>%
  t.test(log10(Relative_Percent) ~ sex, data = ., var.equal = TRUE)
```

Notice here that sample means for the two groups are 1.80 and 1.98 for males and females, respectively.  So that means the difference in sample means is 1.80 - 1.98 = -0.18.  We also see a test statistic of $t$ = -5.32 and a very small $p$-value.  

Let's examine the same relationship using linear regression:
```{r}
diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ sex, data = .) %>%
  summary()
```

Look at the results for the slope of the regression line, indicated by the `sexMale` row in the output above.  Notice how the $t$-value and the $p$-value match our $t$-test!  (Well, the signs are switched in each case -- the $t$ value is negative in the `t.test()` output because the male group is being used as reference group, while the female group is being used as the reference group in `lm()`). We can fix this using the `fct_inorder()` function of the `forcats` package which is all about factors. This function allows us to order the factor by what appears first. In this case "male" appears first, so now our output will match that of the `lm()` function.


```{r}
diet_and_guidelines %<>%
  mutate_at(vars(sex), factor)

diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ sex, data = .) %>%
  summary()

diet_and_guidelines %>%
  mutate_at(vars(sex), forcats::fct_inorder) %>%
  filter(food == "red meat") %>%
  t.test(log10(Relative_Percent) ~ sex,
    data = .,
    var.equal = TRUE
  )
```

Now they match. Notice that the degrees of freedom also match, both results show 388 degrees of freedom. We are estimating two parameters for the linear model the two $\beta$ coefficients, (the slope and intercept), and for the $t$-test we are estimating the means of two groups (males and females). Overall we have two samples (male and female) for each of the 195 countries. 

Thus, the overall sample number is:  $n = 195*2 = 390$

$$df = n - # parameters estimating$$ 
Thus the degrees of freedom can be calculated as:  $df = 390 -2 = 388$

Let's look more closely at the linear regression output from `lm()`.  Our estimated intercept ($\beta_{0}$) is 1.80, which can be interpreted as the mean value when sex is not male (so in this case when sex is female).  This matches the sample mean of the female group in the `t.test()` output. 

Our estimated slope ($\beta_{1}$) is 0.18, which can be interpreted as the slope of the regression line or **the mean change in $Y$ associated with one-unit increase in $X$**.  Since our $X$ variable is sex, a one-unit change means moving from one group to another.  So we can think of the slope as the difference between the means of the two groups, male ($X$=1) minus female ($X$ = 0).  If we calculate this difference in means as calculated in the `t.test()` output, we get the value of $\beta_{1}$ (the slope or the `sexMale estimate`) of the `lm()` output!

Mean of males - Mean of females
$1.983259 - 1.798872 =0.184387$

Cool!  For more information about the output of `lm()` see [here](https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R){target="_blank"}.

After fitting our linear regression model, we can use the base `plot()` function to get information about our model residuals to help us assess whether any of the assumptions of linear regression are violated. Here we choose to view the first three of these plots with `which = 1:3`.

```{r}
diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ sex, data = .) %>%
  plot(which = 1:3)
```

The second plot shows us that our residuals are slightly negatively (or left) skewed.  We can see also see the spread of the residuals is similar between males and females, as the first and third plot show similar spreads of values in the two lines. This suggests that the assumption of homoscedasticity is met.  Here is what these plots would look like if the variance were not the same between the groups:

```{r}
diet_and_guidelines %>%
  filter(food == "red meat") %>%
  mutate(example_perc = case_when(
    # this will artifically make our female data have different
    # variance from the male data
    sex == "Female" ~ log(Relative_Percent),
    sex == "Male" ~ Relative_Percent
  )) %>%
  lm(log10(example_perc) ~ sex, data = .) %>%
  plot(which = 1:3)
```

In this case the spread of the points is clearly less for one group compared to the other.  If we saw plots like these, we would be concerned the assumption of homoscedasticity was violated.

#### Assessing independence

We never considered the assumption of independent required by both a $t$-test and linear regression.  Do we truly have independent samples in this case?  No!  Since we have female and male values from the same countries, our data is really what we would call "paired". The male and female diet values from the same country are most likely related to each another because of cultural effects on diet.  This means the assumption of independence for the independent samples $t$-test is violated, as is the independence assumption for linear regression.  

We can address this by doing a *paired* $t$-test instead of an independent $t$-test and by accounting for country in our linear model by adding it to our model as what we call a *fixed effect*.   

### **Paired $t$-test and linear model with fixed effects**
***

Now we will perform the paired versions of our analysis. This is very easy to do with the `t.test()` function, by simply using the `paired` argument and setting it equal to `TRUE`.

However, our data needs to be in a slightly different form to do the paired test, since we have to tell `R` which values need to be paired together.  Instead of one long dataset with different rows for males and females, we will need separate columns for the male and female values.  So we need to make our dataset *wider*.  We can do that using the `pivot_wider()` function of  the `tidyr` package. To use this function we specify the values that we want to separate into more variables using the `values_from` argument and we use the `names_from` argument to specify how we want to separate these other variables. In this case we will make a male and female version of all the other variables specified.

```{r}
wide_diet <- diet_and_guidelines %>%
  pivot_wider(
    values_from = c(
      contains("percent"),
      mean,
      upper,
      lower,
      opt_achieved
    ),
    names_from = sex
  )

glimpse(wide_diet)
```

You can see we now have a `Relative_Percent_Male` variable and a `Relative_Percent_Female` variable.  We can use these two variables in our paired $t$-test.  Since the paired version of the $t$-test doesn't take a `data=` argument, we will pull the appropriate variables from our data a little bit differently, using the `pull()` function.
```{r}
t.test(log10(pull(
  filter(wide_diet, food == "red meat"),
  Relative_Percent_Male
)),
log10(pull(
  filter(wide_diet, food == "red meat"),
  Relative_Percent_Female
)),
var.equal = TRUE, paired = TRUE
)
```

Here an estimated mean difference (Males - Females) of 0.18, and that this is considered significantly different than 0 due to a very small $p$-value.  You can also see that now our degrees of freedom are 194, which makes sense because with paired samples we are only estimating one parameter (the mean difference) based on data on 195 differences for each country. So $df = n - \# \ parameters = 195 -1 = 194$.

The paired version of the linear model is a bit more complex. In this case we will add another term in our model to evaluate the influence of `sex` on `Relative_Percent` consumption while keeping the country identity fixed or constant, or in other words controlling/adjusting for country. We can use the  `+` to add this additional term. Now that we have multiple covariate/explanatory variable terms, we would call this a **multivariable linear regression**.

So now our model in words will be: 

Mean relative consumption of red meat is dependent on sex and country. Or in other words, sex and location influence the consumption of red meat around the world.

Then the coefficient for `sex` will be different from what we had in our previous `lm()` model, as it will be calculated while keeping `location_name` or the country where the consumption value was obtained fixed, or in other words "controlling for `location_name`." This will also result in output for each of the countries. The [coefficients](https://www.theanalysisfactor.com/interpreting-regression-coefficients/){target="_blank"} here represent the average difference in consumption value for each country compared to the reference country of Afghanistan, while accounting for sex.

This now should meet the assumption of independence for a linear regression model, since observations will be independent conditional an the covariates of sex and country.

Let's fit this model and look at the results.

#### {.scrollable }
```{r}
diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ sex + location_name, data = .) %>%
  summary()
```
####

First let's look at the estimated coefficient for the `sexMale` variable, which is 0.18.  This can be interpreted as the difference in mean log relative percent consumption between males and females, holding country constant.  So comparing males to females within the same country.  Notice this is the same estimated difference we found from our paired $t$-test!  The $p$-value for this coefficient also matches the $p$-value from the paired $t$-test.

You can also see from this output that we have a coefficient for every country except Afghanistan, which is our reference country.  These coefficients compare the country to that reference.  So the estimated coefficient for `location_nameAlbania`, 0.44, can be interpreted as the difference in mean log relative percent consumption between Albania and Afghanistan, holding sex constant.  So comparing Albania to Afghanistan within males or comparing Albania to Afghanistan within females.

Finally, you might notice that the number of residual degrees of freedom for this regression is 194, just as in the paired $t$-test.  This makes sense since we have to estimate a coefficient for 194 countries (all except Afghanistan) as well as a coefficient for sex and an intercept.  So we have:

$$df = n - # parameters estimating = 390 - 194 - 2 = 194$$ 

We should also check the residual plots for this fixed effects regression model.

```{r}
diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ sex + location_name, data = .) %>%
  plot(which = 1:3)
```

These residual plots look much better than our previous plots.  This [guide](https://data.library.virginia.edu/diagnostic-plots/) provides more information on how to interpret these residual diagnostic plots.

Based on our [Q-Q plot](http://onlinestatbook.com/2/advanced_graphs/q-q_plots.html){target="_blank"}, we appear to have some outliers perhaps at the extreme ends of our tails but overall the residuals look fairly normal. The [residual vs fitted plot](https://online.stat.psu.edu/stat462/node/118/){target="_blank"} shows us if the relationship between our outcome variable and our predictors looks linear, if there is unequal error variance between groups, and if there are possible outliers. Ideally this should look like a band of points equally centered around zero. Here are [examples](http://docs.statwing.com/interpreting-residual-plots-to-improve-your-regression/){target="_blank"} of these plots that might show issues of concern. 

Overall our plot looks fairly good. The shape of our data looks fairly linear (the residuals don't appear to have a shape other than a band or line), there does not appear to be any extreme outliers (no data points are especially far away) and the points have the same general range around the line for the various fitted values. There are a few points with wider residuals at the higher fitted values, but overall this looks quite reasonable. 

Our [scale-location plot](https://boostedml.com/2019/03/linear-regression-plots-scale-location-plot.html){target="_blank"} also shows us that our variance looks fairly equal across groups as our values show a relatively even spread. A larger  bend in the line would indicate more variation in the variance across our independent variable groups also known as [heteroscedasticity](https://statisticsbyjim.com/regression/heteroscedasticity-regression/){target="_blank"}. There is only a slight bend in the line for our data suggestive of [homoscedasticity ](https://www.statisticssolutions.com/homoscedasticity/){target="_blank"}. So our assumptions look pretty good:

1) Linear - the relationship appears to be fairly linear 
2) Independence - now that we have taken care of the location structure in our data, our samples are independent
3) Normality - the residuals appear to be fairly normally distributed and we have a large number of samples to help account for minor violations
4) Equal variance - the variance in the residuals appear to be fairly equal across the groups of the independent/predictor variables


### **Paired $t$-test and linear model with mixed effects**
***


To "pair" our data using fixed effects cost us an additional 194 variables in our regression model, one for each country except Afghanistan.  Alternatively, we can perform a slightly different type of regression that still accounts for the paired structure in the data.

In this case we will use the `lmer()` function of the `lme4` package.  This function allows us to fit what is called a [linear mixed effects regression model](https://ourcodingclub.github.io/tutorials/mixed-models/){target="_blank"}. We will also use the `lmerTest` package, since this adds test statistics and $p$-values to the linear mixed effects model output.  

This type of regression is called **mixed** because it contains both **fixed** and **random** effects.  There are many different definitions for **fixed** and **random** effects and the difference is conceptually complex and context specific. 

However in simplistic terms, **fixed effects** are generally speaking the variables of interest that we have reason to believe explain or predict the outcome or response variable, while random effects are those that may introduce additional variance in the influence of those predictor variables on the outcome variable. For example, they may provide information about **group or batch structures** within the data.  

In our case, we are interested in the influence of `sex` on the consumption of red meat, however the identity of the country where the male and female consumption values were obtained may influence this relationship and we would like to control for that. We don't want to model for `location_name` itself, but just model it's influence on the relationship of `sex` on consumption of red meat. In other words, we are interested in getting a sense of how sex influences consumption rates in general and we want to account for the paired structure within our data, the fact that we have corresponding consumption values for the two sexes from different countries. The notation for including a random effect like this is  `1 | variable_name`. The one indicates a varying-intercept group effect, in other words we expect that the intercept may vary for each value of the variable indicated to the right of the `|`. So in our case, the intercept  (log relative percent consumption when sex is assigned to the zero value - female) may be different for each country.

Let's fit a mixed effects model that includes a fixed effect for sex and a random intercept for country:

```{r}
diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lmer(log10(Relative_Percent) ~ sex + (1 | location_name), data = .) %>%
  summary()
```

How would we interpret the results of this model? Again, let's look at the estimated coefficient for the `sexMale` variable, which is 0.18.  This can be interpreted the same way as in the simple linear regression, as the difference in mean log relative percent consumption between males and females.  However, here we haven't violated the independence assumption because we are accounting for the paired nature of the data through the random effect for country.  The $t$-statistic and $p$-value for this coefficient also match those from the paired $t$-test we did before:
```{r}
t.test(log10(pull(
  filter(wide_diet, food == "red meat"),
  Relative_Percent_Male
)),
log10(pull(
  filter(wide_diet, food == "red meat"),
  Relative_Percent_Female
)),
var.equal = TRUE, paired = TRUE
)
```

Notice that in the output for the mixed effects model, there are **not** coefficients for each country, like there were in the fixed effects model.  This is because we are not explicitly estimating individual country effects in this model.  Instead, the country effect is captured through the intercept in this model.  Our estimated intercept is 1.80 and the standard deviation of this intercept is 0.34 (shown in the `Random effects` table in the output.)  We can interpret this as saying that each country has an intercept that comes from a normal distribution with mean of 1.80 and a standard deviation of 0.34.  Since the intercept in this model represents the log relative percent consumption for females, this give us an idea of how female consumption varies across countries -- average log consumption across countries is 1.80, but there is variability from one country to another.  And then the male log consumption is, on average, 0.18 higher than for females.

It is more complicated to calculate the degrees of freedom in the mixed effect model and beyond this case study, but it is based on the [Satterthwaite formula](https://www.statisticshowto.datasciencecentral.com/satterthwaite-formula/){target="_blank"} and results in the same degrees of freedom.

Finally, lets see what our residual plots look like for this mixed effects model.  We can't use the `plot()` function with a `lmer()` model to get all of the plots at once, but we can construct a residual vs. fitted value plot and a Q-Q plot ourselves:

```{r}
diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lmer(log10(Relative_Percent) ~ sex + (1 | location_name), data = .) %>%
  plot()

diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lmer(log10(Relative_Percent) ~ sex + (1 | location_name), data = .) %>%
  resid() %>%
  qqnorm()

diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lmer(log10(Relative_Percent) ~ sex + (1 | location_name), data = .) %>%
  resid() %>%
  qqline()
```

Notice that the plots look very similar to what we saw with the fixed effects model.

We see that the paired $t$-test, the fixed effects model adjusting for country, and the mixed effects model with a random intercept for country all give the same results in this case.  So which test should we use?  The decision of which test to perform depends on your question of interest. In this case we were particularly interested in the influence of `sex`, so setting `location_name` to a random effect provides the same level of detail about sex without as much information about `location_name`, so that might be ideal. As we can see, the results, in this case, are the same.  The benefit of using regression over a simple paired $t$-test would be the ability to add other covariates to our model if we wanted to adjust for other country characteristics.

Overall, though, we can conclude from these tests that we have enough evidence to reject the null hypothesis that there is no difference between the mean consumption of males and females ( or that `sex` has no association or influence on red meat consumption.) **Therefore, it appears that males consume significantly more red meat than females globally.** 


### **ANalysis Of VAriance (ANOVA) test**
***

We are also interested in the influence of age group on dietary consumption, but because there are 15 age groups we can't assess the influence of age group on consumption using the paired $t$-test, as this test can only compare 2 groups. 

If we wanted to test the hypothesis that there are any age group differences, that at least one of the groups is different from the others; we could use an [ANOVA test](http://onlinestatbook.com/2/analysis_of_variance/intro.html){target="_blank"}. This test allows us to compare means of 3 or more groups by evaluating the variance of the data within the groups and among the groups. 

Our null hypothesis is that all age groups have equal means:
$$ H_0: \mu_{1} = \mu_{2} =\mu_{3}=\mu_{4} = ... \mu_{15} $$

The alternative hypothesis is that at least one age group mean is not equal to the others.

**Importantly**, if we reject the null, we *do not know which group means are different from one another*. Subsequent testing is required if we want to know this information. In this case we call this type of non-specific hypothesis an "omnibus" hypothesis.

You could actually perform an ANOVA to compare two means, but in this case you would get an $F$-statistic instead of a $t$-statistic which would be equivalent to $t^2$. However it is not conventional to use ANOVA for only 2 means. The $F$-statistic is derived form the $F$-test is used for a few different type of tests. In the ANOVA the F-test is calculated as:

$$F = \frac{ 
variability\ between\  the \ groups}{ 
variablity\ within\ the \ groups}$$

The larger the ratio, the larger the variability between the groups, thus the more likely that the data for each group comes from a different distribution with different means, suggesting that the groups are different.

It turns out that the ANOVA test is also equivalent to linear regression.  We will demonstrate this by evaluating how the consumption of red meat varies by age group using an ANOVA and a linear regression.

Thinking about how we want to know if red meat consumption differs between age groups from the linear regression perspective, we could also describe our null hypothesis as:

There is no influence of age group identity on consumption or there is no relationship between age group identity and consumption.

And we could describe our alternative hypothesis as:

Age group identity does influence consumption or explain some of the variation in consumption.

#### ANOVA assumptions

The ANOVA assumptions are quite similar to the $t$-test assumptions:

1) Normality of the data for all tested groups (less of an issue if the number of observations is relatively large total n > 30)
2) Equal variance between the groups - aka [Homogeneity of Variances assumption](https://uc-r.github.io/assumptions_homogeneity){target="_blank"} (make sure you do the correct test if the data is not normal)
3) Independent observations

let's evaluate our assumptions for the groups we are comparing, starting with normality using Q-Q plots.  First let's make `age_group_name` a factor:

```{r}
all_age_diet_and_guidelines %<>%
  mutate_at(vars(age_group_name), factor)
```

Now let's look at Q-Q plots of both relative percent consumption and the log-transformed version of this variable:
```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  ggplot(aes(sample = Relative_Percent)) +
  facet_wrap(~age_group_name) +
  geom_qq() +
  geom_qq_line()

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  ggplot(aes(sample = log10(Relative_Percent))) +
  facet_wrap(~age_group_name) +
  geom_qq() +
  geom_qq_line()
```

After transformation, these Q-Q plots look pretty good.

Now let's look at the assumption of constant variance. There are different ways to assess this assumption across more than two groups.  [Bartlett's test](https://www.itl.nist.gov/div898/handbook/eda/section3/eda357.htm){target="_blank"} works well if the data appears to be quite normally distributed, while the [Fligner-Killeen](http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_NonParam_VarIndep){target="_blank"} test is nonparametric and does not assume normality of the data.

We will use another popular test, [Levene's test](https://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm){target="_blank"}, which is more robust to violations of normality than the Bartlett's test, but not as robust as the Fligner-Killeen test.  The null hypothesis of this test, as for the other two tests, is that the variances are equal across all of the groups.  The alternative hypothesis is that at least one pair of groups has different variances.  In symbols we would write this as

$$ H_0: \sigma_1^2 = \sigma_2^2 = \sigma_3^2 ... = \sigma_n^2 $$

and

$$H_a:\sigma_i^2 \neq \sigma_j^2   $$
for at least  one  pair ($i$,$j$).


We will use the `leveneTest()` function of the `car` package to performs Levene's test.

```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  car::leveneTest(log10(Relative_Percent) ~ age_group_name, data = .)
```

Our data does not appear to violate the homogeneity of variances assumption as our $p$-value was greater than 0.05 and so we would fail to reject the null hypothesis of equal variances.

We already know that our independence assumption is not met, since the data for the different age groups comes from the same countries.  We will account for this in later models, but first let's compare the results between ANOVA and linear regression assuming the independence assumption is met.

#### ANOVA and linear regression

We can use the `aov()` function of the `stats` package to perform an ANOVA test. We will be performing what is called a [one-way ANOVA](http://onlinestatbook.com/2/analysis_of_variance/one-way.html){target="_blank"} because we only have one independent variable (age group). We will also perform a linear regression for comparison.

#### {.scrollable }
```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  aov(log10(Relative_Percent) ~ age_group_name, data = .) %>%
  summary()

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ age_group_name, data = .) %>%
  summary()
```
####


We can see that the $F$-statistic ($F$ value in the `aov()` output and at the bottom of the `lm()` output) is the same for both models and the $p$-value for the $F$-statistic is the same!

We also see that the degrees of freedom for the $F$-statistic is 14. This makes sense because we have 15 different age groups and degrees of freedom for the $F$-statistic are calculated as $df = n - 1$. So in our case: $df = 15 -1$.


The difference here is that with the `lm()` model we also get information about how the individual age groups are associated with the log relative percent consumption of red meat. Notice that if we look at all the age groups in the data

```{r}
all_age_diet_and_guidelines %>%
  distinct(age_group_name)
```

we see that our `lm()` results are missing one of the age groups, the `25 to 29` age group. That is because this is the **reference group** and the coefficients indicate the slope or difference in log relative percent consumption rates for each listed age group *compared* to this reference group. 

#### ANOVA and linear regression with fixed effects

Now let's account for the paired `location_name` structure within our data, since the above models violate the independence assumptions for ANOVA and linear regression. We can do this by adding another fixed effect to both the ANOVA model and the linear regression model.  For ANOVA, this means we are now doing a two-way ANOVA, since we have two independent variables (age group and country).  For linear regression, we are now adding a fixed effect for country to our model.  

#### {.scrollable }

```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  aov(log10(Relative_Percent) ~ age_group_name + location_name, data = .) %>%
  summary()

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ age_group_name + location_name, data = .) %>%
  summary()
```
#### 

It's hard to see that these results match, since the linear regression output doesn't print the $F$-statistic for the age groups together or the countries together; it only gives results for individual $t$-tests of each regression coefficient.  We can get these grouped $F$-statistics using the `anova()` function of the `stats` package. This function does not actually directly perform ANOVA like the `aov()` function, but instead prints a variance table using a `lm()` object.

```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ age_group_name + location_name, data = .) %>%
  anova()
```
#### 

We can see that indeed the $F$-values and $p$-values from linear regression match those from ANOVA. In this case, this analysis suggests that there is a significant relationship between age group and consumption, even when controlling for country.  It also suggests that there is a significant relationship between country and consumption, even when controlling for age group.  However, only the first relationship is our relationship of interest; the second is only included in the model to account for the dependent nature of the data.

Remember, the ANOVA results indicate that the means are different across these groups, but it **does not** inform us about which groups are different. However, the original `lm()` output using the `summary()~ command gives more information about specific group differences. Remember, though, that these are **relative to the reference** level for the age group and location and that these values are calculated for the effect on consumption while controlling for the other variable in the model.

#### {.scrollable }
```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ age_group_name + location_name, data = .) %>%
  summary()
```
####

#### ANOVA and linear regression with mixed effects

We could instead perform a similar analysis as we did for the two group analysis where we controlled for the paired data structure using a random effect based on country In particular, we could include a random intercept for country.  We could do this within the `aov()` function using `Error()` and within the `lmer()` function with `1 | variable_name`.

#### {.scrollable }
```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  aov(
    log10(Relative_Percent) ~ age_group_name + Error(location_name),
    data = .
  ) %>%
  summary()

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lmer(
    log10(Relative_Percent) ~ age_group_name + (1 | location_name),
    data = .
  ) %>%
  summary()
```

Notice now the results only show for the age group variable, since this is the only fixed effect in the model.  However, dependence in the data due to country is still accounted for through the random effect.

If we use `anova()` instead of `summary()` for our `lmer()` model, we can see they give the same results.
```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  aov(
    log10(Relative_Percent) ~ age_group_name + Error(location_name),
    data = .
  ) %>%
  summary()

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lmer(
    log10(Relative_Percent) ~ age_group_name + (1 | location_name),
    data = .
  ) %>%
  anova()
```


### **Modeling all groups of interest**
***

Now we can extend out model to include include `sex`, `age_group_name` and `location_name` all in the same linear model and get information about how each of these factors influences dietary consumption, while accounting for the other factors. Since we are primarily interested in the effects of sex and age, but want to account for the dependence in the data due to repeated measurements by country, we will include `sex` and `age_group_name` as fixed effects and incorporate a random intercept for `location_name`.  

```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lmer(
    log10(Relative_Percent) ~ sex + age_group_name + (1 | location_name),
    data = .
  ) %>%
  anova()

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lmer(
    log10(Relative_Percent) ~ sex + age_group_name + (1 | location_name),
    data = .
  ) %>%
  summary()
```

Looking at the `anova()` output, we can see that sex and age group both have significant associations with the consumption of red meat, when controlling for the other variable.  Additionally, by looking at the individual coefficient estimates in the `summary()` output, we see that males tend to have higher red meat consumption compared to females (positive coefficient for `sexMale`) and that consumption seems to decrease with increasing age (negative coefficients for all the age group categories that appear to become larger in magnitude with increasing age).

## **Data Visualization**
***

If you have been following along but stopped you could load the wrangled data like so:

```{r}
load(here::here("data", "wrangled", "wrangled_data.rda"))
```

***
<details> <summary> If you are starting the case study at this section click here. </summary>

First you need to install and load the `OCSdata` package:

```{r, eval=FALSE}
install.packages("OCSdata")
library(OCSdata)
```

Then, you may load the wrangled data using the following code:

```{r, eval=FALSE}
wrangled_rda("ocs-bp-diet", outpath = getwd())
load(here::here("OCSdata", "data", "wrangled", "wrangled_data.rda"))
```

If the package does not work for you, alternatively, an RDA file (stands for R data) of the data can be found [here](https://github.com//opencasestudies/ocs-bp-diet/tree/master/data/wrangled) or slightly more directly [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-diet/master/data/wrangled/wrangled_data.rda). Download this file and then place it in your current working directory within a subdirectory called "wrangled" within a subdirectory called "data" to copy and paste our code. We used an RStudio project and the [`here` package](https://github.com/jennybc/here_here) to navigate to the file more easily. 

```{r}
load(here::here("data", "wrangled", "wrangled_data.rda"))
```


***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>

You can create a project by going to the File menu of RStudio like so:


```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "New_project.png"))
```

You can also do so by clicking the project button:

```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "project_button.png"))
```

See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects and [here](https://github.com/jennybc/here_here) to learn more about the `here` package.

</details>
***
</details>
***

Now that we have statistically analyzed the consumption of red meat based on the location, sex, and age group of different populations around the world. Let's make some visualizations to help with our interpretations.

### **Red Meat**
***

Let's try to make a plot that shows the relationship of age group, sex, and location on consumption of red meat.

First we will filter our data for only the data associated with red meat, and then we will create a box plot graph with age group as the x axis, but include box plots for each sex for each age group. We can include an additional subplot to just look at the relationship of sex and consumption. Recall that the `ggplot2` package is very useful for making figures and uses a layering structure to make plots using the `+` between layers.

```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  ggplot(aes(
    x = age_group_name,
    y = Relative_Percent,
    col = sex
  )) +
  geom_boxplot() +
  # this adds the individual points for the sex comparison
  geom_jitter(aes(
    x = sex,
    y = Relative_Percent
  ),
  # width specifies how wide the points will be plotted
  width = .2,
  size = 2,
  shape = 21
  ) +
  # this angles the x axis text and removes the legend
  theme(
    axis.text.x = element_text(
      angle = 70,
      hjust = 1
    ),
    legend.position = "none"
  )
```

OK, this is pretty good, but we can do better.

Let's try specifically looking at the countries that over-consumed red meat. We can look at these countries by filtering our data where `Relative_Percent` was greater than 100%. Now we will overlap the jitter points and the box plot using the `position_jitterdoge()` as the position in `geom_pont()`. In order to not obscure our box plots, we can use the argument `alpha` to make our jitter points more transparent.

```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  filter(Relative_Percent > 100) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = age_group_name,
    fill = sex
  )) +
  # this position option will separate the points by sex
  # this is determined by the fill argument in the ggplot() function
  # could also use col argument but it would change the style a bit
  geom_point(
    position = position_jitterdodge(),
    aes(col = sex),
    alpha = 3 / 10
  ) +
  geom_boxplot(outlier.shape = NA) +
  theme(axis.text.x = element_text(
    angle = 70,
    hjust = 1
  ))
```

What are the countries that have such high consumption rates?

```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  filter(Relative_Percent > 1000)
```
 
 Looks like the males in Laos and Timor_Leste have the highest consumption.

 
Now let's plot just the populations that eat less than the optimal amount by filtering for `Relative_Percent` < 100%.
 
```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  filter(Relative_Percent < 100) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = age_group_name,
    fill = sex
  )) +
  geom_point(
    position = position_jitterdodge(),
    aes(col = sex),
    alpha = 3 / 10
  ) +
  geom_boxplot(outlier.shape = NA) +
  theme(axis.text.x = element_text(
    angle = 70,
    hjust = 1
  ))
```

Nice! It would be nice to be able to know what countries each data point corresponds to. One way to do this is using a package called `ggiraph`. This package is really helpful for creating interactive graphs. 

We will use the `geom_point_interactive()` function to allow us to hover over points to display the country name. We indicate what label we want with the `tooltip` argument.
This function is similar to the normal `geom_point()` function. Thus, we will include the same arguments as before. However, we will also split the male and female data using `facet_wrap()` to make things a bit less overwhelming.

Notice that we are creating a plot object before we use the `geom_point_interactive()`.

We are also rendering the plot with the `girafe()` function of the `ggiraph` package.


```{r, eval = TRUE}
g <- all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  filter(Relative_Percent < 100) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = age_group_name,
    fill = sex
  )) +
  geom_boxplot(outlier.shape = NA) +
  facet_wrap(~sex) +
  theme(axis.text.x = element_text(
    angle = 70,
    hjust = 1
  ))

g <- g + geom_point_interactive(aes(
  color = sex,
  tooltip = location_name
),
size = 2,
position = position_jitterdodge(),
alpha = 3 / 10
)

girafe(code = print(g))
```


Cool! 

From this plot we can see the countries with populations that do well by not over-consuming red meat, (as over-consumption is associated with health risk). We see that different countries greatly vary, we can see that overall younger populations appear to consume more red meat, and men appear to consume red meat.

Let's do the same thing for the over-consuming countries. We can also take this one step further to show all the points for the same country when we hover over one data point by using the `data_id` argument of the `geom_point_interactive()` function. 


We can also add links to Wikipedia pages for these countries using the `onclick` argument. See this [link](https://davidgohel.github.io/ggiraph/articles/offcran/using_ggiraph.html){target="_blank"} for more information on using the `ggirpah` package. We will use the base `sprintf()` function to format our urls for the Wikipedia links into C style to open a new tab for the link when a user clicks on the figure.

```{r, eval = TRUE}
all_age_diet_and_guidelines %<>%
  mutate(link = sprintf(
    "window.open(\"%s%s\")",
    "http://en.wikipedia.org/wiki/",
    as.character(pull(
      all_age_diet_and_guidelines,
      location_name
    ))
  ))

g <- all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  filter(Relative_Percent > 100) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = age_group_name,
    fill = sex
  )) +
  geom_boxplot(outlier.shape = NA) +
  facet_wrap(~sex) +
  theme(
    legend.position = "none",
    axis.text.x = element_text(
      angle = 70,
      hjust = 1
    )
  ) +
  expand_limits(y = 99)


g <- g + geom_point_interactive(aes(
  color = sex,
  tooltip = location_name,
  data_id = location_name,
  onclick = link
),
size = 2,
position = position_jitterdodge(),
alpha = 3 / 10
)

g <- g + geom_point_interactive(
  data =
    all_age_diet_and_guidelines %>%
      filter(food == "red meat") %>%
      filter(Relative_Percent > 100) %>%
      filter(location_name == "United States"),
  aes(
    fill = location_name,
    tooltip = location_name,
    data_id = location_name,
    onclick = link
  ),
  size = 4,
  position = position_jitterdodge(),
  alpha = 5 / 10,
  color = "black"
)

girafe(code = print(g))
```

### **United Sates Data**
***

Now let's take a look at the US data specifically.

```{r}
diet_and_guidelines %>%
  filter(location_name == "United States") %>%
  count(opt_achieved)
```

OK, it looks like optimal consumption levels were achieved for only 10% of the dietary factors.

Let's look at males and females separately:

#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>

Can you come up with the code for how you would do this?

####

***
<details> <summary> Click here to reveal the code. </summary>

```{r, eval = FALSE}
diet_and_guidelines %>%
  filter(
    sex == "Male",
    location_name == "United States"
  ) %>%
  count(opt_achieved, food) %>%
  arrange(food)

diet_and_guidelines %>%
  filter(
    sex == "Female",
    location_name == "United States"
  ) %>%
  count(opt_achieved, food) %>%
  arrange(food)
```
</details>
***

For males:
```{r, echo = FALSE}
diet_and_guidelines %>%
  filter(
    sex == "Male",
    location_name == "United States"
  ) %>%
  count(opt_achieved, food) %>%
  arrange(food)
```

For females:
```{r}
diet_and_guidelines %>%
  filter(
    sex == "Female",
    location_name == "United States"
  ) %>%
  count(opt_achieved, food) %>%
  arrange(food)
```


So females are a bit better about not over-consuming sodium in the United States relative to males.  Both groups are doing well with avoiding trans fatty acids. Let's look more closely at which dietary components have high and low consumption in the United States:

```{r}
all_age_diet_and_guidelines %>%
  filter(location_name == "United States") %>%
  ggplot(aes(
    y = Relative_Percent,
    x = food,
    fill = sex
  )) +
  theme(axis.text.x = element_text(
    angle = 70,
    hjust = 1
  )) +
  facet_wrap(~direction, scales = "free") +
  geom_boxplot() +
  geom_point(
    position = position_jitterdodge(),
    alpha = 3 / 10
  )
```

OK, so we can  indeed see that overall consumption of sodium and trans fatty acids is pretty close to optimal. So that's great. However, Both males and females are over-consuming processed meat, red meat, and sugar-sweetened beverages. On the other hand both genders are not getting adequate intake of all the other dietary factors for optimal health. The population in the United states has especially poor intake of polyunsaturated fats. it also looks like in most cases females are getting less of the dietary factors that pose health risks when under-consumed, with the exception of fruits.


How about if we look at age groups. First let's look at the dietary components with that were over-consumed in the United States.

#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>

Can you come up with the code for this on your own?

####

***
<details> <summary> Click here to reveal the code. </summary>

 We will also move our legend to the bottom of the plot using the `theme()` function of the `ggplot2` package, like so:

```{r}
plot_age_groups <- all_age_diet_and_guidelines %>%
  filter(
    location_name == "United States",
    direction == "high"
  ) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = food,
    fill = age_group_name
  )) +
  facet_wrap(~food, scales = "free") +
  geom_boxplot() +
  theme(legend.position = "bottom")
```
</details>

***

```{r}
plot_age_groups
```

OK! It looks like age really influences the consumption of these dietary factors. With the exception of trans fatty acids, the consumption of all of these dietary factors seems to decrease with age.  Let's also use the`scale_fill_viridis()` function of the `viridis` package to change the colors of our plot. This package uses palettes of colors that are discernible for individuals who are colorblind. 
```{r}
all_age_diet_and_guidelines %>%
  filter(
    location_name == "United States",
    direction == "high"
  ) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = food,
    fill = age_group_name
  )) +
  facet_wrap(~food, scales = "free") +
  geom_boxplot() +
  # change the colors from rainbow to purple/green/yellow
  scale_fill_viridis(discrete = TRUE) +
  theme_linedraw() +
  theme(
    strip.text = element_text(size = 8, face = "bold"),
    axis.text.x = element_blank(),
    axis.title.x = element_blank(),
    legend.position = "bottom"
  )
```

Nice!


Now let's look at the dietary factors that when consumed at low levels increase health risk:


#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>

Again, see if you come up with the code for this on your own?

####

***
<details> <summary> Click here to reveal the code. </summary>

```{r}
low_foods_plot <- all_age_diet_and_guidelines %>%
  filter(
    location_name == "United States",
    direction == "low"
  ) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = food,
    fill = age_group_name
  )) +
  facet_wrap(~food, scales = "free") +
  geom_boxplot() +
  # change the colors from rainbow to purple/green/yellow
  scale_fill_viridis(discrete = TRUE) +
  theme_linedraw() +
  theme(
    strip.text = element_text(size = 7, face = "bold"),
    axis.text.x = element_blank(),
    axis.title.x = element_blank(),
    legend.position = "bottom"
  )
```

</details> 
***


```{r}
low_foods_plot
```



Interesting, we see that for the foods that are over consumed (processed meat, red meat, sodium, and sugar-sweetened beverages), consumption appears to decrease with age. For the foods that are under consumed, many appear to rise and fall with age.

### **Overall trends**
***

Finally, we would like to get a general sense of how consumption of these dietary factors differs around the world and we would like to know how the US compares to other countries. 

Before we do this let's change the labels of the foods by adding new line breaks (`"\n"`) so that they will fit more easily on our graphs.

#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>

See if you come up with the code for this on your own?

####

***
<details> <summary> Click here to reveal the code for the `diet_and_guidelines` and the `all_age_diet_and_guidelines` data. </summary>

```{r}
diet_and_guidelines[["food"]] <- str_replace_all(
  diet_and_guidelines[["food"]],
  "sugar-sweetened beverages",
  "sugar-sweetened\nbeverages"
)
diet_and_guidelines[["food"]] <- str_replace_all(
  diet_and_guidelines[["food"]],
  "seafood omega-3 fatty acids",
  "seafood omega-3\nfatty acids"
)
diet_and_guidelines[["food"]] <- str_replace_all(
  diet_and_guidelines[["food"]],
  "polyunsaturated fatty acids",
  "polyunsaturated\nfatty acids"
)

all_age_diet_and_guidelines[["food"]] <- str_replace_all(
  all_age_diet_and_guidelines[["food"]],
  "sugar-sweetened beverages",
  "sugar-sweetened\nbeverages"
)

all_age_diet_and_guidelines[["food"]] <- str_replace_all(
  all_age_diet_and_guidelines[["food"]],
  "seafood omega-3 fatty acids",
  "seafood omega-3\nfatty acids"
)

all_age_diet_and_guidelines[["food"]] <- str_replace_all(
  all_age_diet_and_guidelines[["food"]],
  "polyunsaturated fatty acids",
  "polyunsaturated\nfatty acids"
)
```

</details>
***

```{r}
diet_and_guidelines %>%
  select(food) %>%
  distinct()
```

```{r}
all_age_diet_and_guidelines %>%
  select(food) %>%
  distinct()
```

Nice!

#### Under-consumed foods

To choose the colors for our plot we can use the `show_col()` function of the `scales` package (which is installed with `ggplot2` package) to preview color options from the viridis palette of the `viridis` package.

```{r}
scales::show_col(viridis_pal()(3))
```

We will use the `position_jitterdodge()` in our `position` argument of `geom_point()` to indicate how the points should be grouped by sex.

```{r, fig.width = 17, fig.height=14}
# first filter the values
Under <- diet_and_guidelines %>%
  filter(direction == "low") %>%
  filter(Relative_Percent < 110) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = food,
    color = sex
  )) +
  # adds grey points for each country
  geom_point(
    position = position_jitterdodge(),
    color = "dark grey",
    alpha = 7 / 10,
    size = 7,
    # this specifies how to separate the points
    aes(fill = sex)
  ) +
  # add boxplots
  geom_boxplot(
    outlier.shape = NA,
    color = "black",
    lwd = 2.5,
    aes(fill = sex)
  ) +
  # this allows us to use specific colors
  scale_fill_manual(
    values =
      c(
        "#481567FF",
        "#1F968BFF"
      )
  ) +
  # adds line for optimal amount
  geom_hline(
    yintercept = 100,
    linetype = "dashed",
    color = "red",
    size = 3
  ) +
  # manually changes y axis breaks
  scale_y_continuous(
    breaks = c(0, 25, 50, 75, 100),
    labels = c(
      0, 25, 50, 75,
      "100% \n optimal amount"
    )
  )
# creates a larger black point for the US data
Under <- Under +
  geom_point(
    data = diet_and_guidelines %>%
      filter(
        direction == "low",
        location_name == "United States"
      ),
    position = position_jitterdodge(
      jitter.width = 0.01,
      dodge.width = 0.7
    ),
    color = "black",
    size = 11,
    aes(
      fill = sex,
      shape = location_name
    )
  )

# creates a smaller yellow point on top for the US data
Under <- Under +
  geom_point(
    data = diet_and_guidelines %>%
      filter(
        direction == "low",
        location_name == "United States"
      ),
    position = position_jitterdodge(
      jitter.width = 0.01,
      dodge.width = 0.7
    ),
    color = "#FFDF00",
    size = 7,
    aes(
      fill = sex,
      shape = location_name
    )
  ) +
  # make the plot look nice
  theme_linedraw() +
  theme(
    plot.title = element_text(
      size = 40,
      hjust = 0.5,
      face = "bold"
    ),
    axis.text.x = element_text(
      angle = 70,
      hjust = 1,
      size = 32
    ),
    # this is useful for removing the legend
    legend.position = "none",
    axis.text.y = element_text(size = 35),
    axis.title.y = element_text(size = 35),
    axis.title.x = element_text(size = 25),
    panel.background = element_rect(
      colour = "black",
      size = 1.5
    )
  ) +
  labs(
    title = "\n Global consumption of foods associated with\nhealth risk when under-consumed",
    x = "",
    y = "Percent consumption relative \n to guidelines"
  )

Under
```

#### Over-consumed foods

We will use the `facet_zoom()` function of the `ggforce` package to create a plot with a zoomed in portion.

```{r, fig.width=20, fig.height=16}
Over <- diet_and_guidelines %>%
  filter(direction == "high") %>%
  ggplot(aes(
    y = Relative_Percent,
    x = food
  )) +
  # this adds points for each country
  geom_point(
    position = position_jitterdodge(),
    color = "dark grey",
    alpha = 7 / 10,
    # this specifies how to separate the points for male and female
    aes(fill = sex),
    size = 7
  ) +
  # this adds boxplots
  geom_boxplot(
    outlier.shape = NA,
    color = "black",
    lwd = 2.5,
    # key_glyph = draw_key_rect),
    aes(fill = sex)
  ) +
  # this manually changes boxplot colors
  scale_fill_manual(
    values =
      c(
        "#481567FF",
        "#1F968BFF"
      )
  ) +
  # this adds optimal red line
  geom_hline(
    yintercept = 100,
    linetype = "dashed",
    color = "red",
    size = 3
  ) +
  # this changes y axis breaks
  scale_y_continuous(
    breaks = c(100, 500, 1000, 2000),
    labels = c("100% thresh.", 500, 1000, 2000)
  ) +
  # this zooms in to part of the plot
  ggforce::facet_zoom(ylim = c(0, 1300)) +
  # this changes the legend direction
  guides(guide_legend(
    direction = "horizontal",
    label.vjust = .5
  ))

# creates a larger black point for the US data
Over <- Over +
  geom_point(
    data = diet_and_guidelines %>%
      filter(
        direction == "high",
        location_name == "United States"
      ),
    position = position_jitterdodge(
      jitter.width = 0.01,
      dodge.width = 0.7
    ),
    color = "black",
    size = 11,
    aes(
      fill = sex,
      shape = location_name
    )
  )

# creates a smaller yellow point on top for the US data
Over <- Over +
  geom_point(
    data = diet_and_guidelines %>%
      filter(
        direction == "high",
        location_name == "United States"
      ),
    position = position_jitterdodge(
      jitter.width = 0.01,
      dodge.width = 0.7
    ),
    color = "#FFDF00",
    size = 7,
    aes(
      fill = sex,
      shape = location_name
    )
  ) +
  scale_color_discrete(breaks = "United States") +
  # this makes the plot look nice
  theme_linedraw() +
  theme(
    plot.title = element_text(
      size = 40,
      hjust = 0.5,
      face = "bold"
    ),
    axis.text.x = element_text(
      angle = 60,
      hjust = 1,
      size = 32
    ),
    legend.title = element_blank(),
    legend.position = "bottom",
    # this is for changing the zoom triangle color
    strip.background = element_rect(
      fill = "grey86",
      colour = "grey86"
    ),
    axis.text.y = element_text(size = 35),
    axis.title.y = element_text(size = 35),
    axis.title.x = element_text(size = 25),
    legend.text = element_text(
      size = 30,
      vjust = .01
    ),
    # this is for changing the legend symbol size
    legend.key.height = unit(2, "cm"),
    legend.key.width = unit(2, "cm"),
    # this is for changing the figure outline
    panel.background = element_rect(
      colour = "black",
      size = 1.5
    )
  ) +
  expand_limits(y = -10) +
  labs(
    title = " Global consumption of foods associated with\n health risk when over-consumed",
    x = "",
    y = "Percent consumption relative \n to guidelines"
  )

Over
```

Let's put the plots together:

```{r, fig.height = 26, fig.width = 18}
cowplot::plot_grid(Over, Under, ncol = 1)
```
Nice!


#### Over-consumed by Age Global trends

Now we will look at Global trends with age. 

To make this plot we will use the `facet_wrap_paginate()` function of the `ggforce` package which allows you to specify the number of columns or rows for the facets.

```{r, fig.width = 16, fig.height = 18}

Over_age <- all_age_diet_and_guidelines %>%
  filter(direction == "high") %>%
  filter(Relative_Percent < 4000) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = age_group_name,
    fill = age_group_name
  )) +
  geom_boxplot(
    outlier.shape = NA,
    color = "black",
    aes(fill = age_group_name)
  ) +
  geom_hline(
    yintercept = 100,
    linetype = "dashed",
    color = "red",
    size = 2
  ) +
  theme_linedraw() +
  theme(
    plot.title = element_text(
      size = 30,
      hjust = 0.5,
      face = "bold"
    ),
    axis.text.x = element_blank(),
    legend.title = element_blank(),
    legend.position = "bottom",
    axis.text.y = element_text(size = 28),
    axis.title.y = element_text(size = 30),
    axis.title.x = element_text(size = 30),
    strip.text.x = element_text(
      size = 25,
      face = "bold"
    ),
    legend.text = element_text(size = 25),
    legend.key.height = unit(1.2, "cm"),
    legend.key.width = unit(1.2, "cm"),
    panel.background = element_rect(
      colour = "black",
      size = 1.5
    )
  ) +
  labs(
    title = " Global consumption across age\n  of foods associated\n with health risk when over-consumed",
    x = "Age",
    y = "\nPercent consumption relative \n to guidelines"
  ) +
  ggforce::facet_wrap_paginate(~food,
    ncol = 2,
    scales = "free"
  ) +
  scale_fill_viridis(discrete = TRUE)

Over_age
```


Hmmm these are a bit difficult to see for some of the dietary factors like red meat because the outliers are making the range much larger than the box plot themselves. We can calculate the values for the box plots ourselves to deal with this:

```{r}
calc_stat <- function(x) {
  coef <- 1.5
  n <- sum(!is.na(x))
  # calculate quantiles
  stats <- quantile(x, probs = c(0.1, 0.25, 0.5, 0.75, 0.9))
  names(stats) <- c("ymin", "lower", "middle", "upper", "ymax")
  return(stats)
}
```

Then we can use it in our plot code:

This is thanks to this:
https://stackoverflow.com/questions/25124895/no-outliers-in-ggplot-boxplot-with-facet-wrap

```{r, fig.width = 16, fig.height = 18}
# map_df(all_age_diet_and_guidelines %>% select(Relative_Percent), calc_stat)

Over_age <- all_age_diet_and_guidelines %>%
  filter(direction == "high") %>%
  ggplot(aes(
    y = Relative_Percent,
    x = age_group_name,
    fill = age_group_name
  )) +
  # here we will replace geom_boxplot() yet still create boxplots
  stat_summary(
    fun.data = calc_stat,
    geom = "boxplot",
    outlier.shape = NA,
    color = "black",
    lwd = 1.1,
    aes(fill = age_group_name)
  ) +
  geom_hline(
    yintercept = 100,
    linetype = "dashed",
    color = "red",
    size = 2
  ) +
  theme_linedraw() +
  theme(
    axis.text.x = element_text(
      angle = 60,
      hjust = 1
    ),
    legend.title = element_blank()
  ) +
  theme_linedraw() +
  theme(
    plot.title = element_text(
      size = 30,
      hjust = 0.5,
      face = "bold"
    ),
    axis.text.x = element_blank(),
    legend.title = element_blank(),
    legend.position = "bottom",
    axis.text.y = element_text(size = 28),
    axis.title.y = element_text(size = 30),
    axis.title.x = element_text(size = 30),
    strip.text.x = element_text(
      size = 25,
      face = "bold"
    ),
    legend.text = element_text(size = 25),
    legend.key.height = unit(1.2, "cm"),
    legend.key.width = unit(1.2, "cm"),
    panel.background = element_rect(
      colour = "black",
      size = 1.5
    )
  ) +
  labs(
    title = " Global consumption across age\n  of foods associated\n with health risk when over-consumed",
    x = "Age",
    y = "\nPercent consumption relative \n to guidelines"
  ) +
  guides(fill = guide_legend(
    nrow = 5,
    byrow = TRUE
  )) +
  facet_wrap_paginate(~food,
    ncol = 2,
    scales = "free"
  ) +
  scale_fill_viridis(discrete = TRUE)

Over_age
```

#### Under-consumed by Age Global Trends

```{r, fig.width=16, fig.height=18}
Under_age <- all_age_diet_and_guidelines %>%
  filter(direction == "low") %>%
  ggplot(aes(
    y = Relative_Percent,
    x = age_group_name,
    fill = age_group_name
  )) +
  # here we will replace geom_boxplot()
  stat_summary(
    fun.data = calc_stat,
    geom = "boxplot",
    outlier.shape = NA,
    color = "black",
    lwd = 1.1,
    aes(fill = age_group_name)
  ) +
  geom_hline(
    yintercept = 100,
    linetype = "dashed",
    color = "red",
    size = 2
  ) +
  theme_linedraw() +
  theme(
    axis.text.x = element_text(
      angle = 60,
      hjust = 1
    ),
    legend.title = element_blank(),
    legend.position = "bottom"
  ) +
  facet_wrap_paginate(~food, nrow = 5, scales = "free") +
  theme_linedraw() +
  theme(
    plot.title = element_text(
      size = 30,
      hjust = 0.5,
      face = "bold"
    ),
    axis.text.x = element_blank(),
    legend.title = element_blank(),
    legend.position = "none",
    axis.text.y = element_text(size = 28),
    axis.title.y = element_text(size = 30),
    axis.title.x = element_text(size = 25),
    strip.text.x = element_text(
      size = 25,
      face = "bold"
    ),
    panel.background = element_rect(
      colour = "black",
      size = 1.5
    ),
    axis.ticks.x = element_blank()
  ) +
  labs(
    title = "\n Global consumption across age\n  of foods associated\n with health risk when under-consumed",
    x = "",
    y = "\nPercent consumption relative \n to guidelines"
  ) +
  scale_fill_viridis(discrete = TRUE)

Under_age
```




Now let's put all the plots together:

```{r, fig.height=33, fig.width = 34}
cowplot::plot_grid(Over,
  Over_age,
  Under,
  Under_age,
  ncol = 2,
  rel_widths = c(2.5, 1)
)
```

```{r, include = FALSE, echo = FALSE}
png(
  filename = here::here("img", "mainplot.png"),
  res = 300, width = 34, height = 30, units = "in"
)
cowplot::plot_grid(Over,
  Over_age,
  Under,
  Under_age,
  ncol = 2,
  rel_widths = c(2.5, 1)
) + theme(plot.margin = unit(c(1, 1, 1, 1), "cm"))
dev.off()
```


## **Summary**
***

### **Synopsis**
***

We have evaluated average consumption estimates of 15 dietary factors with probably non-communicable disease (NCD) risk from 195 different countries around the world. To do so we imported data from a PDF using the `pdftools` package, as well as data from two CSV files using `readr`. We used `tidyverse` packages such as `dplyr`, `stringr`, and `tidy` to clean and join the data from the PDF with the CSV files. 

We learned that regression is a powerful and flexible statistical tool that simplifies or estimates the relationships between variables using a mathematical model. We learned about the utility of regression techniques to compare groups, look for associations between variables, and predict outcomes based on multiple predictor or explanatory variables. We then compared this to other popular tests like the $t$-test and the ANOVA. We learned that these tests are actually equivalent to specialized types of regressions.

Our statistical analysis focused on evaluating differences in the consumption of red meat around the world between females and males and across different age groups. First we looked at the assumptions of [$t$-tests](https://stattrek.com/statistics/dictionary.aspx?definition=two-sample%20$t$-test){target="_blank"} and regressions, and determined that the rate of red meat consumption relative to the optimal guideline-suggested amount was right skewed. We learned that we could transform the data by taking the log of these values to achieve more normally distributed data. To compare males and females we used a $t$-test and learned that a $t$-test is a specialized form of a linear regression. To compare the 15 different age groups we used an ANOVA and learned that ANOVA is also a specialized form of linear regression. We examined how we obtained the same results using either statistical test. This was also the case if we looked at the effect of gender and controlled for the paired country structure in the data by either including `location_name` in the model as another term or by using a mixed effects model to control for this structure as a random effect but not specifically test for the influence of `location_name` on red meat consumption estimates. We learned that fixed effects are those that we wish to evaluate, while random effects are those that may influence the relationships of our variables of interest but that we do not wish to actively evaluate. Using these tests and models, we determined that males consume more red meat than females on average around the world. 

Our ANOVA analysis of age determined that indeed there is at least one age group that consumed a significantly different amount of red meat compared to the other age groups, and this was still the case when we controlled for `location_name`. However, we learned that the ANOVA does not provide information about which age groups are different. We learned how the regression could provide some quantification of the effect of specific age groups relative to the reference age group. Furthermore, our data visualizations allowed us to determine that in general red meat consumption appears to be higher in the younger age groups relative to the older age groups. 

Finally, we also looked at differences in red meat consumption between the different countries and saw in our ANOVA analysis and our regression analysis that there were significant differences. We were able to use a regression that included `sex`, `age_group_name`, and `location_name` to evaluate the influence of each of the three demographic factors on consumption while controlling or accounting for the other two. Our results demonstrated that all three influenced or were associated with red meat consumption.

In preforming our statistical analyses we learned about the assumptions of the $t$-test, regression, and the ANOVA. We also learned about important methods to tests these assumptions.

Using the `ggplot2` package we were able to visualize trends in the data and to compare consumption of these dietary factors in the US with that of the other countries.

We see that the populations in many countries are over-consuming foods that are associated with health risk when over-consumed. In particular processed meat and sugar-sweetened beverages appear to be the most over consumed. Importantly both of these appear to be consumed at higher quantities by males and younger adults. People in the US  appear to consume fewer sugar-sweetened beverages than other countries, however, people are still over-consuming. Processed meat however appears to be especially bad in the US. In terms of food that need to be consumed in adequate amounts to overcome health risk, nearly all countries for all factors are not reaching guideline levels. However, there are some countries consuming more than adequate amounts of legumes, vegetables, fruits and fiber. People in the US appear to eat more milk products and consume more omega-3 fatty acids and calcium rich foods than other countries. All countries including the US consume very low levels of polyunsaturated fatty acids. These [polyunsaturated fatty acids](https://en.wikipedia.org/wiki/Polyunsaturated_fat) are abundant in seeds, nuts and avocados, as well as fish. Likely the low level of consumption of nuts and seeds contributes to these low polyunsaturated fatty acid estimates. The supplementary table included in the article suggests that poor consumption of polyunsaturated fatty acids is associated with ischemic heart disease. The article takes this data further to evaluate the association of consumption levels of these foods with mortality.

Analyses like the one in our case study are important for defining which groups could benefit the most from interventions, education, and policy changes when attempting to mitigate public health challenges. You can see in the [article](https://www.thelancet.com/action/showPdf?pii=S0140-6736%2819%2930041-8){target="_blank"} however that many additional considerations would be involved to perform a more thorough analysis to adequately understand the data enough to recommend policy changes.


## **Suggested Homework**
***

Students can evaluate consumption estimates of another dietary factor besides red meat.


## **Additional Information**
*** 

### **Helpful Links**
***

<u>Terms and concepts covered:</u>  

[Tidyverse](https://www.tidyverse.org/){target="_blank"}  
[RStudio cheatsheets](https://rstudio.com/resources/cheatsheets/){target="_blank"}  
[Interpunct](https://www.shorttutorials.com/mac-os-special-characters-shortcuts/middle-dot.html){target="_blank"}  
[Regular expressions](https://www.r-bloggers.com/regular-expressions-every-r-programmer-should-know/){target="_blank"}  
[Inference](https://www.britannica.com/science/inference-statistics){target="_blank"}  
[Regression](https://lindeloev.github.io/tests-as-linear/){target="_blank"}  
[Different types of regression](https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/){target="_blank"}  
[Ordinary least squares method](http://setosa.io/ev/ordinary-least-squares-regression/){target="_blank"}  
[Residual](https://www.statisticshowto.datasciencecentral.com/residual/){target="_blank"}  
[$t$-tests](https://stattrek.com/statistics/dictionary.aspx?definition=two-sample%20$t$-test){target="_blank"}  
[ANOVA](http://onlinestatbook.com/2/analysis_of_variance/intro.html){target="_blank"}  
[$t$-tests and ANOVA are equivalent to regression](https://scientificallysound.org/2017/06/08/$t$-test-as-linear-models-r/){target="_blank"} also see [here](https://towardsdatascience.com/everything-is-just-a-regression-5a3bf22c459c){target="_blank"} and [here](https://lindeloev.github.io/tests-as-linear/){target="_blank"} about how many commonly known statistical tests are specialized forms of regression  
[Normal Distribution](https://www.physiology.org/doi/full/10.1152/advan.00064.2017){target="_blank"}  
[Q-Q plot](http://onlinestatbook.com/2/advanced_graphs/q-q_plots.html){target="_blank"}  
[Guide to residual diagnostic plots](https://data.library.virginia.edu/diagnostic-plots/) and [Examples](http://docs.statwing.com/interpreting-residual-plots-to-improve-your-regression/){target="_blank"}  
[Residual vs fitted plot](https://online.stat.psu.edu/stat462/node/118/){target="_blank"}  
[Scale-location plot](https://boostedml.com/2019/03/linear-regression-plots-scale-location-plot.html){target="_blank"}  
[Homoscedasticity ](https://www.statisticssolutions.com/homoscedasticity/){target="_blank"}  
[Heteroscedasticity](https://statisticsbyjim.com/regression/heteroscedasticity-regression/){target="_blank"}  
[Interpreting `lm()` output](https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R){target="_blank"}  
[Coefficients](https://www.theanalysisfactor.com/interpreting-regression-coefficients/){target="_blank"}  
[Linear mixed effects regression](https://ourcodingclub.github.io/tutorials/mixed-models/){target="_blank"}  
[Satterthwaite formula](https://www.statisticshowto.datasciencecentral.com/satterthwaite-formula/){target="_blank"}  
[Mood's Two-Sample Scale Test](https://files.eric.ed.gov/fulltext/ED065559.pdf){target="_blank"}   
[Standard deviation](https://www.statsdirect.com/help/basic_descriptive_statistics/standard_deviation.htm){target="_blank"}  
[Homogeneity of Variances assumption](https://uc-r.github.io/assumptions_homogeneity){target="_blank"}   
[polyunsaturated fatty acids](https://en.wikipedia.org/wiki/Polyunsaturated_fat){target="_blank"} 


<u>Tests of Homogeneity of Variance for 3 or more groups:</u>

[Bartlett's test](https://www.itl.nist.gov/div898/handbook/eda/section3/eda357.htm){target="_blank"}  
[Fligner-Killeen](http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_NonParam_VarIndep){target="_blank"}  
[Levene's test](https://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm){target="_blank"}  
 

<u>Other helpful links:</u>

[Long and Wide Data Formats](https://opencasestudies.github.io/ocs-healthexpenditure/ocs-healthexpenditure.html){target="_blank"}    
[Distributions](http://onlinestatbook.com/2/introduction/distributions.html){target="_blank"} 
[Skewed Distributions](http://onlinestatbook.com/2/glossary/skew.html){target="_blank"} 
[Bimodal Distribution](http://onlinestatbook.com/2/introduction/distributions.html){target="_blank"} 
[ggplot2](https://opencasestudies.github.io/ocs-healthexpenditure/ocs-healthexpenditure.html){target="_blank"}    
[Shapiro-Wilk Test](http://www.statistics4u.info/fundstat_eng/ee_shapiro_wilk_test.html){target="_blank"}   
[Paired Data](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5579465/){target="_blank"}  
[Welch's $t$-test](https://www.statisticshowto.datasciencecentral.com/welchs-test-for-unequal-variances/){target="_blank"}    
[Parametric and Nonparametric Methods](https://www.mayo.edu/research/documents/parametric-and-nonparametric-demystifying-the-terms/doc-20408960){target="_blank"}   
[Variance](https://stattrek.com/statistics/dictionary.aspx?definition=variance){target="_blank"}  
[Balanced Study Design](https://www.statisticshowto.datasciencecentral.com/balanced-and-unbalanced-designs/){target="_blank"}  
[Independent Observations](https://www.stat.cmu.edu/~cshalizi/36-220/lecture-5.pdf){target="_blank"}  
[Transformation](https://www.statisticshowto.datasciencecentral.com/transformation-statistics/){target="_blank"}  
[Permutation/Resampling Methods](https://jhu-advdatasci.github.io/2019/lectures/21-resampling-techniques.html){target="_blank"}   
[Central Limit Theorem](https://www.analyticsvidhya.com/blog/2019/05/statistics-101-introduction-central-limit-theorem/){target="_blank"} 
[Wilcoxon Signed Rank Test](http://www.biostathandbook.com/wilcoxonsignedrank.html)   
[Wilcoxon Rank Sum Test](http://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_nonparametric/BS704_Nonparametric4.html){target="_blank"}  
[Two-sample Kolmogorov-Smirnov Test](https://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/ks2samp.htm){target="_blank"}  
[Type 1 Error](https://web.ma.utexas.edu/users/mks/statmistakes/errortypes.html){target="_blank"}  
[p-value](https://towardsdatascience.com/p-values-explained-by-data-scientist-f40a746cfc8){target="_blank"}  
[Multiple Testing](https://www.gs.washington.edu/academics/courses/akey/56008/lecture/lecture10.pdf){target="_blank"}    
[Bonferroni Method of Multiple Testing Correction](http://mathworld.wolfram.com/BonferroniCorrection.html){target="_blank"}

<u>Packages used in this case study: </u>

 Package   | Use in this case study                                                                        
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"}       | to easily load and save data  
[readr](https://readr.tidyverse.org/){target="_blank"}      | to import the CSV file data  
[dplyr](https://dplyr.tidyverse.org/){target="_blank"}      | to arrange/filter/select/compare specific subsets of the data  
[skimr](https://cran.r-project.org/web/packages/skimr/index.html){target="_blank"}      | to get an overview of data    
[pdftools](https://cran.r-project.org/web/packages/pdftools/pdftools.pdf){target="_blank"}   | to read a PDF into R   
[stringr](https://stringr.tidyverse.org/articles/stringr.html){target="_blank"}    | to manipulate the text within the PDF of the data   
[magrittr](https://magrittr.tidyverse.org/articles/magrittr.html){target="_blank"}   | to use the `%<>%` piping operator  
[purrr](https://purrr.tidyverse.org/){target="_blank"}      | to perform functions on all columns of a tibble   
[tibble](https://tibble.tidyverse.org/){target="_blank"}     | to create data objects that we can manipulate with  dplyr/stringr/tidyr/purrr  
[tidyr](https://tidyr.tidyverse.org/){target="_blank"}      | to separate data within a column into multiple columns 
[ggplot2](https://ggplot2.tidyverse.org/){target="_blank"}    | to make visualizations with multiple layers  
[ggpubr](https://cran.r-project.org/web/packages/ggpubr/index.html){target="_blank"}    | to easily add regression line equations to plots  
[forcats](https://forcats.tidyverse.org/){target="_blank"}    | to change details about factors (categorical variables)  
[lmerTest](https://cran.r-project.org/web/packages/lmerTest/lmerTest.pdf)| to perform linear mixed model testing   
[car](https://cran.r-project.org/web/packages/car/car.pdf)| to perform Levene's Test of Homogeneity of Variances   
[ggiraph](https://cran.r-project.org/web/packages/ggiraph/index.html)| to make plots interactive   
[ggforce](https://cran.r-project.org/web/packages/ggforce/ggforce.pdf)| to modify facets in plots  
[viridis](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html)| to plot in color palette    
[cowplot](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html){target="_blank"} | to allow plots to be combined   

### **Session Info**
***

```{r}
sessionInfo()
```

**Estimate of RMarkdown Compilation Time: **

```{r, echo=FALSE}
rmarkdown:::perf_timer_stop("render")
pts = rmarkdown:::perf_timer_summary()
cat("About", round(pts$time[1]/1000 + 5), "-", round(pts$time[1]/1000 + 15),"seconds")
```

This compilation time was measured on a PC machine operating on Windows 10. This range should only be used as an estimate as compilation time will vary with different machines and operating systems.

### **Acknowledgments**
***

We would like to acknowledge [Jessica
Fanzo](https://bioethics.jhu.edu/people/profile/jessica-fanzo/) for
assisting in framing the major direction of the case study, as well as [Ashkan Afshin](https://globalhealth.washington.edu/faculty/ashkan-afshin) and [Erin Mullany](http://www.healthdata.org/about/erin-mullany) for giving us access to the data.

We would like to acknowledge [Michael Breshock](https://mbreshock.github.io/) for his contributions to this case study and developing the `OCSdata` package.

We would also like to acknowledge the [Bloomberg American Health
Initiative](https://americanhealth.jhu.edu/) for funding this work.
