StudyDaddy Statistics

Answered You can hire a professional tutor to get the answer.

QUESTION

May 13, 2019

In this project you will use data from the World Bank to fit several regression models in order to predict outcome variables of interest using

In this project you will use data from the World Bank to fit several regression models in order to predict outcome variables of interest using carefully chosen predictor variables to tell an interesting story. You can choose both the outcomes and the predictors. To see which variables are available, browse this site: https://data.worldbank.org/indicator/?tab=featured In the example code I included below, I chose infant mortality and child mortality (under age 5) as outcome variables, and a measure of pollution and a measure of child labor as predictor variables. After removing observations with missing values I ended up with a dataset including 71 countries, which is probably a biased sample of all countries but a large enough sample to fit models on and make conclusions about these countries (even if those conclusions may not generalize to the whole world).

You will submit a report knitted from R Markdown in RStudio. I have provided code below to help get you started by downloading some example data. In RStudio you should start a new R Markdown document, and then add section names (intro, first outcome, second outcome, conclusion), code chunks, and text as necessary to follow the rest of the instructions below.

Note that depending on which variables you choose, the available data may have too many missing values and result in a dataset with very few observations. I included some code to check and see if this is happening. When you run the line nrow(wb_recent) if the result is like ~60 or less, instead of something close to 100 or more, you may need to replace

Choose two outcome variables which are related, but not identical, for example two different measures of poverty, or two health outcomes, or two educational outcomes, etc.
Choose at least three predictor variables that you want to use to predict the outcomes. Use all the concepts we've talked about in class in this decision. Try to avoid collinearity by not picking multiple variables that measure essentially the same thing, try to find predictors that are interesting and not just another way of measuring the outcome(s), and try to control for the predictors that you think might be important to give the coefficient(s) a meaningful interpretation which is not too complicated.
Pick one predictor variable to focus on for interpreting and telling your story. This variable should be included as a predictor in all models throughout, while the other predictors may be included or left out.
Write an introduction section to explain your analysis plan, this should just be a high level description of the outcomes and the predictor you're focusing on and what type of relationship you think you might find (e.g. positive or negative). This introduction should be at the top of the document before any code chunks that you run to complete the next parts.
Starting with one of the outcome variables, fit two regression models, one which is a sub-model of the other (i.e. has predictors which are a subset of the predictors in the other model), and use an F-test to compare them.
Use confint() on both models to see the 95% confidence intervals for the one predictor variable you chose to focus on, use summary() on both models and note the p-values for the test of your chosen predictor and the adjusted R-squared values for both models.
Plot and interpret the regression diagnostic plots, commenting on any clear problems that you see, or any obvious differences in them between the two models. You do not need to fix these problems, just note them.
Write a brief summary, perhaps just one paragraph, describing your conclusions about the relationship between the predictor variable you've chosen and the first outcome variable. Describe the interpretations of the coefficients for your chosen predictor variable in both of these models, why the estimated coefficients/intervals for this predictor are different between the two models, and which model you think tells the clearest and most accurate story. Does the relationship match what you thought it would in the introduction? Is it statistically / practically significant? (You may need to check the World Bank site to find the units of the variable to understand practical significance)
Repeat parts 5-8 above with the second outcome variable, but in your summary now focus on any changes/consistencies you notice between using this second outcome and the story you told based on the first outcome. Does the story hold up, or do things seem like they might be more complicated?
Write a conclusion of one or two paragraphs giving your high level interpretation--did the results match what you thought they would, or make you rethink and change your mind? What types of limitations do you think are the most important factors in interpreting your analysis: might the variables have serious measurement issues, did removing missing values result in a sample that might be biased based on which countries are included, did the models have very low predictive ability or obvious problems with fitting the data, can any causal conclusions be reached, and if not, why? Did the coefficients for the predictor variable you focused on tell a consistent story or change dramatically between models, and are there any concerns about things like Simpson's paradox that might apply? Are there any useful or interesting insights about the underlying human issues that these variables are trying to measure that you learned from the analysis?

Note that wb_variables has to use the code names for variables from the World Bank, and you can rename them by typing the names you want (in the same order) in wb_names, like in my example.

# Run this in the console once before

# running anything else:

# install.packages("wbstats")

# Delete this comment after

library(tidyverse)

library(wbstats)

# Delete the comments below after finishing your project but

# before submitting it

# After loading library(wbstats)

# you can use wbsearch() in the console to find variables

# you're interested in using as predictors or outcomes.

# Can also browse https://data.worldbank.org/indicator/

# to find variables and copy their names from there.

# I did this:

# wbsearch(pattern = "mortality rate")

# SH.DYN.MORT Mortality rate, under-5 (per 1,000 live births)

# SP.DYN.IMRT.IN Mortality rate, infant (per 1,000 live births)

# wbsearch(pattern = "pollution")

# EN.ATM.PM25.MC.M3 PM2.5 air pollution, mean annual exposure (micrograms per cubic meter)

# wbsearch(pattern = "children in employment")

# SL.TLF.0714.ZS Children in employment, total (% of children ages 7-14)

# Don't delete after here, just make changes

# Change to the outcomes and predictors you're interested in

wb_variables <- c("SH.DYN.MORT",

"SP.DYN.IMRT.IN",

"EN.ATM.PM25.MC.M3",

"SL.TLF.0714.ZS")

# Names have to be in the same order as wb_variables!

wb_names <- c("childmort",

"infantmort",

"pollution",

"childlabor")

wb_data <- wb(country = "all", indicator = wb_variables,

startdate = 2007, enddate = 2018, return_wide = TRUE)

wb_recent <- wb_data %>%

group_by(country) %>%

arrange(desc(date)) %>%

fill(wb_variables) %>% # fill NAs with value from most recent year

drop_na() %>%

top_n(n = 1, wt = date) %>%

ungroup() %>%

rename_at(vars(wb_variables), ~ wb_names)

# View(wb_recent) and scroll to look for any obvious problems

# Countries included

# wb_recent$country

# Number of countries included

nrow(wb_recent)

# If there are too few countries, it's probably because drop_na()

# removed lots of observations due to some variable being mostly missing.

# If that happens, run the code below to see which variables might be

# causing the problem.

wb_data %>%

summarize_at(vars(wb_variables), function(column) mean(is.na(column)))

# Can also use View and scroll to look for any obvious problems

# View(wb_data)

# After you identify the problem, go back and remove the variable

# and pick a different one. Remember to change the wb_names to match!

# Once you're done picking variables you can delete this section

# (up to the nrow(wb_recent) line, keep that one)

# and move on to fitting models.

model1 <- lm(infantmort ~ pollution + childlabor, wb_recent)

summary(model1)

model2 <- lm(childmort ~ pollution + childlabor, wb_recent)

summary(model2)

Suggested Format

```{r include=FALSE}

knitr::opts_chunk$set(echo = TRUE)

library(wbstats) # data

library(tidyverse) # manipulation

library(ggfortify) # autoplot

```

# Introduction

Brief explanation (one paragraph) describing the analysis plan. What are the predictors and outcomes, which predictor are you focusing on, and what type of relationship are you expecting to find?

# Analysis

## Data description

```{r}

download data

give variables easy to understand names

```

Text describing units of measurement for variables

## First outcome

```{r}

fit models

show summaries

```

Interpret summaries (note: the p-value at the bottom of the summary is for an F-test comparing that model to the intercept-only model. Aside from R-squared/adjusted R-squared ignore the rest of the bottom few lines of the summary)

```{r}

do F-test

```

Which model does the test choose?

```{r}

show confints

```

Interpret intervals, talk about statistical/practical significance for your main predictor

```{r}

diagnostic plots

```

Interpret the plots, describe any problems you see (pay more attention to the points themselves and don't take the blue line too seriously)

Brief summary of your overall conclusions for this part. If there are any major differences between the models explain why you think that is. Did the results match your expectations about the story you're trying to tell about your main predictor and the outcome?

## Second outcome

same stuff

# Conclusion

One or two paragraphs, as described in the project instructions.