StudyDaddy Business & Finance

Waiting for answer This question has not been answered yet. You can hire a professional tutor to get the answer.

QUESTION

Apr 10, 2026

In this assignment, we will run a correlation and bivariate regression analysis to explore the relationship between median_aqi and child_mortality (the Deaths among children under age 18 per 100,000 p

In this assignment, we will run a correlation and bivariate regression analysis to explore the relationship between median_aqi and child_mortality (the Deaths among children under age 18 per 100,000 population). We will use the R functions ggplot to create a scatterplot, cor to generate a correlation, and lm to run a bivariate regression.

Assignment 6 Instructions

Read Chapters 9 & 11
To test the relationship between median_aqi and child_mortality (the Deaths among children under age 18 per 100,000 population) you will need to extract the appropriate data from the textbook database using MySQL Workbench, prepare the data using Excel, then load the data into RStudio, run the appropriate analyses, and describe the results.
It is expected that for this assignment you will be able to use MySQL Workbench, Excel, and RStudio more on your own but there are some additional instructions see the attachment below.
For this assignment, you will submit a Word document containing a scatterplot, correlation, and regression output.
You will also need to interpret the findings and answer the following questions:

a. What is the hypothesis for this analysis?

b. Did you find a relationship between median_aqi and child_mortality?

c. What conclusions might you draw from these findings?

Is there a relationship between the rate of air pollution particulates measured in US counties and population health—to include diabetes, low birth weight, frequent mental distress, poor or fair health, and life expectancy?

The dependent variables for the proposed study include the following:

1. Diabetes prevalence, low birth weight, frequent mental distress, poor or fair health, and life expectancy

2. In the case of the proposed research question, there is one independent variable, hazardous air pollutants. Pollutants are measured by the air quality index (AQI), which provides the number of good days, moderate days, unhealthy days, very unhealthy days, and hazardous days based on the air pollutants in the air for that specific county. The AQI is considered a measure of air quality by the EPA. For the remainder of this chapter, the independent variable will be referred to as air pollution particulate matter or PM.

Hypotheses

The research questions will require more than one hypothesis. The hypotheses will include the following:

H10: There is not a relationship between the PM and the prevalence rate of diabetes in US counties.

H1A: There is a relationship between the PM and the prevalence rate of diabetes in US counties.

H20: There is not a relationship between the PM and the rate of infants born with low birth weight in US counties.

H2A: There is a relationship between the PM and the rate of infants born with low birth weight in US counties.

H30: There is not a relationship between the PM and the rate of frequent mental distress in US counties.

H3A: There is a relationship between the PM and the rate of frequent mental distress in US counties.

H40: There is not a relationship between the PM and the rate of poor or fair health in US counties.

H4A: There is a relationship between the PM and the rate of poor or fair health in US counties.

H50: There is not a relationship between the PM and life expectancy in US counties.

H5A: There is a relationship between the PM and life expectancy in US counties.

To answer our research questions, we require a specific process of data acquisition, preparation, and discovery. The following steps will be explained in detail:

1. Extract the data sets from MySQL.

a. The corrplot package was added for use in this chapter:

library(corrplot)

## corrplot 0.84 loaded

2. Prepare the data.

a. In Microsoft Excel, we did have to find and replace all NULL values with blanks once the data set was downloaded from MySQL.

3. Import the data into RStudio.

4. Perform descriptive statistics on the variables of PM, diabetes, frequent mental distress, poor or fair health, and life expectancy.

5. Conduct similar linear regressions to examine the relationship between PM and measures of population health at the US county level.

The Analysis

This section provides a step-by-step description of obtaining and analyzing the data required to answer the proposed research question.

Step 1: Extract the Data

In order to carry out the analysis, we need the following data:

1. A list of all US counties with measures of levels of the AQI

2. A measure of the prevalence of diabetes cases for each county

3. A measure of the prevalence of births that are considered to be low birth weight for each county

4. A measure of the prevalence of frequent mental distress for each county

5. A measure of the prevalence of poor or fair health for each county

6. A measure of the age of life expectancy for each county

One data set is used to obtain the needed data to answer the proposed research questions. The MySQL query uses data from the air_pollutants, geo_fips_region, and chr_health_outcomes tables of data.

The MySQL script that was used to extract the data from the database included the following:

SELECT

f.state_name,

f.area_name,

f.region,

f.subregion,

a.days_with_aqi,

a.good_days,

a.moderate_days,

a.unhealthy_for_sensitive_groups_days,

a.unhealthy_days,

a.very_unhealthy_days,

a.hazardous_days,

a.median_aqi,

c.child_mortality,

c.diabetes_prevalence,

c.infant_mortality,

c.frequent_mental_distress,

c.poor_or_fair_health,

c.life_expectancy,

c.low_birthweight,

c.premature_age_adjusted_mortality,

c.premature_death

FROM

air_pollutants AS a

JOIN

geo_fips_region AS f ON a.state_county_fips = f.state_county_fips

JOIN

chr_health_outcomes AS c ON c.fips_code = f.state_county_fips;

The data set can be exported from MySQL by selecting the Export button and naming the file aq.csv. This data set includes days_with_aqi, good_days, moderate_days, unhealthy_for_sensitive_groups_days, unhealthy_days, very_unhealthy_days, hazardous_days, median_aqi, child_mortality, diabetes_prevalence, infant_mortality, frequent_mental_distress, poor_or_fair_health, life_expectancy, low_birthweight, premature_age_adjusted_mortality, and premature_death.

Step 2: Prepare the Data

One of the first steps in data preparation is to carefully examine the data for erroneous values or blanks. An erroneous value may be a value that was used as a filler. A filler is often used as a default value to indicate a blank field. For example, the value –1111.1 may be used to represent a blank field entry. Also, a value of NULL is common after exporting data from Microsoft Excel. One way to approach erroneous values is to replace the values with a blank entry. Filler values must be removed from your data set because the filler data often hold a numeric value, which can skew the results. If the filler values are formatted as text, the column may not support a quantitative analysis such as the calculation of a mean. If numeric fillers are included in the data set, this can skew the results when calculating the mean or alter the results when performing inferential statistical analyses.

The data set that was queried from MySQL and exported as aq.csv contains NULL values that need to be removed. Open the file in Microsoft Excel by double-clicking the file. Under the Home tab, select the magnifying glass and select Replace. In the window that appears, enter “NULL” into the “Find what:” text field and leave the “Replace with:” field blank. After you select Replace All, any NULL value will be replaced with a blank entry. See figure 9 .1 as an example of how to use the Find and Replace functionality in Microsoft Excel.