R (or Rstudio) to investigate the Law of Large numbers

Project 2

Due Date: Thursday, April 20th during recitation

In this project, we are going to use R (or Rstudio) to investigate the Law of Large numbers, the Central Limit Theorem, and the interpretation of Confidence Intervals.

R is a free statistics software which has lots of mathematical and statistical functions and packages that can help us explore the data world. Rstudio is a friendly user interface of R.

Both R and Rstudio are available on MSU computers.

If you would like to install them on your own computer, please go to https://cloud.r-project.org/ to download and install R first, then go to https://www.rstudio.com/products/rstudio/download/ to download and install Rstudio. Make sure you have R on your computer before installing Rstudio.

WE WILL NOT ANSWER ANY QUESTIONS RELATED TO THEIR INSTALLATION.

Rstudio looks like this

R (or Rstudio) to investigate the Law of Large numbers 1

You may type codes in Code Editor and once you run a line of them, it will appear in the R Console. You can also type codes directly in the R Console, but this is not recommended since once you run them, they will not be stored. Variables you defined will appear in Workspace and History. Plots you draw appears in Plots and files. If you need help with the usage of a specific function, type ? before the function in R Console and hit enter, then the help document will appear in Plots and files.

Note that in Rstudio, a line is treated as a comment (not code) if it starts with a pound sign #. Everything after # in a line is not to be run by Rstudio, but everything before # is not affected.

In the following 3 parts, read the instructions carefully. The codes are given. Run the codes and use your results to answer the questions. Green parts are the comments for the codes, which will help you understand what the codes do.

# Problem 1, discover the Law of Large Numbers

# In probability theory, the law of large numbers (LLN) is a theorem that describes the

# result of performing the same experiment a large number of times. According to the law,

# the average of the results obtained from a large number of trials should be close to

# the expected value, and will tend to become closer as more trials are performed.

# Now consider the random experiment of rolling a fair die and record the number on the

# top. We already calculated the expected value is 3.5. We will simulate a process of

# rolling a fair die 2000 times and see how the average changes

topnumber = rep(1,2000) # create a variable named "topnumber" to store the 2000 numbers from the rolls of the die

average = rep(1,2000) # create a variable named "average" to store the 2000 averages

# We will write a loop to simulate the process of rolling the die 2000 times

for (i in 1:2000) # repeat the following procedure 2000 times, each time we do

result = sample(1:6,1) # roll the die and record the result in the variable named "result"

topnumber[i] = result # store the result in the i-th position of the variable "topnumber"

averagevalue = mean(topnumber[1:i]) # calculate the average value for the first i-th rollings

average[i] = averagevalue # store the result in the i-th position of the variable “average”

# Now in the variable "average" we have all 2000 averages. We will plot them and see how the averages

# changes as the number of trials increases

plot(average, type = 'l', ylim=c(1,6), xlab=c("number of trials"), ylab=c("average of the previous trials"))

lines(rep(3.5,2000),type='l',col='554')

# The red line is the expected value 3.5

# Now answer the following questions

# (1) What is the result of your first roll?

# (2) Describe how the average values change and how it is related to Law of Large Numbers

# Problem 2, investigate the Central Limit Theorem

# In probability theory, the central limit theorem (CLT) establishes that, for the most commonly studied scenarios,

# when independent random variables are added, their sum tends toward a normal distribution (commonly known as a

# bell curve) even if the original variables themselves are not normally distributed.

# Specifically, we know if population proportion is p and we have a large sample of size n, the sample proportion,

# hat_p is approximately normally distributed with mean p and standard deviation sqrt(p(1-p)/n). However, for small

# sample, this may not be true. How large is large? Let's investigate.

# Consider all MSU students as our population. Assume that 75% of MSU students have laptops. Now we randomly choose

# 10, 50, 500 students and see how their sample proportions are distributed.

# First case n=10, we repeat the sampling process 500 times and see how the sample proportion for a sample of size 10 is distributed

x = rbinom(500,10,0.75) # generate 500 samples of size 10, and store the numbers of students who have laptops in each sample in variable "x"

p_hat = x/10 # calculate the sample proportion for each sample and store them in variable "p_hat"

hist(p_hat, freq = FALSE, breaks = 8, main = c("Histogram of p_hat when n=10")) # make a histogram of sample proportion

#lines(density(p_hat)) # draw the density curve of the distribution of the sample proportions

# Second case n=50, we repeat the sampling process 500 times and see how the sample proportion for a sample of size 50

# is distributed

x = rbinom(500,50,0.75) # generate 500 samples of size 50, and store the numbers of students who have laptops in each sample in variable "x"

p_hat = x/50 # calculate the sample proportion for each sample and store them in variable "p_hat"

hist(p_hat, freq = FALSE, breaks = 8, main = c("Histogram of p_hat when n=50")) # make a histogram of sample proportion

lines(density(p_hat)) # draw the density curve of the distribution of the sample proportions

# Third case n=500, we repeat the sampling process 500 times and see how the sample proportion for a sample of size 500

# is distributed

x = rbinom(500,500,0.75) # generate 500 samples of size 500, and store the numbers of students who have laptops in each sample in variable "x"

p_hat = x/500 # calculate the sample proportion for each sample and store them in variable "p_hat"

hist(p_hat, freq = FALSE, breaks = 8, main = c("Histogram of p_hat when n=500")) # make a histogram of sample proportion

lines(density(p_hat)) # draw the density curve of the distribution of the sample proportions

# Now attach your three plots and describe how you feel about the Central Limit Theorem in our investigation

# Problem 3, investigate the Confidence intervals

# The interpretation of a 90% confidence interval is that we are 90% confident that the interval will cover the value of the true parameter.

# In other words, if we make 90% confidence intervals a lot of times, we will expect about 90% of them cover the value of the true parameter.

# Consider all fish in a river as our population. Assume the average length of the population is 15 inches and the standard deviation

# is 3 inches. Further assume the length of the fish is normally distributed. Now we will construct 90% confidence intervals using different

# simulated samples from this population.

# We will simulate 50 samples of size 100, and use them to construct 50 confidence intervals. Recall a 90% confidence interval for population

# mean with known population variance is x_bar+-1.645*sigma/sqrt(n)

lb = rep(1,50) # create a variable named "lb" to store the values of the lower bound of each confidence interval

ub = rep(1,50) # create a variable named "ub" to store the values of the upper bound of each confidence interval

for (i in 1:50) # repeat the following 100 times, each time we do

sample = rnorm(100,15,1) # draw a random sample from the population and store the lengths in the variable "sample"

x_bar = mean(sample) # calculate sample mean

lbvalue = x_bar-1.645*1/sqrt(100) # calculate lower bound of the confidence interval

ubvalue = x_bar+1.645*1/sqrt(100) # calculate upper bound of the confidence interval

lb[i] = lbvalue # store lower bound in variable "lb"

ub[i] = ubvalue # store upper bound in variable "ub"

plot(c(1:50),rep(15,50),ylim=c(14.5,15.5),type='l',xlab=c("observations"),ylab=c("length")) # plot the true values

arrows(c(1:50),lb,c(1:50),ub,code=3,length=0.02,angle=90,col='red') # plot the confidence intervals

# Now attach your plot and answer the following questions

# (1) How many confidence intervals do you expect to cover the true mean 15 inches?

# (2) How many confidence intervals actually cover the true mean 15 inches?