STAT task

Task 5: What’s for breakfast? Name:

The dataset ‘cereal’ contains information on a selection of Kellog’s (K), General Mills (G), and Post (P) breakfast cereals. Each of these cereals is placed on either shelf 1 (bottom), shelf 2 (middle), or shelf 3 (top). Use graphical and numerical data summary methods that we have discussed in class to respond to ONE of the following questions:

1. Which brand of cereal is the healthiest?

2. Which shelf has the healthiest cereals?

You are free to determine what constitutes a ‘healthy’ cereal as you see fit but your definition should include at criteria related to at least 2 variables. State your definition of healthy (E.g. ‘My definition of healthy is that a cereal is high in protein and low in salt.’). Then investigate the question you’ve chosen from 1 and 2 above.

For your analysis:

  • Consider a variety of summaries and graphics.

  • Explain why you chose the methods you used

  • Describe what you learned from each summary.

  • State and justify your conclusions.

Include relevant excerpts from your output in a brief report describing your analysis and conclusions.

The report must be typed and double-spaced; about 3-5 pages including graphics. Attach your code as an appendix (not part of the 3-5 pages).

Include the following (labeled) sections:

  1. Problem Statement

  2. Methods

  3. Conclusions

  4. Discussion

I. Problem Statement: In this section, the author will describe the question(s) being addressed in the analysis and any relevant background that is helpful to understanding the question. For this project, the author will give his/her working definition of ‘healthy’ and justify it. (1-2 paragraphs)

II. Methods: In this section, the author will detail the methods used, the rationale behind using the chosen methods, what was learned from the different methods, and relevant computer output.

III. Conclusions: In this section, the author will describe overall conclusions and the justification for these conclusions from the work done. In particular, how would you answer the original question posed?

IV. Discussion: In this section, the author will discuss the limitations of his/her analysis did (e.g. did you have to make any particular assumptions? Was there missing information? ), describe additional questions that would be of interest to investigate with the data, and offer any final insights into the question under consideration.

Grading Rubric:

Basic assignment requirements completed +8

  • All required sections included and adequately covered, +2

  • All relevant computer output included and discussed, +1

  • Appropriate analysis and conclusions +2

  • Conclusions sufficiently and appropriately justified, +2

  • Organized, clear +1

Excellent analysis and write-up +1

Exceptional analysis and write-up +2

Taking the analysis and report from ‘good’ to ‘excellent’ or from ‘excellent’ to ‘exceptional’ does not necessarily mean adding more (graphics, words, etc.) – though this might be part of it, rather it means conducting a more thoughtful and careful analysis and preparing a more organized and informative report on your work.

Helpful R code

Methods in R for selecting rows within column variables.

Code and (where applicable) output

Explanation of Code

> cereal<-read.table("cereal.txt",header=T)

Read in the data, name it cereal, indicate that there is a header row

> attach(cereal)

This command enables us to refer to the columns by name (e.g. the second column can be indicated by ‘mfr’ for manufacturer.

> hist(carbo[which(mfr=="K")])

This creates a histogram of the variable ‘carbo’ for those cereals that are from manufacturer Kellog’s.

hist() creates the histogram

carbo indicates the column variable that we want

[] indicates that we want only select elements from carbo

which specifies the elements that we want

mfr == “K” we want those elements for which the manufacturer is Kellog’s, the double =

causes R to see whether the condition is true as opposed to setting mfr = to

“K”

You can change the title and labels using instructions in the main R document.

> mean(carbo[which(shelf == "1")])

[1] 15.41667

This selection method works for other functions (such as ‘mean’ here) and variables (‘shelf’ here) as well.

> boxplot(carbo~mfr)

This creates side by side boxplots of carbo separated by mfr.

boxplot() creates the boxplot

carbo indicates the variable that we want boxplots of

~ indicates that we’re going to separate the variable by categories

mfr indicates the variable holding the categories for separation.