For Mac:
1. Download the appropriate version of .pkg file from the following link.
2. Open the downloaded .pkg file and follow the instructions to install R.
For Linux:
1. For complete R installation in Linux, follow the instructions here.
2. For Linux distributions with apt-get installed (Debian, Ubuntu, etc.) execute the following command in your terminal:
sudo apt-get install r-base
For Windows:
1. Download the binary setup file for R from this link.
2. Open the downloaded .exe file and follow the instructions to install R.
Go to the following link and choose the appropriate installer file for your operating system. Download it, and then run it to install R-Studio.
The purpose of this assignment is to take a snapshot of what you know right now. You may not know everything, and that’s okay! Your grade is only determined by whether or not you complete this assignment.
The pre-test can be found here.
Part 1 Download sleepData.csv from Lyceum. This file represents the last ~2.5 years of sleep data from yours truly. I want you to practice reading in data and getting insights from it. Please follow each of the following steps, and place the code that you used to do this in a text file.
Part 2 Based on the ST reading on misleading graphs, I want you to find your own misleading data visualization. The visualization should be representing quantitative data, and may be taken from a news source, an advertisement, etc. Once you have identified your misleading graph, save the image (screenshot is fine), and write a 3-5 sentence paragraph underneath it describing what is misleading about it. Please append this to your Part 1 homework, and save your file as a single PDF and upload to Lyceum.
Take the same graph/visualization that you submitted for last class, and use R to re-create it in a less misleading way. You may not have access to the raw data that created the graph. If this is the case, feel free to guess the datapoints as closely as you can. The key to the assignment is that your reader will get a realistic sense of the data, not that the graphing aesthetics are identical, so don’t worry if you are not sure how to manipulate color, transparency, or other “bells and whistles”. Create your writeup in a .Rmd file that contains both code and graph, and submit this on Lyceum.
Part 1:
Describe a typical student in our class. Use the appropriate measures of central tendency and variability to describe both what are the most common traits in the studentDemographics.csv file as well as how they vary. You should write for a non-technical audience, such as a newspaper article.
Part 2:
(a) Give three examples (other than ones we discussed in class) of data distributions that follow a normal distribution.
(b) Give three examples (other than ones we discussed in class) of data distributions that follow a long-tailed distribution.
Let’s say that you are a scout for Major League Baseball. You are watching a group of 100 players who have each had 38 plate appearances in the last month. Let us also say that this is a really good group of players whose lifetime batting averages are all 0.333. This means that they will hit the ball into play one out of every three times they have a plate appearance. Although this sounds low, it’s [actually quite good](https://en.wikipedia.org/wiki/Batting_average_(baseball).
Consider the racial profiling example that we started in class:
In a certain week in 1997, the police at a certain location in Philadelphia made 262 car stops. Of these, 207 drivers were African American. Among the whole population of Philadelphia, 42.2% were African American in 1997. Does this prove the police were guilty of racial profiling, i.e. deliberately stopping drivers because they were African Americans?
Using the binomial distribution, state whether or not you have evidence that the Philadelphia police were profiling African American drivers. Then, compute the complementary analysis: is there any evidence that police were pulling over too few non-Black drivers? Why or why not? Last, consider the assumptions of using the binomial distribution here. Is this a good model for this problem? Why or why not? Submit a single page PDF with your analysis to Lyceum.
Download the sleepData.csv file, and load it into your Rstudio using read.csv(). With this data:
You wish to sample Bates College students for their opinions about the 2020 Presidential election. Devise a method for creating a random sample of 100 students for your survey. Please comment on any possible downfalls of your sampling method. I will not count words, but a sufficiently detailed answer will likely be 100-400 words. Please submit your responses to Lyceum in whatever method works best for you.
Using the cltStarter.Rmd, fill in the missing code to run the simulation. Here, you will use a range of sample sizes from 2 to 100 to show how the mean of the sampling distribution approaches \(\mu\), and that the standard deviation of the sampling distribution approaches the standard error of the mean (\(\sigma/ \sqrt{n}\)). Using your data, please comment on the common statistical adage that the central limit theorem tells us that sample sizes should contain at least 30 items.
Important note Because the code is incomplete, you will receive a knitting error until the code runs properly. Use the green comments as instructions that point you to the lines that need editing from you (there will be 4 of them). Knit your document to PDF or Word, and upload to Lyceum.
In class, you saw that there is a systematic bias in sample standard deviations that cannot be easily corrected like variance can. It turns out that correcting this bias is actually very hard, and depends on this distribution of the data in ways that go beyond the scope of this course. However, it has been suggested that using \(\sqrt(\frac{1}{n-1.5})\sum_{i=1}^n(x_i-\bar{x})^2\) will correct the bias in most cases.
Starting with biasHomework.Rmd, fill in the relevant lines to compute sample means, uncorrected sample variance, corrected sample variance, uncorrected sample standard deviation (using sd() is fine here), and corrected sample standard deviation using the equation above. In 1-2 sentences, comment on whether this new correction seems to work.
In your own .Rmd file, read in the studentDemographics.csv file available on Lyceum. Recall that these data came from the student survey at the beginning of the semester. Using these data, compute 95% confidence intervals for the following variables:
Knit your markdown file to a format of your choice, and upload to Lyceum.
Using confidenceIntervals.Rmd that is available on Lyceum, fill in the existing code to simulate 10,000 experiments. In each experiment, take a sample from a known distribution, and calculate a 95% confidence interval for your sample. Fill in the code that keeps track of “if” each interval contains the population mean. Follow the remaining directions in order to check the behavior for 90% confidence intervals, and also how sample size affects the calculations. Note: please do not change curly brackets - they are there for a reason!
Using monteCarlo.Rmd, fill in the existing code to work through bootstrapping, using both the “zap” data and student demographic data for coffee consumption. The last problem features a permutation test. Please submit your knitted file (PDF, .docx, or html only) - .Rmd files will no longer be accepted.
Part 1: If you have not completed your permutation code from last time, please complete it now. If you are done, feel free to continue on to Part 2.
Part 2: Your permutation test showed no statistically significant difference in the coffee consumption of sophomores and seniors. Imagine that your grade depended on getting a significant result — how would you “p-hack” this data set to get your desired result? Note that your answer should avoid outright fraud. You may choose to exclude datapoints if you provide a (non- p-hacking) rationale for your decision. Implement your idea: did it give you a p value of 0.05?
The average time for all runners who finished the Cherry Blossom Run in 2006 was 93.29 minutes. We want to determine whether finishing times are getting faster, slower, or staying the same. We will use data from 100 participants in the 2012 Cherry Blossom Run.These 100 runners had a mean finishing time of 95.61 minutes and a standard deviation of 15.78 minutes. Use simNHST.Rmd to fill in your analysis. The correctly implemented graphs are available on Lyceum for you to compare to your own work.
At lunch, your friend asserts the following hypothesis: “International students are less likely to study abroad because they already are abroad.” You are not sure if they are correct, but wonder if the student demographic data we collected can help. You reason that students who live farther from Bates are more likely to be international students, so if your friend’s hypothesis is correct, students in our class who have not studied abroad will have a larger distance home when compared with students who have.
Run the powerHomework.R script on Lyceum (note that this file requires you to download the pwr library). Observe the resulting graph and comment on what you see. Examine the results (held in samsize) and compare them to the Cohen’s D values (held in d). What sample size is required for a d=0.5 and power of 0.8?
Using the studentDemographics.csv data, use the cor.test() function to test for correlations between all pairs of interval and ratio variables in the set. Report all results in standard form (i.e. t(df)=tStat, p=pVal, CI=[lower upper]) along with your interpretation of the meaning of each test.
Part 1: If you did not complete your analysis of the vitamin D and cognitive function datasets in class, finish them up here. For each, provide a scatterplot of the variables, calculate the mean and sd of each variable, and the Pearson correlation coefficient between them. Using lm(), fit a linear model, and plot the fit on your scatterplot. Calculate the coefficient of determination (\(R^{2}\)) by hand, and verify by getting a summary of your lm() object. Last, write a 1-2 sentence interpretation of your results, considering your regression coefficient and the \(R^{2}\).
Part 2: Find a news article from the last six months that covers a scientific study of an association between two variables. Write a paragraph about the how the result was presented to the general public, given your knowledge of correlation and regression. Did the article imply a causal link between variables? What (if any) limitations did it acknowledge in the study? Were these correct? Were there additional limitations that should have been considered?
Using the final Vitamin D data presented in class, analyze whether the assumptions of regression are met:
Include any code that you used to make your determination, along with your analysis.
Using the studentDemographics.csv data, use glm() to predict students’ TV time from any two predictors of your choice. Provide a 3-6 sentence interpretation of your findings, including the statistical significance of your predictors, the interpretation of each coefficient (including intercept), and the overall fit of your model. Please include your code as well as your paragraph.
Using ADHD.csv, do the following:
Complete the statistical reasoning post-test. This is non-evaluative, meaning that if you turn it in, you will get a 2. Please do your best, though. I will show the growth in the class on Wednesday and it’s more interesting when everyone takes this seriously. :)
None. Use this time to work on your final exam.