September

September 4

Part 1: Install R and R-Studio on your computer

Step 1: Install R

For Mac:
1. Download the appropriate version of .pkg file from the following link.
2. Open the downloaded .pkg file and follow the instructions to install R.

For Linux:
1. For complete R installation in Linux, follow the instructions here.
2. For Linux distributions with apt-get installed (Debian, Ubuntu, etc.) execute the following command in your terminal:

sudo apt-get install r-base

For Windows:
1. Download the binary setup file for R from this link.
2. Open the downloaded .exe file and follow the instructions to install R.

Step 2: Install R-Studio

Go to the following link and choose the appropriate installer file for your operating system. Download it, and then run it to install R-Studio.

Part 2: Complete statistical reasoning pre-test

The purpose of this assignment is to take a snapshot of what you know right now. You may not know everything, and that’s okay! Your grade is only determined by whether or not you complete this assignment.
The pre-test can be found here.

September 9

Part 1 Download sleepData.csv from Lyceum. This file represents the last ~2.5 years of sleep data from yours truly. I want you to practice reading in data and getting insights from it. Please follow each of the following steps, and place the code that you used to do this in a text file.

Read in sleepData.csv in R Studio.
Using $ notation as well as built-in R functions that you learned in class, find the minimum and maximum sleep times over the recording period.
For both minimum and maximum times, convert from minutes to hours in your report.

Part 2 Based on the ST reading on misleading graphs, I want you to find your own misleading data visualization. The visualization should be representing quantitative data, and may be taken from a news source, an advertisement, etc. Once you have identified your misleading graph, save the image (screenshot is fine), and write a 3-5 sentence paragraph underneath it describing what is misleading about it. Please append this to your Part 1 homework, and save your file as a single PDF and upload to Lyceum.

September 11

Take the same graph/visualization that you submitted for last class, and use R to re-create it in a less misleading way. You may not have access to the raw data that created the graph. If this is the case, feel free to guess the datapoints as closely as you can. The key to the assignment is that your reader will get a realistic sense of the data, not that the graphing aesthetics are identical, so don’t worry if you are not sure how to manipulate color, transparency, or other “bells and whistles”. Create your writeup in a .Rmd file that contains both code and graph, and submit this on Lyceum.

September 16

Part 1:
Describe a typical student in our class. Use the appropriate measures of central tendency and variability to describe both what are the most common traits in the studentDemographics.csv file as well as how they vary. You should write for a non-technical audience, such as a newspaper article.

Part 2:
(a) Give three examples (other than ones we discussed in class) of data distributions that follow a normal distribution.
(b) Give three examples (other than ones we discussed in class) of data distributions that follow a long-tailed distribution.

September 18

Let’s say that you are a scout for Major League Baseball. You are watching a group of 100 players who have each had 38 plate appearances in the last month. Let us also say that this is a really good group of players whose lifetime batting averages are all 0.333. This means that they will hit the ball into play one out of every three times they have a plate appearance. Although this sounds low, it’s [actually quite good](https://en.wikipedia.org/wiki/Batting_average_(baseball).

Use rbinom() to simulate this situation.
With just these data, how many players seem to be “legendary” (i.e. have a batting average >=0.400)? (Hint: use which() and length() to have R do this for your automatically).
Let’s say that any player with a batting average below 0.200 will not be considered. How many players will not be considered?
What are the dangers in using just this small number of observations to make your conclusions?
Submit a .Rmd file with your code and answers to Lyceum.

September 23

Consider the racial profiling example that we started in class:

In a certain week in 1997, the police at a certain location in Philadelphia made 262 car stops. Of these, 207 drivers were African American. Among the whole population of Philadelphia, 42.2% were African American in 1997. Does this prove the police were guilty of racial profiling, i.e. deliberately stopping drivers because they were African Americans?

Using the binomial distribution, state whether or not you have evidence that the Philadelphia police were profiling African American drivers. Then, compute the complementary analysis: is there any evidence that police were pulling over too few non-Black drivers? Why or why not? Last, consider the assumptions of using the binomial distribution here. Is this a good model for this problem? Why or why not? Submit a single page PDF with your analysis to Lyceum.

September 25

Download the sleepData.csv file, and load it into your Rstudio using read.csv(). With this data:

Create a variable called sleepHours that converts the original $sleepMinutes field to hours.
Create a histogram of sleepHours. Is this data set normally distributed?
The “95%” rule says that 95% of observations should fall within 2 standard deviations of the mean. Compute the mean and standard deviation of sleepHours, and compute what values constitute the mean - 2sd and the mean + 2sd respectively. Use length() and which() to determine the proportion of observations inside of this range. Does the 95% rule hold up for this set?
Submit a single PDF of your findings. This should be knitted from a .Rmd file.

September 30

You wish to sample Bates College students for their opinions about the 2020 Presidential election. Devise a method for creating a random sample of 100 students for your survey. Please comment on any possible downfalls of your sampling method. I will not count words, but a sufficiently detailed answer will likely be 100-400 words. Please submit your responses to Lyceum in whatever method works best for you.

October

October 2

Using the cltStarter.Rmd, fill in the missing code to run the simulation. Here, you will use a range of sample sizes from 2 to 100 to show how the mean of the sampling distribution approaches $\mu$, and that the standard deviation of the sampling distribution approaches the standard error of the mean ($\sigma/ \sqrt{n}$). Using your data, please comment on the common statistical adage that the central limit theorem tells us that sample sizes should contain at least 30 items.

Important note Because the code is incomplete, you will receive a knitting error until the code runs properly. Use the green comments as instructions that point you to the lines that need editing from you (there will be 4 of them). Knit your document to PDF or Word, and upload to Lyceum.

October 7

In class, you saw that there is a systematic bias in sample standard deviations that cannot be easily corrected like variance can. It turns out that correcting this bias is actually very hard, and depends on this distribution of the data in ways that go beyond the scope of this course. However, it has been suggested that using $\sqrt(\frac{1}{n-1.5})\sum_{i=1}^n(x_i-\bar{x})^2$ will correct the bias in most cases.

Starting with biasHomework.Rmd, fill in the relevant lines to compute sample means, uncorrected sample variance, corrected sample variance, uncorrected sample standard deviation (using sd() is fine here), and corrected sample standard deviation using the equation above. In 1-2 sentences, comment on whether this new correction seems to work.

October 9

In your own .Rmd file, read in the studentDemographics.csv file available on Lyceum. Recall that these data came from the student survey at the beginning of the semester. Using these data, compute 95% confidence intervals for the following variables:

television time
number of siblings
coffee consumed
distance home
hand length (being sure to omit the 46 cm outlier)

Knit your markdown file to a format of your choice, and upload to Lyceum.

October 14

Using confidenceIntervals.Rmd that is available on Lyceum, fill in the existing code to simulate 10,000 experiments. In each experiment, take a sample from a known distribution, and calculate a 95% confidence interval for your sample. Fill in the code that keeps track of “if” each interval contains the population mean. Follow the remaining directions in order to check the behavior for 90% confidence intervals, and also how sample size affects the calculations. Note: please do not change curly brackets - they are there for a reason!

October 21

Using monteCarlo.Rmd, fill in the existing code to work through bootstrapping, using both the “zap” data and student demographic data for coffee consumption. The last problem features a permutation test. Please submit your knitted file (PDF, .docx, or html only) - .Rmd files will no longer be accepted.

October 23

Part 1: If you have not completed your permutation code from last time, please complete it now. If you are done, feel free to continue on to Part 2.

Part 2: Your permutation test showed no statistically significant difference in the coffee consumption of sophomores and seniors. Imagine that your grade depended on getting a significant result — how would you “p-hack” this data set to get your desired result? Note that your answer should avoid outright fraud. You may choose to exclude datapoints if you provide a (non- p-hacking) rationale for your decision. Implement your idea: did it give you a p value of 0.05?

October 28

The average time for all runners who finished the Cherry Blossom Run in 2006 was 93.29 minutes. We want to determine whether finishing times are getting faster, slower, or staying the same. We will use data from 100 participants in the 2012 Cherry Blossom Run.These 100 runners had a mean finishing time of 95.61 minutes and a standard deviation of 15.78 minutes. Use simNHST.Rmd to fill in your analysis. The correctly implemented graphs are available on Lyceum for you to compare to your own work.

October 30

At lunch, your friend asserts the following hypothesis: “International students are less likely to study abroad because they already are abroad.” You are not sure if they are correct, but wonder if the student demographic data we collected can help. You reason that students who live farther from Bates are more likely to be international students, so if your friend’s hypothesis is correct, students in our class who have not studied abroad will have a larger distance home when compared with students who have.

What is the null hypothesis for this test?
Using studentDemographics.csv, partition out the students who have studied abroad from those who have not. (Hint: use the same skills that you gained in the coffee problems to help).
Consider the distance home for both groups. Use the appropriate type of t-test to test whether the groups have the same mean distance home.
Report your statistics using the standards learned in class and from the Navarro text. Please submit all code that you used as well.

November

November 4

Run the powerHomework.R script on Lyceum (note that this file requires you to download the pwr library). Observe the resulting graph and comment on what you see. Examine the results (held in samsize) and compare them to the Cohen’s D values (held in d). What sample size is required for a d=0.5 and power of 0.8?

November 6

Using the studentDemographics.csv data, use the cor.test() function to test for correlations between all pairs of interval and ratio variables in the set. Report all results in standard form (i.e. t(df)=tStat, p=pVal, CI=[lower upper]) along with your interpretation of the meaning of each test.

November 11

Part 1: If you did not complete your analysis of the vitamin D and cognitive function datasets in class, finish them up here. For each, provide a scatterplot of the variables, calculate the mean and sd of each variable, and the Pearson correlation coefficient between them. Using lm(), fit a linear model, and plot the fit on your scatterplot. Calculate the coefficient of determination ($R^{2}$) by hand, and verify by getting a summary of your lm() object. Last, write a 1-2 sentence interpretation of your results, considering your regression coefficient and the $R^{2}$.

Part 2: Find a news article from the last six months that covers a scientific study of an association between two variables. Write a paragraph about the how the result was presented to the general public, given your knowledge of correlation and regression. Did the article imply a causal link between variables? What (if any) limitations did it acknowledge in the study? Were these correct? Were there additional limitations that should have been considered?

November 13

Using the final Vitamin D data presented in class, analyze whether the assumptions of regression are met:

Are the two predictors reasonably uncorrelated?
Is the relationship we are trying to predict linear?
Are the residuals of the model normally distributed and homoscedastic?

Include any code that you used to make your determination, along with your analysis.

November 18

Using the studentDemographics.csv data, use glm() to predict students’ TV time from any two predictors of your choice. Provide a 3-6 sentence interpretation of your findings, including the statistical significance of your predictors, the interpretation of each coefficient (including intercept), and the overall fit of your model. Please include your code as well as your paragraph.

November 20

Using ADHD.csv, do the following:

Create a chart of RT scores as a function of ADHD diagnosis and letter spacing.
Use aov() to do a 3x2 factorial ANOVA to determine the effects of ADHD diagostis and letter spacing on task performance. Make sure to include an interaction term in your model.
Report results of all tests using nomenclature from class and textbook.

December

December 2

Complete the statistical reasoning post-test. This is non-evaluative, meaning that if you turn it in, you will get a 2. Please do your best, though. I will show the growth in the class on Wednesday and it’s more interesting when everyone takes this seriously. :)

December 4

None. Use this time to work on your final exam.

Statistical Methods: Homework Assignments

Michelle R. Greene

Fall 2019