MATH 4530-100 (4216), Spring 2016

Statistical Computing

Catalog Description:
Introduction to computational statistics; Monte Carlo methods, bootstrap, data partitioning methods, EM algorithm, probability density estimation, Markov Chain Monte Carlo methods.
Desired Learning Outcomes:
Students will be able to:
  • Generate distributions by various methods.
  • Use computer-intensive methods for estimation and hypotheses testing.
  • Conduct data analysis using one or more major statistical models.
  • Requisites:
    MATH 4500 Theory of Statistics
    Instructor:
    Martin J. Mohlenkamp, mohlenka@ohio.edu, (740)593-1259, 315-B Morton Hall.
    Office hours: Monday, Wednesday, and Friday 10:45-11:40am, or by appointment.
    Web page:
    http://www.ohiouniversityfaculty.com/mohlenka/20162/4530-5530.
    Class hours/ location:
    Monday, Wednesday, and Friday 9:40-10:35am in 314 Morton Hall.
    Text:
    None. We will scavenge materials from the internet.
    Computational Resources:
    We will do our computations in the Sagemath Cloud using the language R.
    Tests:
    None.
    Journal:
    Each week you will submit a journal documenting that you have performed the requested tasks and learned. Typical components are:
    Writing quality counts:
    We will use the Good Problems method of gradually increasing the criteria, and its writing guides on Layout, Logic, Flow, Intros, Symbols, and Graphs.
    Extensions:
    Journals are due at specified times and will be graded soon thereafter. If you email me before I have graded it, you can have a 24 hour extension, with penalty 5%. You can get further extensions with penalty 5% per 24 hours.
    Partners and Ratings:
    You will work with a partner (assuming an even number of students) and submit a single journal. You will rate the relative contribution of your partner and these ratings will be used to adjust the journal scores at the end of the term, using a statistical analysis.
    Final Project:
    You will individually do a final project to produce materials that could have been used for one day's topic/lesson in this class, but were not used here or in the 2014 version of this class. You will produce: (If you have an idea for a different final project that you would prefer, you can propose it.)
    Final Exam:
    The final exam is scheduled for Friday April 29 1-3 pm. Your final project will substitute for this exam, and be due at the scheduled ending time of the exam.
    Attendance:
    This is a "lab" class, so your attendance, participation, and collaboration is essential. You are allowed 4 absences (out of 41 classes) without penalty; these include university excused absences for illness, death in the immediate family, religious observance, jury duty, or involvement in University-sponsored activities. Each additional absence will reduce your final average by 0.5%. Your attendance record will be available in Blackboard.
    Grade:
    Your grade is based on journals 80% and final project 20%. Your journal average is adjusted by the ratings you receive from your partners and your overall average may be penalized due to excessive absences. An average of 90% guarantees you at least an A-, 80% a B-, 70% a C-, and 60% a D-.
    Academic Misconduct:
    Your work must be done by you, not by someone else for you. Your words must be your own; any text not your own must be properly quoted and cited. You can ask others questions, look in books, use resources from the internet, and generally use whatever help you can find; such help must be acknowledged by naming the person, citing the book, giving the internet link, etc. A minor, first-time violation of this policy will receive a warning and discussion and clarification of the rules. Serious or second violations will result in a grade penalty on the assignment. Very serious or repeated violations will result in failure in the class and be reported to the Office of Community Standards and Student Responsibility, which may impose additional sanctions. You may appeal any sanctions through the grade appeal process.
    Special Needs:
    If you have specific physical, psychiatric, or learning disabilities and require accommodations, please let me know as soon as possible so that your learning needs may be appropriately met. You should also register with Student Accessibility Services to obtain written documentation and to learn about the resources they have available.
    Game:
    We will attempt the competitive mathematical game False Alarms in a Sensor Network. If our results are good, we will enter the competition; the deadline is April 30th, 2016. (In the 2014 version of the game, some of us entered and did well.)
    Learning Resources:

    MATH 5530-100 (4228), Spring 2016

    For students enrolled in MATH 5530, the above syllabus is modified as follows:

    Requisites:
    MATH 5500 Theory of Statistics
    Explorations:
    Most weeks you will have a small, additional, individual, open-ended task. These tasks will help prepare you for your final project.
    Final project:
    You will individually do a final project to validate (or invalidate) a recent published work in statistical computing. You will produce: (If you have an idea for a different final project that you would prefer, you can propose it.)
    Grade:
    Your grade is based on journals 70%, explorations 10%, and project 20%.

    Schedule

    Subject to change. Some tasks will be filled in as we go along.

    Week Date Topic/Materials/Tasks
    1
    Mon Jan 11
    • Introduction, syllabus, etc.
    • Get set up on the Sagemath Cloud:
      • Use Firefox (or Chrome), not Internet Explorer.
      • Sign up for a free account using your real name and your University email address (@ohio.edu). Sign in.
      • Click on "Help" in the upper left and read about it.
      • Look for a project in your account titled with your name. I created this and shared it with you. Your submissions as an individual, such as your biography, go here.
      • Look for a project "StatisticalComputing". I will put things here for the whole class to use.
    • Do your autobiography
      • Familiarize yourself with the markdown language documentation and extensions.
      • Familiarize yourself with the html language documentation.
      • Familiarize yourself with the \(\LaTeX\) Wikibook.
      • Upload the autobiography file to your project. Edit it to make your autobiography.
    Wed Jan 13
    • Find your partner for this week's tasks and journal (see StatisticalComputing/partners.sagews), and sit next to them.
      • One of you create a new project using the naming convention "Week Firstname and Firstname". For example, if your name is Hillary and your partner is Donald, then name it "1 Hillary and Donald".
      • Hit the "Settings" button, look under "Collaborators", search for the other person by email, and add them as a collaborator. Search for me as mohlenka@ohio.edu and add me as a collaborator.
      • Hit the "New" button, pick a name (such as "1Journal"), and hit the "SageMath Worksheet" button. This will create the file for your journal this week. You can both edit it simultaneously. (I can access this file to grade it so you do not have to send it to me.)
      • Use markdown (or html) to put in your names, a title, the course and the week.
    • Introduction to R:
      • Read What is R?
      • Skim the FAQs
      • Notice the manuals and familiarize yourself with An Introduction to R. You will refer to these manuals a lot.
      • Become familiar with the all-important help() and help.search() functions. Try help(Syntax), help(Arithmetic), help(Comparison), help(Extract), and help(Control).
      • In your journal, write a list of 10 interesting things you learned about R and link to where you learned them.
      • Upload the sage worksheet basics.sagews. Run each cell of code and observe/guess what it is doing.
      • In your journal, run 10 different R commands in 10 different cells. Briefly explain what each one does.
    Thu Jan 14 8am autobiography (counts as a journal) due.
    Fri Jan 15
    • Plotting warm up:
    • In your journal:
      • Plot the data x<-c(1,2,4,5,9) y<-c(0,-1,3,3,1) using each of the 9 options for type. Use layout() to show in a 3x3 grid.
      • Pick your favorite type (not "n") and plot with a title, subtitle, x-label, y-label, and larger x and y limits.
      • Read help(plotmath). Repeat the above plot, now with some of the labels mathematical expressions; use at least some powers and greek letters.
      • Repeat the above plot, making it colorful. Use abline() to add a thick green line \(y=0.5x-1\).
      • Read about table(), pie(), barplot(), and rainbow(). Make a pie chart and a barplot of the frequencies of the y values colored by rainbow; use layout() to show them side by side.
      • Use boxplot() to make a boxplot of x and y. Title and label it.
    • MATH 5530 Exploration:
      • Download an article from Computational Statistics & Data Analysis or Statistics and Computing published in 2015 or 2016. (You need to be on-campus or use the proxy server through the library to download.) Upload the paper to your individual project.
      • Read the abstract and introduction and skim the rest of the paper.
      • In a sage worksheet titled "1Exploration", include:
        • The full bibliographic information on the article and the name of the pdf file.
        • A one-paragraph (about 10 sentence) summary in your own words of the topic of the paper.
    2
    Mon Jan 18Martin Luther King, Jr. Day holiday
    Wed Jan 20
    • Find your partner for this week's tasks and journal, sit next to them, and set up a project for your journal.
    • Read about head(), summary(), mean(), var(), paste(), and print().
    • Common discrete probability distributions:
      • Read about the R functions for the binomial, geometric, hypergeometric, poisson, and negative binomial distributions.
      • For each of the five above distributions:
        • Choose some parameters (not trivial like 0 or 1).
        • Use the d....() function to generate the probability distribution function and plot it.
        • Use the r....() function to generate 1000 data points. Use table() and lines() (or points()) and appropriate scaling to plot it on the same graph as the distribution function, so that they approximately match. Remember to title your graph.
        • Use mean() and var() to check the mean and variance of the data; compare to the theoretical values. (Note that the Wikipedia and R definitions sometimes differ, such as switching successes and failures.)
    Thu Jan 21 8am journal for Jan 13-15 due. 5530 exploration from Jan 15 due.
    Fri Jan 22 (drop deadline)
    • From the StatisticalComputing project copy ratings.sagews to your individual project. In it rate your partner on the journal due yesterday.
    • Read about density().
    • Common continuous probability densities:
      • Read about the R functions for the uniform, normal, gamma, and beta distributions.
      • For each of the four above densities:
        • Choose some parameters (not trivial like 0 or 1).
        • Use the d....() function to generate the probability density function and plot it.
        • Use the r....() function to generate 1000 data points. Use density() and lines() to plot it on the same graph as the distribution function. Remember to title your graph.
        • Use mean() and var() to check the mean and variance of the data; compare to the theoretical values.
    • MATH 5530 Exploration:
      • Repeat the exploration from Jan 15 with a new article.
      • Identify one concept or method that you are not familiar with. Research it (usually Wikipedia is sufficient) and write a paragraph explaining it. Cite and link to your sources.
    3
    Mon Jan 25
    • Find your partner for this week's tasks and journal, sit next to them, and set up a project for your journal.
    • Read the Good Problems handout on Flow. Starting with this journal, be sure to use complete sentences and paragraphs and have text to bind your journal together.
    • Making functions:
      • Read about writing your own functions, return(), if(), sapply(), length(), numeric(), and while().
      • Implement the function \(f(x)=\left\{\begin{array}{ll} 1-|x| & -1\le x \le 1\\ 0 & \text{otherwise}\end{array}\right.\) and plot it on \([-2,2]\).
      • Find the explicit formula for the function \(F(x)=\int_{-\infty}^x f(t)dt\), implement it, and plot it on \([-2,2]\).
      • Find the explicit formula for the function \(F^{-1}(y)\), implement it, and plot it on an appropriate interval.
    • Custom densities:
      • Read about Inverse transform sampling. Summarize the method in your own words. Use this method to generate samples from the probability density function \(f(x)\) above. Show that it worked.
      • Read about Rejection sampling. Summarize the method in your own words. Use this method to generate samples from the probability density function \(f(x)\) above. Show that it worked.
    Tue Jan 26 8am journal for Jan 20-22 due. Rate your partner.
    Wed Jan 27
    • Follow these instructions to install the R package "mcmc". (Let me know if it fails.)
    • Markov Chain Monte Carlo methods:
      • Read about Markov Chain Monte Carlo methods. Read about it again, this time slower and more carefully. Summarize in your own words.
      • Read about the Metropolis-Hastings algorithm. Summarize in your own words.
      • Do library(mcmc) to load the mcmc package. Do help(metrop) to read about its metrop() function. (If library() fails, you may need to use its lib.loc option.)
      • Let \(f(x)=\left\{\begin{array}{ll} 1-|x| & -1\le x \le 1\\ 0 & \text{otherwise}\end{array}\right.\). Use metrop() to produce samples from the distribution \(f\). Plot the samples produced versus their index to see how the Markov Chain moves around. Plot the resulting density to see how well the sampling worked.
      • Let \(g_A(x) = \frac{f(x) + f(x-A)}{2}\). Use metrop() to produce samples from the distribution \(g_A\) for \(A=1,3,5\). For each \(A\) plot the samples versus their index and the resulting densities. Explore the options to metrop() to make the sampling and resulting densities better.
    Thu Jan 28 8am 5530 exploration from Jan 22 due.
    Fri Jan 29
    • Catch up.
    • MATH 5530 Exploration:
      • Repeat the exploration from Jan 22 with a new article.
      • Identify the specific claims made in the paper (such as that their method is better than existing methods for certain problems).
    4
    Mon Feb 1
    • Find your partner for this week's tasks and journal, sit next to them, and set up a project for your journal.
    • Read about read.table(), qqplot(), qqnorm().
    • Maximum Likelihood Estimators:
      • Read about Maximum Likelihood Estimators. Summarize in your own words.
      • Load the stats4 package using library and read about its mle function.
      • Get the continuous data set unknowncontinuous.dat. Through plots or measurements, determine which continuous probability density (one of uniform, normal, gamma, beta) was used to generate it.
      • Use mle() to compute the maximum likelihood estimate of the parameter(s). Plot the density function using these parameters along with the density from the data to see how well it worked.
      • Get the discrete data set unknowndiscrete.dat. Through plots or measurements, determine which discrete probability distribution (one of binomial, geometric, poisson, negative binomial) was used to generate it.
      • Use mle() to compute the maximum likelihood estimate of the continuous parameters. If the distribution type you selected has discrete parameters (like n), then fix them using the 'fixed' option and manually try a few values that seem reasonable from the plots; which value is most likely to be true? Plot the distribution function using the parameters you determined and the normalized table from the data to see how well it worked.
    Tue Feb 2 8am journal for Jan 25-29 due. Rate your partner.
    Wed Feb 3
    • Read about sample and replicate.
    • Generate 10000 samples from the (Gaussion mixture) density 0.3*normal(0,1)+0.7*normal(5,2), meaning that a sample has 0.3 probability of coming from normal(0,1) and 0.7 probability of coming from normal(5,2). Plot the density to make sure it looks correct.
    • Expectation Maximization:
      • Read about the EM algorithm and especially its use for Gaussian mixtures. Summarize the method in your own words.
      • Install the package mclust, load it, and read about its em function.
      • Forget that you know the parameters 0.3, 0, 1, 0.7, 5, and 2, and use em to try to recover them from the data you generated above.
    Thu Feb 4 8am 5530 exploration from Jan 29 due.
    Fri Feb 5
    • Catch up.
    • MATH 5530 Exploration:
      • Repeat the exploration from Jan 29 with a new article.
      • List the numerical experiments in the paper and identify which you might be able to reproduce to (in)validate the claims of the paper.
    5
    Mon Feb 8
    • Find your partner for this week's tasks and journal, sit next to them, and set up a project for your journal.
    • Read the Good Problems handout on Introductions and Conclusions. Starting this week's journal, you need to include an introduction and a conclusion.
    • Read about matrix().
    • Gibbs sampling:
      • Read about Gibbs sampling. Summarize in your own words.
      • Write a Gibbs sampler to construct samples from the uniform distribution on a disc of radius 1. Plot the resulting points.
      • Write a Gibbs sampler to construct samples from the distribution f(x,y) whose marginal in x is unif(min=0,max=5) independent of y and whose marginal in y is unif(min=x,max=2x+1) for x in [0,5]. Plot the resulting points.
    Tue Feb 9 8am journal for Feb 1-5 due. Rate your partner.
    Wed Feb 10
    • Multivariate Normal distributions:
      • Read about the Multivariate Normal distribution and the methods for drawing values from it. Summarize in your own words.
      • Install the mvtnorm package and read about its rmvnorm, pmvnorm, qmvnorm, and dmvnorm functions.
      • Read about the persp, contour, image, and wireframe functions in the lattice package.
      • Let \(\mu=[1,2]\) and \(\Sigma=\left[\begin{array}{cc}1&0.8\\ 0.8&1\end{array}\right]\).
        • Generate 1000 samples from the normal\((\mu,\Sigma)\) using rmvnorm and plot them. Use colMeans and var to check \(\mu\) and \(\Sigma\).
        • Evaluate the density function for normal\((\mu,\Sigma)\) using dmvnorm on a grid including \([-2,4]\times [-1,5] \). Plot using persp, contour, image, and wireframe.
    Thu Feb 11 8am 5530 exploration from Feb 5 due.
    Fri Feb 12
    • Create a data set D with samples from the uniform distribution on the unit disc.
    • Create a data set G with samples from the two-dimensional normal distribution with \(\mu=[0,0]\) and \(\Sigma=\left[\begin{array}{cc}1&0\\ 0&1\end{array}\right]\).
    • For each of these:
      • Determine (by thinking) the density function for the x values and the density function for the y values. Are they the same?
      • Determine (by thinking) whether x and y are independent.
      • Test computationally whether or not x and y have the same distribution.
      • Test computationally whether or not x and y are independent.
    • MATH 5530 Exploration:
      • Repeat the exploration from Feb 5 with a new article.
      • Identify an R function that we have not used in this class and would likely be useful in trying to validate the results in this paper. Run a very simple calculation using this function.
    6
    Mon Feb 15
    • Find your partner for this week's tasks and journal, sit next to them, and set up a project for your journal.
    • From the StatisticalComputing project, copy the files 2161_MATH2301_100key.sagews, 2161_MATH2301_100grades.csv, and 2161_MATH2301_100perfect.csv. Look at them. Do not share them with anyone outside this class.
    • Read about data frames, read.csv, summary, head, factor, levels, with, names, sort, and NA.
    • Use read.csv to load the data in 2161_MATH2301_100grades.csv as a data frame named grades and the data in 2161_MATH2301_100perfect.csv as a data frame named perfect. Apply head to both data frames and summary to grades and interpret the results.
    • Plot grades$grade and note that the grades are in alphabetical order, rather than their natural order. Use factor and its levels option to replace grades$grade with itself but with the grades in the order c("A","A-","B+","B","B-","C+","C","C-","D+","D","D-","F","FS","WP","WF","Z"). Replot to show that it worked. (You can use the option cex.names=0.75 to shrink the labels so they all show.)
    • Similarly, fix the order of the levels of grades$Level to put them in their natural order and plot to show that it worked.
    • Similarly, fix the order of the levels of grades$College to put them order from most students to fewest student. (Use table, sort, and names to automatically find the correct order of colleges.) Plot to show that it worked.
    • (Start using with.)
    • Plot grade versus avg and avg versus grade. Explain how to read the plots and which plot you think is most useful. Interpret what the plot tells you about how the students did.
    • Similarly, plot, explain, and interpret using Level and avg.
    • Similarly, plot, explain, and interpret using Level and grade.
    • Similarly, plot, explain, and interpret using College and grade.
    • Use table to make a contingency table of College and grade.
    Tue Feb 16 8am journal for Feb 8-12 due. Rate your partner.
    Wed Feb 17
    • Read about vector manipulations, selecting subsets of the data, subset, is.na, %in%, rbind, legend, and length.
    • Using subset, create the following data frames:
      • finished: students who took the final exam.
      • unfinished: students who did not take the final exam.
      • ABC: students with grades in c("A","A-","B+","B","B-","C+","C","C-").
      • DFW: students with grades in c("D+","D","D-","F","FS","WP","WF","Z").
      Use summary, table, or another method to show that you got the subsets you wanted.
    • Show the distribution of finished$Level and unfinished$Level within a single barplot. Each level (like "Freshman") should have two bars (use the "beside" option). Make the plot colorful and include a legend. Interpret the results.
    • Similarly, show the distributions of Level for ABC and DFW and interpret.
    • Similarly, show the distribution of College for finished/unfinished and for ABC/DFW; interpret.
    • Plot Level versus College for the ABC data frame and (separately) for the DFW data frame. Interpret the results.
    Thu Feb 18 8am 5530 exploration from Feb 12 due.
    Fri Feb 19
    • Summarize what you have learned so far from this data set. How could the Mathematics department use this information to make MATH 2301 better?
    • MATH 5530 Exploration:
      • Repeat the exploration from Feb 12 with a new article.
    7
    Mon Feb 22
    • Find your partner for this week's tasks and journal, sit next to them, and set up a project for your journal.
    • Read the Good Problems handout on Logic. Starting with this week's journal, be sure to make your logic clear and use logical connectives.
    • Linear Models
      • Read about linear models. Summarize the method in your own words.
      • Read about linear models in R, lm, fitted, coefficients, and I.
      • Using the finished dataframe, explore how well the scores on the exam etotal (written by the course coordinator) are predicted by the scores on tests tbest5 (written by the instructor).
        • Apply lm to fit a line relating these variables. Run summary and plot on the result. (It actually produces 4 plots, so use layout(matrix(1:4,2,2)).) Interpret the results.
        • plot(tbest5,etotal). From the result of lm, extract the coefficients (automatically, not copy and paste) and use them to plot the line of best fit on top of the data. Interpret the result.
      • Similarly, see how well tbest5 is predicted by gwbest10 using a line and interpret.
      • Similarly, see how well tbest5 is predicted by gwbest10 using a parabola (quadratic in gwbest10). (You will need to use the I function in your formula.) Argue whether the line or parabola is better.
      • Run
        lmetg <- lm(etotal ~ tbest5*gwbest10,data=finished)
        summary(lmetg)
        layout(matrix(1:4,2,2))
        plot(lmetg)
        Explain what the model is, what the results mean, and how well the prediction worked. Argue whether this is better or worse than the prediction using only tbest5.
    Tue Feb 23 8am journal for Feb 15-19 due. Rate your partner.
    Wed Feb 24
    • Read about Generalized Linear Models. Summarize in your own words and specifically address:
      • How are they different from (ordinary) linear models?
      • What is a link function?
      • What choices (such as the link function) within the generalized linear model gives an (ordinary) linear model?
    • Read about Generalized Linear Models in R, glm, and family.
    • Use glm with appropriate choices to try to reproduce the results from lm when etotal is predicted by tbest5 using a line.
    • When you used lm to predict etotal using tbest5, you (should have) found that very low scores on tbest5 lead to negative predictions for etotal, which is nonsense. Use glm with family=binomial to avoid this nonsense. (It expects response values in \([0,1]\), so try to predict etotal/200.)
    • Using layout(matrix(1:12,4,3)), plot the results from the original lm test, the glm test that should reproduce it, and the glm test using family=binomial. Interpret the results.
    • Make a single plot that has
      • The original (tbest5,etotal) points with xlim=c(0,100),ylim=c(0,200).
      • The prediction line you got using lm.
      • The prediction line you got using glm trying to reproduce lm, in a different color.
      • The prediction curve you got using glm with family=binomial. (You will need to map using binomial()$linkinv and multiply by 200. It should look similar to the prediction lines but stay in \([0,200]\).)
      Interpret the results.
    Thu Feb 25 8am 5530 exploration from Feb 19 due.
    Fri Feb 26 Due March 10:
    • MATH 4530 Students: Propose a topic for your final project. Explain why you think it is a good choice.
    • MATH 5530 Exploration: Propose a paper to use for your final project. It could be one you used for an exploration or a new one. Explain why you think it is a good choice. Identify its specific claims and the numerical experiments you plan to reproduce.
    Spring Break
    8
    Mon Mar 7 Work on your final project proposal.
    Tue Mar 8 8am journal for Feb 22-24 due. Rate your partner.
    Wed Mar 9
    • Find your partner for this week's tasks and journal, sit next to them, and set up a project for your journal.
    • Read the competitive mathematical game False Alarms in a Sensor Network.
    • Read about Expected Value. Summarize in your own words.
    • Computed the expected values of the following:
      • The cost of radioactive leaks from nuclear plants in one year, assuming the sensor network did not detect them early.
      • The cost of radioactive leaks from nuclear plants in one year, assuming the sensor network did detect them early.
      • The cost of clouds of radioactivity from the East, assuming the sensor network did not detect them early.
      • The cost of clouds of radioactivity from the East, assuming the sensor network did detect them early.
      • The cost dirty bombs, assuming the sensor network did not detect them early.
      • The cost dirty bombs, assuming the sensor network did detect them early.
      • The cost to maintain the sensor network.
      • The cost of sensors giving false alarms, assuming the sensor is isolated.
      • The cost of sensors giving false alarms, assuming the sensor network detects it as a likely malfunction.
    • If there was no sensor network, what is the expected total cost?
    • In the best case, where the sensor network detects everything, what is the expected total cost?
    • What is the net benefit of installing the sensor network?
    Thu Mar 10 8am final project proposal due. For 5530 students counts as an exploration.
    Fri Mar 11
    • Make a draft entry in the game (in your journal, not as a .pdf).
    • MATH 5530 Exploration:
      • If your final project proposal was rejected, then propose a new paper.
      • If your final project proposal was accepted, then pick some self-contained topic that you will need for your final project, explain it, and do some related computation.
    9
    Mon Mar 14 Work on your final project.
    Tue Mar 15 8am journal for Mar 9-11 due. Rate your partner.
    Wed Mar 16
    • Find your partner for this week's tasks and journal, sit next to them, and set up a project for your journal.
    • Read about data.frame, cbind, and rowSums.
    • Make a table of the grades received by students who did not take test 1 or did not take test 2 (or did not take both). Interpret the results. (We will not be able to use these for the analysis today.)
    • Make a data frame begend such that:
      • Students who did not take test 1, test 2, or both, are excluded.
      • It has a column Level3 derived from Level that preserves "Freshman" and "Sophomore" but has all other values converted to "other".
      • It has a column College3 derived from College that preserves "A&S" and "ENT" but has all other values converted to "other".
      • It has a column preparation with the sum of all the questions from test 1 and questions 1, 2, and 3 from test 2. (These are PreCalculus questions.)
      • It has a column gw1and2 that is the sum of the first 2 groupworks, with missing scores counted as 0.
      • It has a column grade2 with "ABC" if the student grade was in c("A","A-","B+","B","B-","C+","C","C-") and "DFW" otherwise.
      Show that your data frame is correct by using table, summary, etc.
    • Plot all 10 combinations of pairs of factors in begend. For each pair, choose the order (i.e. plot(x,y) or plot(y,x)) that gives the most useful plot and interpret the results.
    Thu Mar 17 8am 5530 exploration from Mar 11 due.
    Fri Mar 18
    • Read about Binary Classification and Evaluation of binary classifiers. Summarize in your own words. Give the formulas for and interpretation of sensitivity, specificity, and accuracy.
    • Consider a grade2 value of "DFW" as the disease state. Our goal is to diagnose this disease based on information available early in the semester.
    • From looking at the plots, decide on a classification of students into "ABC" or "DFW" using only Level3 information. Compute the sensitivity, specificity, and accuracy of your classifier.
    • Repeat using only College3 information. Does this classifier do better or worse?
    • Write a function ssa with inputs:
      • score: a factor containing numerical scores;
      • class: a factor containing the true classes corresponding to the numerical scores (only two classes);
      • levels: a vector (or list) of the two classes, with the class generally corresponding to lower scores first; and
      • cut: a cutoff score.
      Have it compute the sensitivity, specificity, and accuracy of the binary classifier that classifies scores less than or equal to cut as levels[1] and scores greater than cut as levels[2]. Return c(sensitivity,specificity,accuracy).
    • Run ssa with score=preparation and class=grade2 for cut in 1:150 and plot the sensitivity, specificity, and accuracy on a single graph. Interpret the results.
    • Run ssa with score=gw1and2 and class=grade2 for cut in 1:200 and plot the sensitivity, specificity, and accuracy on a single graph. Interpret the results.
    • If the Mathematics department wants to identify at-risk students early in the semester, what method do you recommend they use?
    • MATH 5530 Exploration: Pick some self-contained topic that you will need for your final project, explain it, and do some related computation.
    10
    Mon Mar 21 Work on your final project.
    Tue Mar 22 8am journal for Mar 16-18 due. Rate your partner.
    Wed Mar 23
    • Find your partner for this week's tasks and journal, sit next to them, and set up a project for your journal.
    • From the StatisticalComputing project, copy the files that start with L3C3eg. Do not share them with anyone outside this class. These contain Level3, College3, etotal, and grade for different sections of MATH 2301. Fix the order of the Level3, College3, and grade factors to their natural orders.
    • For each of the following, make a single plot overlaying the curves from each of the L3C3eg dataframes. Color-code by dataframe and include a legend. Interpret the results.
      • Plot the proportion of students in each level of Level3.
      • Plot the proportion of students in each level of College3.
      • Plot the density of etotal.
      • Plot the proportion of students that received each grade.
      • Plot the proportion of students getting DFW grades at each level of Level3 (i.e. proportion of Freshmen with DFW, proportion of Sophomores with DFW, ...). (Hint: sapply(levels(Level3),function(x){sum(Level3==x & grade %in% c("D+","D","D-","F","FS","WP","WF"))/sum(Level3==x)}).)
      • Plot the proportion of students getting DFW grades in each level of College3.
      • Plot the mean etotal of students that received each grade.
    Thu Mar 24 8am 5530 exploration from Mar 18 due.
    Fri Mar 25 (drop deadline with WP/WF)
    • Note that a rough draft of your final project report is due next Thursday.
    • Read about Hypothesis Testing. Summarize in your own words.
    • Read about Student's t-test and t.test. Summarize in your own words.
    • Apply t.test to L3C3eg100$etotal to test the hypothesis that it is drawn from a population whose mean is greater than 110. Repeat for 115, 120, 125, and 130. Interpret the results.
    • Read about Welch's t test. Summarize in your own words.
    • Apply t.test to the etotal for each pair of L3C3eg dataframes. Interpret the results.
    • Summarize what you have learned about MATH 2301 this week. What differences in performance between sections do you think are due to different student populations? What differences do you think are due to different instructors? How could the Mathematics department use this information to make MATH 2301 better?
    11
    Mon Mar 28 Work on your final project.
    Tue Mar 29 8am journal for Mar 23-25 due. Rate your partner.
    Wed Mar 30
    • Find your partner for this week's tasks and journal, sit next to them, and set up a project for your journal.
    • From grades$etotal make etotalnoNA that has NA values discarded. From grades$etotal make etotal0 that discards NA values that correspond to a grade of "Z" and sets the remaining NA values to 0.
    • Make a function subsetmeanvar that inputs two vectors (data,indices) and returns the mean and variance of the subset of the data vector with those indices.
    • Make a function subsetmedian that inputs (data,indices) and returns the median of the subset of the data with those indices.
    • Read about Bootstrapping. Summarize the method in your own words.
    • Read about the boot package and its functions boot and boot.ci.
    • Use boot and boot.ci on etotalnoNA with the statistic subsetmeanvar to study the empirical distribution of the mean statistic for etotalnoNA. Print the outputs of boot and boot.ci and plot the output of boot. Interpret the results.
    • One may argue that the mean of etotalnoNA is a poor way to measure the effectiveness of the instructor, since a terrible instructor may scare off all but the strongest students. To account for these students, we can instead consider the median of etotal0. Use boot and boot.ci on etotal0 with the statistic subsetmedian to study the empirical distribution of the median statistic for etotal0. Interpret the results.
    Thu Mar 31 8am rough draft of final project report due. For 5530 students counts as an exploration.
    Fri Apr 1
    • Note that a rough draft of your final project presentation (slides) is due next Thursday. See next Monday for guidance.
    • From grades make a data frame eqs that includes only the scores on the exam questions and has NA values discarded. Run boxplot on it to see how the students did on each question, and intepret the results.
    • One can argue that if the class does very badly on a question, then it was "too hard" and asking it was not productive. Identify the question on which the students did the worst and make a table of the scores. Look at the question itself and judge whether or not it is too hard. Argue whether or not questions similar to this one should be included on exams in the future.
    • One can argue that if the class does very well on a question, then it was "too easy" and asking it was not productive. Identify the question on which the students did the best and make a table of the scores. Look at the question itself and judge whether or not it is too easy. Argue whether or not questions similar to this one should be included on exams in the future.
    • One can argue that if the correllation between scores on two questions is too high, then we could save everyone time and energy by only asking one of them. Apply pairs to eqs and interpret the results. Apply cor and find the most correlated pair of questions. Look at the questions themselves and judge whether or not they are too similar. Argue whether or not only one question similar to one of this pair should be included on exams in the future.
    • Summarize what you have learned about MATH 2301 this week. How should performance differerences between instructors be measured? Should the design of the final exam be modified?
    12
    Mon Apr 4 Work on your final project presentation slides:
    Tue Apr 5 8am journal for Mar 30 - Apr 1 due. Rate your partner.
    Wed Apr 6
    • Find your partner for this week's tasks and journal, sit next to them, and set up a project for your journal.
    • Read about Monte Carlo integration. Summarize the basic method in your own words.
    • For practice, we will estimate the integral \(I = \int_{-1}^1 (1-x^2)\,dx\). Compute the exact value so we can compare.
    • Write a function with input n that uses the basic Monte Carlo integration method with n points to estimate \(I\).
    • Read about Antithetic variates. Summarize the method in your own words. Write a function with input n that uses this method with n total points to estimate \(I\). (It should use n/2 original points and n/2 antithetic points.)
    • Read about Importance sampling. Summarize the method in your own words. Write a function with input n that uses this method with n total points to estimate \(I\) using sampling density \(f(x)=1-|x|\) on \([-1,1]\).
    • Compare the three methods by doing the following for each:
      • Use replicate to run it 1000 times using 1000 points to collect 1000 estimates for \(I\).
      • Compute the variance of the estimates.
      • Run summary on the absolute value of the error of the estimates.
      Interpret the results. Which method is working better?
    Thu Apr 7 8am rough draft of presentation (slides) due. For 5530 students counts as an exploration.
    Fri Apr 8
    • (No more 5530 explorations.)
    • Read about Monte Carlo methods. Summarize in your own words.
    • Consider the following three-person game, with players whom we will call 1, 2, and 3. In each round two players play while the third waits. The winner of that round plays in the next round versus the player who was waiting. If a player wins two consecutive rounds then the game stops and that player is declared the overall winner. Each round is a simple coin toss, with the two players having equal probability of winning. Suppose in the first round 1 plays 2 while 3 waits. Is the probability of 1 being the overall winner the same as the probability of 2 being the overall winner? Is the probability of 1 being the overall winner the same as the probability of 3 being the overall winner?
    • Write a function that simulates this game and returns the (overall) winner (1,2, or 3).
    • Use replicate to run this function many times and compute the relative frequencies of the different players winning. Interpret the results.
    • Suppose you need to decide a winner among 3 players using only a (fair) coin and you would like the probabilities of each winning to be the same. Decide on a method/game to do so, write a function that simulates it, and run the simulation many times to show that the relative frequencies tend towards equality.
    13
    Mon Apr 11
    • From the StatisticalComputing project, copy the file ratinganalysis.sagews into your personal project. Read through it to see how partner ratings will be used to adjust journal scores. On Thursday after you submit your final journal, make sure your partner ratings are up to date. If you see any way to improve the process in ratinganalysis.sagews to make it better/fairer, then comment on it in your ratings.sagews file.
    • Look at the presentation rating guide and form. If you have any questions on how the presentations will work then ask them.
    Wed Apr 13 Project presentations:
    • Must be at least 10 and at most 15 minutes long.
    • Must use \(\LaTeX\) slides in the beamer class.
    • Aim for half the presentation to explain to the class general background and half to show what you did.
    • Worth 10% of your project grade.
    • You will rate each other. (rating guide and form)
    Thu Apr 14 8am journal for Apr 6-8 due. Rate your partner.
    Fri Apr 15 More presentations
    14
    Mon Apr 18More presentations
    Wed Apr 20 More presentations
    Fri Apr 22 More presentations or clean-up
    15
    Fri Apr 29 Final Exam 1-3pm (virtual, your presence is not required).
    • Presentation slides due. You can improve them based on feedback from your presentation.
    • Project report due.

    Leftovers

    Topic/Materials/Tasks
    • Read about the Shapiro-Wilk test for normality and shapiro.test. Summarize in your own words.
    • For each of the L3C3eg dataframes, apply shapiro.test and qqnorm to etotal and interpret the results.
    • Jackknife:
      • Read about the Jackknife. Summarize the method in your own words.
      • From our Monte Carlo data on the fine for Ni, calculate the Jackknife estimate of the expected fine. How does it compare to the original Monte Carlo estimate of the expected fine?
      • Compute the Jackknife estimate of the standard error and compare it to the bootstrap estimate of the standard error.
    • Cross-Validation: **web**
    • split() and lapply().
    • Importing data: (See the R Data Import/Export manual.)
    • Some functions it would have been useful to know already are: unique(), all(), paste(), rbind(), and cbind().
    • Nonlinear models:
      • Guess a functional form (or a few) for how it depends on coord_y, such as \(ay+b\), \(a\exp(by)\), \(a\exp(b(y+c)^2)\), etc. Use nls() to find the parameters then plot the original data and the fitted values. Which functional form matches best?
    • Read about rep(), %%, %/%,
    • Read about missing values, t(), rowMeans(), and colMeans().
    • Miscellaneous plotting tools:
      • Use hist() to make a histogram of the x values and a second histogram with too many breaks; use layout() to show them side by side.
      • Plot the density of the x values and a second density with too small width parameter; use layout() to show them side by side.
      • Read about the cloud() function in the package lattice. Plot coord_z versus Ni and Cr and describe what you observe.
    • Read about the kde2d() function in the MASS package; apply it to your samples. Plot the resulting density using contour(), image(), and persp().

    Martin J. Mohlenkamp
    Last modified: Thu Apr 28 11:04:11 EDT 2016