MATH 4530 (5380-100), Spring 2014

Statistical Computing

Catalog Description:
Introduction to computational statistics; Monte Carlo methods, bootstrap, data partitioning methods, EM algorithm, probability density estimation, Markov Chain Monte Carlo methods.
Desired Learning Outcomes:
Students will be able to:
  • Generate distributions by various methods.
  • Use computer-intensive methods for estimation and hypotheses testing.
  • Conduct data analysis using one or more major statistical models.
  • Requisites:
    MATH 4500 Theory of Statistics
    Instructor:
    Martin J. Mohlenkamp, mohlenka@ohio.edu, (740)593-1259, 315-B Morton Hall.
    Office hours: Monday, Wednesday, and Friday 2:00-2:55pm, or by appointment.
    Web page:
    http://www.ohiouniversityfaculty.com/mohlenka/20142/4530-5530.
    Class hours/ location:
    Monday, Wednesday, and Friday 12:55-1:50pm in 314 Morton Hall.
    Text:
    None. We will scavenge materials from the internet.
    Computational Resources:
    We will do our computations in the Sagemath Cloud mostly using the language R.
    Attendance:
    This is a "lab" class, so your attendance, participation, and collaboration is essential. Each class you attend earns you 0.5% toward your final grade, up to a maximum of 18%. Since there are 41 classes, this allows for 5 absences, including university excused absences for illness, death in the immediate family, religious observance, jury duty, or involvement in University-sponsored activities. Your attendance record will be available in Blackboard.
    Journal:
    Each week you will submit a journal documenting that you have performed the requested tasks and learned. Writing quality counts. Typical components are:
    Extensions:
    Journals are due at specified times and will be graded soon thereafter. If you email me before I have graded it, you can have a 24 hour extension, with penalty 5%. You can get further extensions with penalty 5% per 24 hours.
    Partners and Ratings:
    You will work with a partner (assuming an even number of students) and submit a single journal. You will rate the relative contribution of your partner and these ratings will be used to adjust the journal scores at the end of the term, by a mechanism to be decided.
    Tests:
    None.
    Final Exam:
    The final exam is scheduled for Friday May 2 at 3:10-5:10pm in our regular classroom. Your final project will substitute for this exam, and be due at the scheduled ending time of the exam.
    Project:
    You will develop a final project (with a partner), give a presentation on what you did, and submit a final report.
    Game:
    We will attempt the mathematical game Checking an Industrial Process, which uses this data. If are results are good, we will enter the competition; the deadline is April 30th, 2014. For some of you this could be your final project. Addendum: Our entry.
    Grade:
    Your grade is based on attendance 18%, journals 67%, and project 15%. An average of 90% guarantees you at least an A-, 80% a B-, 70% a C-, and 60% a D-.
    Academic Dishonesty:
    Your work must be done by you, not by someone else for you. Your words must be your own; any text not your own must be properly quoted and cited. You can ask others questions, look in books, use resources from the internet, and generally use whatever help you can find; such help must be acknowledged by naming the person, citing the book, giving the internet link, etc. A minor, first-time violation of this policy will receive a warning and discussion and clarification of the rules. Serious or second violations will result in a grade penalty on the assignment. Very serious or repeated violations will result in failure in the class and be reported to the Office of Community Standards and Student Responsibility, which may impose additional sanctions. You may appeal any sanctions through the grade appeal process.
    Special Needs:
    If you have specific physical, psychiatric, or learning disabilities and require accommodations, please let me know as soon as possible so that your learning needs may be appropriately met. You should also register with Student Accessibility Services to obtain written documentation and to learn about the resources they have available.
    Learning Resources:

    MATH 5530 (5397-100), Spring 2014

    For students enrolled in MATH 5530, the above syllabus is modified as follows:

    Requisites:
    MATH 5500 Theory of Statistics
    Exploration:
    In addition to the journal, each week submit individually an "exploration" of some additional topic you encountered and thought was interesting or useful. In scope, this exploration should be 10% the size of the journal.
    Final Exam:
    You will have a final exam on the more theoretical aspects of what we have studied. Amendment: Instead of an exam, submit a final paper using finaltemplate.tex.
    Grade:
    Your grade is based on attendance 18%, journals 52%, exploration 5%, project 15%, and final exam 10%. An average of 90% guarantees you at least an A-, 80% a B-, 70% a C-, and 60% a D-.

    Schedule

    Subject to change. Many of the tasks will be filled in as we go along.

    Week Date Topic/Materials/Tasks
    1
    Mon Jan 13
    • Introduction, syllabus, etc.
    • Get set up on the Sagemath Cloud:
      • Use Firefox (or Chrome), not Internet Explorer.
      • Sign up for a free account using your real name and your University email address (@ohio.edu). Sign in.
      • Click on "Help" in the upper left and read about it.
      • Look for a project in your account titled with your name. I created this and shared it with you. Your submissions as an individual, such as your biography, go here.
      • Look for a project "StatisticalComputing". I will put things here for the whole class to use.
    • Do your Biographical Survey
    Wed Jan 15
    • Find your partner for this week's tasks and journal, and sit next to them.
      • Look for a project shared with your partner. You will submit your journal using this project.
      • One of you upload the journal sample to this shared project. You can then both edit it.
      • Play with this journal. A cell with code is run by entering it, holding the "shift" key and hitting the "enter" (return) key. A cell with pretty text can be made using either html or markdown. (You can see how a cell with text was made by selecting it and hitting the button in the upper left that looks like an eye with a slash through it.)
    • Introduction to R:
    • Using R in the sage cloud:
      • Upload the sage worksheet basics.sagews. Run each cell of code and observe/guess what it is doing.
    Thu Jan 16 8am Biographical survey (as a journal) due.
    Fri Jan 17
    • Read about the game Checking an Industrial Process and download the data.
    • Figure out how to convert the data, upload it, and get it into R. (See the R Data Import/Export manual.) End with two data sets in R, named NonDestructive and Destructive, with (shortened) column headings "day", "cylinder", "coord_x", "coord_y", "coord_z", "Ni", and "Cr". Apply summary() to show that it loaded correctly.
    2
    Mon Jan 20Martin Luther King, Jr. Day holiday
    Wed Jan 22
    • Find your partner for this week's tasks and journal, and sit next to them.
    • Read about vector manipulations, data frames, writing your own functions, table(), summary(), length(), split(), subset(), and lapply().
    • Reread the game description to see how the samples were supposedly selected, and check:
      • Are there the correct number of cylinders selected each day?
      • Are the correct number of samples taken from each cylinder?
      Use the functions that you just read about or others you may find.
    Thu Jan 23 8am journal for Jan 13-17 due.
    Fri Jan 24
    • Rate your partner from last week.
    • Read about selecting subsets of the data.
    • The game description claims that the Company's data on Cr and Ni percentages satisfy the criteria set by both the Industry and the Safety Authority. Verify this.
    3
    Mon Jan 27
    • Find your partner for this week's tasks and journal, and sit next to them.
    • Read the Good Problems handout on Introductions and Conclusions. Starting with your week 3 journal, you need to include an introduction and a conclusion.
    • Plotting:
      • Read about Graphical Procedures in R, especially plot(), points() and lines().
      • Upload the sage worksheet r-plotting.sagews. Read through it, evaluating each cell.
      • Using the Destructive dataset, make a plot of coord_x by Ni. Make a second plot only showing points with coord_y=0. Make a third plot only showing points with coord_y=0 and with points color-coded by coord_z. What do you observe? Repeat for the NonDestructive dataset.
    Tue Jan 28 8am journal for Jan 22-24 due.
    Wed Jan 29
    • Rate your partner from last week.
    • Remarks:
      • If someone reads only the text (none of the code) in your journal, it should make sense to them. Use this standard when deciding what and how much to write.
      • Some functions it would have been useful to know already are: head(), unique(), all(), paste(), rbind(), and cbind().
    • Common discrete probability distributions:
      • Read about the R functions for the binomial, geometric, hypergeometric, poisson, and negative binomial distributions.
      • Read about mean(), and var().
      • For each of the five above distributions:
        • Choose some parameters (not trivial like 0 or 1).
        • Use the d....() function to generate the probability distribution function and plot it.
        • Use the r....() function to generate 1000 data points. Use table() and lines() (or points()) and appropriate scaling to plot it on the same graph as the distribution function, so that they approximately match.
        • Use mean() and var() to check the mean and variance of the data; compare to the theoretical values.
    Fri Jan 31
    • Common continuous probability densities:
      • Read about the R functions for the uniform, normal, gamma, and beta distributions.
      • Read about density().
      • For each of the four above densities:
        • Choose some parameters (not trivial like 0 or 1).
        • Use the d....() function to generate the probability density function and plot it.
        • Use the r....() function to generate 1000 data points. Use density() and lines() to plot it on the same graph as the distribution function.
        • Use mean() and var() to check the mean and variance of the data; compare to the theoretical values.
    4
    Mon Feb 3 Snow day! Class canceled.
    Tue Feb 4 8am journal for Jan 27-31 due.
    Wed Feb 5
    • Rate your partner from last week.
    • Find your partner for this week's tasks and journal, and sit next to them.
    • Making functions:
      • Read about writing your own functions, return(), if(), sapply(), and while().
      • Implement the function \(f(x)=\left\{\begin{array}{ll} 1-|x| & -1\le x \le 1\\ 0 & \text{otherwise}\end{array}\right.\) and plot it on \([-2,2]\).
      • Find the explicit formula for the function \(F(x)=\int_{-\infty}^x f(t)dt\), implement it, and plot it on \([-2,2]\).
      • Find the explicit formula for the function \(F^{-1}(y)\), implement it, and plot it on an appropriate interval.
    • Custom densities:
      • Read about Inverse transform sampling. Summarize the method in your own words. Use this method to generate samples from the probability density function \(f(x)\) above. Show that it worked.
      • Read about Rejection sampling. Summarize the method in your own words. Use this method to generate samples from the probability density function \(f(x)\) above. Show that it worked.
    Fri Feb 7
    • Read about library() and matrix().
    • Multivariate Normal distributions:
      • Read about the Multivariate Normal distribution and the methods for drawing values from it. Summarize in your own words.
      • Read about the rmvnorm(), pmvnorm(), qmvnorm(), and dmvnorm() functions in the mvtnorm package.
      • Using \(\mu=[1,2]\) and \(\Sigma=\left[\begin{array}{cc}1&0.8\\ 0.8&1\end{array}\right]\) generate 1000 samples and plot them.
    5
    Mon Feb 10
    • Find your partner for this week's tasks and journal, and sit next to them.
    • Linear Models
    • Game coord_z dependence:
      • From the Destructive dataframe make a dataframe that only uses points where coord_y==0, has first column coord_z, and has second column Niun8x (Ni after removing the mean 8 and the dependence on coord_x). Plot it.
      • Use lm() to model Niun8x as a line in coord_z. Plot the resulting line along with the original data. Factor out the slope to get Niun8x modeled as proportional to (coord_z+C) for some (rounded) C. Plot coord_z versus Niun8x/(coord_z+C); how well did it do?
      • Use lm() and I() to model Niun8x as a quadratic in coord_z. Do you think the actual relationship is linear or quadratic?
    Tue Feb 11 8am journal for Feb 3-7 due.
    Wed Feb 12
    • Rate your partner from last week.
    • Nonlinear models:
    • Game coord_y dependence:
      • From the Destructive dataframe make a dataframe that has first column coord_y and has second column Niun8xz (Ni after removing the mean 8, the dependence on coord_x, and the dependence on coord_z). Plot it.
      • Guess a functional form (or a few) for how it depends on coord_y, such as \(ay+b\), \(a\exp(by)\), \(a\exp(b(y+c)^2)\), etc. Use nls() to find the parameters then plot the original data and the fitted values. Which functional form matches best?
      • Using your best match but ignoring the overall amplitude, Divide Niun8xz by this function of coord_y and plot. How well did it do?
    Fri Feb 14
    • Game day and cylinder dependence:
      • From the Destructive dataframe make a dataframe that has first column day, the second column cylinder, and third column Niun8xzy (Ni after removing the mean 8, the dependence on coord_x, the dependence on coord_z, and the dependence on coord_y).
      • Try to determine the dependence of Niun8xzy on day and/or cylinder and remove it.
    6
    Mon Feb 17
    • Find your partner for this week's tasks and journal, and sit next to them.
    • Monte Carlo methods for integration:
      • Read about Monte Carlo integration. Summarize the method in your own words. Use the (basic) method to estimate \(\int_{-1}^1 (1-x^2)\,dx\).
      • Read about Antithetic variates. Summarize the method in your own words. Use this method to estimate \(\int_{-1}^1 (1-x^2)\,dx\).
      • Read about Importance sampling. Summarize the method in your own words. Use the method to estimate \(\int_{-1}^1 (1-x^2)\,dx\) using sampling density \(f(x)=1-|x|\). Is this a good choice for the sampling density?
      • Compare the three methods and describe what you observe.
    Tue Feb 18 8am journal for Feb 10-14 due.
    Wed Feb 19
    • Rate your partner from last week.
    • Read about sample(), rep(), %%, %/%, and replicate().
    • Generating the game sampling:
      • Read the game description on how the Safety Authority will sample.
      • Write a function cylindersample() that produces a dataframe with first column coord_x, second column coord_y, and third column coord_z, and rows with samples following the distribution described for sampling one cylinder.
      • Write a function allsample() that produces a dataframe with first column day, second column cylinder, third column coord_x, fourth column coord_y, and fifth column coord_z, and rows with samples following the distribution described for sampling the entire production of cylinders.
      • Run tests to confirm your sampling is working as desired.
    Fri Feb 21
    • Game fines:
      • Read the game description on how the Safety Authority will assess fines based on the Ni content.
      • Write a function fineNi(Ni) that inputs a vector of Ni values and returns the fine.
    • Based on our analysis of Ni, write a function predictNi(day,cylinder,coord_x,coord_y,coord_z). Find the maximum and minimum values of this function.
    7
    Mon Feb 24
    • Find your partner for this week's tasks and journal, and sit next to them.
    • Read the Good Problems handout on Logic. Starting with your week 7 journal, you need to make your logic clear and use logical connectives.
    • Read about missing values, t(), rowMeans(), and colMeans().
    • Partner ratings:
      • From the StatisticalComputing project, get the file ratingmatrix.data. Each row is a rater and each column is the person rated. (Do not try to figure out which row/column is you.)
      • Compute the mean rating each person received, and the mean rating each person gave.
      • If both partners rated accurately, then the sum of their ratings of each other would be 5. A generous person would tend to produce higher sums and a greedy person lower sums. Measure the extent to which each rater is generous/greedy.
      • From the above measures and/or any other measures you think are relevant, compute a score for each person.
      • Decide a system for translating a score to an adjustment on the overall journal grade for a student. The system must be net neutral, meaning that the sum of the adjustments must be zero. Explain why you think your system is fair.
    Tue Feb 25 8am journal for Feb 17-21 due.
    Wed Feb 26
    • Rate your partner from last week.
    • Monte Carlo estimate of the expected Ni fine:
      • Formulate the expected value of the Ni fine as an integral of some function over some random variable.
      • Use Monte Carlo integration to estimate this integral. (Don't break the sage cloud.)
    • Bootstrapping:
      • Read about Bootstrapping. Summarize the method in your own words.
      • Read about the boot package and its functions boot() and boot.ci().
      • Use boot() and boot.ci() to study the empirical distribution of the fine. Is your Monte Carlo sample of the fine sufficient to determine the expected fine within a reasonably narrow range? What else do you observe?
    Fri Feb 28
    • Look at StatisticalComputing/Projects/guidelines.sagews. Over break, think about what you would like to do for your final project.
    • 5530 students: Should we really have a final exam, or should we substitute something else? Any substitute should be done alone and should be beyond what the undergaduates are required to do. Possibilities include:
      • You do your project individually and include an analysis of the theory behind one of the computatonal techniques you used. This would count as both your project and exam.
      • You do a short paper with the background and theory of a computational technique. This would replace your exam.
      Comment and/or make suggestions in your exploration.
    • Jackknife:
      • Read about the Jackknife. Summarize the method in your own words.
      • From our Monte Carlo data on the fine for Ni, calculate the Jackknife estimate of the expected fine. How does it compare to the original Monte Carlo estimate of the expected fine?
      • Compute the Jackknife estimate of the standard error and compare it to the bootstrap estimate of the standard error.
    Spring Break
    8
    Mon Mar 10
    • Find your partner for this week's tasks and journal, and sit next to them.
    • Upload talktemplate.tex and OHIOCLR.pdf; open talktemplate.tex.
    • Look in StatisticalComputing/partners.sagews for your topic. Modify talktemplate.tex into a short but coherent talk on this topic. Your main product is the resulting slides; in your journal comment briefly on problems you encountered, cool things you learned, etc.
    Tue Mar 11 8am journal for Feb 24-28 due.
    Wed Mar 12
    • Rate your partner from week 7.
    • Miscellaneous plotting tools:
      • Read about qqplot(). Generate
        • samples from a normal distribution,
        • samples from a normal distribution but with different mean and variance, and
        • samples from a different distribution (such as gamma).
        Apply qqplot() to each pair of data and describe what you observe.
      • Apply pairs() to the Destructive data set and describe what you observe.
      • Plot Ni versus Cr and describe what you observe. Read about the cloud() function in the package lattice. Plot coord_z versus Ni and Cr and describe what you observe.
    Fri Mar 14 Instead of coming to class, go to the \(\pi\)-day poster session and/or the Colloquium. Write about it in your journal.
    9
    Mon Mar 17 Work on figuring out what to do your project on and with whom you will work. Once that is settled, start working on the project.
    Tue Mar 18 8am journal for Mar 10-14 due.
    Wed Mar 19
    • Rate your partner from last week.
    • From now on you will work with your project partner(s). (If you are doing your project alone, you can work with a partner on the journals.)
    • Read the Good Problems handout on Flow. Starting with this journal, pay extra attention to using complete sentences and paragraphs and having text bind your journal together.
    • Markov Chain Monte Carlo methods:
      • Read about Markov Chain Monte Carlo methods. Read about it again, this time slower and more carefully. Summarize in your own words.
      • Read about the Metropolis-Hastings algorithm. Summarize in your own words.
      • Run install.packages("mcmc") to install the mcmc package and then library(mcmc) to load it. (If you have trouble, see the manual install instructions and use the lib.loc option in library().) Read about its metrop() function.
      • Let \(f(x)=\left\{\begin{array}{ll} 1-|x| & -1\le x \le 1\\ 0 & \text{otherwise}\end{array}\right.\). Use metrop() to produce samples from the distribution \(f\). Plot the samples produced versus their index to see how the Markov Chain moves around. Plot the resulting density to see how well the sampling worked.
      • Let \(g_A(x) = \frac{f(x) + f(x-A)}{2}\). Use metrop() to produce samples from the distribution \(g_A\) for \(A=1,3,5\). For each \(A\) plot the samples versus their index and the resulting densities. Explore the options to metrop() to make the sampling and resulting densities better.
    Fri Mar 21 Work on your project. Set some goals.
    10
    Mon Mar 24
    • Gibbs sampling:
      • Read about Gibbs sampling. Summarize in your own word.
      • Write a Gibbs sampler to construct samples from the uniform distribution on a disc of radius 1. Plot the resulting points.
      • Read about the kde2d() function in the MASS package; apply it to your samples. Plot the resulting density using contour(), image(), and persp().
    • Let f(x,y)=unif(x,min=0,max=5)unif(y,min=x,max=2x+1). By a method of your choice, construct samples from this distribution. Plot the points and/or density by a method of your choice.
    Tue Mar 25 8am: I will look at what you have so far for your project, but not grade anything. (No exploration due.)
    Wed Mar 26 Work on your project.
    Fri Mar 28 (drop deadline with WP/WF)
    • Reverse-engineering data:
      • Get the discrete data set unknowndiscrete.dat. Determine which discrete probability distribution (one of binomial, geometric, hypergeometric, poisson, negative binomial) was used to generate it and with what parameters.
      • Get the continuous data set unknowncontinuous.dat. Determine which continuous probability density (one of uniform, normal, gamma, beta) was used to generate it and with what parameters.
    11
    Mon Mar 31 Work on your project.
    Tue Apr 1 8am journal for Mar 19, 24, and 28 due.
    Wed Apr 2
    • Read the Good Problems handout on Graphs. Starting with this journal, pay extra attention to titles, labels, and legends on your graphs.
    • Maximum Likelihood Estimators:
      • Read about Maximum Likelihood Estimators. Summarize in your own words.
      • Read about the mle() function in the stats4 package.
      • Using the continuous data set unknowncontinuous.dat that we studied last week, select the density type that you determined and use mle() to compute the maximum likelihood estimate of the parameter(s). Plot the density function using these parameters along with the density from the data to see how well it worked.
      • Using the discrete data set unknowndiscrete.dat that we studied last week, select the distribution type that you determined and use mle() to compute the maximum likelihood estimate of the continuous parameters. If the distribution type you selected has discrete parameters (like n), then fix them using the 'fixed' option and manually try a few values near your estimate from last week; which value is most likely to be true? Plot the distribution function using the parameters you determined and the normalized table from the data to see how well it worked.
    Fri Apr 4
    • Work on your project.
    • 5530 students look at the template finaltemplate.tex for your final exam/paper and discuss with me.
    12
    Mon Apr 7
    • Expectation Maximization:
      • Read about the EM algorithm and especially its use for Gaussian mixtures. Summarize the method in your own words.
      • Install the package mclust, load it, and read about its em() function.
      • Generate samples from the (Gaussion mixture) density 0.3*normal(0,1)+0.7*normal(5,2). Forget that you know the parameters 0.3, 0, 1, 0.7, 5, and 2, and use em() to try to recover them.
    Tue Apr 8 8am rough draft of project report due; it counts as a journal. (No exploration due.)
    Wed Apr 9 Work on your project.
    Fri Apr 11
    • More miscellaneous plotting tools:
      • Plot the data x=c(1,2,4,5,9) y=c(0,-1,3,3,1) using each of the 9 options for type. Use layout() to show in a 3x3 grid.
      • Pick your favorite type (not "n") and plot with a title, subtitle, x-label, y-label, and larger x and y limits.
      • Read help(plotmath). Repeat the above plot, now with some of the labels mathematical expressions; use at least some powers and greek letters.
      • Repeat the above plot, making it colorful. Use abline() to add a thick green line \(y=0.5x-1\).
      • Read about pie(), barplot(), and rainbow(). Make a pie chart and a barplot of the y values colored by rainbow; use layout() to show them side by side.
      • Use boxplot() to make a boxplot of x and y. Title and label it.
      • Make a histogram of the x values and a second histogram with too many breaks; use layout() to show them side by side.
      • Plot the density of the x values and a second density with too small width parameter; use layout() to show them side by side.
    13
    Mon Apr 14 Project presentations
    Wed Apr 16 Project presentations
    Fri Apr 18 Project presentations
    14
    Mon Apr 21
    • Create a data set D with samples from the uniform distribution on the unit disc.
    • Create a data set G with samples from the two-dimensional normal distribution with \(\mu=[0,0]\) and \(\Sigma=\left[\begin{array}{cc}1&0\\ 0&1\end{array}\right]\).
    • For each of these:
      • Determine (by thinking) the density function for the x values and the density function for the y values. Are they the same?
      • Determine (by thinking) whether x and y are independent.
      • Test computationally whether or not x and y have the same distribution.
      • Test computationally whether or not x and y are independent.
    Wed Apr 23
    • Work on your project.
    Thu Apr 25 8am journal for Apr 2, 7, 11, and 21 due. (5530 students: draft of final paper due; counts as an exploration.)
    Fri Apr 25 Clean-up
    15
    Fri May 2 Final Exam 3:10-5:10pm in our regular classroom. Project reports and slides due. (5530 papers due.)

    Martin J. Mohlenkamp
    Last modified: Tue May 6 13:33:07 EDT 2014