This web page describes an activity within the Department of Mathematics at Ohio University, but is not an official university web page.
If you have difficulty accessing these materials due to visual impairment, please email me at mohlenka@ohio.edu; an alternative format may be available.

MATH 4530-100 (4216), Spring 2016

Statistical Computing

Catalog Description:

Introduction to computational statistics; Monte Carlo methods, bootstrap, data partitioning methods, EM algorithm, probability density estimation, Markov Chain Monte Carlo methods.

Desired Learning Outcomes:

Students will be able to:

Generate distributions by various methods.

Use computer-intensive methods for estimation and hypotheses testing.

Conduct data analysis using one or more major statistical models.

Requisites:

MATH 4500 Theory of Statistics

Instructor:

Martin J. Mohlenkamp, mohlenka@ohio.edu, (740)593-1259, 315-B Morton Hall.
Office hours: Monday, Wednesday, and Friday 10:45-11:40am, or by appointment.

Web page:

http://www.ohiouniversityfaculty.com/mohlenka/20162/4530-5530.

Class hours/ location:

Monday, Wednesday, and Friday 9:40-10:35am in 314 Morton Hall.

Text:

None. We will scavenge materials from the internet.

Computational Resources:

We will do our computations in the Sagemath Cloud using the language R.

Tests:

None.

Journal:

Each week you will submit a journal documenting that you have performed the requested tasks and learned. Typical components are:

reflections, comments, and questions about topics you were asked to read and learn about;
solutions to specific problems (similar to traditional homework), with explanations; and
analysis and conclusions on open-ended questions.

Writing quality counts:: We will use the Good Problems method of gradually increasing the criteria, and its writing guides on Layout, Logic, Flow, Intros, Symbols, and Graphs.
Extensions:: Journals are due at specified times and will be graded soon thereafter. If you email me before I have graded it, you can have a 24 hour extension, with penalty 5%. You can get further extensions with penalty 5% per 24 hours.
Partners and Ratings:: You will work with a partner (assuming an even number of students) and submit a single journal. You will rate the relative contribution of your partner and these ratings will be used to adjust the journal scores at the end of the term, using a statistical analysis.

Final Project:

You will individually do a final project to produce materials that could have been used for one day's topic/lesson in this class, but were not used here or in the 2014 version of this class. You will produce:

A Sage worksheet with:
- A description of the topic, why you think it would be a good one to include in this class, and the sources you used. (20%)
- The Topic/Materials/Tasks as they would appear in the schedule. (10%)
- A complete solution to the tasks. (50%; writing quality counts)
An oral presentation. (10%)
Presentation slides, in $\LaTeX$ using the beamer class. (10%)

(If you have an idea for a different final project that you would prefer, you can propose it.)

Final Exam:

The final exam is scheduled for Friday April 29 1-3 pm. Your final project will substitute for this exam, and be due at the scheduled ending time of the exam.

Attendance:

This is a "lab" class, so your attendance, participation, and collaboration is essential. You are allowed 4 absences (out of 41 classes) without penalty; these include university excused absences for illness, death in the immediate family, religious observance, jury duty, or involvement in University-sponsored activities. Each additional absence will reduce your final average by 0.5%. Your attendance record will be available in Blackboard.

Grade:

Your grade is based on journals 80% and final project 20%. Your journal average is adjusted by the ratings you receive from your partners and your overall average may be penalized due to excessive absences. An average of 90% guarantees you at least an A-, 80% a B-, 70% a C-, and 60% a D-.

Academic Misconduct:

Your work must be done by you, not by someone else for you. Your words must be your own; any text not your own must be properly quoted and cited. You can ask others questions, look in books, use resources from the internet, and generally use whatever help you can find; such help must be acknowledged by naming the person, citing the book, giving the internet link, etc. A minor, first-time violation of this policy will receive a warning and discussion and clarification of the rules. Serious or second violations will result in a grade penalty on the assignment. Very serious or repeated violations will result in failure in the class and be reported to the Office of Community Standards and Student Responsibility, which may impose additional sanctions. You may appeal any sanctions through the grade appeal process.

Special Needs:

If you have specific physical, psychiatric, or learning disabilities and require accommodations, please let me know as soon as possible so that your learning needs may be appropriately met. You should also register with Student Accessibility Services to obtain written documentation and to learn about the resources they have available.

Game:

We will attempt the competitive mathematical game False Alarms in a Sensor Network. If our results are good, we will enter the competition; the deadline is April 30th, 2016. (In the 2014 version of the game, some of us entered and did well.)

Learning Resources:

For quick computations, you can use the sage cell:

There is a "sagemath" app for mobile devices.
R:
- R home
- UCLA's Institute for Digital Research and Education; R resources and more.
- Wikipedia R
- R programming Wikibook
- Codeschool's Try-R minicourse.
Statistical Computing classes/texts (using R):
- Maria Rizzo Textbook information and code.
- Pittsburgh 2011 has 12 lecture note presentations and 4 homework sets.
- Carnegie Mellon 2013 has about 20 lecture notes, 11 homeworks, and 11 labs.
- Johns Hopkins 2004 has some notes and homeworks; it has a Computer Science emphasis.

MATH 5530-100 (4228), Spring 2016

For students enrolled in MATH 5530, the above syllabus is modified as follows:

Requisites:

MATH 5500 Theory of Statistics

Explorations:

Most weeks you will have a small, additional, individual, open-ended task. These tasks will help prepare you for your final project.

Final project:

You will individually do a final project to validate (or invalidate) a recent published work in statistical computing. You will produce:

A Sage worksheet with:
- A description of the topic in the paper, a summary of the methods and claims, and your analysis of the correctness of the claims. (30%; writing quality counts)
- Your tests of their methods and claims and validations (reproductions) or invalidations of their numerical results. (50%; writing quality counts)
An oral presentation. (10%)
Presentation slides, in $\LaTeX$ using the beamer class. (10%)

(If you have an idea for a different final project that you would prefer, you can propose it.)

Grade:

Your grade is based on journals 70%, explorations 10%, and project 20%.

Schedule

Subject to change. Some tasks will be filled in as we go along.

Week	Date	Topic/Materials/Tasks
1
	Mon Jan 11	Introduction, syllabus, etc. Get set up on the Sagemath Cloud: Use Firefox (or Chrome), not Internet Explorer. Sign up for a free account using your real name and your University email address (@ohio.edu). Sign in. Click on "Help" in the upper left and read about it. Look for a project in your account titled with your name. I created this and shared it with you. Your submissions as an individual, such as your biography, go here. Look for a project "StatisticalComputing". I will put things here for the whole class to use. Do your autobiography Familiarize yourself with the markdown language documentation and extensions. Familiarize yourself with the html language documentation. Familiarize yourself with the $\LaTeX$ Wikibook. Upload the autobiography file to your project. Edit it to make your autobiography.
	Wed Jan 13	Find your partner for this week's tasks and journal (see StatisticalComputing/partners.sagews), and sit next to them. One of you create a new project using the naming convention "Week Firstname and Firstname". For example, if your name is Hillary and your partner is Donald, then name it "1 Hillary and Donald". Hit the "Settings" button, look under "Collaborators", search for the other person by email, and add them as a collaborator. Search for me as mohlenka@ohio.edu and add me as a collaborator. Hit the "New" button, pick a name (such as "1Journal"), and hit the "SageMath Worksheet" button. This will create the file for your journal this week. You can both edit it simultaneously. (I can access this file to grade it so you do not have to send it to me.) Use markdown (or html) to put in your names, a title, the course and the week. Introduction to R: Read What is R? Skim the FAQs Notice the manuals and familiarize yourself with An Introduction to R. You will refer to these manuals a lot. Become familiar with the all-important help() and help.search() functions. Try `help(Syntax)`, `help(Arithmetic)`, `help(Comparison)`, `help(Extract)`, and `help(Control)`. In your journal, write a list of 10 interesting things you learned about R and link to where you learned them. Upload the sage worksheet basics.sagews. Run each cell of code and observe/guess what it is doing. In your journal, run 10 different R commands in 10 different cells. Briefly explain what each one does.
	Thu Jan 14 8am autobiography (counts as a journal) due.
	Fri Jan 15	Plotting warm up: Read the Good Problems handout on Graphs. Read about Graphical Procedures in R, especially plot(), points() and lines(). Upload the sage worksheet r-plotting.sagews. Read through it, evaluating each cell. In your journal: Plot the data `x<-c(1,2,4,5,9)` `y<-c(0,-1,3,3,1)` using each of the 9 options for type. Use layout() to show in a 3x3 grid. Pick your favorite type (not "n") and plot with a title, subtitle, x-label, y-label, and larger x and y limits. Read help(plotmath). Repeat the above plot, now with some of the labels mathematical expressions; use at least some powers and greek letters. Repeat the above plot, making it colorful. Use abline() to add a thick green line $y=0.5x-1$. Read about table(), pie(), barplot(), and rainbow(). Make a pie chart and a barplot of the frequencies of the y values colored by rainbow; use layout() to show them side by side. Use boxplot() to make a boxplot of x and y. Title and label it. MATH 5530 Exploration: Download an article from Computational Statistics & Data Analysis or Statistics and Computing published in 2015 or 2016. (You need to be on-campus or use the proxy server through the library to download.) Upload the paper to your individual project. Read the abstract and introduction and skim the rest of the paper. In a sage worksheet titled "1Exploration", include: The full bibliographic information on the article and the name of the pdf file. A one-paragraph (about 10 sentence) summary in your own words of the topic of the paper.
2
	Mon Jan 18	Martin Luther King, Jr. Day holiday
	Wed Jan 20	Find your partner for this week's tasks and journal, sit next to them, and set up a project for your journal. Read about head(), summary(), mean(), var(), paste(), and print(). Common discrete probability distributions: Read about the R functions for the binomial, geometric, hypergeometric, poisson, and negative binomial distributions. For each of the five above distributions: Choose some parameters (not trivial like 0 or 1). Use the d....() function to generate the probability distribution function and plot it. Use the r....() function to generate 1000 data points. Use table() and lines() (or points()) and appropriate scaling to plot it on the same graph as the distribution function, so that they approximately match. Remember to title your graph. Use mean() and var() to check the mean and variance of the data; compare to the theoretical values. (Note that the Wikipedia and R definitions sometimes differ, such as switching successes and failures.)
	Thu Jan 21 8am journal for Jan 13-15 due. 5530 exploration from Jan 15 due.
	Fri Jan 22	(drop deadline) From the StatisticalComputing project copy ratings.sagews to your individual project. In it rate your partner on the journal due yesterday. Read about density(). Common continuous probability densities: Read about the R functions for the uniform, normal, gamma, and beta distributions. For each of the four above densities: Choose some parameters (not trivial like 0 or 1). Use the d....() function to generate the probability density function and plot it. Use the r....() function to generate 1000 data points. Use density() and lines() to plot it on the same graph as the distribution function. Remember to title your graph. Use mean() and var() to check the mean and variance of the data; compare to the theoretical values. MATH 5530 Exploration: Repeat the exploration from Jan 15 with a new article. Identify one concept or method that you are not familiar with. Research it (usually Wikipedia is sufficient) and write a paragraph explaining it. Cite and link to your sources.
3
	Mon Jan 25	Find your partner for this week's tasks and journal, sit next to them, and set up a project for your journal. Read the Good Problems handout on Flow. Starting with this journal, be sure to use complete sentences and paragraphs and have text to bind your journal together. Making functions: Read about writing your own functions, return(), if(), sapply(), length(), numeric(), and while(). Implement the function $f(x)=\left\{\begin{array}{ll} 1-\|x\| & -1\le x \le 1\\ 0 & \text{otherwise}\end{array}\right.$ and plot it on $[-2,2]$. Find the explicit formula for the function $F(x)=\int_{-\infty}^x f(t)dt$, implement it, and plot it on $[-2,2]$. Find the explicit formula for the function $F^{-1}(y)$, implement it, and plot it on an appropriate interval. Custom densities: Read about Inverse transform sampling. Summarize the method in your own words. Use this method to generate samples from the probability density function $f(x)$ above. Show that it worked. Read about Rejection sampling. Summarize the method in your own words. Use this method to generate samples from the probability density function $f(x)$ above. Show that it worked.
	Tue Jan 26 8am journal for Jan 20-22 due. Rate your partner.
	Wed Jan 27	Follow these instructions to install the R package "mcmc". (Let me know if it fails.) Markov Chain Monte Carlo methods: Read about Markov Chain Monte Carlo methods. Read about it again, this time slower and more carefully. Summarize in your own words. Read about the Metropolis-Hastings algorithm. Summarize in your own words. Do `library(mcmc)` to load the mcmc package. Do `help(metrop)` to read about its metrop() function. (If library() fails, you may need to use its lib.loc option.) Let $f(x)=\left\{\begin{array}{ll} 1-\|x\| & -1\le x \le 1\\ 0 & \text{otherwise}\end{array}\right.$. Use metrop() to produce samples from the distribution $f$. Plot the samples produced versus their index to see how the Markov Chain moves around. Plot the resulting density to see how well the sampling worked. Let $g_A(x) = \frac{f(x) + f(x-A)}{2}$. Use metrop() to produce samples from the distribution $g_A$ for $A=1,3,5$. For each $A$ plot the samples versus their index and the resulting densities. Explore the options to metrop() to make the sampling and resulting densities better.
	Thu Jan 28 8am 5530 exploration from Jan 22 due.
	Fri Jan 29	Catch up. MATH 5530 Exploration: Repeat the exploration from Jan 22 with a new article. Identify the specific claims made in the paper (such as that their method is better than existing methods for certain problems).
4
	Mon Feb 1	Find your partner for this week's tasks and journal, sit next to them, and set up a project for your journal. Read about read.table(), qqplot(), qqnorm(). Maximum Likelihood Estimators: Read about Maximum Likelihood Estimators. Summarize in your own words. Load the `stats4` package using `library` and read about its `mle` function. Get the continuous data set unknowncontinuous.dat. Through plots or measurements, determine which continuous probability density (one of uniform, normal, gamma, beta) was used to generate it. Use mle() to compute the maximum likelihood estimate of the parameter(s). Plot the density function using these parameters along with the density from the data to see how well it worked. Get the discrete data set unknowndiscrete.dat. Through plots or measurements, determine which discrete probability distribution (one of binomial, geometric, poisson, negative binomial) was used to generate it. Use mle() to compute the maximum likelihood estimate of the continuous parameters. If the distribution type you selected has discrete parameters (like n), then fix them using the 'fixed' option and manually try a few values that seem reasonable from the plots; which value is most likely to be true? Plot the distribution function using the parameters you determined and the normalized table from the data to see how well it worked.
	Tue Feb 2 8am journal for Jan 25-29 due. Rate your partner.
	Wed Feb 3	Read about `sample` and `replicate`. Generate 10000 samples from the (Gaussion mixture) density 0.3normal(0,1)+0.7normal(5,2), meaning that a sample has 0.3 probability of coming from normal(0,1) and 0.7 probability of coming from normal(5,2). Plot the density to make sure it looks correct. Expectation Maximization: Read about the EM algorithm and especially its use for Gaussian mixtures. Summarize the method in your own words. Install the package mclust, load it, and read about its `em` function. Forget that you know the parameters 0.3, 0, 1, 0.7, 5, and 2, and use `em` to try to recover them from the data you generated above.
	Thu Feb 4 8am 5530 exploration from Jan 29 due.
	Fri Feb 5	Catch up. MATH 5530 Exploration: Repeat the exploration from Jan 29 with a new article. List the numerical experiments in the paper and identify which you might be able to reproduce to (in)validate the claims of the paper.
5
	Mon Feb 8	Find your partner for this week's tasks and journal, sit next to them, and set up a project for your journal. Read the Good Problems handout on Introductions and Conclusions. Starting this week's journal, you need to include an introduction and a conclusion. Read about matrix(). Gibbs sampling: Read about Gibbs sampling. Summarize in your own words. Write a Gibbs sampler to construct samples from the uniform distribution on a disc of radius 1. Plot the resulting points. Write a Gibbs sampler to construct samples from the distribution f(x,y) whose marginal in x is unif(min=0,max=5) independent of y and whose marginal in y is unif(min=x,max=2x+1) for x in [0,5]. Plot the resulting points.
	Tue Feb 9 8am journal for Feb 1-5 due. Rate your partner.
	Wed Feb 10	Multivariate Normal distributions: Read about the Multivariate Normal distribution and the methods for drawing values from it. Summarize in your own words. Install the `mvtnorm` package and read about its `rmvnorm`, `pmvnorm`, `qmvnorm`, and `dmvnorm` functions. Read about the `persp`, `contour`, `image`, and `wireframe` functions in the `lattice` package. Let $\mu=[1,2]$ and $\Sigma=\left[\begin{array}{cc}1&0.8\\ 0.8&1\end{array}\right]$. Generate 1000 samples from the normal$(\mu,\Sigma)$ using `rmvnorm` and plot them. Use `colMeans` and `var` to check $\mu$ and $\Sigma$. Evaluate the density function for normal$(\mu,\Sigma)$ using `dmvnorm` on a grid including $[-2,4]\times [-1,5] $. Plot using `persp`, `contour`, `image`, and `wireframe`.
	Thu Feb 11 8am 5530 exploration from Feb 5 due.
	Fri Feb 12	Create a data set D with samples from the uniform distribution on the unit disc. Create a data set G with samples from the two-dimensional normal distribution with $\mu=[0,0]$ and $\Sigma=\left[\begin{array}{cc}1&0\\ 0&1\end{array}\right]$. For each of these: Determine (by thinking) the density function for the x values and the density function for the y values. Are they the same? Determine (by thinking) whether x and y are independent. Test computationally whether or not x and y have the same distribution. Test computationally whether or not x and y are independent. MATH 5530 Exploration: Repeat the exploration from Feb 5 with a new article. Identify an R function that we have not used in this class and would likely be useful in trying to validate the results in this paper. Run a very simple calculation using this function.
6
	Mon Feb 15	Find your partner for this week's tasks and journal, sit next to them, and set up a project for your journal. From the StatisticalComputing project, copy the files 2161_MATH2301_100key.sagews, 2161_MATH2301_100grades.csv, and 2161_MATH2301_100perfect.csv. Look at them. Do not share them with anyone outside this class. Read about data frames, `read.csv`, `summary`, `head`, `factor`, `levels`, `with`, `names`, `sort`, and `NA`. Use `read.csv` to load the data in 2161_MATH2301_100grades.csv as a data frame named `grades` and the data in 2161_MATH2301_100perfect.csv as a data frame named `perfect`. Apply `head` to both data frames and `summary` to grades and interpret the results. Plot `grades$grade` and note that the grades are in alphabetical order, rather than their natural order. Use `factor` and its `levels` option to replace `grades$grade` with itself but with the grades in the order `c("A","A-","B+","B","B-","C+","C","C-","D+","D","D-","F","FS","WP","WF","Z")`. Replot to show that it worked. (You can use the option `cex.names=0.75` to shrink the labels so they all show.) Similarly, fix the order of the levels of `grades$Level` to put them in their natural order and plot to show that it worked. Similarly, fix the order of the levels of `grades$College` to put them order from most students to fewest student. (Use `table`, `sort`, and `names` to automatically find the correct order of colleges.) Plot to show that it worked. (Start using `with`.) Plot `grade` versus `avg` and `avg` versus `grade`. Explain how to read the plots and which plot you think is most useful. Interpret what the plot tells you about how the students did. Similarly, plot, explain, and interpret using `Level` and `avg`. Similarly, plot, explain, and interpret using `Level` and `grade`. Similarly, plot, explain, and interpret using `College` and `grade`. Use `table` to make a contingency table of `College` and `grade`.
	Tue Feb 16 8am journal for Feb 8-12 due. Rate your partner.
	Wed Feb 17	Read about vector manipulations, selecting subsets of the data, `subset`, `is.na`, `%in%`, `rbind`, `legend`, and `length`. Using `subset`, create the following data frames: `finished`: students who took the final exam. `unfinished`: students who did not take the final exam. `ABC`: students with grades in `c("A","A-","B+","B","B-","C+","C","C-")`. `DFW`: students with grades in `c("D+","D","D-","F","FS","WP","WF","Z")`. Use `summary`, `table`, or another method to show that you got the subsets you wanted. Show the distribution of `finished$Level` and `unfinished$Level` within a single barplot. Each level (like "Freshman") should have two bars (use the "beside" option). Make the plot colorful and include a legend. Interpret the results. Similarly, show the distributions of `Level` for `ABC` and `DFW` and interpret. Similarly, show the distribution of `College` for `finished`/`unfinished` and for `ABC`/`DFW`; interpret. Plot `Level` versus `College` for the `ABC` data frame and (separately) for the `DFW` data frame. Interpret the results.
	Thu Feb 18 8am 5530 exploration from Feb 12 due.
	Fri Feb 19	Summarize what you have learned so far from this data set. How could the Mathematics department use this information to make MATH 2301 better? MATH 5530 Exploration: Repeat the exploration from Feb 12 with a new article.
7
	Mon Feb 22	Find your partner for this week's tasks and journal, sit next to them, and set up a project for your journal. Read the Good Problems handout on Logic. Starting with this week's journal, be sure to make your logic clear and use logical connectives. Linear Models Read about linear models. Summarize the method in your own words. Read about linear models in R, `lm`, `fitted`, `coefficients`, and `I`. Using the `finished` dataframe, explore how well the scores on the exam `etotal` (written by the course coordinator) are predicted by the scores on tests `tbest5` (written by the instructor). Apply `lm` to fit a line relating these variables. Run `summary` and `plot` on the result. (It actually produces 4 plots, so use `layout(matrix(1:4,2,2))`.) Interpret the results. `plot(tbest5,etotal)`. From the result of `lm`, extract the coefficients (automatically, not copy and paste) and use them to plot the line of best fit on top of the data. Interpret the result. Similarly, see how well `tbest5` is predicted by `gwbest10` using a line and interpret. Similarly, see how well `tbest5` is predicted by `gwbest10` using a parabola (quadratic in `gwbest10`). (You will need to use the `I` function in your formula.) Argue whether the line or parabola is better. Run `lmetg <- lm(etotal ~ tbest5*gwbest10,data=finished) summary(lmetg) layout(matrix(1:4,2,2)) plot(lmetg)` Explain what the model is, what the results mean, and how well the prediction worked. Argue whether this is better or worse than the prediction using only `tbest5`.
	Tue Feb 23 8am journal for Feb 15-19 due. Rate your partner.
	Wed Feb 24	Read about Generalized Linear Models. Summarize in your own words and specifically address: How are they different from (ordinary) linear models? What is a link function? What choices (such as the link function) within the generalized linear model gives an (ordinary) linear model? Read about Generalized Linear Models in R, `glm`, and `family`. Use `glm` with appropriate choices to try to reproduce the results from `lm` when `etotal` is predicted by `tbest5` using a line. When you used `lm` to predict `etotal` using `tbest5`, you (should have) found that very low scores on `tbest5` lead to negative predictions for `etotal`, which is nonsense. Use `glm` with `family=binomial` to avoid this nonsense. (It expects response values in $[0,1]$, so try to predict `etotal/200`.) Using `layout(matrix(1:12,4,3))`, plot the results from the original `lm` test, the `glm` test that should reproduce it, and the `glm` test using `family=binomial`. Interpret the results. Make a single plot that has The original `(tbest5,etotal)` points with `xlim=c(0,100),ylim=c(0,200)`. The prediction line you got using `lm`. The prediction line you got using `glm` trying to reproduce `lm`, in a different color. The prediction curve you got using `glm` with `family=binomial`. (You will need to map using `binomial()$linkinv` and multiply by 200. It should look similar to the prediction lines but stay in $[0,200]$.) Interpret the results.
	Thu Feb 25 8am 5530 exploration from Feb 19 due.
	Fri Feb 26	Due March 10: MATH 4530 Students: Propose a topic for your final project. Explain why you think it is a good choice. MATH 5530 Exploration: Propose a paper to use for your final project. It could be one you used for an exploration or a new one. Explain why you think it is a good choice. Identify its specific claims and the numerical experiments you plan to reproduce.
Spring Break
8
	Mon Mar 7	Work on your final project proposal.
	Tue Mar 8 8am journal for Feb 22-24 due. Rate your partner.
	Wed Mar 9	Find your partner for this week's tasks and journal, sit next to them, and set up a project for your journal. Read the competitive mathematical game False Alarms in a Sensor Network. Read about Expected Value. Summarize in your own words. Computed the expected values of the following: The cost of radioactive leaks from nuclear plants in one year, assuming the sensor network did not detect them early. The cost of radioactive leaks from nuclear plants in one year, assuming the sensor network did detect them early. The cost of clouds of radioactivity from the East, assuming the sensor network did not detect them early. The cost of clouds of radioactivity from the East, assuming the sensor network did detect them early. The cost dirty bombs, assuming the sensor network did not detect them early. The cost dirty bombs, assuming the sensor network did detect them early. The cost to maintain the sensor network. The cost of sensors giving false alarms, assuming the sensor is isolated. The cost of sensors giving false alarms, assuming the sensor network detects it as a likely malfunction. If there was no sensor network, what is the expected total cost? In the best case, where the sensor network detects everything, what is the expected total cost? What is the net benefit of installing the sensor network?
	Thu Mar 10 8am final project proposal due. For 5530 students counts as an exploration.
	Fri Mar 11	Make a draft entry in the game (in your journal, not as a .pdf). MATH 5530 Exploration: If your final project proposal was rejected, then propose a new paper. If your final project proposal was accepted, then pick some self-contained topic that you will need for your final project, explain it, and do some related computation.
9
	Mon Mar 14	Work on your final project.
	Tue Mar 15 8am journal for Mar 9-11 due. Rate your partner.
	Wed Mar 16	Find your partner for this week's tasks and journal, sit next to them, and set up a project for your journal. Read about `data.frame`, `cbind`, and `rowSums`. Make a table of the grades received by students who did not take test 1 or did not take test 2 (or did not take both). Interpret the results. (We will not be able to use these for the analysis today.) Make a data frame `begend` such that: Students who did not take test 1, test 2, or both, are excluded. It has a column `Level3` derived from `Level` that preserves "Freshman" and "Sophomore" but has all other values converted to "other". It has a column `College3` derived from `College` that preserves "A&S" and "ENT" but has all other values converted to "other". It has a column `preparation` with the sum of all the questions from test 1 and questions 1, 2, and 3 from test 2. (These are PreCalculus questions.) It has a column `gw1and2` that is the sum of the first 2 groupworks, with missing scores counted as 0. It has a column `grade2` with "ABC" if the student grade was in `c("A","A-","B+","B","B-","C+","C","C-")` and "DFW" otherwise. Show that your data frame is correct by using `table`, `summary`, etc. Plot all 10 combinations of pairs of factors in `begend`. For each pair, choose the order (i.e. `plot(x,y)` or `plot(y,x)`) that gives the most useful plot and interpret the results.
	Thu Mar 17 8am 5530 exploration from Mar 11 due.
	Fri Mar 18	Read about Binary Classification and Evaluation of binary classifiers. Summarize in your own words. Give the formulas for and interpretation of sensitivity, specificity, and accuracy. Consider a `grade2` value of "DFW" as the disease state. Our goal is to diagnose this disease based on information available early in the semester. From looking at the plots, decide on a classification of students into "ABC" or "DFW" using only `Level3` information. Compute the sensitivity, specificity, and accuracy of your classifier. Repeat using only `College3` information. Does this classifier do better or worse? Write a function `ssa` with inputs: `score`: a factor containing numerical scores; `class`: a factor containing the true classes corresponding to the numerical scores (only two classes); `levels`: a vector (or list) of the two classes, with the class generally corresponding to lower scores first; and `cut`: a cutoff score. Have it compute the sensitivity, specificity, and accuracy of the binary classifier that classifies scores less than or equal to `cut` as `levels[1]` and scores greater than `cut` as `levels[2]`. Return `c(sensitivity,specificity,accuracy)`. Run `ssa` with `score=preparation` and `class=grade2` for `cut` in `1:150` and plot the sensitivity, specificity, and accuracy on a single graph. Interpret the results. Run `ssa` with `score=gw1and2` and `class=grade2` for `cut` in `1:200` and plot the sensitivity, specificity, and accuracy on a single graph. Interpret the results. If the Mathematics department wants to identify at-risk students early in the semester, what method do you recommend they use? MATH 5530 Exploration: Pick some self-contained topic that you will need for your final project, explain it, and do some related computation.
10
	Mon Mar 21	Work on your final project.
	Tue Mar 22 8am journal for Mar 16-18 due. Rate your partner.
	Wed Mar 23	Find your partner for this week's tasks and journal, sit next to them, and set up a project for your journal. From the StatisticalComputing project, copy the files that start with `L3C3eg`. Do not share them with anyone outside this class. These contain `Level3`, `College3`, `etotal`, and `grade` for different sections of MATH 2301. Fix the order of the `Level3`, `College3`, and `grade` factors to their natural orders. For each of the following, make a single plot overlaying the curves from each of the `L3C3eg` dataframes. Color-code by dataframe and include a legend. Interpret the results. Plot the proportion of students in each level of `Level3`. Plot the proportion of students in each level of `College3`. Plot the density of `etotal`. Plot the proportion of students that received each grade. Plot the proportion of students getting DFW grades at each level of `Level3` (i.e. proportion of Freshmen with DFW, proportion of Sophomores with DFW, ...). (Hint: `sapply(levels(Level3),function(x){sum(Level3==x & grade %in% c("D+","D","D-","F","FS","WP","WF"))/sum(Level3==x)})`.) Plot the proportion of students getting DFW grades in each level of `College3`. Plot the mean `etotal` of students that received each grade.
	Thu Mar 24 8am 5530 exploration from Mar 18 due.
	Fri Mar 25	(drop deadline with WP/WF) Note that a rough draft of your final project report is due next Thursday. Read about Hypothesis Testing. Summarize in your own words. Read about Student's t-test and `t.test`. Summarize in your own words. Apply `t.test` to `L3C3eg100$etotal` to test the hypothesis that it is drawn from a population whose mean is greater than 110. Repeat for 115, 120, 125, and 130. Interpret the results. Read about Welch's t test. Summarize in your own words. Apply `t.test` to the `etotal` for each pair of `L3C3eg` dataframes. Interpret the results. Summarize what you have learned about MATH 2301 this week. What differences in performance between sections do you think are due to different student populations? What differences do you think are due to different instructors? How could the Mathematics department use this information to make MATH 2301 better?
11
	Mon Mar 28	Work on your final project.
	Tue Mar 29 8am journal for Mar 23-25 due. Rate your partner.
	Wed Mar 30	Find your partner for this week's tasks and journal, sit next to them, and set up a project for your journal. From `grades$etotal` make `etotalnoNA` that has `NA` values discarded. From `grades$etotal` make `etotal0` that discards `NA` values that correspond to a `grade` of "Z" and sets the remaining `NA` values to 0. Make a function `subsetmeanvar` that inputs two vectors `(data,indices)` and returns the mean and variance of the subset of the data vector with those indices. Make a function `subsetmedian` that inputs `(data,indices)` and returns the median of the subset of the data with those indices. Read about Bootstrapping. Summarize the method in your own words. Read about the `boot` package and its functions `boot` and `boot.ci`. Use `boot` and `boot.ci` on `etotalnoNA` with the statistic `subsetmeanvar` to study the empirical distribution of the mean statistic for `etotalnoNA`. Print the outputs of `boot` and `boot.ci` and plot the output of `boot`. Interpret the results. One may argue that the mean of `etotalnoNA` is a poor way to measure the effectiveness of the instructor, since a terrible instructor may scare off all but the strongest students. To account for these students, we can instead consider the median of `etotal0`. Use `boot` and `boot.ci` on `etotal0` with the statistic `subsetmedian` to study the empirical distribution of the median statistic for `etotal0`. Interpret the results.
	Thu Mar 31 8am rough draft of final project report due. For 5530 students counts as an exploration.
	Fri Apr 1	Note that a rough draft of your final project presentation (slides) is due next Thursday. See next Monday for guidance. From `grades` make a data frame `eqs` that includes only the scores on the exam questions and has `NA` values discarded. Run `boxplot` on it to see how the students did on each question, and intepret the results. One can argue that if the class does very badly on a question, then it was "too hard" and asking it was not productive. Identify the question on which the students did the worst and make a table of the scores. Look at the question itself and judge whether or not it is too hard. Argue whether or not questions similar to this one should be included on exams in the future. One can argue that if the class does very well on a question, then it was "too easy" and asking it was not productive. Identify the question on which the students did the best and make a table of the scores. Look at the question itself and judge whether or not it is too easy. Argue whether or not questions similar to this one should be included on exams in the future. One can argue that if the correllation between scores on two questions is too high, then we could save everyone time and energy by only asking one of them. Apply `pairs` to `eqs` and interpret the results. Apply `cor` and find the most correlated pair of questions. Look at the questions themselves and judge whether or not they are too similar. Argue whether or not only one question similar to one of this pair should be included on exams in the future. Summarize what you have learned about MATH 2301 this week. How should performance differerences between instructors be measured? Should the design of the final exam be modified?
12
	Mon Apr 4	Work on your final project presentation slides: Upload talktemplate.tex and OHIOCLR.pdf. Open talktemplate.tex and edit it to become your talk.
	Tue Apr 5 8am journal for Mar 30 - Apr 1 due. Rate your partner.
	Wed Apr 6	Find your partner for this week's tasks and journal, sit next to them, and set up a project for your journal. Read about Monte Carlo integration. Summarize the basic method in your own words. For practice, we will estimate the integral $I = \int_{-1}^1 (1-x^2)\,dx$. Compute the exact value so we can compare. Write a function with input `n` that uses the basic Monte Carlo integration method with `n` points to estimate $I$. Read about Antithetic variates. Summarize the method in your own words. Write a function with input `n` that uses this method with `n` total points to estimate $I$. (It should use `n/2` original points and `n/2` antithetic points.) Read about Importance sampling. Summarize the method in your own words. Write a function with input `n` that uses this method with `n` total points to estimate $I$ using sampling density $f(x)=1-\|x\|$ on $[-1,1]$. Compare the three methods by doing the following for each: Use `replicate` to run it 1000 times using 1000 points to collect 1000 estimates for $I$. Compute the variance of the estimates. Run `summary` on the absolute value of the error of the estimates. Interpret the results. Which method is working better?
	Thu Apr 7 8am rough draft of presentation (slides) due. For 5530 students counts as an exploration.
	Fri Apr 8	(No more 5530 explorations.) Read about Monte Carlo methods. Summarize in your own words. Consider the following three-person game, with players whom we will call 1, 2, and 3. In each round two players play while the third waits. The winner of that round plays in the next round versus the player who was waiting. If a player wins two consecutive rounds then the game stops and that player is declared the overall winner. Each round is a simple coin toss, with the two players having equal probability of winning. Suppose in the first round 1 plays 2 while 3 waits. Is the probability of 1 being the overall winner the same as the probability of 2 being the overall winner? Is the probability of 1 being the overall winner the same as the probability of 3 being the overall winner? Write a function that simulates this game and returns the (overall) winner (1,2, or 3). Use `replicate` to run this function many times and compute the relative frequencies of the different players winning. Interpret the results. Suppose you need to decide a winner among 3 players using only a (fair) coin and you would like the probabilities of each winning to be the same. Decide on a method/game to do so, write a function that simulates it, and run the simulation many times to show that the relative frequencies tend towards equality.
13
	Mon Apr 11	From the StatisticalComputing project, copy the file `ratinganalysis.sagews` into your personal project. Read through it to see how partner ratings will be used to adjust journal scores. On Thursday after you submit your final journal, make sure your partner ratings are up to date. If you see any way to improve the process in `ratinganalysis.sagews` to make it better/fairer, then comment on it in your `ratings.sagews` file. Look at the presentation rating guide and form. If you have any questions on how the presentations will work then ask them.
	Wed Apr 13	Project presentations: Must be at least 10 and at most 15 minutes long. Must use $\LaTeX$ slides in the beamer class. Aim for half the presentation to explain to the class general background and half to show what you did. Worth 10% of your project grade. You will rate each other. (rating guide and form)
	Thu Apr 14 8am journal for Apr 6-8 due. Rate your partner.
	Fri Apr 15	More presentations
14
	Mon Apr 18	More presentations
	Wed Apr 20	More presentations
	Fri Apr 22	More presentations or clean-up
15
15	Fri Apr 29	Final Exam 1-3pm (virtual, your presence is not required). Presentation slides due. You can improve them based on feedback from your presentation. Project report due.

Leftovers

q perfix for quantiles; dotchart; apply; MASS fitdistr; nortest library; moments library, something on R-values
code comments?
demo()
Selected applications: e.g. kernel density; regression; design of experiment; time series.
- 1: repeat function rep(); lists, frames;
- 2: ordered(); data input scan() read...
- 4: hist(); matrix indexing
- 5: functions on matrices diag() t() Conj() rank()?? solve() eigen() svd() chol() det() rankMatrix() %*% %o% cbin() rbin(); function to sample with and without replacement.
- 6: multiD plotting pairs() cloud() in lattice; Monte Carlo for mean, integral, probability; MC simulation of two games
- 9: MC point estimator, confidence interval; MSE comparison for different estimators
- 10: MC simulation of Type I error; MC power graphs
- 11: ANOVA aov() model.tables() interation.plot(); ANCOVA plot.design()
- 12: bootstrap, for standard error, for bias; jackknife, for variance, for bias; compare bootstrap and jackknife; bootstrap in final project
Carnegie Mellon 2013:
- do.call() curve() for symbolic plot ifelse() rep() optim() nls()

Topic/Materials/Tasks
Read about the Shapiro-Wilk test for normality and `shapiro.test`. Summarize in your own words. For each of the `L3C3eg` dataframes, apply `shapiro.test` and `qqnorm` to `etotal` and interpret the results. Jackknife: Read about the Jackknife. Summarize the method in your own words. From our Monte Carlo data on the fine for Ni, calculate the Jackknife estimate of the expected fine. How does it compare to the original Monte Carlo estimate of the expected fine? Compute the Jackknife estimate of the standard error and compare it to the bootstrap estimate of the standard error. Cross-Validation: web Read about Cross-Validation. Summarize the method in your own words. *** split() and lapply(). Importing data: (See the R Data Import/Export manual.) Some functions it would have been useful to know already are: unique(), all(), paste(), rbind(), and cbind().
Nonlinear models: Read about Nonlinear Least Squares. Summarize the method in your own words. Read about how it is done in R and nls(). Guess a functional form (or a few) for how it depends on coord_y, such as $ay+b$, $a\exp(by)$, $a\exp(b(y+c)^2)$, etc. Use nls() to find the parameters then plot the original data and the fitted values. Which functional form matches best?
Read about rep(), %%, %/%, Read about missing values, t(), rowMeans(), and colMeans().
Miscellaneous plotting tools: Use hist() to make a histogram of the x values and a second histogram with too many breaks; use layout() to show them side by side. Plot the density of the x values and a second density with too small width parameter; use layout() to show them side by side. Read about the cloud() function in the package lattice. Plot coord_z versus Ni and Cr and describe what you observe. Read about the kde2d() function in the MASS package; apply it to your samples. Plot the resulting density using contour(), image(), and persp().

Martin J. Mohlenkamp

Last modified: Thu Apr 28 11:04:11 EDT 2016