Homework Assignments

Statistics 202: Statistical Aspects of Data Mining, Summer 2007

Chapter 5 Homework Part 2 and Chapter 8 Homework - Due Tuesday 8/14 at 9AM:

1) Read Chapter 5 (Sections 5.2, 5.5 and 5.6) and Chapter 8 (Sections 8.1 and 8.2).

2) This question deals with In Class Exercise #42.

a) Repeat In Class Exercise #42 for the k-nearest neighbor classifier for k=1,2,...,10. (We did k=1 in class which is the default). Use R to produce a single plot showing both the training error and the test error as functions of the value of k. Provide your R code. Remember that the 61st column is the response and the other 60 columns are the predictors. Provide a legend on your plot and label the axes. Include your name in the title of the plot. What value of k does your plot suggest is optimal with regard to the test error?

b) Repeat part a using the exact same R code. Explain why both the training errors and the test errors change for even values of k but not for odd values of k. Hint: Read the help on the knn function if you do not know.

3) This question deals with In Class Exercise #44.

a) Using the abline() function, add some lines to the plot from In Class Exercise #44 to try to get a close approximation for the equation of the separating hyperplane (line) estimated in svm(). Give the equation for your approximation in the form x2=m*x1+b.

b) For a general line of the form x2=m*x1+b and a point (x1,x2)=(k1,k2) give an expression for the (shortest) distance from the point to the line.

c) Which two points in In Class Exercise #44 are closest to the separating hyperplane (line) from svm() based on your approximation of its equation? Use your expression in part b to compute the distances from each of these two points to the separating hyperplane (line) from svm() using your approximation for the line equation.

4) Suppose I have 51 classifiers which each classify a point correctly 60% of the time. If these 51 classifiers are completely independent and I take the majority vote, how often is the majority vote correct for that point?

5) Compute the misclassification error on the training data for the Random Forest classifier from In Class Exercise #47. Show your R code for doing this.

6) This question deals with the AdaBoost algorithm and In Class Exercise #48.

a) For AdaBoost, the exponential loss for each data point is defined as e (= 2.718282) to the power of negative 1 times the product of y and F for that data point. The exponential loss for a dataset is the sum of the exponential loss for each data point. (See page 10 of http://www-stat.wharton.upenn.edu/~dmease/contraryevidence.pdf for an expression using the notation from class). It can be shown that AdaBoost is doing a stagewise minimization of the exponential loss for the training data. Make a single plot showing both the natural log of the exponential loss for the training data and the natural log of the exponential loss for the test data for In Class Exercise #48. Provide a legend on your plot and label the axes. Include your name in the title of the plot. From this plot, you should see that the exponential loss for the training data is monotone decreasing at an exponential rate. Does the exponential loss for the test data have this same behavior?

b) For boosting, changing the value of alpha by multiplying it by a constant less than 1 is quite popular. This technique is known as shrinkage and is advocated by many researchers in the statistics community (not including me if you read the paper linked in part a). Repeat In Class Exercise #48 but replace alpha by .1*alpha in the AdaBoost algorithm. Provide your R code and the plot showing the training misclassification error and test misclassification error as functions of the iterations. Provide a legend on your plot and label the axes. Include your name in the title of the plot. From this plot and the final value for the misclassification error on the test data, what do you conclude about the benefits of shrinkage in this example?

7) Using the 1-dimensional data x<-c(1,2,2.5,3,3.5,4,4.5,5,7,8,8.5,9,9.5,10) do the following.

a) Write your own R code to carry out k-means clustering for k=2 clusters. Use Algorithm 8.1 on page 497. Run the algorithm for 10 iterations. Use the means as the centroids. Use Euclidean distance (absolute value). If a point is equidistant between the cluster centers, assign it to the cluster on the left (this should not matter). Use the points x=0 and x=10 as the initial cluster centers. Show your R code and the values of the cluster centers at each iteration. How many iterations does the algorithm need to converge?

b) Repeat part a but using the initial cluster centers as x=9 and x=10. How many iterations do you need for convergence?

c) Does the default implementation of kmeans() give the same final centers as your algorithm if you tell it to use c(0,10) as the initial values for centers? Show the R code for checking this.

d) Does the default implementation of kmeans() give the same final centers as your algorithm if you tell it to use c(9,10) as the initial values for centers? Show the R code for checking this.

Chapter 4 Homework and Chapter 5 Homework Part 1 - Due Tuesday 8/7:

1) Read Chapter 4 (all sections) and Chapter 5 (Section 5.7 only).

2) Do Chapter 4 textbook problem #3 (parts a,b,c,d,e) on pages 198-200.

3) Do Chapter 4 textbook problem #5 (parts b,c only) on page 200.

4) Do Chapter 4 textbook problem #7 (parts b,c,d,e only) on page 201. The instructions in the book are somewhat unclear for this one. The following should help clarify these. Ignore "Build a two-level decision tree" in the first part of the problem. In part b, just find the optimal split for each of the 2 nodes resulting from the optimal split we found in class for part a. For part d they want you to use attribute C as the first split in the tree (instead of the best split we found in class) and then choose the best split for each of the two child nodes. You should see that the resulting tree actually gives a lower misclassification error than the greedy tree.

5) The file http://www-stat.wharton.upenn.edu/~dmease/rpart_text_example.txt gives an example of text output for a tree fit using the rpart() function in R from the library rpart. Use this tree to predict the class labels for the 10 observations in the test data http://www-stat.wharton.upenn.edu/~dmease/test_data.csv linked here.

6) I split the popular sonar data set into a training set (http://www-stat.wharton.upenn.edu/~dmease/sonar_train.csv) and a test set (http://www-stat.wharton.upenn.edu/~dmease/sonar_test.csv). Use R to produce a single plot showing both the training error and the test error as functions of the value of the tree depth "dep" when you fit the training data using the rpart function with all the default values except control=rpart.control(minsplit=0,minbucket=0,cp=-1, maxcompete=0, maxsurrogate=0, usesurrogate=0, xval=0,maxdepth=dep). Remember that the 61st column is the response and the other 60 columns are the predictors. Provide a legend on your plot and label the axes. Include your name in the title of the plot. What value of dep does your plot suggest is optimal with regard to the test error?

7) Do Chapter 5 textbook problem #17 (parts a,c,d) on pages 322-323. Note that there is a typo in part c - it should read "Repeat the analysis for part (b)". We will do part b in class.

Chapter 3 Homework Part 2 and Chapter 6 Homework - Due 9AM Tuesday 7/24:

1) Read Chapter 6 (only sections 6.1 and 6.7).

2) This question uses the sample of 10,000 Ohio house prices at http://www-stat.wharton.upenn.edu/~dmease/OH_house_prices.csv. Download the data set to your computer. Note that the house prices are in thousands of dollars.

a) What is the median value? Is it larger or smaller than the mean?

b) What does your answer to part a suggest about the shape of the distribution (right-skewed or left-skewed)?

c) How does the median change if you add 10 (thousand dollars) to all the values?

d) How does the median change if you multiply all the values by 2?

e) Add 29000 (which is \$29 million) to the first ten values in the data set. How does this affect the mean and the median?

3) This question uses the following people's ages: 19,23,30,30,45,25,24,20. Store them in R using the syntax ages<-c(19,23,30,30,45,25,24,20).

a) Compute the standard deviation in R using the sd() function.

b) Compute the same value by hand and show all the steps.

c) Using R, how does the value in part a change if you add 10 to all the values?

d) Using R, how does the value in part a change if you multiply all the values by 100?

4) This question uses the data at http://www-stat.wharton.upenn.edu/~dmease/football.csv. Download it to your computer. This data set gives the total number of wins for each of the 117 Division 1A college football teams for the 2003 and 2004 seasons.

a) Compute the correlation in R using the function cor().

b) How does the value in part a change if you add 10 to all the values for 2004?

c) How does the value in part a change if you multiply all the 2004 values by 2?

d) How does the value in part a change if you multiply all the 2004 values by -2?

5) Do Chapter 6 textbook problem #2 (parts a,b,c,d only) on page 404.

6) Do Chapter 6 textbook problem #3 (parts b,c,d only) on page 405.

7) Do Chapter 6 textbook problem #6 (parts d,e only) on page 406.

8) Using the data at www.stats202.com/more_stats202_logs.txt and treating each row as a "market basket" compute the support and confidence for the rule ip=65.57.245.11 → "Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3". State what the support and confidence values mean in plain English in this context.

9) My friend and I play basketball. I make 70% of my non-long shots and 40% of my long shots. My friend makes 60% of his non-long shots and 20% of his long shots. Letting pf denote the percent of my friend's shots which are long shots and pm denote the percent of my shots which are long shots, what relationship must hold between pm and pf in order for my friend to make a higher overall percentage of his total shots than me?

Chapter 3 Homework Part 1 - Due Tuesday 7/17:

1) Read Chapter 3 (only sections 3.1, 3.2 and 3.3).

2) Do Chapter 3 textbook problem #2 on page 142.

3) This question uses a sample of 1500 California house prices at http://www-stat.wharton.upenn.edu/~dmease/CA_house_prices.csv and a sample of 10,000 Ohio house prices at http://www-stat.wharton.upenn.edu/~dmease/OH_house_prices.csv. Download both data sets to your computer. Note that the house prices are in thousands of dollars.

a) Use R to produce a single graph displaying a boxplot for each set (as in ICE #16). Include the R commands and the plot. Put your name in the title of the plot (for example, main="Britney Spears' Boxplots").

b) Use R to produce a frequency histogram for only the California house prices. Use intervals of width \$500,000 beginning at 0 and ending at \$3.5 million. Include the R commands and the plot. Put your name in the title of the plot.

c) Use R to produce a plot showing relative frequency polygons for both the California prices and the Ohio prices on the same graph (as in ICE #10). Include a legend. Use the midpoints of the intervals from the previous exercise. (The first point should be at -\$250,000 and the last at \$3.75 million). Include the R commands and the plot. Put your name in the title of the plot.

d) Use R to plot the ECDF of the California houses and Ohio houses on the same graph (as in ICE #11). Include a legend. Include the R commands and the plot. Put your name in the title of the plot.

4) This question uses the data at http://www-stat.wharton.upenn.edu/~dmease/football.csv. Download it to your computer. This data set gives the total number of wins for each of the 117 Division 1A college football teams for the 2003 and 2004 seasons.

a) Use plot() in R to make a scatter plot for this data with 2003 wins on the x-axis and 2004 wins on the y-axis. Use the range 0 to 12 for both the x-axis and y-axis. Include the R commands and the plot. Put your name in the title of the plot.

b) Why are there fewer than 117 points visible on your graph in part a? What is the solution we discussed in class to deal with this problem?

c) Fix the problem so that all 117 points are visible using the technique we discussed in class. Add the diagonal line y=x. Label the points for Stanford and Auburn. Include the R commands and the new plot along with the old. Put your name in the title of the plot.

5) Redo ICE #19 and #20 in Excel using the June data instead of the May data. Include your plot and your table. Put your name in the title of the plot.

Chapters 1 and 2 Homework - Due Tuesday 7/10:

1) Read Chapter 1 (all) and Chapter 2 (only sections 2.1, 2.2 and 2.3).

2) Redo In Class Exercises #1 and #2, but use different examples from those which we used in class.

3) Do Chapter 2 textbook problem #2 on page 89.

a) Read in the data in R using data<-read.csv("myfirstdata.csv",header=FALSE). Note, you first need to specify your working directory using the setwd() command. Determine whether each of the two attributes (columns) is treated as qualitative (categorical) or quantitative (numeric) using R. Explain how you can tell using R.

b) What is the specific problem that causes one of these two attributes to be read in as qualitative (categorical) when it seems it should be quantitative (numeric)?

c) Use the command plot() in R to make a plot for each column by entering plot(data[,1]) and plot(data[,2]). Because one variable is read in as quantitative (numeric) and the other as qualitative (categorical) these two plots are showing completely different things by default. Explain exactly what is being plotted in each of the two cases. Include these two plots in your homework.

d) Read the data into Excel. Excel should have no problem opening the file directly since it is .csv. Create a new column that is equal to the second column plus 10. What is the result for the problem observations (rows) you identified in part b?

a) Read the data into R using data<-read.csv("onemillion.csv",header=FALSE). Note, you first need to specify your working directory using the setwd() command. Extract a simple random sample with replacement of 10,000 observations (rows). Show your R commands for doing this.

b) For your sample, use the functions mean(), max(), var() and quantile(,.25) to compute the mean, maximum, variance and 1st quartile respectively. Show your R code and the resulting values.

c) Compute the same quantities in part b on the entire data set and show your answers.

d) It is very likely that exactly one of your quantities in part c differs from the corresponding sample quantity in part b by more than 0.1. If this is true for your sample, say which quantity this is. If this is not true for your sample, repeat parts a and b until it is true and say which quantity it is.

e) Save your sample from R to a csv file using the command write.csv(). Then open this file with Excel and compute the mean, maximum, variance and 1st quartile. Provide the values and name the Excel functions you used to compute these.

f) Which of the Excel functions in part e give you a result which differs by more than 0.01 from the corresponding R function on the same data?

g) Exactly what happens if you try to open full data set with Excel?