VU FDA

Title: Assignment 4: Classification analysis

Due date: May 17 (09:45 — in class)

For all questions please use R. R provides a package called Rmarkdown which lets you mix code and text together using markdown. This is known as literate programming and I think it's a really great idea!

Question 1 (SVM) (40 points)

For this assignment we will use libsvm. Instructions for installation are available on that website. To use in R, you need to install and load the e1071 package with the following commands:

install.packages("e1071") # install
library("e1071")          # load to use

The usage instructions can be found on the website. You will need to use the svm and predict.svm commands from this package. To see the parameters, type: ??svm or ?svm and ??predict.svm or ?predict.svm.

The data in voices.data consists of digitized frequencies of phonemes from five different classes. Each sample has 256 features, each of which can take values 0–255. The classes refer to what phoneme the person is speaking. There are more than five phonemes in all human languages I know, but five is complex enough for the assignment :)

Produce a separate plot, one for each class (i.e. 5), of the phoneme curves. This should help you understand which frequencies are likely to be important in classification. Describe which features you think are most important for classification based on these graphs (A short paragraph will be sufficient). Also, before doing part b, write down if you think a linear classifier is sufficient for this problem and why or why not (again, a paragraph should be sufficient).
Use the support vector machine classifier to try and classify the dataset. Use a 50/50 split of the training data and test data making sure that a particular speaker is not in both test and training datasets. Use a linear, quadratic, and Gaussian kernel for your classification and compare the results. Discuss the model performance for each different type of kernel (1-2 sentences). For each of these kernels feel free to play with the other parameters and report on what worked better or worse and why (1-2 sentences).

Question 2 (Neural network) (60 points)

For this part of the assignment we will be implementing our own nascent neural network library! Soon we will be unstoppable :) You can use the same dataset from question 1.

Write a program to fit a single hidden layer (with 10 hidden nodes) neural network that is trained using back propogation. For the stopping condition you can use either a variable number of iterations or a minumum change in error but report which one you used.
Apply your neural network program to the voices.data dataset. Use a cross-validation procedure with a 50/50 split of test and training data. Generate graphs for the test and training error of the neural network as a function of the number of training iterations.
Vary the number of hidden units in the network, from 1 to 10, and report on how performance changes with the number of hidden units. Justify your conclusions (2-3 paragraphs). Again, you will need the graphs of the training and test performance for each network architecture to justify your conclusions.

Procedure and Submission

Please submit a ZIP-document with your answers to Moodle. Use the following naming scheme for your submission: “lastname_matrikelnumber_A4.zip”.

Late submission

Late Submissions are NOT possible. Any assignment submitted late will receive zero points.

Academic Honesty