Assignment 3: Pen and paper exercises

Due Date: May 17 (09:45 – in class)

Solutions

  1. (30 points)

    The CEO of a wholesale company comes to you because they want to predict when their customers are going to order a new desk. This way they can have enough on hand to fill demand. They figure that they can predict this based on other items the customer has ordered in the past. Specifically, they sampled i.i.d. customers, where for customer \(i\), let \(x_i \subset \{1,\ldots,d\}\) denote the subset of items the customer bought, and let \(t_i \in \{\pm 1\}\) be the label indicating whether this customer ordered a desk. As prior knowledge, the CEO knows that there are \(k\) items if the customer ordered at least one of these, then the customer will also order a desk. In this case the class label will be 1 iff the customer bought at least one of these items. Sadly, the names of these \(k\) items is not known (that’s why they hired you). In addition, according to regulations, each customer can buy at most \(s\) items. Help the CEO to design a learning algorithm such that its time complexity is polynomial in \(s\) and \(k\).

    1. What representation would you use for each observation \(x_i\) in memory?

    2. Write down your algorithm given this representation. Hint: think about how you would find the distinguishing features quickly

    3. Justify why this algorithm can take polynomial time by referring to specific steps in your algorithm.

  2. (25 points)

    Consider a very simple dataset consisting of just two points. Each point has a single feature and there are two classes: (1) when \(x_1 = -\sqrt{2}\) then \(t_1 = -1\) and (2) when \(x_2 = 3\sqrt{2}\) then \(t_2 = 1\). Even though we only have two points we decide to map them to a feature space of 3 features using a second order polynomial kernel: \(\phi(x) = [1,\sqrt{2}x, x^2]^T\). The max margin classifier has the form \[\begin{align} \mathrm{min} \lVert \mathbf{w} \rVert^2 \mathrm{ s.t.} \\ t_1(\mathbf{w}^T \phi(x_1) + w_0) \ge 1 \\ t_2(\mathbf{w}^T \phi(x_2) + w_0) \ge 1 \end{align}\]

    1. Give an example of a vector that is parallel to the vector \(\mathbf{w}\). Hint: Think about how \(\mathbf{w}\) relates to the decision boundary.

    2. How large is the margin for the optimal decision boundary? Hint: recall that the margin is the distance from each support vector to the decision boundary. Hint 2: think about the geometry of 2 points in space, with a line separating one from the other.

    3. Given that the margin is equal to \(1/\lVert \mathbf{w} \rVert\), solve for \(\mathbf{w}\).

    4. Now that you have a value for \(\mathbf{w}\), find the value for \(w_0\). You will also need the above equations. Hint: the points will be on the decision boundary.

    5. What is the discriminant function, \(f(x) = w_0 + \mathbf{w}^T \phi(x)\), as an explicit funciton of \(x\)?

  3. (35 points)

    Consider a classification problem in which each observation \(x_n\) is known to belong to one of three classes, corresponding to \(t = 0\), \(t = 1\), and \(t = 2\), and suppose that the procedure for collecting training data is imperfect, so that training points are sometimes mislabelled. For every data point \(x_n\), instead of having a value \(t\) for the class label, we have instead
    values \(\pi_{n0}\), \(\pi_{n1}\), and \(\pi_{n2}\) representing the probabilities that \(t_n = 0\), \(t_n = 1\), and \(t_n = 2\) respectively. Note that \(\pi_{n0} + \pi_{n1} + \pi_{n2} = 1\). Given probabilistic models \(p(t = 0 | \phi)\), \(p(t = 1 | \phi)\), and \(p(t = 2 | \phi)\), write down the log likelihood functions appropriate to such a data set.

  4. (10 points)

    Let’s investigate the curse of dimensionality

    1. Let’s assume we have a number of observations consisting of just one feature. This one feature can take any value in the range \([0,1]\). Also, assume that the observations are uniformly distributed in this space. Associated with each observation is a class. We want to investigate what will happen if we base our prediction only on observations that are within a neighborhood of 0.03 on either side (this creates a range of 0.06). For example, for an observation where \(x = 0.35\) we will base our classification on observations in the range \(0.32 \le x \le 0.38\). On average, what fraction of the available observations will we use to make the prediction?

    2. Repeat part a for observations with 2 features, \(x_1\) and \(x_2\). In this case assume that we base our prediction on observations that are within 0.06 of both \(x_1\) and \(x_2\). On average, what fraction of the samples we will use to make the prediction?

    3. Repeat part a for a feature vector of 100 features per observation. If we again use the samples withing 0.06 in all features, what is the faction of the samples we will use to make the prediciton?

    4. Use your results from part a, b, and c to explain the curse of dimensionality. Make sure to include a discussion of what, if any patterns you observed between parts a, b, and c as well as a formula describing this pattern.

    5. 10 bonus points Describe one way to get around the curse of dimensionality. Write down how this would change the probabilities in parts a, b, and c.

Total: 100 points

Procedure and Submission

Please submit a PDF-document with your answers to Moodle. Use the following naming scheme for your submission: “lastname_matrikelnumber_A3.pdf”.

Late submission

Late Submissions are NOT possible. Any assignment submitted late will receive zero points.

Academic Honesty