Based on feedback, we are keeping the new format. Further, we will experiment
with having only one deadline, which, for this assignment, is Thursday. No
summaries are required for this week as the programming assignment is a bit
longer than others.
NOTE A FEW CHANGES IN THE FOLLOWING PREAMBLE
-----------------------------------------------------------------------------
Unless otherwise specified, questions have unit value. The total value of the
assignments from each week will vary substantively.
Recall that assignments are graded rather loosely on effort, and that 3/4 of
the total marks (1/2 for ugrads) over all assignments over all weeks
represents 100%. This policy is in place partly to allow for error in the
grading approach which, by necessity, is somewhat subjective, and needs to be
done somewhat superficially. It is recommended (and requested) that you try to
overshoot the 3/4 requirement, rather than worry about the details of how the
grading is done.
Problems denoted EXTRA can be substituted for other problems, or done in
addition, but they do not count towards the computation of the 3/4
requirement. They may be discussed in class depending on time and interest.
They are problems that I think might be useful, and likely be assigned if we
had more time per chapter.
Sometimes you will explicitly have to choose some of your own problems. Even
when this is not the case, you can substitute some problems in the book for
non-programming assignments if they appear more helpful to you. For now, limit
the number of substitutions to 50% of what you hand in. This parameter may be
increased or decreased as we go on.
You are encouraged to discuss the problems with your peers, but I would like
individual final submissions demonstrating effort and understanding of what
was done. If you end up working closely with someone on a problem set, make a
note on your submission saying who it was. For programming assignments, each
person should turn in their own work.
Since this is graduate level research course that is graded predominately on
effort, I am confident that there will not be any problems with academic
honesty. However, do note that non-negligible deviations are often
surprisingly easy to spot, and can be verified by discussing the submitted
solutions with the student.
-----------------------------------------------------------------------------
Problems for Week 5.
Total value is 11.
Due Thursday. It is possible that we will discuss the first 3 in Tuesday's
class, so it is appreciated that the majority of the class has begun to think
about them by Tuesday.
-----------------------------------------------------------------------------
DOUBLE value.
1. Propose a problem where the concept of a missing variable is a useful way
to think about the problem. There are many good ones. Try to think of one
that is not simply clustering or GMM in disguise. Explain a bit your model.
Try to draw a graphical representation for it.
DOUBLE value.
2. Fill in some of the details in the derivations of 9.17, 9.19, and 9.22.
3. 9.4. This tweaks EM to include a prior over parameters.
Value is SIX.
4. Implement EM for the Gaussian mixture model in the case of
diagonal covariance where the variance of each feature (here, X and Y) is
the same for each clusters. IE, we are assuming that the clusters are
Gaussian "balls" of various sizes. Make the number of clusters, K, easy to
set. You can deal with the problem described on page 434 by adding a
small, fixed, amount of variance to the estimated variance at each
iteration. Your program should output the log-likelihood at each
iteration. It must go up. If it does not always go up, then you have a
bug.
Use your program to cluster the 2D data linked here.
Begin by plotting these points. First state how many clusters you think are
reasonable. Then try to fit that many clusters.
Initialization is always a problem with EM. You may want to experiment a
bit. One method that should work on this data is to randomly assign each
point to a cluster.
You also want to make sure that you try to use sufficient iterations so
that you are close to convergence, at least until you understand the
problem (then you might back off the number of iterations to reduce the
chances of over-fitting). For this data, I suggest that you push it towards
100 iterations. This helps check that the log-likelihood always goes up.
Your program should assign a cluster label to each point based on the
cluster that has maximum responsibility for explaining the point. In
other words, the cluster that yields highest probability for that point.
Your program should plot the points with different colors (or symbols) for
each cluster assignment (see Figure 9.5, but no need to blend the colors
unless you think it is fun to do). Do you think that the program generally
works? Provide a plot of an example where it is working (unless you find
that it never does).
Now try different numbers of clusters. If you increase the number of
clusters sufficiently, you may find that not every cluster has a
representative point. In such a case, you would have fewer colors/symbols
than clusters. Regardless of whether you observe it, explain why this is
possible.