UNB/ CS/ David Bremner/ teaching/ cs2613/ assignments/ CS2613 Assignment 6


The goal of the assignment is to impliment an approximate nearest neighbour classifier. KNN (k-Nearest neighbour) is a famous classification algorithm. It assumes you are given a set of measurement vectors (e.g. from iris flowers), each labelled by class. Unknown measurement vectors are then classified by computing the 'k' closest training vector.

In practice this can be slow if the number of training vectors is large. In situations where the number of features f is small, but the number of training vectors is large, the following is a reasonable strategy - Split the feature space into equal sized boxes. - Count the number of training vectors in each box - Label each box as according to which class has the maximum number of training vectors in.

General Instructions

Getting started

Splitting data

A fundamental rule of pattern classification is to never test your classifier with your training data. To avoid this problem, we will split our data set into two parts, one part for training, and one part for testing the classifier. Write a function randomplit that splits the rows of a matrix randomly in the given ratio. One thing that's not shown in the following tests is that it's important we extract a random subset for training, as the rows of our dataset are sorted by class.

%!assert(randomsplit(zeros(10,1), 0.5) == [zeros(5,1),zeros(5,1)])

%! [foo, bar] = randomsplit(zeros(10,1), 0.7);
%! assert(foo == zeros(7,1))
%! assert(bar == zeros(3,1))

Finding the bounding box

We can think about our data as living in a box in some high dimensional space. The sides of this box are the max and min values for each coordinate. Write a function ranges that returns these values as a matrix, and passes the following tests

%! A=[0,0,0;
%!    0,1,0;
%!    0,0,1;
%!    1,1,1];
%! assert (ranges(A), [0,0,0;
%!                     1,1,1]);

%! A=[0,0,0;
%!    0,10,0;
%!    0,0,1;
%!    1,1,-11];
%! assert (ranges(A), [0,0,-11;
%!                     1,10,1]);

Bucketing vectors

The core of our classifier is a function to put vectors into a bucket. If we imagine the bounding box split along each axis into p slices, we get pᶠ boxes for f features. If f=2, this is just a grid of rectangles. Write a function bucket that calculates which of these boxes (indexed from [0,0,…,0] to [p-1,p-1,…,p-1]) a given vector is in. In order to keep the number of boxes from being bigger than needed, we're going for force vectors with max values in the p-1 coordinate box. In order to help you visualize the buckets/boxes under discussion, you can use boxplot.m.

In the tests, the three parameters are (v,ranges,p), where v is the vector to be bucketed, ranges are computed in the previous step, and p is the precision.

Computing a hash function

Looking at our classifier, we see that we actually need a single integer address for each bucket. Write a function hash that uses bucket, then further transforms the vector to a number from 1 to f^p+1. Some tests are provided. This is done by

Tallying the votes

The final step to complete the classifier is to transform the array votes into something that can be used as a classfier. Each row has c columns, one per class. Write a function tally that returns the column with the most votes, or c+1 if no column has votes. In case of (positive) ties, pick arbitrarily.

%! A = [1,2,3;
%!     2,1,3;
%!     2,3,1;
%!     3,1,2;
%!     3,2,1;
%!     0,0,0];
%! assert (tally(A) == [3;3;2;1;1;4]);

Testing the classifier