UNB/ CS/ David Bremner/ teaching/ cs2613/ assignments/ CS2613 Assignment 6

Due 2022-12-08 23:59AST

Overview

The goal of the assignment is to impliment an approximate nearest neighbour classifier. KNN (k-Nearest neighbour) is a famous classification algorithm. It assumes you are given a set of measurement vectors (e.g. from iris flowers), each labelled by class. Unknown measurement vectors are then classified by computing the 'k' closest training vector.

In practice this can be slow if the number of training vectors is large. In situations where the number of features f is small, but the number of training vectors is large, the following is a reasonable strategy

General Instructions

Getting started

A fundamental rule of pattern classification is to never test your classifier with your training data. To avoid this problem, we will split our data set into two parts, one part for training, and one part for testing the classifier. To get you started, and to serve as an example of loop free style, here is a function randomsplit that splits the rows of a matrix randomly in the given ratio. One thing that's not shown in the following tests is that it's important we extract a random subset for training, as the rows of our dataset are sorted by class.

## usage: given an n x k matrix data, and frac ∈ [0,1]
## return a random selection of frac*n rows of data
function [first,second] = randomsplit(data, frac)
  shuffled = data(randperm(rows(data)),:);

  split = floor(frac*rows(shuffled));
  first = shuffled(1:split,:);

  second = shuffled(split+1:end,:);
endfunction

%!assert(randomsplit(zeros(10,1), 0.5) == [zeros(5,1),zeros(5,1)])

%!test
%! [foo, bar] = randomsplit(zeros(10,1), 0.7);
%! assert(foo == zeros(7,1))
%! assert(bar == zeros(3,1))

Finding the bounding box

We can think about our data as living in a box in some high dimensional space. The sides of this box are the max and min values for each coordinate. Write a function ranges that returns these values as a matrix, and passes the following tests

%!test
%! A=[0,0,0;
%!    0,1,0;
%!    0,0,1;
%!    1,1,1];
%! assert (ranges(A), [0,0,0;
%!                     1,1,1]);

%!test
%! A=[0,0,0;
%!    0,10,0;
%!    0,0,1;
%!    1,1,-11];
%! assert (ranges(A), [0,0,-11;
%!                     1,10,1]);

Bucketing vectors

The core of our classifier is a function to put vectors into a bucket. If we imagine the bounding box split along each axis into p slices, we get pᶠ boxes for f features. If f=2, this is just a grid of rectangles. Write a function bucket that calculates which of these boxes (indexed from [0,0,…,0] to [p-1,p-1,…,p-1]) a given vector is in. In order to keep the number of boxes from being bigger than needed, we're going to force vectors with max values in the p-1 coordinate box. In order to help you visualize the buckets/boxes under discussion, you can use boxplot.m.

In the tests, the three parameters are (v,ranges,p), where v is the vector to be bucketed, ranges are computed in the previous step, and p is the precision.

In this part, the first coordinate (the class label) is ignored for the purposes of bucketing.

Computing a hash function

Looking at our classifier, we see that we actually need a single integer address for each bucket. Write a function hash that uses bucket, then further transforms the vector to a number from 1 to pᶠ. Some tests are provided. This is done by

In this part, the first coordinate (the class label) is ignored for the purposes of computing the hash function.

Tallying the votes

The final step to complete the classifier is to transform the array votes into something that can be used as a classfier. Each row has c columns, one per class. Write a function tally that returns the column with the most votes, or c+1 if no column has votes. In case of (positive) ties, pick arbitrarily.

%!test
%! A = [1,2,3;
%!     2,1,3;
%!     2,3,1;
%!     3,1,2;
%!     3,2,1;
%!     0,0,0];
%! assert (tally(A) == [3;3;2;1;1;4]);

Testing the classifier