Due 2022-12-08 23:59AST
Overview
The goal of the assignment is to impliment an
approximate nearest neighbour classifier.
KNN (k
-Nearest neighbour) is a famous classification algorithm. It
assumes you are given a set of measurement vectors (e.g. from
iris flowers), each
labelled by class. Unknown measurement vectors are then classified by
computing the 'k' closest training vector.
In practice this can be slow if the number of training vectors is large.
In situations where the number of features f
is small, but the number of
training vectors is large, the following is a reasonable strategy
- Split the feature space into equal sized boxes.
- Count the number of training vectors in each box
- Label each box as according to which class has the maximum number of training vectors in it.
General Instructions
- Every function you write for this assignment should have a
usage block
giving roughly the same information as a python function
docstring. You can test if it's working by running "help
" in Octave. Make sure there's no space between the function and the usage block. - For full credit, all functions you write should be loop-free.
- Comment your code appropriately. When writing comments for loopless code, it is more important to explain the intent of a line of code, than to repeat the documentation for a built-in function.
- There are several tests given to help you understand the requirements for the functions. It's up to you to decide if they are enough for a given function.
- The marking rubric is available.
Getting started
- Save the iris data set as
~/fcshome/cs2613/assignments/A6/iris.csv
- Save the classifier script as
~/fcshome/cs2613/assignments/A6/classify.m
A fundamental rule of pattern classification is to never test your
classifier with your training data. To avoid this problem, we will
split our data set into two parts, one part for training, and one part
for testing the classifier. To get you started, and to serve as an example of
loop free style, here is a function randomsplit
that splits
the rows of a matrix randomly in the given ratio. One thing that's not
shown in the following tests is that it's important we extract a
random subset for training, as the rows of our dataset are sorted by
class.
## usage: given an n x k matrix data, and frac ∈ [0,1]
## return a random selection of frac*n rows of data
function [first,second] = randomsplit(data, frac)
shuffled = data(randperm(rows(data)),:);
split = floor(frac*rows(shuffled));
first = shuffled(1:split,:);
second = shuffled(split+1:end,:);
endfunction
%!assert(randomsplit(zeros(10,1), 0.5) == [zeros(5,1),zeros(5,1)])
%!test
%! [foo, bar] = randomsplit(zeros(10,1), 0.7);
%! assert(foo == zeros(7,1))
%! assert(bar == zeros(3,1))
Finding the bounding box
We can think about our data as living in a box in some high
dimensional space. The sides of this box are the max
and min
values for each coordinate. Write a function ranges
that returns
these values as a matrix, and passes the following tests
%!test
%! A=[0,0,0;
%! 0,1,0;
%! 0,0,1;
%! 1,1,1];
%! assert (ranges(A), [0,0,0;
%! 1,1,1]);
%!test
%! A=[0,0,0;
%! 0,10,0;
%! 0,0,1;
%! 1,1,-11];
%! assert (ranges(A), [0,0,-11;
%! 1,10,1]);
Bucketing vectors
The core of our classifier is a function to put vectors into a
bucket. If we imagine the bounding box split along each axis into
p
slices, we get pᶠ
boxes for f
features. If f=2
, this is just
a grid of rectangles. Write a function bucket
that calculates which
of these boxes (indexed from [0,0,…,0]
to [p-1,p-1,…,p-1]
) a given
vector is in. In order to keep the number of boxes from being bigger
than needed, we're going to force vectors with max values in the
p-1
coordinate box. In order to help you visualize the
buckets/boxes under discussion, you can use boxplot.m.
In the tests, the three parameters are
(v,ranges,p)
, where v
is the vector to be bucketed, ranges
are
computed in the previous step, and p
is the precision.
In this part, the first coordinate (the class label) is ignored for the purposes of bucketing.
Computing a hash function
Looking at our classifier, we see that we actually
need a single integer address for each bucket. Write a function hash
that uses bucket
, then further transforms the vector to a number
from 1
to pᶠ
. Some tests are provided. This is done by
- intepreting the output from
bucket
as a number in basep
, least significant digit first - adding
1
so the result can be used as an octave array index.
In this part, the first coordinate (the class label) is ignored for the purposes of computing the hash function.
Tallying the votes
The final step to complete the classifier is to transform the array
votes
into something that can be used as a classfier. Each row has
c
columns, one per class. Write a function tally
that returns the
column with the most votes, or c+1
if no column has votes. In case
of (positive) ties, pick arbitrarily.
%!test
%! A = [1,2,3;
%! 2,1,3;
%! 2,3,1;
%! 3,1,2;
%! 3,2,1;
%! 0,0,0];
%! assert (tally(A) == [3;3;2;1;1;4]);
Testing the classifier
- With the given parameters, your classifier should get the right answer between 60 and 80 percent of the time.
- Try different values of
p
. What do you notice about accuracy? How about run time? - Try smaller or larger training sets. Does this have an effect on accuracy? What would be a hazard of using too large of a training set? You can read more about cross validation