UNB/ CS/ David Bremner/ teaching/ cs2613/ labs/ Lab 20

Before the lab

Linear Algebra

One of the main features of Octave we will discuss is vectorization. To understand it, we need some background material on Linear Algebra. If you've taken a linear algebra course recently, this should be easy, otherwise you will probably need to review

Pattern Classification / Machine Learning


Fibonaccci as matrix multiplication I/II

Time
10 minutes
Activity
Discussion

The following is a well known identity about the Fibonacci numbers F(i).

[ 1, 1;
  1, 0 ]^n = [ F(n+1), F(n);
               F(n),   F(n-1) ]

This can be proven by induction; the key step is to consider the matrix product

 [ F(n+1), F(n);   F(n),   F(n-1) ] * [ 1, 1; 1, 0 ]

Fibonaccci as matrix multiplication II/II

Since matrix exponentiation is built-in to octave, this is particularly easy to implement the formula above in octave

Save the following as ~/cs2613/labs/L19/matfib.m, fill in the two matrix operations needed to complete the algorithm

function ret = matfib(n)
  A = [1,1; 1,0];


endfunction

%!assert (matfib(0) == 0);
%!assert (matfib(1) == 1);
%!assert (matfib(2) == 1);
%!assert (matfib(3) == 2);
%!assert (matfib(4) == 3);
%!assert (matfib(5) == 5);
%!assert (matfib(6) == 8);
%!assert (matfib(25) == 75025);

We can expect the second Fibonacci implementation to be faster for two distinct reasons

function bench3
  timeit(10000, @tabfib, 42)
  timeit(10000, @matfib, 42)
endfunction

Questions for your journal

K nearest neighbour

Time
25 minutes
Activity
Individual

The k-Nearest neighbour algorithm (KNN) is a famous machine learning or pattern classification algorithm. In this algorithm you are given a training set of numerical measurement vectors (e.g. lab attendance and height) each labelled with a class (e.g. final grade). Given a new, unlabelled measurement, the algorithm predicts the corresponding class by looking at the k-closest training vectors, and voting.

In this lab we will work with a famous set of measurements of iris flowers, where the "class" is the species of flower.

You need to know the following definition from linear algebra

distance(X,Y) = norm(X-Y)

Complete the following function which returns the row indices of the k closest rows of data to v. In our data sets, we will assume the first column is always the class label, and should ignored in distance calculations.

function out = nearest(v, k, data)
  for i=1:rows(data)
    dist(i)=
  endfor
  [sorted, indices]=sort(dist);
  out = sort(indices(1:k));
endfunction

%!test
%! v = [0,1,1]
%! data = [0,0,0; 0,0,1; 0,1,0; 1,1,1]
%! assert(nearest(v,1,data) == 4)
%! assert(nearest(v,3,data) == [2,3,4])

You can time your nearest neighbour query with the following:

load training
load testing
timeit(1000,@nearest,testing(1,:), 3, training);

K nearest neighbours continued

Time
10 minutes
Activity
Demo

In practice the function nearest would probably be combined with something like the following; they are split out here to help with debugging and testing. Use the parameter fun here rather than calling nearest directly, so we can swap in a different query function later.

function out=findclass(v, k, data, fun=@nearest)
    selection=
    out = mode(data(selection,1));
endfunction

%!test
%! v = [0,1,1];
%! data = [0,0,0; 0,0,1; 0,1,0; 1,1,1];
%! assert(findclass(v,1,data) == 1)
%! assert(findclass(v,3,data) == 0)

Here is a for-loop based version of a kNN classifier. The return value is the success rate of the classifier (i.e. the fraction of the time it gets the right answer on the test set).

function ratio=knn(k,training, testing, output=0)

  correct = 0;
  for i=1:rows(testing)
    row=testing(i,:);
    if findclass(row,k,training) == row(1);
      correct += 1;
    endif
  endfor

  ratio = correct/rows(testing);
endfunction

You can try out the classifier with

load testing
load training
knn(3,testing,training)

Vectorizing nearest

Time
25 minutes
Activity
individual

We need a couple of ideas from basic linear algebra to make this vectorization work. We can make many copies of the input vector v by multiplying it by a ones column vector on the left (no loop required). Try this in octave

ones(10,1)*[1,2,3,4]

As a convenience, we can also sort by the square of the norm, which is simply the dot product of a vector with itself.

function out = nearest2(v, k, data)

  diff = data(       ) - ones(          ,1)*v(2:end);
  normsq = dot(            )
  [sorted, indices]=sort(normsq);
  out = sort(transpose(indices(1:k)));
endfunction

%!test
%! v = [0,1,1];
%! data = [0,0,0; 0,0,1; 0,1,0; 1,1,1];
%! assert(nearest2(v,1,data) == 4)
%! assert(nearest2(v,3,data) == [2,3,4])

Finally, check the speedup with this second knn frunction

function ratio=knn2(k,training, testing, output=0)

  correct = 0;
  for i=1:rows(testing)
    row=testing(i,:);
    if findclass(row,k,training,@nearest2) == row(1);
      correct += 1;
    endif
  endfor

  ratio = correct/rows(testing);
endfunction

You can use the follow lines in bench.m

timeit(10,@knn,3,training,testing);
timeit(10,@knn2,3,training,testing);

Before next lab

Read