UNB/ CS/ David Bremner/ teaching/ cs2613/ labs/ CS2613 Lab 23

Background

Cell Arrays

Time
15 minutes
Activity
Demo

So far we have mainly concentrated on arrays, which in Octave consist entirely of numbers. These have efficiency advantages for certain computations, but they're not very flexible. Roughly speaking, cell arrays are like multidimensional versions of Python's lists. They support elements of different types, along with random access.

One slightly surprising, but useful application of cell arrays is to use them as lists of vectors.

    data = [0,0,0; 0,0,1; 0,1,0; 1,1,1]
    cells = num2cell(data,2)
    cellfun(@sum,data)
function out=rowfun(fun,mat)


endfunction

%!test
%!shared data
%! data = [0,0,0; 0,0,1; 0,1,0; 1,1,1];
%! assert(rowfun(@norm,data), [0;1;1;sqrt(3)], eps)

%!assert(rowfun(@(v)(dot(v,v)),data), [0;1;1;3], eps)

%!assert(rowfun(@prod,data), [0;0;0;1], eps)

Variable-length argument lists

Time
15 minutes
Activity
Demo

It turns out that the function timeit we used in Lab 21 is not quite general enough, since it works only for single argument functions. Octave provides a special input argument varargin, which collects all remaining arguements into a cell array, providing similar functionality to the . rest argument in racket, or the ...rest argument in JavaScript ES2015.

Use varargin to complete the following code for an updated timeit function

# Based on an example from the Julia microbenchmark suite.

function timeit(reps, func,         )
    times = zeros(reps, 1);

    for i=1:reps
      tic(); func(             ); times(i) = toc();
    end

    times = sort(times);
    fprintf ('%s\tmedian=%.3fms median=%.3fms total=%.3fms\n',func2str(func), median(times)*1000,
             mean(times)*1000, sum(times)*1000);
endfunction

%!test "nullary function"
%! timeit(10000,@rand)

%!test "unary function"
%! timeit(10000,@rand,1)

%!test "Binary function"
%! timeit(10000,@plus,1,2)

%!test "Ternery function"
%! timeit(10000,@plus,1,2,3)

K nearest neighbour

Time
20 minutes
Activity
Small Groups

The k-Nearest neighbour algorithm (KNN) is a famous machine learning or pattern classification algorithm. In this algorithm you are given a training set of numerical measurement vectors (e.g. lab attendence and height) each labelled with a class (e.g. final grade). Given a new, unlabeled measurement, the algorithm predicts the corresponding class by looking at the k-closest training vectors, and voting.

In this lab, and in A5, we will work with a famous set of measurements of iris flowers), where the "class" is the species of flower.

Complete the following function which returns the row indices of the k closest rows of data to v. In our data sets, we will assume the first column is always the class label, and should ignored in distance calculations

function out = nearest(v, k, data)
  for i=1:rows(data)
    dist(i)=
  endfor
  [sorted, indices]=sort(dist);
  out = sort(indices(1:k));
endfunction

%!test
%! v = [0,1,1]
%! data = [0,0,0; 0,0,1; 0,1,0; 1,1,1]
%! assert(nearest(v,1,data) == 4)
%! assert(nearest(v,3,data) == [2,3,4])

You can time your nearest neighbour query with the following:

    load training
    load testing
    timeit(1000,@nearest,testing(1,:), 3, training);

K nearest neighbours continued

Time
10 minutes
Activity
Demo

In practice the function nearest would probably be combined with something like the following; they are split out here to help with debugging and testing. Use the parameter fun here rather than calling nearest directly, so we can swap in a different query function later.

function out=findclass(v, k, data, fun=@nearest)
    selection=
    out = mode(data(selection,1));
endfunction

%!test
%! v = [0,1,1];
%! data = [0,0,0; 0,0,1; 0,1,0; 1,1,1];
%! assert(findclass(v,1,data) == 1)
%! assert(findclass(v,3,data) == 0)

Here is a for-loop based version of a kNN classifer. The return value is the success rate of the classifier (i.e. the fraction of the time it gets the right answer on the test set).

function ratio=knn(k,training, testing, output=0)

  correct = 0;
  for i=1:rows(testing)
    row=testing(i,:);
    if findclass(row,k,training) == row(1);
      correct += 1;
    endif
  endfor

  ratio = correct/rows(testing);
endfunction

You can try out the classifier with

load testing
load training
knn(3,testing,training)

Vectorizing nearest

Time
20 minutes
Activity
Small groups

We need one idea from basic linear algebra to make this vectorization work. We can make many copies of the input vector v by multiplying it by a ones column vector on the left (no loop required). Try this in octave

ones(10,1)*[1,2,3,4]

are two key idea ideas to this vectorization. As a convenience, we can also sort by the square of the norm, which is simply the dot product of a vector with itself.

function out = nearest2(v, k, data)

  diff = data(       ) - ones(          ,1)*v(2:end);
  normsq = dot(            )
  [sorted, indices]=sort(normsq);
  out = sort(transpose(indices(1:k)));
endfunction

%!test
%! v = [0,1,1];
%! data = [0,0,0; 0,0,1; 0,1,0; 1,1,1];
%! assert(nearest2(v,1,data) == 4)
%! assert(nearest2(v,3,data) == [2,3,4])

Finally, check the speedup with this second knn frunction

function ratio=knn2(k,training, testing, output=0)

  correct = 0;
  for i=1:rows(testing)
    row=testing(i,:);
    if findclass(row,k,training,@nearest2) == row(1);
      correct += 1;
    endif
  endfor

  ratio = correct/rows(testing);
endfunction

You can use the follow lines in bench.m

timeit(10,@knn,3,training,testing);
timeit(10,@knn2,3,training,testing);

Using rowfun

Time
20 minutes
Activity
Small groups

Use the find function to complete the following loop-free version of knn. It turns out there's not a huge performance benefit, but the technique of using find is useful to know in general.

function ratio=rowknn(k,training, testing)
  classes=rowfun(@(row)(findclass(row,k,training)),testing);
  correct=length(                           );
  ratio = correct/rows(testing);
endfunction

Add the following line to your bench.m, and compare the timings.

timeit(10,@rowknn,3,training,testing);

If you have time, it's interesting to look at the profile output for the 3 knn versions