I'm about halfway through my Honours year at present (although it doesn't feel like it), doing my dissertation in Mathematics. In the process of writing some code in the R scripting language, I came upon an annoying problem.
I have two matrices:
and
1
and what I want is to create a third matrix:
2
Now it is easy to code something that will create this matrix one element at a time. Now, since X and Y will likely have at least 600 rows each, meaning we'll have a loop with 6002, or 360000 iterations. And since each of those iterations involves 3p floating-point operations, that means K is created in 1,080,000p flops. But this is in fact a lower bound; if we do each cell individually, then it will take much more than that, since R is notoriously inefficient when it comes to one-at-a-time operations.
Where R shines is in operating on entire vectors at once. Toy example: let's say I have a vector of numbers, x, and I want to increase each number by its position within the vector. So the first number increases by 1, the second by 2 and so on. Ask a computer science student, and they'll tell you to make a loop that increments each element one at a time and uses the loop index to keep track of where it's up to. Ask someone who uses R (i.e. not a computer scientist but a mathematician) and they'll say to add the vector to another, dynamically-defined one. In code:
z <- x + 1:length(x)
1:n creates a new vector, where the first element is 1 and it increments the elements by 1 until it gets to the same length as x, and then stops. So this is useful; it means I can create K one column at a time.
But one column at a time is still kind of inefficient; that's still a loop and R is not well-optimised for loops. So I spent a couple of weeks niggling at this problem, trying to find a way to do it without loop structures at all. Then a friend of mine at work, discussing something unrelated, reminded me of the dist() function, which is almost exactly what I needed!
dist() does the following: pass it a matrix, X, and it returns a lower triangular matrix3 where the element at row i-1 and column j is ||xj-xi||. But, this requires you to pass it a single matrix, not two. So what I can do is tack the matrix Y on the bottom of X and pass the result to dist()! The output matrix will therefore have a bunch of stuff I don't need - I don't care about the distance between elements in X and other elements in X - so when I create it I can then pull out the quadrant which contains the distances between the Y values and X values and ditch the rest.
The question remains whether this will actually turn out to be less processor cycles. dist() is pretty well optimised whereas calling one set of vectors at a time is not; however, using dist() means that I'll be creating a whole bunch of distances I don't care about. So the next step will be benchmarking using some test data I have from last year.
This is what an applied mathematician does.
1 For clarity, what this means is that X is a matrix with n rows and p columns, and Y is a matrix with m rows and p columns, and the values in these matrices are all real numbers.
2 This is notational shorthand; what, for example, is meant by ||y1-x1||2 is: take the first row in the matrix Y; subtract the first row in the matrix X; then square all the elements in the resulting vector and add them up. Important note: ||y1-x1|| is the square root of this. Further important note: this value is called the Euclidean distance between two vectors.
3Lower triangular matrix means that 1) the matrix is square and 2) all the elements in the top right above the diagonal are zero.