Saturday, September 7, 2013

Optimization in R (Part I)

This post is the first in a series on solving practical convex optimization problems with R.  My goal for the series is to work up to some interesting applications of the constrained optimization libraries available in R.  In particular, I plan to work out some test cases and practical examples using the quadprog package.

In this very introductory post, however, I want to illustrate some important ideas about convexity using R.

A convex function function of a single variable resembles something like the blue curve in the following diagram:

 x<-seq(from=-1,to=1.5,by=.01)  
 y<-(x-1/2)^2 +1  
 segment <-data.frame(x=c(-1/2,1),y=c(2,1.25))  
 min_point <-data.frame(x=1/2,y=1)  
 mydata <-data.frame(x,y)  
 ggplot(mydata, aes(x, y)) + geom_line(size=3, color="blue")+  
  geom_point(data=min_point,color="yellow", size=5)+  
  geom_line(data=segment, color="red", size=2)+  
  geom_point(data=segment, color="red", size=5)  

The point of this diagram is that whenever you connect two points on the graph of a convex function with a straight line segment, the line segment always lies above the graph of the function.  A (strictly) convex function always has a global minimum, which is the point in yellow in the diagram.

Here's what a convex function of two variables looks like:

 library(rgl)  
 # Define surface  
 f <- function(x,y) x^2+y^2  
 # Build a grid  
 x<-seq(from=-1,to=1,.01)  
 y<-seq(from=-1,to=1,.01)  
 z<-outer(x,y,f)  
 # Plot  
 open3d()  
 rgl.surface(x,y,z, color='green',alpha=.75, smooth=TRUE)  
 axes3d( color='black', alpha=1)  
 title3d('f(x,y)=x^2+y^2','','x','z','y', color='blue', alpha=1)

It's not hard to imagine that any line segment connecting two points on this convex surface will again lie completely above the surface. Moreover, we immediately recognize an absolute minimum at the origin. Though more difficult to visualize, the definition of convexity is much the same in higher dimensions: a function is convex if a line segment connecting two points on the function's graph always remains above the graph.

One reason to care about the convexity is because it is a condition that ensures a function has a minimum.  So given a convex function, how do you locate its minimum? Imagine that you were placed somewhere on a convex surface like the one appearing in the figure above. If you wanted to find the global minimum, you'd likely begin walking in the direction of steepest descent. You'd continue to follow along the path of steepest descent until you can no longer descend. At this point, because you know the surface is convex, you know that you are standing on the minimum.

This very simple idea is the basis for the family of gradient descent algorithms for finding the minimum of a convex function.  The gradient of a function \(f\) at a point \(x_0\) is the vector
$$\nabla f(x_0) = \left. \left(\frac{\partial f}{\partial x_1} , ... , \frac{\partial f}{\partial x_n} \right) \right|_{x=x_0}$$

Here is a simple numerical implementation of the gradient that uses centred differences to approximate the derivatives in this definition.

 # gradient function  
 grad <-function(f,x,h){  
  len <-length(x)  
  y<-rep(NA,len)  
  # Compute the gradient with a centred difference  
  for(i in 1:len){  
   xp<-x  
   xm<-x  
   xp[i]<-x[i]+h  
   xm[i]<-x[i]-h  
   y[i]<-(f(xp)-f(xm))/(2*h)   
  }  
  return(y)  
 }  

[Warning:  For actual numerical work, you probably don't want to write your own derivative functions.  There are excellent libraries available to do this kind of work.  See, for example, the "numDeriv" package.]

The gradient at \(x_0\) always points in the direction of steepest ascent from the point \(x_0\).  To find the minimum, we at some initial point on the surface and compute the gradient.   Then we take a step in the direction exactly opposite the direction of the gradient vector.  If the step size is chosen well, we should find ourselves at a point lower on the surface than the point at which we started.  We repeat this process until we can descend no further (that is, the gradient is the zero vector). 

Choosing the right step size at each level in the algorithm is a sticky point.  Though a uniform step size might work for very simple functions, in general uniform steps can cause the algorithm to fail to converge.  A standard approach is to scale the step size based on the length of the gradient vector or the inter-step performance.

Here's a very simple, adaptive gradient descent implementation.  At each iteration, we start with a step size of del and repeatedly cut the step size in half until we find a step size that results in an improvement.

 # Gradient descent method  
 gradDescend<-function(f,x0,del){  
  z0<-f(x0)     
  eps<-1e-8   # Tolerance  
  G<-grad(f,x0,1e-4)  
  x<-x0-del*G  
  # Find a choice of step size so that the result improves  
  while(z0<f(x)+eps & del>eps){  
   del<-del/2  
   x<-x0-del*G  
  }  
  # If improvement is only possible by step size below the tolerance,   
  # we assume convergence and return NULL. Otherwise, return the improved x.  
  if(del<eps) return(NULL)  
  else return(x)  
 }  

[Warning:  Here especially you'll want to rely on R libraries rather than on your own implementations for actual numerical work.  Functions like optim are equipped to deal with the pathological cases one can encounter in practice.]

Wikipedia has a nice list of test functions for convex optimization problems.  Let's compare our steepest descent method to R's optim function using two of the suggested test functions:

 # Four variable Sphere function--min at (0,0,0,0)  
 f1<- function(x) x[1]^2+x[2]^2+x[3]^2 + x[4]^2  
 # Two variable Rosenbrock function--pathological example with min at (1,1).  Note: convex at (1,1) but not globally convex.
 f2<- function(x) (1-x[1])^2+100*(x[2]-x[1]^2)^2  

Run the tests:
 # Initial values  
 g1 <- c(1/2,-1/3,-1/5,2)  
 g2 <- c(-1/2, 1)  
 # Optim  
 opt_f1 <-optim(par=g1, f1, method="BFGS")  
 opt_f2 <-optim(par=g2, f2, method="BFGS")  
 # Tester function  
  descentTester = function(f,x0,del){  
   x<-x0  
   path <- NULL  # Storage for path to minimum.  
   while(!is.null(x)){  
    path<-rbind(path,x)  
    x<-gradDescend(f,x,del)  
   }  
   rownames(path) = 1:nrow(path)  
   return(path)  
  }  
 path1<-descentTester(f1,g1,1/3)  
 path2<-descentTester(f2,g2,1/3)  

We specify the BFGS method in optim.  This method uses information from both the function and its gradient to minimize the function, so it provides a reasonable comparison against our simple implementation.  Here is the output from optim stored in opt_f1, opt_f2:

 > opt_f1  
 $par  
 [1] 1.230906e-13 -1.748844e-13 -4.921677e-14 -6.452512e-14  
 $value  
 [1] 5.366749e-26  
 $counts  
 function gradient   
     9    3   
 $convergence  
 [1] 0  
 $message  
 NULL  
 > opt_f2  
 $par  
 [1] 0.9998002 0.9996004  
 $value  
 [1] 3.99241e-08  
 $counts  
 function gradient   
    119    48   
 $convergence  
 [1] 0  
 $message  
 NULL  

How did optim do on our two examples?  Optim found the solution to both problems.  It found the first solution with only 9 evaluations of f1 and 3 evaluations of f1's gradient.  The second solution took 119 evaluations of f2 and 48 evaluations of its gradient.

We can look at the data frames path1 and path2 to assess the performance of our steepest descent implementation.
 > nrow(path1)  
 [1] 11  
 > nrow(path2)  
 [1] 6197  
 > path1[nrow(path1),]  
 [1] 8.467544e-06 -5.645029e-06 -3.387018e-06 3.387018e-05  
 > path2[nrow(path2),]  
 [1] 1.002628 1.005267  

First, we note that our numerical precision is considerably lower than the implementation in optim. We can adjust this somewhat by lowering the tolerance value in the gradDescent function.

In terms of computational efficiency, our implementation is on par with the optim for the first example. In the second, however, we took thousands more iterations to converge to the numerical minimum. Part of the reason for this is that the minimum for f2 lies in a deep, curved crease. Once we step into this crease, there is considerable zig-zagging on the approach to the minimum. A more sophisticated implementation might use more information on the local topography and the history of our steps to recognize when we have fallen into this oscillating approach pattern.


Here's a very zoomed-in picture of the alternation pattern within the crease of the f2 function.  Note that we are still at a considerable distance from the minimum at (1,1).
 path2<-as.data.frame(path2)  
 colnames(path2) <-c("x","y")  
 ggplot(data=path2, aes(x=x, y=y))+ geom_path(color="black", size=1) + geom_point(size=5, color="red")+  
   scale_x_continuous(limits=c(1.0105,1.0107))+scale_y_continuous(limits=c(1.0211,1.0215))+  
   ggtitle("Gradient Descent in the Crease of the Rosenbrock Function")  


We also note that our implementation is very sensitive to the size of the initial step size parameter del.  If you think about the geometry of the f1 example, you'll realize that there's a certain special choice of del that can get you to the minimum in a single step. In f2, certain choices of del lead to far better computational performance than others.

Let's look at a final example of a successful  walk to the minimum for a less-trivial minimization problem.
 
 f3 <- function(x) x[2]^2+x[1]*x[2]+x[1]^2 + exp(x[2] + x[1])  
 path3<-descentTester(f3,c(-1,2),1/3)  
 path3<-as.data.frame(path3)  
 colnames(path3)<-c("x","y")  
 # plot the path taken by gradDescent  
 ggplot(data=path3, aes(x=x,y=y)) + geom_path(color="blue",size=2) + geom_point(color="red",size=3)+  
   geom_point(data=path3[1,],color="yellow",size=3)+  
   geom_point(data=path3[nrow(path3),],color="yellow",size=3)  

In the next instalment of this series, we'll explore some applications of convex minimization using optim.