Monday, November 25, 2013

Generating Ticker Symbols with Markov Chains

Stock ticker symbols are short character string codes (AAPL for Apple, WMT for Walmart) that uniquely identify stock listings. They originate from the days when stock quotes were transmitted via telegraph. These symbols are often used as tools for branding and companies choose them to be memorable and easily recognized.

As a fun and simple exercise with text processing in R, we'll build a simple model that generates random character strings which mimic the roughly 6,000 ticker symbols listed on the AMEX, NASDAQ, and NYSE. A basic tool for generating random text patterned off of an input source text is the Markov chain. Markov chains are the engine behind gibberish generators, a briefly popular internet meme in the 90's.

We start with the base of all real ticker symbols. Using the TTR ("Technical Trading Rules") library, it's easy to get the full list for the AMEX, NASDAQ, and NYSE.
library(TTR)
listings <- stockSymbols() 

On November 17, 2013, TTR found 6,462 ticker symbols. The 10 largest symbols by market cap at that time were:
listings <- listings[order(listings$MarketCap,decreasing=TRUE),]
 Symbol                           Name    MarketCap IPOyear Exchange
   AAPL                     Apple Inc. 472354352358    1980   NASDAQ
    XOM        Exxon Mobil Corporation 416188308487      NA     NYSE
   GOOG                    Google Inc. 345299382446    2004   NASDAQ
   MSFT          Microsoft Corporation 315895470257    1986   NASDAQ
     GE       General Electric Company 275192436800      NA     NYSE
    JNJ              Johnson & Johnson 266315572747      NA     NYSE
    WMT          Wal-Mart Stores, Inc. 256994921591    1970     NYSE
    CVX            Chevron Corporation 230896151941      NA     NYSE
     PG Procter & Gamble Company (The) 230614695048      NA     NYSE
    WFC          Wells Fargo & Company 229351244873      NA     NYSE

Ticker symbols pulled down from TTR range from 1 to 8 characters in length and make use of the capital letters A through Z and the hyphen "-". In ticker symbols that contains hyphens, the characters following a hyphen appear to be the "behind the dot" codes that provide additional information on the asset type. To keep things simple, we'll just strip off all this additional info:
  symbols <- unique(gsub("-.*","",listings$Symbol))

This leaves us with a list of 5,594 symbols. Let's get the distribution of symbol lengths for our symbol list:
lengths <- nchar(symbols)
pLength <- prop.table(table(lengths))
print(round(pLength,4))
lengths
1      2      3      4      5 
0.0039 0.0387 0.4572 0.4698 0.0305 

Let's now explain a very simple Markov chain model for text strings. A word $w$ of length $l$ is a combination of letters $c_1, c_2,...,c_n$, which we naturally write as $w=c_1c_2...c_n$. If the first $n-1$ letters of $w$ are known, we write the probability that the $n$-th letter is $c_n$ as $p(c_n|c_1c_2...c_{n-1})$.

A Markov chain model of $p(c_n|c_1c_2...c_{n-1})$ allows us to assume that $$p(c_n|c_1c_2...c_{n-1})=p(c_n | c_{n-1}).$$ That is, we can assume that the $n$-th letter of the word only depends on the letter that directly precedes it. The Markov assumption often allows us to make huge simplifications to probability models and computations.

As an example, suppose that ticker symbols are generated by a probabilistic process. If we have a three letter ticker symbol with its first two letters being "WM", we can meaningfully quantify the probability that the third letter is a "T". Under a Markov model, $p(T|WM)=p(T|M)$. Now there is a little bit of ambiguity in the expression $p(T|M)$. Is this the probability that $T$ directly follows an $M$ anywhere in a symbol or is the probability that $T$ is the third letter given $M$ is the second? Here we'll assume that the position information is relevant and we'll compute $$p(c_3=T|WM) = p(c_3=T | c_2 = M).$$
With our sample of symbols, here is a way to approximate $p(T|WM)$ in R using the Markov model:
sum(grepl("^.MT",symbols[lengths==3]))/sum(grepl("^.M",symbols[lengths==3]))
[1] 0.08391608
Unpacking the syntax, grepl is regular expression matching and returns TRUE when a match is found. The expression "^.MT" matches all the symbols whose first letter is anything and whose second and third letter are "MT" while "^.M" matches to anything with second letter "M". The sums count the number of TRUE evaluations.

With the preliminaries out of the way, let's demonstrate a function which generates a random symbol based on the Markov chain model:
# Create a random symbol of length n based on a length 1 Markov Chain
newSymbol <- function(n=3){
  symbolSet <- symbols[lengths==n]
  s <- ""
  for(i in 1:n){
    # Match s in the first i characters of symbolSet
    pattern         <- paste0("^",substr(s,i-1,i-1))
    match           <- grepl(pattern,substr(symbolSet,i-1,i-1))
    
    # Distribution of next letter
    nxtLetterDist   <- substr(symbolSet,i,i)[nxtLetterDist!="" & match]
    
    # Sample the next character from the distribution
    nextChar        <- sample(nxtLetterDist,1)
    
    # Append to the string
    s <- paste0(s,nextChar)
  }
  return(s)
}

Here is an example of the generator in action:
> replicate(10, newSymbol(3))
 [1] "EMS" "ZTT" "NBS" "BTG" "MAG" "HAM" "PLK" "ALX" "CFM" "FLS"

How can we test that the symbol generator is really making symbols that are similar to the actual ticker symbols? Well, one way is to inspect how the generator performs relative to a completely random symbol generator. Since there are 26 letters available to use in a valid ticker symbol, a completely random generator would use any given letter with probability 1/26. So, for example, the distribution of the second character of completely random generated symbol should be a uniform distribution on the letters A-Z with uniform probability 1/26. For the Markov chain model, we would hope that the second character of a generated random symbol would have a probability distribution similar to the distribution of the second character of the set of actual ticker symbols.
The following snippet inspects the distribution of the second character of symbols of length 3 for natural symbols, Markov generated symbols, and completely random symbols.
# Generate a bunch of symbols with the Markov chain
symbolsA <- unique(replicate(10000, newSymbol(3)))

# Build data frame for second character distributions
require(ggplot2)
require(reshape2)
Natural <- prop.table(table(substr(symbols[lengths==3],2,2)))
Markov <- prop.table(table(substr(symbolsA,2,2)))
secondChar<-as.data.frame(t(rbind(Natural,Markov)))
secondChar$character <- rownames(secondChar)
secondChar <- melt(secondChar, id=c("character"))

# Build the distribution plot
ggplot(secondChar, aes(x=character,y=value)) + 
    geom_point(aes(color=variable), alpha=.75,size=5)+
    ylab("Probability")+xlab("Character")+
    ggtitle("Ticker Symbols:  Second Character Distribution")+
    geom_hline(yintercept=1/26, size=1,color="black")

Here is the plot:
In this figure, the black line represents the uniform probability distribution corresponding to a completely random generation of length 3 ticker symbols. The blue and red circles compare the Markov generator model to the true distribution of the second characters in length 3 ticker symbols.

Friday, November 1, 2013

Underdetermined Systems in R

In this post, we begin a study of underdetermined systems.   We'll motivate this study by considering an application problem in competitive intelligence.  We note from the start, however, that this post is less about finding a particular solution to an underdetermined problem and more about ways to explore the possible solutions.

Problem:  We are trying to learn about the sales regions of a key competitor.  Suppose the competitor sells three versions of a product: basic, standard and luxury.  The product is sold in two regions: North and South.  Competitive intelligence mined from the competitor's annual reports gives us the following revenue schedule:

 Revenue 
(k USD) by
Region/Product

 Basic

 Standard

 Luxury

 TOTAL
North
x1
x2
x3
650
South
x4
x5
x6
350
TOTAL
300
450
250
1000

The values of x1,x2,...,x6 are unknown revenue amounts for a product/region combination.  In this problem, we would like to investigate the possible values of x1,x2,...,x6 which are consistent with the known revenue totals appearing in the margins of the table.  Any set of values x1,x2,x3,...,x6 that is consistent with all the known revenue totals, we'll call a feasible solution.

We can read off the following 5 relations from the table:

$$ \begin{align}
x_1 + x_2 + x_3  &= 650 \\
x_4 + x_5 + x_6  &= 350 \\
x_1 + x_4 &= 300 \\
x_2 + x_5 &= 450 \\
x_3 + x_6 & =250
\end{align}
$$
Actually, there is a bit of redundancy in these equations.  For example, the first equation is equal to the sum of the last three equations minus the second equation.  It's important to check for and remove all redundancy from the system for the sake of the computations we'll do below.  We remove one of the redundant equations and rewrite the system in the matrix form as:

$$ \left[ \begin{matrix} 1 & 1 & 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1 \end{matrix} \right]
\left[ \begin{matrix} x_1 \\ x_2 \\ x_3 \\ x_4 \\ x_5 \\ x_6 \end{matrix} \right] = \left[ \begin{matrix} 650 \\ 300 \\ 450 \\ 250 \end{matrix} \right].$$

 A <- matrix(c(  
  1,1,1,0,0,0,  
  1,0,0,1,0,0,  
  0,1,0,0,1,0,  
  0,0,1,0,0,1),  
  byrow=TRUE,nrow=4, ncol=6)  
 f <- matrix(c(400,450,250,300), nrow=4, ncol=1)  

This system, which we express symbolically as \(A\vec{x} = \vec{f}\), is an underdetermined linear system since it consists of 6 unknowns but only 4 equations.  A key fact to recognize is that an underdetermined linear system that has a solution will always have infinitely many.

We will investigate three issues around underdetermined systems in this post:
  1. When does a system have a solution?
  2. If an underdetermined system has solutions, how can we generate them?
  3. Once we have a method for generating solutions, can we develop any criteria for preferring one particular solution among the infinitely many possible solutions?

Consistency and Stability

The first issue is easy to handle.  A system of linear equations is called consistent if it has one or more solutions. The system \(A\vec{x} = \vec{f}\)  has a solution whenever \(\vec{f}\) is in the column space of \(A\), i.e. whenever \(\vec{f}\) is a linear combination of the columns of \(A\).  This leads us to consider the notion of matrix rank: the rank of \(A\) is the number of linearly independent columns in \(A\).  In these terms, we have the following consistency test: a system is consistent if and only if the rank of \(A\) is the same as the rank of \([A|\vec{f}]\), the matrix attained by attaching \(\vec{f}\) as a column to \(A\).

Here's one way to implement the consistency test in R:
 > require(Matrix)  
 > rankMatrix(A)  
 [1] 4  
 attr(,"method")  
 [1] "tolNorm2"  
 attr(,"useGrad")  
 [1] FALSE  
 attr(,"tol")  
 [1] 2.76354e-15

 > rankMatrix(cbind(A,f))
[1] 4
attr(,"method")
[1] "tolNorm2"
attr(,"useGrad")
[1] FALSE
attr(,"tol")
[1] 1.377126e-12

We can see that \(A\) and \([A|f]\) have the same rank, so that \(A\vec{x} = \vec{f}\) has solutions.  Note that accurate computation of the rank of \(A\) (as well as the QR decomposition of \(A\) that we perform below) require that the matrix \(A\) be numerically "well-behaved".  One way to quantify numerical "well-behaved" is to use the concept of the condition number of \(A\).  In base R, the condition number of rectangular matrix \(A\) is

$$ \kappa(A) = \frac{\sigma_{\max}(A)}{\sigma_\min (A)}$$
where $\sigma_\max,\sigma_\min$ are the maximum and minimum non-zero singular values of \(A\).

 > kappa(A)  
 [1] 2.987313  

The condition number of \(A\) roughly describes how sensitive the output of the transformation \(A\) is to a small change in the input.  We should be worried about doing any computations with \(A\) if \(\kappa(A)\) is very large.  Fortunately, our condition number here is quite reasonable.  Note that a large condition number may suggest that \(A\) contains some redundant rows.

The Space of Solutions

For real-valued \(A\), \(\vec{f}\), every solution to the system
$$ A \vec{x} = \vec{f} $$
can be written in the form
$$ \vec{x} = \vec{x}_0 + \sum_{i=1}^{n_0} w_i q_i $$
where
  • \(\vec{x}_0\) is any particular solution to \(A\vec{x} = \vec{f}\)
  • \(q_1,...,q_{n_0}\) is a basis for the nullspace of \(A\)
  • \(w_1,...,w_n\) are real valued coefficients 
  • \(n_0\) is the dimension of the nullspace of \(A\).  If \(A\) is \(n \times m\) with rank \(m\) then the rank-nullity theorem tells us that \(n_0 = n-m\).
We will write this representation in the shorthand
$$ \vec{x} = \vec{x}_0 + Q \vec{w}$$
where \(Q\) is a \(n \times (m-n)\) matrix whose columns are a basis for the nullspace of \(A\) and \(\vec{w} \in \mathbb{R}^{n-m}\).  We note that in general there are an infinite number of bases for the nullspace of a matrix, so this representation cannot be unique.

We now want to describe procedures in R that will allow us to numerically determine (1) a particular solution \(\vec{x}_0\) to \(A\vec{x} = \vec{f}\) and (2) a convenient choice of the matrix \(Q\).

To obtain a particular solution of \(A\vec{x} = \vec{f}\), we can make use of the pseudoinverse of \(A\).  Assuming the \(m \times n\) matrix \(A\) has full rank with \(m <n\), the matrix \(AA^T\) is an \(m \times m\) matrix of full rank, that is \(AA^T\) is invertible.  It is not hard to prove the identity
 $$ A[A^T (AA^T)^{-1}] = I_n $$
where \(I_n\) is the \(n \times n\) identity matrix.  Right-multiplying both sides of this equation by $\vec{f}$, we obtain
$$ A[A^T (AA^T)^{-1} \vec{f}] = \vec{f}$$.
Thus a particular solution of $A \vec{x} = \vec{f}$ is
$$ \vec{x}_0 = A^T (AA^T)^{-1} \vec{f}.$$
Though there are infinitely many particular solutions we could use, the solution \(\vec{x}_0\) turns out to be the solution with smallest Euclidean norm. Here is one way to compute \(\vec{x}_0\) in R:

 x0 <- t(A)%*%solve(A%*%t(A),f)  

For the example problem:
 > x0  
      [,1]  
 [1,] 191.66667  
 [2,] 91.66667  
 [3,] 116.66667  
 [4,] 258.33333  
 [5,] 158.33333  
 [6,] 183.33333  

Next, we need to obtain a basis for the nullspace of \(A\).  One approach for this is to apply a QR factorization to \(A^T\).  This will decompose \(A^T\) as
$$A^T = \left[  \begin{matrix} Q_1  & Q_2 \end{matrix} \right] \left[ \begin{matrix} R_1  \\ R_2 \end{matrix} \right]$$
where \(Q_1\) is \(n \times m \) and \(Q_2\) is \(n \times (n-m)\).  The columns of \(Q_2\) are a basis for the null space of \(A\).

Here is an R implementation:
 QR <- qr(t(A))  
 Q <- qr.Q(QR,complete=TRUE)[,-(1:QR$rank)]  
> Q
           [,1]       [,2]
[1,]  0.2654822  0.5126915
[2,] -0.5767449 -0.0264314
[3,]  0.3112627 -0.4862601
[4,] -0.2654822 -0.5126915
[5,]  0.5767449  0.0264314
[6,] -0.3112627  0.4862601

We now have our solution representation \(\vec{x} = \vec{x}_0 + Q \vec{w}\).  To see that this is really a solution for any choice of \(\vec{w}\), note that
$$ A Q  = 0_{n \times (n-m) } $$
and compute
$$  A \vec{x} = A \vec{x}_0 + A Q \vec{w}  = \vec{f} + 0_{n \times (n-m)} \vec{w} = \vec{f}.$$

Criteria for Selecting a Solution

Now we consider the question of whether there is any criteria to prefer one solution of \(A\vec{x} = \vec{f}\) to another.  One obvious requirement is that our solution should have all non-negative components \(\vec{x} \geq 0\).

How can we ensure that \(\vec{x} = \vec{x}_0 + Q \vec{w}\) is non-negative?  Satisfying this condition will require us to choose \(\vec{w}\) carefully.  In the context of our problem, it turns out that the set of \(\vec{w} = (w_1,w_2) \) which satisfy \( \vec{x}_0 + Q \vec{w} \geq 0\) defines a convex polygon in the \(w_1,w_2\) plane.  [More generally, the set of \(\vec{w} \in \mathbb{R}^m\) such that \( \vec{x}_0 + Q \vec{w} \geq 0\) defines a convex polytope in \(\mathbb{R}^m\).]  We could imagine a scheme that generates positive solutions by finding vectors lying within this convex polygon.  This is a rather interesting idea but will be very difficult to implement for even moderately sized problems.  Even just the task of finding all the vertices of a polygon (polytope) defined by a matrix inequality \(A\vec{x} \leq b\) is quite difficult!  We'll leave this line of thinking for another post.

Another approach is to formulate the problem as a linear program:
$$\left\{ \begin{array}{l} \mathrm{minimize:} \qquad \vec{c} \cdot \vec{x} \\ \mathrm{subject \; to:} \quad A\vec{x}= \vec{f}, \qquad x \geq 0  \end{array} \right.$$

For this particular application, the vector \(\vec{c}\) can be chosen to be anything we like, since the goal is to find a solution that satisfies the \(A \vec{x} = \vec{f}\) and \(\vec{x} \geq 0 \) constraints.  By varying the \(\vec{c}\) function, we'll get different solutions to the system.

Here is an implementation using the R package linprog:
 > require(linprog)  
 > cvec <- rep(1,6)  
 > solveLP( cvec, f, A, const.dir=rep("=",4), lpSolve=TRUE)  
 Results of Linear Programming / Linear Optimization  
 (using lpSolve)  
 Objective function (Minimum): 1000   
 Solution  
 opt  
 1  0  
 2 100  
 3 300  
 4 450  
 5 150  
 6  0  
 Constraints  
 actual dir bvec free  
 1  400  = 400  0  
 2  450  = 450  0  
 3  250  = 250  0  
 4  300  = 300  0  

A drawback of the linear programming approach is that (in the absence of further information) it tends to choose the sparsest possible solutions.  That is, it
"forces" zeros into the solution vector.  The reason for this is fairly clear from the mechanics of the simplex algorithm used to solve linear programs.  We see that for the example problem, linear programming gives us the solution with x1=x6=0.  Varying the \(\vec{c}\) gives different but still sparse solutions to the system.

For an algorithm that does not give preference to the sparsest solutions of \(A\vec{x}=\vec{f}\), we can turn to nonlinear optimization.  Here is one simple approach.  We observe that for the example problem \(\sum_{i=1}^6 x_i = 1000\) and so
$$ \mu = \frac{1}{6} \sum_{i=1}^6 x_i = \frac{1000}{6} \approx 166.667. $$
We then define the objective function
$$ F( \vec{x} ) = \sum_{i=1}^6 (x_i - \mu)^2$$
We seek the vector \(\vec{x} \in \mathbb{R}^6\) such that \(\vec{x}\) minimizes \(F(\vec{x})\) and satisfies
$$ \vec{x} \geq 0 \qquad A \vec{x} = \vec{f}.$$

In words, we must minimize the squared deviation between the solution components and the mean, subject to the system constraints.  This is a simple example of a quadratic program.  There many approaches available to solve this species of optimization problem and R has several dedicated libraries. We will use the 'quadprog' library here.  This library solves quadratic programming problems of the form
$$ \left\{ \begin{array}{l} -\vec{d}^T \vec{x} + \frac{1}{2} \vec{x}^T D \vec{x} \\ A_1^T \vec{x} = \vec{f}_1 \\ A_2^T \vec{x} \geq \vec{f}_2  \end{array} \right.  $$
Here \(D\) is a \(n \times n\) symmetric, positive definite matrix.

To cast our problem in this form, we write
$$ F(\vec{x}) =  (\vec{x} - \vec{\mu}  )^T (\vec{x} - \vec{\mu}) = \vec{x}^T \vec{x} - 2 \vec{\mu}^T \vec{x} + \vec{\mu}^T \vec{\mu} \geq \vec{x}^T \vec{x} - 2 \vec{\mu}^T \vec{x}$$
where \(\vec{\mu}  = \mu (1,\ldots, 1)\).  We recognize that minimizing \(F(\vec{x})\) is equivalent to minimizing
$$ \tilde{F}(\vec{x}) = \frac{1}{2} \vec{x}^T D \vec{x} -  \vec{d}^T \vec{x},$$
where \(\vec{d} = 2 \vec{\mu}\) and \(D = 2 I \).  Then, we set \(A_1 = A^T\), \(A_2 = I\), \( \vec{f}_1 = \vec{f} \), and \( \vec{f}_2 = \vec{0} \). 

Here is the R implementation:
 require(quadprog)  
 Dmat <- 2*diag(6)  
 dvec <- 2*(1000/6)*rep(1,6)  
 A1 <- A  
 A2 <- diag(6)  
 Amat <- t(rbind(A1,A2))  
 f1 <- f  
 f2 <- rep(0,6)  
 bvec <- c(f1,f2)  

Now, we call the solver (note that the meq=4 parameter tells quadprog that the first four constraints in our system are equality constraints and the rest are inequality constraints):
 > solve.QP(Dmat, dvec, Amat, bvec, meq=4)  
 $solution  
 [1] 191.66667 91.66667 116.66667 258.33333 158.33333 183.33333  
 $value  
 [1] -149166.7  
 $unconstrained.solution  
 [1] 166.6667 166.6667 166.6667 166.6667 166.6667 166.6667  

Clearly, we can generate many different solutions to the underdetermined system by modifying the objective function \(F(\vec{x})\).  For example, replacing \(\vec{\mu}\) with 0, we'll obtain the smallest 2-norm non-negative solution that satisfies \(A\vec{x} = \vec{f}\).  This should agree with \(\vec{x}_0\) in the case where it happens that \(\vec{x}_0 \geq 0\).

The functional form of \(F(\vec{x})\) allows us to assign preference to particular feasible solutions.  If we have additional expectations/beliefs about the values of x1,...,x6, these can be built into the objective function.  In the simple example above, our additional expectation is that the individual solution components shouldn't be very different from the mean.  Clearly, much more complex ideas can be built into the objective function to find feasible solutions that are also consistent with our expectations about x1,x2,...,x6.

Saturday, September 7, 2013

Optimization in R (Part I)

This post is the first in a series on solving practical convex optimization problems with R.  My goal for the series is to work up to some interesting applications of the constrained optimization libraries available in R.  In particular, I plan to work out some test cases and practical examples using the quadprog package.

In this very introductory post, however, I want to illustrate some important ideas about convexity using R.

A convex function function of a single variable resembles something like the blue curve in the following diagram:

 x<-seq(from=-1,to=1.5,by=.01)  
 y<-(x-1/2)^2 +1  
 segment <-data.frame(x=c(-1/2,1),y=c(2,1.25))  
 min_point <-data.frame(x=1/2,y=1)  
 mydata <-data.frame(x,y)  
 ggplot(mydata, aes(x, y)) + geom_line(size=3, color="blue")+  
  geom_point(data=min_point,color="yellow", size=5)+  
  geom_line(data=segment, color="red", size=2)+  
  geom_point(data=segment, color="red", size=5)  

The point of this diagram is that whenever you connect two points on the graph of a convex function with a straight line segment, the line segment always lies above the graph of the function.  A (strictly) convex function always has a global minimum, which is the point in yellow in the diagram.

Here's what a convex function of two variables looks like:

 library(rgl)  
 # Define surface  
 f <- function(x,y) x^2+y^2  
 # Build a grid  
 x<-seq(from=-1,to=1,.01)  
 y<-seq(from=-1,to=1,.01)  
 z<-outer(x,y,f)  
 # Plot  
 open3d()  
 rgl.surface(x,y,z, color='green',alpha=.75, smooth=TRUE)  
 axes3d( color='black', alpha=1)  
 title3d('f(x,y)=x^2+y^2','','x','z','y', color='blue', alpha=1)

It's not hard to imagine that any line segment connecting two points on this convex surface will again lie completely above the surface. Moreover, we immediately recognize an absolute minimum at the origin. Though more difficult to visualize, the definition of convexity is much the same in higher dimensions: a function is convex if a line segment connecting two points on the function's graph always remains above the graph.

One reason to care about the convexity is because it is a condition that ensures a function has a minimum.  So given a convex function, how do you locate its minimum? Imagine that you were placed somewhere on a convex surface like the one appearing in the figure above. If you wanted to find the global minimum, you'd likely begin walking in the direction of steepest descent. You'd continue to follow along the path of steepest descent until you can no longer descend. At this point, because you know the surface is convex, you know that you are standing on the minimum.

This very simple idea is the basis for the family of gradient descent algorithms for finding the minimum of a convex function.  The gradient of a function \(f\) at a point \(x_0\) is the vector
$$\nabla f(x_0) = \left. \left(\frac{\partial f}{\partial x_1} , ... , \frac{\partial f}{\partial x_n} \right) \right|_{x=x_0}$$

Here is a simple numerical implementation of the gradient that uses centred differences to approximate the derivatives in this definition.

 # gradient function  
 grad <-function(f,x,h){  
  len <-length(x)  
  y<-rep(NA,len)  
  # Compute the gradient with a centred difference  
  for(i in 1:len){  
   xp<-x  
   xm<-x  
   xp[i]<-x[i]+h  
   xm[i]<-x[i]-h  
   y[i]<-(f(xp)-f(xm))/(2*h)   
  }  
  return(y)  
 }  

[Warning:  For actual numerical work, you probably don't want to write your own derivative functions.  There are excellent libraries available to do this kind of work.  See, for example, the "numDeriv" package.]

The gradient at \(x_0\) always points in the direction of steepest ascent from the point \(x_0\).  To find the minimum, we at some initial point on the surface and compute the gradient.   Then we take a step in the direction exactly opposite the direction of the gradient vector.  If the step size is chosen well, we should find ourselves at a point lower on the surface than the point at which we started.  We repeat this process until we can descend no further (that is, the gradient is the zero vector). 

Choosing the right step size at each level in the algorithm is a sticky point.  Though a uniform step size might work for very simple functions, in general uniform steps can cause the algorithm to fail to converge.  A standard approach is to scale the step size based on the length of the gradient vector or the inter-step performance.

Here's a very simple, adaptive gradient descent implementation.  At each iteration, we start with a step size of del and repeatedly cut the step size in half until we find a step size that results in an improvement.

 # Gradient descent method  
 gradDescend<-function(f,x0,del){  
  z0<-f(x0)     
  eps<-1e-8   # Tolerance  
  G<-grad(f,x0,1e-4)  
  x<-x0-del*G  
  # Find a choice of step size so that the result improves  
  while(z0<f(x)+eps & del>eps){  
   del<-del/2  
   x<-x0-del*G  
  }  
  # If improvement is only possible by step size below the tolerance,   
  # we assume convergence and return NULL. Otherwise, return the improved x.  
  if(del<eps) return(NULL)  
  else return(x)  
 }  

[Warning:  Here especially you'll want to rely on R libraries rather than on your own implementations for actual numerical work.  Functions like optim are equipped to deal with the pathological cases one can encounter in practice.]

Wikipedia has a nice list of test functions for convex optimization problems.  Let's compare our steepest descent method to R's optim function using two of the suggested test functions:

 # Four variable Sphere function--min at (0,0,0,0)  
 f1<- function(x) x[1]^2+x[2]^2+x[3]^2 + x[4]^2  
 # Two variable Rosenbrock function--pathological example with min at (1,1).  Note: convex at (1,1) but not globally convex.
 f2<- function(x) (1-x[1])^2+100*(x[2]-x[1]^2)^2  

Run the tests:
 # Initial values  
 g1 <- c(1/2,-1/3,-1/5,2)  
 g2 <- c(-1/2, 1)  
 # Optim  
 opt_f1 <-optim(par=g1, f1, method="BFGS")  
 opt_f2 <-optim(par=g2, f2, method="BFGS")  
 # Tester function  
  descentTester = function(f,x0,del){  
   x<-x0  
   path <- NULL  # Storage for path to minimum.  
   while(!is.null(x)){  
    path<-rbind(path,x)  
    x<-gradDescend(f,x,del)  
   }  
   rownames(path) = 1:nrow(path)  
   return(path)  
  }  
 path1<-descentTester(f1,g1,1/3)  
 path2<-descentTester(f2,g2,1/3)  

We specify the BFGS method in optim.  This method uses information from both the function and its gradient to minimize the function, so it provides a reasonable comparison against our simple implementation.  Here is the output from optim stored in opt_f1, opt_f2:

 > opt_f1  
 $par  
 [1] 1.230906e-13 -1.748844e-13 -4.921677e-14 -6.452512e-14  
 $value  
 [1] 5.366749e-26  
 $counts  
 function gradient   
     9    3   
 $convergence  
 [1] 0  
 $message  
 NULL  
 > opt_f2  
 $par  
 [1] 0.9998002 0.9996004  
 $value  
 [1] 3.99241e-08  
 $counts  
 function gradient   
    119    48   
 $convergence  
 [1] 0  
 $message  
 NULL  

How did optim do on our two examples?  Optim found the solution to both problems.  It found the first solution with only 9 evaluations of f1 and 3 evaluations of f1's gradient.  The second solution took 119 evaluations of f2 and 48 evaluations of its gradient.

We can look at the data frames path1 and path2 to assess the performance of our steepest descent implementation.
 > nrow(path1)  
 [1] 11  
 > nrow(path2)  
 [1] 6197  
 > path1[nrow(path1),]  
 [1] 8.467544e-06 -5.645029e-06 -3.387018e-06 3.387018e-05  
 > path2[nrow(path2),]  
 [1] 1.002628 1.005267  

First, we note that our numerical precision is considerably lower than the implementation in optim. We can adjust this somewhat by lowering the tolerance value in the gradDescent function.

In terms of computational efficiency, our implementation is on par with the optim for the first example. In the second, however, we took thousands more iterations to converge to the numerical minimum. Part of the reason for this is that the minimum for f2 lies in a deep, curved crease. Once we step into this crease, there is considerable zig-zagging on the approach to the minimum. A more sophisticated implementation might use more information on the local topography and the history of our steps to recognize when we have fallen into this oscillating approach pattern.


Here's a very zoomed-in picture of the alternation pattern within the crease of the f2 function.  Note that we are still at a considerable distance from the minimum at (1,1).
 path2<-as.data.frame(path2)  
 colnames(path2) <-c("x","y")  
 ggplot(data=path2, aes(x=x, y=y))+ geom_path(color="black", size=1) + geom_point(size=5, color="red")+  
   scale_x_continuous(limits=c(1.0105,1.0107))+scale_y_continuous(limits=c(1.0211,1.0215))+  
   ggtitle("Gradient Descent in the Crease of the Rosenbrock Function")  


We also note that our implementation is very sensitive to the size of the initial step size parameter del.  If you think about the geometry of the f1 example, you'll realize that there's a certain special choice of del that can get you to the minimum in a single step. In f2, certain choices of del lead to far better computational performance than others.

Let's look at a final example of a successful  walk to the minimum for a less-trivial minimization problem.
 
 f3 <- function(x) x[2]^2+x[1]*x[2]+x[1]^2 + exp(x[2] + x[1])  
 path3<-descentTester(f3,c(-1,2),1/3)  
 path3<-as.data.frame(path3)  
 colnames(path3)<-c("x","y")  
 # plot the path taken by gradDescent  
 ggplot(data=path3, aes(x=x,y=y)) + geom_path(color="blue",size=2) + geom_point(color="red",size=3)+  
   geom_point(data=path3[1,],color="yellow",size=3)+  
   geom_point(data=path3[nrow(path3),],color="yellow",size=3)  

In the next instalment of this series, we'll explore some applications of convex minimization using optim.