Sunday, October 19, 2014

Voice control your Roku player with Python

In this post, I describe a simple way to implement a voice controlled Roku streaming player 'remote control' over a telnet connection with Python.

The complete code (about 100 lines) is available on github.

Roku on Telnet

Roku streaming players offer an interesting unofficial control interface over a telnet connection on the local network. To connect to Roku over telnet:
  1. Put Roku in development mode by pressing the following button sequence on your Roku remote: Home (three times), up (twice), right, left, right, left, right.
  2. Follow the on-screen instructions and Roku will tell you its IP address: e.g. 123.456.7.8.90
  3. Using any computer on the same network as Roku, open a terminal and telnet into Roku's port 8080:
    telnet 123.456.7.8.90 8080
  4. If all goes well, you should see Roku's device ID, ethernet and wifi MAC addresses, and command prompt printed on the console:
    Trying 123.456.7.8.90...
    Connected to 123.456.7.8.90.
    Escape character is '^]'.
    ETHMAC 00:00:00:00:00:0a
    WIFIMAC 00:00:00:00:00:0b
  5. From the telnet command line, we can now type simple Roku control commands like 'press play' and 'press right'.
I successfully telneted into Roku SD, Roku HD, and Roku 3 players.
  • Supported commands include: right, left, up, down, select, pause, play, forward, reverse
  • All models support abbreviated commands, e.g. "press r" for "press right".
  • Roku3 requires abbreviated commands. Moreover, Roku3 can accept multiple commands per turn such as "press dls" (down, left, select). Roku3 also appears to support some additional button presses such as "press a" and "press c".

Adding Voice Control

To have some more fun with this feature, I used Python's SpeechRecognition library to develop a simple voice enabled interface for Roku. The SpeechRecognition module is a Python wrapper around Google's speech recognition API. The module uses PyAudio to record wav files from the microphone:
import speech_recognition as sr
recognizer = sr.Recognizer()
with sr.Microphone() as source:       
 audio = recognizer.listen(source, timeout=2)
These files are then streamed to Google's recognition API for transcription:
command = recognizer.recognize(audio)
If recognition is successful, SpeechRecognition should return a string transcription. We can then process this transcription to control the Roku. To manage the processing of transcriptions or "concept mapping", we use a simple approach based on keywords. We define a concept class to hold a list of keywords and a related action. For example, Roku needs a SelectConcept that should request the select button when the user says either of the keywords 'select' or 'ok':
SelectConcept = Concept(['select','ok'], 'press select')
Each concept has a scoring method which takes an input message and returns an action if the message contains the keyword:
# Returns the action string 'press select'
SelectConcept.conceptMatch('please press select button'.split(" "))
Once we've defined all of our concept data, we need a way to pass the action requests to the Roku from within Python. For this, we can use Python's telnetlib to connect to Roku's 8080 port and pass message strings like 'press select'.

See the full code on github.

Tuning and other details

If you are lucky, PyAudio and SpeechRecognizer will work well out of the box. More likely, however, you will encounter a few library dependency problems and hardware-specific issues. On my Linux systems, I needed to add the following packages to get PyAudio working: Note that even when PyAudio is working, you may still see a number of warning/diagnostic messages printed to the console, e.g.:
ALSA lib pcm_dsnoop.c:618:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm_dmix.c:1022:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm_dmix.c:1022:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2239:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2239:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2239:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
bt_audio_service_open: connect() failed: Connection refused (111)
bt_audio_service_open: connect() failed: Connection refused (111)
bt_audio_service_open: connect() failed: Connection refused (111)
bt_audio_service_open: connect() failed: Connection refused (111)
ALSA lib pcm_dmix.c:1022:(snd_pcm_dmix_open) unable to open slave
The call:
with sr.Microphone() as source:       
 audio = recognizer.listen(source, timeout=2)
will listen to the microphone channel for up to two seconds waiting for a signal above a particular energy threshold. If the signal passes over the threshold value, recording begins and will continue until a pause is encountered. The threshold value and stopping pause length are controlled by the parameters:
recognizer.energy_threshold = 2000
recognizer.pause_threshold = .5
The values of these parameters will need to be tuned for particular microphone hardware, system settings, and the level of ambient noise during recording. It may be helpful to adjust the system microphone input level as well to optimize recognition performance. Experimenting on several different microphones, I found optimal energy thresholds for this application varied widely between 200 and 2000. More details about recognizer parameters are available in the SpeechRecognizer documentation.

Even with optimization, SpeechRecognizer for this toy application is not very reliable and latency is too high. A better implementation would use a custom language model and a more robust concept mapping system. An interesting alternative to SpeechRecognition might be blather.

Sunday, June 1, 2014

Analyzing the 2011-2012 California Health Inteview Survey with R

The California Health Interview Survey (CHIS) is a remarkable biannual survey of health status, care access, and demographics for California residents. The 2011-2012 public use survey data has recently been released and is freely available after registering at the CHIS site. CHIS currently offers data in SAS, Stata, and SPSS formats. However, thanks to a few great R packages, it's not very hard to bring this complex and interesting data set into R. In this post, we'll cover some practical details about using CHIS data. We'll also take up a more technical topic about how the CHIS weights were build and about how they should and should not be used. Before we begin, here are some key references for CHIS:
  • UCLA maintains extensive documentation of the CHIS methodlogy including important details of how the sample weighting was developed and implemented. Definitions for the CHIS variables can be found in the data dictionary included with the CHIS public use datasets.
  • Thomas Lumley created and maintains the survey package. He has provided a description of how to work with the CHIS survey weights here and also in his book.

Getting The Data

To begin, we first need to register (free) at the CHIS site and download the data. The CHIS is a landline and mobile telephone survey, designed to represent the non-institutionalized resident population of California. The data is divided into three subsets: child, adolescent, and adult. For simplicity, we'll just work with the adult data set which has about 43k records. To bring the data into R, download the Stata flavor from CHIS, unzip it, and load it into R as follows:
# Read CHIS file
file <- "~/Desktop/CHIS/STATA/chis11-12_adult_stata/Data/ADULT.dta"  # your file
options(nwarnings=500)  # Bump up number of warnings -- see note below
CHIS   <- read.dta(file, convert.factors = TRUE)
warnings() # print the 257 warning messages
# Note: read.dta generates warnings because some of the Stata 
# value label definitions are not found in the raw Stata file.  Thus,
# some variables will show the raw survey coding and some will print
# with the decoded labels from CHIS.  Consult the data dictionary to translate
# raw survey codes.  To be really careful, we could set convert.factors = FALSE
# to work only with the raw codes for all variables.

Now that we've got the data loaded, we want to take a stab at analyzing it. Browsing through the data dictionary, we see that CHIS records educational attainment in the variable aheduc. How can we tabulate this variable? Observe that CHIS includes 81 variables, rakedw0, rakedw1,...,rakedw80 , which are called replicate weights. For the moment, we'll disregard all of these weights except for rakedw0. The variable rakedw0 is the sample weight and it tells us how much each of the 43k observations should "count" in an analysis. Using R's xtabs function, we can tabulate educational attainment with the weight rakedw0:
# Educational attainment -- counts
#, round are just aesthetic choices. ~ aheduc, CHIS, drop.unused.levels = TRUE),0))
# aheduc    Freq
# 1              GRADE 1-8 2214253
# 2             GRADE 9-11 1929232
# 3  GRADE 12/H.S. DIPLOMA 6747336
# 4           SOME COLLEGE 4155536
# 5      VOCATIONAL SCHOOL  842097
# 6        AA OR AS DEGREE 1936675
# 7        BA OR BS DEGREE 5918519
# 8      SOME GRAD. SCHOOL  284266
# 9        MA OR MS DEGREE 2537555
# 10   PH.D. OR EQUIVALENT  930248
# 11   NO FORMAL EDUCATION  300768

# sum over the table
margin.table(xtabs(rakedw0 ~ aheduc, CHIS, drop.unused.levels = TRUE))
# 27796484

Though we had only 43k records in the sample, the sum over all the categories is about 28M people which is roughly the size of the non-institutionalized adult population of California. This is because the sample weight rakedw0 was designed so that CHIS totals match closely to known CA demographics (primarily from the California Department of Finance's Demographics Unit).

Also, we observe that:
# Male/Female proportion in CHIS using the sample weight (correct)
round(prop.table(xtabs( rakedw0 ~ srsex, CHIS, drop.unused.levels = TRUE))*100,2)
# srsex
# 48.72  51.28

# Male/Female proportion in RAW CHIS (wrong!!)
round(prop.table(xtabs(  ~ srsex, CHIS, drop.unused.levels = TRUE))*100,2)
# srsex
# 41.57  58.43  <-- (!)

The raw CHIS sample contains many more women than we'd expect from a completely random draw from the California population. The sample weight rakedw0 includes an adjustment to rebalance the over-representation of females. Later in the post, we'll explore the CHIS weighting method in more detail and also explain the reason for the other weights rakedw1,..., rakedw80.

Simple point estimates with CHIS

CHIS is a great resource for investigating how demographic and health outcome factors are related. Let's use the data to assess a simple hypothesis: Californians with lower educational attainment are more likely to be overweight/obese (ovrwt).

When working with CHIS data it is frequently convenient to group and recode categorical variables. There are a number of ways we could do this but the "car" package offers a recoding function with a simple syntax:
# Recode educational attainment
recodeSTR = "
CHIS$college_grad = recode(CHIS$aheduc, recodeSTR)

Here's a simple visual tabulation from CHIS to support our hypothesis:

# make a plot

# Educational attainment (recoded) vs overweight
round(prop.table(xtabs(rakedw0 ~ college_grad + ovrwt, 
                   CHIS, drop.unused.levels=TRUE),1)*100,1)
#college_grad    YES   NO
#  BA           51.2 48.8
#  LESS THAN BA 64.6 35.4
#  MA OR MORE   50.5 49.5

# summary df
df <- ~ college_grad + ovrwt, 
                        CHIS, drop.unused.levels=TRUE),1))

# reorder the factors for sake of the plot
df$college_grad = factor(df$college_grad, 
                             levels=c('LESS THAN BA','BA','MA OR MORE')) 

# plot
ggplot(df, aes(x=college_grad, y=Freq, fill=ovrwt)) + geom_bar(stat="identity") +
  scale_y_continuous(labels = percent_format()) +
  ylab("% of Adults") + xlab("Education Level") +
  scale_fill_manual(values = c("blue","yellow"))

Now that we've found a potentially interesting relationship in the dataset, our next step might be to check that this relationship is statistically significant. This means we need to compute standard errors for our cross tabulation. Here's where working with the complex CHIS survey design gets a bit tricky...

Replicate weights and estimator variances

At this point, it's useful take a more detailed look at the weighting methodology employed in CHIS and other complex surveys. This section is a little technical but the key practical consequences are summarized at the end. The CHIS sample weights are designed through a process called raking (which is similar to iterated proportional fitting). To understand this process, imagine that we are conducting a study on smoking and have collected the following sample:
sample <- data.frame(
  gender = c(rep("M",3),rep("F",7)),
  city   = c(rep(c("A","B"),4),"B","B"),
  smoke  = c(rep(c(1,0),3),1,0,1,1)
#   gender city smoke
#1       M    A     1
#2       M    B     0
#3       M    A     1
#4       F    B     0
#5       F    A     1
#6       F    B     0
#7       F    A     1
#8       F    B     0
#9       F    B     1
#10      F    B     1

From an external source, such as census data, suppose we know with high-confidence that 52% of the total population (city A + city B) are female and 48% are male. Suppose we also know that the population is distributed so that 55% of people live in A and 45% live in B. These external facts are called usually called control totals. Our sample is clearly not representative of the population as 40% of respondents are from city A and 70% are female. If we want to estimate the joint distribution of smoking rates by gender and city, what can we do?

Consider the strategy of assigning a weight $w_i$ to each respondent in the sample. This weight tells us how much each observation should count for in the cross tabulation. Then, for example, we can estimate: $$ p(\mathrm{smoke}, M ,A) \approx \frac{1}{10} \sum_{i: \mathrm{gender}_i = M, \mathrm{city}_i = A } w_i \cdot \mathrm{smoke}. $$ The key issue is how to choose the weight $w_i$. Raking is one such strategy which can make sense for complex surveys.

In a raking procedure, we begin with one of the control variables (say gender) and choose $w_i$ so that the sample distribution will match the population distribution. For example, to get 48% male representation we'd need to give each male in the sample a weight such that $$.3 \cdot w_i(\mathrm{gender}_i = M) = .48 \qquad \Rightarrow \qquad w_i(\mathrm{gender}_i = M) = \frac{0.48}{0.3} = 1.6.$$ Similarly, to get 52% female representation, we'd need to give each female in the sample a weight such that $$ w_i(\mathrm{gender}_i = F) = \frac{0.52}{0.7} \approx 0.7429.$$ We can then recompute the crosstabs with this initial weight:
sample$w0 = ifelse(sample$gender=="M",.48/.3, .52/.7)

prop.table(xtabs(w0 ~ gender, sample))
#   F    M 
#0.52 0.48 

prop.table(xtabs(w0 ~ city, sample))
#        A         B 
#0.4685714 0.5314286 

Now gender is in balance but city remains out of balance. To balance the city proportions, we need to weight the weighted data.
# tabulate with current weight
P <- prop.table(xtabs(w0 ~ city, sample))

# update the weight for city
sample$w1 = ifelse(sample$city=="A",.55/P[1], .45/P[2])

# tabulate again
prop.table(xtabs(w0*w1 ~ city, sample))
# city
# A    B 
# 0.55 0.45 
prop.table(xtabs(w0*w1 ~ gender, sample))
# gender
#F         M 
#0.4889064 0.5110936 

So we've balanced city but in doing so we've also unbalanced gender. However, gender is not nearly as out of balance as it was when we began. This suggests that we can iterate this process to build a weight variable that will balance both gender and city simultaneously. Indeed, here is a simple code that will produce the sample weight for this toy problem:
# controls
pop.gender = c(F=.52,M=.48)   = c(A=.55,B=.45)

# set initialize the weight
sample$w <- 1

# run the algorithm (only need a few iterations to converge)
for(i in 1:10){

  # get current weighted proportion for gender
  P0 <- prop.table(xtabs(w ~ gender, sample))
  sample$w0 = ifelse(sample$gender=="F", pop.gender[1]/P0[1], pop.gender[2]/P0[2])

  #update weight
  sample$w = sample$w * sample$w0

  # get current weighted proportion for city
  P1 <- prop.table(xtabs(w ~ city, sample))
  sample$w1 = ifelse(sample$city=="A",[1]/P1[1],[2]/P1[2])

  #update weight
  sample$w = sample$w * sample$w1
# It works!
prop.table(xtabs(w ~ gender, sample))
# gender
# F    M 
# 0.52 0.48 
prop.table(xtabs(w ~ city, sample))
# city
# A    B 
# 0.55 0.45 

This simplified raking algorithm is a good approximation to how the weights in CHIS (and other complex surveys) are developed. Of course in CHIS, the process is much more complicated and the rebalancing is done over 11 design variables and with many more factor levels.

Raking is an interesting procedure from a technical perspective. Why would we expect this algorithm to converge? When it does converge, why can we use these weights to obtain unbiased estimates within cells? These questions are well outside the scope of this post but there is a fair amount of literature on the topic (this article, for example). Practically speaking, what we need to know for working with CHIS are the following facts:
  • The raked weight RAKEDW0 can be used to obtain unbiased estimators of means and proportion in the usual way, e.g. with xtabs in R.
  • Variances (and hence standard errors) for these estimators cannot be computed directly using RAKEDW0. A nice description of this issue can be found in this article.

In the next section, we'll explain how to correctly compute the standard errors for means and proportions in CHIS.

Standard errors with Replicate Weights

To handle the standard errors, we need to make use of the replicate weights rakedw1,...,rakedw80. Essentially, these weights capture variability across subsamples of the CHIS sample and from the variability within these subsamples we can deduce the correct variance estimates for our point estimators. The details of how the CHIS replicate weights were created and how they are used can be found in section 9.2 of the CHIS methodology manual.

Fortunately, R's survey package was built precisely to handle complex survey designs, including replicate weights. The key feature of the survey package is that all trickiness of the CHIS design can be encapsulated in a survey design object $D$. Once R knows about the design, survey's methods will take care of the details of correctly computing the standard errors behind the scenes. Let's return to our education/obesity example and compute standard errors for our cross tabulations:
# survey analysis with the survey package

# remove empty levels from overwt (aesthetic)
CHIS$ovrwt <- factor(CHIS$ovrwt)

# design structure -- settings are thanks to the references
# by T. Lumley.  The design object captures all the information
# about the replicate weights.
D <- svrepdesign(variables = CHIS[,c("college_grad", "ovrwt")], 
            repweights = CHIS[,grep("rakedw[1-9]",colnames(CHIS))],
            weights = CHIS$rakedw0,
            combined.weights = TRUE,
            type = "other",
            scale = 1,
            rscales = 1)

# compute the means and standard errors
svyby( ~ovrwt, ~college_grad, D, 
                FUN = svymean, keep.names = FALSE)

#  college_grad  ovrwtYES   ovrwtNO         se1         se2
#1           BA 0.5122654 0.4877346 0.007344430 0.007344430
#2 LESS THAN BA 0.6456145 0.3543855 0.005697532 0.005697532
#3   MA OR MORE 0.5051845 0.4948155 0.009186273 0.009186273

# compare with the naive standard error (WRONG!):
n  <- sum(CHIS$rakedw0[CHIS$college_grad == "BA"])
x  <- sum(CHIS$rakedw0[CHIS$college_grad == "BA" & CHIS$ovrwt == "YES"]) 
p  <- x/n   # proportion
SE <- sqrt(p*(1-p)/n)  # standard error in proportion
# 0.0002006993 
# This is much smaller error than what we got with the correct 
# estimate provided by survey! Thus, we would be overly confident
# about our results if used the naive method.

Sunday, April 20, 2014

The Circus Tent Problem with R's Quadprog

The MathWorks has an interesting demo on how the shape of a circus tent can be modeled as the solution of a quadratic program in MATLAB. In this post, we'll show how to solve this same problem in R using the quadprog package and also provide the technical details not covered in the Mathwork's example. In particular, we'll explain how an energy function is associated to this problem and how descretizing this energy function gives rise to a sparse quadratic program.

Quadratic programming is a powerful and versatile technique that has found applications in a diverse range of fields. Perhaps its most famous application is in portfolio allocation theory, though it also is very well known as the computational foundation of SVM. A quadratic program is a constrained optimization problem where the objective function is a quadratic form (multidimensional generalization of a quadratic function). The constraints in a standard quadratic program are a mix of linear equality and inequality constraints. This example provides both an intuitive introduction to quadratic programming in R and a great test case for benchmarking optimization algorithms.

Problem Description

We imagine draping a piece of fabric over an ensemble of 5 tent poles, placed symmetrically over a 36 by 36 grid. The four outer poles are all the same size. The center pole is somewhat larger and taller. In R, it's easy to build this ensemble by exploiting symmetry:
# Build the pole grid using basic symmetries.
q2 = matrix(0,18,18)
q2[8:9,8:9] = .3
q2[17:18,17:18] = 1/2
q1 <- q2[,18:1]
top  <- cbind(q2,q1)
z <- rbind(top,top[18:1,])

Let's plot the pole ensemble and the piece of fabric:

# Plot
x <- (1:36)/36
y <- (1:36)/36
rgl.surface(x,y, z, color='blue',alpha=.75, smooth=TRUE)
rgl.surface(x,y, matrix(1/2,36,36), color='red', front='lines', back='lines')"white")
The task is to model the height $u(x,y)$ of the piece of fabric over the grid point $(x,y)$.

Defining an Energy Function

We now need to specify an energy function which defines what should happen when you drape the fabric over the pole ensemble. The idea is that once the fabric is placed, it eventually settles to some minimal energy state. Let $u(x,y)$ be any function that gives heights of the fabric over any grid point $(x,y)$. Suppose we come up with a way to measure the energy $E$ of any fabric distribution $u$. We call this measurement $E[u]$. A minimal energy state is a particular fabric distribution $u^*$ that makes our measurement $E[u^*]$ as small as possible.

The question of how to measure the energy $E[u]$ is one for physicists and engineers. Without specifying the mechanical laws governing the process by which the fabric "settles", we shouldn't expect to be able to provide realistic solutions to this model. Nevertheless, let's consider the famous energy measurement: $$ E[u] = \frac{1}{2}\int_0^{l_y} \int_0^{l_x} \nabla u(x,y) \cdot \nabla u(x,y) \;dx dy $$ where $l_y$ and $l_x$ are the length of the vertical and horizontal sides of the grid and $\nabla$ is the gradient operator. This measurement is called the Dirichlet energy of $u$ and, roughly, it captures the total amount of variation in $u$.

We would like to find a numerical approximation to the fabric distribution $u(x,y)$ that makes the energy $E[u]$ as small as possible and also fits on top of our tent peg ensemble. Let's take for granted that such an optimal fabric distribution exists and that this function has no tears or creases (continuous differentiability). We can then use integration by parts to rewrite the Dirichlet energy as: $$\begin{array}{l} E[u] &= \frac{1}{2} \int_0^{l_y} \int_0^{l_x} \nabla u(x,y) \cdot \nabla u(x,y) \; dx \;dy \\ &= - \frac{1}{2} \int_0^{l_y} \int_0^{l_x} u(x,y) \Delta u(x,y) \; dx dy + \frac{1}{2} \oint_{\square} u(x,y) \nabla u(x,y) \cdot \nu(x,y) \; dS. \end{array}$$ The second integral in the last expression is a surface integral around the rectangular boundary of our grid. If we assume that $u(x,y)=0$ at the edge of the grid then this term vanishes and we're left with the energy expression: $$\begin{array}{l} E[u] &= - \frac{1}{2} \int_0^{l_y} \int_0^{l_x} u(x,y) \Delta u(x,y) \; dx dy. \end{array}$$

Discretizing the Energy Function

Now, let's show how to use the previous formula to get a numerical approximation to the function $u(x,y)$ that minimizes the energy $E[u]$. To do this, we choose a uniform discretization of our grid. Let $h_x$ and $h_y$ be the spacing between grid points in the $x$ and $y$ directions. We can then approximate the Laplacian $\Delta$ using finite difference formulas: $$ \Delta u(x,y) = \frac{ u(x+h_x,y) - 2u(x,y) + u(x-h_x,y) }{h_x^2} + \frac{u(x,y +h_y) - 2u(x,y) + u(x,y-h_y)}{h_y^2}.$$ Now we discretize the energy function $E[u]$: $$ \begin{array}{l} E[u] &= -\frac{1}{2} \int_0^{l_y} \int_0^{l_x} \nabla u(x,y) \cdot \nabla u(x,y) \; dx \;dy \\ & \approx - \frac{h_x h_y}{2} \sum_{i=1}^{N_y} \sum_{j=1}^{N_y} u(ih_x, jh_y) \Delta u(i h_x, j h_y). \end{array} $$ In our discrete problem we are trying to find the values of $u$ at each of our $N_x \cdot N_y$ grid points. In the current formulation, these points are arranged as an $N_y \times N_x$ matrix but it turns out to be much more convenient to reorganize these points into a single long vector of $N_x \cdot N_y$ components. We'll do this by stacking the columns of the grid matrix into a single column so that the grid point $(i,j)$ will map to component $k= N_y(j-1) + i$. In R, we can stack the columns of the matrix M in this way by
nr <- nrow(z)
nc <- ncol(z)
N  <- nr*nc
bvec <- matrix(z,N,1,byrow=FALSE)

We write $u_k$ to denote the $k$-th entry of the solution vector $\vec{u}$ in this new indexing system. Then the energy approximation becomes: $$ \begin{array}{l} E[u] & \approx \frac{h_x h_y}{2} \sum_{k=1}^{N_x \cdot N_y} u_k(- \Delta u_k). \end{array} $$ Now, we observe that the approximation of $-\Delta u_k$ involves only five components of the $\vec{u}$ vector. In fact: $$ -\Delta u_k \approx -\frac{u_{k+N_y}- 2u_k + u_{k - N_y}}{h_x^2} - \frac{u_{k+1} -2u_k + u_{k-1} }{h_y^2}.$$ Using this observation, we write the energy approximation as $$ \begin{array}{l} E[u] &\approx \frac{h_x h_y}{2} \vec{u}^T L\vec{u}. \end{array} $$ where $L$ is a special matrix called the discrete Laplacian. The discrete Laplacian is a famous sparse matrix which has five non-zero bands (one for each component that contributes to a typical $\Delta u_k$ value). It is a numerical approximation to the Laplace operator $-\Delta$ and has many remarkable properties. We can use Kronecker products to easily build the 2D discrete Laplacian on a rectangular grid in R:
# 2D Discrete Laplacian on an lx by ly rectangular grid with
# nx grid lines in the x direction and ny grid lines in the y direction.
Laplacian2D <- function(lx,ly, nx,ny){
  hx <- lx/(nx-1)
  hy <- ly/(ny-1)
  tr_x <- c(2,-1,rep(0,nx-2))
  tr_y <- c(2,-1,rep(0,ny-2))
  Tx <- toeplitz(tr_x)/(hx^2)
  Ty <- toeplitz(tr_y)/(hy^2)
  Ix <- diag(nx)
  Iy <- diag(ny)
  L <- kronecker(Tx,Iy) + kronecker(Ix,Ty)

Solving the Quadratic Program

To recap, our goal was to find a numerical approximation to the function $u$ which minimizes $E[u]$ and which fits on our tent peg ensemble. Our work in the previous section showed that we can find such an approximation by solving the following optimization problem: $$ \left\{ \begin{array}{l} \min_{\vec{u} \in \mathbb{R}^{Ny \cdot Nx}} \; \frac{1}{2} \vec{u}^T L \vec{u} \\ \mbox{subject to:} \qquad \vec{u} \geq \vec{z} \end{array} \right. $$ where $\vec{z}$ is the vector whose component $z_k$ give the heights of the ground and tent pole ensemble at the (vectorized) grid point $k$. This is a quadratic programming problem, and a very tractable one since an important property of the matrix $L$ is that it is symmetric positive definite.

Note that the Mathworks example uses a slightly more general quadratic form $$ E[u] \approx \frac{1}{2} \vec{u}^TL \vec{u} + \vec{c}^T \vec{u} $$ where $\vec{c}$ is a constant vector whose components are all equal to $h_x \cdot h_y$. Though the energy function is not explained in their example, it is likely that this linear term is meant to capture some material effects that aren't accounted for by the simple Dirichlet energy measure. We'll also include this term in our solution below to make use of a slightly more interesting quadratic form.

We'll solve our quadratic program with R's quadprog library. The solution algorithm implemented in R's quadprog is somewhat different from that available in MATLAB. One important note about R's quadprog library is that it does not take advantage of the sparsity of the system matrix $L$ nor of the sparsity of the constraints $\vec{u} \geq \vec{z}$. Since an $N \times N$ matrix requires an $N^2 \times N^2$ matrix $L$, the dense matrix representation of $L$ can quickly become impractical. For the $36 \times 36$ grid of this problem, however, quadprog works wonderfully:
# Solve the QP:
Ny  <- nrow(z)
Nx  <- ncol(z)
N   <- Nx*Ny
hx  <- 1/(Nx-1)
hy  <- 1/(Ny-1)

# System matrices
# Note: quadprog's quadratic form is -d^T x + 1/2 x^T D x with t(A)%*%x>= b
theta <- 1                            # coefficient for linear term
Dmat <- hx*hy*Laplacian2D(1,1,36,36)  
dvec <- -theta*(hx*hy)*rep(1,N)      
Amat <- t(diag(N))
bvec <- matrix(z,N,1,byrow=FALSE)     # lower bound

# call the solver
sol  <- solve.QP(Dmat, dvec, Amat, bvec)

# extract and reshape the solution
tent <- matrix(sol$solution,nrow(z),ncol(z), byrow=FALSE)

Now we plot the solution:

x <- (1:Nx)/Nx
y <- (1:Ny)/Ny
rgl.surface(x, y , z, color='blue',alpha=.75, smooth=TRUE)
rgl.surface(x, y , tent, color='red', front='lines', back='lines')"white")

Sunday, March 23, 2014

Going viral with R's igraph package

R's igraph package provides a simple and flexible set of utilities for working with graphs.  In this post, we'll use this package to animate the simulated spread of a disease through a network.


A graph is just a collection of nodes joined by edges:
# Specify an undirected graph by hand, using a numeric   
# vector of the pairs of vertices sharing an edge.
G <- graph( c(1,2,1,3,1,4,3,4,3,5,5,6,6,7,7,8,8,9,3,8,5,8), directed = FALSE )

# visualization
  plot(G, layout = layout.fruchterman.reingold,
       vertex.size = 25,
       vertex.frame.color= "white",
       vertex.label.color = "white", = "sans",

In graph models, the nodes of a graph typically represent entities and the edges represent relationships between these entities. Both nodes and edges may have attributes or qualities that characterize the entities and the relationships. A node might have an attribute of "color". An edge might have an attribute of "weight" that encodes the strength of the relationship between the vertices that the edge joins. The igraph package makes it very simple to manage the assignment of attributes to the components of a graph:
G <- graph( c(1,2,1,3,1,4,3,4,3,5,5,6,6,7,7,8,8,9,3,8,5,8), directed = FALSE )

# Assign attributes to the graph
G$name    <- "A colorful example graph"

# Assign attributes to the graph's vertices
V(G)$name  <- toupper(letters[1:9])
V(G)$color <- sample(rainbow(9),9,replace=FALSE)

# Assign attributes to the edges
E(G)$weight <- runif(length(E(G)),.5,4)

# Plot the graph -- details in the "Drawing graphs" section of the igraph manual
plot(G, layout = layout.fruchterman.reingold, 
     main = G$name,
     vertex.label = V(G)$name,
     vertex.size = 25,
     vertex.color= V(G)$color,
     vertex.frame.color= "white",
     vertex.label.color = "white", = "sans",

Building a simulation

The ease and flexibility with which you can assign and update attributes of a graph's nodes and edges makes igraph a powerful tool for prototyping simulations with graphs. Let's consider a very simplified example. Let $G$ be a graph whose nodes are people in a population. Two nodes in $G$ will share an edge if they have daily in-person contact with each other. Suppose also that a virus has infected a few nodes of $G$. These infected nodes can spread the virus to other uninfected nodes with whom they have daily contact. Newly infected nodes will then spread the virus to other nodes, and so on. Let's use the simple probability model: $$p(x \mbox{ infects } y | x \mbox{ is infected}, y \mbox{ is uninfected}, x\mbox{ and } y \mbox{ share an edge})= \frac{1}{2}.$$
Here is a function that implements this simulation model:
spreadVirus <- function(G,Vinitial,p){  
  # Precompute all outgoing graph adjacencies
  G$AdjList = get.adjlist(G,mode="out")
  # Initialize various graph attributes
  V(G)$color    = "blue"
  E(G)$color    = "black"
  V(G)[Vinitial]$color    <- "yellow"

  # List to store the incremental graphs (for plotting later)
  Glist <- list(G)
  count <- 1
  # Spread the infection
  active <- Vinitial
    new_infected <- NULL
    E(G)$color = "black"
    for(v in active){
      # spread through the daily contacts of vertex v
      daily_contacts <- G$AdjList[[v]]
      E(G)[v %->% daily_contacts]$color <- "red"
      for(v1 in daily_contacts){
        new_color <- sample(c("red","blue"), 1 ,prob=c(p,1-p))       
        if(V(G)[v1]$color == "blue" & new_color=="red"){ 
          V(G)[v1]$color <- "red"
          new_infected <- c(new_infected,v1)
    # the next active set
    active <- new_infected
    # Add graph to list
    count <- count + 1
    Glist[[count]] <- G

The function spreadVirus will return an ordered list of graphs representing the discrete states of the model as the virus spreads. We'll demonstrate the use of this function on a simple directed graph. A directed graph is a graph where edges are "one-way", so a directed edge from nodes $X$ to $Y$ only allows information to pass from $X$ to $Y$ and not from $Y$ to $X$.
# Specify a directed graph by hand
G <- graph.formula(1-+2,1-+3,2-+4,2-+5,3-+6,5-+7,7-+8,8-+9,9+-7, 9-+10,
      6-+9,1-+5,5-+20,3-+9,10-+11,10-+15,11-+12,11-+13,13-+14, 14-+15,
Vinitial <- c("1","10")

# Run the simulation
Glist <- spreadVirus(G,Vinitial,1/2)

# Animation plots (generates a .GIF)
L <- layout.fruchterman.reingold(Glist[[1]])
  count =0
  for(i in 1:length(Glist)){
    plot(Glist[[i]], layout = L,
         vertex.label = NA,
         vertex.size = 10,
         vertex.color= V(G)$color,
         vertex.frame.color= "white",
         edge.arrow.size = 1,
    count = count +1
    title(main="Graph simulation example", 
          sub=paste("Time = ",count), cex.main = 3, cex.sub = 2)
}, interval = 1, = "demo.gif", ani.width = 1000, ani.height = 1000)

This is a very simplistic model and it does not, for example, take into account the importance of time dynamic effects in the virus spread. Thanks to the flexibility of igraph, however, it is not difficult to extend a basic simulation like spreadVirus to include much more complex effects.

We should note, however, that although this implementation is very easy to understand, it is not very efficient. The heavy use of loops will be quite slow when we start working with large graphs. Though we might gain some efficiency through vectorization or other optimizations, for running larger simulations we would probably want to port this routine to another platform like C++. Nevertheless, the strength of R's igraph for quickly building and testing a prototype should be apparent from this example.

More realistic graphs

So far, the graphs we have worked with have been quite simple. How do we start building more complex graphs? There are many great sources for real graph data like Stanford's SNAP dataset collection. There are also many routines for generating random graphs in the igraph packages. The method in igraph is one such example. The degree of a node in a graph is the number of edges attached to that node. One way to study graphs is to look at the distribution of the degrees of the nodes in the graph. For many interesting graphs, the degree distribution often has the form of an exponential distribution: $$ p(\mathrm{degree}(\mathrm{node}) = N ) \propto \exp\left(-\tau N \right), $$ see for example this Wikipedia article which discusses the degree distribution of all web links. As suggested in the igraph manual, we can generate a larger random graph with an exponential distribution as follows:
N <- 200
in.deg  <- sample(1:N, N, replace = TRUE, prob = exp(-.5*1:N))
G <,out.deg=in.deg,method="")

Here's an example simulation on this new graph:

# set seed for reproducible example

# build graph list
Vinitial   = sample(V(G),5,replace=FALSE)
Glist <- spreadVirus(G,Vinitial,1/2)

# Circle layout is better for large graphs
L =[[1]])

# Animation
  count =0
  for(i in 1:length(Glist)){
    plot(Glist[[i]], layout = L,
         vertex.label = NA,
         vertex.size = 4,
         vertex.color= V(G)$color,
         vertex.frame.color= "white",
         edge.arrow.size = 1,
    count = count +1
    title(main="Graph simulation example (200 nodes)", 
          sub=paste("Time = ",count), cex.main = 3, cex.sub = 2)
}, interval = 1, = "demo.gif", ani.width = 1000, ani.height = 1000)

Sunday, January 12, 2014

Solving Quadratic Progams with R's quadprog package

In this post, we'll explore a special type of nonlinear constrained optimization problems called quadratic programs. Quadratic programs appear in many practical applications, including portfolio optimization and in solving support vector machine (SVM) classification problems. There are several packages available to solve quadratic programs in R. Here, we'll work with the quadprog package. Before we dive into some examples with quadprog, we'll give a brief overview of the terminology and mechanics of quadratic programs.

Quick Overview of Quadratic Programming

In a quadratic programming problem, we consider a quadratic objective function: $$ Q(x) = \frac{1}{2} x^T D x - d^T x + c.$$ Here, $x$ is a vector in $\mathbb{R}^n$, $D$ is an $n \times n$ symmetric positive definite matrix, $d$ is a constant vector in $\mathbb{R}^n$ and $c$ is a scalar constant. The function $Q(x)$ is sometimes referred to as a quadratic form and it generalizes the quadratic function $q(x) = ax^2 +bx + c$ to higher dimensions. The key feature to note about $Q(x)$ is that it is a convex function.

We also impose a system of linear constraints on the vector $x \in \mathbb{R}^n$. We write these constraints in the form $$ Ax = f \qquad Bx \geq g.$$  Here $A$ is an $m_1 \times n$ matrix with $m_1 \leq n$ and $B$ is a $m_2 \times n$ matrix.  The vectors $f$ and $g$ have lengths $m_1$ and $m_2$ respectively.  This specification is general enough to allow us to consider a variety of practical conditions on the components of $x$, e.g. we can force $x$ to satisfy the sum condition $$ \sum_{i=1}^n x_i = 1 $$ or the box constraints $a_i \leq x_i \leq b_i$. We'll describe how to encode practical constraint conditions into matrix systems below.

With this notation out of the way, we can write the quadratic program (QP) compactly as: $$ \left\{ \begin{array}{l} \mathrm{minimize}_{x \in \mathbb{R}^n}: \qquad Q(x) = \frac{1}{2} x^T D x - d^T x + c \\ \mathrm{subject\; to:} \qquad Ax = f \qquad Bx \geq g \end{array} \right.$$

Example #1:

Consider the objective function: $$ \begin{eqnarray*} Q(x,y) &=& \frac{1}{2} \left[ \begin{matrix} x & y \end{matrix} \right] \left[ \begin{matrix} 2 & -1 \\ -1 & 2 \end{matrix} \right]\left[ \begin{matrix} x \\ y \end{matrix} \right] - \left[ \begin{matrix} -3 & 2 \end{matrix} \right] \left[ \begin{matrix} x \\ y \end{matrix} \right] + 4 \\ &=& x^2 + y^2 -xy +3x -2y + 4 . \end{eqnarray*}$$ We seek to minimize this function over the triangular region $$\begin{eqnarray*} y &\geq& 2 - x \\ y &\geq& -2 + x \\ y &\leq& 3. \end{eqnarray*}$$
We can find the vertices of this triangle and plot the region in R:
plot(0, 0, xlim = c(-2,5.5), ylim = c(-1,3.5), type = "n", 
     xlab = "x", ylab = "y", main="Feasible Region")
polygon(c(2,5,-1), c(0,3,3), border=TRUE, lwd=4, col="blue")

To solve this QP with the quadprog library, we'll need to translate both our objective function and the constraint system into the matrix formulation required by quadprog. From the quadprog documentation
This routine implements the dual method of Goldfarb and Idnani (1982, 1983) for solving quadratic programming problems of the form min(-d^T b + 1/2 b^T D b) with the constraints A^T b >= b_0.
It's not hard to see how to put the quadratic form $Q(x,y)$ into the correct matrix form for the objective function of quadprog (but be sure to note the sign pattern and factors of two).  First we observe that, for any constant $c$, the minimizer of $Q(x,y)+c$ is the same as the minimizer of $Q(x,y)$. We can therefore ignore all constant terms in the quadratic form $Q(x,y)$. We set:
$$D =  \left[ \begin{matrix} 2 & -1 \\ -1 & 2 \end{matrix} \right] \qquad  d = \left[ \begin{matrix} -3 \\ 2 \end{matrix} \right]. $$
We can write the constraint equations in the form:
$$\left[ \begin{matrix} 1 & 1 \\ -1 & 1 \\ 0 & -1 \end{matrix} \right] \left[ \begin{matrix} x \\ y \end{matrix} \right] \geq \left[ \begin{matrix} 2 \\ -2 \\ -3 \end{matrix} \right]$$ so that $$A = \left[ \begin{matrix} 1 & 1 \\ -1 & 1 \\ 0 & -1 \end{matrix} \right]^T \qquad b_0 = \left[ \begin{matrix} 2 \\ -2 \\ -3 \end{matrix} \right].$$ Here is the complete implementation in R:
Dmat <- 2*matrix(c(1,-1/2,-1/2,1), nrow = 2, byrow=TRUE)
dvec <- c(-3,2)
A    <- matrix(c(1,1,-1,1,0,-1), ncol = 2 , byrow=TRUE)
bvec <- c(2,-2,-3)
Amat <- t(A)
sol  <- solve.QP(Dmat, dvec, Amat, bvec, meq=0)

Note the parameter meq is used to tell quadprog that first meq constraint conditions should be treated as equalities. Let's inspect the output of the quadprog solver:
> sol
[1] 0.1666667 1.8333333
[1] -0.08333333
[1] -1.3333333  0.3333333
[1] 2 0
[1] 1.5 0.0 0.0
[1] 1
The point $(1/6,11/6)$ is the unique minimizer of $Q(x,y)$ subject to the constraint conditions. The point $(-4/3,1/3)$ is the unique minimum of $Q(x,y)$. The slots iterations, Lagrangian, and iact are diagnostics describing the performance of the quadprog algorithm. We'll provide a discussion of these values in a future post. Now, let's visualize the QP solution. For this, we superimpose the boundary of the feasible region on the contour plot of the surface $Q(x,y)$.
In this plot, dark green shading indicates lower altitude regions of the surface $Q(x,y)$, while lighter regions indicate higher altitudes. The red point is the global minimum of $Q(x,y)$ and the yellow point is the solution to the QP.
# Contour Plot with Feasible region overlay
qp_sol <- sol$solution
uc_sol <- sol$unconstrained.solution
x <- seq(-2, 5.5, length.out = 500)
y <- seq(-1, 3.5, length.out = 500)
grid <- expand.grid(x=x, y=y)
grid$z <- with(grid, x^2 + y^2 -x*y +3*x -2*y + 4)
levelplot(z~x*y, grid,  cuts=30,
    panel = function(...){
      panel.polygon(c(2,5,-1),c(0,3,3), border=TRUE, lwd=4, col="transparent")
             lwd=5, col=c("red","yellow"), pch=19)},
    col.regions = terrain.colors(30))

Example #2:

Suppose we have selected 10 stocks from which to build a portfolio $\Pi$. We want to determine how much of each stock to include in our portfolio.

The expected monthly return rate of our portfolio is $$ \overline{r_\Pi} = \sum_{i=1}^{10} w_i \overline{r_i} $$ where $\overline{r_i}$ is the mean monthly return rate on asset $i$ and $w_i$ is the fraction of the portfolio value due to asset $i$. Note that the portfolio weights $w_i$ satisfy the constraints $$ 0 \leq w_i \leq 1 \qquad \qquad \sum_{i=1}^{10} w_i = 1. $$ In practice, we can only estimate the average returns $\overline{r_i}$ using past price data. This is a snap using R's quantmod package:
# Get monthly return data from 2012 through 2013
myStocks <- c("AAPL","XOM","GOOGL","MSFT","GE","JNJ","WMT","CVX","PG","WF")
getSymbols(myStocks ,src='yahoo')
returnsList <- lapply(myStocks, 
                   function(s) periodReturn(eval(parse(text=s)),
                                 period='monthly', subset='2012::2013'))

# Combine return time series
returns.df  <-,returnsList)
colnames(returns.df) <- myStocks
The data frame returns.df contains a time series of monthly returns for each of the 10 specified ticker symbols. Let's plot the monthly returns:
# Plot monthly return data
returns2 <-
returns2$date <- row.names(returns2)
returns2 <- melt(returns2, id="date")
ggplot(returns2, aes(x=date,y=value, group=variable)) +
    geom_line(aes(color=variable), lwd=1.5) +
    ylab("Monthly Return")+ xlab("Date") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

From this plot, we can see that there is significant fluctuation in return rates. This suggests that the variance of $r_i$ and covariances of $r_i$ and $r_j$ should play a role in our analysis. In effect, these variances and covariances indicate how likely we are to actually observe a portfolio return close to our expected return $\overline{r_\Pi}$.

To take into account the risk of deviations in our portfolio return, define the quadratic form $$Q(\vec{w}) = \vec{w}^T C \vec{w}, $$ where $C$ is the covariance matrix of the returns $r_i$. To solve the portfolio allocation problem, we'll try to determine the weights $\vec{w} = (w_1,...,w_{10})$ so that the risk function $Q(\vec{w})$ is minimized. But there are some restrictions to consider. In addition to requiring that $\sum_{i=1}^{10} w_i =1$ and $0 \leq w_i \leq 1$, we may also require a minimum return from the portfolio. For example, we might demand a minimum expected monthly return of 1%: $$ \sum_{i=1}^{10} w_i E(r_i) \geq .01.$$ We can prove that the covariance matrix $C$ is always symmetric positive definite (except in the case of perfect multicollinearity), so this constrained minimization problem is a quadratic programming problem of the type that can be handled by quadprog.

Let's now describe how to implement the quadprog solver for this problem. First, compute the average returns and covariance matrix in R:
# Compute the average returns and covariance matrix of the return series
r <- matrix(colMeans(returns.df), nrow=1)
C <- cov(returns.df)
The constraints in this case are a little bit more tricky than in Example #1. For quadprog, all of our constraints must be expressed in a linear system of the form $A^T \vec{w} \geq f$. The system should be arranged so that any equality constraints appear in the first $m$ rows of the system. We build up the matrix $A^T$ step by step using rbind and applying a transpose at the very end.

To enforce the sum to one condition we need: $$ \left[ \begin{matrix} 1 & 1 & \cdots 1 \end{matrix} \right] \vec{w} = 1. $$ This equality condition should appear first in the system. To enforce the minimum expected return, we require $r \cdot \vec{w} \geq .01$ where $r$ is the row of average return rates obtained from the dataset. To force $0 \leq w_i \leq 1$, we require $$ I_{10} \vec{w} \geq 0 \qquad \qquad -I_{10} \vec{w} \geq - 1$$ where $I_{10}$ is the $10 \times 10$ identity matrix. Putting these steps together:
# Stage and solve the QP
A    <- matrix(1,1,10)
A    <- rbind(A, r, diag(10),-diag(10))
f    <- c(1, 0.01, rep(0,10),rep(-1,10))
sol <- solve.QP(Dmat=C, dvec = rep(0,10), Amat=t(A), bvec=f, meq=1)
Let's inspect the optimal allocation:
portfolio <- data.frame(name = myStocks, w = round(sol$solution,3))
ggplot(portfolio, aes(x=name, y=w)) + geom_bar(stat="identity", fill="blue")

The expected return from this allocation is about 1.2%. It is instructive to solve the quadratic program with different minimum return requirements. For example, there is a solution to this problem with minimum required expected return greater than or equal to 2% but no solution with minimum required expected return greater than or equal to 3%. What is key to note is that the solution with the 2% restriction has a higher value of $Q(\vec{w})$ (more risk) compared to the lower risk solution to the problem with the 1% restriction. As we'd expect, quadratic programming doesn't allow us to escape the risk-return trade-off; it only provides the lowest risk allocation for a given minimum expected return requirement.