tag:blogger.com,1999:blog-18870319302413336622017-08-04T13:44:19.322-07:00quantitateSoftware, computing, optimization, statistics, and miscellanyRyan Walkernoreply@blogger.comBlogger14125tag:blogger.com,1999:blog-1887031930241333662.post-54921964289527423362017-02-12T16:18:00.000-08:002017-02-12T16:18:19.309-08:00Rust: Cross compiling from Ubuntu to OS X<style type="text/css"> .gist-file .gist-data {max-height: 500px;} </style>A few months ago, I saw this <a href="http://www.chriskrycho.com/2016/using-rust-for-scripting.html">post</a> by Chris Krycho on Hackernews, which points out just how easy it can be to share Rust binaries with friends. The use case really spoke to me -- I spend a lot of time building prototypes and I often need to share my prototypes with a few non-technical people. The typical sharing workflow looks like this: <ol> <li>I package my prototype as a Python script with a bunch of dependencies into a simple command line tool.</li><li>I send the package over to a collaborator.</li><li>Collaborator is missing roughly half of the dependencies.</li><li>I debate teaching collaborator about virtual environments or anaconda. Meanwhile, collaborator is busy typing things like <pre class="prettyprint bash"><br />sudo pip install randompackage<br /></pre>into their terminal. </li><li>Collaborator says, "it's still not working". Turns out collaborator only has Python 2.7 but the script is meant for Python 3.</li><li>The process repeats until we get the script running.</li></ol> After reading Chris's post, I started looking into what would be involved in cross compiling from Ubuntu (my preferred OS) to OS X (the preferred OS of most of my collaborators). I was discouraged--it looked complicated and I wasn't sure it could be done without having a running version of OS X with Xcode. Dismayed, I dropped the idea entirely. But the burden of sharing my work continues to weigh on me while my love for the Rust langauge grows ever stronger. Finally, this weekend I decided enough was enough and I was going to attempt to figure out how to cross compile from Ubuntu to OS X. <br><br> And it turns out that it's not nearly as hard as I expected! Most of the key ideas have already been fully explained by Brian Pearce in this <a href="http://alwayscoding.ca/momentos/2016/05/08/cross-compilation-to-osx-with-rust/">post</a>. Brian cross compiles from a Vagrant Ubuntu build environment running on an OS X host to an OS X binary. But Brian's post isn't quite enough for the devoted Linux user -- it still assumes you're running a copy of OS X somewhere (namely on the Vagrant host).<br><br> Below, I give a complete account of how to cross compile a Rust library from Ubuntu 14.04 for an OS X target. I emphasize again that most of the ideas are not mine but come from Brian's post as well as from the work of the <a href="https://github.com/tpoechtrager/osxcross">osxcross</a> project. I've just added a few details to complete the build without having to run OS X anywhere.<br><br> <h2>Prerequisites</h2>Make sure you've got the following dependencies installed directly from apt-get: <pre class="prettyprint bash"><br />sudo apt-get update<br />sudo apt-get install clang autotools-dev automake cmake libfuse-dev<br /></pre>Note, a difference from Brian's post is that we require <code>libfuse</code>. This allows osxcross to work directly with the necessary Xcode disk image so that we need only have the Xcode image and not a full OS X system to build the OS X SDK. Next, install rustup: <pre class="prettyprint bash"><br /> curl https://sh.rustup.rs -sSf | sh<br /></pre>You'll then want to run (if this is the first time using rustup): <pre class="prettyprint bash"><br /> source .cargo/env<br /></pre> Next clone the osxcross project: <pre class="prettyprint bash"><br />git clone https://github.com/tpoechtrager/osxcross<br /></pre> Finally, <a href="https://developer.apple.com/download/more/?name=Xcode%207.3">download version 7.3 of Xcode</a>. For this you'll need to signin in with any valid Apple ID. You want to obtain a <code>.dmg</code> file, e.g. <code>Xcode_7.3.1.dmg</code>.<br><br> <h2>Bootstrapping the OS X cross compile toolchain</h2>First, we'll use osxcross to build an OS X SDK package for cross compilation. Change into the osxcross root directory and run: <pre class="prettyprint bash"><br />./tools/gen_sdk_package_p7zip.sh Xcode_7.3.1.dmg<br /></pre>providing the full path to your Xcode <code>.dmg</code> file. If all goes well, you osxcross will have built a packaged OS X SDK into the root of the osxcross project. Move this package into the tarballs directory: <pre class="prettyprint bash"><br />mv MacOSX10.11.sdk.tar.xz tarballs/<br /></pre>Now we build osxcross: <pre class="prettyprint bash"><br />OSX_VERSION_MIN=10.7 ./build.sh<br /></pre>This creates the full OS X cross compiler tool chain, allowing us to compile Rust targets just like we would on a native OS X system. To make the toolchain available, you need to add it to your path. For this, you might consider edit your <code>~/.profile</code> from: <pre class="prettyprint bash"><br /># this line was automatically added to .profile by rustup:<br />export PATH="$HOME/.cargo/bin:$PATH"<br /></pre>to: <pre class="prettyprint bash"><br />export PATH="$HOME/.cargo/bin:$HOME/projects/osxcross/target/bin:$PATH"<br /></pre><br><br><h2>Cross compiling</h2>Now we are almost ready to cross compile. Change into your Rust project directory and use Rustup to configure an OS X target: <pre class="prettyprint bash"><br />rustup target add x86_64-apple-darwin<br /></pre> Finally, we need to tell <code>rustc</code> where to find the linker. Add or edit a <code>.cargo/config</code> file to the root of your Rust project to contain the following line: <pre class="prettyprint bash"><br />[target.x86_64-apple-darwin]<br />linker = "x86_64-apple-darwin15-clang"<br /></pre> We can then cross compile our Rust code with: <pre class="prettyprint bash"><br />cargo build --release --target=x86_64-apple-darwin<br /></pre>Your shiny new binary can be found under <code>target/x86_64-apple-darwin/release</code>.<br><br> Hooray and happy sharing! Ryan Walkerhttps://plus.google.com/112373077410906476606noreply@blogger.com0Palo Alto, CA, USA37.4418834 -122.1430194999999837.2400059 -122.46574299999997 37.643760900000004 -121.82029599999998tag:blogger.com,1999:blog-1887031930241333662.post-45839122727216717582016-12-27T11:04:00.000-08:002016-12-27T11:04:02.098-08:00Analyzing the 2015 California Health Interview Survey in R<style type="text/css"> .gist-file .gist-data {max-height: 500px;} </style>A few years ago, I wrote about how to <a href="http://blog.ryanwalker.us/2014/06/analyzing-2011-2012-california-health.html">analyze the 2012 California Health Interview Survey in R</a>. In 2012, plans for <a href="https://en.wikipedia.org/wiki/Covered_California">Covered California</a> (Obamacare in California) were just beginning to take shape. Today, Covered California is a relatively mature program and it is arguably the most successful implementation of the Affordable Care Act in the United States. This month, <a href="http://healthpolicy.ucla.edu/Pages/home.aspx">UCLA's Center for Health Policy</a> released the <a href="http://healthpolicy.ucla.edu/newsroom/press-releases/pages/details.aspx?NewsID=261">2015 California Health Interview Survey</a> (CHIS for short). With this fantastic new data set, we can measure the impact of Covered California in its second year. In this brief post, I'll review the basics of the working with CHIS data in R by way of a simple example. My hope is to inspire other R users to dive into this unique data set. <br> <br> <h2>The CHIS Quickstart guide for R</h2>Though CHIS is a complex survey, it's simple to work with CHIS data in R. Here's how to get started: <ul> <li> Head over to the <a href="http://healthpolicy.ucla.edu/chis/Pages/default.aspx">CHIS site and create an account</a>.</li><li> Once you've created a free account and accepted a bunch of terms of use agreements, you'll be able to <a href="http://healthpolicy.ucla.edu/chis/data/public-use-data-file/Pages/2015.aspx">download</a> the CHIS public use data files. You'll want to download the Stata .dta versions, as these are the easiest to work with using R's foreign package.</li><li> CHIS data is divided into 3 groups, child, adolescent, and adult. We'll work with the adult data below.</li><li> You'll also want to download the appropriate data dictionary for your data set. The dictionary provides excellent documentation about the hundreds of variables covered by CHIS. If it's your first time working with CHIS, I recommend a quick skim of the entire dictionary to get a sense of the kinds of things covered by the survey. </li> </ul>Once you've downloaded the data, to bring it into R you can use the foreign package: <pre class="prettyprint lang-r"><br /># Read CHIS file<br />library(foreign)<br />file <- "~/projects/CHIS/chis15_adult_stata/Data/ADULT.dta" # your file<br />CHIS <- read.dta(file, convert.factors = TRUE)<br /></pre> <br/> The most important thing to understand about CHIS data is how to use the replicate weights RAKEDW0-RAKED80. I covered the use of replicate weights in detail in this <a href="http://blog.ryanwalker.us/2014/06/analyzing-2011-2012-california-health.html">post</a>. The important points about replicate weights in CHIS are: <ul><li>Use RAKEDW0 for estimating means and counts in CHIS. RAKEDW0 is designed so that it's sum across all rows in the CHIS data is equal to the total non-institutionalized adult population of California.</li><li>Use RAKEDW1-RAKED80 for estimating variances as described <a href="http://blog.ryanwalker.us/2014/06/analyzing-2011-2012-california-health.html">here</a>. </ul> As an example, let's start by getting counts of health insurance coverage by type. For this we have two insurance type variable INSTYPE and the new INS9TP which gives a more detailed breakdown of insurance types. <pre class="prettyprint lang-r"><br /># tabulate the data<br />print(as.data.frame(xtabs(rakedw0~instype, CHIS, drop.unused.levels = TRUE)))<br /># instype Freq<br /># 1 UNINSURED 2910380.5<br /># 2 MEDICARE & MEDICAID 1561496.7<br /># 3 MEDICARE & OTHERS 1646743.6<br /># 4 MEDICARE ONLY 2129841.7<br /># 5 MEDICAID 6239539.9<br /># 6 EMPLOYMENT-BASED 12193686.7<br /># 7 PRIVATELY PURCHASED 1985807.9<br /># 8 OTHER PUBLIC 415154.8 <br /></pre> <br/> One interesting health behavior that CHIS tracks is fast food consumption. To create the variable AC31, CHIS asked respondents about the number of times they ate fast food in the past week. This simple script explores how fast food consumption behavior interacts with health insurance coverage type:<br><br><script src="https://gist.github.com/rwalk/f2d6d995b5133d92ca69495c218cb44b.js"></script><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/--3Uu1NXiieE/WGK0611DkVI/AAAAAAAAAZU/fAMzltWZUGwscPQsB5OUPQ7IixQx8PK2wCLcB/s1600/chis2015demo.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/--3Uu1NXiieE/WGK0611DkVI/AAAAAAAAAZU/fAMzltWZUGwscPQsB5OUPQ7IixQx8PK2wCLcB/s1600/chis2015demo.png" /></a></div><br><br> Already with this superficial analysis we can see some interesting things. First we notice that the uninsured are eating fast food more often than the non-Medicaid insured. The uninsured's fast food behavior looks quite similar to the Medicaid population while the fast food behaviour of the employment-based insured resembles the behaviour of the private purchase group. And most importantly, everyone is eating too much fast food.<br><br> <h2>Conclusion</h2>I hope this simple example inspires you to investigate CHIS data on your own. I think it would be especially interesting to see some further analysis of the nearly 3 million Californians who remain uninsured despite the relative success of Covered California. Some interesting background research on this topic can be found <a href="http://laborcenter.berkeley.edu/pdf/2012/aca_uninsured12.pdf">here</a> and <a href="http://healthpolicy.ucla.edu/publications/search/pages/detail.aspx?PubID=1593">here</a>. Feel free to get in touch if you are working with CHIS data to improve public health in California.<br>Ryan Walkerhttps://plus.google.com/112373077410906476606noreply@blogger.com1tag:blogger.com,1999:blog-1887031930241333662.post-71881079724600906552016-01-03T15:45:00.001-08:002016-11-15T07:57:53.462-08:00Color Quantization in R<style type="text/css"> .gist-file .gist-data {max-height: 500px;} </style>In this post, we'll look at a simple method to identify segments of an image based on RGB color values. The segmentation technique we'll consider is called <a href="https://en.wikipedia.org/wiki/Color_quantization">color quantization</a>. Not surprisingly, this topic lends itself naturally to visualization and R makes it easy to render some really cool graphics for the color quantization problem. <div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-vBWAP1Cx150/VomoOIBJyRI/AAAAAAAAAYg/teDEvC0rZ0U/s1600/barn.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-vBWAP1Cx150/VomoOIBJyRI/AAAAAAAAAYg/teDEvC0rZ0U/s640/barn.png" /></a></div><br><br> The code presented in detail below is packaged concisely in this github gist:<br><br><script src="https://gist.github.com/rwalk/e61e34e583e9b70fa7cf.js"></script>By sourcing this script in R, all the required images will be fetched and some demo visualizations will be rendered. <h1>Color Quantization</h1>Digital color images can be represented using the <a href="https://en.wikipedia.org/wiki/RGB_color_model">RGB color model.</a> In a digital RGB image, each pixel is associated with a triple of 3 channel values red, green, and blue. For a given pixel in the image, each channel has an intensity value (e.g. an integer in the range from 0 to 255 for an 8-bit color representation or a floating point number in the range from 0 to 1). To render a pixel in a particular image, the intensity values of three RGB channels are combined to yield a specific color value. This RGB illumination image from Wikipedia give some idea of how the three RGB channels can combine to form new colors:<br><br><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-8oxEkOm2LTM/VolWzmGcr4I/AAAAAAAAAUE/BZhuMUUtpes/s1600/RGB_illumination.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-8oxEkOm2LTM/VolWzmGcr4I/AAAAAAAAAUE/BZhuMUUtpes/s1600/RGB_illumination.jpg" /></a></div><br><br>The goal of <a href="https://en.wikipedia.org/wiki/Image_segmentation">image segmentation</a>, is to take a digital image and partition it into simpler regions. By breaking an image into simpler regions, it often becomes easier to identify interesting superstructure in an image such as edges of objects. For example, here's a possible segmentation of the Wikipedia RGB illumination image into 8 segments:<br><br><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-jtmC9PgiPn8/VolYEsWTfkI/AAAAAAAAAUQ/UOMNUYNwI8g/s1600/rgb_illumination_segmented.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-jtmC9PgiPn8/VolYEsWTfkI/AAAAAAAAAUQ/UOMNUYNwI8g/s1600/rgb_illumination_segmented.png" /></a></div>This segmentation picks out all of the solid color regions in the original image (excluding the white center) and discards much of the finer details of the image.<br><br>There are many approaches to segmenting an image but here we'll just consider a fairly simple one using <a href="https://en.wikipedia.org/wiki/K-means_clustering">K-means</a>. The k-means algorithm attempts to partition a data set into k clusters. Our data set will be the RBG channel values for each pixel in a given image and we'll choose k to coincide with the number of partitions we'd like to extract from the region. By clustering over the RGB channel values, we'll tend to get clusters whose RGB channel values are relatively "close" in terms of Euclidean distance. If the choice of k is a good one, the color values of the pixels within a cluster will be very close to each other and the color values of pixels within two different clusters will be fairly distinct. <h1>Implementing Color Segmentation in R</h1>This beautiful image of a <a href="https://en.wikipedia.org/wiki/Mandrill">mandrill</a> is famous in image processing (it's also in the public domain like all images in this post). <br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-tkggOOkYd00/Voln4EZhHrI/AAAAAAAAAVg/s81ocGoNFcc/s1600/mandrill.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-tkggOOkYd00/Voln4EZhHrI/AAAAAAAAAVg/s81ocGoNFcc/s400/mandrill.png" /></a></div><br /><br />To load this PNG image into R, we'll use the <a href="https://cran.r-project.org/web/packages/png/index.html">PNG package</a>: <br /><pre class="prettyprint lang-r">library("png")<br /># download the mandrill image<br />if(!file.exists("mandrill.png")){<br /> download.file(url = "http://graphics.cs.williams.edu/data/images/mandrill.png", <br /> destfile="mandrill.png")<br />}<br /><br /># load the PNG into an RGB image object<br />mandrill = readPNG("mandrill.png")<br /><br /># This mandrill is 512 x 512 x 3 array<br />dim(mandrill)<br />## [1] 512 512 3<br /></pre>In R, an RGB image is represented as an n by m by 3 array. The last dimension of this array is the channel (1 for red, 2 for green, 3 for blue). Here's what the three RGB channels of the image look like:<br><br><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-kAtLMhm4lCM/VolnvzNgtdI/AAAAAAAAAVY/EMESzQKQdYo/s1600/mandrill_rgb.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-kAtLMhm4lCM/VolnvzNgtdI/AAAAAAAAAVY/EMESzQKQdYo/s640/mandrill_rgb.png" /></a></div> <br><br>Here are some ways to view image data directly from within R: <pre class="prettyprint lang-r">library("grid")<br />library("gridExtra")<br /><br />### EX 1: show the full RGB image<br />grid.raster(mandrill)<br /><br />### EX 2: show the B channel in gray scale representing pixel intensity<br />grid.raster(mandrill[,,3])<br /><br />### EX 3: show the 3 channels in separate images<br /># copy the image three times<br />mandrill.R = mandrill<br />mandrill.G = mandrill<br />mandrill.B = mandrill<br /><br /># zero out the non-contributing channels for each image copy<br />mandrill.R[,,2:3] = 0<br />mandrill.G[,,1]=0<br />mandrill.G[,,3]=0<br />mandrill.B[,,1:2]=0<br /><br /># build the image grid<br />img1 = rasterGrob(mandrill.R)<br />img2 = rasterGrob(mandrill.G)<br />img3 = rasterGrob(mandrill.B)<br />grid.arrange(img1, img2, img3, nrow=1)<br /></pre>Now let's segment this image. First, we need to reshape the array into a data frame with one row for each pixel and three columns for the RGB channels: <br /><pre class="prettyprint lang-r"><br /># reshape image into a data frame<br />df = data.frame(<br /> red = matrix(mandrill[,,1], ncol=1),<br /> green = matrix(mandrill[,,2], ncol=1),<br /> blue = matrix(mandrill[,,3], ncol=1)<br />)<br /></pre>Now, we apply k-means to our data frame. We'll choose k=4 to break the image into 4 color regions. <br /><pre class="prettyprint lang-r">### compute the k-means clustering<br />K = kmeans(df,4)<br />df$label = K$cluster<br /><br />### Replace the color of each pixel in the image with the mean <br />### R,G, and B values of the cluster in which the pixel resides:<br /><br /># get the coloring<br />colors = data.frame(<br /> label = 1:nrow(K$centers), <br /> R = K$centers[,"red"],<br /> G = K$centers[,"green"],<br /> B = K$centers[,"blue"]<br />)<br /><br /># merge color codes on to df<br /># IMPORTANT: we must maintain the original order of the df after the merge!<br />df$order = 1:nrow(df)<br />df = merge(df, colors)<br />df = df[order(df$order),]<br />df$order = NULL<br /></pre>Finally, we have to reshape our data frame back into an image: <br /><pre class="prettyprint lang-r"># get mean color channel values for each row of the df.<br />R = matrix(df$R, nrow=dim(mandrill)[1])<br />G = matrix(df$G, nrow=dim(mandrill)[1])<br />B = matrix(df$B, nrow=dim(mandrill)[1])<br /> <br /># reconstitute the segmented image in the same shape as the input image<br />mandrill.segmented = array(dim=dim(mandrill))<br />mandrill.segmented[,,1] = R<br />mandrill.segmented[,,2] = G<br />mandrill.segmented[,,3] = B<br /><br /># View the result<br />grid.raster(mandrill.segmented)<br /></pre>Here is our segmented image:<br><br><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-M3CGXqSUU-0/Volw4u4Nr0I/AAAAAAAAAWQ/KsNjV2p-9h8/s1600/mandrill_segmented.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-M3CGXqSUU-0/Volw4u4Nr0I/AAAAAAAAAWQ/KsNjV2p-9h8/s400/mandrill_segmented.png" /></a></div><br><br><h1>Color Space Plots in Two and Three Dimensions</h1>Color space is the three dimensional space formed by the three RGB channels. We can get a better understanding of color quantization by visualizing our images in color space. Here are animated 3d plots of the color space for the mandrill and the segmented mandrill:<br><br><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-kJT7OhWh2AM/VomT67ct2bI/AAAAAAAAAXA/VlDANEEdxnI/s1600/movie.gif" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="http://1.bp.blogspot.com/-kJT7OhWh2AM/VomT67ct2bI/AAAAAAAAAXA/VlDANEEdxnI/s1600/movie.gif" /></a><a href="http://3.bp.blogspot.com/-TK3u__5ym8I/VomTD-Sm-6I/AAAAAAAAAW4/tDHLwkUWlPE/s1600/movie.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-TK3u__5ym8I/VomTD-Sm-6I/AAAAAAAAAW4/tDHLwkUWlPE/s1600/movie.gif" /></a></div><br><br>These animations were generated with the help of the <a href="https://cran.r-project.org/web/packages/rgl/index.html">rgl package</a>: <br /><pre class="prettyprint lang-r">library("rgl")<br /># color space plot of mandrill<br />open3d()<br />plot3d(df$red, df$green, df$blue, <br /> col=rgb(df$red, df$green, df$blue),<br /> xlab="R", ylab="G", zlab="B",<br /> size=3, box=FALSE, axes=TRUE)<br />play3d( spin3d(axis=c(1,1,1), rpm=3), duration = 10 )<br /><br /># color space plot of segmented mandrill<br />open3d()<br />plot3d(df$red, df$green, df$blue, <br /> col=rgb(df$R, df$G, df$B),<br /> xlab="R", ylab="G", zlab="B",<br /> size=3, box=FALSE)<br />play3d( spin3d(axis=c(1,1,1), rpm=3), duration = 10 )<br /><br /># Use <br /># movie3d( spin3d(axis=c(1,1,1), rpm=3), duration = 10 )<br /># instead of play3d to generate GIFs (requires imagemagick).<br /></pre>To visualize color space in two dimensions, we can use <a href="https://en.wikipedia.org/wiki/Principal_component_analysis">principle components analysis</a>. Principle components transforms the original RGB coordinate system into a new coordinate system UVW. In this system, the U coordinate captures as much of the variance in the original data as possible and the V coordinate captures as much of the variance as possible after factoring out U. So after performing PCA, most of the variation in the data should be visible by plotting in the UV plane. Here is the color space projection for the mandrill:<br><br><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-_WeK3GStyhE/Vome0UYJs_I/AAAAAAAAAYA/PwYLO9EGBwY/s1600/mandrill_cs.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-_WeK3GStyhE/Vome0UYJs_I/AAAAAAAAAYA/PwYLO9EGBwY/s1600/mandrill_cs.png" /></a></div> <br><br>and for the segmented mandrill:<br><br><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/--MfKa3p-4vg/Vome_A9JXTI/AAAAAAAAAYI/aEItLIT6EGE/s1600/mandrill_cs_segmented.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/--MfKa3p-4vg/Vome_A9JXTI/AAAAAAAAAYI/aEItLIT6EGE/s1600/mandrill_cs_segmented.png" /></a></div><br><br>Here is the code to generate these projections: <pre class="prettyprint lang-r">require("ggplot2")<br /><br /># perform PCA on the mandril data and add the uv coordinates to the dataframe<br />PCA = prcomp(df[,c("red","green","blue")], center=TRUE, scale=TRUE)<br />df$u = PCA$x[,1]<br />df$v = PCA$x[,2]<br /><br /># Inspect the PCA<br /># most of the cumulative proportion of variance in PC2 should be close to 1. <br />summary(PCA)<br /><br />#Importance of components:<br /># PC1 PC2 PC3<br />#Standard deviation 1.3903 0.9536 0.39695<br />#Proportion of Variance 0.6443 0.3031 0.05252<br />#Cumulative Proportion 0.6443 0.9475 1.00000<br /><br /># mandrill<br />ggplot(df, aes(x=u, y=v, col=rgb(red,green,blue))) + <br /> geom_point(size=2) + scale_color_identity()<br /><br /># segmented mandrill<br />ggplot(df, aes(x=u, y=v, col=rgb(R,G,B))) + <br /> geom_point(size=2) + scale_color_identity()<br /></pre>Ryan Walkerhttps://plus.google.com/112373077410906476606noreply@blogger.com6tag:blogger.com,1999:blog-1887031930241333662.post-55613475207345029562015-11-24T13:03:00.000-08:002016-11-15T08:01:47.792-08:00Building a Streaming Search Platform<style type="text/css"> .gist-file .gist-data {max-height: 500px;} </style>On average, Twitter users worldwide generate <a href="https://blog.twitter.com/2013/new-tweets-per-second-record-and-how">about 6,000 tweets per second</a>. Obviously, there is much interest in extracting real-time signal from this rich but noisy stream of data. More generally, there many open and interesting problems in using high-velocity streaming text sources to track real-time events. In this post, I describe the key components of a platform that will allow for near real-time search of a streaming text data source such as the Twitter firehose. <br /><br />Such a platform can have many applications far beyond monitoring Twitter. For example, a network of speech to text monitors could transcribe radio and television feeds and pass the transcriptions to the platform. When key phrases or features are found in the feeds, the platform could be configured to trigger real-time event management. This application is potentially relevant to financial, marketing, and other domains that depend on real-time information processing. <br /><br />All code for the platform I describe here can be found in my github project <a href="https://github.com/rwalk/straw">Straw</a>. The code base includes: <br /><ul><li> Complete Java-based <a href="http://storm.apache.org/">Storm</a> implementation, including both <a href="https://github.com/flaxsearch/luwak">Lucene-Luwak</a> and <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html">Elasticsearch-Percolators</a> implementations of streaming search.</li><li> Scripts to automate AWS deployment using boto3</li><li> A local run mode enabling testing on a single machine using dockerized components</li><li> Benchmark utilities</li><li> A demo multiuser web interface where users register queries and receive streaming matches from a simulated twitter firehose</li></ul>I completed this project as a Fellow in the <a href="http://www.insightdataengineering.com/">Insight Data Engineering Program</a>. The original inspiration for for this project came from two excellent blog posts on streaming search: <br /><ul><li><a href="http://www.confluent.io/blog/real-time-full-text-search-with-luwak-and-samza/">Real-time full-text search with Luwak and Samza</a></li><li><a href="http://www.flax.co.uk/blog/2015/07/27/a-performance-comparison-of-streamed-search-implementations/">Elasticsearch Percolator & Luwak: a performance comparison of streamed search implementations</a></li></ul><h1>Streaming Search</h1>The key data structure for solving a traditional text search problem is an <i>inverted index</i> built from the collection of documents you want to be able to query. In its simplest form, an inverted index is just a map whose keys are the set of all unique terms in the documents. The value associated to a particular term in the map is a list of all the documents which use that term. <br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-ECEnQBBrj14/VlNX2HqeLTI/AAAAAAAAATI/woLuTM-lRkA/s1600/inverted.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-ECEnQBBrj14/VlNX2HqeLTI/AAAAAAAAATI/woLuTM-lRkA/s1600/inverted.jpeg" /></a></div><br /><br />After the index has been built, users can submit queries to run against the index. For example, we can have a query that should return all the documents that contain both words in the phrase "llama pajamas". The query engine will split the input phrase into the tokens "llama" and "pajamas", then it will check the inverted index to get the list of all documents that contain the word "llamas" and the list of all documents that contain the word "pajamas". The engine will then return the intersection of these two lists, i.e. the list of the documents that are present in both lists. <br /><br />In the streaming case, documents arrive at a very fast rate (e.g. average of 6000 per second in the case of Twitter) and with this kind of velocity and volume it is impractical to build the inverted document index in real-time. Moreover, the goal is not to create a static index of tweets--rather it is to scan the tweets as they arrive in real-time and determine if they match a registered query. Here's where we can play a clever trick. Instead of building our inverted index from the documents, we can instead build the index from the queries themselves. <br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-SzWBl1N3Ugo/VlMets_sf1I/AAAAAAAAASU/mz8fz2V5gAU/s1600/streaming-text.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-SzWBl1N3Ugo/VlMets_sf1I/AAAAAAAAASU/mz8fz2V5gAU/s1600/streaming-text.jpeg" /></a></div><br /><br />As a simple example, suppose a user wants to see all the tweets that contain the word "llama" and "pajamas". To add this query to the inverted index we would: <br /><ol><li> Create an identifier for the query, say "q1". </li><li> If "llama" is in the inverted index add "q1" to the list of keys at "llama". Otherwise, initialize "llama" in the index with a list containing "q1". </li><li> If "pajamas" is in the inverted index add "q1" to the list of keys at "pajamas". Otherwise, initialize "pajamas" in the index with a list containing "q1". </li></ol>As tweets arrive in the stream, the query engine will break the text into tokens and then query engine would return the intersection of all the list values whose key is a token in the inverted index. <br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-Cx9NDMXl2GE/VlOR_4P0YlI/AAAAAAAAATw/zOQhN8ERXHo/s1600/inverted1.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-Cx9NDMXl2GE/VlOR_4P0YlI/AAAAAAAAATw/zOQhN8ERXHo/s1600/inverted1.jpeg" /></a></div><br><br>Fortunately, there are already several existing tools which can be used to build an inverted query index: <br /><ul><li> Elasticsearch <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html">percolators</a> is a standard feature of Elasticsearch that allows us to index queries and "percolate" documents.</li><li> <a href="https://github.com/flaxsearch/luwak">Luwak</a> for Lucene is a Lucene module that uses significant pre-filtering to optimize matching against an inverted query index. Speed performance compared to percolators can be <a href="http://www.flax.co.uk/blog/2015/07/27/a-performance-comparison-of-streamed-search-implementations/">very significant</a>.</li></ul><h1>Architecture</h1>Now that we've got the basic tools for streaming search (Elasticsearch-Percolators or Lucene-Luwak), lets describe the architecture for the platform. The Straw platform is made up of the following components: <br/><ul><li>A streaming text source, such as the Twitter firehose, which emits a continuous stream of JSON documents</li><li>An <a href="https://kafka.apache.org/">Apache Kafka</a> cluster, which handles ingestion from the text stream</li><li>An <a href="http://storm.apache.org/">Apache Storm</a> cluster, which distributes computation across multiple search engine workers</li><li>A <a href="http://redis.io/">Redis</a> server, which provides a PUBSUB framework to collect and distribute matches to subscribed users</li><li>One or more clients, who submit queries and listen for matches on behalf of users</li></ul><br /><br /><a href="http://2.bp.blogspot.com/-vgJs0VbXTd8/VlORisyPuwI/AAAAAAAAATo/07S2DXVTZnA/s1600/arch.jpeg" imageanchor="1"><img border="0" src="http://2.bp.blogspot.com/-vgJs0VbXTd8/VlORisyPuwI/AAAAAAAAATo/07S2DXVTZnA/s1600/arch.jpeg" /></a><br><br> <h2>Streaming Sources</h2>The <a href="https://dev.twitter.com/streaming/overview">Twitter streaming API</a> does not offer access to the firehose without special permission. To see how Straw would perform under firehose level load, we can instead use the <a href="https://dev.twitter.com/streaming/reference/get/statuses/sample">sample endpoint</a> to collect a large corpus of tweets. We can either store these tweets in a file or alternatively send them directly to the Kafka cluster's documents topic:<br><br><script src="https://gist.github.com/rwalk/3ca1a2aac6af3a5ae635.js"></script><br><br> Alternatively, we can load tweets from a file into Kafka with a simple producer script:<br><br> <script src="https://gist.github.com/rwalk/1f0598b7e46b13bf9ca0.js"></script><br><br> To maintain a high load we can run multiple instances of this script and restart the script as soon as it finishes reading the file, using for example a <a href="https://www.digitalocean.com/community/tutorials/how-to-install-and-manage-supervisor-on-ubuntu-and-debian-vps">supervisor</a>. <br><br> Though the Straw project was designed for handling discrete JSON documents, by change the internal parsers it could be very easy to use other formats like XML. A more interesting challenge is handling continuous stream data, such as audio transcriptions. In this case, several strategies could be tried. For example, we could detect sentence breaks and treat each detected break as a separate document in the stream. <br><br> <h2>Kafka </h2>The Kafka cluster has two topics: documents and queries. The producer script above can be used to populate the documents topic. The frontend client populates the query topic with user subscriptions. In production, I found a 5 node Kafka cluster could easily accommodate Twitter level volume. For the documents topic, I used a partition factor of 5 and a replication factor of 2. While high availability is very important to accommodate the volume of the stream, document loss may not be a big concern. For queries, I used only 2 partitions with a replication factor of 3. Queries are infrequent so availability may not be important but query loss is not acceptable. Note that the partition factor should be less than or equal to the number of KafkaSpouts in our Storm topology, since each spout will consume from exactly one partition. <br><br>One other important Kafka configuration is in kafka.server.properties: <pre class="prettyprint lang-bash"><br /># The minimum age of a log file to be eligible for deletion<br />log.retention.hours=1<br /></pre>The Kafka default is 168 hours--far too big since you can easily fill a modestly sized disk under load. As messages should ideally be consumed in real-time, I recommend using the minimum value which is 1 hour. Note, however, that you may still need to ensure that you have a sufficiently large volume for the Kafka log. In production, I gave each Kafka node a 64GB volume with a 1 hour retention. <br> <br><h2>Storm </h2>The Storm topology implements KafkaSpouts for the documents and queries topics. In production, I used 5 document spouts and 3 query spouts (consistent with the Kafka partitioning). The bolts in the topology search the document stream and publish any matches to Redis. In production, I allocated a total of 6 workers. Sizing the cluster correctly proved to be somewhat challenging. I highly recommend this <a href="http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/">post</a> which explains the key concepts of Storm parallelism. Also the Storm built-in UI can be helpful for monitoring and understanding how the cluster is performing. <br><br> <div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-cHhEWQip2cw/VlMethD_H5I/AAAAAAAAASo/LobOA3Scn0g/s1600/topology1.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-cHhEWQip2cw/VlMethD_H5I/AAAAAAAAASo/LobOA3Scn0g/s1600/topology1.jpeg" /></a></div><br><br> In the most basic scenario, we assume that the number of queries is small and can fit easily into memory on a single machine. Then scaling to the volume of the stream is quite easy. The idea is to give each bolt a complete copy of the in memory Lucene-Luwak index (remember that the queries are what's being indexed here). So each time a user registers a new query, we must <i>broadcast</i> it to all of the bolts in the topology to maintain the local query index. When a document arrives from the stream, we can then randomly assign it to any bolt since each bolt has a full copy of the query index. To handle failover, we can also keep a global copy of the all the queries, so that if a bolt dies we can replace it with a new one and populate its index from the global store. This Java snippet defines such a topology: <pre class="prettyprint lang-java"><br />TopologyBuilder builder = new TopologyBuilder();<br />builder.setSpout("query-spout", new KafkaSpout(query_spout_config), 3);<br />builder.setSpout("document-spout", new KafkaSpout(document_spout_config), 5);<br />builder.setBolt("search-bolt", new LuwakSearchBolt(), 5)<br /> .allGrouping("query-spout")<br /> .shuffleGrouping("document-spout");<br /></pre><br><br>Since this platform is intended to be multiuser and multitenant, we can easily imagine a situation where the number of queries can't practically fit in memory on a single bolt. In this case, we can add another layer of bolts to the Storm topology: <br><br><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-Wr9e5bF4VrI/VlMet5fc3qI/AAAAAAAAASY/IfutokuVFF4/s1600/topology2.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-Wr9e5bF4VrI/VlMet5fc3qI/AAAAAAAAASY/IfutokuVFF4/s1600/topology2.jpeg" /></a></div><br><br>Here the complete query index is partitioned across a small cluster of bolts. Incoming queries are broadcast to the fan bolts. Each fan bolt will then randomly choose one Lucene worker to index that query. Documents from the stream can be shuffled among the fan bolts. Each fan bolt must then broadcast the document so that each Lucene bolt can check the document against its partition of the index. <br><br>If we use Percolators instead of Luwak then each bolt contains an Elasticsearch client. In this case, it is a good idea to collocate the Elasticsearch cluster with the search bolts and to use high replication so as to minimize network overhead. Note that Percolator queries are also stored in-memory, so we still face difficulties as the number of queries becomes large. <br><br> <h2>Redis </h2>Redis is most commonly used as an in-memory application cache, but it also has a simple and elegant publish-subscribe framework. Here's an example of pubsub using the Redis-cli: <br><br>In a terminal A, listeners subscribe to a topic: <pre class="prettyprint lang-bash"><br />127.0.0.1:6379> SUBSCRIBE "llama-topic"<br /></pre>In a separate terminal B, the publisher publishes to the topic: <pre class="prettyprint lang-bash"><br />127.0.0.1:6379> PUBLISH "llama-topic" "llamas love to wear pajamas"<br /></pre>Back in terminal A, the subscriber receives the message: <pre class="prettyprint lang-bash"><br />1) "message"<br />2) "llama-topic"<br />3) "llamas love to wear pajamas"<br /></pre>That's all there is to it. All standard Redis clients expose an API to interact with the PUBSUB framework. <br><br>When a user registers a query in the Straw platform, here's what happens: <ol><li>The client passes the query to the Kafka queries topic.</li><li>The client computes the MD5 hash of the query which will be the ID for the query.</li><li>The client subscribes the user to the computed ID in Redis PUBSUB. </li><li>The Storm cluster receives the query from the Kafka spout and broadcasts it to the Lucene bolts</li><li>Each bolt computes the MD5 hash of the query and registers the query with Luwak using the hash as the query ID</li><li>When a bolt receives a document, it uses Luwak to check if the document matches any query in the index. If Luwak finds a match, it will return one or matching IDs. For each ID returned by Luwak, the bolt will use Redis PUBSUB to publish the original document using the ID as the topic.</li><li>Subscribed clients will receive documents as they are published to Redis. </li></ol>Using the hash as the query ID allows two or more users to subscribe to the same query while only needing to actually index a single query. <br><br><h2>Clients</h2>A client for Straw has the following duties: <ol><li>Manage users. In particular, it must keep track of which users have subscribed to which queries</li><li>Push user queries to Kafka and subscribe to queries in Redis</li><li>Listen for responses from queries</li></ol>The Straw platform comes packaged with a default client which is a simple Python Flask webserver. The webserver is sessionized so that users can follow particular queries. The server implements a basic Kafka producer to publish queries to Kafka and Redis keeps track of the list of subscribed query IDs for each user. The listening is handled by a single background thread that holds a Redis client subscribed to all unique queries across the entire set of active users. When a query ID and document pair are found, the background thread queries Redis to find which users are subscribed to that query ID. It will then copy the document text to a result pool for each subscribed user. The user interface will checks the user's pool for updates every half-second so that results stream into the console. Here is a video of UI in action: <br><br><iframe width="640" height="360" src="https://www.youtube.com/embed/62uOmnYVzH8?feature=player_embedded" frameborder="0" allowfullscreen></iframe><br><br> <h1>Benchmarks and Conclusions</h1>One goal of the Straw project was to compare and measure performance of Elasticsearch-Percolators vs. Lucene-Luwak. Measuring this performance is not completely straightforward. I used the following very basic approach to measuring throughput: <ul><li> Fill Kafka's documents topic with a very large number of documents</li><li> Add a fixed number of reasonably complex queries to the Kafka query topic</li><li> Start the Kafka cluster </li> <li> Each worker Bolt has a counter and a stopwatch running in a background thread</li><li> Each time a document is passed to Lucene and response (empty or non-empty) is recieved, increment the counter </li><li> When the stopwatch reaches 10 seconds, publish the value of the counter to a special Redis topic e.g. "benchmark". Set the counter to 0 and restart the stopwatch </li></ul>By monitoring the benchmark channel in Redis, we can then track the search throughput of the system. Pictured below are density plots for estimated total throughput per second obtained by running this procedure for several hours: <div class="separator" style="clear: both; text-align: center;"><br><br><a href="http://4.bp.blogspot.com/-tdWqR_9Zmxk/VlMetYef5lI/AAAAAAAAASI/xUcawCI4AHs/s1600/benchmarks1.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-tdWqR_9Zmxk/VlMetYef5lI/AAAAAAAAASI/xUcawCI4AHs/s1600/benchmarks1.jpeg" /></a></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-P0IIoFmJTu4/VlMetXcR7gI/AAAAAAAAASM/0aHCvyu37JM/s1600/benchmarks2.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-P0IIoFmJTu4/VlMetXcR7gI/AAAAAAAAASM/0aHCvyu37JM/s1600/benchmarks2.jpeg" /></a></div><br><br>Some comments and conclusion about these preliminary estimates are in order: <ul><li>In both cases, Lucene-Luwak strongly outperforms Elasticsearch-Percolators. However, the Elasticsearch cluster I used was not especially optimized for this experiment. I suspect that a portion of the differential would disappear if more effort was made to localize the Elasticsearch index to the search bolts</li><li>As the number of queries increases we see significant reduction in the throughput. It would be very interesting to see if the fan bolt solution described above would improve this performance</li><li>The variance of throughput is very high, particularly for Luwak</li><li>In the small query case, we are easily accommodating average twitter level volume; for large queries we are close and could likely scale horizontally to obtain a solution</li><li>The queries used here are available in the <a href="https://github.com/rwalk/straw">straw</a> repository. I generated these by computing bigram frequency across a sample of 40M English language tweets and keeping the most frequent bigrams. It would be interesting to evaluate performance with more complex queries </li><li>The documents here are tweets which are limited to 140 characters. It would be interesting to evaluate performance with longer text sources</li></ul>Ryan Walkerhttps://plus.google.com/112373077410906476606noreply@blogger.com1tag:blogger.com,1999:blog-1887031930241333662.post-21445629623637474582015-03-08T18:06:00.001-07:002016-11-15T08:01:10.706-08:00Sparse Quadratic Programming with Ipoptr<style type="text/css"> .gist-file .gist-data {max-height: 500px;} </style>This post is a follow up to my <a href="http://quantitate.blogspot.com/2015/02/more-on-quadratic-progamming-in-r.html">last post</a> on quadratic programming facilities in R. A commenter pointed me to the <a href="http://www.ucl.ac.uk/~uctpjyy/ipoptr.html">ipoptr project</a> which exposes an R interface to the <a href="https://projects.coin-or.org/Ipopt">COIN-OR optimization routine Ipopt</a>. COIN-OR is a suite of optimization utilities implemented in C++ and supported by a back-end of configurable FORTRAN linear system solvers. ipoptr may be a good solution for users wishing to solve sparse nonlinear constrained optimization problems through an R frontend. Some highlights of this solver are: <ol><li> It is a general interior point solver that can handle nonlinear objectives with nonlinear constraints. In particular, no convexity assumptions are required to obtain local solutions. </li><li> It offers a flexible and lightweight specification of the objective function and a sparse matrix representation for the constraints and other problem data. </li></ol>In this post, I'll explain how ipoptr can be applied to solve quadratic programs and I'll compare the performance of this solver to other quadratic program solvers (<a href="http://cran.r-project.org/web/packages/quadprog/">quadprog</a>, <a href="http://cran.r-project.org/web/packages/kernlab/index.html">ipop</a>) available in R. We'll see that ipoptr is very fast and efficient on large sparse quadratic programs, seemingly an order of magnitude faster than quadprog on the demonstration problem considered in my previous post. Because the Ipopt backend is a bit tricky to install, the last section provides a detailed overview of how I successfully built this solver under Ubuntu Linux. <br><br> <h1>The ipoptr interface</h1>Before applying ipoptr to quadratic programming, it may be helpful to present a high-level picture of the ipoptr interface. The following is a summary of the ipoptr interface drawn from the <a href="http://www.ucl.ac.uk/~uctpjyy/downloads/ipoptr.pdf">documentation</a>. ipoptr stages and invokes an interior point algorithm from within R to find a local solution of the following constrained optimization problem: $$ \begin{aligned} \underset{\alpha \in \mathbb{R}^n}{\text{Minimize}}: \qquad & f(x) \\ \text{Subject to:} \qquad & g_L \leq g(x) \leq g_U \\ & x_L \leq x \leq x_U \end{aligned}$$ where: <ul><li> $f:\mathbb{R}^n \to \mathbb{R}$ the objective function is twice continuously differentiable </li><li> $g:\mathbb{R}^n \to \mathbb{R}^m$ defines the constraint set and is twice continuously differentiable </li><li> $x_U$, and $x_L$ are fixed vectors in $\mathbb{R}^n$</li><li> $g_U$, and $g_L$ are fixed vectors in $\mathbb{R}^m$</li></ul>Note that the solver does not need to make any assumptions about the convexity of either $f$ or $g$. The solver makes use of the following ingredients to find a solution: <ol><li> An initial guess $x_0$ for the solution </li><li> The objective function gradient $\nabla f$ </li><li> the Jacobian of the constraint map $J_g$ </li><li> Optionally, the Hessian of the problem Lagrangian in the form: $$H(x, \sigma_f, \vec{\lambda}) = \sigma_f \nabla^2 f(x) + \sum_{i=1}^m \lambda_i \nabla^2 g_i(x).$$ </li></ol>When $n$ is large, the dense matrix representations of $g$, $\nabla f$, and $J_g$ will have a substantial footprint. In application problems, however, the Jacobian and/or the Hessian object will often be sparse. For efficiency, ipoptr utilizes a sparse matrix representation for $H$ and $J_g$. This format is defined as follows. Let $A_{\mathrm{values}}$ be the list of all non-zero values in $A$ (read into the list from left to right along the rows of $A$). Then let $A_{\mathrm{mask}}$ be a list of lists, where the list at position $i$ in $A_{\mathrm{mask}}$ provides the column indexes of the nonzero elements in the $i$-th row of $A$. Then $(A_{\mathrm{values}}$, $A_{\mathrm{mask}})$ is a sparse representation of $A$.<br><br> Generally, it could be quite difficult to come up with the explicit representations of the Jacobian and Hessian that ipoptr consumes. Although the Hessian is an optional argument, including it can significantly improve convergence behaviour (try taking the Hessian arguments out in the examples below). For a quadratic program, however, it is simple to compute the Jacobian and Hessian directly. We'll do this in the next section and then provide an R wrapper function that can transform any symmetric positive definite quadratic program in standard matrix form into a quadratic program that can be solved by ipoptr. <br><br> <h1>Solving Quadratic Programs with ipoptr </h1>Consider the quadratic program: $$ \begin{aligned} \underset{\alpha \in \mathbb{R}^n}{\text{Minimize}}: \qquad & f(x) = -d^Tx + \frac{1}{2}x^T D x \\ \text{Subject to:} \qquad & g(x) = Ax \leq b \end{aligned}$$ where <ul><li> $D$ is an $n \times n$ matrix (which we'll assume is symmetric) </li><li> $A$ is an $m \times n$ matrix. </li><li> $d$ is a fixed vector in $\mathbb{R}^n$ and $b$ is a fixed vector in $\mathbb{R}^m$. </li></ul> Then using some <a href="http://en.wikipedia.org/wiki/Matrix_calculus#Identities">basic identities</a> of vector calculus and the symmetry of $D$: $$ \begin{aligned} \nabla f(x) &= -d + \frac{1}{2}(D^T + D) x = -d + Dx \\ \nabla^2 f(x) &= D \\ J_g(x) &= A^T \\ \nabla^2 g_i(x) &= 0 \mbox{ for all } i \\ H(x, \sigma_f, \vec{\lambda}) &= \sigma_f D. \end{aligned} $$ Because $D$ is positive definite, the global minimizer $\tilde{x}$ of $f$ can be computed as: $$\nabla f(\tilde{x}) = 0 \rightarrow \tilde{x} = D^{-1} d.$$ <a href="http://www.javaquant.net/papers/goldfarbidnani.pdf">Goldfarb and Idnani</a>, whose QP algorithm is implemented in the quadprog package, use the global minimizer $\tilde{x}$ as the initial point for their interior point primal/dual method. This choice is based on empirical evidence cited by the authors that this selection can substantially reduce the number of iterations required to find a solution. We'll follow suit with ipoptr by setting $x_0 =\tilde{x}$ for our initial guess. <br><br>The following R function uses these facts to solve a quadratic program in standard form with ipoptr: <br><br><script src="https://gist.github.com/rwalk/8f3a38064da2876acb8a.js"></script><br><br> <h1>Performance Comparison</h1>In my previous post, I <a href="http://quantitate.blogspot.com/">compared the performance of quadprog and kernlab's ipop solvers</a> on the <a href="http://quantitate.blogspot.com/2014/04/the-circus-tent-problem-with-rs-quadprog.html">circus tent demonstration problem</a>. For this post, I repeated the same experiment with the ipoptr solver and the results were very impressive. Ipoptr was <i>substantially faster</i> at solving this sparse demonstration problem than quadprog. Here are the solve times I obtained: <br><br> <a href="http://3.bp.blogspot.com/-57kX7LVYfrk/VPFNDuNrIcI/AAAAAAAAALQ/Bj1tmGMuo6I/s1600/qptimes2.png" imageanchor="1" ><img border="0" src="http://3.bp.blogspot.com/-57kX7LVYfrk/VPFNDuNrIcI/AAAAAAAAALQ/Bj1tmGMuo6I/s640/qptimes2.png" /></a><br><br><pre class="prettyprint lang-r"><br /> problem.size quadprog.time ipop.time ipoptr.time<br />1 64 0.002 0.024 0.010<br />2 256 0.048 0.710 0.034<br />3 576 0.518 4.759 0.126<br />4 1024 2.903 23.430 0.414<br />5 1600 10.815 77.708 0.855<br />6 2304 31.899 232.355 1.819<br />7 3136 79.749 544.193 3.318<br />8 4096 176.743 1521.225 8.297<br />9 5184 354.839 2543.754 9.039<br />10 6400 664.804 5053.922 19.551<br /></pre><br><br>Here is the experiment code:<br><br><script src="https://gist.github.com/rwalk/33f40a385cdef1c3812e.js"></script><br><br>It's not so surprising that the ipoptr package is substantially faster in this example because the matrix Dmat that defines the objective function for the tent problem is very sparse. Quadprog requires dense matrix representations to describe the QP that we wish to solve, while in ipoptr we need only specify the functional form the quadratic objective function. Here is the plot of Dmat's sparsity structure: <br><br><a href="http://2.bp.blogspot.com/-WrO96DZQ6vQ/VPzVo2ex3uI/AAAAAAAAAMM/F8khH-UQqgo/s1600/laplacian.png" imageanchor="1" ><img border="0" src="http://2.bp.blogspot.com/-WrO96DZQ6vQ/VPzVo2ex3uI/AAAAAAAAAMM/F8khH-UQqgo/s640/laplacian.png" /></a><br><br> To investigate the performance of ipoptr on dense QPs, I generated some random positive definite QPs of differing sizes and compared solve times against the solver from quadprog and the ipop solver from the kernlab package. In this dense problem case, quadprog appears to have a significant advantage over ipoptr. Still, ipoptr shows solid performance and is considerably faster than the pure R implementation of the interior point solver found in kernlab's ipop. <br><br><a href="http://3.bp.blogspot.com/-V7Npf6_SIi8/VPzOVwjr__I/AAAAAAAAAL0/OSXXTK70p7g/s1600/Rplot04.png" imageanchor="1" ><img border="0" src="http://3.bp.blogspot.com/-V7Npf6_SIi8/VPzOVwjr__I/AAAAAAAAAL0/OSXXTK70p7g/s640/Rplot04.png" /></a><br><br><pre class="prettyprint lang-r"><br /> problem.size quadprog.time ipop.time ipoptr.time<br />1 500 0.068 1.942 0.294<br />2 1500 0.909 35.352 5.374<br />3 2500 4.598 155.138 23.160<br />4 3500 13.186 414.762 60.888<br />5 4500 28.419 868.141 126.278<br />6 5500 49.608 1570.674 226.914<br />7 6500 81.756 2574.962 369.469<br />8 7500 128.451 3937.629 563.236<br /></pre><br><br>Here is the code for the randomly generated QP experiment: <script src="https://gist.github.com/rwalk/666bde369571568b2fb9.js"></script><br><br><h1>Conclusions and Comments</h1>In this post, we looked at how the ipoptr interface can be used to solve quadratic programming problems and compared the Ipopt solver to other quadratic program solvers available in R. We found: <ul><li> ipoptr is very efficient and well-suited for solving large, sparse quadratic programs, even outperforming the quadprog package solver in this case. </li><li> However, for dense symmetric positive definite QPs, quadprog is much faster than ipoptr. </li><li> Both solvers are substantially faster than the pure R interior point solver ipop from the kernlab package. </li></ul>Some additional remarks: <ul><li> For the experiments in this post, Ipopt was configured to use the <a href="http://mumps.enseeiht.fr/">MUMPS</a> linear solver. It is possible to configure Ipopt to use other linear solvers and this could have an impact on performance. </li><li> Initializing Ipopt with $x_0$ as the global minimum of the objective function (as done internally in the quadprog package) seems to be somewhat beneficial for reducing runtimes. This effect should be investigated more carefully, however.</li><li> The timings presented above do not include the time required to translate from a QP specified in matrix format (e.g. Dmat, Amat, etc.) to the input format required by ipoptr. In practice, however, this step is rather expensive and users will probably want to generate the ipoptr input data directly instead of forming dense matrix inputs and then converting to the ipoptr format. The specifics of how this should be done will depend on the particular problem which the user is trying to solve. </li></ul><br><br> <h1>Installing ipoptr for Ubuntu Linux</h1>Because of complex licensing issues in COIN-OR suite and its many dependencies, it's not possible to bundle the backend ipopt solver and its R interface into a simple package installable directly from R. Thus, installing ipoptr won't be as easy as installing most other R pacakges. Fortunately, the <a href="http://www.ucl.ac.uk/~uctpjyy/downloads/ipoptr.pdf">CoinOpt documentation</a> and the <a href="http://www.ucl.ac.uk/~uctpjyy/downloads/ipoptr.pdf">R interface documentation</a> are fairly complete. <br> <br>After a bit of work, I was able to install everything on a Ubuntu-Linux box using the following steps. <pre class="prettyprint lang-sh"><br /># Change into directory where you want to put the ipoptr source and build path<br />cd ~/projects/<br /><br /># install dependencies, including latest blas and lapack.<br />sudo apt-get update<br />sudo apt-get install gcc g++ gfortran subversion patch wget<br />sudo apt-get install libblas3 liblapack3<br /><br /># get the project -- note, will need to accept SSL cert<br />svn co https://projects.coin-or.org/svn/Ipopt/stable/3.12 CoinIpopt<br /><br /># get and build Mumps linear solver<br />cd CoinIpopt/ThirdParty/Mumps<br />./get.Mumps<br />cd ../..<br /><br /># build and install Ipopt; it's a good idea to inspect the output of make test<br /># With this setup, AMPL test will fail (ok) but all the other tests should pass.<br />mkdir build<br />cd build<br />../configure<br />make<br />make test<br />make install<br /><br /># Move RInterface Makevars from build to src directory<br />cp Ipopt/contrib/RInterface/src/Makevars ../Ipopt/contrib/RInterface/src<br /></pre>Now, we can fire up R and install the package from source using: <pre class="prettyprint lang-r"><br />install.packages("~/projects/CoinIpopt/Ipopt/contrib/RInterface", <br /> repos=NULL, type="source")<br /></pre> Ryan Walkerhttps://plus.google.com/112373077410906476606noreply@blogger.com4tag:blogger.com,1999:blog-1887031930241333662.post-72872136666753232142015-02-10T21:04:00.001-08:002016-11-15T08:02:26.984-08:00More on Quadratic Programming in R<style type="text/css"> .gist-file .gist-data {max-height: 500px;} </style>This post is <a href="http://quantitate.blogspot.com/2014/01/solving-quadratic-progams-with-rs.html">another</a> tour of <a href="http://en.wikipedia.org/wiki/Quadratic_programming">quadratic programming</a> algorithms and applications in R. First, we look at the quadratic program that lies at the heart of <a href="http://en.wikipedia.org/wiki/Support_vector_machine#Dual_form">support vector machine (SVM) classification</a>. Then we'll look at a very different quadratic programming demo problem that models the energy of a <a href="http://quantitate.blogspot.com/2014/04/the-circus-tent-problem-with-rs-quadprog.html">circus tent</a>. The key difference between these two problems is that the energy minimization problem has a positive definite system matrix whereas the SVM problem has only a semi-definite one. This distinction has important implications when it comes to choosing a quadratic program solver and we'll do some solver benchmarking to further illustrate this issue. <br><br> <h1> QP and SVM </h1>Let's consider the following very simple version of the SVM problem. Suppose we have observed $y_i \in \{-1,1\}$, $X_i \in \mathbb{R}^{m}$ for $n$ (perfectly linearly separable) training cases. We let $y$ denote the $n \times 1$ vector of training labels and $X$ the $n \times m$ matrix of predictor variables. Our task is to find the hyperplane in $\mathbb{R}^m$ which "best separates" our two classes of labels $-1$ and $1$. Visually: <div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-5fGV1UxuApo/VNbnwKvVngI/AAAAAAAAAJE/wwHQfc9uTiQ/s1600/3dsvm.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-5fGV1UxuApo/VNbnwKvVngI/AAAAAAAAAJE/wwHQfc9uTiQ/s640/3dsvm.png" /></a></div><pre class="prettyprint lang-r"><br />library("e1071")<br />library("rgl")<br />train <- iris<br />train$y <-ifelse(train[,5]=="setosa", 1, -1)<br />sv <- svm(y~Petal.Length+Petal.Width+Sepal.Length, <br /> data=train, kernel="linear", scale=FALSE, type="C-classification")<br />W <- rowSums(sapply(1:length(sv$coefs), function(i) sv$coefs[i]*sv$SV[i,]))<br />plot3d(train$Petal.Length, train$Petal.Width, <br /> train$Sepal.Length, col= ifelse(train$y==-1,"red","blue"), <br /> size = 2, type='s', alpha = .6)<br />rgl.planes(a = W[1], b=W[2], c=W[3], d=-sv$rho, color="gray", alpha=.4)<br /></pre><br> The problem of finding this optimal hyperplane is a quadratic program (<a href="http://en.wikipedia.org/wiki/Support_vector_machine#Primal_form">the primal SVM problem</a>). For computational reasons however, it is much easier to work with a related quadratic program, the <a href="http://en.wikipedia.org/wiki/Support_vector_machine#Dual_form">SVM dual problem</a>. With a general kernel function, this quadratic program is: $$ \begin{aligned} \underset{\alpha \in \mathbb{R}^n}{\text{Maximize}}: \qquad & L\left( \alpha \right) = \sum_{i=1}^n \alpha_{i} - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \left( y_i k\left(\mathbf{X}_i, \mathbf{X}_j \right) y_j \right) \alpha_j \\ \text{Subject to:} \qquad & 0 \leq \alpha_i \leq C. \end{aligned}$$ In the simple case of a linear kernel, $k(\mathbf{x}, \mathbf{y}) = \mathbf{x} \cdot \mathbf{y}$ and so we can rewrite $L$ in matrix notation as \[ L(\alpha) = \mathbf{1} \cdot \alpha - \alpha^T Q^TQ \alpha\] with \[ Q = \left[ \begin{array}{ccc} y_1 (\mathbf{X}^T)_1 \; | & \ldots & | \; y_n (\mathbf{X}^T)_n \end{array}\right] \] where $(\mathbf{X}^T)_i$ denotes the $i$-th column of $\mathbf{X}^T$.<br><br> It is important to note that the matrix $A = Q^TQ$ is symmetric positive semi-definite since $A^T = A$ and for any nonzero $x$ in $\mathbb{R}^n$: \[x^TAx = x^TQ^T Q x = (Qx)^T (Qx) = \|Qx\|_2^2 \geq 0.\] However, in the context of SVM the matrix $A$ will usually not be positive definite. In particular, the rank of $X^T$ is at most $m$ (number of features) which is typically much less than the number of observations $n$. By the Rank-Nullity theorem, the nullspace of $Q$ has dimension $n-m$ which we expect to be quite large. So intuitively quite a few non-zero $x$ map to $0$ under $Q$ and so the above inequality cannot be strengthened to a strict inequality. <br><br> The fact that the matrix $A$ is only semi-definite and not positive-definite is significant because many quadratic programming algorithms are specialized to solve only positive definite problems. For example, R's quadprog handles only positive definite problems, whereas solvers like kernlab's ipop method can handle semidefinite problems. In the specialized semidefinite case of SVM, many highly optimized algorithms exist (for example, the algorithms implemented in libsvm and liblinear). In the following gist, we solve a separable SVM problem for Fisher's iris data in three ways: <ol><li> Using the e1071 wrapper around libsvm. </li><li> Using the semi-definite ipop solver from kernlab. Note that this general interior point solver is implemented in R and it can be quite slow when applied to larger scale problems. </li><li> Using quadprog's positive definite solver with a slight perturbance to the SVM data so that the system matrix becomes positive definite. Quadprog is a wrapper around an interior point solver implemented in Fortran.</li></ol>Note that only the first method is recommended for solving SVM problems in real life. The second and third methods are only included for the sake of the demonstrating the mechanics of quadratic programming. <script src="https://gist.github.com/rwalk/64f1365f7a2470c498a4.js"></script><br>We see that all three methods generate quite similar solutions for this highly stable and quite simple problem. <br><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-Ub7WjuQfK6U/VNblIBaOYUI/AAAAAAAAAIg/hLa-U8lBfjo/s1600/Rplot07.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-Ub7WjuQfK6U/VNblIBaOYUI/AAAAAAAAAIg/hLa-U8lBfjo/s640/Rplot07.png" /></a></div><br><br> As another example of the SVM technique, here is a minimal example that uses SVM to classify topics in the Reuters21578 corpus. <script src="https://gist.github.com/rwalk/99265561766015a153fc.js"></script> <br> <h1> The Circus Tent Revisited</h1>In a previous post, I explained how the solution of a <a href="http://quantitate.blogspot.com/2014/04/the-circus-tent-problem-with-rs-quadprog.html">quadratic program can model the shape of a circus tent</a>. <a href="http://1.bp.blogspot.com/-Q3KVWg_as68/VNhQl25eAaI/AAAAAAAAAJ4/JzIQRxPXOK4/s1600/pretent.png" imageanchor="1" ><img border="0" src="http://1.bp.blogspot.com/-Q3KVWg_as68/VNhQl25eAaI/AAAAAAAAAJ4/JzIQRxPXOK4/s320/pretent.png" /></a><a href="http://4.bp.blogspot.com/-7SHb0T1CGNM/VNhQlzPPDWI/AAAAAAAAAJ8/BCcBLHE1s5I/s1600/post.png" imageanchor="1" ><img border="0" src="http://4.bp.blogspot.com/-7SHb0T1CGNM/VNhQlzPPDWI/AAAAAAAAAJ8/BCcBLHE1s5I/s320/post.png" /></a><br> The system matrix in the circus tent problem is symmetric positive definite and therefore an ideal candidate for the quadprog solver. In the gist below, we build a series of increasingly larger circus tent problems and use this to profile the solver times of ipop and quadprog.<br> <a href="http://1.bp.blogspot.com/-FW--wQAMTiw/VNhTRxQf-tI/AAAAAAAAAKM/fDqtHby3Y_k/s1600/bigtent.png" imageanchor="1" ><img border="0" src="http://1.bp.blogspot.com/-FW--wQAMTiw/VNhTRxQf-tI/AAAAAAAAAKM/fDqtHby3Y_k/s320/bigtent.png" /></a><a href="http://2.bp.blogspot.com/-L1yVQlgVaSQ/VNhTR-RnZ2I/AAAAAAAAAKQ/p1EpMjJ8eUI/s1600/bigtentpost.png" imageanchor="1" ><img border="0" src="http://2.bp.blogspot.com/-L1yVQlgVaSQ/VNhTR-RnZ2I/AAAAAAAAAKQ/p1EpMjJ8eUI/s320/bigtentpost.png" /></a><br> <script src="https://gist.github.com/rwalk/f56362eb308a1a54c7c2.js"></script><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-cZbHUl4UzEg/VNrhGwJ4h3I/AAAAAAAAAKs/3Xsz7EJ3sf8/s1600/solver_times.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-cZbHUl4UzEg/VNrhGwJ4h3I/AAAAAAAAAKs/3Xsz7EJ3sf8/s640/solver_times.png" /></a></div> From this simple experiment we can make the following observations: <ol><li> For this symmetric positive definite problem, quadprog's solver is significantly faster than ipop. This is not surprising given that quadprog is calling compiled Fortran routines whereas ipop is an R implementation. </li><li> The time complexity for both solvers is superlinear in the problem size. Roughly, both solvers appear to be nearly cubic in problem size.</li><li> Even though both the system and constraint matrices are sparse, neither solver is able to take advantage of sparse matrix representations. In a pinch, it's worth noting that a little memory can be saved by using quadprog's solve.QP.compact which utilizes a more compact representation of the constraint matrix $\mathrm{Amat}$. </ol>Ryan Walkerhttps://plus.google.com/112373077410906476606noreply@blogger.com5tag:blogger.com,1999:blog-1887031930241333662.post-71501581146303382382014-10-19T18:32:00.001-07:002016-11-15T08:02:46.374-08:00Voice control your Roku player with PythonIn this post, I describe a simple way to implement a voice controlled Roku streaming player 'remote control' over a telnet connection with Python. <br> <br><b><font size="+1"> <a href="https://github.com/rwalk/RokuTelnetWithVoiceControl"">The complete code (about 100 lines) is available on github.</a> </font></b><br><br><h1> Roku on Telnet</h1>Roku streaming players offer an interesting <a href="http://forums.roku.com/viewtopic.php?t=20106">unofficial control interface</a> over a telnet connection on the local network. To connect to Roku over telnet: <ol><li> Put Roku in development mode by pressing the following button sequence on your Roku remote: <b>Home (three times), up (twice), right, left, right, left, right.</b></li><li> Follow the on-screen instructions and Roku will tell you its IP address: e.g. 123.456.7.8.90</li><li> Using any computer on the same network as Roku, open a terminal and telnet into Roku's port 8080: <pre><br />telnet 123.456.7.8.90 8080<br /></pre> </li></li> If all goes well, you should see Roku's device ID, ethernet and wifi MAC addresses, and command prompt printed on the console: <pre><br />Trying 123.456.7.8.90...<br />Connected to 123.456.7.8.90.<br />Escape character is '^]'.<br />A01234567890<br />ETHMAC 00:00:00:00:00:0a<br />WIFIMAC 00:00:00:00:00:0b<br />><br /></pre></li><li> From the telnet command line, we can now type simple Roku control commands like 'press play' and 'press right'. </li></ol>I successfully telneted into Roku SD, Roku HD, and Roku 3 players. <ul><li> Supported commands include: right, left, up, down, select, pause, play, forward, reverse </li><li> All models support abbreviated commands, e.g. "press r" for "press right".</li><li> Roku3 requires abbreviated commands. Moreover, Roku3 can accept multiple commands per turn such as "press dls" (down, left, select). Roku3 also appears to support some additional button presses such as "press a" and "press c".</li></ul><h1> Adding Voice Control </h1>To have some more fun with this feature, I used Python's <a href="https://pypi.python.org/pypi/SpeechRecognition/">SpeechRecognition</a> library to develop a simple voice enabled interface for Roku. The SpeechRecognition module is a Python wrapper around Google's speech recognition API. The module uses PyAudio to record wav files from the microphone: <pre class="prettyprint lang-python"><br />import speech_recognition as sr<br />recognizer = sr.Recognizer()<br />with sr.Microphone() as source: <br /> audio = recognizer.listen(source, timeout=2)<br /></pre>These files are then streamed to Google's recognition API for transcription: <pre class="prettyprint lang-python"><br />command = recognizer.recognize(audio)<br /></pre>If recognition is successful, SpeechRecognition should return a string transcription. We can then process this transcription to control the Roku. To manage the processing of transcriptions or "concept mapping", we use a simple approach based on keywords. We define a concept class to hold a list of keywords and a related action. For example, Roku needs a SelectConcept that should request the select button when the user says either of the keywords 'select' or 'ok': <pre class="prettyprint lang-python"><br />SelectConcept = Concept(['select','ok'], 'press select')<br /></pre>Each concept has a scoring method which takes an input message and returns an action if the message contains the keyword: <pre class="prettyprint lang-python"><br /># Returns the action string 'press select'<br />SelectConcept.conceptMatch('please press select button'.split(" "))<br /></pre> Once we've defined all of our concept data, we need a way to pass the action requests to the Roku from within Python. For this, we can use Python's <a href="https://docs.python.org/2/library/telnetlib.html">telnetlib</a> to connect to Roku's 8080 port and pass message strings like 'press select'. <br> <br><b><font size="+1"> <a href="https://github.com/rwalk/RokuTelnetWithVoiceControl"">See the full code on github.</a> </font></b><br><br><h1> Tuning and other details </h1>If you are lucky, PyAudio and SpeechRecognizer will work well out of the box. More likely, however, you will encounter a few library dependency problems and hardware-specific issues. On my Linux systems, I needed to add the following packages to get PyAudio working: <ul><li> <a href="http://www.portaudio.com/">PortAudio</a> </li><li> flac </li></ul> Note that even when PyAudio is working, <a href="http://blog.yjl.im/2012/11/pyaudio-portaudio-and-alsa-messages.html">you may still see a number of warning/diagnostic messages printed to the console</a>, e.g.: <pre><br />ALSA lib pcm_dsnoop.c:618:(snd_pcm_dsnoop_open) unable to open slave<br />ALSA lib pcm_dmix.c:1022:(snd_pcm_dmix_open) unable to open slave<br />ALSA lib pcm_dmix.c:1022:(snd_pcm_dmix_open) unable to open slave<br />ALSA lib pcm.c:2239:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear<br />ALSA lib pcm.c:2239:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe<br />ALSA lib pcm.c:2239:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side<br />bt_audio_service_open: connect() failed: Connection refused (111)<br />bt_audio_service_open: connect() failed: Connection refused (111)<br />bt_audio_service_open: connect() failed: Connection refused (111)<br />bt_audio_service_open: connect() failed: Connection refused (111)<br />ALSA lib pcm_dmix.c:1022:(snd_pcm_dmix_open) unable to open slave<br />Recognizing...<br /></pre> The call: <pre class="prettyprint lang-python"><br />with sr.Microphone() as source: <br /> audio = recognizer.listen(source, timeout=2)<br /></pre>will listen to the microphone channel for up to two seconds waiting for a signal above a particular energy threshold. If the signal passes over the threshold value, recording begins and will continue until a pause is encountered. The threshold value and stopping pause length are controlled by the parameters: <pre class="prettyprint lang-python"><br />recognizer.energy_threshold = 2000<br />recognizer.pause_threshold = .5<br /></pre>The values of these parameters will need to be tuned for particular microphone hardware, system settings, and the level of ambient noise during recording. It may be helpful to adjust the system microphone input level as well to optimize recognition performance. Experimenting on several different microphones, I found optimal energy thresholds for this application varied widely between 200 and 2000. More details about recognizer parameters are available in the <a href="https://pypi.python.org/pypi/SpeechRecognition/">SpeechRecognizer</a> documentation. <br><br>Even with optimization, SpeechRecognizer for this toy application is not very reliable and latency is too high. A better implementation would use a custom language model and a more robust concept mapping system. An interesting alternative to SpeechRecognition might be <a href="http://www.jezra.net/projects/blather">blather</a>. Ryan Walkerhttps://plus.google.com/112373077410906476606noreply@blogger.com0tag:blogger.com,1999:blog-1887031930241333662.post-50391929703416120432014-06-01T20:41:00.001-07:002014-06-02T08:11:07.058-07:00Analyzing the 2011-2012 California Health Inteview Survey with RThe <a href="http://healthpolicy.ucla.edu/chis/Pages/default.aspx">California Health Interview Survey</a> (CHIS) is a remarkable biannual survey of health status, care access, and demographics for California residents. The 2011-2012 public use survey data has recently been released and is freely available after registering at the CHIS site. CHIS currently offers data in SAS, Stata, and SPSS formats. However, thanks to a few great R packages, it's not very hard to bring this complex and interesting data set into R. In this post, we'll cover some practical details about using CHIS data. We'll also take up a more technical topic about how the CHIS weights were build and about how they should and should not be used. Before we begin, here are some key references for CHIS: <ul><li> UCLA maintains extensive documentation of the <a href="http://healthpolicy.ucla.edu/chis/design/Pages/methodology.aspx"> CHIS methodlogy</a> including important details of how the sample weighting was developed and implemented. Definitions for the CHIS variables can be found in the data dictionary included with the CHIS public use datasets.</li> <li> Thomas Lumley created and maintains the survey package. He has provided a description of how to work with the CHIS survey weights <a href="http://faculty.washington.edu/tlumley/old-survey/survey-wss.pdf">here</a> and also in his <a href="http://faculty.washington.edu/tlumley/old-svybook/index.html">book</a>. </li> </ul><br/><h3> Getting The Data </h3>To begin, we first need to register (free) at the CHIS site and download the data. The CHIS is a landline and mobile telephone survey, designed to represent the non-institutionalized resident population of California. The data is divided into three subsets: child, adolescent, and adult. For simplicity, we'll just work with the adult data set which has about 43k records. To bring the data into R, download the Stata flavor from CHIS, unzip it, and load it into R as follows: <pre class="prettyprint lang-r"><br /># Read CHIS file<br />library(foreign)<br />file <- "~/Desktop/CHIS/STATA/chis11-12_adult_stata/Data/ADULT.dta" # your file<br />options(nwarnings=500) # Bump up number of warnings -- see note below<br />CHIS <- read.dta(file, convert.factors = TRUE)<br />warnings() # print the 257 warning messages<br /># Note: read.dta generates warnings because some of the Stata <br /># value label definitions are not found in the raw Stata file. Thus,<br /># some variables will show the raw survey coding and some will print<br /># with the decoded labels from CHIS. Consult the data dictionary to translate<br /># raw survey codes. To be really careful, we could set convert.factors = FALSE<br /># to work only with the raw codes for all variables.<br /></pre> <br/> Now that we've got the data loaded, we want to take a stab at analyzing it. Browsing through the data dictionary, we see that CHIS records educational attainment in the variable <i>aheduc</i>. How can we tabulate this variable? Observe that CHIS includes 81 variables, <i> rakedw0, rakedw1,...,rakedw80 </i>, which are called replicate weights. For the moment, we'll disregard all of these weights except for <i>rakedw0</i>. The variable <i>rakedw0</i> is the sample weight and it tells us how much each of the 43k observations should "count" in an analysis. Using R's xtabs function, we can tabulate educational attainment with the weight <i>rakedw0</i>: <pre class="prettyprint lang-r"><br /># Educational attainment -- counts<br /># as.data.frame, round are just aesthetic choices.<br />as.data.frame(round(xtabs(rakedw0 ~ aheduc, CHIS, drop.unused.levels = TRUE),0))<br /># aheduc Freq<br /># 1 GRADE 1-8 2214253<br /># 2 GRADE 9-11 1929232<br /># 3 GRADE 12/H.S. DIPLOMA 6747336<br /># 4 SOME COLLEGE 4155536<br /># 5 VOCATIONAL SCHOOL 842097<br /># 6 AA OR AS DEGREE 1936675<br /># 7 BA OR BS DEGREE 5918519<br /># 8 SOME GRAD. SCHOOL 284266<br /># 9 MA OR MS DEGREE 2537555<br /># 10 PH.D. OR EQUIVALENT 930248<br /># 11 NO FORMAL EDUCATION 300768<br /><br /># sum over the table<br />margin.table(xtabs(rakedw0 ~ aheduc, CHIS, drop.unused.levels = TRUE))<br /># 27796484<br /></pre> <br/>Though we had only 43k records in the sample, the sum over all the categories is about 28M people which is roughly the size of the non-institutionalized adult population of California. This is because the sample weight <i>rakedw0</i> was designed so that CHIS totals match closely to known CA demographics (primarily from the <a href="http://www.dof.ca.gov/research/demographic/reports/view.php">California Department of Finance's Demographics Unit</a>). <br/> <br/> Also, we observe that: <pre class="prettyprint lang-r"><br /># Male/Female proportion in CHIS using the sample weight (correct)<br />round(prop.table(xtabs( rakedw0 ~ srsex, CHIS, drop.unused.levels = TRUE))*100,2)<br /># srsex<br /># MALE FEMALE <br /># 48.72 51.28<br /><br /># Male/Female proportion in RAW CHIS (wrong!!)<br />round(prop.table(xtabs( ~ srsex, CHIS, drop.unused.levels = TRUE))*100,2)<br /># srsex<br /># MALE FEMALE <br /># 41.57 58.43 <-- (!)<br /></pre> <br/> The raw CHIS sample contains many more women than we'd expect from a completely random draw from the California population. The sample weight <i>rakedw0 </i> includes an adjustment to rebalance the over-representation of females. Later in the post, we'll explore the CHIS weighting method in more detail and also explain the reason for the other weights <i>rakedw1,..., rakedw80</i>. <br/> <br/><h3> Simple point estimates with CHIS </h3>CHIS is a great resource for investigating how demographic and health outcome factors are related. Let's use the data to assess a simple hypothesis: Californians with lower educational attainment are more likely to be overweight/obese (<i>ovrwt</i>). <br> <br> When working with CHIS data it is frequently convenient to group and recode categorical variables. There are a number of ways we could do this but the "car" package offers a recoding function with a simple syntax: <pre class="prettyprint lang-r"><br /># Recode educational attainment<br />library(car)<br />recodeSTR = "<br />c('NOT ASCERTAINED') = 'UNKNOWN';<br />c('NO FORMAL EDUCATION','GRADE 1-8', 'GRADE 9-11',<br />'GRADE 12/H.S. DIPLOMA','SOME COLLEGE',<br />'VOCATIONAL SCHOOL','AA OR AS DEGREE') = 'LESS THAN BA';<br />c('BA OR BS DEGREE','SOME GRAD. SCHOOL') = 'BA';<br />c('MA OR MS DEGREE','PH.D. OR EQUIVALENT') = 'MA OR MORE' "<br />CHIS$college_grad = recode(CHIS$aheduc, recodeSTR)<br /></pre> <br/> Here's a simple visual tabulation from CHIS to support our hypothesis: <br/> <div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-IrX-_4NzcUA/U4PH9bwoOmI/AAAAAAAAAHE/hQe35Bcqfd8/s1600/ovrwt.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-IrX-_4NzcUA/U4PH9bwoOmI/AAAAAAAAAHE/hQe35Bcqfd8/s640/ovrwt.png" /></a></div> <br/> <pre class="prettyprint lang-r"><br /># make a plot<br />require(ggplot2)<br />require(scales)<br /><br /># Educational attainment (recoded) vs overweight<br />round(prop.table(xtabs(rakedw0 ~ college_grad + ovrwt, <br /> CHIS, drop.unused.levels=TRUE),1)*100,1)<br />#college_grad YES NO<br /># BA 51.2 48.8<br /># LESS THAN BA 64.6 35.4<br /># MA OR MORE 50.5 49.5<br /><br /># summary df<br />df <- as.data.frame(prop.table(xtabs(rakedw0 ~ college_grad + ovrwt, <br /> CHIS, drop.unused.levels=TRUE),1))<br /><br /># reorder the factors for sake of the plot<br />df$college_grad = factor(df$college_grad, <br /> levels=c('LESS THAN BA','BA','MA OR MORE')) <br /><br /># plot<br />ggplot(df, aes(x=college_grad, y=Freq, fill=ovrwt)) + geom_bar(stat="identity") +<br /> scale_y_continuous(labels = percent_format()) +<br /> ylab("% of Adults") + xlab("Education Level") +<br /> scale_fill_manual(values = c("blue","yellow"))<br /></pre> <br/>Now that we've found a potentially interesting relationship in the dataset, our next step might be to check that this relationship is statistically significant. This means we need to compute standard errors for our cross tabulation. Here's where working with the complex CHIS survey design gets a bit tricky...<br/> <br/> <h3> Replicate weights and estimator variances </h3>At this point, it's useful take a more detailed look at the weighting methodology employed in CHIS and other complex surveys. This section is a little technical but the key practical consequences are summarized at the end. The CHIS sample weights are designed through a process called raking (which is similar to <a href="http://en.wikipedia.org/wiki/Iterative_proportional_fitting">iterated proportional fitting</a>). To understand this process, imagine that we are conducting a study on smoking and have collected the following sample: <pre class="prettyprint lang-r"><br />sample <- data.frame(<br /> gender = c(rep("M",3),rep("F",7)),<br /> city = c(rep(c("A","B"),4),"B","B"),<br /> smoke = c(rep(c(1,0),3),1,0,1,1)<br />)<br /># gender city smoke<br />#1 M A 1<br />#2 M B 0<br />#3 M A 1<br />#4 F B 0<br />#5 F A 1<br />#6 F B 0<br />#7 F A 1<br />#8 F B 0<br />#9 F B 1<br />#10 F B 1<br /></pre> <br/> From an external source, such as census data, suppose we know with high-confidence that 52% of the total population (city A + city B) are female and 48% are male. Suppose we also know that the population is distributed so that 55% of people live in A and 45% live in B. These external facts are called usually called control totals. Our sample is clearly not representative of the population as 40% of respondents are from city A and 70% are female. If we want to estimate the joint distribution of smoking rates by gender and city, what can we do?<br/> <br/> Consider the strategy of assigning a weight $w_i$ to each respondent in the sample. This weight tells us how much each observation should count for in the cross tabulation. Then, for example, we can estimate: $$ p(\mathrm{smoke}, M ,A) \approx \frac{1}{10} \sum_{i: \mathrm{gender}_i = M, \mathrm{city}_i = A } w_i \cdot \mathrm{smoke}. $$ The key issue is how to choose the weight $w_i$. Raking is one such strategy which can make sense for complex surveys. <br/> <br/> In a raking procedure, we begin with one of the control variables (say gender) and choose $w_i$ so that the sample distribution will match the population distribution. For example, to get 48% male representation we'd need to give each male in the sample a weight such that $$.3 \cdot w_i(\mathrm{gender}_i = M) = .48 \qquad \Rightarrow \qquad w_i(\mathrm{gender}_i = M) = \frac{0.48}{0.3} = 1.6.$$ Similarly, to get 52% female representation, we'd need to give each female in the sample a weight such that $$ w_i(\mathrm{gender}_i = F) = \frac{0.52}{0.7} \approx 0.7429.$$ We can then recompute the crosstabs with this initial weight: <pre class="prettyprint lang-r"><br />sample$w0 = ifelse(sample$gender=="M",.48/.3, .52/.7)<br /><br />prop.table(xtabs(w0 ~ gender, sample))<br />#gender<br /># F M <br />#0.52 0.48 <br /><br />prop.table(xtabs(w0 ~ city, sample))<br />#city<br /># A B <br />#0.4685714 0.5314286 <br /></pre> <br/> Now gender is in balance but city remains out of balance. To balance the city proportions, we need to <i>weight the weighted data</i>. <pre class="prettyprint lang-r"><br /># tabulate with current weight<br />P <- prop.table(xtabs(w0 ~ city, sample))<br /><br /># update the weight for city<br />sample$w1 = ifelse(sample$city=="A",.55/P[1], .45/P[2])<br /><br /># tabulate again<br />prop.table(xtabs(w0*w1 ~ city, sample))<br /># city<br /># A B <br /># 0.55 0.45 <br />prop.table(xtabs(w0*w1 ~ gender, sample))<br /># gender<br />#F M <br />#0.4889064 0.5110936 <br /></pre> <br/> So we've balanced city but in doing so we've also unbalanced gender. However, gender is not nearly as out of balance as it was when we began. This suggests that we can <i>iterate</i> this process to build a weight variable that will balance both gender and city simultaneously. Indeed, here is a simple code that will produce the sample weight for this toy problem: <pre class="prettyprint lang-r"><br /># controls<br />pop.gender = c(F=.52,M=.48)<br />pop.city = c(A=.55,B=.45)<br /><br /># set initialize the weight<br />sample$w <- 1<br /><br /># run the algorithm (only need a few iterations to converge)<br />for(i in 1:10){<br /><br /> # get current weighted proportion for gender<br /> P0 <- prop.table(xtabs(w ~ gender, sample))<br /> sample$w0 = ifelse(sample$gender=="F", pop.gender[1]/P0[1], pop.gender[2]/P0[2])<br /><br /> #update weight<br /> sample$w = sample$w * sample$w0<br /><br /> # get current weighted proportion for city<br /> P1 <- prop.table(xtabs(w ~ city, sample))<br /> sample$w1 = ifelse(sample$city=="A", pop.city[1]/P1[1], pop.city[2]/P1[2])<br /><br /> #update weight<br /> sample$w = sample$w * sample$w1<br />}<br /># It works!<br />prop.table(xtabs(w ~ gender, sample))<br /># gender<br /># F M <br /># 0.52 0.48 <br />prop.table(xtabs(w ~ city, sample))<br /># city<br /># A B <br /># 0.55 0.45 <br /></pre> <br/> This simplified raking algorithm is a good approximation to how the weights in CHIS (and other complex surveys) are developed. Of course in CHIS, the process is much more complicated and the rebalancing is done over 11 design variables and with many more factor levels. <br/> <br/> Raking is an interesting procedure from a technical perspective. Why would we expect this algorithm to converge? When it does converge, why can we use these weights to obtain unbiased estimates within cells? These questions are well outside the scope of this post but there is a fair amount of literature on the topic (this <a href="http://www.amstat.org/sections/srms/Proceedings/y2004/files/Jsm2004-000074.pdf">article</a>, for example). Practically speaking, what we need to know for working with CHIS are the following facts: <ul><li> The raked weight RAKEDW0 can be used to obtain unbiased estimators of means and proportion in the usual way, e.g. with xtabs in R. </li><li> Variances (and hence standard errors) for these estimators <i>cannot</i> be computed directly using RAKEDW0. A nice description of this issue can be found in this <a href="http://aje.oxfordjournals.org/content/144/1/102.full.pdf?origin=publication_detail">article</a>. </li></ul><br/> In the next section, we'll explain how to correctly compute the standard errors for means and proportions in CHIS. <br/> <br/> <h3> Standard errors with Replicate Weights</h3>To handle the standard errors, we need to make use of the replicate weights <i>rakedw1,...,rakedw80</i>. Essentially, these weights capture variability across subsamples of the CHIS sample and from the variability within these subsamples we can deduce the correct variance estimates for our point estimators. The details of how the CHIS replicate weights were created and how they are used can be found in <a href="http://healthpolicy.ucla.edu/Documents/Newsroom%20PDF/CHIS2009_method5.pdf">section 9.2 of the CHIS methodology manual.</a> <br/> <br/> Fortunately, R's survey package was built precisely to handle complex survey designs, including replicate weights. The key feature of the survey package is that all trickiness of the CHIS design can be encapsulated in a survey design object $D$. Once R knows about the design, survey's methods will take care of the details of correctly computing the standard errors behind the scenes. Let's return to our education/obesity example and compute standard errors for our cross tabulations: <pre class="prettyprint lang-r"><br /># survey analysis with the survey package<br />require(survey)<br /><br /># remove empty levels from overwt (aesthetic)<br />CHIS$ovrwt <- factor(CHIS$ovrwt)<br /><br /># design structure -- settings are thanks to the references<br /># by T. Lumley. The design object captures all the information<br /># about the replicate weights.<br />D <- svrepdesign(variables = CHIS[,c("college_grad", "ovrwt")], <br /> repweights = CHIS[,grep("rakedw[1-9]",colnames(CHIS))],<br /> weights = CHIS$rakedw0,<br /> combined.weights = TRUE,<br /> type = "other",<br /> scale = 1,<br /> rscales = 1)<br /><br /># compute the means and standard errors<br />svyby( ~ovrwt, ~college_grad, D, <br /> FUN = svymean, keep.names = FALSE)<br /><br /># college_grad ovrwtYES ovrwtNO se1 se2<br />#1 BA 0.5122654 0.4877346 0.007344430 0.007344430<br />#2 LESS THAN BA 0.6456145 0.3543855 0.005697532 0.005697532<br />#3 MA OR MORE 0.5051845 0.4948155 0.009186273 0.009186273<br /><br /><br /># compare with the naive standard error (WRONG!):<br />n <- sum(CHIS$rakedw0[CHIS$college_grad == "BA"])<br />x <- sum(CHIS$rakedw0[CHIS$college_grad == "BA" & CHIS$ovrwt == "YES"]) <br />p <- x/n # proportion<br />SE <- sqrt(p*(1-p)/n) # standard error in proportion<br /># 0.0002006993 <br /># This is much smaller error than what we got with the correct <br /># estimate provided by survey! Thus, we would be overly confident<br /># about our results if used the naive method.<br /></pre> <br/> <br/> Ryan Walkerhttps://plus.google.com/112373077410906476606noreply@blogger.com2tag:blogger.com,1999:blog-1887031930241333662.post-63067641407643242192014-04-20T19:12:00.001-07:002014-04-23T08:00:17.334-07:00The Circus Tent Problem with R's QuadprogThe MathWorks has an interesting demo on how the shape of a <a href="http://www.mathworks.com/products/optimization/code-examples.html?file=/products/demos/shipping/optim/circustent.html">circus tent can be modeled as the solution of a quadratic program in MATLAB.</a> In this post, we'll show how to solve this same problem in R using the quadprog package and also provide the technical details not covered in the Mathwork's example. In particular, we'll explain how an energy function is associated to this problem and how descretizing this energy function gives rise to a sparse quadratic program. <br/> <br/> Quadratic programming is a powerful and versatile technique that has found applications in a diverse range of fields. Perhaps its most famous application is in portfolio allocation theory, though it also is very well known as the computational foundation of SVM. A quadratic program is a constrained optimization problem where the objective function is a quadratic form (multidimensional generalization of a quadratic function). The constraints in a standard quadratic program are a mix of linear equality and inequality constraints. This example provides both an intuitive introduction to quadratic programming in R and a great test case for benchmarking optimization algorithms. <br/> <br/> <h3> Problem Description </h3>We imagine draping a piece of fabric over an ensemble of 5 tent poles, placed symmetrically over a 36 by 36 grid. The four outer poles are all the same size. The center pole is somewhat larger and taller. In R, it's easy to build this ensemble by exploiting symmetry: <pre class="prettyprint lang-r"><br /># Build the pole grid using basic symmetries.<br />q2 = matrix(0,18,18)<br />q2[8:9,8:9] = .3<br />q2[17:18,17:18] = 1/2<br />q1 <- q2[,18:1]<br />top <- cbind(q2,q1)<br />z <- rbind(top,top[18:1,])<br /></pre> <br/> <br/>Let's plot the pole ensemble and the piece of fabric:<br/> <br/> <div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-ekP8Yuo3P3I/UzebT4tsfBI/AAAAAAAAAGQ/qc8RVWgBWQQ/s1600/foundation.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-ekP8Yuo3P3I/UzebT4tsfBI/AAAAAAAAAGQ/qc8RVWgBWQQ/s400/foundation.png" /></a></div><br/><pre class="prettyprint lang-r"><br /># Plot<br />library(rgl) <br />x <- (1:36)/36<br />y <- (1:36)/36<br />open3d()<br />rgl.surface(x,y, z, color='blue',alpha=.75, smooth=TRUE)<br />rgl.surface(x,y, matrix(1/2,36,36), color='red', front='lines', back='lines')<br />rgl.bg(color="white")<br /></pre> The task is to model the height $u(x,y)$ of the piece of fabric over the grid point $(x,y)$. <br/> <br/> <h3> Defining an Energy Function </h3>We now need to specify an energy function which defines what should happen when you drape the fabric over the pole ensemble. The idea is that once the fabric is placed, it eventually settles to some minimal energy state. Let $u(x,y)$ be any function that gives heights of the fabric over any grid point $(x,y)$. Suppose we come up with a way to measure the energy $E$ of any fabric distribution $u$. We call this measurement $E[u]$. A minimal energy state is a particular fabric distribution $u^*$ that makes our measurement $E[u^*]$ as small as possible. <br/> <br/> The question of how to measure the energy $E[u]$ is one for physicists and engineers. Without specifying the mechanical laws governing the process by which the fabric "settles", we shouldn't expect to be able to provide realistic solutions to this model. Nevertheless, let's consider the famous energy measurement: $$ E[u] = \frac{1}{2}\int_0^{l_y} \int_0^{l_x} \nabla u(x,y) \cdot \nabla u(x,y) \;dx dy $$ where $l_y$ and $l_x$ are the length of the vertical and horizontal sides of the grid and $\nabla$ is the gradient operator. This measurement is called the <a href="http://en.wikipedia.org/wiki/Dirichlet_energy">Dirichlet energy</a> of $u$ and, roughly, it captures the total amount of variation in $u$. <br/> <br/> We would like to find a numerical approximation to the fabric distribution $u(x,y)$ that makes the energy $E[u]$ as small as possible and also fits on top of our tent peg ensemble. Let's take for granted that such an optimal fabric distribution exists and that this function has no tears or creases (continuous differentiability). We can then use <a href="http://en.wikipedia.org/wiki/Integration_by_parts#Higher_dimensions">integration by parts</a> to rewrite the Dirichlet energy as: $$\begin{array}{l} E[u] &= \frac{1}{2} \int_0^{l_y} \int_0^{l_x} \nabla u(x,y) \cdot \nabla u(x,y) \; dx \;dy \\ &= - \frac{1}{2} \int_0^{l_y} \int_0^{l_x} u(x,y) \Delta u(x,y) \; dx dy + \frac{1}{2} \oint_{\square} u(x,y) \nabla u(x,y) \cdot \nu(x,y) \; dS. \end{array}$$ The second integral in the last expression is a surface integral around the rectangular boundary of our grid. If we assume that $u(x,y)=0$ at the edge of the grid then this term vanishes and we're left with the energy expression: $$\begin{array}{l} E[u] &= - \frac{1}{2} \int_0^{l_y} \int_0^{l_x} u(x,y) \Delta u(x,y) \; dx dy. \end{array}$$ <h3> Discretizing the Energy Function</h3>Now, let's show how to use the previous formula to get a numerical approximation to the function $u(x,y)$ that minimizes the energy $E[u]$. To do this, we choose a uniform discretization of our grid. Let $h_x$ and $h_y$ be the spacing between grid points in the $x$ and $y$ directions. We can then approximate the Laplacian $\Delta$ using <a href="http://en.wikipedia.org/wiki/Finite_difference#Finite_difference_in_several_variables">finite difference formulas</a>: $$ \Delta u(x,y) = \frac{ u(x+h_x,y) - 2u(x,y) + u(x-h_x,y) }{h_x^2} + \frac{u(x,y +h_y) - 2u(x,y) + u(x,y-h_y)}{h_y^2}.$$ Now we discretize the energy function $E[u]$: $$ \begin{array}{l} E[u] &= -\frac{1}{2} \int_0^{l_y} \int_0^{l_x} \nabla u(x,y) \cdot \nabla u(x,y) \; dx \;dy \\ & \approx - \frac{h_x h_y}{2} \sum_{i=1}^{N_y} \sum_{j=1}^{N_y} u(ih_x, jh_y) \Delta u(i h_x, j h_y). \end{array} $$ In our discrete problem we are trying to find the values of $u$ at each of our $N_x \cdot N_y$ grid points. In the current formulation, these points are arranged as an $N_y \times N_x$ matrix but it turns out to be much more convenient to reorganize these points into a single long vector of $N_x \cdot N_y$ components. We'll do this by stacking the columns of the grid matrix into a single column so that the grid point $(i,j)$ will map to component $k= N_y(j-1) + i$. In R, we can stack the columns of the matrix M in this way by <pre class="prettyprint lang-r"><br />nr <- nrow(z)<br />nc <- ncol(z)<br />N <- nr*nc<br />bvec <- matrix(z,N,1,byrow=FALSE)<br /></pre> <br/>We write $u_k$ to denote the $k$-th entry of the solution vector $\vec{u}$ in this new indexing system. Then the energy approximation becomes: $$ \begin{array}{l} E[u] & \approx \frac{h_x h_y}{2} \sum_{k=1}^{N_x \cdot N_y} u_k(- \Delta u_k). \end{array} $$ Now, we observe that the approximation of $-\Delta u_k$ involves only five components of the $\vec{u}$ vector. In fact: $$ -\Delta u_k \approx -\frac{u_{k+N_y}- 2u_k + u_{k - N_y}}{h_x^2} - \frac{u_{k+1} -2u_k + u_{k-1} }{h_y^2}.$$ Using this observation, we write the energy approximation as $$ \begin{array}{l} E[u] &\approx \frac{h_x h_y}{2} \vec{u}^T L\vec{u}. \end{array} $$ where $L$ is a special matrix called the discrete Laplacian. The discrete Laplacian is a famous sparse matrix which has five non-zero bands (one for each component that contributes to a typical $\Delta u_k$ value). It is a numerical approximation to the Laplace operator $-\Delta$ and has many remarkable properties. We can use <a href="http://en.wikipedia.org/wiki/Kronecker_sum_of_discrete_Laplacians">Kronecker products to easily build the 2D discrete Laplacian</a> on a rectangular grid in R: <pre class="prettyprint lang-r"><br /># 2D Discrete Laplacian on an lx by ly rectangular grid with<br /># nx grid lines in the x direction and ny grid lines in the y direction.<br />Laplacian2D <- function(lx,ly, nx,ny){<br /> hx <- lx/(nx-1)<br /> hy <- ly/(ny-1)<br /> tr_x <- c(2,-1,rep(0,nx-2))<br /> tr_y <- c(2,-1,rep(0,ny-2))<br /> Tx <- toeplitz(tr_x)/(hx^2)<br /> Ty <- toeplitz(tr_y)/(hy^2)<br /> Ix <- diag(nx)<br /> Iy <- diag(ny)<br /> L <- kronecker(Tx,Iy) + kronecker(Ix,Ty)<br /> return(L)<br />}<br /></pre> <br/> <br/><h3> Solving the Quadratic Program </h3>To recap, our goal was to find a numerical approximation to the function $u$ which minimizes $E[u]$ and which fits on our tent peg ensemble. Our work in the previous section showed that we can find such an approximation by solving the following optimization problem: $$ \left\{ \begin{array}{l} \min_{\vec{u} \in \mathbb{R}^{Ny \cdot Nx}} \; \frac{1}{2} \vec{u}^T L \vec{u} \\ \mbox{subject to:} \qquad \vec{u} \geq \vec{z} \end{array} \right. $$ where $\vec{z}$ is the vector whose component $z_k$ give the heights of the ground and tent pole ensemble at the (vectorized) grid point $k$. This is a quadratic programming problem, and a very tractable one since an important property of the matrix $L$ is that it is symmetric positive definite. <br/> <br/> Note that the Mathworks example uses a slightly more general quadratic form $$ E[u] \approx \frac{1}{2} \vec{u}^TL \vec{u} + \vec{c}^T \vec{u} $$ where $\vec{c}$ is a constant vector whose components are all equal to $h_x \cdot h_y$. Though the energy function is not explained in their example, it is likely that this linear term is meant to capture some material effects that aren't accounted for by the simple Dirichlet energy measure. We'll also include this term in our solution below to make use of a slightly more interesting quadratic form. <br/> <br/> We'll solve our quadratic program with R's quadprog library. The solution algorithm implemented in R's quadprog is somewhat different from that available in MATLAB. One important note about R's quadprog library is that it does not take advantage of the sparsity of the system matrix $L$ nor of the sparsity of the constraints $\vec{u} \geq \vec{z}$. Since an $N \times N$ matrix requires an $N^2 \times N^2$ matrix $L$, the dense matrix representation of $L$ can quickly become impractical. For the $36 \times 36$ grid of this problem, however, quadprog works wonderfully: <pre class="prettyprint lang-r"><br /># Solve the QP:<br />library(quadprog)<br />Ny <- nrow(z)<br />Nx <- ncol(z)<br />N <- Nx*Ny<br />hx <- 1/(Nx-1)<br />hy <- 1/(Ny-1)<br /><br /># System matrices<br /># Note: quadprog's quadratic form is -d^T x + 1/2 x^T D x with t(A)%*%x>= b<br />theta <- 1 # coefficient for linear term<br />Dmat <- hx*hy*Laplacian2D(1,1,36,36) <br />dvec <- -theta*(hx*hy)*rep(1,N) <br />Amat <- t(diag(N))<br />bvec <- matrix(z,N,1,byrow=FALSE) # lower bound<br /><br /># call the solver<br />sol <- solve.QP(Dmat, dvec, Amat, bvec)<br /><br /># extract and reshape the solution<br />tent <- matrix(sol$solution,nrow(z),ncol(z), byrow=FALSE)<br /></pre> <br/> <br/>Now we plot the solution: <br/> <br/><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-oejqkp8QqtA/Uzedhm8APSI/AAAAAAAAAGs/UynSE5yUYPI/s1600/tent.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-oejqkp8QqtA/Uzedhm8APSI/AAAAAAAAAGs/UynSE5yUYPI/s1600/tent.png" /></a></div><br/> <pre class="prettyprint lang-r"><br />#plot <br />open3d()<br />x <- (1:Nx)/Nx<br />y <- (1:Ny)/Ny<br />rgl.surface(x, y , z, color='blue',alpha=.75, smooth=TRUE)<br />rgl.surface(x, y , tent, color='red', front='lines', back='lines')<br />rgl.bg(color="white")<br /></pre> Ryan Walkerhttps://plus.google.com/112373077410906476606noreply@blogger.com8tag:blogger.com,1999:blog-1887031930241333662.post-42583832024008283262014-03-23T14:53:00.000-07:002014-03-23T14:53:14.830-07:00Going viral with R's igraph packageR's <a href="http://cran.r-project.org/web/packages/igraph/index.html" target="_blank">igraph</a> package provides a simple and flexible set of utilities for working with graphs. In this post, we'll use this package to animate the simulated spread of a disease through a network.<br /><br /><h1> Graphs </h1>A graph is just a collection of nodes joined by edges:<br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-xMOjGeR_4xM/Uy0weRr_jKI/AAAAAAAAAFI/MG25NfCg018/s1600/Rplot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-xMOjGeR_4xM/Uy0weRr_jKI/AAAAAAAAAFI/MG25NfCg018/s1600/Rplot.png" height="435" width="640" /></a></div><div class="separator" style="clear: both; text-align: center;"></div><pre class="prettyprint lang-r"><br />require(igraph)<br /># Specify an undirected graph by hand, using a numeric <br /># vector of the pairs of vertices sharing an edge.<br />G <- graph( c(1,2,1,3,1,4,3,4,3,5,5,6,6,7,7,8,8,9,3,8,5,8), directed = FALSE )<br /><br /># visualization<br /> plot(G, layout = layout.fruchterman.reingold,<br /> vertex.size = 25,<br /> vertex.color="red",<br /> vertex.frame.color= "white",<br /> vertex.label.color = "white",<br /> vertex.label.family = "sans",<br /> edge.width=2, <br /> edge.color="black")</pre> <br/> In graph models, the nodes of a graph typically represent entities and the edges represent relationships between these entities. Both nodes and edges may have attributes or qualities that characterize the entities and the relationships. A node might have an attribute of "color". An edge might have an attribute of "weight" that encodes the strength of the relationship between the vertices that the edge joins. The igraph package makes it very simple to manage the assignment of attributes to the components of a graph: <br/><pre class="prettyprint lang-r"><br />require(igraph)<br />G <- graph( c(1,2,1,3,1,4,3,4,3,5,5,6,6,7,7,8,8,9,3,8,5,8), directed = FALSE )<br /><br /># Assign attributes to the graph<br />G$name <- "A colorful example graph"<br /><br /># Assign attributes to the graph's vertices<br />V(G)$name <- toupper(letters[1:9])<br />V(G)$color <- sample(rainbow(9),9,replace=FALSE)<br /><br /># Assign attributes to the edges<br />E(G)$weight <- runif(length(E(G)),.5,4)<br /><br /># Plot the graph -- details in the "Drawing graphs" section of the igraph manual<br />plot(G, layout = layout.fruchterman.reingold, <br /> main = G$name,<br /> vertex.label = V(G)$name,<br /> vertex.size = 25,<br /> vertex.color= V(G)$color,<br /> vertex.frame.color= "white",<br /> vertex.label.color = "white",<br /> vertex.label.family = "sans",<br /> edge.width=E(G)$weight, <br /> edge.color="black")<br /></pre><a href="http://2.bp.blogspot.com/-berEdMW9aMg/Uy3ZG60xYRI/AAAAAAAAAFg/-FS0UPYN6F4/s1600/Rplot01.png" imageanchor="1" ><img border="0" src="http://2.bp.blogspot.com/-berEdMW9aMg/Uy3ZG60xYRI/AAAAAAAAAFg/-FS0UPYN6F4/s640/Rplot01.png" /></a><br/><br/><h1> Building a simulation </h1>The ease and flexibility with which you can assign and update attributes of a graph's nodes and edges makes igraph a powerful tool for prototyping simulations with graphs. Let's consider a very simplified example. Let $G$ be a graph whose nodes are people in a population. Two nodes in $G$ will share an edge if they have daily in-person contact with each other. Suppose also that a virus has infected a few nodes of $G$. These infected nodes can spread the virus to other uninfected nodes with whom they have daily contact. Newly infected nodes will then spread the virus to other nodes, and so on. Let's use the simple probability model: $$p(x \mbox{ infects } y | x \mbox{ is infected}, y \mbox{ is uninfected}, x\mbox{ and } y \mbox{ share an edge})= \frac{1}{2}.$$ <br>Here is a function that implements this simulation model: <pre class="prettyprint lang-r"><br />spreadVirus <- function(G,Vinitial,p){ <br /> # Precompute all outgoing graph adjacencies<br /> G$AdjList = get.adjlist(G,mode="out")<br /> <br /> # Initialize various graph attributes<br /> V(G)$color = "blue"<br /> E(G)$color = "black"<br /> V(G)[Vinitial]$color <- "yellow"<br /><br /> # List to store the incremental graphs (for plotting later)<br /> Glist <- list(G)<br /> count <- 1<br /> <br /> # Spread the infection<br /> active <- Vinitial<br /> while(length(active)>0){<br /> new_infected <- NULL<br /> E(G)$color = "black"<br /> for(v in active){<br /> # spread through the daily contacts of vertex v<br /> daily_contacts <- G$AdjList[[v]]<br /> E(G)[v %->% daily_contacts]$color <- "red"<br /> for(v1 in daily_contacts){<br /> new_color <- sample(c("red","blue"), 1 ,prob=c(p,1-p)) <br /> if(V(G)[v1]$color == "blue" & new_color=="red"){ <br /> V(G)[v1]$color <- "red"<br /> new_infected <- c(new_infected,v1)<br /> }<br /> }<br /> }<br /> # the next active set<br /> active <- new_infected<br /> <br /> # Add graph to list<br /> count <- count + 1<br /> Glist[[count]] <- G<br /> }<br /> return(Glist)<br />}<br /></pre> <br/>The function spreadVirus will return an ordered list of graphs representing the discrete states of the model as the virus spreads. We'll demonstrate the use of this function on a simple directed graph. A directed graph is a graph where edges are "one-way", so a directed edge from nodes $X$ to $Y$ only allows information to pass from $X$ to $Y$ and not from $Y$ to $X$. <pre class="prettyprint lang-r"><br /># Specify a directed graph by hand<br />G <- graph.formula(1-+2,1-+3,2-+4,2-+5,3-+6,5-+7,7-+8,8-+9,9+-7, 9-+10,<br /> 6-+9,1-+5,5-+20,3-+9,10-+11,10-+15,11-+12,11-+13,13-+14, 14-+15,<br /> 15-+11,4-+8,15-+16,15-+17,17-+18,18-+4,17-+19,19-+20,<br /> 20-+1,14-+1,19-+3)<br />Vinitial <- c("1","10")<br /><br /># Run the simulation<br />set.seed(55555)<br />Glist <- spreadVirus(G,Vinitial,1/2)<br /><br /># Animation plots (generates a .GIF)<br />require(animation)<br />L <- layout.fruchterman.reingold(Glist[[1]])<br />ani.options(interval=1)<br />saveGIF({<br /> count =0<br /> for(i in 1:length(Glist)){<br /> plot(Glist[[i]], layout = L,<br /> vertex.label = NA,<br /> vertex.size = 10,<br /> vertex.color= V(G)$color,<br /> vertex.frame.color= "white",<br /> edge.arrow.size = 1,<br /> edge.color=E(G)$color)<br /> count = count +1<br /> title(main="Graph simulation example", <br /> sub=paste("Time = ",count), cex.main = 3, cex.sub = 2)<br /> }<br />}, interval = 1, movie.name = "demo.gif", ani.width = 1000, ani.height = 1000)<br /></pre> <br/> <a href="http://4.bp.blogspot.com/-B5V2uCSwT7w/Uy9FkTdy70I/AAAAAAAAAFw/XN8zwYPggqw/s1600/example.gif" imageanchor="1" ><img border="0" src="http://4.bp.blogspot.com/-B5V2uCSwT7w/Uy9FkTdy70I/AAAAAAAAAFw/XN8zwYPggqw/s640/example.gif" /></a> <br /><br /> This is a very simplistic model and it does not, for example, take into account the importance of time dynamic effects in the virus spread. Thanks to the flexibility of igraph, however, it is not difficult to extend a basic simulation like spreadVirus to include much more complex effects. <br/> <br/> We should note, however, that although this implementation is very easy to understand, it is not very efficient. The heavy use of loops will be quite slow when we start working with large graphs. Though we might gain some efficiency through vectorization or other optimizations, for running larger simulations we would probably want to port this routine to another platform like C++. Nevertheless, the strength of R's igraph for quickly building and testing a prototype should be apparent from this example. <br/><br/><h1> More realistic graphs </h1>So far, the graphs we have worked with have been quite simple. How do we start building more complex graphs? There are many great sources for real graph data like <a href="http://snap.stanford.edu/data/">Stanford's SNAP dataset collection</a>. There are also many routines for generating random graphs in the igraph packages. The degree.sequence.game method in igraph is one such example. The degree of a node in a graph is the number of edges attached to that node. One way to study graphs is to look at the distribution of the degrees of the nodes in the graph. For many interesting graphs, the degree distribution often has the form of an exponential distribution: $$ p(\mathrm{degree}(\mathrm{node}) = N ) \propto \exp\left(-\tau N \right), $$ see for example <a href="http://en.wikipedia.org/wiki/Scale-free_network">this</a> Wikipedia article which discusses the degree distribution of all web links. As suggested in the igraph manual, we can generate a larger random graph with an exponential distribution as follows:<br/><pre class="prettyprint lang-r"><br />N <- 200<br />in.deg <- sample(1:N, N, replace = TRUE, prob = exp(-.5*1:N))<br />G <-degree.sequence.game(in.deg=in.deg,out.deg=in.deg,method="simple.no.multiple")<br /></pre> <br/>Here's an example simulation on this new graph: <br/><br/><a href="http://1.bp.blogspot.com/-nw50oF7r0m8/Uy9Rzl4DXGI/AAAAAAAAAGA/rjAp6L8kyKc/s1600/example3.gif" imageanchor="1" ><img border="0" src="http://1.bp.blogspot.com/-nw50oF7r0m8/Uy9Rzl4DXGI/AAAAAAAAAGA/rjAp6L8kyKc/s640/example3.gif" /></a> <pre class="prettyprint lang-r"><br /># set seed for reproducible example<br />set.seed(2014)<br /><br /># build graph list<br />Vinitial = sample(V(G),5,replace=FALSE)<br />Glist <- spreadVirus(G,Vinitial,1/2)<br /><br /># Circle layout is better for large graphs<br />L = layout.circle(Glist[[1]])<br /><br /># Animation<br />ani.options(interval=1)<br />saveGIF({<br /> count =0<br /> for(i in 1:length(Glist)){<br /> plot(Glist[[i]], layout = L,<br /> vertex.label = NA,<br /> vertex.size = 4,<br /> vertex.color= V(G)$color,<br /> vertex.frame.color= "white",<br /> edge.arrow.size = 1,<br /> edge.color=E(G)$color)<br /> count = count +1<br /> title(main="Graph simulation example (200 nodes)", <br /> sub=paste("Time = ",count), cex.main = 3, cex.sub = 2)<br /> }<br />}, interval = 1, movie.name = "demo.gif", ani.width = 1000, ani.height = 1000)<br /></pre> Ryan Walkerhttps://plus.google.com/112373077410906476606noreply@blogger.com1tag:blogger.com,1999:blog-1887031930241333662.post-72001516395402472032014-01-12T22:08:00.000-08:002015-03-01T16:22:54.163-08:00Solving Quadratic Progams with R's quadprog packageIn this post, we'll explore a special type of nonlinear constrained optimization problems called quadratic programs. Quadratic programs appear in many practical applications, including portfolio optimization and in solving support vector machine (SVM) classification problems. There are several packages available to solve quadratic programs in R. Here, we'll work with the <i>quadprog</i> package. Before we dive into some examples with quadprog, we'll give a brief overview of the terminology and mechanics of quadratic programs. <br /> <br /> <h3> Quick Overview of Quadratic Programming </h3>In a quadratic programming problem, we consider a <i>quadratic objective function</i>: $$ Q(x) = \frac{1}{2} x^T D x - d^T x + c.$$ Here, $x$ is a vector in $\mathbb{R}^n$, $D$ is an $n \times n$ <a href="http://en.wikipedia.org/wiki/Positive-definite_matrix">symmetric positive definite matrix</a>, $d$ is a constant vector in $\mathbb{R}^n$ and $c$ is a scalar constant. The function $Q(x)$ is sometimes referred to as a quadratic form and it generalizes the quadratic function $q(x) = ax^2 +bx + c$ to higher dimensions. The key feature to note about $Q(x)$ is that it is a <a href="http://quantitate.blogspot.com/2013/09/optimization-in-r-part-i.html">convex function</a>.<br/> <br /> We also impose a system of linear constraints on the vector $x \in \mathbb{R}^n$. We write these constraints in the form $$ Ax = f \qquad Bx \geq g.$$ Here $A$ is an $m_1 \times n$ matrix with $m_1 \leq n$ and $B$ is a $m_2 \times n$ matrix. The vectors $f$ and $g$ have lengths $m_1$ and $m_2$ respectively. This specification is general enough to allow us to consider a variety of practical conditions on the components of $x$, e.g. we can force $x$ to satisfy the sum condition $$ \sum_{i=1}^n x_i = 1 $$ or the box constraints $a_i \leq x_i \leq b_i$. We'll describe how to encode practical constraint conditions into matrix systems below.<br /><br />With this notation out of the way, we can write the quadratic program (QP) compactly as: $$ \left\{ \begin{array}{l} \mathrm{minimize}_{x \in \mathbb{R}^n}: \qquad Q(x) = \frac{1}{2} x^T D x - d^T x + c \\ \mathrm{subject\; to:} \qquad Ax = f \qquad Bx \geq g \end{array} \right.$$ <h3>Example #1:</h3>Consider the objective function: $$ \begin{eqnarray*} Q(x,y) &=& \frac{1}{2} \left[ \begin{matrix} x & y \end{matrix} \right] \left[ \begin{matrix} 2 & -1 \\ -1 & 2 \end{matrix} \right]\left[ \begin{matrix} x \\ y \end{matrix} \right] - \left[ \begin{matrix} -3 & 2 \end{matrix} \right] \left[ \begin{matrix} x \\ y \end{matrix} \right] + 4 \\ &=& x^2 + y^2 -xy +3x -2y + 4 . \end{eqnarray*}$$ We seek to minimize this function over the triangular region $$\begin{eqnarray*} y &\geq& 2 - x \\ y &\geq& -2 + x \\ y &\leq& 3. \end{eqnarray*}$$ <br />We can find the vertices of this triangle and plot the region in R: <br /><pre class="prettyprint lang-r">plot(0, 0, xlim = c(-2,5.5), ylim = c(-1,3.5), type = "n", <br /> xlab = "x", ylab = "y", main="Feasible Region")<br />polygon(c(2,5,-1), c(0,3,3), border=TRUE, lwd=4, col="blue")<br /></pre><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-xTYyotjGhVI/UryH6slTmxI/AAAAAAAAADs/KlomFFAszdo/s1600/FeasibleRegion.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-xTYyotjGhVI/UryH6slTmxI/AAAAAAAAADs/KlomFFAszdo/s640/FeasibleRegion.png" /></a></div><br />To solve this QP with the quadprog library, we'll need to translate both our objective function and the constraint system into the matrix formulation required by quadprog. From the quadprog documentation<br /><blockquote class="tr_bq">This routine implements the dual method of Goldfarb and Idnani (1982, 1983) for solving quadratic programming problems of the form min(-d^T b + 1/2 b^T D b) with the constraints A^T b >= b_0.</blockquote>It's not hard to see how to put the quadratic form $Q(x,y)$ into the correct matrix form for the objective function of quadprog (but be sure to note the sign pattern and factors of two). First we observe that, for any constant $c$, the minimizer of $Q(x,y)+c$ is the same as the minimizer of $Q(x,y)$. We can therefore ignore all constant terms in the quadratic form $Q(x,y)$. We set:<br />$$D = \left[ \begin{matrix} 2 & -1 \\ -1 & 2 \end{matrix} \right] \qquad d = \left[ \begin{matrix} -3 \\ 2 \end{matrix} \right]. $$<br />We can write the constraint equations in the form:<br />$$\left[ \begin{matrix} 1 & 1 \\ -1 & 1 \\ 0 & -1 \end{matrix} \right] \left[ \begin{matrix} x \\ y \end{matrix} \right] \geq \left[ \begin{matrix} 2 \\ -2 \\ -3 \end{matrix} \right]$$ so that $$A = \left[ \begin{matrix} 1 & 1 \\ -1 & 1 \\ 0 & -1 \end{matrix} \right]^T \qquad b_0 = \left[ \begin{matrix} 2 \\ -2 \\ -3 \end{matrix} \right].$$ Here is the complete implementation in R: <br /><pre class="prettyprint lang-r"><br />require(quadprog)<br />Dmat <- 2*matrix(c(1,-1/2,-1/2,1), nrow = 2, byrow=TRUE)<br />dvec <- c(-3,2)<br />A <- matrix(c(1,1,-1,1,0,-1), ncol = 2 , byrow=TRUE)<br />bvec <- c(2,-2,-3)<br />Amat <- t(A)<br />sol <- solve.QP(Dmat, dvec, Amat, bvec, meq=0)<br /></pre> <br> Note the parameter meq is used to tell quadprog that first meq constraint conditions should be treated as equalities. Let's inspect the output of the quadprog solver: <pre class="prettyprint lang-r">> sol<br />$solution<br />[1] 0.1666667 1.8333333<br />$value<br />[1] -0.08333333<br />$unconstrained.solution<br />[1] -1.3333333 0.3333333<br />$iterations<br />[1] 2 0<br />$Lagrangian<br />[1] 1.5 0.0 0.0<br />$iact<br />[1] 1<br /></pre>The point $(1/6,11/6)$ is the unique minimizer of $Q(x,y)$ subject to the constraint conditions. The point $(-4/3,1/3)$ is the unique minimum of $Q(x,y)$. The slots <i>iterations</i>,<i> Lagrangian</i>, and <i>iact </i> are diagnostics describing the performance of the quadprog algorithm. We'll provide a discussion of these values in a future post. Now, let's visualize the QP solution. For this, we superimpose the boundary of the feasible region on the contour plot of the surface $Q(x,y)$. <div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-_hHQ_xGE9VA/UryT-sK7sBI/AAAAAAAAAEE/RKOlzpPlEtw/s1600/contour.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-_hHQ_xGE9VA/UryT-sK7sBI/AAAAAAAAAEE/RKOlzpPlEtw/s1600/contour.png" /></a></div>In this plot, dark green shading indicates lower altitude regions of the surface $Q(x,y)$, while lighter regions indicate higher altitudes. The red point is the global minimum of $Q(x,y)$ and the yellow point is the solution to the QP. <pre class="prettyprint lang-r"><br /># Contour Plot with Feasible region overlay<br />require(lattice)<br />qp_sol <- sol$solution<br />uc_sol <- sol$unconstrained.solution<br />x <- seq(-2, 5.5, length.out = 500)<br />y <- seq(-1, 3.5, length.out = 500)<br />grid <- expand.grid(x=x, y=y)<br />grid$z <- with(grid, x^2 + y^2 -x*y +3*x -2*y + 4)<br />levelplot(z~x*y, grid, cuts=30,<br /> panel = function(...){<br /> panel.levelplot(...)<br /> panel.polygon(c(2,5,-1),c(0,3,3), border=TRUE, lwd=4, col="transparent")<br /> panel.points(c(uc_sol[1],qp_sol[1]),<br /> c(uc_sol[2],qp_sol[2]),<br /> lwd=5, col=c("red","yellow"), pch=19)},<br /> colorkey=FALSE,<br /> col.regions = terrain.colors(30))<br /></pre> <h3>Example #2:</h3>Suppose we have selected 10 stocks from which to build a portfolio $\Pi$. We want to determine how much of each stock to include in our portfolio. <br> <br> The expected monthly return rate of our portfolio is $$ \overline{r_\Pi} = \sum_{i=1}^{10} w_i \overline{r_i} $$ where $\overline{r_i}$ is the mean monthly return rate on asset $i$ and $w_i$ is the fraction of the portfolio value due to asset $i$. Note that the portfolio weights $w_i$ satisfy the constraints $$ 0 \leq w_i \leq 1 \qquad \qquad \sum_{i=1}^{10} w_i = 1. $$ In practice, we can only estimate the average returns $\overline{r_i}$ using past price data. This is a snap using R's <i>quantmod</i> package: <pre class="prettyprint lang-r"><br /># Get monthly return data from 2012 through 2013<br />require(quantmod)<br />myStocks <- c("AAPL","XOM","GOOGL","MSFT","GE","JNJ","WMT","CVX","PG","WF")<br />getSymbols(myStocks ,src='yahoo')<br />returnsList <- lapply(myStocks, <br /> function(s) periodReturn(eval(parse(text=s)),<br /> period='monthly', subset='2012::2013'))<br /><br /># Combine return time series<br />returns.df <- do.call(cbind,returnsList)<br />colnames(returns.df) <- myStocks<br /></pre> The data frame returns.df contains a time series of monthly returns for each of the 10 specified ticker symbols. Let's plot the monthly returns: <div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-q0V1w1p8CIU/UtNi8f5rIBI/AAAAAAAAAEU/_A_pTIeklgI/s1600/returns.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-q0V1w1p8CIU/UtNi8f5rIBI/AAAAAAAAAEU/_A_pTIeklgI/s1600/returns.png" /></a></div><pre class="prettyprint lang-r"><br /># Plot monthly return data<br />require(ggplot2)<br />require(reshape2)<br />returns2 <- as.data.frame(returns.df)<br />returns2$date <- row.names(returns2)<br />returns2 <- melt(returns2, id="date")<br />ggplot(returns2, aes(x=date,y=value, group=variable)) +<br /> geom_line(aes(color=variable), lwd=1.5) +<br /> ylab("Monthly Return")+ xlab("Date") +<br /> theme(axis.text.x = element_text(angle = 90, hjust = 1))<br /></pre> <br> From this plot, we can see that there is significant fluctuation in return rates. This suggests that the variance of $r_i$ and covariances of $r_i$ and $r_j$ should play a role in our analysis. In effect, these variances and covariances indicate how likely we are to actually observe a portfolio return close to our expected return $\overline{r_\Pi}$. <br> <br> To take into account the risk of deviations in our portfolio return, define the quadratic form $$Q(\vec{w}) = \vec{w}^T C \vec{w}, $$ where $C$ is the covariance matrix of the returns $r_i$. To solve the portfolio allocation problem, we'll try to determine the weights $\vec{w} = (w_1,...,w_{10})$ so that the risk function $Q(\vec{w})$ is minimized. But there are some restrictions to consider. In addition to requiring that $\sum_{i=1}^{10} w_i =1$ and $0 \leq w_i \leq 1$, we may also require a minimum return from the portfolio. For example, we might demand a minimum expected monthly return of 1%: $$ \sum_{i=1}^{10} w_i E(r_i) \geq .01.$$ We can prove that the covariance matrix $C$ is always symmetric positive definite (except in the case of perfect multicollinearity), so this constrained minimization problem is a quadratic programming problem of the type that can be handled by <i> quadprog</i>. <br> <br> Let's now describe how to implement the quadprog solver for this problem. First, compute the average returns and covariance matrix in R: <pre class="prettyprint lang-r"><br /># Compute the average returns and covariance matrix of the return series<br />r <- matrix(colMeans(returns.df), nrow=1)<br />C <- cov(returns.df)<br /></pre> The constraints in this case are a little bit more tricky than in Example #1. For quadprog, all of our constraints must be expressed in a linear system of the form $A^T \vec{w} \geq f$. The system should be arranged so that any equality constraints appear in the first $m$ rows of the system. We build up the matrix $A^T$ step by step using <i>rbind</i> and applying a transpose at the very end. <br> <br> To enforce the sum to one condition we need: $$ \left[ \begin{matrix} 1 & 1 & \cdots 1 \end{matrix} \right] \vec{w} = 1. $$ This equality condition should appear first in the system. To enforce the minimum expected return, we require $r \cdot \vec{w} \geq .01$ where $r$ is the row of average return rates obtained from the dataset. To force $0 \leq w_i \leq 1$, we require $$ I_{10} \vec{w} \geq 0 \qquad \qquad -I_{10} \vec{w} \geq - 1$$ where $I_{10}$ is the $10 \times 10$ identity matrix. Putting these steps together: <br><pre class="prettyprint lang-r"><br /># Stage and solve the QP<br />require(quadprog)<br />A <- matrix(1,1,10)<br />A <- rbind(A, r, diag(10),-diag(10))<br />f <- c(1, 0.01, rep(0,10),rep(-1,10))<br />sol <- solve.QP(Dmat=C, dvec = rep(0,10), Amat=t(A), bvec=f, meq=1)<br /></pre>Let's inspect the optimal allocation: <div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-Yoc8FsClqTU/UtN38BI2_eI/AAAAAAAAAEk/har2NSwFvTI/s1600/portSolution.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-Yoc8FsClqTU/UtN38BI2_eI/AAAAAAAAAEk/har2NSwFvTI/s1600/portSolution.png" /></a></div><pre class="prettyprint lang-r"><br />require(ggplot2)<br />portfolio <- data.frame(name = myStocks, w = round(sol$solution,3))<br />ggplot(portfolio, aes(x=name, y=w)) + geom_bar(stat="identity", fill="blue")<br /></pre> <br>The expected return from this allocation is about 1.2%. It is instructive to solve the quadratic program with different minimum return requirements. For example, there is a solution to this problem with minimum required expected return greater than or equal to 2% but no solution with minimum required expected return greater than or equal to 3%. What is key to note is that the solution with the 2% restriction has a higher value of $Q(\vec{w})$ (more risk) compared to the lower risk solution to the problem with the 1% restriction. As we'd expect, quadratic programming doesn't allow us to escape the risk-return trade-off; it only provides the lowest risk allocation for a given minimum expected return requirement. Ryan Walkerhttps://plus.google.com/112373077410906476606noreply@blogger.com5tag:blogger.com,1999:blog-1887031930241333662.post-65872085296474111582013-11-25T08:30:00.000-08:002013-11-25T09:15:50.350-08:00Generating Ticker Symbols with Markov Chains<script src="https://google-code-prettify.googlecode.com/svn/loader/run_prettify.js?lang=r&skin=default"></script> <a href="http://en.wikipedia.org/wiki/Ticker_symbol">Stock ticker symbols</a> are short character string codes (AAPL for Apple, WMT for Walmart) that uniquely identify stock listings. They originate from the days when stock quotes were transmitted via telegraph. These symbols are often used as tools for branding and companies choose them to be <a href="http://www.thestreet.com/story/11972063/1/the-10-most-clever-stock-ticker-symbols.html">memorable and easily recognized</a>.<br/> <br/> As a fun and simple exercise with text processing in R, we'll build a simple model that generates random character strings which mimic the roughly 6,000 ticker symbols listed on the AMEX, NASDAQ, and NYSE. A basic tool for generating random text patterned off of an input source text <i>is the Markov chain</i>. Markov chains are the engine behind <a href="http://thinkzone.wlonk.com/Gibber/GibGen.htm" target="_blank">gibberish generators</a>, a briefly popular internet meme in the 90's. <br/> <br/> We start with the base of all real ticker symbols. Using the TTR ("Technical Trading Rules") library, it's easy to get the full list for the AMEX, NASDAQ, and NYSE. <br /><pre class="prettyprint lang-r"><br />library(TTR)<br />listings <- stockSymbols() <br /></pre><br/>On November 17, 2013, TTR found 6,462 ticker symbols. The 10 largest symbols by market cap at that time were: <br /><pre class="prettyprint lang-r"><br />listings <- listings[order(listings$MarketCap,decreasing=TRUE),]<br /> Symbol Name MarketCap IPOyear Exchange<br /> AAPL Apple Inc. 472354352358 1980 NASDAQ<br /> XOM Exxon Mobil Corporation 416188308487 NA NYSE<br /> GOOG Google Inc. 345299382446 2004 NASDAQ<br /> MSFT Microsoft Corporation 315895470257 1986 NASDAQ<br /> GE General Electric Company 275192436800 NA NYSE<br /> JNJ Johnson & Johnson 266315572747 NA NYSE<br /> WMT Wal-Mart Stores, Inc. 256994921591 1970 NYSE<br /> CVX Chevron Corporation 230896151941 NA NYSE<br /> PG Procter & Gamble Company (The) 230614695048 NA NYSE<br /> WFC Wells Fargo & Company 229351244873 NA NYSE<br /></pre><br/>Ticker symbols pulled down from TTR range from 1 to 8 characters in length and make use of the capital letters A through Z and the hyphen "-". In ticker symbols that contains hyphens, the characters following a hyphen appear to be the <a href="http://en.wikipedia.org/wiki/Ticker_symbol#United_States">"behind the dot"</a> codes that provide additional information on the asset type. To keep things simple, we'll just strip off all this additional info:<br/><pre class="prettyprint lang-r"><br /> symbols <- unique(gsub("-.*","",listings$Symbol))<br /></pre><br/>This leaves us with a list of 5,594 symbols. Let's get the distribution of symbol lengths for our symbol list: <br/><pre class="prettyprint lang-r"><br />lengths <- nchar(symbols)<br />pLength <- prop.table(table(lengths))<br />print(round(pLength,4))<br />lengths<br />1 2 3 4 5 <br />0.0039 0.0387 0.4572 0.4698 0.0305 <br /></pre><br/>Let's now explain a very simple Markov chain model for text strings. A word $w$ of length $l$ is a combination of letters $c_1, c_2,...,c_n$, which we naturally write as $w=c_1c_2...c_n$. If the first $n-1$ letters of $w$ are known, we write the probability that the $n$-th letter is $c_n$ as $p(c_n|c_1c_2...c_{n-1})$. <br/> <br/> A Markov chain model of $p(c_n|c_1c_2...c_{n-1})$ allows us to assume that $$p(c_n|c_1c_2...c_{n-1})=p(c_n | c_{n-1}).$$ That is, we can assume that the $n$-th letter of the word only depends on the letter that directly precedes it. The Markov assumption often allows us to make huge simplifications to probability models and computations. <br/> <br/> As an example, suppose that ticker symbols are generated by a probabilistic process. If we have a three letter ticker symbol with its first two letters being "WM", we can meaningfully quantify the probability that the third letter is a "T". Under a Markov model, $p(T|WM)=p(T|M)$. Now there is a little bit of ambiguity in the expression $p(T|M)$. Is this the probability that $T$ directly follows an $M$ anywhere in a symbol or is the probability that $T$ is the third letter given $M$ is the second? Here we'll assume that the position information is relevant and we'll compute $$p(c_3=T|WM) = p(c_3=T | c_2 = M).$$ <br/>With our sample of symbols, here is a way to approximate $p(T|WM)$ in R using the Markov model: <pre class="prettyprint lang-r">sum(grepl("^.MT",symbols[lengths==3]))/sum(grepl("^.M",symbols[lengths==3]))<br />[1] 0.08391608<br /></pre>Unpacking the syntax, grepl is regular expression matching and returns TRUE when a match is found. The expression "^.MT" matches all the symbols whose first letter is anything and whose second and third letter are "MT" while "^.M" matches to anything with second letter "M". The sums count the number of TRUE evaluations. <br/> <br/> With the preliminaries out of the way, let's demonstrate a function which generates a random symbol based on the Markov chain model: <br/><pre class="prettyprint lang-r"><br /># Create a random symbol of length n based on a length 1 Markov Chain<br />newSymbol <- function(n=3){<br /> symbolSet <- symbols[lengths==n]<br /> s <- ""<br /> for(i in 1:n){<br /> # Match s in the first i characters of symbolSet<br /> pattern <- paste0("^",substr(s,i-1,i-1))<br /> match <- grepl(pattern,substr(symbolSet,i-1,i-1))<br /> <br /> # Distribution of next letter<br /> nxtLetterDist <- substr(symbolSet,i,i)[nxtLetterDist!="" & match]<br /> <br /> # Sample the next character from the distribution<br /> nextChar <- sample(nxtLetterDist,1)<br /> <br /> # Append to the string<br /> s <- paste0(s,nextChar)<br /> }<br /> return(s)<br />}<br /></pre> <br/>Here is an example of the generator in action: <br/><pre class="prettyprint lang-r"><br />> replicate(10, newSymbol(3))<br /> [1] "EMS" "ZTT" "NBS" "BTG" "MAG" "HAM" "PLK" "ALX" "CFM" "FLS"<br /></pre> <br/> How can we test that the symbol generator is really making symbols that are similar to the actual ticker symbols? Well, one way is to inspect how the generator performs relative to a completely random symbol generator. Since there are 26 letters available to use in a valid ticker symbol, a completely random generator would use any given letter with probability 1/26. So, for example, the distribution of the second character of completely random generated symbol should be a uniform distribution on the letters A-Z with uniform probability 1/26. For the Markov chain model, we would hope that the second character of a generated random symbol would have a probability distribution similar to the distribution of the second character of the set of actual ticker symbols. <br/> The following snippet inspects the distribution of the second character of symbols of length 3 for natural symbols, Markov generated symbols, and completely random symbols. <br/><pre class="prettyprint lang-r"><br /># Generate a bunch of symbols with the Markov chain<br />symbolsA <- unique(replicate(10000, newSymbol(3)))<br /><br /># Build data frame for second character distributions<br />require(ggplot2)<br />require(reshape2)<br />Natural <- prop.table(table(substr(symbols[lengths==3],2,2)))<br />Markov <- prop.table(table(substr(symbolsA,2,2)))<br />secondChar<-as.data.frame(t(rbind(Natural,Markov)))<br />secondChar$character <- rownames(secondChar)<br />secondChar <- melt(secondChar, id=c("character"))<br /><br /># Build the distribution plot<br />ggplot(secondChar, aes(x=character,y=value)) + <br /> geom_point(aes(color=variable), alpha=.75,size=5)+<br /> ylab("Probability")+xlab("Character")+<br /> ggtitle("Ticker Symbols: Second Character Distribution")+<br /> geom_hline(yintercept=1/26, size=1,color="black")<br /></pre> <br/>Here is the plot: <div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-WHjx0_djz3I/UpN6nCRaoTI/AAAAAAAAADE/LzkBfmBXbJ8/s1600/secondDistro.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-WHjx0_djz3I/UpN6nCRaoTI/AAAAAAAAADE/LzkBfmBXbJ8/s1600/secondDistro.png" /></a></div> In this figure, the black line represents the uniform probability distribution corresponding to a completely random generation of length 3 ticker symbols. The blue and red circles compare the Markov generator model to the true distribution of the second characters in length 3 ticker symbols. Ryan Walkerhttps://plus.google.com/112373077410906476606noreply@blogger.com0tag:blogger.com,1999:blog-1887031930241333662.post-77803208581818244922013-11-01T20:54:00.001-07:002014-03-23T15:09:23.995-07:00Underdetermined Systems in RIn this post, we begin a study of underdetermined systems.<b> </b>We'll motivate this study by considering an application problem in competitive intelligence. We note from the start, however, that this post is less about finding a particular solution to an underdetermined problem and more about ways to explore the <i>possible </i>solutions.<br /><br /><b>Problem: </b>We are trying to learn about the sales regions of a key competitor. Suppose the competitor sells three versions of a product: basic, standard and luxury. The product is sold in two regions: North and South. Competitive intelligence mined from the competitor's annual reports gives us the following revenue schedule:<br /><br /><table cellpadding="4" cellspacing="0" style="page-break-after: auto; page-break-before: auto; page-break-inside: auto;"><colgroup><col width="51"></col><col width="51"></col><col width="51"></col><col width="51"></col><col width="51"></col></colgroup><tbody><tr valign="top"><td bgcolor="#000080" style="border-bottom: #000000 1px solid; border-left: #000000 1px solid; border-right: medium none; border-top: #000000 1px solid; padding-bottom: 0.04in; padding-left: 0.04in; padding-right: 0in; padding-top: 0.04in;" width="20%"><div align="left" style="font-style: normal; text-decoration: none;"><span style="color: white;"><span style="font-family: Times New Roman, serif;"><span style="font-size: small;"><b><span style="text-decoration: none;"> <span style="color: white;"><span style="font-family: Times New Roman, serif;"><span style="font-size: small;"><b><span style="text-decoration: none;">Revenue </span></b></span></span></span></span></b></span></span></span></div><div align="left" style="font-style: normal; text-decoration: none;"><span style="color: white;"><span style="font-family: Times New Roman, serif;"><span style="font-size: small;"><b><span style="text-decoration: none;"><span style="color: white;"><span style="font-family: Times New Roman, serif;"><span style="font-size: small;"><b><span style="text-decoration: none;">(k USD) by</span></b></span></span></span> </span></b></span></span></span></div><div align="left" style="font-style: normal; text-decoration: none;"><span style="color: white;"><span style="font-family: Times New Roman, serif;"><span style="font-size: small;"><b><span style="text-decoration: none;">Region/Product</span></b></span></span></span></div></td><td bgcolor="#000080" style="border-bottom: #000000 1px solid; border-left: #000000 1px solid; border-right: medium none; border-top: #000000 1px solid; padding-bottom: 0.04in; padding-left: 0.04in; padding-right: 0in; padding-top: 0.04in;" width="20%"><div align="left" style="font-style: normal; text-decoration: none;"><br /></div><div align="left" style="font-style: normal; text-decoration: none;"><span style="color: white;"><span style="font-family: Times New Roman, serif;"><span style="font-size: small;"><b><span style="text-decoration: none;"> Basic</span></b></span></span></span></div></td><td bgcolor="#000080" style="border-bottom: #000000 1px solid; border-left: #000000 1px solid; border-right: medium none; border-top: #000000 1px solid; padding-bottom: 0.04in; padding-left: 0.04in; padding-right: 0in; padding-top: 0.04in;" width="20%"><div align="left" style="font-style: normal; text-decoration: none;"><br /></div><div align="left" style="font-style: normal; text-decoration: none;"><span style="color: white;"><span style="font-family: Times New Roman, serif;"><span style="font-size: small;"><b><span style="text-decoration: none;"> Standard</span></b></span></span></span></div></td><td bgcolor="#000080" style="border-bottom: #000000 1px solid; border-left: #000000 1px solid; border-right: medium none; border-top: #000000 1px solid; padding-bottom: 0.04in; padding-left: 0.04in; padding-right: 0in; padding-top: 0.04in;" width="20%"><div align="left" style="font-style: normal; text-decoration: none;"><br /></div><div align="left" style="font-style: normal; text-decoration: none;"><span style="color: white;"><span style="font-family: Times New Roman, serif;"><span style="font-size: small;"><b><span style="text-decoration: none;"> Luxury</span></b></span></span></span></div></td><td bgcolor="#000080" style="border-bottom: #000000 1px solid; border-left: #000000 1px solid; border-right: #000000 1px solid; border-top: #000000 1px solid; padding-bottom: 0.04in; padding-left: 0.04in; padding-right: 0.04in; padding-top: 0.04in;" width="20%"><div align="left" style="font-style: normal; text-decoration: none;"><br /></div><div align="left" style="font-style: normal; text-decoration: none;"><span style="color: white;"><span style="font-family: Times New Roman, serif;"><span style="font-size: small;"><b><span style="text-decoration: none;"> TOTAL</span></b></span></span></span></div></td></tr><tr valign="top"><td bgcolor="#4d4d4d" style="border-bottom: #000000 1px solid; border-left: #000000 1px solid; border-right: medium none; border-top: medium none; padding-bottom: 0.04in; padding-left: 0.04in; padding-right: 0in; padding-top: 0in;" width="20%"><div align="left" style="font-style: normal; text-decoration: none;"><span style="color: white;"><span style="font-family: Times New Roman, serif;"><span style="font-size: small;"><b><span style="text-decoration: none;">North</span></b></span></span></span></div></td><td bgcolor="#ffffff" style="border-bottom: #000000 1px solid; border-left: #000000 1px solid; border-right: medium none; border-top: medium none; padding-bottom: 0.04in; padding-left: 0.04in; padding-right: 0in; padding-top: 0in;" width="20%"><div align="center" style="font-style: normal; font-weight: normal; text-decoration: none;"><span style="color: black;"><span style="font-family: Times New Roman, serif;"><span style="font-size: small;"><span style="text-decoration: none;">x1</span></span></span></span></div></td><td bgcolor="#ffffff" style="border-bottom: #000000 1px solid; border-left: #000000 1px solid; border-right: medium none; border-top: medium none; padding-bottom: 0.04in; padding-left: 0.04in; padding-right: 0in; padding-top: 0in;" width="20%"><div align="center" style="font-style: normal; font-weight: normal; text-decoration: none;"><span style="color: black;"><span style="font-family: Times New Roman, serif;"><span style="font-size: small;"><span style="text-decoration: none;">x2</span></span></span></span></div></td><td bgcolor="#ffffff" style="border-bottom: #000000 1px solid; border-left: #000000 1px solid; border-right: medium none; border-top: medium none; padding-bottom: 0.04in; padding-left: 0.04in; padding-right: 0in; padding-top: 0in;" width="20%"><div align="center" style="font-style: normal; font-weight: normal; text-decoration: none;"><span style="color: black;"><span style="font-family: Times New Roman, serif;"><span style="font-size: small;"><span style="text-decoration: none;">x3</span></span></span></span></div></td><td bgcolor="#cccccc" style="border-bottom: #000000 1px solid; border-left: #000000 1px solid; border-right: #000000 1px solid; border-top: medium none; padding-bottom: 0.04in; padding-left: 0.04in; padding-right: 0.04in; padding-top: 0in;" width="20%"><div align="left" style="font-style: normal; font-weight: normal; text-decoration: none;"><span style="color: black;"><span style="font-family: Times New Roman, serif;"><span style="font-size: small;"><span style="text-decoration: none;">650</span></span></span></span></div></td></tr><tr valign="top"><td bgcolor="#4d4d4d" style="border-bottom: #000000 1px solid; border-left: #000000 1px solid; border-right: medium none; border-top: medium none; padding-bottom: 0.04in; padding-left: 0.04in; padding-right: 0in; padding-top: 0in;" width="20%"><div align="left" style="font-style: normal; text-decoration: none;"><span style="color: white;"><span style="font-family: Times New Roman, serif;"><span style="font-size: small;"><b><span style="text-decoration: none;">South</span></b></span></span></span></div></td><td bgcolor="#ffffff" style="border-bottom: #000000 1px solid; border-left: #000000 1px solid; border-right: medium none; border-top: medium none; padding-bottom: 0.04in; padding-left: 0.04in; padding-right: 0in; padding-top: 0in;" width="20%"><div align="center" style="font-style: normal; font-weight: normal; text-decoration: none;"><span style="color: black;"><span style="font-family: Times New Roman, serif;"><span style="font-size: small;"><span style="text-decoration: none;">x4</span></span></span></span></div></td><td bgcolor="#ffffff" style="border-bottom: #000000 1px solid; border-left: #000000 1px solid; border-right: medium none; border-top: medium none; padding-bottom: 0.04in; padding-left: 0.04in; padding-right: 0in; padding-top: 0in;" width="20%"><div align="center" style="font-style: normal; font-weight: normal; text-decoration: none;"><span style="color: black;"><span style="font-family: Times New Roman, serif;"><span style="font-size: small;"><span style="text-decoration: none;">x5</span></span></span></span></div></td><td bgcolor="#ffffff" style="border-bottom: #000000 1px solid; border-left: #000000 1px solid; border-right: medium none; border-top: medium none; padding-bottom: 0.04in; padding-left: 0.04in; padding-right: 0in; padding-top: 0in;" width="20%"><div align="center" style="font-style: normal; font-weight: normal; text-decoration: none;"><span style="color: black;"><span style="font-family: Times New Roman, serif;"><span style="font-size: small;"><span style="text-decoration: none;">x6</span></span></span></span></div></td><td bgcolor="#cccccc" style="border-bottom: #000000 1px solid; border-left: #000000 1px solid; border-right: #000000 1px solid; border-top: medium none; padding-bottom: 0.04in; padding-left: 0.04in; padding-right: 0.04in; padding-top: 0in;" width="20%"><div align="left" style="font-style: normal; font-weight: normal; text-decoration: none;"><span style="color: black;"><span style="font-family: Times New Roman, serif;"><span style="font-size: small;"><span style="text-decoration: none;">350</span></span></span></span></div></td></tr><tr valign="top"><td bgcolor="#4d4d4d" style="border-bottom: #000000 1px solid; border-left: #000000 1px solid; border-right: medium none; border-top: medium none; padding-bottom: 0.04in; padding-left: 0.04in; padding-right: 0in; padding-top: 0in;" width="20%"><div align="left" style="font-style: normal; text-decoration: none;"><span style="color: white;"><span style="font-family: Times New Roman, serif;"><span style="font-size: small;"><b><span style="text-decoration: none;">TOTAL</span></b></span></span></span></div></td><td bgcolor="#cccccc" style="border-bottom: #000000 1px solid; border-left: #000000 1px solid; border-right: medium none; border-top: medium none; padding-bottom: 0.04in; padding-left: 0.04in; padding-right: 0in; padding-top: 0in;" width="20%"><div align="center" style="font-style: normal; font-weight: normal; text-decoration: none;"><span style="color: black;"><span style="font-family: Times New Roman, serif;"><span style="font-size: small;"><span style="text-decoration: none;">300</span></span></span></span></div></td><td bgcolor="#cccccc" style="border-bottom: #000000 1px solid; border-left: #000000 1px solid; border-right: medium none; border-top: medium none; padding-bottom: 0.04in; padding-left: 0.04in; padding-right: 0in; padding-top: 0in;" width="20%"><div align="center" style="font-style: normal; font-weight: normal; text-decoration: none;"><span style="color: black;"><span style="font-family: Times New Roman, serif;"><span style="font-size: small;"><span style="text-decoration: none;">450</span></span></span></span></div></td><td bgcolor="#cccccc" style="border-bottom: #000000 1px solid; border-left: #000000 1px solid; border-right: medium none; border-top: medium none; padding-bottom: 0.04in; padding-left: 0.04in; padding-right: 0in; padding-top: 0in;" width="20%"><div align="center" style="font-style: normal; font-weight: normal; text-decoration: none;"><span style="color: black;"><span style="font-family: Times New Roman, serif;"><span style="font-size: small;"><span style="text-decoration: none;">250</span></span></span></span></div></td><td bgcolor="#cccccc" style="border-bottom: #000000 1px solid; border-left: #000000 1px solid; border-right: #000000 1px solid; border-top: medium none; padding-bottom: 0.04in; padding-left: 0.04in; padding-right: 0.04in; padding-top: 0in;" width="20%"><div align="left" style="font-style: normal; font-weight: normal; text-decoration: none;"><span style="color: black;"><span style="font-family: Times New Roman, serif;"><span style="font-size: small;"><span style="text-decoration: none;">1000</span></span></span></span></div></td></tr></tbody></table><br />The values of x1,x2,...,x6 are unknown revenue amounts for a product/region combination. In this problem, we would like to investigate the possible values of x1,x2,...,x6 which are <i>consistent </i>with the known revenue totals appearing in the margins of the table. Any set of values x1,x2,x3,...,x6 that is consistent with all the known revenue totals, we'll call a <i>feasible solution.</i><br /><br />We can read off the following 5 relations from the table:<br /><br />$$ \begin{align}<br />x_1 + x_2 + x_3 &= 650 \\<br />x_4 + x_5 + x_6 &= 350 \\<br />x_1 + x_4 &= 300 \\<br />x_2 + x_5 &= 450 \\<br />x_3 + x_6 & =250 <br />\end{align}<br />$$<br />Actually, there is a bit of redundancy in these equations. For example, the first equation is equal to the sum of the last three equations minus the second equation. It's important to check for and remove all redundancy from the system for the sake of the computations we'll do below. We remove one of the redundant equations and rewrite the system in the matrix form as:<br /><br />$$ \left[ \begin{matrix} 1 & 1 & 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1 \end{matrix} \right]<br />\left[ \begin{matrix} x_1 \\ x_2 \\ x_3 \\ x_4 \\ x_5 \\ x_6 \end{matrix} \right] = \left[ \begin{matrix} 650 \\ 300 \\ 450 \\ 250 \end{matrix} \right].$$<br /><br /><pre style="background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> A <- matrix(c( <br /> 1,1,1,0,0,0, <br /> 1,0,0,1,0,0, <br /> 0,1,0,0,1,0, <br /> 0,0,1,0,0,1), <br /> byrow=TRUE,nrow=4, ncol=6) <br /> f <- matrix(c(400,450,250,300), nrow=4, ncol=1) <br /></code></pre><br />This system, which we express symbolically as \(A\vec{x} = \vec{f}\), is an <i>underdetermined linear system </i>since it consists of 6 unknowns but only 4 equations. A key fact to recognize is that an underdetermined linear system that has a solution will always have infinitely many.<br /><br />We will investigate three issues around underdetermined systems in this post:<br /><ol><li>When does a system have a solution?</li><li>If an underdetermined system has solutions, how can we generate them?</li><li>Once we have a method for generating solutions, can we develop any criteria for preferring one particular solution among the infinitely many possible solutions?</li></ol><ol></ol><h1>Consistency and Stability </h1>The first issue is easy to handle. A system of linear equations is called <i>consistent </i>if it has one or more solutions. The system \(A\vec{x} = \vec{f}\) has a solution whenever \(\vec{f}\) is in the column space of \(A\), i.e. whenever \(\vec{f}\) is a linear combination of the columns of \(A\). This leads us to consider the notion of <i>matrix</i> <i>rank</i>: the rank of \(A\) is the number of linearly independent columns in \(A\). In these terms, we have the following consistency test: a system is consistent if and only if the rank of \(A\) is the same as the rank of \([A|\vec{f}]\), the matrix attained by attaching \(\vec{f}\) as a column to \(A\).<br /><br />Here's one way to implement the consistency test in R:<br /><pre style="background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > require(Matrix) <br /> > rankMatrix(A) <br /> [1] 4 <br /> attr(,"method") <br /> [1] "tolNorm2" <br /> attr(,"useGrad") <br /> [1] FALSE <br /> attr(,"tol") <br /> [1] 2.76354e-15<br /><br /> > rankMatrix(cbind(A,f))<br />[1] 4<br />attr(,"method")<br />[1] "tolNorm2"<br />attr(,"useGrad")<br />[1] FALSE<br />attr(,"tol")<br />[1] 1.377126e-12<br /></code></pre><br />We can see that \(A\) and \([A|f]\) have the same rank, so that \(A\vec{x} = \vec{f}\) has solutions. Note that accurate computation of the rank of \(A\) (as well as the QR decomposition of \(A\) that we perform below) require that the matrix \(A\) be numerically "well-behaved". One way to quantify numerical "well-behaved" is to use the concept of the <a href="http://en.wikipedia.org/wiki/Condition_number" target="_blank">condition number</a> of \(A\). In base R, the condition number of rectangular matrix \(A\) is<br /><br />$$ \kappa(A) = \frac{\sigma_{\max}(A)}{\sigma_\min (A)}$$<br />where $\sigma_\max,\sigma_\min$ are the maximum and minimum non-zero <a href="http://en.wikipedia.org/wiki/Singular_value_decomposition" target="_blank">singular values</a> of \(A\).<br /><br /><pre style="background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > kappa(A) <br /> [1] 2.987313 <br /></code></pre><br />The condition number of \(A\) roughly describes how sensitive the output of the transformation \(A\) is to a small change in the input. We should be worried about doing any computations with \(A\) if \(\kappa(A)\) is very large. Fortunately, our condition number here is quite reasonable. Note that a large condition number may suggest that \(A\) contains some redundant rows.<br /><h1>The Space of Solutions</h1>For real-valued \(A\), \(\vec{f}\), every solution to the system<br />$$ A \vec{x} = \vec{f} $$<br />can be written in the form<br />$$ \vec{x} = \vec{x}_0 + \sum_{i=1}^{n_0} w_i q_i $$ <br />where<br /><ul><li>\(\vec{x}_0\) is any particular solution to \(A\vec{x} = \vec{f}\) </li><li>\(q_1,...,q_{n_0}\) is a basis for the nullspace of \(A\)</li><li>\(w_1,...,w_n\) are real valued coefficients </li><li>\(n_0\) is the dimension of the nullspace of \(A\). If \(A\) is \(n \times m\) with rank \(m\) then the rank-nullity theorem tells us that \(n_0 = n-m\).</li></ul>We will write this representation in the shorthand<br />$$ \vec{x} = \vec{x}_0 + Q \vec{w}$$<br />where \(Q\) is a \(n \times (m-n)\) matrix whose columns are a basis for the nullspace of \(A\) and \(\vec{w} \in \mathbb{R}^{n-m}\). We note that in general there are an infinite number of bases for the nullspace of a matrix, so this representation cannot be unique.<br /><br />We now want to describe procedures in R that will allow us to numerically determine (1) a particular solution \(\vec{x}_0\) to \(A\vec{x} = \vec{f}\) and (2) a convenient choice of the matrix \(Q\).<br /><br />To obtain a particular solution of \(A\vec{x} = \vec{f}\), we can make use of the <a href="http://en.wikipedia.org/wiki/Pseudoinverse" target="_blank">pseudoinverse</a> of \(A\). Assuming the \(m \times n\) matrix \(A\) has full rank with \(m <n\), the matrix \(AA^T\) is an \(m \times m\) matrix of full rank, that is \(AA^T\) is invertible. It is not hard to prove the identity<br /> $$ A[A^T (AA^T)^{-1}] = I_n $$<br />where \(I_n\) is the \(n \times n\) identity matrix. Right-multiplying both sides of this equation by $\vec{f}$, we obtain<br />$$ A[A^T (AA^T)^{-1} \vec{f}] = \vec{f}$$.<br />Thus a particular solution of $A \vec{x} = \vec{f}$ is<br />$$ \vec{x}_0 = A^T (AA^T)^{-1} \vec{f}.$$<br />Though there are infinitely many particular solutions we could use, the solution \(\vec{x}_0\) turns out to be the solution with <a href="http://see.stanford.edu/materials/lsoeldsee263/08-min-norm.pdf" target="_blank">smallest Euclidean norm</a>. Here is one way to compute \(\vec{x}_0\) in R:<br /><br /><pre style="background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> x0 <- t(A)%*%solve(A%*%t(A),f) <br /></code></pre><br />For the example problem:<br /><pre style="background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > x0 <br /> [,1] <br /> [1,] 191.66667 <br /> [2,] 91.66667 <br /> [3,] 116.66667 <br /> [4,] 258.33333 <br /> [5,] 158.33333 <br /> [6,] 183.33333 <br /></code></pre><br />Next, we need to obtain a basis for the nullspace of \(A\). One approach for this is to apply a <a href="http://en.wikipedia.org/wiki/QR_decomposition" target="_blank">QR factorization</a> to \(A^T\). This will decompose \(A^T\) as<br />$$A^T = \left[ \begin{matrix} Q_1 & Q_2 \end{matrix} \right] \left[ \begin{matrix} R_1 \\ R_2 \end{matrix} \right]$$<br />where \(Q_1\) is \(n \times m \) and \(Q_2\) is \(n \times (n-m)\). The columns of \(Q_2\) are a basis for the null space of \(A\).<br /><br />Here is an R implementation: <br /><pre style="background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> QR <- qr(t(A)) <br /> Q <- qr.Q(QR,complete=TRUE)[,-(1:QR$rank)] <br />> Q<br /> [,1] [,2]<br />[1,] 0.2654822 0.5126915<br />[2,] -0.5767449 -0.0264314<br />[3,] 0.3112627 -0.4862601<br />[4,] -0.2654822 -0.5126915<br />[5,] 0.5767449 0.0264314<br />[6,] -0.3112627 0.4862601<br /></code></pre><br />We now have our solution representation \(\vec{x} = \vec{x}_0 + Q \vec{w}\). To see that this is really a solution for any choice of \(\vec{w}\), note that<br />$$ A Q = 0_{n \times (n-m) } $$ <br />and compute<br />$$ A \vec{x} = A \vec{x}_0 + A Q \vec{w} = \vec{f} + 0_{n \times (n-m)} \vec{w} = \vec{f}.$$<br /><h1>Criteria for Selecting a Solution</h1>Now we consider the question of whether there is any criteria to prefer one solution of \(A\vec{x} = \vec{f}\) to another. One obvious requirement is that our solution should have all non-negative components \(\vec{x} \geq 0\). <br /><br />How can we ensure that \(\vec{x} = \vec{x}_0 + Q \vec{w}\) is non-negative? Satisfying this condition will require us to choose \(\vec{w}\) carefully. In the context of our problem, it turns out that the set of \(\vec{w} = (w_1,w_2) \) which satisfy \( \vec{x}_0 + Q \vec{w} \geq 0\) defines a convex polygon in the \(w_1,w_2\) plane. [More generally, the set of \(\vec{w} \in \mathbb{R}^m\) such that \( \vec{x}_0 + Q \vec{w} \geq 0\) defines a <a href="http://en.wikipedia.org/wiki/Convex_polytope" target="_blank">convex polytope</a> in \(\mathbb{R}^m\).] We could imagine a scheme that generates positive solutions by finding vectors lying within this convex polygon. This is a rather interesting idea but will be very difficult to implement for even moderately sized problems. Even just the task of <a href="http://en.wikipedia.org/wiki/Vertex_enumeration_problem" target="_blank">finding all the vertices of a polygon (polytope)</a> defined by a matrix inequality \(A\vec{x} \leq b\) is quite difficult! We'll leave this line of thinking for another post.<br /><br />Another approach is to formulate the problem as a <a href="http://en.wikipedia.org/wiki/Linear_programming" target="_blank">linear program</a>:<br />$$\left\{ \begin{array}{l} \mathrm{minimize:} \qquad \vec{c} \cdot \vec{x} \\ \mathrm{subject \; to:} \quad A\vec{x}= \vec{f}, \qquad x \geq 0 \end{array} \right.$$ <br /><br />For this particular application, the vector \(\vec{c}\) can be chosen to be anything we like, since the goal is to find a solution that satisfies the \(A \vec{x} = \vec{f}\) and \(\vec{x} \geq 0 \) constraints. By varying the \(\vec{c}\) function, we'll get different solutions to the system.<br /><br />Here is an implementation using the R package linprog: <br /><pre style="background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > require(linprog) <br /> > cvec <- rep(1,6) <br /> > solveLP( cvec, f, A, const.dir=rep("=",4), lpSolve=TRUE) <br /> Results of Linear Programming / Linear Optimization <br /> (using lpSolve) <br /> Objective function (Minimum): 1000 <br /> Solution <br /> opt <br /> 1 0 <br /> 2 100 <br /> 3 300 <br /> 4 450 <br /> 5 150 <br /> 6 0 <br /> Constraints <br /> actual dir bvec free <br /> 1 400 = 400 0 <br /> 2 450 = 450 0 <br /> 3 250 = 250 0 <br /> 4 300 = 300 0 <br /></code></pre><br />A drawback of the linear programming approach is that (in the absence of further information) it tends to choose the sparsest possible solutions. That is, it <br />"forces" zeros into the solution vector. The reason for this is fairly clear from the mechanics of the simplex algorithm used to solve linear programs. We see that for the example problem, linear programming gives us the solution with x1=x6=0. Varying the \(\vec{c}\) gives different but still sparse solutions to the system.<br /><br />For an algorithm that does not give preference to the sparsest solutions of \(A\vec{x}=\vec{f}\), we can turn to nonlinear optimization. Here is one simple approach. We observe that for the example problem \(\sum_{i=1}^6 x_i = 1000\) and so<br />$$ \mu = \frac{1}{6} \sum_{i=1}^6 x_i = \frac{1000}{6} \approx 166.667. $$<br />We then define the <i>objective function</i><br />$$ F( \vec{x} ) = \sum_{i=1}^6 (x_i - \mu)^2$$<br />We seek the vector \(\vec{x} \in \mathbb{R}^6\) such that \(\vec{x}\) minimizes \(F(\vec{x})\) and satisfies<br />$$ \vec{x} \geq 0 \qquad A \vec{x} = \vec{f}.$$ <br /><br />In words, we must minimize the squared deviation between the solution components and the mean, subject to the system constraints. This is a simple example of a quadratic program. There many approaches available to solve this species of optimization problem and R has several dedicated libraries. We will use the 'quadprog' library here. This library solves quadratic programming problems of the form<br />$$ \left\{ \begin{array}{l} -\vec{d}^T \vec{x} + \frac{1}{2} \vec{x}^T D \vec{x} \\ A_1^T \vec{x} = \vec{f}_1 \\ A_2^T \vec{x} \geq \vec{f}_2 \end{array} \right. $$<br />Here \(D\) is a \(n \times n\) symmetric, positive definite matrix.<br /><br />To cast our problem in this form, we write<br />$$ F(\vec{x}) = (\vec{x} - \vec{\mu} )^T (\vec{x} - \vec{\mu}) = \vec{x}^T \vec{x} - 2 \vec{\mu}^T \vec{x} + \vec{\mu}^T \vec{\mu} \geq \vec{x}^T \vec{x} - 2 \vec{\mu}^T \vec{x}$$<br />where \(\vec{\mu} = \mu (1,\ldots, 1)\). We recognize that minimizing \(F(\vec{x})\) is equivalent to minimizing<br />$$ \tilde{F}(\vec{x}) = \frac{1}{2} \vec{x}^T D \vec{x} - \vec{d}^T \vec{x},$$<br />where \(\vec{d} = 2 \vec{\mu}\) and \(D = 2 I \). Then, we set \(A_1 = A^T\), \(A_2 = I\), \( \vec{f}_1 = \vec{f} \), and \( \vec{f}_2 = \vec{0} \). <br /><br />Here is the R implementation:<br /><pre style="background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> require(quadprog) <br /> Dmat <- 2*diag(6) <br /> dvec <- 2*(1000/6)*rep(1,6) <br /> A1 <- A <br /> A2 <- diag(6) <br /> Amat <- t(rbind(A1,A2)) <br /> f1 <- f <br /> f2 <- rep(0,6) <br /> bvec <- c(f1,f2) <br /></code></pre><br />Now, we call the solver (note that the meq=4 parameter tells quadprog that the first four constraints in our system are equality constraints and the rest are inequality constraints): <br /><pre style="background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > solve.QP(Dmat, dvec, Amat, bvec, meq=4) <br /> $solution <br /> [1] 191.66667 91.66667 116.66667 258.33333 158.33333 183.33333 <br /> $value <br /> [1] -149166.7 <br /> $unconstrained.solution <br /> [1] 166.6667 166.6667 166.6667 166.6667 166.6667 166.6667 <br /></code></pre><br />Clearly, we can generate many different solutions to the underdetermined system by modifying the objective function \(F(\vec{x})\). For example, replacing \(\vec{\mu}\) with 0, we'll obtain the smallest 2-norm non-negative solution that satisfies \(A\vec{x} = \vec{f}\). This should agree with \(\vec{x}_0\) in the case where it happens that \(\vec{x}_0 \geq 0\).<br /><br />The functional form of \(F(\vec{x})\) allows us to assign preference to particular feasible solutions. If we have additional expectations/beliefs about the values of x1,...,x6, these can be built into the objective function. In the simple example above, our additional expectation is that the individual solution components shouldn't be very different from the mean. Clearly, much more complex ideas can be built into the objective function to find feasible solutions that are also consistent with our expectations about x1,x2,...,x6.<br /><br />Ryan Walkerhttps://plus.google.com/112373077410906476606noreply@blogger.com0tag:blogger.com,1999:blog-1887031930241333662.post-8823620014622228412013-09-07T20:06:00.000-07:002013-09-07T23:46:14.811-07:00 Optimization in R (Part I)This post is the first in a series on solving practical convex optimization problems with R. My goal for the series is to work up to some interesting applications of the constrained optimization libraries available in R. In particular, I plan to work out some test cases and practical examples using the <a href="http://cran.r-project.org/web/packages/quadprog/quadprog.pdf" target="_blank">quadprog package</a>. <br /><br />In this very introductory post, however, I want to illustrate some important ideas about convexity using R.<br /><br />A convex function function of a single variable resembles something like the blue curve in the following diagram:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-eMMsv95MLt0/UifuMYzYQOI/AAAAAAAAABI/teRVBkqGOjA/s1600/convex1D.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="http://1.bp.blogspot.com/-eMMsv95MLt0/UifuMYzYQOI/AAAAAAAAABI/teRVBkqGOjA/s400/convex1D.png" width="400" /></a></div><pre style="background-image: URL(http://2.bp.blogspot.com/_z5ltvMQPaa8/SjJXr_U2YBI/AAAAAAAAAAM/46OqEP32CJ8/s320/codebg.gif); background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> x<-seq(from=-1,to=1.5,by=.01) <br /> y<-(x-1/2)^2 +1 <br /> segment <-data.frame(x=c(-1/2,1),y=c(2,1.25)) <br /> min_point <-data.frame(x=1/2,y=1) <br /> mydata <-data.frame(x,y) <br /> ggplot(mydata, aes(x, y)) + geom_line(size=3, color="blue")+ <br /> geom_point(data=min_point,color="yellow", size=5)+ <br /> geom_line(data=segment, color="red", size=2)+ <br /> geom_point(data=segment, color="red", size=5) <br /></code></pre><br />The point of this diagram is that whenever you connect two points on the graph of a convex function with a straight line segment, the line segment always lies above the graph of the function. A (strictly) convex function always has a global minimum, which is the point in yellow in the diagram.<br /><br />Here's what a convex function of two variables looks like:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-yeXq916mUV4/Uity2q9Vk_I/AAAAAAAAABY/dZlJWAB0gDk/s1600/surface.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-yeXq916mUV4/Uity2q9Vk_I/AAAAAAAAABY/dZlJWAB0gDk/s1600/surface.png" /></a></div><pre style="background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> library(rgl) <br /> # Define surface <br /> f <- function(x,y) x^2+y^2 <br /> # Build a grid <br /> x<-seq(from=-1,to=1,.01) <br /> y<-seq(from=-1,to=1,.01) <br /> z<-outer(x,y,f) <br /> # Plot <br /> open3d() <br /> rgl.surface(x,y,z, color='green',alpha=.75, smooth=TRUE) <br /> axes3d( color='black', alpha=1) <br /> title3d('f(x,y)=x^2+y^2','','x','z','y', color='blue', alpha=1)</code></pre><br />It's not hard to imagine that any line segment connecting two points on this convex surface will again lie completely above the surface. Moreover, we immediately recognize an absolute minimum at the origin. Though more difficult to visualize, the definition of convexity is much the same in higher dimensions: a function is convex if a line segment connecting two points on the function's graph always remains above the graph.<br /><br />One reason to care about the convexity is because it is a condition that ensures a function has a minimum. So given a convex function, how do you locate its minimum? Imagine that you were placed somewhere on a convex surface like the one appearing in the figure above. If you wanted to find the global minimum, you'd likely begin walking in the direction of steepest descent. You'd continue to follow along the path of steepest descent until you can no longer descend. At this point, because you know the surface is convex, you know that you are standing on the minimum.<br /><br />This very simple idea is the basis for the family of <a href="http://en.wikipedia.org/wiki/Gradient_descent">gradient descent algorithms</a> for finding the minimum of a convex function. The gradient of a function \(f\) at a point \(x_0\) is the vector<br />$$\nabla f(x_0) = \left. \left(\frac{\partial f}{\partial x_1} , ... , \frac{\partial f}{\partial x_n} \right) \right|_{x=x_0}$$<br /><br />Here is a simple numerical implementation of the gradient that uses centred differences to approximate the derivatives in this definition.<br /><br /><pre style="background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> # gradient function <br /> grad <-function(f,x,h){ <br /> len <-length(x) <br /> y<-rep(NA,len) <br /> # Compute the gradient with a centred difference <br /> for(i in 1:len){ <br /> xp<-x <br /> xm<-x <br /> xp[i]<-x[i]+h <br /> xm[i]<-x[i]-h <br /> y[i]<-(f(xp)-f(xm))/(2*h) <br /> } <br /> return(y) <br /> } <br /></code></pre><br />[<b>Warning: </b>For actual numerical work, you probably don't want to write your own derivative functions. There are excellent libraries available to do this kind of work. See, for example, the <a href="http://cran.r-project.org/web/packages/numDeriv/numDeriv.pdf">"numDeriv"</a> package.]<br /><br />The gradient at \(x_0\) always points in the direction of <i>steepest ascent</i> from the point \(x_0\). To find the minimum, we at some initial point on the surface and compute the gradient. Then we take a step in the direction exactly opposite the direction of the gradient vector. If the step size is chosen well, we should find ourselves at a point lower on the surface than the point at which we started. We repeat this process until we can descend no further (that is, the gradient is the zero vector). <br /><br />Choosing the right step size at each level in the algorithm is a sticky point. Though a uniform step size might work for very simple functions, in general uniform steps can cause the algorithm to fail to converge. A standard approach is to scale the step size based on the length of the gradient vector or the inter-step performance.<br /><br />Here's a very simple, adaptive gradient descent implementation. At each iteration, we start with a step size of del and repeatedly cut the step size in half until we find a step size that results in an improvement.<br /><br /><pre style="background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> # Gradient descent method <br /> gradDescend<-function(f,x0,del){ <br /> z0<-f(x0) <br /> eps<-1e-8 # Tolerance <br /> G<-grad(f,x0,1e-4) <br /> x<-x0-del*G <br /> # Find a choice of step size so that the result improves <br /> while(z0<f(x)+eps & del>eps){ <br /> del<-del/2 <br /> x<-x0-del*G <br /> } <br /> # If improvement is only possible by step size below the tolerance, <br /> # we assume convergence and return NULL. Otherwise, return the improved x. <br /> if(del<eps) return(NULL) <br /> else return(x) <br /> } <br /></code></pre><br />[<b>Warning: </b>Here especially you'll want to rely on R libraries rather than on your own implementations for actual numerical work. Functions like <a href="http://stat.ethz.ch/R-manual/R-devel/library/stats/html/optim.html" target="_blank">optim</a> are equipped to deal with the pathological cases one can encounter in practice.]<br /><br />Wikipedia has a nice list of <a href="http://en.wikipedia.org/wiki/Test_functions_for_optimization" target="_blank">test functions</a> for convex optimization problems. Let's compare our steepest descent method to R's optim function using two of the suggested test functions:<br /><br /><pre style="background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> # Four variable Sphere function--min at (0,0,0,0) <br /> f1<- function(x) x[1]^2+x[2]^2+x[3]^2 + x[4]^2 <br /> # Two variable Rosenbrock function--pathological example with min at (1,1). Note: convex at (1,1) but not globally convex.<br /> f2<- function(x) (1-x[1])^2+100*(x[2]-x[1]^2)^2 <br /></code></pre><br />Run the tests: <br /><pre style="background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> # Initial values <br /> g1 <- c(1/2,-1/3,-1/5,2) <br /> g2 <- c(-1/2, 1) <br /> # Optim <br /> opt_f1 <-optim(par=g1, f1, method="BFGS") <br /> opt_f2 <-optim(par=g2, f2, method="BFGS") <br /> # Tester function <br /> descentTester = function(f,x0,del){ <br /> x<-x0 <br /> path <- NULL # Storage for path to minimum. <br /> while(!is.null(x)){ <br /> path<-rbind(path,x) <br /> x<-gradDescend(f,x,del) <br /> } <br /> rownames(path) = 1:nrow(path) <br /> return(path) <br /> } <br /> path1<-descentTester(f1,g1,1/3) <br /> path2<-descentTester(f2,g2,1/3) <br /></code></pre><br />We specify the <a href="http://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm" target="_blank">BFGS</a> method in optim. This method uses information from both the function and its gradient to minimize the function, so it provides a reasonable comparison against our simple implementation. Here is the output from optim stored in opt_f1, opt_f2:<br /><br /><pre style="background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > opt_f1 <br /> $par <br /> [1] 1.230906e-13 -1.748844e-13 -4.921677e-14 -6.452512e-14 <br /> $value <br /> [1] 5.366749e-26 <br /> $counts <br /> function gradient <br /> 9 3 <br /> $convergence <br /> [1] 0 <br /> $message <br /> NULL <br /> > opt_f2 <br /> $par <br /> [1] 0.9998002 0.9996004 <br /> $value <br /> [1] 3.99241e-08 <br /> $counts <br /> function gradient <br /> 119 48 <br /> $convergence <br /> [1] 0 <br /> $message <br /> NULL <br /></code></pre><br />How did optim do on our two examples? Optim found the solution to both problems. It found the first solution with only 9 evaluations of f1 and 3 evaluations of f1's gradient. The second solution took 119 evaluations of f2 and 48 evaluations of its gradient.<br /><br />We can look at the data frames path1 and path2 to assess the performance of our steepest descent implementation.<br /><pre style="background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> > nrow(path1) <br /> [1] 11 <br /> > nrow(path2) <br /> [1] 6197 <br /> > path1[nrow(path1),] <br /> [1] 8.467544e-06 -5.645029e-06 -3.387018e-06 3.387018e-05 <br /> > path2[nrow(path2),] <br /> [1] 1.002628 1.005267 <br /></code></pre><br />First, we note that our numerical precision is considerably lower than the implementation in optim. We can adjust this somewhat by lowering the tolerance value in the gradDescent function.<br /><br />In terms of computational efficiency, our implementation is on par with the optim for the first example. In the second, however, we took thousands more iterations to converge to the numerical minimum. Part of the reason for this is that the minimum for f2 lies in a deep, curved crease. Once we step into this crease, there is considerable zig-zagging on the approach to the minimum. A more sophisticated implementation might use more information on the local topography and the history of our steps to recognize when we have fallen into this oscillating approach pattern.<br /><br /><br />Here's a very zoomed-in picture of the alternation pattern within the crease of the f2 function. Note that we are still at a considerable distance from the minimum at (1,1).<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-TPtfZdlNv1I/UiwZb06SWaI/AAAAAAAAACM/fgUdORDDacQ/s1600/rosenbrock.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-TPtfZdlNv1I/UiwZb06SWaI/AAAAAAAAACM/fgUdORDDacQ/s1600/rosenbrock.png" /></a></div><pre style="background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> path2<-as.data.frame(path2) <br /> colnames(path2) <-c("x","y") <br /> ggplot(data=path2, aes(x=x, y=y))+ geom_path(color="black", size=1) + geom_point(size=5, color="red")+ <br /> scale_x_continuous(limits=c(1.0105,1.0107))+scale_y_continuous(limits=c(1.0211,1.0215))+ <br /> ggtitle("Gradient Descent in the Crease of the Rosenbrock Function") <br /></code></pre><br /><br />We also note that our implementation is very sensitive to the size of the initial step size parameter del. If you think about the geometry of the f1 example, you'll realize that there's a certain special choice of del that can get you to the minimum in a single step. In f2, certain choices of del lead to far better computational performance than others.<br /><br />Let's look at a final example of a successful walk to the minimum for a less-trivial minimization problem.<br /> <br /><pre style="background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> f3 <- function(x) x[2]^2+x[1]*x[2]+x[1]^2 + exp(x[2] + x[1]) <br /> path3<-descentTester(f3,c(-1,2),1/3) <br /> path3<-as.data.frame(path3) <br /> colnames(path3)<-c("x","y") <br /> # plot the path taken by gradDescent <br /> ggplot(data=path3, aes(x=x,y=y)) + geom_path(color="blue",size=2) + geom_point(color="red",size=3)+ <br /> geom_point(data=path3[1,],color="yellow",size=3)+ <br /> geom_point(data=path3[nrow(path3),],color="yellow",size=3) <br /></code></pre><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-z_GA5x3ILRM/UivliVS5FmI/AAAAAAAAABo/PFp-c7xFoVw/s1600/gradDescentPath.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="http://4.bp.blogspot.com/-z_GA5x3ILRM/UivliVS5FmI/AAAAAAAAABo/PFp-c7xFoVw/s400/gradDescentPath.png" width="400" /></a></div>In the next instalment of this series, we'll explore some applications of convex minimization using optim.<br /><br />Ryan Walkerhttps://plus.google.com/112373077410906476606noreply@blogger.com0