Replicating "The Markov Chain Monte Carlo Revolution"
Tuesday, December 8th, 2020
Why Markov Chain Monte Carlo sampling algorithms aren't as efficient as simple optimization algorithms when searching for the maximum likelihood.
If you'd like to following along at home, I've posted the code for this post on Github
With the added free-time of only working part-time as a data/software consultant, I’ve recently been working through my backlog of papers and links favorited over the past few years. One that had been sitting in the queue for a while was The Markov Chain Monte Carlo Revolution by Persi Diaconis, a math & statistics professor who came to the field after dropping out of high school to become a professional magician. What a story!
On first learning of MCMC methods in reading All of Statistics by Larry Wasserman, I was a bit skeptical of their usefulness. Estimating a probability distribution (most usefully, the kinds of complex ones produced with Bayesian techniques) using intelligent sampling techniques seemed unreliable, unscalable, and not repeatable given the difficulties of tuning and discerning whether or not your simulation has “poor mixing.” However, in the intervening years, computational and algorithmic progress seems to have, at least somewhat, ameliorated my concerns. That, and my interest in Bayesian Hierarchical Modeling motivated wanting to revisit MCMC.
Much to my confusion however, the motivating example at the center of the paper seemed to be misusing MCMC. Diaconis describes being asked to decipher some coded text, which he suspects is “a simple substitution cipher, each symbol standing for a letter.” To determine the mapping of symbol to letter, he constructs a probability distribution as the product of the likelihood of each first-order transition from letter to letter in the text. To calculate those first-order transition likelihoods, he simply samples their frequency from a large text, in this case, War and Peace. With this framing, solving the cipher can be framed as finding the mapping of symbol to letter that maximizes this probability.
Oddly, Diaconis uses MCMC sampling methods intended for estimating this probability distribution to solve this problem, when a simple optimization algorithm would do the trick, in a fraction of the time I might add. I say that now with confidence only after coding up a small replication in Python, which you can find here.
Let’s talk about the differences between the two algorithms I’ll be showing you the results for. They’re so similar, we can encapsulate how they differ in a single step. First, we pick a random mapping of symbols to letters, and calculate its probability†. From there, we pick a random pair of letters to swap in this mapping, and calculate that mapping’s probability. Each algorithm has its own “step rule” for either accepting or rejecting this new “candidate mapping.” If it’s rejected, we’ll go through the same evaluation process with a random ordering of all the possible swaps until we find one we like, unless we run out, in which case we’ve “converged.” Alternatively, once we accept a “candidate mapping,” we’ll begin the step process all over from there.
The simple optimization algorithm I implemented is a greedy, stochastic hill climber. The rule is quite simple: if the candidate mapping’s probability is higher, accept, otherwise reject. The greediness refers to the fact that it only accepts improvements, never straying from its upward trajectory to explore a new path; stochastic refers to the fact that we’re choosing from a random order of letter swaps, lending some diversity and exploration to our algorithm.
On the other hand, the Metropolis-Hastings MCMC sampling method is more permissive. It also accepts any improved candidate, but when presented with an inferior mapping, will occasionally take that path, depending on the acceptance ratio: the probability of the candidate divided by that of the current mapping. Specifically, it draws a random sample from 0 to 1, and accepts the candidate if the acceptance ratio exceeds it, hence the name. Putting together the value at each step will, over time, conveniently approximate the probability distribution itself, spending more time at more likely areas of the distribution thanks to that simple sampling rule.
However, the algorithm is prone to backtracking, purposefully designed to meander through the space of solutions instead of specifically searching for the most likely one. With discrete spaces like our cipher solution space, this can be partially mitigated by caching the function we’re evaluating, which I’ve done, but even caches take time to evaluate.
Alright, with both algorithms described, let’s take a look at how they did. To test them, I randomly scrambled “Tomorrow, tomorrow, and tomorrow,” Shakespeare’s famous soliloquy from Macbeth. I first ran the hill climbing optimization algorithm, restarting as many times as was necessary to find the correct solution. I would then run the MCMC algorithm, checking every so often if it had successfully found the solution, since it runs continuously, never converging. After coding this up, I found one big problem to my simple experimental setup: the MCMC sampler would occasionally never find the solution. Okay, I haven’t solved the halting problem, so I only know it wouldn’t find it after hours overnight, but you get the idea. Anyways, to solve this, I added a simple timeout at 1 minute. After running 1000 trials, I found that a whopping 5% of the time, the scrambled text could not be solved by MCMC within a minute. This, despite the fact that the hill climbing algorithm never took longer than two seconds, on average taking a mere 305ms.
Comparing the samplers over these runs, we see that about ¾ of the time, they perform identically. In the other ¼ of cases though, the MCMC diverges in performance, taking in many cases an order of magnitude or more longer than the simple hill climbing algorithm.
To demonstrate what’s going on here, I measured the number of cache hits and misses, in effect determining how often each sampler would backtrack on itself. Here, I’ve plotted the percentage of function evaluations as a cumulative distribution function for each sampler, as a function of the duration each took. The hill climbing algorithm consistently achieves around 80% on this metric, due to occasionally resampling neighbors that have been already traversed. In comparison, the MCMC algorithm actually performs better on this measure when it is successful. But as you can see, when it takes longer, it’s clearly getting lost by resampling previous character mappings already traversed.
Of course, we’re using a cache here, so the cost of a cache hit is small: just a dictionary look-up. To better understand how well each sampling algorithm does with exploring the space, the real metric of interest here ought to be the number of unique evaluations over time, factoring in 1) the increased cost of the more complex step acceptance rule of MCMC, and 2) the frequency of cache hits. Here, we see that the MCMC algorithm does much better thanks to the cache. Still, when only one in ten evaluations is unique, the cost is high with exploring the space, and the number of unique function evaluations drops off.
So, the lesson learned here is that MCMC samplers aren’t going to serve you well if you’re simply looking for the maximum likelihood estimate. Good to know!
†: We actually calculate the log probability, since it’s much simpler to add up the contribution of each transition than to multiply them out to an infinitesimally small number.
Questions | Comments | Suggestions
If you have any feedback and want to continue the conversation, please get in touch; I'd be happy to hear from you! Feel free to use the form, or just email me directly at email@example.com.