Colonel Blotto Tournament

Monday, June 5th, 2017

If you'd like to follow along at home, I've published the code for this post on Github

Back in February, FiveThirtyEight's Friday puzzle column, The Riddler, posed not so much a brainteaser as a competition. I'll let the author, Oliver Roeder, explain:

In a distant, war-torn land, there are 10 castles. There are two warlords: you and your archenemy. Each castle has its own strategic value for a would-be conqueror. Specifically, the castles are worth 1, 2, 3, …, 9, and 10 victory points. You and your enemy each have 100 soldiers to distribute, any way you like, to fight at any of the 10 castles. Whoever sends more soldiers to a given castle conquers that castle and wins its victory points. If you each send the same number of troops, you split the points. You don’t know what distribution of forces your enemy has chosen until the battles begin. Whoever wins the most points wins the war.

Submit a plan distributing your 100 soldiers among the 10 castles. Once I receive all your battle plans, I’ll adjudicate all the possible one-on-one matchups. Whoever wins the most wars wins the battle royale and is crowned king or queen of Riddler Nation!

Each match in this tournament is a type of two-person, zero-sum game known as Colonel Blotto, a classic in the field of game theory. Academics have studied the problem since the '50's, exploring the Nash equilibria of various versions of the game — different numbers of troops & castles, different weightings for each castle, continuous instead of discrete troop deployments. Despite its simple structure, Colonel Blotto is an incredibly complex game, e.g. The Riddler's version has 4.3 trillion potential deployment configurations. For the uninitiated, Nash equilibrium is a concept from game theory, named after the founder of the field John Nash of A Beautiful Mind fame. A Nash equilibrium occurs in a game when a set of opponents, having determined their strategies, would not change their mind knowing their adversary's decision. In essence, each player's strategy is the best response to the circumstances of the game, hence "equilibrium." As hinted at by the Latin pluralization, multiple Nash equilibria can exist within a game. As well, the strategies in an equilibrium do not need to be deterministic — send this many troops to each castle — and can in fact be probabilistic — draw the troop deployments from this probability distribution.

Now that you've got the concept down, let's cast it aside. In addition to being computationally infeasible to calculate for a game of this size, Nash equilibria can be downright counterintuitive and unuseful when dealing with the irrational behavior of humans. Not only are people predictably irrational, most of us are not trained in game theory, and none of us are even capable of computing what "rational behavior" is in this case (at least in our heads). I write this with the confidence of empirical evidence, as FiveThirtyEight was kind enough to publish all 1,387 submissions on Github.

First off, a tip of the cap is in order to the queen or king of troll nation who found the worst possible troop distribution — all 100 troops on castle one. Well done madam or sir! OK, let's take a look at the field a few different ways.

First up, I scored each entry against the remaining field, counting up the number of wins, ties, and losses. I then gave each a "win margin percentage", which is just (wins - losses) / total matches. A few examples: an undefeated team would be 100%, a .500 team is 0%, and our friend the ruler of troll nation checks in at -100%. Let's plot the scores of the field on a percentile basis:

Our troll despot isn't alone, off in the bottom left corner, with a few similarly terrible competitors. On the other side of the chart, the best entries won just under 70% more games than they lost, including the winner, Cyrus Hettle, as announced in the following week's Riddler.

In that post, the author plotted histograms for the number of troops at each castle. I decided to take that one step further, and color each histogram bar based on the average win margin percentage for entries that had that many troops at that castle. The result, on the right, shows lots of wonderful human irrationality.

For example, one interesting cognitive bias crops up quite clearly in castles seven through ten: round number bias. The peaks of troop counts divisible by ten, and less so five, shows our strange fascination with quantities that correspond to the number of fingers we have on one hand. We often take it for granted, but base ten is arbitrary, as any computer or nerdy t-shirt will tell you.

Interestingly, it's clear that some competitors were aware of this bias, and took advantage by choosing to deploy an extra troop, one above the cluster at each round number. Others, one more step ahead, doubled down and added a second, and so on, to diminishing returns up to the next round number. Thanks to the colorization, we can see that the round number crowd were generally less successful than the adjacent groups.

Just by looking at the range of colors present for each histogram, it's clear that strategies on the first four castles were less impactful than those of the last six. They have fewer deep purple losers, or bright yellow winners. This makes some sense, since they collectively account for only 10% of the total points in the match, and the strategies are less diverse.

As the castles become more valuable, however, the strategies diverge, resulting in that interesting multi-peak distribution, with round number periodicity. The first peak at zero troops shows that for virtually every castle, dispatching a small expeditionary force of 2-4 troops was a worthwhile investment of manpower. Beyond five troops, and you'll find yourself in the deep purple gap where you've committed more than enough troops to succeed over the small groups, but not enough to beat the larger battalions.

This gets at a larger strategy of aiming for small wins and big losses. Since each castle is a binary outcome (win or lose all the points), running up the score at any particular location is counter-productive, since it means you'll be light on infantry at each other battlefield. Thus, juking your opponent into sending a large force to an almost empty castle is thus a huge success, allowing you to spread those savings out over a number of wins.

Of course, we don't actually know our opponents before submitting a troop distribution to the tournament, but if we did have this information, just eyeballing these charts, it looks like small scout forces for castles 1-4, 9, & 10 could hold the fort down while sending 15-25 troops to castles 5-8, e.g. 2, 3, 3, 3, 13, 22, 24, 24, 3, 3. Unfortunately, with a 55% win margin percentage, that puts us squarely in the 97th percentile, with forty entries above us, so we'll need to do better.

While the histograms were great, they don't capture any relationships in the data between each castle. For example, we already know that the number of troops deployed to any one castle is negatively correlated with the troops deployed to all other castles — there are only 100 troops to distribute. However, with ten dimensions of data (recall, 10 castles), producing a grid of 45 correlation plots is pretty unwieldy, e.g. here, and also only shows relationships between any two dimensions. Instead, I decided to try a parallel coordinates chart, which is basically an unrolled radar chart. Picture's worth a thousand words:

Each entry in the field is represented by a horizontal line, and as it crosses from left to right, its height represents the number of troops at each castle. With 1,387 lines, the chart becomes a bit of a mess of spaghetti, but to compensate, I've set the transparency of each to just 10%. More popular strategies become more distinct as lines layer upon each other, while uncommon distributions remain faint. Now, before we make too much of this chart, let's add one more complicating factor by coloring each line based on its success in the tournament, specifically its percentile. Take a look:

If you look closely, you can just about make out the original histograms from our previous chart. As well, different strategies start to emerge. We can see the even splitters (something like 10 across the board) don't seem to have fared well, over-dispersing their troops without winning enough castles, though they would have beat our previous example entry 29-26. The slow-rampers (starting at something like 5 and linearly increasing to 15) didn't earn themselves any trophies, and the same can be said for the two-levelers (start off with a few zeros, then evenly distribute your troops amongst the top 4-6 castles). Sure enough, we can make out what looks like a strong negative correlation in troop levels between neighboring castles. All that said, for all the pretty lines and colors, it's still hard to say what the winning strategy is from this chart. So, instead of eyeballing it using data visualization, let's try out an algorithm.

Before we start traversing a constrained 10-dimensional space of 4.3 trillion potential entries, let's gain some intuition in dimensions we can visualize. The simplest, non-trivial example is 2 castles, and let's say 3 troops. Here's a plot showing our four strategy options:

In a 2-dimensional space, the x-y plane plotted above, the strategies form a 1-dimensional line. This is because, while we have two variables (castles one and two), we also have one constraint (the troops sum to three), giving us one one degree of freedom. The nearest neighboring strategies can be found by switching a single troop between each castle, resulting in up to two nearest neighbors. Moving to three castles and three troops:

I'd just like to note that matplotlib's 3D visualization capabilities are unfortunately not yet at parity to MATLAB. Apparently Mayavi is the best way to produce excellent 3D visualizations with Python, but I hope you don't mind slumming it with me and mplot3d just this once. Excuses for poorly placed axes labels, ugly background grids, and the lack of a legend aside, we can see an intuitive analogy in mapping from two to three castles. With three variables (castles one, two, and three) and one constraint (troops still sum to three), we have two degrees of freedom in a 3-dimensional space, forming a neat 2-dimensional plane. Traversing this space is a little bit less intuitive, but the nearest neighbors of a solution can similarly be found by switching a single troop between two castles, the only difference being we now have to choose which two castles to switch troops. This leads to a total of six nearest neighbors, as can be seen in the hexagon of points around the central (1, 1, 1) strategy. While I prefer this method of visualizing our solution space, since all of our potential strategies are contained in one flat surface, we can actually visualize them in 2D, like so:

This plot was produced using just a 2D plot tool, but one could imagine viewing the former 3D version from above to get the same result. Here, we simply assume that the third castle, represented as the z-axis coming out of the screen, takes up any leftover troops. I've colored the various third castle troop counts to show that this 2-dimensional plane can be formed by stacking 1-dimensional lines. So, if going from 1D to 2D involved stacking lines, then going from 2D to 3D must involve stacking planes, right? Let's take a look at the four castle, three troop variation in this way:

Since the legend feature is broken: the blue dots represent each potential troop deployment, while each colored plane and marker outline show the various fourth castle troop quantities. Again, the fourth castle takes up any excess troops not sent to defend castles one through three. We're left with a range of solutions that forms an irregular tetrahedron in the first octant, the 3-dimensional analog to the 2-dimensional right triangle in the first quadrant from the 3 castle example.

Beyond mapping the four castle example into 3D, we're relying on the intuition we've built up, e.g. in imagining stepping through the 9-dimensional hyperplane of entries in 10D for our ten castle problem, even if 10-dimensional hyperspace is nothing more than an abstract concept, at least for me. When I state that there are 90 nearest neighbors between any one strategy, found by switching a single troop between any two castles, we can go back to the 3-castle in 3D case and imagine stepping along the surface.

It turns out that finding the best strategy given the existing field is a nifty optimization problem. Brute-force search — trying all 4.3 trillion options — is too slow, so we need a smart way to explore this huge space. Luckily, optimization is also a huge (academic) space studied for hundreds of years with countless applications. We're hoping to maximize a single objective — win margin percentage against the field — over ten bounded, discrete variables — a non-negative integer of troops distributed to each of ten castles — subject to one constraint — there are 100 total troops.

One simple, intuitive and commonly used algorithm, gradient descent, searches a space by, after being supplied with an initial guess, taking the derivative of the objective function with respect to each input, and making incremental steps proportional to that derivative in the direction of steepest descent, similar to a ball rolling down a hill. (Note: its standard in the optimization literature to assume objective function minimization is the goal, but no one really cares, since maximization is simply minimization of the negative of the objective function, and vice versa.) Another algorithm, momentum, takes that analogy one step further, weighting the direction of steepest descent with the direction of the previous step, similar to a heavy cannonball rolling down a curvy hill — not always taking the steepest path due to its built-up momentum.

These are both examples of gradient-based optimization methods, in that they require the objective function be differentiable, i.e. we can calculate the slope. Unfortunately, our problem doesn't have a closed-form derivative, i.e. a simple equation for calculating the direction & magnitude of steepest descent. One way to get around this is to numerically differentiate the objective function by sampling small steps away from our current guess, but in my experience, it turns out this is a pretty inefficient way of going about things.

Non-gradient based optimization is a wild field, with lots of interesting algorithms, including particle swarm, simulated annealing, differential evolution, genetic algorithms, and covariance matrix adaptation. One of the simplest, hill climbing, after an initial guess, iteratively tests out small steps in random directions until a better solution is found, adopting that as its next position. The algorithm keeps stepping in this way until no further improvement can be found, at which point we call the algorithm "converged" and call it a day.

Hill climbing should work well for our purposes, but we have to be careful in both generating an initial guess and a random step, to make sure we keep the troop distributions as non-negative integers that sum to 100. For the former (generating an initial guess), a simple multinomial distribution (rolling a k-sided dice n times and summing the number of rolls for each side) will suffice. With the latter, the intuition we built-up previously is helpful in imagining the 90 nearest-neighbors to any single solution found by switching a single troop from one castle to another. As well, I opted to code up a variant called steepest ascent hill climbing. Instead of randomly selecting from the nearest neighbors, we evaluate all 90, and choose the best one to base our next iteration on.

One programming note: after evaluating each nearest neighbor and stepping to the best one, it turns out our algorithm will re-evaluate many of the same troop distributions, i.e. many nearest neighbors of a solution are themselves nearest neighbors. To avoid a costly re-evaluation over the entire field of entries, I implemented a simple caching scheme using a Python dictionary and its convenient __missing__ method. Instead of immediately scoring an entry, the function first checks to see if it has the score saved from a previous attempt. If not, it'll evaluate, and add that to its record for the future.

Putting that all into action, and starting with the random guess (14, 20, 9, 11, 16, 7, 7, 0, 14, 2), after 26 steps, we find the optimal solution (12, 6, 5, 19, 25, 6, 4, 3, 13, 7) with a win margin of 16.1%. Wait, what gives?! Wasn't the best entry of the tournament at 68.3%?!

It turns out, upon convergence, our version of the hill climbing algorithm only guarantees a local maximum. While (12, 6, 5, 19, 25, 6, 4, 3, 13, 7) may be superior to all its nearest-neighbors, i.e. it is a local maximum, it is certainly not the global maximum. We were in search of Everest, but instead, we discovered Whitney, only ~15,000 feet short (sorry metric friends!). Our search space, like the Earth's surface, is lumpy, with lots of local maxima (Whitney) that are mostly indistinguishable from the global maximum (Everest). To avoid this problem, academics are often concerned with convex objective functions, a nice property that, amongst other things, ensures that any local minimum is the global minimum.

Unfortunately, for non-convex problems such as our own, 100% confidence in finding the global maximum requires the brute-force, global search method we originally cast aside. We'll attempt to mitigate this issue by implementing random-restart, or shotgun hill climbing, in which we'll try lots of initial guesses, finding many local maxima in the hopes of one being the global maximum. With enough restarts, one of those initial guesses will be in Nepal or Tibet, instead of California.

One improvement over using truly random initial guesses is to evenly distribute those guesses across the space of potential solutions. For example, in the case of searching for the tallest mountain on Earth, one might try an initial guess at each continent, instead of randomly spinning the globe seven times. To do this, I used something called Latin Hypercube Sampling, which conveniently generates a set of near-random samples guaranteed to spread across a given range. Here's what that looks like for 100 samples on a range of 0-10:

Intuitively, the random samples will, on average, be evenly distributed, but random variation occurs. Instead, a Latin Hypercube Sampling method guarantees us the even distribution, while still providing essentially random samples.

With everything in place, I quickly profiled my code, and found that each local maxima took about half a second to find, so I took a nice half hour coffee break and let my laptop crunch through 3,600 LHS-random starts. It found two neighboring optimal troop distributions with an identical record of 1,272 wins, 4 ties, and 111 losses, good for a 83.7% win margin: (0, 6, 7, 12, 12, 21, 3, 32, 3, 4) and (0, 6, 7, 12, 12, 21, 3, 31, 3, 5).

This leads to the obvious question: was 3,600 starts enough, or is there an even better entry out there? To answer this question, I plotted the distribution of scores for each of the local maxima we found. Quick sidenote: while histograms are a great tool for this task, when estimating the probability distribution of a population given a random sample, I prefer to use Kernel Density Estimation using cross-validation for bandwidth selection.

Our search space is quite non-convex, with 1,473 unique local maxima discovered in a half hour alone. Still, with our confidence bands pretty tight, there seems to be very little likelihood of an entry better than the two we discovered. At the risk of pulling a George W. Bush on USS Abraham Lincoln: MISSION ACCOMPLISHED!

At this point, you're probably wondering why I've been so interested in a tournament from months ago. Well, FiveThirtyEight recently announced a second round and I couldn't help but pull out my computer to start analyzing the first round in preparation for making a submission. We've already found a troop distribution that could crush the competition from round 1, and more generally created an algorithm for efficiently finding the best entry given a field. The only problem: the field will almost certainly be different in round 2.

In my mind, there are two ways they could differ. First, if we consider each entry in the original field a random sample from some distribution, then the second round would be a new set (of indeterminate size) of samples from the same distribution. This would produce similar histograms and parallel coordinate charts, but might not lead to the same optimal troop distribution. Second, round 2's supposed distribution could itself differ from round 1. While the latter seems likely given that competitors now have access to round 1's submissions and can take them into account, the former is a good place to start.

Given scenario one, it seems as though the previous optimal solution we found would be the best choice, but is it possible that it's only optimal specifically for those 1,387 submissions, and not generally optimal for a future round drawn from the same distribution? The growing field of machine learning gives us an excellent approach to answer this question called cross-validation. Machine learning is concerned with the creation of models that, given a dataset, can make predictions. Cross-validation is a convenient way to evaluate the predictive capabilities of those models.

As an example, say your friend the pediatrician wanted to predict the full-grown height of a 2-year-old, given its height, weight, and gender, and already had the medical records of 1,000 previous patients now full-grown. You chose an algorithm, which learned from the 1,000 data points you fed it, but before trying it out, your friend wants to know how accurate the predictions will be, e.g. the average error in inches. Unfortunately, if you use that same dataset to check for accuracy, you'll get a biased result, with a lower predicted error than reality. What you need is two datasets: one to train your model, and another to test it. One simple way to achieve this is to leave one data point out, train your model on the other 999 examples, then make a prediction using the left out two-year-old's height, weight, and gender. Do this for each previous patient, and you now have an unbiased set of predictions for each child to evaluate.

That process of iteratively slicing up a dataset for separate training and testing is cross-validation. When algorithms slow or datasets grow, in-place of the simple leave one out approach, we can save time by splitting our data into a handful of equally-sized groups, e.g. 10 groups of 100 former patients, and leave one group out at a time. The groups are commonly known as folds, as in "10-fold cross-validation."

We can use cross-validation to evaluate the generalizability of the optimal solutions of our multi-start, hill-climbing algorithm. After splitting the round 1 field into groups, we'll find the best troop distribution of the sub-field that excludes one group at a time, resulting in an optimal troop distribution that can be tested by each excluded group. The only challenge with this scheme is our familiar problem of determining how many multi-starts are required to give us confidence we've found the best troop distribution. Sticking with 3600 starts would mean it would take about 5 hours to complete 10-fold cross-validatio. Is that really necessary, or can we save time by using fewer restarts? More precisely, after how many starts can we be 95% confident we've found the best one?

To answer this question, instead of running our multi-start search algorithm many more times, we can quickly bootstrap (repeatedly sample with replacement) our existing set of 3600 local maxima. Here's the results:

The first plot shows how confident we can be that we've found the optimal solution, given how many starts we've performed. In this case, we've reached 95% confidence, i.e. we'll have found the optimal troop distribution 95 times out of 100, after a bit more than 100 samples (notice the logarithmic scale of the x-axis). The second chart shows the same takeaway — we need about 100 starts — by plotting the win margin percentage of the average and 95% confidence level for the best entry after so many starts. By also plotting the win margin percentage as well, it's clear that even if we're not assured of the optimal entry after 100 solutions, the best entry we find will be very close to that optimal either way.

So with that knowledge in hand, we can now setup and run our cross-validation to evaluate the generalizability of our algorithm. We'll split the data up into 10 groups (10-fold cross-validation), find 100 local maxima for each sub-field by excluding one group at a time, then score the top entry against the excluded field. Better yet, since we've already done the work of finding 100 local maxima, we might as well score the top 10 unique entries found. We can then compare the win margin percentages for the sub-fields, aka training data, against the excluded group, aka testing data. We'll expect some drop off between the two, but how bad is it?

Sure enough, generalizing, i.e. going from training data to test data, takes off about 2.5 points of win margin percentage. Still, given that the optimal solution we found for round 1 was more than 15 points better than any entry from the field, we can be pretty confident that it would win, if indeed the second round's underlying probability distribution were identical to round 1.

While we can rest assured we've found a robust optimal solution for the previous round, it's unlikely the next round will resemble the last. I'm not smart enough to have a thorough, precise, sophisticated way to predict what the next round will look like, so instead, I came up with an intuitive, defensible, but ultimately arbitrary method:

From my dataset of 3,600 local optima, I first filtered them for uniqueness, giving me 1,473 entries.
Next, I took only the solutions that performed better than the 90th percentile entry from round 1, a win margin of 43%, leaving 890 local maxima.
After that, I filtered for spacing. Similar to the two neighboring best solutions, many solutions were close to each other, and I wanted to have a more diverse round 2 field. I used an L1 norm or Manhattan length as my distance metric, and filtered for solutions more than 5 steps apart.
Finally, I combined the 396 local maxima left with the top 10% from round 1, producing a total of 535 round 2 entries

Here's what that ended up looking like, using the same visualizations as round 1. First, the win margin percentiles:

Next, the histograms:

And finally, the parallel coordinates chart:

Applying the search algorithm one last time with 1,000 starts just to be thorough, I found (0, 8, 11, 15, 4, 25, 4, 4, 6, 23) as the optimal troop distribution, with a win margin of 92%. Fingers crossed that takes the cake!

Questions | Comments | Suggestions

If you have any feedback and want to continue the conversation, please get in touch; I'd be happy to hear from you! Feel free to use the form, or just email me directly at matt.e.fay@gmail.com.