Colonel Blotto Tournament, Revisited

Thursday, June 22nd, 2017

This is a follow-up to this previous post analyzing FiveThirtyEight's Colonel Blotto Tournament.

If you'd like to follow along at home, I've published the code for this post on Github.

By the end of my last post on FiveThirtyEight's Battle for Riddler Nation, I had put together a strategy for winning round 2 of the battle. First, a quick review of the game's rules, from author Oliver Roeder:

In a distant, war-torn land, there are 10 castles. There are two warlords: you and your archenemy. Each castle has its own strategic value for a would-be conqueror. Specifically, the castles are worth 1, 2, 3, …, 9, and 10 victory points. You and your enemy each have 100 soldiers to distribute, any way you like, to fight at any of the 10 castles. Whoever sends more soldiers to a given castle conquers that castle and wins its victory points. If you each send the same number of troops, you split the points. You don’t know what distribution of forces your enemy has chosen until the battles begin. Whoever wins the most points wins the war.

Submit a plan distributing your 100 soldiers among the 10 castles. Once I receive all your battle plans, I’ll adjudicate all the possible one-on-one matchups. Whoever wins the most wars wins the battle royale and is crowned king or queen of Riddler Nation!

After visualizing the field from round 1 (FiveThirtyEight was kind enough to publish the data on Github) and gaining some intuition around the search space of potential troop deployments, we came up with an algorithm — a multi-start, steepest ascent hill climber distributed through the search space using Latin Hypercube Sampling — to find the optimal troop deployment for defeating the round 1 field. With that optimization tool, and an educated guess for what the second round's field would look like, we were able to find a suitable submission, and left it there.

Now that they've posted the results, we can see where we went right, and where we went wrong: Our entry of (0, 8, 11, 15, 4, 25, 4, 4, 6, 23) placed 124th of 932 submissions with a win margin of 28.5%, placing squarely in the 87th percentile of entries. B+ ain't bad, but if you're a competitive ass like me, this feels more like "Needs improvement!" To figure out what happened, let's start with the visualizations we built for round 1. First, let's take a look at the win margin percentiles:

I checked, and sure enough, the troll of riddler nation made another appearance, checking in at last place, or the 0th-percentile on the bottom left, losing all 931 battles with a troop distribution of (100, 0, 0, 0, 0, 0, 0, 0, 0, 0). At the top right of the chart, we see that the very best strategies won more than 55% more often than they lost, with Vince Vatter taking home the trophy with a strong 61.8% win margin! Next, the histograms:

Interestingly, competitors were less likely to send 0-2 troops to a castle this time around, reacting to the easy wins a small force of 3-5 had in round 1. This is particularly true of castles 9 & 10, which exhibit a funky peak starting at 3 instead of 0.

However, the round number bias (small peaks at troop distributions divisible by 5 & 10) is still present and particularly noticeable in castles 8, 9, & 10.

Before reading too much into which strategies were more or less successful, it's important to note the spread of win margin for this chart. While the minimum dark purples are ~-55%, the maximum is only ~20%. This tells me two things: 1) we can only make strong statements about strategies to avoid when using this chart, and 2) strategies within an individual castle are less important than the relations between them when determining success.

With that caveat, it's clear that last round's winning strategy of loading up on castles 7 & 8 was largely unsuccessful. Specifically, we can see a large number of entries with 30-34 troops at castle 8 and 25-28 troops at castle 7, a largely ineffective response to round 1.

While following the herd towards loading up on castles 7 & 8 was unsuccessful, the collective strategy of not giving away castles 9 & 10 was a must. Competitors who sent fewer than three troops to either fared extremely poorly.

And finally, the parallel coordinates chart:

Again, we can see that loading up on castles 7 & 8 was ineffective this round, as was the related strategy of giving up an easy win at castles 9 & 10. Instead, it looks like sending a small 3-6 sized force to 7 & 8 was better. Competing strongly for castles 3-5 was also a winning strategy.

This all begs the question of where our previous strategy went awry. Looking through the data, three things immediately jump out at me:

First, the predicted round 2 field we used for finding our best strategy was too small, at only 535 entries, compared to 1,387 from round 1, and 932 in round 2. Only including the top local optima from round 1 along with the best original entries didn't produce a large and diverse enough field to optimize over.

Second, on the internet, trolls abound! While I immediately noted the existence of them in round 1 at the start of my first post (the troll of riddler nation), I had forgotten to take them into account in round 2. The win margin versus percentile charts from both rounds show a clear inflection point for the bottom 8% of entries, at which point the scores become much worse. A better method for predicting the next round needs to factor in this group of naive players and trolls.

Third, I simply didn't run my optimization code with enough multi-starts in finding the optimal entry from our predicted round 2 field. The smaller, less diverse field seems to have had more local optima, and thus it was more difficult to find the global one. In this case, the best optimal entry from our predicted round 2 field (0, 1, 12, 17, 21, 4, 4, 5, 13, 23) would have fared much better, with a 38.5% win margin, earning 55th place. Whoops!

If we want to do better, that means we'll need to 1) find a way to produce a larger, more diverse field, 2) include trolls in that field, and 3) find more local optima (either by speeding up the algorithm, or finding more patience).

Luckily, I found a subtle algorithmic change that added randomness and diversity while speeding up the search time by a factor of six. If you recall from last time, the algorithm we ended up using was a steepest ascent hill climber. After an initial guess, our algorithm iteratively tested out all 90 nearest-neighbors, adopting the best as its next position. This stepping occurred until the current guess was found to be superior to all its neighbors, i.e. it was a local maximum, the top of the hill.

Instead of testing out each neighbor and stepping to the best one — choosing the "steepest ascent" — I changed the algorithm to step to the first superior neighbor from a random ordering of them. Whereas before, the search algorithm was deterministic, i.e. given the same initial guess, it would follow the same path to an identical solution, this new algorithm is stochastic, or randomly determined. While this might lead to discovering a few more local maxima, more importantly, the set of search paths the algorithm now follows are now vastly more diverse, and can be sampled for generating the next field. Another counter-intuitive benefit: the paths may be longer, but the algorithm itself is much faster due to having to evaluate many fewer neighbors each step.

With that solution in hand, I next tackled a more systematic way to predict the round 2 field than my previous defensible but ultimately arbitrary attempt:

After discovering 3,600 local optima, I filtered them for uniqueness, giving me 1,473 entries.
Next, I took only the solutions that performed better than the 90th percentile entry from round 1, a win margin of 43%, leaving 890 local maxima.
After that, I filtered for spacing. Similar to the two neighboring best solutions, many solutions were close to each other, and I wanted to have a more diverse round 2 field. I used an L1 norm or Manhattan length as my distance metric, and filtered for solutions more than 5 steps apart.
Finally, I combined the 396 local maxima left with the top 10% from round 1, producing a total of 535 round 2 entries

This time, using the new faster, randomized search algorithm:

First, I discovered 5,000 local maxima for round 1, storing both those entries and the paths leading to them, giving me 306,474 total entries.
Next, I took only the solutions that would have won round 1, requiring a win margin greater than 61.8%, leaving 32,735 entries left.
Finally, I randomly sampled (with replacement) from both the original field and this new field of winning entries to form the predicted next round.

The only challenge left was to figure out what proportion ought to come from the first round, versus from the set of entries optimized against the first round. It occurred to me that it might be prudent to mix in some entries from a set of entries optimized against that optimal set, and perhaps even a few more from a thrice optimized set, and so on. Unfortunately, this seemed like another defensible but ultimately arbitrary dead-end.

That is, until I started looking into beauty contests...

John Maynard Keynes is considered by many to be the father of macroeconomics. In chapter 12 of his seminal work The General Theory of Employment, Interest and Money, Keynes compared stock market investors to competitors in a form of beauty contest since named after him:

... professional investment may be likened to those newspaper competitions in which the competitors have to pick out the six prettiest faces from a hundred photographs, the prize being awarded to the competitor whose choice most nearly corresponds to the average preferences of the competitors as a whole; so that each competitor has to pick, not those faces which he himself finds prettiest, but those which he thinks likeliest to catch the fancy of the other competitors, all of whom are looking at the problem from the same point of view. It is not a case of choosing those which, to the best of one's judgment, are really the prettiest, nor even those which average opinion genuinely thinks the prettiest. We have reached the third degree where we devote our intelligences to anticipating what average opinion expects the average opinion to be. And there are some, I believe, who practise the fourth, fifth and higher degrees.

Or, if you'd prefer a less potentially misogynistic version, Planet Money, NPR's economics podcast, ran a decidedly more modern experiment involving cute animal videos. Slow Lorises aside, the analogy to our Colonel Blotto problem resides in predicting to what degree competitors will consider the problem in submitting an entry. For example, first degree submissions would come from the same distribution as the original round, i.e. they would not have factored in the results at all. Second degree submissions are those that have been optimized against the first degree submissions, i.e. beat round 1. Third degree submissions were optimized against the second degree submissions, and so on. Figuring out how to predict the round 2 field is thus the same question as figuring out the distribution of first, second, and third (maybe more?) degree submissions.

It turns out another version of a Keynesian Beauty Contest — the classic game theory game "Guess ⅔ of the Average" — can reveal the depth of reasoning of its participants, and oddly enough, was the other riddle included in the original Can You Rule Riddler Nation? post from FiveThirtyEight. In that version, competitors submitted an integer between 1 and 1,000,000,000, with the winner having the submission closest to ⅔ of the average of all submissions. First degree submissions would be uniformly distributed across the entire. Second degree submissions would take this into account, predicting the mean to be 500,000,000, and submitting ⅔ of that, or 333,333,333. Third degree submissions would then submit ⅔ of that, or 222,222,222. At degree infinity, the Nash equilibrium of guessing the minimum is finally reached.

While Oliver Roeder didn't post the data from this competition (and didn't reply to email), I found this histogram of the results of a Danish newspaper's version of this game, digitizing the plot using Engauge Digitizer. I've reproduced it below:

In this version, submissions could be any number between 0 and 100 (not just integers). The maximum possible result of this game, in which all entries are 100, is 66.666..., and yet a whopping 9.2% of entries are greater than 67! If we chalk all of these submissions up to uniform distribution of the first degree group, that would mean a full 27.4% of competitors are of the first degree. The second and third degree groups have large, distinct, roughly equal peaks at 33 and 22, respectively. There is notably no distinct fourth degree peak at 15. If we constrain the second and third degree groups as equal and only allow for one more fourth iteration, in order to reproduce the mean of the population, that results in 31.3%, 31.3%, and 9.9% of participants at the second, third, and fourth degree, respectively.

While that seems reasonable, there's certainly room for interpretation. FiveThirtyEight's audience is almost certainly different from that of a Danish newspaper, which might lead to more nerdy, game-theory types that are higher-degree thinkers. Simultaneously, the game we're concerned with, Colonel Blotto, is MUCH more complex, so analyzing it to various depths takes much more time and effort than a simple multiplication by ⅔. This would mean we might expect submissions to reflect less depth of reasoning. And finally, there's the matter of trolls. I've ignored the impact of trolls in analyzing the Danish data since the competition had a roughly $1000 reward, but one could certainly attribute the peak at 100 to them.

To take the trolls into account in predicting round 2, I ended up randomly sampling from the troll population of round 1 (the bottom 8%), and keeping their proportion of the field the same. The other 92% was then filled proportional to the results from the Danish competition.

Finally, the rubber meets the road. In the plot below, I've plotted the top 100 entries versus their win margin percentage against the actual round 2 field. The blue line shows the actual top 100 entries from the round, while the green line represents the top 100 entries found by applying our optimization algorithm to the actual round 2 field, i.e. we cheated with perfect knowledge. Thus, the green line represents a sort of upper bound. Next, I've plotted in red the results of the previous post's attempts at predicting the round 2 field, this time running the optimization much longer. And finally, the Keynesian Beauty Contest version of predicting round 2 optimal solutions is shown in orange.

While the previous attempt in red is noisy due to that method being deterministic, the beauty contest method results, in orange, are random. Thus, I was able to run it 40 times, and find the average expected win margin percentage for each rank, smoothing the result using kernel density regression. The transparent ribbon represents the 95% confidence interval over which the average is likely to occur. More on regression of the kernel density variety and others in a later post!

As you can see, even with the help of John Maynard Keynes, out top entry only averaged ~50% win margin, or about 20th place. Certainly nothing to sneeze at, but not a guaranteed victory. By my calculations, this method would have won the tournament roughly 5% of the time. Averaging the 97th-percentile with a 1-in-20 chance of winning it all seems like a good time to quit while we're ahead, but if anyone has any contributions to improve on that, I'd love to post a follow-up!

Questions | Comments | Suggestions

If you have any feedback and want to continue the conversation, please get in touch; I'd be happy to hear from you! Feel free to use the form, or just email me directly at matt.e.fay@gmail.com.