Open benchmarks for the win

Fri 7 November 2014

Geoffrey De Smet

OptaPlanner creator

Recently, there was some commotion on Twitter because a competitor heavily restricts publicizing benchmarks of their Solver as part of their license. That might seem harsh, but I can understand the sentiment: when a competitor publicizes a benchmark report comparing our product against their own, I know we’re gonna get screwed. Unlike single product benchmarking, competitive benchmarking is inherently dishonest…

Competitive benchmarking for dummies

As as competitor, you can utilize several (obvious and not so obvious) means to prove your superiority over another Solver:

Publication bias
- Pick a use case which is known to work well in your Solver.
- Use datasets with a scale and granularity which are known to work well in your Solver.
- If you’re really evil, benchmark multiple use cases and datasets in both Solvers and only retain those for which your Solver wins.
Expertise imbalance
- Let one of your experts develop an implementations for both Solvers.
  - Motivation: like any other company, your company only employs experts in your own technology.
- If he has years of recent experience in your technology, it’s unlikely he’ll had time for any recent experience in the competitive technology.
  - So you’re effectively using your jockey on someone else’s horse.
Tweaking imbalance
- Spend an equal amount of time on both implementations.
  - The use case is probably already implemented in your Solver (or straightforward to implement), so you can spend most of the time budget to tweak it better.
  - You’ll need to learn the competitor’s Solver first, so you’ll spend most of the time budget in that implementation to learn the technology, which leaves no room for tweaking.
Funding
- There’s no need to explicitly set a desired outcome: your developer will know better than to bite the hand that feeds him.

Notice how these approaches don’t require any malice (except for the evil one): it’s normal to conduct a competitive benchmark like this…

Furthermore, you can make the competitive benchmark comparison look more objective, by sponsoring an academic research group to do the benchmark for you. Just make sure that’s a research group which has been happily using your technology for years and has little or no experience with the competition.

Marketing value

The marketing value of such a benchmark report should not be underestimated. These numbers, written in black and white, which clearly show the superiority of your Solver against another Solver, make a strong argument:

To close sales deals, when in direct competition with the other Solver.
To convince developers, researchers and students to learn and use your technology.
To build a strong, long-term reputation.
- Benchmarks from the 90’s can still affect the Google search results today, for example for “performance of Java vs C++”.
- Such information spreads virally, and counter claims might not.

Empirical evidence

Are all competitive benchmark reports lying? Yes, they are probably misrepresenting the truth.

Should we therefore restrict users from publicizing benchmarks on our Solver? No, of course not (even if our open source licence would allow such conditions, which it does not).

Computer science - like any other science - is built on empirical evidence: the promise that any experiment I publish can be repeated by others independently. If we prevent people from publishing such repeated experiments, we undermine our science. In fact, the more people which report their benchmarks, the clearer our strengths and weaknesses show. Historically, this approach has already enabled us to diagnose and fix weaknesses, regardless whether those were caused by our Solver or the user’s domain specific implementation.

Therefore, OptaPlanner welcomes external benchmark reports. I believe in Open Science, as strongly as I believe in Open Source. I do ask the courtesy of allowing public comments/feedback on a public report website, as well as to publicize the details (such as the Solver configuration). If you use the OptaPlanner Benchmarker toolkit (which you will find convenient), simply share the benchmarker HTML report.

To run any of the benchmarks of the OptaPlanner Examples locally, simply run a *BenchmarkApp executable class, for example CloudBalancingBenchmarkApp. Notice how a small change in the *BenchmarkConfig.xml, such as switching score calculation from Easy Java to Drools or from Drools to Incremental Java, can have a serious effect in the results.

In short: I like external benchmarks, but dislike competitive benchmarks, except for …

Independent research challenges

Can we compare fairly with our competition? Yes, through an independent research challenge.

Regularly, the academic community launches such challenges. Each challenge:

defines a real-world use case with real-world constraints
provides multiple, real-world datasets (half of which they keep hidden)
expects reproducible results within a specific time limit on specific hardware
gets worldwide participation from the academic and/or enterprise Operations Research community
benchmarks each contestant’s implementation on the same hardware in the same time limit to determine a winner
benchmarks those hidden datasets to counter overfitting and dataset recognition

It’s fair: each jockey rides his own horse. Most of the arguments against competitive benchmarking do not apply. And as an added bonus, we get to learn from and compare with the academic research community.

In the past, OptaPlanner has done well on these challenges, despite the limited weekend time we have to spend on them. In the last challenge, the ICON power scheduling challenge, we (Lukas, Matej and me) finished in 2nd place. A minority of the researchers still beat us (with their innovative algorithms in their experimental contraptions and massive time to tweak/build those), but it’s been years since a competitive Solver has beaten us.

Long term vision

Sharing our benchmarks and enabling others to easily reproduce them, is part of a bigger vision: Too many research papers (on metaheuristics and other optimization algorithms) are hard to reproduce. That’s the paradox in computer science research: to reproduce the findings of a research paper, all we really need is a computer and the code. We don’t need an expensive laboratory. Yet, in practice, the code is usually closed and the raw benchmark data is not accessible. It’s like everyone is scared of sharing the dirty secrets of their code and their benchmarks.

I believe that we - the worldwide optimization research community - need to create a benchmark repository: a centralized repository of benchmarks for every use case, for every dataset, for every algorithm, for every implementation version, for any amount of running time. That, together with a good statistical interface, will give us some real insight as to which optimization algorithms are good under which circumstances.

We - in OptaPlanner - are well on our way to build exactly that:

OptaPlanner Examples already implements 14 distinct use cases.
For each use case, we’re already benchmarking on many different optimization algorithms.
Our benchmarker HTML report already includes many useful statistics to analyse the raw benchmark data.

Join us :)

Permalink