Sunday 6 May 2012

Lies, damned lies, and performance metrics

In the DESI IV e-discovery workshop at ICAIL last June, in one of the presentations there was a single slide that really caught my attention. It showed the performance of a large number of e-discovery systems on two axes representing the most common metrics used in information retrieval, computational linguistics, and what have you: precision and recall. (The slide was based an earlier publication which I unfortunately did not register at the time and have not been able to find since, pointers extremely welcome.)

What was so intriguing about that slide was that apart from a small number of outliers, almost all of the systems evaluated were at the bottom left quadrant, with both precision and recall (typically well) below 50%. Based on my expectations from my previous line of work, this was quite shocking, and had I known this earlier, I would certainly have worded a paper or two a bit differently as far as using e-discovery as an example is concerned. All the same, I suppose even at these figures e-discovery already outperforms all the alternatives, but there is certainly still considerable room for improvement.

What I was accustomed to was that a marketable product should deliver well above 90% in both columns or else the users would just stop using it. Whether they are 93% or 98% is not all that important, because that kind of variation is mostly just noise and depends on how well suited the test materials happen to be for that system. In particular, if you use a manually annotated test corpus as a gold standard, in a commercial scenario, you can really only use it once, because whatever is left between the actual performance and 100% are what we in the business call bugs, and they should be dealt with unless there is a good reason not to do so. And so our gold standard becomes tainted once these bugs are fixed. Which is of course no reason not to use the new figures for marketing purposes. (‘Press statements aren’t delivered under oath.’ - Jim Hacker, PM)

Performance is not just an e-discovery issue. It is raised in many other legal technology contexts as well. For example, the Swedish trade mark law start-up Markify prides itself with the ‘99% accuracy’ of its system, for example in this recent Arctic Startup profile. The actual study on which this claim is based is also available. The results are based on querying a set of 1000 actual cases of successful US trade mark oppositions and the question was whether the different services would return the correct mark (that of the opponent) when queried for the mark that is being opposed. Here are the results, and for added entertainment easier overall comparison I have also computed the F-scores for all of them:

SystemRecallPrecisionF-score
Markify99.7%0.02%0.04%
Thomson Reuters Compumark45.5%0.21%0.4%
CT Corsearch34.6%0.43%8.5%
USPTO34.2%0.31%0.6%
Trademarkia32.5%0.75%1.5%
CSC31.8%0.55%1.1%

If recall is all you are after, improving on that 99% is really easy. Simply by returning the entire database for each query you can reach 100% just like that. At the same time precision naturally drops down to epsilon but so what. Of course this is not quite fair (there was no proper indication of the placing of the correct answer on the list of results), but still, just returning the desired answer is definitely not enough, at least when it is returned in a needle-in-a-haystack mode, where, even if the result is there, it is increasingly likely to be missed by the person reading the results the longer the list is. For what it’s worth, I tried to search for ‘äpyli’ (that’s Helsinki slang for ‘apple’) on Markify's system and quit after 10 pages of results at which time the trade mark of a well-known Cupertino-based fruit company had not yet shown up, and the results being shown at that point were already much further away. I suppose ‘the other [sic!] high quality paid trademark search services that can run $500 a word’ can still breathe easy.

No comments:

Post a Comment