What was so intriguing about that slide was that apart from a small number of outliers, almost all of the systems evaluated were at the bottom left quadrant, with both precision and recall (typically well) below 50%. Based on my expectations from my previous line of work, this was quite shocking, and had I known this earlier, I would certainly have worded a paper or two a bit differently as far as using e-discovery as an example is concerned. All the same, I suppose even at these figures e-discovery already outperforms all the alternatives, but there is certainly still considerable room for improvement.
What I was accustomed to was that a marketable product should deliver well above 90% in both columns or else the users would just stop using it. Whether they are 93% or 98% is not all that important, because that kind of variation is mostly just noise and depends on how well suited the test materials happen to be for that system. In particular, if you use a manually annotated test corpus as a gold standard, in a commercial scenario, you can really only use it once, because whatever is left between the actual performance and 100% are what we in the business call bugs, and they should be dealt with unless there is a good reason not to do so. And so our gold standard becomes tainted once these bugs are fixed. Which is of course no reason not to use the new figures for marketing purposes. (‘Press statements aren’t delivered under oath.’ - Jim Hacker, PM)
Performance is not just an e-discovery issue. It is raised in many other legal technology contexts as well. For example, the Swedish trade mark law start-up Markify prides itself with the ‘99% accuracy’ of its system, for example in this recent Arctic Startup profile. The actual study on which this claim is based is also available. The results are based on querying a set of 1000 actual cases of successful US trade mark oppositions and the question was whether the different services would return the correct mark (that of the opponent) when queried for the mark that is being opposed. Here are the results, and for
System | Recall | Precision | F-score |
---|---|---|---|
Markify | 99.7% | 0.02% | 0.04% |
Thomson Reuters Compumark | 45.5% | 0.21% | 0.4% |
CT Corsearch | 34.6% | 0.43% | 8.5% |
USPTO | 34.2% | 0.31% | 0.6% |
Trademarkia | 32.5% | 0.75% | 1.5% |
CSC | 31.8% | 0.55% | 1.1% |
If recall is all you are after, improving on that 99% is really easy. Simply by returning the entire database for each query you can reach 100% just like that. At the same time precision naturally drops down to epsilon but so what. Of course this is not quite fair (there was no proper indication of the placing of the correct answer on the list of results), but still, just returning the desired answer is definitely not enough, at least when it is returned in a needle-in-a-haystack mode, where, even if the result is there, it is increasingly likely to be missed by the person reading the results the longer the list is. For what it’s worth, I tried to search for ‘äpyli’ (that’s Helsinki slang for ‘apple’) on Markify's system and quit after 10 pages of results at which time the trade mark of a well-known Cupertino-based fruit company had not yet shown up, and the results being shown at that point were already much further away. I suppose ‘the other [sic!] high quality paid trademark search services that can run $500 a word’ can still breathe easy.
No comments:
Post a Comment