Sunday 6 May 2012

Lies, damned lies, and performance metrics

In the DESI IV e-discovery workshop at ICAIL last June, in one of the presentations there was a single slide that really caught my attention. It showed the performance of a large number of e-discovery systems on two axes representing the most common metrics used in information retrieval, computational linguistics, and what have you: precision and recall. (The slide was based an earlier publication which I unfortunately did not register at the time and have not been able to find since, pointers extremely welcome.)

What was so intriguing about that slide was that apart from a small number of outliers, almost all of the systems evaluated were at the bottom left quadrant, with both precision and recall (typically well) below 50%. Based on my expectations from my previous line of work, this was quite shocking, and had I known this earlier, I would certainly have worded a paper or two a bit differently as far as using e-discovery as an example is concerned. All the same, I suppose even at these figures e-discovery already outperforms all the alternatives, but there is certainly still considerable room for improvement.

What I was accustomed to was that a marketable product should deliver well above 90% in both columns or else the users would just stop using it. Whether they are 93% or 98% is not all that important, because that kind of variation is mostly just noise and depends on how well suited the test materials happen to be for that system. In particular, if you use a manually annotated test corpus as a gold standard, in a commercial scenario, you can really only use it once, because whatever is left between the actual performance and 100% are what we in the business call bugs, and they should be dealt with unless there is a good reason not to do so. And so our gold standard becomes tainted once these bugs are fixed. Which is of course no reason not to use the new figures for marketing purposes. (‘Press statements aren’t delivered under oath.’ - Jim Hacker, PM)

Performance is not just an e-discovery issue. It is raised in many other legal technology contexts as well. For example, the Swedish trade mark law start-up Markify prides itself with the ‘99% accuracy’ of its system, for example in this recent Arctic Startup profile. The actual study on which this claim is based is also available. The results are based on querying a set of 1000 actual cases of successful US trade mark oppositions and the question was whether the different services would return the correct mark (that of the opponent) when queried for the mark that is being opposed. Here are the results, and for added entertainment easier overall comparison I have also computed the F-scores for all of them:

SystemRecallPrecisionF-score
Markify99.7%0.02%0.04%
Thomson Reuters Compumark45.5%0.21%0.4%
CT Corsearch34.6%0.43%8.5%
USPTO34.2%0.31%0.6%
Trademarkia32.5%0.75%1.5%
CSC31.8%0.55%1.1%

If recall is all you are after, improving on that 99% is really easy. Simply by returning the entire database for each query you can reach 100% just like that. At the same time precision naturally drops down to epsilon but so what. Of course this is not quite fair (there was no proper indication of the placing of the correct answer on the list of results), but still, just returning the desired answer is definitely not enough, at least when it is returned in a needle-in-a-haystack mode, where, even if the result is there, it is increasingly likely to be missed by the person reading the results the longer the list is. For what it’s worth, I tried to search for ‘äpyli’ (that’s Helsinki slang for ‘apple’) on Markify's system and quit after 10 pages of results at which time the trade mark of a well-known Cupertino-based fruit company had not yet shown up, and the results being shown at that point were already much further away. I suppose ‘the other [sic!] high quality paid trademark search services that can run $500 a word’ can still breathe easy.

Saturday 5 May 2012

What is Legal Technology?

Makeovers 'R' Us
In my opinion, AI & law desperately needs a makeover. One simple but effective way for the field to reinvent itself is rebranding, and the best label for this I can think of is Legal Technology (oikeusteknologia, rättsteknologi, ret(t)steknologi, õigustehnoloogia, R/rechtstechnologie, technologie juridique, tecnologia giuridica, юридическая технология &c).

As someone who has followed and worked in language technology for about two decades now, I see AI & law as being now in the same state as language technology was in the early 1990s. I have presented some lessons learned on how to approach real-world problems at the detail level in my robo-judge paper, so I won't go into them here. Instead, my proposal here looks into language technology as a field that has succesfully reinvented itself a couple of times already. Early on, it was only known as natural language processing (NLP) as a subfield of AI and a form of basic research rarely with any concrete application in mind. (A notable exception to this is machine translation, which also happens to be older than the term AI itself. More on that in a separate post as well as an article written jointly by me and Anniina Real Soon Now.) Then came computational linguistics, which was centered on using computational models and techniques as a tool for linguistic research. (This is where I think AI & law is now.) Of these, in particular corpus linguistics has become mainstream in virtually all subfields of linguistics, but other computational methods are now widely used outside computational linguistics proper as well. Through the 1990s computational linguistics also started to find its way into commercial applications in domains such as language checking, information retrieval, text-to-speech and vice versa, dialogue systems, and machine translation. As these real-world applications started to generate increasingly important research questions in their own right, language technology was born.

"Legal technology" as a term is not my invention. For example, in the US there has been a bicoastal biannual conference called LegalTech® since 2006. As far as I know, most of the technologies presented there are are not all that interesting from an AI & law perspective, with topics such as case management and billing platforms, synchronizing your BlackBerry with your Outlook and stuff like that, and whatever new cruft Westlaw and LexisNexis have come up with each year.

More to the point are for example the LawTechCamps arranged by Daniel Martin Katz (of Computational Legal Studies) and others in June 2011 and next week in Toronto and in the end of June in London. There is also a growing number of start-up companies in the field at least in the US, as listed just the other day on the eLawyering blog. Most of the start-ups listed seem to be working on applications having to do with contracts (possibly a sign of flock mentality from the venture capital side?). Contracts are also the target of the only out legal tech start-up I know of here in Finland, Sopima. With a large number of companies on the same domain fighting over the same market from somewhat different perspectives, it is clear that only some of these companies will be able to succeed (at least as far as the US companies are concerned, Europe is a different kettle of fish because of the different legal culture(s) and the prevalence of non-English languages). The best products have to address a real-world problem and solve it well and efficiently. Usability is another key success factor, and it still seems to be generally neglected in legal IT. Just because a certain design is a possible way to do something does not mean it is the best way (indeed it rarely is, though at least it usually is not quite this bad; required reading: Donald Norman's The Design of Everyday Things, MIT Press 1989). In particular, just replicating ancient practices from the age of pen and paper (and secretaries) and possibly adding some bells and whistles is a true and tried pattern, unfortunately. And the result is an application that takes a week-long course just to get started with it. All the same, the technically best solution does not necessarily win the game. In the end, it all boils down to the viability of the business model and the ability to make it into a reality. (Here's a convenient rule of thumb: marketing costs money, selling makes money. Close early, close often.)

So how can the AI & law community contribute to the impending legal technology boom? One approach is to take an existing, reasonably well-developed Good Old-Fashioned AI & law technology, and to find a real-life legal problem which it could possibly solve. (I'm afraid I can't come up with an example.) The other approach is to take an existing problem (= market need) in the legal community, a problem of the kind that should be solvable by computing, and looking around all over the place in computer science in search of that solution. (Here e-discovery is a prime example, though it does not travel well, and performance-wise it is quite disappointing by the language technology metrics I'm used to but at least it is still equally reliable yet faster and cheaper than people doing the same job.) Since language is in a key role in law, language technology is one obvious place to look at but it should definitely not be the only one for any legal tech company. I'm sure the next 20 years will be a lot more interesting (and profitable) for the field than the past 20.