Machine Translation and Multilingual E-Discovery

E-discovery hit prime time in the November 2015 episode of The Good Wife, appropriately entitled “Discovery”. Chumhum, a large search company, is being sued by a restaurant owner because one of its applications, Chummy Maps, has designated the restaurant’s neighborhood as potentially unsafe and hence put her out of business. The key accusation is that the safety level algorithm is based on racial demographic profiling. Fans of the mega-popular series got a peek behind the scenes of how digital forensic methods are used to extract potentially relevant evidence from terabytes of electronically stored information (ESI).

E-discovery was born in 1970 when the US Federal Rules of Civil Procedure (FRCP) were amended for the first time to make “data compilations” discoverable. If in the subsequent decade there were less than 10 e-discovery cases, in the half-decade between 2010 and 2015 there were more than 360 such cases tried before the courts in the US.^[1]Examples of the types of ESI included in e-discovery are e-mails, instant messaging chats, documents, accounting databases, CAD/CAM files, and Web sites.^[2]

Now take the breathtaking scope of e-discovery in and of itself and put it in the context of a lawsuit that is being conducted across borders, including countries where the ESI is not in English. Given economic globalization, multilingual litigation is no longer rare and, according to Gartner, by 2020 80% of all litigation will be multilingual.

One of the ways to manage the daunting costs and time frames of multilingual e-discovery is to use Machine Translation (MT). However there is an ongoing debate as to whether MT is good enough to ensure a reliable e-discovery process. Despite the advances in MT since its commercial debut in the 1980s, there are those who argue that MT is still too literal. Nuanced meanings in the original are lost and grammatical errors are introduced into the finished product. Specifically in the case of e-discovery, there are concerns that MT does not account for differences in legal terminology and systems across cultures. Last but not least, submitting legal documents to online machine translation systems could very well constitute a breach of confidentiality.^[3]

No machine learning technology can replace the distinctly human ability to understand context and bridge cultural gaps. Here are some ideas how both human translation and MT can be used in multilingual e-discovery to effectively address both quality and cost constraints:

MT can be effective when formal or formulaic language is used such as, for example, in legal or scientific documents.^[4]
MT can be used to give a “gist translation” that is good enough for the English speaking review team to categorize documents and get a certified human translation of the subset of documents that are most relevant.^[5]
Highly customized MT engines can be used, supported by a post-editing process involving legal translation experts.