Hacks for Finding Foreign Languages in Your eDiscovery Data
A New Morningside Translations Article Featured on the Relativity Blog.
Relativity is the most popular eDiscovery platform for lawyers and legal professionals. We recently published a blog on Relativity’s website highlighting three important tricks to quickly identify critical foreign language information. Check it out below.
You are an eDiscovery professional at a big law firm. You sit down at your desk, log into Relativity, and start sifting through documents. You’re cruising right along, finding exactly what you need and you’re even ahead of your deadline. You start to think about lunch. Should I get a $15 salad from the place across the street? Should I just get pizza? Get pizza, you deserve it. Get a whole pie. You’re doing great.
Then you come across a huge cache of foreign-looking documents. You think they’re in Romanian, but you don’t actually know Romanian. You also don’t know Latvian, Lithuanian, or Polish. Could it be one of those? A bead of sweat forms on your brow and you start to panic, running through your options in a mental catalog:
- To ask one of the partners what to do, turn to page 17
- To go down a 13-hour internet rabbit hole, turn to page 25
- To run to the parking lot to cry in your car, turn to page 33
Don’t love those options? Luckily, there are a few helpful hacks that can save you a ton of time and frustration after you stumble upon foreign language documents halfway through your review—or, even better, prevent this heart-sinking moment and find them at the outset.
Identifying foreign languages as early as possible in your review process is critical to achieve clear, predictable costs, prevent unnecessary delays, and construct a workflow that makes sense. Here are several easy tricks to help you navigate foreign eDiscovery waters.
#1: Use foreign language stop words
If you don’t have Relativity Analytics, or if you’re looking for a quick and easy way to scan your data set for a certain foreign language, then a creative use of stop words in a dtSearch might help.
Stop words, also called noise words, are the most frequently used words in a given language (for example, in English: and, the, my, all, for). They are typically filtered out of a dtSearch or keyword search, as they tend to be so common that they don’t return valuable search results. However, their frequency also makes them a great way to find foreign language documents.
Because it’s safe to assume that stop words can be found in just about any piece of text, a dtSearch for a list of stop words will likely return any documents in the foreign language. If, for example, you believe your data set may contain German, then a search for German stop words will hopefully return any documents with German text.
Note that each language has its own unique set of stop words, so rather than translate a list of English words, it’s best to obtain a list of stop words in the desired foreign language from a legal language services expert.
#2: Run language identification
While the stop words hack gets the job done, it requires that you have an idea of which languages are in your data set and proves tedious if you want to search for more than one language. Full language identification analysis is preferable for data sets that may contain multiple languages, or if you simply want to double check for any foreign languages before moving forward with your review.
Language identification uses machine learning to detect the languages in a piece of text automatically. A feature in Relativity Analytics, it returns the primary and up to two secondary languages in a document, along with the percentage breakdown of each language.
From here, you can leverage the language identification output to guide your next steps; build dashboards to achieve a birds’ eye overview of the number of documents, custodians, and control numbers by language; batch documents by language so they can be sent to foreign language reviewers efficiently; and then send any documents with foreign language text for machine translation so you can review the gist in English. No matter which approach you take, language identification results lay the groundwork for the rest of your review workflow.
#3: Recognize that the internet is your friend—except for when it’s not
The beauty of the internet is that you can find almost anything you’re looking for with the click of a button. A simple Google search of stop words in your suspected language will net you some quick and reliable returns. Searching “Spanish stop words,” for example, points you to a comprehensive list of stop words in over 40 languages. The internet is pretty great, am I right? But don’t let it give you a false sense of security.
We’re all aware of the free translation tools out there. You might think that simply copying and pasting from your documents into one of these free engines might be your ticket out of this language identification mess, but before you go down that road there are a few important issues to consider:
- Copying and pasting is extremely tedious when you consider the volume of documents you’re likely dealing with. “Ctrl+C, Ctrl+V” isn’t really a feasible option when confronted with hundreds or thousands of documents.
- Free online translation tools are not secure. Once you input text into one of these tools, that text is their property as well. In most cases, you’re dealing with sensitive documents that shouldn’t be exposed to third parties. But, of course, you already know that.
Choosing one of the hacks above rather than a free online translation tool is a surefire way to keep your data secure and allow your team the time they need to focus on building a killer case.
So you found foreign language documents. Now what?
Now it’s time to determine whether those foreign language documents are relevant, privileged, or something else—in other words, to figure out what they say. To do that, you’ll likely want to partner with a trusted language service provider. Choosing a reliable provider is a topic for another day, but here are a few quick tips to get started:
- Make sure they have ISO-certified quality — Bad translations can cause confusion and cost you time and money. Defend yourself against them. Choosing a provider certified by the International Organization for Standardization is a good start.
- Make sure they have extensive experience in eDiscovery — Most often, a combination of tools—such as machine translation, foreign language review, and keyword search term translation—will optimize your time and costs, so make sure your provider is familiar with all of them and how they apply to these types of projects.
- Make sure they are familiar with your chosen technology — Selecting a partner who is already comfortable in your eDiscovery software can save time, boost security, and prevent headaches. Some may even have a dedicated application for your platform, like Morningside’s Relativity plugin, providing dedicated support inside the tool you already know.
With these simple workflow hacks, you have some better options to choose your own eDiscovery translation adventure. Have you used any of them before? Let us know in the comments.
ABOUT THE AUTHOR:
As VP, Business Development at Morningside Translations, Dylan Blaney leverages his industry expertise to provide comprehensive language solutions to law firms, eDiscovery service providers, and in-house legal teams. Under Dylan’s leadership, Morningside’s legal group advises international law firms and corporate legal and compliance departments around the world on legal translation and interpretation best practices.