What is this TAR (Technology Assisted Review) that people are talking about, and what do they really mean by this term? Perhaps more importantly, why should you care? Part 1 of this three-part article will discuss the terminology of TAR. Part 2 will drill into two specific statistical methodologies commonly used in TAR; namely, support vector and conceptual search, how they work, what they do and which are appropriate for particular use cases. Part 3 will focus on the practical: What do the cases say? How can parties use TAR in a defensible way? How can TAR help to achieve proportionality?
The identification, collection and production of electronically stored information-ESI-has brought about a revolution in litigation discovery, both because of the sheer quantity of information that must be addressed, but also because of the industry that the ESI revolution has spawned-the e-discovery and litigation support industry. The first time that a lawyer used software to view the images of scanned documents was a use of technology that assisted the lawyer's review-although that review was still a linear, document-by-document review, and searching the documents was only possible in a limited way, if at all. Viewed against that backdrop, many technological developments in this industry can on some level be termed “Technology Assisted Review.”
Today, many terms are heard in the marketplace in addition to “TAR”: Categorization, concept searching, predictive coding, clustering, and other terms. A full-blown discussion of all of these terms would be encyclopedic, but it is possible to present a high-level understanding of them.
In the beginning, there was Keyword Search TAR
Keyword searching-in which a pre-determined set of words are run against all of the documents, and only those documents that “hit” on one or more key words are reviewed-was a break-through technique. Nonetheless, studies (including the Blair-Maron study and others) conclude that keyword searching only yields from 20% to 40% recall of relevant documents. Despite these limits, keyword search remains the most common approach used today for reducing the number of documents to be reviewed.
Keyword search was advanced through Boolean searches which allowed combinations of terms to disambiguate over-inclusive keywords. For example, a search string would state “include (keyword1, keyword2, keyword3, etc.) but not (excludeword1, excludeword2, excludeword3, etc.)”. These combinations of related words or “ontologies” often include proximity limiters (i.e. Abraham w/2 Lincoln). Boolean searching improved the retrieval of relevant documents, but given the syntax rules required to formulate Boolean searches, litigation counsel had to engage highly skilled linguists to develop productive combinations. This was expensive and time consuming, although it did create a short-lived burst of notoriety and exposure for linguistic experts. Some would say ontology based search was the beginning of Technology Assisted Review.
Then came “Conceptual Search” TAR
Recently, sophisticated prioritization algorithms packaged as e-discovery software have exploded onto the scene as an efficient and cost effective method of slicing and dicing enormous quantities of ESI.
Many algorithms will categorize ESI. Keyword searching, discussed above, is one straight-forward example. Other applications use a conceptual indexing algorithm to categorize by concept, subject or legal issue based on document exemplars fed to the system. The market place uses terms such as concept search or clustering-these are forms of categorization.
Different statistical models power these algorithms; the models can be based on word indexes, dictionaries, thesauruses, the context within which words commonly occur in the dataset, and ontologies, which are sets of related concepts.
Some tools operate with no human intervention or training, but simply by applying the algorithm which then organizes the documents into similar clusters. For such applications, however, the clustering (or foldering) that results is not necessarily what the review team would choose or what is helpful, but is based on the operation of the algorithm. A more productive tool may be one that starts with human review of a set of documents in order to code or differentiate by issue, subject or concept, followed by application of the algorithm and then refinement by additional human input. This is what the e-discovery industry calls “categorization”. Document reviewers can then concentrate on issue categories, which is said to speed up document review.
Clustering and categorization tools may do more than categorize-they may also rank documents in terms of the strength of association within the cluster. This feature can also be used to speed up document reviews by allowing review teams to concentrate on documents with the highest rankings --within the issues or clusters.
Yes or No-Predictive Coding TAR
Predictive coding applies a binary decision-relevant or not relevant-to a body of documents. The goal of predictive coding software is to assign prioritization scores to documents within a corpus by likelihood of relevance. Scores that are higher indicate that documents are more likely relevant, and lower scores are assigned to documents less likely to be relevant.
The science underlying predictive coding algorithms is not new; it has been used in other industries such as energy distribution, air traffic control, weather forecasting, and insurance coverage, among others, for decades. Any field where known facts can be extrapolated and monitored with a statistically sound control model can successfully implement this science, known as “predictive analytics”. A familiar usage of predictive analytics is recommending products based on previous purchases and buying habits. We see this every day if we shop on-line or listen to Pandora. Predictive coding requires human involvement. One or more subject matter experts (or SMEs) review a sample of the documents and make yes-no determinations about relevance. The coded sample “trains” an algorithm to score all of the documents in the collection. Additional small batches may have to be reviewed in order to “stabilize” the system, which calculates key statistical benchmarks that allow the review team and e-discovery experts to decide when the system has achieved sufficient stabilization. Some algorithms require or permit “seeding”, which is adding coded documents back into the data set to provide further system training and better refinement of the ranking process.
TAR is Disruptive-But in a Good Way
TAR is causing interest and buzz among law firms and their clients. Law firms, always cautious, are reluctant to try TAR because they fear that the science is a “black box” that they won't be able to explain to a court in the event of a challenge to their document production. They also fear that the expense will outweigh the return. Interestingly, clients are driving some of the uses of TAR because it provides a principled framework to address massive volumes of data. Clients see that TAR is likely to result in the review of far fewer documents and a resulting cost savings. Law firms don't like that development, whether from worry about the sufficiency of a document production, or a fear of lost review revenue, or both.
Loosely defined, the term “TAR” can cover many technologies. Part 2 of this article will focus on two specific types of TAR, support vector and conceptual search, which operate on different statistical algorithms, and will provide an overview of how those models work and which are appropriate for particular use cases.
Copyright © 2019 Legal IT Professionals. All Rights Reserved.