eDiscovery and the data transfer problem
Discovery is a legal device employed by a party in a civil or criminal action, prior to trial, to require the adverse party to disclose information essential to case preparation which only the other party knows or possess . Electronic discovery or eDiscovery is simply discovery applied to electronic media, e.g., email, documents, spreadsheets, schematics, instant messenger logs, voice mail recordings. eDiscovery is growing. It is growing because the legal requirements are expanding . It is growing because technology is making more and more data readily accessible and therefore discoverable. Forrester expects eDiscovery technology to grow from $1.4B in 2006 to $4.8B in 2011; the 2006 Socha-Gelbmann Electronic Discovery Survey estimates $1.8B in 2006 to $3.1B in 2008 . Machine costs should range between 15-25%.
I have spent the last two years working for Gallivan, Gallivan & O’Melia (GG&O), a Seattle-based firm offers both software and consulting to law firms and enterprises for electronic discovery. GG&O’s capabilities (and hence my experience) range from forensic data acquisition to native document processing and review support through imaging for production. I functioned as the both technical and managerial lead for the Silicon Valley office.
During my tenure, a “standard” matter ballooned from several hundred gigabytes and several hundred thousand files to multiple terabytes and multiple-million files. Because of the volume of data involved, data transfer, from drive to drive and from drive to memory to CPU (for hashing, indexing), has become the primary bottleneck holding attorneys back from review. Generally, unlike gaming or many scientific pursuits, eDiscovery is not computationally intensive; performance is not CPU-bound; it is input/output (i/o) bound.
Other applications that are i/o bound include bioinformatics, Homeland security, financial (transactional) databases, and enterprise document management systems. For these, having increased data throughput can generally be categorized as “nice-to-have” or “do-it-only-when-it-becomes-cheap-enough”. The requirements of electronic discovery, in contrast, are business critical. The legal and financial pressures are tangible and quantifiable (especially when dealing with government agencies with three-letter acronyms!). As the volume of data increases, the machine time, mostly because of data transfer issues, can involve days. Even when RAM or Flash-based solid state drives (SSD) become available, the time require to transfer data will remain a limiting factor. The interesting thing is that the eDiscovery industry will technologically drive itself. As faster data transfer becomes available, because litigation is competitive and time is always of the essence, there will be uptake.
The data transfer bottleneck. The bottleneck for data transfer comes from the need to access the entire volume through a single interface. IDE and SATA interfaces range from 150-300 MB/s. (USB devices operate at approximately 60 MB/s, DRAM up to 2-3 GB/s.) Besides the interface, there is also the issue of sequential versus random i/o. Operating at the highest rates, a terabyte (TB) takes approximately 1-2 hours to sequentially transfer from one drive to another. When files are accessed randomly (and there are many small files), the transfer time could be extended as much as 3-4x. The random versus sequential discrepancy can be alleviated by using RAM or Flash SSDs, which have better ways of addressing the data. But this still will be throttled by the interface. Where is my 1 TB RAM computer?
1. West Publishing Company. and West Group., West's encyclopedia of American law. 1998, Minneapolis/St. Paul, MN: West Pub. Co. v. <1-12 >.
2. Federal Rules of Civil Procedure. 2006 [cited 2007 May 10]; Available from: http://www.law.cornell.edu/rules/frcp/.
3. Skjekkeland, A. eDiscovery Market Size. AIIM Knowledge Center Blog 2007 [cited 2007 May 10]; Available from: http://infogovernance.blogspot.com/2007/03/ediscovery-market-size-aiim-knowledge.html.