De-duping documents and emails for legal document review.
De-duplication is defined by the EDRM as the removal of duplicate documents. It is also called, “de-duping” for short and is a process that identifies duplicate digital objects from data sets to be later removed from subsequent steps in the discovery process. There are several ways that duplicates are identified and then removed. One of the primary ways of identifying duplicate documents is by relying on hash values to determine which documents are exact duplicates.
How is de-duplication related to eDiscovery?
In eDiscovery, de-duplication is an accepted and important method for reducing the volume of documents that must be reviewed and that are produced. There are two main types of de-duplication: horizontal de-duplication and vertical de-duplication.
In horizontal de-duplication, sometimes referred to as “cross-custodian de-duplication,” duplicate documents that are identified are compared to others across the data sets of multiple custodians. That way, if for example a document is duplicated across multiple custodians, such as an email then only one of those duplicate documents will pass to subsequent stages of the discovery process with a report of other custodians that also contained that duplicate document. When vertical de-duplication is used, only duplicate objects within a single custodian’s data set are identified and removed.