Optical Character Recognition (OCR) is the electronic method for recognizing characters on image files such as scanned paper or graphic files and converting any characters identified to manipulable text. It has been in commercial use in the computing field since the 1980s and has enabled a mass transition of paper documents to electronic documents to occur. Though OCR is considered a legacy technology it is still an important method for digitizing scanned paper documents for legal needs.
How is OCR related to eDiscovery?
OCR has been used in the litigation support industry and then as part of the eDiscovery process from the beginning of those fields as a way to rapidly transform scanned paper documents to indexable data sets used by search engines. Without OCR, legal professionals would need to manually review every document for keywords but with its advent, electronic keyword and keyphrase searching was possible, increasing the efficiency and lowering the cost of legal document reviews. Today, since most documents begin in electronic form, OCR has morphed into what is referred to as text extraction so that the text found in electronic documents can be added to a digital index for a subsequent search. Indeed, today the terms OCR and text extraction are usually used interchangeably.