How accurate is automated redaction in eDiscovery?

Automated redaction accuracy ranges from 85 to 97 percent depending on entity type and document type. SSN detection in structured text is typically 95 to 99 percent accurate. Medical record number detection in clinical notes is typically 80 to 88 percent. Handwritten text requires OCR first and can drop to 65 to 75 percent accuracy.

Production Set 09 // Redaction Workflow

AI redaction automation, accuracy, validation, and the spreadsheet trap.

VERIFIED 21 APR 2026 // INDEPENDENT REFERENCE // NOT LEGAL ADVICE

Redaction is both a legal obligation and an operational cost centre in modern eDiscovery. The obligation is precise: privileged content, PII, PHI, and third-party confidential information must be masked before production. The cost is real: in complex HIPAA-implicated litigation, manual redaction can account for 20 to 30 percent of total review spend. AI automation helps substantially, but failure modes require a validated QA step before any production.

Section 01 // What Must Be Redacted

What requires redaction in modern discovery

•Attorney-client privilege and work product: communications with counsel, attorney memoranda, litigation strategy documents, draft pleadings. Covered by the privilege review workflow; see /privilege-review.
•PII (Personally Identifiable Information): Social Security numbers, dates of birth, financial account numbers, passport numbers, driver's license numbers. Required by protective orders, state privacy laws, and GDPR Article 5 data minimisation where EU data is involved.
•PHI (Protected Health Information): HIPAA-defined health identifiers, patient name, medical record numbers, diagnosis codes, treatment dates, insurer identifiers. Required in any litigation involving healthcare records, insurance claims, or employment benefits.
•Trade secret and third-party confidential: proprietary formulas, pricing models, non-public customer data, third-party contractual confidential information. Typically governed by the protective order in the case.

Section 02 // Three Approaches

Manual, NER-automated, LLM-classified

Manual redaction is a reviewer examining each document page and applying a black-box redaction mark over sensitive content. It is the most accurate method but is extremely time-intensive for large productions. At 20 to 40 pages per hour for a careful manual redaction pass on mixed document types, a 100,000-page production with 15 percent requiring redaction can run 375 to 750 attorney hours.

NER (Named Entity Recognition) automation uses a trained entity-recognition model to identify specific entity types (PERSON, ORG, DATE, SSN, ACCOUNT_NUMBER, etc.) and automatically proposes or applies redactions. NER is highly accurate for structured entities (SSN pattern matching is 95 to 99 percent accurate) and degrades on unstructured or context-dependent entities.

LLM-classified redaction uses a large language model to identify redactable content based on a natural-language description of what must be redacted (‘redact all health information and diagnostic information about any individual other than the named plaintiff’). LLM classification handles contextual and relationship-dependent redaction better than NER, but requires more compute per document and generates more false positives that require attorney review.

Section 03 // Accuracy Log

Accuracy by entity type and document type

Entity / Document Type	NER Accuracy	LLM Accuracy	Key Challenge
SSN in structured text	95-99%	97-99%	Pattern matching; few challenges
Medical record numbers	80-88%	88-94%	Format varies by institution
Third-party trade secrets (contextual)	60-72%	78-88%	Context-dependent; no pattern
Native Word / email documents	90-96%	92-97%	Well-structured text; reliable
PDF documents (native)	88-94%	90-95%	Layer vs image PDF matters
Scanned paper (OCR)	70-80%	72-82%	OCR quality is the binding constraint
Native spreadsheets	65-75%	72-84%	Cell-level redaction; formula exposure

Approximate // Last verified Apr 2026

Section 04 // Validation

Validation before production

Automated redaction requires validation before production. The standard approach is stratified sampling: a random sample of documents from each stratum (document type, entity type, custodian, date range) is reviewed by an attorney to check that the automated redactions are correct and complete. The sample size follows the same 95 percent confidence with plus-or-minus 5 percent margin convention used for TAR validation.

A Rule 502(d) order should cover inadvertent production of privileged material revealed through imperfect redaction. An additional protective order clause should cover PII and PHI produced inadvertently through redaction failure. The claw-back provision in the protective order should address both.

Section 05 // Spreadsheet Trap

The native spreadsheet problem

Native spreadsheet redaction is significantly harder than most firms assume. A PDF redaction overlays a black box over the relevant cell value, but in a native Excel file the cell value is still accessible in the underlying XML. Producing a native spreadsheet with a ‘redacted’ cell that is actually readable by the receiving party is a common and expensive production error. The correct approach for native spreadsheets is to convert to PDF before applying redactions, or to use a platform that modifies the underlying cell data and not just the display layer. Verify your platform's native spreadsheet redaction approach before any production involving Excel files with sensitive financial or personal data.

Section 06 // FAQ

Frequently asked questions

How accurate is automated redaction?+

Accuracy ranges from 65 to 99 percent depending on entity type and document type. SSN detection in structured text is 95 to 99 percent accurate. Scanned paper documents through OCR can drop to 70 to 80 percent. All automated redaction requires human QA sampling before production.

Can AI redact native spreadsheet files?+

This requires care. A display-layer redaction on a native Excel file does not remove the underlying data; it is accessible in the XML. The correct approach is to convert to PDF before applying redactions, or to use a platform that modifies the underlying cell data. Confirm your platform's native spreadsheet redaction approach before any Excel production.

Does automated redaction meet HIPAA requirements?+

HIPAA de-identification requires removal of all 18 PHI identifiers listed in 45 CFR 164.514(b). Automated redaction can systematically identify and redact these identifiers, but must be validated by a qualified professional to satisfy the Safe Harbor or Expert Determination methods under HIPAA. Automated redaction alone does not constitute HIPAA de-identification without validation.

What tools offer automated redaction?+

Major platforms with automated redaction include Relativity (native redaction module), Everlaw, DISCO, Exterro, and specialist tools like Blackout by IPRO. NER-based redaction is standard; LLM-classified contextual redaction is available in Relativity aiR for Privilege and selected specialist platforms.

Cross-reference

PRV /Privilege review reference PLT /Platform redaction comparison ETH /Ethics + confidentiality