Production Set 09 // Redaction Workflow
AI redaction automation, accuracy, validation, and the spreadsheet trap.
VERIFIED 21 APR 2026 // INDEPENDENT REFERENCE // NOT LEGAL ADVICE
Redaction is both a legal obligation and an operational cost centre in modern eDiscovery. The obligation is precise: privileged content, PII, PHI, and third-party confidential information must be masked before production. The cost is real: in complex HIPAA-implicated litigation, manual redaction can account for 20 to 30 percent of total review spend. AI automation helps substantially, but failure modes require a validated QA step before any production.
Section 01 // What Must Be Redacted
What requires redaction in modern discovery
- •Attorney-client privilege and work product: communications with counsel, attorney memoranda, litigation strategy documents, draft pleadings. Covered by the privilege review workflow; see /privilege-review.
- •PII (Personally Identifiable Information): Social Security numbers, dates of birth, financial account numbers, passport numbers, driver's license numbers. Required by protective orders, state privacy laws, and GDPR Article 5 data minimisation where EU data is involved.
- •PHI (Protected Health Information): HIPAA-defined health identifiers, patient name, medical record numbers, diagnosis codes, treatment dates, insurer identifiers. Required in any litigation involving healthcare records, insurance claims, or employment benefits.
- •Trade secret and third-party confidential: proprietary formulas, pricing models, non-public customer data, third-party contractual confidential information. Typically governed by the protective order in the case.
Section 02 // Three Approaches
Manual, NER-automated, LLM-classified
Manual redaction is a reviewer examining each document page and applying a black-box redaction mark over sensitive content. It is the most accurate method but is extremely time-intensive for large productions. At 20 to 40 pages per hour for a careful manual redaction pass on mixed document types, a 100,000-page production with 15 percent requiring redaction can run 375 to 750 attorney hours.
NER (Named Entity Recognition) automation uses a trained entity-recognition model to identify specific entity types (PERSON, ORG, DATE, SSN, ACCOUNT_NUMBER, etc.) and automatically proposes or applies redactions. NER is highly accurate for structured entities (SSN pattern matching is 95 to 99 percent accurate) and degrades on unstructured or context-dependent entities.
LLM-classified redaction uses a large language model to identify redactable content based on a natural-language description of what must be redacted (‘redact all health information and diagnostic information about any individual other than the named plaintiff’). LLM classification handles contextual and relationship-dependent redaction better than NER, but requires more compute per document and generates more false positives that require attorney review.
Section 03 // Accuracy Log
Accuracy by entity type and document type
| Entity / Document Type | NER Accuracy | LLM Accuracy | Key Challenge |
|---|---|---|---|
| SSN in structured text | 95-99% | 97-99% | Pattern matching; few challenges |
| Medical record numbers | 80-88% | 88-94% | Format varies by institution |
| Third-party trade secrets (contextual) | 60-72% | 78-88% | Context-dependent; no pattern |
| Native Word / email documents | 90-96% | 92-97% | Well-structured text; reliable |
| PDF documents (native) | 88-94% | 90-95% | Layer vs image PDF matters |
| Scanned paper (OCR) | 70-80% | 72-82% | OCR quality is the binding constraint |
| Native spreadsheets | 65-75% | 72-84% | Cell-level redaction; formula exposure |
Approximate // Last verified Apr 2026
Section 04 // Validation
Validation before production
Automated redaction requires validation before production. The standard approach is stratified sampling: a random sample of documents from each stratum (document type, entity type, custodian, date range) is reviewed by an attorney to check that the automated redactions are correct and complete. The sample size follows the same 95 percent confidence with plus-or-minus 5 percent margin convention used for TAR validation.
A Rule 502(d) order should cover inadvertent production of privileged material revealed through imperfect redaction. An additional protective order clause should cover PII and PHI produced inadvertently through redaction failure. The claw-back provision in the protective order should address both.
Section 05 // Spreadsheet Trap
The native spreadsheet problem
Native spreadsheet redaction is significantly harder than most firms assume. A PDF redaction overlays a black box over the relevant cell value, but in a native Excel file the cell value is still accessible in the underlying XML. Producing a native spreadsheet with a ‘redacted’ cell that is actually readable by the receiving party is a common and expensive production error. The correct approach for native spreadsheets is to convert to PDF before applying redactions, or to use a platform that modifies the underlying cell data and not just the display layer. Verify your platform's native spreadsheet redaction approach before any production involving Excel files with sensitive financial or personal data.
Section 06 // FAQ