Document Field Extraction Evaluation Results
Overview
This document presents the evaluation results for document field extraction using different preprocessing approaches. The evaluation was conducted on a dataset of 56 document samples with various field types commonly found in identity documents.
Evaluation Metrics
The evaluation uses standard information extraction metrics:
- Precision: Ratio of correctly extracted fields to total extracted fields
- Recall: Ratio of correctly extracted fields to total ground truth fields
- F1-Score: Harmonic mean of precision and recall
- Accuracy: Overall field-level accuracy
- TP: True Positives (correctly extracted fields)
- FP: False Positives (incorrectly extracted fields)
- FN: False Negatives (missed fields)
Preprocessing Approaches
1. No Preprocessing (Baseline)
- Configuration: Raw images without any preprocessing
- Performance:
- Micro Precision: 79.0%
- Micro Recall: 68.7%
- Micro F1: 73.5%
- Field Accuracy: 68.7%
2. Crop
- Configuration: Content-aware cropping (no shadow removal)
- Performance:
- Micro Precision: 94.8%
- Micro Recall: 89.9%
- Micro F1: 92.3% (+18.8% improvement)
- Field Accuracy: 89.9%
3. Crop + PaddleOCR + Shadow Removal
- Configuration: Cropping with PaddleOCR document processing and shadow removal
- Performance:
- Micro Precision: 93.6%
- Micro Recall: 89.4%
- Micro F1: 91.5% (+18.0% improvement)
- Field Accuracy: 89.4%
4. Crop + PaddleOCR + Shadow Removal + Cache
- Configuration: Cropping with PaddleOCR, shadow removal, and caching
- Performance:
- Micro Precision: 92.5%
- Micro Recall: 88.3%
- Micro F1: 90.3% (+16.8% improvement)
- Field Accuracy: 88.3%
5. Crop + Shadow Removal + Cache
- Configuration: Cropping with shadow removal and caching
- Performance:
- Micro Precision: 93.6%
- Micro Recall: 88.5%
- Micro F1: 91.0% (+17.5% improvement)
- Field Accuracy: 88.5%
Field-Level Performance Analysis
High-Performance Fields
Fields that consistently perform well across all approaches:
| Field | Best F1 | Best Approach | Performance Trend |
|---|---|---|---|
| Gender | 85.1% | Crop + PaddleOCR | Consistent improvement |
| Birth Date | 80.5% | Crop + PaddleOCR | Strong improvement |
| Document Type | 85.4% | Crop + PaddleOCR | Significant improvement |
| Surname | 82.9% | Crop + PaddleOCR | Consistent improvement |
Medium-Performance Fields
Fields with moderate improvement:
| Field | Best F1 | Best Approach | Performance Trend |
|---|---|---|---|
| Birth Place | 83.4% | Crop Only | Good improvement |
| Expiry Date | 78.5% | Crop + PaddleOCR | Moderate improvement |
| Issue Date | 69.3% | Crop + Shadow + Cache | Variable performance |
| Address | 44.4% | Crop + PaddleOCR | Limited improvement |
Low-Performance Fields
Fields that remain challenging:
| Field | Best F1 | Best Approach | Notes |
|---|---|---|---|
| MRZ Lines | 41.8% | Crop + Shadow + Cache | Complex OCR patterns |
| Personal Number | 40.0% | Crop + PaddleOCR + Cache | Small text, variable format |
| Issue Place | 50.0% | Crop + PaddleOCR + Cache | Handwritten text challenges |
Zero-Performance Fields
Fields that consistently fail across all approaches:
- Recto/Verso: Document side detection
- Code: Encoded information
- Height: Physical measurements
- Type: Document classification
Key Findings
1. Preprocessing Impact
- Cropping alone delivers the strongest overall boost (+18.8 F1 pts vs. baseline)
- PaddleOCR + Shadow Removal is highly competitive (up to +18.0 F1 pts)
- Caching has minimal to moderate impact on accuracy
2. Field Type Sensitivity
- Structured fields (dates, numbers) benefit most from preprocessing
- Text fields (names, addresses) show moderate improvement
- Complex fields (MRZ, codes) remain challenging
3. Processing Pipeline Efficiency
- Crop currently provides the best overall F1 in this evaluation
- Crop + PaddleOCR + Shadow Removal is close and benefits some fields
- Caching shows minimal gains; use for speed, not accuracy
Recommendations
For Production Use
- Use Crop as the primary preprocessing step
- Focus optimization on high-value fields (dates, document types, names)
- Consider field-specific preprocessing strategies for challenging fields
For Further Research
- Investigate MRZ line extraction techniques
- Explore advanced OCR methods for handwritten text
- Develop specialized preprocessing for low-performance fields
Performance Targets
- Overall F1: Target 65%+ (currently 60.7% best)
- Field Accuracy: Target 50%+ (currently 43.5% best)
- Critical Fields: Ensure 80%+ F1 for dates and document types
Technical Details
Dataset Characteristics
- Total Samples: 56 documents
- Field Types: 25+ different field categories
- Document Types: Identity documents, permits, certificates
- Image Quality: Variable (scanned, photographed, digital)
Evaluation Methodology
- Ground Truth: Manually annotated field boundaries and text
- Evaluation: Field-level precision, recall, and F1 calculation
- Aggregation: Micro-averaging across all fields and samples
Preprocessing Pipeline
- Image Input: Raw document images
- Cropping: Content area detection and extraction
- Document Processing: PaddleOCR unwarping and orientation
- Shadow Removal: Optional DocShadow processing
- Field Extraction: OCR-based text extraction
- Post-processing: Field validation and formatting
Conclusion
The evaluation demonstrates that preprocessing significantly improves document field extraction performance. The Crop + PaddleOCR approach provides the best balance of performance and complexity, achieving a 14.1% improvement in F1-score over the baseline. While some fields remain challenging, the overall pipeline shows strong potential for production deployment with further field-specific optimizations.
Last Updated: August 2024
Evaluation Dataset: 56 document samples
Total Fields Evaluated: 900+ field instances