Product: PdfTron
Product Version: 11.10.0
Please give a brief summary of your issue:
Need Robust Way to Extract Title Block Fields Across Varying Drawing Layouts
Please describe your issue and provide steps to reproduce it:
Background
We are using Apryse PDFNet Data Extraction in an Orchard Core–based application to extract metadata from construction drawing PDFs (architectural, structural, civil, etc.).
Our primary goal is to extract title block information, such as:
-
Drawn By
-
Checked By
-
Sheet Number
-
Total Sheets
-
Project / Company Name
We are currently using:
DataExtractionModule.ExtractData(
inputPdfPath,
DataExtractionModule.DataExtractionEngine.e_doc_structure
);
This works correctly and returns structured OCR output (text blocks, bounding boxes, layout).
Current Behavior (What Works)
-
Apryse successfully:
-
Performs OCR
-
Detects text accurately
-
Preserves bounding boxes and layout
-
Handles rotated drawings
-
-
We can locate title block text visually in the extracted JSON.
-
We can identify labels like:
-
DRAWN BY -
CHECKED BY -
SHEET 1 OF 12
-
This confirms that OCR and layout extraction are functioning correctly.
The Core Issue
The challenge is semantic field extraction across different drawing formats.
Problem Details
-
Title blocks vary significantly between documents:
-
Different labels (
DRAWN,DRAFTED,DESIGNED BY) -
Different positions (bottom, right, vertical orientation)
-
Different grid structures
-
Some drawings omit labels entirely
-
-
Because of this variability:
-
Fixed positional logic fails
-
Label-based logic fails
-
Regex-based logic fails
-
Example:
In one drawing:
DRAWN BY : CFB
DESIGNED : CFB
Apryse correctly extracts the text, but does not infer relationships between labels and values — which we understand is by design.
Why This Is a Problem for Us
We are processing multiple document types from different sources, and we cannot assume:
-
Fixed title block layout
-
Fixed wording
-
Fixed coordinates
-
Single template
As a result, any hardcoded extraction logic breaks as soon as the document format changes.
Attempted Approaches
1. Positional / Coordinate-Based Matching
-
Works only for a single known layout
-
Breaks immediately for new drawings
2. Label-Based Matching
- Fails when labels change or are missing
3. Regex on OCR Text
-
Loses layout context
-
Produces unreliable results
4. Virtual-to-Physical Resource Handling
-
Successfully resolved data-extraction runtime dependencies
-
This is no longer an issue
Direction We Are Exploring
We are now considering a hybrid approach:
-
Use Apryse only for OCR + layout extraction
-
Isolate the title block region using heuristics
-
Convert that region into a structured, layout-preserved representation
-
Use an LLM to infer semantic meaning and normalize fields
This approach seems promising, but we want to ensure we are aligning with Apryse best practices and not reinventing something already supported.
Questions for Apryse Team
-
Is Template-Based Data Extraction the recommended approach for title block extraction across varying drawing formats?
-
Can multiple templates be applied dynamically?
-
Is there confidence scoring or fallback support?
-
-
Are there examples or guidance from Apryse for:
-
Title block extraction
-
Drawing metadata normalization
-
Handling missing or variant labels
-
-
Does Apryse provide (or plan to provide):
-
Semantic field inference
-
Layout-aware key/value extraction
-
ML-based drawing understanding
-
-
For large-scale drawing processing, what is the recommended architecture:
-
Templates only?
-
Templates + heuristics?
-
External AI/LLM integration?
-
Please provide a link to a minimal sample where the issue is reproducible:
