Data Extraction: Need Robust Way to Extract Title Block Fields Across Varying Drawing Layouts

Product: PdfTron

Product Version: 11.10.0

Please give a brief summary of your issue:
Need Robust Way to Extract Title Block Fields Across Varying Drawing Layouts

Please describe your issue and provide steps to reproduce it:

Background

We are using Apryse PDFNet Data Extraction in an Orchard Core–based application to extract metadata from construction drawing PDFs (architectural, structural, civil, etc.).

Our primary goal is to extract title block information, such as:

  • Drawn By

  • Checked By

  • Sheet Number

  • Total Sheets

  • Project / Company Name

We are currently using:

DataExtractionModule.ExtractData(
inputPdfPath,
DataExtractionModule.DataExtractionEngine.e_doc_structure
);

This works correctly and returns structured OCR output (text blocks, bounding boxes, layout).

Current Behavior (What Works)

  • Apryse successfully:

    • Performs OCR

    • Detects text accurately

    • Preserves bounding boxes and layout

    • Handles rotated drawings

  • We can locate title block text visually in the extracted JSON.

  • We can identify labels like:

    • DRAWN BY

    • CHECKED BY

    • SHEET 1 OF 12

This confirms that OCR and layout extraction are functioning correctly.

The Core Issue

The challenge is semantic field extraction across different drawing formats.

Problem Details

  • Title blocks vary significantly between documents:

    • Different labels (DRAWN, DRAFTED, DESIGNED BY)

    • Different positions (bottom, right, vertical orientation)

    • Different grid structures

    • Some drawings omit labels entirely

  • Because of this variability:

    • Fixed positional logic fails

    • Label-based logic fails

    • Regex-based logic fails

Example:

In one drawing:

DRAWN BY : CFB

DESIGNED : CFB

Apryse correctly extracts the text, but does not infer relationships between labels and values — which we understand is by design.

Why This Is a Problem for Us

We are processing multiple document types from different sources, and we cannot assume:

  • Fixed title block layout

  • Fixed wording

  • Fixed coordinates

  • Single template

As a result, any hardcoded extraction logic breaks as soon as the document format changes.

Attempted Approaches

1. Positional / Coordinate-Based Matching

  • Works only for a single known layout

  • Breaks immediately for new drawings

2. Label-Based Matching

  • Fails when labels change or are missing

3. Regex on OCR Text

  • Loses layout context

  • Produces unreliable results

4. Virtual-to-Physical Resource Handling

  • Successfully resolved data-extraction runtime dependencies

  • This is no longer an issue

Direction We Are Exploring

We are now considering a hybrid approach:

  1. Use Apryse only for OCR + layout extraction

  2. Isolate the title block region using heuristics

  3. Convert that region into a structured, layout-preserved representation

  4. Use an LLM to infer semantic meaning and normalize fields

This approach seems promising, but we want to ensure we are aligning with Apryse best practices and not reinventing something already supported.

Questions for Apryse Team

  1. Is Template-Based Data Extraction the recommended approach for title block extraction across varying drawing formats?

    • Can multiple templates be applied dynamically?

    • Is there confidence scoring or fallback support?

  2. Are there examples or guidance from Apryse for:

    • Title block extraction

    • Drawing metadata normalization

    • Handling missing or variant labels

  3. Does Apryse provide (or plan to provide):

    • Semantic field inference

    • Layout-aware key/value extraction

    • ML-based drawing understanding

  4. For large-scale drawing processing, what is the recommended architecture:

    • Templates only?

    • Templates + heuristics?

    • External AI/LLM integration?

Please provide a link to a minimal sample where the issue is reproducible:

Hello, I’m AiRon, an automated tech support bot :robot:
While you wait for one of our customer support representatives, I can help you with your question.

It seems you are dealing with a complex issue of extracting title block fields from varying drawing layouts using Apryse’s PDFNet Data Extraction. The variability in drawing formats makes it challenging to rely on fixed positional, label-based, or regex logic for data extraction.

Some things you can try:


  • Consider using the Smart Data Extraction capabilities of Apryse, particularly the Key-Value Extraction (KVE) feature, which is designed to handle variability in document structures, including CAD title blocks.

  • Leverage the Document Structure Recognition feature to parse the logical structure of your documents and assist in identifying patterns that can be used for extraction.

  • Explore a hybrid approach where you first extract OCR and layout information using Apryse, and then apply custom logic or AI models to interpret and normalize the extracted data.

  • Investigate the use of form field identification and key-value extraction for semantic mapping, which can help in associating labels with their corresponding values.

  • Consider integrating machine learning models or AI for semantic understanding and normalization of extracted data fields if Apryse’s built-in capabilities do not fully meet your needs.




Here are some resources that might help you:



Hopefully, this helps you solve your problem while you wait for a human to get back to you.

Thanks for sharing your use case with us–do you have any sample documents you’re working with you can share with us? If you can’t upload via our community forum, you can reach out via email at support@apryse.com to include them. Be sure to reference this forum post in your email.

In addition, what company are you affiliated with?

GenericHouse 1.pdf (227.0 KB)

I want the title block data

Thank you for providing this. To clarify, is this the region of interest?

image

In addition, do you have any other documents inconsistent with this one which also demonstrate the use case about the information being in varying location?

What company are you affiliated with?

1 Like

36472-1._architectural_final_compressed.pdf (3.8 MB)

Another document for your reference.

1 Like

Thank you–and what company are you with, please?

1 Like