Does HTML2PDF conversion loose the information related to H1-H6 headings?

maneesharajaratne · August 2, 2022, 12:06pm

Product: PDFNet Windows

Product Version:9.2.0

I am trying to convert an HTML string to PDF using HTML2PDF and make it accessible using the tag information that comes with the PDFDoc.
HTML string conversion looks like this;
PDFDoc doc = new PDFDoc();
string html = “<html><body><h1>Heading</h1><p>Paragraph.</p></body></html>”
converter.InsertFromHtmlString(html);
converter.Convert(doc);

After the conversion, when I tried to process the elements in the PDFDoc and tried to see what the MCTag returns for each text element(e_text); (in this case Heading and Paragraph)
var tag = element.GetMCTag();
tag.GetName() returns “P” as the element tag for Heading which I expected to have “h1” instead. For “Paragraph” it gave the correct MCTag which is “P”.

Am I missing something during the conversion? or is there a way to get the correct tag information when it comes to headings (H1-H6) in HTML or any way of getting the heading info after the conversion?

Thank you
waiting for a quick answer
Maneesha

system · August 2, 2022, 12:07pm

Hello, I’m Ron, an automated tech support bot

While you wait for one of our customer support representatives to get back to you, please check out some of these documentation pages:

Guides:

Forums:

Ryan · August 3, 2022, 3:52pm

Currently no, the output does not include H1-H6 headings.

Could you elaborate on how not having the H1-H6 headings exactly affects your users?

maneesharajaratne · August 4, 2022, 3:28am

Hi Ryan,
Yes, we are trying to create a logical structure out of an untagged pdf (which is converted from HTML to PDF using HTML2PDF). Because the requirement is to make a PDF/UA compliant PDF. So in the logical structure tree, the headings should be identified as headings and tagged accordingly. So the assistive technology can identify what’s heading and what’s a paragraph.

Ryan · August 5, 2022, 4:44pm

Thank you for the clarifications.

I have added this feature request to the product backlog, but at this time it is not on our schedule.

If instead your input was a DOCX file, then the PDF output with our SDK is fully tagged with the H1-H6 tags. Is switching from HTML to DOCX an option for you?

maneesharajaratne · August 8, 2022, 6:18am

No, unfortunately, DOCX type is not used within the application.
Thank you for the information

Ryan · August 9, 2022, 9:58pm

I am happy to report that starting with our next release, PDFNet 9.4, the HTML2PDF module output will be a Tagged PDF, and the H1-H6 entries will be preserved and present in the PDF output.

To get notified for the next official SDK release you can join our Discourse Announcements channel.

For platform specific notifications, such as Nuget, NPM, CocoaPods, please see the respective PDFTron product documentation page.

maneesharajaratne · August 10, 2022, 6:16am

That’s good news. Will you be considering nested elements as well, like Tables and lists in the output tagged PDF?

maneesharajaratne · August 16, 2022, 6:31am

Hi @Ryan,
When can we expect the next release (9.4) of PDFTron SDK? Will it be within this year?

Ryan · August 16, 2022, 3:49pm

You can download a Preview build here:

https://pdftron.s3.amazonaws.com/custom/ID-zJWLuhTffd3c/support/html2pdf/html2pdf_chromium/HTML2PDFWindows_TaggedOutput.zip

While not as fully tested as our official releases, this should be fine for production usage, as the only change from the official release is the flag to generate a Tagged PDF.

maneesharajaratne · August 17, 2022, 8:56am

Hi,
Thank you for the response.
I was checking the dll, but the output was the same - untagged PDF. Do I have to set some properties or is it only will be available in the released version?
I am using this on the attached file here;
html.txt (10.4 KB)

HTML2PDF converter = new HTML2PDF();
converter.InsertFromHtmlString(contentString);
converter.Convert(doc)

Ryan · August 24, 2022, 7:05pm

Sorry for the confusion, which can happen when dealing with Preview builds like this case.

There is some PDFNet SDK side work that also needs to be done, which is actually more involved then the change on the HTML2PDF module side.

I will update you once I know more. In the meantime, thank you for your patience.

Ryan · September 13, 2022, 5:32pm

The update on the PDFNet SDK is done in our developer preview builds. You can try that now if you like.

Let me know how this SDK build works for you with the Aug 16th HTML2PDF module.

maneesharajaratne · September 27, 2022, 5:55am

Hi @Ryan ,

I was able to test it, I could see that the logical structure is being created now with the new version of PDFTron . But it seems it doesn’t do the mapping properly. Was it intentional or is it because the DLL is still under development?

But anyways if the content is coming as marked content I’ll be able to map the pdf elements with the logical structure tree using the PDFTron Systems Inc. | Documentation documentation. I’ll let you know if that’s doable in this state

Thank you for the quick update