PdfTron is misidentifying office types in conversion to Pdf

phil.heroux · October 31, 2024, 3:04pm

Product:
PDFTron.NETCore.Windows.x64

Product Version:
10.10.0

Please give a brief summary of your issue:
PdfTron is misidentifying office types in conversion to Pdf

Please describe your issue and provide steps to reproduce it:
We are attempting to use PdfTron to convert Microsoft Office documents to PDF. These documents are coming as Streams so we are using OfficeToPDF(PDFDoc inDoc, Filter inData, ConversionOptions options). This works for new Office types (DOCX, XLSX,PPTX), but it is misidentifying older formats (DOC, XLS, PPT, RTF) a lot and uses the wrong conversion method. Any help with our issue would be appreciated!

This is the exception we see for these documents:

{“Exception: \n\t Message: document layout failed: Unable to convert this document from binary to OOXML form. File is not a valid zip archive.\n\t Conditional expression: \n\t Version : 10.10.0-4950f2eb9c\n\t Platform : Windows\n\t Architecture : AMD64\n\t Filename : FlowToPDFConversion.cpp\n\t Function : PDF::DocxConversion::Convert()\n\t Linenumber : 200\n”}

Please provide a link to a minimal sample where the issue is reproducible:
This is how we are converting the Stream to use the OfficeToPdf method:

            using(pdftron.PDF.PDFDoc pdfDoc = new())
            {
                using (Filter filter = getFilterForConversion(inpStream))
                {
                    pdftron.PDF.Convert.OfficeToPDF(pdfDoc, filter, null);

                    pdfDoc.Save(outpMemoryStream, pdftron.SDF.SDFDoc.SaveOptions.e_linearized);
                }
            }        


private static Filter getFilterForConversion(Stream inpStream)
        {
            using(BinaryReader binaryReader = new(inpStream))
            {
                // Write stream to a byte array. This is needed for pdfTron.Filters.FilterWriter
                int streamLength = (int)inpStream.Length;
                Byte[] byteArrayOfStream = binaryReader.ReadBytes(streamLength);

                
                Filter filter = new MemoryFilter(byteArrayOfStream.Length, true);
                filter.Begin();

                // Write the byte array to the Filter via FilterWriter
                using (FilterWriter filterWriter = new FilterWriter(filter))
                {
                    filterWriter.WriteBuffer(byteArrayOfStream);
                    filterWriter.Flush();
                }

                // Ensure filter position is at start and return
                filter.Begin();
                return filter;
            }
        }

shakthi124 · October 31, 2024, 9:24pm

Thank you for contacting us about this. Note that the ConversionOptions class offers a direct way to specify the file extension. Can you please try setting the file extension using the SetFileExtension(string) method in the ConversionOptions to see if you are still able to reproduce the issue?

phil.heroux · November 1, 2024, 9:44pm

Hi @shakthi124 , thanks for the feedback. I have created a ConversionOptions variable, used SetFileExtension(fileTypeString), and added it to the method. However I am still seeing issues with the conversion of these older types. Here are some example messages:

When I ran type RTF through the conversion:
{“Exception: \n\t Message: This file type is not supported for PDFNet builtin conversion!\n\t Conditional expression: false\n\t Version : 10.10.0-4950f2eb9c\n\t Platform : Windows\n\t Architecture : AMD64\n\t Filename : Office2PDFNative.cpp\n\t Function : trn::PDF::Office2PDFNative::CreateConversion\n\t Linenumber : 1241\n”}

When I ran type DOC through the conversion:
{“Exception: \n\t Message: document layout failed: Unable to convert this document from binary to OOXML form. File is not a valid zip archive.\n\t Conditional expression: \n\t Version : 10.10.0-4950f2eb9c\n\t Platform : Windows\n\t Architecture : AMD64\n\t Filename : FlowToPDFConversion.cpp\n\t Function : PDF::DocxConversion::Convert()\n\t Linenumber : 200\n”}

Let me know if any other information can help with the investigation, thanks!

phil.heroux · November 5, 2024, 2:32pm

Hi @shakthi124 , I wanted to follow up here and see if you had any advice on next steps. Thanks!

shakthi124 · November 5, 2024, 9:09pm

Thank you for your reply. Please note that RTF conversions are not supported by our built in office conversion. Instead, you will need to use the ToPDF function as this will use an external application to convert RTF files.

As for the DOC conversion issue you are seeing, is this occurring with all DOC files, or a single file? If its the latter, can you please forward us the file by creating a ticket in our support portal: https://support.apryse.com/

Please also reference the forum post. Thank you.

phil.heroux · November 5, 2024, 9:50pm

Hi @shakthi124 , I will discuss with my management if I am able to send an example file to you in a separate ticket.

For the RTF conversion, the format I have is a memoryStream, not a string. Is there an overload function of ToPDF() that takes a Stream or Filter? I didn’t see any in the documentation.

Thanks,
Phil

shakthi124 · November 7, 2024, 12:00am

Thank you for your reply. Unfortunately, the ToPDF function that leverages external applications will require a string input and cannot work with streams. For this case, is it possible for you to write this out to a temporary file?