Unlocking the Power of PDFs: Converting to Markdown for LLMs

5 min readAug 28, 2024

Introduction:

Large Language Models (LLMs) are powerful technologies capable of generating coherent and contextually relevant text. However, they may sometimes produce responses that lack factual accuracy or context.

By incorporating retrieval-based methods, such as Retrieval-Augmented Generation (RAG), we can enhance the quality of generated text by providing external context, mitigating “hallucination issues” and improving response relevance.

PDF documents, being one of the most common formats for distributing information, are a rich source of high-quality, domain-specific content, including academic papers, reports, and books. This makes them ideal for training high-quality LLMs to interact with the information in these documents.

However, unlocking the information from these PDFs, in a manner that LLMs can readily process, presents several challenges:

Lack of Structural Metadata
Complex Layouts and Formats
Quality (OCR) Issues
Multimodal Content

Why Convert to Markdown?

Markdown is a lightweight markup language that allows users to format plain text using simple syntax. By converting PDFs into a Markdown format, the content is able to maintain its structural integrity. Unlike regular text extracted from PDFs, Markdown formatted text preserves elements like headings, lists, and links. This preservation is crucial for LLMs, as it ensures the context and hierarchy of the information remain intact, leading to more accurate processing and analysis.

1.Preserve Structural Integrity: Markdown lets you organize information into headings, lists, and tables, revealing important content structure hidden in raw text. Structuring content with clear headings improves the efficiency and quality of responses from language models (LLMs). Headings guide the model to understand topics and subtopics, ensuring precise and contextually appropriate answers. This organization reduces irrelevant information and enhances overall coherence, leading to clearer, more accurate responses.

2. Embedding Links and References: Markdown lets you embed hyperlinks, footnotes, and references. For RAGs, this can be crucial for referring to external sources or providing additional context.

3. Refined Chunking: The formatting structure preserved by Markdowns allows for improved chunking of the content since it gives better guidelines for chunking within sections of the text rather than the full document text.

Challenges Faced When Converting PDFs to Markdown:

1. Handling Lack of Structural Metadata

Converting PDFs to Markdown involves inferring and recreating structural metadata that is not explicitly present in the PDF. This requires sophisticated algorithms to identify and tag headers accurately, ensuring that the resulting Markdown document retains the structure and hierarchy of the original PDF.

Using the Aspose library. As you can see here, it could not recognize the “1 Introduction” as a header and recognized it as regular text instead.

2. Preserving Layout Integrity

Retaining the original layout and structure of the PDF in Markdown can lead to a loss of formatting. Multi-column layouts, whitespace, and tables present significant challenges. Markdown has limited support for complex layouts, necessitating creative solutions or compromises in the final output.

Using pymupdf4llm library. As you can see here, it could not recognize the 2-column structure of the text and extracted the right column first before the left (as seen by the blue highlighted text appearing before the red highlighted text instead of the other way around).

If the content order is disrupted, the model may misinterpret the relationships between different parts of the text, leading to confusion and inaccuracies in its responses. This can result in flawed or irrelevant information being generated, which undermines the reliability and effectiveness of the LLM.

3. Extracting Embedded Links

Markdown lets you embed hyperlinks and references found in the PDF. Especially in scientific papers, this can be crucial for referring to external sources, providing additional context, and maintaining academic integrity. However, identifying those links proves to be a difficult task for many converters.

Using morethan.io pdf2md service. As you can see here, it only preserved the text portion of the link but did not extract the embedded link with the Markdown text format.

4. OCR Issues

Scanned or Flat PDFs introduce additional complexity due to the need for Optical Character Recognition (OCR). OCR can be prone to errors, especially with poor scan quality, varying fonts, or handwritten text. These errors can lead to incorrect text extraction, which would impact the accuracy and relevancy of the LLMs that the extracted text is fed to.

Using JINA.AI’s Reader API. As you can see here, the highlighted text is scanned and therefore flat. However, JINA.AI was unable to recognize the text entirely and could only preserve the headings (which were not flat) in the extracted Markdown.

5. Extracting Formulas

When dealing with academic papers, there are often many formulas included in the paper. Riddled with inconsistent formatting and special characters, many PDF parsers have a hard time deciphering the formulas presented in these papers.

Using LlamaParse. As you can see here, the formulas from the original pdf are converted into LaTeX format to preserve original formula structure.

After converting the LaTeX back to text, its evident that the formula was incorrectly extracted!

This cannot be overlooked as these formulas often play vital roles in the content and meaning of these papers, and if parsed incorrectly, can lead to misinformation with LLM outputs.

6. Optimizing Processing Time

Efficient processing is vital when converting PDFs to Markdown, particularly with large documents or batches. Slow processing can impede productivity and delay access to crucial data, underscoring the importance of choosing a tool that balances speed with accuracy and quality.

Conclusion

Converting PDFs to Markdown involves addressing several significant challenges, from handling the lack of structural metadata to preserving layout integrity and embedded links.

In my following articles, I will delve more deeply into these specific challenges and discuss how the following tools attempt to solve them:

LlamaParse — powered by LLMs
JINA.AI
pymupdf4llm
morethan.io: pdf2md
Aspose
ContextForce — powered by LLMs

Stay tuned!

If you enjoyed this article and want to stay updated, follow me on my socials below 👇🏻

Socials: LinkedIn, GitHub, Medium