Visualizing the Hidden Structures of Unstructured Documents

Category: Data Engineering

Editor’s note: Cedric Clyburn and William Caban are both speakers for ODSC East this May 13th to 15th in Boston! Be sure to check out their talk, “Structuring the Unstructured: Advanced Document Parsing for AI Workflows,” there!

We all have been there, tackling the challenge of extracting unstructured data from documents while maintaining context awareness and fidelity. There is no shortage of tools for PDF or DOCX conversions to markdown or plain text. These work well for documents with simple and clean layouts, far from what we can find in the wild: enterprise reports, forms, documents, magazines, presentations, and everything else.

An enterprise document is not just text or simple tables. Enterprise and technical documents may contain diagrams, graphs, tables, multi-column layouts, highlighted or formatted text, text blocks for quotes, and code snippets. These elements and visuals ensure clarity and enhance understanding of the topic in the document. Solving this for traditional NLP problems or retrieval systems, or extracting knowledge from the documents to train models, continues to be challenging. Different tools tackle different aspects of this domain and are optimized for different parts of the challenge. Over the years, I have used many tools while working in these domains and crafting a “magic combination” that works for the use case each time.

Last year, while looking to improve context-aware chunking techniques, I learned about the open-source tool Docling. Initially, I thought it was just another PDF processor tool and started testing it with that mental model. Boy, I was wrong! Don’t get me wrong—it does all that, but it is much more, and the secret is in the what and how—the way it captures the document layout, the text formatting, and the block types within the document and makes those accessible to the developers—that makes it unique.

Docling’s capability to uncover structures and patterns hidden within unstructured data is a powerful feature that can easily be overlooked. To achieve this, docling extracts the page and document layout and coordinates of the elements, and extracted text is projected back into the discovered layout to determine which text belongs to the same blocks, columns, paragraphs, lists, etc.

Take, for example, a Spanish document from the government of El Salvador https://www.mh.gob.sv. The PDF for the following example is Strategy-Manager-Financiera-Ade-El-Risgo-de-Desastres.pdfwhich contains tables, images, indices, vertical text and the underlying visual elements that define if a text belongs together or are part of a different chunk. When extracting the text to a simple format like Markdown, even when the text is identified, a lot of the contextual information is lost, making it difficult to determine the context of a text with high-accuracy for advanced NLP tasks.

For easier visualization of the impact of the different types of conversions, let’s use docling-serve.

# install docling-serve 
pip install docling-serve gradio 

# run the docling serve API interface 
docling-serve run --enable-ui --host 127.0.0.1 --port 10100 
Starting production server

Server started at http://127.0.0.1:10100 
Documentation at http://127.0.0.1:10100/docs 
UI at http://127.0.0.1:10100/ui 

Logs: 
INFO: Started server process [8863] 
INFO: Waiting for application startup. 
INFO: Application startup complete. 
INFO: Uvicorn running on http://127.0.0.1:10100 (Press CTRL+C to quit)
...

Connecting to the UI allows for quick access to the settings and interaction.

The markdown version can encode the image inline and extract the text.

The Markdown-rendered version shows how the text will be interpreted when converting the PDF into Markdown. As the screenshot below shows, the context information derived from the original layout is completely lost.

When using the DoclingDocument format, the Docling-Rendered tells us a different story. This option maintains a high-fidelity representation of the original document, including the layout and formatting. Each text, including the rotated text on the left of the page, is identified and extracted as a stand-alone text element with coordinates and other metadata that makes it possible to render a document very close to the original PDF but from a structured JSON format.

The richness of the metadata and layout that docling captured as a structured output when processing a document sets it apart. That type of information expands the possibilities for traditional NLP use cases and use cases for retrieval systems like RAG and the creation of training datasets for LLMs.

When designing a retrieval system, the text chunk can be easily mapped to the exact location and position in the document from where it is coming from. In addition, when using the HybridChunker, a tokenization-aware hierarchical chunker, each chunk maintains context awareness and hierarchical information for the chunk. All these native capabilities positively impact the accuracy and quality of retrieved documents. What would usually require custom implementations combining multiple approaches to enrich the text chunks is now simplified by being a native function of the tool.

When using the extracted text for model distillation techniques to ground training datasets, the richness of the metadata allows for the contextualization of the sections of the document being analyzed for knowledge extraction for the training dataset.

Table of Contents

Wrapping Up

Wrapping Up

Docling represents a significant advancement in document processing technology. It preserves the structural integrity and context that most conversion tools discard. Maintaining layout information, text formatting, and hierarchical relationships enables developers to create more accurate and context-aware document processing systems.

For developers working with enterprise documents, the implications are substantial:

Enhanced Retrieval Systems: RAG applications benefit from precise document mapping and context awareness, leading to more accurate information retrieval.
Improved Training Data: Rich metadata allows for better contextualization of extracted knowledge when creating datasets for LLM training.
Layout-Aware Processing: The ability to understand document structure means applications can interpret information like humans do, considering headings, formatting, and spatial relationships.
Better Multi-Format Support: Complex elements like rotated text, multi-column layouts, and tables are preserved rather than flattened.

As unstructured documents remain a primary source of enterprise knowledge, tools like Docling that can extract structure from seeming chaos become increasingly valuable. The patterns hidden in the noise of our documents are finally becoming visible, allowing us to process information with the context and nuance it deserves.

If you’re working with document processing, especially for enterprise or technical documents with complex layouts, I encourage you to explore what Docling can offer for your specific use cases. The ability to maintain structural fidelity while converting documents to machine-readable formats may be the missing piece in your document processing pipeline.

About the Authors/ODSC East 2025 Speakers:

William Caban, Product Manager in Red Hat’s AI Business Unit, is a technology leader who bridges the gap between cutting-edge AI innovation and enterprise solutions. With deep expertise in high-performance computing and machine learning operations, he has successfully architected and deployed AI platforms that scale across global organizations. At Red Hat, William leads the development of enterprise-grade Generative AI solutions, helping organizations navigate the complexities of large language models (LLMs), responsible AI governance, and seamless integration with existing infrastructure. His patent portfolio reflects breakthrough contributions in distributed computing and AI systems optimization. Beyond the corporate sphere, William dedicates his time to mentoring social entrepreneurs and sharing practical frameworks for embedding ethical AI principles into product development while maximizing social impact.

Cedric Clyburn (@cedricclyburn), Senior Developer Advocate at Red Hat, is an enthusiastic software technologist with a background in Kubernetes, DevOps, and container tools. He has experience speaking and organizing conferences including DevNexus, WeAreDevelopers, The Linux Foundation, KCD NYC, and more. Cedric loves all things open-source, and works to make developer’s lives easier! Based out of New York.

Source link

For more info visit at Times Of Tech

mohsin

I am an author and tech enthusiast deeply passionate about AI, Data Science, and cutting-edge technologies. With expertise in Python, machine learning, and automation, he is dedicated to simplifying complex concepts, helping readers navigate and excel in the dynamic world of artificial intelligence and data science.

See All Posts