Extracting Data From PDFs and Other Unstructured Documents

5 min. read

Untitled - Frame 4 (1)

If your organization is looking to speed up mundane tasks like extracting information from PDFs, emails, text messages, and other unstructured documents, this article will walk you through the basic concepts on how it can be achieved.  While I focus mostly on PDF documents, similar concepts can be applied to any type of document, whether it be a Word document, text file, email / text messages, or even audio recording transcripts.

PDFs Are Great For Sharing Documents

PDF is an acronym for ‘Portable Document Format’, and is the go-to document format for sharing documents digitally.  If you create a document in Word, and you want to share it, sharing the actual Word document may not be a good approach since the person that you are sharing it with may not have Word installed.  Or, you may not want to give the person the source document that can be edited, but rather a non-editable version of the document.  PDFs are versatile because they maintain the layout of the original document, and allow people to share documents, presentations, invoices and so on in a way that can be viewed the same way on any device. 

Extracting Information From PDFs

While PDFs are excellent for sharing, they can be challenging if you need to extract information from the document.  Depending on how the PDF was created, the text that comprises the document may or may not be selectable.  If the document was published to not allow updates, it is likely that the pages are actually images of each document page, which means that the text in the document doesn’t actually exist in the PDF.  This can be problematic if you need to programmatically extract information from the PDF.  Realistically, even if the PDF is saved with the text in the document, the unstructured nature of these types of documents make it very challenging to isolate the data that you are looking for, and pluck it out of the document.  The term ‘unstructured text’ is an important one to understand, so I will explain exactly what that means.

Unstructured vs Structured Data

Unstructured data is data that is formatted as prose, or sentences and paragraphs.  It can be an essay, a contact form submission from your website, or an email that comes in from a client.  In general, unstructured text does not have clear labels or hints that a software application could use in order to build a rule based algorithm to extract the data.  

Key differences between structured and unstructured text

  1. Known Layout and Organization:  structured documents have a well defined format.  They typically contain fields, labels, titles and separators that software can use to extract data from.  
  2. Content in Documents:  structured documents typically have succinct information specific to the topic whereas unstructured documents may be longer and have less specific information that is harder to extract.
  3. Extraction Process:  structured text can be extracted using tried and tested pattern matchers like regular expressions, where unstructured text requires more sophisticated natural language processing (NLP) models to be trained to extract the information.
  4. Accuracy in Data: generally speaking, structured data can be extracted with greater detail than unstructured data that relies on an NLP model to extract.  While structured documents yield better results, unfortunately the majority of documents are unstructured.

Unstructured Data Extraction Process

Now that you understand the differences between structured and unstructured data, I will explain how exactly information is extracted from unstructured documents, like PDFs.

As shown in the diagram above, the steps are fairly straightforward in that there is only one decision that determines if the PDF is text or image based.  If the PDF is image based, it runs the document through an Optical Character Recognition (OCR) microservice that tries to transcribe the text found in each page of the document.  If the document is already text based, the pipeline simply sends the text pulled from the PDF, and runs it through the AI model that extracts the information.  Once the information is extracted, the information is then sent to an application database, so that it can be persisted and so on.

Example Use-Cases

While the above example is a bit abstract, the following example use-cases should get your wheels turning:

  1. Extracting Information from Complex Legal Agreements:  models can be trained to identify all of the parties, dates, and terms of complex contracts and send the summarized information to your main CRM.
  2. Extracting Information From Customer Emails:  organizations that need to deal with a high volume of emails from customers can have models trained to extract key data or sentiment from the emails.  For example, a model could watch for angry emails that relate to clients tagged as high-priority.  Or models can be trained to extract order numbers, skus, and other vital information when customers reach out via email. 
  3. Extracting Data from Invoices or Receipts:  if your A/R department deals with a lot of digital invoices, receipts or statements from different customers and suppliers, a model can be trained to understand these semi-structured documents and extract all of the data from them. From there, the data can be stored in a database or it can trigger a workflow based on the data that was extracted.

Next Steps

This article is just a brief overview of the process needed to extract data from PDF documents.  While the overall concepts are straightforward, assembling the data, validating it, and training models with the data can be complex.  Feel free to contact Wired Solutions if your organization needs help with their information extraction project. 

Previous Article