Benefits of document classification

How business processes improve with it

Is document processing really needed?

With the spread of digitalisation we would be tempted to believe that document processing is a thing of the past. Aren't companies moving towards a paperless standard? Does not every information transmission take place over a digital channel? Because if such is the case it seems hard to find a place for document classification these days.

In fact this assumption is at least misleading and at most utterly wrong for two reasons

  • Not every market and not every industry in every market has made the transition to a fully digital operation. Think of the last time you signed on a paper. I would bet this took place very recently. And this paper will be archived somewhere in old ink and paper form.
  • Even if the transition to the digital sphere has been made or is midway completed, the case for document processing is as strong. The reason is that there is a big difference between human-readable documents and machine-readable objects. While an image file, be it pdf, jpg or similar, is good for human users or agents, it still needs to be processed along the same lines as the old fashioned paper. The need and the process remain nearly the same.
Therefore businesses and other organisations have to tackle the processing of vaste amounts of non-structured data contained in images, pdf or paper documents. Processing those documents typically means identifying the class they belong to (that's classification) but also extracting some relevant data from it (which is called entity extraction).

Text classification: what is it?

The concept is pretty simple. Given a certain document assign it to one of a closed set of predefined categories. For example, is this a resume or a payroll slip or an invoice. Those categories must have some solid, preexisting business meaning and classifying the documents has to be a required step in some business workflow. The traditional approach is to set up a team of agents that go through the document stream and manually annotate the class that each document belongs to. Initially every agent of the team had to be in the same place, but now and thsanks to technology it is also possible to have decentralized teams accessing a database of digitized documents. While perfectly feasible this approach presents a number of drawbacks:
  • The approach is not cost efficient. Specifically costs grow proportionally with the growth in the document stream, preventing the appearance of any economy of scale.
  • Such a repetitive process is necessary error-prone, even if the task is relatively simple. Some companies solve this problem duplicating the team and classifying each document twice. Whenever there is disagreement, a third agent takes the final decision. Of course this solution increases the cost even more.
The alternative is the introduction of automatic document classification methods where the combination of the right Natural Language Processing methodology with other IT components such as OCR, web services and cloud processing can deliver a scalable, fast and efficient alternative.

Information extraction

The job is incomplete if we limit ourselves to document classification. Our business process will typically require the extraction of relevant information from the document in structured form. For example:

  • From an invoice extract the invoice date or the tax number of the client
  • From a payroll slip extract the net salary or the tenure of the employee

The automation of such tasks is surrounded by a number of technical challenges that go beyond natural language processing and touch the field of computer vision. However this automation has an obvious range of advantages, from speed and efficiency to economies of scale.

In natural language processing there is a distinction between Named Entities (typically proper nouns such as country names or organization but also dates an numbers) and Slots (which are domain-specific pieces of data such as Tax ids or Part number). Both heuristic and Machine Learning techniques are required to extract these data from the original image.

In some cases non textual information such as logos, signatures or stamps need to be extracted and sometimes matched against a reference image. This falls in the computer vision domain once more.

Practical use cases

The number of situations where scanned or paper documents are sent for processing is really wide. We will mention only a handful of them.

  • Customer onboarding. Be it in insurance, telecoms, finance or utilities, the need for the supplier to get and process documents from the client can be found everywhere. The utility company needs the rent contract of the client. The finance house asks for a payroll slip. The insurance company requests a blood test. Many more cases like those happen every day.
  • Insurance claims. There are few industries more swamped with documents than Insurance. This trait is more evident in the claims department where procedures require a number of document pieces to back up the claim. A lot of manual processing in place.
  • Legal advice. This area deserves a post in its own right. Just imagine the time taken by lawyers just to find a small piece of data across a number of thick files holding past court decisions or contracts.


We expect that the benefits of document classification have become evident in this post. Namely:

  • Speed. These days everybody wants and expects immediate turnaround of any process. Delaying a business process in insurance, credit, retail or any other business because someone needs to go through a huge stack of papers is unthinkable.
  • Efficiency. While technology is not always cheap and not every manual action is candidate for automation, most of the time automation brings about process reliability and economic sense to the business.
  • Economies of scale. And once automation is introduced the ability to grow the business in a fast and agile manner is available for the management who can tackle growth peaks by just adding some - or none at all! - technical resources.