Episode 7 | An honest look at Intelligent Document Processing. What works, what fails, and why (interview)

Context and the history of document processing

Andrzej Kinastowski (AK): Konnichiwa, welcome to the Office Samurai podcast, where we strike inefficiencies with the precision of an EDO period archer. This time we’re going to be discussing Intelligent Document Processing (IDP), so digitized processing of invoices, purchase orders, account statements, government IDs, customs documents, all the paper and paperless paper you may have. I’m your host Andrzej Kinastowski, one of the founders of Office Samurai, the company that believes your business shouldn’t be held hostage by a poorly scanned document. Now grab your favorite katana or that surprisingly pointy letter opener you liberated from accounting, and let’s get to it.

Almost 20 years ago, when I was starting my corporate career, I was hired into a BPO, and my job at first was to book invoices, which was a usual entry-level job at that time. I would every day get a package of actual paper invoices and I would be typing them into a computer. I remember thinking at the time that it’s quite crazy that we’re in the 21st century yet big international companies are sending each other pieces of paper, pieces of paper that then humans need to put into the computer. It was peculiar. Some other customers of that BPO were using OCRs (Optical Character Recognition), and I was quite jealous of them.

A few years later, when I moved into continuous improvement and I was doing Lean management projects, I have seen that some of those teams actually put more work into fixing what the OCR had done wrongly than they would have if they just booked those invoices manually. Those were kind of old-school OCRs where, for every new invoice, you would have to build what was called a mask. If the invoice template changed, you would have to change a mask. The mask would tell the tool that with this type of invoice, you can expect the bank account to be in this part of the page and the gross amount in that part. So it required a lot of maintenance and it wasn’t that great. Today we’re here to talk about how those things are being done now.

I am joined today by Tomasz Wierzbicki. He is one of the consultants in Office Samurai. He is an expert in a lot of things, working on Intelligent Document Processing, Communication Mining, GenAI Agents, among other things. Tomasz, welcome to the podcast.

Tomasz Wierzbicki (TW): Hi, very happy to be here.

AK: Tell me, why are you here (meaning why are you the one that I am talking to about Intelligent Document Processing)?

TW: I deal with Intelligent Document Processing and recently more towards like classical machine learning and Communication Mining (which is mostly NLP, Natural Language Processing). The story started somewhere around two years ago. I remember our tech lead Konrad doing some experiments with Document Understanding, and I tried it and got hooked up on it very quickly. It’s very interesting to be combining the stuff we regularly do (RPA or automation) with this little bit of Artificial Intelligence. Nowadays, with the rapid advance of technology, it’s a very interesting time and domain to be in. I feel like I’m learning every day, and stuff gets obsolete very quickly right now.

Evolution: from classical OCR to layout agnostic IDP

AK: When it comes to the old-school OCRs that we’ve been using 20 years ago in a BPO, how are today’s tools different, because what I saw 20 years ago was not that impressive?

TW: Just a short disclaimer: for me it’s going to be more like 10 to 15 years ago (20 years ago I was more like a high school student). I had to deal with those classical stuff. Back at the time, as I remember it, it was a matter of if it really pays off, because it’s freakishly expensive to be using those OCRs, and back then the technology was not as mature as it is right now. Our tech lead Conrad said something that right now OCRs are kind of hitting the upper limit of their efficiency, so the technology is really mature and is currently being enhanced with some deep learning and probably large language models as well.

There were still a lot of mistakes, so the volumes that you wanted to process with OCR they really needed to be huge enough so that the solution could pay off in the long run, because then you would still need to repeat some typos (the common thing was that ‘1’ looked like ‘L,’ then ‘zeros’ looked like ‘O’s, and stuff like that). There was a lot of post-processing done on that OCR text.

The classical approach would be that even though there was OCR, it couldn’t do much more than just reading and recognizing text. Right now with IDP (Intelligent Document Processing), we are also doing some pattern recognitions so that the tool is layout agnostic.

Document structure categorization

TW: When you have a document, you’re probably going to have some sort of structure. We could split those documents into categories:

No Structure at All: If you take one contract or some certificates, it might be totally different from the other.
Semi-Structured: Common types like invoices or purchase orders, where you can expect them to have certain stuff on them. In Polish legislation, you are legally bound to include certain type of information on your invoice. You can expect not only the presence of a given piece of data but also some structure (like a header repeated over multiple pages).
Full Structure: Our famous tax PIT form or the old way of bank transfers on a post office.

Right now, we’re not only focused on getting the text but also getting locations, probably the angle that the document is skewed, and all sorts of different parameters. The long story short is it’s not only about getting the text but also recognizing structure, so more like image recognition kind of a situation, not only just the text part.

AK: This is something that is very easy for humans. I remember when I was taught to read those invoices (they showed me a few, they told me what kind of information I’m looking for), and it’s very easy for a human to recognize those patterns. But it was always very hard for a computer to do it.

IDP workflow: classification, feature extraction, and data output

AK: How does it work so that we put a digitized invoice in, and then as an output, we get all the information nicely sorted into an Excel table?

TW: Based on the projects I’ve done, the usual path would involve a couple of crucial steps that you would always do.

Step 1: Digitization and Feature Extraction

TW: Not every document you have is going to be native (like a PDF native kind of a situation where you actually can select the text). Sometimes it’s either flat or even scanned, so then those tools need to either OCR or just extract the text. First step would be to digitize, and here the whole OCR realm is open for research.

Basically, we need to get the text, but apart from just the text, we also extract features. It takes a form of a big object (in programming, code-wise), where you have all sorts of different parameters. It could be words, their location, maybe the nearest neighborhoods, locations like the skewness of the documents, and all different weights and biases that you can expect. This document object model is then fed into some kind of machine learning algorithm.

In a more classical approach, this would probably be a combination of deep neural nets for the image (the structure part), and convolutional neural networks. For the text part, embeddings are probably going to be on top. Those models are either pre-trained (the supplier can offer you models specifically for invoices or some other document types) or you would need to do the training on your own.

Step 2: Document Classification

TW: The first thing you want to do is classification. If you cannot ensure that you are only getting one particular document type, you will want to classify. The algorithm tells you what the document type it is (e.g., invoice or maybe credit note) without human intervention. It’s not going to be a simple keyword-based classifier, but again, a machine learning approach.

AK: So, if the algorithm isn’t sure that this is this particular type of document, then it would seek human help, and then this data we can use to retrain the model to make it stronger, to make it able to be more sure of its decisions?

TW: Yeah, exactly. You can build that feedback loop to get since you needed to involve a human for validation, then why not use it and reinforce it to supply more training examples to the model. Once you know which document type it is, you could be feeding it to specialized models (e.g., one model just for invoices, another for orders), routing the document specifically to the right model.

Step 3: Data Extraction

TW: The second part, which is more often expected, is extraction (just getting the important stuff out of the document). On invoices, you would have that invoice number, a date, and products or services you’re paying for. The goal is for the algorithm to actually grab those specific values so you can have a structured data format that you can then work with. This way, you wouldn’t need a human to be just retyping them manually, or even copy-pasting, which is also a mundane and tedious task.

Implementation and the Human in the Loop (HITL) imperative

AK: So the tool understands what the type of the document is, it knows what data it needs to get from it, and then it gets it from the document. What happens then? How does this connect to our business processes?

TW: Since you got structured data, you can basically do anything that is programmatically or RPA-wise possible to be doing with this data. How it connects depends on the vendor of your IDP solution.

🔸 It can be anything from basic API (any coding language will do).
🔸 With more sophisticated platforms (like UiPath), there will be some ready-made interfaces, maybe libraries with RPA, where you have some ready-made activities where you just select the model and the document path.

If you don’t have resources for hard coding (not RPA style, not low code, but classical programming), you need to research the integrations, as this can make the entry barrier steep. Solutions like Document Understanding offer everything out of the box, and the experience is really, really nice even for the nontechnical users. I often refer to the training interface as a “coloring book” because the business user just selects pieces of text and data on a nice graphical user interface.

The role of HITL and confidence dcore

AK: What we used to do with those old-school OCRs is that if the model couldn’t handle a particular document, it was handing it over to manual handling for a human to book it. If you weren’t putting enough effort into maintenance, the model was able to understand less and less of your invoices. How does this work with Intelligent Document Processing?

TW: What you might want to look for is a solution that incorporates something called Human in the Loop (HITL). You don’t interrupt the process, but you put a human in the middle (not any robot or AI) who validates if the data is okay. You can also run some items in parallel, so you don’t stop everything just for this one document to be validated.

The validation station is a graphical user interface where you get the document displayed, and you also get the values for the recognized document type and all the values for the fields. Then you get to check if the model did all right or not. Here comes the important aspect of confidence score. You need to accept that this is not your classical RPA. The neural net will give you some values (usually ranging from 0 to 100%) representing how confident the model is that a given value is actually the value you care for. You then get to decide if you’re comfortable with (say) that 85% confidence to run the next automation. Our usual approach would be to not be scared of heuristics. AI is great, but don’t be afraid to look up data or build some mapping tables (for example, finding that invoice number somewhere in the system first) and only then deciding to display the validation station to the human user.

The human vs. machine accuracy paradox

TW: The tool probably is going to do like 80% of your work, but we’re not there yet (maybe for very simple cases) where the pass-through without human touch would be 100%. Still, hardly anybody for some important processes will be leaving everything to AI.

AK: We also have to remember that AI is not supposed to be perfect, but humans are neither. Our correctness KPIs were usually like 96% for invoice booking. This means that in 4% of the invoices, you would find something done wrong if done by humans. It gets very easy, so you get distracted, you forget things. I do understand why businesses are hesitant to use non-deterministic models, but it’s not like humans are that much better at it.

TW: People don’t validate so thoroughly. Show me a person that is patient enough to be running a check against an MRP system of a 30 pages long PDF line by line. If you are a big company, you get thousands, tens of thousands of invoices every day. It’s just inhuman to ask somebody (you know) you got 10,000 pages and you need to compare them to some kind of a database. You have to understand that whether humans or a machine will get things wrong, and you need your processes to be able to handle that.

Use cases: beyond invoices and high-volume documents

AK: So when we started with invoices, this is something that most companies start with when it comes to Intelligent Document Processing, because everybody gets invoices, and the bigger you are, the more invoices you get. But what are some other types of documents that you could be processing with IDP where it actually still makes sense to use the technology?

TW: What I would suggest is to take those two basic functionalities of classification and extraction (so understanding what document type it is and then fetching that data out of the document), and then I would brainstorm stuff that I could be doing with these two functionalities for any processes. Other examples that we did and are more interesting (because just classifying and extracting and typing into somewhere else is just the most common thing) were to be recognizing whether signature is present on the document.

Use Case 1: Recognizing signature presence

TW: One customer did not care about any sort of text or data, but they wanted just there were like three fields on some delivery documents of sorts and all they cared is to tell us okay if there’s like three yeses. There were stamps and handwritten signatures. With such stuff, you probably might want to lean again towards more like image kind of recognition systems because they’re specialized. Fortunately, the solution we use supports a specific field for signatures, and they can be just Booleans (logical values of yes or no). We used that and it worked pretty much like yeah 80-90% of the cases were read correctly.

Use Case 2: Data obfuscation (Anonymization)

TW: Another example was for a Lithuanian company where some government authorities requested some documents from a company, and they were allowed to do so. But on the other hand, the GDPR needed to be we need to be compliant with GDPR, so they couldn’t just hand over those documents. They needed to obfuscate (anonymize) personal data of the employees because these were probably some HR or some contracts. The solution was to use that extraction, but not to extract, but to nail down where those personal values were, and then use some third-party libraries (we used Python) that would just draw black rectangles on top of that, and then just flatten the whole thing into a file that you cannot reverse engineer.

Use Case 3: HR document management and splitting

TW: And one more interesting case, again high volumes, problematic case: HR documents. They had a lot of HR based documents (contracts, gym membership cards, healthcare agreements). They wanted to digitize it, but they had put those piles of paper straight into a scanner, resulting in 100 or 150 pages long merged PDFs. If they are merged, then this is a challenge to understand where document A ends and document B starts. These tools can also help us with that, trying to understand that this document ranges from here to here and split it for us into chunks. The general goal here was later to check if an employee (e.g., John Smith) had certain documents signed (e.g., this contract and this, but maybe they don’t have medical care done).

AK: So I understand what you’re saying is sometime we would start with simple documents like more on the structured side, high volume, so invoices, purchase orders, account statements. Then when it comes to HR administration, we do get a lot of paper. And a lot of companies are still in the process of digitizing all those archives. Then we have this box of more strange stuff, so like you said finding signatures or obfuscating data. So this is a more kind of creative way of using the tool to handle whatever papers we may be having.

Technical challenges and the “Digital Paper” paradox

AK: And this is one thing that amazes me is that we are using PDFs, but PDF isn’t really a great format for data processing. PDF is still unstructured data (unless you have some hidden layers with the structure). In those 20 years (since I started manual invoice processing), we have moved from paper to basically what is a digital paper, but it’s still a paper because this is not structured data. We didn’t move to CSV files or SQL tables; we have just moved from one kind of paper to another kind of paper.

TW: People often time forget that PDF was originally designed for print (like marketing and copy, digital leaflets), but it got quickly adapted to be the most popular format for basically anything. It was a shot to the knee that we did not go the extra mile to figure out better standardized formats. Right now, basically these tools help us solve something a mistake from the past. If everything was running via API or some EDI system, you wouldn’t need those AI tools to be recognizing data for you.

Centralized E-Invoicing and the future of IDP

AK: But that would be my next question because there are already some countries in Europe that would have centralized invoicing. In Poland, we’re still struggling with it; it has been postponed a few times, but maybe someday it will come. How do you think this will affect the need for using IDP technologies?

TW: Right now, with the base we’ve got, it’s hard to say. With invoices being the most popular type, I would at least consider it and do some research on European Union or your local government initiatives. I see improvement with governments going digital. So, I would at least do research so it doesn’t turn out that you did a solution in half a year, and another half year later everybody goes to digital invoicing, and they are legally obliged to go this way.

AK: But the research needs to be thorough because from what I understand a lot of those invoicing systems only have the basic data from the invoice, and they do not have the tables with the details. So you still need to find a way to read those, plus, it’s just invoices, and we still process a lot of other document types.

TW: The pace of technology and the pace of adaptation are totally two different things. We observe that IDP solutions were new 3 or 4 years ago, but people are still coming to us with the same old stuff, same problems, same PDFs. I guess it’s going to here to stay. Something that already runs and already can give you some profit is still going to be better than chasing the dream and not getting anything done.

Vendor choice: why UiPath Document Understanding

AK: There are a few major players on the market, and in Office Samurai, we are mostly using Document Understanding from UiPath. Can you tell me why we’re doing that, why we’re not using the majority of other vendors?

TW: I guess our habits play a key role here; we already know the technology. The vendor is known to be integrating stuff well with all the other portfolio of the products. Another advantage would be the implementation for business users. I often refer to the training as a “coloring book”, a business user just selects pieces of text and data. They offer pre-trained models that already come with some baseline efficiency. For most common pre-trained fields for invoices, you can expect an average confidence score of 60-70%. If you’re already using the software for RPA, onboarding another tool within the same platform is very easy. They also seem to be mentioned as the leader for that technology.

The hybrid approach: IDP and Large Language Models (LLMs)

AK: We do get questions about Large Language Models (LLMs). A lot of companies get this idea about: “Why instead of intelligent document processing, why can’t we just put it in ChatGPT?”. What’s your take on that?

TW: The first thing I always mention is the repeatability. Doing high volumes would involve building a workflow. Second, documents that are transaction-oriented, like invoices, are not really that rich in text itself. LLMs are about understanding language and not about understanding structures. This is where dedicated IDP solutions still work better.

A very interesting area would be building hybrids. You can take the output from the IDP solution and then ask the LLM to maybe validate this. You could build an agent that has access to your ERP system, and it can look up stuff and verify that data. A big area is fuzzy matching. You can ask the agent to examine if mismatched wording for a product is “really the same thing just expressed with different sort of wording”. Another area where you could be using LLMs is with handwriting. Conrad, our tech lead, fed the OCR text back to the LLM along with the image which had a lot of handwriting, and it asked it to correct some mistakes in the handwriting. We got the best of both worlds.

The third factor is the confidence level. LLMs will not give you any sort of probability. If you use LLMs, everybody has heard of hallucinations. When it’s wrong, it’s going to tell you it’s a hallucination with 100% certainty. Hardly any executive will right now leave everything to AI without any human supervision. So, for now, we’re doing hybrid.

When automation hits the wall: the problematic documents

AK: Did you encounter any projects or types of documents this technology failed, or didn’t work like we wanted it to?

TW: Handwriting is still a thing, so don’t do handwriting. You need to expect it to work well when the handwriting needs to be legible. Low-quality scans happen rarely now; I would ask myself if it wouldn’t be cheaper just to replace the equipment. Signatures are also problematic, especially if they are illegible. We just needed to check whether it’s there or not.

But the thing that was actually the most problematic was nested tables. The tool can define two-dimensional X and Y structures, but often documents have these tendencies to be building one big table, and inside one column, you would have the next table. It gets problematic to teach the model that. You needed to do workarounds and post-process the data.

AK: This is our announcement to all the people who are designing invoices and other types: If we want to replace people just retyping and copy-pasting from these documents

TW: let us stop doing those fancy layouts. Really, for automation, a text file is going to be better, just listing everything out.

AK: No nested tables, please.

Final thoughts: IDP or the apocalypse

AK: Well, someday maybe we will get to the world where intelligent document processing isn’t needed anymore because we exchange data with each other in a way that makes sense in the 21st century, not in the 20th century. But it’s going to be a long time.

TW: If you’re starting fresh as a new company, as a startup, already think about how well you handle this. If you could incorporate some kind of digitized process where computers talk to each other via API or standardized formats like XMLs and JSON, it’s still going to be better than just replacing paper with its digital version as a PDF.

AK: I think we can clearly see that there are loads of options on how to use intelligent document processing. And it’s not just about the savings.

TW: It’s also about it being harder and harder to find people who will actually want to do this kind of a job. To find people who will be doing it long enough for it to make sense for you to teach them, it’s getting really hard. Those volumes are not going to get lower; every business needs to grow. So, we see we have two options: we either implement intelligent document processing or we just wait for the apocalypse.

AK: On that optimistic note, Thomas, thank you so much for sharing your experience.

TW: First, let’s check what that message about not doing fancy PDFs gets us to. To the people who are designing, remember in five years, we’re going to check up on you. Simplicity is beautiful. Thank you everyone

AK: All right folks, that’s it, another Office Samurai podcast sliced, diced, and served like your favorite sashimi platter. We did this episode in cooperation with UiPath, our first choice in process automation platforms. A huge thank you to you, our listeners, for tuning in, whether you’re on your commute, stuck in a meeting pretending to take notes, or just hiding from your inbox. We appreciate you, and we promise not to automate you yet. A huge thank you to our guest, Tomasz Wierzbicki, for showing us the path to victory over paperwork. And as always, a round of applause for Anna Cubal, our producer, who wrangles these episodes with more precision than an IDP model reading a perfectly formatted invoice. This was all recorded at the legendary Wodzu Beats Studio, where the coffee is strong, but our hatred for manual data entry is stronger. If you liked what you heard, tell your friends, and if you hated it, definitely tell your enemies. Remember to subscribe wherever you’re currently procrastinating – Spotify, Apple, or that weird podcast app your cousin recommended. And hey, if you loved it, hated it, or have suggestions for future episodes, or just want to argue about acronyms, don’t hesitate to reach out. Send us your feedback, your questions, or even your haikus about automation. We’re open-minded here at Samurai. Until next time, keep your blades sharp and your processes sharper.