AI Document Processing

Azure Document Intelligence

TIP

We can also self-host Azure Document Intelligence as Docker Containers - see Container Installation

Klippa

Docs: https://dochorizon.klippa.com/docs/api/getting-started

We tried using the Klippa OpenAPI spec to generate the API client but the generated source code was unusable because of duplicated names in their schema.

Custom Models

Conceptual trial and error - just some CLI examples

We are using a combination of tesseract and ollama to process documents locally.

First read the raw text from the file using

shell

pdfium text <document>.pdf

then extract document pages into images

shell

 pdfium render <document>pdf ./page_%d.jpg

then read the image files using tesseract

shell

tesseract -l deu image_1.jpg -

pass the output to llama3.2-vision using the following API request:

http

POST http://localhost:11434/api/chat
Content-Type: application/json

{
  "model": "llama3.2-vision",
  "stream": false,
  "messages": [
		{
			"role": "user",
			"content": "The document contains an invoice. Extract invoiceNumber, sender, recipient and sum according to the schema as JSON."
		},
		{
			"role": "user",
			"content": "Document Data: <data>"
		}
	],
	"images": ["<base64 encoded image>"],
  "format": {
    "type": "object",
    "properties": {
      "invoiceNumber": {
        "type": "string"
      },
      "recipient": {
        "type": "object",
        "properties": {
          "name": {
            "type": "string"
          },
          "address": {
            "type": "object",
            "properties": {
              "addressLine1": {
                "type": "string"
              },
              "addressLine2": {
                "type": "string"
              },
              "city": {
                "type": "string"
              },
              "zip": {
                "type": "string"
              }
            }
          }
        }
      },
      "sender": {
        "type": "object",
        "properties": {
          "name": {
            "type": "string"
          },
          "address": {
            "type": "object",
            "properties": {
              "addressLine1": {
                "type": "string"
              },
              "addressLine2": {
                "type": "string"
              },
              "city": {
                "type": "string"
              },
              "zip": {
                "type": "string"
              }
            }
          }
        }
      },
      "sum": {
        "type": "object",
        "properties": {
          "gross": {
            "type": "number"
          },
          "net": {
            "type": "number"
          }
        },
				"required": [
					"net",
					"gross"
				]
      }
    },
    "required": [
      "invoiceNumber",
      "sender",
      "recipient",
      "sum"
    ]
  }
}

Tesseract can also produce output in the ALTO format We could try to use this output and feed it into an LLM with a JSON schema for the ouput.

Current state of a (OpenAI compatible) JSON schema:

json

{
  "name": "de_billvault_invoice",
  "strict": false,
  "schema": {
    "additionalProperties": false,
    "type": "object",
    "properties": {
      "invoiceNumber": {
        "type": "string"
      },
      "recipient": {
       "additionalProperties": false,
        "type": "object",
        "properties": {
          "additionalProperties": false,
          "name": {
            "type": "string"
          },
          "address": {
            "type": "object",
            "additionalProperties": false,
            "properties": {
              "addressLine1": {
                "type": "string"
              },
              "addressLine2": {
                "type": "string"
              },
              "city": {
                "type": "string"
              },
              "zip": {
                "type": "string"
              }
            }
          }
        }
      },
      "sender": {
        "type": "object",
        "additionalProperties": false,
        "properties": {
          "name": {
            "type": "string"
          },
          "address": {
            "type": "object",
            "additionalProperties": false,
            "properties": {
              "addressLine1": {
                "type": "string"
              },
              "addressLine2": {
                "type": "string"
              },
              "city": {
                "type": "string"
              },
              "zip": {
                "type": "string"
              }
            }
          }
        }
      },
      "lines": {
        "type": "array",
        "items": {
          "type": "object",
          "required": [
            "description"
          ],
          "additionalProperties": false,
          "properties": {
            "description": {
              "type": "string"
            },
            "quantity": {
              "type": "number"
            },
            "amount": {
              "type": "object",
              "additionalProperties": false,
              "properties": {
                "gross": {
                  "type": "number"
                },
                "net": {
                  "type": "number"
                },
                "taxRate": {
                  "type": "number"
                }
              }
            }
          }
        }
      },
      "sum": {
        "type": "object",
        "additionalProperties": false,
        "properties": {
          "gross": {
            "type": "number"
          },
          "net": {
            "type": "number"
          }
        },
        "required": [
          "net",
          "gross"
        ]
      }
    },
    "required": [
      "invoiceNumber",
      "sender",
      "recipient",
      "sum",
      "lines"
    ]
  }
}

LangChain based custom processing system

While researching other possibilities, LangChain came to mind. For a future independent document analysis pipeline, we would need to first extract data from the documents (either in ALTO format or just as text) using tesseract and feed this data (converted to text) into a locally running LLM (e.g. llama3.3 from Meta AI on HuggingFace) using a local pipeline.

For orchestration and job management, we will create a REST API (something like this) using FastAPI with and Celery.

During research, the poetry (for deps management) and black (for code-style) tools were discovered - they could be used to ease development structure.

Semantic search on documents

As we are essentially building a mix of bookkeeping and document management solution, a semantic search feature comes to mind. We could create embeddings from structured document data and use RAG to feed matching documents for a query into llama3.3 for a response.

AI Document Processing ​

Azure Document Intelligence ​

Klippa ​

Custom Models ​

Conceptual trial and error - just some CLI examples ​

LangChain based custom processing system ​

Semantic search on documents ​

AI Document Processing

Azure Document Intelligence

Klippa

Custom Models

Conceptual trial and error - just some CLI examples

LangChain based custom processing system

Semantic search on documents