Skip to content

AI Document Processing

Azure Document Intelligence

TIP

We can also self-host Azure Document Intelligence as Docker Containers - see Container Installation

Klippa

Docs: https://dochorizon.klippa.com/docs/api/getting-started

We tried using the Klippa OpenAPI spec to generate the API client but the generated source code was unusable because of duplicated names in their schema.

Custom Models

Conceptual trial and error - just some CLI examples

We are using a combination of tesseract and ollama to process documents locally.

First read the raw text from the file using

shell
pdfium text <document>.pdf

then extract document pages into images

shell
 pdfium render <document>pdf ./page_%d.jpg

then read the image files using tesseract

shell
tesseract -l deu image_1.jpg -

pass the output to llama3.2-vision using the following API request:

http
POST http://localhost:11434/api/chat
Content-Type: application/json

{
  "model": "llama3.2-vision",
  "stream": false,
  "messages": [
		{
			"role": "user",
			"content": "The document contains an invoice. Extract invoiceNumber, sender, recipient and sum according to the schema as JSON."
		},
		{
			"role": "user",
			"content": "Document Data: <data>"
		}
	],
	"images": ["<base64 encoded image>"],
  "format": {
    "type": "object",
    "properties": {
      "invoiceNumber": {
        "type": "string"
      },
      "recipient": {
        "type": "object",
        "properties": {
          "name": {
            "type": "string"
          },
          "address": {
            "type": "object",
            "properties": {
              "addressLine1": {
                "type": "string"
              },
              "addressLine2": {
                "type": "string"
              },
              "city": {
                "type": "string"
              },
              "zip": {
                "type": "string"
              }
            }
          }
        }
      },
      "sender": {
        "type": "object",
        "properties": {
          "name": {
            "type": "string"
          },
          "address": {
            "type": "object",
            "properties": {
              "addressLine1": {
                "type": "string"
              },
              "addressLine2": {
                "type": "string"
              },
              "city": {
                "type": "string"
              },
              "zip": {
                "type": "string"
              }
            }
          }
        }
      },
      "sum": {
        "type": "object",
        "properties": {
          "gross": {
            "type": "number"
          },
          "net": {
            "type": "number"
          }
        },
				"required": [
					"net",
					"gross"
				]
      }
    },
    "required": [
      "invoiceNumber",
      "sender",
      "recipient",
      "sum"
    ]
  }
}

Tesseract can also produce output in the ALTO format We could try to use this output and feed it into an LLM with a JSON schema for the ouput.

Current state of a (OpenAI compatible) JSON schema:

json
{
  "name": "de_billvault_invoice",
  "strict": false,
  "schema": {
    "additionalProperties": false,
    "type": "object",
    "properties": {
      "invoiceNumber": {
        "type": "string"
      },
      "recipient": {
       "additionalProperties": false,
        "type": "object",
        "properties": {
          "additionalProperties": false,
          "name": {
            "type": "string"
          },
          "address": {
            "type": "object",
            "additionalProperties": false,
            "properties": {
              "addressLine1": {
                "type": "string"
              },
              "addressLine2": {
                "type": "string"
              },
              "city": {
                "type": "string"
              },
              "zip": {
                "type": "string"
              }
            }
          }
        }
      },
      "sender": {
        "type": "object",
        "additionalProperties": false,
        "properties": {
          "name": {
            "type": "string"
          },
          "address": {
            "type": "object",
            "additionalProperties": false,
            "properties": {
              "addressLine1": {
                "type": "string"
              },
              "addressLine2": {
                "type": "string"
              },
              "city": {
                "type": "string"
              },
              "zip": {
                "type": "string"
              }
            }
          }
        }
      },
      "lines": {
        "type": "array",
        "items": {
          "type": "object",
          "required": [
            "description"
          ],
          "additionalProperties": false,
          "properties": {
            "description": {
              "type": "string"
            },
            "quantity": {
              "type": "number"
            },
            "amount": {
              "type": "object",
              "additionalProperties": false,
              "properties": {
                "gross": {
                  "type": "number"
                },
                "net": {
                  "type": "number"
                },
                "taxRate": {
                  "type": "number"
                }
              }
            }
          }
        }
      },
      "sum": {
        "type": "object",
        "additionalProperties": false,
        "properties": {
          "gross": {
            "type": "number"
          },
          "net": {
            "type": "number"
          }
        },
        "required": [
          "net",
          "gross"
        ]
      }
    },
    "required": [
      "invoiceNumber",
      "sender",
      "recipient",
      "sum",
      "lines"
    ]
  }
}

LangChain based custom processing system

While researching other possibilities, LangChain came to mind. For a future independent document analysis pipeline, we would need to first extract data from the documents (either in ALTO format or just as text) using tesseract and feed this data (converted to text) into a locally running LLM (e.g. llama3.3 from Meta AI on HuggingFace) using a local pipeline.

For orchestration and job management, we will create a REST API (something like this) using FastAPI with and Celery.

During research, the poetry (for deps management) and black (for code-style) tools were discovered - they could be used to ease development structure.

Semantic search on documents

As we are essentially building a mix of bookkeeping and document management solution, a semantic search feature comes to mind. We could create embeddings from structured document data and use RAG to feed matching documents for a query into llama3.3 for a response.