← Back to all notes

AKTA — under the hood

Published:

Technical companion to the AKTA project: architecture, stack, engineering principles, and the key ADRs. For readers who want to know how it's built.

The big picture

AKTA consists of three services, with clearly divided responsibilities.

  • Core — the source of truth. Documents, metadata, search, question-answering, audit log. A Spring Boot application in Java, speaks REST to the outside world and Postgres to itself.
  • File-Manager — the only service allowed to see the Synology filesystem. Detects new files in the inbox folder, shepherds them through the pipeline, archives them at the end. Also Spring Boot.
  • Processing — everything content-related: pulling PDF text, text recognition as fallback, then the local LLM for metadata suggestions, deadlines, and answers to free-form questions. Python with FastAPI, because the OCR and LLM libraries are simply more at home there.

On top of all that, a React UI where I review documents, correct them, search, and ask questions of my archive. The split into services is functional, not something the user feels.

Tech stack

  • Backend: Java with Spring Boot; Python with FastAPI
  • Database: Postgres, schema versioning via Flyway
  • Search and question-answering: Postgres full-text search with a custom configuration for household paperwork (umlaut tolerance, synonyms), combined with vector embeddings via pgvector. On the same foundation runs a RAG pipeline for free-form questions — the top hits from hybrid search get condensed into an answer by an LLM call, with mandatory citations from the underlying documents.
  • LLM & OCR: Ollama running locally with an open-source language model, Tesseract for scanned documents
  • Operations: Docker Compose, container images from a self-hosted Harbor registry, CI/CD via Gitea Actions with vulnerability scanning before every deploy
  • Tests: JUnit with Testcontainers for the JVM side, pytest for Python, Playwright for end-to-end

Engineering principles

A handful of rules I've held myself to:

  • Suggestion, not automation. The LLM gets to guess, the human decides. Sounds banal but it changes where validation sits and how the UI looks.
  • Findability beats perfect filing. If I find a document in five seconds, I don't care whether the category was a hundred percent precise.
  • Sources, not inventions. Answers to free-form questions come from the documents, not from the LLM. The prompt explicitly demands source IDs from the context; no source, no answer. Better "I don't know from the files" than a fabricated date.
  • One source of truth. Metadata lives in Postgres. Not in parallel in search, cache or filesystem. Caches are then actually just caches.
  • Ports and adapters across all services. Each service has clear functional ports; the technical adapters (HTTP, JDBC, Ollama, filesystem) sit on the outside. Swapping Ollama for something else would be an adapter swap, not a refactor.
  • Untrusted in, sanitised out. I treat OCR text and LLM output like user input — as potentially malicious. Lengths get capped, content gets checked before it's stored.

Key decisions (ADRs)

The ADRs live in the repo. A few highlights:

  • Content hash as an optional unique column. Documents scanned twice get caught by the SHA-256 of their content. To keep older rows legal, it runs as a partial unique index — new rows without a hash are forbidden, old ones may remain.
  • LLM output is untrusted. The Processing service caps title and category suggestions before they reach Core. Core sanitises a second time. Two stations, both independent.
  • Extract deadlines, remind on deadlines. A second LLM call pulls a date out of the document text, stored idempotently with marker columns for the three reminder stages. The source ("from the LLM" vs "from the human") is part of the model.
  • Prompts in the database. An iteration earlier the prompts sat as files inside the container, with a bind-mount for live edits from the UI. Every deploy with rsync --delete overwrote them. Clean fix: prompts are data, not code.
  • Hybrid search via Reciprocal Rank Fusion. Classical full-text search and vector similarity produce separate hit lists. RRF combines them without one side steamrolling the other — and the FTS fallback keeps search alive when the embeddings are down.
  • RAG pipeline with mandatory citations. For free-form questions ("when does the home contents insurance expire") the hybrid search delivers the top-K hits, whose snippets become the context of an LLM call. The prompt forces source IDs into the answer and forbids any statement without context backing. The frontend has a "Questions" page where the answer and clickable sources sit side by side — hallucination protection as an acceptance criterion, not a nice-to-have.

What I'd do differently today

  • Write the prompt editor against the database from day one. The file path was half an hour of setup and three weeks of pain.
  • Formalise the LLM trust model in the very first sprint. Introducing length caps only later cost me exactly one "Newsletter" categorisation of a payment reminder.
  • Take container vulnerability scans seriously before the first production deploy. A week of image hardening in retrospect is more painful than two days proactively.