AKTA — under the hood

Published: May 21, 2026

Technical companion to the AKTA project: architecture, stack, engineering principles, and the key ADRs. For readers who want to know how it's built.

The big picture

AKTA consists of three services, with clearly divided responsibilities.

Core — the source of truth. Documents, metadata, search, question-answering, audit log. A Spring Boot application in Java, speaks REST to the outside world and Postgres to itself.
File-Manager — the only service allowed to see the Synology filesystem. Detects new files in the inbox folder, shepherds them through the pipeline, archives them at the end. Also Spring Boot.
Processing — everything content-related: pulling PDF text, text recognition as fallback, then the local LLM for metadata suggestions, deadlines, and answers to free-form questions. Python with FastAPI, because the OCR and LLM libraries are simply more at home there.

On top of all that, a React UI where I review documents, correct them, search, and ask questions of my archive. The split into services is functional, not something the user feels.

Tech stack

Backend: Java with Spring Boot; Python with FastAPI
Database: Postgres, schema versioning via Flyway
Search and question-answering: Postgres full-text search with a custom configuration for household paperwork (umlaut tolerance, synonyms), combined with vector embeddings via pgvector. On the same foundation runs a RAG pipeline for free-form questions — the top hits from hybrid search get condensed into an answer by an LLM call, with mandatory citations from the underlying documents.
LLM & OCR: Ollama running locally with an open-source language model, Tesseract for scanned documents
Operations: Docker Compose, container images from a self-hosted Harbor registry, CI/CD via Gitea Actions with vulnerability scanning before every deploy
Tests: JUnit with Testcontainers for the JVM side, pytest for Python, Playwright for end-to-end

Engineering principles

A handful of rules I've held myself to:

Suggestion, not automation. The LLM gets to guess, the human decides. Sounds banal but it changes where validation sits and how the UI looks.
Findability beats perfect filing. If I find a document in five seconds, I don't care whether the category was a hundred percent precise.
Sources, not inventions. Answers to free-form questions come from the documents, not from the LLM. The prompt explicitly demands source IDs from the context; no source, no answer. Better "I don't know from the files" than a fabricated date.
One source of truth. Metadata lives in Postgres. Not in parallel in search, cache or filesystem. Caches are then actually just caches.
Ports and adapters across all services. Each service has clear functional ports; the technical adapters (HTTP, JDBC, Ollama, filesystem) sit on the outside. Swapping Ollama for something else would be an adapter swap, not a refactor.
Untrusted in, sanitised out. I treat OCR text and LLM output like user input — as potentially malicious. Lengths get capped, content gets checked before it's stored.

Key decisions (ADRs)

The ADRs live in the repo. A few highlights:

Content hash as an optional unique column. Documents scanned twice get caught by the SHA-256 of their content. To keep older rows legal, it runs as a partial unique index — new rows without a hash are forbidden, old ones may remain.
LLM output is untrusted. The Processing service caps title and category suggestions before they reach Core. Core sanitises a second time. Two stations, both independent.
Extract deadlines, remind on deadlines. A second LLM call pulls a date out of the document text, stored idempotently with marker columns for the three reminder stages. The source ("from the LLM" vs "from the human") is part of the model.
Prompts in the database. An iteration earlier the prompts sat as files inside the container, with a bind-mount for live edits from the UI. Every deploy with rsync --delete overwrote them. Clean fix: prompts are data, not code.
Hybrid search via Reciprocal Rank Fusion. Classical full-text search and vector similarity produce separate hit lists. RRF combines them without one side steamrolling the other — and the FTS fallback keeps search alive when the embeddings are down.
RAG pipeline with mandatory citations. For free-form questions ("when does the home contents insurance expire") the hybrid search delivers the top-K hits, whose snippets become the context of an LLM call. The prompt forces source IDs into the answer and forbids any statement without context backing. The frontend has a "Questions" page where the answer and clickable sources sit side by side — hallucination protection as an acceptance criterion, not a nice-to-have.

What I'd do differently today

Write the prompt editor against the database from day one. The file path was half an hour of setup and three weeks of pain.
Formalise the LLM trust model in the very first sprint. Introducing length caps only later cost me exactly one "Newsletter" categorisation of a payment reminder.
Take container vulnerability scans seriously before the first production deploy. A week of image hardening in retrospect is more painful than two days proactively.