143. Local LLM Inference with Ollama
Status: Accepted Date: 2025-07-06
Context
Our system's AI features rely on Large Language Models (LLMs). The default way to access powerful LLMs is via cloud-based, commercial APIs (e.g., OpenAI, Anthropic, Google). However, relying exclusively on these external APIs has several drawbacks for development and certain use cases:
- Cost: API calls can be expensive, especially during development, experimentation, and testing, where a large volume of calls may be made.
- Latency: Network latency to external APIs can slow down the development feedback loop.
- Privacy: Sending data to third-party services may not be desirable for sensitive or proprietary information.
- Vendor Lock-in: Building directly against a specific vendor's API can create lock-in.
Decision
We will use Ollama as our primary tool for local LLM inference.
The 17_ollama Ansible role will install the Ollama server on our development machines and powerful servers like hecate. This allows us to download and run a wide variety of open-source LLMs (e.g., from the Llama, Mistral, and other families) directly on our own hardware.
Our AI services (kaido-ollama library) will be configured to use the local Ollama endpoint for development, testing, and any production workloads that are suitable for the available local models.
Consequences
Positive:
- Zero Cost for Inference: Once the hardware is paid for, running inference on local models is free, regardless of volume. This is a massive benefit for development, experimentation, and high-volume batch processing tasks.
- Low Latency: Inference calls are local network requests, resulting in very low latency and a fast, iterative development cycle.
- Data Privacy: All data remains within our own infrastructure, which is ideal for sensitive information.
- Flexibility: Gives us access to a huge and rapidly growing ecosystem of open-source models. We can experiment with different models and fine-tunes to find the best one for a specific task.
Negative:
- Lower Model Capability: The most powerful, state-of-the-art models are typically only available via commercial APIs. The open-source models available through Ollama, while very capable, may not match the performance of models like GPT-4 or Claude 3 Opus on the most complex reasoning tasks.
- Hardware Requirements: Running LLMs locally requires significant hardware resources (RAM, VRAM, and processing power). This requires investment in powerful machines like
hecate. - Operational Overhead: We are now responsible for managing the Ollama server, downloading models, and ensuring the hardware is running correctly.
Mitigation:
- Hybrid Approach: We are not exclusively using local models. The decision is to use Ollama as our primary tool for local inference, not as our only tool. We will still use external APIs for production tasks that require the absolute highest level of model capability. Our AI libraries will be designed to switch between local and remote endpoints via configuration.
- Investment in Hardware: We have acknowledged and made the necessary investment in hardware (
hecate) to make local inference viable for a wide range of tasks. - Simple Management: Ollama itself is a very simple and low-maintenance tool. Its command-line interface makes downloading and managing models straightforward. The operational overhead is minimal.