17. Mar 2026

Local AI for protecting sensitive information

Running large language models (LLMs) locally makes it possible to use AI even with sensitive information without disclosing it. With the right equipment, it is even possible to deploy coding agents while keeping the source code secure. Axel Burghof explains how this works and what requirements are necessary.

Author

Axel Burghof

Person mit Headset sitzt vor zwei Monitoren mit Code; Overlay zeigt „Private AI Assistant“ und Sicherheits-Hinweise, Pflanze und Notizbuch auf dem Tisch.

LLMs and Agentic Coding

Anyone who has ever solved a programming task with the help of a coding agent realizes immediately: from now on, you won’t want to work without one. With a little practice, a few requirements quickly turn into executable code. Tests with existing source code show that coding agents also make many tasks in maintenance and further development more pleasant and faster to complete.

The agents’ performance relies primarily on large language models (LLMs): The selected model receives the relevant code snippets and technical requirements to process them in accordance with the task. The greatest advantage arises from the interaction with the most modern, largest LLMs trained for coding. Suitable hardware resources for this are found almost exclusively in the cloud—directly from the original developer or in a slightly modified form from alternative AI providers. In doing so, the creators of the LLMs rely on using the provided information to train their next generation of models – a race for the best AI, in which every dataset counts. But what happens if a project does not permit this open-handed handling of information that is, in part, sensitive?


AI-Supported Development Without the Cloud

At Accso, we have set ourselves the goal of working AI-natively: coding agents should naturally contribute to the development of our projects. In many client projects, however, there are strict data protection and compliance requirements – meaning project information must not leave the protected environment. Cloud-based LLMs are therefore often not an option.

The alternative: We or our clients operate the LLMs ourselves. If we want to prevent agents from sending project information to the cloud, we need to clarify:

  • What LLM sizes are realistic for local deployment?
  • Is that sufficient for agentic coding or only for code completion?
  • Where are the bottlenecks: RAM, bandwidth, context size, tool calls?


Requirements for Agentic Coding

Classic code completion is relatively undemanding: a smaller model that has seen a lot of code, a few thousand tokens of context – that’s it. For agentic coding, the requirements are significantly higher. This is qualitatively different from “please give me the next 20 lines of code.”

The agent must:

  • Understand project structures (monorepos, microservices, build systems)
  • Be familiar with coding conventions, frameworks, and platforms
  • Understand technical requirements, at least in broad terms
  • Not just work “on the side,” but:
  • Read/modify files
  • Search directories
  • Start builds and tests
  • Invoke additional tools, e.g., via MCP.

These diverse activities of the coding agents require sophisticated tool invocation cycles between agents and the LLM – sometimes involving extensive context (source code files, documentation, logs, chat history) or retries when something goes wrong. This increases complexity enormously.


LLM Basics: Size, Quantization, Formats

If you want to run these processes on your own hardware instead of in the cloud as usual, certain prerequisites are needed. To discuss the hardware question meaningfully, here’s a brief overview of the technical foundation:


Parameter sizes and working memory: the “B” in the name

LLMs are roughly classified by the number of their parameters – identifiable by components such as “-7B,” “-14B,” “-32B,” etc. The number of parameters provides a measure of the amount of retrievable knowledge. The “B” stands for “billion,” meaning billions of parameters. Common sizes range from 7B up to 70B+. There are also so-called Mixture-of-Experts (MoE) models, in which only parts of the total available parameters are active at any given time (e.g., “-30b-4e”).

Guides from manufacturers like Apple emphasize: The actual limitation for local use is how many of these parameters, including quantization, can be meaningfully processed with the available memory.

The original language models typically use 16 bits per parameter (“FP16”). This results in the following rule of thumb for memory requirements:

  • 1B parameters → approx. 2 GB for the weights alone
  • 8B → approx. 16 GB
  • 30B → approx. 60 GB

For high-performance models, the memory requirements for the executing software and supporting functions are negligible.


Quantization: Saving RAM and computing power at the expense of quality

To run such models on “normal” hardware, quantization is used. Quantization allows the knowledge embodied in the parameters to be utilized with less memory and computational power – at the cost of lower-quality outputs. The weights are compressed from, for example, 16 bits to 4–8 bits. MLX and other frameworks support such methods and offer quantized versions of many common models.

Example:

  • 16 bits → 2 bytes per weight
  • 4 bits → 0.5 bytes per weight

This reduces the memory requirements of a 1B model from approximately 2 GB to approximately 0.5 GB.
The trade-off:

  • slightly lower accuracy
  • more grammatical and logical errors
  • increased tendency to hallucinate

For simple code completion, a well-quantized 7–14B code model can be perfectly sufficient; for complex agent tasks, linguistic errors in an LLM’s output compromise the reliability of tool calls. In some cases, however, small models can be “persuaded” to cooperate with the agent under strict guidelines, whereas large LLMs independently select the correct format.


Model Formats and Inference Runtimes

But it is not just the memory requirements of the parameters alone that determine the ratio of performance to hardware resources. The application of the parameters—known as “inference”—requires computational power proportional to the number and size of the parameters. Many high-precision parameters mean a high demand for computing cores working in parallel. CPUs, GPUs, and NPUs (Neural Processing Units) can be combined for this purpose. Depending on the type of hardware used, manufacturer-specific LLM formats enable the best utilization of the hardware.

The weights of an LLM are available in different formats—depending on the runtime, use case, and hardware stack:

  • GGUF – the format based on llama.cpp, optimized for tailored utilization of available CPU/GPU cores
  • MLX – Apple’s proprietary format or stack for Apple Silicon, efficiently utilizes unified memory and GPU/NPU
  • PyTorch/Transformers checkpoints – a generic format upon which frameworks such as vLLM or Text Generation Inference (TGI) are built
  • NVIDIA-optimized formats – e.g., TensorRT LLM engines, highly optimized for NVIDIA GPUs, Tensor Cores, and CUDA
  • AMD stack – featuring ROCm, vLLM/TGI, which are specifically optimized for AMD Instinct and Radeon Pro

With the various software products for LLM inference, it is possible to extract the maximum AI performance from any hardware. However, this depends on the right combination of LLM version and size, encoding, quantization, and engine. This is because the architectural features of the LLMs and the quantization used require corresponding support. Open Weight models are available in various variants, sizes, and encodings. Depending on the use case and hardware used, the size and encoding can be adjusted to make the most of the available hardware. Additionally, it is important to find a suitable combination of parameters and engine for each language model to achieve the expected performance.


Hardware Requirements for Coding Models: From Completion to Agent

Now that we have established the foundational knowledge regarding the required hardware, we will turn our attention to the requirements of the models needed for code completion and agentic coding. 

Code Completion with Small Code Models (<10B)

Nearly every major provider (OpenAI, Google, Anthropic, Qwen, Mistral, …) offers variants of its models that have been explicitly trained or fine-tuned on code. This makes sense: programming languages are strictly structured; this helps the models generate correct code.

For pure code completion, models with 3–7B parameters are often sufficient:

  • specifically trained on code (“Coder,” “Code,” “Instruct” variants)
  • with moderate context (8–16k tokens)
  • aggressively quantized (4–6 bits) to run them locally fast enough

The Continue.dev AI plugin can be configured with different models for different tasks. This allows you to improve performance and save memory during code completion.
 

Agentic coding requires more “brainpower” and more context

However, as soon as we want to equip a coding agent with the appropriate hardware to solve tasks independently, the requirements shift:

  • World knowledge about frameworks, tools, patterns, and common error scenarios
  • Better planning and “multi-step reasoning”
  • Large context window: simultaneous view of solution strategies, prompts, source code, command outputs, and chat history

Today’s range of Open Weight LLMs also offers suitable candidates for coding agents. The achievable potential depends on the available hardware. Key considerations include fundamental suitability for coding, for tool invocations, and for sufficiently large contexts. Compared to frontier models in the cloud, local solutions always represent a compromise. However, hardware investments are likely to pay off in many cases compared to cloud-based solutions.


LLMs and Agentic Coding: Tool-Calling Quality

The use of LLMs for coding agents introduces an additional dependency: coding agents offer tools to the LLM, and the LLM formulates the calls for these tools. LLMs are specifically trained for this task. In practice, combinations of LLMs and coding agents perform with varying degrees of success. Larger models typically handle tool calls more reliably than smaller ones, and some agents execute tool calls more reliably than others. In some cases, configuring the language model for structured responses helps.

 

Conclusion

With the right hardware, the required performance of LLMs can be ensured, even though a large context window currently limits the options significantly. New, more powerful models are constantly being released, which will eventually eliminate the observed shortcomings through new features and LLMs.

After extensive research, Accso already offers its customers customized, on-premises AI solutions, thereby relieving them of the concern of having to send sensitive data to the cloud for AI processing. On-premises AI solutions make an important contribution to digital sovereignty and data protection. Additionally, on-premises solutions can reduce power consumption—keyword: Green IT. By offering local AI solutions, Accso underscores its commitment to sustainable IT solutions.

Would you like to use Agentic Coding in your company but have data protection or compliance requirements? Contact us. We’d be happy to advise you on local AI solutions and support you in implementing data protection-compliant development environments.