How I run LLMs locally

AljwadhDecember 29, 2024

0 1,589 2 minutes read

A HN user asked me⁰ how I run LLMs locally with some specific questions, I’m documenting it here for everyone.

Before I begin I want to honor the thousands or millions of unknown artists, coders and writers whose work has been trained by Large Language Models(LLMs), often without due credit or compensation.

dash

r/LocalLLaMA subreddit¹ & Ollama blog² good places to start running LLMs locally.

Hardware

I have a laptop running Linux with a core i9 (32threads) CPU, 4090 GPU (16GB VRAM) and 96 GB of RAM. Models that fit inside VRAM can generate more tokens/second, larger models will be offloaded to RAM (dGPU offloading) and thus tokens/second will be lower. I will talk about models in a section below.

It is not necessary to have such a strong computer for running LLMs locally, small models can do well with older GPUs or CPUs even if they are slow and have a lot of hallucinations.

There are several high-quality open-source tools that enable running LLMs locally. These are the tools I use most often.

To be³ a middleware with python, JavaScript libraries for llama.cpp⁴ which helps in running LLMs. I use Ollama on docker⁵.

Open the WebUI⁶ is a frontend that offers a familiar chat interface for text and image input and communicates with the Ollama back-end and streams the output back to the user.

llamafile⁷ an executable file containing LLM. This is probably the easiest way to get started with local LLMs, but I’m having issues with dGPU offloading in llamafile⁸.

I’m not a big consumer of image / video production models, but if necessary, I use AUTOMATIC1111⁹ for images that need some customization and Fooocus¹⁰ for simple image generation. For complex workflow automatons with image creation, there’s ComfyUI¹¹.

For code completion I use Continue¹² in VSCode.

I use Smart Connections¹³ of Obsidian¹⁴ to query my notes using Ollama.

Screenshot of Obsidian with the chat smart connections extension showing the last journal I wrote — I asked Smart Connections when I wrote my last journal, I hope to write my journal every day in 2025.

Models

I use the Ollama models page¹⁵ to download the latest LLMs. I use RSS in Thunderbird to track models. I use CivitAI¹⁶ to download rendering models for specific styles (eg Isometric for world building). But note that most of CivitAI’s models seem to be intended for creating adult images.

I choose LLMs based on performance / size. My current selection for LLMs is constantly changing due to the rapid development of LLMs.

•	Llama3.2 for Smart Connections and generic queries.
•	Deepseek-coder-v2 for code completion in Continue.
•	Qwen2.5-coder for chatting about code in Continue.
•	Stable Diffusion for image generation in AUTOMATIC1111 or Fooocus.

Update

I update docker containers using WatchTower¹⁷ and models from within the Open Web UI.

Fine-Tuning and Quantization

I have not fine-tuned or quantized any of the models on my machine because my Intel CPU may have a manufacturing defect¹⁸ so I don’t want to push it to high temperatures for long periods of time during training.

CONCLUSION

Running LLMs locally gives me full control over my data and low latency for responses. None of this would be possible without open-source projects and ~~open-source~~ free models and original owners of the data on which these models were trained.

I will update this post if and when I use new tools / models.

(0) https://news.ycombinator.com/item?id=42537024

(1) https://www.reddit.com/r/LocalLLaMA/

(2) https://ollama.com/blog

(3) https://ollama.com/download

(4) https://github.com/ggerganov/llama.cpp

(5) https://hub.docker.com/r/ollama/ollama

(6) https://github.com/open-webui/open-webui

(7) https://github.com/Mozilla-Ocho/llamafile

(8) https://github.com/Mozilla-Ocho/llamafile/issues/611

(9) https://github.com/AUTOMATIC1111/stable-diffusion-webui

(10) https://github.com/llyasviel/Fooocus

(11) https://github.com/comfyanonymous/ComfyUI

(12) https://docs.continue.dev/getting-started/overview

(13) https://github.com/brianpetro/obsidian-smart-connections

(14) https://obsidian.md

(15) https://ollama.com/search

(16) https://civitai.com/models/63376/isometric-chinese-style-architecture-lora

(17) https://containrrr.dev/watchtower/

(18) https://en.wikipedia.org/wiki/Raptor_Lake#Instability_and_degradation_issue

I try to write low frequency, High quality content on Health, Product Development, Programming, Software Engineering, DIY, Security, Philosophy and other interests. If you would like to receive them in your email inbox then please consider subscribing to mine Newsletter.

https://abishekmuthian.com/images/obsidian-smart-connections.jpg

2024-12-29 10:49:00

AljwadhDecember 29, 2024

0 1,589 2 minutes read

How I run LLMs locally

dash

Hardware

Models

Update

Fine-Tuning and Quantization

CONCLUSION

Aljwadh

Leave a Reply Cancel reply

Elon Musk agrees with Tweet saying Americans aren’t smart enough for tech jobs

Apple Allows Support for Satellite T-Mobile and Starlink in the iPhone

Lamar Kendrick will appear in Synth Riders experience on Apple Pro vision

The 2024 Movie Monster State of the Union

Thousands of people are evacuating in LA as wildfires and extreme winds hit Southern California

The dollar tree stocks, dollar tree shots as the pinning consumers in pincini are looking for low cost

Ryan Reynolds and Andrew Garfield Are Game to Return as Deadpool and Spider-Man

Your Dishwasher Is Gross. Here’s How to Clean It

Apple Music expands its live radio offerings with three new stations

Ready Player Me’s Player Zero sees momentum for Web3 collectible avatars

The 33 Best Shows on Apple TV+ Right Now (December 2024)

dash

Hardware

Models

Update

Fine-Tuning and Quantization

CONCLUSION

Aljwadh

The visitors claim a much-needed victory in Guardiola's 500th game in charge

Pep Guardiola admits honestly after Man City end up waiting for victory away from home

Related Articles

Cloud Efficiency on Netflix. By J Han, Pallavi Phadnis | via Netflix Technology Blog | Dec, 2024

Janus Pro 1B running 100% local in-browser on Webgpu, run by transformers.js: loadalllama

First look: loops, by PICELELFED

I Was Wrong About the Ethical Crisis – Communications of the ACM

Leave a Reply Cancel reply

The dollar tree stocks, dollar tree shots as the pinning consumers in pincini are looking for low cost

Ryan Reynolds and Andrew Garfield Are Game to Return as Deadpool and Spider-Man

Your Dishwasher Is Gross. Here’s How to Clean It

Apple Music expands its live radio offerings with three new stations

Ready Player Me’s Player Zero sees momentum for Web3 collectible avatars

The 33 Best Shows on Apple TV+ Right Now (December 2024)