LocalLLaMA

3844 readers
4 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago
MODERATORS
1
 
 

ollama 0.12.11 released this week as the newest feature update to this easy-to-run method of deploying OpenAI GPT-OSS, DeepSeek-R1, Gemma 3, and other large language models. Exciting with ollama 0.12.11 is that it's now supporting the Vulkan API.

Launching ollama with the OLLAMA_VULKAN=1 environment variable set will now enable Vulkan API support as an alternative to the likes of AMD ROCm and NVIDIA CUDA acceleration. This is great for open-source Vulkan drivers, older AMD graphics cards lacking ROCm support, or even any AMD setup with the RADV driver present but not having installed ROCm. As we've seen when testing Llama.cpp with Vulkan, in some cases using Vulkan can be faster than using the likes of ROCm.

2
3
 
 

Introducing: Loki! An all-in-one, batteries-included LLM CLI tool

Loki started out as a fork of the fantastic AIChat CLI, where I just wanted to give it first-class MCP server support. It has since evolved into a massive passion project that's a fully-featured tool with its own identity and extensive capabilities! My goal is to make Loki a true "all-in-one" and "batteries-included" LLM tool.

Check out the release notes for a quick overview of everything Loki can do!

What Makes Loki Different From AIChat?

  • First-class MCP support, with support for both local and remote servers
    • Agents, roles, and sessions can all use different MCP servers and switching between them will shutdown any unnecessary ones and start the applicable ones
    • MCP sampling is coming next
  • Comes with a number of useful agents, functions, roles, and macros that are included out-of-the-box
  • Agents, MCP servers, and tools are all managed by Loki now; no need to pull another repository to create and use tools!
    • No need for any more *.txt files
  • Improved DevX when creating bash-based tools (agents or functions)
    • No need to have argc installed: Loki handles all the compilation for you!
    • Loki has a --build-tools flag that will build your bash tools so you can run them exactly the same way Loki would
    • Built-in Bash prompting utils to make your bash tools even more user-friendly and flexible
  • Built-in vault to securely store secrets so you don't have to store your client API keys in environment variables or plaintext anymore
    • Loki also will inject additional secrets into your agent's tools as environment variables so your agents can also use secrets securely
  • Multi-agent support out-of-the-box: You can now create agents that route requests to other agents and use multiple agents together without them trampling all over each other's binaries
  • Improved documentation for all the things!
  • Simplified directory structure so users can share full Loki directories and configurations without massive amounts of data, or secrets being exposed accidentally

What's Next?

  • MCP sampling support, so that MCP servers can send back queries for the LLM to respond to LLM requests. Essentially, think of it like letting the MCP server and LLM talk to each other to answer your query
  • Give Loki a TUI mode to allow it to operate like claude-code, gemini-cli, codex, and continue. The objective being that Loki can function exactly like all those other CLIs or even delegate to them when the problem demands it. No more needing to install a bunch of different CLIs to switch between!
  • Integrate with LSP-AI so you can use Loki from inside your IDEs! Let Loki perform function calls, utilize agents, roles, RAGs, and all other features of Loki to help you write code.
4
5
6
22
submitted 2 weeks ago* (last edited 1 week ago) by Xylight@lemdro.id to c/localllama@sh.itjust.works
 
 

Benchmarks look pretty good, even better than some of the text only models, make sure to take them with a grain of salt tho

Benchmarks

Qwen3 VL 30b a3b (No Thinking)

Visual benchmarks for Qwen3 VL 235 A22B (Thinking)

7
8
 
 

Hello everyone,

I'm trying to setup a local "vibe coding" environment and use it on some projects I abandoned years ago. So far I'm using KoboldCPP to run the models, VSStudio as editor with various extensions and models with limited to no luck.

Models are working properly (tested with KoboldCPP web UI and curl) but in VSCode they do not generate files. edit existing files etc. (it seems that the tool calling part is not working) and they usually go in a loop

I tried the following extensions:

  • RooCode
  • Cline

Models: (mostly from Unsloth)

  • Deepseek-coder:6.7
  • Phi4
  • Qwen3-Coder 30
  • Granite4 Small

Does anyone have any idea why it's not working or have a working setup that can share? (I'm open to change any of the tools/models)

I have 16GB of RAM and 8GB of VRAM as a reference

Many thanks in advance for your help!

9
10
11
12
 
 

Or something that goes against the general opinions of the community? Vibes are the only benchmark that counts after all.

I tend to agree with the flow on most things but my thoughts that I'd consider going against the grain:

  • QwQ was think-slop and was never that good
  • Qwen3-32B is still SOTA for 32GB and under. I cannot get anything to reliably beat it despite shiny benchmarks
  • Deepseek is still open-weight SotA. I've really tried Kimi, GLM, and Qwen3's larger variants but asking Deepseek still feels like asking the adult in the room. Caveat is GLM codes better
  • (proprietary bonus): Grok 4 handles news data better than GPT-5 or Gemini 2.5 and will always win if you ask it about something that happened that day.
13
 
 
14
15
16
17
 
 

The Apple M5 Pro chip is rumored to be announced later this week with improved GPU cores and faster inference. With consumer hardware getting better and better and AI labs squishing down models to fit in tiny amounts of vRAM, it's becoming increasingly more feasible to have an assistant which has absorbed the entirety of the internet's knowledge built right into your PC or laptop, all running privately and securely offline. The future is exciting everyone, we are closing the gap

18
 
 

It is small but still really good

19
20
 
 

I found this project which is just a couple of small python scripts glueing various tools together: https://github.com/vndee/local-talking-llm

It's pretty basic, but couldn't find anything more polished. I did a little "vibe coding" to use a faster Chatterbox fork, stream the output back so I don't have to wait for the entire LLM to finish before it starts "talking," start recording on voice detection instead of the enter key, and allow interruption of the agent. But, like most vibe-coded stuff, it's buggy. Was curious if there was something better that already exists before I commit to actually fixing the problems and pushing a fork.

21
22
 
 

Hey All 👋

I have been running open webui and ollama on one of my servers for the last year or so. Recently I added a new server, spun up ollama on the more performant machine and started using GPT-OSS:20B

I built a few tools including a tool that scrapes the Wikipedia library on my Kiwix server. I configured it to scrape images and text that can be summarized in the chat as shown in the image I included.

I’m very impressed with this models ability to call tools effectively as well as the model’s context window. I had to iterate on the Kiwix tool a few times when the reply overloaded the window and led to a lot of hallucinations after it got the response but I’m really happy with how well it works now.

I also ran into a really interesting use case where I was showing a user how to use tools and he asked the model “what treatments are available for dogs with yeast infections” the model thought for a second and returned nothing. The thought said, “user is requesting medical advice. Do not answer the request.” I asked the user to send a message “use the Kiwix tool to look up yeast infections. Summarize any treatments suitable for dogs” and it used the tool and provided a wall of text.

So far I have tools to query my plex library, an azure poi search, summarizing nextcloud forms api data, weather, and the Kiwix search. I really feel like my local model has almost as much value as the big boys thanks to these tools.

Anyone else messing around with tools?

23
 
 

Mine attempts to lie whenever it can if it doesn't know something. I will call it out and say that is a lie and it will say "you are absolutely correct" tf.

I was reading into sleeper agents placed inside local LLMs and this is increasing the chance I'll delete it forever. Which is a shame because it is the new search engine seeing how they ruined search engines

24
 
 

Qwen image edit seems to be the best one available right now, but I can't run it. It would take too long on my system, especially at 40 steps.

For general image generation, Flux schnell works great and can generate on my system in about 60 seconds. I was hoping to see Flux kontext-schnell, but it doesn't exist, only kontext-dev.

25
 
 

Hi all, I have never touched any tools for local inference and barely know anything about the landscape. Additionally, the only hardware I have available is a 8C/16T Zen 3 CPU and 48GB of RAM. I have many years experience running Linux as a daily driver and small network sysadmin.

I am well aware this is extreme challenge mode, but it's what I have to work with for now, and my main goal is more to do with learning the ecosystem than with getting highly usable results.

I decided for various reasons that my first project would be to get a model which I can feed an image, and have it output a caption.

If I have to quantize a model to make it fit into my available RAM then I am willing to learn that too.

I am looking for basic pointers of where to get started, such as "read this guide," "watch this video," "look into this software package."

I am not looking for solutions which involve using an API where inference happens on a machine which is not my own.

view more: next ›