LocalLLaMA

2982 readers
29 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

founded 2 years ago
MODERATORS
1
 
 

My groupchats use those react emoji all the time. Maybe they could train a model to classify with those. Then use that classifier to help RL models into being funny.

All my funniest groupchats are on Snapchat.

I don't think this would be ethical, but it could be effective.

2
 
 

Some days ago ROCm 6.4 was officially added to the Arch repositories - which is great - but it made my current setup completely explode - which is less great - and currently I don't have the necessary will to go and come back from gdb hell...

So I've taken this opportunity to set up a podman (docker alternative) container to use the older, and for me working, ROCm 6.3.3. On the plus side this has made it even easier to test new things and do random stuff: I will probably port my Vulkan setup too, at a later date.

Long story short I've decided to clean it up a bit, place a bunch of links and comments, and share it with you all in the hope it will help someone out.

You still need to handle the necessary requirements on your host system to make everything work, but I've complete trust in you! Even if it doesn't work, it is a starting point that I hope will give some direction on what to do.

BTW I'm not an expert in this field, so some things can be undoubtedly improved.

Assumptions

  • To make this simpler I will consider, and advice to use, this kind of folder structure:
base_dir
 ├─ROCm_debian_dev
 │  └─ Dockerfile
 └─llamacpp_rocm6.33
    ├─ logs
    │   └─ logfile.log
    ├─ workdir
    │   └─ entrypoint.sh
    ├─ Dockerfile
    └─ compose.yaml
  • I've tested this on Arch Linux. You can probably make it work on basically any current, and not too old distro, but it's untested.

  • You should follow the basic requirements from the AMD documentation, and cross your fingers. You can probably find a more precise guide on your distro wiki. Or just install any and all ROCm and HIP related SDKs. Sigh.

  • I'm using podman, which is an alternative to docker. It has some idiosyncrasies - which I will not get into because they would require another full write-up, so if you use docker it is possible you'll need to modify some things. I can't help you there.

  • This is given with no warranty: if your computer catches on fire, it is on you (code MIT/Apache 2 license, the one you prefer; text CC BY-SA 4.0). More at the end.

  • You should know what 'generation' of card yours is. ROCm works in mysterious ways and each card has its problems. Generally you can just steam roll forward, with no care, but you still need to find which HSA_OVERRIDE_GFX_VERSION your card needs to run under. For example for a rx6600xt/rx6650xt it would be gfx1030 and HSA_OVERRIDE_GFX_VERSION=10.3.0. Some info here: Compatibility Matrix You can (not so) easily search for the correct gfx and HSA codes on the web. I don't think the 9xxx series is currently supported, but I could be wrong.

  • There's an official Docker image in the llama.cpp repository, you could give that one a go. Personally I like doing them myself, so I understand what is going on when I inevitably bleed on the edge - in fact I didn't even consider the existence of an official Dockerfile until after writing this post.. Whelp. Still, they are two different approaches, pick your poison.

Dockerfile(s)

These can, at the higher level, be described as the recipe with which we will set up the virtual machine that will compile and run llama.cpp for us.

I will put here two Dockerfile, one can be used as a fixed base, while the second one can be re-built everytime you want to update llama.cpp.

Now, this will create a new container each time, we could use a volume (like a virtual directory shared between the host machine and the container) to just git pull the new code instead of cloning, but that would almost completely disregard the pro of running this in a container. TLDR: For now don't overthink it and go with the flow.

Base image

This is a pretty basic recipe, it gets the official dev-ubuntu image by AMD and then augment it to be suitable for our needs: you can easily use other versions of ROCm (for example dev-ubuntu-24.04:6.4-complete) or even ubuntu. You can find the filtered list of the images here: Link

Could we use a lighter image? Yes. Should we? Probably. Maybe next time.

tbh I've tried other images with no success, or they needed too much effort for a minimal reward: this Just Works™. YMMV.

base_dir/ROCm_debian_dev/Dockerfile

# This is the one that currently works for me, you can
# select a different one:
#   https://hub.docker.com/r/rocm/dev-ubuntu-24.04/tags
FROM docker.io/rocm/dev-ubuntu-24.04:6.3.3-complete
# 6.4.0
# FROM docker.io/rocm/dev-ubuntu-24.04:6.4-complete

# We update and then install some stuff.
# In theory we could delete more things to make the final
# image slimmer.
RUN apt-get update && apt-get install -y \
    build-essential \
    git \
    cmake \
    libcurl4-openssl-dev \
    && rm -rf /var/lib/apt/lists/*

It is a big image, over 30GB (around 6 to download for 6.3.3-complete and around 4 for 6.4-complete) in size.

Let's build it:

cd base_dir/ROCm_debian_dev/
podman build -t rocm-6.3.3_ubuntu-dev:latest .

This will build it and add it to your local images (you can see them with podman images) with the name rocm-6.3.3_ubuntu-dev and the tag latest. You can change them as you see fit, obviously. You can even give multiple tags to the same image, a common way is to have a more specific tag and then add the tag latest to the last one you have generated, so you don't have to change the other scripts that reference it. More info here: podman tag

The real image

The second image is the one that will handle the llama.cpp[server|bench] compilation and then execution, and you need to customize it:

  • You should modify the number after the -j based on the number of virtual cores that your CPU has, minus one. You can probably use nproc in a terminal to check for it.
  • You have to change the AMDGPU_TARGETS code based on your gfx version! pay attention, because the correct one is probably not the one returned by rocminfo, for example the rx6650xt is gfx1032, but that is not directly supported by ROCm. You have to use the supported (and basically identical) gfx1030 instead.

If you want to compile with a ROCm image after 6.3 you need to swap the commented lines. Still, no idea if it works or if it is even supported by llama.cpp.

More info, and some tips, here: Link

base_dir/llamacpp_rocm6.33/Dockerfile

FROM localhost/rocm-6.3.3_ubuntu-dev:latest

# This could be shortened, but I like to have multiple
# steps to make it clear, and show how to achieve
# things in different ways.
WORKDIR /app
RUN git clone https://github.com/ggml-org/llama.cpp.git
WORKDIR /app/llama.cpp
RUN mkdir build_hip
WORKDIR build_hip
# This will run the cmake configuration.
# Pre  6.4 -DAMDGPU_TARGETS=gfx1030
RUN HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S .. -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release
# Post 6.4 -DGPU_TARGETS=gfx1030
# RUN HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S .. -DGGML_HIP=ON -DGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release
# Here we build the binaries, both for the server and the bench.
RUN cmake --build . --config Release -j7 --target llama-server
RUN cmake --build . --config Release -j7 --target llama-bench

To build this one we will need to use a different command:

cd base_dir/llamacpp_rocm6.33/
podman build --no-cache -t rocm-6.3.3_llamacpp:b1234 .

As you can see we have added the --no-cache long flag, this is to make sure that the image gets compiled, otherwise it would just keep outputting the same image over and over from the cache - because the recipe didn't change. This time the tag is a b1234 placeholder, you should use the current release build number or the current commit short hash of llama.cpp (you can easily find them when you start the bin, or by going on the github page) to remember at which point you have compiled, and use the dynamic latest tag as a supplementary bookmark. The current date is a good candidate too.

If something doesn't feel right - for example your GPU is not running when you make a request to the server - you should try to read the configuration step logs, to see that everything required has been correctly set up and there are no errors.

Let's compose it up

Now that we have two images that have compiled without any kind of error we can use them to reach our goal. I've heavily commented it, so just read and modify it directly. Don't worry too much about all the lines, but if you are curious - and you should - you can easily search for them and find a bunch of explanations that are surely better than what I could write here without occupying too much space.

Being a yaml file - bless the soul of whoever decided that - pay attention to the whitespaces! They matter!

We will use two Volumes, one will point to the folder where you have downloaded your GGUF files. The second one will point to where we have the entrypoint.sh file. We are putting the script into a volume instead of backing it into the container so you can easily modify it, to experiment.

A small image that you could use as a benchmark to see if everything is working is Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf.

base_dir/llamacpp_rocm6.33/compose.yaml

# Benchmark image: https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf
# bechmark command:
#    ./bin/llama-bench -t 7 -m /app/models/Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf -ngl 99 -fa 1 -ctk q4_0 -ctv q4_0
#    ./bin/llama-bench -t 7 -m /app/models/Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf -ngl 99
services:
    llamacpp-server:
        # If you have renamed the image, change it here too!
        image: localhost/rocm-6.3.3_llamacpp:latest
        # The subsequent two lines are needed to enter the image and directly use bash:
        # start it with [podman-compose up -d|docker compose up -d]
        # and then docker attach to the container with
        # [podman|docker] attach ID
        # You'll need to change the entrypoint.sh file too, just with the
        # shebang and a line straight up calling `bash`, as content.
        stdin_open: true
        tty: true
        # end bash section, Comment those two lines if you don't need shell
        # access. Or leave them.
        group_add:
            # The video group is needed on most distros to access the GPU
            # the render group is not present in some and needed
            # in others. Try it out.
            - "video" # 985 # video group - "render" # 989 # render
        environment:
            # FIXME: Change this with the right one!
            # If you have a wrong one it will _not work_.
            - HSA_OVERRIDE_GFX_VERSION=10.3.0
        devices:
            - /dev/kfd:/dev/kfd
            - /dev/dri:/dev/dri
        cap_add:
            - SYS_PTRACE
        logging:
            # The default logging driver is journald, which I despise
            # because it can pollute it up pretty hard.
            #
            # The none driver will not save the logs anywhere.
            # You can still attach to the container, but you will lose
            # the lines before the attachment.
            # driver: none
            #
            # The json-file option is deprecated, so we will use the
            # k8s-file one.
            # You can use `podman-compose logs -f` to keep tabs, and it will not
            # pollute the system journal.
            # Remember to `podman-compose down` to stop the container.
            # `ctrl+c`ing the logs will do nothing.
            driver: k8s-file
            options:
                max-size: "10m"
                max-file: "3"
                # You should probably use an absolute path.
                # Really.
                path: ./logs/logfile.log
        # This is mostly a fix for how podman net stack works.
        # If you are offline when starting the image it would just not
        # start, erroring out. Making it in host mode solves this
        # but it has other cons.
        # Reading the issue(https://github.com/containers/podman/issues/21896) it is
        # probably fixed, but I still have to test it out.
        # It meanly means that you can't have multiple of this running because they will
        # take the same port. Lucky you you can change the port from the llama-server
        # command in the entrypoint.sh script.
        network_mode: "host"
        ipc: host
        security_opt:
            - seccomp:unconfined
        # These you really need to CHANGE.
        volumes:
            # FIXME: Change these paths! Only the left side before the `:`.
            #        Use absolute paths.
            - /path/on/your/machine/where/the/ggufs/are:/app/models
            - /path/to/rocm6.3.3-llamacpp/workdir:/workdir
        # It doesn't work with podman-compose
        # restart: no
        entrypoint: "/workdir/entrypoint.sh"
        # To make it easy to use I've added a number of env variables
        # with which you can set the llama.cpp command params.
        # More info in the bash script, but they are quite self explanatory.
        command:
            - "${MODEL_FILENAME:-Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf}"
            - "${GPU_LAYERS:-22}"
            - "${CONTEXT_SIZE:-8192}"
            - "${CALL_TYPE:-bench}"
            - "${CPU_THREADS:-7}"

Now that you have meticulously modified the above file let's talk about the script that will launch llama.cpp.

base_dir/llamacpp_rocm6.33/workdir/entrypoint.sh

#!/bin/bash
cd /app/llama.cpp/build_hip || exit 1
MODEL_FILENAME=${1:-"Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf"}
GPU_LAYERS=${2:-"22"}
CONTEXT_SIZE=${3:-"8192"}
CALL_TYPE=${4:-"server"}
CPU_THREADS=${5:-"7"}

if [ "$CALL_TYPE" = "bench" ]; then
  ./bin/llama-bench -t "$CPU_THREADS" -m /app/models/"$MODEL_FILENAME" -ngl "$GPU_LAYERS"
elif [ "$CALL_TYPE" = "fa-bench" ]; then
  ./bin/llama-bench -t "$CPU_THREADS" -m /app/models/"$MODEL_FILENAME" -ngl "$GPU_LAYERS" -fa 1 -ctk q4_0 -ctv q4_0
elif [ "$CALL_TYPE" = "server" ]; then
  ./bin/llama-server -t "$CPU_THREADS" -c "$CONTEXT_SIZE" -m /app/models/"$MODEL_FILENAME" -fa -ngl "$GPU_LAYERS" -ctk q4_0 -ctv q4_0
else
  echo "Valid modalities are \"bench\", \"fa-bench\" or \"server\""
  exit 1
fi

exit 0

This is straightforward. It enters the folder (inside the container) where we built the binary and then calls the right command, decided with an env var. I've set it up to handle some common options, so you don't have to change the script every time you want to run a different model or change the number of layers loaded on VRAM.

The beauty of it is that you could put a .env file in the llamacpp_rocm6.33 folder with the params you want to use, and just start the container.

An example .env file could be:

base_dir/llamacpp_rocm6.33/.env

MODEL_FILENAME=Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf
GPU_LAYERS=99
CONTEXT_SIZE=8192
CALL_TYPE=bench
CPU_THREADS=7

Some notes:

  • For now it uses flash attention by default with a quantized context. You can avoid this by deleting the -fa and the -ctk q4_0 -ctv q4_0. Experiment around.
  • You could add more params or environmental variables: it is easy to do. How about one for the port number?
  • Find more info about llama.cpp server here: Link.
  • And the bench here: Link.
  • For now I've set up three commands, one is the server, one is a plain bench and another is a bench with FlashAttention enabled. server, bench, fa-bench.

Time to start it

Starting it is just a command away:

cd base_dir/llamacpp_rocm6.33/
podman-compose up -d
podman-compose logs -f

When everything is completely loaded, open your browser and go to http://127.0.0.1:8080/ to be welcomed by the llama.cpp webui and test if the GPU is being used. (I've my fingers crossed for you!)

Now that everything is working, have fun with your waifus and/or husbandos! ..Sorry, I meant, be productive with your helpful assistant!

When you are done, in the same folder, run podman-compose down to mercilessly kill them off.

Licensing

I know, I know. But better safe than sorry.

All the code, configurations and comments in them not otherwise already under other licenses or under copyright by others, are dual licensed under the MIT and Apache 2 licenses, Copyright 2025 [Mechanize@feddit.it](https://feddit.it/u/Mechanize) . Take your pick.

All the other text of the post © 2025 by Mechanize@feddit.it is licensed under CC BY-SA 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/4.0/

3
 
 

Gemma 3n includes the following key features:

Audio input: Process sound data for speech recognition, translation, and audio data analysis.

Visual and text input: Multimodal capabilities let you handle vision, sound, and text to help you understand and analyze the world around you.

PLE caching: Per-Layer Embedding (PLE) parameters contained in these models can be cached to fast, local storage to reduce model memory run costs. Learn more

MatFormer architecture: Matryoshka Transformer architecture allows for selective activation of the models parameters per request to reduce compute cost and response times. Learn more

Conditional parameter loading: Bypass loading of vision and audio parameters in the model to reduce the total number of loaded parameters and save memory resources. Learn more

Wide language support: Wide linguistic capabilities, trained in over 140 languages. 32K token context: Substantial input context for analyzing data and handling processing tasks.

4
 
 

"While the B60 is designed for powerful 'Project Battlematrix' AI workstations sold as full systems ranging from $5,000 to $10,000, it will carry a roughly $500 per-unit price tag."

5
 
 

If you are an agent builder, these three protocols should be all you need

  • MCP gives agents tools
  • A2A allows agents to communicate with other agents
  • AG-UI brings your agents to the frontend, so they can engage with users.

Is there anything I'm missing?

6
7
 
 

from 10b0t0mized: I miss the days when I had to go through a humiliation ritual before getting my questions answered.

Now days you can just ask your questions from an infinitely patient entity, AI is really terrible.

8
9
10
 
 

I took a practice test (math) and would like to have it be graded by a LLM since I can't find the key online. I have 20GB VRAM, but I'm on intel Arc so I can't do gemma3. I would prefer models from ollama.com 'cause I'm not deep enough down the rabbit hole to try huggingface stuff yet and don't have time to right now.

11
17
submitted 1 week ago* (last edited 1 week ago) by Eyekaytee@aussie.zone to c/localllama@sh.itjust.works
 
 

This fork introduces a Radio Station feature where AI generates continuous radio music. The process involves two key components:

LLM: Generates the lyrics for the songs. ACE: Composes the music for the generated lyrics.

Due to the limitations of slower PCs, the demo video includes noticeable gaps (approximately 4 minutes) between the generated songs.

If your computer struggles to stream songs continuously, increasing the buffer size will result in a longer initial delay but fewer gaps between songs (until the buffer is depleted again).

By default the app attempts to load the model file gemma-3-12b-it-abliterated.q4_k_m.gguf from the same directory. However, you can also use alternative LLMs. Note that the quality of generated lyrics will vary depending on the LLM's capabilities.

12
13
 
 

Hi, I'm not too informed about LLMs so I'll appreciate any correction to what I might be getting wrong. I have a collection of books I would like to train an LLM on so I could use it as a quick source of information on the topics covered by the books. Is this feasible?

14
 
 

Something I always liked about NousResearch is how they seemingly try to understand cognition in a more philosophical/metaphysically symbolic way and aren't afraid to let you know it. I think their unique view may allow them to find some new perspectives that allow for advancement in the field. Check out AscensionMaze in particular the wording they use is just fascinating.

15
 
 

I'm interested in really leveraging the full capabilities of local ai, for code generation and everything else. let me know what you people are using.

16
 
 

It's amazing how far open source LLMs have come.

Qwen3-32b recreated the Windows95 Starfield screensaver as a web app with the bonus feature to enable "warp drive" on click. This was generated with reasoning disabled (/no_think) using a 4-bit quant running locally on a 4090.

Here's the result: https://codepen.io/mekelef486/pen/xbbWGpX

Model: Qwen3-32B-Q4_K_M.gguf (Unsloth quant)

Llama.cpp Server Docker Config:

docker run \
-p 8080:8080 \
-v /path/to/models:/models \
--name llama-cpp-qwen3-32b \
--gpus all \
ghcr.io/ggerganov/llama.cpp:server-cuda \
-m /models/qwen3-32b-q4_k_m.gguf \
--host 0.0.0.0 --port 8080 \
--n-gpu-layers 65 \
--ctx-size 13000 \
--temp 0.7 \
--top-p 0.8 \
--top-k 20 \
--min-p 0

System Prompt:

You are a helpful expert and aid. Communicate clearly and succinctly. Avoid emojis.

User Prompt:

Create a simple web app that uses javascript to visualize a simple starfield, where the user is racing forward through the stars from a first person point of view like in the old Microsoft screensaver. Stars must be uniformly distributed. Clicking inside the window enables "warp speed" mode, where the visualization speeds up and star trails are added. The app must be fully contained in a single HTML file. /no_think

17
18
19
 
 

Qwen3 was apparently posted early, then quickly pulled from HuggingFace and Modelscope. The large ones are MoEs, per screenshots from Reddit:

screenshots

Including a 235B/22B active and a 30B/3B active.

Context appears to 'only' be 32K unfortunately: https://huggingface.co/qingy2024/Qwen3-0.6B/blob/main/config_4b.json

But its possible they're still training them to 256K:

from reddit

Take it all with a grain of salt, configs could change with the official release, but it appears it is happening today.

20
21
 
 

This is one of the "smartest" models you can fit on a 24GB GPU now, with no offloading and very little quantization loss. It feels big and insightful, like a better (albeit dry) Llama 3.3 70B with thinking, and with more STEM world knowledge than QwQ 32B, but comfortably fits thanks the new exl3 quantization!

Quantization Loss

You need to use a backend that support exl3, like (at the moment) text-gen-web-ui or (soon) TabbyAPI.

22
 
 

I would like my model to know the code libraries I use and help me write code with them. I use llama.cpp's server and web UI for inference, but I have no clue how to get started with RAG, since it seems it is not natively supported with llama.cpp's server implementation. It almost looks like I would need to code my own agent.

I am not interested in commercial offerings or APIs. If you use RAG, how do you do it?

23
 
 

I'm currently running Gemma3, it is really good overall, but one thing that is frustrating is the relentless positivity.

It there a way to make it more critical?

I'm not looking for it to say "that is a shit" idea; but less of the "that is a great observation" or "You've made a really insightful point" etc...

If a human was talking like that, I'd be suspicious of their motives. Since it is a machine, I don't think it is trying to manipulate me, I think the programming is set too positive.

It may also be cultural, at a rule New Zealanders are less emotive in our communication, the LLM (to me) feels like are overly positive American.

24
 
 

Seems there's not a lot of talk about relatively unknown finetunes these days, so I'll start posting more!

Openbuddy's been on my radar, but this one is very interesting: QwQ 32B, post-trained on openbuddy's dataset, apparently with QAT applied (though it's kinda unclear) and context-extended. Observations:

  • Quantized with exllamav2, it seems to show lower distortion levels than nomal QwQ. Its works conspicuously well at 4.0bpw and 3.5bpw.

  • Seems good at long context. Have not tested 200K, but it's quite excellent in the 64K range.

  • Works fine in English.

  • The chat template is funky. It seems to mix up the and <|think|> tags in particular (why don't they just use ChatML?), and needs some wrangling with your own template.

  • Seems smart, can't say if it's better or worse than QwQ yet, other than it doesn't seem to "suffer" below 3.75bpw like QwQ does.

Also, I reposted this from /r/locallama, as I feel the community generally should going forward. With its spirit, it seems like we should be on Lemmy instead?

25
view more: next ›