Some days ago ROCm 6.4 was officially added to the Arch repositories - which is great - but it made my current setup completely explode - which is less great - and currently I don't have the necessary will to go and come back from gdb hell...
So I've taken this opportunity to set up a podman (docker alternative) container to use the older, and for me working, ROCm 6.3.3. On the plus side this has made it even easier to test new things and do random stuff: I will probably port my Vulkan setup too, at a later date.
Long story short I've decided to clean it up a bit, place a bunch of links and comments, and share it with you all in the hope it will help someone out.
You still need to handle the necessary requirements on your host system to make everything work, but I've complete trust in you!
Even if it doesn't work, it is a starting point that I hope will give some direction on what to do.
BTW I'm not an expert in this field, so some things can be undoubtedly improved.
Assumptions
- To make this simpler I will consider, and advice to use, this kind of folder structure:
base_dir
├─ROCm_debian_dev
│ └─ Dockerfile
└─llamacpp_rocm6.33
├─ logs
│ └─ logfile.log
├─ workdir
│ └─ entrypoint.sh
├─ Dockerfile
└─ compose.yaml
-
I've tested this on Arch Linux. You can probably make it work on basically any current, and not too old distro, but it's untested.
-
You should follow the basic requirements from the AMD documentation, and cross your fingers. You can probably find a more precise guide on your distro wiki. Or just install any and all ROCm and HIP related SDKs. Sigh.
-
I'm using podman, which is an alternative to docker. It has some idiosyncrasies - which I will not get into because they would require another full write-up, so if you use docker it is possible you'll need to modify some things. I can't help you there.
-
This is given with no warranty: if your computer catches on fire, it is on you (code MIT/Apache 2 license, the one you prefer; text CC BY-SA 4.0). More at the end.
-
You should know what 'generation' of card yours is. ROCm works in mysterious ways and each card has its problems. Generally you can just steam roll forward, with no care, but you still need to find which HSA_OVERRIDE_GFX_VERSION your card needs to run under. For example for a rx6600xt/rx6650xt it would be gfx1030
and HSA_OVERRIDE_GFX_VERSION=10.3.0
. Some info here: Compatibility Matrix
You can (not so) easily search for the correct gfx and HSA codes on the web. I don't think the 9xxx series is currently supported, but I could be wrong.
-
There's an official Docker image in the llama.cpp repository, you could give that one a go. Personally I like doing them myself, so I understand what is going on when I inevitably bleed on the edge - in fact I didn't even consider the existence of an official Dockerfile until after writing this post.. Whelp. Still, they are two different approaches, pick your poison.
Dockerfile(s)
These can, at the higher level, be described as the recipe with which we will set up the virtual machine that will compile and run llama.cpp for us.
I will put here two Dockerfile, one can be used as a fixed base, while the second one can be re-built everytime you want to update llama.cpp.
Now, this will create a new container each time, we could use a volume (like a virtual directory shared between the host machine and the container) to just git pull
the new code instead of cloning, but that would almost completely disregard the pro of running this in a container.
TLDR: For now don't overthink it and go with the flow.
Base image
This is a pretty basic recipe, it gets the official dev-ubuntu image by AMD and then augment it to be suitable for our needs: you can easily use other versions of ROCm (for example dev-ubuntu-24.04:6.4-complete) or even ubuntu. You can find the filtered list of the images here: Link
Could we use a lighter image? Yes. Should we? Probably. Maybe next time.
tbh I've tried other images with no success, or they needed too much effort for a minimal reward: this Just Works™. YMMV.
base_dir/ROCm_debian_dev/Dockerfile
# This is the one that currently works for me, you can
# select a different one:
# https://hub.docker.com/r/rocm/dev-ubuntu-24.04/tags
FROM docker.io/rocm/dev-ubuntu-24.04:6.3.3-complete
# 6.4.0
# FROM docker.io/rocm/dev-ubuntu-24.04:6.4-complete
# We update and then install some stuff.
# In theory we could delete more things to make the final
# image slimmer.
RUN apt-get update && apt-get install -y \
build-essential \
git \
cmake \
libcurl4-openssl-dev \
&& rm -rf /var/lib/apt/lists/*
It is a big image, over 30GB (around 6 to download for 6.3.3-complete and around 4 for 6.4-complete) in size.
Let's build it:
cd base_dir/ROCm_debian_dev/
podman build -t rocm-6.3.3_ubuntu-dev:latest .
This will build it and add it to your local images (you can see them with podman images
) with the name rocm-6.3.3_ubuntu-dev
and the tag latest
. You can change them as you see fit, obviously.
You can even give multiple tags to the same image, a common way is to have a more specific tag and then add the tag latest
to the last one you have generated, so you don't have to change the other scripts that reference it.
More info here: podman tag
The real image
The second image is the one that will handle the llama.cpp[server|bench] compilation and then execution, and you need to customize it:
- You should modify the number after the
-j
based on the number of virtual cores that your CPU has, minus one. You can probably use nproc
in a terminal to check for it.
- You have to change the AMDGPU_TARGETS code based on your gfx version!
pay attention, because the correct one is probably not the one returned by
rocminfo
, for example the rx6650xt is gfx1032
, but that is not directly supported by ROCm. You have to use the supported (and basically identical) gfx1030
instead.
If you want to compile with a ROCm image after 6.3 you need to swap the commented lines. Still, no idea if it works or if it is even supported by llama.cpp.
More info, and some tips, here: Link
base_dir/llamacpp_rocm6.33/Dockerfile
FROM localhost/rocm-6.3.3_ubuntu-dev:latest
# This could be shortened, but I like to have multiple
# steps to make it clear, and show how to achieve
# things in different ways.
WORKDIR /app
RUN git clone https://github.com/ggml-org/llama.cpp.git
WORKDIR /app/llama.cpp
RUN mkdir build_hip
WORKDIR build_hip
# This will run the cmake configuration.
# Pre 6.4 -DAMDGPU_TARGETS=gfx1030
RUN HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S .. -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release
# Post 6.4 -DGPU_TARGETS=gfx1030
# RUN HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S .. -DGGML_HIP=ON -DGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release
# Here we build the binaries, both for the server and the bench.
RUN cmake --build . --config Release -j7 --target llama-server
RUN cmake --build . --config Release -j7 --target llama-bench
To build this one we will need to use a different command:
cd base_dir/llamacpp_rocm6.33/
podman build --no-cache -t rocm-6.3.3_llamacpp:b1234 .
As you can see we have added the --no-cache
long flag, this is to make sure that the image gets compiled, otherwise it would just keep outputting the same image over and over from the cache - because the recipe didn't change.
This time the tag is a b1234
placeholder, you should use the current release build number or the current commit short hash of llama.cpp (you can easily find them when you start the bin, or by going on the github page) to remember at which point you have compiled, and use the dynamic latest
tag as a supplementary bookmark. The current date is a good candidate too.
If something doesn't feel right - for example your GPU is not running when you make a request to the server - you should try to read the configuration step logs, to see that everything required has been correctly set up and there are no errors.
Let's compose it up
Now that we have two images that have compiled without any kind of error we can use them to reach our goal.
I've heavily commented it, so just read and modify it directly.
Don't worry too much about all the lines, but if you are curious - and you should - you can easily search for them and find a bunch of explanations that are surely better than what I could write here without occupying too much space.
Being a yaml file - bless the soul of whoever decided that - pay attention to the whitespaces! They matter!
We will use two Volumes, one will point to the folder where you have downloaded your GGUF files. The second one will point to where we have the entrypoint.sh
file.
We are putting the script into a volume instead of backing it into the container so you can easily modify it, to experiment.
A small image that you could use as a benchmark to see if everything is working is Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf.
base_dir/llamacpp_rocm6.33/compose.yaml
# Benchmark image: https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf
# bechmark command:
# ./bin/llama-bench -t 7 -m /app/models/Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf -ngl 99 -fa 1 -ctk q4_0 -ctv q4_0
# ./bin/llama-bench -t 7 -m /app/models/Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf -ngl 99
services:
llamacpp-server:
# If you have renamed the image, change it here too!
image: localhost/rocm-6.3.3_llamacpp:latest
# The subsequent two lines are needed to enter the image and directly use bash:
# start it with [podman-compose up -d|docker compose up -d]
# and then docker attach to the container with
# [podman|docker] attach ID
# You'll need to change the entrypoint.sh file too, just with the
# shebang and a line straight up calling `bash`, as content.
stdin_open: true
tty: true
# end bash section, Comment those two lines if you don't need shell
# access. Or leave them.
group_add:
# The video group is needed on most distros to access the GPU
# the render group is not present in some and needed
# in others. Try it out.
- "video" # 985 # video group - "render" # 989 # render
environment:
# FIXME: Change this with the right one!
# If you have a wrong one it will _not work_.
- HSA_OVERRIDE_GFX_VERSION=10.3.0
devices:
- /dev/kfd:/dev/kfd
- /dev/dri:/dev/dri
cap_add:
- SYS_PTRACE
logging:
# The default logging driver is journald, which I despise
# because it can pollute it up pretty hard.
#
# The none driver will not save the logs anywhere.
# You can still attach to the container, but you will lose
# the lines before the attachment.
# driver: none
#
# The json-file option is deprecated, so we will use the
# k8s-file one.
# You can use `podman-compose logs -f` to keep tabs, and it will not
# pollute the system journal.
# Remember to `podman-compose down` to stop the container.
# `ctrl+c`ing the logs will do nothing.
driver: k8s-file
options:
max-size: "10m"
max-file: "3"
# You should probably use an absolute path.
# Really.
path: ./logs/logfile.log
# This is mostly a fix for how podman net stack works.
# If you are offline when starting the image it would just not
# start, erroring out. Making it in host mode solves this
# but it has other cons.
# Reading the issue(https://github.com/containers/podman/issues/21896) it is
# probably fixed, but I still have to test it out.
# It meanly means that you can't have multiple of this running because they will
# take the same port. Lucky you you can change the port from the llama-server
# command in the entrypoint.sh script.
network_mode: "host"
ipc: host
security_opt:
- seccomp:unconfined
# These you really need to CHANGE.
volumes:
# FIXME: Change these paths! Only the left side before the `:`.
# Use absolute paths.
- /path/on/your/machine/where/the/ggufs/are:/app/models
- /path/to/rocm6.3.3-llamacpp/workdir:/workdir
# It doesn't work with podman-compose
# restart: no
entrypoint: "/workdir/entrypoint.sh"
# To make it easy to use I've added a number of env variables
# with which you can set the llama.cpp command params.
# More info in the bash script, but they are quite self explanatory.
command:
- "${MODEL_FILENAME:-Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf}"
- "${GPU_LAYERS:-22}"
- "${CONTEXT_SIZE:-8192}"
- "${CALL_TYPE:-bench}"
- "${CPU_THREADS:-7}"
Now that you have meticulously modified the above file let's talk about the script that will launch llama.cpp.
base_dir/llamacpp_rocm6.33/workdir/entrypoint.sh
#!/bin/bash
cd /app/llama.cpp/build_hip || exit 1
MODEL_FILENAME=${1:-"Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf"}
GPU_LAYERS=${2:-"22"}
CONTEXT_SIZE=${3:-"8192"}
CALL_TYPE=${4:-"server"}
CPU_THREADS=${5:-"7"}
if [ "$CALL_TYPE" = "bench" ]; then
./bin/llama-bench -t "$CPU_THREADS" -m /app/models/"$MODEL_FILENAME" -ngl "$GPU_LAYERS"
elif [ "$CALL_TYPE" = "fa-bench" ]; then
./bin/llama-bench -t "$CPU_THREADS" -m /app/models/"$MODEL_FILENAME" -ngl "$GPU_LAYERS" -fa 1 -ctk q4_0 -ctv q4_0
elif [ "$CALL_TYPE" = "server" ]; then
./bin/llama-server -t "$CPU_THREADS" -c "$CONTEXT_SIZE" -m /app/models/"$MODEL_FILENAME" -fa -ngl "$GPU_LAYERS" -ctk q4_0 -ctv q4_0
else
echo "Valid modalities are \"bench\", \"fa-bench\" or \"server\""
exit 1
fi
exit 0
This is straightforward. It enters the folder (inside the container) where we built the binary and then calls the right command, decided with an env var.
I've set it up to handle some common options, so you don't have to change the script every time you want to run a different model or change the number of layers loaded on VRAM.
The beauty of it is that you could put a .env
file in the llamacpp_rocm6.33
folder with the params you want to use, and just start the container.
An example .env
file could be:
base_dir/llamacpp_rocm6.33/.env
MODEL_FILENAME=Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf
GPU_LAYERS=99
CONTEXT_SIZE=8192
CALL_TYPE=bench
CPU_THREADS=7
Some notes:
- For now it uses flash attention by default with a quantized context. You can avoid this by deleting the
-fa
and the -ctk q4_0 -ctv q4_0
. Experiment around.
- You could add more params or environmental variables: it is easy to do. How about one for the port number?
- Find more info about llama.cpp server here: Link.
- And the bench here: Link.
- For now I've set up three commands, one is the server, one is a plain bench and another is a bench with FlashAttention enabled.
server
, bench
, fa-bench
.
Time to start it
Starting it is just a command away:
cd base_dir/llamacpp_rocm6.33/
podman-compose up -d
podman-compose logs -f
When everything is completely loaded, open your browser and go to http://127.0.0.1:8080/
to be welcomed by the llama.cpp webui and test if the GPU is being used. (I've my fingers crossed for you!)
Now that everything is working, have fun with your waifus and/or husbandos! ..Sorry, I meant, be productive with your helpful assistant!
When you are done, in the same folder, run podman-compose down
to mercilessly kill them off.
Licensing
I know, I know. But better safe than sorry.
All the code, configurations and comments in them not otherwise already under other licenses or under copyright by others, are dual licensed under the MIT and Apache 2 licenses, Copyright 2025 [Mechanize@feddit.it](https://feddit.it/u/Mechanize)
. Take your pick.
All the other text of the post © 2025 by Mechanize@feddit.it is licensed under CC BY-SA 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/4.0/