Hey there ThorrJo welcome to our community.
I recommend you use kobold.cpp as your first inference engine of choice as its very easy to get running especially on Linux. Since you have no GPU you don't need to worry about CUDA or Vulcan for offloading.
https://github.com/LostRuins/koboldcpp/
Read the kobold wiki section for vision model projection. For the image recognition model itself I recommend you use Nvidia Cosmos finetune of Qwen2.5-VL. Make sure to load the qwen2.5vl mmproj lens that kobold links along with the model.
https://github.com/LostRuins/koboldcpp/wiki#what-is-llava-and-mmproj
https://huggingface.co/koboldcpp/mmproj/tree/main
https://huggingface.co/mradermacher/Cosmos-Reason1-7B-i1-GGUF
.GGUF I linked are already pre-quantized, you should be able to load the biggest quant available and the f16 mmproj on your 48gb ram easy with lots of context allocation room left.
Allocate as much context size as you can. larger high resolution images take more input context to process.
For troubleshooting if its replies are wonky try changing chat template first I forget if its ChatML or something else. You can try adjusting sampler size too.
Kobold.CPP runs a web interface you can connect to through the browser on multiple devices. It also exposes its backend through openai-compatable api so you can write your own custom apps for send and receive or use kobold with other frontend software thats compatable with corporate APIs like tinychat if you want to go further.
If you have any specific questions or need help feel free to reach out :)