Distributed inference llama.cpp via RPC

02:55
15.09.2024
efreelancer
1451

The idea of creating this publication has been on my mind for a long time, the fact is that one of my hobbies is related to distributed computing, and another hobby is related to neural networks, and I have long been obsessed with the idea of running LLM inference on several computers, but so that they all work on the same model in parallel.

After googling for a while, I found out that the LocalAI project has supported this possibility for quite some time, so I quickly deployed this project on several computers, then made all the necessary settings, linking all the instances into a single system and, to put it mildly, was disappointed, this solution turned out to be too "fatally insufficient", the Docker image was built suboptimally, it was huge in size and only for amd64, the non-disconnectable web interface came with the project, a meager selection of models, some of the available LLMs did not work in RPC mode, all embedding models also refused to run in this mode, and so on and so forth.

After fiddling around a bit more, I went into the source code and found a mention of the llama.cpp project, then found a call to the rpc-server binary. And so I ended up on the llama.cpp/examples/rpc page and everything started spinning...

Brief(?) overview

Let's first ask GigaChat what the RPC protocol is:

The RPC (Remote Procedure Call) protocol allows programs to call functions or procedures in another address space, on remote nodes, or in independent systems on the same node. It includes a network protocol for data exchange in client-server mode and an object serialization language for encoding data when transmitting it over the network.
There are various implementations of RPC, including SOA, CORBA, and DCOM. TCP and UDP protocols are often used for the transport layer, but there are also HTTP-based implementations. Examples of RPC implementations are XML-RPC, which uses XML for message encoding and HTTP as a transport mechanism, and gRPC, which uses HTTP/2 and Protocol Buffers to describe interfaces. RPC is widely used in various network services, including NFS.

In the llama.cpp project, this protocol is implemented in a client-server format, with utilities such as llama-server, llama-cli, llama-embedding, and so on acting as RPC clients, and specialized binaries rpc-server acting as RPC servers.

To briefly describe how all this works, it goes as follows:

An RPC client, say llama-server, receives a list of RPC servers and a model through command line arguments at startup;
The RPC client reads the model and then "slices" its layers so that they are evenly distributed among all RPC servers;
Then the RPC client distributes the layers across the servers and starts inference.

In general, this scheme will look as follows:

Building binaries

After carefully studying the instructions for building both the server and clients, I concluded that at least four binary files are needed to solve the task:

llama-cli - a command-line utility that allows running LLM inference;
llama-embedding - a command-line utility that allows running embedding model inference;
llama-server - a very simple API server that can work both in LLM inference mode and in embedding model inference mode;
rpc-server - a binary that will run on remote machines and perform all the inference work.

So, in short, the llama.cpp build can be done in three simple steps.

Install the packages needed for the build:

apt install -fyq bash wget git make g++

Clone the repository to your host and navigate to the source directory:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

Start the compilation (the instructions provide an example using cmake, but I prefer make):

GGML_RPC=ON make llama-server llama-cli llama-embedding rpc-server libggml.so libllama.so

It is important to set the environment variable GGML_RPC=ON before make (you can also use export, but I find it more convenient in inline format). This variable allows you to include code blocks that add RPC support in the build instructions.

After the compilation is complete, the executable binary files listed after make will appear in the directory.

Building Docker Images

The ability to compile binaries for different architectures is certainly useful, but what if we have, say, a dozen computers and virtual machines or a Kubernetes cluster? We won't be compiling on each node, will we? Of course not! Instead, we will use DevOps practices and build the binaries into Docker images.

Building Docker images with CUDA support

There are no fundamental differences from the Dockerfile based on library ubuntu, except that in the first stage of the build, the container nvidia/cuda:devel is used, and in the second stage nvidia/cuda:runtime.

# Stage 1
FROM nvidia/cuda:12.5.1-devel-ubuntu22.04 AS builder

# Stage 2
FROM nvidia/cuda:12.5.1-runtime-ubuntu22.04

The full Dockerfile.cuda code is in the GitHub repository.

About entrypoint.sh

Since I wanted to create a universal container that can be used in various modes, I had to implement a special entrypoint.sh script that will run every time the container starts.

The container is planned to work in the following modes:

backend
The mode in which the rpc-server is launched, the server start command looks like this:

rpc-server --host "0.0.0.0" --port "50052" --mem "1024"

Here you can see that there is a strange option --mem which allows you to specify how much memory (in Megabytes) this RPC server can use. If the rpc-server is built with CUDA support, this parameter specifies the amount of VRAM (video memory), if without CUDA support, then the amount of RAM (system memory).

server
The mode in which llama-server is launched, representing a simple API server that provides the ability to interactively interact with large (and small) language and embedding models, the launch command looks as follows:

llama-server --host "0.0.0.0" --port "8080" --model "/app/models/TinyLlama-1.1B-q4_0.gguf" --gpu-layers 99 --rpc backend01:50052,backend02:50052

It is important to pay attention to the --gpu-layers option, under normal circumstances it indicates how many layers can be loaded into the video card memory, however, if the --rpc option is specified, its behavior changes and it indicates how many layers can be unloaded to the RPC servers.

With the --rpc option, we list the hosts and ports of the RPC servers to which the RPC client will connect, separated by commas.

none
A special mode that launches the sleep inf command so that you can connect to the container and manually launch llama-cli or say llama-embedding.

If you put all this together in one script, you get a universal entrypoint.sh.

Cross-platform Docker image build

One of the curious features of the ubuntu library image is that it supports many processor architectures, but I was primarily interested in amd64, arm64, and arm/v7, the first one is clear why, but the last two I need to be able to run the RPC server on microcomputers, but the nvidia/cuda container is only available for amd64 and arm64 architectures.

The build itself will be performed using the docker buildx special plugin, which extends the basic functionality of Docker. In our case, we are only interested in the ability to cross-compile containers, as the build for ARM64 is planned to be performed on an x86_64 processor.

So, first, let's create a buildx builder, let's call it, say, my_builder.

docker buildx create --name my_builder --driver=docker-container

Next, let's assume that the Dockerfile and entrypoint.sh files are located in a directory called llama.cpp:

docker buildx build --builder=my_builder --platform=linux/amd64,linux/arm64,linux/arm/v7 --build-arg LLAMACPP_VERSION=master ./llama.cpp/

Here we see that the build is happening for three architectures, and the version used is HEAD from the master branch of the llama.cpp repository.

By adding the options --tag=${owner}/${repo}:${tag} and --push, we can tag the images and push them to the registry.

A complete example of building and publishing containers using GitHub Actions.

Running through Docker Compose

So, let's assume we have built several containers, pushed them to Docker Hub, and now want to run all this on our hardware. Let's assume we have two servers, one where we can use a graphics card, but with only 1GB VRAM, and the other without a graphics card and can only use 2GB RAM. We plan to run the TinyLlama 1.1B model on them so that the user interacts with the API server.

In general, this scheme will look as follows:

Diagram of two RPC servers and one RPC client

As a result, we will get the following docker-compose.yml

version: "3.9"

services:

  main:
    image: evilfreelancer/llama.cpp-rpc:latest
    restart: unless-stopped
    volumes:
      - ./models:/app/models
    environment:
      APP_MODE: server
      APP_MODEL: /app/models/TinyLlama-1.1B-q4_0.gguf
      APP_RPC_BACKENDS: backend-cuda:50052,backend-cpu:50052
    ports:
      - "127.0.0.1:8080:8080"

  backend-cpu:
    image: evilfreelancer/llama.cpp-rpc:latest
    restart: unless-stopped
    environment:
      APP_MODE: backend
      APP_MEM: 2048

  backend-cuda:
    image: evilfreelancer/llama.cpp-rpc:latest-cuda
    restart: "unless-stopped"
    environment:
      APP_MODE: backend
      APP_MEM: 1024
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [ gpu ]

Next, you need to create a models directory next to docker-compose.yml and download the TinyLlama-1.1B-q4_0.gguf file into it.

Start the composition with the command:

docker compose up -d

Then wait for a while, and after the composition starts, you can try to perform inference through curl:

curl     --request POST     --url http://localhost:8080/completion     --header "Content-Type: application/json"     --data '{"prompt": "Building a website can be done in 10 simple steps:"}'

The response will be something like this:

What's next?

In principle, the project is already usable, it has everything you need, and what is missing can be easily added in the future.

Of interest, I would draw your attention to this small PR in the ollama project (which at the time of publication of this article was still pending) and this discussion, also in the ollama project tickets. In short, the developers want to add the ability to perform distributed inference on RPC backends similar to what was demonstrated in this publication. So in the future, I plan to try to integrate ollama with my Docker containers.

Also, I plan to use these containers in Kubernetes, so most likely in the near future I will prepare a k8s operator or just a deployment in the format of a Helm chart to simplify the procedure for deploying servers across nodes.

Also, I have a lot of microcomputers on my mezzanine, as well as two special motherboards called TuringPi v1 for clustering RaspberryPi CM3, I also plan to conduct experiments on them in the future and that is why among all the listed container architectures there is arm/v7.

In general, there is a lot of work, if only there was time...

With that, I take my leave, thank you for reading the article to the end, if you are interested in what will happen with this project in the future, I invite you to my channel @evilfreelancer on Telegram.