Improving GPU Utilization in LLM Inference System

Question

I´m trying to build a distributed LLM inference platform with Huggingface support. The implementation involves utilizing Python for model processing and Java for interfacing with external systems. Below is the Python code responsible for receiving input text from a Java program, processing it through a pre-trained LLM, and returning the processed text:

import socket, sys import threading from transformers import pipeline generator = pipeline('text-generation', model='gpt2-large', device="cuda") def process_input(input_text): request = generator(input_text, min_length=200) return request[0]["generated_text"] def handle_connection(conn): with conn: data = conn.recv(10240).decode() processed_data = process_input(data.strip()) conn.sendall(processed_data.encode()) PORT = int(sys.argv[1]) with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: s.bind(('localhost', PORT)) s.listen() while True: conn, addr = s.accept() thread = threading.Thread(target=handle_connection, args=(conn,)) thread.start()

As you can see, it accepts socket connections from the Java process, receives the text, returns the request result, and closes the connection, all inside the execution of a thread. There is a shared pipeline among all threads, since it takes too much space in memory, it's only created 1 time.

On the Java side, I have a class LLMProcess that handles the creation and communication with the Python process, using threads for each request lifetime.

LLMProcess process = new LLMProcess(); for (int i = 0; i < 50; i++) { int index = i; Thread thread = new Thread(() -> { System.out.println("" + index + " : " + process.request("Sample text"); System.out.flush(); }); thread.start(); }

However, when attempting to execute a substantial number of requests simultaneously, the system demonstrates sequential behavior in processing the requests and exhibits an overhead associated with thread usage, rather than effectively leveraging concurrent processing via the LLM pipeline and GPU acceleration.

The goal is to optimize this process by minimizing thread usage overhead and fully utilizing available GPU resources. Despite the presence of GPU support, its load remains minimal during program execution, typically not exceeding 3%.

Karl · Accepted Answer · 2024-05-14 20:25:24Z

The system you have designed is not capable of processing multiple requests concurrently. GPUs cannot process separate workloads in parallel (at least not without exerting control at the SM layer and lets not get into that).

Your requests are stuck in a queue waiting for their turn on the single GPU. If you want parallel processing, you need to batch requests before passing them to the model.

Doing this effectively gets quite complicated. At a minimum, you need a process that grabs multiple queued requests, batches the inputs, and sends the batch to the GPU as a single call. For performance, you also need KV cache, continuous batching and paged attention.

Your best bet is to use a proper LLM inference server like vLLM which has already implemented best practices for parallel inference.

Stack Exchange Network

Improving GPU Utilization in LLM Inference System

1 Answer 1

Hot Network Questions

Improving GPU Utilization in LLM Inference System

1 Answer 1

Related

Hot Network Questions