I´m trying to build a distributed LLM inference platform with Huggingface support. The implementation involves utilizing Python for model processing and Java for interfacing with external systems. Below is the Python code responsible for receiving input text from a Java program, processing it through a pre-trained LLM, and returning the processed text:
import socket, sys import threading from transformers import pipeline generator = pipeline('text-generation', model='gpt2-large', device="cuda") def process_input(input_text): request = generator(input_text, min_length=200) return request[0]["generated_text"] def handle_connection(conn): with conn: data = conn.recv(10240).decode() processed_data = process_input(data.strip()) conn.sendall(processed_data.encode()) PORT = int(sys.argv[1]) with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: s.bind(('localhost', PORT)) s.listen() while True: conn, addr = s.accept() thread = threading.Thread(target=handle_connection, args=(conn,)) thread.start() As you can see, it accepts socket connections from the Java process, receives the text, returns the request result, and closes the connection, all inside the execution of a thread. There is a shared pipeline among all threads, since it takes too much space in memory, it's only created 1 time.
On the Java side, I have a class LLMProcess that handles the creation and communication with the Python process, using threads for each request lifetime.
LLMProcess process = new LLMProcess(); for (int i = 0; i < 50; i++) { int index = i; Thread thread = new Thread(() -> { System.out.println("" + index + " : " + process.request("Sample text"); System.out.flush(); }); thread.start(); } However, when attempting to execute a substantial number of requests simultaneously, the system demonstrates sequential behavior in processing the requests and exhibits an overhead associated with thread usage, rather than effectively leveraging concurrent processing via the LLM pipeline and GPU acceleration.
The goal is to optimize this process by minimizing thread usage overhead and fully utilizing available GPU resources. Despite the presence of GPU support, its load remains minimal during program execution, typically not exceeding 3%.