61 questions
18 votes
7 answers
77k views
Error while installing python package: llama-cpp-python
I am using Llama to create an application. Previously I used openai but am looking for a free alternative. Based on my limited research, this library provides openai-like api access making it quite ...
13 votes
11 answers
77k views
llama-cpp-python not using NVIDIA GPU CUDA
I have been playing around with oobabooga text-generation-webui on my Ubuntu 20.04 with my NVIDIA GTX 1060 6GB for some weeks without problems. I have been using llama2-chat models sharing memory ...
5 votes
2 answers
10k views
AssertionError when using llama-cpp-python in Google Colab
I'm trying to use llama-cpp-python (a Python wrapper around llama.cpp) to do inference using the Llama LLM in Google Colab. My code looks like this: !pip install llama-cpp-python from llama_cpp import ...
5 votes
2 answers
14k views
Very slow Response from LLM based Q/A query engine
I built a Q/A query bot over a 4MB csv file I have in my local, I'm using chroma for vector DB creation and with embedding model being Instructor Large from hugging face, and LLM chat model being ...
4 votes
1 answer
3k views
How can I install llama-cpp-python with cuBLAS using poetry?
I can install llama cpp with cuBLAS using pip as below: CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python However, I don't know how to install it with cuBLAS when ...
4 votes
1 answer
4k views
No GPU support while running llama-cpp-python inside a docker container
I'm trying to run llama index with llama cpp by following the installation docs but inside a docker container. Following this repo for installation of llama_cpp_python==0.2.6. DOCKERFILE # Use the ...
3 votes
1 answer
4k views
Llama.cpp GPU Offloading Issue - Unexpected Switch to CPU
I'm reaching out to the community for some assistance with an issue I'm encountering in llama.cpp. Previously, the program was successfully utilizing the GPU for execution. However, recently, it seems ...
3 votes
1 answer
5k views
RAG with Langchain and FastAPI: Stream generated answer and return source documents
I have built a RAG application with Langchain and now want to deploy it with FastAPI. Generally it works tto call a FastAPI endpoint and that the answer of the LCEL-chain gets streamed. However I want ...
3 votes
1 answer
3k views
LLM model is not loading into the GPU even after BLAS = 1, LlamaCpp, Langchain, Mistral 7b GGUF Model
Confession: At first, I am not an expert at all in this sector; I am just practicing and trying to learn while working. Also, I am confused about whether this kind of model does not run on this type ...
3 votes
0 answers
209 views
Cannot interence with images on llama-cpp-python
I am new to this. I have been trying but could not make the the model answer on images. from llama_cpp import Llama import torch from PIL import Image import base64 llm = Llama( model_path='Holo1-...
2 votes
1 answer
973 views
How to use `llama-cpp-python` to output list of candidate tokens and their probabilities?
I want to manually choose my tokens by myself, instead of letting llama-cpp-python automatically choose one for me. This requires me to see a list of candidate next tokens, along their probabilities, ...
2 votes
2 answers
4k views
Detecting GPU availability in llama-cpp-python
Question How can I programmatically check if llama-cpp-python is installed with support for a CUDA-capable GPU? Context In my program, I am trying to warn the developers when they fail to configure ...
2 votes
1 answer
4k views
Running Local LLMs in Production and handling multiple requests
I am trying to run a RAG with Gemma LLM locally it is running fine but the idea is I can't handle more than one request at a time. Is there a way to handle concurrent requests with utilizing resources ...
2 votes
1 answer
1k views
How to make a llm remember previous runtime chats
I want my llm chatbot to remember previous conversations even after restarting the program. It is made with llama cpp python and langchain, it has conversation memory of the present chat but obviously ...
2 votes
0 answers
1k views
Connection error in langchain with llama2 model downloaded locally
raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by ...