Python bindings for the Transformer models implemented in C/C++ using GGML library.
Also see ChatDocs
| Models | Model Type | CUDA | Metal |
|---|---|---|---|
| GPT-2 | gpt2 | ||
| GPT-J, GPT4All-J | gptj | ||
| GPT-NeoX, StableLM | gpt_neox | ||
| Falcon | falcon | ✅ | |
| LLaMA, LLaMA 2 | llama | ✅ | ✅ |
| MPT | mpt | ✅ | |
| StarCoder, StarChat | gpt_bigcode | ✅ | |
| Dolly V2 | dolly-v2 | ||
| Replit | replit |
pip install ctransformersIt provides a unified interface for all models:
from ctransformers import AutoModelForCausalLM llm = AutoModelForCausalLM.from_pretrained("/path/to/ggml-model.bin", model_type="gpt2") print(llm("AI is going to"))To stream the output, set stream=True:
for text in llm("AI is going to", stream=True): print(text, end="", flush=True)You can load models from Hugging Face Hub directly:
llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml")If a model repo has multiple model files (.bin or .gguf files), specify a model file using:
llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", model_file="ggml-model.bin")Note: This is an experimental feature and may change in the future.
To use it with 🤗 Transformers, create model and tokenizer using:
from ctransformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True) tokenizer = AutoTokenizer.from_pretrained(model)You can use 🤗 Transformers text generation pipeline:
from transformers import pipeline pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) print(pipe("AI is going to", max_new_tokens=256))You can use 🤗 Transformers generation parameters:
pipe("AI is going to", max_new_tokens=256, do_sample=True, temperature=0.8, repetition_penalty=1.1)You can use 🤗 Transformers tokenizers:
from ctransformers import AutoModelForCausalLM from transformers import AutoTokenizer model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True) # Load model from GGML model repo. tokenizer = AutoTokenizer.from_pretrained("gpt2") # Load tokenizer from original model repo.It is integrated into LangChain. See LangChain docs.
To run some of the model layers on GPU, set the gpu_layers parameter:
llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GGML", gpu_layers=50)Install CUDA libraries using:
pip install ctransformers[cuda]To enable ROCm support, install the ctransformers package using:
CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformersTo enable Metal support, install the ctransformers package using:
CT_METAL=1 pip install ctransformers --no-binary ctransformersNote: This is an experimental feature and only LLaMA models are supported using ExLlama.
Install additional dependencies using:
pip install ctransformers[gptq]Load a GPTQ model using:
llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ")If model name or path doesn't contain the word
gptqthen specifymodel_type="gptq".
It can also be used with LangChain. Low-level APIs are not fully supported.
| Parameter | Type | Description | Default |
|---|---|---|---|
top_k | int | The top-k value to use for sampling. | 40 |
top_p | float | The top-p value to use for sampling. | 0.95 |
temperature | float | The temperature to use for sampling. | 0.8 |
repetition_penalty | float | The repetition penalty to use for sampling. | 1.1 |
last_n_tokens | int | The number of last tokens to use for repetition penalty. | 64 |
seed | int | The seed value to use for sampling tokens. | -1 |
max_new_tokens | int | The maximum number of new tokens to generate. | 256 |
stop | List[str] | A list of sequences to stop generation when encountered. | None |
stream | bool | Whether to stream the generated text. | False |
reset | bool | Whether to reset the model state before generating text. | True |
batch_size | int | The batch size to use for evaluating tokens in a single prompt. | 8 |
threads | int | The number of threads to use for evaluating tokens. | -1 |
context_length | int | The maximum context length to use. | -1 |
gpu_layers | int | The number of layers to run on GPU. | 0 |
Note: Currently only LLaMA, MPT and Falcon models support the
context_lengthparameter.
from_pretrained( model_path_or_repo_id: str, model_type: Optional[str] = None, model_file: Optional[str] = None, config: Optional[ctransformers.hub.AutoConfig] = None, lib: Optional[str] = None, local_files_only: bool = False, revision: Optional[str] = None, hf: bool = False, **kwargs ) → LLMLoads the language model from a local file or remote repo.
Args:
model_path_or_repo_id: The path to a model file or directory or the name of a Hugging Face Hub model repo.model_type: The model type.model_file: The name of the model file in repo or directory.config:AutoConfigobject.lib: The path to a shared library or one ofavx2,avx,basic.local_files_only: Whether or not to only look at local files (i.e., do not try to download the model).revision: The specific model version to use. It can be a branch name, a tag name, or a commit id.hf: Whether to create a Hugging Face Transformers model.
Returns: LLM object.
__init__( model_path: str, model_type: Optional[str] = None, config: Optional[ctransformers.llm.Config] = None, lib: Optional[str] = None )Loads the language model from a local file.
Args:
model_path: The path to a model file.model_type: The model type.config:Configobject.lib: The path to a shared library or one ofavx2,avx,basic.
The beginning-of-sequence token.
The config object.
The context length of model.
The input embeddings.
The end-of-sequence token.
The unnormalized log probabilities.
The path to the model file.
The model type.
The padding token.
The number of tokens in vocabulary.
detokenize(tokens: Sequence[int], decode: bool = True) → Union[str, bytes]Converts a list of tokens to text.
Args:
tokens: The list of tokens.decode: Whether to decode the text as UTF-8 string.
Returns: The combined text of all tokens.
embed( input: Union[str, Sequence[int]], batch_size: Optional[int] = None, threads: Optional[int] = None ) → List[float]Computes embeddings for a text or list of tokens.
Note: Currently only LLaMA and Falcon models support embeddings.
Args:
input: The input text or list of tokens to get embeddings for.batch_size: The batch size to use for evaluating tokens in a single prompt. Default:8threads: The number of threads to use for evaluating tokens. Default:-1
Returns: The input embeddings.
eval( tokens: Sequence[int], batch_size: Optional[int] = None, threads: Optional[int] = None ) → NoneEvaluates a list of tokens.
Args:
tokens: The list of tokens to evaluate.batch_size: The batch size to use for evaluating tokens in a single prompt. Default:8threads: The number of threads to use for evaluating tokens. Default:-1
generate( tokens: Sequence[int], top_k: Optional[int] = None, top_p: Optional[float] = None, temperature: Optional[float] = None, repetition_penalty: Optional[float] = None, last_n_tokens: Optional[int] = None, seed: Optional[int] = None, batch_size: Optional[int] = None, threads: Optional[int] = None, reset: Optional[bool] = None ) → Generator[int, NoneType, NoneType]Generates new tokens from a list of tokens.
Args:
tokens: The list of tokens to generate tokens from.top_k: The top-k value to use for sampling. Default:40top_p: The top-p value to use for sampling. Default:0.95temperature: The temperature to use for sampling. Default:0.8repetition_penalty: The repetition penalty to use for sampling. Default:1.1last_n_tokens: The number of last tokens to use for repetition penalty. Default:64seed: The seed value to use for sampling tokens. Default:-1batch_size: The batch size to use for evaluating tokens in a single prompt. Default:8threads: The number of threads to use for evaluating tokens. Default:-1reset: Whether to reset the model state before generating text. Default:True
Returns: The generated tokens.
is_eos_token(token: int) → boolChecks if a token is an end-of-sequence token.
Args:
token: The token to check.
Returns: True if the token is an end-of-sequence token else False.
prepare_inputs_for_generation( tokens: Sequence[int], reset: Optional[bool] = None ) → Sequence[int]Removes input tokens that are evaluated in the past and updates the LLM context.
Args:
tokens: The list of input tokens.reset: Whether to reset the model state before generating text. Default:True
Returns: The list of tokens to evaluate.
reset() → NoneDeprecated since 0.2.27.
sample( top_k: Optional[int] = None, top_p: Optional[float] = None, temperature: Optional[float] = None, repetition_penalty: Optional[float] = None, last_n_tokens: Optional[int] = None, seed: Optional[int] = None ) → intSamples a token from the model.
Args:
top_k: The top-k value to use for sampling. Default:40top_p: The top-p value to use for sampling. Default:0.95temperature: The temperature to use for sampling. Default:0.8repetition_penalty: The repetition penalty to use for sampling. Default:1.1last_n_tokens: The number of last tokens to use for repetition penalty. Default:64seed: The seed value to use for sampling tokens. Default:-1
Returns: The sampled token.
tokenize(text: str, add_bos_token: Optional[bool] = None) → List[int]Converts a text into list of tokens.
Args:
text: The text to tokenize.add_bos_token: Whether to add the beginning-of-sequence token.
Returns: The list of tokens.
__call__( prompt: str, max_new_tokens: Optional[int] = None, top_k: Optional[int] = None, top_p: Optional[float] = None, temperature: Optional[float] = None, repetition_penalty: Optional[float] = None, last_n_tokens: Optional[int] = None, seed: Optional[int] = None, batch_size: Optional[int] = None, threads: Optional[int] = None, stop: Optional[Sequence[str]] = None, stream: Optional[bool] = None, reset: Optional[bool] = None ) → Union[str, Generator[str, NoneType, NoneType]]Generates text from a prompt.
Args:
prompt: The prompt to generate text from.max_new_tokens: The maximum number of new tokens to generate. Default:256top_k: The top-k value to use for sampling. Default:40top_p: The top-p value to use for sampling. Default:0.95temperature: The temperature to use for sampling. Default:0.8repetition_penalty: The repetition penalty to use for sampling. Default:1.1last_n_tokens: The number of last tokens to use for repetition penalty. Default:64seed: The seed value to use for sampling tokens. Default:-1batch_size: The batch size to use for evaluating tokens in a single prompt. Default:8threads: The number of threads to use for evaluating tokens. Default:-1stop: A list of sequences to stop generation when encountered. Default:Nonestream: Whether to stream the generated text. Default:Falsereset: Whether to reset the model state before generating text. Default:True
Returns: The generated text.