Make completion for a given chat messages.
NOTE: this function uses the chat template (if available) to format the chat messages. If the template is not available, it will use the default format (chatml). It can throw an error if the chat template is not compatible.
Chat messages
Output completion text (only the completion part)
Make completion for a given text.
Input text
Output completion text (only the completion part)
Calculate embedding vector for a given text. By default, BOS and EOS tokens will be added automatically. You can use the "skipBOS" and "skipEOS" option to disable it.
Input text
An embedding vector
Apply chat template to a list of messages
list of messages
whether to add assistant prompt at the end
Optional
template: string(optional) custom template, see llama-server --chat-template argument for more details
formatted chat
Get the jinja chat template comes with the model. It only available if the original model (before converting to gguf) has the template in tokenizer_config.json
NOTE: This can only being used after loadModel
is called.
the jinja template. null if there is no template in gguf
Get token ID associated to token used by decoder, to start generating output sequence(only usable for encoder-decoder architecture). In other words, encoder uses normal BOS and decoder uses this token.
NOTE: This can only being used after loadModel
is called.
-1 if the model is not loaded.
Get softmax-ed probability of logits, can be used for custom sampling
Get top K tokens having highest logits value. If topK == -1, we return all n_vocab logits, but this is not recommended because it's slow.
Get model hyper-parameters and metadata
NOTE: This can only being used after loadModel
is called.
ModelMetadata
Load model from a given list of Blob.
You can pass multiple buffers into the function (in case the model contains multiple shards).
Can be either list of Blobs (in case you use local file), or a Model object (in case you use ModelManager)
LoadModelConfig
Load model from a given Hugging Face model ID and file path.
The HF model ID, for example: 'ggml-org/models'
The GGUF file path, for example: 'tinyllamas/stories15M-q4_0.gguf'
Load model from a given URL (or a list of URLs, in case the model is splitted into smaller files)
downloadModel()
), then we will use the cached modelURL to the GGUF file. If the model is splitted, pass the URL to the first shard.
Lookup to see if a token exist in vocab or not. Useful for searching special tokens like "<|im_start|>" NOTE: It will match the whole token, so do not use it as a replacement for tokenize()
Token ID associated to the given piece. Returns -1 if cannot find the token.
Create or reset the ctx_sampling
In case re-initializing the ctx_sampling, you can re-import past tokens into the new context
Set options for underlaying llama_context
get debug info