Namespace: LLama.Native
Direct translation of the llama.cpp API
public class NativeApi
Inheritance Object → NativeApi
public NativeApi()
Mirostat 1.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
public static int llama_sample_token_mirostat(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float tau, float eta, int m, Single& mu)
candidates LLamaTokenDataArrayNative&
A vector of llama_token_data containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.
tau Single
The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.
eta Single
The learning rate used to update mu based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause mu to be updated more quickly, while a smaller learning rate will result in slower updates.
m Int32
The number of tokens considered in the estimation of s_hat. This is an arbitrary value that is used to calculate s_hat, which in turn helps to calculate the value of k. In the paper, they use m = 100, but you can experiment with different values to see how it affects the performance of the algorithm.
mu Single&
Maximum cross-entropy. This value is initialized to be twice the target cross-entropy (2 * tau) and is updated in the algorithm based on the error between the target and observed surprisal.
Mirostat 2.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
public static int llama_sample_token_mirostat_v2(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float tau, float eta, Single& mu)
candidates LLamaTokenDataArrayNative&
A vector of llama_token_data containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.
tau Single
The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.
eta Single
The learning rate used to update mu based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause mu to be updated more quickly, while a smaller learning rate will result in slower updates.
mu Single&
Maximum cross-entropy. This value is initialized to be twice the target cross-entropy (2 * tau) and is updated in the algorithm based on the error between the target and observed surprisal.
Selects the token with the highest probability.
public static int llama_sample_token_greedy(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates)
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
Randomly selects a token from the candidates based on their probabilities.
public static int llama_sample_token(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates)
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
Token Id -> String. Uses the vocabulary in the provided context
public static IntPtr llama_token_to_str(SafeLLamaContextHandle ctx, int token)
token Int32
IntPtr
Pointer to a string.
Get the "Beginning of sentence" token
public static int llama_token_bos(SafeLLamaContextHandle ctx)
Get the "End of sentence" token
public static int llama_token_eos(SafeLLamaContextHandle ctx)
Get the "new line" token
public static int llama_token_nl(SafeLLamaContextHandle ctx)
Print out timing information for this context
public static void llama_print_timings(SafeLLamaContextHandle ctx)
Reset all collected timing information for this context
public static void llama_reset_timings(SafeLLamaContextHandle ctx)
Print system information
public static IntPtr llama_print_system_info()
Get the number of tokens in the model vocabulary
public static int llama_model_n_vocab(SafeLlamaModelHandle model)
model SafeLlamaModelHandle
Get the size of the context window for the model
public static int llama_model_n_ctx(SafeLlamaModelHandle model)
model SafeLlamaModelHandle
Get the dimension of embedding vectors from this model
public static int llama_model_n_embd(SafeLlamaModelHandle model)
model SafeLlamaModelHandle
Convert a single token into text
public static int llama_token_to_piece_with_model(SafeLlamaModelHandle model, int llamaToken, Byte* buffer, int length)
model SafeLlamaModelHandle
llamaToken Int32
buffer Byte*
buffer to write string into
length Int32
size of the buffer
Int32
The length writte, or if the buffer is too small a negative that indicates the length required
Convert text into tokens
public static int llama_tokenize_with_model(SafeLlamaModelHandle model, Byte* text, Int32* tokens, int n_max_tokens, bool add_bos)
model SafeLlamaModelHandle
text Byte*
tokens Int32*
n_max_tokens Int32
add_bos Boolean
Int32
Returns the number of tokens on success, no more than n_max_tokens.
Returns a negative number on failure - the number of tokens that would have been returned
Register a callback to receive llama log messages
public static void llama_log_set(LLamaLogCallback logCallback)
logCallback LLamaLogCallback
Create a new grammar from the given set of grammar rules
public static IntPtr llama_grammar_init(LLamaGrammarElement** rules, ulong n_rules, ulong start_rule_index)
rules LLamaGrammarElement**
n_rules UInt64
start_rule_index UInt64
Free all memory from the given SafeLLamaGrammarHandle
public static void llama_grammar_free(IntPtr grammar)
grammar IntPtr
Apply constraints from grammar
public static void llama_sample_grammar(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, SafeLLamaGrammarHandle grammar)
candidates LLamaTokenDataArrayNative&
grammar SafeLLamaGrammarHandle
Accepts the sampled token into the grammar
public static void llama_grammar_accept_token(SafeLLamaContextHandle ctx, SafeLLamaGrammarHandle grammar, int token)
grammar SafeLLamaGrammarHandle
token Int32
Returns 0 on success
public static int llama_model_quantize(string fname_inp, string fname_out, LLamaModelQuantizeParams* param)
fname_inp String
fname_out String
param LLamaModelQuantizeParams*
Int32
Returns 0 on success
Remarks:
not great API - very likely to change
Apply classifier-free guidance to the logits as described in academic paper "Stay on topic with Classifier-Free Guidance" https://arxiv.org/abs/2306.17806
public static void llama_sample_classifier_free_guidance(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative candidates, SafeLLamaContextHandle guidanceCtx, float scale)
candidates LLamaTokenDataArrayNative
A vector of llama_token_data containing the candidate tokens, the logits must be directly extracted from the original generation context without being sorted.
guidanceCtx SafeLLamaContextHandle
A separate context from the same model. Other than a negative prompt at the beginning, it should have all generated and user input tokens copied from the main context.
scale Single
Guidance strength. 1.0f means no guidance. Higher values mean stronger guidance.
Repetition penalty described in CTRL academic paper https://arxiv.org/abs/1909.05858, with negative logit fix.
public static void llama_sample_repetition_penalty(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, Int32* last_tokens, ulong last_tokens_size, float penalty)
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
last_tokens Int32*
last_tokens_size UInt64
penalty Single
Frequency and presence penalties described in OpenAI API https://platform.openai.com/docs/api-reference/parameter-details.
public static void llama_sample_frequency_and_presence_penalties(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, Int32* last_tokens, ulong last_tokens_size, float alpha_frequency, float alpha_presence)
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
last_tokens Int32*
last_tokens_size UInt64
alpha_frequency Single
alpha_presence Single
Apply classifier-free guidance to the logits as described in academic paper "Stay on topic with Classifier-Free Guidance" https://arxiv.org/abs/2306.17806
public static void llama_sample_classifier_free_guidance(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, SafeLLamaContextHandle guidance_ctx, float scale)
candidates LLamaTokenDataArrayNative&
A vector of llama_token_data containing the candidate tokens, the logits must be directly extracted from the original generation context without being sorted.
guidance_ctx SafeLLamaContextHandle
A separate context from the same model. Other than a negative prompt at the beginning, it should have all generated and user input tokens copied from the main context.
scale Single
Guidance strength. 1.0f means no guidance. Higher values mean stronger guidance.
Sorts candidate tokens by their logits in descending order and calculate probabilities based on logits.
public static void llama_sample_softmax(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates)
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
public static void llama_sample_top_k(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, int k, ulong min_keep)
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
k Int32
min_keep UInt64
Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
public static void llama_sample_top_p(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float p, ulong min_keep)
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
p Single
min_keep UInt64
Tail Free Sampling described in https://www.trentonbricken.com/Tail-Free-Sampling/.
public static void llama_sample_tail_free(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float z, ulong min_keep)
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
z Single
min_keep UInt64
Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666.
public static void llama_sample_typical(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float p, ulong min_keep)
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
p Single
min_keep UInt64
Modify logits by temperature
public static void llama_sample_temperature(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float temp)
candidates LLamaTokenDataArrayNative&
temp Single
A method that does nothing. This is a native method, calling it will force the llama native dependencies to be loaded.
public static bool llama_empty_call()
Create a LLamaContextParams with default values
public static LLamaContextParams llama_context_default_params()
Create a LLamaModelQuantizeParams with default values
public static LLamaModelQuantizeParams llama_model_quantize_default_params()
Check if memory mapping is supported
public static bool llama_mmap_supported()
Check if memory lockingis supported
public static bool llama_mlock_supported()
Export a static computation graph for context of 511 and batch size of 1
NOTE: since this functionality is mostly for debugging and demonstration purposes, we hardcode these
parameters here to keep things simple
IMPORTANT: do not use for anything else other than debugging and testing!
public static int llama_eval_export(SafeLLamaContextHandle ctx, string fname)
fname String
Various functions for loading a ggml llama model.
Allocate (almost) all memory needed for the model.
Return NULL on failure
public static IntPtr llama_load_model_from_file(string path_model, LLamaContextParams params)
path_model String
params LLamaContextParams
Create a new llama_context with the given model.
Return value should always be wrapped in SafeLLamaContextHandle!
public static IntPtr llama_new_context_with_model(SafeLlamaModelHandle model, LLamaContextParams params)
model SafeLlamaModelHandle
params LLamaContextParams
not great API - very likely to change.
Initialize the llama + ggml backend
Call once at the start of the program
public static void llama_backend_init(bool numa)
numa Boolean
Frees all allocated memory in the given llama_context
public static void llama_free(IntPtr ctx)
ctx IntPtr
Frees all allocated memory associated with a model
public static void llama_free_model(IntPtr model)
model IntPtr
Apply a LoRA adapter to a loaded model
path_base_model is the path to a higher quality model to use as a base for
the layers modified by the adapter. Can be NULL to use the current loaded model.
The model needs to be reloaded before applying a new adapter, otherwise the adapter
will be applied on top of the previous one
public static int llama_model_apply_lora_from_file(SafeLlamaModelHandle model_ptr, string path_lora, string path_base_model, int n_threads)
model_ptr SafeLlamaModelHandle
path_lora String
path_base_model String
n_threads Int32
Int32
Returns 0 on success
Returns the number of tokens in the KV cache
public static int llama_get_kv_cache_token_count(SafeLLamaContextHandle ctx)
Sets the current rng seed.
public static void llama_set_rng_seed(SafeLLamaContextHandle ctx, int seed)
seed Int32
Returns the maximum size in bytes of the state (rng, logits, embedding
and kv_cache) - will often be smaller after compacting tokens
public static ulong llama_get_state_size(SafeLLamaContextHandle ctx)
Copies the state to the specified destination address.
Destination needs to have allocated enough memory.
public static ulong llama_copy_state_data(SafeLLamaContextHandle ctx, Byte* dest)
dest Byte*
UInt64
the number of bytes copied
Copies the state to the specified destination address.
Destination needs to have allocated enough memory (see llama_get_state_size)
public static ulong llama_copy_state_data(SafeLLamaContextHandle ctx, Byte[] dest)
dest Byte[]
UInt64
the number of bytes copied
Set the state reading from the specified address
public static ulong llama_set_state_data(SafeLLamaContextHandle ctx, Byte* src)
src Byte*
UInt64
the number of bytes read
Set the state reading from the specified address
public static ulong llama_set_state_data(SafeLLamaContextHandle ctx, Byte[] src)
src Byte[]
UInt64
the number of bytes read
Load session file
public static bool llama_load_session_file(SafeLLamaContextHandle ctx, string path_session, Int32[] tokens_out, ulong n_token_capacity, UInt64* n_token_count_out)
path_session String
tokens_out Int32[]
n_token_capacity UInt64
n_token_count_out UInt64*
Save session file
public static bool llama_save_session_file(SafeLLamaContextHandle ctx, string path_session, Int32[] tokens, ulong n_token_count)
path_session String
tokens Int32[]
n_token_count UInt64
Run the llama inference to obtain the logits and probabilities for the next token.
tokens + n_tokens is the provided batch of new tokens to process
n_past is the number of tokens to use from previous eval calls
public static int llama_eval(SafeLLamaContextHandle ctx, Int32[] tokens, int n_tokens, int n_past, int n_threads)
tokens Int32[]
n_tokens Int32
n_past Int32
n_threads Int32
Int32
Returns 0 on success
Run the llama inference to obtain the logits and probabilities for the next token.
tokens + n_tokens is the provided batch of new tokens to process
n_past is the number of tokens to use from previous eval calls
public static int llama_eval_with_pointer(SafeLLamaContextHandle ctx, Int32* tokens, int n_tokens, int n_past, int n_threads)
tokens Int32*
n_tokens Int32
n_past Int32
n_threads Int32
Int32
Returns 0 on success
Convert the provided text into tokens.
public static int llama_tokenize(SafeLLamaContextHandle ctx, string text, Encoding encoding, Int32[] tokens, int n_max_tokens, bool add_bos)
text String
encoding Encoding
tokens Int32[]
n_max_tokens Int32
add_bos Boolean
Int32
Returns the number of tokens on success, no more than n_max_tokens.
Returns a negative number on failure - the number of tokens that would have been returned
Convert the provided text into tokens.
public static int llama_tokenize_native(SafeLLamaContextHandle ctx, Byte* text, Int32* tokens, int n_max_tokens, bool add_bos)
text Byte*
tokens Int32*
n_max_tokens Int32
add_bos Boolean
Int32
Returns the number of tokens on success, no more than n_max_tokens.
Returns a negative number on failure - the number of tokens that would have been returned
Get the number of tokens in the model vocabulary for this context
public static int llama_n_vocab(SafeLLamaContextHandle ctx)
Get the size of the context window for the model for this context
public static int llama_n_ctx(SafeLLamaContextHandle ctx)
Get the dimension of embedding vectors from the model for this context
public static int llama_n_embd(SafeLLamaContextHandle ctx)
Token logits obtained from the last call to llama_eval()
The logits for the last token are stored in the last row
Can be mutated in order to change the probabilities of the next token.
Rows: n_tokens
Cols: n_vocab
public static Single* llama_get_logits(SafeLLamaContextHandle ctx)
Get the embeddings for the input
shape: [n_embd] (1-dimensional)
public static Single* llama_get_embeddings(SafeLLamaContextHandle ctx)