Namespace: LLama.Native
public class NativeApi
Inheritance Object → NativeApi
public NativeApi()
public static void llama_print_timings(SafeLLamaContextHandle ctx)
public static void llama_reset_timings(SafeLLamaContextHandle ctx)
Print system information
public static IntPtr llama_print_system_info()
public static int llama_model_quantize(string fname_inp, string fname_out, LLamaFtype ftype, int nthread)
fname_inp String
fname_out String
ftype LLamaFtype
nthread Int32
Repetition penalty described in CTRL academic paper https://arxiv.org/abs/1909.05858, with negative logit fix.
public static void llama_sample_repetition_penalty(SafeLLamaContextHandle ctx, IntPtr candidates, Int32[] last_tokens, ulong last_tokens_size, float penalty)
candidates IntPtr
Pointer to LLamaTokenDataArray
last_tokens Int32[]
last_tokens_size UInt64
penalty Single
Frequency and presence penalties described in OpenAI API https://platform.openai.com/docs/api-reference/parameter-details.
public static void llama_sample_frequency_and_presence_penalties(SafeLLamaContextHandle ctx, IntPtr candidates, Int32[] last_tokens, ulong last_tokens_size, float alpha_frequency, float alpha_presence)
candidates IntPtr
Pointer to LLamaTokenDataArray
last_tokens Int32[]
last_tokens_size UInt64
alpha_frequency Single
alpha_presence Single
Sorts candidate tokens by their logits in descending order and calculate probabilities based on logits.
public static void llama_sample_softmax(SafeLLamaContextHandle ctx, IntPtr candidates)
candidates IntPtr
Pointer to LLamaTokenDataArray
Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
public static void llama_sample_top_k(SafeLLamaContextHandle ctx, IntPtr candidates, int k, ulong min_keep)
candidates IntPtr
Pointer to LLamaTokenDataArray
k Int32
min_keep UInt64
Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
public static void llama_sample_top_p(SafeLLamaContextHandle ctx, IntPtr candidates, float p, ulong min_keep)
candidates IntPtr
Pointer to LLamaTokenDataArray
p Single
min_keep UInt64
Tail Free Sampling described in https://www.trentonbricken.com/Tail-Free-Sampling/.
public static void llama_sample_tail_free(SafeLLamaContextHandle ctx, IntPtr candidates, float z, ulong min_keep)
candidates IntPtr
Pointer to LLamaTokenDataArray
z Single
min_keep UInt64
Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666.
public static void llama_sample_typical(SafeLLamaContextHandle ctx, IntPtr candidates, float p, ulong min_keep)
candidates IntPtr
Pointer to LLamaTokenDataArray
p Single
min_keep UInt64
public static void llama_sample_temperature(SafeLLamaContextHandle ctx, IntPtr candidates, float temp)
candidates IntPtr
temp Single
Mirostat 1.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
public static int llama_sample_token_mirostat(SafeLLamaContextHandle ctx, IntPtr candidates, float tau, float eta, int m, Single* mu)
candidates IntPtr
A vector of llama_token_data containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.
tau Single
The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.
eta Single
The learning rate used to update mu based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause mu to be updated more quickly, while a smaller learning rate will result in slower updates.
m Int32
The number of tokens considered in the estimation of s_hat. This is an arbitrary value that is used to calculate s_hat, which in turn helps to calculate the value of k. In the paper, they use m = 100, but you can experiment with different values to see how it affects the performance of the algorithm.
mu Single*
Maximum cross-entropy. This value is initialized to be twice the target cross-entropy (2 * tau) and is updated in the algorithm based on the error between the target and observed surprisal.
Mirostat 2.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
public static int llama_sample_token_mirostat_v2(SafeLLamaContextHandle ctx, IntPtr candidates, float tau, float eta, Single* mu)
candidates IntPtr
A vector of llama_token_data containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.
tau Single
The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.
eta Single
The learning rate used to update mu based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause mu to be updated more quickly, while a smaller learning rate will result in slower updates.
mu Single*
Maximum cross-entropy. This value is initialized to be twice the target cross-entropy (2 * tau) and is updated in the algorithm based on the error between the target and observed surprisal.
Selects the token with the highest probability.
public static int llama_sample_token_greedy(SafeLLamaContextHandle ctx, IntPtr candidates)
candidates IntPtr
Pointer to LLamaTokenDataArray
Randomly selects a token from the candidates based on their probabilities.
public static int llama_sample_token(SafeLLamaContextHandle ctx, IntPtr candidates)
candidates IntPtr
Pointer to LLamaTokenDataArray
public static bool llama_empty_call()
public static LLamaContextParams llama_context_default_params()
public static bool llama_mmap_supported()
public static bool llama_mlock_supported()
Various functions for loading a ggml llama model.
Allocate (almost) all memory needed for the model.
Return NULL on failure
public static IntPtr llama_init_from_file(string path_model, LLamaContextParams params_)
path_model String
params_ LLamaContextParams
not great API - very likely to change.
Initialize the llama + ggml backend
Call once at the start of the program
public static void llama_init_backend()
Frees all allocated memory
public static void llama_free(IntPtr ctx)
ctx IntPtr
Apply a LoRA adapter to a loaded model
path_base_model is the path to a higher quality model to use as a base for
the layers modified by the adapter. Can be NULL to use the current loaded model.
The model needs to be reloaded before applying a new adapter, otherwise the adapter
will be applied on top of the previous one
public static int llama_apply_lora_from_file(SafeLLamaContextHandle ctx, string path_lora, string path_base_model, int n_threads)
path_lora String
path_base_model String
n_threads Int32
Int32
Returns 0 on success
Returns the number of tokens in the KV cache
public static int llama_get_kv_cache_token_count(SafeLLamaContextHandle ctx)
Sets the current rng seed.
public static void llama_set_rng_seed(SafeLLamaContextHandle ctx, int seed)
seed Int32
Returns the maximum size in bytes of the state (rng, logits, embedding
and kv_cache) - will often be smaller after compacting tokens
public static ulong llama_get_state_size(SafeLLamaContextHandle ctx)
Copies the state to the specified destination address.
Destination needs to have allocated enough memory.
Returns the number of bytes copied
public static ulong llama_copy_state_data(SafeLLamaContextHandle ctx, Byte[] dest)
dest Byte[]
Set the state reading from the specified address
Returns the number of bytes read
public static ulong llama_set_state_data(SafeLLamaContextHandle ctx, Byte[] src)
src Byte[]
Load session file
public static bool llama_load_session_file(SafeLLamaContextHandle ctx, string path_session, Int32[] tokens_out, ulong n_token_capacity, UInt64* n_token_count_out)
path_session String
tokens_out Int32[]
n_token_capacity UInt64
n_token_count_out UInt64*
Save session file
public static bool llama_save_session_file(SafeLLamaContextHandle ctx, string path_session, Int32[] tokens, ulong n_token_count)
path_session String
tokens Int32[]
n_token_count UInt64
Run the llama inference to obtain the logits and probabilities for the next token.
tokens + n_tokens is the provided batch of new tokens to process
n_past is the number of tokens to use from previous eval calls
public static int llama_eval(SafeLLamaContextHandle ctx, Int32[] tokens, int n_tokens, int n_past, int n_threads)
tokens Int32[]
n_tokens Int32
n_past Int32
n_threads Int32
Int32
Returns 0 on success
public static int llama_eval_with_pointer(SafeLLamaContextHandle ctx, Int32* tokens, int n_tokens, int n_past, int n_threads)
tokens Int32*
n_tokens Int32
n_past Int32
n_threads Int32
Convert the provided text into tokens.
The tokens pointer must be large enough to hold the resulting tokens.
Returns the number of tokens on success, no more than n_max_tokens
Returns a negative number on failure - the number of tokens that would have been returned
public static int llama_tokenize(SafeLLamaContextHandle ctx, string text, Encoding encoding, Int32[] tokens, int n_max_tokens, bool add_bos)
text String
encoding Encoding
tokens Int32[]
n_max_tokens Int32
add_bos Boolean
public static int llama_tokenize_native(SafeLLamaContextHandle ctx, SByte[] text, Int32[] tokens, int n_max_tokens, bool add_bos)
text SByte[]
tokens Int32[]
n_max_tokens Int32
add_bos Boolean
public static int llama_n_vocab(SafeLLamaContextHandle ctx)
public static int llama_n_ctx(SafeLLamaContextHandle ctx)
public static int llama_n_embd(SafeLLamaContextHandle ctx)
Token logits obtained from the last call to llama_eval()
The logits for the last token are stored in the last row
Can be mutated in order to change the probabilities of the next token
Rows: n_tokens
Cols: n_vocab
public static Single* llama_get_logits(SafeLLamaContextHandle ctx)
Get the embeddings for the input
shape: [n_embd] (1-dimensional)
public static Single* llama_get_embeddings(SafeLLamaContextHandle ctx)
Token Id -> String. Uses the vocabulary in the provided context
public static IntPtr llama_token_to_str(SafeLLamaContextHandle ctx, int token)
token Int32
IntPtr
Pointer to a string.
public static int llama_token_bos()
public static int llama_token_eos()
public static int llama_token_nl()