NativeApi

candidates LLamaTokenDataArrayNative&

A vector of llama_token_data containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.

tau Single

The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.

eta Single

The learning rate used to update mu based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause mu to be updated more quickly, while a smaller learning rate will result in slower updates.

m Int32

The number of tokens considered in the estimation of s_hat. This is an arbitrary value that is used to calculate s_hat, which in turn helps to calculate the value of k. In the paper, they use m = 100, but you can experiment with different values to see how it affects the performance of the algorithm.

mu Single&

Maximum cross-entropy. This value is initialized to be twice the target cross-entropy (2 * tau) and is updated in the algorithm based on the error between the target and observed surprisal.

Returns

llama_sample_token_mirostat_v2(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, Single, Single&)

Mirostat 2.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.

public static int llama_sample_token_mirostat_v2(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float tau, float eta, Single& mu)

Parameters

candidates LLamaTokenDataArrayNative&

A vector of llama_token_data containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.

Returns

llama_sample_token_greedy(SafeLLamaContextHandle, LLamaTokenDataArrayNative&)

Selects the token with the highest probability.

public static int llama_sample_token_greedy(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates)

Parameters

candidates LLamaTokenDataArrayNative&

Pointer to LLamaTokenDataArray

Returns

llama_sample_token(SafeLLamaContextHandle, LLamaTokenDataArrayNative&)

Randomly selects a token from the candidates based on their probabilities.

public static int llama_sample_token(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates)

Parameters

candidates LLamaTokenDataArrayNative&

Pointer to LLamaTokenDataArray

Returns

llama_token_to_str(SafeLLamaContextHandle, Int32)

Token Id -> String. Uses the vocabulary in the provided context

public static IntPtr llama_token_to_str(SafeLLamaContextHandle ctx, int token)

Parameters

token Int32

Returns

IntPtr

Pointer to a string.

llama_token_bos(SafeLLamaContextHandle)

Get the "Beginning of sentence" token

public static int llama_token_bos(SafeLLamaContextHandle ctx)

Parameters

Returns

llama_token_eos(SafeLLamaContextHandle)

Get the "End of sentence" token

public static int llama_token_eos(SafeLLamaContextHandle ctx)

Parameters

Returns

llama_token_nl(SafeLLamaContextHandle)

Get the "new line" token

public static int llama_token_nl(SafeLLamaContextHandle ctx)

Parameters

Returns

llama_print_timings(SafeLLamaContextHandle)

Print out timing information for this context

public static void llama_print_timings(SafeLLamaContextHandle ctx)

Parameters

llama_reset_timings(SafeLLamaContextHandle)

Reset all collected timing information for this context

public static void llama_reset_timings(SafeLLamaContextHandle ctx)

Parameters

llama_print_system_info()

Print system information

public static IntPtr llama_print_system_info()

Returns

llama_model_n_vocab(SafeLlamaModelHandle)

Get the number of tokens in the model vocabulary

public static int llama_model_n_vocab(SafeLlamaModelHandle model)

Parameters

Returns

llama_model_n_ctx(SafeLlamaModelHandle)

Get the size of the context window for the model

public static int llama_model_n_ctx(SafeLlamaModelHandle model)

Parameters

Returns

llama_model_n_embd(SafeLlamaModelHandle)

Get the dimension of embedding vectors from this model

public static int llama_model_n_embd(SafeLlamaModelHandle model)

Parameters

Returns

llama_token_to_piece_with_model(SafeLlamaModelHandle, Int32, Byte, Int32)*

Convert a single token into text

public static int llama_token_to_piece_with_model(SafeLlamaModelHandle model, int llamaToken, Byte* buffer, int length)

Parameters

llamaToken Int32

buffer Byte*

buffer to write string into

length Int32

size of the buffer

Returns

Int32

The length writte, or if the buffer is too small a negative that indicates the length required

llama_tokenize_with_model(SafeLlamaModelHandle, Byte, Int32, Int32, Boolean)**

Convert text into tokens

public static int llama_tokenize_with_model(SafeLlamaModelHandle model, Byte* text, Int32* tokens, int n_max_tokens, bool add_bos)

Parameters

text Byte*

tokens Int32*

n_max_tokens Int32

add_bos Boolean

Returns

Int32

Returns the number of tokens on success, no more than n_max_tokens.
Returns a negative number on failure - the number of tokens that would have been returned

llama_log_set(LLamaLogCallback)

public static void llama_log_set(LLamaLogCallback logCallback)

Parameters

logCallback LLamaLogCallback

llama_grammar_init(LLamaGrammarElement, UInt64, UInt64)**

Create a new grammar from the given set of grammar rules

public static IntPtr llama_grammar_init(LLamaGrammarElement** rules, ulong n_rules, ulong start_rule_index)

Parameters

rules LLamaGrammarElement**

n_rules UInt64

start_rule_index UInt64

Returns

llama_grammar_free(IntPtr)

Free all memory from the given SafeLLamaGrammarHandle

public static void llama_grammar_free(IntPtr grammar)

Parameters

grammar IntPtr

llama_sample_grammar(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, SafeLLamaGrammarHandle)

Apply constraints from grammar

public static void llama_sample_grammar(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, SafeLLamaGrammarHandle grammar)

Parameters

candidates LLamaTokenDataArrayNative&

grammar SafeLLamaGrammarHandle

llama_grammar_accept_token(SafeLLamaContextHandle, SafeLLamaGrammarHandle, Int32)

Accepts the sampled token into the grammar

public static void llama_grammar_accept_token(SafeLLamaContextHandle ctx, SafeLLamaGrammarHandle grammar, int token)

Parameters

grammar SafeLLamaGrammarHandle

token Int32

llama_model_quantize(String, String, LLamaModelQuantizeParams)*

Returns 0 on success

public static int llama_model_quantize(string fname_inp, string fname_out, LLamaModelQuantizeParams* param)

Parameters

fname_inp String

fname_out String

param LLamaModelQuantizeParams*

Returns

Int32

Returns 0 on success

Remarks:

not great API - very likely to change

llama_sample_classifier_free_guidance(SafeLLamaContextHandle, LLamaTokenDataArrayNative, SafeLLamaContextHandle, Single)

Apply classifier-free guidance to the logits as described in academic paper "Stay on topic with Classifier-Free Guidance" https://arxiv.org/abs/2306.17806

public static void llama_sample_classifier_free_guidance(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative candidates, SafeLLamaContextHandle guidanceCtx, float scale)

Parameters

candidates LLamaTokenDataArrayNative

A vector of llama_token_data containing the candidate tokens, the logits must be directly extracted from the original generation context without being sorted.

guidanceCtx SafeLLamaContextHandle

A separate context from the same model. Other than a negative prompt at the beginning, it should have all generated and user input tokens copied from the main context.

scale Single

Guidance strength. 1.0f means no guidance. Higher values mean stronger guidance.

llama_sample_repetition_penalty(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Int32, UInt64, Single)*

Repetition penalty described in CTRL academic paper https://arxiv.org/abs/1909.05858, with negative logit fix.

public static void llama_sample_repetition_penalty(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, Int32* last_tokens, ulong last_tokens_size, float penalty)

Parameters

candidates LLamaTokenDataArrayNative&

Pointer to LLamaTokenDataArray

last_tokens Int32*

last_tokens_size UInt64

penalty Single

llama_sample_frequency_and_presence_penalties(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Int32, UInt64, Single, Single)*

Frequency and presence penalties described in OpenAI API https://platform.openai.com/docs/api-reference/parameter-details.

public static void llama_sample_frequency_and_presence_penalties(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, Int32* last_tokens, ulong last_tokens_size, float alpha_frequency, float alpha_presence)

Parameters

candidates LLamaTokenDataArrayNative&

Pointer to LLamaTokenDataArray

last_tokens Int32*

last_tokens_size UInt64

alpha_frequency Single

alpha_presence Single

llama_sample_classifier_free_guidance(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, SafeLLamaContextHandle, Single)

Apply classifier-free guidance to the logits as described in academic paper "Stay on topic with Classifier-Free Guidance" https://arxiv.org/abs/2306.17806

public static void llama_sample_classifier_free_guidance(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, SafeLLamaContextHandle guidance_ctx, float scale)

Parameters

candidates LLamaTokenDataArrayNative&

A vector of llama_token_data containing the candidate tokens, the logits must be directly extracted from the original generation context without being sorted.

guidance_ctx SafeLLamaContextHandle

A separate context from the same model. Other than a negative prompt at the beginning, it should have all generated and user input tokens copied from the main context.

scale Single

Guidance strength. 1.0f means no guidance. Higher values mean stronger guidance.

llama_sample_softmax(SafeLLamaContextHandle, LLamaTokenDataArrayNative&)

Sorts candidate tokens by their logits in descending order and calculate probabilities based on logits.

public static void llama_sample_softmax(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates)

Parameters

candidates LLamaTokenDataArrayNative&

Pointer to LLamaTokenDataArray

llama_sample_top_k(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Int32, UInt64)

Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751

public static void llama_sample_top_k(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, int k, ulong min_keep)

Parameters

candidates LLamaTokenDataArrayNative&

Pointer to LLamaTokenDataArray

k Int32

min_keep UInt64

llama_sample_top_p(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, UInt64)

Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751

public static void llama_sample_top_p(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float p, ulong min_keep)

Parameters

candidates LLamaTokenDataArrayNative&

Pointer to LLamaTokenDataArray

p Single

min_keep UInt64

llama_sample_tail_free(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, UInt64)

Tail Free Sampling described in https://www.trentonbricken.com/Tail-Free-Sampling/.

public static void llama_sample_tail_free(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float z, ulong min_keep)

Parameters

candidates LLamaTokenDataArrayNative&

Pointer to LLamaTokenDataArray

z Single

min_keep UInt64

llama_sample_typical(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, UInt64)

Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666.

public static void llama_sample_typical(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float p, ulong min_keep)

Parameters

candidates LLamaTokenDataArrayNative&

Pointer to LLamaTokenDataArray

p Single

min_keep UInt64

llama_sample_temperature(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single)

Modify logits by temperature

public static void llama_sample_temperature(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float temp)

Parameters

candidates LLamaTokenDataArrayNative&

temp Single

llama_empty_call()

A method that does nothing. This is a native method, calling it will force the llama native dependencies to be loaded.

public static bool llama_empty_call()

Returns

llama_context_default_params()

Create a LLamaContextParams with default values

public static LLamaContextParams llama_context_default_params()

Returns

LLamaContextParams

llama_model_quantize_default_params()

Create a LLamaModelQuantizeParams with default values

public static LLamaModelQuantizeParams llama_model_quantize_default_params()

Returns

LLamaModelQuantizeParams

llama_mmap_supported()

Check if memory mapping is supported

public static bool llama_mmap_supported()

Returns

llama_mlock_supported()

Check if memory lockingis supported

public static bool llama_mlock_supported()

Returns

llama_eval_export(SafeLLamaContextHandle, String)

Export a static computation graph for context of 511 and batch size of 1
NOTE: since this functionality is mostly for debugging and demonstration purposes, we hardcode these
parameters here to keep things simple
IMPORTANT: do not use for anything else other than debugging and testing!

public static int llama_eval_export(SafeLLamaContextHandle ctx, string fname)

Parameters

fname String

Returns

llama_load_model_from_file(String, LLamaContextParams)

Various functions for loading a ggml llama model.
Allocate (almost) all memory needed for the model.
Return NULL on failure

public static IntPtr llama_load_model_from_file(string path_model, LLamaContextParams params)

Parameters

path_model String

params LLamaContextParams

Returns

llama_new_context_with_model(SafeLlamaModelHandle, LLamaContextParams)

Create a new llama_context with the given model.
Return value should always be wrapped in SafeLLamaContextHandle!

public static IntPtr llama_new_context_with_model(SafeLlamaModelHandle model, LLamaContextParams params)

Parameters

params LLamaContextParams

Returns

llama_backend_init(Boolean)

not great API - very likely to change.
Initialize the llama + ggml backend
Call once at the start of the program

public static void llama_backend_init(bool numa)

Parameters

numa Boolean

llama_free(IntPtr)

Frees all allocated memory in the given llama_context

public static void llama_free(IntPtr ctx)

Parameters

ctx IntPtr

llama_free_model(IntPtr)

Frees all allocated memory associated with a model

public static void llama_free_model(IntPtr model)

Parameters

model IntPtr

llama_model_apply_lora_from_file(SafeLlamaModelHandle, String, String, Int32)

Apply a LoRA adapter to a loaded model
path_base_model is the path to a higher quality model to use as a base for
the layers modified by the adapter. Can be NULL to use the current loaded model.
The model needs to be reloaded before applying a new adapter, otherwise the adapter
will be applied on top of the previous one

public static int llama_model_apply_lora_from_file(SafeLlamaModelHandle model_ptr, string path_lora, string path_base_model, int n_threads)

Parameters

model_ptr SafeLlamaModelHandle

path_lora String

path_base_model String

n_threads Int32

Returns

Int32

Returns 0 on success

llama_get_kv_cache_token_count(SafeLLamaContextHandle)

Returns the number of tokens in the KV cache

public static int llama_get_kv_cache_token_count(SafeLLamaContextHandle ctx)

Parameters

Returns

llama_set_rng_seed(SafeLLamaContextHandle, Int32)

Sets the current rng seed.

public static void llama_set_rng_seed(SafeLLamaContextHandle ctx, int seed)

Parameters

seed Int32

llama_get_state_size(SafeLLamaContextHandle)

Returns the maximum size in bytes of the state (rng, logits, embedding
and kv_cache) - will often be smaller after compacting tokens

public static ulong llama_get_state_size(SafeLLamaContextHandle ctx)

Parameters

Returns

UInt64

llama_copy_state_data(SafeLLamaContextHandle, Byte)*

Copies the state to the specified destination address.
Destination needs to have allocated enough memory.

public static ulong llama_copy_state_data(SafeLLamaContextHandle ctx, Byte* dest)

Parameters

dest Byte*

Returns

UInt64

the number of bytes copied

llama_copy_state_data(SafeLLamaContextHandle, Byte[])

Copies the state to the specified destination address.
Destination needs to have allocated enough memory (see llama_get_state_size)

public static ulong llama_copy_state_data(SafeLLamaContextHandle ctx, Byte[] dest)

Parameters

dest Byte[]

Returns

UInt64

the number of bytes copied

llama_set_state_data(SafeLLamaContextHandle, Byte)*

Set the state reading from the specified address

public static ulong llama_set_state_data(SafeLLamaContextHandle ctx, Byte* src)

Parameters

src Byte*

Returns

UInt64

the number of bytes read

llama_set_state_data(SafeLLamaContextHandle, Byte[])

Set the state reading from the specified address

public static ulong llama_set_state_data(SafeLLamaContextHandle ctx, Byte[] src)

Parameters

src Byte[]

Returns

UInt64

the number of bytes read

llama_load_session_file(SafeLLamaContextHandle, String, Int32[], UInt64, UInt64)*

Load session file

public static bool llama_load_session_file(SafeLLamaContextHandle ctx, string path_session, Int32[] tokens_out, ulong n_token_capacity, UInt64* n_token_count_out)

Parameters

path_session String

tokens_out Int32[]

n_token_capacity UInt64

n_token_count_out UInt64*

Returns

llama_save_session_file(SafeLLamaContextHandle, String, Int32[], UInt64)

Save session file

public static bool llama_save_session_file(SafeLLamaContextHandle ctx, string path_session, Int32[] tokens, ulong n_token_count)

Parameters

path_session String

tokens Int32[]

n_token_count UInt64

Returns

llama_eval(SafeLLamaContextHandle, Int32[], Int32, Int32, Int32)

Run the llama inference to obtain the logits and probabilities for the next token.
tokens + n_tokens is the provided batch of new tokens to process
n_past is the number of tokens to use from previous eval calls

public static int llama_eval(SafeLLamaContextHandle ctx, Int32[] tokens, int n_tokens, int n_past, int n_threads)

Parameters

tokens Int32[]

n_tokens Int32

n_past Int32

n_threads Int32

Returns

Int32

Returns 0 on success

llama_eval_with_pointer(SafeLLamaContextHandle, Int32, Int32, Int32, Int32)*

public static int llama_eval_with_pointer(SafeLLamaContextHandle ctx, Int32* tokens, int n_tokens, int n_past, int n_threads)

Parameters

tokens Int32*

n_tokens Int32

n_past Int32

n_threads Int32

Returns

Int32

Returns 0 on success

llama_tokenize(SafeLLamaContextHandle, String, Encoding, Int32[], Int32, Boolean)

Convert the provided text into tokens.

public static int llama_tokenize(SafeLLamaContextHandle ctx, string text, Encoding encoding, Int32[] tokens, int n_max_tokens, bool add_bos)

Parameters

text String

encoding Encoding

tokens Int32[]

n_max_tokens Int32

add_bos Boolean

Returns

Int32

Returns the number of tokens on success, no more than n_max_tokens.
Returns a negative number on failure - the number of tokens that would have been returned

llama_tokenize_native(SafeLLamaContextHandle, Byte, Int32, Int32, Boolean)**

Convert the provided text into tokens.

public static int llama_tokenize_native(SafeLLamaContextHandle ctx, Byte* text, Int32* tokens, int n_max_tokens, bool add_bos)

Parameters

text Byte*

tokens Int32*

n_max_tokens Int32

add_bos Boolean

Returns

Int32

Returns the number of tokens on success, no more than n_max_tokens.
Returns a negative number on failure - the number of tokens that would have been returned

llama_n_vocab(SafeLLamaContextHandle)

Get the number of tokens in the model vocabulary for this context

public static int llama_n_vocab(SafeLLamaContextHandle ctx)

Parameters

Returns

llama_n_ctx(SafeLLamaContextHandle)

Get the size of the context window for the model for this context

public static int llama_n_ctx(SafeLLamaContextHandle ctx)

Parameters

Returns

llama_n_embd(SafeLLamaContextHandle)

Get the dimension of embedding vectors from the model for this context

public static int llama_n_embd(SafeLLamaContextHandle ctx)

Parameters

Returns

llama_get_logits(SafeLLamaContextHandle)

Token logits obtained from the last call to llama_eval()
The logits for the last token are stored in the last row
Can be mutated in order to change the probabilities of the next token.

Rows: n_tokens

Cols: n_vocab

public static Single* llama_get_logits(SafeLLamaContextHandle ctx)

Parameters

Returns

Single*

llama_get_embeddings(SafeLLamaContextHandle)

Get the embeddings for the input
shape: [n_embd] (1-dimensional)

public static Single* llama_get_embeddings(SafeLLamaContextHandle ctx)

Parameters