- Async loading supports cancellation through a `CancellationToken`. If loading is cancelled an `OperationCanceledException` is thrown. If it fails for another reason a `LoadWeightsFailedException` is thrown.
- Updated examples to use `LoadFromFileAsync`
* Added the ability to save and load individual conversations in a batched executor.
- New example
- Added `BatchedExecutor.Load(filepath)` method
- Added `Conversation.Save(filepath)` method
- Added new (currently internal) `SaveState`/`LoadState` methods in LLamaContext which can stash some extra binary data in the header
* Added ability to save/load a `Conversation` to an in-memory state, instead of to file.
* Moved the new save/load methods out to an extension class specifically for the batched executor.
* Removed unnecessary spaces
* Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`.
- Added all new functions.
- Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs`
- Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here.
- Changed all token properties to return nullable tokens, to handle some models not having some tokens.
- Fixed `DefaultSamplingPipeline` to handle no newline token in some models.
* Moved native methods to more specific locations.
- Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already.
- Checking that GPU layer count is zero if GPU offload is not supported.
- Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs.
* Removed exception if `GpuLayerCount > 0` when GPU is not supported.
* - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle`
- Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext`
- Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle`
* Added update and defrag methods for KV cache in `SafeLLamaContextHandle`
* Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`
* Passing the sequence ID when saving a single sequence state
Replaced `BatchedExecutor.Prompt(string)` method with `BatchedExecutor.Create()` method. This improves the API in two ways:
- A conversation can be created, without immediately prompting it
- Other prompting overloads (e.g. prompt with token list) can be used without duplicating all the overloads onto `BatchedExecutor`
Added `BatchSize` property to `LLamaContext`
Modified LLamaBatch to not share tokens with other sequences if logits is true. This ensures that the logit span at the end in used by exactly one sequence - therefore it's safe to mutate. This removes the need for copying _very_ large arrays (vocab size) and simplifies sampling pipelines.
* Added a `Guidance` method to `LLamaTokenDataArray` which applies classifier free guidance
* Factored out a safer `llama_sample_apply_guidance` method based on spans
* Created a guided sampling demo using the batched executor
* fixed comment, "classifier free" not "context free"
* Rebased onto master and fixed breakage due to changes in `BaseSamplingPipeline`
* Asking user for guidance weight
* Progress bar in batched fork demo
* Improved fork example (using tree display)
* Added proper disposal of resources in batched examples
* Added some more comments in BatchedExecutorGuidance
* - Modified ISamplingPipeline to accept `ReadOnlySpan<float>` of logits directly. This moves responsibility to copy the logits into the pipeline.
- Added a flag to `BaseSamplingPipeline` indicating if a logit copy is necessary. Skipping it in most cases.
* Fixed `RestoreProtectedTokens` not working if logit processing is skipped
* - Implemented a new greedy sampling pipeline (always sample most likely token)
- Moved `Grammar` into `BaseSamplingPipeline`
- Removed "protected tokens" concept from `BaseSamplingPipeline`. Was introducing a lot of incidental complexity.
- Implemented newline logit save/restore in `DefaultSamplingPipeline` (only place protected tokens was used)
* Implemented pipelines for mirostat v1 and v2
- UserSettings, simplifying the validation/re-ask loop down to one call
- Program, adding colour to figlet title
- Batched examples, showing default prompt
- ExampleRunner, resetting state after running an example
* LLama.Examples: disable console logging
* LLama.Examples: rename titles to signal grouped topics
* LLama.Examples: add additional PDF for Q&A
* LLama.Examples: improve kernel memory demo
multi-document ingestion
* LLama.Examples: improve message before resetting to main menu
* LLama.Examples: document Q&A with local memory
* LLama.Examples: RepoUtils.cs → ConsoleLogger.cs
* LLama.Examples: Examples/Runner.cs → ExampleRunner.cs
* LLama.Examples: delete unused console logger
* LLama.Examples: improve splash screen appearance
the llama_empty_call() no longer shows configuration information on startup, but it will display it automatically the first time a model is engaged
* LLama.Examples: Runner → ExampleRunner
* LLama.Examples: improve model path prompt
The last used model is stored in a config file and is re-used when a blank path is provided
* LLama.Examples: NativeApi.llama_empty_call() at startup
* LLama.Examples: reduce console noise when saving model path
* Embeddings example: set EmbeddingMode true
prevents an exception from being thrown when GetEmbeddings() is called
* Embeddings example: improve documentation and styling
* docs: improve GetEmbeddings page
If EmbeddingMode is not set to true, GetEmbeddings() throws an exception
* docs: improve GetEmbeddings page
The previous commit 6c9ff3158c was inaccurate
* Embeddings example: improve styling
displays the example description after the model is loaded to ensure the text is on the screen at the time the prompt is first requested
Conversations can be "forked", to create a copy of a conversation at a given point. This allows e.g. prompting a conversation with a system prefix just once and then forking it again and again for each individual conversation. Conversations can also be "rewound" to an earlier state.
Added two new examples, demonstrating forking and rewinding.
- Added a `DecodeAsync` overload which runs the work in a task
- Replaced some `NativeHandle` usage in `BatchedDecoding` with higher level equivalents.
- Made the `LLamaBatch` grow when token capacity is exceeded, removing the need to manage token capacity externally.