## LLama.Web - Basic ASP.NET Core examples of LLamaSharp in action
LLama.Web has no heavy dependencies and no extra frameworks ove bootstrap and jquery to keep the examples clean and easy to copy over to your own project
LLama.Web has no heavy dependencies and no extra frameworks over bootstrap and jquery to keep the examples clean and easy to copy over to your own project
## Websockets
Using signalr websockets simplifys the streaming of responses and model per connection management
Using signalr websockets simplifies the streaming of responses and model per connection management
@@ -175,7 +175,7 @@ For more examples, please refer to [LLamaSharp.Examples](./LLama.Examples).
#### Why GPU is not used when I have installed CUDA
1. If you are using backend packages, please make sure you have installed the cuda backend package which matches the cuda version of your device. Please note that before LLamaSharp v0.10.0, only one backend package should be installed.
2. Add `NativeLibraryConfig.Instance.WithLogs(LLamaLogLevel.Info)` to the very beginning of your code. The log will show which native library file is loaded. If the CPU library is loaded, please try to compile the native library yourself and open an issue for that. If the CUDA libraty is loaded, please check if `GpuLayerCount > 0` when loading the model weight.
2. Add `NativeLibraryConfig.Instance.WithLogs(LLamaLogLevel.Info)` to the very beginning of your code. The log will show which native library file is loaded. If the CPU library is loaded, please try to compile the native library yourself and open an issue for that. If the CUDA library is loaded, please check if `GpuLayerCount > 0` when loading the model weight.
Sometimes, your application with LLM and LLamaSharp may have unexpected behaviours. Here are some frequently asked questions, which may help you to deal with your problem.
## Why GPU is not used when I have installed CUDA
1. If you are using backend packages, please make sure you have installed the cuda backend package which matches the cuda version of your device. Please note that before LLamaSharp v0.10.0, only one backend package should be installed.
2. Add `NativeLibraryConfig.Instance.WithLogs(LLamaLogLevel.Info)` to the very beginning of your code. The log will show which native library file is loaded. If the CPU library is loaded, please try to compile the native library yourself and open an issue for that. If the CUDA libraty is loaded, please check if `GpuLayerCount > 0` when loading the model weight.
2. Add `NativeLibraryConfig.Instance.WithLogs(LLamaLogLevel.Info)` to the very beginning of your code. The log will show which native library file is loaded. If the CPU library is loaded, please try to compile the native library yourself and open an issue for that. If the CUDA library is loaded, please check if `GpuLayerCount > 0` when loading the model weight.
@@ -110,7 +110,7 @@ At this time, by repeating the same mode of `Q: xxx? A: xxx.`, LLM outputs the a
## BatchedExecutor
Different from other executors, `BatchedExecutor` could accept multiple inputs from different sessions and geneate outputs for them at the same time. Here is an example to use it.
Different from other executors, `BatchedExecutor` could accept multiple inputs from different sessions and generate outputs for them at the same time. Here is an example to use it.
```cs
using LLama.Batched;
@@ -249,7 +249,7 @@ Here is the parameters for LLamaSharp executors.
@@ -8,7 +8,7 @@ As indicated in [Architecture](../Architecture.md), LLamaSharp uses the native l
Before introducing the way to customize native library loading, please follow the tips below to see if you need to compile the native library yourself, rather than use the published backend packages, which contain native library files for multiple targets.
1. Your device/environment has not been supported by any published backend packages. For example, vulkan has not been supported yet. In this case, it will mean a lot to open an issue to tell us you are using it. Since our support for new backend will have a delay, you could compile yourself before that.
2. You want to gain the best performance of LLamaSharp. Because LLamaSharp offloads the model to both GPU and CPU, the performance is significantly related with CPU if your GPU memory size is small. AVX ([Advanced Vector Extensions](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions)) and BLAS ([Basic Linear Algebra Subprograms](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms)) are the most important ways to accelerate the CPU computation. By default, LLamaSharp disables the support for BLAS and use AVX2 for CUDA backend yet. If you would like to enable BLAS or use AVX 512 along with CUDA, please compile the native library youself, following the [instructions here](../ContributingGuide.md).
2. You want to gain the best performance of LLamaSharp. Because LLamaSharp offloads the model to both GPU and CPU, the performance is significantly related with CPU if your GPU memory size is small. AVX ([Advanced Vector Extensions](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions)) and BLAS ([Basic Linear Algebra Subprograms](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms)) are the most important ways to accelerate the CPU computation. By default, LLamaSharp disables the support for BLAS and use AVX2 for CUDA backend yet. If you would like to enable BLAS or use AVX 512 along with CUDA, please compile the native library yourself, following the [instructions here](../ContributingGuide.md).
@@ -11,7 +11,7 @@ If you are new to LLM, here're some tips for you to help you to get start with `
1. The main ability of LLamaSharp is to provide an efficient way to run inference of LLM on your device (and fine-tune model in the future). The model weights, however, need to be downloaded from other resources such as [huggingface](https://huggingface.co).
2. To gain high performance, LLamaSharp interacts with a native library compiled from c++, which is called `backend`. We provide backend packages for Windows, Linux and MAC with CPU, Cuda, Metal and OpenCL. You **don't** need to handle anything about c++ but just install the backend packages. If no published backend match your device, please open an issue to let us know. If compiling c++ code is not difficult for you, you could also follow [this guide]() to compile a backend and run LLamaSharp with it.
3. `LLaMA` originally refers to the weights released by Meta (Facebook Research). After that, many models are fine-tuned based on it, such as `Vicuna`, `GPT4All`, and `Pyglion`. There are two popular file format of these model now, which are PyTorch format (.pth) and Huggingface format (.bin). LLamaSharp uses `GGUF` format file, which could be converted from these two formats. There are two options for you to get GGUF format file. a) Search model name + 'gguf' in [Huggingface](https://huggingface.co), you will find lots of model files that have already been converted to GGUF format. Please take care of the publishing time of them because some old ones could only work with old version of LLamaSharp. b) Convert PyTorch or Huggingface format to GGUF format yourself. Please follow the instructions of [this part of llama.cpp readme](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#prepare-and-quantize) to convert them with the python scripts.
4. LLamaSharp supports multi-modal, which means that the model could take both text and image as input. Note that there are two model files requied for using multi-modal (LLaVA), which are main model and mm-proj model. Here is a huggingface repo which shows that: [link](https://huggingface.co/ShadowBeast/llava-v1.6-mistral-7b-Q5_K_S-GGUF/tree/main).
4. LLamaSharp supports multi-modal, which means that the model could take both text and image as input. Note that there are two model files required for using multi-modal (LLaVA), which are main model and mm-proj model. Here is a huggingface repo which shows that: [link](https://huggingface.co/ShadowBeast/llava-v1.6-mistral-7b-Q5_K_S-GGUF/tree/main).
public class InferenceParams : LLama.Abstractions.IInferenceParams, System.IEquatable`1[[LLama.Common.InferenceParams, LLamaSharp, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null]]