docs: add the documentations with mkdocs.

2 years ago · eed96248b5
--- a/LLama.Examples/NewVersion/TestRunner.cs
+++ b/LLama.Examples/NewVersion/TestRunner.cs
@@ -30,11 +30,11 @@ namespace LLama.Examples.NewVersion

                if (choice == 0)
                {
                    ChatSessionStripRoleName.Run();
                    ChatSessionWithRoleName.Run();
                }
                else if (choice == 1)
                {
                    ChatSessionWithRoleName.Run();
                    ChatSessionStripRoleName.Run();
                }
                else if(choice == 2)
                {
--- a/docs/Architecher.md
+++ b/docs/Architecher.md
@@ -0,0 +1,23 @@
 # Architecher

 ## Architecher of main functions

 The figure below shows the core framework structure, which is separated to four levels.

 - **LLamaModel**: The holder of a model which directly interact with native library and provide some basic APIs such as tokenization and embedding. Currently it includes three classes: `LLamaModel`, `LLamaEmbedder` and `LLamaQuantizer`.
 - **LLamaExecutors**: Executors which define the way to run the LLama model. It provides text-to-text APIs to make it easy to use. Currently we provide three kinds of executors: `InteractiveExecutor`, `InstructuExecutor` and `StatelessExecutor`.
 - **ChatSession**: A wrapping for `InteractiveExecutor` and `LLamaModel`, which supports interactive tasks and saving/re-loading sessions. It also provides a flexible way to customize the text process by `IHistoryTransform`, `ITextTransform` and `ITextStreamTransform`.
 - **High-level Applications**: Some applications that provides higher-level integration. For example, [BotSharp](https://github.com/SciSharp/BotSharp) provides integration for vector search, Chatbot UI and Web APIs. [semantic-kernel](https://github.com/microsoft/semantic-kernel) provides various APIs for manipulations related with LLM. If you've made an integration, please tell us and add it to the doc!


 ![structure_image](media/structure.jpg)

 ## Recommended usings

 Since `LLamaModel` interact with native library, it's not recommended to use the methods of it directly unless you know what you are doing. So does the `NativeApi`, which is not included in the arcitecher figure above.

 `ChatSession` is recommended to be used when you want to build an application similar to ChatGPT, or the ChatBot, because it works best with `InteractiveExecutor`. Though other executors are also allowed to passed as a parameter to initialize a `ChatSession`, it's not encouraged if you are new to LLamaSharp and LLM.

 High-level applications, such as BotSharp, are supposed to be used when you concentrate on the part not related with LLM. For example, if you want to deploy a chat bot to help you remember your schedules, using BotSharp may be a good choice.

 Note that the APIs of the high-level applications may not be stable now. Please take it into account when using them.
--- a/docs/ChatSession/basic-usages.md
+++ b/docs/ChatSession/basic-usages.md
@@ -0,0 +1,36 @@
 # Basic usages of ChatSession

 `ChatSession` is a higher-level absatrction than the executors. In the context of a chat application like ChatGPT, a "chat session" refers to an interactive conversation or exchange of messages between the user and the chatbot. It represents a continuous flow of communication where the user enters input or asks questions, and the chatbot responds accordingly. A chat session typically starts when the user initiates a conversation with the chatbot and continues until the interaction comes to a natural end or is explicitly terminated by either the user or the system. During a chat session, the chatbot maintains the context of the conversation, remembers previous messages, and generates appropriate responses based on the user's inputs and the ongoing dialogue.

 ## Initialize a session

 Currently, the only parameter that is accepted is an `ILLamaExecutor`, because this is the only parameter that we're sure to exist in all the future versions. Since it's the high-level absatrction, we're conservative to the API designs. In the future, there may be more kinds of constructors added.

 ```cs
 InteractiveExecutor ex = new(new LLamaModel(new ModelParams(modelPath)));
 ChatSession session = new ChatSession(ex);
 ```

 ## Chat with the bot

 There'll be two kinds of input accepted by the `Chat` API, which are `ChatHistory` and `String`. The API with string is quite similar to that of the executors. Meanwhile, the API with `ChatHistory` is aimed to provide more flexible usages. For example, you have had a chat with the bot in session A before you open the session B. Now session B has no memory for what you said before. Therefore, you can feed the history of A to B.

 ```cs
 string prompt = "What is C#?";

 foreach (var text in session.Chat(prompt, new InferenceParams() { Temperature = 0.6f, AntiPrompts = new List<string> { "User:" } })) // the inference params should be changed depending on your statement
 {
    Console.Write(text);
 }
 ```

 ## Get the history

 Currently `History` is a property of `ChatSession`.

 ```cs
 foreach(var rec in session.History.Messages)
 {
    Console.WriteLine($"{rec.AuthorRole}: {rec.Content}");
 }
 ```
--- a/docs/ChatSession/save-load-session.md
+++ b/docs/ChatSession/save-load-session.md
@@ -0,0 +1,14 @@
 # Save/Load Chat Session

 Generally, the chat session could be switched, which requires the ability of loading and saving session.

 When building a chat bot app, it's **NOT encouraged** to initialize many chat sessions and keep them in memory to wait for being switched, because the memory comsumption of both CPU and GPU is expensive. It's recommended to save the current session before switching to a new session, and load the file when switching back to the session.

 The API is also quite simple, the files will be saved into a directory you specified. If the path does not exist, a new directory will be created.

 ```cs
 string savePath = "<save dir>";
 session.SaveSession(savePath);

 session.LoadSession(savePath);
 ```
--- a/docs/ChatSession/transforms.md
+++ b/docs/ChatSession/transforms.md
@@ -0,0 +1,245 @@
 # Transforms in Chat Session

 There's three important elements in `ChatSession`, which are input, output and history. Besides, there're some conversions between them. Since the process of them under different conditions varies, LLamaSharp hands over this part of the power to the users.

 Currently, there're three kinds of process that could be customized, as introduced below.

 ## Input transform

 In general, the input of the chat API is a text (without stream), therefore `ChatSession` processes it in a pipeline. If you want to use your customized transform, you need to define a transform that implements `ITextTransform` and add it to the pipeline of `ChatSession`.

 ```cs
 public interface ITextTransform
 {
    string Transform(string text);
 }
 ```

 ```cs
 public class MyInputTransform1 : ITextTransform
 {
    public string Transform(string text)
    {
        return $"Question: {text}\n";
    }
 }

 public class MyInputTransform2 : ITextTransform
 {
    public string Transform(string text)
    {
        return text + "Answer: ";
    }
 }

 session.AddInputTransform(new MyInputTransform1()).AddInputTransform(new MyInputTransform2());
 ```

 ## Output transform

 Different from the input, the output of chat API is a text stream. Therefore you need to process it word by word, instead of getting the full text at once.

 The interface of it has an `IEnumerable<string>` as input, which is actually a yield sequence.

 ```cs
 public interface ITextStreamTransform
 {
    IEnumerable<string> Transform(IEnumerable<string> tokens);
    IAsyncEnumerable<string> TransformAsync(IAsyncEnumerable<string> tokens);
 }
 ```

 When implementing it, you could throw a not-implemented exception in one of them if you only need to use the chat API in synchronously or asynchronously.

 Different from the input transform pipeline, the output transform only supports one transform.

 ```cs
 session.WithOutputTransform(new MyOutputTransform());
 ```

 Here's an example of how to implement the interface. In this example, the transform detects wether there's some keywords in the response and removes them.

 ```cs
 /// <summary>
 /// A text output transform that removes the keywords from the response.
 /// </summary>
 public class KeywordTextOutputStreamTransform : ITextStreamTransform
 {
    HashSet<string> _keywords;
    int _maxKeywordLength;
    bool _removeAllMatchedTokens;

    /// <summary>
    /// 
    /// </summary>
    /// <param name="keywords">Keywords that you want to remove from the response.</param>
    /// <param name="redundancyLength">The extra length when searching for the keyword. For example, if your only keyword is "highlight", 
    /// maybe the token you get is "\r\nhighligt". In this condition, if redundancyLength=0, the token cannot be successfully matched because the length of "\r\nhighligt" (10)
    /// has already exceeded the maximum length of the keywords (8). On the contrary, setting redundancyLengyh >= 2 leads to successful match.
    /// The larger the redundancyLength is, the lower the processing speed. But as an experience, it won't introduce too much performance impact when redundancyLength <= 5 </param>
    /// <param name="removeAllMatchedTokens">If set to true, when getting a matched keyword, all the related tokens will be removed. Otherwise only the part of keyword will be removed.</param>
    public KeywordTextOutputStreamTransform(IEnumerable<string> keywords, int redundancyLength = 3, bool removeAllMatchedTokens = false)
    {
        _keywords = new(keywords);
        _maxKeywordLength = keywords.Select(x => x.Length).Max() + redundancyLength;
        _removeAllMatchedTokens = removeAllMatchedTokens;
    }
    /// <inheritdoc />
    public IEnumerable<string> Transform(IEnumerable<string> tokens)
    {
        var window = new Queue<string>();

        foreach (var s in tokens)
        {
            window.Enqueue(s);
            var current = string.Join("", window);
            if (_keywords.Any(x => current.Contains(x)))
            {
                var matchedKeyword = _keywords.First(x => current.Contains(x));
                int total = window.Count;
                for (int i = 0; i < total; i++)
                {
                    window.Dequeue();
                }
                if (!_removeAllMatchedTokens)
                {
                    yield return current.Replace(matchedKeyword, "");
                }
            }
            if (current.Length >= _maxKeywordLength)
            {
                if (_keywords.Any(x => current.Contains(x)))
                {
                    var matchedKeyword = _keywords.First(x => current.Contains(x));
                    int total = window.Count;
                    for (int i = 0; i < total; i++)
                    {
                        window.Dequeue();
                    }
                    if (!_removeAllMatchedTokens)
                    {
                        yield return current.Replace(matchedKeyword, "");
                    }
                }
                else
                {
                    int total = window.Count;
                    for (int i = 0; i < total; i++)
                    {
                        yield return window.Dequeue();
                    }
                }
            }
        }
        int totalCount = window.Count;
        for (int i = 0; i < totalCount; i++)
        {
            yield return window.Dequeue();
        }
    }
    /// <inheritdoc />
    public async IAsyncEnumerable<string> TransformAsync(IAsyncEnumerable<string> tokens)
    {
        throw new NotImplementedException(); // This is implemented in `LLamaTransforms` but we ignore it here.
    }
 }
 ```

 ## History transform

 The chat history could be converted to or from a text, which is exactly what the interface of it.

 ```cs
 public interface IHistoryTransform
 {
    string HistoryToText(ChatHistory history);
    ChatHistory TextToHistory(AuthorRole role, string text);
 }
 ```

 Similar to the output transform, the history transform is added in the following way:

 ```cs
 session.WithHistoryTransform(new MyHistoryTransform());
 ```

 The implementation is quite flexible, depending on what you want the history message to be like. Here's an example, which is the default history transform in LLamaSharp.

 ```cs
 /// <summary>
 /// The default history transform.
 /// Uses plain text with the following format:
 /// [Author]: [Message]
 /// </summary>
 public class DefaultHistoryTransform : IHistoryTransform
 {
    private readonly string defaultUserName = "User";
    private readonly string defaultAssistantName = "Assistant";
    private readonly string defaultSystemName = "System";
    private readonly string defaultUnknownName = "??";

    string _userName;
    string _assistantName;
    string _systemName;
    string _unknownName;
    bool _isInstructMode;
    public DefaultHistoryTransform(string? userName = null, string? assistantName = null, 
        string? systemName = null, string? unknownName = null, bool isInstructMode = false)
    {
        _userName = userName ?? defaultUserName;
        _assistantName = assistantName ?? defaultAssistantName;
        _systemName = systemName ?? defaultSystemName;
        _unknownName = unknownName ?? defaultUnknownName;
        _isInstructMode = isInstructMode;
    }

    public virtual string HistoryToText(ChatHistory history)
    {
        StringBuilder sb = new();
        foreach (var message in history.Messages)
        {
            if (message.AuthorRole == AuthorRole.User)
            {
                sb.AppendLine($"{_userName}: {message.Content}");
            }
            else if (message.AuthorRole == AuthorRole.System)
            {
                sb.AppendLine($"{_systemName}: {message.Content}");
            }
            else if (message.AuthorRole == AuthorRole.Unknown)
            {
                sb.AppendLine($"{_unknownName}: {message.Content}");
            }
            else if (message.AuthorRole == AuthorRole.Assistant)
            {
                sb.AppendLine($"{_assistantName}: {message.Content}");
            }
        }
        return sb.ToString();
    }

    public virtual ChatHistory TextToHistory(AuthorRole role, string text)
    {
        ChatHistory history = new ChatHistory();
        history.AddMessage(role, TrimNamesFromText(text, role));
        return history;
    }

    public virtual string TrimNamesFromText(string text, AuthorRole role)
    {
        if (role == AuthorRole.User && text.StartsWith($"{_userName}:"))
        {
            text = text.Substring($"{_userName}:".Length).TrimStart();
        }
        else if (role == AuthorRole.Assistant && text.EndsWith($"{_assistantName}:"))
        {
            text = text.Substring(0, text.Length - $"{_assistantName}:".Length).TrimEnd();
        }
        if (_isInstructMode && role == AuthorRole.Assistant && text.EndsWith("\n> "))
        {
            text = text.Substring(0, text.Length - "\n> ".Length).TrimEnd();
        }
        return text;
    }
 }
 ```
--- a/docs/ContributingGuide.md
+++ b/docs/ContributingGuide.md
@@ -0,0 +1,65 @@
 # LLamaSharp Contributing Guide

 Hi, welcome to develop LLamaSharp with us together! We are always open for every contributor and any format of contributions! If you want to maintain this library actively together, please contact us to get the write access after some PRs. (Email: AsakusaRinne@gmail.com)

 In this page, we'd like to introduce how to make contributions here easily. 😊

 ## Compile the native library from source

 Firstly, please clone the [llama.cpp](https://github.com/ggerganov/llama.cpp) repository and following the instructions in [llama.cpp readme](https://github.com/ggerganov/llama.cpp#build) to configure your local environment.

 If you want to support cublas in the compilation, please make sure that you've installed the cuda.

 When building from source, please add `-DBUILD_SHARED_LIBS=ON` to the cmake instruction. For example, when building with cublas but without openblas, use the following instruction:

 ```bash
 cmake .. -DLLAMA_CUBLAS=ON -DBUILD_SHARED_LIBS=ON
 ```

 After running `cmake --build . --config Release`, you could find the `llama.dll`, `llama.so` or `llama.dylib` in your build directory. After pasting it to `LLamaSharp/LLama/runtimes` and renaming it to `libllama.dll`, `libllama.so` or  `libllama.dylib`, you can use it as the native library in LLamaSharp.


 ## Add a new feature to LLamaSharp

 After refactoring the framework in `v0.4.0`, LLamaSharp will try to maintain the backward compatibility. However, in the following cases, break change is okay:

 1. Due to some break changes in [llama.cpp](https://github.com/ggerganov/llama.cpp), making a break change will help to maintain the good abstraction and friendly user APIs.
 2. A very improtant feature cannot be implemented unless refactoring some parts.
 3. After some discussions, an agreement was reached that making the break change is reasonable.

 If a new feature could be added without introducing any break change, please **open a PR** rather than open an issue first. We will never refuse the PR but help to improve it, unless it's malicious.

 When adding the feature, please take care of the namespace and the naming convention. For example, if you are adding an integration for WPF, please put the code under namespace `LLama.WPF` or `LLama.Integration.WPF` instead of putting it under the root namespace. The naming convention of LLamaSharp follows the pascal naming convention, but in some parts that are invisible to users, you can do whatever you want.

 ## Find the problem and fix the BUG

 If the issue is related to the LLM internal behaviors, such as endless generating the response, the best way to find the problem is to do comparison test between llama.cpp and LLamaSharp.

 You could use exactly the same prompt, the same model and the same parameters to run the inference in llama.cpp and LLamaSharp respectively to see if it's really a problem caused by the implementation in LLamaSharp.

 If the experiment showed that it worked well in llama.cpp but didn't in LLamaSharp, a the search for the problem could be started. While the reason of the problem could be various, the best way I think is to add log-print in the code of llama.cpp and use it in LLamaSharp after compilation. Thus, when running LLamaSharp, you could see what happened in the native library.

 After finding out the reason, a painful but happy process comes. When working on the BUG fix, there's only one rule to follow, that is keeping the examples working well. If the modification fixed the BUG but impact on other functions, it would not be a good fix.

 During the BUG fix process, please don't hesitate to discuss together when you stuck on something.

 ## Add integrations

 All kinds of integration are welcomed here! Currently the following integrations are under work or on our schedule:

 1. BotSharp
 2. semantic-kernel
 3. Unity

 Besides, for some other integrations, like `ASP.NET core`, `SQL`, `Blazor` and so on, we'll appreciate it if you could help with that. If the time is limited for you, providing an example for it also means a lot!

 ## Add examples

 There're mainly two ways to add an example:

 1. Add the example to `LLama.Examples` of the repository.
 2. Put the example in another repositpry and add the link to the readme or docs of LLamaSharp.

 ## Add documents

 LLamaSharp uses [mkdocs](https://github.com/mkdocs/mkdocs) to build the documantation, please follow the tutorial of mkdocs to add or modify documents in LLamaSharp.
--- a/docs/GetStarted.md
+++ b/docs/GetStarted.md
@@ -0,0 +1,110 @@
 # Get Started

 ## Install packages

 Firstly, search `LLamaSharp` in nuget package manager and install it.

 ```
 PM> Install-Package LLamaSharp
 ```

 Then, search and install one of the following backends:

 ```
 LLamaSharp.Backend.Cpu
 LLamaSharp.Backend.Cuda11
 LLamaSharp.Backend.Cuda12
 ```

 Here's the mapping of them and corresponding model samples provided by `LLamaSharp`. If you're not sure which model is available for a version, please try our sample model.

 | LLamaSharp.Backend | LLamaSharp | Verified Model Resources | llama.cpp commit id |
 | - | - | -- | - |
 | - | v0.2.0 | This version is not recommended to use. | - |
 | - | v0.2.1 | [WizardLM](https://huggingface.co/TheBloke/wizardLM-7B-GGML/tree/previous_llama), [Vicuna (filenames with "old")](https://huggingface.co/eachadea/ggml-vicuna-13b-1.1/tree/main) | - |
 | v0.2.2 | v0.2.2, v0.2.3 | [WizardLM](https://huggingface.co/TheBloke/wizardLM-7B-GGML/tree/previous_llama_ggmlv2), [Vicuna (filenames without "old")](https://huggingface.co/eachadea/ggml-vicuna-13b-1.1/tree/main) | 63d2046 |
 | v0.3.0 | v0.3.0 | [LLamaSharpSamples v0.3.0](https://huggingface.co/AsakusaRinne/LLamaSharpSamples/tree/v0.3.0), [WizardLM](https://huggingface.co/TheBloke/wizardLM-7B-GGML/tree/main) | 7e4ea5b |


 ## Download a model

 One of the following models could be okay:

 - LLaMA 🦙
 - [Alpaca](https://github.com/ggerganov/llama.cpp#instruction-mode-with-alpaca)
 - [GPT4All](https://github.com/ggerganov/llama.cpp#using-gpt4all)
 - [Chinese LLaMA / Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca)
 - [Vigogne (French)](https://github.com/bofenghuang/vigogne)
 - [Vicuna](https://github.com/ggerganov/llama.cpp/discussions/643#discussioncomment-5533894)
 - [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/)
 - [OpenBuddy 🐶 (Multilingual)](https://github.com/OpenBuddy/OpenBuddy)
 - [Pygmalion 7B / Metharme 7B](#using-pygmalion-7b--metharme-7b)
 - [WizardLM](https://github.com/nlpxucan/WizardLM)

 **Note that because `llama.cpp` is under fast development now and often introduce break changes, some model weights on huggingface which works under a version may be invalid with another version. If it's your first time to configure LLamaSharp, we'd like to suggest for using verified model weights in the table above.**

 ## Run the program

 Please create a console program with dotnet runtime >= netstandard 2.0 (>= net6.0 is more recommended). Then, paste the following code to `program.cs`;

 ```cs
 using LLama.Common;
 using LLama;

 string modelPath = "<Your model path>" // change it to your own model path
 var prompt = "Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.\r\n\r\nUser: Hello, Bob.\r\nBob: Hello. How may I help you today?\r\nUser: Please tell me the largest city in Europe.\r\nBob: Sure. The largest city in Europe is Moscow, the capital of Russia.\r\nUser:"; // use the "chat-with-bob" prompt here.

 // Initialize a chat session
 var ex = new InteractiveExecutor(new LLamaModel(new ModelParams(modelPath, contextSize: 1024, seed: 1337, gpuLayerCount: 5)));
 ChatSession session = new ChatSession(ex);

 // show the prompt
 Console.WriteLine();
 Console.Write(prompt);

 // run the inference in a loop to chat with LLM
 while (true)
 {
    foreach (var text in session.Chat(prompt, new InferenceParams() { Temperature = 0.6f, AntiPrompts = new List<string> { "User:" } }))
    {
        Console.Write(text);
    }

    Console.ForegroundColor = ConsoleColor.Green;
    prompt = Console.ReadLine();
    Console.ForegroundColor = ConsoleColor.White;
 }
 ```

 After starting it, you'll see the following outputs.

 ```
 Please input your model path: D:\development\llama\weights\wizard-vicuna-13B.ggmlv3.q4_1.bin
 llama.cpp: loading model from D:\development\llama\weights\wizard-vicuna-13B.ggmlv3.q4_1.bin
 llama_model_load_internal: format     = ggjt v3 (latest)
 llama_model_load_internal: n_vocab    = 32000
 llama_model_load_internal: n_ctx      = 1024
 llama_model_load_internal: n_embd     = 5120
 llama_model_load_internal: n_mult     = 256
 llama_model_load_internal: n_head     = 40
 llama_model_load_internal: n_layer    = 40
 llama_model_load_internal: n_rot      = 128
 llama_model_load_internal: ftype      = 3 (mostly Q4_1)
 llama_model_load_internal: n_ff       = 13824
 llama_model_load_internal: n_parts    = 1
 llama_model_load_internal: model size = 13B
 llama_model_load_internal: ggml ctx size = 7759.48 MB
 llama_model_load_internal: mem required  = 9807.48 MB (+ 1608.00 MB per state)
 ....................................................................................................
 llama_init_from_file: kv self size  =  800.00 MB

 Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.

 User: Hello, Bob.
 Bob: Hello. How may I help you today?
 User: Please tell me the largest city in Europe.
 Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
 User:
 ```

 Now, enjoy chatting with LLM!
--- a/docs/HighLevelApps/bot-sharp.md
+++ b/docs/HighLevelApps/bot-sharp.md
@@ -0,0 +1,3 @@
 # The Usage of BotSharp Integration

 The document is under work, please have a wait. Thank you for your support! :)
--- a/docs/LLamaExecutors/differences.md
+++ b/docs/LLamaExecutors/differences.md
@@ -0,0 +1,69 @@
 ## Differences between the executors

 There're currently three kinds of executors provided, which are `InteractiveExecutor`, `InstructExecutor` and `StatelessExecutor`.

 In a word, `InteractiveExecutor` is suitable for getting answer of your questions from LLM continuously. `InstructExecutor` let LLM execute your instructions, such as "continue writing". `StatelessExecutor` is best for one-time job because the previous inference has no impact on the current inference.


 ## Interactive mode & Instruct mode

 Both of them are taking "completing the prompt" as the goal to generate the response. For example, if you input `Long long ago, there was a fox who wanted to make friend with humen. One day`, then the LLM will continue to write the story.

 Under interactive mode, you serve a role of user and the LLM serves the role of assistant. Then it will help you with your question or request. 

 Under instruct mode, you give LLM some instructions and it follows.

 Though the behaviors of them sounds similar, it could introduce many differences depending on your prompt. For example, "chat-with-bob" has good performance under interactive mode and `alpaca` does well with instruct mode.

 ```
 // chat-with-bob

 Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.

 User: Hello, Bob.
 Bob: Hello. How may I help you today?
 User: Please tell me the largest city in Europe.
 Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
 User:
 ```

 ```
 // alpaca

 Below is an instruction that describes a task. Write a response that appropriately completes the request.
 ```

 Therefore, please modify the prompt correspondingly when switching from one mode to the other.

 ## Stateful mode and Stateless mode.

 Despite the differences between interactive mode and instruct mode, both of them are stateful mode. That is, your previous question/instruction will impact on the current response from LLM. On the contrary, the steteless executor does not have such a "memory". No matter how many times you talk to it, it will only concentrate on what you say in this time.

 Since the stateless executor has no memory of conversations before, you need to input your question with the whole prompt into it to get the better answer.

 For example, if you feed `Q: Who is Trump? A: ` to the steteless executor, it may give the following answer with the antiprompt `Q: `.

 ```
 Donald J. Trump, born June 14, 1946, is an American businessman, television personality, politician and the 45th President of the United States (2017-2021). # Anexo:Torneo de Hamburgo 2022 (individual masculino)

 ## Presentación previa

 * Defensor del título:  Daniil Medvédev
 ```

 It seems that things went well at first. However, after answering the question itself, LLM began to talk about some other things until the answer reached the token count limit. The reason of this strange behavior is the anti-prompt cannot be match. With the input, LLM cannot decide whether to append a string "A: " at the end of the response.

 As an improvement, let's take the following text as the input:

 ```
 Q: What is the capital of the USA? A: Washingtong. Q: What is the sum of 1 and 2? A: 3. Q: Who is Trump? A: 
 ```

 Then, I got the following answer with the anti-prompt `Q: `.

 ```
 45th president of the United States.
 ```

 At this time, by repeating the same mode of `Q: xxx? A: xxx.`, LLM outputs the anti-prompt we want to help to decide where to dtop the generation.

--- a/docs/LLamaExecutors/parameters.md
+++ b/docs/LLamaExecutors/parameters.md
@@ -0,0 +1,261 @@
 # Inference Parameters

 Different from `LLamaModel`, when using an exeuctor, `InferenceParams` is passed to the `Infer` method instead of constructor. This is because executors only define the ways to run the model, therefore in each run, you can change the settings for this time inference.


 # InferenceParams

 Namespace: LLama.Common

 ```csharp
 public class InferenceParams
 ```

 Inheritance [Object](https://docs.microsoft.com/en-us/dotnet/api/system.object) → [InferenceParams]()

 ## Properties

 ### **TokensKeep**

 number of tokens to keep from initial prompt

 ```csharp
 public int TokensKeep { get; set; }
 ```

 #### Property Value

 [Int32](https://docs.microsoft.com/en-us/dotnet/api/system.int32)<br>

 ### **MaxTokens**

 how many new tokens to predict (n_predict), set to -1 to inifinitely generate response
 until it complete.

 ```csharp
 public int MaxTokens { get; set; }
 ```

 #### Property Value

 [Int32](https://docs.microsoft.com/en-us/dotnet/api/system.int32)<br>

 ### **LogitBias**

 logit bias for specific tokens

 ```csharp
 public Dictionary<int, float> LogitBias { get; set; }
 ```

 #### Property Value

 [Dictionary&lt;Int32, Single&gt;](https://docs.microsoft.com/en-us/dotnet/api/system.collections.generic.dictionary-2)<br>

 ### **AntiPrompts**

 Sequences where the model will stop generating further tokens.

 ```csharp
 public IEnumerable<string> AntiPrompts { get; set; }
 ```

 #### Property Value

 [IEnumerable&lt;String&gt;](https://docs.microsoft.com/en-us/dotnet/api/system.collections.generic.ienumerable-1)<br>

 ### **PathSession**

 path to file for saving/loading model eval state

 ```csharp
 public string PathSession { get; set; }
 ```

 #### Property Value

 [String](https://docs.microsoft.com/en-us/dotnet/api/system.string)<br>

 ### **InputSuffix**

 string to suffix user inputs with

 ```csharp
 public string InputSuffix { get; set; }
 ```

 #### Property Value

 [String](https://docs.microsoft.com/en-us/dotnet/api/system.string)<br>

 ### **InputPrefix**

 string to prefix user inputs with

 ```csharp
 public string InputPrefix { get; set; }
 ```

 #### Property Value

 [String](https://docs.microsoft.com/en-us/dotnet/api/system.string)<br>

 ### **TopK**

 0 or lower to use vocab size

 ```csharp
 public int TopK { get; set; }
 ```

 #### Property Value

 [Int32](https://docs.microsoft.com/en-us/dotnet/api/system.int32)<br>

 ### **TopP**

 1.0 = disabled

 ```csharp
 public float TopP { get; set; }
 ```

 #### Property Value

 [Single](https://docs.microsoft.com/en-us/dotnet/api/system.single)<br>

 ### **TfsZ**

 1.0 = disabled

 ```csharp
 public float TfsZ { get; set; }
 ```

 #### Property Value

 [Single](https://docs.microsoft.com/en-us/dotnet/api/system.single)<br>

 ### **TypicalP**

 1.0 = disabled

 ```csharp
 public float TypicalP { get; set; }
 ```

 #### Property Value

 [Single](https://docs.microsoft.com/en-us/dotnet/api/system.single)<br>

 ### **Temperature**

 1.0 = disabled

 ```csharp
 public float Temperature { get; set; }
 ```

 #### Property Value

 [Single](https://docs.microsoft.com/en-us/dotnet/api/system.single)<br>

 ### **RepeatPenalty**

 1.0 = disabled

 ```csharp
 public float RepeatPenalty { get; set; }
 ```

 #### Property Value

 [Single](https://docs.microsoft.com/en-us/dotnet/api/system.single)<br>

 ### **RepeatLastTokensCount**

 last n tokens to penalize (0 = disable penalty, -1 = context size) (repeat_last_n)

 ```csharp
 public int RepeatLastTokensCount { get; set; }
 ```

 #### Property Value

 [Int32](https://docs.microsoft.com/en-us/dotnet/api/system.int32)<br>

 ### **FrequencyPenalty**

 frequency penalty coefficient
 0.0 = disabled

 ```csharp
 public float FrequencyPenalty { get; set; }
 ```

 #### Property Value

 [Single](https://docs.microsoft.com/en-us/dotnet/api/system.single)<br>

 ### **PresencePenalty**

 presence penalty coefficient
 0.0 = disabled

 ```csharp
 public float PresencePenalty { get; set; }
 ```

 #### Property Value

 [Single](https://docs.microsoft.com/en-us/dotnet/api/system.single)<br>

 ### **Mirostat**

 Mirostat uses tokens instead of words.
 algorithm described in the paper https://arxiv.org/abs/2007.14966.
 0 = disabled, 1 = mirostat, 2 = mirostat 2.0

 ```csharp
 public MiroStateType Mirostat { get; set; }
 ```

 #### Property Value

 [MiroStateType]()<br>

 ### **MirostatTau**

 target entropy

 ```csharp
 public float MirostatTau { get; set; }
 ```

 #### Property Value

 [Single](https://docs.microsoft.com/en-us/dotnet/api/system.single)<br>

 ### **MirostatEta**

 learning rate

 ```csharp
 public float MirostatEta { get; set; }
 ```

 #### Property Value

 [Single](https://docs.microsoft.com/en-us/dotnet/api/system.single)<br>

 ### **PenalizeNL**

 consider newlines as a repeatable token (penalize_nl)

 ```csharp
 public bool PenalizeNL { get; set; }
 ```

 #### Property Value

 [Boolean](https://docs.microsoft.com/en-us/dotnet/api/system.boolean)<br>
--- a/docs/LLamaExecutors/save-load-state.md
+++ b/docs/LLamaExecutors/save-load-state.md
@@ -0,0 +1,27 @@
 # Save/Load State of Executor

 Similar to `LLamaModel`, an executor also has its state, which can be saved and loaded. **Note that in most of cases, the state of executor and the state of the model should be loaded and saved at the same time.** 

 To decouple the model and executor, we provide APIs to save/load state for model and executor respectively. However, during the inference, the processed information will leave footprint in `LLamaModel`'s native context. Therefore, if you just load a state from another executor but keep the model unmodified, some strange things may happen. So will loading model state only.

 Is there a condition that requires to load one of them only? The answer is YES. For example, after resetting the model state, if you don't want the inference starting from the new position, leaving the executor unmodified is okay. But, anyway, this flexible usage may cause some unexpected behaviors, therefore please ensure you know what you're doing before using it in this way.

 In the future version, we'll open the access for some variables inside the executor to support more flexible usages.

 The APIs to load/save state of the executors is similar to that of `LLamaModel`. However, note that `StatelessExecutor` doesn't have such APIs because it's stateless itself. Besides, the output of `GetStateData` is an object of type `ExecutorBaseState`.

 ```cs
 LLamaModel model = new LLamaModel(new ModelParams("<modelPath>"));
 InteractiveExecutor executor = new InteractiveExecutor(model);
 // do some things...
 executor.SaveState("executor.st");
 var stateData = model.GetStateData();

 InteractiveExecutor executor2 = new InteractiveExecutor(model);
 executor2.LoadState(stateData);
 // do some things...

 InteractiveExecutor executor3 = new InteractiveExecutor(model);
 executor3.LoadState("executor.st");
 // do some things...
 ```
--- a/docs/LLamaExecutors/text-to-text-apis.md
+++ b/docs/LLamaExecutors/text-to-text-apis.md
@@ -0,0 +1,18 @@
 # Text-to-Text APIs of the executors

 All the executors implements the interface `ILLamaExecutor`, which provides two APIs to execute text-to-text tasks.

 ```cs
 public interface ILLamaExecutor
 {
    public LLamaModel Model { get; }

    IEnumerable<string> Infer(string text, InferenceParams? inferenceParams = null, CancellationToken token = default);

    IAsyncEnumerable<string> InferAsync(string text, InferenceParams? inferenceParams = null, CancellationToken token = default);
 }
 ```

 Just pass the text to the executor with the inference parameters. For the inference parameters, please refer to [executor inference parameters doc](./parameters.md).

 The output of both two APIs are **yield enumerable**. Therefore, when receiving the output, you can directly use `foreach` to take actions on each word you get by order, instead of waiting for the whole process completed.
--- a/docs/LLamaModel/embeddings.md
+++ b/docs/LLamaModel/embeddings.md
@@ -0,0 +1,13 @@
 # Get Embeddings

 Getting the embeddings of a text in LLM is sometimes useful, for example, to train other MLP models.

 To get the embeddings, please initialize a `LLamaEmbedder` and then call `GetEmbeddings`.

 ```cs
 var embedder = new LLamaEmbedder(new ModelParams("<modelPath>"));
 string text = "hello, LLM.";
 float[] embeddings = embedder.GetEmbeddings(text);
 ```

 The output is a float array. Note that the length of the array is related with the model you load. If you just want to get a smaller size embedding, please consider changing a model.
--- a/docs/LLamaModel/parameters.md
+++ b/docs/LLamaModel/parameters.md
@@ -0,0 +1,208 @@
 # LLamaModel Parameters

 When initializing a `LLamaModel` object, there're three parameters, `ModelParams Params, string encoding = "UTF-8", ILLamaLogger? logger = null`.

 The usage of `logger` will be further introduced in [logger doc](../More/log.md). The `encoding` is the encoding you want to use when dealing with text via this model.

 The most improtant of all, is the `ModelParams`, which is defined as below. We'll explain the parameters step by step in this document.

 ```cs
 public class ModelParams
 {
    public int ContextSize { get; set; } = 512;
    public int GpuLayerCount { get; set; } = 20;
    public int Seed { get; set; } = 1686349486;
    public bool UseFp16Memory { get; set; } = true;
    public bool UseMemorymap { get; set; } = true;
    public bool UseMemoryLock { get; set; } = false;
    public bool Perplexity { get; set; } = false;
    public string ModelPath { get; set; }
    public string LoraAdapter { get; set; } = string.Empty;
    public string LoraBase { get; set; } = string.Empty;
    public int Threads { get; set; } = Math.Max(Environment.ProcessorCount / 2, 1);
    public int BatchSize { get; set; } = 512;
    public bool ConvertEosToNewLine { get; set; } = false;
 }
 ```


 # ModelParams

 Namespace: LLama.Common

 ```csharp
 public class ModelParams
 ```

 Inheritance [Object](https://docs.microsoft.com/en-us/dotnet/api/system.object) → [ModelParams]()

 ## Properties

 ### **ContextSize**

 Model context size (n_ctx)

 ```csharp
 public int ContextSize { get; set; }
 ```

 #### Property Value

 [Int32](https://docs.microsoft.com/en-us/dotnet/api/system.int32)<br>

 ### **GpuLayerCount**

 Number of layers to run in VRAM / GPU memory (n_gpu_layers)

 ```csharp
 public int GpuLayerCount { get; set; }
 ```

 #### Property Value

 [Int32](https://docs.microsoft.com/en-us/dotnet/api/system.int32)<br>

 ### **Seed**

 Seed for the random number generator (seed)

 ```csharp
 public int Seed { get; set; }
 ```

 #### Property Value

 [Int32](https://docs.microsoft.com/en-us/dotnet/api/system.int32)<br>

 ### **UseFp16Memory**

 Use f16 instead of f32 for memory kv (memory_f16)

 ```csharp
 public bool UseFp16Memory { get; set; }
 ```

 #### Property Value

 [Boolean](https://docs.microsoft.com/en-us/dotnet/api/system.boolean)<br>

 ### **UseMemorymap**

 Use mmap for faster loads (use_mmap)

 ```csharp
 public bool UseMemorymap { get; set; }
 ```

 #### Property Value

 [Boolean](https://docs.microsoft.com/en-us/dotnet/api/system.boolean)<br>

 ### **UseMemoryLock**

 Use mlock to keep model in memory (use_mlock)

 ```csharp
 public bool UseMemoryLock { get; set; }
 ```

 #### Property Value

 [Boolean](https://docs.microsoft.com/en-us/dotnet/api/system.boolean)<br>

 ### **Perplexity**

 Compute perplexity over the prompt (perplexity)

 ```csharp
 public bool Perplexity { get; set; }
 ```

 #### Property Value

 [Boolean](https://docs.microsoft.com/en-us/dotnet/api/system.boolean)<br>

 ### **ModelPath**

 Model path (model)

 ```csharp
 public string ModelPath { get; set; }
 ```

 #### Property Value

 [String](https://docs.microsoft.com/en-us/dotnet/api/system.string)<br>

 ### **LoraAdapter**

 lora adapter path (lora_adapter)

 ```csharp
 public string LoraAdapter { get; set; }
 ```

 #### Property Value

 [String](https://docs.microsoft.com/en-us/dotnet/api/system.string)<br>

 ### **LoraBase**

 base model path for the lora adapter (lora_base)

 ```csharp
 public string LoraBase { get; set; }
 ```

 #### Property Value

 [String](https://docs.microsoft.com/en-us/dotnet/api/system.string)<br>

 ### **Threads**

 Number of threads (-1 = autodetect) (n_threads)

 ```csharp
 public int Threads { get; set; }
 ```

 #### Property Value

 [Int32](https://docs.microsoft.com/en-us/dotnet/api/system.int32)<br>

 ### **BatchSize**

 batch size for prompt processing (must be &gt;=32 to use BLAS) (n_batch)

 ```csharp
 public int BatchSize { get; set; }
 ```

 #### Property Value

 [Int32](https://docs.microsoft.com/en-us/dotnet/api/system.int32)<br>

 ### **ConvertEosToNewLine**

 Whether to convert eos to newline during the inference.

 ```csharp
 public bool ConvertEosToNewLine { get; set; }
 ```

 #### Property Value

 [Boolean](https://docs.microsoft.com/en-us/dotnet/api/system.boolean)<br>

 ### **EmbeddingMode**

 Whether to use embedding mode. (embedding) Note that if this is set to true, 
 The LLamaModel won't produce text response anymore.

 ```csharp
 public bool EmbeddingMode { get; set; }
 ```

 #### Property Value

 [Boolean](https://docs.microsoft.com/en-us/dotnet/api/system.boolean)<br>
--- a/docs/LLamaModel/quantization.md
+++ b/docs/LLamaModel/quantization.md
@@ -0,0 +1,23 @@
 # Quantization

 Quantization is significant to accelerate the model inference. Since there's little accuracy (performance) reduction when quantizing the model, get it easy to quantize it!

 To quantize the model, please call `Quantize` from `LLamaQuantizer`, which is a static method.

 ```cs
 string srcPath = "<model.bin>";
 string dstPath = "<model_q4_0.bin>";
 LLamaQuantizer.Quantize(srcPath, dstPath, "q4_0");
 // The following overload is also okay.
 // LLamaQuantizer.Quantize(srcPath, dstPath, LLamaFtype.LLAMA_FTYPE_MOSTLY_Q4_0);
 ```

 After calling it, a quantized model file will be saved.

 There're currently 5 types of quantization supported:

 - q4_0
 - q4_1
 - q5_0
 - q5_1
 - q8_0
--- a/docs/LLamaModel/save-load-state.md
+++ b/docs/LLamaModel/save-load-state.md
@@ -0,0 +1,19 @@
 # Save/Load State

 There're two ways to load state: loading from path and loading from bite array. Therefore, correspondingly, state data can be extracted as byte array or saved to a file.

 ```cs
 LLamaModel model = new LLamaModel(new ModelParams("<modelPath>"));
 // do some things...
 model.SaveState("model.st");
 var stateData = model.GetStateData();
 model.Dispose();

 LLamaModel model2 = new LLamaModel(new ModelParams("<modelPath>"));
 model2.LoadState(stateData);
 // do some things...

 LLamaModel model3 = new LLamaModel(new ModelParams("<modelPath>"));
 model3.LoadState("model.st");
 // do some things...
 ```
--- a/docs/LLamaModel/tokenization.md
+++ b/docs/LLamaModel/tokenization.md
@@ -0,0 +1,25 @@
 # Tokenization/Detokenization

 A pair of APIs to make conversion between text and tokens.

 ## Tokenization

 The basic usage is to call `Tokenize` after initializing the model.

 ```cs
 LLamaModel model = new LLamaModel(new ModelParams("<modelPath>"));
 string text = "hello";
 int[] tokens = model.Tokenize(text).ToArray();
 ```

 Depending on different model (or vocab), the output will be various.

 ## Detokenization

 Similar to tokenization, just pass an `IEnumerable<int>` to `Detokenize` method.

 ```cs
 LLamaModel model = new LLamaModel(new ModelParams("<modelPath>"));
 int[] tokens = new int[] {125, 2568, 13245};
 string text = model.Detokenize(tokens);
 ```
--- a/docs/More/log.md
+++ b/docs/More/log.md
@@ -0,0 +1,163 @@
 # The Logger in LLamaSharp

 LLamaSharp supports customized logger because it could be used in many kinds of applications, like Winform/WPF, WebAPI and Blazor, so that the preference of logger varies.

 ## Define customized logger

 What you need to do is to implement the `ILogger` interface. 

 ```cs
 public interface ILLamaLogger
 {
    public enum LogLevel
    {
        Info,
        Debug,
        Warning,
        Error
    }
    void Log(string source, string message, LogLevel level);
 }
 ```

 The `source` specifies where the log message is from, which could be a function, a class, etc..

 The `message` is the log message itself.

 The `level` is the level of the information in the log. As shown above, there're four levels, which are `info`, `debug`, `warning` and `error` respectively.

 The following is a simple example of theb logger implementation:

 ```cs
 public sealed class LLamaDefaultLogger : ILLamaLogger
 {
    private static readonly Lazy<LLamaDefaultLogger> _instance = new Lazy<LLamaDefaultLogger>(() => new LLamaDefaultLogger());

    private bool _toConsole = true;
    private bool _toFile = false;

    private FileStream? _fileStream = null;
    private StreamWriter _fileWriter = null;

    public static LLamaDefaultLogger Default => _instance.Value;

    private LLamaDefaultLogger()
    {

    }

    public LLamaDefaultLogger EnableConsole()
    {
        _toConsole = true;
        return this;
    }

    public LLamaDefaultLogger DisableConsole()
    {
        _toConsole = false;
        return this;
    }

    public LLamaDefaultLogger EnableFile(string filename, FileMode mode = FileMode.Append)
    {
        _fileStream = new FileStream(filename, mode, FileAccess.Write);
        _fileWriter = new StreamWriter(_fileStream);
        _toFile = true;
        return this;
    }

    public LLamaDefaultLogger DisableFile(string filename)
    {
        if (_fileWriter is not null)
        {
            _fileWriter.Close();
            _fileWriter = null;
        }
        if (_fileStream is not null)
        {
            _fileStream.Close();
            _fileStream = null;
        }
        _toFile = false;
        return this;
    }

    public void Log(string source, string message, LogLevel level)
    {
        if (level == LogLevel.Info)
        {
            Info(message);
        }
        else if (level == LogLevel.Debug)
        {

        }
        else if (level == LogLevel.Warning)
        {
            Warn(message);
        }
        else if (level == LogLevel.Error)
        {
            Error(message);
        }
    }

    public void Info(string message)
    {
        message = MessageFormat("info", message);
        if (_toConsole)
        {
            Console.ForegroundColor = ConsoleColor.White;
            Console.WriteLine(message);
            Console.ResetColor();
        }
        if (_toFile)
        {
            Debug.Assert(_fileStream is not null);
            Debug.Assert(_fileWriter is not null);
            _fileWriter.WriteLine(message);
        }
    }

    public void Warn(string message)
    {
        message = MessageFormat("warn", message);
        if (_toConsole)
        {
            Console.ForegroundColor = ConsoleColor.Yellow;
            Console.WriteLine(message);
            Console.ResetColor();
        }
        if (_toFile)
        {
            Debug.Assert(_fileStream is not null);
            Debug.Assert(_fileWriter is not null);
            _fileWriter.WriteLine(message);
        }
    }

    public void Error(string message)
    {
        message = MessageFormat("error", message);
        if (_toConsole)
        {
            Console.ForegroundColor = ConsoleColor.Red;
            Console.WriteLine(message);
            Console.ResetColor();
        }
        if (_toFile)
        {
            Debug.Assert(_fileStream is not null);
            Debug.Assert(_fileWriter is not null);
            _fileWriter.WriteLine(message);
        }
    }

    private string MessageFormat(string level, string message)
    {
        DateTime now = DateTime.Now;
        string formattedDate = now.ToString("yyyy.MM.dd HH:mm:ss");
        return $"[{formattedDate}][{level}]: {message}";
    }
 }
 ```
--- a/docs/NonEnglishUsage/Chinese.md
+++ b/docs/NonEnglishUsage/Chinese.md
@@ -0,0 +1,3 @@
 # Use LLamaSharp with Chinese

 It's supported now but the document is under work. Please wait for some time. Thank you for your support! :)
--- a/docs/Tricks.md
+++ b/docs/Tricks.md
@@ -0,0 +1,44 @@
 # Tricks for FAQ

 Sometimes, your application with LLM and LLamaSharp may have strange behaviors. Before opening an issue to report the BUG, the following tricks may worth a try.


 ## Carefully set the anti-prompts

 Anti-prompt can also be called as "Stop-keyword", which decides when to stop the response generation. Under interactive mode, the maximum tokens count is always not set, which makes the LLM generates responses infinitively. Therefore, setting anti-prompt correctly helps a lot to avoid the strange behaviors. For example, the prompt file `chat-with-bob.txt` has the following content:

 ```
 Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.

 User: Hello, Bob.
 Bob: Hello. How may I help you today?
 User: Please tell me the largest city in Europe.
 Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
 User:
 ```

 Therefore, the anti-prompt should be set as "User:". If the last line of the prompt is removed, LLM will automatically generate a question (user) and a response (bob) for one time when running the chat session. Therefore, the antiprompt is suggested to be appended to the prompt when starting a chat session.

 What if an extra line is appended? The string "User:" in the prompt will be followed with a char "\n". Thus when running the model, the automatic generation of a pair of question and response may appear because the anti-prompt is "User:" but the last token is "User:\n". As for whether it will appear, it's an undefined behavior, which depends on the implementation inside the `LLamaExecutor`. Anyway, since it may leads to unexpected behaviors, it's recommended to trim your prompt or carefully keep consistent with your anti-prompt.

 ## Pay attention to the length of prompt

 Sometimes we want to input a long prompt to execute a task. However, the context size may limit the inference of LLama model. Please ensure the inequality below holds.

 $$ len(prompt) + len(response) < len(context) $$

 In this inequality, `len(response)` refers to the expected tokens for LLM to generate.

 ## Try differenct executors with a prompt

 Some prompt works well under interactive mode, such as `chat-with-bob`, some others may work well with instruct mode, such as `alpaca`. Besides, if your input is quite simple and one-time job, such as "Q: what is the satellite of the earth? A: ", stateless mode will be a good choice.

 If your chat bot has bad performance, trying different executor will possibly make it work well.

 ## Choose models weight depending on you task

 The differences between modes may lead to much different behaviors under the same task. For example, if you're building a chat bot with non-English, a fine-tuned model specially for the language you want to use will have huge effect on the performance.

 ## Set the layer count you want to offload to GPU

 Currently, the `GpuLayerCount` param, which decides the number of layer loaded into GPU, is set to 20 by default. However, if you have some efficient GPUs, setting it as a larger number will attain faster inference.
--- a/docs/index.md
+++ b/docs/index.md
@@ -0,0 +1,36 @@
 # Overview

 ![logo](./media/LLamaSharpLogo.png)

 LLamaSharp is the C#/.NET binding of [llama.cpp](https://github.com/ggerganov/llama.cpp). It provides APIs to inference the LLaMa Models and deploy it on native environment or Web. It could help C# developers to deploy the LLM (Large Language Model) locally and integrate with C# apps.

 ## Main features

 - Model inference
 - Model quantization
 - Generating embeddings
 - Interactive/Instruct/Stateless executor mode
 - Chat session APIs
 - Save/load the state
 - Integration with other applications like BotSharp and semantic-kernel

 ## Essential insights for novice learners

 If you are new to LLM, here're some tips for you to help you to get start with `LLamaSharp`. If you are experienced in this field, we'd still recommend you to take a few minutes to read it because somethings performs differently compared to cpp/python.

 1. Tha main ability of LLamaSharp is to provide an efficient way to run inference of LLM (Large Language Model) locally (and fine-tune model in the future). The model weights, however, needs to be downloaded from other resources, like [huggingface](https://huggingface.co).
 2. Since LLamaSharp supports multiple platforms, The nuget package is splitted to `LLamaSharp` and `LLama.Backend`. After installing `LLamaSharp`, please install one of `LLama.Backend.Cpu`, `LLama.Backend.Cuda11` and `LLama.Backend.Cuda12`. If you use the source code, dynamic libraries could be found in `LLama/Runtimes`. Then rename the one you want to use to `libllama.dll`.
 3. `LLaMa` originally refers to the weights released by Meta (Facebook Research). After that, many models are fine-tuned based on it, such as `Vicuna`, `GPT4All`, and `Pyglion`. Though all of these models are supported by LLamaSharp, some steps are necessary with different file formats. There're mainly three kinds of files, which are `.pth`, `.bin (ggml)`, `.bin (quantized)`. If you have the `.bin (quantized)` file, it could be used directly by LLamaSharp. If you have the `.bin (ggml)` file, you could use it directly but get higher inference speed after the quantization. If you have the `.pth` file, you need to follow [the instructions in llama.cpp](https://github.com/ggerganov/llama.cpp#prepare-data--run) to convert it to `.bin (ggml)` file at first.
 4. LLamaSharp supports GPU acceleration, but it requires cuda installation. Please install cuda 11 or cuda 12 on your system before using LLamaSharp to enable GPU. If you have another cuda version, you could compile llama.cpp from source to get the dll. For building from source, please refer to [issue #5](https://github.com/SciSharp/LLamaSharp/issues/5).

 ## Welcome to join the development!

 Community effort is always one of the most important things in open-source projects. Any contribution in any way is welcomed here. For example, the following things mean a lot for LLamaSharp:

 1. Open an issue when you find something wrong.
 2. Open an PR if you've fixed something. Even if just correcting a typo, it also makes great sense.
 3. Help to optimize the documentation. 
 4. Write an example or blog about how to integrate LLamaSharp with your APPs.
 5. Ask for a missed feature and discuss with other developers.

 If you'd like to get deeply involved in development, please touch us in discord channel or send email to `AsakusaRinne@gmail.com`. :)
--- a/docs/media/LLamaSharpLogo.png
+++ b/docs/media/LLamaSharpLogo.png
--- a/docs/media/structure.jpg
+++ b/docs/media/structure.jpg
--- a/docs/media/structure.vsdx
+++ b/docs/media/structure.vsdx
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -0,0 +1,29 @@
 site_name: LLamaSharp Documentation
 nav:
    - Overview: index.md
    - Get Started: GetStarted.md
    - Architecher: Architecher.md
    - Tricks for FAQ: Tricks.md
    - Contributing Guide: ContributingGuide.md
    - LLamaModel:
        - Model Parameters: LLamaModel/parameters.md
        - Tokenization: LLamaModel/tokenization.md
        - Get Embeddings: LLamaModel/embeddings.md
        - Quantization: LLamaModel/quantization.md
        - Save/Load State: LLamaModel/save-load-state.md
    - LLamaExecutors:
        - Inference Parameters: LLamaExecutors/parameters.md
        - Text-to-Text APIs: LLamaExecutors/text-to-text-apis.md
        - Save/Load State: LLamaExecutors/save-load-state.md
        - Differences of Executors: LLamaExecutors/differences.md
    - ChatSession:
        - Basic Usages: ChatSession/basic-usages.md
        - Transoforms: ChatSession/transforms.md
        - Save/Load Session: ChatSession/save-load-session.md
    - Non-English Usages:
        - Chinese: NonEnglishUsage/Chinese.md
    - High-level Applications:
        - BotSharp: HighLevelApps/bot-sharp.md
    - More:
        - Logger: More/log.md
 theme: readthedocs