BpeTokenizer 1.0.5
dotnet add package BpeTokenizer --version 1.0.5
NuGet\Install-Package BpeTokenizer -Version 1.0.5
<PackageReference Include="BpeTokenizer" Version="1.0.5" />
paket add BpeTokenizer --version 1.0.5
#r "nuget: BpeTokenizer, 1.0.5"
// Install BpeTokenizer as a Cake Addin #addin nuget:?package=BpeTokenizer&version=1.0.5 // Install BpeTokenizer as a Cake Tool #tool nuget:?package=BpeTokenizer&version=1.0.5
BpeTokenizer
BpeTokenizer is a C# implementation of tiktoken written by OpenAI. It is a byte pair encoding tokenizer that can be used to tokenize text into subword units.
This library is built for x64 architectures.
As a BpeTokenizer derived from tiktoken, it can be used as a token counter. Useful to ensure that when streaming tokens from the OpenAI API for GPT Chat Completions, you could keep track of the cost related to the software calling the API.
To Install BpeTokenizer, run the following command in the Package Manager Console
Install-Package BpeTokenizer
If you'd prefer to use the .NET CLI, run this command instead:
dotnet add package BpeTokenizer
Usage
To use BpeTokenizer, import the namespace:
using BpeTokenizer;
Then create an encoder by its model or encoding name:
// By its encoding name:
var encoder = await BytePairEncodingRegistry.GetEncodingAsync("cl100k_base");
// By its model:
var encoder = await BytePairEncodingModels.EncodingForModelAsync("gpt-4");
Both variants are async so you can await them, since they will either access a remote server to download the model or load it from the local cache.
Once you have an encoding, you can encode your text:
var tokens = encoder.Encode("Hello BPE world!"); //Results in: [9906, 426, 1777, 1917, 0]
To decode a stream of tokens, you can use the following:
var text = encoder.Decode(tokens); //Results in: "Hello BPE world!"
Supported Encodings/Models:
BpeTokenizer supports the following encodings:
- cl100k_base
- p50k_edit
- p50k_base
- r50k_base
- gpt2
You can use these encoding names when creating an encoder:
var cl100kBaseEncoder = await BytePairEncodingRegistry.GetEncodingAsync("cl100k_base");
var p50kEditEncoder = await BytePairEncodingRegistry.GetEncodingAsync("p50k_edit");
var p50kBaseEncoder = await BytePairEncodingRegistry.GetEncodingAsync("p50k_base");
var r50kBaseEncoder = await BytePairEncodingRegistry.GetEncodingAsync("r50k_base");
var gpt2Encoder = await BytePairEncodingRegistry.GetEncodingAsync("gpt2");
The following models are supported (from tiktoken source, embedding in parentheses):
- Chat (all cl100k_base)
- gpt-4 - e.g., gpt-4-0314, etc., plus gpt-4-32k
- gpt-3.5-turbo - e.g, gpt-3.5-turbo-0301, -0401, etc.
- gpt-35-turbo - Azure deployment name
- Text (future use, all cl100k_base API availability on Jan 4, 2024)
- ada-002
- babbage-002
- curie-002
- davinci-002
- gpt-3.5-turbo-instruct
- Code (all p50k_base)
- code-davinci-002
- code-davinci-001
- code-cushman-002
- code-cushman-001
- davinci-codex
- cushman-codex
- Edit (all p50k_edit)
- text-davinci-edit-001
- code-davinci-edit-001
- Embeddings
- text-embedding-ada-002 (cl100k_base)
- Legacy (no longer available on Jan 4, 2024)
- text-davinci-003 (p50k_base)
- text-davinci-002 (p50k_base)
- text-davinci-001 (r50k_base)
- text-curie-001 (r50k_base)
- text-babbage-001 (r50k_base)
- text-ada-001 (r50k_base)
- davinci (r50k_base)
- curie (r50k_base)
- babbage (r50k_base)
- ada (r50k_base)
- Old Embeddings (all r50k_base)
- text-similarity-davinci-001
- text-similarity-curie-001
- text-similarity-babbage-001
- text-similarity-ada-001
- text-search-davinci-doc-001
- text-search-curie-doc-001
- text-search-babbage-doc-001
- text-search-ada-doc-001
- code-search-babbage-code-001
- code-search-ada-code-001
- Open Source
- gpt2 (gpt2)
You can use these model names when creating an encoder (list not exhaustive):
var gpt4Encoder = await BytePairEncodingModels.EncodingForModelAsync("gpt-4");
var textDavinci003Encoder = await BytePairEncodingModels.EncodingForModelAsync("text-davinci-003");
var textDavinci001Encoder = await BytePairEncodingModels.EncodingForModelAsync("text-davinci-001");
var codeDavinci002Encoder = await BytePairEncodingModels.EncodingForModelAsync("code-davinci-002");
var textDavinciEdit001Encoder = await BytePairEncodingModels.EncodingForModelAsync("text-davinci-edit-001");
var textEmbeddingAda002Encoder = await BytePairEncodingModels.EncodingForModelAsync("text-embedding-ada-002");
var textSimilarityDavinci001Encoder = await BytePairEncodingModels.EncodingForModelAsync("text-similarity-davinci-001");
var gpt2Encoder = await BytePairEncodingModels.EncodingForModelAsync("gpt2");
Several of the older models are being deprecated at the start of 2024:
Token Counting
To count tokens in a given string, you can use the following:
var tokenCount = encoder.CountTokens("Hello BPE world!"); //Results in: 5
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net7.0 is compatible. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. |
-
net7.0
- Newtonsoft.Json (>= 13.0.3)
NuGet packages (1)
Showing the top 1 NuGet packages that depend on BpeTokenizer:
Package | Downloads |
---|---|
BpeChatAI
Package Description |
GitHub repositories
This package is not used by any popular GitHub repositories.
Corrected ReadMe.md to point to the appropriate branch for GitHub link.