crates.io "tokenization" keyword
brainwires-datasets 0.1.0
Training data pipelines for the Brainwires Agent Framework — JSONL I/O, tokenization, deduplicati...1 version - Latest release: about 1 hour ago - 0 downloads total - 1 maintainer
fern-tokenization 0.0.0
Empty crate, used only to reserve the name.1 version - Latest release: over 3 years ago - 1.66 thousand downloads total - 14 stars on GitHub - 1 maintainer
derive-finite-automaton-derive 0.3.0
Procedural macro for generating finite automaton6 versions - Latest release: 8 months ago - 1 dependent package - 1 dependent repositories - 14.3 thousand downloads total - 2 stars on GitHub - 1 maintainer
crossandra 1.0.0 💰
A fast and simple lexical tokenization library.3 versions - Latest release: 8 days ago - 1.94 thousand downloads total - 8 stars on GitHub - 1 maintainer
build-trie 0.1.1
Procedural macro for generating match and state code representing a trie structure2 versions - Latest release: almost 5 years ago - 3.06 thousand downloads total - 3 stars on GitHub - 1 maintainer
sentence 0.0.2
Sentence tokenizes English language sentences for use in TTS applications.2 versions - Latest release: almost 6 years ago - 3.1 thousand downloads total - 2 stars on GitHub - 1 maintainer
chunk 0.9.2
The fastest semantic text chunking library — up to 1TB/s chunking throughput7 versions - Latest release: about 2 months ago - 373 downloads total - 1 maintainer
toon_ql 0.0.2
A query language for Toon data2 versions - Latest release: 4 months ago - 45 downloads total - 1 maintainer
tuck5 0.2.0
A pragmatic lexer/parser generator4 versions - Latest release: over 2 years ago - 4.92 thousand downloads total - 0 stars on GitHub - 1 maintainer
go-brrr 0.1.0
Token-efficient code analysis for LLMs - Rust implementation1 version - Latest release: about 2 months ago - 14 downloads total - 1 maintainer
niblits 0.3.6
Token-aware, multi-format text chunking library with language-aware semantic splitting5 versions - Latest release: about 1 month ago - 97 downloads total - 1 maintainer
memchunk 0.4.0
The fastest semantic text chunking library — up to 1TB/s chunking throughput11 versions - Latest release: 2 months ago - 221 downloads total - 2 stars on GitHub - 1 maintainer
marqant 1.0.0
Quantum-compressed markdown format for AI consumption with 90% token reduction5 versions - Latest release: 4 months ago - 1.78 thousand downloads total - 0 stars on GitHub - 1 maintainer
rustrawi 0.1.2 💰
Rust port of the original PHP Sastrawi3 versions - Latest release: about 3 years ago - 3.67 thousand downloads total - 0 stars on GitHub - 1 maintainer
any-lexer 0.0.3
Lexers for various programming languages and formats3 versions - Latest release: over 2 years ago - 2 dependent packages - 6.42 thousand downloads total - 0 stars on GitHub - 1 maintainer
vaporetto 0.6.5
Vaporetto: a pointwise prediction based tokenizer18 versions - Latest release: 12 months ago - 3 dependent packages - 1 dependent repositories - 156 thousand downloads total - 245 stars on GitHub - 1 maintainer
wordpieces 0.6.1
Split tokens into word pieces10 versions - Latest release: over 3 years ago - 3 dependent packages - 3 dependent repositories - 21.7 thousand downloads total - 5 stars on GitHub - 1 maintainer
tokenizer-lib 1.6.0
Tokenization utilities for building parsers in Rust15 versions - Latest release: almost 2 years ago - 2 dependent packages - 1 dependent repositories - 25.4 thousand downloads total - 2 stars on GitHub - 1 maintainer
vibrato 0.5.2
Vibrato: viterbi-based accelerated tokenizer12 versions - Latest release: 12 months ago - 1 dependent package - 1 dependent repositories - 50.1 thousand downloads total - 377 stars on GitHub - 2 maintainers
colorblast-cli 0.0.1
Syntax highlighting CLI for various programming languages, markup languages and various other for...1 version - Latest release: over 2 years ago - 1.51 thousand downloads total - 0 stars on GitHub - 1 maintainer
blex 0.2.2
A lightweight lexing framework4 versions - Latest release: almost 3 years ago - 1 dependent package - 5.74 thousand downloads total - 0 stars on GitHub - 1 maintainer
libtqsm 0.6.1
Sentence segmenter that supports ~300 languages1 version - Latest release: almost 2 years ago - 1 dependent package - 3.34 thousand downloads total - 2 stars on GitHub - 1 maintainer
classi-cine 0.5.1
A tool that builds smart video playlists by learning your preferences through Bayesian classifica...12 versions - Latest release: 7 months ago - 12.3 thousand downloads total - 6 stars on GitHub - 1 maintainer
vtext 0.2.0
NLP with Rust4 versions - Latest release: over 5 years ago - 3 dependent repositories - 14.7 thousand downloads total - 153 stars on GitHub - 1 maintainer
strizer 0.1.0
minimal and fast library for text tokenization1 version - Latest release: almost 5 years ago - 1.94 thousand downloads total - 1 stars on GitHub - 1 maintainer
esg-tokenization-protocol 0.1.2
Official Rust implementation of the ESG Tokenization Protocol (ERC-8040 / EIP-8040). MIT-grade co...3 versions - Latest release: 4 months ago - 62 downloads total - 1 maintainer
kizzasi-tokenizer 0.1.0
Signal quantization and tokenization for Kizzasi AGSP - VQ-VAE, μ-law, continuous embeddings1 version - Latest release: about 2 months ago - 26 downloads total - 1 maintainer
colorblast 0.0.3
Syntax highlighting library for various programming languages, markup languages and various other...3 versions - Latest release: over 2 years ago - 1 dependent package - 4.49 thousand downloads total - 0 stars on GitHub - 1 maintainer
bytepunch-rs 0.1.0
Profile-aware semantic compression for structured documents (CML and beyond)1 version - Latest release: 3 months ago - 0 downloads total - 1 maintainer
text-scanner 0.0.3
A UTF-8 char-oriented, zero-copy, text and code scanning library3 versions - Latest release: over 2 years ago - 1 dependent package - 6.36 thousand downloads total - 0 stars on GitHub - 1 maintainer
bpetok 0.1.2
A simple CLI for tokenizing text input using Byte Pair Encoding (BPE).3 versions - Latest release: over 1 year ago - 3.06 thousand downloads total - 1 maintainer
unscanny 0.1.0 💰
Painless string scanning.1 version - Latest release: almost 4 years ago - 8 dependent packages - 28 dependent repositories - 11.5 million downloads total - 56 stars on GitHub - 1 maintainer
agrocrypto-core 0.1.0
The core engine of AgroCrypto: a blockchain-native asset tokenization and settlement layer.1 version - Latest release: 11 months ago - 917 downloads total - 1 maintainer
pretok 0.1.0
A string pre-tokenizer for C-like syntaxes.1 version - Latest release: over 5 years ago - 1 dependent repositories - 1.69 thousand downloads total - 0 stars on GitHub - 1 maintainer
vaporetto_tantivy 0.24.0
Vaporetto Tokenizer for Tantivy15 versions - Latest release: 9 months ago - 28.4 thousand downloads total - 242 stars on GitHub - 1 maintainer
bitcoin-get-json-token 0.1.1
A comprehensive Rust library for parsing and tokenizing JSON data, optimized for Bitcoin applicat...2 versions - Latest release: 3 months ago - 4.38 thousand downloads total - 1 maintainer
bpe-match 0.1.1
A pattern matching library for BPE tokenization, intended to replace regex-based approaches.2 versions - Latest release: 5 months ago - 374 downloads total - 1 maintainer
vaporetto_rules 0.6.5
Rule-base filters for Vaporetto12 versions - Latest release: 12 months ago - 1 dependent package - 1 dependent repositories - 71 thousand downloads total - 242 stars on GitHub - 1 maintainer
derive-finite-automaton 0.3.0
Procedural macro for generating finite automaton6 versions - Latest release: 8 months ago - 1 dependent package - 1 dependent repositories - 13.6 thousand downloads total - 2 stars on GitHub - 1 maintainer
Related Keywords
tokenizer
11
parsing
11
nlp
10
rust
10
lexer
8
text
6
token
4
code2image
4
html-syntax-highlighter
4
lexers
4
render-code
4
syntax-highlighting
4
text-scanner
4
utils
4
analyzer
4
japanese
4
morphological-analysis
4
segmentation
4
morphological
3
chunking
3
compression
3
text-processing
2
blockchain
2
lex
2
streaming
2
highlighter
2
highlighting
2
syntax
2
bpe
2
simd
2
ai
2
quantization
1
classification
1
semantic
1
cml
1
cli
1
tokenizing
1
scanning
1
carbon-credits
1
settlement
1
web3
1
lexical-analysis
1
tantivy
1
json
1
bayes
1
ml
1
bitcoin
1
cryptocurrency
1
pattern-matching
1
audio
1
eip8040
1
erc8040
1
esg
1
vq-vae
1
tf-idf
1
information-retrieval
1
dictionary
1
bag-of-words
1
levenshtein
1
tfidf
1
serde-json
1
reqwest
1
naive-bayes-classifier
1
http
1
bayesian-inference
1
vlc
1
playlist
1
rust-crate
1
parser
1
parse
1
toon
1
query
1
data
1
tts
1
english
1
sentence
1
proc-macro
1
parser-combinators
1
regex
1
lexing
1
database
1
proxy
1
privacy
1
fern
1
training-data
1
jsonl
1
deduplication
1
datasets
1
word
1
piece
1
wordpiece
1
stopwords
1
stemmer
1
indonesian
1
stem
1
stopword
1
sastrawi
1
smart-tree
1
markdown-converter
1
documentation-tool
1