An open API service providing package, version and dependency metadata of many open source software ecosystems and registries.

pypi.org "tokenizer" keyword

View the packages on the pypi.org package registry that are tagged with the "tokenizer" keyword.

divyanx-tokenizers 0.20.0.dev0
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
1 version - Latest release: 8 months ago - 27 downloads last month - 9,605 stars on GitHub - 1 maintainer
tokenizers-gt 0.15.2.post0
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
3 versions - Latest release: about 1 year ago - 2.32 thousand downloads last month - 9,605 stars on GitHub - 1 maintainer
tensorflow-onmttok-ops 0.4.0
OpenNMT Tokenizer as TensorFlow Operations
5 versions - Latest release: over 4 years ago - 1 dependent repositories - 285 downloads last month - 1 maintainer
Top 1.6% on pypi.org
sacremoses 0.1.1
SacreMoses
52 versions - Latest release: over 1 year ago - 120 dependent packages - 5,564 dependent repositories - 2.07 million downloads last month - 486 stars on GitHub - 3 maintainers
biosaic 0.0.7
Tokenizer for encoding/decoding DNA & amino acid sequences
2 versions - Latest release: 7 days ago - 233 downloads last month - 1 stars on GitHub - 1 maintainer
twitter-korean 0.1.0.dev522
Python port to the normalizer in https://github.com/twitter/twitter-korean-text
2 versions - Latest release: over 1 year ago - 2 dependent repositories - 31 downloads last month - 1 maintainer
Top 6.9% on pypi.org
text2text 1.9.5
Text2Text Language Modeling Toolkit
192 versions - Latest release: 3 months ago - 4 dependent repositories - 7.11 thousand downloads last month - 300 stars on GitHub - 1 maintainer
tokens-cli 0.1.0
Count tokens in text using tiktoken encoders
1 version - Latest release: about 11 hours ago - 1 maintainer
japanesetokenizer 1.3.7
aim to use JapaneseTokenizer as easy as possible
21 versions - Latest release: about 7 years ago - 1 dependent repositories - 502 downloads last month - 138 stars on GitHub - 1 maintainer
dango 0.0.1
An easy to use tokenizer for Japanese text, aimed at language learners and non-linguists
2 versions - Latest release: over 3 years ago - 3 dependent repositories - 382 downloads last month - 16 stars on GitHub - 1 maintainer
Top 1.9% on pypi.org
syntok 1.4.4
Text tokenization and sentence segmentation (segtok v2).
16 versions - Latest release: about 3 years ago - 6 dependent packages - 59 dependent repositories - 54.3 thousand downloads last month - 193 stars on GitHub - 1 maintainer
tokenizerchanger 1.0.4
Library for manipulating the existing tokenizer.
19 versions - Latest release: about 1 month ago - 835 downloads last month - 16 stars on GitHub - 1 maintainer
rusyll 0.1.1
Splitting Russian words into phonetic syllables
1 version - Latest release: over 4 years ago - 2 dependent repositories - 64 downloads last month - 6 stars on GitHub - 1 maintainer
irtm 0.0.4
A toolbox for Information Retrieval & Text Mining.
4 versions - Latest release: over 3 years ago - 1 dependent repositories - 187 downloads last month - 1 stars on GitHub - 1 maintainer
alt-eval 1.2.0
Automatic lyrics transcription evaluation toolkit
4 versions - Latest release: 8 months ago - 526 downloads last month - 486 stars on GitHub - 1 maintainer
kitoken 0.10.1 💰
Fast and versatile tokenizer for language models, supporting BPE, Unigram and WordPiece tokenization
2 versions - Latest release: 4 months ago - 533 downloads last month - 16 stars on GitHub - 1 maintainer
sengiri 0.2.1 💰
Yet another sentence-level tokenizer for the Japanese text
3 versions - Latest release: over 5 years ago - 7 dependent repositories - 284 downloads last month - 22 stars on GitHub - 1 maintainer
crossandra 2.2.1 💰
A fast and simple enum/regex-based tokenizer with decent configurability
12 versions - Latest release: 11 months ago - 1 dependent package - 1 dependent repositories - 4.91 thousand downloads last month - 9 stars on GitHub - 1 maintainer
gpt3_tokenizer 0.1.5
Encoder/Decoder and tokens counter for GPT3
6 versions - Latest release: 12 months ago - 957 downloads last month - 8 stars on GitHub - 1 maintainer
livelex 0.3.0
The livelex lexer
6 versions - Latest release: about 5 years ago - 1 dependent repositories - 112 downloads last month - 10 stars on GitHub - 1 maintainer
korhal 0.1.2
KOrean Rpc-based Application for Handy Application for Language-processing
2 versions - Latest release: over 6 years ago - 1 dependent repositories - 55 downloads last month - 1 maintainer
twokenize 1.0.0
Word segmentation / tokenization focussed on Twitter
1 version - Latest release: almost 7 years ago - 6 dependent repositories - 306 downloads last month - 7 stars on GitHub - 1 maintainer
zltk 0.0.1
A collection of commonly used functions.
2 versions - Latest release: over 1 year ago - 100 downloads last month - 1 maintainer
parce 0.33.0
The parce lexer
32 versions - Latest release: almost 2 years ago - 1 dependent repositories - 716 downloads last month - 10 stars on GitHub - 1 maintainer
greedtok 0.14
Partition Cover Approach to Tokenization
3 versions - Latest release: 3 days ago - 31 downloads last month - 1 stars on GitHub - 1 maintainer
ja-sentence 0.0.5
Light-weight sentence tokenizer for Japanese.
5 versions - Latest release: over 3 years ago - 1 dependent repositories - 155 downloads last month - 1 stars on GitHub - 1 maintainer
Top 3.7% on pypi.org
sentence-splitter 1.4
Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder
4 versions - Latest release: over 6 years ago - 12 dependent packages - 42 dependent repositories - 88.9 thousand downloads last month - 241 stars on GitHub - 2 maintainers
hebrew-tokenizer 2.3.0
A very simple python tokenizer for Hebrew text
8 versions - Latest release: over 3 years ago - 1 dependent package - 2 dependent repositories - 1.15 thousand downloads last month - 25 stars on GitHub - 1 maintainer
djurl 0.2.0
Simple yet helpful library for writing Django urls by an easy, short an intuitive way.
4 versions - Latest release: almost 8 years ago - 2 dependent repositories - 107 downloads last month - 79 stars on GitHub - 1 maintainer
hindikosh 0.0.1
Hindi corpus reader
1 version - Latest release: over 6 years ago - 1 dependent repositories - 53 downloads last month - 1 stars on GitHub - 1 maintainer
bleuscore 0.1.3
A fast bleu score calculator
4 versions - Latest release: 11 months ago - 658 downloads last month - 10 stars on GitHub - 1 maintainer
autotiktokenizer 0.2.2
🧰 The AutoTokenizer that TikToken always needed -- Load any tokenizer with TikToken now! ✨
7 versions - Latest release: 4 months ago - 26.5 thousand downloads last month - 39 stars on GitHub - 1 maintainer
optimal-data-selector 1.2.2
('A Package for optimize models, transfer or copy files from one directory to other, use for nlp ...
22 versions - Latest release: 11 months ago - 184 downloads last month - 1 maintainer
Top 6.2% on pypi.org
tokenizer 3.4.5
A tokenizer for Icelandic text
55 versions - Latest release: 8 months ago - 6 dependent packages - 75 dependent repositories - 20.6 thousand downloads last month - 28 stars on GitHub - 3 maintainers
Top 3.6% on pypi.org
pyonmttok 1.37.1
Fast and customizable text tokenization library with BPE and SentencePiece support
66 versions - Latest release: about 2 years ago - 3 dependent packages - 103 dependent repositories - 28.8 thousand downloads last month - 302 stars on GitHub - 4 maintainers
Top 6.5% on pypi.org
somajo 2.4.3
A tokenizer and sentence splitter for German and English web and social media texts.
60 versions - Latest release: 9 months ago - 1 dependent package - 10 dependent repositories - 2.98 thousand downloads last month - 139 stars on GitHub - 1 maintainer
xml-cleaner 2.0.4
Word and sentence tokenization.
27 versions - Latest release: over 8 years ago - 4 dependent repositories - 939 downloads last month - 13 stars on GitHub - 1 maintainer
vaporetto 0.3.0
Python wrapper of Vaporetto tokenizer
5 versions - Latest release: about 2 years ago - 1 dependent repositories - 1.9 thousand downloads last month - 20 stars on GitHub - 1 maintainer
kimchima 0.5.4
The collections of tools for ML model development.
15 versions - Latest release: 10 months ago - 560 downloads last month - 0 stars on GitHub - 1 maintainer
zh-sentence 0.0.5
Light-weight sentence tokenizer for Chinese languages.
5 versions - Latest release: over 3 years ago - 1 dependent repositories - 166 downloads last month - 2 stars on GitHub - 1 maintainer
mecab-text-cleaner 0.1.1 💰
Simple Python package for getting japanese reading (yomigana) using MeCab
2 versions - Latest release: over 1 year ago - 78 downloads last month - 7 stars on GitHub - 1 maintainer
morpholog 1.6
Morphological tokenizer for Russian is able to split words into morphemes: prefixes, roots, infix...
7 versions - Latest release: over 4 years ago - 1 dependent package - 1 dependent repositories - 237 downloads last month - 12 stars on GitHub - 1 maintainer
py-nltools 0.5.0
A collection of basic python modules for spoken natural language processing
22 versions - Latest release: almost 6 years ago - 2 dependent repositories - 691 downloads last month - 56 stars on GitHub - 1 maintainer
count-tokens 0.7.2
Count number of tokens in the text file using toktoken tokenizer from OpenAI.
8 versions - Latest release: 3 months ago - 6.63 thousand downloads last month - 6 stars on GitHub - 1 maintainer
ilmulti 0.0.1
Multilingual Text Tooling around Indian Languages
2 versions - Latest release: over 4 years ago - 1 dependent repositories - 85 downloads last month - 22 stars on GitHub - 1 maintainer
Top 5.1% on pypi.org
vncorenlp 1.0.3
A Python wrapper for VnCoreNLP using a bidirectional communication channel.
2 versions - Latest release: over 6 years ago - 4 dependent packages - 31 dependent repositories - 2 thousand downloads last month - 56 stars on GitHub - 1 maintainer
transformers-embedder 3.0.11
Word level transformer based embeddings
24 versions - Latest release: almost 2 years ago - 1 dependent repositories - 694 downloads last month - 34 stars on GitHub - 1 maintainer
doc2term 0.1
A fast NLP tokenizer that detects tokens and remove duplications and punctuations
1 version - Latest release: almost 4 years ago - 1 dependent repositories - 33 downloads last month - 2 stars on GitHub - 1 maintainer
code-splitter 0.1.5
Split code into semantic chunks using tree-sitter
5 versions - Latest release: 7 months ago - 2.24 thousand downloads last month - 3 stars on GitHub - 1 maintainer
generic-lexer 1.1.1
A generic pattern-based Lexer/tokenizer tool.
1 version - Latest release: over 4 years ago - 1 dependent repositories - 46 downloads last month - 2 stars on GitHub - 1 maintainer
dir2text 2.0.0
A Python library and command-line tool for expressing directory structures and file contents in f...
4 versions - Latest release: 6 days ago - 64 downloads last month - 1 stars on GitHub - 1 maintainer
Top 2.3% on pypi.org
fugashi 1.4.0 💰
A Cython MeCab wrapper for fast, pythonic Japanese tokenization.
75 versions - Latest release: 5 months ago - 32 dependent packages - 243 dependent repositories - 308 thousand downloads last month - 440 stars on GitHub - 1 maintainer
lexikanon 0.6.5
A Python Library for Tokenizers
26 versions - Latest release: about 1 year ago - 3 dependent packages - 851 downloads last month - 1 stars on GitHub - 1 maintainer
xontrib-output-search 0.6.5 💰
Get identifiers, names, paths, URLs and words from the previous command output and use them for t...
13 versions - Latest release: about 1 year ago - 1 dependent package - 5 dependent repositories - 480 downloads last month - 44 stars on GitHub - 1 maintainer
python-ucto 0.6.9
This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost a...
24 versions - Latest release: 4 months ago - 1 dependent package - 4 dependent repositories - 3.59 thousand downloads last month - 29 stars on GitHub - 1 maintainer
transformer-embedder 1.7.16
Word level transformer based embeddings
52 versions - Latest release: over 3 years ago - 2 dependent repositories - 1.07 thousand downloads last month - 34 stars on GitHub - 1 maintainer
tokenlens 0.1.6
A library for accurate token counting and limit validation across various LLM providers
7 versions - Latest release: 3 months ago - 213 downloads last month - 1 stars on GitHub - 1 maintainer
wyzard 1.0
Run various transformers models from one packages.
3 versions - Latest release: almost 2 years ago - 103 downloads last month - 0 stars on GitHub - 2 maintainers
pynutshell 1.0.2
An unsupervised text summarization and information retrieval library under the hood using natural...
3 versions - Latest release: over 4 years ago - 1 dependent repositories - 145 downloads last month - 15 stars on GitHub - 1 maintainer
pyregtokenizer 0.0.2
A BPE Tokenizer using regex
2 versions - Latest release: 12 months ago - 24 downloads last month - 1 maintainer
tokenicer 0.0.4
A (nicer) tokenizer you want to use for model `inference` and `training`: with all known peventab...
5 versions - Latest release: about 2 months ago - 9.97 thousand downloads last month - 6 stars on GitHub - 1 maintainer
sctokenizer 0.0.8
A Source Code Tokenizer
8 versions - Latest release: about 2 years ago - 4 dependent repositories - 6.46 thousand downloads last month - 13 stars on GitHub - 1 maintainer
Top 9.0% on pypi.org
cereja 2.0.8
Cereja is a bundle of useful functions that I don't want to rewrite.
140 versions - Latest release: about 2 months ago - 3 dependent packages - 2 dependent repositories - 5.35 thousand downloads last month - 27 stars on GitHub - 1 maintainer
pithy 0.0.13
Pithy is a collection of utility libraries for Python 3.
11 versions - Latest release: almost 5 years ago - 8 dependent repositories - 247 downloads last month - 5 stars on GitHub - 1 maintainer
tolkien 0.0.1
Token class for lexers and parsers.
1 version - Latest release: over 5 years ago - 1 dependent repositories - 33 downloads last month - 5 stars on GitHub - 1 maintainer
spag 1.0.0a0
A module containing scanner (regular expression) and parser (BNF) compilers as well as a base gen...
1 version - Latest release: over 6 years ago - 1 dependent repositories - 49 downloads last month - 8 stars on GitHub - 1 maintainer
microtokenizer 0.21.3 💰
A micro tokenizer for Chinese
54 versions - Latest release: 6 months ago - 1 dependent repositories - 1.59 thousand downloads last month - 144 stars on GitHub - 1 maintainer
ipa-core 0.1.3
NLP Preprocessing Pipeline Wrappers
4 versions - Latest release: almost 2 years ago - 156 downloads last month - 11 stars on GitHub - 1 maintainer
bpeasy 0.1.5
Fast bare-bones BPE for modern tokenizer training
6 versions - Latest release: 17 days ago - 6.78 thousand downloads last month - 152 stars on GitHub - 1 maintainer
bodotokenizer 0.1.1
Package for Bodo Tokenizer
2 versions - Latest release: about 3 years ago - 1 dependent repositories - 102 downloads last month - 0 stars on GitHub - 1 maintainer
dante-tokenizer 0.2.0
A portuguese Twitter Tokenizer for DANTE dataset
3 versions - Latest release: almost 4 years ago - 1 dependent repositories - 61 downloads last month - 2 stars on GitHub - 1 maintainer
Top 5.5% on pypi.org
simplemma 1.1.2
A lightweight toolkit for multilingual lemmatization and language detection.
18 versions - Latest release: 5 months ago - 6 dependent packages - 25 dependent repositories - 16.8 thousand downloads last month - 154 stars on GitHub - 1 maintainer
mwtokenizer 0.2.0
Wikipedia Tokenizer Utility
3 versions - Latest release: over 1 year ago - 1 dependent repositories - 284 downloads last month - 0 stars on gitlab.wikimedia.org - 1 maintainer
sylber 0.1.4
Python code for "Sylber: Syllabic Embedding Representation of Speech from Raw Audio"
5 versions - Latest release: about 1 month ago - 479 downloads last month - 29 stars on GitHub - 1 maintainer
openkoreantext 0.2.6
Python interface to open-korean-text, a Korean morphological analyzer.
7 versions - Latest release: over 7 years ago - 1 dependent repositories - 171 downloads last month - 4 stars on GitHub - 1 maintainer
yamper 0.1.0
A Markdown to HTML converter
1 version - Latest release: 8 months ago - 48 downloads last month - 8 stars on GitHub - 1 maintainer
flash-tokenizer 1.2.0
Extremely fast bert tokenizer
33 versions - Latest release: 18 days ago - 6.83 thousand downloads last month - 287 stars on GitHub - 1 maintainer
nepalikit 1.0.2
A Nepali language processing library
3 versions - Latest release: 9 months ago - 181 downloads last month - 7 stars on GitHub - 1 maintainer
pytokencounter 1.7.0
A Python library for tokenizing text and counting tokens using various encoding schemes.
16 versions - Latest release: about 1 month ago - 667 downloads last month - 2 stars on GitHub - 1 maintainer
rs-bpe 0.1.0
A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
1 version - Latest release: about 1 month ago - 1.91 thousand downloads last month - 1 stars on GitHub - 1 maintainer
semantic-text-splitter 0.25.1
Split text into semantic chunks, up to a desired chunk size. Supports calculating length by chara...
46 versions - Latest release: 25 days ago - 1 dependent package - 88.5 thousand downloads last month - 270 stars on GitHub - 1 maintainer
wordpiece-rs 0.1.0
A fast WordPiece tokenizer implementation in Rust with Python bindings
1 version - Latest release: 2 months ago - 112 downloads last month - 0 stars on GitHub - 1 maintainer
Top 2.5% on pypi.org
soynlp 0.0.493
Unsupervised Korean Natural Language Processing Toolkits
31 versions - Latest release: over 5 years ago - 4 dependent packages - 48 dependent repositories - 4.68 thousand downloads last month - 954 stars on GitHub - 1 maintainer
pinyintokenizer 0.0.3
Pinyin Tokenizer, chinese pinyin tokenizer
3 versions - Latest release: 3 months ago - 1.07 thousand downloads last month - 29 stars on GitHub - 1 maintainer
twkorean 0.1.5
Python interface to twitter-korean-text, a Korean morphological analyzer.
6 versions - Latest release: over 10 years ago - 4 dependent repositories - 248 downloads last month - 33 stars on GitHub - 1 maintainer
Top 2.4% on pypi.org
natasha 1.6.0
Named-entity recognition for russian language
13 versions - Latest release: over 1 year ago - 6 dependent packages - 73 dependent repositories - 16 thousand downloads last month - 1,242 stars on GitHub - 2 maintainers
javac-parser 1.0.0
Exposes the OpenJDK Java parser and scanner to Python
16 versions - Latest release: almost 7 years ago - 4 dependent repositories - 7.75 thousand downloads last month - 6 stars on GitHub - 1 maintainer
ebnfparser 2.1.3
very powerful and optional parser framework for python
24 versions - Latest release: about 7 years ago - 1 dependent repositories - 496 downloads last month - 65 stars on GitHub - 1 maintainer
thai-tokenizer 0.2.5
Fast and accurate Thai tokenization library.
7 versions - Latest release: about 4 years ago - 1 dependent repositories - 4.32 thousand downloads last month - 5 stars on GitHub - 1 maintainer
python-vncorenlp 0.1.8
python_vncorenlp
9 versions - Latest release: over 4 years ago - 1 dependent repositories - 248 downloads last month - 2 stars on GitHub - 1 maintainer
Top 2.3% on pypi.org
hazm 0.10.0
Persian NLP Toolkit
15 versions - Latest release: over 1 year ago - 8 dependent packages - 126 dependent repositories - 15.5 thousand downloads last month - 1,121 stars on GitHub - 1 maintainer
rs-bytepiece 0.2.2
bytepiece-rs Python binding
7 versions - Latest release: over 1 year ago - 209 downloads last month - 14 stars on GitHub - 1 maintainer
parasol-nlp 0.0.4
Korean tokenizer with character decomposition
4 versions - Latest release: about 5 years ago - 1 dependent repositories - 110 downloads last month - 3 stars on GitHub - 1 maintainer
token-vision 0.1.0
A fast, offline token calculator for images with various AI models (Claude, GPT-4V, Gemini)
5 versions - Latest release: 4 months ago - 214 downloads last month - 0 stars on GitHub - 1 maintainer
ai21-tokenizer 0.12.0
AI21's Jurassic models tokenizers
22 versions - Latest release: 8 months ago - 1 dependent package - 52.1 thousand downloads last month - 30 stars on GitHub - 1 maintainer
rwkv-tokenizer 0.5.2
RWKV Tokenizer
13 versions - Latest release: 10 months ago - 2.08 thousand downloads last month - 44 stars on GitHub - 1 maintainer
pyrwkv-tokenizer 0.9.1
RWKV Tokenizer
10 versions - Latest release: 20 days ago - 9.02 thousand downloads last month - 44 stars on GitHub - 1 maintainer
tglex 0.2.1
Lexical analysis base for telegram bots
4 versions - Latest release: about 5 years ago - 1 dependent repositories - 135 downloads last month - 0 stars on GitHub - 1 maintainer
tokenregex 0.1.14
NLP at your fingertips
15 versions - Latest release: over 8 years ago - 1 dependent repositories - 227 downloads last month - 28 stars on GitHub - 1 maintainer
Top 7.4% on pypi.org
botok 0.9.0
Tibetan Word Tokenizer
28 versions - Latest release: about 1 month ago - 1 dependent package - 21 dependent repositories - 2.84 thousand downloads last month - 65 stars on GitHub - 1 maintainer