An open API service providing package, version and dependency metadata of many open source software ecosystems and registries.

pypi.org "tokenization" keyword

View the packages on the pypi.org package registry that are tagged with the "tokenization" keyword.

subtokenizer 0.0.19
Subwords tokenizer for neural natural language processing
16 versions - Latest release: over 5 years ago - 1 dependent repositories - 312 downloads last month - 5 stars on GitHub - 1 maintainer
Top 0.1% on pypi.org
spacy 3.8.5 💰
Industrial-strength Natural Language Processing (NLP) in Python
216 versions - Latest release: 17 days ago - 873 dependent packages - 15,793 dependent repositories - 17.2 million downloads last month - 29,548 stars on GitHub - 3 maintainers
rusyll 0.1.1
Splitting Russian words into phonetic syllables
1 version - Latest release: over 4 years ago - 2 dependent repositories - 64 downloads last month - 6 stars on GitHub - 1 maintainer
dango 0.0.1
An easy to use tokenizer for Japanese text, aimed at language learners and non-linguists
2 versions - Latest release: over 3 years ago - 3 dependent repositories - 382 downloads last month - 16 stars on GitHub - 1 maintainer
Top 3.7% on pypi.org
sentence-splitter 1.4
Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder
4 versions - Latest release: over 6 years ago - 12 dependent packages - 42 dependent repositories - 88.9 thousand downloads last month - 241 stars on GitHub - 2 maintainers
spacyface 0.3.0
Aligner for spacy and huggingface tokenization
10 versions - Latest release: about 4 years ago - 1 dependent repositories - 186 downloads last month - 44 stars on GitHub - 1 maintainer
bigrams 0.1.2
Simply create (N)grams
3 versions - Latest release: about 2 years ago - 152 downloads last month - 6 stars on GitHub - 1 maintainer
llm-obfuscator 0.1.0
A tool for obfuscating text by manipulating token IDs while preserving token count and structure
1 version - Latest release: about 1 month ago - 94 downloads last month - 1 maintainer
nlp-preprocessing 0.2.0
A Package for text preprocessing
14 versions - Latest release: over 4 years ago - 1 dependent repositories - 471 downloads last month - 16 stars on GitHub - 1 maintainer
Top 3.6% on pypi.org
pyonmttok 1.37.1
Fast and customizable text tokenization library with BPE and SentencePiece support
66 versions - Latest release: about 2 years ago - 3 dependent packages - 103 dependent repositories - 28.8 thousand downloads last month - 302 stars on GitHub - 4 maintainers
spacywb 0.1.1 💰
Industrial-strength Natural Language Processing (NLP) in Python
2 versions - Latest release: over 3 years ago - 1 dependent repositories - 64 downloads last month - 31,363 stars on GitHub - 1 maintainer
semantic-split 0.1.0 💰
A better way to split (chunk/group) your text before inserting them into an LLM/Vector DB.
1 version - Latest release: almost 2 years ago - 581 downloads last month - 31,363 stars on GitHub - 1 maintainer
spacy-weibo 2.3.0 💰
Industrial-strength Natural Language Processing (NLP) in Python
1 version - Latest release: over 3 years ago - 1 dependent repositories - 45 downloads last month - 31,363 stars on GitHub - 1 maintainer
Top 4.5% on pypi.org
spacy-nightly 3.0.0rc5 💰
Industrial-strength Natural Language Processing (NLP) in Python
74 versions - Latest release: about 4 years ago - 2 dependent packages - 9 dependent repositories - 6.62 thousand downloads last month - 31,363 stars on GitHub - 2 maintainers
spacy-ci-improve 2.0.5 💰
Industrial-strength Natural Language Processing (NLP) with Python and Cython
1 version - Latest release: about 7 years ago - 1 dependent repositories - 57 downloads last month - 31,363 stars on GitHub - 1 maintainer
vaporetto 0.3.0
Python wrapper of Vaporetto tokenizer
5 versions - Latest release: about 2 years ago - 1 dependent repositories - 1.9 thousand downloads last month - 20 stars on GitHub - 1 maintainer
mrs-spellings 1.0.3
a micro utility for generating plausible misspellings
7 versions - Latest release: almost 5 years ago - 1 dependent repositories - 247 downloads last month - 2 stars on GitHub - 1 maintainer
textmate-grammar-python 0.6.1
A lexer and tokenizer for grammar files as defined by TextMate and used in VSCode, implemented in...
12 versions - Latest release: 9 months ago - 1 dependent package - 525 downloads last month - 8 stars on GitHub - 1 maintainer
witokit 1.1.0
A python module to generate a tokenized dump of Wikipedia for NLP
20 versions - Latest release: over 5 years ago - 1 dependent repositories - 254 downloads last month - 9 stars on GitHub - 1 maintainer
Top 6.9% on pypi.org
text2text 1.9.5
Text2Text Language Modeling Toolkit
192 versions - Latest release: 3 months ago - 4 dependent repositories - 6.94 thousand downloads last month - 300 stars on GitHub - 1 maintainer
count-tokens 0.7.2
Count number of tokens in the text file using toktoken tokenizer from OpenAI.
8 versions - Latest release: 3 months ago - 6.63 thousand downloads last month - 6 stars on GitHub - 1 maintainer
tivars 0.9.2
A library for interacting with TI-(e)z80 (82/83/84 series) calculator files
10 versions - Latest release: 4 months ago - 454 downloads last month - 19 stars on GitHub - 1 maintainer
Top 2.9% on pypi.org
sudachipy 0.6.10 💰
Python version of Sudachi, the Japanese Morphological Analyzer
41 versions - Latest release: 3 months ago - 23 dependent packages - 74 dependent repositories - 1.18 million downloads last month - 273 stars on GitHub - 2 maintainers
aymara 0.4.1
Python bindings to the LIMA linguistic analyzer
23 versions - Latest release: over 2 years ago - 1 dependent repositories - 777 downloads last month - 103 stars on GitHub - 1 maintainer
tokviz 0.1
Library for visualizing tokenization patterns across different language models
1 version - Latest release: about 1 year ago - 57 downloads last month - 10 stars on GitHub - 1 maintainer
Top 2.4% on pypi.org
zhon 2.1.1
Zhon provides constants used in Chinese text processing.
15 versions - Latest release: 5 months ago - 7 dependent packages - 159 dependent repositories - 118 thousand downloads last month - 370 stars on GitHub - 1 maintainer
llama-tokens 0.0.3
A Quick Library with Llama 3.1/3.2 Tokenization - source https://github.com/jeffxtang/llama-tokens
3 versions - Latest release: 5 months ago - 175 downloads last month - 1 stars on GitHub - 1 maintainer
xontrib-output-search 0.6.5 💰
Get identifiers, names, paths, URLs and words from the previous command output and use them for t...
13 versions - Latest release: about 1 year ago - 1 dependent package - 5 dependent repositories - 480 downloads last month - 44 stars on GitHub - 1 maintainer
python-ucto 0.6.9
This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost a...
24 versions - Latest release: 4 months ago - 1 dependent package - 4 dependent repositories - 3.59 thousand downloads last month - 29 stars on GitHub - 1 maintainer
Top 5.9% on pypi.org
attacut 1.0.6
Fast and Reasonably Accurate Word Tokenizer for Thai
17 versions - Latest release: over 5 years ago - 1 dependent package - 11 dependent repositories - 4.28 thousand downloads last month - 85 stars on GitHub - 1 maintainer
pymmseg 1.2.0
pyMMSeg-cpp, a high performance Chinese word segmentation utility.
1 version - Latest release: about 12 years ago - 4 dependent repositories - 23 downloads last month - 189 stars on GitHub - 1 maintainer
pithy 0.0.13
Pithy is a collection of utility libraries for Python 3.
11 versions - Latest release: almost 5 years ago - 8 dependent repositories - 247 downloads last month - 5 stars on GitHub - 1 maintainer
tolkien 0.0.1
Token class for lexers and parsers.
1 version - Latest release: over 5 years ago - 1 dependent repositories - 33 downloads last month - 5 stars on GitHub - 1 maintainer
spag 1.0.0a0
A module containing scanner (regular expression) and parser (BNF) compilers as well as a base gen...
1 version - Latest release: over 6 years ago - 1 dependent repositories - 49 downloads last month - 8 stars on GitHub - 1 maintainer
hangul-korean 1.0rc2
Word segmentation for the Korean Language
2 versions - Latest release: about 4 years ago - 104 downloads last month - 1 maintainer
biosaic 0.0.7
Tokenizer for encoding/decoding DNA & amino acid sequences
2 versions - Latest release: 6 days ago - 0 stars on GitHub - 1 maintainer
nlpannotator 1.0.6
Annotator combining different NLP pipelines
7 versions - Latest release: over 1 year ago - 209 downloads last month - 0 stars on GitHub - 1 maintainer
ipa-core 0.1.3
NLP Preprocessing Pipeline Wrappers
4 versions - Latest release: almost 2 years ago - 156 downloads last month - 11 stars on GitHub - 1 maintainer
bpeasy 0.1.5
Fast bare-bones BPE for modern tokenizer training
6 versions - Latest release: 17 days ago - 6.78 thousand downloads last month - 152 stars on GitHub - 1 maintainer
miditok-for-musiclang 0.0.1
A convenient MIDI tokenizer for Deep Learning networks, with multiple encoding strategies
1 version - Latest release: over 1 year ago - 1 dependent package - 23 downloads last month - 1 stars on GitHub - 1 maintainer
Top 5.9% on pypi.org
miditok 3.0.5
MIDI / symbolic music tokenizers for Deep Learning models.
65 versions - Latest release: 2 months ago - 2 dependent packages - 2 dependent repositories - 3.93 thousand downloads last month - 758 stars on GitHub - 1 maintainer
Top 5.5% on pypi.org
simplemma 1.1.2
A lightweight toolkit for multilingual lemmatization and language detection.
18 versions - Latest release: 5 months ago - 6 dependent packages - 25 dependent repositories - 16.8 thousand downloads last month - 154 stars on GitHub - 1 maintainer
Top 4.6% on pypi.org
spacy-streamlit 1.0.6
Visualize spaCy with streamlit
17 versions - Latest release: almost 2 years ago - 68 dependent repositories - 8.41 thousand downloads last month - 831 stars on GitHub - 2 maintainers
nlpbrl 1.0.1
NLP algorithm integration package
5 versions - Latest release: about 2 years ago - 137 downloads last month - 0 stars on GitHub - 1 maintainer
Top 4.6% on pypi.org
trankit 1.1.2
Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing
12 versions - Latest release: 6 months ago - 2 dependent packages - 6 dependent repositories - 1.98 thousand downloads last month - 749 stars on GitHub - 1 maintainer
mamba-safe 1.0.1
A framework to generate molecules with the mamba architecture
2 versions - Latest release: 8 months ago - 93 downloads last month - 2 stars on GitHub - 1 maintainer
nepalikit 1.0.2
A Nepali language processing library
3 versions - Latest release: 9 months ago - 181 downloads last month - 7 stars on GitHub - 1 maintainer
pytokencounter 1.7.0
A Python library for tokenizing text and counting tokens using various encoding schemes.
16 versions - Latest release: about 1 month ago - 667 downloads last month - 2 stars on GitHub - 1 maintainer
rs-bpe 0.1.0
A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
1 version - Latest release: about 1 month ago - 1.91 thousand downloads last month - 1 stars on GitHub - 1 maintainer
datesearch 0.0.1
Поиск токенов, относящихся к датам, по регулярным выражениям
1 version - Latest release: over 4 years ago - 1 dependent repositories - 47 downloads last month - 0 stars on GitHub - 1 maintainer
ponrawee-ssg 0.0.8
Thai syllable segmentation using Conditional Random Fields
1 version - Latest release: over 3 years ago - 1 dependent repositories - 48 downloads last month - 27 stars on GitHub - 1 maintainer
Top 8.7% on pypi.org
match 0.3.2
Match tokenized words and phrases within the original, untokenized, often messy, text.
6 versions - Latest release: over 2 years ago - 10 dependent repositories - 2.56 thousand downloads last month - 19 stars on GitHub - 3 maintainers
rftokenizer 2.3.2
A character-wise tokenizer for morphologically rich languages
8 versions - Latest release: about 1 month ago - 1 dependent repositories - 295 downloads last month - 27 stars on GitHub - 1 maintainer
maze-dataset 1.3.2
generating and working with datasets of mazes
24 versions - Latest release: 9 days ago - 1.22 thousand downloads last month - 5 stars on GitHub - 1 maintainer
tokenization-scorer 1.1.8
Package for evaluating text tokenizations.
12 versions - Latest release: 3 months ago - 447 downloads last month - 39 stars on GitHub - 1 maintainer
quickbpe 1.8.6
A fast BPE implementation in C
14 versions - Latest release: 4 months ago - 335 downloads last month - 6 stars on GitHub - 1 maintainer
huspacy-nightly 0.11.0.dev261 💰
HuSpaCy: industrial strength Hungarian natural language processing
126 versions - Latest release: over 1 year ago - 1 dependent repositories - 2.25 thousand downloads last month - 155 stars on GitHub - 1 maintainer
Top 6.6% on pypi.org
huspacy 0.12.1 💰
HuSpaCy: industrial strength Hungarian natural language processing
23 versions - Latest release: 6 months ago - 1 dependent package - 6 dependent repositories - 2.19 thousand downloads last month - 142 stars on GitHub - 1 maintainer
tokenization-layer 0.0.2
An NLP tokenization algorithm that is a trainable layer for neural networks.
2 versions - Latest release: over 3 years ago - 1 dependent repositories - 94 downloads last month - 2 stars on GitHub - 1 maintainer
alphacodings 0.2.0
base26 ([A-Z]) and base52 ([A-Za-z]) encodings
2 versions - Latest release: 4 months ago - 106 downloads last month - 1,044 stars on GitHub - 1 maintainer
hebpipe 4.0.0.0
A pipeline for Hebrew NLP
15 versions - Latest release: about 1 month ago - 1 dependent repositories - 381 downloads last month - 36 stars on GitHub - 1 maintainer
Top 8.6% on pypi.org
ssg 0.0.8
Thai syllable segmentation using Conditional Random Fields
6 versions - Latest release: over 3 years ago - 1 dependent package - 16 dependent repositories - 4.12 thousand downloads last month - 27 stars on GitHub - 1 maintainer
llmaestro 0.1.0
A system for orchestrating LLM tasks that exceed context limits
1 version - Latest release: 2 months ago - 66 downloads last month - 0 stars on GitHub - 1 maintainer
hanzinlp 0.1.0
A NLP package specifically for Chinese
1 version - Latest release: over 1 year ago - 63 downloads last month - 24 stars on GitHub - 1 maintainer
overtokenizer 0.2.0
Unicode-based language-agnostic (over-) tokenizer.
2 versions - Latest release: almost 7 years ago - 1 dependent repositories - 75 downloads last month - 1 maintainer
vtext 0.2.0
Natural Language Processing in Rust with Python bidings
4 versions - Latest release: almost 5 years ago - 4 dependent repositories - 300 downloads last month - 150 stars on GitHub - 1 maintainer
Top 4.5% on pypi.org
ekphrasis 0.5.4
Text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekph...
54 versions - Latest release: almost 3 years ago - 48 dependent repositories - 3.96 thousand downloads last month - 666 stars on GitHub - 1 maintainer
nupunkt 0.5.1
Next-generation Punkt sentence and paragraph boundary detection with zero dependencies
5 versions - Latest release: 13 days ago - 318 downloads last month - 0 stars on GitHub - 1 maintainer
Top 3.8% on pypi.org
icetk 0.0.7
A unified tokenization tool for Images, Chinese and English.
7 versions - Latest release: about 2 years ago - 18 dependent packages - 552 dependent repositories - 14.4 thousand downloads last month - 151 stars on GitHub - 1 maintainer
handict 0.2.0 💰
Yet another word segmentation tool.
3 versions - Latest release: about 5 years ago - 1 dependent repositories - 117 downloads last month - 1 stars on GitHub - 1 maintainer
example990420 1.1.2
Taiwanese Hokkien Transliterator and Tokeniser
9 versions - Latest release: 11 months ago - 342 downloads last month - 10 stars on GitHub - 1 maintainer
Top 7.0% on pypi.org
rosette-api 1.31.0
Babel Street Analytics API Python client SDK
35 versions - Latest release: 5 months ago - 3 dependent repositories - 1.11 thousand downloads last month - 38 stars on GitHub - 3 maintainers
tokencost 0.1.20
To calculate token and translated USD cost of string and message calls to OpenAI, for example whe...
25 versions - Latest release: 20 days ago - 1 dependent package - 28.3 thousand downloads last month - 190 stars on GitHub - 3 maintainers
Top 3.2% on pypi.org
razdel 0.5.0
Splits russian text into tokens, sentences, section. Rule-based
5 versions - Latest release: about 5 years ago - 7 dependent packages - 105 dependent repositories - 27.2 thousand downloads last month - 261 stars on GitHub - 1 maintainer
lughaatnlp 1.3.1
A Python package for natural language processing tasks for the Urdu language, including normaliza...
8 versions - Latest release: 4 months ago - 219 downloads last month - 6 stars on GitHub - 1 maintainer
ts-tokenizer 0.1.19
TS Tokenizer is a hybrid (lexicon-based and rule-based) tokenizer designed specifically for token...
20 versions - Latest release: 3 months ago - 450 downloads last month - 1 stars on GitHub - 1 maintainer
Top 2.2% on pypi.org
youtokentome 1.0.6
Unsupervised text tokenizer focused on computational efficiency
8 versions - Latest release: about 5 years ago - 8 dependent packages - 228 dependent repositories - 59.7 thousand downloads last month - 966 stars on GitHub - 3 maintainers
sic 1.3.3
Utility for string normalization
15 versions - Latest release: over 3 years ago - 4 dependent repositories - 710 downloads last month - 2 stars on GitHub - 1 maintainer
quebra-frases 0.3.7
quebra_frases chunks strings into byte sized pieces
12 versions - Latest release: almost 4 years ago - 4 dependent packages - 2 dependent repositories - 15.8 thousand downloads last month - 1 stars on GitHub - 2 maintainers
taibun 1.1.7
Taiwanese Hokkien Transliterator and Tokeniser
14 versions - Latest release: 8 months ago - 305 downloads last month - 10 stars on GitHub - 1 maintainer
kl3m-data-client 0.1.2
Client for interacting with KL3M data stored in S3
1 version - Latest release: 21 days ago - 1 maintainer
mytokenize 0.1.1
Comprehensive tokenization library for Myanmar language
3 versions - Latest release: 5 months ago - 163 downloads last month - 3 stars on GitHub - 1 maintainer
beanstream 1.0.1
Beanstream SDK library for processing credit card payments.
5 versions - Latest release: almost 9 years ago - 3 dependent repositories - 930 downloads last month - 8 stars on GitHub - 1 maintainer
nlpcube 0.3.1.2
Natural Language Processing Toolkit with support for tokenization, sentence splitting, lemmatizat...
22 versions - Latest release: almost 2 years ago - 1 dependent repositories - 1.37 thousand downloads last month - 556 stars on GitHub - 4 maintainers
grigora 0.0.3
Optimised implementation of common deep learning preprocessing utilities.
3 versions - Latest release: almost 6 years ago - 1 dependent repositories - 108 downloads last month - 2 stars on GitHub - 1 maintainer
Top 9.8% on pypi.org
tokenmonster 1.1.12
Tokenize and decode text with TokenMonster vocabularies.
15 versions - Latest release: over 1 year ago - 2 dependent packages - 1 dependent repositories - 1.49 thousand downloads last month - 567 stars on GitHub - 1 maintainer
Top 9.0% on pypi.org
spacy-wheel 3.5.0
Reupload of SpaCy 3.4.4 with Global Wheel
1 version - Latest release: about 2 years ago - 25,021 stars on GitHub - 1 maintainer
reason 1.0.7
Natural language processing toolbox
19 versions - Latest release: over 1 year ago - 9 dependent repositories - 834 downloads last month - 3 stars on GitHub - 1 maintainer
b-labs-models 2017.8.22
Ready to use CRFSuite models for sentence segmentation, tokenization and so on
3 versions - Latest release: over 7 years ago - 10 dependent repositories - 84 downloads last month - 15 stars on GitHub - 1 maintainer
charformer-pytorch 0.0.4
Charformer - Pytorch
4 versions - Latest release: almost 4 years ago - 1 dependent repositories - 208 downloads last month - 117 stars on GitHub - 1 maintainer
pyfpe 0.10.3
Python FPE- Does Format preserving Encryption of values
3 versions - Latest release: over 3 years ago - 1 dependent repositories - 2.99 thousand downloads last month - 2 stars on GitHub - 1 maintainer
naivenlp 0.0.9
NLP toolkit, including tokenization, sequence tagging, etc.
9 versions - Latest release: over 4 years ago - 1 dependent repositories - 314 downloads last month - 2 stars on GitHub - 1 maintainer
ud-toolkit 0.0.2
NLP toolkit built around UDPipe.
2 versions - Latest release: over 6 years ago - 1 dependent repositories - 71 downloads last month - 4 stars on GitHub - 1 maintainer
Top 6.0% on pypi.org
ff3 1.0.2
Format Preserving Encryption (FPE) with FF3
7 versions - Latest release: about 1 year ago - 3 dependent packages - 3 dependent repositories - 202 thousand downloads last month - 101 stars on GitHub - 1 maintainer
epub-conversion 1.0.15
Python package for converting xml and epubs to text files
14 versions - Latest release: about 5 years ago - 5 dependent repositories - 767 downloads last month - 34 stars on GitHub - 1 maintainer
wikipedia-ner 0.0.24
Python package for creating labeled examples from wiki dumps
22 versions - Latest release: about 10 years ago - 3 dependent repositories - 290 downloads last month - 67 stars on GitHub - 1 maintainer
ciseau 1.0.1
Word and sentence tokenization.
2 versions - Latest release: over 7 years ago - 8 dependent repositories - 514 downloads last month - 12 stars on GitHub - 1 maintainer
sept 0.4.2
The Simple Extensible Path Template (sept) is a simple to configure templating system designed at...
6 versions - Latest release: over 3 years ago - 1 dependent repositories - 205 downloads last month - 8 stars on GitHub - 1 maintainer
anyks-lm 3.5.0 💰
Smart language model
43 versions - Latest release: over 2 years ago - 1 dependent repositories - 1.34 thousand downloads last month - 46 stars on GitHub - 1 maintainer
Top 6.0% on pypi.org
mosestokenizer 1.2.1
Wrappers for several pre-processing scripts from the Moses toolkit.
5 versions - Latest release: over 3 years ago - 9 dependent packages - 74 dependent repositories - 19.6 thousand downloads last month - 20 stars on GitHub - 1 maintainer