Ecosyste.ms: Packages

An open API service providing package, version and dependency metadata of many open source software ecosystems and registries.

pypi.org "tokenizer" keyword

twokenize 1.0.0
Word segmentation / tokenization focussed on Twitter
1 version - Latest release: almost 6 years ago - 6 dependent repositories - 33 downloads last month - 7 stars on GitHub - 1 maintainer
tensorflow-onmttok-ops 0.4.0
OpenNMT Tokenizer as TensorFlow Operations
5 versions - Latest release: almost 4 years ago - 1 dependent repositories - 30 downloads last month - 1 maintainer
Top 2.3% on pypi.org
fugashi 1.3.2 💰
A Cython MeCab wrapper for fast, pythonic Japanese tokenization.
71 versions - Latest release: about 2 months ago - 32 dependent packages - 243 dependent repositories - 231 thousand downloads last month - 366 stars on GitHub - 1 maintainer
zh-sentence 0.0.5
Light-weight sentence tokenizer for Chinese languages.
5 versions - Latest release: over 2 years ago - 1 dependent repositories - 104 downloads last month - 2 stars on GitHub - 1 maintainer
xml-cleaner 2.0.4
Word and sentence tokenization.
27 versions - Latest release: over 7 years ago - 4 dependent repositories - 116 downloads last month - 13 stars on GitHub - 1 maintainer
korhal 0.1.2
KOrean Rpc-based Application for Handy Application for Language-processing
2 versions - Latest release: over 5 years ago - 1 dependent repositories - 10 downloads last month - 1 maintainer
livelex 0.3.0
The livelex lexer
6 versions - Latest release: over 4 years ago - 1 dependent repositories - 19 downloads last month - 9 stars on GitHub - 1 maintainer
Top 1.9% on pypi.org
syntok 1.4.4
Text tokenization and sentence segmentation (segtok v2).
16 versions - Latest release: about 2 years ago - 6 dependent packages - 59 dependent repositories - 69 thousand downloads last month - 193 stars on GitHub - 1 maintainer
math-tokenizer 1.0.1
Simple and lightweighted tokenizer for mathematical functions
2 versions - Latest release: about 7 years ago - 1 dependent repositories - 22 downloads last month - 1 stars on GitHub - 1 maintainer
Top 3.3% on pypi.org
nagisa 0.2.11
A Japanese tokenizer based on recurrent neural networks
25 versions - Latest release: 4 months ago - 5 dependent packages - 28 dependent repositories - 261 thousand downloads last month - 371 stars on GitHub - 1 maintainer
ciseau 1.0.1
Word and sentence tokenization.
2 versions - Latest release: over 6 years ago - 8 dependent repositories - 124 downloads last month - 13 stars on GitHub - 1 maintainer
Top 9.0% on pypi.org
cereja 1.9.9
Cereja is a bundle of useful functions that I don't want to rewrite.
130 versions - Latest release: about 1 month ago - 3 dependent packages - 2 dependent repositories - 3.18 thousand downloads last month - 23 stars on GitHub - 1 maintainer
Top 1.6% on pypi.org
sacremoses 0.1.1
SacreMoses
52 versions - Latest release: 7 months ago - 120 dependent packages - 5,564 dependent repositories - 2.2 million downloads last month - 479 stars on GitHub - 3 maintainers
nlp-zero 0.1.6
无监督NLP工具包|unsupervised nlp toolkit
1 version - Latest release: about 6 years ago - 1 dependent repositories - 7 downloads last month - 1 maintainer
ipa-core 0.1.3
NLP Preprocessing Pipeline Wrappers
4 versions - Latest release: about 1 year ago - 57 downloads last month - 13 stars on GitHub - 1 maintainer
polyglot-tokenizer 2.0.2
Tokenizer for world's most spoken languages and social media texts like Facebook, Twitter etc.
9 versions - Latest release: almost 3 years ago - 1 dependent repositories - 125 downloads last month - 3 stars on GitHub - 2 maintainers
transformers-embedder 3.0.11
Word level transformer based embeddings
24 versions - Latest release: about 1 year ago - 1 dependent repositories - 87 downloads last month - 34 stars on GitHub - 1 maintainer
dango 0.0.1
An easy to use tokenizer for Japanese text, aimed at language learners and non-linguists
2 versions - Latest release: over 2 years ago - 3 dependent repositories - 302 downloads last month - 12 stars on GitHub - 1 maintainer
parasol-nlp 0.0.4
Korean tokenizer with character decomposition
4 versions - Latest release: over 4 years ago - 1 dependent repositories - 16 downloads last month - 3 stars on GitHub - 1 maintainer
Top 5.5% on pypi.org
simplemma 0.9.1
A simple multilingual lemmatizer for Python.
15 versions - Latest release: over 1 year ago - 6 dependent packages - 25 dependent repositories - 10.9 thousand downloads last month - 129 stars on GitHub - 1 maintainer
rusyll 0.1.1
Splitting Russian words into phonetic syllables
1 version - Latest release: almost 4 years ago - 2 dependent repositories - 44 downloads last month - 6 stars on GitHub - 1 maintainer
generic-lexer 1.1.1
A generic pattern-based Lexer/tokenizer tool.
1 version - Latest release: over 3 years ago - 1 dependent repositories - 13 downloads last month - 2 stars on GitHub - 1 maintainer
rs-bytepiece 0.2.2
bytepiece-rs Python binding
7 versions - Latest release: 7 months ago - 28 downloads last month - 14 stars on GitHub - 1 maintainer
alt-eval 1.1.0
Automatic lyrics transcription evaluation toolkit
3 versions - Latest release: 3 months ago - 44 downloads last month - 479 stars on GitHub - 1 maintainer
crossandra 2.1.0
A fast and simple enum/regex-based tokenizer with decent configurability
12 versions - Latest release: about 1 month ago - 1 dependent package - 1 dependent repositories - 5.76 thousand downloads last month - 8 stars on GitHub - 1 maintainer
parce 0.33.0
The parce lexer
32 versions - Latest release: about 1 year ago - 1 dependent repositories - 96 downloads last month - 9 stars on GitHub - 1 maintainer
spag 1.0.0a0
A module containing scanner (regular expression) and parser (BNF) compilers as well as a base gen...
1 version - Latest release: over 5 years ago - 1 dependent repositories - 9 downloads last month - 8 stars on GitHub - 1 maintainer
dom-tokenizers 0.0.6
DOM-aware tokenization for 🤗 Hugging Face language models
13 versions - Latest release: 16 days ago - 1.95 thousand downloads last month - 0 stars on GitHub - 1 maintainer
Top 6.9% on pypi.org
text2text 1.4.4
Text2Text: Crosslingual NLP/G toolkit
142 versions - Latest release: 3 months ago - 4 dependent repositories - 1.08 thousand downloads last month - 274 stars on GitHub - 1 maintainer
Top 9.7% on pypi.org
sinling 0.3.6
A language processing tool for Sinhalese (සිංහල)
7 versions - Latest release: over 3 years ago - 2 dependent packages - 3 dependent repositories - 195 downloads last month - 47 stars on GitHub - 1 maintainer
gpt3_tokenizer 0.1.5
Encoder/Decoder and tokens counter for GPT3
6 versions - Latest release: about 1 month ago - 768 downloads last month - 7 stars on GitHub - 1 maintainer
mecab-text-cleaner 0.1.1 💰
Simple Python package for getting japanese reading (yomigana) using MeCab
2 versions - Latest release: 5 months ago - 14 downloads last month - 3 stars on GitHub - 1 maintainer
tokenizers-gt 0.15.2.post0
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
3 versions - Latest release: 3 months ago - 571 downloads last month - 8,543 stars on GitHub - 1 maintainer
testasasnkaonlytest 0.1.3
A very basic calculator
4 versions - Latest release: about 3 years ago - 1 dependent repositories - 34 downloads last month - 46 stars on GitHub - 1 maintainer
Top 0.6% on pypi.org
tokenizers 0.19.1
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
95 versions - Latest release: about 2 months ago - 380 dependent packages - 14,571 dependent repositories - 25.2 million downloads last month - 8,516 stars on GitHub - 4 maintainers
javac-parser 1.0.0
Exposes the OpenJDK Java parser and scanner to Python
16 versions - Latest release: about 6 years ago - 4 dependent repositories - 19 thousand downloads last month - 6 stars on GitHub - 1 maintainer
dadmatools 3.0.0
DadmaTools is a Persian NLP toolkit
19 versions - Latest release: 6 months ago - 1 dependent repositories - 418 downloads last month - 164 stars on GitHub - 1 maintainer
nepalitokenizers 0.0.2
Pre-trained Tokenizers for the Nepali language with an interface to HuggingFace's tokenizers libr...
2 versions - Latest release: 11 months ago - 49 downloads last month - 2 stars on GitHub - 1 maintainer
bleuscore 0.1.2
A fast bleu score calculator
4 versions - Latest release: about 1 month ago - 601 downloads last month - 0 stars on GitHub - 1 maintainer
taibun 1.1.2
Taiwanese Hokkien Transliterator and Tokeniser
12 versions - Latest release: 19 days ago - 498 downloads last month - 10 stars on GitHub - 1 maintainer
doc2term 0.1
A fast NLP tokenizer that detects tokens and remove duplications and punctuations
1 version - Latest release: about 3 years ago - 1 dependent repositories - 14 downloads last month - 1 stars on GitHub - 1 maintainer
unitok 3.5.2
Unified Tokenizer
99 versions - Latest release: 2 months ago - 1 dependent repositories - 177 downloads last month - 4 stars on GitHub - 2 maintainers
easytoken 2.0.2 💰
easytoken is an independent Open Source, Natural Language Processing python library which impleme...
11 versions - Latest release: over 4 years ago - 1 dependent repositories - 35 downloads last month - 1 stars on GitHub - 1 maintainer
semiformal 0.7.0
Tokenizer for semiformal unicode text using TR-29 segmentation
2 versions - Latest release: 10 months ago - 1 dependent repositories - 248 downloads last month - 0 stars on GitHub - 1 maintainer
mwtokenizer 0.2.0
Wikipedia Tokenizer Utility
3 versions - Latest release: 5 months ago - 1 dependent repositories - 21 downloads last month - 1 maintainer
morpholog 1.6
Morphological tokenizer for Russian is able to split words into morphemes: prefixes, roots, infix...
7 versions - Latest release: over 3 years ago - 1 dependent package - 1 dependent repositories - 82 downloads last month - 12 stars on GitHub - 1 maintainer
optimal-data-selector 1.2.1
('A Package for optimize models, transfer or copy files from one directory to other, use for nlp ...
22 versions - Latest release: 5 months ago - 54 downloads last month - 1 maintainer
kimchima 0.5.0
The collections of tools for ML model development.
14 versions - Latest release: about 1 month ago - 216 downloads last month - 0 stars on GitHub - 1 maintainer
korrektor-py 1.0.0
Python wrapper for the https://korrektor.uz/api
1 version - Latest release: almost 2 years ago - 12 downloads last month - 2 stars on GitHub - 1 maintainer
irtm 0.0.4
A toolbox for Information Retrieval & Text Mining.
4 versions - Latest release: over 2 years ago - 1 dependent repositories - 19 downloads last month - 1 stars on GitHub - 1 maintainer
extractionstring 0.8.2
Basic tools to tokenize (i.e. to construct atomic-entities/sub-strings of) a string, for Natural ...
1 version - Latest release: over 1 year ago - 20 downloads last month - 0 stars on framagit.org - 1 maintainer
dadmatools-light 1.2.0
DadmaTools is a Persian NLP toolkit
3 versions - Latest release: almost 2 years ago - 16 downloads last month - 162 stars on GitHub - 1 maintainer
twkorean 0.1.5
Python interface to twitter-korean-text, a Korean morphological analyzer.
6 versions - Latest release: over 9 years ago - 4 dependent repositories - 79 downloads last month - 33 stars on GitHub - 1 maintainer
lexikanon 0.6.5
A Python Library for Tokenizers
26 versions - Latest release: 2 months ago - 3 dependent packages - 122 downloads last month - 1 stars on GitHub - 1 maintainer
plane 0.2.1 💰
A lib for text preprocessing
20 versions - Latest release: over 3 years ago - 3 dependent repositories - 236 downloads last month - 11 stars on GitHub - 1 maintainer
example990420 1.1.2
Taiwanese Hokkien Transliterator and Tokeniser
9 versions - Latest release: 19 days ago - 307 downloads last month - 10 stars on GitHub - 1 maintainer
bodotokenizer 0.1.1
Package for Bodo Tokenizer
2 versions - Latest release: about 2 years ago - 1 dependent repositories - 30 downloads last month - 0 stars on GitHub - 1 maintainer
jk-php-tokenizer 0.2020.3.9
This python module is a tokenizer for configuration files written in PHP.
1 version - Latest release: about 4 years ago - 1 dependent repositories - 15 downloads last month - 1 stars on GitHub - 1 maintainer
djurl 0.2.0
Simple yet helpful library for writing Django urls by an easy, short an intuitive way.
4 versions - Latest release: almost 7 years ago - 2 dependent repositories - 16 downloads last month - 80 stars on GitHub - 1 maintainer
space-wrap 0.0.3
Automated Spacy wrapper to turn plain text into Spacy doc objects
3 versions - Latest release: over 1 year ago - 29 downloads last month - 1 maintainer
pyregtokenizer 0.0.1
A BPE Tokenizer using regex
2 versions - Latest release: about 1 month ago - 240 downloads last month - 1 maintainer
unico 0.0.0
Unico provides Unicode metadata parsed directly from the published standard data.
1 version - Latest release: 10 months ago - 1 dependent repositories - 0 stars on GitHub - 1 maintainer
openai-function-tokens 0.1.2
A package to estimate token counts for messages AND functions in openai's chat completion API.
3 versions - Latest release: 9 months ago - 1.07 thousand downloads last month - 14 stars on GitHub - 1 maintainer
bpeasy 0.1.2
Fast bare-bones BPE for modern tokenizer training
3 versions - Latest release: 6 months ago - 397 downloads last month - 125 stars on GitHub - 1 maintainer
sengirifix 0.1.3
Yet another fork of sentence-level tokenizer for the Japanese text
1 version - Latest release: over 3 years ago - 1 dependent repositories - 14 downloads last month - 0 stars on GitHub - 1 maintainer
Top 2.3% on pypi.org
hazm 0.10.0
Persian NLP Toolkit
15 versions - Latest release: 5 months ago - 8 dependent packages - 126 dependent repositories - 7.78 thousand downloads last month - 1,116 stars on GitHub - 1 maintainer
pyvgram 0.1.2
VGram tokenization
5 versions - Latest release: almost 3 years ago - 1 dependent repositories - 38 downloads last month - 0 stars on GitHub - 1 maintainer
ilmulti 0.0.1
Multilingual Text Tooling around Indian Languages
2 versions - Latest release: almost 4 years ago - 1 dependent repositories - 10 downloads last month - 21 stars on GitHub - 1 maintainer
tokengeex 1.0.1
TokenGeeX is an efficient tokenizer for code based on UnigramLM and TokenMonster.
9 versions - Latest release: 16 days ago - 6 thousand downloads last month - 3 stars on GitHub - 1 maintainer
tokenizer-adapter 0.1.2
A simple to adapt a pretrained language model to a new vocabulary
3 versions - Latest release: 5 months ago - 41 downloads last month - 1 stars on GitHub - 1 maintainer
python-vncorenlp 0.1.8
python_vncorenlp
9 versions - Latest release: almost 4 years ago - 1 dependent repositories - 67 downloads last month - 2 stars on GitHub - 1 maintainer
python-rdrsegmenter 0.1.1
python_rdrsegmenter
2 versions - Latest release: over 3 years ago - 1 dependent repositories - 96 downloads last month - 1 stars on GitHub - 1 maintainer
word-piece-tokenizer 1.0.1 💰
A Lightweight Word Piece Tokenizer
2 versions - Latest release: over 1 year ago - 289 downloads last month - 5 stars on GitHub - 1 maintainer
Top 4.5% on pypi.org
ekphrasis 0.5.4
Text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekph...
54 versions - Latest release: about 2 years ago - 48 dependent repositories - 1.71 thousand downloads last month - 656 stars on GitHub - 1 maintainer
handict 0.2.0 💰
Yet another word segmentation tool.
3 versions - Latest release: about 4 years ago - 1 dependent repositories - 39 downloads last month - 1 stars on GitHub - 1 maintainer
easy-tokenizer 0.0.10
tokenizer tool
8 versions - Latest release: about 4 years ago - 1 dependent repositories - 103 downloads last month - 1 stars on GitHub - 1 maintainer
hindikosh 0.0.1
Hindi corpus reader
1 version - Latest release: over 5 years ago - 1 dependent repositories - 18 downloads last month - 1 stars on GitHub - 1 maintainer
Top 4.4% on pypi.org
segments 2.2.1
Unicode Standard tokenization routines and orthography profile segmentation
17 versions - Latest release: almost 2 years ago - 5 dependent packages - 696 dependent repositories - 267 thousand downloads last month - 29 stars on GitHub - 4 maintainers
quebra-frases 0.3.7
quebra_frases chunks strings into byte sized pieces
12 versions - Latest release: about 3 years ago - 4 dependent packages - 2 dependent repositories - 12.4 thousand downloads last month - 1 stars on GitHub - 2 maintainers
sengiri 0.2.1 💰
Yet another sentence-level tokenizer for the Japanese text
3 versions - Latest release: over 4 years ago - 7 dependent repositories - 460 downloads last month - 21 stars on GitHub - 1 maintainer
Top 5.3% on pypi.org
hangul-utils 0.4.5
An integrated library for Korean preprocessing.
9 versions - Latest release: almost 4 years ago - 1 dependent package - 10 dependent repositories - 1.26 thousand downloads last month - 196 stars on GitHub - 1 maintainer
Top 5.1% on pypi.org
vncorenlp 1.0.3
A Python wrapper for VnCoreNLP using a bidirectional communication channel.
2 versions - Latest release: almost 6 years ago - 4 dependent packages - 31 dependent repositories - 2.15 thousand downloads last month - 55 stars on GitHub - 1 maintainer
Top 3.6% on pypi.org
pyonmttok 1.37.1
Fast and customizable text tokenization library with BPE and SentencePiece support
66 versions - Latest release: over 1 year ago - 3 dependent packages - 103 dependent repositories - 23.1 thousand downloads last month - 259 stars on GitHub - 4 maintainers
unicodetokenizer 0.2.2
UnicodeTokenizer: tokenize all Unicode text
25 versions - Latest release: 7 months ago - 270 downloads last month - 0 stars on GitHub - 1 maintainer
Top 9.8% on pypi.org
tokenmonster 1.1.12
Tokenize and decode text with TokenMonster vocabularies.
15 versions - Latest release: 9 months ago - 2 dependent packages - 1 dependent repositories - 1.4 thousand downloads last month - 485 stars on GitHub - 1 maintainer
Top 6.2% on pypi.org
tokenizer 3.4.3
A tokenizer for Icelandic text
53 versions - Latest release: 10 months ago - 6 dependent packages - 75 dependent repositories - 20.9 thousand downloads last month - 27 stars on GitHub - 3 maintainers
Top 5.2% on pypi.org
spacy-experimental 0.6.4
Cutting-edge experimental spaCy components and features
9 versions - Latest release: 7 months ago - 6 dependent packages - 14 dependent repositories - 3.96 thousand downloads last month - 94 stars on GitHub - 1 maintainer
Top 6.5% on pypi.org
somajo 2.4.2
A tokenizer and sentence splitter for German and English web and social media texts.
59 versions - Latest release: 3 months ago - 1 dependent package - 10 dependent repositories - 1.8 thousand downloads last month - 133 stars on GitHub - 1 maintainer
Top 3.7% on pypi.org
sentence-splitter 1.4
Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder
4 versions - Latest release: over 5 years ago - 12 dependent packages - 42 dependent repositories - 190 thousand downloads last month - 216 stars on GitHub - 4 maintainers
semantic-text-splitter 0.13.1
Split text into semantic chunks, up to a desired chunk size. Supports calculating length by chara...
24 versions - Latest release: 26 days ago - 1 dependent package - 11.8 thousand downloads last month - 135 stars on GitHub - 1 maintainer
Top 2.9% on pypi.org
segtok 1.5.11
sentence segmentation and word tokenization tools
23 versions - Latest release: over 2 years ago - 8 dependent packages - 353 dependent repositories - 327 thousand downloads last month - 166 stars on GitHub - 1 maintainer
ai21-tokenizer 0.9.1
AI21's Jurassic models tokenizers
16 versions - Latest release: 19 days ago - 1 dependent package - 71.8 thousand downloads last month - 26 stars on GitHub - 1 maintainer
python-ucto 0.6.7
This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost a...
22 versions - Latest release: 7 months ago - 1 dependent package - 4 dependent repositories - 856 downloads last month - 29 stars on GitHub - 1 maintainer
Top 2.4% on pypi.org
natasha 1.6.0
Named-entity recognition for russian language
13 versions - Latest release: 10 months ago - 6 dependent packages - 73 dependent repositories - 9.39 thousand downloads last month - 1,149 stars on GitHub - 2 maintainers
ebnfparser 2.1.3
very powerful and optional parser framework for python
24 versions - Latest release: about 6 years ago - 1 dependent repositories - 115 downloads last month - 64 stars on GitHub - 1 maintainer
tiniestsegmenter 0.1.0
Compact Japanese segmenter
1 version - Latest release: 22 days ago - 0 stars on GitHub - 1 maintainer
count-tokens 0.7.0
Count number of tokens in the text file using toktoken tokenizer from OpenAI.
7 versions - Latest release: 8 months ago - 2.04 thousand downloads last month - 3 stars on GitHub - 1 maintainer
wyzard 1.0
Run various transformers models from one packages.
3 versions - Latest release: about 1 year ago - 33 downloads last month - 0 stars on GitHub - 2 maintainers
zltk 0.0.1
A collection of commonly used functions.
2 versions - Latest release: 5 months ago - 24 downloads last month - 1 maintainer
basictokenizer 0.0.4 removed
A basic and useful tokenizer.
3 versions - Latest release: over 1 year ago - 49 downloads last month - 1 maintainer