Ecosyste.ms: Packages

An open API service providing package, version and dependency metadata of many open source software ecosystems and registries.

pypi.org "tokenizer" keyword

lexikanon 0.6.5
A Python Library for Tokenizers
26 versions - Latest release: about 2 months ago - 3 dependent packages - 122 downloads last month - 1 stars on GitHub - 1 maintainer
kimchima 0.5.0
The collections of tools for ML model development.
13 versions - Latest release: about 1 month ago - 216 downloads last month - 0 stars on GitHub - 1 maintainer
crossandra 2.1.0
A fast and simple enum/regex-based tokenizer with decent configurability
10 versions - Latest release: 21 days ago - 1 dependent package - 1 dependent repositories - 1.84 thousand downloads last month - 8 stars on GitHub - 1 maintainer
plane 0.2.1 ๐Ÿ’ฐ
A lib for text preprocessing
20 versions - Latest release: over 3 years ago - 3 dependent repositories - 236 downloads last month - 11 stars on GitHub - 1 maintainer
example990420 1.1.2
Taiwanese Hokkien Transliterator and Tokeniser
9 versions - Latest release: 8 days ago - 307 downloads last month - 10 stars on GitHub - 1 maintainer
taibun 1.1.2
Taiwanese Hokkien Transliterator and Tokeniser
10 versions - Latest release: 8 days ago - 419 downloads last month - 10 stars on GitHub - 1 maintainer
bodotokenizer 0.1.1
Package for Bodo Tokenizer
2 versions - Latest release: about 2 years ago - 1 dependent repositories - 30 downloads last month - 0 stars on GitHub - 1 maintainer
jk-php-tokenizer 0.2020.3.9
This python module is a tokenizer for configuration files written in PHP.
1 version - Latest release: about 4 years ago - 1 dependent repositories - 15 downloads last month - 1 stars on GitHub - 1 maintainer
Top 5.5% on pypi.org
simplemma 0.9.1
A simple multilingual lemmatizer for Python.
14 versions - Latest release: over 1 year ago - 6 dependent packages - 25 dependent repositories - 10.7 thousand downloads last month - 128 stars on GitHub - 1 maintainer
djurl 0.2.0
Simple yet helpful library for writing Django urls by an easy, short an intuitive way.
4 versions - Latest release: almost 7 years ago - 2 dependent repositories - 16 downloads last month - 80 stars on GitHub - 1 maintainer
space-wrap 0.0.3
Automated Spacy wrapper to turn plain text into Spacy doc objects
3 versions - Latest release: over 1 year ago - 29 downloads last month - 1 maintainer
pyregtokenizer 0.0.1
A BPE Tokenizer using regex
2 versions - Latest release: about 1 month ago - 240 downloads last month - 1 maintainer
mecab-text-cleaner 0.1.1 ๐Ÿ’ฐ
Simple Python package for getting japanese reading (yomigana) using MeCab
2 versions - Latest release: 5 months ago - 18 downloads last month - 3 stars on GitHub - 1 maintainer
unico 0.0.0
Unico provides Unicode metadata parsed directly from the published standard data.
1 version - Latest release: 9 months ago - 1 dependent repositories - 0 stars on GitHub - 1 maintainer
openai-function-tokens 0.1.2
A package to estimate token counts for messages AND functions in openai's chat completion API.
3 versions - Latest release: 8 months ago - 1.07 thousand downloads last month - 14 stars on GitHub - 1 maintainer
bpeasy 0.1.2
Fast bare-bones BPE for modern tokenizer training
3 versions - Latest release: 5 months ago - 397 downloads last month - 125 stars on GitHub - 1 maintainer
sengirifix 0.1.3
Yet another fork of sentence-level tokenizer for the Japanese text
1 version - Latest release: over 3 years ago - 1 dependent repositories - 14 downloads last month - 0 stars on GitHub - 1 maintainer
Top 2.3% on pypi.org
hazm 0.10.0
Persian NLP Toolkit
15 versions - Latest release: 4 months ago - 8 dependent packages - 126 dependent repositories - 7.78 thousand downloads last month - 1,116 stars on GitHub - 1 maintainer
tokenizers-gt 0.15.2.post0
๐Ÿ’ฅ Fast State-of-the-Art Tokenizers optimized for Research and Production
3 versions - Latest release: 3 months ago - 1.21 thousand downloads last month - 8,489 stars on GitHub - 1 maintainer
Top 0.6% on pypi.org
tokenizers 0.19.1
๐Ÿ’ฅ Fast State-of-the-Art Tokenizers optimized for Research and Production
95 versions - Latest release: about 1 month ago - 380 dependent packages - 14,571 dependent repositories - 25.7 million downloads last month - 8,489 stars on GitHub - 4 maintainers
pyvgram 0.1.2
VGram tokenization
5 versions - Latest release: over 2 years ago - 1 dependent repositories - 38 downloads last month - 0 stars on GitHub - 1 maintainer
ilmulti 0.0.1
Multilingual Text Tooling around Indian Languages
2 versions - Latest release: over 3 years ago - 1 dependent repositories - 10 downloads last month - 21 stars on GitHub - 1 maintainer
tokengeex 1.0.1
TokenGeeX is an efficient tokenizer for code based on UnigramLM and TokenMonster.
9 versions - Latest release: 5 days ago - 6 thousand downloads last month - 3 stars on GitHub - 1 maintainer
tokenizer-adapter 0.1.2
A simple to adapt a pretrained language model to a new vocabulary
3 versions - Latest release: 4 months ago - 41 downloads last month - 1 stars on GitHub - 1 maintainer
python-vncorenlp 0.1.8
python_vncorenlp
9 versions - Latest release: almost 4 years ago - 1 dependent repositories - 67 downloads last month - 2 stars on GitHub - 1 maintainer
python-rdrsegmenter 0.1.1
python_rdrsegmenter
2 versions - Latest release: over 3 years ago - 1 dependent repositories - 96 downloads last month - 1 stars on GitHub - 1 maintainer
word-piece-tokenizer 1.0.1 ๐Ÿ’ฐ
A Lightweight Word Piece Tokenizer
2 versions - Latest release: over 1 year ago - 289 downloads last month - 5 stars on GitHub - 1 maintainer
Top 4.5% on pypi.org
ekphrasis 0.5.4
Text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekph...
54 versions - Latest release: about 2 years ago - 48 dependent repositories - 1.71 thousand downloads last month - 656 stars on GitHub - 1 maintainer
handict 0.2.0 ๐Ÿ’ฐ
Yet another word segmentation tool.
3 versions - Latest release: about 4 years ago - 1 dependent repositories - 39 downloads last month - 1 stars on GitHub - 1 maintainer
easy-tokenizer 0.0.10
tokenizer tool
8 versions - Latest release: about 4 years ago - 1 dependent repositories - 103 downloads last month - 1 stars on GitHub - 1 maintainer
Top 1.9% on pypi.org
syntok 1.4.4
Text tokenization and sentence segmentation (segtok v2).
16 versions - Latest release: about 2 years ago - 6 dependent packages - 59 dependent repositories - 68.4 thousand downloads last month - 193 stars on GitHub - 1 maintainer
hindikosh 0.0.1
Hindi corpus reader
1 version - Latest release: over 5 years ago - 1 dependent repositories - 18 downloads last month - 1 stars on GitHub - 1 maintainer
Top 4.4% on pypi.org
segments 2.2.1
Unicode Standard tokenization routines and orthography profile segmentation
17 versions - Latest release: almost 2 years ago - 5 dependent packages - 696 dependent repositories - 267 thousand downloads last month - 29 stars on GitHub - 4 maintainers
quebra-frases 0.3.7
quebra_frases chunks strings into byte sized pieces
12 versions - Latest release: almost 3 years ago - 4 dependent packages - 2 dependent repositories - 12.4 thousand downloads last month - 1 stars on GitHub - 2 maintainers
sengiri 0.2.1 ๐Ÿ’ฐ
Yet another sentence-level tokenizer for the Japanese text
3 versions - Latest release: over 4 years ago - 7 dependent repositories - 460 downloads last month - 21 stars on GitHub - 1 maintainer
bleuscore 0.1.2
A fast(not yet :) bleu score calculator
3 versions - Latest release: 23 days ago - 1.33 thousand downloads last month - 0 stars on GitHub - 1 maintainer
Top 5.3% on pypi.org
hangul-utils 0.4.5
An integrated library for Korean preprocessing.
9 versions - Latest release: almost 4 years ago - 1 dependent package - 10 dependent repositories - 1.26 thousand downloads last month - 196 stars on GitHub - 1 maintainer
Top 5.1% on pypi.org
vncorenlp 1.0.3
A Python wrapper for VnCoreNLP using a bidirectional communication channel.
2 versions - Latest release: almost 6 years ago - 4 dependent packages - 31 dependent repositories - 2.15 thousand downloads last month - 55 stars on GitHub - 1 maintainer
Top 3.6% on pypi.org
pyonmttok 1.37.1
Fast and customizable text tokenization library with BPE and SentencePiece support
66 versions - Latest release: about 1 year ago - 3 dependent packages - 103 dependent repositories - 23.1 thousand downloads last month - 259 stars on GitHub - 4 maintainers
Top 1.6% on pypi.org
sacremoses 0.1.1
SacreMoses
52 versions - Latest release: 7 months ago - 120 dependent packages - 5,564 dependent repositories - 2.07 million downloads last month - 479 stars on GitHub - 3 maintainers
unicodetokenizer 0.2.2
UnicodeTokenizer: tokenize all Unicode text
25 versions - Latest release: 6 months ago - 270 downloads last month - 0 stars on GitHub - 1 maintainer
Top 9.8% on pypi.org
tokenmonster 1.1.12
Tokenize and decode text with TokenMonster vocabularies.
15 versions - Latest release: 9 months ago - 2 dependent packages - 1 dependent repositories - 1.4 thousand downloads last month - 485 stars on GitHub - 1 maintainer
Top 6.2% on pypi.org
tokenizer 3.4.3
A tokenizer for Icelandic text
53 versions - Latest release: 9 months ago - 6 dependent packages - 75 dependent repositories - 20.9 thousand downloads last month - 27 stars on GitHub - 3 maintainers
Top 5.2% on pypi.org
spacy-experimental 0.6.4
Cutting-edge experimental spaCy components and features
9 versions - Latest release: 7 months ago - 6 dependent packages - 14 dependent repositories - 3.96 thousand downloads last month - 94 stars on GitHub - 1 maintainer
Top 6.5% on pypi.org
somajo 2.4.2
A tokenizer and sentence splitter for German and English web and social media texts.
59 versions - Latest release: 3 months ago - 1 dependent package - 10 dependent repositories - 1.8 thousand downloads last month - 133 stars on GitHub - 1 maintainer
Top 3.7% on pypi.org
sentence-splitter 1.4
Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder
4 versions - Latest release: over 5 years ago - 12 dependent packages - 42 dependent repositories - 190 thousand downloads last month - 216 stars on GitHub - 4 maintainers
semantic-text-splitter 0.13.1
Split text into semantic chunks, up to a desired chunk size. Supports calculating length by chara...
24 versions - Latest release: 14 days ago - 1 dependent package - 11.8 thousand downloads last month - 135 stars on GitHub - 1 maintainer
Top 2.9% on pypi.org
segtok 1.5.11
sentence segmentation and word tokenization tools
23 versions - Latest release: over 2 years ago - 8 dependent packages - 353 dependent repositories - 327 thousand downloads last month - 166 stars on GitHub - 1 maintainer
ai21-tokenizer 0.9.1
AI21's Jurassic models tokenizers
16 versions - Latest release: 7 days ago - 1 dependent package - 71.8 thousand downloads last month - 26 stars on GitHub - 1 maintainer
python-ucto 0.6.7
This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost a...
22 versions - Latest release: 7 months ago - 1 dependent package - 4 dependent repositories - 856 downloads last month - 29 stars on GitHub - 1 maintainer
Top 2.4% on pypi.org
natasha 1.6.0
Named-entity recognition for russian language
13 versions - Latest release: 10 months ago - 6 dependent packages - 73 dependent repositories - 9.39 thousand downloads last month - 1,149 stars on GitHub - 2 maintainers
Top 3.3% on pypi.org
nagisa 0.2.11
A Japanese tokenizer based on recurrent neural networks
25 versions - Latest release: 4 months ago - 5 dependent packages - 28 dependent repositories - 346 thousand downloads last month - 367 stars on GitHub - 1 maintainer
Top 2.3% on pypi.org
fugashi 1.3.2 ๐Ÿ’ฐ
A Cython MeCab wrapper for fast, pythonic Japanese tokenization.
71 versions - Latest release: about 1 month ago - 32 dependent packages - 243 dependent repositories - 217 thousand downloads last month - 366 stars on GitHub - 1 maintainer
ebnfparser 2.1.3
very powerful and optional parser framework for python
24 versions - Latest release: about 6 years ago - 1 dependent repositories - 115 downloads last month - 64 stars on GitHub - 1 maintainer
tiniestsegmenter 0.1.0
Compact Japanese segmenter
1 version - Latest release: 11 days ago - 0 stars on GitHub - 1 maintainer
gpt3_tokenizer 0.1.5
Encoder/Decoder and tokens counter for GPT3
6 versions - Latest release: 26 days ago - 796 downloads last month - 7 stars on GitHub - 1 maintainer
alt-eval 1.1.0
Automatic lyrics transcription evaluation toolkit
3 versions - Latest release: 2 months ago - 35 downloads last month - 479 stars on GitHub - 1 maintainer
rs-bytepiece 0.2.2
bytepiece-rs Python binding
7 versions - Latest release: 6 months ago - 68 downloads last month - 14 stars on GitHub - 1 maintainer
semiformal 0.7.0
Tokenizer for semiformal unicode text using TR-29 segmentation
2 versions - Latest release: 9 months ago - 1 dependent repositories - 187 downloads last month - 0 stars on GitHub - 1 maintainer
optimal-data-selector 1.2.1
('A Package for to optimize models, use for nlp short word treatment, choosing optimal data for M...
21 versions - Latest release: 5 months ago - 30 downloads last month - 1 maintainer
count-tokens 0.7.0
Count number of tokens in the text file using toktoken tokenizer from OpenAI.
7 versions - Latest release: 8 months ago - 2.04 thousand downloads last month - 3 stars on GitHub - 1 maintainer
mwtokenizer 0.2.0
Wikipedia Tokenizer Utility
3 versions - Latest release: 5 months ago - 1 dependent repositories - 17 downloads last month - 1 maintainer
nepalitokenizers 0.0.2
Pre-trained Tokenizers for the Nepali language with an interface to HuggingFace's tokenizers libr...
2 versions - Latest release: 11 months ago - 74 downloads last month - 2 stars on GitHub - 1 maintainer
wyzard 1.0
Run various transformers models from one packages.
3 versions - Latest release: about 1 year ago - 33 downloads last month - 0 stars on GitHub - 2 maintainers
zltk 0.0.1
A collection of commonly used functions.
2 versions - Latest release: 5 months ago - 24 downloads last month - 1 maintainer
basictokenizer 0.0.4 removed
A basic and useful tokenizer.
3 versions - Latest release: over 1 year ago - 49 downloads last month - 1 maintainer
extractionstring 0.8.2
Basic tools to tokenize (i.e. to construct atomic-entities/sub-strings of) a string, for Natural ...
1 version - Latest release: over 1 year ago - 6 downloads last month - 0 stars on framagit.org - 1 maintainer
vibrato 0.2.0
Viterbi-based accelerated tokenizer (Python wrapper)
3 versions - Latest release: about 1 year ago - 2.87 thousand downloads last month - 34 stars on GitHub - 1 maintainer
ipa-core 0.1.3
NLP Preprocessing Pipeline Wrappers
4 versions - Latest release: about 1 year ago - 27 downloads last month - 13 stars on GitHub - 1 maintainer
scanpars 0.0.0 removed
scanpars umbrella project
1 version - Latest release: over 1 year ago - 11 downloads last month - 0 stars on GitHub - 1 maintainer
jf-tokenize-package 1.0.3
A simple tokenizer function for NLP
4 versions - Latest release: almost 2 years ago - 21 downloads last month - 0 stars on GitHub - 1 maintainer
vaporetto 0.3.0
Python wrapper of Vaporetto tokenizer
5 versions - Latest release: about 1 year ago - 1 dependent repositories - 3.66 thousand downloads last month - 19 stars on GitHub - 1 maintainer
zh-sentence 0.0.5
Light-weight sentence tokenizer for Chinese languages.
5 versions - Latest release: over 2 years ago - 1 dependent repositories - 167 downloads last month - 1 stars on GitHub - 1 maintainer
youcab 0.1.3
Converts MeCab parsing results to Python objects.
4 versions - Latest release: over 3 years ago - 1 dependent repositories - 36 downloads last month - 0 stars on GitHub - 1 maintainer
xml-cleaner 2.0.4
Word and sentence tokenization.
27 versions - Latest release: over 7 years ago - 4 dependent repositories - 116 downloads last month - 13 stars on GitHub - 1 maintainer
whoosh-igo 0.7
tokenizers for Whoosh designed for Japanese language
6 versions - Latest release: almost 12 years ago - 2 dependent repositories - 21 downloads last month - 6 stars on GitHub - 1 maintainer
unitok 3.5.2
Unified Tokenizer
99 versions - Latest release: about 2 months ago - 1 dependent repositories - 200 downloads last month - 4 stars on GitHub - 2 maintainers
twokenize 1.0.0
Word segmentation / tokenization focussed on Twitter
1 version - Latest release: almost 6 years ago - 6 dependent repositories - 90 downloads last month - 7 stars on GitHub - 1 maintainer
twkorean 0.1.5
Python interface to twitter-korean-text, a Korean morphological analyzer.
6 versions - Latest release: over 9 years ago - 4 dependent repositories - 43 downloads last month - 33 stars on GitHub - 1 maintainer
twitter-korean 0.1.0.dev522
Python port to the normalizer in https://github.com/twitter/twitter-korean-text
2 versions - Latest release: 9 months ago - 2 dependent repositories - 13 downloads last month - 1 maintainer
transformer-embedder 1.7.16
Word level transformer based embeddings
52 versions - Latest release: over 2 years ago - 2 dependent repositories - 93 downloads last month - 34 stars on GitHub - 1 maintainer
toktok 0.0.2
Toktok tokenizer
2 versions - Latest release: over 5 years ago - 1 dependent repositories - 47 downloads last month - 1 stars on GitHub - 1 maintainer
tokenregex 0.1.14
NLP at your fingertips
15 versions - Latest release: over 7 years ago - 1 dependent repositories - 46 downloads last month - 28 stars on GitHub - 1 maintainer
tokenize-output 0.4.10 ๐Ÿ’ฐ
Get identifiers, names, paths, URLs and words from the command output.
9 versions - Latest release: about 1 year ago - 1 dependent package - 3 dependent repositories - 166 downloads last month - 6 stars on GitHub - 1 maintainer
thai-tokenizer 0.2.5
Fast and accurate Thai tokenization library.
7 versions - Latest release: about 3 years ago - 1 dependent repositories - 8.55 thousand downloads last month - 5 stars on GitHub - 1 maintainer
tglex 0.2.1
Lexical analysis base for telegram bots
4 versions - Latest release: about 4 years ago - 1 dependent repositories - 44 downloads last month - 0 stars on GitHub - 1 maintainer
tftokenizers 0.1.8
Use Huggingface Transformer and Tokenizers as Tensorflow Reusable SavedModels.
9 versions - Latest release: about 2 years ago - 1 dependent repositories - 122 downloads last month - 5 stars on GitHub - 1 maintainer
Top 6.9% on pypi.org
text2text 1.4.4
Text2Text: Crosslingual NLP/G toolkit
142 versions - Latest release: 3 months ago - 4 dependent repositories - 984 downloads last month - 272 stars on GitHub - 1 maintainer
testasasnkaonlytest 0.1.3
A very basic calculator
4 versions - Latest release: about 3 years ago - 1 dependent repositories - 17 downloads last month - 46 stars on GitHub - 1 maintainer
tensorflow-onmttok-ops 0.4.0
OpenNMT Tokenizer as TensorFlow Operations
5 versions - Latest release: almost 4 years ago - 1 dependent repositories - 66 downloads last month - 1 maintainer
spag 1.0.0a0
A module containing scanner (regular expression) and parser (BNF) compilers as well as a base gen...
1 version - Latest release: over 5 years ago - 1 dependent repositories - 13 downloads last month - 8 stars on GitHub - 1 maintainer
Top 2.5% on pypi.org
soynlp 0.0.493
Unsupervised Korean Natural Language Processing Toolkits
31 versions - Latest release: over 4 years ago - 4 dependent packages - 48 dependent repositories - 4.2 thousand downloads last month - 900 stars on GitHub - 1 maintainer
Top 9.7% on pypi.org
sinling 0.3.6
A language processing tool for Sinhalese (เทƒเท’เถ‚เท„เถฝ)
7 versions - Latest release: over 3 years ago - 2 dependent packages - 3 dependent repositories - 472 downloads last month - 46 stars on GitHub - 1 maintainer
sept 0.4.2
The Simple Extensible Path Template (sept) is a simple to configure templating system designed at...
6 versions - Latest release: over 2 years ago - 1 dependent repositories - 37 downloads last month - 8 stars on GitHub - 1 maintainer
separatrice-temp 1.6.4
Separatrice is able to split a text into sentences and a sentence into clauses (russian). See doc...
3 versions - Latest release: about 3 years ago - 1 dependent repositories - 16 downloads last month - 0 stars on GitHub - 1 maintainer
separatrice 1.6.2
Separatrice is able to split a text into sentences and a sentence into clauses (russian). See doc...
9 versions - Latest release: over 3 years ago - 1 dependent repositories - 99 downloads last month - 0 stars on GitHub - 1 maintainer
sctokenizer 0.0.8
A Source Code Tokenizer
8 versions - Latest release: about 1 year ago - 4 dependent repositories - 1.49 thousand downloads last month - 12 stars on GitHub - 1 maintainer
rusyll 0.1.1
Splitting Russian words into phonetic syllables
1 version - Latest release: almost 4 years ago - 2 dependent repositories - 50 downloads last month - 6 stars on GitHub - 1 maintainer
re101 0.4.0
A back-pocket regex cookbook
10 versions - Latest release: over 5 years ago - 8 dependent repositories - 340 downloads last month - 5 stars on GitHub - 1 maintainer
pytokenizer 1.1.4
A streaming tokenizer.
6 versions - Latest release: over 3 years ago - 1 dependent repositories - 72 downloads last month - 0 stars on GitHub - 1 maintainer