Ecosyste.ms: Packages
An open API service providing package, version and dependency metadata of many open source software ecosystems and registries.
pypi.org "text-extraction" keyword
Top 1.7% on pypi.org
44 versions - Latest release: 14 days ago - 71 dependent packages - 63 dependent repositories - 476 thousand downloads last month - 2,688 stars on GitHub - 1 maintainer
trafilatura 1.9.0 💰
Python package and command-line tool designed to gather text on the Web, includes all necessary d...44 versions - Latest release: 14 days ago - 71 dependent packages - 63 dependent repositories - 476 thousand downloads last month - 2,688 stars on GitHub - 1 maintainer
Top 1.2% on pypi.org
35 versions - Latest release: over 1 year ago - 33 dependent packages - 528 dependent repositories - 351 thousand downloads last month - 1,420 stars on GitHub - 1 maintainer
tika 2.6.0 💰
Apache Tika Python library35 versions - Latest release: over 1 year ago - 33 dependent packages - 528 dependent repositories - 351 thousand downloads last month - 1,420 stars on GitHub - 1 maintainer
Top 1.6% on pypi.org
16 versions - Latest release: over 1 year ago - 9 dependent packages - 413 dependent repositories - 379 thousand downloads last month - 3,421 stars on GitHub - 1 maintainer
sumy 0.11.0 💰
Module for automatic summarization of text documents and HTML pages.16 versions - Latest release: over 1 year ago - 9 dependent packages - 413 dependent repositories - 379 thousand downloads last month - 3,421 stars on GitHub - 1 maintainer
Top 2.1% on pypi.org
30 versions - Latest release: about 1 year ago - 26 dependent packages - 205 dependent repositories - 93.4 thousand downloads last month - 422 stars on GitHub - 1 maintainer
srt 3.5.3
A tiny library for parsing, modifying, and composing SRT files.30 versions - Latest release: about 1 year ago - 26 dependent packages - 205 dependent repositories - 93.4 thousand downloads last month - 422 stars on GitHub - 1 maintainer
Top 6.2% on pypi.org
1 version - Latest release: about 7 years ago - 2 dependent packages - 29 dependent repositories - 20.6 thousand downloads last month - 30 stars on GitHub - 1 maintainer
slate3k 0.5.3
Extract text from PDF documents easily.1 version - Latest release: about 7 years ago - 2 dependent packages - 29 dependent repositories - 20.6 thousand downloads last month - 30 stars on GitHub - 1 maintainer
Top 2.7% on pypi.org
7 versions - Latest release: 7 days ago - 7 dependent packages - 43 dependent repositories - 538 thousand downloads last month - 682 stars on GitHub - 1 maintainer
justext 3.0.1 💰
Heuristic based boilerplate removal tool7 versions - Latest release: 7 days ago - 7 dependent packages - 43 dependent repositories - 538 thousand downloads last month - 682 stars on GitHub - 1 maintainer
Top 4.2% on pypi.org
7 versions - Latest release: 7 months ago - 12 dependent packages - 25 dependent repositories - 256 thousand downloads last month - 70 stars on GitHub - 1 maintainer
boilerpy3 1.0.7
Python port of Boilerpipe, for HTML boilerplate removal and text extraction7 versions - Latest release: 7 months ago - 12 dependent packages - 25 dependent repositories - 256 thousand downloads last month - 70 stars on GitHub - 1 maintainer
yt2text 1.0.2
Extract text from a YouTube video in a single command, using OpenAi's Whisper speech recognition ...3 versions - Latest release: 7 months ago - 26 downloads last month - 1 stars on GitHub - 1 maintainer
balena-cpu 1.0.0 💰
BALanced Execution through Natural Activation : a human-computer interaction methodology for code...1 version - Latest release: 4 months ago - 35 downloads last month - 5 stars on GitHub - 1 maintainer
yirabot 1.0.9
YiraBot: Simplifying Web Scraping for All. A user-friendly tool for developers and enthusiasts, o...20 versions - Latest release: 2 months ago - 172 downloads last month - 11 stars on GitHub - 1 maintainer
deboiler 2023.46.150
Deboiler is an open-source package to clean HTML pages across an entire domain2 versions - Latest release: 6 months ago - 248 downloads last month - 7 stars on GitHub - 1 maintainer
fundus 0.3.1
A very simple news crawler6 versions - Latest release: 3 days ago - 800 downloads last month - 38 stars on GitHub - 1 maintainer
ammico 0.2.0
AI Media and Misinformation Content Analysis Tool2 versions - Latest release: 8 months ago - 22 downloads last month - 4 stars on GitHub - 1 maintainer
arachnio 0.0.0
Client library for interacting with Arachnio API1 version - Latest release: about 1 year ago - 11 downloads last month - 0 stars on GitHub - 1 maintainer
wikipedia-ner 0.0.24
Python package for creating labeled examples from wiki dumps22 versions - Latest release: about 9 years ago - 3 dependent repositories - 93 downloads last month - 68 stars on GitHub - 1 maintainer
wagtail-textract 1.2
Allow searching for text in Documents in the Wagtail content management system8 versions - Latest release: over 4 years ago - 1 dependent repositories - 52 downloads last month - 31 stars on GitHub - 2 maintainers
util-ds 0.5.3 💰
This project is a convenient part of the NLP project, including several already exposed projects ...22 versions - Latest release: almost 2 years ago - 1 dependent repositories - 31 downloads last month - 3,421 stars on GitHub - 1 maintainer
pdf-layout-scanner 1.3.3
A more complete example of programming with PDFMiner, which continues where the default documenta...7 versions - Latest release: over 4 years ago - 1 dependent repositories - 78 downloads last month - 8 stars on GitHub - 1 maintainer
newsman 1.1.0
A tool for web news scraping.2 versions - Latest release: over 4 years ago - 1 dependent repositories - 18 downloads last month - 0 stars on GitHub - 1 maintainer
Top 6.5% on pypi.org
6 versions - Latest release: about 2 years ago - 8 dependent packages - 13 dependent repositories - 1.6 thousand downloads last month - 55 stars on GitHub - 1 maintainer
mobi 0.3.3
unpack unencrypted mobi files6 versions - Latest release: about 2 years ago - 8 dependent packages - 13 dependent repositories - 1.6 thousand downloads last month - 55 stars on GitHub - 1 maintainer
hnlp 0.0.1
Humanly Deeplearning NLP.2 versions - Latest release: almost 4 years ago - 1 dependent repositories - 14 downloads last month - 27 stars on GitHub - 1 maintainer
extracteur-de-fou-malade-pour-charles-le-charlo 0.0.1 💰
PDF data parser1 version - Latest release: over 3 years ago - 1 dependent repositories - 15 downloads last month - 3,421 stars on GitHub - 1 maintainer
Top 3.4% on pypi.org
21 versions - Latest release: about 10 years ago - 2 dependent packages - 212 dependent repositories - 225 thousand downloads last month - 203 stars on GitHub - 1 maintainer
breadability 0.1.20
Port of Readability HTML parser in Python21 versions - Latest release: about 10 years ago - 2 dependent packages - 212 dependent repositories - 225 thousand downloads last month - 203 stars on GitHub - 1 maintainer
articleparse 0.2.1 💰
Heuristic text extraction from news articles3 versions - Latest release: over 6 years ago - 12 downloads last month - 9 stars on GitHub - 1 maintainer
galeodes 0.7 💰
Browsers options7 versions - Latest release: about 2 years ago - 8 dependent repositories - 1.91 thousand downloads last month - 0 stars on GitHub - 1 maintainer
pd3f 0.4.0
Reconstruct the original continuous text from PDFs with language models5 versions - Latest release: about 3 years ago - 1 dependent repositories - 68 downloads last month - 32 stars on GitHub - 1 maintainer
aiopytesseract 0.14.0 💰
asyncio tesseract wrapper for Tesseract-OCR15 versions - Latest release: 3 months ago - 1 dependent repositories - 468 downloads last month - 15 stars on GitHub - 1 maintainer
apple-ocr 1.0.8 💰
An OCR (Optical Character Recognition) utility for text extraction from images.9 versions - Latest release: 4 months ago - 99 downloads last month - 68 stars on GitHub - 1 maintainer
Top 5.2% on pypi.org
4 versions - Latest release: 9 months ago - 46 dependent repositories - 1.28 thousand downloads last month - 422 stars on GitHub - 1 maintainer
slate 0.5.2 💰
Extract text from PDF documents easily.4 versions - Latest release: 9 months ago - 46 dependent repositories - 1.28 thousand downloads last month - 422 stars on GitHub - 1 maintainer
pnlp 0.4.10
A pre/post-processing tool for NLP.23 versions - Latest release: 4 months ago - 1 dependent package - 2 dependent repositories - 170 downloads last month - 27 stars on GitHub - 1 maintainer
hotpdf 0.5.2
Fast PDF Data Extraction library27 versions - Latest release: 3 months ago - 1.89 thousand downloads last month - 164 stars on GitHub - 1 maintainer
Related Keywords
python
15
nlp
11
text
6
pdf
5
NLP
4
html-extraction
4
web-scraping
4
html-extractor
4
extraction
3
pagerank-algorithm
3
reduction
3
summarization
3
lsa
3
html-page
3
summarizer
3
summary
3
sumy
3
textteaser
3
text-cleaning
3
text-processing
2
command-line-tool
2
boilerplate-removal
2
news-scraping
2
machine-learning
2
data-extraction
2
python3
2
tesseract
2
pdfminer
2
text extraction
2
html-parsing
2
html-parser
2
chinese-nlp
2
concurrency
2
nlp-enhancer
2
nlp-preprocess
2
normalization
2
preprocessing
2
text-length
2
corpus
2
news-crawler
2
scraper
2
article-extractor
2
crawler
2
news
2
readability
2
scraping
2
text-mining
2
ocr
2
"ocr"
1
epub
1
tokenization
1
dataset
1
named-entity-recognition
1
wikipedia
1
django
1
search
1
hotpdf
1
textract
1
wagtail
1
layout-analysis
1
data extraction
1
press
1
articles
1
scraping-websites
1
mobi
1
language-model
1
mobipocket
1
web scraping
1
web crawling
1
cc-news
1
commoncrawl
1
text-search
1
rss
1
sitemap
1
web-corpus
1
classification
1
"asyncio"
1
computer-vision
1
translation
1
pdfquery
1
arachnio
1
arachn.io
1
web-scraping-python
1
XML
1
pd3f
1
clustering
1
image-recognition
1
OCR
1
tesseract-ocr
1
pytesseract-ocr
1
pytesseract
1
pdftotext
1
optical-character-recognition
1
text-analysis
1
asyncio
1
breadability
1
content
1
HTML
1
parsing
1
readable
1