pypi.org "text-extraction" keyword
View the packages on the pypi.org package registry that are tagged with the "text-extraction" keyword.
Top 1.7% on pypi.org
50 versions - Latest release: 5 months ago - 71 dependent packages - 63 dependent repositories - 944 thousand downloads last month - 4,118 stars on GitHub - 1 maintainer
trafilatura 2.0.0 💰
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction...50 versions - Latest release: 5 months ago - 71 dependent packages - 63 dependent repositories - 944 thousand downloads last month - 4,118 stars on GitHub - 1 maintainer
gittxt 1.7.7
Gittxt: Get Text from Git — Optimized for AI.18 versions - Latest release: 4 days ago - 1.09 thousand downloads last month - 0 stars on GitHub - 1 maintainer
Top 1.2% on pypi.org
36 versions - Latest release: 23 days ago - 33 dependent packages - 528 dependent repositories - 438 thousand downloads last month - 1,426 stars on GitHub - 1 maintainer
tika 3.1.0 💰
Apache Tika Python library36 versions - Latest release: 23 days ago - 33 dependent packages - 528 dependent repositories - 438 thousand downloads last month - 1,426 stars on GitHub - 1 maintainer
Top 1.6% on pypi.org
16 versions - Latest release: over 2 years ago - 9 dependent packages - 413 dependent repositories - 139 thousand downloads last month - 3,431 stars on GitHub - 1 maintainer
sumy 0.11.0 💰
Module for automatic summarization of text documents and HTML pages.16 versions - Latest release: over 2 years ago - 9 dependent packages - 413 dependent repositories - 139 thousand downloads last month - 3,431 stars on GitHub - 1 maintainer
Top 6.2% on pypi.org
1 version - Latest release: about 8 years ago - 2 dependent packages - 29 dependent repositories - 22.8 thousand downloads last month - 30 stars on GitHub - 1 maintainer
slate3k 0.5.3
Extract text from PDF documents easily.1 version - Latest release: about 8 years ago - 2 dependent packages - 29 dependent repositories - 22.8 thousand downloads last month - 30 stars on GitHub - 1 maintainer
pdf-layout-scanner 1.3.3
A more complete example of programming with PDFMiner, which continues where the default documenta...7 versions - Latest release: over 5 years ago - 1 dependent repositories - 414 downloads last month - 8 stars on GitHub - 1 maintainer
articleparse 0.2.1 💰
Heuristic text extraction from news articles3 versions - Latest release: over 7 years ago - 89 downloads last month - 10 stars on GitHub - 1 maintainer
fundus 0.5.0
A very simple news crawler14 versions - Latest release: 2 months ago - 1.81 thousand downloads last month - 366 stars on GitHub - 1 maintainer
vision-parse 0.1.13
Parse PDF documents into markdown formatted content using Vision LLMs14 versions - Latest release: 3 months ago - 2.94 thousand downloads last month - 339 stars on GitHub - 1 maintainer
magicconvert 0.1.0
MagicConvert is a Python library that converts various document formats (PDF, DOCX, XLSX, PPTX, H...2 versions - Latest release: 2 months ago - 106 downloads last month - 1 stars on GitHub - 1 maintainer
kreuzberg 3.1.3
A text extraction library supporting PDFs, images, office documents and more19 versions - Latest release: 9 days ago - 6.53 thousand downloads last month - 1,736 stars on GitHub - 1 maintainer
wagtail-textract 1.2
Allow searching for text in Documents in the Wagtail content management system8 versions - Latest release: over 5 years ago - 1 dependent repositories - 253 downloads last month - 31 stars on GitHub - 2 maintainers
hotpdf 0.5.2
Fast PDF Data Extraction library27 versions - Latest release: about 1 year ago - 2.44 thousand downloads last month - 186 stars on GitHub - 1 maintainer
fileseek 0.1.3
FileSeek – AI-Powered Local Document Archive&Search3 versions - Latest release: 2 months ago - 505 downloads last month - 1 maintainer
Top 2.7% on pypi.org
8 versions - Latest release: about 2 months ago - 7 dependent packages - 43 dependent repositories - 1.09 million downloads last month - 725 stars on GitHub - 1 maintainer
justext 3.0.2 💰
Heuristic based boilerplate removal tool8 versions - Latest release: about 2 months ago - 7 dependent packages - 43 dependent repositories - 1.09 million downloads last month - 725 stars on GitHub - 1 maintainer
Top 2.1% on pypi.org
30 versions - Latest release: about 2 years ago - 26 dependent packages - 205 dependent repositories - 269 thousand downloads last month - 498 stars on GitHub - 1 maintainer
srt 3.5.3
A tiny library for parsing, modifying, and composing SRT files.30 versions - Latest release: about 2 years ago - 26 dependent packages - 205 dependent repositories - 269 thousand downloads last month - 498 stars on GitHub - 1 maintainer
pd3f 0.4.0
Reconstruct the original continuous text from PDFs with language models5 versions - Latest release: about 4 years ago - 1 dependent repositories - 216 downloads last month - 32 stars on GitHub - 1 maintainer
Top 5.2% on pypi.org
4 versions - Latest release: over 1 year ago - 46 dependent repositories - 567 downloads last month - 427 stars on GitHub - 1 maintainer
slate 0.5.2 💰
Extract text from PDF documents easily.4 versions - Latest release: over 1 year ago - 46 dependent repositories - 567 downloads last month - 427 stars on GitHub - 1 maintainer
ammico 0.2.6
AI Media and Misinformation Content Analysis Tool8 versions - Latest release: about 2 months ago - 344 downloads last month - 6 stars on GitHub - 1 maintainer
apple-ocr 1.0.8 💰
An OCR (Optical Character Recognition) utility for text extraction from images.9 versions - Latest release: about 1 year ago - 381 downloads last month - 105 stars on GitHub - 1 maintainer
atai-whisper-tool 0.0.7
OpenAI Whisper with Apple MPS support6 versions - Latest release: about 1 month ago - 448 downloads last month - 0 stars on GitHub - 1 maintainer
atai-pdf-tool 0.1.1
A tool for parsing and extracting text from PDF files with OCR capabilities5 versions - Latest release: about 2 months ago - 277 downloads last month - 0 stars on GitHub - 1 maintainer
vlense 0.1.4
A Python package to extract text from images and PDFs using Vision Language Model (VLM).5 versions - Latest release: 5 months ago - 118 downloads last month - 1 stars on GitHub - 1 maintainer
extracteur-de-fou-malade-pour-charles-le-charlo 0.0.1 💰
PDF data parser1 version - Latest release: over 4 years ago - 1 dependent repositories - 54 downloads last month - 3,518 stars on GitHub - 1 maintainer
util-ds 0.5.3 💰
This project is a convenient part of the NLP project, including several already exposed projects ...22 versions - Latest release: almost 3 years ago - 1 dependent repositories - 129 downloads last month - 3,518 stars on GitHub - 1 maintainer
Top 3.4% on pypi.org
21 versions - Latest release: about 11 years ago - 2 dependent packages - 212 dependent repositories - 89.6 thousand downloads last month - 204 stars on GitHub - 1 maintainer
breadability 0.1.20
Port of Readability HTML parser in Python21 versions - Latest release: about 11 years ago - 2 dependent packages - 212 dependent repositories - 89.6 thousand downloads last month - 204 stars on GitHub - 1 maintainer
hnlp 0.0.1
Humanly Deeplearning NLP.2 versions - Latest release: almost 5 years ago - 1 dependent repositories - 76 downloads last month - 29 stars on GitHub - 1 maintainer
pnlp 0.4.16
A pre/post-processing tool for NLP.29 versions - Latest release: about 2 months ago - 1 dependent package - 2 dependent repositories - 1.17 thousand downloads last month - 29 stars on GitHub - 1 maintainer
galeodes 0.7 💰
Browsers options7 versions - Latest release: about 3 years ago - 8 dependent repositories - 1.96 thousand downloads last month - 0 stars on GitHub - 1 maintainer
Top 6.5% on pypi.org
6 versions - Latest release: about 3 years ago - 8 dependent packages - 13 dependent repositories - 1.77 thousand downloads last month - 61 stars on GitHub - 1 maintainer
mobi 0.3.3 💰
unpack unencrypted mobi files6 versions - Latest release: about 3 years ago - 8 dependent packages - 13 dependent repositories - 1.77 thousand downloads last month - 61 stars on GitHub - 1 maintainer
tikara 0.1.6
The metadata and text content extractor for almost every file type.6 versions - Latest release: 3 months ago - 214 downloads last month - 1 stars on GitHub - 1 maintainer
llamasearch-pdf-llamasearch 0.1.0
A comprehensive PDF processing toolkit for document workflows1 version - Latest release: 14 days ago
arachnio 0.0.0
Client library for interacting with Arachnio API1 version - Latest release: about 2 years ago - 61 downloads last month - 0 stars on GitHub - 1 maintainer
html-to-markdown 1.3.0
Convert HTML to markdown5 versions - Latest release: 17 days ago - 7.35 thousand downloads last month - 23 stars on GitHub
wpextract 1.1.1
Create datasets from WordPress sites9 versions - Latest release: 3 months ago - 348 downloads last month - 3 stars on GitHub - 1 maintainer
yirabot 1.0.9
YiraBot: Simplifying Web Scraping for All. A user-friendly tool for developers and enthusiasts, o...20 versions - Latest release: about 1 year ago - 695 downloads last month - 20 stars on GitHub - 1 maintainer
pdf-parser-header-footer 0.1.10
A Python package for processing PDFs with header and footer detection10 versions - Latest release: 24 days ago - 131 downloads last month - 1 maintainer
atai-ebook-tool 0.0.4
A command-line tool for parsing ebooks (such as EPUB and MOBI) and converting them into a structu...3 versions - Latest release: 25 days ago - 304 downloads last month - 0 stars on GitHub - 1 maintainer
yt2text 1.0.2
Extract text from a YouTube video in a single command, using OpenAi's Whisper speech recognition ...3 versions - Latest release: over 1 year ago - 125 downloads last month - 4 stars on GitHub - 1 maintainer
wikipedia-ner 0.0.24
Python package for creating labeled examples from wiki dumps22 versions - Latest release: about 10 years ago - 3 dependent repositories - 290 downloads last month - 67 stars on GitHub - 1 maintainer
aiopytesseract 0.14.0 💰
asyncio tesseract wrapper for Tesseract-OCR15 versions - Latest release: about 1 year ago - 1 dependent repositories - 2.01 thousand downloads last month - 17 stars on GitHub - 1 maintainer
deboiler 2023.46.150
Deboiler is an open-source package to clean HTML pages across an entire domain2 versions - Latest release: over 1 year ago - 126 downloads last month - 7 stars on GitHub - 1 maintainer
balena-cpu 1.0.0 💰
BALanced Execution through Natural Activation : a human-computer interaction methodology for code...1 version - Latest release: about 1 year ago - 55 downloads last month - 5 stars on GitHub - 1 maintainer
Top 4.2% on pypi.org
7 versions - Latest release: over 1 year ago - 12 dependent packages - 25 dependent repositories - 181 thousand downloads last month - 86 stars on GitHub - 1 maintainer
boilerpy3 1.0.7
Python port of Boilerpipe, for HTML boilerplate removal and text extraction7 versions - Latest release: over 1 year ago - 12 dependent packages - 25 dependent repositories - 181 thousand downloads last month - 86 stars on GitHub - 1 maintainer
spanish-pdf-parser 0.1.0
A Python package for processing PDFs with header and footer detection1 version - Latest release: 3 months ago - 62 downloads last month - 1 maintainer
newsman 1.1.0 removed
A tool for web news scraping.2 versions - Latest release: over 5 years ago - 1 dependent repositories - 9 downloads last month - 0 stars on GitHub - 1 maintainer
Related Keywords
python
15
pdf
13
nlp
13
ocr
11
document-processing
9
text-processing
6
text
6
web-scraping
5
NLP
4
html-extraction
4
html-extractor
4
text-mining
4
textteaser
3
text-cleaning
3
extraction
3
tesseract
3
markdown
3
image-to-text
3
corpus
3
html-page
3
lsa
3
pagerank-algorithm
3
reduction
3
data-extraction
3
crawler
3
summarization
3
summarizer
3
summary
3
sumy
3
python3
2
pdfminer
2
boilerplate-removal
2
text-analysis
2
news-scraping
2
parser
2
nlp-enhancer
2
concurrency
2
chinese-nlp
2
nlp-preprocess
2
normalization
2
preprocessing
2
text-length
2
whisper
2
machine-learning
2
command-line-tool
2
html-parsing
2
html-parser
2
document-indexing
2
pdf-processing
2
document-management
2
metadata
2
text extraction
2
search
2
docx
2
asyncio
2
file-conversion
2
tesseract-ocr
2
format-detection
2
pdf-to-markdown
2
news-crawler
2
natural-language-processing
2
scraper
2
tika
2
html-to-markdown
2
article-extractor
2
rag
2
scraping
2
llm
2
readability
2
text-recognition
2
file-format
1
file-analysis
1
excel
1
document-understanding
1
document-text
1
document-reader
1
document-parsing
1
document-ocr
1
document-metadata
1
document-intelligence
1
document-extraction
1
html-text-extraction
1
file-identification
1
file-parsing
1
file-processing
1
file-reader
1
full-text-extraction
1
file-type
1
format-identification
1
image-extraction
1
information-extraction
1
language-detection
1
full text extraction
1
html text extraction
1
scraping-websites
1
news
1
articles
1
screenshots
1
selenium
1
wrapper
1