An open API service providing package, version and dependency metadata of many open source software ecosystems and registries.

pypi.org "text-extraction" keyword

View the packages on the pypi.org package registry that are tagged with the "text-extraction" keyword.

Top 1.7% on pypi.org
trafilatura 2.0.0 💰
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction...
50 versions - Latest release: 5 months ago - 71 dependent packages - 63 dependent repositories - 944 thousand downloads last month - 4,118 stars on GitHub - 1 maintainer
gittxt 1.7.7
Gittxt: Get Text from Git — Optimized for AI.
18 versions - Latest release: 4 days ago - 1.09 thousand downloads last month - 0 stars on GitHub - 1 maintainer
Top 1.2% on pypi.org
tika 3.1.0 💰
Apache Tika Python library
36 versions - Latest release: 23 days ago - 33 dependent packages - 528 dependent repositories - 438 thousand downloads last month - 1,426 stars on GitHub - 1 maintainer
Top 1.6% on pypi.org
sumy 0.11.0 💰
Module for automatic summarization of text documents and HTML pages.
16 versions - Latest release: over 2 years ago - 9 dependent packages - 413 dependent repositories - 139 thousand downloads last month - 3,431 stars on GitHub - 1 maintainer
Top 6.2% on pypi.org
slate3k 0.5.3
Extract text from PDF documents easily.
1 version - Latest release: about 8 years ago - 2 dependent packages - 29 dependent repositories - 22.8 thousand downloads last month - 30 stars on GitHub - 1 maintainer
pdf-layout-scanner 1.3.3
A more complete example of programming with PDFMiner, which continues where the default documenta...
7 versions - Latest release: over 5 years ago - 1 dependent repositories - 414 downloads last month - 8 stars on GitHub - 1 maintainer
articleparse 0.2.1 💰
Heuristic text extraction from news articles
3 versions - Latest release: over 7 years ago - 89 downloads last month - 10 stars on GitHub - 1 maintainer
fundus 0.5.0
A very simple news crawler
14 versions - Latest release: 2 months ago - 1.81 thousand downloads last month - 366 stars on GitHub - 1 maintainer
vision-parse 0.1.13
Parse PDF documents into markdown formatted content using Vision LLMs
14 versions - Latest release: 3 months ago - 2.94 thousand downloads last month - 339 stars on GitHub - 1 maintainer
magicconvert 0.1.0
MagicConvert is a Python library that converts various document formats (PDF, DOCX, XLSX, PPTX, H...
2 versions - Latest release: 2 months ago - 106 downloads last month - 1 stars on GitHub - 1 maintainer
kreuzberg 3.1.3
A text extraction library supporting PDFs, images, office documents and more
19 versions - Latest release: 9 days ago - 6.53 thousand downloads last month - 1,736 stars on GitHub - 1 maintainer
wagtail-textract 1.2
Allow searching for text in Documents in the Wagtail content management system
8 versions - Latest release: over 5 years ago - 1 dependent repositories - 253 downloads last month - 31 stars on GitHub - 2 maintainers
hotpdf 0.5.2
Fast PDF Data Extraction library
27 versions - Latest release: about 1 year ago - 2.44 thousand downloads last month - 186 stars on GitHub - 1 maintainer
fileseek 0.1.3
FileSeek – AI-Powered Local Document Archive&Search
3 versions - Latest release: 2 months ago - 505 downloads last month - 1 maintainer
Top 2.7% on pypi.org
justext 3.0.2 💰
Heuristic based boilerplate removal tool
8 versions - Latest release: about 2 months ago - 7 dependent packages - 43 dependent repositories - 1.09 million downloads last month - 725 stars on GitHub - 1 maintainer
Top 2.1% on pypi.org
srt 3.5.3
A tiny library for parsing, modifying, and composing SRT files.
30 versions - Latest release: about 2 years ago - 26 dependent packages - 205 dependent repositories - 269 thousand downloads last month - 498 stars on GitHub - 1 maintainer
pd3f 0.4.0
Reconstruct the original continuous text from PDFs with language models
5 versions - Latest release: about 4 years ago - 1 dependent repositories - 216 downloads last month - 32 stars on GitHub - 1 maintainer
Top 5.2% on pypi.org
slate 0.5.2 💰
Extract text from PDF documents easily.
4 versions - Latest release: over 1 year ago - 46 dependent repositories - 567 downloads last month - 427 stars on GitHub - 1 maintainer
ammico 0.2.6
AI Media and Misinformation Content Analysis Tool
8 versions - Latest release: about 2 months ago - 344 downloads last month - 6 stars on GitHub - 1 maintainer
apple-ocr 1.0.8 💰
An OCR (Optical Character Recognition) utility for text extraction from images.
9 versions - Latest release: about 1 year ago - 381 downloads last month - 105 stars on GitHub - 1 maintainer
atai-whisper-tool 0.0.7
OpenAI Whisper with Apple MPS support
6 versions - Latest release: about 1 month ago - 448 downloads last month - 0 stars on GitHub - 1 maintainer
atai-pdf-tool 0.1.1
A tool for parsing and extracting text from PDF files with OCR capabilities
5 versions - Latest release: about 2 months ago - 277 downloads last month - 0 stars on GitHub - 1 maintainer
vlense 0.1.4
A Python package to extract text from images and PDFs using Vision Language Model (VLM).
5 versions - Latest release: 5 months ago - 118 downloads last month - 1 stars on GitHub - 1 maintainer
extracteur-de-fou-malade-pour-charles-le-charlo 0.0.1 💰
PDF data parser
1 version - Latest release: over 4 years ago - 1 dependent repositories - 54 downloads last month - 3,518 stars on GitHub - 1 maintainer
util-ds 0.5.3 💰
This project is a convenient part of the NLP project, including several already exposed projects ...
22 versions - Latest release: almost 3 years ago - 1 dependent repositories - 129 downloads last month - 3,518 stars on GitHub - 1 maintainer
Top 3.4% on pypi.org
breadability 0.1.20
Port of Readability HTML parser in Python
21 versions - Latest release: about 11 years ago - 2 dependent packages - 212 dependent repositories - 89.6 thousand downloads last month - 204 stars on GitHub - 1 maintainer
hnlp 0.0.1
Humanly Deeplearning NLP.
2 versions - Latest release: almost 5 years ago - 1 dependent repositories - 76 downloads last month - 29 stars on GitHub - 1 maintainer
pnlp 0.4.16
A pre/post-processing tool for NLP.
29 versions - Latest release: about 2 months ago - 1 dependent package - 2 dependent repositories - 1.17 thousand downloads last month - 29 stars on GitHub - 1 maintainer
galeodes 0.7 💰
Browsers options
7 versions - Latest release: about 3 years ago - 8 dependent repositories - 1.96 thousand downloads last month - 0 stars on GitHub - 1 maintainer
Top 6.5% on pypi.org
mobi 0.3.3 💰
unpack unencrypted mobi files
6 versions - Latest release: about 3 years ago - 8 dependent packages - 13 dependent repositories - 1.77 thousand downloads last month - 61 stars on GitHub - 1 maintainer
tikara 0.1.6
The metadata and text content extractor for almost every file type.
6 versions - Latest release: 3 months ago - 214 downloads last month - 1 stars on GitHub - 1 maintainer
llamasearch-pdf-llamasearch 0.1.0
A comprehensive PDF processing toolkit for document workflows
1 version - Latest release: 14 days ago
arachnio 0.0.0
Client library for interacting with Arachnio API
1 version - Latest release: about 2 years ago - 61 downloads last month - 0 stars on GitHub - 1 maintainer
html-to-markdown 1.3.0
Convert HTML to markdown
5 versions - Latest release: 17 days ago - 7.35 thousand downloads last month - 23 stars on GitHub
wpextract 1.1.1
Create datasets from WordPress sites
9 versions - Latest release: 3 months ago - 348 downloads last month - 3 stars on GitHub - 1 maintainer
yirabot 1.0.9
YiraBot: Simplifying Web Scraping for All. A user-friendly tool for developers and enthusiasts, o...
20 versions - Latest release: about 1 year ago - 695 downloads last month - 20 stars on GitHub - 1 maintainer
pdf-parser-header-footer 0.1.10
A Python package for processing PDFs with header and footer detection
10 versions - Latest release: 24 days ago - 131 downloads last month - 1 maintainer
atai-ebook-tool 0.0.4
A command-line tool for parsing ebooks (such as EPUB and MOBI) and converting them into a structu...
3 versions - Latest release: 25 days ago - 304 downloads last month - 0 stars on GitHub - 1 maintainer
yt2text 1.0.2
Extract text from a YouTube video in a single command, using OpenAi's Whisper speech recognition ...
3 versions - Latest release: over 1 year ago - 125 downloads last month - 4 stars on GitHub - 1 maintainer
wikipedia-ner 0.0.24
Python package for creating labeled examples from wiki dumps
22 versions - Latest release: about 10 years ago - 3 dependent repositories - 290 downloads last month - 67 stars on GitHub - 1 maintainer
aiopytesseract 0.14.0 💰
asyncio tesseract wrapper for Tesseract-OCR
15 versions - Latest release: about 1 year ago - 1 dependent repositories - 2.01 thousand downloads last month - 17 stars on GitHub - 1 maintainer
deboiler 2023.46.150
Deboiler is an open-source package to clean HTML pages across an entire domain
2 versions - Latest release: over 1 year ago - 126 downloads last month - 7 stars on GitHub - 1 maintainer
balena-cpu 1.0.0 💰
BALanced Execution through Natural Activation : a human-computer interaction methodology for code...
1 version - Latest release: about 1 year ago - 55 downloads last month - 5 stars on GitHub - 1 maintainer
Top 4.2% on pypi.org
boilerpy3 1.0.7
Python port of Boilerpipe, for HTML boilerplate removal and text extraction
7 versions - Latest release: over 1 year ago - 12 dependent packages - 25 dependent repositories - 181 thousand downloads last month - 86 stars on GitHub - 1 maintainer
spanish-pdf-parser 0.1.0
A Python package for processing PDFs with header and footer detection
1 version - Latest release: 3 months ago - 62 downloads last month - 1 maintainer
newsman 1.1.0 removed
A tool for web news scraping.
2 versions - Latest release: over 5 years ago - 1 dependent repositories - 9 downloads last month - 0 stars on GitHub - 1 maintainer
Related Keywords
python 15 pdf 13 nlp 13 ocr 11 document-processing 9 text-processing 6 text 6 web-scraping 5 NLP 4 html-extraction 4 html-extractor 4 text-mining 4 textteaser 3 text-cleaning 3 extraction 3 tesseract 3 markdown 3 image-to-text 3 corpus 3 html-page 3 lsa 3 pagerank-algorithm 3 reduction 3 data-extraction 3 crawler 3 summarization 3 summarizer 3 summary 3 sumy 3 python3 2 pdfminer 2 boilerplate-removal 2 text-analysis 2 news-scraping 2 parser 2 nlp-enhancer 2 concurrency 2 chinese-nlp 2 nlp-preprocess 2 normalization 2 preprocessing 2 text-length 2 whisper 2 machine-learning 2 command-line-tool 2 html-parsing 2 html-parser 2 document-indexing 2 pdf-processing 2 document-management 2 metadata 2 text extraction 2 search 2 docx 2 asyncio 2 file-conversion 2 tesseract-ocr 2 format-detection 2 pdf-to-markdown 2 news-crawler 2 natural-language-processing 2 scraper 2 tika 2 html-to-markdown 2 article-extractor 2 rag 2 scraping 2 llm 2 readability 2 text-recognition 2 file-format 1 file-analysis 1 excel 1 document-understanding 1 document-text 1 document-reader 1 document-parsing 1 document-ocr 1 document-metadata 1 document-intelligence 1 document-extraction 1 html-text-extraction 1 file-identification 1 file-parsing 1 file-processing 1 file-reader 1 full-text-extraction 1 file-type 1 format-identification 1 image-extraction 1 information-extraction 1 language-detection 1 full text extraction 1 html text extraction 1 scraping-websites 1 news 1 articles 1 screenshots 1 selenium 1 wrapper 1