proxy.golang.org : github.com/lmullen/chronam-ocr-debatcher
This utility converts Chronicling America OCR batches into CSVs of the OCR text. It takes as its arguments paths to Chronicling America OCR batches which are stored as .tar.bz2 files, which in turn contain directories of text files (which we care about) and XML files (which we don't). The path to the files comprise (with modification) an ID for that page on Chronicling America. This utility reads in each batch, extracts the page text, and writes each of them as a CSV file with a column for the batch ID, page ID, and text. It will process the batches in parallel.
Registry
-
Source
- Documentation
- JSON
purl: pkg:golang/github.com/lmullen/chronam-ocr-debatcher
License: MIT
Latest release: over 6 years ago
First release: over 6 years ago
Namespace: github.com/lmullen
Stars: 2 on GitHub
Forks: 0 on GitHub
See more repository details: repos.ecosyste.ms
Last synced: 15 days ago