Skip to content
View SunnkerLocket89's full-sized avatar

Block or report SunnkerLocket89

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
SunnkerLocket89/README.md

Idaho4 Exhibits Parser

This repository provides a command line helper that automates the task of downloading and organising the public exhibits listed in the Idaho4_exhibits_with_full_metadata.xlsx spreadsheet. The script reads the spreadsheet, downloads the referenced PDF files, and optionally extracts the first N pages of each document into a dedicated folder.

Installation

The parser now works out of the box using only the Python standard library. Optional third-party packages improve performance and unlock extras:

Install them individually or via the provided requirements.txt file when available:

pip install -r requirements.txt

Usage

python run_idaho4_parser.py \
  --in-file Idaho4_exhibits_with_full_metadata.xlsx \
  --sheet Exhibits_With_Metadata \
  --workers 6 \
  --extract-pages 4

By default the script stores the downloaded PDFs in idaho4_output/downloads and writes a JSON manifest plus a CSV summary to idaho4_output. Downloaded files are prefixed with the zero-padded Excel row number to guarantee unique filenames while keeping the on-disk order aligned with the worksheet. The manifest records whether each row succeeded, was skipped (for example because it did not contain a URL), or failed, and includes the corresponding Excel row number for quick cross-referencing. Re-run the command with --resume to continue from where a previous session stopped without re-downloading files.

Common flags

  • --url-column – Set the spreadsheet column that contains the PDF URL. When omitted the script attempts to infer a sensible column automatically.
  • --id-column – Configure the column that uniquely identifies each exhibit. This identifier is used to name the downloaded files.
  • --out-dir – Choose a different destination directory for all generated artefacts.
  • --manifest / --csv – Override the default manifest output paths.
  • --verbose – Enable verbose logging for troubleshooting.

Run python run_idaho4_parser.py --help to see the full list of supported flags.

Pinned Loading

  1. openai/codex openai/codex Public

    Lightweight coding agent that runs in your terminal

    Rust 94.7k 14.1k

  2. shodan-python shodan-python Public

    Forked from achillean/shodan-python

    The official Python library for Shodan

    Python 1

  3. doctor doctor Public

    Forked from freelawproject/doctor

    A microservice for document conversion at scale

    Python 1

  4. x-ray x-ray Public

    Forked from freelawproject/x-ray

    A tool to detect whether a PDF has a bad redaction

    Python 1

  5. freelawproject/x-ray freelawproject/x-ray Public

    A tool to detect whether a PDF has a bad redaction

    Python 809 49

  6. IntelligenceX/SDK IntelligenceX/SDK Public

    Public SDK for Intelligence X

    Python 536 124