Skip to content

Doinggithub14/Adobe-round-1a-Div

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Adobe Hackathon – PDF Outline Extractor (Offline)

This project extracts structured outlines like Title, H1, H2, H3 from PDF files using an ML model trained offline — as part of Adobe Hackathon Round 1A.


What We’ve Done So Far

  • Extracted font and layout-based features using PyMuPDF
  • Converted JSON ground truth + PDFs into a dataset (CSV)
  • Trained an XGBoost classifier using TF-IDF vectorized features
  • Evaluated using precision/recall metrics
  • Saved model, vectorizer, and encoder
  • Configured everything to run in Docker (offline)

How to Run (Docker)

Step 1: Build Docker image

docker build --platform linux/amd64 -t adobe-headings:v1 .

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published