PDF Injector
A Python tool for extracting PDF pages as images and generating segmentation masks using SAM2 from local files or Firebase Storage.
Features
- Convert PDF pages to images using
pdf2image - Generate segmentation masks for each page using SAM2
- Save images and masks in organized directory structure
- Load PDFs from local filesystem or Firebase Storage (gs://)
- Separate visualization tool for browsing results
- Unified file loading interface using
iopath - Command-line interface using
fire
Installation
Prerequisites
- Python 3.7+
- poppler-utils (required for pdf2image)
Install System Dependencies
Ubuntu/Debian:
sudo apt-get update
sudo apt-get install -y poppler-utils
macOS:
brew install poppler
Windows: Download and install poppler from: https://github.com/oschwartz10612/poppler-windows/releases/
In WSL2 ensure you sudo apt install python3-tk -y
Install Python Dependencies
pip install fire pdf2image pillow iopath matplotlib google-cloud-storage sam2 openai
pip install git+https://github.com/facebookresearch/segment-anything.git
Getting Started
Basic Usage
Extract pages and generate masks from a local PDF:
python pdf_injector.py --file_name="path/to/your.pdf"
This will create an output directory with the structure:
output_20231201_123456/
├── page_0001/
│ ├── input_image.png
│ └── masks/
│ ├── mask_000.png
│ ├── mask_001.png
│ └── ...
├── page_0002/
│ ├── input_image.png
│ └── masks/
│ └── ...
└── ...
Save to Custom Directory
python pdf_injector.py --file_name="document.pdf" --output_dir="my_custom_output"
Firebase Storage Usage
Load a PDF from Firebase Storage:
python pdf_injector.py --file_name="gs://pajama-d9d22.firebasestorage.app/story_injection/test_files/document.pdf"
Visualizing Results
Use the separate visualization tool to browse saved masks:
python visualize_masks.py --output_dir="output_20231201_123456"
The visualizer provides:
- Interactive slider to navigate between pages
- Keyboard navigation (arrow keys, Home/End)
- Checkboxes to toggle individual mask overlays
- Color-coded masks for easy identification
Python API Usage
from pdf_injector import load_pdf, save_masks_to_disk
from image_to_story_node_processor import ImageToRGB, PageSegment, ToNumpy
from torchvision import transforms
import torch
# Load local PDF
pages = load_pdf("document.pdf")
# Create processing pipeline
process_page = transforms.Compose([
ImageToRGB(),
PageSegment(device=torch.device('cuda')),
ToNumpy()
])
# Process pages to generate masks
transformed_pages = [process_page(page.image) for page in pages]
# Save images and masks
output_dir = save_masks_to_disk(pages, transformed_pages)
print(f"Saved to: {output_dir}")
# Access page data
for page in pages:
print(f"Page {page.page_number}:")
print(f" Image size: {page.image.size}")
print(f" Metadata: {page.metadata}")
Visualization API Usage
from visualize_masks import visualize_output_directory
# Visualize saved masks
visualize_output_directory("output_20231201_123456")
SAM2 setup
Example Colab: https://colab.research.google.com/github/facebookresearch/sam2/blob/main/notebooks/image_predictor_example.ipynb#scrollTo=7e28150b
Pretrained Weights: wget -P ./pretrained_models/ https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt
Output Structure
The tool saves results in the following directory structure:
output_<timestamp>/ # Root output directory
├── page_0001/ # Directory for page 1
│ ├── input_image.png # Original PDF page as image
│ └── masks/ # Directory containing all masks
│ ├── mask_000.png # Individual mask 0
│ ├── mask_001.png # Individual mask 1
│ └── ... # Additional masks
├── page_0002/ # Directory for page 2
│ └── ...
└── ...
Note: By default, output directories are created under engine_test/ unless an absolute path is provided.
ScannedPage Structure
Each ScannedPage object contains:
page_number: Page number (1-indexed)image: PIL Image objecttext: Empty string (reserved for future text extraction)metadata: Dictionary containing source file information