PDF Injector

A Python tool for extracting PDF pages as images and generating segmentation masks using SAM2 from local files or Firebase Storage.

Google Firestore: https://console.firebase.google.com/u/0/project/pajama-d9d22/storage/pajama-d9d22.firebasestorage.app/files/~2Fstory_injection~2Ftest_files?fb_gclid=Cj0KCQjwnJfEBhCzARIsAIMtfKI7J5chRqEDqPIPDna1pGa86iDT8-sQ0yswfvqTuxVjZfTd9uPMB30aAmceEALw_wcB

Features

Convert PDF pages to images using pdf2image
Generate segmentation masks for each page using SAM2
Save images and masks in organized directory structure
Load PDFs from local filesystem or Firebase Storage (gs://)
Separate visualization tool for browsing results
Unified file loading interface using iopath
Command-line interface using fire

Installation

Prerequisites

Python 3.7+
poppler-utils (required for pdf2image)

Install System Dependencies

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install -y poppler-utils

macOS:

brew install poppler

Windows: Download and install poppler from: https://github.com/oschwartz10612/poppler-windows/releases/

In WSL2 ensure you sudo apt install python3-tk -y

Install Python Dependencies

pip install fire pdf2image pillow iopath matplotlib google-cloud-storage sam2 openai
pip install git+https://github.com/facebookresearch/segment-anything.git

Getting Started

Basic Usage

Extract pages and generate masks from a local PDF:

python pdf_injector.py --file_name="path/to/your.pdf"

This will create an output directory with the structure:

output_20231201_123456/
├── page_0001/
│   ├── input_image.png
│   └── masks/
│       ├── mask_000.png
│       ├── mask_001.png
│       └── ...
├── page_0002/
│   ├── input_image.png
│   └── masks/
│       └── ...
└── ...

Save to Custom Directory

python pdf_injector.py --file_name="document.pdf" --output_dir="my_custom_output"

Firebase Storage Usage

Load a PDF from Firebase Storage:

python pdf_injector.py --file_name="gs://pajama-d9d22.firebasestorage.app/story_injection/test_files/document.pdf"

Visualizing Results

Use the separate visualization tool to browse saved masks:

python visualize_masks.py --output_dir="output_20231201_123456"

The visualizer provides:

Interactive slider to navigate between pages
Keyboard navigation (arrow keys, Home/End)
Checkboxes to toggle individual mask overlays
Color-coded masks for easy identification

Python API Usage

from pdf_injector import load_pdf, save_masks_to_disk
from image_to_story_node_processor import ImageToRGB, PageSegment, ToNumpy
from torchvision import transforms
import torch

# Load local PDF
pages = load_pdf("document.pdf")

# Create processing pipeline
process_page = transforms.Compose([
    ImageToRGB(),
    PageSegment(device=torch.device('cuda')),
    ToNumpy()
])

# Process pages to generate masks
transformed_pages = [process_page(page.image) for page in pages]

# Save images and masks
output_dir = save_masks_to_disk(pages, transformed_pages)
print(f"Saved to: {output_dir}")

# Access page data
for page in pages:
    print(f"Page {page.page_number}:")
    print(f"  Image size: {page.image.size}")
    print(f"  Metadata: {page.metadata}")

Visualization API Usage

from visualize_masks import visualize_output_directory

# Visualize saved masks
visualize_output_directory("output_20231201_123456")

SAM2 setup

Example Colab: https://colab.research.google.com/github/facebookresearch/sam2/blob/main/notebooks/image_predictor_example.ipynb#scrollTo=7e28150b

Pretrained Weights: wget -P ./pretrained_models/ https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt

Output Structure

The tool saves results in the following directory structure:

output_<timestamp>/              # Root output directory
├── page_0001/                  # Directory for page 1
│   ├── input_image.png         # Original PDF page as image
│   └── masks/                  # Directory containing all masks
│       ├── mask_000.png        # Individual mask 0
│       ├── mask_001.png        # Individual mask 1
│       └── ...                 # Additional masks
├── page_0002/                  # Directory for page 2
│   └── ...
└── ...

Note: By default, output directories are created under engine_test/ unless an absolute path is provided.

ScannedPage Structure

Each ScannedPage object contains:

page_number: Page number (1-indexed)
image: PIL Image object
text: Empty string (reserved for future text extraction)
metadata: Dictionary containing source file information

Features​

Installation​

Prerequisites​

Install System Dependencies​

Install Python Dependencies​

Getting Started​

Basic Usage​

Save to Custom Directory​

Firebase Storage Usage​

Visualizing Results​

Python API Usage​

Visualization API Usage​

SAM2 setup​

Output Structure​

ScannedPage Structure​