Skip to main content

PDF Injector

A Python tool for extracting PDF pages as images and generating segmentation masks using SAM2 from local files or Firebase Storage.

Google Firestore: https://console.firebase.google.com/u/0/project/pajama-d9d22/storage/pajama-d9d22.firebasestorage.app/files/~2Fstory_injection~2Ftest_files?fb_gclid=Cj0KCQjwnJfEBhCzARIsAIMtfKI7J5chRqEDqPIPDna1pGa86iDT8-sQ0yswfvqTuxVjZfTd9uPMB30aAmceEALw_wcB

Features

  • Convert PDF pages to images using pdf2image
  • Generate segmentation masks for each page using SAM2
  • Save images and masks in organized directory structure
  • Load PDFs from local filesystem or Firebase Storage (gs://)
  • Separate visualization tool for browsing results
  • Unified file loading interface using iopath
  • Command-line interface using fire

Installation

Prerequisites

  • Python 3.7+
  • poppler-utils (required for pdf2image)

Install System Dependencies

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install -y poppler-utils

macOS:

brew install poppler

Windows: Download and install poppler from: https://github.com/oschwartz10612/poppler-windows/releases/

In WSL2 ensure you sudo apt install python3-tk -y

Install Python Dependencies

pip install fire pdf2image pillow iopath matplotlib google-cloud-storage sam2 openai
pip install git+https://github.com/facebookresearch/segment-anything.git

Getting Started

Basic Usage

Extract pages and generate masks from a local PDF:

python pdf_injector.py --file_name="path/to/your.pdf"

This will create an output directory with the structure:

output_20231201_123456/
├── page_0001/
│ ├── input_image.png
│ └── masks/
│ ├── mask_000.png
│ ├── mask_001.png
│ └── ...
├── page_0002/
│ ├── input_image.png
│ └── masks/
│ └── ...
└── ...

Save to Custom Directory

python pdf_injector.py --file_name="document.pdf" --output_dir="my_custom_output"

Firebase Storage Usage

Load a PDF from Firebase Storage:

python pdf_injector.py --file_name="gs://pajama-d9d22.firebasestorage.app/story_injection/test_files/document.pdf"

Visualizing Results

Use the separate visualization tool to browse saved masks:

python visualize_masks.py --output_dir="output_20231201_123456"

The visualizer provides:

  • Interactive slider to navigate between pages
  • Keyboard navigation (arrow keys, Home/End)
  • Checkboxes to toggle individual mask overlays
  • Color-coded masks for easy identification

Python API Usage

from pdf_injector import load_pdf, save_masks_to_disk
from image_to_story_node_processor import ImageToRGB, PageSegment, ToNumpy
from torchvision import transforms
import torch

# Load local PDF
pages = load_pdf("document.pdf")

# Create processing pipeline
process_page = transforms.Compose([
ImageToRGB(),
PageSegment(device=torch.device('cuda')),
ToNumpy()
])

# Process pages to generate masks
transformed_pages = [process_page(page.image) for page in pages]

# Save images and masks
output_dir = save_masks_to_disk(pages, transformed_pages)
print(f"Saved to: {output_dir}")

# Access page data
for page in pages:
print(f"Page {page.page_number}:")
print(f" Image size: {page.image.size}")
print(f" Metadata: {page.metadata}")

Visualization API Usage

from visualize_masks import visualize_output_directory

# Visualize saved masks
visualize_output_directory("output_20231201_123456")

SAM2 setup

Example Colab: https://colab.research.google.com/github/facebookresearch/sam2/blob/main/notebooks/image_predictor_example.ipynb#scrollTo=7e28150b

Pretrained Weights: wget -P ./pretrained_models/ https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt

Output Structure

The tool saves results in the following directory structure:

output_<timestamp>/              # Root output directory
├── page_0001/ # Directory for page 1
│ ├── input_image.png # Original PDF page as image
│ └── masks/ # Directory containing all masks
│ ├── mask_000.png # Individual mask 0
│ ├── mask_001.png # Individual mask 1
│ └── ... # Additional masks
├── page_0002/ # Directory for page 2
│ └── ...
└── ...

Note: By default, output directories are created under engine_test/ unless an absolute path is provided.

ScannedPage Structure

Each ScannedPage object contains:

  • page_number: Page number (1-indexed)
  • image: PIL Image object
  • text: Empty string (reserved for future text extraction)
  • metadata: Dictionary containing source file information