PDFInfo Tips: Speed Up PDF Analysis and Metadata Extraction

PDFInfo: Quick Guide to Extracting Metadata from PDFs

What pdfinfo is

pdfinfo (part of Poppler/poppler-utils) is a command-line tool that prints a PDF’s Info dictionary and other useful file-level details: title, author, creator, producer, creation/modification dates, page count, page size, PDF version, file size, encryption/permissions, whether the PDF is tagged/has metadata or JavaScript, and more.

Install

  • Debian/Ubuntu: sudo apt-get install poppler-utils
  • macOS (Homebrew): brew install poppler

Basic usage

  • Show basic metadata:

    Code

    pdfinfo file.pdf
  • Read from stdin:

    Code

    pdfinfo -

Useful options

  • -meta — print the PDF metadata stream
  • -custom — print custom and standard metadata
  • -js — print JavaScript in the PDF
  • -struct / -struct-text — print logical structure (Tagged PDF) / structure with text
  • -box — print MediaBox, CropBox, BleedBox, TrimBox, ArtBox
  • -url — list URLs (annotations)
  • -f N -l M — examine pages N through M (prints sizes/bounding boxes per page if range used)
  • -isodates — print dates in ISO-8601 format
  • -rawdates — print raw PDF date strings
  • -opw / -upw — owner/user password for encrypted PDFs
  • -enc encoding-name — set output encoding (default UTF-8)
  • -v / -h — version / help

Example outputs

  • General example:

    Code

    Title:Report Q4 Author: Alice Smith Creator: Microsoft Word Producer: Mac OS X Quartz PDFContext CreationDate: 2024-11-12T09:15:00Z ModDate: 2024-11-12T09:20:00Z Pages: 12 Encrypted: no Page size: 612 x 792 pts (letter) File size: 234567 bytes PDF version: 1.7
  • Metadata stream (use -meta):

    Code

    <?xpacket begin=“…”?> /x:xmpmeta

Scripting tips

  • Parse output into a key/value map in scripts (grep/sed/awk or a language wrapper).
  • Example: get page count in shell:

    Code

    pdfinfo file.pdf | awk -F: ‘/^Pages/ {print $2+0}’
  • Use pdfinfo alongside other poppler tools: pdftotext, pdfimages, pdffonts, pdfseparate, pdfunite.

When to use pdfinfo

  • Quickly inspect document metadata and properties before processing or publishing.
  • Detect encryption or unexpected producers/creators.
  • Automate metadata checks in CI or batch processing scripts.

Limitations

  • Reads and reports what’s stored in the PDF — metadata can be missing or intentionally altered.
  • URL extraction is limited to supported annotation types; it won’t search plain text for HTTP strings.

If you want, I can provide a ready-to-run shell script or a small Python wrapper that parses pdfinfo output into JSON.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *