PDFInfo Tips: Speed Up PDF Analysis and Metadata Extraction

PDFInfo: Quick Guide to Extracting Metadata from PDFs

What pdfinfo is

pdfinfo (part of Poppler/poppler-utils) is a command-line tool that prints a PDF’s Info dictionary and other useful file-level details: title, author, creator, producer, creation/modification dates, page count, page size, PDF version, file size, encryption/permissions, whether the PDF is tagged/has metadata or JavaScript, and more.

Install

Debian/Ubuntu: sudo apt-get install poppler-utils
macOS (Homebrew): brew install poppler

Basic usage

Show basic metadata:
```
Code
pdfinfo file.pdf 
```
Read from stdin:
```
Code
pdfinfo - 
```

Useful options

-meta — print the PDF metadata stream
-custom — print custom and standard metadata
-js — print JavaScript in the PDF
-struct / -struct-text — print logical structure (Tagged PDF) / structure with text
-box — print MediaBox, CropBox, BleedBox, TrimBox, ArtBox
-url — list URLs (annotations)
-f N -l M — examine pages N through M (prints sizes/bounding boxes per page if range used)
-isodates — print dates in ISO-8601 format
-rawdates — print raw PDF date strings
-opw / -upw — owner/user password for encrypted PDFs
-enc encoding-name — set output encoding (default UTF-8)
-v / -h — version / help

Example outputs

General example:

Code
Title:Report Q4 Author:         Alice Smith Creator:        Microsoft Word Producer:       Mac OS X Quartz PDFContext CreationDate:   2024-11-12T09:15:00Z ModDate:        2024-11-12T09:20:00Z Pages:          12 Encrypted:      no Page size:      612 x 792 pts (letter) File size:      234567 bytes PDF version:    1.7

Metadata stream (use -meta):

Code
<?xpacket begin=“…”?> …/x:xmpmeta

Scripting tips

Parse output into a key/value map in scripts (grep/sed/awk or a language wrapper).

Example: get page count in shell:

Code
pdfinfo file.pdf | awk -F: ‘/^Pages/ {print $2+0}’

Use pdfinfo alongside other poppler tools: pdftotext, pdfimages, pdffonts, pdfseparate, pdfunite.

When to use pdfinfo

Quickly inspect document metadata and properties before processing or publishing.
Detect encryption or unexpected producers/creators.
Automate metadata checks in CI or batch processing scripts.

Limitations

Reads and reports what’s stored in the PDF — metadata can be missing or intentionally altered.
URL extraction is limited to supported annotation types; it won’t search plain text for HTTP strings.

If you want, I can provide a ready-to-run shell script or a small Python wrapper that parses pdfinfo output into JSON.

PDFInfo Tips: Speed Up PDF Analysis and Metadata Extraction

PDFInfo: Quick Guide to Extracting Metadata from PDFs

What pdfinfo is

Install

Basic usage

Useful options

Example outputs

Scripting tips

When to use pdfinfo

Limitations

Comments

Leave a Reply Cancel reply

More posts

SuperUpdate Best Practices: Streamline Patches and Reduce Downtime

Malware Spy Explained: How It Works and How to Protect Yourself

Video Editor: Beginner’s Guide to Editing Fast and Creatively

MusicClassification Evaluation: Metrics, Datasets, and Benchmarks