Features Comparison¶
Feature Matrix¶
The following table illustrates how PyMuPDF compares with other typical solutions.
| Feature | PyMuPDF | pikepdf | PyPDF2 | pdfrw | pdfplumber / pdfminer |
|---|---|---|---|---|---|
| Supports Multiple Document Formats |
PDF
XPS
EPUB
MOBI
FB2
CBZ
SVG
TXT
Image
DOCX XLSX PPTX HWPX See note |
||||
| Implementation | Python and C | Python and C++ | Python | Python | Python |
| Render Document Pages | All document types | No rendering | No rendering | No rendering | No rendering |
| Write Text to PDF Page |
See: Page.insert_htmlbox or: Page.insert_textbox or: TextWriter |
||||
| Supports CJK characters | |||||
| Extract Text | All document types | PDF only | PDF only | ||
| Extract Text as Markdown (.md) | All document types | ||||
| Extract Tables | All document types | PDF only | |||
| Extract Vector Graphics | All document types | Limited | |||
| Draw Vector Graphics (PDF) | |||||
| Based on Existing, Mature Library | MuPDF | QPDF | |||
| Automatic Repair of Damaged PDFs | |||||
| Encrypted PDFs | Limited | Limited | |||
| Linerarized PDFs | |||||
| Incremental Updates | |||||
| Integrates with Jupyter and IPython Notebooks | |||||
| Joining / Merging PDF with other Document Types | All document types | PDF only | PDF only | PDF only | PDF only |
| OCR API for Seamless Integration with Tesseract | All document types | ||||
| Integrated Checkpoint / Restart Feature (PDF) | |||||
| PDF Optional Content | |||||
| PDF Embedded Files | Limited | Limited | |||
| PDF Redactions | |||||
| PDF Annotations | Full | Limited | |||
| PDF Form Fields | Create, read, update | Limited, no creation | |||
| PDF Page Labels | Read-only | ||||
| Support Font Sub-Setting |
Note
A note about Office document types (DOCX, XLXS, PPTX) and Hangul documents (HWPX). These documents can be loaded into PyMuPDF and you will receive a Document object.
There are some caveats:
we convert the input to HTML to layout the content.
because of this the original page separation has gone.
When saving out the result any faithful representation of the original layout cannot be expected.
Therefore input files are mostly in a form that’s useful for text extraction.
PyMuPDF Product Suite¶
PyMuPDF is the standard version of the library, however there are a family of additional products each with different features and functionality.
Additional products in the PyMuPDF product suite are:
PyMuPDF Pro adds support for Office document formats.
PyMuPDF4LLM is optimized for large language model (LLM) applications, providing enhanced text extraction and processing capabilities.
PyMuPDF Layout focuses on layout analysis and semantic understanding, ideal for document conversion and formatting tasks with enhanced results.
Note
All of the products above depend on the same core product - PyMuPDF and therefore have full access to all of its features. These additional products can be seen as optional extras to the enhance the core PyMuPDF library.
PyMuPDF Products Comparison¶
The following table illustrates what features the products offer:
PyMuPDF |
PyMuPDF Pro |
PyMuPDF4LLM |
PyMuPDF Layout |
|
|---|---|---|---|---|
Input Documents |
|
as PyMuPDF and:
|
as PyMuPDF |
as PyMuPDF |
Output Documents |
Can convert any input document to |
as PyMuPDF |
as PyMuPDF and:
Markdown ( |
as PyMuPDF4LLM and:
|
Page Analysis |
Basic page analysis to return document structure |
as PyMuPDF |
as PyMuPDF |
Advanced Page Analysis with trained data for enhanced results |
Data extraction |
Basic data extraction with structured layout information and bounding box data |
as PyMuPDF |
Advanced data extraction with structure tags such as headings, lists, tables |
Advanced layout analysis and semantic understanding |
Table extraction |
Basic table extraction as part of text extraction |
as PyMuPDF |
Advanced table extraction with cell structure and data types |
Superior table detection |
Image extraction |
Basic image extraction |
as PyMuPDF |
Advanced detection and rendering of image areas on page saving them to disk or embedding in MD output |
Superior detection of “picture” areas |
Vector extraction |
Vector extraction and clustering |
as PyMuPDF |
as PyMuPDF |
Superior detection of “picture” areas |
Popular RAG Integrations |
Langchane, LlamaIndex |
as PyMuPDF |
as PyMuPDF and with some addiotnal help methods for RAG workflows |
as PyMuPDF4LLM |
OCR |
On-demand invocation of built-in Tesseract for text detection on pages or images. |
as PyMuPDF |
as PyMuPDF |
Automatic OCR based on page content analysis. |
Performance¶
To benchmark PyMuPDF performance against a range of tasks a test suite with a fixed set of 8 PDFs with a total of 7,031 pages containing text & images is used to obtain performance timings.
Here are current results, grouped by task:
- Copying
This refers to opening a document and then saving it to a new file. This test measures the speed of reading a PDF and re-writing as a new PDF. This process is also at the core of functions like merging / joining multiple documents. The numbers below therefore apply to PDF joining and merging.
The results for all 7,031 pages are:
⏱
- Text Extraction
This refers to extracting simple, plain text from every page of the document and storing it in a text file.
The results for all 7,031 pages are:
⏱
- Rendering
This refers to making an image (like PNG) from every page of a document at a given DPI resolution. This feature is the basis for displaying a document in a GUI window.
The results for all 7,031 pages are:
⏱
Note
For more detail regarding the methodology for these performance timings see: Performance Comparison Methodology.
License and Copyright¶
PyMuPDF and MuPDF are now available under both, open-source AGPL and commercial license agreements. Please read the full text of the AGPL license agreement, available in the distribution material (file COPYING) and on the GNU license page, to ensure that your use case complies with the guidelines of the license. If you determine you cannot meet the requirements of the AGPL, please contact Artifex for more information regarding a commercial license.
Artifex is the exclusive commercial licensing agent for MuPDF.
Artifex, the Artifex logo, MuPDF, and the MuPDF logo are registered trademarks of Artifex Software Inc.
This documentation covers PyMuPDF 1.27.1.
The major and minor versions of PyMuPDF and MuPDF will always be the same. Only the third qualifier (patch level) may deviate from that of MuPDF.
Typically PyMuPDF is released more frequently than MuPDF so it will often be the case that the patch level of PyMuPDF will be greater than the embedded MuPDF.
For example PyMuPDF-1.24.5 contains MuPDF-1.24.2.
Also see pymupdf_version and mupdf_version.
