Skip to content

Include custom metadata from PDF and Office Types, switch gauzl+concat libraries out for fflate#76

Open
carlosb1504 wants to merge 6 commits intoharshankur:masterfrom
carlosb1504:master
Open

Include custom metadata from PDF and Office Types, switch gauzl+concat libraries out for fflate#76
carlosb1504 wants to merge 6 commits intoharshankur:masterfrom
carlosb1504:master

Conversation

@carlosb1504
Copy link

Description

One of our primary use cases for browser based pdf / office doc parsing, is to detect the presence of labels (metadata) added by 3rd party applications, eg. Microsoft Purview.

This change adds customProperties to the parsed metadata output for all supported office formats and PDF.

Changes:

  • Added a customProperties field to OfficeMetadata, and populate it from custom.xml for Office OpenXml types, and from Custom metadata returned by pdfjs for PDF types.
  • Replaced yauzl and concat-stream with fflate for ZIP extraction, removing two runtime dependencies and the Node-specific stream pipeline they required. This improves performance vastly on platforms where polyfilled streams have to be used (eg Javascript sandboxed runtimes, embedded WebViews).
  • Removed four unused Node polyfills from the browser bundle (crypto, util, timers, path), reducing the bundle size by ~241 KB (~20%).
  • Updated test files and routine to include instances of custom metadata

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
@carlosb1504
Copy link
Author

any thoughts on this one @harshankur ?

@harshankur
Copy link
Owner

To be honest, I haven't gotten time to look at it. I have been extremely busy. But I will try hard to look at all the pull requests and issues this week. So sorry for it. And thanks @carlosb1504

@carlosb1504
Copy link
Author

no problem, thanks!

@harshankur
Copy link
Owner

harshankur commented Mar 24, 2026

Hello @carlosb1504 can you update the branch with following changes:

  1. Squash commits 1, 2 and 6.
  2. The comments in types.ts for customProperties write RTF \info items. However, I see no changes for its data extraction. Either implement the feature or update the comment.
  3. Drop commit 4. I see no benefit in the regex "simplification". It reduces readability without any performance benefit.
  4. Tests - Do not just count the properties. Verify the actual key-value pairs.
  5. Tests - You have no tests for open office files custom metadata and definitely not for rtf
  6. The first commit message has a typo with office typees.
  7. Rebase with the current master.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants