Skip to content

Tags: Unstructured-IO/unstructured

Tags

0.21.5

Toggle 0.21.5's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix: relax lower bound for pdfminer.six (#4262) The current lower bound for pdfminer.six is still too new for some commonly used file parsing tools like `pdfplumber`. This PR lowers this bound so that `unstructured` is compatible with those tools.

0.21.2

Toggle 0.21.2's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix: self-install pinned spaCy model at runtime with SHA256 verificat… …ion (#4258) ## Summary - Replace `en-core-web-sm` direct URL dependency in `pyproject.toml` with the `installer` library - spaCy model is now downloaded and installed on first use with SHA256 hash verification - Removes `[tool.uv.sources]` section, making the install more portable across package managers ## Test plan - [ ] Verify `tokenize.py` downloads and installs the spaCy model on first use - [ ] Verify SHA256 hash check rejects tampered wheels - [ ] Verify existing NLP tokenization tests pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Introduces runtime download-and-install behavior into the NLP path and writes into `site-packages`, which can fail under restricted networking/permissions or in unusual multi-process environments despite locking and hash checks. > > **Overview** > Updates NLP tokenization to **lazy-load and self-install** the pinned `en_core_web_sm` spaCy model on first use, downloading the wheel from GitHub and verifying it via **SHA256**, with a cross-process `FileLock` to avoid concurrent installs. > > Removes the `en-core-web-sm` wheel URL dependency and `[tool.uv.sources]` override, adding `installer` (for wheel installation) and `filelock` (for install locking) to dependencies; `uv.lock` is updated accordingly and the version is bumped to `0.21.2`. > > Adjusts the `Dockerfile` to trigger model installation during image build (via `uv run` importing `_get_nlp`) so the model is present before `HF_HUB_OFFLINE=1` is set. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit df62a9c. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Yao You <yao@unstructured.io>

0.21.1

Toggle 0.21.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
bump version (#4257) 

0.21.0

Toggle 0.21.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Fix: replace nltk with spacy CVE 2025 14009 (#4255) Co-authored-by: Lawrence Elitzer <lawrence@unstructured.io> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>

0.20.8

Toggle 0.20.8's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix: update depdencies (#4247) - resolve lock issue with windows and python 3.13 (lack of library support): a few dependencies are only required for either non-windows system or windows but with python version < 3.13 - downgrade `wrapt` so it is compatible with `opentelemetry-instrumentation-httpx` library

0.20.6

Toggle 0.20.6's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix: remap parent id after hashing (#4245) This PR addresses an issue where hashing element id loses the reference for parent id. This happens when calling `partition_html` where the partition process already assigned parent ids for elements based on html structure before `apply_metadata` is called, i.e., before element id hashing happens. This fix ensures that the parent references stay unchanged after hashing.

0.20.2

Toggle 0.20.2's commit message
Add verbose to PyPI publish step to debug 403 Co-authored-by: Cursor <cursoragent@cursor.com>

0.20.1

Toggle 0.20.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Use bigger runner to publish images (#4237) <!-- CURSOR_SUMMARY --> > [!NOTE] > **Low Risk** > CI/workflow-only change plus a version/changelog bump; low risk beyond potential runner availability/cost differences. > > **Overview** > Fixes CD Docker publishing failures by switching the `publish-images` GitHub Actions job runner from `ubuntu-latest` to `ubuntu-latest-m`, addressing "no space left on device" issues when pulling multi-arch images. > > Updates release metadata by bumping `unstructured/__version__.py` to `0.20.1` and adding a matching changelog entry describing the CI runner change. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit f6408c9. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->

0.19.3

Toggle 0.19.3's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Fix ARM64 image issues (#4233) <!-- CURSOR_SUMMARY --> > [!NOTE] > **Medium Risk** > Touches release CD for multi-arch Docker publishing and alters Docker build dependency installation behavior; failures could block image releases or mask install issues despite limited runtime code impact. > > **Overview** > Improves the Docker publish GitHub Actions workflow by adding a manual `workflow_dispatch` trigger, switching to native runners for both `amd64` and `arm64`, and setting `fail-fast: false` so one-arch flakiness doesn’t cancel the other. > > Reduces CD image testing to a consistent lightweight target (`test_unstructured/partition/test_text.py` plus smoke tests) and hardens Docker builds by adding 3-attempt retry logic around `apk update/add` to mitigate transient Chainguard mirror failures. Bumps version to `0.19.3` and updates the changelog accordingly. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 559244d. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->

0.18.32

Toggle 0.18.32's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
feat: put pdfium call behind a threadlock (#4211) [pdfium is not thread safe](https://groups.google.com/g/pdfium/c/HeZSsM_KEUk?pli=1) so this PR put it behind a thread lock for thread safety.