Tags: Unstructured-IO/unstructured
Tags
fix: self-install pinned spaCy model at runtime with SHA256 verificat… …ion (#4258) ## Summary - Replace `en-core-web-sm` direct URL dependency in `pyproject.toml` with the `installer` library - spaCy model is now downloaded and installed on first use with SHA256 hash verification - Removes `[tool.uv.sources]` section, making the install more portable across package managers ## Test plan - [ ] Verify `tokenize.py` downloads and installs the spaCy model on first use - [ ] Verify SHA256 hash check rejects tampered wheels - [ ] Verify existing NLP tokenization tests pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Introduces runtime download-and-install behavior into the NLP path and writes into `site-packages`, which can fail under restricted networking/permissions or in unusual multi-process environments despite locking and hash checks. > > **Overview** > Updates NLP tokenization to **lazy-load and self-install** the pinned `en_core_web_sm` spaCy model on first use, downloading the wheel from GitHub and verifying it via **SHA256**, with a cross-process `FileLock` to avoid concurrent installs. > > Removes the `en-core-web-sm` wheel URL dependency and `[tool.uv.sources]` override, adding `installer` (for wheel installation) and `filelock` (for install locking) to dependencies; `uv.lock` is updated accordingly and the version is bumped to `0.21.2`. > > Adjusts the `Dockerfile` to trigger model installation during image build (via `uv run` importing `_get_nlp`) so the model is present before `HF_HUB_OFFLINE=1` is set. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit df62a9c. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Yao You <yao@unstructured.io>
Fix: replace nltk with spacy CVE 2025 14009 (#4255) Co-authored-by: Lawrence Elitzer <lawrence@unstructured.io> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
fix: update depdencies (#4247) - resolve lock issue with windows and python 3.13 (lack of library support): a few dependencies are only required for either non-windows system or windows but with python version < 3.13 - downgrade `wrapt` so it is compatible with `opentelemetry-instrumentation-httpx` library
fix: remap parent id after hashing (#4245) This PR addresses an issue where hashing element id loses the reference for parent id. This happens when calling `partition_html` where the partition process already assigned parent ids for elements based on html structure before `apply_metadata` is called, i.e., before element id hashing happens. This fix ensures that the parent references stay unchanged after hashing.
Use bigger runner to publish images (#4237) <!-- CURSOR_SUMMARY --> > [!NOTE] > **Low Risk** > CI/workflow-only change plus a version/changelog bump; low risk beyond potential runner availability/cost differences. > > **Overview** > Fixes CD Docker publishing failures by switching the `publish-images` GitHub Actions job runner from `ubuntu-latest` to `ubuntu-latest-m`, addressing "no space left on device" issues when pulling multi-arch images. > > Updates release metadata by bumping `unstructured/__version__.py` to `0.20.1` and adding a matching changelog entry describing the CI runner change. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit f6408c9. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
Fix ARM64 image issues (#4233) <!-- CURSOR_SUMMARY --> > [!NOTE] > **Medium Risk** > Touches release CD for multi-arch Docker publishing and alters Docker build dependency installation behavior; failures could block image releases or mask install issues despite limited runtime code impact. > > **Overview** > Improves the Docker publish GitHub Actions workflow by adding a manual `workflow_dispatch` trigger, switching to native runners for both `amd64` and `arm64`, and setting `fail-fast: false` so one-arch flakiness doesn’t cancel the other. > > Reduces CD image testing to a consistent lightweight target (`test_unstructured/partition/test_text.py` plus smoke tests) and hardens Docker builds by adding 3-attempt retry logic around `apk update/add` to mitigate transient Chainguard mirror failures. Bumps version to `0.19.3` and updates the changelog accordingly. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 559244d. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
PreviousNext