chore(docker): update nltk data download process to include unstructured download_nltk_packages#28876
Conversation
Summary of ChangesHello @lework, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request optimizes the NLTK data download process within the Docker build for the API service. Previously, the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request updates the Dockerfile to pre-download NLTK packages required by the unstructured library. This is a good optimization to prevent downloads at runtime. The change correctly uses the download_nltk_packages function from unstructured. My review includes a suggestion to simplify the command by removing redundant package download calls, which will improve the clarity and maintainability of the Dockerfile.
| Please create an issue and link it in the description. |
Added. |
| @laipz8200 @QuantumGhost Please review. |
| @crazywoola This PR also resolves issue #30092 |
add unstructured download_nltk_packages, Fix the issue #28902.
Important
Fixes #<issue number>.Summary
When using the document extractor, unstructured nltk files are automatically downloaded to the
/home/dify/nltk_data/directory.https://github.com/Unstructured-IO/unstructured/blob/91a9888d35b645988529e99b539eee9b7ffd30e3/unstructured/nlp/tokenize.py#L29
download_nltk_packagesThe downloaded packages do not match those specified in the Dockerfile, causing documents to be downloaded again during extraction. Use thedownload_nltk_packagesmethod to download directly.After modifying the Dockerfile to download in advance, unstructured will not download again.
Screenshots
NLTK_DATA=/usr/local/share/nltk_data python -c "import nltk; nltk.download('punkt'); nltk.download('averaged_perceptron_tagger'); nltk.download('stopwords')"&& NLTK_DATA=/usr/local/share/nltk_data python -c "import nltk; from unstructured.nlp.tokenize import download_nltk_packages; nltk.download('punkt'); nltk.download('averaged_perceptron_tagger'); nltk.download('stopwords'); download_nltk_packages()"Checklist
dev/reformat(backend) andcd web && npx lint-staged(frontend) to appease the lint gods