You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
📢 Spark NLP 6.2.0: A new stage for unstructured document ingestion and processing at scale
Spark NLP 6.2.0 introduces key upgrades across entity extraction, document normalization, HTML reading, and GGUF-based models. To recap, since the releases of Spark NLP 6.1 you can:
Infer quantized cutting-edge LLMs and VLMs such as Gemma 3, Phi-4, Llama 3.1, Qwen 2.5
Rerank documents using llama.cpp with AutoGGUFReranker
Ingest unstructured documents of diverse formats
Reader2Doc: streamlines the process of loading and integrating diverse file formats (PDFs, Word, Excel, PowerPoint, HTML, Text, Email, Markdown) directly into Spark NLP pipelines with a unified and flexible interface.
Reader2Table: streamlines tabular data extraction from multiple document formats with seamless pipeline integration.
Reader2Image: extract structured image content from various document types
Spark NLP release 6.2.0 further focuses on automation, structure-awareness, and resource efficiency, making pipelines easier to configure, manage, and extend.
🔥 Highlights
Auto Modes for EntityRuler and DocumentNormalizer: automatic regex and text-cleaning presets for faster setup.
Hierarchical Element Tracking in HTMLReader: adds element and parent identifiers for structure-aware document processing.
Resource Management for AutoGGUF Annotators: improved control and cleanup of llama.cpp-based models.
🚀 New Features & Enhancements
EntityRulerModel and DocumentNormalizer Auto Modes
EntityRulerModel
Added autoMode parameter to enable predefined regex entity groups ("network_entities", "communication_entities", "media_entities", "email_entities", "all_entities").
Added extractEntities parameter to filter entities within auto modes.
Automatically applies case-insensitive regex presets and falls back to manual mode if not specified.
Retains full backward compatibility with JSON or RocksDB-based rules.
DocumentNormalizer
Added presetPattern and autoMode parameters to apply built-in text cleaning patterns.
New modes include "light_clean", "document_clean", "social_clean", "html_clean", and "full_auto".
Enables quick application of multiple cleaning operations without manual configuration.
Together, these additions significantly reduce boilerplate setup for common text extraction and normalization workflows.
Hierarchical Element Identification in HTMLReader
Introduced element_id and parent_id metadata fields for each parsed HTML element.
Enables explicit structural relationships (e.g., title → paragraph → link) for hierarchical retrieval and contextual reasoning.
Supports graph-based indexing, hybrid search, and multi-level document analysis.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
📢 Spark NLP 6.2.0: A new stage for unstructured document ingestion and processing at scale
Spark NLP 6.2.0 introduces key upgrades across entity extraction, document normalization, HTML reading, and GGUF-based models. To recap, since the releases of Spark NLP 6.1 you can:
AutoGGUFRerankerReader2Doc: streamlines the process of loading and integrating diverse file formats (PDFs, Word, Excel, PowerPoint, HTML, Text, Email, Markdown) directly into Spark NLP pipelines with a unified and flexible interface.Reader2Table: streamlines tabular data extraction from multiple document formats with seamless pipeline integration.Reader2Image: extract structured image content from various document typesSpark NLP release 6.2.0 further focuses on automation, structure-awareness, and resource efficiency, making pipelines easier to configure, manage, and extend.
🔥 Highlights
🚀 New Features & Enhancements
EntityRulerModel and DocumentNormalizer Auto Modes
EntityRulerModelautoModeparameter to enable predefined regex entity groups ("network_entities","communication_entities","media_entities","email_entities","all_entities").extractEntitiesparameter to filter entities within auto modes.DocumentNormalizerpresetPatternandautoModeparameters to apply built-in text cleaning patterns."light_clean","document_clean","social_clean","html_clean", and"full_auto".Together, these additions significantly reduce boilerplate setup for common text extraction and normalization workflows.
Hierarchical Element Identification in HTMLReader
element_idandparent_idmetadata fields for each parsed HTML element.title → paragraph → link) for hierarchical retrieval and contextual reasoning.AutoGGUF Annotator Enhancements
For
AutoGGUFModel,AutoGGUFVision,AutoGGUFEmbeddings,AutoGGUFRerankerclose()method to explicitly release llama.cpp model resources, preventing memory retention in long-running sessions.setRemoveThinkingTag(tag: String)parameter to remove internal<think>...</think>sections from model outputs.(?s)<$tag>.+?</$tag>🐛 Bug Fixes
❤️ Community Support
💻 Installation
Python
Spark Packages
CPU
GPU
Apple Silicon
AArch64
Maven
FAT JARs
What's Changed
Full Changelog: 6.1.5...6.2.0
This discussion was created from the release 6.2.0.
Beta Was this translation helpful? Give feedback.
All reactions