Skip to content

Granary Dataset Processing (Component-Based)#135

Open
ssh-meister wants to merge 1 commit intomainfrom
granary
Open

Granary Dataset Processing (Component-Based)#135
ssh-meister wants to merge 1 commit intomainfrom
granary

Conversation

@ssh-meister
Copy link
Collaborator

@ssh-meister ssh-meister commented Jul 2, 2025

This pull request introduces a set of modular components for processing the Granary dataset.

🔧 General-Purpose Processors:

These processors are not specific to any single dataset and can be reused across different data pipelines:

  1. LambdaExpression processor LambdaExpression processor implemetation #136
  2. SubRegex processor: adds support for extracting a list of regex parameters from a YAML file SubRegex processor: substitution rules from an external YAML  #137
  3. ExtractTar, RemoveFiles processors Add RemoveFiles and ExtractTar, reorganize audio converters #139
  4. FasterWhisperInference, DetectWhisperHallucinationFeatures, vLLMInference and CleanQwenGeneration Refactor inference processes & add new engines (FasterWhisper, vLLM) #141
  5. ListToEntries processor ListToEntries processor #140
  6. DropSpecifiedFields processor DropSpecifiedFields processor implemetation  #144
  7. CharacterHistogramLangValidator processor CharacterHistogramLangValidator processor implementation #154
  8. FastTextLangIdClassifier processor FastTextLangIdClassifier processor implementation #149
  9. CometoidWMTQualityEstimation processor CometoidWMTQualityEstimation processor implementation #151
  10. ConvertToTarredAudioDataset processor ConvertToTarredAudioDataset processor implemetation #145

⛓️ Pipelines

  1. Unified pipeline and README with instructions and documentation Granary large-scale speech processing pipeline  #155
@ssh-meister ssh-meister self-assigned this Jul 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant