Skip to content

[feature] Add ArrowRecordReader for batch ingestion of Arrow IPC files#17871

Open
udaysagar2177 wants to merge 1 commit intoapache:masterfrom
udaysagar2177:arrow
Open

[feature] Add ArrowRecordReader for batch ingestion of Arrow IPC files#17871
udaysagar2177 wants to merge 1 commit intoapache:masterfrom
udaysagar2177:arrow

Conversation

@udaysagar2177
Copy link

Summary

The Arrow plugin previously only had ArrowMessageDecoder for streaming ingestion (e.g. Kafka). This PR adds ArrowRecordReader implementing the RecordReader interface, enabling batch ingestion from Arrow IPC files, consistent with how other formats (Avro, JSON, ORC, Parquet, etc.) support both streaming and batch.

Changes:

  • Add ArrowRecordReader that reads Arrow IPC files using ArrowFileReader, iterating row-by-row across batches
  • Add ARROW to FileFormat enum and register in RecordReaderFactory
  • Generalize ArrowToGenericRowConverter to accept ArrowReader (parent of both ArrowStreamReader and ArrowFileReader) instead of ArrowStreamReader
  • Make convertSingleRow public with a reuse overload for row recycling
  • Add fieldsToRead filtering support to ArrowToGenericRowConverter
  • Add ArrowRecordReaderTest extending AbstractRecordReaderTest (10,000 random records across all Pinot field types, multi-batch writing, field filtering test)

Note: Arrow IPC file format requires seekable channels, so gzip compression is not supported (test overridden to skip).

Test plan

  • ArrowRecordReaderTest.testRecordReader - 10,000 random records with all Pinot SV/MV field types, verifies read + rewind
  • ArrowRecordReaderTest.testFieldsToReadFiltering - verifies only requested fields are extracted
  • ArrowRecordReaderTest.testGzipRecordReader - overridden (Arrow IPC doesn't support gzip)
  • All 16 existing ArrowMessageDecoderTest tests pass (no regressions from converter changes)
  • RecordReaderFactoryTest passes with new ARROW registration
The Arrow plugin previously only had ArrowMessageDecoder for streaming ingestion. This adds ArrowRecordReader implementing the RecordReader interface, enabling batch ingestion from Arrow IPC files. Changes: - Add ArrowRecordReader that reads Arrow IPC files using ArrowFileReader - Add ARROW to FileFormat enum and register in RecordReaderFactory - Generalize ArrowToGenericRowConverter to accept ArrowReader (parent of both ArrowStreamReader and ArrowFileReader) instead of ArrowStreamReader - Make convertSingleRow public with a reuse overload for row recycling - Add fieldsToRead filtering support to ArrowToGenericRowConverter - Add ArrowRecordReaderTest extending AbstractRecordReaderTest
@Jackie-Jiang Jackie-Jiang added feature ingestion Related to data ingestion pipeline labels Mar 20, 2026
@codecov-commenter
Copy link

codecov-commenter commented Mar 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.22%. Comparing base (0f93d52) to head (0a519d2).
⚠️ Report is 184 commits behind head on master.

Additional details and impacted files
@@ Coverage Diff @@ ## master #17871 +/- ## ============================================ - Coverage 63.25% 63.22% -0.03%  + Complexity 1499 1468 -31  ============================================ Files 3174 3190 +16 Lines 190373 192198 +1825 Branches 29089 29456 +367 ============================================ + Hits 120419 121520 +1101  - Misses 60610 61152 +542  - Partials 9344 9526 +182 
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.20% <100.00%> (-0.02%) ⬇️
java-21 63.20% <100.00%> (-0.02%) ⬇️
temurin 63.22% <100.00%> (-0.03%) ⬇️
unittests 63.22% <100.00%> (-0.03%) ⬇️
unittests1 55.54% <100.00%> (-0.09%) ⬇️
unittests2 34.25% <100.00%> (+0.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
@xiangfu0 xiangfu0 added enhancement Improvement to existing functionality json Related to JSON column support kafka Related to Kafka stream connector and removed feature labels Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Improvement to existing functionality ingestion Related to data ingestion pipeline json Related to JSON column support kafka Related to Kafka stream connector

4 participants