Big Data Batch and Stream Data pipeline with Hadoop Spark

Question

I am designing this below flow and want to know if am going in the right way. i want to skip any unwanted steps if added. I have Hadoop running on spark engine.

below pipeline has to pull the data in batch as well as streaming data.
flow needs to pull the data and store in HDFS
hive with mysql for metedata storage to run hive queries
then need to perform complex ETL operations using pyspark
then load the transformed data in rdbms
data from rdbms will be reported using Apache Superset
the complete flow should be run by a scheduler, for that using airflow can you please check and suggest if am missing something in order to make this flow robust

Seems fine. Please clarify what problems you are running into — OneCricketeer
– OneCricketeer, Commented Nov 14, 2022 at 21:47

OneCricketeer · Accepted Answer · 2022-11-14 21:49:16Z

Use Debezium to pull from RDBMS. All writes therefore end up in Kafka, and you don't end up with "batches" at all. (Sqoop is a retired Apache project)

Use Apache Pinot or Druid to ingest Kafka directly. Then you don't need HDFS.

You can query Pinot / Druid using SQL. Or you can use Presto in place of Hive/SparkSQL, and you should be able to link SuperSet to Presto rather than an intermediate RDMBS.

Collectives™ on Stack Overflow

Big Data Batch and Stream Data pipeline with Hadoop Spark

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related