0

I am designing this below flow and want to know if am going in the right way. i want to skip any unwanted steps if added. I have Hadoop running on spark engine.

  • below pipeline has to pull the data in batch as well as streaming data.
  • flow needs to pull the data and store in HDFS
  • hive with mysql for metedata storage to run hive queries
  • then need to perform complex ETL operations using pyspark
  • then load the transformed data in rdbms
  • data from rdbms will be reported using Apache Superset
  • the complete flow should be run by a scheduler, for that using airflow can you please check and suggest if am missing something in order to make this flow robust

enter image description here

1
  • Seems fine. Please clarify what problems you are running into Commented Nov 14, 2022 at 21:47

1 Answer 1

1

Use Debezium to pull from RDBMS. All writes therefore end up in Kafka, and you don't end up with "batches" at all. (Sqoop is a retired Apache project)

Use Apache Pinot or Druid to ingest Kafka directly. Then you don't need HDFS.

You can query Pinot / Druid using SQL. Or you can use Presto in place of Hive/SparkSQL, and you should be able to link SuperSet to Presto rather than an intermediate RDMBS.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.