I am designing this below flow and want to know if am going in the right way. i want to skip any unwanted steps if added. I have Hadoop running on spark engine.
- below pipeline has to pull the data in batch as well as streaming data.
- flow needs to pull the data and store in HDFS
- hive with mysql for metedata storage to run hive queries
- then need to perform complex ETL operations using pyspark
- then load the transformed data in rdbms
- data from rdbms will be reported using Apache Superset
- the complete flow should be run by a scheduler, for that using airflow can you please check and suggest if am missing something in order to make this flow robust
