Batch processing with spark and azure

Question

I am working for an energy provider company. Currently, we are generating 1 GB data in form of flat files per day. We have decided to use azure data lake store to store our data, in which we want to do batch processing on a daily basis. My question is that what is the best way to transfer the flat files into azure data lake store? and after the data is pushed into azure I am wondering whether it is good idea to process the data with HDInsight spark? like Dataframe API or SparkSQL and finally visualize it with azure?

Using AzCopy to blob storage and then code to transform into Data Lake would be an option if Data lake doesn't provide something directly — Richard
– Richard, Commented May 3, 2018 at 8:44
Yes, it is currently flat files. Actually, there is ETL process that collects data from databases, ,transforms them and generates flat files in the end and then put it in local file system — milad ahmadi
– milad ahmadi, Commented May 3, 2018 at 9:40
Use the Azure Import/Export service for offline copy of data to Data Lake Store. — Richard
– Richard, Commented May 3, 2018 at 9:56

Hauke Mallow · Accepted Answer · 2018-05-06 20:42:38Z

For a daily load from a local file system I would recommend using Azure Data Factory Version 2. You have to install Integration Runtimes on Premise (more than one for High Avalibility). You have to consider several security topics (local firewalls, network connectivity etc.) A detailed documentation can be found here. There are also some good Tutorials available. With Azure Data Factory you can trigger your upload to Azure with a Get-Metadata-Activity and use e. g. an Azure Databricks Notebook Activity for further Spark processing.

Collectives™ on Stack Overflow

Batch processing with spark and azure

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related