2

I am working for an energy provider company. Currently, we are generating 1 GB data in form of flat files per day. We have decided to use azure data lake store to store our data, in which we want to do batch processing on a daily basis. My question is that what is the best way to transfer the flat files into azure data lake store? and after the data is pushed into azure I am wondering whether it is good idea to process the data with HDInsight spark? like Dataframe API or SparkSQL and finally visualize it with azure?

6
  • Do you mean Azure Data Lake? Commented May 3, 2018 at 8:37
  • Using AzCopy to blob storage and then code to transform into Data Lake would be an option if Data lake doesn't provide something directly Commented May 3, 2018 at 8:44
  • Where is your source data stored? A database? Flat files? Commented May 3, 2018 at 9:34
  • Yes, it is currently flat files. Actually, there is ETL process that collects data from databases, ,transforms them and generates flat files in the end and then put it in local file system Commented May 3, 2018 at 9:40
  • Use the Azure Import/Export service for offline copy of data to Data Lake Store. Commented May 3, 2018 at 9:56

1 Answer 1

1

For a daily load from a local file system I would recommend using Azure Data Factory Version 2. You have to install Integration Runtimes on Premise (more than one for High Avalibility). You have to consider several security topics (local firewalls, network connectivity etc.) A detailed documentation can be found here. There are also some good Tutorials available. With Azure Data Factory you can trigger your upload to Azure with a Get-Metadata-Activity and use e. g. an Azure Databricks Notebook Activity for further Spark processing.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.