I have a requirement to parse a lot of small unstructured files in near real-time inside Azure and load the parsed data into a SQL database. I chose Python (because I don't think any Spark cluster or big data would suite considering the volume of source files and their size) and the parsing logic has been already written. I am looking forward to schedule this python script in different ways using Azure PaaS
- Azure Data Factory
- Azure Databricks
- Both 1+2
May I ask what's the implication of running a Python notebook activity from Azure Data Factory pointing to Azure Databricks? Would I be able to fully leverage the potential of the cluster (Driver & Workers)?
Also, please suggest me if you think the script has to be converted to PySpark to meet my use case requirement to run in Azure Databricks? The only hesitation here is the files are in KB and they are unstructured.