I am in need of processing several thousands small log files.
I opted for Databricks to handle this problem, because it has good parallel computing capacities and interacts nicely with the Azure Blob storage account where the files are hosted.
After some researching, I always retrieve the same snippet of code (in PySpark).
# Getting your list of files with custom function list_of_files = get_my_files() # Create a path_rdd and use a custom udf to parse it path_rdd = sc.parallelize(list_of_files) content = path_rdd.map(parse_udf).collect() Is there a better any method to do this? Would you opt for a flatmap if the logfiles are in a CSV format?
Thank you!
spark.read.format('csv').load("folder_name")) - this way you will leverage spark internal parallel processing instead parsing every file as a UDF.df = spark.read.format("csv").option("header", "true").load("cars_data/")will automatically add year, month and date as a column, which you can utilize for filter and that will certainly provide you performance gain.spark.read.csv("location/*/*/").