PySpark/DataBricks: How to read parquet files using 'file:///' and not 'dbfs'

Question

I am trying to use petastorm in a different manner which requires that I tell it where my parquet files are stored through one of the following:

hdfs://some_hdfs_cluster/user/yevgeni/parquet8, or file:///tmp/mydataset, or s3://bucket/mydataset, or gs://bucket/mydataset. Since I am on DataBricks and given other constraints, my option is to use the file:/// option.

However, I am at a loss as to how specify the location of my parquet files. I continually get rejected saying that Path does not exist:

Here is what I am doing:

# save spark df to parquet dbutils.fs.rm('dbfs:/mnt/team01/assembled_train.parquet', recurse=True) assembled_train.write.parquet('dbfs:/mnt/team01/assembled_train')

# look at files display(dbutils.fs.ls('mnt/team01/assembled_train/'))

# results path name size dbfs:/mnt/team01/assembled_train/_SUCCESS _SUCCESS 0 dbfs:/mnt/team01/assembled_train/_committed_2150262571233317067 _committed_2150262571233317067 856 dbfs:/mnt/team01/assembled_train/_started_2150262571233317067 _started_2150262571233317067 0 dbfs:/mnt/team01/assembled_train/part-00000-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035357-1-c000.snappy.parquet part-00000-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035357-1-c000.snappy.parquet 578991 dbfs:/mnt/team01/assembled_train/part-00001-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035358-1-c000.snappy.parquet part-00001-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035358-1-c000.snappy.parquet 579640 dbfs:/mnt/team01/assembled_train/part-00002-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035359-1-c000.snappy.parquet part-00002-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035359-1-c000.snappy.parquet 580675 dbfs:/mnt/team01/assembled_train/part-00003-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035360-1-c000.snappy.parquet part-00003-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035360-1-c000.snappy.parquet 579483 dbfs:/mnt/team01/assembled_train/part-00004-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035361-1-c000.snappy.parquet part-00004-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035361-1-c000.snappy.parquet 578807 dbfs:/mnt/team01/assembled_train/part-00005-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035362-1-c000.snappy.parquet part-00005-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035362-1-c000.snappy.parquet 580942 dbfs:/mnt/team01/assembled_train/part-00006-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035363-1-c000.snappy.parquet part-00006-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035363-1-c000.snappy.parquet 579202 dbfs:/mnt/team01/assembled_train/part-00007-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035364-1-c000.snappy.parquet part-00007-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035364-1-c000.snappy.parquet 579810

While testing with a basic dataframe load from the file structure, like so:

df1 = spark.read.option("header", "true").parquet('file:///mnt/team01/assembled_train/part-00000-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035357-1-c000.snappy.parquet')```

I get file does not exist.

just wondering what would happen if you remove 'file://' or use 'dbfs:/'? — mck
– mck, Commented Nov 29, 2020 at 16:35

mck · Accepted Answer · 2020-11-29 16:30:06Z

You just need to specify the path as it is, no need for 'file:///':

df1 = spark.read.option("header", "true").parquet('/mnt/team01/assembled_train/part-00000-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035357-1-c000.snappy.parquet')

If this doesn't work, try the methods in https://docs.databricks.com/applications/machine-learning/load-data/petastorm.html#configure-cache-directory

Thanks, that does work for that purpose but Petastorm still does not like it. The link you gave me, in addition to your comment, does help me get a bit further and I appreciate it greatly!

Collectives™ on Stack Overflow

PySpark/DataBricks: How to read parquet files using 'file:///' and not 'dbfs'

Here is what I am doing:

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

Here is what I am doing:

1 Answer 1

1 Comment

Related