I am running Sparkspark in cluster mode and reading data from a RDBMS via JDBC format.
As per Spark docs, the followingthese partitioning parameters describe how to partition the table when reading in parallel from multiple workers: partitionColumn, lowerBound, upperBound, and numPartitions:
These options must all be specified if any of them is specified. They describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric column from the table in question. Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned. This option applies only to reading.
partitionColumn, lowerBound, upperBound, numPartitions
These are however optional parameters and my question is what happensThese are optional parameters.
What would happen if I don't specify them? Does only 1 worker read the whole data? If it still reads in parallel, how does it partition data?these:
- Only 1 worker read the whole data?
- If it still reads parallelly, how does it partition data?