Revisions to More than one hour to execute pyspark.sql.DataFrame.take(4)

added 41 characters in body

edited Feb 4, 2020 at 11:45

4.2k
2
15
30

While Spark supports a limited predicate pushdown over JDBC all other operations, like limit, group, aggregations are performed internally. Unfortunately it means that take(4) will fetch data first and then apply the limit. In other words your database will execute (assuming no projections an filters) something equivalent to:

SELECT * FROM table

and the rest will handled by Spark. There are some optimizations involved (in particular Spark evaluates partitions iterativelyevaluates partitions iteratively to obtain number of records requested by LIMIT) but it still quite inefficient process compared to database-side optimizations.

If you want to push limit to the database you'll have to do it statically using subquery as a dbtable parameter:

(sqlContext.read.format('jdbc') .options(url='xxxx', dbtable='(SELECT * FROM xxx LIMIT 4) tmp', ....))

sqlContext.read.format("jdbc").options(Map( "url" -> "xxxx", "dbtable" -> "(SELECT * FROM xxx LIMIT 4) tmp", ))

Please note that an alias in subquery is mandatory.

Note:

This behavior may be improved in the future, once Data Source API v2 is ready:

While Spark supports a limited predicate pushdown over JDBC all other operations, like limit, group, aggregations are performed internally. Unfortunately it means that take(4) will fetch data first and then apply the limit. In other words your database will execute (assuming no projections an filters) something equivalent to:

SELECT * FROM table

and the rest will handled by Spark. There are some optimizations involved (in particular Spark evaluates partitions iteratively to obtain number of records requested by LIMIT) but it still quite inefficient process compared to database-side optimizations.

If you want to push limit to the database you'll have to do it statically using subquery as a dbtable parameter:

(sqlContext.read.format('jdbc') .options(url='xxxx', dbtable='(SELECT * FROM xxx LIMIT 4) tmp', ....))

sqlContext.read.format("jdbc").options(Map( "url" -> "xxxx", "dbtable" -> "(SELECT * FROM xxx LIMIT 4) tmp", ))

Please note that an alias in subquery is mandatory.

Note:

This behavior may be improved in the future, once Data Source API v2 is ready:

While Spark supports a limited predicate pushdown over JDBC all other operations, like limit, group, aggregations are performed internally. Unfortunately it means that take(4) will fetch data first and then apply the limit. In other words your database will execute (assuming no projections an filters) something equivalent to:

SELECT * FROM table

and the rest will handled by Spark. There are some optimizations involved (in particular Spark evaluates partitions iteratively to obtain number of records requested by LIMIT) but it still quite inefficient process compared to database-side optimizations.

If you want to push limit to the database you'll have to do it statically using subquery as a dbtable parameter:

(sqlContext.read.format('jdbc') .options(url='xxxx', dbtable='(SELECT * FROM xxx LIMIT 4) tmp', ....))

sqlContext.read.format("jdbc").options(Map( "url" -> "xxxx", "dbtable" -> "(SELECT * FROM xxx LIMIT 4) tmp", ))

Please note that an alias in subquery is mandatory.

Note:

This behavior may be improved in the future, once Data Source API v2 is ready:

added 281 characters in body

Source Link

edited Aug 30, 2017 at 8:27

zero323

331.4k
108
982
958

While Spark supports a limited predicate pushdown over JDBC all other operations, like limit, group, aggregations are performed internally. Unfortunately it means that take(4) will fetch data first and then apply the limit. In other words your database will execute (assuming no projections an filters) something equivalent to:

SELECT * FROM table

and the rest will handled by Spark. There are some optimizations involved (in particular Spark evaluates partitions iteratively to obtain number of records requested by LIMIT) but it still quite inefficient process compared to database-side optimizations.

If you want to push limit to the database you'll have to do it statically using subquery as a dbtable parameter:

(sqlContext.read.format('jdbc') .options(url='xxxx', dbtable='(SELECT * FROM xxx LIMIT 4) tmp', ....))

sqlContext.read.format("jdbc").options(Map( "url" -> "xxxx", "dbtable" -> "(SELECT * FROM xxx LIMIT 4) tmp", ))

Please note that an alias in subquery is mandatory.

Note:

This behavior may be improved in the future, once Data Source API v2 is ready:

SPARK-15689

SPIP: Data Source API V2

While Spark supports a limited predicate pushdown over JDBC all other operations, like limit, group, aggregations are performed internally. Unfortunately it means that take(4) will fetch data first and then apply the limit. In other words your database will execute (assuming no projections an filters) something equivalent to:

SELECT * FROM table

and the rest will handled by Spark. There are some optimizations involved (in particular Spark evaluates partitions iteratively to obtain number of records requested by LIMIT) but it still quite inefficient process compared to database-side optimizations.

If you want to push limit to the database you'll have to do it statically using subquery as a dbtable parameter:

(sqlContext.read.format('jdbc') .options(url='xxxx', dbtable='(SELECT * FROM xxx LIMIT 4) tmp', ....))

sqlContext.read.format("jdbc").options(Map( "url" -> "xxxx", "dbtable" -> "(SELECT * FROM xxx LIMIT 4) tmp", ))

Please note that an alias in subquery is mandatory.

While Spark supports a limited predicate pushdown over JDBC all other operations, like limit, group, aggregations are performed internally. Unfortunately it means that take(4) will fetch data first and then apply the limit. In other words your database will execute (assuming no projections an filters) something equivalent to:

SELECT * FROM table

and the rest will handled by Spark. There are some optimizations involved (in particular Spark evaluates partitions iteratively to obtain number of records requested by LIMIT) but it still quite inefficient process compared to database-side optimizations.

If you want to push limit to the database you'll have to do it statically using subquery as a dbtable parameter:

(sqlContext.read.format('jdbc') .options(url='xxxx', dbtable='(SELECT * FROM xxx LIMIT 4) tmp', ....))

sqlContext.read.format("jdbc").options(Map( "url" -> "xxxx", "dbtable" -> "(SELECT * FROM xxx LIMIT 4) tmp", ))

Please note that an alias in subquery is mandatory.

Note:

This behavior may be improved in the future, once Data Source API v2 is ready:

SPARK-15689

SPIP: Data Source API V2

added 147 characters in body

Source Link

edited Aug 3, 2016 at 15:49

zero323

331.4k
108
982
958

While Spark supports a limited predicate pushdown over JDBC all other operations, like limit, group, aggregations are performed internally. Unfortunately it means that take(4) will fetch data first and then apply the limit. In other words your database will execute (assuming no projections an filters) something equivalent to:

SELECT * FROM table

and the rest will handled by Spark. There can beare some optimizations involved (in particular Spark evaluates partitions iteratively to obtain number of records requested by LIMIT) but it still quite inefficient process compared to database-side optimizations.

If you want to push limit to the database you'll have to do it statically using subquery as a dbtable parameter:

(sqlContext.read.format('jdbc') .options(url='xxxx', dbtable='(SELECT * FROM xxx LIMIT 4) tmp', ....))

sqlContext.read.format("jdbc").options(Map( "url" -> "xxxx", "dbtable" -> "(SELECT * FROM xxx LIMIT 4) tmp", ))

Please note that an alias in subquery is mandatory.

While Spark supports a limited predicate pushdown over JDBC all other operations, like limit, group, aggregations are performed internally. Unfortunately it means that take(4) will fetch data first and then apply the limit. In other words your database will execute (assuming no projections an filters) something equivalent to:

SELECT * FROM table

and the rest will handled by Spark. There can be some optimizations involved but it still inefficient process.

If you want to push limit to the database you'll have to do it statically using subquery as a dbtable parameter:

(sqlContext.read.format('jdbc') .options(url='xxxx', dbtable='(SELECT * FROM xxx LIMIT 4) tmp', ....))

sqlContext.read.format("jdbc").options(Map( "url" -> "xxxx", "dbtable" -> "(SELECT * FROM xxx LIMIT 4) tmp", ))

Please note that an alias in subquery is mandatory.

While Spark supports a limited predicate pushdown over JDBC all other operations, like limit, group, aggregations are performed internally. Unfortunately it means that take(4) will fetch data first and then apply the limit. In other words your database will execute (assuming no projections an filters) something equivalent to:

SELECT * FROM table

and the rest will handled by Spark. There are some optimizations involved (in particular Spark evaluates partitions iteratively to obtain number of records requested by LIMIT) but it still quite inefficient process compared to database-side optimizations.

If you want to push limit to the database you'll have to do it statically using subquery as a dbtable parameter:

(sqlContext.read.format('jdbc') .options(url='xxxx', dbtable='(SELECT * FROM xxx LIMIT 4) tmp', ....))

sqlContext.read.format("jdbc").options(Map( "url" -> "xxxx", "dbtable" -> "(SELECT * FROM xxx LIMIT 4) tmp", ))

Please note that an alias in subquery is mandatory.

added 169 characters in body

Source Link

edited Apr 14, 2016 at 9:46

zero323

331.4k
108
982
958

Loading

edited body

Source Link

edited Apr 14, 2016 at 9:38

zero323

331.4k
108
982
958

Loading

added 1 character in body

Source Link

edited Mar 8, 2016 at 21:10

zero323

331.4k
108
982
958

Loading

added 163 characters in body

Source Link

edited Mar 8, 2016 at 15:05

zero323

331.4k
108
982
958

Loading

Source Link

answered Mar 8, 2016 at 14:38

zero323

331.4k
108
982
958

Loading

Collectives™ on Stack Overflow

Return to Answer