Return to Revisions

5 of 8

added 169 characters in body

edited Apr 14, 2016 at 9:46

331.4k
108
982
958

While Spark supports a limited predicate pushdown over JDBC all other operations, like limit, group, aggregations are performed internally. Unfortunately it means that take(4) will fetch data first and then apply the limit. In other words your database will execute (assuming no projections an filters) something equivalent to:

SELECT * FROM table

and the rest will handled by Spark. There can be some optimizations involved but it still inefficient process.

If you want to push limit to the database you'll have to do it statically using subquery as a dbtable parameter:

(sqlContext.read.format('jdbc') .options(url='xxxx', dbtable='(SELECT * FROM xxx LIMIT 4) tmp', ....))

sqlContext.read.format("jdbc").options(Map( "url" -> "xxxx", "dbtable" -> "(SELECT * FROM xxx LIMIT 4) tmp", ))

Please note that an alias in subquery is mandatory.

answered Mar 8, 2016 at 14:38

zero323

331.4k
108
982
958

Collectives™ on Stack Overflow

Return to Revisions