1

For the following generic sql:

 showTablesSql = """SELECT table_catalog,table_schema,table_name FROM information_schema.tables ORDER BY table_schema,table_name""" 

When it is submitted to spark jdbc for postgresql the following exception is happening:

py4j.protocol.Py4JJavaError: An error occurred while calling o34.load. : org.postgresql.util.PSQLException: ERROR: syntax error at or near "SELECT" Position: 15 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2578) 

Here is the code being used:

url = f"jdbc:postgresql://{c['db.host']}/{c['db.name']}?user={c['db.user']}&password={c['db.password']}" print(url) empDF = spark.read \ .format("jdbc") \ .option("url", url) \ .option("dbtable", showTablesSql) \ .option("user", c['db.user']) \ .option("password", c['db.password']) \ .load() 

Here are the stack trace details:

Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). jdbc:postgresql://localhost/bluej?user=bluej&password=mypassword Traceback (most recent call last): File "/git/bluej/fusion/python/pointr/bluej/util/sparkmgr.py", line 37, in <module> tab = readTab(db, tname) File "/git/bluej/fusion/python/pointr/bluej/util/sparkmgr.py", line 23, in readTab empDF = spark.read \ File "/shared/spark3/python/pyspark/sql/readwriter.py", line 166, in load return self._df(self._jreader.load()) File "/shared/spark3/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1285, in __call__ File "/shared/spark3/python/pyspark/sql/utils.py", line 98, in deco return f(*a, **kw) File "/shared/spark3/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 326, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o34.load. : org.postgresql.util.PSQLException: ERROR: syntax error at or near "SELECT" Position: 15 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2578) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2313) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:331) at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:448) at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:369) at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:159) at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:109) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:61) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:339) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:240) at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:229) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:229) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:179) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.base/java.lang.Thread.run(Thread.java:834) 
8
  • is it possible to set log_statement = all temporarily and find out what exact query got sent to Postgres? Seems odd that it has syntax error at position 15 Commented Mar 20, 2020 at 23:13
  • yes that position is in middle of a table name. i'll try that. restarting pg server now Commented Mar 20, 2020 at 23:14
  • actually, you probably won't even need to set log_statement = all -- log_min_error_statement default should log the query for you. Just look in your postgres logs and find out what actual query was received Commented Mar 20, 2020 at 23:16
  • I restarted the db with the higher logging - and there are entries in there from the startup. But no entries for the above queries that I just re ran couple of times. Any ideas why they would not generate log entries? Commented Mar 20, 2020 at 23:23
  • I believe you should write the subquery in parenthesis "(Select ... )" as you would in a sql from clause. Commented Mar 20, 2020 at 23:33

1 Answer 1

1

In a comment @BjarniRagnarsson alluded to the dbtable actually being a subquery. I found some info on this - from the esteemed @zero323

https://stackoverflow.com/a/32629170/1056563

Since dbtable is used as a source for the SELECT statement it has be in a form which would be valid for normal SQL query. If you want to use subquery you should pass a query in parentheses and provide an alias:

USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:postgresql:dbserver", dbtable "(SELECT * FROM mytable) tmp" ); 

Upon making the sql a subquery I am seeing this parsed properly: no data coming back yet but that will likely come.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.