3

I currently have the following Python code to read in a table from a local SQL Server db into Pandas:

import pandas as pd import pyodbc # Connect to DB server = 'server' db = 'db' conn = pyodbc.connect('DRIVER={SQL SERVER}; SERVER=' + server + '; DATABASE=' + db + '; TRUSTED_CONNECTION=yes') cursor = conn.cursor() table = 'table' df = pd.read_sql('Select * From ' + table, conn) 

That code works, but now I would like to do the same thing in Pyspark. What is the equivalent of this code in Pyspark?

I have tried the following:

import findspark import os from pyspark.sql import SparkSession from pyspark.sql.functions import * # didn't know which of these would work so tried both os.environ['SPARK_CLASSPATH'] = 'path/to/sqljdbc42.jar' os.environ['driver-class-path'] = 'path/to/sqljdbc42.jar' findspark.init('C:/spark/spark') spark = SparkSession \ .builder \ .appName("SparkCoreTest") \ .getOrCreate() sc = spark.sparkContext sqlctx = SQLContext(sc) server = 'server' db = 'db' url = 'jdbc:sqlserver//' + server + ';databaseName=' + db table = 'table' properties = {'driver' : 'com.microsoft.sqlserver.jdbc.SQLServerDriver'} df = sqlctx.read.format('jdbc').options(url=url, dbtable=table, driver='{SQL SERVER}').load() 

This gives java.lang.ClassNotFoundException: {SQL SERVER}. Throughout this process I've also gotten a errors resulting from not being able to find a "suitable driver," although I think I've fixed those by changingos.environ. Any help would be greatly appreciated!

1
  • You obviously mean "pyspark equivalent of pyodbc"... Commented Nov 28, 2017 at 15:49

1 Answer 1

3

You are quite close. The ClassNotFoundException means that it cannot locate the driver jar.

In order to use drivers for MySQL, SQL Server etc, it is important to have the jars in a folder that is accessible from both master and slaves (if you run spark locally then you are ok) and you need to specify the location of the jars either with the --jars flag like this:

spark-submit --jars /path/to/sqljdbc42.jar ...other params... yourscript.py

or :

# create a configuration instance and set the spark.jars option conf = SparkConf() conf.set('spark.jars', '/path/to/sqljdbc42.jar') # give that configuration to the spark session spark = SparkSession.builder\ .config(conf=conf) \ .appName(app_name) \ .getOrCreate() 

Additionally, you should use the spark session not the SQLContext directly to read from the sql server (even thought that depends on your spark version):

df = spark.read.format('jdbc').options(url=url, dbtable=table).load() 

(driver jar )

(note: there is also a way to work with environment variables but I have not used it)

Hope this helps, good luck!

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.