I currently have the following Python code to read in a table from a local SQL Server db into Pandas:
import pandas as pd import pyodbc # Connect to DB server = 'server' db = 'db' conn = pyodbc.connect('DRIVER={SQL SERVER}; SERVER=' + server + '; DATABASE=' + db + '; TRUSTED_CONNECTION=yes') cursor = conn.cursor() table = 'table' df = pd.read_sql('Select * From ' + table, conn) That code works, but now I would like to do the same thing in Pyspark. What is the equivalent of this code in Pyspark?
I have tried the following:
import findspark import os from pyspark.sql import SparkSession from pyspark.sql.functions import * # didn't know which of these would work so tried both os.environ['SPARK_CLASSPATH'] = 'path/to/sqljdbc42.jar' os.environ['driver-class-path'] = 'path/to/sqljdbc42.jar' findspark.init('C:/spark/spark') spark = SparkSession \ .builder \ .appName("SparkCoreTest") \ .getOrCreate() sc = spark.sparkContext sqlctx = SQLContext(sc) server = 'server' db = 'db' url = 'jdbc:sqlserver//' + server + ';databaseName=' + db table = 'table' properties = {'driver' : 'com.microsoft.sqlserver.jdbc.SQLServerDriver'} df = sqlctx.read.format('jdbc').options(url=url, dbtable=table, driver='{SQL SERVER}').load() This gives java.lang.ClassNotFoundException: {SQL SERVER}. Throughout this process I've also gotten a errors resulting from not being able to find a "suitable driver," although I think I've fixed those by changingos.environ. Any help would be greatly appreciated!