5

I have what I consider to be a pretty simple requirement.

I want to create a job that takes one file and transforms it into another file and then updates the data catalog meta data within Glue. This would allow another job to then pick up the new data source and consume it using Glue/EMR/Athena.

Now, I can do the transform without any issues but for the life of me I cannot work out how to create the table within Glue other than using a crawler or the console or the glue API - I would prefer to do this inside the job so that I can just call the next job rather than execute a crawler and wait for it to complete.

The issue with the glue API is that I also have to convert the Spark schema to understand the API layout.

In Spark on EMR I can create the glue data catalog table pretty easily (although not well documented!):

dataframe.write.mode(mode).format("parquet").option("path", parquet_path).saveAsTable(glue_table) dataframe.write.format("parquet").mode(mode).save(parquet_path) 

This doesn't work in Glue. While I can setup the Glue data catalog hive metadata store on the Spark session within the Glue job:

spark = SparkSession.builder \ .appName(args['JOB_NAME']) \ .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \ .enableHiveSupport() \ .getOrCreate() 

but when I try and set the database it says it doesn't exist and when I list the databases I get the following:

Databases=[Database(name=u'default', description=u'Default Hive database', locationUri=u'hdfs://ip-172-31-29-88.ap-southeast-2.compute.internal:8020/user/spark/warehouse')] 

Which makes me think that glue doesn't work with the Glue data catalog - it seems to be using a default hive catalog, am I missing something?

The reason this is an issue is that in EMR I can do stuff like:

spark.sql("select * from my_glue_table") 

Which will work, but I suspect that this will not work in a Glue job unless I run a crawler and I really don't see the need to run the crawler when in EMR I can pretty much do it with one line of code.

Am I missing something here?

Thanks in advance.

3 Answers 3

2

You can create a temp table from DataFrame and run sql queries:

var dataDf = glueContext.sparkSession.read.format(format).load(path) // or var dataDf = dynamicFrame.toDF() dataDf.createOrReplaceTempView("my_glue_table") val allDataDf = glueContext.sparkSession.sql("select * from my_glue_table") 

To create a table in Data Catalog following code can help:

val table = new com.amazonaws.services.glue.catalog.Table(namespace, tblName, schema, partitions, parameters, location, serdeInfo, hiveCompatible) glueContext.getCatalogClient.createTable(table) 
Sign up to request clarification or add additional context in comments.

7 Comments

Thank you so much for this. I am aware of the first call - the problem with it is that it isn't available outside of the current job.
Do you have any details of the second call. I can't find any reference to that class at all - for example what is the serdeInfo object? Is HiveCompatible a boolean? Is there any documentation on it? Unfortunately when you google it all you get is this page at the moment.
Unfortunately there are no docs currently available for this method call and parameters. I can try to find some info for you if it's still needed.
It would be great if we could get this working. We have worked around it for the moment but your solution is much more elegant that what we are trying to do at the moment.
Sorry I can't provide all details, it's two complicated to describe all dependant classes. For example, schema type com.amazonaws.services.glue.schema.Schema uses com.amazonaws.services.glue.schema.types.DataType with more than 20 actual types. You can play with reflection to get constructor params and/or static creational methods.
|
1

they announced a new feature in April 2020 that makes it easier

https://aws.amazon.com/about-aws/whats-new/2020/04/aws-glue-now-supports-the-ability-to-update-partitions-from-glue-spark-etl-jobs/

Comments

0

You can use the CREATE TABLE statement in Spark SQL to add the table to the AWS Glue Catalog.

spark.sql("USE database_name") df.registerTempTable("df") spark.sql(""" CREATE TABLE table_name USING CSV AS SELECT * FROM df """) 

When writing to CSV, I had to make sure the URI location for the Glue database was set, otherwise I'd end up with 'Can not create a Path from an empty string' errors, even when setting LOCATION in the query.

When writing to Parquet, it worked by setting LOCATION to an Amazon S3 path.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.