Creating a Glue Data Catalog Table within a Glue Job

Question

I have what I consider to be a pretty simple requirement.

I want to create a job that takes one file and transforms it into another file and then updates the data catalog meta data within Glue. This would allow another job to then pick up the new data source and consume it using Glue/EMR/Athena.

Now, I can do the transform without any issues but for the life of me I cannot work out how to create the table within Glue other than using a crawler or the console or the glue API - I would prefer to do this inside the job so that I can just call the next job rather than execute a crawler and wait for it to complete.

The issue with the glue API is that I also have to convert the Spark schema to understand the API layout.

In Spark on EMR I can create the glue data catalog table pretty easily (although not well documented!):

dataframe.write.mode(mode).format("parquet").option("path", parquet_path).saveAsTable(glue_table) dataframe.write.format("parquet").mode(mode).save(parquet_path)

This doesn't work in Glue. While I can setup the Glue data catalog hive metadata store on the Spark session within the Glue job:

spark = SparkSession.builder \ .appName(args['JOB_NAME']) \ .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \ .enableHiveSupport() \ .getOrCreate()

but when I try and set the database it says it doesn't exist and when I list the databases I get the following:

Databases=[Database(name=u'default', description=u'Default Hive database', locationUri=u'hdfs://ip-172-31-29-88.ap-southeast-2.compute.internal:8020/user/spark/warehouse')]

Which makes me think that glue doesn't work with the Glue data catalog - it seems to be using a default hive catalog, am I missing something?

The reason this is an issue is that in EMR I can do stuff like:

spark.sql("select * from my_glue_table")

Which will work, but I suspect that this will not work in a Glue job unless I run a crawler and I really don't see the need to run the crawler when in EMR I can pretty much do it with one line of code.

Am I missing something here?

Thanks in advance.

Yuriy Bondaruk · Accepted Answer · 2018-08-25 02:03:00Z

2

You can create a temp table from DataFrame and run sql queries:

var dataDf = glueContext.sparkSession.read.format(format).load(path) // or var dataDf = dynamicFrame.toDF() dataDf.createOrReplaceTempView("my_glue_table") val allDataDf = glueContext.sparkSession.sql("select * from my_glue_table")

To create a table in Data Catalog following code can help:

val table = new com.amazonaws.services.glue.catalog.Table(namespace, tblName, schema, partitions, parameters, location, serdeInfo, hiveCompatible) glueContext.getCatalogClient.createTable(table)

answered Aug 25, 2018 at 2:03

Yuriy Bondaruk

4,7702 gold badges38 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

mransley Over a year ago

Thank you so much for this. I am aware of the first call - the problem with it is that it isn't available outside of the current job.

mransley Over a year ago

Do you have any details of the second call. I can't find any reference to that class at all - for example what is the serdeInfo object? Is HiveCompatible a boolean? Is there any documentation on it? Unfortunately when you google it all you get is this page at the moment.

Yuriy Bondaruk Over a year ago

Unfortunately there are no docs currently available for this method call and parameters. I can try to find some info for you if it's still needed.

mransley Over a year ago

It would be great if we could get this working. We have worked around it for the moment but your solution is much more elegant that what we are trying to do at the moment.

Yuriy Bondaruk Over a year ago

Sorry I can't provide all details, it's two complicated to describe all dependant classes. For example, schema type com.amazonaws.services.glue.schema.Schema uses com.amazonaws.services.glue.schema.types.DataType with more than 20 actual types. You can play with reflection to get constructor params and/or static creational methods.

|

mishkin · Accepted Answer · 2020-12-23 17:26:33Z

they announced a new feature in April 2020 that makes it easier

https://aws.amazon.com/about-aws/whats-new/2020/04/aws-glue-now-supports-the-ability-to-update-partitions-from-glue-spark-etl-jobs/

Thom Lane · Accepted Answer · 2020-03-19 03:32:13Z

You can use the CREATE TABLE statement in Spark SQL to add the table to the AWS Glue Catalog.

spark.sql("USE database_name") df.registerTempTable("df") spark.sql(""" CREATE TABLE table_name USING CSV AS SELECT * FROM df """)

When writing to CSV, I had to make sure the URI location for the Glue database was set, otherwise I'd end up with 'Can not create a Path from an empty string' errors, even when setting LOCATION in the query.

When writing to Parquet, it worked by setting LOCATION to an Amazon S3 path.

Collectives™ on Stack Overflow

Creating a Glue Data Catalog Table within a Glue Job

3 Answers 3

7 Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

Comments

Comments

Related