1

We use Trino (https://trino.io/) to connect to HDFS. I discovered that the data in the information_schema tables, for example:

select * from information_schema.columns clz where clz.table_catalog = ‘hive’ and clz.table_schema = ‘<schema_name>’ and clz.table_name = ‘<table_name>’ 

doesn’t always match up with what I get if I run

show tables from [schema] show columns in [schema].[table] 

etc. It seems that the show tables/show columns commands pretty much always match up with what I see if I run the hadoop command (hadoop fs -ls ...) to show the contents of the hdfs folder.

So I’m trying to figure out:

  • why the information_schema doesn’t give the same results as show tables/show columns/etc.
  • if there is a way to refresh/update information_schema to make it current

Thank you.

1 Answer 1

2

The information_schema table in Trino just exposes the underlying schema data from each data source. It therefore varies depending on the used data source and connector:

  • For connectors for an RDBMS such as PostgreSQL it basically just exposes the information schema from PostgresSQL after applying type mapping and such.
  • For systems like Hive and Iceberg connectors it exposes the information from the Hive metastore service and the table format.
  • For other systems like Elasticsearch or so it is completely different again, but basically always gets the information from the underlying system.

For your specific case it might not match up if some external system also messes around in the object storage and the HMS. Specifically there is also a metadata cache in play with HMS which could be stale.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.