I have an application using EKS in AWS that runs a spark session that can run multiple workloads. In each workload, I need to access data from S3 in another AWS account, for which I have STS credentials. At the beginning of each workload, I run code like:
hadoop_conf.set(f'fs.s3a.bucket.{s3_bucket}.access.key', 'ACCESS_KEY') hadoop_conf.set(f'fs.s3a.bucket.{s3_bucket}.secret.key', 'SERCRET_KEY') hadoop_conf.set(f'fs.s3a.bucket.{s3_bucket}.session.token', 'SESSION_TOKEN') hadoop_conf.set(f'fs.s3a.bucket.{s3_bucket}.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider') then I check S3. I need to check if the data is in delta format, so I run:
from delta.tables import DeltaTable table_url = f"s3://{s3_bucket}/.../" is_table = DeltaTable.isDeltaTable(spark, table_url) When I run the first workload of the session, the credentials are set and the isDeltaTable call works as expected. Later, if another workload runs on the same session, if the S3 location I'm trying to access is different, and creds have expired, I get an errors originating from the isDeltaTable call like:
Lost task 2.0 in stage 253.0 (TID 45861) (100.64.174.54 executor 108): org.apache.hadoop.fs.s3a.AWSBadRequestException: getFileStatus on s3://s3_bucket/.../_delta_log/00000000000000000002.json: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request From what I can tell, the first isDeltaTable call in the session spins up executors, which get the hadoop config present from the first workload. These executors stick around until the second workload, where they are reused for isDeltaTable, but do not get the new hadoop config, so they continue using the original creds from the first workload, and throw the Bad Request error.
Is it possible to update hadoop config on existing executors? I'm trying to avoid restarting my spark session for the benefit of performance.