Spark supports multiple encryption schemes. You can use client side encryption, server side encryption, etc. What wasn’t working for me for a long time is reading encrypted data and writing as a plain text. Before reading I was configuring encryption and this was working fine. However, writing as plain text didn’t work (data was encrypted), even though I was disabling encryption.
I was told that this is because encryption settings are cached and my changes are not honored. However, what works for me now is using different access protocols to read and write s3 files.
So, for configuration do this:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
// Enable CSE for s3:// prefix spark.conf.set("fs.s3.enableServerSideEncryption", "false") spark.conf.set("fs.s3.cse.enabled", "true") spark.conf.set("fs.s3.cse.encryptionMaterialsProvider", "com.amazon.ws.emr.hadoop.fs.cse.KMSEncryptionMaterialsProvider") spark.conf.set("fs.s3.cse.kms.keyId", "KMS ID") // KMS key to encrypt the data with spark.conf.set("fs.s3.cse.kms.region", "us-east-1") // the region for the KMS key // Disable CSE for s3a:// prefix to not encrypt spark.conf.set("fs.s3a.enableServerSideEncryption", "false") spark.conf.set("fs.s3a.cse.enabled", "false") spark.conf.set("fs.s3a.canned.acl","BucketOwnerFullControl") spark.conf.set("fs.s3a.acl.default","BucketOwnerFullControl") spark.conf.set("fs.s3a.acl","bucket-owner-full-control") |
or in Python do this:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# Enable CSE for s3:// prefix spark._jsc.hadoopConfiguration().set("fs.s3.enableServerSideEncryption", "false") spark._jsc.hadoopConfiguration().set("fs.s3.cse.enabled", "true") spark._jsc.hadoopConfiguration().set("fs.s3.cse.encryptionMaterialsProvider", "com.amazon.ws.emr.hadoop.fs.cse.KMSEncryptionMaterialsProvider") spark._jsc.hadoopConfiguration().set("fs.s3.cse.kms.keyId", "KMS ID") # KMS key to encrypt the data with spark._jsc.hadoopConfiguration().set("fs.s3.cse.kms.region", "us-east-1") # the region for the KMS key # Disable CSE for s3a:// prefix to not encrypt spark._jsc.hadoopConfiguration().set("fs.s3a.enableServerSideEncryption", "false") spark._jsc.hadoopConfiguration().set("fs.s3a.cse.enabled", "false") spark._jsc.hadoopConfiguration().set("fs.s3a.canned.acl","BucketOwnerFullControl") spark._jsc.hadoopConfiguration().set("fs.s3a.acl.default","BucketOwnerFullControl") spark._jsc.hadoopConfiguration().set("fs.s3a.acl","bucket-owner-full-control") |
Now, when you read or write file using s3
prefix, it uses encryption with KMS key. However, if you read or write using s3a
, it doesn’t encrypt. You can use s3n
prefix to configure yet another encryption scheme. If you want to do more, you need to dig into protocol handlers.