How to deploy Tink for BigQuery encryption on-prem and in the cloud

How to deploy Tink for BigQuery encryption on-prem and in the cloud

How to deploy Tink for BigQuery encryption on-prem and in the cloud

Data security is a key focus for organizations moving their data warehouses from on-premises to cloud-first systems, such as BigQuery. In addition to storage-level encryption, whether using Google-managed or customer-managed keys, BigQuery also provides column-level encryption. Using BigQuery’s SQL AEAD functions, organizations can enforce a more granular level of encryption to help protect sensitive customer data, such as government identity or credit card numbers, and help comply with security requirements.

While BigQuery provides column-level encryption in the cloud, many organizations operate in hybrid-cloud environments. To prevent a scenario where data needs to be decrypted and re-encrypted each time it moves between locations, Google Cloud offers a consistent and interoperable encryption mechanism. This enables deterministically-encrypted data (which maintains referential integrity) to be immediately joined with on-prem tables for anonymized analytics.

To achieve a BigQuery-compatible encryption on-prem, customers can use Tink, a Google-developed open-source cryptography library. BigQuery uses Tink to implement its SQL AEAD functions. We can use the Tink library directly to encrypt data on-prem in a way that can later be decrypted using BigQuery SQL in the cloud, and decrypt BigQuery’s column level-encrypted data outside of BigQuery. 

For our customers who want to use Tink with BigQuery, we have put together a few helpful Python utilities and samples in the BigQuery Tink Toolkit GitHub repo. Let’s first walk through an example of how to use Tink directly to encrypt or decrypt on-prem data using the same keyset used for BigQuery, followed by how the BigQuery Tink Toolkit can help simplify working with Tink. 

To start, we need to retrieve the Tink keyset. We’ll assume that KMS-wrapped keysets are being used. These keysets need to be stored in BigQuery to use with BigQuery SQL.  If needed, they can also be replicated to a secondary store on-prem.

code_block[StructValue([(u’code’, u’from google.cloud import bigquery, kmsrnrnbq_client = bigquery.Client()rnquery_job = bq_client.query(“””SELECT kms_resource_path, wrapped_keyset, associated_data FROM `my-keysets-table` WHERE column_name = “my-pii-column”;”””)rnresult_row = query_job.result()’), (u’language’, u’lang-py’), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ec785035250>)])]

Now that we have the encrypted keyset, we need to unwrap it to retrieve the usable Tink keyset. If Cloud KMS is not accessible from on-prem, the unwrapped keyset will need to be maintained in a secure keystore on-prem.

code_block[StructValue([(u’code’, u’kms_client = kms.KeyManagementServiceClient()rndecrypted_keyset_obj = kms_client.decrypt(rn {rn “name”: result_row.kms_resource_path.split(“gcp-kms://”)[1],rn “ciphertext”: result_row.wrapped_keyset,rn }rn)rnkeyset = decrypted_keyset_obj.plaintext’), (u’language’, u’lang-py’), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ec785035290>)])]

We can now use the keyset to generate a Tink primitive. This can be used to encrypt or decrypt data with the associated keyset. Note that different primitives should be used depending on whether the keyset is for a deterministic or nondeterministic key.

code_block[StructValue([(u’code’, u’import tinkrnfrom tink import aead, cleartext_keyset_handle, daeadrnrnbinary_keyset_reader = tink.BinaryKeysetReader(keyset)rnkeyset_handle = cleartext_keyset_handle.read(binary_keyset_reader)rnrn# If using a determinisitic keyset:rndaead.register()rncipher = keyset_handle.primitive(daead.DeterministicAead)rnrn# If using a nondeterministic keyset instead:rnaead.register()rncipher = keyset_handle.primitive(aead.Aead)’), (u’language’, u’lang-py’), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ec785036890>)])]

Once we have our cipher, we can use it to encrypt or decrypt data as needed.

code_block[StructValue([(u’code’, u’plaintext = “Hello world!”rnassociated_data = result_row.associated_datarn# To encrpyt:rn# If using a determinisitic keysetrnciphertext = cipher.encrypt_deterministically(rn plaintext.encode(), associated_data.encode()rn )rnrn# If using a nondeterministic keyset insteadrnciphertext = cipher.encrypt(plaintext.encode(), associated_data.encode())rnrnrn# To decrypt:rn# If using a determinisitic keysetrnplaintext = cipher.decrypt_deterministically(rn ciphertext, associated_data.encode()rn )rnrn# If using a nondeterministic keyset insteadrnplaintext = cipher.decrypt(rn ciphertext, associated_data.encode()rn )’), (u’language’, u’lang-py’), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ec785036710>)])]

We have provided the CipherManager class to help simplify this process, which handles four actions:

Retrieving the required keysets from a BigQuery table

Unwrapping those keysets

Creating a Tink cipher for each column

Providing a consistent interface to call encrypt and decrypt. 

We have also included a sample Spark job that shows how to use CipherManager to encrypt or decrypt columns for a given table. We hope these come in handy – happy Tinkering.

Source : Data Analytics Read More