Getting started with Terraform and Datastream: Replicating Postgres data to BigQuery

Getting started with Terraform and Datastream: Replicating Postgres data to BigQuery

Getting started with Terraform and Datastream: Replicating Postgres data to BigQuery

Consider the case of an organization that has accumulated a lot of data, stored in various databases and applications. Producing analytics reports takes a long time because of the complexity of the data storage. The team decides to replicate all the data to BigQuery in order to increase reporting efficiency. Traditionally this would be a large, complex and costly project that can take a long time to complete.

Instead of painstakingly setting up replication for each data source, the team can now use Datastream and Terraform. They compile a list of data sources, create a few configuration files according to the organization’s setup, and voila! Replication begins and data starts to appear in BigQuery within minutes.

Datastream is Google Cloud’s serverless and easy-to-use change data capture and replication service. If you are unfamiliar with Datastream, we recommend this post for a high-level overview, or read our latest Datastream for BigQuery launch announcement.

Terraform is a popular Infrastructure as code (IaC) tool. Terraform enables infrastructure management through configuration files, which makes management safer, more consistent and easily automatable. 

Launched in mid February 23’, the Terraform support in Datastream unblocks and simplifies some exciting use cases, such as:

Policy compliance management – Terraform can be used to enforce compliance and governance policies on the resources that teams provision.

Automated replication process – Terraform can be used to automate Datastream operations. This can be useful when you need automated replication, replication from many data sources, or replication of a single data source to multiple destinations.

Using Terraform to set up Datastream replication from PostgreSQL to BigQuery

Let’s look at an example where the data source is a PostgreSQL database, and review the process step by step.

Limitations

Datastream will only replicate data to the BigQuery data warehouse configured in the same Google Cloud project as Datastream, so make sure that you create Datastream resources in the same project you want your data in.

Requirements

Datastream API needs to be enabled on the project before we continue. Go to the API & Services page in the Google Cloud console, and make sure that the Datastream API is enabled.

Make sure you have the Terraform cli installed – Terraform cli installation guide

It’s possible to follow the steps in this blog with either a MySQL or Oracle database with a few slight modifications. Just skip the Postgres configuration section, and use our MySQL or Oracle configuration guides instead.

You will naturally need a Postgres database instance with some initial data. If you want to set up a new Postgres instance, you can follow the steps in the Cloud SQL for Postgres quickstart guide. 

We will need to make sure that PostgreSQL is configured for replication with Datastream. This includes enabling logical replication and optionally creating a dedicated user for Datastream. See our PostgreSQL configuration guide. Make sure to note the replication slot and publication names from this step, as we will need them to configure the replication later on.

You will also need to set up connectivity between your database and Datastream. Check the Network connectivity options guide, and find the connectivity type that fits your setup.

Configuring the replication with Terraform

We will start by creating Datastream Connection Profiles, which store the information needed to connect to the source and destination (e.g. hostname, port, user, etc.).

To do this, we will start by creating a new .tf file in an empty directory, and adding the following configurations to it:

code_block[StructValue([(u’code’, u’resource “google_datastream_connection_profile” “source” {rn display_name = “Postgresql Source”rn location = “us-central1″rn connection_profile_id = “source”rnrn postgresql_profile {rn hostname = “HOSTNAME”rn port = 5432rn username = “USERNAME”rn password = “PASSWORD”rn database = “postgres”rn }rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4cbfe4e110>)])]

In this example, we create a new Connection Profile that points to a Postgres instance. Edit the configuration with your source information and save it. For other sources and configurations see the Terraform Datastream Connection Profile documentation.

Next, let’s define the Connection Profile for the BigQuery destination:

code_block[StructValue([(u’code’, u’resource “google_datastream_connection_profile” “destination” {rn display_name = “BigQuery Destination”rn location = “us-central1″rn connection_profile_id = “destination”rnrn bigquery_profile {}rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4cbfbb32d0>)])]

We now have the source and destination configured, and are ready to configure the replication process between them. We will do that by defining a Stream, which is a Datastream resource representing the source and destination replication.

code_block[StructValue([(u’code’, u’resource “google_datastream_stream” “stream” {rn display_name = “Postgres to BigQuery Stream”rn location = “us-central1″rn stream_id = “stream”rn desired_state = “RUNNING”rnrn source_config {rn source_connection_profile = google_datastream_connection_profile.source.idrn postgresql_source_config {rnt publication = “PUBLICATION_NAME”rnt replication_slot = “REPLICATION_SLOT”rnt}rn }rnrn destination_config {rn destination_connection_profile = google_datastream_connection_profile.destination.idrn bigquery_destination_config {rn data_freshness = “900s”rn source_hierarchy_datasets {rn dataset_template {rn location = “us-central1″rn }rn }rn }rn }rnrn backfill_all {}rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4cbfbb3c50>)])]

In this configuration, we are creating a new Stream and configuring the source and destination Connection Profiles and properties. Some features to note here:

Backfill ( backfill_all ) – means that Datastream will replicate an initial snapshot of historical data. This can be configured to exclude specific tables.

Replicating a subset of the source – you can specify which data should be included or excluded from the Stream using the include / exclude lists – see more in the API docs

Edit the configuration with the source Postgres publication and replication slot that you created in the initial setup. For other sources and configurations see the Terraform Stream documentation.

Running the Terraform configuration

Now that we have our configuration ready, it’s time to run it with the Terraform CLI. For that, we can use Cloud Shell which has terraform CLI installed and configured with the permissions needed for your project. 

You can also prepare a local environment, by adding this to your .tf file.

code_block[StructValue([(u’code’, u’terraform {rn required_providers {rn google = {rn source = “hashicorp/google”rn version = “4.53.0”rn }rn }rn}rnrnprovider “google” {rn credentials = file(“PATH/TO/KEY.json”)rn project = “PRROJECT_ID”rn region = “us-central1″rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4c8f799d10>)])]

Start by running the terraform init command to initialize terraform in your configuration directory. Then run the terraform plan command to check and validate the configuration.

Now lets run terraform apply to apply the new configuration

If all went well, you should now have a running Datastream Stream! Go to the Datastream console to manage your Streams, and to the BigQuery console and check that the appropriate data sets were created.

When you’re done, you can use terraform destroy to remove the created Datastream resources.

Something went wrong?

You can set: export TF_LOG=DEBUG flag to see debug logs for the Terraform CLI. See Debugging Terraform for more.

Automating multiple replications

To automate terraform configurations, we can make use of input variables, the count argument and the element function. A simple variable that holds a sources list can look like this:

code_block[StructValue([(u’code’, u’variable “sources” {rn type = list(object({rn name = stringrn hostname = stringrn …rn publication = stringrn replication_slot = stringrn }))rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4c8f799650>)])]

Add a count field to the resources you want to automate (source, stream and maybe destination). Replace values from the variable to:

code_block[StructValue([(u’code’, u’count = length(var.sources)rn…rnconnection_profile_id = “source_${element(var.sources, count.index).name}”‘), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4c8eb8d190>)])]

Source : Data Analytics Read More