Transforming customer experiences with modern cloud database capabilities

Transforming customer experiences with modern cloud database capabilities

Editor’s note: Six customers, across a range of industries, share their success stories with Google Cloud databases.

From professional sports leagues to kidney care and digital commerce, Google Cloud databases enable organizations to develop radically transformative experiences for their users. The stories of how Google Cloud Databases have helped Box, Credit Karma, Davita, Forbes, MLB, and PLAID build data-driven applications is truly remarkable – from unifying data lifecycles for intelligent applications, to reducing, and even eliminating operational burden. Here are some of the key stories that customers shared at Google Cloud Next.

Box modernizes its NoSQL databases with zero downtime with Bigtable   

A content cloud, Box enables users to securely create, share, co-edit, and retain their content online. While moving its core infrastructure from on-premises data centers to the cloud, Box chose to migrate its NoSQL infrastructure to Cloud Bigtable. To fulfill the company’s user request needs, the NoSQL infrastructure has latency requirements measured in tens of milliseconds. “File metadata like location, size, and more, are stored in a NoSQL table and accessed at every download. This table is about 150 terabytes in size and spans over 600 billion rows. Hosting this on Bigtable removes the operational burden of infrastructure management. Using Bigtable, Box gains automatic replication with eventual consistency, an HBase-compliant library, and managed backup and restore features to support critical data.” Axatha Jayadev Jalimarada, Staff Software Engineer at Box, was enthusiastic about these Bigtable benefits, “We no longer need manual interventions by SREs to scale our clusters, and that’s been a huge operational relief. We see around 80 millisecond latencies to Bigtable from our on-prem services. We see sub-20 millisecond latencies from our Google Cloud resident services, especially when the Bigtable cluster is in the same region. Finally, most of our big NoSQL use cases have been migrated to Bigtable and I’m happy to report that some have been successfully running for over a year now.”

Axatha Jayadev Jalimarada walks through “how Box modernized their NoSQL databases with minimal effort and downtime” with Jordan Hambleton, Bigtable Solutions Architect at Google Cloud.

Credit Karma deploys models faster with Cloud Bigtable and BigQuery

Credit Karma, a consumer technology platform helping consumers in the US, UK and Canada make financial progress, is reliant on its data models and systems to deliver a personalized experience for its nearly 130 million members. Given its scale, Credit Karma recognized the need to cater to the growing volume, complexity, and speed of data, and began moving its technology stack to Google Cloud in 2016. 

UsingCloud Bigtable andBigQuery, Credit Karma registered a 7x increase in the number of pre-migration experiments, and began deploying 700 models/week compared to 10 per quarter. Additionally, Credit Karma was able to push recommendations through its modeling scoring service built on a reverse extract, transform, load, (ETL) process on BigQuery, Cloud Bigtable andGoogle Kubernetes Engine. Powering Credit karma’s recommendations are machine learning models at scale — the team runs about 58 billion model predictions each day.

Looking to learn “what’s next for engineers”? Check outthe conversation between Scott Wong, and Andi Gutmans, General Manager and Vice President of Engineering for Databases at Google.

DaVita leverages Spanner and BigQuery to centralize health data and analytics for clinician enablement

As a leading global kidney care company, DaVita spans the gamut of kidney care from chronic kidney disease to transplants. As part of its digital transformation strategy, DaVita was looking to centralize all electronic health records (EHRs) and related care activities into a single system that would not only embed work flows, but also save clinicians time and enable them to focus on their core competencies. Jay Richardson, VP, Application Development at DaVita, spoke to the magnitude of the task, “Creating a seamless, real-time data flow across 600,000 treatments on 200,000 patients and 45,000 clinicians was a tall engineering order.”  The architecture was set up in Cloud Spanner housing all the EHRs and related-care activities, and BigQuery handling the analytics. Spanner change streams replicated data changes to BigQuery with a 75 percent reduction in time for replication–from 60 to 15 seconds-enabling both, simplification of the integration process, as well as, a highly scalable solution. DaVita also gained deep, relevant, insights–about 200,000 a day–and full aggregation for key patient meds and labs data. This helps equip physicians with additional tools to care for their patients, without inundating them with numbers.

Jerene Yang, Senior Software Engineering Manager at Google Cloud, helps to “see the whole picture by unifying operational data with analytics” with Jay Richardson.

Forbes fires up digital transformation with Firestore

A leading media and information company, Forbes is plugged into an ecosystem of about 140 million—employees, contributors, and readers—across the globe. It recently underwent a successful digital transformation effort to support its rapidly scaling business. This included a swift, six-month migration to Google Cloud, and integrating with the full suite of Google Cloud products from BigQuery to Firestore—a NoSQL document database. Speaking of Firestore, Vadim Supitskiy, Chief Digital & Information Officer at Forbes, explained, “We love that it’s a managed service, we do not want to be in the business of managing databases. It has a flexible document model, which makes it very easy for developers to use and it integrates really, really, well with the products that GCP has to offer.” Firestore powers the Forbes insights and analytics platform to give its journalists and contributors comprehensive, real-time suggestions that help content creators author relevant content, and analytics to assess the performance of published articles. At the backend, Firestore seamlessly integrates with Firebase Auth, Google Kubernetes Engine, Cloud Functions, BigQuery, and Google Analytics, while reducing maintenance overheads. As a cloud-native database that requires no configuration or management, it’s cheap to store data in, and executes low-latency queries

Minh Nguyen, Senior Product Manager at Google cloud, discusses “serverless application development with a document database” with Vadim Supitskiyhere.

MLB hits a home run by moving to Cloud SQL

When you think ofMajor League Baseball (MLB), you think of star players and home runs. But as Joseph Zirilli, senior software engineer at MLB explained, behind-the-scenes technology is critical to the game, whether it is the TV streaming service, or on-field technology to capture statistics data. And that’s a heavy lift, especially when MLB was running its player scouting and management system for player transactions on a legacy, on-premises database. This, in combination with the limitations of conventional licensing, was adversely impacting the business. The lack of in-house expertise in the legacy database, coupled with its small team size, made routine tasks challenging. 

Having initiated the move to Google Cloud a few years ago, MLB was already using Cloud SQL for some of its newer products. It was also looking to standardize its relational database management system around PostgreSQL so it could build in-house expertise around a single database. They selected Cloud SQL which supported their needs, and also offered high availability and automation.

Today, with drastically improved database performance and automatic rightsizing of database instances, MLB is looking forward to keeping its operational costs low and hitting it out of the park for fan experience.

Sujatha Mandava, Director, Product Management, SQL Databases at Google Cloud, and Joseph Zirilli discuss “why now is the time to migrate your apps to managed databases”.

Major League Baseball trademarks and copyrights are used with permission of Major League Baseball. Visit

PLAID allies with AlloyDB to enhance the KARTE website and native app experience for customer engagement

PLAID, a Tokyo-based startup hosts KARTE, an engagement platform focused on customer experience that tracks the customer in real time, supports flexible interactions, and provides wide analytics functionality. To support hybrid transactional and analytical processing (HTAP) at scale, KARTE was using a combination of BigQuery, Bigtable, and Spanner in the backend. This enabled KARTE to process over 100,000 transactions per second, and store over 10 petabytes of data. Adding AlloyDB for PostgreSQL to the mix has provided KARTE with the ability to answer flexible analytical queries. In addition to the range of queries that KARTE can now handle, AlloyDB has brought in expanded capacity with low-latency analysis in a simplified system. As Yuki Makino, CTO at PLAID pointed out, “With the current (columnar) engine and AlloyDB performance is about 100 times faster than earlier.”

Yuki Makino, in conversation with Sandy Ghai, Product Manager at Google Cloud, says “goodbye, expensive legacy database, hello next-gen PostgreSQL database” here.

Implement a modern database strategy

Transformation hinges on new cloud database capabilities. Whether you want to increase your agility and pace of innovation, better manage your costs, or entirely shut down data centers, we can help you accelerate your move to cloud. From integration into a connected environment, to disruption-free migration, and automation to free up developers for creative work, Google Cloud databases offer unified, open, and intelligent building blocks to enable a modern database strategy.

Download the complimentary 2022 Gartner Magic Quadrant for Cloud Database Management Systems report. 

Learn more about Google Cloud databases.

Learn why customers choose Google Cloud databases in this e-book.

Source : Data Analytics Read More

Built with BigQuery: How Tamr delivers Master Data Management at scale and what this means for a data product strategy

Built with BigQuery: How Tamr delivers Master Data Management at scale and what this means for a data product strategy

Master data is a holistic view of your key business entities, providing a consistent set of identifiers and attributes that give context to the business data that matters most to your organization. It’s about ensuring that clean, accurate, curated data – the best available – is accessible throughout the company to manage operations and make critical business decisions. Having well-defined master data is essential to running your business operations. 

Master data undergoes a far more enriched and refined  process than other types of data captured across the organization. For instance, it’s not the same as the transactional data generated by applications. Instead, master data gives context to the transaction itself by providing the fundamental business objects – like the Customer, Product, Patient, or Supplier – on which the transactions are performed. 

Without master data, enterprise applications  are left with potentially inconsistent data living in disparate systems; with an unclear picture of whether multiple records are related. And without it, gaining essential business insight may be difficult, if not impossible, to  attain: for example, “which customers generate the most revenue?” or “which suppliers do we do the most business with?” 

Master data is a critical element of treating data as an enterprise asset and as a product.. A data product strategy requires that the data remain clean, integrated, and freshly updated with appropriate frequency. Without this additional preparation and enrichment, data becomes stale and incomplete, leading to inability to provide the necessary insights for timely business decisions. Data preparation, consolidation and  enrichment should be a  part of a data product strategy, since consolidating a complete set of external data sources will provide more complete and accurate insights for business decisions.  This data preparation, consolidation and enrichment  requires the right infrastructure, tools, and processes, otherwise it will be an additional burden on already thinly stretched data management teams. . 

This is why it is necessary to adopt and implement a next-generation master data management platform that enables a data product strategy to be operationalized. This in turn enables the acquisition of trusted records to drive business outcomes. 

The Challenge: A Single Source of Truth – The Unified “Golden” Record 

Many companies have built or are working on rolling out data lakes, lake houses, data marts, or data warehouses to address data integration challenges. However, when multiple data sets from disparate sources are combined, there is a high likelihood of introducing problems , which Tamr and Google Cloud are partnering to address and alleviate:

Data duplication: same semantic/physical entity like customer with different keys

Inconsistency: same entity having partial and/or mismatching properties (like different phone numbers or addresses for the same customers)

Reduced insight accuracy: duplicates skew the analytic key figures (like total distinct customers are higher with duplicates than without them)

 Timeliness impact: manual efforts to reach a consistent and rationalized  core set of data entities used  for application input and analytics  cause significant delays in processing and ultimately, decision making


Tamr is the leader in data mastering and next-generation master data management, delivering data products that provide clean, consolidated, and curated data to help businesses stay ahead in a rapidly changing world. Organizations benefit from Tamr’s integrated, turn-key solution that combines machine learning with humans-in-the-loop, a low-code/no-code environment, and integrated data enrichment to streamline operations. The outcome is higher quality data; faster and with less manual work

Tamr takes multiple source records, identifies duplicates, enriches data, assigns a unique ID, and provides a unified, mastered “golden record” while maintaining all source information for analysis and review. Once cleansed, data can be utilized in the downstream analytics and applications, enabling more informed decisions.

A successful data product strategy requires consistently cleaning and integrating data, a task that’s ideal for data mastering. machine-learning based capabilities in a data mastering platform can handle increases in data volume and variety, as well as data enrichment to ensure that the data stays fresh and accurate so it can be trusted by the business consumers. 

With accurate key entity data, companies can unlock the bigger picture of data insights. The term “key” signifies entities that are most important to an organization. For example, for healthcare organizations, this could mean patients and providers; for manufacturers, it could mean suppliers;  for financial services firms, it could mean customers. 

Below are examples of key business entities after they’ve been cleaned, enriched, and curated with Tamr:

Better Together: How Tamr leverages Google Cloud to differentiate their next-gen MDM

Tamr Mastering, a template-based SaaS MDM solution, is built on Google Cloud Platform technologies such as Cloud Dataproc, Cloud Bigtable and BigQuery, allowing customers to scale modern data pipelines with excellent performance while controlling costs.

The control plane (application layer) is built on Google Compute Engine (GCE) to leverage its scalability. The data plane utilizes a full suite of interconnected Google Cloud Platform services such as Google Dataproc for distributed processing, allowing for a flexible and sustainable way to bridge the gap between the analytics powers of distributed TensorFlow and the scaling capabilities of Hadoop in a managed offering. Google Cloud Storage is used for data movement/staging. 

Google Cloud Run, which enables Tamr to deploy containers directly on top of Google’s scalable infrastructure, is used in the data enrichment process. This approach allows serverless deployments without the need to create a stateful cluster or manage infrastructure to be productive with container deployments. Google Bigtable is utilized for data-scale storage, allowing for high throughput and scalability for key/value data. Data that doesn’t fall into the key/value lookup schema is retrieved in batches or used for analytical purposes. Google BigQuery is the ideal storage for this type of data and storage of the golden copy of the data discussed earlier in this blog post. Additionally, Tamr chose BigQuery as their central data storage solution due to the ability of BigQuery to promote schema denormalization with the native support of nested and repeated fields to denormalize data storage and increase query performance. 

On top of that, Tamr Mastering utilizes Cloud IAM for access control, authn/authz, configuration and observability. Deploying across the Google framework provides key advantages such as better performance due to higher bandwidth, lower management overhead, and autoscaling and resource adjustment, among other value drivers, all resulting in lower TCO.

The architecture above illustrates the different layers of functionality. Starting from the top down with the front-end deployment to the core layers at the borrow of the diagram. To scale the overall MDM architecture depicted in the above diagram, efficiently, Tamr has partnered with Google Cloud to focus on three core capabilities: 

Capability One: Machine learning optimized for scale and accuracy

Traditionally, organizing and mastering data in most organizations’ legacy infrastructure has been done using a rules-based approach (if <condition> then <action>). Conventional rules-based systems can be effective on a small scale, relying on human-built logic implemented in the rules to generate master records. However, such rules fail to scale when tasked with connecting and reconciling large amounts of highly variable data. 

Machine learning, on the other hand, becomes more efficient at matching records across datasets as more data is added. In fact, huge amounts of data (more than 1 million records across dozens of systems) provide more signal, so  the machine learning models are able to identify patterns, matches, and relationships, accelerating years of human effort down to days. Google’s high performance per core on Compute Engine, high network throughput and lower provisioning times across both storage and compute are all differentiating factors in Tamr’s optimized machine learning architecture on Google Cloud.

Capability Two: Ensure there is sufficient human input

While machine learning is critical, so is keeping humans in the loop and letting them provide feedback. Engaging business users and subject matter experts is key to building trust in the data. A middle ground where machines take the lead and humans provide guidance and feedback to make the machine – and the results – better is the data mastering approach that delivers the best outcomes. Not only will human input improve machine learning models, but it will also foster tighter alignment between the data and business outcomes that require curated data. 

Capability Three: Enrichment built in the workflow

As a final step in the process, data enrichment integrates internal data assets with external data to increase the value of these assets. It adds additional relevant or missing information so that the data is more complete – and thus more usable. Enriching data improves its quality, making it a more valuable asset to an organization. Combining data enrichment with data mastering means that not only are data sources automatically cleaned, they are also enhanced with valuable commercial information while avoiding the incredibly time-consuming and manual work that goes into consolidating or stitching internal data with  external data.

Below is an example of how these three core-capabilities are incorporated into the Tamr MDM architecture:

Building the data foundation for connected customer experiences at P360

When a major pharmaceutical company approached P360 for help with a digital transformation project aimed at better reaching the medical providers they count as customers, P360 realized that building a solid data foundation with a modern master data management (MDM) solution was the first step. 

“One of the customer’s challenges was master data management, which was the core component of rebuilding their data infrastructure. Everything revolves around data so not having a solid data infrastructure is a non-starter. Without it, you can’t compete, you can’t understand your customers and how they use your products,” said Anupam Nandwana, CEO of P360, a technology solutions provider for the pharmaceutical industry.

To develop that foundation of trusted data, P360 turned to Tamr Mastering. By using Tamr Mastering, the pharmaceutical company is quickly unifying internal and external data on millions of health care providers to create golden records that power downstream applications, including a new CRM system. Like other business-to-business companies, P360’s customer has diverse and expansive data from a variety of sources. From internal data like physician names and addresses to external data like prescription histories and claims information, this top pharmaceutical company has 150 data sources to master in order to get complete views of their customers. This includes records on 1 million healthcare providers (as well as 2 million provider addresses) and records on the more than 100,000 healthcare organizations.

“For the modern data platform, cloud is the only answer. To provide the scale, flexibility and speed that’s needed, it’s just not pragmatic to leverage other infrastructure. The cloud gives us the opportunity to do things faster. Completing this project in a short amount of time was a key criteria for success and that would have only been possible with the cloud. Using it was an easy decision,” Nandwana said. 

With Tamr Mastering, P360 helped their customer master millions of provider records in weeks and create golden records containing unique customer IDs as a consistent identifier and single source of truth. 


Google’s data cloud provides a complete platform for building data-driven applications like Tamr’s MDM solution on Google Cloud. Simplified data ingestion, processing, and storage to powerful analytics, AI, ML, and data sharing capabilities are integrated with the open, secure, and sustainable Google Cloud platform. With a diverse partner ecosystem, open-source tools, and APIs, Google Cloud can provide technology companies with a platform that provides the portability and differentiators they need to build their products and serve the next generation of customers. 

Learn more about Tamr on Google Cloud. 

Learn more about Google Cloud’s Built with BigQuery initiative

We thank the Google Cloud team member who co-authored the blog: Christian Williams, Principal Architect, Cloud Partner Engineering

Source : Data Analytics Read More

Optimize Cloud Composer via Better Airflow DAGs

Optimize Cloud Composer via Better Airflow DAGs

Hosting, orchestrating, and managing data pipelines is a complex process for any business.  Google Cloud offers Cloud Composer – a fully managed workflow orchestration service – enabling businesses to create, schedule, monitor, and manage workflows that span across clouds and on-premises data centers. Cloud Composer is built on the popular Apache Airflow open source project and operates using the Python programming language.  Apache Airflow allows users to create directed acyclic graphs (DAGs) of tasks, which can be scheduled to run at specific intervals or triggered by external events.

This guide contains a generalized checklist of activities when authoring Apache Airflow DAGs.  These items follow best practices determined by Google Cloud and the open source community.  A collection of performant DAGs will enable Cloud Composer to work optimally and standardized authoring will help developers manage hundreds or even thousands of DAGs.  Each item will benefit your Cloud Composer environment and your development process.

Get Started

1. Standardize file names. Help other developers browse your collection of DAG files.
a. ex)

2. DAGs should be deterministic.
a. A given input will always produce the same output.

3. DAGs should be idempotent. 
a. Triggering the DAG multiple times has the same effect/outcome.

4. Tasks should be atomic and idempotent. 
a. Each task should be responsible for one operation that can be re-run independently of the others. In an atomized task, a success in part of the task means a success of the entire task.

5. Simplify DAGs as much as possible.
a. Simpler DAGs with fewer dependencies between tasks tend to have better scheduling performance because they have less overhead. A linear structure (e.g. A -> B -> C) is generally more efficient than a deeply nested tree structure with many dependencies. 

Standardize DAG Creation

6. Add an owner to your default_args.
a. Determine whether you’d prefer the email address / id of a developer, or a distribution list / team name.

7. Use with DAG() as dag: instead of dag = DAG()
a. Prevent the need to pass the dag object to every operator or task group.

8. Set a version in the DAG ID. 
a. Update the version after any code change in the DAG.
b. This prevents deleted Task logs from vanishing from the UI, no-status tasks generated for old dag runs, and general confusion of when DAGs have changed.
c. Airflow open-source has plans to implement versioning in the future. 

9. Add tags to your DAGs.
a. Help developers navigate the Airflow UI via tag filtering.
b. Group DAGs by organization, team, project, application, etc. 

10. Add a DAG description. 
a. Help other developers understand your DAG.

11. Pause your DAGs on creation. 
a. This will help avoid accidental DAG runs that add load to the Cloud Composer environment.

12. Set catchup=False to avoid automatic catch ups overloading your Cloud Composer Environment.

13. Set a dagrun_timeout to avoid dags not finishing, and holding Cloud Composer Environment resources or introducing collisions on retries.

14. Set SLAs at the DAG level to receive alerts for long-running DAGs.
a. Airflow SLAs are always defined relative to the start time of the DAG, not to individual tasks.
b. Ensure that sla_miss_timeout is less than the dagrun_timeout.
c. Example: If your DAG usually takes 5 minutes to successfully finish, set the sla_miss_timeout to 7 minutes and the dagrun_timeout to 10 minutes.  Determine these thresholds based on the priority of your DAGs.

15. Ensure all tasks have the same start_date by default by passing arg to DAG during instantiation

16. Use a static start_date with your DAGs. 
a. A dynamic start_date is misleading, and can cause failures when clearing out failed task instances and missing DAG runs.

17. Set retries as a default_arg applied at the DAG level and get more granular for specific tasks only where necessary. 
a. A good range is 1–4 retries. Too many retries will add unnecessary load to the Cloud Composer environment.

Example putting all the above together:

code_block[StructValue([(u’code’, u’import airflowrnfrom airflow import DAGrnfrom airflow.operators.bash_operator import BashOperatorrnrn# Define default_args dictionary to specify default parameters of the DAG, such as the start date, frequency, and other settingsrndefault_args = {rn ‘owner’: ‘me’,rn ‘retries’: 2, # 2-4 retries maxrn ‘retry_delay’: timedelta(minutes=5),rn ‘is_paused_upon_creation’: True,rn ‘catchup’: False,rn}rnrn# Use the `with` statement to define the DAG object and specify the unique DAG ID and default_args dictionaryrnwith DAG(rn ‘dag_id_v1_0_0′, #versioned IDrn default_args=default_args,rn description=’This is a detailed description of the DAG’, #detailed descriptionrn start_date=datetime(2022, 1, 1), # Static start datern dagrun_timeout=timedelta(minutes=10), #timeout specific to this dagrn sla_miss_timeout=timedelta(minutes=7), # sla miss less than timeoutrn tags=[‘example’, ‘versioned_dag_id’], # tags specific to this dagrn schedule_interval=None,rn) as dag:rn # Define a task using the BashOperatorrn task = BashOperator(rn task_id=’bash_task’,rn bash_command=’echo “Hello World”‘rn )’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e853b802f10>)])]

18. Define what should occur for each callback function. (send an email, log a context, message slack channel, etc.).  Depending on the DAG you may be comfortable doing nothing. 
a. success
b. failure
c. sla_miss
d. retry


code_block[StructValue([(u’code’, u’from airflow import DAGrnfrom airflow.operators.python_operator import PythonOperatorrnrndefault_args = {rn ‘owner’: ‘me’,rn ‘retries’: 2, # 2-4 retries maxrn ‘retry_delay’: timedelta(minutes=5),rn ‘is_paused_upon_creation’: True,rn ‘catchup’: False,rn}rnrndef on_success_callback(context):rn # when a task in the DAG succeedsrn print(f”Task {context[‘task_instance_key_str’]} succeeded!”)rnrndef on_sla_miss_callback(context):rn # when a task in the DAG misses its SLArn print(f”Task {context[‘task_instance_key_str’]} missed its SLA!”)rnrndef on_retry_callback(context):rn # when a task in the DAG retriesrn print(f”Task {context[‘task_instance_key_str’]} retrying…”)rnrndef on_failure_callback(context):rn # when a task in the DAG failsrn print(f”Task {context[‘task_instance_key_str’]} failed!”)rnrn# Create a DAG and set the callbacksrnwith DAG(rn ‘dag_id_v1_0_0′,rn default_args=default_args,rn description=’This is a detailed description of the DAG’,rn start_date=datetime(2022, 1, 1), rn dagrun_timeout=timedelta(minutes=10),rn sla_miss_timeout=timedelta(minutes=7),rn tags=[‘example’, ‘versioned_dag_id’],rn schedule_interval=None,rn on_success_callback=on_success_callback, # what to do on successrn on_sla_miss_callback=on_sla_miss_callback, # what to do on sla missrn on_retry_callback=on_retry_callback, # what to do on retryrn on_failure_callback=on_failure_callback # what to do on failurern) as dag:rnrn def example_task(**kwargs):rn # This is an example task that will be part of the DAGrn print(f”Running example task with context: {kwargs}”)rnrn # Create a task and add it to the DAGrn task = PythonOperator(rn task_id=”example_task”,rn python_callable=example_task,rn provide_context=True,rn )’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e853b355650>)])]

19. Use Task Groups to organize Tasks.


code_block[StructValue([(u’code’, u’# Use the `with` statement to define the DAG object and specify the unique DAG ID and default_args dictionaryrnwith DAG(rn ‘example_dag’,rn default_args=default_args,rn schedule_interval=timedelta(hours=1),rn) as dag:rn # Define the first task grouprn with TaskGroup(name=’task_group_1′) as tg1:rn # Define the first task in the first task grouprn task_1_1 = BashOperator(rn task_id=’task_1_1′,rn bash_command=’echo “Task 1.1″‘,rn dag=dag,rn )’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e853b3553d0>)])]

Reduce the Load on Your Composer Environment

20. Use Jinja Templating / Macros instead of python functions.
a. Airflow’s template fields allow you to incorporate values from environment variables and jinja templates into your DAGs. This helps make your DAGs idempotent (meaning multiple invocations do not change the result) and prevents unnecessary function execution during Scheduler heartbeats.
b. The Airflow engine passes a few variables by default that are accessible in all templates.

Contrary to best practices, the following example defines variables based on datetime Python functions:

code_block[StructValue([(u’code’, u”# Variables used by tasksrn# Bad example – Define today’s and yesterday’s date using datetime modulerntoday = = – timedelta(1)”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e853b355090>)])]

If this code is in a DAG file, these functions execute on every Scheduler heartbeat, which may not be performant. Even more importantly, this doesn’t produce an idempotent DAG. You can’t rerun a previously failed DAG run for a past date because is relative to the current date, not the DAG execution date.

A better way of implementing this is by using an Airflow Variable as such:

code_block[StructValue([(u’code’, u”# Variables used by tasksrn# Good example – Define yesterday’s date with an Airflow variablernyesterday = {{ yesterday_ds_nodash }}”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e8554a515d0>)])]

21. Avoid creating your own additional Airflow Variables. 
a. The metadata database stores these variables and requires database connections to retrieve them. This can affect the performance of the Cloud Composer Environment. Use Environment Variables or Google Cloud Secrets instead.

22. Avoid running all DAGs on the exact same schedules (disperse workload as much as possible). 
a. Prefer to use cron expressions for schedule intervals compared to airflow macros or time_deltas. This allows a more rigid schedule and it’s easier to spread out workloads throughout the day, making it easier on your Cloud Composer environment.
b. can help with generating specific cron expression schedules.  Check out the examples here.


code_block[StructValue([(u’code’, u’schedule_interval=”*/5 * * * *”, # every 5 minutes.rnrn schedule_interval=”0 */6 * * *”, # at minute 0 of every 6th hour.’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e8554a516d0>)])]

23. Avoid XComs except for small amounts of data. 
a. These add storage and introduce more connections to the database. 
b. Use JSON dicts as values if absolutely necessary. (one connection for many values inside dict)

24. Avoid adding unnecessary objects in the dags/ Google Cloud Storage path. 
a. If you must, add an .airflowignore file to GCS paths that the Airflow Scheduler does not need to parse. (sql, plug-ins, etc.)

25. Set execution timeouts for tasks.


code_block[StructValue([(u’code’, u”# Use the `PythonOperator` to define the taskrntask = PythonOperator(rn task_id=’my_task’,rn python_callable=my_task_function,rn execution_timeout=timedelta(minutes=30), # Set the execution timeout to 30 minutesrn dag=dag,rn)”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e8554a517d0>)])]

26. Use Deferrable Operators over Sensors when possible. 
a. A deferrable operator can suspend itself and free up the worker when it knows it has to wait, and hand off the job of resuming it to a Trigger. As a result, while it suspends (defers), it is not taking up a worker slot and your cluster will have fewer/lesser resources wasted on idle Operators or Sensors.


code_block[StructValue([(u’code’, u’PYSPARK_JOB = {rn “reference”: { “project_id”: “PROJECT_ID” },rn “placement”: { “cluster_name”: “PYSPARK_CLUSTER_NAME” },rn “pyspark_job”: {rn “main_python_file_uri”: “gs://dataproc-examples/pyspark/hello-world/”rn },rn}rnrnDataprocSubmitJobOperator(rn task_id=”dataproc-deferrable-example”,rn job=PYSPARK_JOB,rn deferrable=True,rn )’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e8554a518d0>)])]

27. When using Sensors, always define mode, poke_interval, and timeout. 
a. Sensors require Airflow workers to run.
b. Sensor checking every n seconds (i.e. poke_interval < 60)? Use mode=poke. A sensor in mode=poke will continuously poll every n seconds and hold Airflow worker resources. 
c. Sensor checking every n minutes (i.e. poke_interval >= 60)? Use mode=reschedule. A sensor in mode=reschedule will free up Airflow worker resources between poke intervals.


code_block[StructValue([(u’code’, u’table_partition_sensor = BigQueryTablePartitionExistenceSensor(rn project_id=”{{ project_id }}”,rn task_id=”bq_check_table_partition”,rn dataset_id=”{{ dataset }}”,rn table_id=”comments_partitioned”,rn partition_id=”{{ ds_nodash }}”,rn mode=”reschedule”rn poke_interval=60,rn timeout=60 * 5rn )’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e8554a519d0>)])]

28. Offload processing to external services (BigQuery, Dataproc, Cloud Functions, etc.) to minimize load on the Cloud Composer environment.
a. These services usually have their own Airflow Operators for you to utilize.

29. Do not use sub-DAGs.
a. Sub-DAGs were a feature in older versions of Airflow that allowed users to create reusable groups of tasks within DAGs. However, Airflow 2.0 deprecated sub-DAGs because they caused performance and functional issues.

30. UsePub/Subfor DAG-to-DAG dependencies.
a. Here is an example for multi-cluster / dag-to-dag dependencies. 

31. Make DAGs load faster.
a. Avoid unnecessary “Top-level” Python code. DAGs with many imports, variables, functions outside of the DAG will introduce greater parse times for the Airflow Scheduler and in turn reduce the performance and scalability of Cloud Composer / Airflow.
b. Moving imports and functions within the DAG can reduce parse time (in the order of seconds).
c. Ensure that developed DAGs do not increase DAG parse times too much.


code_block[StructValue([(u’code’, u”import airflowrnfrom airflow import DAGrnfrom airflow.operators.python_operator import PythonOperatorrnrn# Define default_args dictionaryrndefault_args = {rn ‘owner’: ‘me’,rn ‘start_date’: datetime(2022, 11, 17),rn}rnrn# Use with statement and DAG context manager to instantiate the DAGrnwith DAG(rn ‘my_dag_id’,rn default_args=default_args,rn schedule_interval=timedelta(days=1),rn) as dag:rn # Import module within DAG blockrn import my_module # DO THISrnrn # Define function within DAG blockrn def greet(): # DO THISrn greeting = my_module.generate_greeting()rn print(greeting)rnrn # Use the PythonOperator to execute the functionrn greet_task = PythonOperator(rn task_id=’greet_task’,rn python_callable=greetrn )”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e8554a51ad0>)])]

Improve Development and Testing

32. Implement “self-checks” (via Sensors or Deferrable Operators).
a. To ensure that tasks are functioning as expected, you can add checks to your DAG. For example, if a task pushes data to a BigQuery partition, you can add a check in the next task to verify that the partition generates and that the data is correct.


code_block[StructValue([(u’code’, u’# ————————————————————rn # Transform source data and transfer to partitioned tablern # ————————————————————rnrn create_or_replace_partitioned_table_job = BigQueryInsertJobOperator(rn task_id=”create_or_replace_comments_partitioned_query_job”,rn configuration={rn “query”: {rn “query”: ‘sql/create_or_replace_comments_partitioned.sql’,rn “useLegacySql”: False,rn }rn },rn location=”US”,rn )rnrn create_or_replace_partitioned_table_job_error = dummy_operator.DummyOperator(rn task_id=”create_or_replace_partitioned_table_job_error”,rn trigger_rule=”one_failed”,rn )rnrn create_or_replace_partitioned_table_job_ok = dummy_operator.DummyOperator(rn task_id=”create_or_replace_partitioned_table_job_ok”, trigger_rule=”one_success”rn )rnrn # ————————————————————rn # Determine if today’s partition exists in comments_partitionedrn # ————————————————————rnrn table_partition_sensor = BigQueryTablePartitionExistenceSensor(rn project_id=”{{ project_id }}”,rn task_id=”bq_check_table_partition”,rn dataset_id=”{{ dataset }}”,rn table_id=”comments_partitioned”,rn partition_id=”{{ ds_nodash }}”,rn mode=”reschedule”rn poke_interval=60,rn timeout=60 * 5rn )rnrn create_or_replace_partitioned_table_job >> [rn create_or_replace_partitioned_table_job_error,rn create_or_replace_partitioned_table_job_ok,rn ]rn create_or_replace_partitioned_table_job_ok >> table_partition_sensor’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e8554a51bd0>)])]

33. Look for opportunities to dynamically generate similar tasks/task groups/DAGs via Python code.
a. This can simplify and standardize the development process for DAGs. 


code_block[StructValue([(u’code’, u’import airflowrnfrom airflow import DAGrnfrom airflow.operators.python_operator import PythonOperatorrnrndef create_dag(dag_id, default_args, task_1_func, task_2_func):rn with DAG(dag_id, default_args=default_args) as dag:rn task_1 = PythonOperator(rn task_id=’task_1′,rn python_callable=task_1_func,rn dag=dagrn )rn task_2 = PythonOperator(rn task_id=’task_2′,rn python_callable=task_2_func,rn dag=dagrn )rn task_1 >> task_2rn return dagrnrndef task_1_func():rn print(“Executing task 1”)rnrndef task_2_func():rn print(“Executing task 2″)rnrndefault_args = {rn ‘owner’: ‘me’,rn ‘start_date’: airflow.utils.dates.days_ago(2),rn}rnrnmy_dag_id = create_dag(rn dag_id=’my_dag_id’,rn default_args=default_args,rn task_1_func=task_1_func,rn task_2_func=task_2_funcrn)’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e8554a51cd0>)])]

34. Implement unit-testing for your DAGs


code_block[StructValue([(u’code’, u’from airflow import modelsrnfrom airflow.utils.dag_cycle_tester import test_cyclernrnrndef assert_has_valid_dag(module):rn “””Assert that a module contains a valid DAG.”””rnrn no_dag_found = Truernrn for dag in vars(module).values():rn if isinstance(dag, models.DAG):rn no_dag_found = Falsern test_cycle(dag) # Throws if a task cycle is found.rnrn if no_dag_found:rn raise AssertionError(‘module does not contain a valid DAG’)’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e8554a51dd0>)])]

35. Perform local development via the Composer Local Development CLI Tool.
a. Composer Local Development CLI tool streamlines Apache Airflow DAG development for Cloud Composer 2 by running an Airflow environment locally. This local Airflow environment uses an image of a specific Cloud Composer version.

36. If possible, keep a staging Cloud Composer Environment to fully test the complete DAG run before deploying in the production.
a. Parameterize your DAG to change the variables, e.g., the output path of Google Cloud Storage operation or the database used to read the configuration. Do not hard code values inside the DAG and then change them manually according to the environment.

37. Use a Python linting tool such as Pylint or Flake8 for standardized code.

38. Use a Python formatting tool such as Black or YAPF for standardized code.

Next Steps

In summary, this blog provides a comprehensive checklist of best practices for developing Airflow DAGs for use in Google Cloud Composer. By following these best practices, developers can help ensure that Cloud Composer is working optimally and that their DAGs are well-organized and easy to manage.

For more information about Cloud Composer, check out the following related blog posts and documentation pages:

What is Cloud Composer? 

Deutsche Bank uses Cloud Composer workload automation

Using Cloud Build to keep Airflow Operators up-to-date in your Composer environment

Writing DAGs (workflows) | Cloud Composer

Source : Data Analytics Read More

How to do multivariate time series forecasting in BigQuery ML

How to do multivariate time series forecasting in BigQuery ML

Companies across industries rely heavily on time series forecasting to project product demand, forecast sales, project online subscription/cancellation, and for many other use cases. This makes time series forecasting one of the most popular models in BigQuery ML. 

What is multivariate time series forecasting? For example, if you want to forecast ice cream sales, it is helpful to forecast using the external covariant “weather” along with the target metric “past sales.” Multivariate time series forecasting in BigQuery lets you create more accurate forecasting models without having to move data out of BigQuery. 

When it comes to time series forecasting, covariates or features besides the target time series are often used to provide better forecasting. Up until now, BigQuery ML has only supported univariate time series modeling using the ARIMA_PLUS model (documentation). It is one of the most popular BigQuery ML models.

While ARIMA_PLUS is widely used, forecasting using only the target variable is sometimes not sufficient. Some patterns inside the time series strongly depend on other features. We see strong customer demand for multivariate time series forecasting support that allows you to forecast using covariate and features.  

We recently announced the public preview of multivariate time series forecasting with external regressors. We are introducing a new model type ARIMA_PLUS_XREG, where the XREG refers to external regressors or side features. You can use the SELECT statement to choose side features with the target time series. This new model leverages the BigQuery ML linear regression model to include the side features and the BigQuery ML ARIMA_PLUS model to model the linear regression residuals.

The ARIMA_PLUS_XREG model supports the following capabilities: 

Automatic feature engineering for numerical, categorical, and array features.

All the model capabilities of the ARIMA_PLUS model, such as detecting seasonal trends, holidays, etc.

Headlight, an AI-powered ad agency, is using a multivariate forecasting model to determine conversion volumes for down-funnel metrics like subscriptions, cancellations, etc. based on cohort age. You can check out the customer video and demo here.

The following sections show some examples of the new ARIMA_PLUS_XREG model in BigQuery ML. In this example, we explore the bigquery-public-data.epa_historical_air_quality dataset, which has daily air quality and weather information. We use the model to forecast the PM2.51 , based on its historical data and some covariates, such as temperature and wind speed.

An example: forecast Seattle’s air quality with weather information

Step 1. Create the dataset

The PM2.5, temperature, and wind speed data are in separate tables. To simplify the queries, create a new table by joining those tables into a new table “bqml_test.seattle_air_quality_daily,” with the following columns:

date: the date of the observation

PM2.5: the average PM2.5 value for each day

wind_speed: the average wind speed for each day

temperature: the highest temperature for each day

The new table has daily data from 2009-08-11 to 2022-01-31.

code_block[StructValue([(u’code’, u”CREATE TABLE `bqml_test.seattle_air_quality_daily`rnASrnWITHrn pm25_daily AS (rn SELECTrn avg(arithmetic_mean) AS pm25, date_local AS datern FROMrn `bigquery-public-data.epa_historical_air_quality.pm25_nonfrm_daily_summary`rn WHERErn city_name = ‘Seattle’rn AND parameter_name = ‘Acceptable PM2.5 AQI & Speciation Mass’rn GROUP BY date_localrn ),rn wind_speed_daily AS (rn SELECTrn avg(arithmetic_mean) AS wind_speed, date_local AS datern FROMrn `bigquery-public-data.epa_historical_air_quality.wind_daily_summary`rn WHERErn city_name = ‘Seattle’ AND parameter_name = ‘Wind Speed – Resultant’rn GROUP BY date_localrn ),rn temperature_daily AS (rn SELECTrn avg(first_max_value) AS temperature, date_local AS datern FROMrn `bigquery-public-data.epa_historical_air_quality.temperature_daily_summary`rn WHERErn city_name = ‘Seattle’ AND parameter_name = ‘Outdoor Temperature’rn GROUP BY date_localrn )rnSELECTrn AS date, pm25, wind_speed, temperaturernFROM pm25_dailyrnJOIN wind_speed_daily USING (date)rnJOIN temperature_daily USING (date)”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eebb79cce90>)])]

Here is a preview of the data:

Step 2. Create Model

The “CREATE MODEL” query of the new multivariate model, ARIMA_PLUS_XREG, is very similar to the current ARIMA_PLUS model. The major differences are the MODEL_TYPE and inclusion of feature columns in the SELECT statement.

code_block[StructValue([(u’code’, u”CREATE OR REPLACErn MODELrn `bqml_test.seattle_pm25_xreg_model`rn OPTIONS (rn MODEL_TYPE = ‘ARIMA_PLUS_XREG’,rn time_series_timestamp_col = ‘date’,rn time_series_data_col = ‘pm25’)rnASrnSELECTrn date,rn pm25,rn temperature,rn wind_speedrnFROMrn `bqml_test.seattle_air_quality_daily`rnWHERErn datern BETWEEN DATE(‘2012-01-01’)rn AND DATE(‘2020-12-31′)”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eebb48f11d0>)])]

Step 3. Forecast the future data

With the created model, you can use the ML.FORECAST function to forecast the future data. Compared to the ARIMA_PLUS model, you have to specify the future covariates as an input.

code_block[StructValue([(u’code’, u”SELECTrn *rnFROMrn ML.FORECAST(rn MODELrn `bqml_test.seattle_pm25_xreg_model`,rn STRUCT(30 AS horizon),rn (rn SELECTrn date,rn temperature,rn wind_speedrn FROMrn `bqml_test.seattle_air_quality_daily`rn WHERErn date > DATE(‘2020-12-31′)rn ))”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eebb48f1d50>)])]

After running the above query, you can see the forecasting results:

Step 4. Evaluate the model

You can use the ML.EVALUATE function to evaluate the forecasting errors. You can set perform_aggregation to “TRUE” to get the aggregated error metric or “FALSE” to see the per timestamp errors.

code_block[StructValue([(u’code’, u”SELECTrn *rnFROMrn ML.EVALUATE(rn MODEL `bqml_test.seattle_pm25_xreg_model`,rn (rn SELECTrn date,rn pm25,rn temperature,rn wind_speedrn FROMrn `bqml_test.seattle_air_quality_daily`rn WHERErn date > DATE(‘2020-12-31′)rn ),rn STRUCT(rn TRUE AS perform_aggregation,rn 30 AS horizon))”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eebb48eb9d0>)])]

The evaluation result of ARIMA_PLUS_XREG is as follows:

As a comparison, we also show the univariate forecasting ARIMA_PLUS result in the following table:

Compared to ARIMA_PLUS, ARIMA_PLUS_XREG performs better on all measured metrics on this specific dataset and date range.


In the previous example, we demonstrated how to create a multivariate time series forecasting model, forecast future values using the model, and evaluate the forecasted results. The ML.ARIMA_EVALUATE and ML.ARIMA_COEFFICIENTS table value functions are also helpful for investigating your model. Based on the feedback from users, the model does the following to improve user productivity.  

It shortens the time spent preprocessing data and lets users keep their data in BigQuery when doing machine learning. 

It reduces overhead for the users who know SQL to do machine learning work in BigQuery.   

For more information about the ARIMA_PLUS_XREG model, please see thedocumentation here.

What’s Next?

In this blogpost, we described the BigQuery ML Multivariate Time Series Forecast model, which is now available for public preview. We also showed a code demo for a data scientist, data engineer, or data analyst to enable the multivariate time series forecast model. 

The following features are coming soon:

Large-scale multivariate time series, i.e., training millions of models for millions of multivariate time series in a single CREATE MODEL statement

Multivariate time series anomaly detection

Thanks to Xi Cheng, Honglin Zheng, Jiashang Liu, Amir Hormati, Mingge Deng and Abhinav Khushraj from the BigQuery ML team. Also thanks to Weijie Shen from the Google Resource Efficiency Data Science team.

1. A measure of air pollution from fine particulate matter

Source : Data Analytics Read More

How to deploy Tink for BigQuery encryption on-prem and in the cloud

How to deploy Tink for BigQuery encryption on-prem and in the cloud

Data security is a key focus for organizations moving their data warehouses from on-premises to cloud-first systems, such as BigQuery. In addition to storage-level encryption, whether using Google-managed or customer-managed keys, BigQuery also provides column-level encryption. Using BigQuery’s SQL AEAD functions, organizations can enforce a more granular level of encryption to help protect sensitive customer data, such as government identity or credit card numbers, and help comply with security requirements.

While BigQuery provides column-level encryption in the cloud, many organizations operate in hybrid-cloud environments. To prevent a scenario where data needs to be decrypted and re-encrypted each time it moves between locations, Google Cloud offers a consistent and interoperable encryption mechanism. This enables deterministically-encrypted data (which maintains referential integrity) to be immediately joined with on-prem tables for anonymized analytics.

To achieve a BigQuery-compatible encryption on-prem, customers can use Tink, a Google-developed open-source cryptography library. BigQuery uses Tink to implement its SQL AEAD functions. We can use the Tink library directly to encrypt data on-prem in a way that can later be decrypted using BigQuery SQL in the cloud, and decrypt BigQuery’s column level-encrypted data outside of BigQuery. 

For our customers who want to use Tink with BigQuery, we have put together a few helpful Python utilities and samples in the BigQuery Tink Toolkit GitHub repo. Let’s first walk through an example of how to use Tink directly to encrypt or decrypt on-prem data using the same keyset used for BigQuery, followed by how the BigQuery Tink Toolkit can help simplify working with Tink. 

To start, we need to retrieve the Tink keyset. We’ll assume that KMS-wrapped keysets are being used. These keysets need to be stored in BigQuery to use with BigQuery SQL.  If needed, they can also be replicated to a secondary store on-prem.

code_block[StructValue([(u’code’, u’from import bigquery, kmsrnrnbq_client = bigquery.Client()rnquery_job = bq_client.query(“””SELECT kms_resource_path, wrapped_keyset, associated_data FROM `my-keysets-table` WHERE column_name = “my-pii-column”;”””)rnresult_row = query_job.result()’), (u’language’, u’lang-py’), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ec785035250>)])]

Now that we have the encrypted keyset, we need to unwrap it to retrieve the usable Tink keyset. If Cloud KMS is not accessible from on-prem, the unwrapped keyset will need to be maintained in a secure keystore on-prem.

code_block[StructValue([(u’code’, u’kms_client = kms.KeyManagementServiceClient()rndecrypted_keyset_obj = kms_client.decrypt(rn {rn “name”: result_row.kms_resource_path.split(“gcp-kms://”)[1],rn “ciphertext”: result_row.wrapped_keyset,rn }rn)rnkeyset = decrypted_keyset_obj.plaintext’), (u’language’, u’lang-py’), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ec785035290>)])]

We can now use the keyset to generate a Tink primitive. This can be used to encrypt or decrypt data with the associated keyset. Note that different primitives should be used depending on whether the keyset is for a deterministic or nondeterministic key.

code_block[StructValue([(u’code’, u’import tinkrnfrom tink import aead, cleartext_keyset_handle, daeadrnrnbinary_keyset_reader = tink.BinaryKeysetReader(keyset)rnkeyset_handle = If using a determinisitic keyset:rndaead.register()rncipher = keyset_handle.primitive(daead.DeterministicAead)rnrn# If using a nondeterministic keyset instead:rnaead.register()rncipher = keyset_handle.primitive(aead.Aead)’), (u’language’, u’lang-py’), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ec785036890>)])]

Once we have our cipher, we can use it to encrypt or decrypt data as needed.

code_block[StructValue([(u’code’, u’plaintext = “Hello world!”rnassociated_data = result_row.associated_datarn# To encrpyt:rn# If using a determinisitic keysetrnciphertext = cipher.encrypt_deterministically(rn plaintext.encode(), associated_data.encode()rn )rnrn# If using a nondeterministic keyset insteadrnciphertext = cipher.encrypt(plaintext.encode(), associated_data.encode())rnrnrn# To decrypt:rn# If using a determinisitic keysetrnplaintext = cipher.decrypt_deterministically(rn ciphertext, associated_data.encode()rn )rnrn# If using a nondeterministic keyset insteadrnplaintext = cipher.decrypt(rn ciphertext, associated_data.encode()rn )’), (u’language’, u’lang-py’), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ec785036710>)])]

We have provided the CipherManager class to help simplify this process, which handles four actions:

Retrieving the required keysets from a BigQuery table

Unwrapping those keysets

Creating a Tink cipher for each column

Providing a consistent interface to call encrypt and decrypt. 

We have also included a sample Spark job that shows how to use CipherManager to encrypt or decrypt columns for a given table. We hope these come in handy – happy Tinkering.

Source : Data Analytics Read More

Log Analytics in Cloud Logging is now GA

Log Analytics in Cloud Logging is now GA

Solving big problems usually takes a combination of the right people and the right tools. SRE, DevOps, and IT operations teams in organizations both big and small have used Google Cloud’s built-in logging service, Cloud Logging, to troubleshoot faster, recognize trends easier, and scale operations more effectively. Additionally, customers have been building homegrown solutions that combine the powers of BigQuery and Cloud Logging to help them address operational and security challenges at massive scale. Last year we introduced Log Analytics, powered by BigQuery, so more customers can bring logs and advanced analysis together without having to build the connection themselves.  

Today, we are announcing the general availability of Cloud Logging’s Log Analytics (powered by BigQuery), a capability that allows you to search, aggregate and transform all log data types including application, network and audit log data at no additional cost for existing Cloud Logging customers. We are also launching three new Log Analytics capabilities. 

Multi-region support for US and EU region

Improved query experience to save and share queries

Support for custom retention up to 10 years

To get started, upgrade your existing log buckets to Log Analytics supported buckets. 

Same logs, same cost, more value with Log Analytics 

Log Analytics brings entirely new capabilities to search, aggregate, or transform logs at query time directly into Cloud Logging with a new user experience that’s optimized for analyzing logs data. 

Centralized logging –  By collecting and centrally storing the log data in a dedicated Log Bucket, it allows multiple stakeholders to manipulate their data from the same datasource. You don’t need to make duplicate copies of the data.

Reduced cost and complexity – Log Analytics allows reuse of data across the organization, effectively saving cost and reducing complexities. 

Ad hoc log analysis – It allows for ad-hoc query-time log analysis without requiring complex pre-processing. 

Scalable platform – Log Analytics can scale for observability using the serverless BQ platform and perform aggregation at petabyte scale efficiently

By leveraging BigQuery, Log Analytics breaks down data silos helping security, networking, developer and even business teams collaborate using a single copy of data. 

New features in this release

1. Multi-region support for Log Analytics buckets 

In addition to GA, we are also announcing multi-region support for Log Analytics with log buckets in the US and EU. These new multi-regions are available for log buckets that use Log Analytics and for those that don’t.  This means that you can now store and analyze your logs in the region that is most convenient for you, improving performance and reducing latency. 

2. Improved query experience

We are also improving the query experience by allowing users to save, share and re-use recent queries. This means that you can easily reuse and share your most important queries, saving time and making it easier to get the insights you need from your logs.

Log Analytics feature: Save & Share Query

3. Retain logs up to 10 years in a Log Analytics bucket

We are rolling out the ability to support custom log retention. You can now store logs in the Log Analytics supported bucket for beyond 30 days. Standard custom log retention pricingwill apply. 

Get started today

Now that Log Analytics is Generally Available, you can upgrade your log buckets to use Log Analytics and know that it’s covered under the Cloud Logging SLA. Upgrade your log bucket today to start taking advantage of Log Analytics. 

If this is the first time you’re hearing about Log Analytics, we’ve got you covered with some materials to get you up to speed. Take a look at our blog the top 10 reasons to get started with Log Analytics, watch a recent on-demand information session we did aimed at developers, and learn more about the overall challenges we’re helping you solve in this video: Streamline software development with better insights and diagnostics.

Source : Data Analytics Read More

Self-service analytics finally gets real for the Connected Sheet

Self-service analytics finally gets real for the Connected Sheet

Today’s workplace is getting more and more data intensive. This is why, even in a modern data stack world full of incredible business intelligence tools, “good ol’ reliable” spreadsheets are still commonly used to wrangle data problems big and small. 

Data is everywhere, so why aren’t insights?

While companies recognize the value data can provide for their business, tapping into it isn’t always easy. Spreadsheets are a fantastic tool for users to start turning raw data into something meaningful, which is why knowing how to use spreadsheets for data analysis remains critically important for users across industries — it’s the top use case cited by 63% of users when it comes to their sheets. And yet, more than half of users say they still struggle to make sense of their data in sheets. For every power user or data analyst who knows how to make the most of spreadsheets, there are countless business users, each with their own specific questions.  

The increasing demands on today’s knowledge workers, the new decision makers, are only exacerbating these challenges. Not only do you need the skills to analyze and act on data quickly, but you need live access to this data so you’re making decisions based on up-to-date information in the moment, not what was happening earlier today, yesterday, or last week. 

With the industry adopting cloud data warehouses and the rise of the data cloud, we are witnessing the rise of the connected sheet. Google Cloud BigQuery makes it easy for any information worker to tap into Google’s data cloud, leveraging the ease and use of a Connected Sheets experience. 

But let’s face it, spreadsheet users come in all shapes and sizes. Not everyone has the skills of an analytics engineer. You’ve got everyone from spreadsheet experts who can perform VLOOKUPS in the blink of an eye, to the humble beginner just mastering how to write a simple SUM() formula. The truth is spreadsheets are everywhere, but aren’t for everyone. Simply bringing them to the cloud isn’t going to fix that. 

As one of our ThoughtSpot customers said, “Sheets and Excel are pretty powerful, but not easy for everyone to use. It’s one thing to get the data into sheets, but once it’s there it’s difficult to drill, ask ad-hoc questions, or simply build charts and visualizations.”

Simply put: Spreadsheets can be hard. Creating a chart or data visualization from spreadsheets can be even more challenging. Drilling into spreadsheet data can be complicated, even for power users.

But it doesn’t have to be. We teamed up with Google Cloud to invent something new to bring the power of Connected Sheets to everyone. Introducing ThoughtSpot for Google Sheets.

Unleashing insights from data in sheets

ThoughtSpot for Sheets is a standalone native app plugin for Google Sheets that brings a true self-service analytics experience to your sheets data. With ThoughtSpot for Sheets, creating charts and visualizations is as simple as searching. All you need is some data in Sheets and a curious mind.

Let’s take a look.

Bringing data to Connected Sheets

There’s a plethora of ways to get data into Sheets. Many savvy users are turning to popular data integration tools like Supermetrics and to connect live data from their business apps to Google Sheets for ad-hoc analysis. Other people like a quick and dirty copy and paste. However, with so much business data moving to data clouds, you can now query the trove of customer data in BigQuery through Sheets.

Getting started

ThoughtSpot for Sheets is simple to launch. Do a quick search on Google Workspace Marketplace or find it here. A simple click and you’ve installed the app. Depending on your  organization, you might have to send a brief friendly message to authorize the plugin from your Google Workspace Admin.

With ThoughtSpot installed, you’re ready to connect to some data — just go to the Data tab and select “Connect to BigQuery”. Now you can start to explore tables in your business data set or connect to some of the public data sets available from Google Cloud’s Analytics Hub.

Once your Sheet is connected, the fun really begins. 

No data modeling required. That’s right. No star schemas or snowflake schemas. Just insights. 

With one click, you’ll be able to leap over the biggest hurdle people face with data. Simply select ThoughtSpot from your Extensions menu, and the app handles the rest. Depending on the data size, in just a few moments the app automatically reads, understands, and categorizes your data into different types (DATES, MEASURES, ATTRIBUTES) all in a easy to use modal that sits right on top of your Sheet for easy exploration.

Create content with search

Once data is in Sheets, ThoughtSpot makes it easy to ask questions of your data. Using the search bar or selecting from the side panel, you can start to create new visuals automatically. The user experience was designed for the data explorer, and the data explorer in all of us just waiting to be unleashed. The search engine acts as a powerful compiler and gives you access to unlimited combinations of searches to uncover opportunities hidden in your Sheets. With search you can call up endless combinations of measures, attributes, and even filters in your data, so no stone is left unturned.

We even built in powerful keywords to take your analysis further, like “top” which generates the top n items from a sorted result. If your search has more than one measure, ThoughtSpot generates the top n items from the first measure in the search.

With search you can even travel through time with ease. The search engine lets you easily aggregate your data daily, weekly, monthly or get even more granular with your time slices like [last 2 weeks], [last 35 days], etc.

Drill anywhere

True self-service BI should be limitless, without making drilling into the data difficult. So we put the data search engine to work again to let you drill anywhere in your data with a single click. Say “so long” to VLOOKUPS. 

Don’t worry, you won’t get lost. We built in easy internal navigation that lets you easily undo/redo your query so you never lose track of that aha-moment.

Data storytelling with Slides & Sheets

For almost every user we surveyed, making sensible charts and visualizations from their spreadsheet data is a requirement, especially for data storytelling, yet more than half say doing so is challenging. In addition to being a valuable companion for Google Sheets, the app also integrates with Google Slides to make sharing, collaboration, and presenting data a breeze. When you are happy with the chart or visualization you created, you can Pin to Slides which then lets you create a new presentation or add to an existing deck.

Putting ThoughtSpot for Sheets to work

So in theory, this all sounds great. But it’s even more exciting when applied to the real world. 


Whether Salesforce, Gong, Outreach, Qualified, or Drift, there are so many data sources to help give sales professionals the edge.

With ThoughtSpot for Sheets, you have easy to use sales analytics to manage pipeline health, go deeper with prospecting, or make sure your next QBR with your customer or manager is grounded in actionable data.


Google Sheets has become the workhorse of any marketer. Data from Google Ads, web analytics, HubSpot, and advertising spend all find their way into Sheets, begging for marketing analytics so users can understand customer and user trends, optimize programs, and uncover the next winning campaign.


Universities all over the world leverage Google Workspace as their productivity suite of choice. Students and researchers now have access to a powerful self-service experience, conveniently right in the Google Sheets they already use.

Explore public datasets available from BigQuery and Analytics Hub from crypto and census data to geopolitical ad spend to boost data fluency and research projects.

Spreadsheets 2.0 and the dawn of the Connected Sheet is upon us. Time to get more with your new best friend ThoughtSpot. Install the app free, and let us know what you think.

Happy data exploring.

Source : Data Analytics Read More

Built with BigQuery: How to accelerate data-centric AI development with Google Cloud and Snorkel AI

Built with BigQuery: How to accelerate data-centric AI development with Google Cloud and Snorkel AI

In Deloitte’s annual “State of AI in the Enterprise” survey, 94% of business leaders identified AI as critical to their organizations’ success over the next five years. That survey also uncovered a 29% increase in the number of organizations struggling to achieve meaningful AI-driven business outcomes. Part of this challenge lies in the ability to capitalize on existing data, in its various formats spread throughout the organization. For example, up to 80% of enterprise information assets are scattered across the organization in text, PDFs, emails, web pages, and other unstructured formats. This includes a wealth of valuable insights embedded within contracts, buried within patient files, recorded in chat transcripts, noted in EHR/CRM text fields, and present in other formats. This wealth of unstructured data is often untapped, as some business leaders may be unaware of the value or unsure how to leverage it.

Challenges: The need to put unstructured data to use more rapidly 

Accessing data across various locations and file types and then operationalizing that data for AI usage is usually a cumbersome, manual, time-consuming, and costly process. Individually labeling files to build an adequate dataset to train a machine learning (ML) model is notoriously slow, while human errors and inconsistencies also tend to degrade data quality and negatively impact ML model performance. 

Often, analyzing enterprise data requires the expertise of analysts, clinicians, lawyers or other domain specific experts. In highly-regulated industries such as financial services and healthcare, privacy regulations, standards, and other access restrictions make the challenges posed in using unstructured data proportionally higher. 

Solution approach

Snorkel AI has teamed with Google Cloud to help organizations transform raw, unstructured data into a format that can be used to train actionable AI-powered models for insights and decision making. By combining Google Cloud services such as BigQuery and Vertex AI with Snorkel AI’s data-centric AI platform for programmatic data curation and preparation, organizations can accelerate AI development 10-100x [1]. Tapping into the value of unstructured data stored in BigQuery and making that data ready for ML training empowers enterprises to incorporate all types of data for training AI models.

Snorkel AI’s data-centric approach unlocks new ways of preparing ML training workloads 

Snorkel AI addresses one of the biggest blockers to preparing data for AI development: the massive hand-labeled training datasets needed to prepare data for supervised training of ML models. Snorkel AI overcomes this bottleneck through using a programmatic labeling approach implemented in Snorkel Flow, a novel data-centric AI platform.

Leveraging business logic, and using foundation models as a means of generating labels, data science and ML teams can use Snorkel Flow’s labeling functions to programmatically label data using various sources, including previously-labeled datasets that may have been poorly labeled while encoding knowledge or heuristics from subject matter experts. Snorkel Flow can leverage these multiple data and knowledge sources to label large quantities of unstructured data at scale. 

In addition to data scientists, other users in the ML lifecycle, such as ML engineers, can leverage Snorkel Flow to rapidly improve training data quality and model performance using integrated error analysis and model-guided feedback mechanisms to develop more accurate AI applications.

The data-centric AI workflow within Snorkel Flow operates as follows:

Data scientists, ML engineers, and subject matter experts programmatically label large amounts of data in minutes to hours by creating labeling functions.

Upon creating labeling functions, Snorkel Flow generates a probabilistic labeled dataset that is used to train a model within the platform.

Next, data scientists use guided error analysis to analyze the model’s performance deficits. They look for the gaps that facilitate creation of more targeted and relevant labeling assignments. In other words, data scientists and other users specifically work on places where the model is most wrong, or on particular high-value examples, or on commonly confused classes of data.

Next, users collaboratively iterate on these gaps with internal experts, refining or adding labeling functions as needed to label even more data with which they can again feed into the model for analysis.

Users repeat this iteration even after deploying a model and monitoring a slice of production data.

As a result of this loop, the metrics improvements in an AI application are often orders-of-magnitude greater than what can be achieved with model-centric AI and hand-labeled data.

Solution details

Unified access to data stored on Google Cloud

With training data curation and preparation unblocked via programmatic labeling of unstructured data, data scientists can harness the full power of Google’s end-to-end BigQuery ML and/or Vertex AI platforms to fast-track the development of analytics and AI applications. Google Cloud customers can easily deploy Snorkel Flow on their Google Cloud infrastructure using Google Kubernetes Engine (GKE), then consume unstructured, semi-structured or structured data from Google Cloud data services such as BigQuery and Google Cloud Storage (GCS). See the below figure for data sources and integrations. 

BigQuery is a serverless, cost-effective, and cross-cloud analytics data warehouse built to address the needs of data-driven organizations. BigQuery breaks down silos across clouds, allowing enterprises to centralize all of their data – structured, semi-structured, and unstructured – in a single secure repository. BigQuery support for unstructured data management includes built-in capabilities to secure, govern, and share unstructured data.

Snorkel Flow + Google Cloud BigQuery 

The Snorkel Flow platform integrates natively with BigQuery to streamline and simplify AI development:

With a few clicks, data scientists can immediately pull relevant data from BigQuery directly into Snorkel Flow using the integrated BigQuery connector. 

Data can then be labeled programmatically using a data-centric AI workflow in Snorkel Flow to quickly generate high-quality training sets over complex, highly variable data. Snorkel Flow includes templates to classify and extract information from unstructured text, native PDFs, richly formatted documents, HTML data, conversational text, and more.

Newly labeled datasets can then be used to either train custom ML models or fine-tune pre-built models.

Labeled data can be loaded back into the BigQuery environment as structured data.

Real-world impact

Top U.S. banks, healthcare, insurance, and other Fortune 500 organizations have used Snorkel Flow to extract information from complex documents such as 10-K reports, clinical trial protocols, technical manuals, rent rolls, legal contracts, and more. One Fortune 500 telecom provider and long-time Google Cloud customer, for example, uses Snorkel Flow to classify encrypted network data flows into key application categories. Using Snorkel Flow’s comprehensive data exploration and error analysis tools, the telco successfully trained 200,000 labels in a matter of hours, achieving 25% better accuracy compared to an internal ground truth baseline.  

Google and Snorkel AI have collaborated on a Snorkel research project for Google’s internal use. Google used early versions of Snorkel’s core technology to tackle data labeling for content, product, and event classification problems that were not amenable to manual labeling due to the rapid variations in the labels. Using Snorkel, Google condensed a six month process involving thousands of hand-labeled examples into just 30 minutes and built content classification models that achieved an average performance improvement of 52% [2, 3, 4].

Better together: Snorkel AI + Google Cloud

Together, Google Cloud and Snorkel AI enable Fortune 500 enterprises, federal agencies, and other AI innovators to operationalize unstructured data to build and and accelerate AI applications to solve their most critical challenges 

To learn more, schedule a custom demo tailored to your use case with Snorkel AI ML experts or watch one of the below recent presentations: 

Accelerate AI development by eliminating the pain of manual labeling, delivered by Snorkel AI co-founder Henry Ehernberg as part of a Google Cloud BigQuery Innovation event

Promises and Compromises of Responsible Generative AI Model Adoption in the Enterprise, delivered by Google Director, Cloud Partner Engineering, Dr. Ali Arsanjani at Snorkel’s Foundation Model Summit 

[1] Snorkel AI documented customer results reflect 45x, 52%, 98% and similar improvements vs hand-labeling 
[2] Case study on Google’s use of Snorkel’s core technology:
[3] Harnessing Organizational Knowledge for Machine Learning:
[4] Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale:

Source : Data Analytics Read More

Run data science workloads without creating more data silos

Run data science workloads without creating more data silos

For organizations, it is important to build a data lake solution that offers flexible governance and the ability to break data silos while maintaining a simple and manageable data infrastructure that does not require multiple copies of the same data. This is particularly true for organizations trying to empower multiple data science teams to run workloads like demand forecasting or anomaly detection on the data lake. 

A data lake is a centralized repository designed to store, process, and secure large amounts of structured, semistructured, and unstructured data. It can store data in its native format and process any variety of it, ignoring size limits. For example, many companies have matrix structures, with specific teams responsible for some geographic regions while other teams are responsible for global coverage but only for their limited functional areas. This leads to data duplication and the creation of new data silos.

Managing distributed data at scale is incredibly complex. Distributed teams need to be able to own their data without creating silos, duplication, and inconsistencies. Dataplex allows organizations to scale their governance and introduce access policies that enable teams to operate on the portion of the data that is relevant to them.  

Google Cloud can support your data lake modernization journey no matter where you are with people, processes, and technology. BigLake allows Google customers to unify their data warehouses and data lakes. Dataproc empowers distributed data science teams in complex organizations to run workloads in Apache Spark and other engines directly on the data lake while respecting policies and access rules.

This blog will show how Dataproc, Dataplex, and BigLake can empower data teams in a complex organizational setting, following the example of a global consumer goods company that has finance teams organized geographically. At the same time, other functions, such as marketing, are global. 

Organizations are complex, but your data architecture doesn’t need to be

Our global consumer goods company has centralized their data in a data lake, and access policies ensure that each of their regional finance team has access only to the data that pertains to the appropriate location. While having access to global data, the marketing team does not have access to sensitive financial information stored in specific columns.

Dataproc with personal authentication enables these distributed teams to run data science and data engineering workloads on a centralized BigLake architecture with governance and policies defined in Dataplex.

BigLake creates a unified storage layer for all of the data and extends the BigQuery security model to file-based data in several different formats on Google Cloud and even on other clouds. Thanks to Dataproc, you can process this data in open-source engines such as Apache Spark and others.

In this example, our global consumer goods company has a centralized file-based repository of sales data for each product. Thanks to BigLake, this company can map these files in their data lake to tables, apply row and column level security and, with Dataplex, manage data governance at scale. For the sake of simplicity, let’s create a BigLake table based on a file stored in Cloud Storage containing global ice cream sales data.

As seen in the architecture diagram above, BigLake is not creating a copy of the data in the BigQuery storage layer. Data remains in Cloud Storage, but BigLake allows us to map it to the BigQuery security model and apply governance through Dataplex.

To satisfy our business requirement to control access to the data on a geographical basis, we can leverage row-level access policies. Members of the US Finance team will only have access to US data, while members of the Australia Finance team will only have access to Australian data.

Dataplex allows us to create policy tags to prevent access to specific columns. In this case, a policy tag called “Business Critical: Financial Data” is associated with discount and net revenue so that only finance teams can access this information.

Data Science with Dataproc on BigLake data

Dataproc allows customers to run workloads in several open-source engines, including Apache Spark. We will see in the rest of this blog how users can leverage Dataproc personal authentication to run data science workloads on Jupyter notebooks directly on the data lake, leveraging the governance and security features provided by BigLake and Dataplex.

For example, a member of the Australia finance team can only access data in their geographical area based on the row-level access policies defined on the BigLake table. Below, you can see the output of a simple operation reading the data from a Jupyter notebook running Spark on a Dataproc cluster with personal authentication:

As a reminder, even if we use the BigQuery connector to access the data via Spark, the data itself is still in the original file format on Cloud Storage. BigLake is creating a layer of abstraction that allows Dataproc to access the data while respecting all the governance rules defined on the data lake.

This member of the Australia finance team can leverage Spark to build a sales forecasting model, predicting sales of ice cream in the next six months:

Now, suppose a different user who is a member of the US Finance team tries to run a similar forecasting of ice cream sales based on the data she has access to, given the policies defined in BigLake and Dataplex. In that case, she will get very different results:

Sales of ice cream in the United States are expected to decline, while sales of ice cream in Australia will increase, all due to the different seasonal patterns in the Northern and Southern hemispheres. More importantly, each local team can independently operate on their regional data stored in a unified data lake, thanks to Dataplex on BigLake tables’ policies and Dataproc’s ability to run workloads with personal authentication.

Finally, users in the Marketing department will also be able to run Spark on Jupyter notebooks on Dataproc. Thanks to policy tags protecting financial data, they can only leverage the columns they have the right to access. For example, despite not having access to discount and revenue data, a marketing team member could leverage unit sales information to build a segmentation model using k-means clustering in Apache Spark on Dataproc.

Learn More

In this blog, we saw how Dataproc, BigLake, and Dataplex empower distributed data science teams with fine-grained access policies, governance, and the power of open-source data processing frameworks such as Apache Spark. To learn more about open-source data workloads on Google Cloud and governance at scale, please visit:

Create a lake in Dataplex

Create and manage BigLake tables

Dataproc Serverless Spark

Dataproc personal cluster authentication

Use policy tags to control column access in BigLake

Source : Data Analytics Read More

Built with BigQuery: Solving scale and complexity in retail pricing with BigQuery, Looker, Analytics Hub and more

Built with BigQuery: Solving scale and complexity in retail pricing with BigQuery, Looker, Analytics Hub and more

Context and Background

Maintaining optimal pricing in the retail industry can be challenging when relying on manual processes or platforms that are not equipped to handle the complexity and scale of the task. The ability to quickly adapt to factors affecting pricing has also become a critical success factor for retailers when pricing their products.

As a whole, the retail industry is working to absorb and respond to changes in the way customers buy and receive products, and to how pricing affects competitive advantage. For example, within some vertical domains, customers’ expectations for the price of products differ when buying online as compared to in-store, but are expected to align in others. Additionally, because of the ease with which shoppers can assess the prices across competitors, retailers are looking for ways to retain their most valuable customers; loyalty pricing, private label strategies, and bespoke promotional offers are seen as key aspects of possible solutions in this regard.

Whether in an everyday, promotional or clearance sense, the need to maintain optimal pricing requires a forward-looking mechanism that employs AI/ML capabilities based on multiple sources of input to provide prescriptive decision-making. Such an AI/ML platform can encourage specific buying behaviors aligned to a retailer’s strategy, for example, to rebalance inventory, expand basket sizes or increase private label brand sales. A pricing process can be thought of as having four stages:

Transfer and processing of information about a retailer’s operational structure, customer behavior related to its products, and other elements impacting supply and demand

Synthesis of information that represents the relationships between prices for products available across a retailers’ sales channels and business outcomes vis-a-vis financial metrics

Decision making, about price-related activities such as price increases / decreases or promotional offers, driven by human or software systems

Actions to actualize pricing decisions and to inform stakeholders affected by changes, driven by human or software systems 

In the Solution Architecture section we will examine these stages in detail. 

Use-cases: Challenges and Problems Resolved

The complexity of pricing in the retail industry, particularly for Fast Moving Consumer Goods (FMCG) retailers, can be significant. These retailers often have over 100,000 items in their assortments being sold at thousands of stores, and must also consider the impact of online shopping and customer segmentation on pricing decisions. Different buying behaviors across these dimensions can affect the recommendations of a pricing system, and it is important to take them into account in order to make accurate and effective pricing recommendations. An AI/ML-driven platform can provide greater agility and manage complexity to make more informed pricing decisions.

Speed is a critical factor in the retail industry, particularly for retailers selling Specialty Goods who face intense competitive pricing pressure in certain key products. In this environment, the ability to respond quickly to changes in the market and customer demand can be the difference between staying relevant and losing business to competitors. Automation using AI/ML, enabling real-time, on-demand price changes and promotions is a key factor in the evolving retail industry, particularly in the context of ecommerce and digital in-store systems like Electronic Shelf Labels (ESLs). These systems provide on-demand price changes and promotions that can positively alter customer behavior by increasing the basket size during a session. 

To make this possible, the decision-making and delivery mechanisms behind these systems need to be driven by a flexible, programmatically accessible AI/ML engines that learn and adapts over time. 

High level Architecture of Revionics Using GCP

Revionics’ product, Platform Built for Change, is a new platform that aims to address the significant changes occurring in the retail market by providing a flexible, scalable, intelligent and extensible solution for managing pricing processes. A foundational design principle for the platform is that it can be easily adapted, through configuration rather than code changes, to support a wide range of approaches and states of maturity in pricing practices. By externalizing dependencies of changes from the underlying code, the platform allows retailers to make changes more easily and quickly adapt to new requirements.

Figure 1: Revionics Solution on Google Cloud

The above diagram shows Revionics solution where we can see GCP serverless technologies used across all different layers, from ingestion to export. The key services used are:

Data Storage: Google BigQuery, GCS (blob storage), MongoDB Atlas

Data Processing: Google BigQuery, Google DataProc, Google Dataflow

Data Streaming: Kafka, Google PubSub

Orchestration: Cloud Composer (Airflow), Google Cloud Functions

Containerization & Infra Automation: GKE (Kubernetes)

Analytics: Google Looker

Data Sharing: Google Analytics Hub

Observability: Google Cloud Logger, Prometheus, Grafana

Solution Architecture

We will discuss the key problems, challenges to solve and how the various stages of the solution; Ingest, Process, Sync and Export have enabled Revionics to address the need for speed, scale, and automation, all while solving increasing complexity and evolving challenges in Retail pricing.

1. Transfer and processing of information

As Revionics is a SaaS provider, supporting retailers at the first stage of the pricing process – the transfer and processing of information – essentially boils down to overcoming one major challenge: wide variability. In our domain, variability comes from several data sources shown in Figure 2 below:

Each retailer’s pricing practice and technical environment exist in various states of maturity and sophistication

Entire sources of data may be included in some cases, but excluded in others

API usage and streaming may be plausible with some customers, whereas SFTP transfers are the only available means for others

Data quality, completeness and correctness vary by retailer according their upstream processes 

Both large batch and near real-time use cases need to be supported

The Data transformation logic that feeds into science modeling will differ according to a combination of grouping and configuration choices based on the retailer’s operations and objectives. Essentially, there is no single “golden data pipeline”.

Figure 2: The workflow of aggregating multiple data feeds for pricing

In order to describe how the variability challenges are addressed, let’s drill into the Ingest and Processing portions of the architecture in the above diagram of a Customer Workflow example. There are three primary concepts: 

a feed is a representation of a streamed or scheduled batch data source which has a number of methods for handling file formats and for plugging into various data technologies

a pipeline represents a combination of transformations taking in feeds and other pipelines

a DAG (directed acyclic graph) is generated by the configuration or wiring together of feeds and pipelines as well as supporting methods that execute validation and observability tasks. The generated DAG represents the full workflow for ingesting and processing information in preparation for Revionics’ Science platform 

Let’s explore the benefits of the solution that leverages GCP, as depicted in the DAG Generation flow diagram in Figure 3, below:

The logical flow is a combination of Templates, Configuration and a Library of modular processing methods to generate workflows for ingesting, combining, transforming and composing data in a variety of ways. These are abstracted in a set of human readable configurations, simplifying the setup and support accessible to non-engineers. The DAG Generator outputs a JSON file for the entire workflow that can easily be understood.

The Data Platform natively delivers support capabilities such as validations and observability. Configurable validation checks are insertable at various levels for inspecting schema, looking for anomalies, and running statistical checks. Similarly, event logs, metrics and traces are collected by Cloud Logger and consolidation within BigQuery, to easily explore building dashboards or building ML models.

From an architectural standpoint, there is very minimal intervention needed to scale, operate and manage infrastructure. The workflow logic is represented in an execution agnostic fashion by a DAG JSON file that orchestrates method calls and artifact creation such as Tables, Views, Stored Procedures etc. The DAGs become instruction sets for Airflow, ultimately executed on a Composer Cluster (serverless). DataProc reads from GCS or SQL procedures on BigQuery to do all the heavy lifting in terms of combining data, aggregating, ML feature prep, etc.

Figure 3: Logical DAG Generation Flow

The bulk of event-driven processing is achieved using Cloud Functions or DataFlow, PubSub or Kafka and Dataflow. The architecture provides a high level of reliability and zero degradation in performance as requirements for scale and speed vary. The platform thus provides the ability for the team to focus on building novel approaches to challenging problems and being cost conscious on infrastructure spend. 

2. Synthesis of information (balancing scale and skill)

At the heart of Revionics’ pricing process is an AI/ML engine used to synthesize the combined data signals that express the sources and contexts for demand of retail goods over zones, stores and customer segments. 

Trained models learn to explain the influence of a multitude of features, such as seasonal effects, inflation trends, cannibalization across products, and competitive pricing on the quantity of products sold at a particular location and/or to a particular customer segment. These models are then used to forecast the impact of changes in price or in the structure of promotional offers given all known contexts prevalent during the time interval of interest. 

The forecasts drive optimization processes that balance one or more business objectives (e.g. profit, margin, revenue or total units) while adhering to constraints based on the retailer’s operations and desired outcomes. 

For example, in Figure 4, we are showing the data science modeling aspects employed by Revionics, leveraging the benefits of the GCP platform:

A retailer may want to optimize profit on its private label brands, while maintaining a minimum and maximum relative price differential to certain premium brands in a product category.

A retailer may look to set a markdown or clearance schedule while capturing as much revenue as possible using only discount multiples of 10%. 

A retailer may want to optimize the discount level of a brand of flat screen TVs to maximize profitability over the whole category along with its related products.

Figure 4: Science Modeling & Price Recommendations

The primary technical challenge within the AI/ML domain for pricing is in balancing scalability with predictive skill during modeling, forecasting and optimization.  Keep in mind that our median customer has 500,000 independent models (see Figure 5) that need to be trained per the histogram below; each training run is inherently iterative and computationally intensive. While the details of the science are beyond the scope of this blog, from a system perspective, Revionics architecture combines: 

Two proprietary AI/ML frameworks, Probabilistic Programming Language (PPL) and the Grid, for expressing pricing domain behavior, and dynamically provisioning infrastructure to orchestrate separable modeling jobs based on statistical dependencies. PPL for expressing AI/ML models of pricing domain behavior; and the Grid infrastructure for orchestrating millions of parallel, separable modeling jobs.

Google-led open source platforms: TensorFlow for its rich machine learning library and framework and TensorFlow Probability for probabilistic modeling methods

Several of Google Cloud serverless services for data storage, compute, messaging, containerization & logging – GCS, BigQuery, PubSub, GKE & Logs Explorer.

Again, what is noteworthy about this solution is the breadth of challenges and capabilities that don’t have to be solved by Revionics because the services sit on infrastructure where sizing, configuration, monitoring, deployment and scaling are managed as a platform service.

Figure 5: Multiple ML Models Per Organization

Revionics’ modeling framework leverages a Hierarchical Bayesian methodology that can optimize each individual retailer’s product, store and customer relationships.  This is a key aspect of differentiation from a predictive skill perspective, as complex relationships between entities can be preserved with a learned reduction of an otherwise intractable problem space.

3. Decision making and Action

The DAG generation-based architectural pattern used in the last two steps of the pricing process, handles the need for Automated Intelligence described earlier. Note, the purpose of the final steps in the pricing process are:

To expose outputs from the data science modeling and price recommendations to teams or systems for decision making about price-related activities such as increases / decreases or promotional offers

To take actions, or to actualize pricing decisions, as well as to inform stakeholders affected by the changes being made

The Automated Intelligence enabled by the Google Cloud Platform enables us to scale these steps by providing well-designed APIs that allow users to integrate with and build on a platform’s output. However, AI/ML -based applications have specific challenges that need to be overcome in pricing.

One of these challenges is to build trust through explainability and greater transparency of the decisions behind the models. Because AI/ML software is non-deterministic, poorly understood by non-practitioners, and often replaces human processes, confidence isn’t easy to engender.

Visibility is the best asset for creating trust, which is why Revionics uses BigQuery and Looker at the center of our approach. With these technologies, reports and visualizations are interwoven into all aspects of the Revionics solution, creating a clear line of sight from data to decisions – this gives users visibility:

Forecasted business results from recommendations – for example, by making it easy for a user to understand the weighting of profit and revenue in a multi-objective optimization resulting in a particular price recommendation 

Statistical confidence around decision variables – by showing visualizations and metrics such as 95% credibility intervals around price elasticity, for example

AI/ML model output analytics over time – including histograms, statistical metrics, outlier detection and the like that express the health of the models

Figure 6: Analytics Embedded in Solution

Automated Intelligence. The Looker + BigQuery combination is particularly effective because:

Performance in high-dimensional analytics. Its ability to maintain performance at scale in the context of high-dimensional analytics without the need for manual intervention. 

Personalization. Its capacity for personalization and business relevance views and reports that reflect the specific conditions and metrics representative of the business.

Collaboration. In order to gain the confidence of people whose jobs are not pricing (e.g. category managers, merchandisers, executives), users have to be able to engage with and share analytical content in a completely frictionless way. In the Revionics solution, people within the organization who have no user login nor experience with the tool can view any analytical asset as well as engage with in-app comment threads. 

Triggers. Additionally, analytical assets, such as reports and datasets, can also be scheduled or triggered for export.

Enhancing Performance and Managing Cost. Due to the scale of the data that is needed to drive downstream processes, APIs or even distributed streaming-based egress approaches are not always ideal. To resolve this, Revionics is exploring the use of Google’s Analytics Hub. The service gives an ability to create and securely share BigQuery datasets, tables and views that are simply available within the customer’s environment, which is an incredibly powerful tool for increasing the impact of Automated Intelligence. The benefit of exchanging data via the Analytics Hub is that we can preserve flexibility for users at scale in the system they are already likely to be using to drive their analytics, stream, and execute large transformations. In addition, Analytics Hub provides the levers to create exchanges and listings in Revionics’ SaaS solution without having to move data, thereby being incredibly cost optimized.

In particular, use cases that require very granular data from Revionics’ pricing system to be combined with a retailer’s other source data are very well served here. In a more tangible sense, we foresee users automating eCommerce-related capabilities, merchandising processes and digital marketing systems, in addition to any number of operational use cases that we have yet to conceive.


By building on the Google Cloud Platform and data cloud, Revionics has built and hosted solutions that have yielded numerous benefits. Some of the notable outcomes are:

Enhanced speed and agility: By replacing customer-specific stored procedures and scripts with human readable configurations, the solution has lowered the barrier for variable data transformation logic and better validation logic. All of these have made the solution easier to configure and more agile.

Improved stability: As the reliance on data quality is fairly high, several constructs in the solution have ensured data hygiene has improved leading to meeting SLAs and reduced downtime. Collectively, the customer support issues have lowered.

Rapid Data processing: Below graphic shows the multi-fold improvement in various parts of the data processing pipeline with progressively increasing volumes.

Faster Technical implementations: The time to value has been quicker enabled by design and performance. For instance: Test cycles and customer feedback sped up leading to quality standards being achieved faster. Historical loads have run significantly faster.

Increased accuracy: The forecast accuracy grew by a greater percent while maintaining training and optimization over the first phase of customers migrating to the new platform.

Decision making: The rising statical and decision confidence led to higher-impact results and better SUS (system usability scores).

Click here to learn more about Revionics.

The Built with BigQuery advantage for ISVs 

Google is helping tech companies like Revionics build innovative applications on Google’s data cloud with simplified access to technology, helpful and dedicated engineering support, and joint go-to-market programs through the Built with BigQuery initiative, launched in April as part of the Google Data Cloud Summit. Participating companies can: 

Get started fast with a Google-funded, pre-configured sandbox. 

Accelerate product design and architecture through access to designated experts from the ISV Center of Excellence who can provide insight into key use cases, architectural patterns, and best practices. 

Amplify success with joint marketing programs to drive awareness, generate demand, and increase adoption.

BigQuery gives ISVs the advantage of a powerful, highly scalable data warehouse that’s integrated with Google Cloud’s open, secure, sustainable platform. And with a huge partner ecosystem and support for multi-cloud, open source tools and APIs, Google provides technology companies the portability and extensibility they need to avoid data lock-in. 

Click here to learn more about Built with BigQuery.

We thank the Google Cloud and Revionics team members who co-authored the blog: Revionics: Aakriti Bhargava, Director Platform Engineering. Google: Sujit Khasnis, Cloud Partner Engineering

Related Article

Built with BigQuery: Zeotap uses Google BigQuery to build highly customized audiences at scale

Zeotap is a next-generation Customer Data Platform hosted in Google Cloud aimed at building customized Audience segments and activating a…

Read Article

Source : Data Analytics Read More