How Renault solved scaling and cost challenges on its Industrial Data platform using BigQuery and Dataflow

How Renault solved scaling and cost challenges on its Industrial Data platform using BigQuery and Dataflow
Following the first significant bricks of data management implemented in our plants to serve initial use cases such as traceability and operational efficiency improvements, we confirmed we got the right solution to collect industrial data from our machines and operations at scale. We started to deploy that solution and we therefore needed a state-of-the-art data platform for contextualization, processing and hosting all the data gathered. This platform had to be scalable to deploy across our entire footprint, and also affordable and reliable, to foster the use of data in our operations. Our partnership with Google Cloud in 2020 was a key lever to reach this target. Jean-Marc Chatelanaz – Industry Data Management 4.0 Director, Renault

French multinational automotive manufacturer Renault Group has been investing in Industry 4.0 since the early days. A primary objective of this transformation has been to leverage manufacturing and industrial equipment data through a robust and scalable platform. Renault designed an industrial data acquisition layer and connected it to Google Cloud, using optimized big data products and services that together form Renault’s Industrial Data Platform.

In this blog, we’ll outline the evolution of that data platform on Google Cloud and how Renault worked with Google Cloud’s professional services to design and build a new architecture to achieve a secure, reliable, scalable, and cost-effective data management platform. Here’s how they’ve done it.

All started with Renault’s digital transformation

Renault Group has manufactured cars since 1898. Today, it is an international group with five brands that sold close to 3 million vehicles worldwide in 2020 and provide sustainable mobility throughout the world, with 40 manufacturing sites and more than 170,000 employees.

A leader in the Industry 4.0 movement, Renault launched its digital transformation plan in 2016 to achieve greater efficiency and modernize its traditional manufacturing and industrial practices. A key objective of this transformation was using the industrial data from up to 40 sites globally to promote data-based business decisions and create new opportunities.

Several initiatives were already growing in silos, such as Conditional Based Maintenance, Full Track and Trace, Industrial Data Capture and AI projects. In 2019, Renault kicked off a program named Industry Data Management 4.0 (IDM 4.0) to federate all those initiatives as well as future ones, and, most of all, to design and build a unique data platform and a framework for all Renault’s industrial data. 

The IDM 4.0 would ultimately enable Renault’s manufacturing, supply chain, and production engineering teams to quickly develop analytics, AI, and predictive applications based on a single industrial data capture and data referential platform as shown with these pillars:

Renault Group data transformation pillars

With Renault’s leadership in Industry 4.0 for manufacturing and supply chain, and with Google’s expertise in big data, analytics and machine learning since its foundations, Google Cloud solutions were a clear match for Renault’s ambitions.

Building IDM 4.0 in the cloud

Before IDM 4.0, Renault Digital was the internal team that incubated the idea of data in the cloud. Following some trials, the team was able to reduce storage costs by more than 50% by moving to Google Cloud. Key business requirements were data reliability, long-term storage, and unforgeability, all of which led the team to use Apache Spark running on fully-managed Dataproc for data processing, and Spanner as the main database.

The IDM 4.0 program aimed to scale and unfold more business use cases while keeping the platform cost-effective, with maximum performance and reliability. Meeting this goal required reviewing the architecture, as the costs projections of the business usage would go beyond the project budget, refraining from additional business explorations. 

Accelerating the redesign of that core data platform was a pillar of the Renault-Google Cloud partnership. . Google Cloud’s professional services organization (PSO) helped define the best architecture to meet requirements, solve challenges, and handle 40 plants’ worth of scaling data in a reliable and cost-sustainable way.

Renault Group IDM 4.0 layers

The data capture journey is complex and includes several components. Basically, Renault built a specific process relying on standard data models designed with OPC-UA, an open machine-to-machine communication protocol. In the new Google Cloud architecture, this data model is implemented in BigQuery such that most writes are append-only and it is not necessary to do lots of join operations to retrieve the required information. BigQuery is complemented with a cache in Memorystore to retrieve up-to-date states at all times. This architecture enabled high-performing and cost-efficient reads and writes.

The team also started developing with the Beam framework and Dataflow. Working with Beam was not only as easy as working with Spark, but also unlocked a number of benefits:

A unified model for batch and streaming processing: less coding variety and more flexibility in choosing the types of processing jobs depending on needs and cost target

A thriving ecosystem around Beam: many available connectors (PubsubIO, BigQueryIO, JMSIO, etc)

Simplified operations: efficient autoscaling, smoother updates, finer cost tracking, easy monitoring, end-to-end testing, and error management.

Dataflow has since become the go-to tool for most data processing needs in the platform. Pub/Sub was also a natural choice for streaming data processing, along with other third-party messaging systems. The teams are also maintaining their advanced use of Google Cloud’s managed services such as Google Kubernetes Engine (GKE), Cloud Composer, and Memorystore.

The new platform is built to scale, and as of Q1 2021, the IDM 4.0 team has been connecting more than 4,900 industrial appliances through Renault’s internal data acquisition solution, transmitting more than 1 billion messages per day to the platform in Google Cloud!

Focusing on cost efficiency

While the migration to the new architecture was ongoing, cost optimization became a key decision driver, especially with the COVID-19 pandemic and its impacts on the world’s economics.

Hence, Renault’s management team and Google Cloud’s PSO agreed to allocate a special task force to optimize costs on parts of the platform that were not yet migrated. 

This special task force discovered that one Apache Spark cluster continuously performed Cloud Storage Class A & B operations every few tens of milliseconds. This was inconsistent with the majority of incoming flows, which were hourly or daily batches. One correction we applied was  simply increasing the interval polling the filesystem, i.e. the spark.sql.streaming.pollingDelay parameter from Spark. Later, we confirmed a 100x drop in Cloud Storage Class A calls, and a meaningful drop in class B calls, which resulted in a 50% decrease in production costs.

Drop in cost of storage  following cost optimization actions

Renault now uses  Dataflow to ingest and transform data from manufacturing plants as well as from other connected key referential databases. BigQuery is used to store and analyze massive amounts of data, such as packages and vehicle tracking, with many more data sources in the works. 

 The below diagram shows last year’s impressive achievements!

Evolution of data ingestion and platform costs (not to scale)

The new IDM 4.0 platform started to run production workloads just a few months ago, and Renault expects to have 10 times more industrial data coming in the next two years, helping to keep costs, performance, and reliability optimum. 

Unfold new opportunities and democratize data access 

The IDM 4.0 team has succeeded in gathering industrial data from plants, and merging and harmonizing this data in a scalable, reliable, and cost-effective platform. The team also managed to expose data in a controlled and secure way to data scientists, business teams or any application. 

In this blog, we’ve overviewed use cases ranging  from tracking to conditional maintenance, but this platform can open many other possibilities for innovation and business impact, such as self-service analytics that will democratize the use of data and support data-driven business decisions.

This Renault data transformation with Google Cloud demonstrates the potential of Industry 4.0 to improve Renault’s manufacturing, process engineering, and supply chain processes—and also provides an example for other companies looking to leverage data and cloud platforms for improved results. To learn more about BigQuery, click here


We’d like to thank all collaborators who participated actively in writing these lines; both from Google and Renault working tremendously as one team for 1+ year, fully remote:

Writers from Google:

Jean-Francois Macresy, Consultant, IDM4.0 Architecture best practicesJeremie Gomez, Strategic Engineer, IDM4.0 Data specialistRifel Mahiou, Program Manager, IDM4.0 Partnership strategy & governanceRazvan Culea, Data Engineer, IDM4.0 Data optimization 

Reviewers from Renault:

Sebastien Aubron, IDM4.0 Program ManagerJean-Marc Chatelanaz, IDM4.0 DirectorMatthieu Dureau, IDM4.0 platform Product OwnerDavid Duarte, IDM4.0 platform Data Engineer

Related Article

Registration is open for Google Cloud Next: October 12–14

Register now for Google Cloud Next on October 12–14, 2021

Read Article

Source : Data Analytics Read More

Save messages, money, and time with Pub/Sub topic retention

Save messages, money, and time with Pub/Sub topic retention

Starting today, there is a simpler, more useful way to save and replay messages that are published to Pub/Sub: topic retention. Prior to topic retention, you needed to individually configure and pay for message retention in each subscription. Now, when you enable topic retention, all messages sent to the topic within the chosen retention window are accessible to all the topic’s subscriptions, without increasing your storage costs when you add subscriptions. Additionally, messages will be retained and available for replay even if there are no subscriptions attached to the topic at the time the messages are published. 

Topic retention extends Pub/Sub’s existing seek functionality—message replay is no longer constrained to the subscription’s acknowledged messages. You can initialize new subscriptions with data retained by the topic, and any subscription can replay previously published messages. This makes it safer than ever to update stream processing code without fear of data processing errors, or to deploy new AI models and services built on a history of messages. 

Topic retention explained

With topic retention, the topic is responsible for storing messages, independently of subscription retention settings. The topic owner has full control over the topic retention duration and pays the full cost associated with message storage by the topic. As a subscription owner, you can still configure subscription retention policies to meet your individual needs.

Topic-retained messages are available even when the subscription is not configured to retain messages

Initializing data for new use cases

As organizations become more mature at using streaming data, they often want to apply new use cases to existing data streams that they’ve published to Pub/Sub topics. With topic retention, you can access the history of this data stream for new use cases by creating a new subscription and seeking back to a desired point in time.

Using the GCloud CLI 

These two commands initialize a new subscription and replay data from two days in the past. Retained messages are available within a minute after the seek operation is performed.

Choosing the retention option that’s right for you

Pub/Sub lets you choose between several different retention policies for your messages; here’s an overview of how we recommend you should use each type.

Topic retention lets you pay just once for all attached subscriptions, regardless of when they were created, to replay all messages published within the retention window. We recommend topic retention in circumstances where it is desirable for the topic owner to manage shared storage.

Subscription retention allows subscription owners, in a multi-tenant configuration, to guarantee their retention needs independently of the retention settings configured by the topic owner.

Snapshots are best used to capture the state of a subscription at the time of an important event, e.g. an update to subscriber code when reading from the subscription.

Transitioning from subscription retention to topic retention

You can configure topic retention when creating a new topic or updating an existing topic via the Cloud Console or the gcloud CLI. In the CLI, the command would look like:

gcloud pubsub topics update myTopic –message-retention-duration 7d.

If you are migrating to topic retention from subscription storage, subscription storage can be safely disabled after 7 days.

What’s next

Pub/Sub topic retention makes reprocessing data with Pub/Sub simpler and more useful. To get started, you can read more about the feature, visit the pricing documentation, or simply enable topic retention on a topic using Cloud Console or the gcloud CLI.

Related Article

How to detect machine-learned anomalies in real-time foreign exchange data

Model the expected distribution of financial technical indicators to detect anomalies and show when the Relative Strength Indicator is un…

Read Article

Source : Data Analytics Read More

Converging architectures: Bringing data lakes and data warehouses together

Converging architectures: Bringing data lakes and data warehouses together

Historically, data warehouses have been painful to manage. The legacy, on-premises systems that worked well for the past 40 years have proved to be expensive and they had many challenges around data freshness, scaling, and high costs. Furthermore, they cannot easily provide AI or real-time capabilities that are needed by modern businesses. We even see this with the cloud newly created data warehouses as well. They do not have AI capabilities still, despite showing that or arguing that they are the modern data warehouses. They are really like the lift and shift version of the legacy on-premises environments over to cloud. 

At the same time, on-premises data lakes have other challenges. They promised a lot, looked really good on paper,  promised low cost and ability to scale. However, in reality this did not capitalize for many organizations. This was  mainly because they were not easily operationalized, productionized, or utilized. This in return increased the overall total cost of ownership. There are also significant data governance challenges created by the data lakes. They did not work well with the existing IAM and security models. Furthermore, they ended up creating data silos because data is not easily shared across through the hadoop environment.

With varying choices, customers would choose the environment that made sense, perhaps a pure data warehouse, or perhaps a pure data lake, or a combination. This leads to a set of tradeoffs for nearly any real-world customer working with real-world data and use cases. Therefore, this past approach has naturally set up a model where we see different and often disconnected teams setting up shop within organizations. Resulting in users split between their use of the data warehouse and the data lake. 

Data warehouse users tend to be closer to the business, and have ideas about how to improve analysis, often without the ability to explore the business to drive a deeper understanding. On the contrary, data lake users are closer to the raw data and have the tools and capabilities to explore the data. Since they spend so much time doing this, they are focused on the data itself, and less focused on the business. This disconnect robs the business of the opportunity to find insights that would drive the business forward to higher revenues, lower costs, lower risk, and new opportunities.

Since then the two systems co-existed and complemented each other as the two main data analytics systems of enterprises, residing side by side in the shared IT sphere. These are also the data systems at the heart of any digital transformation of the business and the move to a full data-driven culture. As more organizations are migrating their traditional on-premises systems to the cloud and SaaS solutions, this is a period during which enterprises are rethinking the boundaries of these systems toward a more converged analytics platform.

This rethinking has led to convergence of data lakes and warehouses, as well as data teams across organizations. The cloud offers managed services that help expedite the convergence so that any data person could start to get insight and value out of the data, regardless of the system. The benefits of the converged data lake and data warehouse environment present itself in several ways. Most of these are driven by the ability to provide managed, scalable, and serverless technologies. As a result, the notion of storage and computation is blurred. Now it is no longer important to explicitly manage where data is stored or what format it is stored. Users are democratized, they should be able to access the data regardless of the infrastructure limitations. From a data user perspective, it doesn’t really matter whether the data resides in a data lake or a data warehouse. They do not look into which system the data is coming from. They really care about what data that they have, and whether they can trust it. The volume of the data that they can ingest and whether it is real time or not. They are also discovering and managing data across varied datastores and taking them away from the siloed world into an integrated data ecosystem. Most importantly, analyze and process data with any person or tool.

At Google Cloud, we provide a cloud native, highly scalable and secure, converged solution that delivers choice and interoperability to customers. Our cloud native architecture reduces cost and improves efficiency for organizations. For example, BigQuery‘s full separation of storage and compute allows for BigQuery compute to be brought to other storage mechanisms through federated queries. BigQuery storage API allows treating a data warehouse like a data lake. It allows you to access the data residing in BigQuery. For example, you can use Spark to access data resigning in Data Warehouse without it affecting performance of any other  jobs accessing it. On top of this, Dataplex, our intelligent data fabric service, provides data governance and security capabilities across various storage tiers built on GCS and BigQuery.

There are many benefits achieved by the convergence of the data warehouses and data lakes and if you would like to find more, here’s the full paper.

Related Article

Registration is open for Google Cloud Next: October 12–14

Register now for Google Cloud Next on October 12–14, 2021

Read Article

Source : Data Analytics Read More

Single-cell genomic analysis accelerated by NVIDIA on Google Cloud

Single-cell genomic analysis accelerated by NVIDIA on Google Cloud

In the past decade, the Healthcare and Life Sciences industry has enjoyed a boon in technological and scientific advancement. New insights and possibilities are revealed almost daily. At Google Cloud, driving innovation in cloud computing is in our DNA. Our team is dedicated to sharing ways Google Cloud can be used to accelerate scientific discovery. For example, the recent announcement of AlphaFold2 showcases a scientific breakthrough, powered by Google Cloud, that will promote a quantum leap in the field of proteomics. In this blog, we’ll review another omics use case, single-cell analysis, and how Google Cloud’s Dataproc and NVIDIA GPUs can help accelerate that analysis.

The Need for Performance in Scientific Analysis

The ability to understand the causal relationship between genotypes and phenotypes is one of the long-standing challenges in biology and medicine. Understanding and drawing insights from the complexity of biological systems abounds from the actual code of life (DNA) through to expression of genes (RNA) to translation of gene transcripts into proteins that function in different pathways, cells, and tissues within an organism. Even the smallest of changes in our DNA can have large impacts on protein expression, structure, and function, which ultimately drives development and response – at both cellular and organism levels. And, as the omics space becomes increasingly data- and compute-intensive, research requires an adequate informatics infrastructure. An infrastructure that scales with growing data demands, enables a diverse range of resource-intensive computational activities, and is affordable and efficient – reducing data bottlenecks and enabling researchers to maximize insight. 

But where do all these data and compute challenges come from and what makes scientific study so arduous? The layers of biological complexity begin to be made apparent immediately when looking at not just the genes themselves, but their expression. Although all the cells in our body share nearly identical genotypes, our many diverse cell types (e.g. hepatocytes versus melanocytes) express a unique subset of genes necessary for specific functions, making transcriptomics a more powerful method of analysis by allowing researchers to map gene expression to observable traits. Studies have shown that gene expression is heterogeneous, even in similar cell types. Yet, conventional sequencing methods require DNA or RNA extracted from a cell population. The development of single-cell sequencing was pivotal to the omics field. Single-cell RNA sequencing has been critical in allowing scientists to study transcriptomes across large numbers of individual cells. 

Despite its potential, and the increasing availability of single-cell sequencing technology, there are several obstacles: an ever increasing volume of high-dimensionality data, the need to integrate data across different types of measurements (e.g. genetic variants, transcript and protein expression, epigenetics) and across samples or conditions, as well as varying levels of resolution and the granularity needed to map specific cell types or states. These challenges present themselves in a number of ways including background noise, signal dropouts requiring imputation, and limited bioinformatics pipelines that lack statistical flexibility. These and other challenges result in analysis workflows that are very slow, prohibiting the iterative, visual, and interactive analysis required to detect differential gene activity. 

Accelerating Performance

Cloud computing can help not only with data challenges, but with some of the biggest obstacles: scalability, performance, and automation of analysis. To address several of the data and infrastructure challenges facing single-cell analysis, NVIDIA developed end-to-end accelerated single-cell RNA sequencing workflows that can be paired with Google Cloud Dataproc, a fully-managed service for running open source frameworks like Spark, Hadoop, and RAPIDS. The Jupyter notebooks that power these workflows include examples using samples like human lung cells and mouse brains cells and demonstrate acceleration between CPU-based processing compared to GPU-based workflows. 

Google Cloud Dataproc powers the NVIDIA GPU-based approach and demonstrates data processing capabilities and acceleration, which in turn have the potential of delivering considerable performance gains.  When paired with RAPIDS, practitioners can accelerate data science pipelines on NVIDIA GPUs, reducing operations like data loading, processing, and training from hours to seconds. RAPIDS abstracts the complexities of accelerated data science by building upon popular Python and Java libraries effortlessly. When applying RAPIDS and NVIDIA accelerated compute to single-cell genomics use cases, practitioners can churn through analysis of a million cells in only a few minutes.

Give it a Try

The journey to realizing the full potential of omics is long; but through collaboration with industry experts, customers, and partners like NVIDIA, Google Cloud is here to help shine a light on the road ahead. To learn more about the notebook provided for single-cell genomic analysis, please take a look at NVIDIA’s walkthrough. To give this pattern a try on Dataproc, please visit our technical reference guide.

Related Article

Building a genomics analysis architecture with Hail, BigQuery, and Dataproc

Try a cloud architecture for creating large clusters for genomics analysis with BigQuery and Google-built healthcare tooling.

Read Article

Source : Data Analytics Read More

Handling duplicate data in streaming pipelines using Dataflow and Pub/Sub

Handling duplicate data in streaming pipelines using Dataflow and Pub/Sub


Processing streaming data to extract insights and powering real time applications is becoming more and more critical. Google Cloud Dataflow and Pub/Sub provides a highly scalable, reliable and mature streaming analytics platform to run mission critical pipelines. One very common challenge that developers often face when designing such pipelines is how to handle duplicate data. 

In this blog, I want to give an overview of common places where duplicate data may originate in your streaming pipelines and discuss various options that are available to you to handle them. You can also check out this tech talk on the same topic.

Origin of duplicates in streaming data pipelines

This section gives an overview of the places where duplicate data may originate in your streaming pipelines. Numbers in red boxes in the following diagram indicate where this may happen.

Some duplicates are automatically handled by Dataflow while for others developers may need to use some techniques to handle them. This is summarized in the following table.

1. Source generated duplicate
Your data source system may itself produce duplicate data. There could be several reasons like network failure, system errors etc that can produce duplicate data. Such duplicates are referred to as ‘source generated duplicates’.

One example where this could happen is when you set trigger notifications from Google Cloud Storage to Pub/Sub in response to object changes to GCS buckets. This feature guarantees at-least-once delivery to Pub/Sub and can produce duplicate notifications.

2. Publisher generated duplicates 
Your publisher when publishing messages to Pub/Sub can generate duplicates due to at-least-once publishing guarantees. Such duplicates are referred to as ‘publisher generated duplicates’. 

Pub/Sub automatically assigns a unique message_id to each message successfully published to a topic. Each message is considered successfully published by the publisher when Pub/Sub returns an acknowledgement to the publisher. Within a topic all messages have a unique message_id and no two messages have the same message_id. If success of the publish is not observed for some reason (network delays, interruptions etc) the same message payload may be retried by the publisher. If retries happen, we may end up with duplicate messages with different message_id in Pub/Sub. For Pub/Sub these are unique messages as they have different message_id.

3. Reading from Pub/Sub
Pub/Sub guarantees at least once delivery for every subscription. This means that a message may be delivered more than once by the same subscription if Pub/Sub doesn’t receive acknowledgement within the acknowledgement deadline. The subscriber may acknowledge after the acknowledgement deadline or the acknowledgement may be lost due to transient network issues. In such scenarios the same message would be redelivered and subscribers may see duplicate data. It is the responsibility of the subscribing system (for example Dataflow) to detect such duplicates and handle accordingly.

When Dataflow receives messages from Pub/Sub subscription, messages are acknowledged after they are successfully processed by the first fused stage. Dataflow does optimization called fusion where multiple stages can be combined into a single fused stage. A break in fusion happens when there is a shuffle which happens if you have transforms like GROUP BY, COMBINE or I/O transforms like BigQueryIO. If a message has not been acknowledged within its acknowledgement deadline, Dataflow attempts to maintain the lease on the message by repeatedly extending the acknowledgement deadline to prevent redelivery from Pub/Sub. However this is best effort and there is a possibility that messages may be redelivered. This can be monitored using metrics listed here.

However, because Pub/Sub provides each message with a unique message_id, Dataflow uses it to deduplicate messages by default if you use the built-in Apache Beam PubSubIO. Thus Dataflow filters out such duplicates originating from redelivery of the same message by Pub/Sub. You can read more about this topic on one of our earlier blog under the section “Example source: Cloud Pub/Sub”

4. Processing data in Dataflow
Due to the distributed nature of processing in Dataflow each message may be retried multiple times on different Dataflow workers. However Dataflow ensures that only one of those tries wins and the processing from the other tries does not affect downstream fused stages. Dataflow does guarantee exactly once processing by leveraging checkpointing at each stage to ensure such duplicates are not reprocessed affecting state or output. You can read more about how this is achieved in this blog.

5. Writing to a sink
Each element can be retried multiple times by Dataflow workers and may produce duplicate writes. It is the responsibility of the sink to detect these duplicates and handle them accordingly. Depending on the sink, duplicates may be filtered out, over-written or appear as duplicates.

File systems as sink
If you are writing files, exactly once is guaranteed as any retries by Dataflow workers in event of failure will overwrite the file. Beam provides several I/O connectors to write files, all of which guarantees exactly once processing.

BigQuery as sink

If you use the built-in Apache Beam BigQueryIO to write messages to BigQuery using streaming inserts, Dataflow provides a consistent insert_id (different from Pub/Sub message_id) for retries and this is used by BigQuery for deduplication. However, this deduplication is best effort and duplicate writes may appear. BigQuery provides other insert methods as well with different deduplication guarantees as listed below.

You can read more about BigQuery insert methods at the BigQueryIO Javadoc. Additionally for more information on BigQuery as a sink check out the section “Example sink: Google BigQuery” in one of our earlier blog

For duplicates originating from places discussed in points 3), 4) and 5) there are built-in mechanisms in place to remove such duplicates as discussed above, assuming BigQuery is a sink. In the following section we will discuss deduplication options for ‘source generated duplicates’ and ‘publisher generated duplicates’. In both cases, we have duplicate messages with different message_id, which for Pub/Sub and downstream systems like Dataflow are two unique messages.

Deduplication options for source generated duplicates and publisher generated duplicates

1. Use Pub/Sub message attributes

Each message published to a Pub/Sub topic can have some string key value pairs attached as metadata under the “attributes” field of PubsubMessage. These attributes are set when publishing to Pub/Sub. For example, if you are using the Python Pub/Sub Client Library, you can set the “attrs” parameter of the publish method when publishing messages. You can set the unique fields (e.g: event_id) from your message as attribute value and field name as attribute key.

Dataflow can be configured to use these fields to deduplicate messages instead of the default deduplication using Pub/Sub message_id. You can do this by specifying the attribute key when reading from Pub/Sub using the built-in PubSubIO.

For Java SDK, you can specify this attribute key in the withIdAttribute method of PubsubIO.Read() as shown below.

In the Python SDK, you can specify this in the id_label parameter of the ReadFromPubSub PTransform as shown below.

This deduplication using a Pub/Sub message attribute is only guaranteed to work for duplicate messages that are published to Pub/Sub within 10 minutes of each other.

2. Use Apache Beam Deduplicate PTransform
Apache Beam provides deduplicate PTransforms which can deduplicate incoming messages  over a time duration. Deduplication can be based on the message or a key of a key value pair, where the key could be derived from the message fields. The deduplication window can be configured using the withDuration method, which can be based on processing time or event time (specified using the withTimeDomain method). This has a default value of 10 mins.

You can read the Java documentation or the Python documentation of this PTransform for more details on how this works.

This PTransform uses the Stateful API under the hood and maintains a state for each key observed. Any duplicate message with the same key that appears within the deduplication window is discarded by this PTransform.

3. Do post-processing in sink
Deduplication can also be done in the sink. This could be done by running a scheduled job that periodically deduplicates rows using a unique identifier.

BigQuery as a sink
If BigQuery is the sink in your pipeline, scheduled query can be executed periodically that writes the deduplicated data to another table or updates the existing table. Depending on the complexity of the scheduling you may need orchestration tools like Cloud Composer or Dataform to schedule queries.

Deduplication can be done using a DISTINCT statement or DML like MERGE. You can find sample queries about these methods on these blogs (blog 1, blog 2).

Often in streaming pipelines you may need deduplicated data available in real time in BigQuery. You can achieve this by creating materialized views on top of underlying tables using a DISTINCT statement.

Any new updates to the underlying tables will be updated in real time to the materialized view with zero maintenance or orchestration.

Technical trade-offs of different deduplication options

Related Article

After Lambda: Exactly-once processing in Google Cloud Dataflow, Part 1

Learn the meaning of “exactly once” processing in Dataflow, its importance for stream processing overall and its implementation in stream…

Read Article

Source : Data Analytics Read More

Wrapping up the summer: A host of new stories and announcements from Data Analytics

Wrapping up the summer: A host of new stories and announcements from Data Analytics

August is the time to sit back, relax, and enjoy the last of the summer. Or, for those in the southern hemisphere, August is the month you start looking at your swimsuits and sunglasses with interest as the weather warms. But regardless of where you live, August also seems to be when Google produces a lot of interesting Data Analytics reading.

In this monthly recap, I’ll divide last month’s most interesting articles into three groups: New Features and Announcements, Customer Stories, and  How-Tos. You can read through in order or skip to the section that’s most interesting to you.

Features and announcements

Datasets and Demo Queries – Recursion is a powerful topic, but it’s also a marvelous metaphor.  During August, we launched our most self-referential datasetyet; Google Cloud Release Notes are now in BigQuery. Find the product and feature announcements you need and do it fast using the new Google Cloud Release Notes Dataset. Additionally, we also launched the Top 25 topics in Google Trends (Looker Dashboard).

Save Messages, money and time with Pub/Sub topic retention. When you enable topic retention, all messages sent to the topic within the chosen retention window are accessible to all the topic’s subscriptions —without increasing your storage costs when you add subscriptions. Additionally, messages will be retained and available for replay even if there are no subscriptions attached to the topic at the time the messages are published, allowing subscribers to see the entire history of messages sent to the topic.

Extend your Dataflow templates with UDFs. Google provides a set of Dataflow templates that customers commonly use for frequent data tasks but also as reference data pipelines that developers can extend. But what if you want to customize a Dataflow template without modifying or maintaining the Dataflow template code itself? With JavaScript user-defined functions (UDFs), customers can now extend certain Dataflow templates with custom logic to transform records on the fly. This is especially helpful for users who want to customize a pipeline’s output format without having to re-compile or maintain the template code itself.

The diagram below shows the process flow for UDF enabled Dataflow Templates

Customer stories

Renault uses BigQuery to improve its Industrial Data platform
Last month, Renault described their journey and the impact of establishing an industrial IoT analytics system using Dataflow, Pub/Sub, and BigQuery for traceability of tools and products, as well as measure operational efficiency.  By consolidating multiple data silos into BigQuery the IT Infrastructure team was able to reduce storage costs by 50% even while processing several times more data than on their legacy system.

The chart below shows the multifold growth in data that Renault processed each month (blue bars) along with the corresponding drop in cost (red line) between the previous system (shaded area) and the Google solution.

Constellation Brands chose Google Cloud to power Direct to Consumer (DTC) shift

Ryan Mason, Director and Head of Direct to Consumer Strategy from Constellation Brands authored a piece on the business value of DTC channels as well as the method and impact of how he implemented his pipelines. This story explained how to gather multiple streams from the Google Marketing platform (Analytics 360, Tag Manager 360) and land them in BigQuery.

A key differentiator for us is that all Google Marketing Platform data is natively accessible for analysis within BigQuery.

From there, Constellation Brands can calculate key performance indicators (KPIs), such as customer acquisition cost (CAC), Customer Lifetime Value (CLV), and Net Promoter Score (NPS), and broadcast them across the company using Looker dashboards. In this way, the entire organization can track the health of the business through common access to the same KPI’s.

The operational impact of Looker [dashboards] is also substantial: our team estimates that the number of hours needed to reach critical business decisions has been reduced by nearly 60%.


Our DevRel Rockstar Leigha Jarett has published four very useful articles in her continuing series on the BigQuery Administration Reference Guide. This month she covered: Monitoring, API Landscape, Data Governance, and Query Optimization.

I highly recommend reading the article on query optimization. It’s packed with good tips, from the most commonly used — partitioning and clustering to reduce the amount of data that the query has to scan – to some lesser-known tips like proper ordering of expressions.

This article on workload management in BigQuery has some very useful tips and explains the key constructs of Commitments, Reservations, and Assignments.

If streaming is your game then this post from Zeeshan Kahn on handling duplicate data in streaming pipelines using DF and PS will come in handy.

Zooming up a few thousand feet, Googlers Firat Tekiner and Susan Pierce offer some high level insights as they discuss theconvergence of data lakes and data warehouses. After all, who wants to manage two sets of infrastructure?   Aunified data platformis the way to go.

And that’s how we do a relaxing summer month here at Google Cloud! Stay tuned for more announcements, how-to blogs, and customer stories as we ramp up for Google Cloud Next coming in October.

Related Article

What are the newest datasets in Google Cloud?

Want to know about the latest datasets from Google Cloud? Find information here in one handy location. Check back regularly as we update …

Read Article

Source : Data Analytics Read More

Ad agencies choose BigQuery to drive campaign performance

Ad agencies choose BigQuery to drive campaign performance

Advertising agencies are faced with the challenge of providing the precision data that marketers require to make better decisions at a time when customers’ digital footprints are rapidly changing. They need to transform customer information and real-time data into actionable insights to inform clients what to execute to ensure the highest campaign performance.

In this post, we’ll explore how two of our advertising agency customers are turning to Google BigQuery to innovate, succeed, and meet the next generation of digital advertising head on. 

Net Conversion eliminated legacy toil to reach new heights

Paid marketing and comprehensive analytics agency Net Conversion has made a name for itself with its relentless attitude and data-driven mindset. But like many agencies, Net Conversion felt limited by traditional data management and reporting practices. 

A few years ago, Net Conversion was still using legacy data servers to mine and process data across the organization, and analysts relied heavily on Microsoft Excel spreadsheets to generate reports. The process was lengthy, fragmented, and slow—especially when spreadsheets exceeded the million-row limit.

To transform, Net Conversion built Conversionomics, a serverless platform that leverages BigQuery, Google Cloud’s enterprise data warehouse, to centralize all of its data and handle all of its data transformation and ETL processes. BigQuery was selected for its serverless architecture, high scalability, and integration with tools that analysts were already using daily, such as Google Ads, Google Analytics, and Data Hub. 

After moving to BigQuery, Net Conversion discovered surprising benefits that streamlined reporting processes beyond initial expectations. For instance, many analysts had started using Google Sheets for reports, and BigQuery’s native integration with Connected Sheets gave them the power to analyze billions of rows of data and generate visualizations right where they were already working.

If you’re still sending Excel files that are larger than 1MB, you should explore Google Cloud. Kenneth Eisinger
Manager of Paid Media Analytics at Net Conversion

Since modernizing their data analytics stack, Net Conversion has saved countless hours of time that can now be spent on taking insights to the next level. Plus, BigQuery’s advanced data analytics capabilities and robust integrations have opened up new roads to offer more dynamic insights that help clients better understand their audience.   

For instance, Net Conversion recently helped a large grocery retailer launch a more targeted campaign that significantly increased downloads of their mobile application. The agency was able to better understand and predict their customers’ needs by analyzing buyer behavior across the website, mobile application, and their purchase history. Net Conversion analyzed website data in real-time with BigQuery, ran analytics on their mobile app data through the Firebase’s integration with BigQuery, and enriched these insights with sales information from the grocery retailer’s CRM to generate propensity behavior models that  accurately predicted which customers would most likely install their mobile app.

WITHIN helped companies weather the COVID storm

WITHIN is a performance branding company, focused on helping brands maximize growth by fusing marketing and business goals together in a single funnel. During the COVID-19 health crisis, WITHIN became an innovator in the ad agency world by sharing real-time trends and insights with customers through its Marketing Pulse Dashboard. This dashboard was part of the company’s path to adopting BigQuery for data analytics transformation. 

Prior to using BigQuery, WITHIN used a PostgreSQL database to house its data and manual reporting. Not only was the team responsible for managing and maintaining the server, which took focus away from the data analytics, but query latency issues often slowed them down. 

BigQuery’s serverless architecture, blazing-fast compute, and rich ecosystem of integrations with other Google Cloud and partner solutions made it possible to rapidly query, automate reporting, and completely get rid of CSV files. 

Using BigQuery, WITHIN is able to run Customer Lifetime Value (LTV) analytics and quickly share the insights with their clients in a collaborative Google Sheet. In order to improve the effectiveness of their campaigns across their marketing channels, WITHIN further segments the data into high and low LTV cohorts and shares the predictive insights with their clients for in-platform optimizations.

By distilling these types of LTV insights from BigQuery, WITHIN has been able to use those to empower their campaigns on Google Ads with a few notable success stories.

WITHIN worked with a pet food company to analyze historical transactional data to model predicted LTV of new customers. They found significant differences between product category and autoship vs single order customers, and they implemented LTV-based optimization. As a result, they saw a 400% increase in average customer LTV. 
WITHIN helped a coffee brand increase their customer base by 560%, with the projected 12-month LTV of newly acquired customers jumping a staggering 1280%.

Through integration with Google AI Platform Notebooks, BigQuery also advanced WITHIN’s ability to use machine learning (ML) models. Today, the team can build and deploy models to predict dedicated campaign impact across channels without moving the data.  The integration of clients’ LTV data through Google Ads has also impacted how WITHIN structures their clients’ accounts and how they make performance optimization decisions.

Now, WITHIN can capitalize on the entire data lifecycle: ingesting data from multiple sources into BigQuery, running data analytics, and empowering people with data by automatically visualizing data right in Google Data Studio or Google Sheets.

A year ago, we delivered client reporting once a week. Now, it’s daily. Customers can view real-time campaign performance in Data Studio — all they have to do is refresh. Evan Vaughan
Head of Data Science at WITHIN

Having a consistent nomenclature and being able to stitch together a unified code name has allowed WITHIN to scale their analytics. Today, WITHIN is able to create an internal Media Mix Modeling (MMM) tool with the help of Google Cloud that they’re trialing with their clients.

The overall unseen benefit of BigQuery was that it put WITHIN in a position to remain nimble and spot trends before other agencies when COVID-19 hit. This aggregated view of data allowed WITHIN to provide unique insights to serve their customers better and advise them on rapidly evolving conditions.

Ready to modernize your data analytics? Learn more about how Google BigQuery unlocks the insights hidden in your data.

Related Article

Query BIG with BigQuery: A cheat sheet

Organizations rely on data warehouses to aggregate data from disparate sources, process it, and make it available for data analysis in s…

Read Article

Source : Data Analytics Read More

Optimizing your BigQuery incremental data ingestion pipelines

Optimizing your BigQuery incremental data ingestion pipelines

When you build a data warehouse, the important question is how to ingest data from the source system to the data warehouse. If the table is small you can fully reload a table on a regular basis, however, if the table is large a common technique is to perform incremental table updates. This post demonstrates how you can enhance incremental pipeline performance when you ingest data into BigQuery.

Setting up a standard incremental data ingestion pipeline

We will use the below example to illustrate a common ingestion pipeline that incrementally updates a data warehouse table. Let’s say that you ingest data into BigQuery from a large and frequently updated table in the source system, and you have Staging and Reporting areas (datasets) in BigQuery.

The Reporting area in BigQuery stores the most recent, full data that has been ingested from the source system tables. Usually you create the base table as a full snapshot of the source system table. In our running example, we use BigQuery public data as the source system and create reporting.base_table as shown below. In our example each row is identified by a unique key which consists of two columns: block_hash and log_index.

In data warehouses it is common to partition a large base table by a datetime column that has a business meaning. For example, it may be a transaction timestamp, or datetime when some business event happened, etc. The idea is that data analysts who use the data warehouse usually need to analyze only some range of dates and rarely need the full data. In our example, we partition the base table by block_timestamp which comes from the source system.

After ingesting the initial snapshot you need to capture changes that happen in the source system table and update the reporting base table accordingly. This is when the Staging area comes into the picture. The staging table will contain captured data changes that you will merge into the base table. Let’s say that in our source system on a regular basis we have a set of new rows and also some updated records. In our example we mock the staging data as follows: first, we create new data, than we mock the updated records:

Next, the pipeline merges the staging data into the base table. It joins two tables by unique key and than updates the changed value or inserts a new row

It is often the case that the staging table contains keys from various partitions but the number of those partitions are relatively small. It holds, for instance, because in the source system the recently added data may get changed due to some initial errors or ongoing processes but older records are rarely updated. However, when the above MERGE gets executed, BigQuery scans all partitions in the base table and processes 161 GB of data. You might add additional join condition on block_timestamp:

But BigQuery would still scan all partitions in the base table because condition T.block_timestamp = S.block_timestamp is a dynamic predicate and BigQuery doesn’t automatically push such predicates down from one table to another in MERGE.

Can you improve the MERGE efficiency by making it scan less data? The answer is Yes. 

As described in the MERGE documentation, pruning conditions may be located in a subquery filter, a merge_condition filter, or a search_condition filter. In this post we show how you can leverage the first two. The main idea is to turn a dynamic predicate into a static predicate.

Steps to enhance your ingestion pipeline

The initial step is to compute the range of partitions that will be updated during the MERGE and store it in a variable. As was mentioned above, in data ingestion pipelines, staging tables are usually small so the cost of the computation is relatively low.

Based on your existing ETL/ELT pipeline, you can add the above code as-is to your pipeline or you can compute date_min, data_max as part of some already existing transformation step. Alternatively, date_min, data_max can be computed on the Source System side while capturing the next ingestion data batch.

After computing date_min, date_max you pass those values to the MERGE statement as static predicates. There are several ways to enhance the MERGE and prune partitions in the base table based on precomputed date_min, data_max. 

If your initial MERGE statement uses a subquery, you can incorporate a new filter into it:

Note that you add the static filter to the staging table and keep T.block_timestamp = S.block_timestamp to convey to BigQuery that it can push that filter to the base table. This MERGE processes 41 GB of data in contrast to the initial 161 GB. You can see in the query plan that BigQuery pushes the partition filter from the staging table to the base table:

This type of optimization, when a pruning condition is pushed from a subquery to a large partitioned or clustered table, is not unique for MERGE. It also works for other types of queries. For instance:

And you can check the query plan to verify that BigQuery pushed down the partition filter from one table to another.

Moreover, for SELECT statements, BigQuery can automatically infer a filter predicate on a join column and push it down from one table to another if your query meets the following criteria:

The target table must be clustered or partitioned. The result size of the other table, i.e. after applying all filters, must qualify for broadcast join. Namly, the result set must be relatively small, less than ~100MB.

In our running example, reporting.base_table is partitioned by block_timestamp. If you define a selective filter on staging.load_delta and join two tables, you can see an inferred filter on the join key pushed to the target table

There is no requirement to join tables by partitioning or clustering key to kick off this type of optimization. However, in this case the pruning effect on the target table would be less significant.

But let us get back to the pipeline optimizations. Another way to enhance MERGE is to modify the merge_condition filter by adding static predicate on the base table:

To summarize, here are the steps that you can perform to enhance incremental ingestion pipelines in BigQuery. First you compute the range of updated partitions based on the small staging table. Next, you tweak the MERGE statement a bit to let BigQuery know to prune data in the base table.

All the enhanced MERGE statements scanned 41 GB of data, and setting up the src_range variable took 115 MB.  Compare it with the initial 161 GB scan. Moreover, given that computing src_range may be incorporated into some existing transformation in your ETL/ELT, it results in a good performance improvement which you can leverage in your pipelines. 

In this post we described how to enhance data ingestion pipelines by turning dynamic filter predicates into static predicates and letting BiQuery prune data for us. You can find more tips on BigQuery DML tuning here.

Special thanks to Daniel De Leo, who helped with examples and provided valuable feedback on this content.

Related Article

BigQuery explained: How to run data manipulation statements to add, modify and delete data stored in BigQuery

How do you delete all rows from a table in BigQuery? In this blog post, you’ll learn that and more, as we show you how you how to run dat…

Read Article

Source : Data Analytics Read More

Lower TCO for managing data pipelines by 80% with Cloud Data Fusion

Lower TCO for managing data pipelines by 80% with Cloud Data Fusion

In today’s data-driven environment, organizations need to use various different data sources available in order to extract timely and actionable insights.  Organizations are better off making data integration easier to get faster insights from data rather than spending time and effort in coding, testing complex data pipelines .  

Recently, Google Cloud sponsored Enterprise Strategy Group (ESG) on two whitepapers, conducting deep dives into the need for modern data integration, and how Cloud Data Fusion can address these challenges including the economic value of using Cloud Data Fusion as compared to other alternatives available.

Challenges in Data Integration

Data integration has notoriously been the greatest challenge data-driven organizations face as they work to better leverage data. On top of migrating certain, if not all, workloads to the cloud, areas like lack of metadata, combining distributed data sets, combining different data types, and handling the rate at which source data changes are directly leading to organizations prioritizing data integration.  In fact, this is the reason that improving data integration is where organizations expect to make the most significant investments in the next 12-18 months1. Organizations recognize that by simplifying and improving their data integration processes, they can enhance operational efficiency across data pipelines while ensuring they are on a path to data-driven success.

Cloud Strategy and Data Integration 

Based on the ESG report, the cloud strategy impacts the way in which organizations implement and utilize data integration today. Organizations can choose from, single cloud, multi-cloud or hybrid cloud strategy and in doing so, choosing the right data integration option can give organizations freedom and flexibility. Irrespective of the cloud strategy, organizations are embracing a centralized approach to data management to not only reduce costs but also to ensure greater efficiency in the creation and management of data pipelines. By standardizing on a centralized approach, data lifecycle management is streamlined through data unification. Further, with improved data access and availability, data and insight reusability are achieved.

Cross-environment integration and collaboration

Organizations are increasingly in search of services and platforms that minimize lock-in while promoting cross-environment integration and collaboration. As developers dedicate more time and effort into building modern applications heavily rooted in data, knowing the technology is underlined by open-source technology provides peace of mind knowing the application and underlying code can be run anywhere the open-source technology is supported. This desire for open-source technology extends to data pipelines too, where data teams have dedicated hours to optimally integrate a growing set of technologies and perfect ETL scripts. As new technologies, use cases, or business goals emerge, enabling environment flexibility ensures organizations can embrace data in the best way possible.

Cost savings with Cloud Data Fusion

ESG conducted interviews with customers to validate the savings and benefits that they have seen in practice and used these as the assumptions to compare Cloud Data Fusion. Aviv, a senior validation analyst with ESG has taken two use cases, building data warehouse and building data lake and compared on-prem, build yourself with Cloud Data Fusion. The research shows that customers can realize cost savings up to 88% to operate a hybrid cloud data lake and up to 80% to deploy, manage, and maintain data pipelines for cloud-based enterprise data warehouses in BigQuery. Here is a sneak peek into ROI calculations for building Data Warehouse in BigQuery using Cloud Data Fusion vs other alternatives.

The full whitepapers contain even more insight, as well as a thorough analysis of the data integration tools’ impact on businesses and recommended steps for unlocking its full potential. You can download the full reports below:

Accelerating Digital Transformation with Modern Data Integration

The Economic Benefits of Data Fusion versus other Alternatives 

Additionally, try a quickstart, or reach out to us for your cloud data integration needs.

1. ESG Master Survey Results, 2021 Technology Spending Intentions Survey, December 2020

Source : Data Analytics Read More

Google Cloud improves Healthcare Interoperability on FHIR

Google Cloud improves Healthcare Interoperability on FHIR

The Importance of Interoperability

In 2020 hospital systems were scrambling to prepare for COVID-19. Not just the clinicians preparing for a possible influx of patients, but also the infrastructure & analytics teams trying to navigate a maze of Electronic Health Records (EHR) systems. By default these EHRs are not interoperable, or able to speak to one another, so answering a relatively simple question “how many COVID-19 patients are in all of my hospitals?” can require many separate investigations. 

Typically, the more complex a dataset is, the more difficult it is to build interoperable systems around it. Clinical data is extremely complex (a patient has many diagnoses, procedures, visits, providers, prescriptions, etc.), and EHR vendors built and managed their own proprietary data models to handle those data challenges. This has made it much more difficult for hospitals to track a patient’s performance when they switch hospitals (even within the same hospital system) and especially difficult for multiple hospitals systems to coordinate on care for nationwide epidemics (e.g. COVID-19, opioid abuse), which makes care less effective for patients & more expensive for hospitals. 

A Big Leap Forward in Interoperability

Building an interoperable system requires: 

(1) A common data schema

(2) A mechanism for hospitals to bring their messy real-world data into that common data schema

(3) A mechanism for asking questions against that common data schema

In 2011 a common FHIR (Fast Healthcare Interoperability Resources) Data Model & API Standard provided an answer to (1): a single data schema for the industry to speak the same data language. In the past 18 months, Google Cloud has deployed several technologies to unlock the power of FHIR and solve for (2) and (3): 

Google Cloud’s Healthcare Data Engine (HDE) produces FHIR records from streaming clinical data (either HL7v2 messages out of EHR systems or legacy formats from EDWs). This technology then enables data use for other applications & analytics in Google BigQuery (BQ)

Google Cloud’s Looker enables anyone in a healthcare organization to ask any question against the complex FHIR schema in Google BigQuery

Now a hospital system can quickly ask & answer a question against records from several EHR systems at once.

This dashboard tracks a hospital system’s COVID-19 cases & volume across its hospitals.

Applications Seen So Far

In less than 18 months, GCP has seen dozens of applications for HDE, BigQuery, and Looker working together to improve clinical outcomes. A few applications that have been particularly successful so far have answered questions like: 

How many readmissions will a hospital expect in 30 days? 

How long will each inpatient patient stay in a hospital?

How can a hospital better track misuse & operational challenges in prescribing opioid drugs to my patients?

How can a hospital quickly identify anomalies in patients’ vital signs across my hospital system?

How can a hospital identify & minimize hospital-associated infections (e.g. CLABSI) in my hospital?

How can a hospital prepare for COVID-19 cases across a hospital system? And leverage what-if planning to prepare for the worst?

These use cases represent just the tip of the iceberg of possibilities for improving day-to-day operations for clinicians & hospitals.

Solving Major Challenges in Interoperability

Latency: Hospitals often rely on stale weekly reports for analytics; receiving analytics in near real-time enables hospitals to identify problems and make changes much more quickly. COVID-19 in particular highlighted the need for faster turnaround on analytics. GCP’s Healthcare Data Engine handles streaming clinical data in the form of HL7 messages. As soon as messages arrive, they are transformed into FHIR and sent over to the BigQuery database to be queried. There is minimal latency, and users are querying near real-time data.

Scale: The scale and scope of hospital data is rapidly increasing as hospitals track more clinical events and as hospitals consolidate into larger systems. Hospitals are adopting cloud-based systems that autonomously scale to handle the intensive computation necessary to take in millions of clinical records with growing hospital system datasets. GCP’s serverless, managed cloud is meeting these needs for many hospital systems today.

Manage Multiple Clinical Definitions: Today hospitals struggle to manage definitions of complex clinical KPIs. For example, hospitals had many different definitions for a positive COVID-19 result (based on a frequently changing set of lab results and symptoms), which creates inconsistencies in analytics. Additionally, those definitions are often buried in scripts that are hard to adjust and change. HDE has developed capabilities that consistently transform HL7 messages into the FHIR store in a scalable fashion. Looker then provides a single source of truth in an object-oriented, version-controlled semantic layer to define clinical KPIs and quickly update them.  

Represent FHIR Relationally: FHIR was originally intended for XML storage to maximize schema flexibility. However, this format is usually very difficult for analytical queries, which perform better with relational datasets. In particular, FHIR has rows of data buried (or “nested”) within a single record (e.g. a single patient record has many key-value pairs of diagnoses) that make FHIR difficult to ask questions against. BigQuery is an analytical database that combines the analytical power of OLAP databases with the flexibility of a No-SQL data schema by natively storing FHIR data in this “nested” structure and querying against it. 

Query Quickly against FHIR: Writing unnested queries to a schema as complex as FHIR can be challenging. GCP’s Looker solution writes “nested” queries natively to BigQuery, making it much simpler to ask & answer new questions. This also prevents the “cube / extract” problem so common in healthcare, where hospitals are forced to build, manage, and maintain hundreds of simplified data cubes to answer their questions.

Predict Clinical Outcomes: Predictive modelling with AI/ML workflows has matured significantly. Hospitals increasingly rely on AI/ML to guide patients & providers towards better outcomes. For example, predicting patient, staffing, and ventilator volumes 30 days in advance across a hospital system can minimize disruptions to care. Leveraging FHIR on GCP enables GCP’s full suite of managed AI/ML tools – in particular BQML (BigQuery Machine Learning) and AutoML.

Ensure 24/7 Data Availability: COVID-19 exposed the vulnerabilities of relying on staffed on-premise data centers; GCP’s cloud infrastructure ensures availability and security of all clinical data. 

Protect Patient Data: Interoperability blends the need for private patient data to stay private while allowing data to be shared across hospitals. Researchers in particular often require granular security rules to access clinical data. Today hospitals often use an extract-based approach that requires many copies of the data outside of the database, a potential security flaw. GCP’s approach ensures that hospitals can query the data where it resides – in a secure data warehouse. Additionally, every component of GCP’s FHIR solution (HDE, BQ, Looker) can be configured to be HIPAA-compliant and includes row-level, column-level, and field-level security that can be set by users to ensure cell-level control over PHI data. GCP’s Cloud Loss Prevention API also anonymizes data automatically.

Speed to Insights: The complexity of data can lead to long windows to build analytics pipelines, leading to delays in healthcare improvements. GCP’s FHIR solution is relatively low-effort to implement. HDE can be set up against streaming HL7 messages in a few days and weeks, not months and years. Looker has a pre-built FHIR Block (coming to the Looker Blocks Directory and Marketplace soon) that can be installed & configured for a hospital’s particular data needs.

Share Insights Broadly: Interoperability requires not just being able to query across multiple systems, but also to share those insights across multiple platforms. GCP’s FHIR solution allows hospital systems to analyze governed results on FHIR, then send them anywhere: to dashboards in Looker, other BI tools, embedded applications, mobile apps, etc. For example, the National Response Portal represents the promise of hospitals and other organizations sharing aggregated healthcare data for nationwide insights around COVID-19.

For a technical review of GCP’s Healthcare Data Engine against Azure’s & AWS’ solutions, see here and here.

The New Frontier

This new healthcare data stack at Google Cloud represents a significant step forward towards interoperability in healthcare. When hospitals can communicate more easily with each other and when complex analytics are easier to conduct, everyone wins. Patients have better healthcare outcomes, and hospitals can provide care more efficiently. 

Google Cloud is committed to continue partnering with the largest hospital systems in the country to solve the most challenging problems in healthcare. Patients’ lives depend on it.

Related Article

Registration is open for Google Cloud Next: October 12–14

Register now for Google Cloud Next on October 12–14, 2021

Read Article

Source : Data Analytics Read More