Privacy-preserving data sharing now generally available with BigQuery data clean rooms

Privacy-preserving data sharing now generally available with BigQuery data clean rooms

The rise of data collaboration and use of external data sources highlights the need for robust privacy and compliance measures. In this evolving data ecosystem, businesses are turning to clean rooms to share data in  low-trust environments. Clean rooms enable secure analysis of sensitive data assets, allowing organizations to unlock insights without compromising on privacy.

To facilitate this type of data collaboration, we launched the preview of data clean rooms last year. Today, we are excited to announce that BigQuery data clean rooms is now generally available.

Backed by BigQuery, customers can now share data in place with analysis rules to protect the underlying data. This launch includes a streamlined data contributor and subscriber experience in the Google Cloud console, as well as highly requested capabilities such as:

Join restrictions: Limits the joins that can be on specific columns for data shared in a clean room, preventing unintended or unauthorized connections between data. 
Differential privacy analysis rule: Enforces that all queries on your shared data use differential privacy with the parameters that you specify. The privacy budget that you specify also prevents further queries on that data when the budget is exhausted.
List overlap analysis rule: Restricts the output to only display the intersecting rows between two or more views joined in a query.
Usage metrics on views: Data owners or contributors see aggregated metrics on the views and tables shared in a clean room.

Using data clean rooms in BigQuery does not require creating copies of or moving sensitive data. Instead, the data can be shared directly from your BigQuery project and you remain in full control. Any updates you make to your shared data are reflected in the clean room in real-time, ensuring everyone is working with the most current data.

Create and deploy clean rooms in BigQuery

BigQuery data clean rooms are available in all BigQuery regions. You can set up a clean room environment using the Google Cloud console or using APIs. During this process, you set permissions and invite collaborators within or outside organizational boundaries to contribute or subscribe to the data.

Enforce analysis rules to protect underlying data

When sharing data into a clean room, you can configure analysis rules to protect the underlying data and determine how the data can be analyzed. BigQuery data clean rooms support multiple analysis rules including aggregation, differential privacy, list overlap, and join restrictions. The new user experience within Cloud console lets data contributors configure these rules without needing to use SQL.

Lastly, by default, a clean room employs restricted egress to prevent subscribers from exporting or copying the underlying data. However, data contributors can choose to allow the export and copying of query results for specific use cases, such as activation.

Monitor usage and stay in control of your data

The data owner or contributor is always in control of their respective data in a clean room. At any time, a data contributor can revoke access to their data. Additionally, as the clean room owner, you can adjust access using subscription management or privacy budgets to prevent subscribers from performing further analysis. Additionally, data contributors receive aggregated logs and metrics, giving them insights into how their data is being used within the clean room. This promotes both transparency and a clearer understanding of the collaborative process. 

What BigQuery data clean room customers are saying 

Customers across all industries are already seeing tremendous success with BigQuery data clean rooms. Here’s what some of our early adopters and partners had to say:

“With BigQuery data clean rooms, we are now able to share and monetize more impactful data with our partners while maintaining our customers’ and strategic data protection.” –  Guillaume Blaquiere, Group Data Architect, Carrefour

“Data clean rooms in BigQuery is a real accelerator for L’Oréal to be able to share, consume, and manage data in a secure and sustainable way with our partners.” –  Antoine Castex, Enterprise Data Architect, L’Oréal

“BigQuery data clean rooms equip marketing teams with a powerful tool for advancing privacy-focused data collaboration and advanced analytics in the face of growing signal loss. LiveRamp and Habu, which independently were each early partners of BigQuery data clean rooms, are excited to build on top of this foundation with our combined interoperable solutions: a powerful application layer, powered by Habu, accelerates the speed to value for technical and business users alike, while cloud-native identity, powered by RampID in Google Cloud, maximizes data fidelity and ecosystem connectivity for all collaborators. With BigQuery data clean rooms, enterprises will be empowered to drive more strategic decisions with actionable, data-driven insights.”Roopak Gupta, VP of Engineering, LiveRamp

“In today’s marketing landscape, where resources are limited and the ecosystem is fragmented, solutions like the data clean room we are building with Google Cloud can help reduce friction for our clients. This collaborative clean room ensures privacy and security while allowing Stagwell to integrate our proprietary data to create custom audiences across our product and service offerings in the Stagwell Marketing Cloud. With the continued partnership of Google Cloud, we can offer our clients integrated Media Studio solutions that connect brands with relevant audiences, improving customer journeys and making media spend more efficient.” – Mansoor Basha, Chief Technology Officer, Stagwell Marketing Cloud

“We are extremely excited about the General Availability announcement of BigQuery data clean rooms. It’s been great collaborating with Google Cloud on this initiative and it is great to see it come to market.. This release enables production-grade secure data collaboration for the media and advertising industry, unlocking more interoperable planning, activation and measurement use cases for our ecosystem.” – Bosko Milekic, Chief Product Officer, Optable

Next steps

Whether you’re an advertiser trying to optimize your advertising effectiveness with a publisher, or a retailer improving your promotional strategy with a CPG, BigQuery data clean rooms can help. Get started today by using this guide, starting a free trial with BigQuery, or contacting the Google Cloud sales team.

Source : Data Analytics Read More

Get started with differential privacy and privacy budgeting in BigQuery data clean rooms

Get started with differential privacy and privacy budgeting in BigQuery data clean rooms

We are excited to announce that differential privacy enforcement with privacy budgeting is now available in BigQuery data clean rooms to help organizations prevent data from being reidentified when it is shared.

Differential privacy is an anonymization technique that limits the personal information that is revealed in a query output. Differential privacy is considered to be one of the strongest privacy protections that exists today because it:

is provably private
supports multiple differentially private queries on the same dataset
can be applied to many data types

Differential privacy is used by advertisers, healthcare companies, and education companies to perform analysis without exposing individual records. It is also used by public sector organizations that comply with the General Data Protection Regulation (GDPR), the Health Insurance Portability and Accountability Act (HIPAA), the Family Educational Rights and Privacy Act (FERPA), and the California Consumer Privacy Act (CCPA).

What can I do with differential privacy?

With differential privacy, you can:

protect individual records from re-identification without moving or copying your data
protect against privacy leak and re-identification
use one of the anonymization standards most favored by regulators

BigQuery customers can use differential privacy to:

share data in BigQuery data clean rooms while preserving privacy
anonymize query results on AWS and Azure data with BigQuery Omni
share anonymized results with Apache Spark stored procedures and Dataform pipelines so they can be consumed by other applications
enhance differential privacy implementations with technology from Google Cloud partners Gretel.ai and Tumult Analytics
call frameworks like PipelineDP.io

So what is BigQuery differential privacy exactly?

BigQuery differential privacy is three capabilities:

Differential privacy in GoogleSQL – You can use differential privacy aggregate functions directly in GoogleSQL

Differential privacy enforcement in BigQuery data clean rooms – You can apply a differential privacy analysis rule to enforce that all queries on your shared data use differential privacy in GoogleSQL with the parameters that you specify

Parameter-driven privacy budgeting in BigQuery data clean rooms – When you apply a differential privacy analysis rule, you also set a privacy budget to limit the data that is revealed when your shared data is queried. BigQuery uses parameter-driven privacy budgeting to give you more granular control over your data than query thresholds do and to prevent further queries on that data when the budget is exhausted.

BigQuery differential privacy enforcement in action

Here’s how to enable the differential privacy analysis rule and configure a privacy budget when you add data to a BigQuery data clean room.

Subscribers of that clean room must then use differential privacy to query your shared data.

Subscribers of that clean room cannot query your shared data once the privacy budget is exhausted.

Get started with BigQuery differential privacy

BigQuery differential privacy is configured when a data owner or contributor shares data in a BigQuery data clean room. A data owner or contributor can share data using any compute pricing model and does not incur compute charges when a subscriber queries that data. Subscribers of a data clean room incur compute charges when querying shared data that is protected with a differential privacy analysis rule. Those subscribers are required to use on-demand pricing (charged per TB) or the Enterprise Plus edition (charged per slot hour).

Create a clean room where all queries are protected with differential privacy today and let us know where you need help.

Related Article

Privacy-preserving data sharing now generally available with BigQuery data clean rooms

Now GA, BigQuery data clean rooms has a new data contributor and subscriber experience, join restrictions, new analysis rules, usage metr…

Read Article

Source : Data Analytics Read More

Get excited about what’s coming for data professionals at Next ‘24

Get excited about what’s coming for data professionals at Next ‘24

Google Cloud Next ’24 lands in Las Vegas on April 9, ready to unleash the future of cloud computing! This global event is where inspiration, innovation, and practical knowledge collide.  

Data cloud professionals, get ready to harness the power and scalability of Google Cloud operational databases – AlloyDB, Cloud SQL, and Spanner –  and leverage AI to streamline your data management and unlock real-time insights. You’ll learn how to master BigQuery for lightning-fast analytics on massive datasets, visualize your findings with the intuitive power of Looker, and use generative AI to get better insights from your data. With these cutting-edge Google Cloud tools, you’ll have everything you need to drive better, more informed decision-making.

Here are 10 must-attend sessions for data professionals to add to your Google Cloud Next ’24 agenda:

1. What’s next for Google Cloud databases • SPTL203

Dive into the world of Google Cloud databases and discover next-generation innovations that will help you modernize your database estate to easily build enterprise generative AI apps, unify your analytical and transactional workloads, and simplify database management with assistive AI. Join us to hear our vision for the future of Google Cloud databases and see how we’re pushing the boundaries alongside the AI ecosystem. 
>> Add to my agenda <<

2. What’s next for data analytics in the AI era • SPTL202

With the surge of new generative AI capabilities, companies and their customers can now interact with systems and data in new ways. To activate AI, organizations require a data foundation with the scale and efficiency to bring business data together with AI models and ground them in customer reality. Join this session to learn the latest innovations for data analytics and business intelligence, and see why tens of thousands of organizations are fueling their journey with BigQuery and Looker.
>> Add to my agenda <<

3. The future of databases and generative AI • DBS104

Learn how AI has the potential to revolutionize the way applications interact with databases and explore the exciting future of Google Cloud‘s managed databases, including Cloud SQL, AlloyDB, and Spanner. We will also delve into the most intriguing frontiers on the horizon, such as vector search capabilities, natural language processing in databases, and app migration with large language model-powered code migration.
>> Add to my agenda <<

4. What’s new with BigQuery • ANA112

Join this session to explore all the latest BigQuery innovations to support all structured or unstructured data, across multiple and open data formats, and cross-clouds; all workloads, whether Cloud SQL, Spark, or Python; and new generative AI use cases with built-in AI to supercharge the work of data teams. Learn how you can take advantage of BigQuery, a single offering that combines data processing, streaming, and governance capabilities to unlock the full power of your data.
>> Add to my agenda <<

5. A deep dive into AlloyDB for PostgreSQL • DBS216

Powered by Google-embedded storage for superior performance with full PostgreSQL compatibility, AlloyDB offers the best of the cloud. With scale-out architecture, a no-nonsense 99.99% availability SLA, intelligent caching, and ML-enabled adaptive systems, it’s got everything you need to simplify database management. In this session, we’ll cover what’s new in AlloyDB, plus take a deep dive into the technology that powers it.
>> Add to my agenda <<

6. BigQuery’s multimodal data foundation for the Gemini era • ANA116

In the era of multimodal generative AI, a unified, governance-focused data platform powered by Gemini is now paramount. Join this session to learn how BigQuery fuels your data and AI lifecycle, from training to inference, by unifying all your data — structured or unstructured — while addressing security and governance. 
>> Add to my agenda <<

7. Best practices to maximize the availability of your Cloud SQL databases • DBS210

Customers use Cloud SQL to run their business-critical applications. The session will dive deep into Enterprise Plus edition features, how Cloud SQL achieves near-zero downtime maintenance, and behaviors that affect availability and mitigations — all of which will prepare you to be an expert in configuring and monitoring Cloud SQL for maximum availability.
>> Add to my agenda <<

8. Make generative AI work: Best practices from data leaders • ANA122

Join experts from Databricks, MongoDB, Confluent, and Dataiku for an exclusive executive discussion on harnessing generative AI’s transformative potential. We’ll explore how generative AI breaks down multicloud data silos to enable informed decision-making and unlock your data’s full value. Discover strategies for integrating generative AI, addressing challenges, and building a future-proof, innovation-driven data culture.
>> Add to my agenda <<

9. Spanner: Beyond relational? Yahoo! Mail says yes • DBS203

Imagine running your non-relational workloads at relational consistency and unlimited scale. Yahoo! dared to dream it, and with Google Spanner, it plans to make it a reality. Dive into its modernization plans for the Mail platform, consolidating diverse databases, and unlocking innovation with unprecedented scale. 
>> Add to my agenda <<

10. Talk with your business data using generative AI • ANA113

Getting insights from your business data should be as easy as asking Google. That is the Looker mission – instant insights, timely alerts when it matters most, and faster, more impactful decisions, all powered by the most important information: yours. Join this session to learn how AI is reshaping our relationship with data and how Looker is leading the way.
>> Add to my agenda <<

Source : Data Analytics Read More

Enrich your streaming data using Bigtable and Dataflow

Enrich your streaming data using Bigtable and Dataflow

Revised version for your consideration

Data engineers know that eventing is all about speed, scale, and efficiency. Event streams — high-volume data feeds coming off of things such as devices such as point-of-sale systems or websites logging stateless clickstream activity — process lightweight event payloads that often lack the information to make each event actionable on its own. It is up to the consumers of the event stream to transform and enrich the events, followed by further processing as required for their particular use case.

Key-value stores such as Bigtable are the preferred choice for such workloads, with their ability to process hundreds of thousands of events per second at very low latencies. However, key value lookups often require a lot of careful productionisation and scaling code to ensure the processing can happen with low latency and good operational performance. 

With the new Apache Beam Enrichment transform, this process is now just a few lines of code, allowing you to process events that are in messaging systems like Pub/Sub or Apache Kafka, and enrich them with data in Bigtable, before being sent along for further processing.

This is critical for streaming applications, as streaming joins enrich the data to give meaning to the streaming event. For example, knowing the contents of a user’s shopping cart, or whether they browsed similar items before, can bring valuable context to clickstream data that feeds into a recommendation model. Identifying a fraudulent in-store credit card transaction requires much more information than what’s in the current transaction, for example, the location of the prior purchase, count of recent transactions or whether a travel notice is in place. Similarly, enriching telemetry data from factory floor hardware with historical signals from the same device or overall fleet statistics can help a machine learning (ML) model predict failures before they happen.

The Apache Beam enrichment transform can take care of the client-side throttling to rate-limit the number of requests being sent to the Bigtable instance when necessary. It retries the requests with a configurable retry strategy, which by default is exponential backoff. If coupled with auto-scaling, this allows Bigtable and Dataflow to scale up and down in tandem and automatically reach an equilibrium. Beam 2.5.4.0 supports exponential backoff, which can be disabled or replaced with a custom implementation.

Lets see this in action:

code_block
<ListValue: [StructValue([(‘code’, ‘with beam.Pipeline() as p:rn output = (prn | “Read from PubSub” >> beam.io.ReadFromPubSub(subscription=SUBSCRIPTION)rn | “Convert bytes to Row” >> beam.ParDo(DecodeBytes())rn | “Enrichment” >> Enrichment(bigtable_handler)rn | “Run Inference” >> RunInference(model_handler)rn )’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e72b60d01c0>)])]>

The above code runs a Dataflow job that reads from a Pub/Sub subscription and performs data enrichment by doing a key-value lookup with Bigtable cluster. The enriched data is then fed to the machine learning model for RunInference. 

The pictures below illustrate how Dataflow and Bigtable work in harmony to scale correctly based on the load. When the job starts, the Dataflow runner starts with one worker while the Bigtable cluster has three nodes and autoscaling enabled for Dataflow and Bigtable. We observe a spike in the input load for Dataflow at around 5:21 PM that leads it to scale to 40 workers.

This increases the number of reads to the Bigtable cluster. Bigtable automatically responds to the increased read traffic by scaling to 10 nodes to maintain the user-defined CPU utilization target.

The events can then be used for inference, with either embedded models in the Dataflow worker or with Vertex AI

This Apache Beam transform can also be useful for applications that serve mixed batch and real-time workloads from the same Bigtable database, for example multi-tenant SaaS products and interdepartmental line of business applications. These workloads often take advantage of built-in Bigtable mechanisms to minimize the impact of different workloads on one another. Latency-sensitive requests can be run at high priority on a cluster that is simultaneously serving large batch requests with low priority and throttling requests, while also automatically scaling the cluster up or down depending on demand. These capabilities come in handy when using Dataflow with Bigtable, whether it’s to bulk-ingest large amounts of data over many hours, or process streams in real-time.

Conclusion

With a few lines of code, we are able to build a production pipeline that translates to many thousands of lines of production code under the covers, allowing Pub/Sub, Dataflow, and Bigtable to seamlessly scale the system to meet your business needs! And as machine learning models evolve over time, it will be even more advantageous to use a NoSQL database like Bigtable which offers a flexible schema. With the upcoming Beam 2.55.0, the enrichment transform will also have caching support for Redis that you can configure for your specific cache. To get started, visit the documentation page.

Source : Data Analytics Read More

Combine data across BigQuery and Salesforce Data Cloud securely with zero ETL

Combine data across BigQuery and Salesforce Data Cloud securely with zero ETL

We are excited that bidirectional data sharing between BigQuery and Salesforce Data Cloud is now generally available. This will make it easy for customers to enrich their data use cases by combining data across different platforms securely, without the additional cost of building or managing data infrastructure and complex ETL (Extract, Transform, Load) pipelines.

Responding to customer needs quickly is more critical than ever, with a growing number of touchpoints and devices to deliver in-the-moment customer experiences. But it’s becoming increasingly challenging to do so, as the amount of data created and captured continues to grow, and is spread across multiple SaaS applications and analytics platforms. 

Last year, Google Cloud and Salesforce announced a partnership that makes it easy for customers to combine their data across BigQuery and Salesforce Data Cloud, and leverage the power of BigQuery and Vertex AI solutions to enrich and unlock new analytics and AI/ML scenarios. 

Today, we are launching general availability of these capabilities, enabling joint Google Cloud and Salesforce customers to securely access their data in the different platforms and different clouds. Customers will be able to access their Salesforce Data Cloud in BigQuery, without having to set up or manage infrastructure, as well as use their Google Cloud data to enrich Salesforce Customer 360 and other applications. 

With this announcements, Google Cloud and Salesforce customers benefit from: 

A single pane of glass and serverless data access across platforms with zero ETL

Governed and secure bi-directional access to their Salesforce Data Cloud data and BigQuery in near real time without needing to create data infrastructure and data pipelines.

Use their Google Cloud data to enrich Salesforce Customer 360 and Salesforce Data Cloud. Also, the ability to enrich customer data with other relevant datasets, public datasets with minimal data movement.

Leveraging differentiated Vertex AI and Cloud AI services for predictive analytics, churn modeling and flowing back to customer campaigns through Vertex AI and Einstein Copilot Studio’s integration.

Customers who want to look at their data holistically across Salesforce and Google platforms, spanning cloud boundaries, can do so by leveraging BigQuery Omni and Analytics Hub. This integration allows data and business users including data analysts, marketing analysts, and data scientists, to combine data across Salesforce and Google platforms to analyze, derive insights and run AI/ML pipelines, all in a self-service capacity, without the need to involve data engineering or infrastructure teams.

This integration is fully managed and governed, allowing customers to focus on analytics and insights and avoid several critical business challenges that are typical when integrating critical enterprise systems. These innovations enforce data access and governance policy set for the data by admins. Only datasets that are explicitly shared are available for access and only authorized users are able to share and explore the data. With data spanning multiple clouds and platforms, relevant data is pre-filtered with minimal copying from Salesforce Data Cloud to BigQuery, reducing both egress costs and data engineering overhead.

“Trying to get a handle on our customer data was a nightmare until we seamlessly connected Google Cloud and Salesforce Data Cloud. No more copying data between platforms, no more struggling with complex APIs. It’s revolutionized how we do segmentation, understand our customers, and drive better marketing and service actions.” Large insurance company in NorthAM

“We faced several challenges in making the most of our customer data. Enriching leads between our first-party BigQuery data and Salesforce was a slow, manual process. It also made creating timely, data-driven lifecycle marketing journeys difficult due to batch data transfers. By seamlessly integrating BigQuery and Salesforce, we’ve transformed these processes. The integration fuels our automated marketing campaigns with real-time data triggers, significantly enhancing customer engagement. Best of all, this solution eliminated the manual overhead of batch data transfers, saving us valuable time and resources. It’s a win-win for our marketing team and our bottom line.”Large retailer in NorthAM

Easy and secure access to Salesforce Data Cloud from Google Cloud

Customers want to access and combine their marketing, commerce and service data in Salesforce Data Cloud with loyalty and point-of-sale data in Google analytics platforms to derive actionable insights about their customer behavior such as propensity to buy, cross-sell/up-sell recommendations and run highly personalized promotional campaigns. They also want to leverage differentiated Google AI services to build machine learning models on top of combined Salesforce and Google Cloud data for training/predictions, enabling use cases such as churn modeling, customer funnel analysis, market-mix modeling, price elasticity, and A/B test experimentation.

With the launch, customers can get access to their Salesforce Data Cloud data seamlessly through differentiated BigQuery cross-cloud and data sharing capabilities. They can access all the relevant information needed to perform cross-platform analytics in a privacy safe manner with other Google assets, and power ad campaigns. Salesforce Data Cloud Admins can easily share data, directly with the relevant BigQuery users or groups. BigQuery users can easily subscribe to shared datasets through the Analytics Hub UI.

There are several different ways to share information with this platform integration: 

For smaller datasets and ad hoc access, for example to find the store that had the largest sales last year, you can leverage a single cross-cloud join of your Salesforce Data Cloud and Google Cloud datasets, with minimal movement or duplication of data.

For larger data sets that are powering your executive update, weekly business review or marketing campaign dashboards, you can access the data using cross-cloud materialized views that are automatically and incrementally updated and only bring the incremental data periodically.

Enrich Salesforce Customer 360 with data stored on Google Cloud

We also hear from customers — especially retailers — that they want to access and combine their data in Salesforce Data Cloud and behavioral data captured in Google Analytics from their websites and mobile apps to build a richer customer 360 profile, derive actionable insights, deliver personalized messaging with rich capabilities of Salesforce Data Cloud. We are making it easier than ever to break down data silos and give customers seamless real-time access to Google Analytics data within Salesforce Data Cloud and build richer customer profiles, and personalized experiences.

Salesforce Data Cloud customers can use simple point-and-click to connect to their Google Cloud account, select relevant BigQuery datasets and make them available as External Data Lake Objects, providing live access to data. Once they are Data Lake Objects, they behave like native Data Cloud objects to enrich customer 360 data models, derive insights to power real-time Customer 360 models for analytics and personalization. This integration eliminates the need to build and monitor ETL pipelines for data integration, eliminating operational overhead and latencies of the traditional ETL copy approach.

Breaking down walls between Salesforce and Google data

This Google Cloud and Salesforce Data Cloud platform integration empowers organizations to break down data silos, gain actionable insights, and deliver exceptional customer experiences. With seamless data sharing, unified access, and the power of Google AI, this partnership is transforming the way businesses leverage their data for success.

Through unique cross-cloud functionality of BigQuery Omni and data sharing capabilities of Analytics Hub, customers can directly access data stored in Salesforce Data Cloud and combine it with data in Google Cloud to enrich it further for business insights and activation. Customers are not only able to view their data across clouds but perform unparalleled cross-cloud analytics without the need to build custom ETL or move data. 

To learn more about the collaboration between Google and Salesforce, check out this partnership page, the introduction video and quick start guide.

Source : Data Analytics Read More

At least once Streaming: Save up to 70% for Streaming ETL workloads

At least once Streaming: Save up to 70% for Streaming ETL workloads

Historically, Dataflow Streaming Engine has offered exactly-once processing for streaming jobs. Recently, we launched at-least-once streaming mode as an alternative for lower latency and cost-of-streaming data ingestion. In this post, we will explain both streaming modes and provide guidance on how to choose the right mode for your streaming use case.

Exactly-once: what it is and why it matters

Applications that react to incoming events sometimes require that each event be reflected in the output exactly once — meaning the event is not lost, nor accepted more than a single time. But as the processing pipeline scales, load-balances, or encounters faults, that deduplication of events imposes a computational cost, affecting overall cost and latency of the system.

Dataflow Streaming provides an exactly-once guarantee, meaning that the effects of data processed in the pipeline are reflected at least and at most once. Let’s unpack that a little bit. For every arriving message, whether it’s from an external source or an upstream shuffle, Dataflow ensures that the message will be processed and not lost (at-least-once). Additionally, results of that processing that remain within the pipeline, like state updates and outputs to a subsequent shuffle to the next pipeline stage, are also reflected at-most once. This guarantee enables, among other things, performing exact aggregations, such as exact sums or counts.

Exactly-once inside the pipeline is usually only half the story. As pipeline authors and runners, we really want to get the results of processing out of Dataflow and into a downstream system. Here we run into a common roadblock: no general at-most-once guarantee is made about side-effects of the pipeline. Without further effort, any side-effect, such as output to an external store, may generate duplicates. Careful work must be done to orchestrate the writes in a way that avoids duplicates. The key challenge is that in the general case, it is not possible to implement exactly-once operation in a distributed system without a consensus protocol that involves all the actors. For internal state changes, such as state updates and shuffle, exactly-once is achieved by a careful protocol.With sufficient support from data sinks, we can thus have exactly-once all the way through the pipeline and to its output. An example is the storage write version of the BigQueryIO.Write implementation, which ensures exactly-once data extraction to BigQuery. 

But even without exactly-once semantics at the sink, exactly-once semantics within the pipeline can be useful. Duplicates on the output may be acceptable, as long as they are duplicates of correct results — with exactly-once having been required to achieve this correctness.

At-least once: what it is and why it matters

There are other use cases where duplicates may be acceptable, for example ETL or Map-Only pipelines that are not performing any aggregation but rather only per-message operations. In these cases, duplicates are simple replays of data through the pipeline.

But why wouldn’t you choose to use exactly-once semantics? Aren’t stronger guarantees always better? The reason is that achieving exactly-once adds to pipeline latency and cost. This happens for several reasons, some obvious, and some quite subtle. Let’s take a deeper look.

In order to achieve exactly-once, we must store and read exactly-once metadata. In practice, the storage and read costs incurred to do this turn out to be quite expensive, especially in pipelines that otherwise perform very little I/O. Less intuitively, having to perform this metadata-based deduplication dictates how we implement the backend.

For example, in order to deduplicate messages across the shuffle we must ensure that all replays are idempotent — which means we must checkpoint the results of processing before they are sent to shuffle — which again increases cost and latency. 

Another example: to de-duplicate the input from Pub/Sub, we must first re-shuffle incoming messages on keys that are deterministically derived from a given message because the deduplication metadata is stored in the per-key state. Performing this shuffle using deterministic keys exposes us to additional cost and latency. In the next section we lay out the reasons for this in more detail.

We cannot assume that at-least-once semantics are acceptable for the user, so we default to the strictest, semantics, i.e., exactly-once. If we know beforehand that at-least-once processing is acceptable, and we can relax our constraints, then we can make more cost- and latency-beneficial implementation decisions. 

Exactly-once vs. at-least-once when reading from Pub/Sub

Pub/Sub reads’ latency and cost in particular benefit from at-least-once mode. To understand why, let’s look closer at how exactly-once deduplication is implemented. 

Pub/Sub reads are implemented in the Dataflow backend workers. To acquire new messages, each Dataflow backend worker makes remote procedure calls (RPCs) internally to the Pub/Sub service. Since RPCs can fail, workers may crash, or other sources of failure are possible, and messages will be replayed until successful processing is acknowledged by the backend worker. Pub/Sub and backend workers are dynamic systems, without static partitioning, meaning that replays of messages from Pub/Sub are not guaranteed to arrive at the same backend worker. This poses a challenge when deduplicating these messages. 

In order to perform deduplication, the backend worker puts these messages through a shuffle, attaching a key internally to each message. The key is chosen deterministically based on the message, or a message id attribute if configured1, so that a replay of a duplicate message is deterministically shuffled to the same key. Doing so allows deduplicating replays from Pub/Sub in the same manner that shuffle replays are deduplicated between stages in the Dataflow pipeline (see this detailed discussion), as illustrated in the sequence diagram below:

This design contributes to cost and latency in two significant ways. First, as with all deduplication in the Dataflow backend, a read against the persistent store may be required. While heavily optimized with caches and bloom filters, it cannot be completely eliminated. Second, and often even more significant, is the need to shuffle the data on a deterministic key. If a particular key or worker is slow or becomes a bottleneck, this creates head-of-line blocking that prevents other traffic from flowing — an artificial constraint since the messages in the queue are not semantically tied to this key. 

When at-least-once processing is acceptable, we can both eliminate the cost associated with reads from the persistent store and the shuffling of messages on a deterministic key. In fact, we can do better — we still shuffle the messages, but the key we pick instead is the current “least-loaded” key, meaning the key that is currently experiencing the least queueing. In this way, we evenly distribute the incoming traffic to maximize throughput, even when some keys or workers are experiencing slowness. 

We can see this in action in a benchmark where we simulate stragglers, e.g., slow writes to an external sink, by artificially delaying arbitrary messages at low probability for multi-minute intervals.

Compare the throughput of the exactly-once pipeline on the left to the throughput of the at-least-once pipeline on the right. The at-least-once pipeline can sustain much more consistent throughput in the presence of such stragglers, dramatically decreasing average latency. In other words, even though both cases still have high tail latency, the latency outliers no longer affect the bulk of the distribution in the at-least-once configuration.

Benchmarks: at-least-once vs. exactly-once

Here are three representative benchmark streaming jobs to evaluate the impact of streaming-mode choice on costs. To maximize the cost benefit, we enabled resource-based billing and aligned I/O to the streaming mode. Here is what we observed:

Note that cost depends on multiple factors such as data-load characteristics, the specific pipeline composition, configuration and the I/O used. Therefore, our benchmarking results may differ from what you observe in your test and production pipelines.

Spotify’s own testing supported the Dataflow team’s findings:

“By incorporating at-least-once mode in our platform that is built on Dataflow, Pub/Sub, and Bigtable, we have seen a portion of our Dataflow jobs cut costs by 50%. Since this is used by several consumers, 7 downstream systems are now cheaper overall with this simple change. Because of the way this system works, there has been 0 effects of duplicates! We plan on turning this feature on in more jobs that are compatible to cut down our Dataflow costs even more.” Sahith Nallapareddy, Software Engineer, Spotify

Choose the right streaming mode for your job

When creating streaming pipelines, choosing the right mode is essential. The critical factor is to determine whether the pipeline can tolerate duplicate records in the output or any intermediate processing stages.

At-least-once mode can help optimize cost and performance in the following cases:

Map-only pipelines performing idempotent per-message operations, e.g. ETL jobs

When deduplication happens in  the destination, e.g., in BigQuery or Bigtable

Pipelines that already use an at-least-once I/O sink, e.g., Storage API At Least Once or PubSub I/O

Exactly-once mode is preferable in the following cases:

Use cases that cannot tolerate duplicates within Dataflow

Pipelines that perform exact aggregations as part of stream processing

Map-only jobs that perform non-idempotent per-message operations

At-least-once streaming mode is now generally available for Dataflow streaming customers. You can enable at-least-once mode by setting the at-least-once Dataflow service option when starting a new streaming job using the API or gcloud. To get started, we also offer a selection of commonly used Dataflow streaming templates that support the streaming modes.

Source : Data Analytics Read More

Built with BigQuery: How Pendo Data Sync maximizes ROI on your data

Built with BigQuery: How Pendo Data Sync maximizes ROI on your data

Editor’s note: The post is part of a series showcasing partner solutions that are Built with BigQuery.

Data is paramount when making critical business decisions. Whether it’s identifying new opportunities, streamlining operations, or delivering more value to customers, leadership wants to know that they have the facts on their side when making crucial choices about the future. That means having and evaluating the right data – the data that comes from their products.

Pendo provides companies with the most effective way to infuse product data into their greater business strategy. As a company on a mission to elevate the world’s experience with software, Pendo has created Data Sync, a product that makes it easier to connect valuable behavioral insights from applications to the broader business. With Data Sync, teams have visibility into data about customer engagement and health, regardless of where it sits in an organization.

Siloed product data is wasted product data

Without the ability to incorporate product-level behavioral insights into business reporting, companies can’t understand the holistic journey of their users. Given the current macroeconomic conditions, businesses are under extra pressure to retain and grow their existing customer base, and not understanding user behavior in its full context is a clear disadvantage. 

Too often an incomplete picture of product usage is the result of compiling data together piece by piece or via an API pull, which doesn’t provide the value businesses need. This approach can distract and pull resources away from building a scalable, supported data strategy model. Without a single source of truth into product data, each team is left making crucial business decisions with only part of the picture. This results in siloed data and wasted insights that could help answer crucial questions, such as:

Are users deriving value from key product areas that align with commercial models? 

Are customer cohorts as engaged and healthy as their point of renewal approaches?

When looking to maximize the lifetime value of customer accounts, what behavior trends can inform sales strategy? 

Do customer teams have the insights they need to proactively address risk and churn?

Rich insights to answer crucial questions

There’s an alternative to this fragmented approach. With Pendo Data Sync for Google Cloud, teams can quickly make Pendo data available whenever they need it in Cloud Storage and access it in BigQuery, or any other data lake or warehouse. By connecting behavioral data from product applications (i.e., user interactions with them) with other key business data sources, Pendo enables organizations to make important decisions with confidence. 

Pendo makes it easy to:

Easily access all historical behavioral and product utilization data. Structure and take action on the valuable data across your entire data ecosystem. In Google Cloud, data teams can maximize the value of BigQuery and Looker by integrating these additional datasets to help generate business insights that were previously unavailable.

Democratize data to drive informed decision making. Instantly join product data with the full breadth of all your key business data at the depth required to support a modern data-driven organization.

Drive better resource utilization. Remove the need to manually pull siloed data sets and give your data and engineering teams back time to spend on more valuable activities.

Get maximum ROI on your product insights. Enable every department — all the way up to the C-suite — to speak the same universal language of the business.

Make faster, more confident decisions. Put product data at the center of your decision-making processes.

Democratizing data to buttress the business

Pendo data acts as our customers’ primary source of truth for the applications they build, sell, and deploy. With Data Sync, these insights are no longer confined to the product teams using Pendo directly — they become available for all. With a straightforward extract, transform, and load (ETL) process, every department and cross-functional initiative can easily access Pendo data in a centralized location. 

To date, Data Sync customers have used Pendo data for a wide variety of use cases. These include measuring the impact of product improvements on sales and renewals, identifying friction in your end-to-end user journey, and calculating a comprehensive customer health score across all data sources to shape renewal strategy. Other teams have also leveraged Pendo data to create a churn-risk model based on a foundation of product usage and sentiment signals or  identify up-sell and cross-sell opportunities to drive data-informed account growth.

Pendo Data Sync is ‘Built with BigQuery’

Building Data Sync on top of Google Cloud made the development and initial rollout of the solution very straightforward. Pendo’s engineering team was able to leverage its existing knowledge of Google Cloud products and services to build the beta version of Data Sync. This version was then rolled out to design partners using Google Cloud, helping to standardize the methods and best practices for pulling bulk Pendo data from Cloud Storage into BigQuery. Additionally, Pendo’s data science team uses BigQuery as its data warehouse solution and Looker as its primary business intelligence tool. The data team became the first “alpha” customer for Data Sync, partnering with engineering to design the product’s data structure, ETL best practices, and help documentation. 

Looking ahead, Pendo is exploring even deeper integration into Google Cloud, including data sharing directly in Analytics Hub.

Below is a diagram of the architecture Data Sync uses for its data flow:

Utilize your product data to the fullest

By partnering with Pendo, you can take maximum advantage of your product utilization data. Combining quantitative and qualitative feedback with other related data sets can give you a holistic view into user engagement and health, empowering your teams to make strategic decisions about customer retention and growth — backed by data.

To learn more about Data Sync, check out Pendo in Google Cloud Marketplace or contact your Google or Pendo sales representative.

You can also request a custom demo with a Pendo team member to learn how your business can use Pendo to increase customer engagement and health.

The ‘Built with BigQuery’ advantage for ISVs and data providers

Built with BigQuery helps companies like Pendo build innovative applications with Google Data Cloud. Participating companies can: 

Accelerate product design and architecture through access to designated experts who can provide insight into key use cases, architectural patterns, and best practices. 

Amplify success with joint marketing programs to drive awareness, generate demand, and increase adoption. 

BigQuery gives ISVs the advantage of a powerful, highly scalable unified AI lakehouse that’s integrated with Google Cloud’s open, secure, sustainable platform. To find how to harness the full potential of your data to drive innovation, learn more about Built with BigQuery

Source : Data Analytics Read More

Introducing new BigQuery features to simplify time-series data analysis

Introducing new BigQuery features to simplify time-series data analysis

We are excited to invite customers to join a preview of new SQL features that simplify time series analysis in BigQuery. These new features simplify writing queries that perform two of the most common time series operations: windowing and gap filling. We are also introducing the RANGE data type and supporting functions. The RANGE type represents a continuous window of time and is useful for recording time-based state of a value. Combined, these features make it easier to write time series queries over your data in BigQuery.

Time-series analytics and data warehouses

Time-series data is an important, and growing class of data for many enterprises. Operational data like debug logs, metrics, and event logs are inherently time-oriented and full of valuable information. Additionally, connected devices common in manufacturing, scientific research, and other domains continually produce important metrics that need to be analyzed. All of this data follows the same pattern: the data is reported as values at a specific time, and deriving meaning from that data requires understanding how it changes over time.

Users with time-series analytics problems have traditionally relied on domain-specific solutions that result in harder to access data silos. Additionally, they often have their own query languages, which require additional expertise to use. Organizations looking to analyze their time-series data alongside the rest of their data must invest in pipelines that first clean and normalize the data in one store before exporting it to their data warehouse. This slows down time-to-insights and potentially hides data from the users who could build value from the data. Adding more time-series analytics features to BigQuery SQL democratizes access to this valuable data. Organizations can now leverage BigQuery’s scalable serverless architecture to store and analyze their time-series data alongside the rest of their business data without having to build and maintain expensive pipelines first.

Introducing time-series windowing and gap filling functions

The first step in most time-series queries is to map individual data points to output windows of a specific time duration and alignment. For example, you want to know the average of a sampled value every ten minutes over the past twelve hours. The underlying data is often sampled at arbitrary times so the user’s query must include logic to map the input times to the output time windows. BigQuery’s new time bucketing functions provide a full-featured method for performing this mapping.

Let’s look at an example using a time series data set showing the air quality index (AQI) and temperature recorded over part of the day. Each row in the table represents a single sample of time-series data. In this case, we have a single time series for the zip code 60606, but the table can easily hold data for multiple time series, all represented by their zip code.

code_block
<ListValue: [StructValue([(‘code’, “CREATE OR REPLACE TABLE mydataset.environmental_data_hourly ASrnSELECT * FROM UNNEST(rn ARRAY<STRUCT<zip_code INT64, time TIMESTAMP, aqi INT64, temperature INT64>>[rn STRUCT(60606, TIMESTAMP ‘2020-09-08 00:30:51’, 22, 66),rn STRUCT(60606, TIMESTAMP ‘2020-09-08 01:32:10’, 23, 63),rn STRUCT(60606, TIMESTAMP ‘2020-09-08 02:30:35’, 22, 60),rn STRUCT(60606, TIMESTAMP ‘2020-09-08 03:29:39’, 21, 58),rn STRUCT(60606, TIMESTAMP ‘2020-09-08 06:31:14’, 22, 56),rn STRUCT(60606, TIMESTAMP ‘2020-09-08 07:31:06’, 28, 55)rn]);”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e2b0ac92e50>)])]>

Before doing additional processing we’d like to assign each row to a specific time bucket that is two-hours wide. We’ll also aggregate within those windows to compute the average AQI and maximum temperature.

code_block
<ListValue: [StructValue([(‘code’, ‘SELECTrn TIMESTAMP_BUCKET(time, INTERVAL 2 HOUR) AS time,rn zip_code,rn CAST(AVG(aqi) AS INT64) AS aqi,rn MAX(temperature) AS max_temperaturernFROM mydataset.environmental_data_hourlyrnGROUP BY zip_code, timernORDER BY zip_code, time;rnrn+———————+———-+—–+—————–+rn| time | zip_code | aqi | max_temperature |rn+———————+———-+—–+—————–+rn| 2020-09-08 00:00:00 | 60606 | 23 | 66 |rn| 2020-09-08 02:00:00 | 60606 | 22 | 60 |rn| 2020-09-08 06:00:00 | 60606 | 25 | 56 |rn+———————+———-+—–+—————–+’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e2b0ac92460>)])]>

Note that we can use any valid INTERVAL to define the width of the output window. We can also define an arbitrary origin for aligning the windows if we want to start the windows at something other than 00:00:00.

Gap filling is another common step in time-series analysis that addresses gaps in time of the underlying data. A server producing metrics may reset, stopping the flow of data in the progress. A lack of user traffic may mean gaps in event data. Or, data may be inherently sparse. Users want to control how these gaps are filled in before joining with other data or preparing it for graphing. The newly released GAP_FILL table valued function does just that.

In our previous example, we have a gap between 02:00 and 06:00 in the output data because there is no raw data between 04:00:00 and 05:59:59. We would like to fill in that gap before joining this data with other time-series data. Below, we demonstrate two modes of backfill supported by the GAP_FILL TVF: “last observation carried forward” (locf) and linear interpolation. The TVF also supports inserting nulls.

code_block
<ListValue: [StructValue([(‘code’, “WITH aggregated_2_hr AS (rn SELECTrn TIMESTAMP_BUCKET(time, INTERVAL 2 HOUR) AS time,rn zip_code,rn CAST(AVG(aqi) AS INT64) AS aqi,rn MAX(temperature) AS max_temperaturern FROM mydataset.environmental_data_hourlyrn GROUP BY zip_code, timern ORDER BY zip_code, time)rnrnSELECT *rnFROM GAP_FILL(rn TABLE aggregated_2_hr,rn ts_column => ‘time’,rn bucket_width => INTERVAL 2 HOUR,rn partitioning_columns => [‘zip_code’],rn value_columns => [rn (‘aqi’, ‘locf’),rn (‘max_temperature’, ‘linear’)rn ]rn)rnORDER BY zip_code, time;rnrnrn+———————+———-+—–+—————–+rn| time | zip_code | aqi | max_temperature |rn+———————+———-+—–+—————–+rn| 2020-09-08 00:00:00 | 60606 | 23 | 66 |rn| 2020-09-08 02:00:00 | 60606 | 22 | 60 |rn| 2020-09-08 04:00:00 | 60606 | 22 | 58 |rn| 2020-09-08 06:00:00 | 60606 | 25 | 56 |rn+———————+———-+—–+—————–+”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e2b0ac92d60>)])]>

GAP_FILL can also be used to produce aligned output data without having to bucket the input data first.

Combined, these new bucketing and gap-filling operations provide key building blocks of time-series analytics. Users can now leverage BigQuery’s serverless architecture, high-throughput streaming ingestion, massive parallelism, and built-in AI to derive valuable insights from their time-series data alongside the rest of their business data, all without relying on a separate data silo.

Importantly, these new functions align with the standard SQL syntax and semantics analytics that experts are familiar with. There is no need to learn a new language or a new computation model. Simply add these functions into your existing workflows and start delivering insights.

Introducing the RANGE data type

The RANGE type and functions make it easier to work with contiguous windows of data. For example, RANGE<DATE> “[2024-01-01, 2024-02-01)” represents all DATE values starting from 2024-01-01 up to and excluding 2024-02-01. RANGE can also be unbounded on either end, representing the beginning and end of time respectively. RANGE is useful for representing a state that is true, or valid, over a contiguous period of time. Quota assigned to a user, the exchange rate of a currency over different times, the value of a system setting, or the version number of an algorithm are all great examples.

We are also introducing features that allow you to combine or expand RANGEs at query time. RANGE_SESSIONIZE is a TVF that allows you to combine overlapping or adjacent RANGEs into a single row. The table below shows how users can use this feature to create a smaller table with the same information:

code_block
<ListValue: [StructValue([(‘code’, ‘CREATE OR REPLACE TABLE mydataset.sensor_metrics ASrnSELECT * FROM UNNEST(rn ARRAY<STRUCT<sensor_id INT64, duration RANGE<DATETIME>, flow INT64, spins INT64>>[rn (1, RANGE<DATETIME> “[2020-01-01 12:00:01, 2020-01-01 12:05:23)”, 10, 1),rn (1, RANGE<DATETIME> “[2020-01-01 12:05:12, 2020-01-01 12:10:46)”, 10, 20), rn (1, RANGE<DATETIME> “[2020-01-01 12:10:27, 2020-01-01 12:15:56)”, 11, 4),rn (2, RANGE<DATETIME> “[2020-01-01 12:05:08, 2020-01-01 12:10:30)”, 21, 2),rn (2, RANGE<DATETIME> “[2020-01-01 12:10:22, 2020-01-01 12:15:42)”, 21, 10)rn]);’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e2b0ac92eb0>)])]>

RANGE_SESSIONIZE finds all overlapping ranges and outputs the session range, which is a union of all the ranges that overlap within the (sensor_id, flow) partition. We finally group by the session ranges to output a table that combines the overlapping ranges:

code_block
<ListValue: [StructValue([(‘code’, ‘SELECT sensor_id, session_range, flowrnFROM RANGE_SESSIONIZE(rn (SELECT sensor_id, duration, flow FROM mydataset.sensor_metrics),rn “duration”,rn [“sensor_id”, “flow”],rn “OVERLAPS”)rnORDER BY sensor_id, session_range;rnrn+———–+——————————————–+——+rn| sensor_id | session_range | flow | rn+———–+——————————————–+——+rn| 1 | [2020-01-01 12:00:01, 2020-01-01 12:10:46) | 10 |rn| 1 | [2020-01-01 12:00:01, 2020-01-01 12:10:46) | 10 |rn| 1 | [2020-01-01 12:10:27, 2020-01-01 12:15:56) | 11 |rn| 2 | [2020-01-01 12:05:08, 2020-01-01 12:15:42) | 21 |rn| 2 | [2020-01-01 12:05:08, 2020-01-01 12:15:42) | 21 |rn+———–+——————————————–+——+rnrnSELECT sensor_id, session_range, flow, SUM(spins)rnFROM RANGE_SESSIONIZE(rn TABLE mydataset.sensor_metrics, rn “duration”,rn [“sensor_id”, “flow”],rn “OVERLAPS”)rnGROUP BY sensor_id, session_range, flowrnORDER BY sensor_id, session_rangernrn+———–+——————————————–+——+——-+rn| sensor_id | session_range | flow | spins |rn+———–+——————————————–+——+——-+rn| 1 | [2020-01-01 12:00:01, 2020-01-01 12:10:46) | 10 | 21 |rn| 1 | [2020-01-01 12:10:27, 2020-01-01 12:15:56) | 11 | 4 |rn| 2 | [2020-01-01 12:05:08, 2020-01-01 12:15:42) | 21 | 12 |rn+———–+——————————————–+——+——-+’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e2b0ac92130>)])]>

You can do much more with RANGE. For example, you can use RANGE to JOIN a timestamp table with a RANGE table or emulate the behavior of an AS OF JOIN. See more examples in the RANGE and time series documentation.

Getting started

To use the new time series analysis features, all you need is a regular table with some time series data in it. You can even try experimenting with one of the existing public datasets. Check out the documentation to learn more and get started. Give these new features a try and let us know if you have feedback.

Source : Data Analytics Read More

How RealTruck drives data reliability and business growth with Masthead and BigQuery

How RealTruck drives data reliability and business growth with Masthead and BigQuery

One of the challenges organizations face today is harnessing the potential of their collected data. To do so, you need to invest in powerful data platforms that can efficiently manage, control, and coordinate complex data flows and access across various business domains.

RealTruck, a leader in aftermarket accessories for trucks and off-road vehicles, stands out for its omnichannel approach, which successfully integrates over 12,000 dealers and a robust online presence at RealTruck.com. Operating from 47 locations across North America, the company initially faced significant data challenges due to its extensive offline network and diverse customer touchpoints. To address these complexities, the data team at RealTruck decided to develop a data platform that could serve as a source of truth for executives and every manager in the organization, providing data to support business decision-making. The goal was to gain visibility into and control over all collected assets, monitor data flows, manage costs, and ensure the high reliability of the data platform. 

RealTruck’s data team chose BigQuery as the center element of their data platform for its high security standards, scalability, and ease of use. As a serverless data platform, BigQuery allows the team to focus on strategic analysis and insights rather than on managing infrastructure, thereby enhancing their efficiency in handling large volumes of data. 

RealTruck data is gathered from various sources, including manufacturers, dealers, marketing campaigns, web and app customer interactions, and sales transactions. This data, along with the company’s data pipelines, vary in format, structure, and cadence. The diversity and number of external data sources present significant maintenance challenges and operational complexity. 

RealTruck also added Masthead Data, a Google Cloud Ready partner for BigQuery, to help its data team identify any pipeline or data issues that affect business users or data consumers. When selecting a partner to integrate with BigQuery, RealTruck needed the ability to monitor for errors in other solutions used to build its data platform, which could result in downtime. This included Cloud Storage, BigQuery Data Transfer Service, Dataform, and other Google Cloud services. 

Together, BigQuery and Masthead enabled RealTruck’s data team to deliver on two of its biggest commitments — ensuring the accuracy of the company’s data and resolving any doubts about the performance of data pipelines.

Mastering data platform complexity: Visibility, cost efficiency, and anomaly detection

As RealTruck began building out its data platform with BigQuery, the data team realized that there were still some issues around complexity that needed to be solved.

Limited visibility of pipeline performance: Ingesting data into the platfrom from numerous sources using various solutions made it difficult to track pipeline failures or data system errors. This limitation hindered RealTruck’s ability to maintain reliable data. 
Cost control: BigQuery enabled the data team to develop a decentralized data platform, boosting agility to create data pipelines and assets. However, this approach requires more refined management of resources to ensure cost-effectiveness, given the scalable processing power. To sustain efficiency, the team sought granular visibility into every process and its associated costs. 
Anomaly detection across the data platform: Tables in BigQuery are regularly used as sources for data products, requiring vigilant monitoring for issues like freshness, volume spikes, or missing values. The ability to automatically identify outliers or unexpected behavior is key for building trust in the data platform among business users.

Masthead and BigQuery: Achieving data platform reliability for RealTruck 

To overcome these challenges, RealTruck implemented Masthead Data to enhance the reliability of BigQuery data pipelines and assets in its data platform. 

Masthead provided visibility into potential syntax errors and system issues caused by using various ingestion tools. Automating observability enabled RealTruck to detect pipeline or data environment issues in real time, allowing the team to address them before they impacted downstream data products or platform users.

For example, Masthead provided real-time alerts and robust column-level lineage and data dictionary features to help troubleshoot downtime in the data platform within minutes. As a result, the RealTruck data team was able to trace an error or anomaly and assess the full impact of it on pipelines or BigQuery tables. Column-level lineage also made it easier for the team to respond quickly and collaborate more effectively when resolving issues.

In addition, Masthead’s unique approach of using logs to monitor time-series tables for freshness, volume, and schema changes allowed RealTruck to have an overarching view of the health of all its BigQuery tables without increasing compute costs. Masthead also integrates with Google Dataplex, enabling the RealTruck team to implement rule-based data quality checks to catch any anomalies in metrics.

RealTruck also leveraged Masthead’s Compute Cost Insights for BigQuery to gain granular visibility into BigQuery storage and pipeline costs as well as any third-party solutions used in the data platform. These features have helped the data team identify and cleanup orphan processes and expired assets, making the costs of the data platform more manageable and transparent.

One of the main reasons RealTruck chose Masthead was its unique architecture, which does not access our data. This was a critical factor in our decision, especially given our ambitious global growth plans and the increasingly complex data privacy regulations worldwide. Masthead, as a Google Cloud Partner, complimentary to Google Cloud BigQuery, is compliant with data privacy and security regulations at the architectural level, ensuring that our data remains secure,aligning perfectly with our strategic objectives.

The ability to achieve comprehensive observability of all our BigQuery data pipelines and tables through a no-code integration, which was set up in just 15 minutes and began delivering value within a few hours, has been transformative. It has enabled the RealTruck team to gain valuable insights into pipeline costs and data flows swiftly across our entire data platform, reinforcing the reliability and strategic value of our data-driven initiatives.” – Chris Wall, Director of BI & Analytics, RealTruck

Google Cloud has become the backbone of RealTruck’s data infrastructure, providing efficient data governance and management with minimum configuration required. BigQuery offers Google Cloud’s world-class default encryption and sophisticated user access management features, which allows RealTruck to distribute, store, and process its data with confidence, knowing its data is secure.   

Masthead’s approach to processing logs and metadata also aligns well with Google Cloud’s approach to security and privacy, offering a single view of pipeline and data health across RealTruck’s entire data environment. This consolidated view has enabled the data team to shift from ad-hoc solutions to making strategic improvements to the data platform. This enhanced perspective has been vital for building a data platform that business users trust, allowing RealTruck to efficiently tackle data errors and manage costs. The efficient use of BigQuery in combination with Masthead significantly reduced the risk of unnoticed issues impacting business operations, reinforcing the importance of data in decision-making.

If you’re interested in using Masthead Data with BigQuery, visit the Google Cloud partner directory or Masthead Data’s Marketplace offerings. We also recommend checking out Google Cloud Ready – BigQuery to learn more about our Google Cloud Ready partners.

Source : Data Analytics Read More

How Palo Alto Networks uses BigQuery ML to automate resource classification

How Palo Alto Networks uses BigQuery ML to automate resource classification

At Palo Alto Networks, our mission is to enable secure digital transformation for all. Part of our growth through mergers and acquisitions has led to a large, decentralized structure with many engineering teams that contribute to our world-renowned products. Our teams have more than 170,000 projects on Google Cloud, each with its own resource hierarchy and naming convention.

Our Cloud Center of Excellence team oversees the organization’s central cloud operations. We took over this complex brownfield landscape that has been growing exponentially, and it’s our job to make sure the growth is cost-effective, follows cloud hygiene, and is secure while still empowering all Palo Alto Networks product engineering teams to do their best work.

However, it was challenging to identify which project belonged to which team, cost center, and environment, which is a crucial starting point for our team’s work. We overtook a large automated labeling effort three years ago, which got us to over 95% coverage on tagging for team, owner, cost center and environment. However the last 5% turned out to be more difficult. That’s when we decided we could use machine learning to make our lives easier and our operations more efficient. This is the story of how we achieved that with BigQuery ML, BigQuery’s built-in machine learning feature.

Reducing ML prototyping turnaround from two weeks to two hours

Identifying the owner, environment, and cost center for each cloud project was challenging because of the sheer number of projects and their various naming conventions. We often found mislabeled projects that were assigned to incorrect teams or to no team at all. This made it difficult to determine how much teams were spending on cloud resources.

To correctly assign team owners on dashboards and reports, a finance team member had to sort hundreds of projects by hand and contact possible owners, a process taking weeks. If our investigation was inconclusive, the projects were marked as ‘undecided.’ As this list grew, we only looked into high-cost projects, leaving low-spend projects without a correct ownership label.

When questions regarding project ownership surfaced, our team looked for keywords in a project’s name or path which gave us clues about which team was connected to it. But we followed our intuition based on keywords, and we knew that we could use machine learning to do the same. It was time to automate this manual process.

Initially we used Scikit-learn for machine learning and Python libraries to write the code from scratch, and it took almost two weeks to build a working model to help us start training end-to-end prediction algorithms. While we got good results, it was a small-scale prototype that couldn’t handle the volumes of data we needed to ingest.

Palo Alto Networks already used BigQuery extensively, making it easy to access our data for this project. The Google Cloud team suggested we instead try BigQuery ML to prototype our project and it just made sense. With BigQuery ML, prototyping the entire project took a couple of hours. We were up and running within the same afternoon, with 99.9% accuracy. We tested it on hundreds of projects and got correct label predictions every time.

Boosting developer productivity while democratizing AI

Immediately after deploying BigQuery ML, we could use and test a variety of models that were readily available from its library to see what worked best for our project, eventually landing on the boosted trees model. Previously, using Python Scikit-learn, training different algorithms for testing took up to three hours each time we found that they weren’t accurate enough. With BigQuery ML, that trial-and-error loop is much shorter. We simply replace the keyword and do one hour of training to try a new model.

Similarly, the developer time required for this project has reduced significantly. In our previous iteration, we had more than 300 lines of Python code. We’ve now turned that into 10 lines of SQL in BigQuery, which is much easier to read, understand, and manage.

This brings me to AI democratization. We initially assigned this prototype to an experienced colleague because a project like this used to require an in-depth machine learning and Python background. Reading 300 lines of ML Python code would take a while and explaining it would take even longer, so no one else on our team could have done this manually.

But with BigQuery ML, we can look at the code sequence and explain it in five minutes. Anyone on our team can understand and modify it by knowing just a little about what each algorithm does in theory. BigQuery ML makes this work much more accessible, even for people without years of machine learning training.

Solving for greater visibility with 99.9% accuracy

This label prediction project now supports the backend infrastructure for all cloud operations teams at Palo Alto Networks. It helps to identify which team each project belongs to and sorts mislabeled projects, giving financial teams visibility into cloud costs. Our new labeling system gives us accurate, reliable information about our cloud projects with minimal manual intervention.

For now, this solution can tell us with 99.9% accuracy which team any given project belongs to, in which cost center, and in which environment. This feels like a gateway introduction. Now that we’ve seen the usefulness of BigQuery ML, and how quickly it can make things happen, we’ve been talking about how to extend its benefits to more teams and use cases.

For example, we want to implement this model as a service for financial operations and information security teams who may need more information about any project. If there’s a breach or suspicious activity for a project that isn’t already mapped, they could quickly use our model to find out who the affected project belongs to. We have mapping for 95-98% of our projects, but that last bit of unknown territory is the most dangerous. If something happens in a place where no one knows who’s responsible, how can it be fixed? Ultimately, that’s what BigQuery ML will help us solve.

Excited for what’s ahead with generative AI

One other project we’re excited about combines BigQuery with generative AI to empower non-technical users to get business questions answered using natural language. We’re creating a financial operations companion that understands who employees are, what team they belong to, what projects that team owns, and what cloud resources it is using, to provide all the relevant cost, asset, and optimization information from our Data Lake stored in BigQuery.

Previously, searching for this kind of information would require knowing where and how to write a query in BigQuery. Now, anyone who isn’t familiar with SQL, from a director to an intern, can ask questions in plain English and get an appropriate answer. Generative AI democratizes access to information by using a natural language prompt to write queries and combine data from multiple BigQuery tables to surface a contextualized answer. Our alpha version for this project is out and already showing good results. We look forward to building this into all of our financial operations tools.

Source : Data Analytics Read More