How Gemini in BigQuery accelerates data and analytics workflows with AI

How Gemini in BigQuery accelerates data and analytics workflows with AI

The journey of going from data to insights can be fragmented, complex and time consuming. Data teams spend time on repetitive and routine tasks such as ingesting structured and unstructured data, wrangling data in preparation for analysis, and optimizing and maintaining pipelines. Obviously, they’d rather prefer doing higher-value analysis and insights-led decision making. 

At Next ‘23, we introduced Duet AI in BigQuery. This year at Next ‘24, Duet AI in BigQuery becomes Gemini in BigQuery which provides AI-powered experiences for data preparation, analysis and engineering as well as intelligent recommendations to enhance user productivity and optimize costs.

“With the new AI-powered assistive features in BigQuery and ease of integrating with other Google Workspace products, our teams can extract valuable insights from data. The natural language-based experiences, low-code data preparation tools, and automatic code generation features streamline high-priority analytics workflows, enhancing the productivity of data practitioners and providing the space to focus on high impact initiatives. Moreover, users with varying skill sets, including our business users, can leverage more accessible data insights to effect beneficial changes, fostering an inclusive data-driven culture within our organization.” said Tim Velasquez, Head of Analytics, Veo 

Let’s take a closer look at the new features of Gemini in BigQuery.

Accelerate data preparation with AI

Your business insights are only as good as your data. When you work with large datasets that come from a variety of sources, there are often inconsistent formats, errors, and  missing data. As such, cleaning, transforming, and structuring them can be a major hurdle.

To simplify data preparation, validation, and enrichment, BigQuery now includes AI augmented data preparation that helps users to cleanse and wrangle their data. Additionally we are enabling users to build low-code visual data pipelines, or rebuild legacy pipelines in BigQuery. 

Once the pipelines are running in production, AI assists with finding and resolving issues such as schema or data drift, significantly reducing the toil associated with maintaining a data pipeline. Because the resulting pipelines run in BigQuery, users also benefit from integrated metadata management, automatic end-to-end data lineage, and capacity management.

Gemini in BigQuery provides AI-driven assistance for users to clean and wrangle data

Kickstart the data-to-insights journey

Most data analysis starts with exploration — finding the right dataset, understanding the data’s structure, identifying key patterns, and identifying the most valuable insights you want to extract. This step can be cumbersome and time-consuming, especially if you are working with a new dataset or if you are new to the team. 

To address this problem, Gemini in BigQuery provides new semantic search capabilities to help you pinpoint the most relevant tables for your tasks. Leveraging the metadata and profiling information of these tables from Dataplex, Gemini in BigQuery surfaces relevant, executable queries that you can run with just one click. You can learn more about BigQuery data insights here.

Gemini in BigQuery suggests executable queries for tables that you can run in single click

Reimagine analytics workflows with natural language

To boost user productivity, we’re also rethinking the end-to-end user experience. The new BigQuery data canvas provides a reimagined natural language-based experience for data exploration, curation, wrangling, analysis, and visualization, allowing you to explore and scaffold your data journeys in a graphical workflow that mirrors your mental model. 

For example, to analyze a recent marketing campaign, you can use simple natural language prompts to discover campaign data sources, integrate with existing customer data, derive insights, and share visual reports with executives — all within a single experience. Watch this video for a quick overview of BigQuery data canvas.

BigQuery data canvas allows you to explore and analyze datasets, and create a customized visualization, all using natural language prompts within the same interface

Enhance productivity with SQL and Python code assistance 

Even advanced users sometimes struggle to remember all the details of SQL or Python syntax, and navigating through numerous tables, columns, and relationships can be daunting. 

Gemini in BigQuery helps you write and edit SQL or Python code using simple natural language prompts, referencing relevant schemas and metadata. You can also leverage BigQuery’s in-console chat interface to explore tutorials, documentation and best practices for specific tasks using simple prompts such as: “How can I use BigQuery materialized views?” “How do I ingest JSON data?” and “How can I improve query performance?”

Optimize analytics for performance and speed 

With growing data volumes, analytics practitioners including data administrators, find it increasingly challenging to effectively manage capacity and enhance query performance. We are introducing recommendations that can help continuously improve query performance, minimize errors and optimize your platform costs. 

With these recommendations, you can identify materialized views that can be created or deleted based on your query patterns and partition or cluster of your tables. Additionally, you can autotune Spark pipelines and troubleshoot failures and performance issues. 

Get started

To learn more about Gemini in BigQuery, watch this short overview video and refer to the documentation , and sign up to get early access to the preview features. If you’re at Next ‘24, join our data and analytics breakout sessions and stop by at the demo stations to explore further and see these capabilities in action. Pricing details for Gemini in BigQuery will be shared when generally available to all customers.

Source : Data Analytics Read More

Introducing Gemini in Looker to bring intelligent AI-powered BI to everyone

Introducing Gemini in Looker to bring intelligent AI-powered BI to everyone

We are at a pivotal moment for business intelligence (BI). There’s more data than ever impacting all aspects of business. Organizations are faced with increasing user demands for that data, with a wide range of access requirements. Then there’s AI, which is radically transforming how you create and think about every project. The delivery and adoption of generative AI is poised to bring the full benefits of BI to users who find a conversational experience more appealing than traditional methods. This week at Google Cloud Next, we introduced Conversational Analytics as part of Gemini in Looker, rethinking how we bring easy access of insights to our users, transforming the way we engage with our data in BI, using natural language. In addition, we announced the preview of an array of capabilities for Looker that leverage Google’s work in generative AI and speed up your organization’s ability to dive deeper into the data that matters most, so you can rapidly create and share insights.

With Gemini in Looker, your relationship with your data and reporting goes from a slow-moving and high-friction process, limited by gatekeepers, to a collaborative and intelligent conversation – powered by AI. The deep integration of Gemini models in Looker brings insights to the major user flows that power your business, and establishes a single source of truth for your data with consistent metrics.

Conversational Analytics brings your data to life

Conversational Analytics is a dedicated space in Looker for you to chat with and engage with your business data, as simply as you would ask a colleague a question on chat. In combination with LookML semantic models available from Looker, we establish a single source of truth for your data, providing consistent metrics across your organization. Now, your entire company, including business and analyst teams, can chat with your data and obtain insights in seconds, fully enabling your data-driven decision-making culture.

You can leverage Conversational Analytics, using Gemini in Looker, to find top products, sales details, and dive deeper into the answers with follow-up questions.

With Conversational Analytics, everyone can uncover patterns and trends in data, as if you were speaking to your in-house data expert – and while the answers come in seconds, Looker shows you the data behind the insights, so you know the foundation is accurate and the method is true.

Smart and simple modeling on a trusted foundation

In the generative AI era, ensuring data authenticity and standardizing operational metrics is more than a nice to have – it’s critical, ensuring measures and comparisons across apps and teams are reliable and consistent. Looker’s semantic layer is at the heart of our modeling capabilities, powering the centrally defined metrics and data relationships that mean truth and accuracy as you go through your workflows. With LookML, your analysts can work together seamlessly to create universal data and metrics definitions.

Gemini in Looker features LookML Assistant, which we hope will enable everyone to leverage and improve the power of their semantic models quickly using natural language. Simply tell Gemini in Looker what you are looking to build, and the LookML code will be automatically created for you, setting the stage for governed data, powered by generative AI, easier than ever before. 

Expanding intelligence for all Looker customers — and beyond

As the world of BI has evolved, so have our customers’ needs. They demand powerful and complete BI tools that are intuitive to use, with self-service exploration, seamless ad-hoc analysis, and high-quality visualizations all in a single platform, augmented by generative AI. 

We are now offering Looker Studio Pro to licensed Looker users (excluding Embed), at no additional cost, making getting started with BI easier than ever. 

Our vision is that Looker is the single source of truth for both modeled data and metrics that can be consumed anywhere — in our products, through partner BI tools or through our open SQL APIs. Looker’s modeling layer provides a single place to curate and govern the metrics most important to your business, meaning that customers can see consistent results no matter where they interact with their data.

Thanks to deep integration with Google Workspace, you can ask questions of your data with Gemini in Looker, helping you create reports easily and bring your creations to Slides.

Traditionally, BI tools take a user out of the flow of their work. We believe we can improve on this, helping users collaborate on their data where they are. With this in mind, we have extended our connections to Google Workspace, with the goal of meeting users where they are, across Slides, Sheets and Chat. Users will be able to automatically create Looker Studio reports from Google Sheets, helping you rapidly visualize and share insights on your data, while Slide Generation from Gemini in Looker eliminates that blank deck start, starting with your visuals and reports, and building AI-generated summaries to kick off your presentation right.

Business data insights as easy as asking Google

Gemini in Looker offers an array of new capabilities to help speed up and make analytics tasks and workflows including data modeling, chart creation, slide presentation generation and more even easier. As Google has done for decades in applications like Chrome, Gmail, and Google Maps, Gemini in Looker offers a customer experience that is intuitive and efficient.Conversational Analytics in Looker and LookML Assistant are joined by a set of capabilities that we first showcased at Next 2023, namely:

Report generation: Build an entire report, including multiple visualizations, a title, theme and layout, in seconds, by providing a one- two-sentence prompt. Gemini in Looker is an AI analyst that can create entire reports, giving you a starting point that you can adjust by using natural language.

Advanced visualization assistant: Customize your visualizations using natural language. Gemini in Looker helps create JSON code configs, which you can modify as necessary, and generate a custom visualization.

Automatic slide generation: Create impactful presentations with insightful text summaries of your data. Gemini in Looker automatically exports a report into Google Slides, with text narratives that explains the data in charts and highlights key insights.

Formula assistant: Create calculated fields on-the-fly to extend and transform the information flowing from your data sources. Gemini in Looker removes the need for you to remember complicated formulas, and creates your formula for you, for ad-hoc analysis.

Each of these capabilities are now available in preview.

Reliable intelligence for the generative AI era

Looker plays a critical role in Google Cloud’s intelligence platform, unifying your data ecosystem. Bringing even more intelligence into Looker with Gemini makes it easier for our customers to understand and access their business data for analysts to create dashboards and reports, and for developers to build new semantic models. Join us as we create new experiences with data and analytics — one defined by AI-powered conversational interfaces for data and analytics. It all starts with a simple chat box.

Source : Data Analytics Read More

Grounding generative AI in enterprise truth

Grounding generative AI in enterprise truth

Generative AI continues to amaze users at organizations around the world. From helping marketers workshop ideas and creative campaigns, to recommending coding advice to developers, to assisting analysts with market research, the technology has captivated users with its ability to synthesize information and generate answers to questions.

But the arrival of generative AI hasn’t been without its challenges. 

Though foundation models that power generative AI develop a vast “world knowledge” during training, they’re only as up-to-date as their training data, and they can lack access to all the data sources pertinent to enterprise use cases. To adopt generative AI at full speed, businesses need to ground foundation model responses in enterprises systems and fresh data, to ensure the most accurate and complete responses.

For instance, by grounding foundation model responses in ERP systems, businesses can create AI agents that provide accurate shipping predictions, and by grounding in documentation and manuals, they can deliver more helpful answers to product questions and for troubleshooting.

Similarly, research can be accelerated by grounding in analyst reports and studies, compliance can be strengthened by connecting foundation models to contracts, and employee training and onboarding can be improved by rooting agents in internal documents, knowledge bases, and HR systems.

Essentially, the more easily businesses can ground foundation models in their data, the more powerful their use cases can become.

At Google Cloud, we call this “enterprise truth” — the approach to grounding a foundation model in web information; enterprise data like databases and data warehouses; enterprise applications like ERP, CRM, and HR systems; and other sources of relevant information. Grounding in enterprise truth significantly improves the completeness and accuracy of responses, unlocking unique use cases across the business and laying the groundwork for the next generation of AI agents. 

Let’s explore how we do it! 

Grounding generative AI with Google Search and enterprise data 

Generative AI models know the most probable response, which isn’t the same as being able to cite facts. This is why we’ve built — and continue to build — a variety of ways to help ensure each organization is able to ground its foundation models in the truth relevant to its use case.

Google Search is one of the world’s most trusted sources of factual and up-to-date information. Ground with Google Search expands and enhances the model’s access to fresh, high-quality information, significantly improving the completeness and accuracy of responses.

Today, we are announcing the preview of Ground with Google Search in Vertex AI. Businesses can now augment Gemini models with Google Search grounding, and can easily integrate the enhanced model into their AI agents.

When it comes to enterprise data, we offer multiple ways for businesses to ground model responses in enterprise data sources by leveraging retrieval augmented generation, or RAG. RAG helps improve the accuracy of model outputs by using vectors and embeddings to gather facts from relevant data sources.

Vertex AI includes not only a Google-quality, out-of-box RAG solution, but also a variety of component APIs for building bespoke retrieval, ranking, and document processing systems that enable enterprises to easily ground foundation models in their own data. 

For organizations that need embeddings-based information retrieval, Vertex AI offers powerful vector search capabilities. Today, to enhance vector search, we are excited to announce the preview of our hybrid search feature, which integrates vector-based and keyword-based search techniques to ensure relevant and accurate responses for users.

Besides these, customers can connect models to Google databases like AlloyDB and BigQuery for the contextual retrieval of operational data and analytics like purchase preferences, rewards, basket analysis, interaction history, and more. To enable actions and transactions, we provide a host of data connectors to help businesses connect their models to enterprise applications like Workday, Salesforce, ServiceNow, Hadoop, Confluence, and JIRA to access the latest data on customer interactions and internal knowledge updates like issue tracking, program management, and employee records. 

With a comprehensive approach to grounding that covers web search, enterprise data, and third-party enterprise applications, businesses can ensure that their models will deliver enterprise truth – wherever it is hosted.

How more sources of truth create more value 

Let’s walk through an example to show how grounding lets organizations integrate sources of enterprise truth and create more helpful AI agents. 

Suppose an athletic brand wants to create an AI agent to help customers find and purchase shoes.

If the company just puts an interface atop a foundation model API, they won’t accomplish much. The resulting app would be able to discuss shoes generally, based on its training knowledge, but it wouldn’t have particular expertise in the brand’s shoes or any awareness of new footwear trends that emerged after its training cutoff date. 

With grounding in Google Search, the shoe brand’s app can become much more functional, able to search the web for fresh information. However, it wouldn’t have insight into the brand’s internal data such as product information, inventory levels, and manufacturing timelines, nor would it be able to call functions for transactions — so the shoe company would still be dealing with a basic and rather limited agent. 

To cross this chasm, the company also needs to connect its gen AI models to enterprise data sources via RAG mechanisms, so the agent can ground its advice in the specificity and factuality of internal documents and databases. 

Imagine an advanced, proactive version of the shoe-recommending agent, with access to the full spectrum of aforementioned search, databases, and analytics. It would be able to observe patterns like the customer’s last several purchases all having green stripes. It would remember the earlier chat in which the customer said they dislike shoes that squeak on hardwood floors, and then go about reviewing customer reviews to purge squeakiness from its recommendations. It would also generate tables on the fly so the customer can more easily compare options, and it would know up-to-date inventory and shipping information to help execute transactions. With the right grounding in enterprise truth, the sky’s the limit — and so is the value the agent can create.

Enterprise truth: fueling gen AI innovation across businesses

Generative AI adoption isn’t just about access to capable models. It’s also about grounding foundation models in first-party data and high-quality external sources — and using these connections to steer model behavior, creating more accurate, relevant, and factual generative AI experiences for businesses to offer their customers, partners, and employees. 

With access to high-quality and relevant data, models can power experiences that move beyond traditional passive applications, giving rise to the next generation of AI agents grounded in enterprise truth. That’s the future we are rapidly moving towards. Backed by our commitment to this journey with our customers, we’re excited to help make the outputs of today’s agents factual, relevant, and actionable.

To learn more about Google Cloud’s AI news at Next, check out our Vertex AI Agent Builder announcement.

Source : Data Analytics Read More

What’s next for data analytics at Google Cloud Next ’24

What’s next for data analytics at Google Cloud Next ’24

We’re entering a new era for data analytics, going from narrow insights to enterprise-wide transformation through a virtuous cycle of data, analytics, and AI. At the same time, analytics and AI are becoming widely accessible, providing insights and recommendations to anyone with a question. Ultimately, we’re going beyond our own human limitations to leverage AI-based data agents to find deeply hidden insights for us.

Organizations already recognize that data and AI can come together to unlock the value of AI for their business. Research from Google’s 2024 Data and AI Trends Report highlighted 84% of data leaders believe that generative AI will help their organization reduce time-to-insight, and 80% agree that the lines of data and AI are starting to blur.

Today at Google Cloud Next ’24, we’re announcing new innovations for BigQuery and Looker that will help activate all of your data with AI:

BigQuery is a unified AI-ready data platform with support for multimodal data, multiple serverless processing engines and built-in streaming and data governance to support the entire data-to-AI lifecycle. 

New BigQuery integrations with Gemini models in Vertex AI support multimodal analytics, vector embeddings, and fine-tuning of LLMs from within BigQuery, applied to your enterprise data.

Gemini in BigQuery provides AI-powered experiences for data preparation, analysis and engineering, as well as intelligent recommenders to optimize your data workloads.

Gemini in Looker enables business users to chat with their enterprise data and generate visualizations and reports—all powered by the Looker semantic data model that’s seamlessly integrated into Google Workspace.

Let’s take a deeper look at each of these developments.

BigQuery: the unified AI-ready data foundation

BigQuery is now Google Cloud’s single integrated platform for data to AI workloads. BigLake, BigQuery’s unified storage engine, provides a single interface across BigQuery native and open formats for analytics and AI workloads, giving you the choice of where your data is stored and access to all of your data, whether structured or unstructured, along with a universal view of data supported by a single runtime metastore, built-in governance, and fine grained access controls.

Today we’re expanding open format support with the preview of a fully managed experience for Iceberg, with DDL, DML and high throughput support. In addition to support for Iceberg and Hudi, we’re also extending BigLake capabilities with native support for the Delta file format, now in preview. 

At HCA Healthcare we are committed to the care and improvement of human life. We are on a mission to redesign the way care is delivered, letting clinicians focus on patient care and using data and AI where it can best support doctors and nurses. We are building our unified data and AI foundation using Google Cloud’s lakehouse stack, where BigQuery and BigLake enable us to securely discover and manage all data types and formats in a single platform to build the best possible experiences for our patients, doctors, and nurses. With our data in Google Cloud’s lakehouse stack, we’ve built a multimodal data foundation that will enable our data scientists, engineers, and analysts to rapidly innovate with AI.” – Mangesh Patil, Chief Analytics Officer, HCA Healthcare

We’re also extending our cross-cloud capabilities of BigQuery Omni. Through partnerships with leading organizations like Salesforce and our recent launch of bidirectional data sharing between BigQuery and Salesforce Data Cloud, customers can securely combine data across platforms with zero copy and zero ops to build AI models and predictions on combined Salesforce and BigQuery data. Customers can also enrich customer 360 profiles in Salesforce Data Cloud with data from BigQuery, driving additional personalization opportunities powered by data and AI. 

“It is great to collaborate without boundaries to unlock trapped data and deliver amazing customer experiences. This integration will help our joint customers tap into Salesforce Data Cloud’s rich capabilities and use zero copy data sharing and Google AI connected to trusted enterprise data.” – Rahul Auradkar, EVP and General Manager of United Data Services & Einstein at Salesforce

Building on this unified AI-ready data foundation, we are now making BigQuery Studio generally available, which already has hundreds of thousands of active users. BigQuery Studio provides a collaborative data workspace across data and AI that all data teams and practitioners can use to accelerate their data-to-AI workflows. BigQuery Studio provides the choice of SQL, Python, Spark or natural language directly within BigQuery, as well as new integrations for real-time streaming and governance. 

Customers’ use of serverless Apache Spark for data processing increased by over 500% in the past year1. Today, we are excited to announce the preview of our serverless engine for Apache Spark integrated within BigQuery Studio to help data teams work with Python as easily as they do with SQL, without having to manage infrastructure.

The data team at Snap Inc. uses these new capabilities to converge toward a common data and AI platform with multiple engines that work across a single copy of data. This gives them the ability to enforce fine-grained governance and track lineage close to the data to easily expand analytics and AI use cases needed to drive transformation.

To make data processing on real-time streams directly accessible from BigQuery, we’re announcing the preview of BigQuery continuous queries providing continuous SQL processing over data streams, enabling real-time pipelines with AI operators or reverse ETL. We are also announcing the preview of Apache Kafka for BigQuery as a managed service to enable streaming data workloads based on open-source APIs.

We’re expanding our governance capabilities with Dataplex with new innovations for data-to-AI governance available in preview. You can now perform integrated search and drive gen AI-powered insights on your enterprise data, including data and models from Vertex AI, with a fully integrated catalog in BigQuery. We’re introducing column-level lineage in BigQuery and expanding lineage capabilities to support Vertex AI pipelines (available in preview soon) to help you better understand data-to-AI workloads. Finally, to facilitate governance for data-access at scale, we are launching governance rules in Dataplex. 

Multimodal analytics with new BigQuery and Vertex AI integrations

With BigQuery’s direct integration with Vertex AI, we are now announcing the ability to connect models in Vertex AI with your enterprise data, without having to copy or move your data out of BigQuery. This enables multi-modal analytics using unstructured data, fine tuning of LLMs and the use of vector embeddings in BigQuery.

Priceline, for instance, is using business data stored in BigQuery for LLMs across a wide range of applications. 

“BigQuery gave us a solid data foundation for AI. Our data was exactly where we needed it. We were able to connect millions of customer data points from hotel information, marketing content, and customer service chat and use our business data to ground LLMs.” – Allie Surina Dixon, Director of Data, Priceline 

The direct integration between BigQuery and Vertex AI now enables seamless preparation and analysis of multimodal data such as documents, audio and video files. BigQuery features rich support for analyzing unstructured data using object tables and Vertex AI Vision, Document AI and Speech-to-Text APIs. We are now enabling BigQuery to analyze images and video using Gemini 1.0 Pro Vision, making it easier than ever to combine structured with unstructured data in data pipelines using the generative AI capabilities of the latest Gemini models. 

BigQuery makes it easier than ever to execute AI on enterprise data by providing the ability to build prompts based on your BigQuery data, and use of LLMs for sentiment extraction, classification, topic detection, translation, classification, data enrichment and more.

BigQuery now also supports generating vector embeddings and indexing them at scale using vector and semantic search. This enables new use cases that require similarity search, recommendations or retrieval of your BigQuery data, including documents, images or videos. Customers can use the semantic search in the BigQuery SQL interface or via our integration with gen AI frameworks such as LangChain and leverage Retrieval Augmented Generation based on their enterprise data.

Gemini in BigQuery and Gemini in Looker for AI-powered assistance

Gen AI is creating new opportunities for rich data-driven experiences that enable business users to ask questions, build custom visualizations and reports, and surface new insights using natural language. In addition to business users, gen AI assistive and agent capabilities can also accelerate the work of data teams, spanning data exploration, analysis, governance, and optimization. In fact, more than 90% of organizations believe business intelligence and data analytics will change significantly due to AI. 

Today, we are announcing the public preview of Gemini in BigQuery, which provides AI-powered features that enhance user productivity and optimize costs throughout the analytics lifecycle, from ingestion and pipeline creation to deriving valuable insights. What makes Gemini in BigQuery unique is its contextual awareness of your business through access to metadata, usage data, and semantics. Gemini in BigQuery also goes beyond chat assistance to include new visual experiences such as data canvas, a new natural language-based experience for data exploration, curation, wrangling, analysis, and visualization workflows.

Imagine you are a data analyst at a bikeshare company. You can use the new data canvas of Gemini in BigQuery to explore the datasets, identify the top trips and create a customized visualization, all using natural language prompts within the same interface

Gemini in BigQuery capabilities extend to query recommendations, semantic search capabilities, low-code visual data pipeline development tools, and AI-powered recommendations for query performance improvement, error minimization, and cost optimization. Additionally, it allows users to create SQL or Python code using natural language prompts and get real-time suggestions while composing queries.

Today, we are also announcing the private preview of Gemini in Looker to enable business users and analysts to chat with their business data. Gemini in Looker capabilities include conversational analytics, report and formula generation, LookML and visualization assistance, and automated Google slide generation. What’s more, these capabilities are being integrated with Workspace to enable users to easily access beautiful data visualizations and insights right where they work.

Imagine you’re an ecommerce store. You can query Gemini in Looker to learn sales trends and market details and immediately explore the insights, with details on how the charts were created.

To learn more about our data analytics product innovations, hear customer stories, and gain hands-on knowledge from our developer experts, join our data analytics spotlights and breakout sessions at Google Cloud Next ‘24, or watch them on-demand.

1. Google internal data – YoY growth of data processed using Apache Spark on Google Cloud compared with Feb ‘23

Source : Data Analytics Read More

Privacy-preserving data sharing now generally available with BigQuery data clean rooms

Privacy-preserving data sharing now generally available with BigQuery data clean rooms

The rise of data collaboration and use of external data sources highlights the need for robust privacy and compliance measures. In this evolving data ecosystem, businesses are turning to clean rooms to share data in  low-trust environments. Clean rooms enable secure analysis of sensitive data assets, allowing organizations to unlock insights without compromising on privacy.

To facilitate this type of data collaboration, we launched the preview of data clean rooms last year. Today, we are excited to announce that BigQuery data clean rooms is now generally available.

Backed by BigQuery, customers can now share data in place with analysis rules to protect the underlying data. This launch includes a streamlined data contributor and subscriber experience in the Google Cloud console, as well as highly requested capabilities such as:

Join restrictions: Limits the joins that can be on specific columns for data shared in a clean room, preventing unintended or unauthorized connections between data. 
Differential privacy analysis rule: Enforces that all queries on your shared data use differential privacy with the parameters that you specify. The privacy budget that you specify also prevents further queries on that data when the budget is exhausted.
List overlap analysis rule: Restricts the output to only display the intersecting rows between two or more views joined in a query.
Usage metrics on views: Data owners or contributors see aggregated metrics on the views and tables shared in a clean room.

Using data clean rooms in BigQuery does not require creating copies of or moving sensitive data. Instead, the data can be shared directly from your BigQuery project and you remain in full control. Any updates you make to your shared data are reflected in the clean room in real-time, ensuring everyone is working with the most current data.

Create and deploy clean rooms in BigQuery

BigQuery data clean rooms are available in all BigQuery regions. You can set up a clean room environment using the Google Cloud console or using APIs. During this process, you set permissions and invite collaborators within or outside organizational boundaries to contribute or subscribe to the data.

Enforce analysis rules to protect underlying data

When sharing data into a clean room, you can configure analysis rules to protect the underlying data and determine how the data can be analyzed. BigQuery data clean rooms support multiple analysis rules including aggregation, differential privacy, list overlap, and join restrictions. The new user experience within Cloud console lets data contributors configure these rules without needing to use SQL.

Lastly, by default, a clean room employs restricted egress to prevent subscribers from exporting or copying the underlying data. However, data contributors can choose to allow the export and copying of query results for specific use cases, such as activation.

Monitor usage and stay in control of your data

The data owner or contributor is always in control of their respective data in a clean room. At any time, a data contributor can revoke access to their data. Additionally, as the clean room owner, you can adjust access using subscription management or privacy budgets to prevent subscribers from performing further analysis. Additionally, data contributors receive aggregated logs and metrics, giving them insights into how their data is being used within the clean room. This promotes both transparency and a clearer understanding of the collaborative process. 

What BigQuery data clean room customers are saying 

Customers across all industries are already seeing tremendous success with BigQuery data clean rooms. Here’s what some of our early adopters and partners had to say:

“With BigQuery data clean rooms, we are now able to share and monetize more impactful data with our partners while maintaining our customers’ and strategic data protection.” –  Guillaume Blaquiere, Group Data Architect, Carrefour

“Data clean rooms in BigQuery is a real accelerator for L’Oréal to be able to share, consume, and manage data in a secure and sustainable way with our partners.” –  Antoine Castex, Enterprise Data Architect, L’Oréal

“BigQuery data clean rooms equip marketing teams with a powerful tool for advancing privacy-focused data collaboration and advanced analytics in the face of growing signal loss. LiveRamp and Habu, which independently were each early partners of BigQuery data clean rooms, are excited to build on top of this foundation with our combined interoperable solutions: a powerful application layer, powered by Habu, accelerates the speed to value for technical and business users alike, while cloud-native identity, powered by RampID in Google Cloud, maximizes data fidelity and ecosystem connectivity for all collaborators. With BigQuery data clean rooms, enterprises will be empowered to drive more strategic decisions with actionable, data-driven insights.”Roopak Gupta, VP of Engineering, LiveRamp

“In today’s marketing landscape, where resources are limited and the ecosystem is fragmented, solutions like the data clean room we are building with Google Cloud can help reduce friction for our clients. This collaborative clean room ensures privacy and security while allowing Stagwell to integrate our proprietary data to create custom audiences across our product and service offerings in the Stagwell Marketing Cloud. With the continued partnership of Google Cloud, we can offer our clients integrated Media Studio solutions that connect brands with relevant audiences, improving customer journeys and making media spend more efficient.” – Mansoor Basha, Chief Technology Officer, Stagwell Marketing Cloud

“We are extremely excited about the General Availability announcement of BigQuery data clean rooms. It’s been great collaborating with Google Cloud on this initiative and it is great to see it come to market.. This release enables production-grade secure data collaboration for the media and advertising industry, unlocking more interoperable planning, activation and measurement use cases for our ecosystem.” – Bosko Milekic, Chief Product Officer, Optable

Next steps

Whether you’re an advertiser trying to optimize your advertising effectiveness with a publisher, or a retailer improving your promotional strategy with a CPG, BigQuery data clean rooms can help. Get started today by using this guide, starting a free trial with BigQuery, or contacting the Google Cloud sales team.

Source : Data Analytics Read More

Get started with differential privacy and privacy budgeting in BigQuery data clean rooms

Get started with differential privacy and privacy budgeting in BigQuery data clean rooms

We are excited to announce that differential privacy enforcement with privacy budgeting is now available in BigQuery data clean rooms to help organizations prevent data from being reidentified when it is shared.

Differential privacy is an anonymization technique that limits the personal information that is revealed in a query output. Differential privacy is considered to be one of the strongest privacy protections that exists today because it:

is provably private
supports multiple differentially private queries on the same dataset
can be applied to many data types

Differential privacy is used by advertisers, healthcare companies, and education companies to perform analysis without exposing individual records. It is also used by public sector organizations that comply with the General Data Protection Regulation (GDPR), the Health Insurance Portability and Accountability Act (HIPAA), the Family Educational Rights and Privacy Act (FERPA), and the California Consumer Privacy Act (CCPA).

What can I do with differential privacy?

With differential privacy, you can:

protect individual records from re-identification without moving or copying your data
protect against privacy leak and re-identification
use one of the anonymization standards most favored by regulators

BigQuery customers can use differential privacy to:

share data in BigQuery data clean rooms while preserving privacy
anonymize query results on AWS and Azure data with BigQuery Omni
share anonymized results with Apache Spark stored procedures and Dataform pipelines so they can be consumed by other applications
enhance differential privacy implementations with technology from Google Cloud partners Gretel.ai and Tumult Analytics
call frameworks like PipelineDP.io

So what is BigQuery differential privacy exactly?

BigQuery differential privacy is three capabilities:

Differential privacy in GoogleSQL – You can use differential privacy aggregate functions directly in GoogleSQL

Differential privacy enforcement in BigQuery data clean rooms – You can apply a differential privacy analysis rule to enforce that all queries on your shared data use differential privacy in GoogleSQL with the parameters that you specify

Parameter-driven privacy budgeting in BigQuery data clean rooms – When you apply a differential privacy analysis rule, you also set a privacy budget to limit the data that is revealed when your shared data is queried. BigQuery uses parameter-driven privacy budgeting to give you more granular control over your data than query thresholds do and to prevent further queries on that data when the budget is exhausted.

BigQuery differential privacy enforcement in action

Here’s how to enable the differential privacy analysis rule and configure a privacy budget when you add data to a BigQuery data clean room.

Subscribers of that clean room must then use differential privacy to query your shared data.

Subscribers of that clean room cannot query your shared data once the privacy budget is exhausted.

Get started with BigQuery differential privacy

BigQuery differential privacy is configured when a data owner or contributor shares data in a BigQuery data clean room. A data owner or contributor can share data using any compute pricing model and does not incur compute charges when a subscriber queries that data. Subscribers of a data clean room incur compute charges when querying shared data that is protected with a differential privacy analysis rule. Those subscribers are required to use on-demand pricing (charged per TB) or the Enterprise Plus edition (charged per slot hour).

Create a clean room where all queries are protected with differential privacy today and let us know where you need help.

Related Article

Privacy-preserving data sharing now generally available with BigQuery data clean rooms

Now GA, BigQuery data clean rooms has a new data contributor and subscriber experience, join restrictions, new analysis rules, usage metr…

Read Article

Source : Data Analytics Read More

Get excited about what’s coming for data professionals at Next ‘24

Get excited about what’s coming for data professionals at Next ‘24

Google Cloud Next ’24 lands in Las Vegas on April 9, ready to unleash the future of cloud computing! This global event is where inspiration, innovation, and practical knowledge collide.  

Data cloud professionals, get ready to harness the power and scalability of Google Cloud operational databases – AlloyDB, Cloud SQL, and Spanner –  and leverage AI to streamline your data management and unlock real-time insights. You’ll learn how to master BigQuery for lightning-fast analytics on massive datasets, visualize your findings with the intuitive power of Looker, and use generative AI to get better insights from your data. With these cutting-edge Google Cloud tools, you’ll have everything you need to drive better, more informed decision-making.

Here are 10 must-attend sessions for data professionals to add to your Google Cloud Next ’24 agenda:

1. What’s next for Google Cloud databases • SPTL203

Dive into the world of Google Cloud databases and discover next-generation innovations that will help you modernize your database estate to easily build enterprise generative AI apps, unify your analytical and transactional workloads, and simplify database management with assistive AI. Join us to hear our vision for the future of Google Cloud databases and see how we’re pushing the boundaries alongside the AI ecosystem. 
>> Add to my agenda <<

2. What’s next for data analytics in the AI era • SPTL202

With the surge of new generative AI capabilities, companies and their customers can now interact with systems and data in new ways. To activate AI, organizations require a data foundation with the scale and efficiency to bring business data together with AI models and ground them in customer reality. Join this session to learn the latest innovations for data analytics and business intelligence, and see why tens of thousands of organizations are fueling their journey with BigQuery and Looker.
>> Add to my agenda <<

3. The future of databases and generative AI • DBS104

Learn how AI has the potential to revolutionize the way applications interact with databases and explore the exciting future of Google Cloud‘s managed databases, including Cloud SQL, AlloyDB, and Spanner. We will also delve into the most intriguing frontiers on the horizon, such as vector search capabilities, natural language processing in databases, and app migration with large language model-powered code migration.
>> Add to my agenda <<

4. What’s new with BigQuery • ANA112

Join this session to explore all the latest BigQuery innovations to support all structured or unstructured data, across multiple and open data formats, and cross-clouds; all workloads, whether Cloud SQL, Spark, or Python; and new generative AI use cases with built-in AI to supercharge the work of data teams. Learn how you can take advantage of BigQuery, a single offering that combines data processing, streaming, and governance capabilities to unlock the full power of your data.
>> Add to my agenda <<

5. A deep dive into AlloyDB for PostgreSQL • DBS216

Powered by Google-embedded storage for superior performance with full PostgreSQL compatibility, AlloyDB offers the best of the cloud. With scale-out architecture, a no-nonsense 99.99% availability SLA, intelligent caching, and ML-enabled adaptive systems, it’s got everything you need to simplify database management. In this session, we’ll cover what’s new in AlloyDB, plus take a deep dive into the technology that powers it.
>> Add to my agenda <<

6. BigQuery’s multimodal data foundation for the Gemini era • ANA116

In the era of multimodal generative AI, a unified, governance-focused data platform powered by Gemini is now paramount. Join this session to learn how BigQuery fuels your data and AI lifecycle, from training to inference, by unifying all your data — structured or unstructured — while addressing security and governance. 
>> Add to my agenda <<

7. Best practices to maximize the availability of your Cloud SQL databases • DBS210

Customers use Cloud SQL to run their business-critical applications. The session will dive deep into Enterprise Plus edition features, how Cloud SQL achieves near-zero downtime maintenance, and behaviors that affect availability and mitigations — all of which will prepare you to be an expert in configuring and monitoring Cloud SQL for maximum availability.
>> Add to my agenda <<

8. Make generative AI work: Best practices from data leaders • ANA122

Join experts from Databricks, MongoDB, Confluent, and Dataiku for an exclusive executive discussion on harnessing generative AI’s transformative potential. We’ll explore how generative AI breaks down multicloud data silos to enable informed decision-making and unlock your data’s full value. Discover strategies for integrating generative AI, addressing challenges, and building a future-proof, innovation-driven data culture.
>> Add to my agenda <<

9. Spanner: Beyond relational? Yahoo! Mail says yes • DBS203

Imagine running your non-relational workloads at relational consistency and unlimited scale. Yahoo! dared to dream it, and with Google Spanner, it plans to make it a reality. Dive into its modernization plans for the Mail platform, consolidating diverse databases, and unlocking innovation with unprecedented scale. 
>> Add to my agenda <<

10. Talk with your business data using generative AI • ANA113

Getting insights from your business data should be as easy as asking Google. That is the Looker mission – instant insights, timely alerts when it matters most, and faster, more impactful decisions, all powered by the most important information: yours. Join this session to learn how AI is reshaping our relationship with data and how Looker is leading the way.
>> Add to my agenda <<

Source : Data Analytics Read More

Enrich your streaming data using Bigtable and Dataflow

Enrich your streaming data using Bigtable and Dataflow

Revised version for your consideration

Data engineers know that eventing is all about speed, scale, and efficiency. Event streams — high-volume data feeds coming off of things such as devices such as point-of-sale systems or websites logging stateless clickstream activity — process lightweight event payloads that often lack the information to make each event actionable on its own. It is up to the consumers of the event stream to transform and enrich the events, followed by further processing as required for their particular use case.

Key-value stores such as Bigtable are the preferred choice for such workloads, with their ability to process hundreds of thousands of events per second at very low latencies. However, key value lookups often require a lot of careful productionisation and scaling code to ensure the processing can happen with low latency and good operational performance. 

With the new Apache Beam Enrichment transform, this process is now just a few lines of code, allowing you to process events that are in messaging systems like Pub/Sub or Apache Kafka, and enrich them with data in Bigtable, before being sent along for further processing.

This is critical for streaming applications, as streaming joins enrich the data to give meaning to the streaming event. For example, knowing the contents of a user’s shopping cart, or whether they browsed similar items before, can bring valuable context to clickstream data that feeds into a recommendation model. Identifying a fraudulent in-store credit card transaction requires much more information than what’s in the current transaction, for example, the location of the prior purchase, count of recent transactions or whether a travel notice is in place. Similarly, enriching telemetry data from factory floor hardware with historical signals from the same device or overall fleet statistics can help a machine learning (ML) model predict failures before they happen.

The Apache Beam enrichment transform can take care of the client-side throttling to rate-limit the number of requests being sent to the Bigtable instance when necessary. It retries the requests with a configurable retry strategy, which by default is exponential backoff. If coupled with auto-scaling, this allows Bigtable and Dataflow to scale up and down in tandem and automatically reach an equilibrium. Beam 2.5.4.0 supports exponential backoff, which can be disabled or replaced with a custom implementation.

Lets see this in action:

code_block
<ListValue: [StructValue([(‘code’, ‘with beam.Pipeline() as p:rn output = (prn | “Read from PubSub” >> beam.io.ReadFromPubSub(subscription=SUBSCRIPTION)rn | “Convert bytes to Row” >> beam.ParDo(DecodeBytes())rn | “Enrichment” >> Enrichment(bigtable_handler)rn | “Run Inference” >> RunInference(model_handler)rn )’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e72b60d01c0>)])]>

The above code runs a Dataflow job that reads from a Pub/Sub subscription and performs data enrichment by doing a key-value lookup with Bigtable cluster. The enriched data is then fed to the machine learning model for RunInference. 

The pictures below illustrate how Dataflow and Bigtable work in harmony to scale correctly based on the load. When the job starts, the Dataflow runner starts with one worker while the Bigtable cluster has three nodes and autoscaling enabled for Dataflow and Bigtable. We observe a spike in the input load for Dataflow at around 5:21 PM that leads it to scale to 40 workers.

This increases the number of reads to the Bigtable cluster. Bigtable automatically responds to the increased read traffic by scaling to 10 nodes to maintain the user-defined CPU utilization target.

The events can then be used for inference, with either embedded models in the Dataflow worker or with Vertex AI

This Apache Beam transform can also be useful for applications that serve mixed batch and real-time workloads from the same Bigtable database, for example multi-tenant SaaS products and interdepartmental line of business applications. These workloads often take advantage of built-in Bigtable mechanisms to minimize the impact of different workloads on one another. Latency-sensitive requests can be run at high priority on a cluster that is simultaneously serving large batch requests with low priority and throttling requests, while also automatically scaling the cluster up or down depending on demand. These capabilities come in handy when using Dataflow with Bigtable, whether it’s to bulk-ingest large amounts of data over many hours, or process streams in real-time.

Conclusion

With a few lines of code, we are able to build a production pipeline that translates to many thousands of lines of production code under the covers, allowing Pub/Sub, Dataflow, and Bigtable to seamlessly scale the system to meet your business needs! And as machine learning models evolve over time, it will be even more advantageous to use a NoSQL database like Bigtable which offers a flexible schema. With the upcoming Beam 2.55.0, the enrichment transform will also have caching support for Redis that you can configure for your specific cache. To get started, visit the documentation page.

Source : Data Analytics Read More

Combine data across BigQuery and Salesforce Data Cloud securely with zero ETL

Combine data across BigQuery and Salesforce Data Cloud securely with zero ETL

We are excited that bidirectional data sharing between BigQuery and Salesforce Data Cloud is now generally available. This will make it easy for customers to enrich their data use cases by combining data across different platforms securely, without the additional cost of building or managing data infrastructure and complex ETL (Extract, Transform, Load) pipelines.

Responding to customer needs quickly is more critical than ever, with a growing number of touchpoints and devices to deliver in-the-moment customer experiences. But it’s becoming increasingly challenging to do so, as the amount of data created and captured continues to grow, and is spread across multiple SaaS applications and analytics platforms. 

Last year, Google Cloud and Salesforce announced a partnership that makes it easy for customers to combine their data across BigQuery and Salesforce Data Cloud, and leverage the power of BigQuery and Vertex AI solutions to enrich and unlock new analytics and AI/ML scenarios. 

Today, we are launching general availability of these capabilities, enabling joint Google Cloud and Salesforce customers to securely access their data in the different platforms and different clouds. Customers will be able to access their Salesforce Data Cloud in BigQuery, without having to set up or manage infrastructure, as well as use their Google Cloud data to enrich Salesforce Customer 360 and other applications. 

With this announcements, Google Cloud and Salesforce customers benefit from: 

A single pane of glass and serverless data access across platforms with zero ETL

Governed and secure bi-directional access to their Salesforce Data Cloud data and BigQuery in near real time without needing to create data infrastructure and data pipelines.

Use their Google Cloud data to enrich Salesforce Customer 360 and Salesforce Data Cloud. Also, the ability to enrich customer data with other relevant datasets, public datasets with minimal data movement.

Leveraging differentiated Vertex AI and Cloud AI services for predictive analytics, churn modeling and flowing back to customer campaigns through Vertex AI and Einstein Copilot Studio’s integration.

Customers who want to look at their data holistically across Salesforce and Google platforms, spanning cloud boundaries, can do so by leveraging BigQuery Omni and Analytics Hub. This integration allows data and business users including data analysts, marketing analysts, and data scientists, to combine data across Salesforce and Google platforms to analyze, derive insights and run AI/ML pipelines, all in a self-service capacity, without the need to involve data engineering or infrastructure teams.

This integration is fully managed and governed, allowing customers to focus on analytics and insights and avoid several critical business challenges that are typical when integrating critical enterprise systems. These innovations enforce data access and governance policy set for the data by admins. Only datasets that are explicitly shared are available for access and only authorized users are able to share and explore the data. With data spanning multiple clouds and platforms, relevant data is pre-filtered with minimal copying from Salesforce Data Cloud to BigQuery, reducing both egress costs and data engineering overhead.

“Trying to get a handle on our customer data was a nightmare until we seamlessly connected Google Cloud and Salesforce Data Cloud. No more copying data between platforms, no more struggling with complex APIs. It’s revolutionized how we do segmentation, understand our customers, and drive better marketing and service actions.” Large insurance company in NorthAM

“We faced several challenges in making the most of our customer data. Enriching leads between our first-party BigQuery data and Salesforce was a slow, manual process. It also made creating timely, data-driven lifecycle marketing journeys difficult due to batch data transfers. By seamlessly integrating BigQuery and Salesforce, we’ve transformed these processes. The integration fuels our automated marketing campaigns with real-time data triggers, significantly enhancing customer engagement. Best of all, this solution eliminated the manual overhead of batch data transfers, saving us valuable time and resources. It’s a win-win for our marketing team and our bottom line.”Large retailer in NorthAM

Easy and secure access to Salesforce Data Cloud from Google Cloud

Customers want to access and combine their marketing, commerce and service data in Salesforce Data Cloud with loyalty and point-of-sale data in Google analytics platforms to derive actionable insights about their customer behavior such as propensity to buy, cross-sell/up-sell recommendations and run highly personalized promotional campaigns. They also want to leverage differentiated Google AI services to build machine learning models on top of combined Salesforce and Google Cloud data for training/predictions, enabling use cases such as churn modeling, customer funnel analysis, market-mix modeling, price elasticity, and A/B test experimentation.

With the launch, customers can get access to their Salesforce Data Cloud data seamlessly through differentiated BigQuery cross-cloud and data sharing capabilities. They can access all the relevant information needed to perform cross-platform analytics in a privacy safe manner with other Google assets, and power ad campaigns. Salesforce Data Cloud Admins can easily share data, directly with the relevant BigQuery users or groups. BigQuery users can easily subscribe to shared datasets through the Analytics Hub UI.

There are several different ways to share information with this platform integration: 

For smaller datasets and ad hoc access, for example to find the store that had the largest sales last year, you can leverage a single cross-cloud join of your Salesforce Data Cloud and Google Cloud datasets, with minimal movement or duplication of data.

For larger data sets that are powering your executive update, weekly business review or marketing campaign dashboards, you can access the data using cross-cloud materialized views that are automatically and incrementally updated and only bring the incremental data periodically.

Enrich Salesforce Customer 360 with data stored on Google Cloud

We also hear from customers — especially retailers — that they want to access and combine their data in Salesforce Data Cloud and behavioral data captured in Google Analytics from their websites and mobile apps to build a richer customer 360 profile, derive actionable insights, deliver personalized messaging with rich capabilities of Salesforce Data Cloud. We are making it easier than ever to break down data silos and give customers seamless real-time access to Google Analytics data within Salesforce Data Cloud and build richer customer profiles, and personalized experiences.

Salesforce Data Cloud customers can use simple point-and-click to connect to their Google Cloud account, select relevant BigQuery datasets and make them available as External Data Lake Objects, providing live access to data. Once they are Data Lake Objects, they behave like native Data Cloud objects to enrich customer 360 data models, derive insights to power real-time Customer 360 models for analytics and personalization. This integration eliminates the need to build and monitor ETL pipelines for data integration, eliminating operational overhead and latencies of the traditional ETL copy approach.

Breaking down walls between Salesforce and Google data

This Google Cloud and Salesforce Data Cloud platform integration empowers organizations to break down data silos, gain actionable insights, and deliver exceptional customer experiences. With seamless data sharing, unified access, and the power of Google AI, this partnership is transforming the way businesses leverage their data for success.

Through unique cross-cloud functionality of BigQuery Omni and data sharing capabilities of Analytics Hub, customers can directly access data stored in Salesforce Data Cloud and combine it with data in Google Cloud to enrich it further for business insights and activation. Customers are not only able to view their data across clouds but perform unparalleled cross-cloud analytics without the need to build custom ETL or move data. 

To learn more about the collaboration between Google and Salesforce, check out this partnership page, the introduction video and quick start guide.

Source : Data Analytics Read More

At least once Streaming: Save up to 70% for Streaming ETL workloads

At least once Streaming: Save up to 70% for Streaming ETL workloads

Historically, Dataflow Streaming Engine has offered exactly-once processing for streaming jobs. Recently, we launched at-least-once streaming mode as an alternative for lower latency and cost-of-streaming data ingestion. In this post, we will explain both streaming modes and provide guidance on how to choose the right mode for your streaming use case.

Exactly-once: what it is and why it matters

Applications that react to incoming events sometimes require that each event be reflected in the output exactly once — meaning the event is not lost, nor accepted more than a single time. But as the processing pipeline scales, load-balances, or encounters faults, that deduplication of events imposes a computational cost, affecting overall cost and latency of the system.

Dataflow Streaming provides an exactly-once guarantee, meaning that the effects of data processed in the pipeline are reflected at least and at most once. Let’s unpack that a little bit. For every arriving message, whether it’s from an external source or an upstream shuffle, Dataflow ensures that the message will be processed and not lost (at-least-once). Additionally, results of that processing that remain within the pipeline, like state updates and outputs to a subsequent shuffle to the next pipeline stage, are also reflected at-most once. This guarantee enables, among other things, performing exact aggregations, such as exact sums or counts.

Exactly-once inside the pipeline is usually only half the story. As pipeline authors and runners, we really want to get the results of processing out of Dataflow and into a downstream system. Here we run into a common roadblock: no general at-most-once guarantee is made about side-effects of the pipeline. Without further effort, any side-effect, such as output to an external store, may generate duplicates. Careful work must be done to orchestrate the writes in a way that avoids duplicates. The key challenge is that in the general case, it is not possible to implement exactly-once operation in a distributed system without a consensus protocol that involves all the actors. For internal state changes, such as state updates and shuffle, exactly-once is achieved by a careful protocol.With sufficient support from data sinks, we can thus have exactly-once all the way through the pipeline and to its output. An example is the storage write version of the BigQueryIO.Write implementation, which ensures exactly-once data extraction to BigQuery. 

But even without exactly-once semantics at the sink, exactly-once semantics within the pipeline can be useful. Duplicates on the output may be acceptable, as long as they are duplicates of correct results — with exactly-once having been required to achieve this correctness.

At-least once: what it is and why it matters

There are other use cases where duplicates may be acceptable, for example ETL or Map-Only pipelines that are not performing any aggregation but rather only per-message operations. In these cases, duplicates are simple replays of data through the pipeline.

But why wouldn’t you choose to use exactly-once semantics? Aren’t stronger guarantees always better? The reason is that achieving exactly-once adds to pipeline latency and cost. This happens for several reasons, some obvious, and some quite subtle. Let’s take a deeper look.

In order to achieve exactly-once, we must store and read exactly-once metadata. In practice, the storage and read costs incurred to do this turn out to be quite expensive, especially in pipelines that otherwise perform very little I/O. Less intuitively, having to perform this metadata-based deduplication dictates how we implement the backend.

For example, in order to deduplicate messages across the shuffle we must ensure that all replays are idempotent — which means we must checkpoint the results of processing before they are sent to shuffle — which again increases cost and latency. 

Another example: to de-duplicate the input from Pub/Sub, we must first re-shuffle incoming messages on keys that are deterministically derived from a given message because the deduplication metadata is stored in the per-key state. Performing this shuffle using deterministic keys exposes us to additional cost and latency. In the next section we lay out the reasons for this in more detail.

We cannot assume that at-least-once semantics are acceptable for the user, so we default to the strictest, semantics, i.e., exactly-once. If we know beforehand that at-least-once processing is acceptable, and we can relax our constraints, then we can make more cost- and latency-beneficial implementation decisions. 

Exactly-once vs. at-least-once when reading from Pub/Sub

Pub/Sub reads’ latency and cost in particular benefit from at-least-once mode. To understand why, let’s look closer at how exactly-once deduplication is implemented. 

Pub/Sub reads are implemented in the Dataflow backend workers. To acquire new messages, each Dataflow backend worker makes remote procedure calls (RPCs) internally to the Pub/Sub service. Since RPCs can fail, workers may crash, or other sources of failure are possible, and messages will be replayed until successful processing is acknowledged by the backend worker. Pub/Sub and backend workers are dynamic systems, without static partitioning, meaning that replays of messages from Pub/Sub are not guaranteed to arrive at the same backend worker. This poses a challenge when deduplicating these messages. 

In order to perform deduplication, the backend worker puts these messages through a shuffle, attaching a key internally to each message. The key is chosen deterministically based on the message, or a message id attribute if configured1, so that a replay of a duplicate message is deterministically shuffled to the same key. Doing so allows deduplicating replays from Pub/Sub in the same manner that shuffle replays are deduplicated between stages in the Dataflow pipeline (see this detailed discussion), as illustrated in the sequence diagram below:

This design contributes to cost and latency in two significant ways. First, as with all deduplication in the Dataflow backend, a read against the persistent store may be required. While heavily optimized with caches and bloom filters, it cannot be completely eliminated. Second, and often even more significant, is the need to shuffle the data on a deterministic key. If a particular key or worker is slow or becomes a bottleneck, this creates head-of-line blocking that prevents other traffic from flowing — an artificial constraint since the messages in the queue are not semantically tied to this key. 

When at-least-once processing is acceptable, we can both eliminate the cost associated with reads from the persistent store and the shuffling of messages on a deterministic key. In fact, we can do better — we still shuffle the messages, but the key we pick instead is the current “least-loaded” key, meaning the key that is currently experiencing the least queueing. In this way, we evenly distribute the incoming traffic to maximize throughput, even when some keys or workers are experiencing slowness. 

We can see this in action in a benchmark where we simulate stragglers, e.g., slow writes to an external sink, by artificially delaying arbitrary messages at low probability for multi-minute intervals.

Compare the throughput of the exactly-once pipeline on the left to the throughput of the at-least-once pipeline on the right. The at-least-once pipeline can sustain much more consistent throughput in the presence of such stragglers, dramatically decreasing average latency. In other words, even though both cases still have high tail latency, the latency outliers no longer affect the bulk of the distribution in the at-least-once configuration.

Benchmarks: at-least-once vs. exactly-once

Here are three representative benchmark streaming jobs to evaluate the impact of streaming-mode choice on costs. To maximize the cost benefit, we enabled resource-based billing and aligned I/O to the streaming mode. Here is what we observed:

Note that cost depends on multiple factors such as data-load characteristics, the specific pipeline composition, configuration and the I/O used. Therefore, our benchmarking results may differ from what you observe in your test and production pipelines.

Spotify’s own testing supported the Dataflow team’s findings:

“By incorporating at-least-once mode in our platform that is built on Dataflow, Pub/Sub, and Bigtable, we have seen a portion of our Dataflow jobs cut costs by 50%. Since this is used by several consumers, 7 downstream systems are now cheaper overall with this simple change. Because of the way this system works, there has been 0 effects of duplicates! We plan on turning this feature on in more jobs that are compatible to cut down our Dataflow costs even more.” Sahith Nallapareddy, Software Engineer, Spotify

Choose the right streaming mode for your job

When creating streaming pipelines, choosing the right mode is essential. The critical factor is to determine whether the pipeline can tolerate duplicate records in the output or any intermediate processing stages.

At-least-once mode can help optimize cost and performance in the following cases:

Map-only pipelines performing idempotent per-message operations, e.g. ETL jobs

When deduplication happens in  the destination, e.g., in BigQuery or Bigtable

Pipelines that already use an at-least-once I/O sink, e.g., Storage API At Least Once or PubSub I/O

Exactly-once mode is preferable in the following cases:

Use cases that cannot tolerate duplicates within Dataflow

Pipelines that perform exact aggregations as part of stream processing

Map-only jobs that perform non-idempotent per-message operations

At-least-once streaming mode is now generally available for Dataflow streaming customers. You can enable at-least-once mode by setting the at-least-once Dataflow service option when starting a new streaming job using the API or gcloud. To get started, we also offer a selection of commonly used Dataflow streaming templates that support the streaming modes.

Source : Data Analytics Read More