Introducing new ML model monitoring capabilities in BigQuery

Introducing new ML model monitoring capabilities in BigQuery

Monitoring machine learning (ML) models in production is now as simple as using a function in BigQuery! Today we’re introducing a new set of functions that enable model monitoring directly within BigQuery. Now, you can describe data throughout the model workflow by profiling training or inference data, monitor skew between training and serving data, and monitor drift in serving data over time using SQL — for BigQuery ML models as well as any model whose feature training and serving data is available through BigQuery. With these new functions, you can ensure your production models continue to deliver value while simplifying their monitoring.

In this blog, we present two companion notebooks to help you get hands-on with these features today!

Companion Introduction – a fast introduction to all the new functions

Companion Tutorial – an in-depth tutorial covering many usage patterns for the new functions, including using Vertex AI Endpoints, monitoring feature attributions, and an overview of how monitoring metrics are calculated.

The foundation of a model: the data

A model is only as good as the data it learns from. Understanding the data deeply is essential for effective feature engineering, model selection, and ensuring quality through MLOps. BigQuery’s table-valued function ML.DESCRIBE_DATA provides a powerful tool for this, allowing you to summarize and describe an entire table with a single query.

Example: Identifying data issues

In the accompanying introduction notebook, we profile the training data ( penguin classification dataset) using   the ML.DESCRIBE DATA  function and quickly identify a data issue.

code_block
<ListValue: [StructValue([(‘code’, ‘SELECT *rnFROM ML.DESCRIBE_DATA(rn TABLE `bigquery-public-data.ml_datasets.penguins`,rn STRUCT(3 AS top_k, 4 AS num_quantiles)rn);’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ee877bce790>)])]>

Here’s the resulting output table:

Notice that the min value for the sex column is a ‘.’. Ideally, we’d see the values MALE, FEMALE or null as indicated in the top_values.values column. This means that in addition to the 10 null values (indicated by the num_null column) there are also some null values indicated by a string with value ‘.’. This should be corrected before using it as training data. 

The ML.DESCRIBE_DATA function is extra helpful because it summarizes each data type all in one table. There are also optional parameters that can be specified to control the number of quantiles for different numerical column types and the number of top values to return for categorical columns. The input data can be specified as a table or a query statement, allowing you to describe specific subsets of data (e.g., serving timeframes, or groups within your training data). The function’s flexibility extends beyond ML tasks: it even allows you to describe data stored outside of BigQuery, facilitating quick analysis for both model-building and broader data exploration purposes.

Detect skew at a glance

A trained model will perform only when the serving data is similar in distribution to the training data. Model monitoring helps ensure this by comparing training and serving data for shifts known as skew. BigQuery’s ML.VALIDATE_DATA_SKEW table valued function streamlines this process, allowing you to directly compare serving data to any BigQuery ML model’s training data.

Let’s see it in action:

code_block
<ListValue: [StructValue([(‘code’, ‘SELECT *rnFROM ML.VALIDATE_DATA_SKEW(rn MODEL `bqml_model_monitoring.classify_species_logistic`,rn (rn SELECT *rn FROM `bqml_model_monitoring.serving`rn WHERE DATE(instance_timestamp) = CURRENT_DATE()rn )rn);’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ee877bce0a0>)])]>

This query directly compares the data in the serving table to the BigQuery ML model classify_species_logistic. The accompanying introduction notebook has the full code in an interactive example. In that notebook the serving data is simulated to create change in two of the features: body_mass_g and flipper_length_mm. The results of the ML.VALIDATE_SKEW function show anomalies detect for each of these:

The detection of skew is as easy as comparing a model in BigQuery to a table of serving data. During training, BigQuery ML models automatically compute and store relevant statistics. This eliminates the need for reusing the entire training dataset, making skew monitoring simple and cost-efficient. Importantly, the function intelligently focuses on features present in the model, further enhancing efficiency and workflow. With optional parameters, you can customize anomaly detection thresholds, metric types for categorical features, and even set different thresholds for specific features. Later, we’ll demonstrate how easily you can monitor skew for any model!

Proactive monitoring for drift

Beyond comparing serving data to training data, it’s also important to keep an eye on changes within serving data over time. Comparing recent serving data to previous serving data is another type of model monitoring known as drift detection. This uses the same detection techniques of metrics that compare distributions between a baseline and comparison dataset and flag anomalies that exceed set threshold. With the table valued function ML.VALIDATE_DATA_DRIFT, you can compare any two tables, or query statements results, directly for detection. 

Drift detection in action:

code_block
<ListValue: [StructValue([(‘code’, ‘SELECT *rnFROM ML.VALIDATE_DATA_DRIFT(rn (rn SELECT * EXCEPT(species, instance_timestamp)rn FROM `statmike-mlops-349915.bqml_model_monitoring.serving`rn WHERE DATE(instance_timestamp) = CURRENT_DATE()rn ),rn (rn SELECT * EXCEPT(species, instance_timestamp)rn FROM `statmike-mlops-349915.bqml_model_monitoring.serving`rn WHERE DATE(instance_timestamp) = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)rn ),rn STRUCT(rn 0.03 AS categorical_default_threshold,rn 0.03 AS numerical_default_thresholdrn )rn)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ee877bced60>)])]>

Here, the same serving table is used as the baseline and comparison table but with different WHERE statements to filter the rows and compare today to yesterday as an example. The results below show that while the detection values did not surpass the threshold, they are approaching the threshold between two consecutive days for the features that have simulated change.

Just like with skew detection, you can also adjust the default detection threshold for anomaly detection as well as the metric type used for categorical features, and specify different thresholds for different columns and feature types. There are additional parameters to control the binning of numerical features for the metrics calculations. 

Take TFDV monitoring to the next level

If you’re already familiar with the TensorFlow Data Validation (TFDV) library, you’ll appreciate how these new BigQuery functions enhance your model monitoring toolkit. They bring the power of TFDV directly into your BigQuery workflows, allowing you to generate rich statistics, detect anomalies, and leverage TFDV’s powerful visualization tools — all with SQL. And the best part is it uses BigQuery’s scalable, serverless compute. Leverage BigQuery’s scalable, serverless compute for near-instant analysis, empowering you to take rapid action on model monitoring insights!

Let’s explore how it works:

Generate statistics with ML.TFDV_DESCRIBE

You can generate in-depth statistics summaries with table valued function ML.TFDV_DESCRIBE for any table, or query, in the same format as the TensorFlow tfdv.generate_statistics_from_csv() API:

code_block
<ListValue: [StructValue([(‘code’, ‘SELECT *rnFROM ML.TFDV_DESCRIBE(rn (rn SELECT * EXCEPT(species)rn FROM `bqml_model_monitoring.training`rn )rn)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ee877bce8b0>)])]>

The ML.TFDV_DESCRIBE function outputs statistics in a structured data format (a ‘proto’) that is directly compatible with TFDV: tfmd.proto.statistics_pb2.DatasetFeatureStatisticsList

Using a bit of Python code in a BigQuery notebook, we can import the TFDV package as well as TensorFlow Metadata package and then make a call to the tfdv.visualize_statistics method while converting the data to the expected format. The ML.TFDV_DESCRIBE results were loaded to Python for the training data as train_describe and for the current day’s serving data as today_describe. See the accompanying tutorial for complete details.

code_block
<ListValue: [StructValue([(‘code’, “import tensorflow_data_validation as tfdvrnimport tensorflow_metadata as tfmdrnfrom google.protobuf import json_formatrnrntfdv.visualize_statistics(rn lhs_statistics = json_format.ParseDict(train_describe, tfmd.proto.statistics_pb2.DatasetFeatureStatisticsList()),rn rhs_statistics = json_format.ParseDict(today_describe, tfmd.proto.statistics_pb2.DatasetFeatureStatisticsList()),rn lhs_name = ‘Training Data Stats’,rn rhs_name = ‘Serving Data Stats – For Today’rn)”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ee877bcea00>)])]>

This generates the amazing visualizations shown below that directly highlight shifts in the two parameters that we purposefully shifted in the serving data for this example: body_mass_g and flipper_length_mm

This streamlined workflow brings the power and precision of TensorFlow Data Validation directly to BigQuery and enables you to quickly visualize how sets of data differ. This provides deeper insight to model health monitoring and informs how to proceed with model training iterations.

Detect anomalies With ML.TFDV_VALIDATE

You can also precisely detect skew or drift anomalies with the scalar function ML.TFDV_VALIDATE, which compares tables, or queries, pinpointing potential model-breaking shifts.

Example:

code_block
<ListValue: [StructValue([(‘code’, “WITHrn TRAIN AS (rn SELECT * EXCEPT(species)rn FROM `bqml_model_monitoring.training`rn ),rn SERVE AS (rn SELECT * EXCEPT(species, instance_timestamp)rn FROM `bqml_model_monitoring.serving`rn WHERE DATE(instance_timestamp) = CURRENT_DATE()rn )rnSELECT ML.TFDV_VALIDATE(rn (SELECT * FROM ML.TFDV_DESCRIBE(TABLE TRAIN)),rn (SELECT * FROM ML.TFDV_DESCRIBE(TABLE SERVE)),rn ‘SKEW’, 0.03,’L_INFTY’,0.03rn) as validate”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ee877bcea60>)])]>

These results are formatted in a structured data format (‘proto’) that is specifically compatible with TFDV’s display tools: tfmd.proto.anomalies_pbs2.Anomalies. Passing this as input to Python method tfdv.display_anomalies presents an easy-to-read table of anomaly detection results as presented after the code snippet:

code_block
<ListValue: [StructValue([(‘code’, ‘tfdv.display_anomalies(rn anomalies = json_format.ParseDict(validate, tfmd.proto.anomalies_pb2.Anomalies())rn)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ee877bced00>)])]>

Feature name

Anomaly short description

Anomaly long description

‘culmen_depth_mm’

High approximate Jensen-Shannon divergence between training and serving

The approximate Jensen-Shannon divergence between training and serving is 0.0483968 (up to six significant digits), above the threshold 0.03.

‘flipper_length_mm’

High approximate Jensen-Shannon divergence between training and serving

The approximate Jensen-Shannon divergence between training and serving is 0.917495 (up to six significant digits), above the threshold 0.03.

‘body_mass_g’

High approximate Jensen-Shannon divergence between training and serving

The approximate Jensen-Shannon divergence between training and serving is 0.356159 (up to six significant digits), above the threshold 0.03.

‘island’

High Linfty distance between training and serving

The Linfty distance between training and serving is 0.118041 (up to six significant digits), above the threshold 0.03. The feature value with maximum difference is: Dream

‘culmen_length_mm’

High approximate Jensen-Shannon divergence between training and serving

The approximate Jensen-Shannon divergence between training and serving is 0.0594803 (up to six significant digits), above the threshold 0.03.

‘sex’

High Linfty distance between training and serving

The Linfty distance between training and serving is 0.0513795 (up to six significant digits), above the threshold 0.03. The feature value with maximum difference is: FEMALE

The default detection methods for numerical and categorical data, as well as thresholds are the same as for the other functions shown above. You can customize detection with parameters in the function for precision monitoring needs. For a deeper dive, the accompanying tutorial includes a section that demonstrates how these metrics are calculated manually and uses this function to compare to the manual calculation results as a validation.

Online and batch serving: A unified model monitoring approach

BigQuery’s model monitoring functions offer a streamlined solution whether you’re working with models deployed on Vertex AI Prediction Endpoints or using batch serving data stored within BigQuery (as shown above). Here’s how:

Batch serving: For batch prediction data already stored or accessible by BigQuery, the monitoring features are readily accessible just as demonstrated previously in this blog.

Online serving: Directly monitor models deployed on Vertex AI Prediction Endpoints. By configuring logging requests and responses to BigQuery, you can easily apply BigQuery ML model monitoring functions to detect skew and drift. 

The accompanying tutorial provides a step-by-step walkthrough, demonstrating endpoint creation, model deployment, logging setup (for Vertex AI to BigQuery), and how to monitor both online and batch serving data within BigQuery.

Automate for scale

To achieve truly scalable monitoring of shifts and drifts, automation is essential. BigQuery’s procedural language offers a powerful way to streamline this process, as demonstrated in the SQL query from our introductory notebook. This automation isn’t limited to monitoring; it can extend to continuous model retraining. In a production environment, continuous training would be accompanied by: proactively identifying data quality issues, adapting to real-world changes, and maintaining a rigorous deployment strategy aligned with your organization’s needs.

code_block
<ListValue: [StructValue([(‘code’, “DECLARE skew_anomalies ARRAY<STRING>;rnrn# Monitor Skew: latest serving compared to trainingrnSET skew_anomalies = (rn SELECT ARRAY_AGG(input)rn FROM ML.VALIDATE_DATA_SKEW(rn MODEL `bqml_model_monitoring.classify_species_logistic`,rn (rn SELECT *rn FROM `bqml_model_monitoring.serving`rnt WHERE DATE(instance_timestamp) >= CURRENT_DATE()rn )rn )rn WHERE is_anomaly = Truern);rnrnIF(ARRAY_LENGTH(skew_anomalies) > 0) THENrn # retrain the modelrn CREATE OR REPLACE MODEL `bqml_model_monitoring.classify_species_logistic`rnt# find the full model training query in the introduction notebookrn ;rn rn # force alert with messagern SELECT ERROR(rn CONCAT(rn ‘\n\nFound data skew in features: ‘,rn ARRAY_TO_STRING(skew_anomalies, ‘, ‘),rn ‘. Model is retrained with latest up to date serving data.\n\n’rn )rn );rnrn ELSE SET skew_anomalies = [‘No skew detected.’];rnEND IF;”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ee877bce7f0>)])]>

Let’s take a look at what the results look like:

code_block
<ListValue: [StructValue([(‘code’, ‘Found data skew in features: body_mass_g, flipper_length_mm. Model is retrained with the latest serving data.’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ee877bce940>)])]>

A skew anomaly was detected and successfully triggered model retraining, restoring accuracy after the data changes. This demonstrates the value of automated monitoring and retraining for maintaining model performance in dynamic production environments.

To streamline this process, Google Cloud offers several powerful automation options::

BigQuery Scheduled Queries

Dataform

Workflows

Cloud Composer

Vertex AI Pipelines

Want a hands-on demonstration? Our accompanying tutorial dives into BigQuery scheduled queries, including historical backfilling, daily monitoring, and setting up email alerts for detected shifts and drifts. We’ll also be releasing future tutorials covering the other automation tools.

The simplicity and power of model monitoring With BigQuery

Building trustworthy machine learning systems requires continuous monitoring. BigQuery’s new model monitoring functions streamline this to just a few SQL functions:

Deeply understand your data: ML.DESCRIBE_DATA provides a comprehensive view of your datasets, aiding in feature engineering and quality checks.

Detect skew between training and serving data: ML.VALIDATE_DATA_SKEW directly compares BigQuery ML models against their serving data.

Monitor data drift over time: ML.VALIDATE_DATA_DRIFT empowers you to track changes in serving data, ensuring your model’s performance remains consistent.

Enhance your TFDV workflow: ML.TFDV_DESCRIBE and ML.TFDV_VALIDATE bring the precision of TensorFlow Data Validation directly into BigQuery, enabling more detailed visualizations and anomaly detection while leveraging BigQuery’s scalable, and efficient compute.

Getting Started

Extend from BigQuery ML models to Vertex AI Models and automate these new functions with Google Cloud offerings like BigQuery scheduled queries, Dataform, Workflows, Cloud Composer, or Vertex AI Pipelines. Dive into our hands-on notebooks to get started today:

Companion Introduction – a fast introduction to all the new functions

Companion Tutorial – an in-depth tutorial covering many usage patterns for the new functions, including using Vertex AI Endpoints, monitoring feature attributions, and an overview of how monitoring metrics are calculated

Source : Data Analytics Read More

Make data your competitive edge with new solutions from Cortex Framework

Make data your competitive edge with new solutions from Cortex Framework

In today’s AI era, data is your competitive edge.

There has never been a more exciting time in technology, with AI creating entirely new ways to solve problems, engage customers, and work more efficiently. 

However, most enterprises still struggle with siloed data, which stifles innovation, keeps vital insights locked away, and can reduce the value AI has across the business. 

Discover a faster, smarter way to innovate

Google Cloud Cortex Framework accelerates your ability to unify enterprise data for connected insights, and provides new opportunities for AI to transform customer experiences, boost revenue, and reduce costs which can otherwise be hidden in your company’s data and applications. 

Built on an AI-ready Data Cloud foundation, Cortex Framework includes what you need to design, build, and deploy solutions for specific business problems and opportunities including endorsed reference architectures and packaged business solution deployment content. In this blog, we provide an overview of Cortex Framework, and highlight some recent enhancements.

Get a connected view of your business with one data foundation 

Cortex Framework enables one data foundation for businesses by bridging and enriching private, public, and community insights for deeper analysis.

Our latest release extends Cortex Data Foundation with new data and AI solutions for enterprise data sources including Salesforce Marketing Cloud, Meta, SAP ERP, and Dun & Bradstreet. Together this data unlocks insights across the enterprise and opens up opportunities for optimization and innovation. 

New intelligent marketing use cases

Drive more intelligent marketing strategies with one data foundation for your enterprise data, including integrated sources like Google Ads, Campaign Manager 360, TikTok and now — Salesforce Marketing Cloud and Meta connectivity with BigQuery via Cortex Framework, with predefined data ingestion templates, data models and sample dashboards. Together with other business data available like sales and supply chain sources in Cortex Data Foundation, you can accelerate insights and answer questions like: How does my overall campaign and audience performance relate to sales and supply chain?

New sustainability management use cases

Want more timely insights into environment, social and governance (ESG) risks and opportunities? You can now manage ESG performance and goals with new vendor ESG performance insights using Dun & Bradstreet ESG ranking data connected with your SAP ERP supplier data. Now with predefined data ingestion templates, data models and a sample dashboard focused on sustainability insights, for informed decision making. Answer questions like: “What is my raw material suppliers’ ESG performance against industry peers?” “What is their ability to measure and manage GHG emissions?” and “What is their adherence and commitment to environmental compliance and corporate governance?”

New simplified finance use cases

Simplify financial insights across the business to make informed decisions about liquidity, solvency, and financial flexibility to feed into strategic growth investment opportunities — now with predefined data ingestion templates, data models and sample dashboards to help you discern new insights with balance sheet and income statement reporting on SAP ERP data.

Accelerate AI innovation with secured data access

To help organizations build out a data mesh for more optimized data discovery, access control and governance when interacting with Cortex Data Foundation, our new solution content offers a metadata framework built on BigQuery and Dataplex that:

Organizes Cortex Data Foundation pre-defined data models into business domains

Augments Cortex Data Foundation tables, views and columns with semantic context to empower search and discovery of data assets

Enables natural language to SQL capabilities by providing logical context for Cortex Data Foundation content to LLMs and gen AI applications

Annotates data access policies to enable consistent enforcement of access controls and masking of sensitive columns

With a data mesh in place, Cortex Data Foundation models can allow for more efficiency in generative AI search and discovery, as well as fine-grained access policies and governance. 

Data and AI brings next-level innovation and efficiency 

Will your business lead the way? Learn more about our portfolio of solutions by tuning in to our latest Next ‘24 session – ANA107 and checking out our website.

Source : Data Analytics Read More

Telegraph Media Group unlocks insights with a Single Customer View on Google Cloud

Telegraph Media Group unlocks insights with a Single Customer View on Google Cloud

In today’s data-driven world, organizations across industries are seeking ways to gain a deeper understanding of their customers. A Single Customer View (SCV) — also known as a 360-degree customer view — has emerged as a powerful concept in data engineering, enabling companies to consolidate and unify customer data from multiple siloed sources. By integrating various data points into a single, comprehensive view, organizations can unlock valuable insights, drive personalized experiences, and make data-informed decisions. In this blog post, we will take a look at how Telegraph Media Group (TMG) built a SCV using Google Cloud and what we learned from our experience.

TMG is the publisher of The Daily Telegraph, The Sunday Telegraph, The Telegraph Magazine, Telegraph.co.uk, and the Telegraph app. We operate as a subscription-based business, offering news content through a combination of traditional print media and digital channels, including a website and various mobile applications. TMG initially operated a free-to-air, advertising-based revenue model, but over time, this model became increasingly challenging. Like many news media publishers, we saw long-term trends, such as a declining print readership, diminishing ad yields for content publishers, and volatility in ad revenue — all of which make revenue projections uncertain and growth unpredictable. 

In 2018, we set out a bold vision to become a subscriber-first business, with quality journalism at our heart, to build deeper connections with our subscribers at scale. By embracing a subscription approach, TMG aimed to establish a more predictable revenue stream and enhance its advertising offerings, which yield higher returns. Our goal was to reach one million subscriptions within five years, and we reached our milestone in August 2023.

The SCV platform we have engineered leverages two primary data resources: a customer’s digital behavior across all digital domains and TMG’s subscription data. Additionally, it integrates data from third-party sources, such as partner shopping websites and engagement products like fantasy football or puzzles. These diverse data sources play a vital role in enriching the platform’s understanding of our audience and delivering a comprehensive news experience.

We can conceptualize the entire process of building the SCV in a few stages:

Data collection: The initial stage involves gathering data from various sources and loading it into BigQuery, which serves as a data lake. Data is extracted from a variety of different sources, using multiple methods, such as databases, APIs, or files, and ingested into BigQuery for centralized storage and future processing.

Data transformation: In this stage, the data retrieved from BigQuery is processed and transformed according to defined business rules. The data is cleansed, standardized, and enriched to ensure its quality and consistency. The data is stored in a new dataset within BigQuery in a structured format known as the SCV data model, where it can be easily accessed and analyzed.

Data presentation: Once the data has been transformed and stored in the BigQuery data lake, it can be organized into smaller, specialized datasets, known as data marts. These data marts serve specific user groups or departments and provide them with tailored and pre-aggregated data for consumption by third-party activation tools, such as email marketing systems, alongside reporting and visualization tools that enable internal decision-making processes.

Stage 1: Data collection

All of TMG’s subscription data is stored in Salesforce. We implemented a streamlined process to gather this data and store it in BigQuery.

First, we utilize a pipeline comprising containerized Python applications that run on Apache Airflow (specifically, Cloud Composer) every minute. This pipeline retrieves updated data from the Salesforce API and transfers it to Pub/Sub, a messaging service within Google Cloud.

Second, we created a real-time pipeline with DataFlow that reads data from Pub/Sub and promptly updates various tables in BigQuery, enabling us to gain real-time insights into the data.

We also perform a daily batch ingestion from Salesforce to ensure data integrity. This practice allows us to have a comprehensive and complete view of the data, compensating for any potential data loss that may occur during real-time ingestion.

We employ a similar approach to ingesting data from both Adobe Analytics, which monitors user behavior on TMG websites and apps, and Adobe Campaign, which tracks user behavior on communication channels. Since real-time availability is not essential for these datasets, batch processing is deemed sufficient for their ingestion and processing. Additionally, similar ingestion methods are applied to other data sources to ensure a consistent and unified data pipeline.

Stage 2: Data transformation

We employ the open-source Data Build Tool (DBT) for transforming our data, utilizing the power of BigQuery through SQL. By leveraging DBT, we translate all of our business rules into SQL for efficient data transformation. The DBT pipelines are containerized applications that run on an hourly basis and are orchestrated using Cloud Composer, which is built on Apache Airflow. As a result, the output of these data pipelines is a relational model that resides in BigQuery, delivering streamlined and organized data for further analysis and processing. During data transformation, we employ several important pipelines, including:

Salesforce Snapshot: This pipeline generates a snapshot of the Salesforce data from both real-time and batch tables. The snapshot reflects the latest available data in Salesforce and serves as a valuable source for other pipelines in the transformation process.

Source: This pipeline creates a table to store the source data, including source original customer ID and new customer ID. This information plays a crucial role in identifying customers in the original data source.

Customer: This pipeline creates a table that captures and presents detailed information, providing a comprehensive view of their attributes and characteristics.

Contact: This pipeline creates multiple tables that store various contact details of the customers.

Content Interaction: This pipeline generates a table that captures the digital behavior of customers, including their interactions with different content, enabling deeper analysis of customer engagement and preferences.

Subscription Interaction: This pipeline creates a table that tracks and stores subscription-related events and their associated details, providing insights into customer subscription behavior and patterns.

Campaign Interaction: This pipeline creates a table that stores detailed information about events related to communication behavior within different channels, enabling analysis of customer engagement with campaigns and marketing initiatives.

Stage 3: Data presentation

Similar to the transformation layer, DBT pipelines play a crucial role in transforming data from the SCV data model into different data marts. These data marts serve as valuable resources for further analysis and are also consumed by third-party applications. Presently, the three main consumers of the SCV data are Adobe Campaign, Adobe Experience Platform, and Permutive.

Adobe Campaign utilizes the SCV data to effectively target customers by sending relevant campaigns and personalized offers. By leveraging the comprehensive customer insights derived from this data, Adobe Campaign optimizes customer engagement and facilitates targeted marketing efforts.

Adobe Experience Platform leverages the SCV data to deliver tailored experiences to customers visiting the website. By utilizing the rich customer information available, the Adobe Experience Platform customizes the website experience to cater to individual customer preferences, enhancing customer satisfaction and engagement.

Permutive primarily relies on SCV demographic data to target customers with tailored advertisements on the Telegraph website and application. Permutive creates customer segments and integrates with Google Ad Manager to deliver personalized ads.

Prior to the implementation of the SCV, these consumers depended on various data sources, which often resulted in using data that was several days old. This delay imposed limitations on their ability to target customers multiple times within a day. However, with the integration of the SCV, they now have direct access to near real-time data, allowing them to consume and utilize the data as frequently as every 30 minutes. This significant improvement in data freshness empowers TMG to deliver more timely and relevant experiences to our target audiences.

Challenges of creating a SCV

Building a Single Customer View brings forth various challenges, particularly in constructing a data model that meets current requirements while remaining adaptable for future needs. To address this, we prioritize careful extension of the data model, aiming to incorporate new requirements within the existing framework whenever possible. Additionally, determining the appropriate data to include in the SCV is also critical. While businesses may desire to include all customer data, we recognize the importance of avoiding noise and include only relevant and valuable data to maintain a clean SCV.

Managing customer preferences for communication is another significant issue to resolve. Within the SCV, for example, customer preferences dictate how TMG is authorized to engage with them. However, these preferences can vary across different channels, including third-party platforms, potentially conflicting with the preferences stored in our first-party data. To mitigate this, we establish and implement hierarchical rules to carefully define permissions for each communication channel, ensuring compliance and minimizing legal implications.

Efficiently matching customers from third-party data to TMG’s first-party data is also crucial for unifying customers across multiple sources. To tackle this challenge, we employ a combination of exact and fuzzy matching techniques. We implement fuzzy matching using BigQuery User-Defined Functions (UDFs), which allows us to apply various algorithms. However, processing fuzzy matching on large data volumes can be time-consuming. We are actively exploring different approaches to strike a balance between accuracy and processing time, optimizing the matching process and facilitating more efficient customer data integration.

In conclusion, implementing a SCV on Google Cloud empowers TMG to leverage customer data effectively, helping us to drive growth, enhance customer satisfaction, and stay competitive. By harnessing the rich insights derived from a SCV, companies can make data-informed decisions and deliver personalized experiences that resonate with their customers. Overcoming the challenges inherent in building an SCV enables businesses to unlock the full potential of their data and achieve meaningful outcomes.

Source : Data Analytics Read More

What’s new in Cloud Pub/Sub at Next ’24

What’s new in Cloud Pub/Sub at Next ’24

Organizations are increasingly adopting streaming technologies, and Google Cloud offers a comprehensive solution for streaming ingestion and analytics. Cloud Pub/Sub is Google Cloud’s simple, highly scalable and reliable global messaging service. It serves as the primary entry point for you to ingest your streaming data into Google Cloud and is natively integrated with BigQuery, Google Cloud’s unified, AI-ready data analytics platform. You can then use this data for downstream analytics, visualization, and AI applications. Today, we are excited to announce recent Pub/Sub innovations answering customer needs for simplified streaming data ingestion and analytics.

One-click Streaming Import (GA)

Multi-cloud workloads are becoming a reality for many organizations where customers would like to run certain workloads (e.g., operational) on one public cloud and want to run their analytical workloads on another. However, it can be a challenge to gain a holistic view of their business data. Through data consolidation in one public cloud, you can run analytics across their entire data footprint. For Google Cloud customers it is common to consolidate data in BigQuery, providing a source of truth for the organization. 

To ingest streaming data from external sources such as AWS Kinesis Data Streams into Google Cloud, you need to configure, deploy, run, manage and scale a custom connector. You also need to monitor and maintain the connector to ensure the streaming ingestion pipeline is running as expected. Last week, we launched a no-code, one-click capability to ingest streaming data into Pub/Sub topics from external sources, starting with Kinesis Data Streams. The Import Topics capability is now generally available (GA) and offers multiple benefits:

Simplified data pipelines: You can streamline your cross-cloud streaming data ingestion pipelines by using the Import Topics capability. This removes the overhead of running and managing a custom connector.

Auto-scaling: Streaming pipelines created with managed import topics scale up and down based on the incoming throughput.

Out-of-the-box monitoring: Three new Pub/Sub metrics are now available out-of-the-box to monitor your import topics.

Import Topics will support Cloud Storage as another external source later in the year.

Streaming analytics with Pub/Sub Apache Flink connector (GA)

Apache Flink is an open-source stream processing framework with powerful stream and batch processing capabilities, with growing adoption across enterprises. Customers often use Apache Flink with messaging services to power streaming analytics use cases. We are pleased to announce that a new version of the Pub/Sub Flink Connector is now GA with active support from the Google Cloud Pub/Sub team. The connector is fully open source under an Apache 2.0 license and hosted on our GitHub repository. With just a few steps, the connector allows you to connect your existing Apache Flink deployment to Pub/Sub. 

The connector allows you to publish an Apache Flink output into Pub/Sub topics or use Pub/Sub subscriptions as a source in Apache Flink applications. The new GA version of the connector comes with multiple enhancements. It now leverages the StreamingPull API to achieve maximum throughput and low latency. We also added support for automatic message lease extensions to enable setting longer checkpointing intervals. Finally, the connector supports the latest Apache Flink source streaming API.

Enhanced Export Subscriptions experience

Pub/Sub has two popular export subscriptions — BigQuery and Cloud Storage. BigQuery subscriptions can now be leveraged as a simple method to ingest streaming data into BigLake Managed Tables, BigQuery’s recently announced capability for building open-format lakehouses on Google Cloud. You can use this method to transform your streaming data into Parquet or Iceberg format files in your Cloud Storage buckets. We also launched a number of enhancements to these export subscriptions.

BigQuery subscriptions support a growing number of ways to move your structured data seamlessly. The biggest change is the ability to write JSON data into columns in BigQuery without defining a schema on the Pub/Sub topic. Previously, the only way to get data into columns was to define a schema on the topic and publish data that matched that schema. Now, with the use table schema feature, Pub/Sub can write JSON messages to the BigQuery table using its schema. Basic types are supported now and support for more advanced types like NUMERIC and DATETIME is coming soon.

Speaking of type support, BigQuery subscriptions now handle most Avro logical types. BigQuery subscriptions now support non-local timestamp types (compatible with the BigQuery TIMESTAMP type) and decimal types (compatible with the BigQuery NUMERIC and BIGNUMERIC types, coming soon). You can use these logical types to preserve the semantic meaning of fields across your pipelines.

Another highly requested feature coming soon to both BigQuery subscriptions and Cloud Storage subscriptions is the ability to specify a custom service account. Currently, only the per-project Pub/Sub service account can be used to write messages to your table or bucket. Therefore, when you grant access, you enable anyone who has permission to use this project-wide service account to write to the destination. You may prefer to limit access to a specific service account via this upcoming feature.

Cloud Storage subscriptions will be enhanced in the coming months with a new batching option allowing you to batch Cloud Storage files based on the number of Pub/Sub messages in each file. You will also be able to specify a custom datetime format in Cloud Storage filenames to support custom downstream data lake analysis pipelines. Finally, you’ll soon be able to use topic schema to write data to your Cloud Storage bucket.

Getting started

We’re excited to introduce a set of new capabilities to help you leverage your streaming data for a variety of use cases. You can now simplify your cross-cloud ingestion pipelines with Managed Import. You can also leverage Apache Flink with Pub/Sub for streaming analytics use cases. Finally, you can now use enhanced Export Subscriptions to seamlessly get data in either BigQuery or Cloud Storage. We are excited to see how you use these Pub/Sub features to solve your business challenges.

Source : Data Analytics Read More

Introducing LLM fine-tuning and evaluation in BigQuery

Introducing LLM fine-tuning and evaluation in BigQuery

BigQuery allows you to analyze your data using a range of large language models (LLMs) hosted in Vertex AI including Gemini 1.0 Pro, Gemini 1.0 Pro Vision and text-bison. These models work well for several tasks such as text summarization, sentiment analysis, etc. using only prompt engineering. However, in some scenarios, additional customization via model fine-tuning is needed, such as when the expected behavior of the model is hard to concisely define in a prompt, or when prompts do not produce expected results consistently enough. Fine-tuning also helps the model learn specific response styles (e.g., terse or verbose), new behaviors (e.g., answering as a specific persona), or to update itself with new information.

Today, we are announcing support for customizing LLMs in BigQuery with supervised fine-tuning. Supervised fine-tuning via BigQuery uses a dataset which has examples of input text (the prompt) and the expected ideal output text (the label), and fine-tunes the model to mimic the behavior or task implied from these examples.Let’s see how this works.

Feature walkthrough

To illustrate model fine-tuning, let’s  look at a classification problem using text data. We’ll use a medical transcription dataset and ask our model to classify a given transcript into one of 17 categories, e.g. ‘Allergy/Immunology’, ‘Dentistry’, ‘Cardiovascular/ Pulmonary’, etc.

Dataset

Our dataset is from mtsamples.com as provided on Kaggle. To fine-tune and evaluate our model, we first create an evaluation table and a training table in BigQuery using a subset of this data available in Cloud Storage as follows:

code_block
<ListValue: [StructValue([(‘code’, “– Create a eval tablernrnLOAD DATA INTOrn bqml_tutorial.medical_transcript_evalrnFROM FILES( format=’NEWLINE_DELIMITED_JSON’,rn uris = [‘gs://cloud-samples-data/vertex-ai/model-evaluation/peft_eval_sample.jsonl’] )rnrn– Create a train tablernrnLOAD DATA INTOrn bqml_tutorial.medical_transcript_trainrnFROM FILES( format=’NEWLINE_DELIMITED_JSON’,rn uris = [‘gs://cloud-samples-data/vertex-ai/model-evaluation/peft_train_sample.jsonl’] )”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e7e321dcc70>)])]>

The training and evaluation dataset has an ‘input_text’ column that contains the transcript, and a ‘output_text’ column that contains the label, or ground truth.

Baseline performance of text-bison model

First, let’s establish a performance baseline for the text-bison model. You can create a remote text-bison model in BigQuery using a SQL statement like the one below. For more details on creating a connection and remote models refer to the documentation (1,2).

code_block
<ListValue: [StructValue([(‘code’, “CREATE OR REPLACE MODELrn `bqml_tutorial.text_bison_001` REMOTErnWITH CONNECTION `LOCATION. ConnectionID`rnOPTIONS (ENDPOINT =’text-bison@001′)”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e7e2d667190>)])]>

For inference on the model, we first construct a prompt by concatenating the task description for our model and the transcript from the tables we created. We then use the ML.GENERATE_TEXT function to get the output. While the model gets many classifications correct out of the box, it classifies some transcripts erroneously. Here’s a sample response where it classifies incorrectly.

code_block
<ListValue: [StructValue([(‘code’, ‘PromptrnrnPlease assign a label for the given medical transcript from among these labels [Allergy / Immunology, Autopsy, Bariatrics, Cardiovascular / Pulmonary, Chiropractic, Consult – History and Phy., Cosmetic / Plastic Surgery, Dentistry, Dermatology, Diets and Nutritions, Discharge Summary, ENT – Otolaryngology, Emergency Room Reports, Endocrinology, Gastroenterology, General Medicine, Hematology – Oncology, Hospice – Palliative Care, IME-QME-Work Comp etc., Lab Medicine – Pathology, Letters, Nephrology, Neurology, Neurosurgery, Obstetrics / Gynecology, Office Notes, Ophthalmology, Orthopedic, Pain Management, Pediatrics – Neonatal, Physical Medicine – Rehab, Podiatry, Psychiatry / Psychology, Radiology, Rheumatology, SOAP / Chart / Progress Notes, Sleep Medicine, Speech – Language, Surgery, Urology]. TRANSCRIPT: rnINDICATIONS FOR PROCEDURE:, The patient has presented with atypical type right arm discomfort and neck discomfort. She had noninvasive vascular imaging demonstrating suspected right subclavian stenosis. Of note, there was bidirectional flow in the right vertebral artery, as well as 250 cm per second velocities in the right subclavian. Duplex ultrasound showed at least a 50% stenosis.,APPROACH:, Right common femoral artery.,ANESTHESIA:, IV sedation with cardiac catheterization protocol. Local infiltration with 1% Xylocaine.,COMPLICATIONS:, None.,ESTIMATED BLOOD LOSS:, Less than 10 ml.,ESTIMATED CONTRAST:, Less than 250 ml.,PROCEDURE PERFORMED:, Right brachiocephalic angiography, right subclavian angiography, selective catheterization of the right subclavian, selective aortic arch angiogram, right iliofemoral angiogram, 6 French Angio-Seal placement.,DESCRIPTION OF PROCEDURE:, The patient was brought to the cardiac catheterization lab in the usual fasting state. She was laid supine on the cardiac catheterization table, and the right groin was prepped and draped in the usual sterile fashion. 1% Xylocaine was infiltrated into the right femoral vessels. Next, a #6 French sheath was introduced into the right femoral artery via the modified Seldinger technique.,AORTIC ARCH ANGIOGRAM:, Next, a pigtail catheter was advanced to the aortic arch. Aortic arch angiogram was then performed with injection of 45 ml of contrast, rate of 20 ml per second, maximum pressure 750 PSI in the 4 degree LAO view.,SELECTIVE SUBCLAVIAN ANGIOGRAPHY:, Next, the right subclavian was selectively cannulated. It was injected in the standard AP, as well as the RAO view. Next pull back pressures were measured across the right subclavian stenosis. No significant gradient was measured.,ANGIOGRAPHIC DETAILS:, The right brachiocephalic artery was patent. The proximal portion of the right carotid was patent. The proximal portion of the right subclavian prior to the origin of the vertebral and the internal mammary showed 50% stenosis.,IMPRESSION:,1. Moderate grade stenosis in the right subclavian artery.,2. Patent proximal edge of the right carotid.rnrnResponsernRadiology’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e7e2d667070>)])]>

In the above case the correct classification should have been ‘Cardiovascular/ Pulmonary’.

Metrics-based evaluation for base modelTo perform a more robust evaluation of the model’s performance, you can use BigQuery’s ML.EVALUATE function to compute metrics on how the model responses compare against the ideal responses from a test/eval dataset. You can do so as follows:

code_block
<ListValue: [StructValue([(‘code’, ‘– Evaluate base modelrnrnSELECTrn *rnFROMrn ml.evaluate(MODEL bqml_tutorial.text_bison_001,rn (rn SELECTrn CONCAT(“Please assign a label for the given medical transcript from among these labels [Allergy / Immunology, Autopsy, Bariatrics, Cardiovascular / Pulmonary, Chiropractic, Consult – History and Phy., Cosmetic / Plastic Surgery, Dentistry, Dermatology, Diets and Nutritions, Discharge Summary, ENT – Otolaryngology, Emergency Room Reports, Endocrinology, Gastroenterology, General Medicine, Hematology – Oncology, Hospice – Palliative Care, IME-QME-Work Comp etc., Lab Medicine – Pathology, Letters, Nephrology, Neurology, Neurosurgery, Obstetrics / Gynecology, Office Notes, Ophthalmology, Orthopedic, Pain Management, Pediatrics – Neonatal, Physical Medicine – Rehab, Podiatry, Psychiatry / Psychology, Radiology, Rheumatology, SOAP / Chart / Progress Notes, Sleep Medicine, Speech – Language, Surgery, Urology]. “, input_text) AS input_text,rn output_textrn FROMrn `bqml_tutorial.medical_transcript_eval` ),rn STRUCT(“classification” AS task_type))’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e7e2d667490>)])]>

In the above code we provided an evaluation table as input and chose ‘classification‘ as the task type on which we evaluate the model. We left other inference parameters at their defaults but they can be modified for the evaluation.

The evaluation metrics that are returned are computed for each class (label). The results look like following:

Focusing on the F1 score (harmonic mean of precision and recall), you can see that the model performance varies between classes. For example, the baseline model performs well for ‘Autopsy’, ‘Diets and Nutritions’, and ‘Dentistry’, but performs poorly for ‘Consult – History and Phy.’, ‘Chiropractic’, and ‘Cardiovascular / Pulmonary’ classes.

Now let’s fine-tune our model and see if we can improve on this baseline performance.

Creating a fine-tuned model

Creating a fine-tuned model in BigQuery is simple. You can perform fine-tuning by specifying the training data with ‘prompt’ and ‘label’ columns in it in the Create Model statement. We use the same prompt for fine-tuning that we used in the evaluation earlier. Create a fine-tuned model as follows:

code_block
<ListValue: [StructValue([(‘code’, ‘– Fine tune a textbison modelrnrnCREATE OR REPLACE MODELrn `bqml_tutorial.text_bison_001_medical_transcript_finetuned` REMOTErnWITH CONNECTION `LOCATION. ConnectionID`rnOPTIONS (endpoint=”text-bison@001″,rn max_iterations=300,rn data_split_method=”no_split”) ASrnSELECTrn CONCAT(“Please assign a label for the given medical transcript from among these labels [Allergy / Immunology, Autopsy, Bariatrics, Cardiovascular / Pulmonary, Chiropractic, Consult – History and Phy., Cosmetic / Plastic Surgery, Dentistry, Dermatology, Diets and Nutritions, Discharge Summary, ENT – Otolaryngology, Emergency Room Reports, Endocrinology, Gastroenterology, General Medicine, Hematology – Oncology, Hospice – Palliative Care, IME-QME-Work Comp etc., Lab Medicine – Pathology, Letters, Nephrology, Neurology, Neurosurgery, Obstetrics / Gynecology, Office Notes, Ophthalmology, Orthopedic, Pain Management, Pediatrics – Neonatal, Physical Medicine – Rehab, Podiatry, Psychiatry / Psychology, Radiology, Rheumatology, SOAP / Chart / Progress Notes, Sleep Medicine, Speech – Language, Surgery, Urology]. “, input_text) AS prompt,rn output_text AS labelrnFROMrn `bqml_tutorial.medical_transcript_train`’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e7e2d667520>)])]>

The CONNECTION you use to create the fine-tuned model should have (a) Storage Object User  and (b) Vertex AI Service Agent roles attached. In addition, your Compute Engine (GCE) default service account should have an editor access to the project. Refer to the documentation for guidance on working with BigQuery connections.

BigQuery performs model fine-tuning using a technique known as Low-Rank Adaptation (LoRA. LoRA tuning is a parameter efficient tuning (PET) method that freezes the pretrained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture to reduce the number of trainable parameters. The model fine-tuning itself happens on a Vertex AI compute and you have the option to choose GPUs or TPUs as accelerators. You are billed by BigQuery for the data scanned or slots used, as well as by Vertex AI for the Vertex AI resources consumed. The fine-tuning job creates a new model endpoint that represents the learned weights. The Vertex AI inference charges you incur when querying the fine-tuned model are the same as for the baseline model.

This fine-tuning job may take a couple of hours to complete, varying based on training options such as ‘max_iterations’. Once completed, you can find the details of your fine-tuned model in the BigQuery UI, where you will see a different remote endpoint for the fine-tuned model.

Endpoint for the baseline model vs a fine tuned model.

Currently, BigQuery supports fine-tuning of text-bison-001 and text-bison-002 models.

Evaluating performance of fine-tuned model

You can now generate predictions from the fine-tuned model using code such as following:

code_block
<ListValue: [StructValue([(‘code’, ‘SELECTrn ml_generate_text_llm_result,rn label,rn promptrnFROMrn ml.generate_text(MODEL bqml_tutorial.text_bison_001_medical_transcript_finetuned,rn (rn SELECTrn CONCAT(“Please assign a label for the given medical transcript from among these labels [Allergy / Immunology, Autopsy, Bariatrics, Cardiovascular / Pulmonary, Chiropractic, Consult – History and Phy., Cosmetic / Plastic Surgery, Dentistry, Dermatology, Diets and Nutritions, Discharge Summary, ENT – Otolaryngology, Emergency Room Reports, Endocrinology, Gastroenterology, General Medicine, Hematology – Oncology, Hospice – Palliative Care, IME-QME-Work Comp etc., Lab Medicine – Pathology, Letters, Nephrology, Neurology, Neurosurgery, Obstetrics / Gynecology, Office Notes, Ophthalmology, Orthopedic, Pain Management, Pediatrics – Neonatal, Physical Medicine – Rehab, Podiatry, Psychiatry / Psychology, Radiology, Rheumatology, SOAP / Chart / Progress Notes, Sleep Medicine, Speech – Language, Surgery, Urology]. “, input_text) AS prompt,rn output_text as labelrn FROMrn `bqml_tutorial.medical_transcript_eval`rn ),rn STRUCT(TRUE AS flatten_json_output))’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e7e2d667400>)])]>

Let us look at the response to the sample prompt we evaluated earlier. Using the same prompt, the model now classifies the transcript as ‘Cardiovascular / Pulmonary’ — the correct response.

Metrics based evaluation for fine tuned model

Now, we will compute metrics on the fine-tuned model using the same evaluation data and the same prompt we previously used for evaluating the base model.

code_block
<ListValue: [StructValue([(‘code’, ‘– Evaluate fine tuned modelrnrnrnSELECTrn *rnFROMrn ml.evaluate(MODEL bqml_tutorial.text_bison_001_medical_transcript_finetuned,rn (rn SELECTrn CONCAT(“Please assign a label for the given medical transcript from among these labels [Allergy / Immunology, Autopsy, Bariatrics, Cardiovascular / Pulmonary, Chiropractic, Consult – History and Phy., Cosmetic / Plastic Surgery, Dentistry, Dermatology, Diets and Nutritions, Discharge Summary, ENT – Otolaryngology, Emergency Room Reports, Endocrinology, Gastroenterology, General Medicine, Hematology – Oncology, Hospice – Palliative Care, IME-QME-Work Comp etc., Lab Medicine – Pathology, Letters, Nephrology, Neurology, Neurosurgery, Obstetrics / Gynecology, Office Notes, Ophthalmology, Orthopedic, Pain Management, Pediatrics – Neonatal, Physical Medicine – Rehab, Podiatry, Psychiatry / Psychology, Radiology, Rheumatology, SOAP / Chart / Progress Notes, Sleep Medicine, Speech – Language, Surgery, Urology]. “, input_text) AS prompt,rn output_text as labelrn FROMrn `bqml_tutorial.medical_transcript_eval`), STRUCT(“classification” AS task_type))’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e7e2d667c70>)])]>

The metrics from the fine-tuned model are below. Even though the fine-tuning (training) dataset we used for this blog contained only 519 examples, we already see a marked improvement in performance. F1 scores on the labels, where the model had performed poorly earlier, have improved, with the “macro” F1 score (a simple average of F1 score across all labels) jumping from 0.54 to 0.66.

Ready for inference

The fine-tuned model can now be used for inference using the ML.GENERATE_TEXT function, which we used in the previous steps to get the sample responses. You don’t need to manage any additional infrastructure for your fine-tuned model and you are charged the same inference price as you would have incurred for the base model.

To try fine-tuning for text-bison models in BigQuery, check out the documentation. Have feedback or need fine-tuning support for additional models? Let us know at bqml-feedback@google.com>.

Special thanks to Tianxiang Gao for his contributions to this blog.

Source : Data Analytics Read More

Google Cloud offers new AI, cybersecurity, and data analytics training to unlock job opportunities

Google Cloud offers new AI, cybersecurity, and data analytics training to unlock job opportunities

Google Cloud is on a mission to help everyone build the skills they need for in-demand cloud jobs. Today, we’re excited to announce  new learning opportunities  that will help you gain these in-demand skills through new courses and certificates in AI, data analytics, and cybersecurity. Even better, we’re hearing from Google Cloud customers that they are eager to consider certificate completers for roles they’re actively hiring for, so don’t delay and start your learning today. 

Google Cloud offers new generative AI courses

Introduction to Generative AI

Demand for AI skills is exploding in the market. There has been a staggering 21x increase in job postings that include AI technologies in 2023.1 To help prepare you for these roles, we’re announcing new generative AI courses on YouTube and Google Cloud Skills Boost, from introductory level to advanced. Once you complete the hands-on training, you can show off your new skill badges to employers.

Introductory (no cost!): This training will get you started with the basics of generative AI and responsible AI.  

Intermediate: For Application Developers, and you will learn how to use Gemini for Google Cloud to work faster across networking, security, and infrastructure.

Advanced: For AI/ ML Engineers, and you will learn how to integrate multimodal prompts in Gemini into your workflow. 

New AI-powered, employer-recognized Google Cloud Certificates

Gen AI has triggered massive demand for skilling, especially in the areas of cybersecurity and data analytics,2 where there are significant employment opportunities. In the U.S. alone:

There are over 505,000 open entry-level roles3 related to a Cloud Cybersecurity Analyst, with a median annual salary of $135,000.4

There are more than 725,000 open entry-level roles5 related to a Cloud Data Analyst, with a median annual salary of $85,700.6

Building on the success of the Grow with Google Career Certificates, our new Google Cloud Certificates in Data Analytics and Cybersecurity can help prepare you for these high-growth, entry-level cloud jobs. 

A gen AI-powered learning journey 

What better way to understand just how much AI can do for you than integrating it into your learning journey? You’ll get no-cost access to generative AI tools throughout your learning experience. For example, you can put your skills to use and rock your interviews with Interview Warmup, Google’s gen AI-powered interview prep tool.

Talent acquisition, reimagined 

And while we’re at it, we’ll help connect you to jobs. Our new Google Cloud Affiliate Employer program unlocks access for certificate completers to apply for jobs with some top cloud employers, like the U.S. Department of the Treasury, Rackspace, and Jack Henry.

We’re also taking it one step further. Together, with the employers in the affiliate program, we’re helping  reimagine talent acquisition through a new skills-based hiring effort. This new initiative uses Google Cloud technology to help move certificate completers through the hiring process. Here’s how it works: Certificate completers in select hiring locations will have the chance to take custom labs that represent on-the-job scenarios, specific to each employer partner. These labs will be considered the first stage in their hiring process. By matching candidates with the right skills to the right jobs, this initiative marks a major step forward in creating more access to job opportunities for cloud employers.

The U.S. Department of the Treasury will start using these new Google Cloud Certificates and labs for cyber and data analytics talent identification across the federal agency, per President Biden’s Executive Order on AI.

“In an age of rapid innovation and adoption of new technology offering the promise of improved productivity, it is imperative that we equip every worker with accessible training and development opportunities to understand and apply this new technology. We are partnering with Google to provide the new Cloud Certificates training for our current and future employees to accelerate their careers in cybersecurity and data analytics.” – Todd Conklin, Chief AI Officer and Deputy Assistant Secretary of Cybersecurity and Critical Infrastructure Protection, U.S. Department of the Treasury 

No-cost access for higher education institutions worldwide

To expand access to these programs, educational institutions, as well as government and nonprofit workforce development programs across the globe, can offer these new certificates and gen AI courses at no cost. Learn more and apply today

And in the U.S., learners who successfully complete a Google Cloud Certificate can apply for college credit,7 to have a faster and more affordable pathway to a degree.

“Purdue Global’s students have benefited greatly from the strong working relationship between Purdue Global and Google. Together, they were the pioneers in stacking Grow with Google certificates into four types of degree-earning credit certificates over the past two years. We believe these new Google Cloud Cybersecurity and Data Analytics Certificates will equip our working adult learners with the essential skills to move forward and succeed in today’s cloud-driven market.”Frank Dooley, Chancellor of Purdue Global 

Take the next steps to upskill and identify cloud-skilled talent  

We’re helping to prepare new-to-industry talent for the most in-demand cloud jobs, expanding access to these opportunities globally, and pioneering a skills-based hiring effort with employers eager to hire them. Here’s how you can get started:

Learners: Preview the courses and certificates on Google Cloud YouTube and earn the full credential on Google Cloud Skills Boost to give yourself a headstart in the race to hire AI talent.  

Higher education institutions and government / nonprofit workforce programs: Apply today to skill up your workforce at no cost. 

Employers: Express interest to become a Google Cloud Affiliate Employer and be considered for our skills-based hiring pilot to connect with cloud-skilled talent.

1. LinkedIn, Future of Work Report (2023)
2. CompTIA Survey (Feb 2024)
3. U.S. Bureau of Labor Statistics (2024)
4. (ISC)2 Cybersecurity Workforce Study (2022)

5.  U.S. Bureau of Labor Statistics (2024)
6. U.S. Bureau of Labor Statistics (2024)
7. The Google Cloud Certificates offer a recommendation from the American Council on Education® of up to 10 college credits.

Source : Data Analytics Read More

Introducing multimodal and structured data embedding support in BigQuery

Introducing multimodal and structured data embedding support in BigQuery

Embeddings represent real-world objects, like entities, text, images, or videos as an array of numbers (a.k.a vectors) that machine learning models can easily process. Embeddings are the building blocks of many ML applications such as semantic search, recommendations, clustering, outlier detection, named entity extraction, and more. Last year, we introduced support for text embeddings in BigQuery, allowing machine learning models to understand real-world data domains more effectively and earlier this year we introduced vector search, which lets you index and work with billions of embeddings and build generative AI applications on BigQuery.

At Next ’24, we announced further enhancement of embedding generation capabilities in BigQuery with support for:

Multimodal embeddings generation in BigQuery via Vertex AI’s multimodalembedding model, which lets you embed text and image data in the same semantic space

Embedding generation for structured data using PCA, Autoencoder or Matrix Factorization models that you train on your data in BigQuery

Multimodal embeddings

Multimodal embedding generates embedding vectors for text and image data in the same semantic space (vectors of items similar in meaning are closer together) and the generated embeddings have the same dimensionality (text and image embeddings are the same size). This enables a rich array of use cases such as embedding and indexing your images and then searching for them via text. 

You can start using multimodal embedding in BigQuery using the following simple flow. If you like, you can take a look at our overview video which walks through a similar example.

Step 0: Create an object table which points to your unstructured data
You can work with unstructured data in BigQuery via object tables. For example, if you have your images stored in a Google Cloud Storage bucket on which you want to generate embeddings, you can create a BigQuery object table that points to this data without needing to move it. 

To follow along the steps in this blog you will need to reuse an existing BigQuery CONNECTION or create a new one following instruction here. Ensure that the principal of the connection used has the ‘Vertex AI User’ role and that the Vertex AI API is enabled for your project. Once the connection is created you can create an object table as follows:

code_block
<ListValue: [StructValue([(‘code’, “CREATE OR REPLACE EXTERNAL TABLErn `bqml_tutorial.met_images`rnWITH CONNECTION `Location.ConnectionID`rnOPTIONSrn( object_metadata = ‘SIMPLE’,rn uris = [‘gs://gcs-public-data–met/*’]rn);”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e9807b70a00>)])]>

In this example, we are creating an object table that contains public domain art images from The Metropolitan Museum of Art (a.k.a. “The Met”) using a public Cloud Storage bucket that contains this data. The resulting object table has the following schema:

Let’s look at a sample of these images. You can do this using a BigQuery Studio Colab notebook by following instructions in this tutorial. As you can see, the images represent a wide range of objects and art pieces.

Image source: The Metropolitan Museum of Art

Now that we have the object table with images, let’s create embeddings for them.

Step 1: Create model
To generate embeddings, first create a BigQuery model that uses the Vertex AI hosted ‘multimodalembedding@001’ endpoint.

code_block
<ListValue: [StructValue([(‘code’, “CREATE OR REPLACE MODELrn bqml_tutorial.multimodal_embedding_model REMOTErnWITH CONNECTION `LOCATION.CONNNECTION_ID`rnOPTIONS (endpoint = ‘multimodalembedding@001’)”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e9804081220>)])]>

Note that while the multimodalembedding model supports embedding generation for text, it is specifically designed for cross-modal semantic search scenarios, for example, searching images given text. For text-only use cases, we recommend using the textembedding-gecko@ model instead.

Step 2: Generate embeddings 
You can generate multimodal embeddings in BigQuery via the ML.GENERATE_EMBEDDING function. This function also works for generating text embeddings (via textembedding-gecko model) and structured data embeddings (via PCA, AutoEncoder and Matrix Factorization models). To generate embeddings, simply pass in the embedding model and the object table you created in previous steps to the ML.GENERATE_EMBEDDING function.

code_block
<ListValue: [StructValue([(‘code’, “CREATE OR REPLACE TABLE `bqml_tutorial.met_image_embeddings`rnASrnSELECT * FROM ML.GENERATE_EMBEDDING(rn MODEL `bqml_tutorial.multimodal_embedding_model`,rn TABLE `bqml_tutorial.met_images`)rnWHERE content_type = ‘image/jpeg’rnLimit 10000″), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e9804081430>)])]>

To reduce the tutorial’s runtime, we limit embedding generation to 10,000 images. This query will take 30 minutes to 2 hours to run. Once this step is completed you can see a preview of the output in BigQuery Studio. The generated embeddings have a dimension of 1408.

Step 3 (optional): Create a vector index on generated embeddings
While the embeddings generated in the previous step can be persisted and used directly in downstream models and applications, we recommend creating a vector index for improving embedding search performance and enabling the nearest-neighbor query pattern. You can learn more about vector search in BigQuery here.

code_block
<ListValue: [StructValue([(‘code’, “– Create a vector index on the embeddingsrnrnCREATE OR REPLACE VECTOR INDEX `met_images_index`rnON bqml_tutorial.met_image_embeddings(ml_generate_embedding_result)rnOPTIONS(index_type = ‘IVF’,rn distance_type = ‘COSINE’)”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e9803cc6fa0>)])]>

Step 4: Use embeddings for text-to-image (cross-modality) search
You can now use these embeddings in your applications. For example, to search for “pictures of white or cream colored dress from victorian era” you first embed the search string like so:

code_block
<ListValue: [StructValue([(‘code’, ‘– embed search stringrnrnCREATE OR REPLACE TABLE `bqml_tutorial.search_embedding`rnASrnSELECT * FROM ML.GENERATE_EMBEDDING(rn MODEL `bqml_tutorial.multimodal_embedding_model`,rn (rn SELECT “pictures of white or cream colored dress from victorian era” AS contentrn )rn)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e9803cc6bb0>)])]>

You can now use the embedded search string to find similar (nearest) image embeddings as follows:

code_block
<ListValue: [StructValue([(‘code’, ‘– use the embedded search string to search for imagesrnrnCREATE OR REPLACE TABLErn `bqml_tutorial.vector_search_results` ASrnSELECTrn base.uri AS gcs_uri,rn distancernFROMrn VECTOR_SEARCH( TABLE `bqml_tutorial.met_image_embeddings`,rn “ml_generate_embedding_result”,rn TABLE `bqml_tutorial.search_embedding`,rn “ml_generate_embedding_result”,rn top_k => 5)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e9807772760>)])]>

Step 5: Visualize results
Now let’s visualize the results along with the computed distance and see how we performed on the search query “pictures of white or cream colored dress from victorian era”. Refer the accompanying tutorial on how to render this output using a BigQuery notebook.

Image source: The Metropolitan Museum of Art

The results look quite good!

Wrapping up

In this blog, we demonstrated a common vector search usage pattern but there are many other use cases for embeddings. For example, with multimodal embeddings you can perform zero-shot classification of images by converting a table of images and a separate table containing sentence-like labels to embeddings. You can then classify images by computing distance between images and each descriptive label’s embedding. You can also use these embeddings as input for training other ML models, such as clustering models in BigQuery to help you discover hidden groupings in your data. Embeddings are also useful wherever you have free text input as a feature, for example, embeddings of user reviews or call transcripts can be used in a churn prediction model, embeddings of images of a house can be used as input features in a price prediction model etc. You can even use embeddings instead of categorical text data when such categories have semantic meaning, for example, product categories in a deep-learning recommendation model.

In addition to multimodal and text embeddings, BigQuery also supports generating embeddings on structured data using PCA, AUTOENCODER and Matrix Factorization models that have been trained on your data in BigQuery. These embeddings have a wide range of use cases. For example, embeddings from PCA and AUTOENCODER models can be used for anomaly detection (embeddings further away from other embeddings are deemed anomalies) and as input features to other models, for example, a sentiment classification model trained on embeddings from an autoencoder. Matrix Factorization models are classically used for recommendation problems, and you can use them to generate user and item embeddings. Then, given a user embedding you can find the nearest item embeddings and recommend these items, or cluster users so that they can be targeted with specific promotions.

To generate such embeddings, first use the CREATE MODEL function to create a PCA, AutoEncoder or Matrix Factorization model and pass in your data as input, and then use ML.GENERATE_EMBEDDING function providing the model, and a table input to generate embeddings on this data.

Getting started

Support for multimodal embeddings and support for embeddings on structured data in BigQuery is now available in preview. Get started by following our documentation and tutorials. Have feedback? Let us know what you think at bqml-feedback@google.com.

Source : Data Analytics Read More

Announcing Delta Lake support for BigQuery

Announcing Delta Lake support for BigQuery

Delta Lake is an open-source optimized storage layer that provides a foundation for tables in lake houses and brings reliability and performance improvements to existing data lakes. It sits on top of your data lake storage (like cloud object stores) and provides a performant and scalable metadata layer on top of data stored in the Parquet format. 

Organizations use BigQuery to manage and analyze all data types, structured and unstructured, with fine-grained access controls. In the past year, customer use of BigQuery to process multiformat, multicloud, and multimodal data using BigLake has grown over 60x. Support for open table formats gives you the flexibility to use existing open source and legacy tools while getting the benefits of an integrated data platform. This is enabled via BigLake — a storage engine that allows you to store data in open file formats on cloud object stores such as Google Cloud Storage, and run Google-Cloud-native and open-source query engines on it in a secure, governed, and performant manner. BigLake unifies data warehouses and lakes by providing an advanced, uniform data governance model. 

This week at Google Cloud Next ’24, we announced that this support now extends to the Delta Lake format, enabling you to query Delta Lake tables stored in Cloud Storage or Amazon Web Services S3 directly from BigQuery, without having to export, copy, nor use manifest files to query the data. 

Why is this important? 

If you have existing dependencies on Delta Lake and prefer to continue utilizing Delta Lake, you can now leverage BigQuery native support. Google Cloud provides an integrated and price-performant experience for Delta Lake workloads, encompassing unified data management, centralized security, and robust governance. Many customers already harness the capabilities of Dataproc or Serverless Spark to manage Delta Lake tables on Cloud Storage. Now, BigQuery’s native Delta Lake support enables seamless delivery of data for downstream applications such as business intelligence, reporting, as well as integration with Vertex AI. This lets you do a number of things, including: 

Build a secure and governed lakehouse with BigLake’s fine-grained security model

Securely exchange Delta Lake data using Analytics Hub 

Run data science workloads on Delta Lake using BigQuery ML and Vertex AI 

How to use Delta Lake with BigQuery

Delta Lake tables follow the same table creation process as BigLake tables. 

Required roles

To create a BigLake table, you need the following BigQuery identity and access management (IAM) permissions: 

bigquery.tables.create 

bigquery.connections.delegate

Prerequisites

Before you create a BigLake table, you need to have a dataset and a Cloud resource connection that can access Cloud Storage.

Table creation using DDL

Here is the DDL statement to create a Delta lake Table

code_block
<ListValue: [StructValue([(‘code’, ‘CREATE EXTERNAL TABLE `PROJECT_ID.DATASET.DELTALAKE_TABLE_NAME`rnWITH CONNECTION `PROJECT_ID.REGION.CONNECTION_ID`rnOPTIONS (rn format =”DELTA_LAKE”,rn uris=[‘DELTA_TABLE_GCS_BASE_PATH’]);’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e476f76ed00>)])]>

Querying Delta Lake tables

After creating a Delta Lake BigLake table, you can query it using GoogleSQL syntax, the same as you would a standard BigQuery table. For example:

code_block
<ListValue: [StructValue([(‘code’, ‘SELECT FIELD1, FIELD2 FROM `PROJECT_ID.DATASET.DELTALAKE_TABLE_NAME`’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e476f76e610>)])]>

You can also enforce fine-grained security at the table level, including row-level and column-level security. For Delta Lake tables based on Cloud Storage, you can also use dynamic data masking.

Conclusion

We believe that BigQuery’s support for Delta Lake is a major step forward for customers building lakehouses using Delta Lake. This integration will make it easier for you to get insights from your data and make data-driven decisions. We are excited to see how you use Delta Lake and BigQuery together to solve their business challenges. For more information on how to use Delta Lake with BigQuery, please refer to the documentation.

Acknowledgments: Mahesh Bogadi, Garrett Casto, Yuri Volobuev, Justin Levandoski, Gaurav Saxena, Manoj Gunti, Sami Akbay, Nic Smith and the rest of the BigQuery Engineering team.

Source : Data Analytics Read More

Analyze images and videos in BigQuery using Gemini 1.0 Pro Vision

Analyze images and videos in BigQuery using Gemini 1.0 Pro Vision

With the proliferation of digital devices and platforms including social media, mobile devices and IoT sensors, organizations are increasingly generating unstructured data in the form of images, audio files, videos, and documents etc. Over the last few months, we launched BigQuery integrations with Vertex AI to leverage Gemini 1.0 Pro, PaLM , Vision AI, Speech AI, Doc AI, Natural Language AI and more to help you interpret and extract meaningful insights from unstructured data.

While Vision AI provides image classification and object recognition capabilities, large language models (LLMs) unlock new visual use cases. To that end, we are expanding BigQuery and Vertex AI integrations to support multimodal generative AI use cases with Gemini 1.0 Pro Vision. Using familiar SQL statements, you can take advantage of Gemini 1.0 Pro Vision directly in BigQuery to analyze both images and videos by combining them with your own text prompts.

A birds-eye view of Vertex AI integration capabilities for analyzing unstructured data in BigQuery

Within a data warehouse setting, multimodal capabilities can help enhance your unstructured data analysis across a variety of use cases: 

Object recognition: Answer questions related to fine-grained identification of the objects in images and videos.

Info seeking: Combine world knowledge with information extracted from the images and videos.

Captioning/description: Generate descriptions of images and videos with varying levels of detail.

Digital content understanding: Answer questions by extracting information from content like infographics, charts, figures, tables, and web pages.

Structured content generation: Generate responses in formats like HTML and JSON based on provided prompt instructions.

Turning unstructured data into structured data

With minimal prompt adjustments, Gemini 1.0 Pro Vision can produce structured responses in convenient formats like HTML or JSON, making them easy to consume in downstream tasks. In a data warehouse such as BigQuery, having structured data means you can use the results in SQL operations and combine it with other structured datasets for deeper analysis.

For example, imagine you have a large dataset that contains images of cars. You want to understand a few basic details about the car in each image. This is a use case that Gemini 1.0 Pro Vision can help with!

Combining text and image into a prompt for Gemini 1.0 Pro Vision, with a sample response.

Dataset from: 3D Object Representations for Fine-Grained Categorization Jonathan Krause, Michael Stark, Jia Deng, Li Fei-Fei 4th IEEE Workshop on 3D Representation and Recognition, at ICCV 2013 (3dRR-13). Sydney, Australia. Dec. 8, 2013.

As you can see, Gemini’s response is very thorough! But while the format and extra information are great if you’re a person, they’re not so great if you’re a data warehouse. Rather than turning unstructured data into more unstructured data, you can make changes to the prompt to direct the model on how to return a structured response.

Adjusting the text portion of the prompt to indicate a structured response from Gemini 1.0 Pro Vision, with a sample result.

You can see how this response would be much more useful in an environment like BigQuery.

Now let’s see how to prompt Gemini 1.0 Pro Vision directly in BigQuery to perform this analysis over thousands of images!

Accessing Gemini 1.0 Pro Vision from BigQuery ML

Gemini 1.0 Pro Vision is integrated with BigQuery through the ML.GENERATE_TEXT() function. To unlock this function in your BigQuery project, you will need to create a remote model that represents a hosted Vertex AI large language model. Fortunately, it’s just a few lines of SQL:

code_block
<ListValue: [StructValue([(‘code’, “CREATE MODEL `mydataset.gemini_pro_vision_model`rnREMOTE WITH CONNECTION `us.bqml_llm_connection`rnOPTIONS(endpoint = ‘gemini-pro-vision’);”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e77205b98b0>)])]>

Once the model is created, you can combine your data with the ML.GENERATE_TEXT() function in your SQL queries to generate text. 

A few notes on the ML.GENERATE_TEXT() function syntax when it is pointing to a gemini-pro-vision model endpoint, as is the case in this example:

TABLE: takes an object table as input, where it can contain different types of unstructured objects (e.g. images, videos).

PROMPT: takes a single string text prompt that is placed as part of the option STRUCT (dissimilar to the case when using the gemini-pro model) and applies this prompt to each object, row-by-row, contained in the object TABLE.

code_block
<ListValue: [StructValue([(‘code’, “SELECTrn uri,rn ml_generate_text_llm_result as brand_model_yearrn FROMrn ML.GENERATE_TEXT(rn MODEL `mydataset.gemini_pro_vision_model`,rn TABLE `mydataset.car_images_object_table`,rn STRUCT(rn ‘What is the brand, model, and year of this car? Answer in JSON format with three keys: brand, model, year. brand and model should be string, year should be integer.’ AS prompt, TRUE AS flatten_json_output));”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e7720ce4070>)])]>

Let’s take a peek at the results.

We can add some SQL to this query to extract each of the values for brand, model, and year into new fields for use downstream.

code_block
<ListValue: [StructValue([(‘code’, ‘WITH raw_json_result AS ( rnSELECTrn uri,rn ml_generate_text_llm_result as brand_model_yearrn FROMrn ML.GENERATE_TEXT(rn MODEL `mydataset.gemini_pro_vision_model`,rn TABLE `mydataset.car_images_object_table`,rn STRUCT(rn ‘What is the brand, model, and year of this car? Answer in JSON format with three keys: brand, model, year. brand and model should be string, year should be integer.’ AS prompt, TRUE AS flatten_json_output)))rnSELECTrn uri,rn JSON_QUERY(RTRIM(LTRIM(raw_json_result.brand_model_year, ” “`json”), ““`”), “$.brand”) AS brand,rn JSON_QUERY(RTRIM(LTRIM(raw_json_result.brand_model_year, ” “`json”), ““`”), “$.model”) AS model,rn JSON_QUERY(RTRIM(LTRIM(raw_json_result.brand_model_year, ” “`json”), ““`”), “$.year”) AS yearrnFROM raw_json_result’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e7720ce4790>)])]>

Now the responses have been parsed into new, structured columns.

And there you have it. We’ve just turned a collection of unlabeled, raw images into structured data, fit for analysis in a data warehouse. Imagine joining this new table with other relevant enterprise data. With a dataset of historical car sales, for example, you could determine the average or median sale price for similar cars in a recent time period. This is just a taste of the possibilities that are uncovered by bringing unstructured data into your data workflows!

When getting started with Gemini 1.0 Pro Vision in BigQuery, there are a few important items to note:

You need an enterprise or enterprise plus reservation to run Gemini 1.0 Pro Vision model inference over an object table. For reference see the BigQuery editions documentation.

Limits apply to functions that use Vertex AI large language models (LLMs) and Cloud AI services, so review the current quota in place for the Gemini 1.0 Pro Vision model.

Next steps

Bringing generative AI directly into BigQuery has enormous benefits. Instead of writing custom Python code and building data pipelines between BigQuery and the generative AI model APIs, you can now just write a few lines of SQL! BigQuery manages the infrastructure and helps you scale from one prompt to thousands. Check out the overview and demo video, and the documentation to see more example queries using ML.GENERATE_TEXT() with Gemini 1.0 Pro Vision.

Coming to Next ‘24? Check out session Power data analytics with generative AI using BigQuery and Gemini, where you can see Gemini Vision Pro and BigQuery in action.

Source : Data Analytics Read More

How Gemini in BigQuery accelerates data and analytics workflows with AI

How Gemini in BigQuery accelerates data and analytics workflows with AI

The journey of going from data to insights can be fragmented, complex and time consuming. Data teams spend time on repetitive and routine tasks such as ingesting structured and unstructured data, wrangling data in preparation for analysis, and optimizing and maintaining pipelines. Obviously, they’d rather prefer doing higher-value analysis and insights-led decision making. 

At Next ‘23, we introduced Duet AI in BigQuery. This year at Next ‘24, Duet AI in BigQuery becomes Gemini in BigQuery which provides AI-powered experiences for data preparation, analysis and engineering as well as intelligent recommendations to enhance user productivity and optimize costs.

“With the new AI-powered assistive features in BigQuery and ease of integrating with other Google Workspace products, our teams can extract valuable insights from data. The natural language-based experiences, low-code data preparation tools, and automatic code generation features streamline high-priority analytics workflows, enhancing the productivity of data practitioners and providing the space to focus on high impact initiatives. Moreover, users with varying skill sets, including our business users, can leverage more accessible data insights to effect beneficial changes, fostering an inclusive data-driven culture within our organization.” said Tim Velasquez, Head of Analytics, Veo 

Let’s take a closer look at the new features of Gemini in BigQuery.

Accelerate data preparation with AI

Your business insights are only as good as your data. When you work with large datasets that come from a variety of sources, there are often inconsistent formats, errors, and  missing data. As such, cleaning, transforming, and structuring them can be a major hurdle.

To simplify data preparation, validation, and enrichment, BigQuery now includes AI augmented data preparation that helps users to cleanse and wrangle their data. Additionally we are enabling users to build low-code visual data pipelines, or rebuild legacy pipelines in BigQuery. 

Once the pipelines are running in production, AI assists with finding and resolving issues such as schema or data drift, significantly reducing the toil associated with maintaining a data pipeline. Because the resulting pipelines run in BigQuery, users also benefit from integrated metadata management, automatic end-to-end data lineage, and capacity management.

Gemini in BigQuery provides AI-driven assistance for users to clean and wrangle data

Kickstart the data-to-insights journey

Most data analysis starts with exploration — finding the right dataset, understanding the data’s structure, identifying key patterns, and identifying the most valuable insights you want to extract. This step can be cumbersome and time-consuming, especially if you are working with a new dataset or if you are new to the team. 

To address this problem, Gemini in BigQuery provides new semantic search capabilities to help you pinpoint the most relevant tables for your tasks. Leveraging the metadata and profiling information of these tables from Dataplex, Gemini in BigQuery surfaces relevant, executable queries that you can run with just one click. You can learn more about BigQuery data insights here.

Gemini in BigQuery suggests executable queries for tables that you can run in single click

Reimagine analytics workflows with natural language

To boost user productivity, we’re also rethinking the end-to-end user experience. The new BigQuery data canvas provides a reimagined natural language-based experience for data exploration, curation, wrangling, analysis, and visualization, allowing you to explore and scaffold your data journeys in a graphical workflow that mirrors your mental model. 

For example, to analyze a recent marketing campaign, you can use simple natural language prompts to discover campaign data sources, integrate with existing customer data, derive insights, and share visual reports with executives — all within a single experience. Watch this video for a quick overview of BigQuery data canvas.

BigQuery data canvas allows you to explore and analyze datasets, and create a customized visualization, all using natural language prompts within the same interface

Enhance productivity with SQL and Python code assistance 

Even advanced users sometimes struggle to remember all the details of SQL or Python syntax, and navigating through numerous tables, columns, and relationships can be daunting. 

Gemini in BigQuery helps you write and edit SQL or Python code using simple natural language prompts, referencing relevant schemas and metadata. You can also leverage BigQuery’s in-console chat interface to explore tutorials, documentation and best practices for specific tasks using simple prompts such as: “How can I use BigQuery materialized views?” “How do I ingest JSON data?” and “How can I improve query performance?”

Optimize analytics for performance and speed 

With growing data volumes, analytics practitioners including data administrators, find it increasingly challenging to effectively manage capacity and enhance query performance. We are introducing recommendations that can help continuously improve query performance, minimize errors and optimize your platform costs. 

With these recommendations, you can identify materialized views that can be created or deleted based on your query patterns and partition or cluster of your tables. Additionally, you can autotune Spark pipelines and troubleshoot failures and performance issues. 

Get started

To learn more about Gemini in BigQuery, watch this short overview video and refer to the documentation , and sign up to get early access to the preview features. If you’re at Next ‘24, join our data and analytics breakout sessions and stop by at the demo stations to explore further and see these capabilities in action. Pricing details for Gemini in BigQuery will be shared when generally available to all customers.

Source : Data Analytics Read More