Go from logs to security insights faster with Dataform and Community Security Analytics

Go from logs to security insights faster with Dataform and Community Security Analytics

Making sense of security logs such as audit and network logs can be a challenge, given the volume, variety and velocity of valuable logs from your Google Cloud environment. To help accelerate the time to get security insights from your logs, the open-source Community Security Analytics (CSA) provides pre-built queries and reports you can use on top of Log Analytics powered by BigQuery. Customers and partners use CSA queries to help with data usage audits, threat detection and investigation, behavioral analytics and network forensics. It’s now easier than ever to deploy and operationalize CSA on BigQuery, with significant queries performance gains and cost savings.

In collaboration with Onix, a premier Google Cloud service partner, we’re delighted to announce CSA can now be deployed via Dataform, a BigQuery service and an open-source data modeling framework to manage the Extraction, Loading, and Transformation (ELT) process for your data. Now, you can automate the rollout of CSA reports and alerts with cost-efficient summary tables and entity lookup tables (e.g. unique users and IP addresses seen). Dataform handles the infrastructure and orchestration of ELT pipelines to filter, normalize and model log data starting with the raw logs in Log Analytics into curated and up-to-date BigQuery tables and views for the approximately 50 CSA use cases, as shown in the dependency tree below.

The best of Google Cloud for your logs analysis

The best cloud for your log analysis

Dataform alongside Log Analytics in Cloud Logging and BigQuery provide the best of Google Cloud for your logs management and analysis.

First, BigQuery provides the fully managed petabyte-scale centralized data warehouse to store all your logs (but also other security data like your SCC findings).

Then, Log Analytics from Cloud Logging provides a native and simple solution to route and analyze your logs in BigQuery by enabling you to analyze in place without exporting or duplicating logs, or worrying about partitioning, clustering or setting up search indexes.

Finally, Dataform sets up the log data modeling necessary to report, visualize, and alert on your logs using normalized, continuously updated summary tables derived from the raw logs.

Why deploy CSA with Dataform?

Optimize query cost and performance

By querying logs from the summary tables, the amount of data scanned is significantly reduced as opposed to querying the source BigQuery _AllLogs view. The data scanned is often less than 1% based on our internal test environment (example screenshot below) and in line with the nature of voluminous raw logs. 

Summary table for DNS logs enables 330x efficiency gains in data scanned by SQL queries

This leads to significant cost savings for read-heavy workloads such as reporting and alerting on logs. This is particularly important for customers leveraging BigQuery scheduled queries for continuous alerting, and/or a business intelligence (BI) tool on top of BigQuery such as Looker, or Grafana for monitoring.

Note: Unlike reporting and alerting, when it comes to ad-hoc search for troubleshooting or investigation, you can do so via Log Analytics user interface at no additional query cost.

Segregate logs by domains and drop sensitive fields

Logs commonly contain sensitive and confidential information. By having your log data stored in separate domain-specific tables, you can help ensure authorized users can only view the logs they need to view to perform their job. For example, a network forensics analyst may only need access to network logs as opposed to other sensitive logs like data audit logs. With Dataform for CSA, you can ensure that this separation of duties is enforced with table-level permissions, by providing them with read-only access to network activity summary tables (for CSA 6.*), but not to data usage summary tables (for CSA 5.*).

Furthermore, by summarizing the data over time — hourly or daily — you can eliminate potentially sensitive low-level information. For example, request metadata including caller IP and user agent is not captured in the user actions summary table (for CSA 4.01). This way, for example, an ML researcher performing behavioral analytics, can focus on user activities over time to look for any anomalies, without accessing personal user details such as IP addresses.

Unlock AI/ML and gen AI capabilities

Normalizing log data into simpler and smaller tables greatly accelerates time to value. For example, analyzing the summarized and normalized BigQuery table for user actions, that is 4_01_summary_daily table depicted below, is significantly simpler and delivers more insights than trying to analyze the _AllLogs BigQuery view in its original raw format. The latter has a complex (and sometimes obscure) schema including several nested records and JSON fields, which limits the ability to parse the logs and identify patterns.  

A snippet of source _AllLogs schema (left) compared to low-dimensional summary table (right)

The normalized logs allows you to scale ML opportunities both computationally  (because the dataset is smaller and simpler), as well as for ML researchers, who don’t need to be familiar with Cloud Logging log schema or LogEntry definition to analyze summary tables such as daily user actions.

This also enables gen AI opportunities such as using LLM models to generate SQL queries from natural language based on a given database schema. There’s a lot of ongoing research about using LLM models for text-to-SQL applications. Early research has shown promising results where simpler schema and distinct domain-specific datasets yielded reasonably accurate SQL queries.

How to get started

Before leveraging BigQuery Dataform for CSA, aggregate your logs in a central log bucket and create a linked BigQuery dataset provided by Log Analytics. If you haven’t done so, follow the steps to route your logs to a log bucket (select the Log Analytics tab) as part of the security log analytics solution guide.

See Getting started in the CSA Dataform README to start building your CSA tables and views off of the source logs view, i.e., the BigQuery view _AllLogs from Log Analytics. You can run Dataform through Google Cloud console (more common) or via the Dataform CLI.

You may want to use the Dataform CLI for a quick one-time or ad hoc Dataform execution to process historical logs, where you specify a desired lookback time window (default is 90 days). 

However, in most cases, you need to set up a Dataform repository via the Cloud console, as well as Dataform workflows for scheduled Dataform executions to process historical and streaming logs. Dataform workflows will continuously and incrementally update your target dataset with new data on a regular schedule, say hourly or daily. This enables you to continuously and cost-efficiently report on fresh data. The Cloud console Dataform page also allows you to manage your Dataform resources, edit your Dataform code inline, and visualize your dataset dependency tree (like the one shown above), all with access control via fine-grained Dataform IAM roles

Leverage partner delivery services

Get on the fast path to building your own security data lake and monitoring on BigQuery by using Dataform for CSA and leveraging specialized Google Cloud partners.

Onix, a Premier Google Cloud partner and a leading provider of data management and security solutions, is available to help customers leverage this new Dataform for CSA functionality including:

Implementing security foundations to deploy this CSA solution

Setting up your Google Cloud logs for security visibility and coverage

Deploying CSA with Dataform following Infrastructure as Code and data warehousing best practices.

Managing and scaling your reporting and alerting layer as part of your security foundation.

In summary, BigQuery natively stores your Google Cloud logs via Log Analytics, as well as your high-fidelity Security Command Center alerts. No matter the size of your organization, you can deploy CSA with Dataform today to report and alert on your Google Cloud security data. By leveraging specialized partners like Onix to help you design, build and implement your security analytics with CSA, your security data lake can be built to meet your specific security and compliance requirements today.

Source : Data Analytics Read More

Deliver trusted insights with Dataplex data profiling and automatic data quality

Deliver trusted insights with Dataplex data profiling and automatic data quality

We are excited to announce the general availability of data profiling and automatic data quality (AutoDQ) in Dataplex. These features enable Google Cloud customers to build trust in their analytical data in a scalable and automated manner. 

Power innovation, decision-making, and differentiated customer experience with high-quality data

Data quality has always been an essential foundation for successful analytics and ML models. In these past six months, the rapid rise of artificial intelligence (AI) has led to an explosion in the use of machine learning (ML) models. The importance of data quality in machine learning has become even more critical in recent months. Data scientists and analysts need to understand their data more deeply before building the models. It will ultimately lead to more accurate and reliable ML outcomes.

Dataplex data profiling and AutoDQ make it easy to build and maintain this information in a scalable and efficient manner. These features offer:

Reduction in time to insights about the data

Dataplex makes it easy and quick to go from data to its profile and quality. These features have zero-setup requirements and are easy to start within the BigQuery UI. Dataplex AutoDQ will get you started with intelligent rule recommendations and generate quality reports within a short time. Similarly, with a single click, Dataplex data profiling will generate meaningful insights like data distribution, top-N, unique percentages, etc.Rich platform capabilities
The underlying capabilities of the platform allow users to build an end-to-end solution with desired customizations. Dataplex AutoDQ enables a data quality solution from rules to reports to alerts. AutoDQ rules can also incorporate BigQuery MLfor advanced use cases. On the other hand, with the information Data profiling generates, you can build custom AI/ML models like drift detection for detecting meaningful shifts in your training data. Secure, performant, and efficient executionThese features are designed to work with petabyte-scale data without any data copy. While it leverages the scale-out power of BigQuery behind the scenes, it has zero impact on customers’ BigQuery slots and reservations.Dataplex data profiling and AutoDQ are powerful new tools that can help organizations to improve their data quality and build more accurate and reliable insights and models. 

What our customers have to say

Here is what some of our customers say about Dataplex data profiling and AutoDQ:

“At Paramount we have data coming from multiple vendors and data anomalies might occur from time to time with data from different sources and integration channels. We have started incorporating Dataplex AutoDQ and BigQuery ML to address the challenges to detect and get alerted on anomalies in real time. This is not only efficient but it will improve the accuracy of our data.”  – Bhaskara Peta, SVP Data Engineering, Paramount Global

“At Orange, we are always on the cutting edge of innovation and rely on trusted insights to power this innovation. As we move our data and AI workloads to GCP, we have been looking for an elegant, integrated data quality service to provide a seamless experience to our data engineering team. We started using Datalex AutoDQ at a very early stage and we believe it could become a strong basis in our journey to Data Democracy. We are also excited to continue partnering with Google on building a stronger and innovative roadmap!” – Guillaume Prévost, Lead Tech, Data and AI, Orange

New features

New and exciting additional features since the public preview include: 

Configure and view results in BigQuery UI in addition to Dataplex 

You can now perform data quality and data profiling tasks directly from BigQuery in addition to Dataplex. Data owners can configure their data scans and publish the latest results of their data profile and data quality scans next to the table information in BigQuery. This information can then be viewed by any user with the appropriate authorization, regardless of the project in which the table resides. This makes it easier for users to get started with data quality and data profiling, and it provides a more consistent experience across all of the tools they use to manage their data.

New deployment options

In addition to our rich UI, we also added support for creating, managing, and deploying data quality and data profiling scans using a variety of methods, including:

Terraform: A first-class Terraform operator for deploying and managing data quality and data profiling resources.

Python and Java client libraries: We provide client libraries for Python and Java that make it easy to interact with Dataplex data profiling and AutoDQ from your code.

CLI: We also have a comprehensive CLI that can be used to create, manage, and deploy scans from the command line.

YAML: When using the CLI or Terraform, you can create and manage your scans using a YAML-based specification.

We have also made Airflow operators available for data quality to allow engineers to build data-quality checks within their data production pipelines. The airflow operator gives data engineers more flexibility in using AutoDQ, making it easier to integrate data quality checks with their existing data pipelines.

New configuration options to save costs and/or protect sensitive data

We have enhanced the core capabilities of Dataplex data profiling and AutoDQ to make them more flexible and scalable.

Row filters: Users can now specify row filters to focus on data from certain segments or to eliminate certain data from the scan. This can be useful for tuning scans for specific use cases, such as compliance or privacy.

Column filters: You can now specify column filters to avoid publishing stats on sensitive columns. This can help to protect sensitive data from unauthorized access.

Sampling: You can now sample your data for quick tests to save costs. This can be useful for getting a quick overview of the data quality and data profile without having to scan the entire dataset.

Build your reports or downstream ML models 

Dataplex data profiling and AutoDQ can also export metrics to a BigQuery table. This makes it easy to build downstream applications that use the metrics, such as:

Drift detection: You can use BQML to build a model that predicts the expected values for the metrics. You can then use this model to detect any changes in the metrics that indicate data drift.

Data-quality dashboard: You can build a dashboard that visualizes the metrics for a data domain. This can help you to identify any data quality issues.

We are grateful to our customers for partnering with us to build data trust, and we are excited to make these features generally available so that even more customers can benefit.

Learn more 

Get started by creating a data profile or data quality scan on BigQuery public data 

Learn more about Dataplex Data profiling

Learn more about Dataplex AutoDQ

Git repo with sample scripts and airflow DAG

Source : Data Analytics Read More

Enhancing Google Cloud’s blockchain data offering with 11 new chains in BigQuery

Enhancing Google Cloud’s blockchain data offering with 11 new chains in BigQuery

Early in 2018, Google Cloud worked with the community to democratize blockchain data via our BigQuery public datasets; in 2019, we expanded with six more datasets. Today, we’ve added eleven more of the most in-demand blockchains to the BigQuery public datasets, in preview. And we’re making improvements to existing datasets in the program, too.

We’re doing this because blockchain foundations, Web3 analytics firms, partners, developers, and customers tell us they want a more comprehensive view across the crypto landscape, and to be able to query more chains. They want to answer complex questions and verify subjective claims such as “How many NFTs were minted today across three specific chains?” “How do transaction fees compare across chains?” and “How many active wallets are on the top EVM chains?” 

Having a more robust list of chains accessible via BigQuery and new ways to access data will help the Web3 community better answer these questions and others, without the overhead of operating nodes or maintaining an indexer. Customers can now query full on-chain transaction history off-chain to understand the flow of assets from one wallet to another, which tokens are most popular, and how users are interacting with smart contracts. 

Chain expansion

Here are the 11 in-demand chains we’re adding into the BigQuery public datasets:

Avalanche

Arbitrum

Cronos

Ethereum (Görli)

Fantom (Opera) 

Near 

Optimism

Polkadot

Polygon Mainnet 

Polygon Mumbai 

Tron

We’re also improving the current Bitcoin BigQuery dataset by adding Satoshis (sats) / Ordinals to the open-source blockchain-ETL datasets for developers to query. Ordinals, in their simplest state, are a numbering scheme for sats. 

Google Cloud managed datasets 

We want to provide users with a range of data options. In addition to community managed datasets on BigQuery, we are creating first party Google Cloud managed datasets that offer additional feature capabilities. For example, in addition to the existing Ethereum community dataset (crypto_ethereum), we created a Google Cloud managed Ethereum dataset (goog_ethereum_mainnet.us) which offers a full representation of the data model native to Ethereum with curated tables for events. Customers that are looking for richer analysis on Ethereum will be able to access derived data to easily query wallet balances, transactions related to specific tokens (ERC20, ERC721, ERC1155), or interactions with smart contracts. 

We want to provide fast and reliable enterprise-grade results for our customers and the Web3 community. Here’s an example of a query against the goog_ethereum_mainnet.us dataset:

Let’s say we want to know “How many ETH transactions are executed daily (last 7 days)?”

code_block[StructValue([(u’code’, u’SELECT DATE(block_timestamp) as date, COUNT(*) as txns FROM `bigquery-public-data.goog_blockchain_ethereum_mainnet_us.transactions`rnWHERE DATE(block_timestamp) > DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)rnGROUP BY 1rnORDER BY 1 DESC;rnrnSELECT DATE(block_timestamp) as date, COUNT(*) as txns FROM `bigquery-public-data.crypto_ethereum.transactions`rnWHERE DATE(block_timestamp) > DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)rnGROUP BY 1rnORDER BY 1 DESC;’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eceece3d6d0>)])]

On the results above you can see how using the goog_ dataset is faster and consumes less slot time, while also remaining competitive in terms of bytes processed.

More precise data 

We gathered feedback from customers and developers to understand pain points from the community and heard loud and clear that features such as numerical precision are important for more accurately calculating the pricing of certain coins. We are improving the precision of the blockchain datasets by launching UDF for better UNIT256 integration and BIGNUMERIC1 support. This will give customers access to longer decimal digits for their blockchain data and reduce rounding errors in computation.  

Making on-chain data more accessible off-chain

Today, customers interested in blockchain data must first get access to the right nodes, then develop and maintain an indexer that transforms the data into a queryable data model. They then repeat this process for every protocol they’re interested in. 

By leveraging our deep expertise in scalable data processing, we’re making on-chain data accessible off-chain for easier consumption and composability, enabling developers to access blockchain data without nodes. This means that customers can access blockchain data as easily as they would their own data. By joining chain data with application data, customers can get a complete picture of their users and their business.

Lastly, we have seen this data used in other end user applications such as Looker and Google Sheets

Building together 

For the past five years, we have supported the community through our public blockchain dataset offering, and we will continue to build on these efforts with a range of data options and user choice — from community-owned to Google managed high-quality alternatives and real-time data. We’re excited to work with partners who want to distribute public data for developers or monetize datasets for curated feeds and insights. We’re also here to partner with startups and data providers who want to build cloud-native distribution and syndication channels unique to Web3.

You can get started with our free tier and usage-based pricing. To gain early access to the new chains available on BigQuery, contact web3-trustedtester-data@google.com

1.  32 logical bytes

Source : Data Analytics Read More

Built with BigQuery: Bloomreach Engagement brings power to marketers with advanced personalization

Built with BigQuery: Bloomreach Engagement brings power to marketers with advanced personalization

Editor’s note: The post is part of a series showcasing our partners, and their solutions, that are Built with BigQuery.

Your customers aren’t just looking for the best prices, they’re looking for better experiences, too. Personalization sets the great companies apart from the rest. But it’s a constant challenge for businesses to try and understand consumer behavior, preferences, and needs, to deliver memorable, personalized experiences. 

According to McKinsey, companies that are leaders in customer experience achieved more than double the revenue growth of “customer experience laggards.” And those that excel at personalization generate 40% more revenue from those activities than those that do not.

Data silos create missed opportunities

Without access to both real-time and historical customer data, such as purchase history, browsing behavior, in-store visits or personal preferences, marketers don’t have a singular view of their customers, making it difficult to optimize marketing strategies and identify revenue growth opportunities. 

While most marketers have the potential to be gathering valuable customer data at the many touch points along the purchasing journey, collecting, managing and analyzing large datasets requires costly infrastructure and complex data integration processes. So, these customer insights remain trapped in data silos, leaving marketers with missed opportunities and suboptimal customer experiences.

Unlock and utilize customer engagement data

Bloomreach Engagement is an omnichannel marketing automation platform that leverages data analytics and machine learning to deliver personalized and relevant content. It combines data-driven personalization, content optimization, and real-time recommendations to home in on individual customers and give them a highly tailored and engaging digital experience.

A core component of the platform is its Customer Data Engine (CDE). It’s responsible for gathering and managing the customer data that’s key to building these personalized experiences. The CDE allows marketers to plug in their individual data source, as well as existing data lakes and warehouses, as well as any CDP they already have, to then gain a singular view of their customers that allows for the real-time personalization to occur.

Engagement comes with its own real-time data storage, but for those who want to use Engagement data outside of the platform and store it within a wider data analytics stack, there’s Engagement BigQuery (EBQ). In this data integration package, the BigQuery platform is hosted and managed by Bloomreach, providing a fully featured data warehouse experience that’s pre-populated with the full Engagement data. This provides Bloomreach Engagement users with the power of Google’s BigQuery platform and ecosystem.

Specifically, it helps marketers transform customer experiences through:

Seamless integration with various data sources such as websites, CRM systems, POS terminals or loyalty programs, enabling businesses to consolidate customer data into a single customer view

Advanced marketing intelligence that provides a single source of truth for data-driven decision-making and driving business growth

Cloud-based infrastructure that eliminates the need for businesses to invest in and maintain their own data infrastructure, resulting in cost savings

Fast data processing and analytics that enable businesses to derive valuable insights from large datasets in seconds, improving time-to-insights

Marketers accessing these capabilities through Bloomreach see a range of benefits, including improved conversion rates, customer lifetime value (CLTV), customer engagement, faster time-to-insights, flexible data integration, increased revenue, and cost savings through mitigating media waste. 

Multiple Bloomreach customers are exporting data through EBQ every day. The biggest value they find is in the ability to have the data available in their Business Intelligence (BI) tools, structured in the same way as in Bloomreach Engagement. Their analyst teams are then able to connect it with other data sources within the company and create a single source of truth for the entire company. It makes Bloomreach data more accessible to the whole organization, with everyone being able to access reports and high-level data to make data-driven business decisions every day. And this impact is felt outside of marketing, as important business metrics such as average order amounts, return rates, customer churn, and more, are valuable for many other teams, e.g., customer service, logistics, sales, and operations.

Better Together: Bloomreach Engagement and BigQuery Bloomreach Engagement’s advanced personalization and marketing automation capabilities deliver hyper-personalized customer experiences powered by Google Cloud technologies, including BigQuery, smoothly integrating into the Bloomreach Engagement environment. 

Bloomreach EBQ synchronizes data collecting and processing through Bloomreach Engagement with a dedicated instance of the secure, highly scalable BigQuery data warehouse analytics that’s fully managed by Bloomreach for each customer.

Through EBQ, data scientists and analysts have access to comprehensive data about their customers and the full power of the BigQuery platform, including the built-in, easy-to-use and scalable machine learning capabilities from BigQuery ML. Through this, they are able to gain a deeper understanding of customer behavior, preferences, and engagement patterns that inform business decisions and strategies that can be activated by Bloomreach Engagement. 

Together, Bloomreach Engagement and BigQuery are empowering businesses to leverage customer data and unlock insights to improve customer experiences and boost revenue.

To see how Bloomreach Engagement works for yourself, book a demo. If you want to find more information about Bloomreach Engagement, visit the Bloomreach website to discover all the ways we can help you grow.

The Built with BigQuery advantage for ISVs and data providers

Google is helping companies like Bloomreach build innovative applications on Google’s data cloud with simplified access to technology, helpful and dedicated engineering support, and joint go-to-market programs through the Built with BigQuery initiative. Participating companies can: 

Accelerate product design and architecture through access to designated experts who can provide insight into key use cases, architectural patterns, and best practices. 

Amplify success with joint marketing programs to drive awareness, generate demand, and increase adoption.

BigQuery gives ISVs the advantage of a powerful, highly scalable data warehouse that’s integrated with Google Cloud’s open, secure, sustainable platform. And with a huge partner ecosystem and support for multi-cloud, open source tools and APIs, Google provides technology companies the portability and extensibility they need to avoid data lock-in.

Click here to learn more about Built with BigQuery.

Source : Data Analytics Read More

Use a GitHub repository to manage pipelines across Data Fusion instances/namespaces

Use a GitHub repository to manage pipelines across Data Fusion instances/namespaces

Data engineers, do you struggle to manage your data pipelines?

As data pipelines become more complex and involve multiple team members, it can be challenging to keep track of changes, collaborate effectively, and deploy pipelines to different environments in a controlled manner.

In this post we will talk about how data engineers can manage their data fusion pipelines across instances and namespaces using the pipeline git integration feature.

Background & Overview

As enterprises move towards modernizing their businesses with digital transformation, they’re faced with the challenge to adapt to the volume, velocity, and veracity of data. The only real way to address this challenge is to iterate fast – delivering data projects sustainably, predictably and quickly. To help customers achieve this goal, Cloud Data Fusion now supports iterative data pipeline design and team-based development/version control systems (VCS) integration. In this post, we mainly focus on the Git integration feature: team-based development/VCS integration. To learn more about iterative data pipeline design in Cloud Data Fusion, see Edit Pipelines.

Nowadays, many developer tools have integrations with VCS systems. It improves  development efficiency, assists CI/CD and facilitates team collaborations. With the pipeline git integration feature in Cloud Data Fusion, ETL developers are able to manage pipelines using Github, so that they can implement proper development processes, such as code reviews, promotion/demotion between environments for the pipelines. Below we will showcase these user journeys.

Before you begin

You need a Cloud Data Fusion instance with version 6.9.1 or above.

Only GitHub is supported as the git hosting provider.

Currently Cloud Data Fusion only supports personal access token (PAT) auth mechanisms. Please refer to Creating a fine-grained personal access token to create a PAT with limited permissions to read and write to the git repository.

To connect to a GitHub server from a private Cloud Data Fusion instance, you must configure network settings for public source access. For more information, see Create a private instance and Connect to a public source from a private instance.

Link a GitHub repository

The first step is to link the GitHub repository. It could be a newly created repository or an existing one. Cloud Data Fusion lets you link a GitHub repository with a namespace. Once the repository is linked with a namespace, you can push deployed pipelines from namespace to repository, or pull and deploy pipelines from repository to namespace.

To link a Github repository, follow these steps:

1. In the Cloud Data Fusion web interface, click hamburger Menu –> Namespace Admin.

2. On the Namespace Admin page, click the Source Control Management tab.

3. Click Link Repository and fill in the below fields. 
Repository URL (required)Default branch (optional)Path prefix (optional)Authentication type (optional)Token name (required)Token (required)User name (optional)

To verify the configuration click the VALIDATE button and you should see a green banner indicating a valid GitHub connection. Click the SAVE & CLOSE button to save the configuration.

4. You can always edit/delete the configuration later, as needed.

Unlinking the repository with a namespace will not delete the configurations present in GitHub

Use Case: Use linked GitHub repository to manage pipeline across instances/namespaces

Imagine Bob is an IT admin at an ecommerce company. The team has already built several data pipelines. Recently the company created a new Cloud Data Fusion instance for a newly opened company branch. Bob wants to replicate the existing pipelines from the existing instance to the new instance. In the past, Bob had to manually export and import those pipelines. It is cumbersome to do so and prone to error. With the git integration, let’s see how Bob’s workflow has improved.

Pushing pipelines to GitHub repository

In the same Source Control Management page after the above configuration, Bob can see the configured repository as below:

To view the deployed pipelines in the current namespace, Bob clicks SYNC PIPELINES. Then, to push the DataFusionQuickStart pipeline config to the linked repository, Bob selects the PUSH TO REMOTE checkbox by the pipeline.

A dialog appears where Bob enters a commit message and clicks Push. They can see the pipeline is pushed successfully. Now Bob can switch to the GitHub repository page and check the pushed pipeline configuration JSON file:

Similarly, to see details about the pipeline that was pushed, Bob goes to the Cloud Data Fusion REMOTE PIPELINES tab.

Pulling pipelines from linked repository

To initiate a new instance with existing pipelines, Bob can link the same repository to a namespace to the new instance.

To deploy the pipelines that Bob pushed to the linked GitHub repository, Bob opens the Source Control Management page, clicks on the SYNC PIPELINES button and  switches to the REMOTE PIPELINES tab. Now they can choose the pipeline of interest and click PULL TO NAMESPACE.

In the LOCAL PIPELINES tab, Bob can see the newly deployed pipeline. They could also see the new pipeline in the deployed pipeline list page:

Use case: Team-based development

Billie built several pipelines to perform data analytics in different environments, such as test, staging, and prod. One of their pipelines classifies on-time and delayed orders, based on whether shipping time takes more than two days. Due to the increased number of customer orders during Black Friday, Billie just received a change request from the business to increase the expected delivery time temporarily. 

Billie could edit the pipeline and modify it iteratively to find a proper increased time. But Billie doesn’tt want to risk deploying the new changes into the prod environment without fully testing it. 

Before, there was no easy way for them to apply the new changes across from testing env to staging, and finally Prod. With git integration, let’s see how Billie can solve this problem.

Edit the pipeline

1. Billie opens the deployed pipeline in the Studio page

2. Billie clicks on the cog icon on the right side of the top bar. 

3. In the drop down, Billie selects Edit which starts the edit flow.

4. Billie makes the necessary changes in the plugin config. Once done, Billie clicks Deploy and a new version of the pipeline will be deployed. To learn more about iterative data pipeline design in Cloud Data Fusion, see Edit Pipelines.

Push the latest pipeline version to git

The namespace was already linked with a git repository and a previous version of the pipeline has already been pushed. 

Billie clicks on the cog icon on the right side of the top bar. In the drop-down, select Push to remote.

A dialog will be shown to give the commit message. Once confirmed, the pipeline push process begins.

In case of success Billie sees a green banner at the top. Billie can now go and check in GitHub that the new pipeline config has been synced.

Merge the changes to main

For a proper review flow we suggest using different branches for different environments. Billie can push the changes to a development branch and then create a pull request to merge the changes to the main branch.

Pulling the latest pipeline version from GitHub

The production namespace has been linked with the main branch of the git repository. There already exists the older version of the pipeline.

Billie clicks on the cog icon on the right side of the top bar. In the drop down Billie selects Pull to namespace. The pull process will take some time to complete as it also deploys the new version of the pipeline

Once succeeded Billie can now click on the history button at the top bar and sees a new version has been deployed.

Billie can now verify the change in the plugin config.  

In the above steps, we see how Billie applies the new pipeline changes across from testing env to finally Prod with the git integration feature. Please visit https://cloud.google.com/data-fusion and learn more about data fusion features.

Source : Data Analytics Read More

How to optimize your existing queries with search indexes

How to optimize your existing queries with search indexes

In October 2022, BigQuerylaunched the search indexes and SEARCH function that enable using Google Standard SQL to efficiently pinpoint specific data elements in unstructured text and semi-structured data. In a previous blog post, we demonstrated the performance gains achievable by utilizing search indexes on the SEARCH function.

Today, BigQuery expands the optimization capabilities to a new set of SQL operators and functions, including the equal operator (=), IN operator, LIKE operator, and STARTS_WITH function when used to compare string literals with indexed data. This means that if you have a search index on a table, and  a query that compares a string literal to a value in the table, BigQuery can now use the index to find the rows that match the query more quickly and efficiently.

For more information about which existing functions/operators are eligible for search index optimization, refer to the Search indexed data documentation.

In this blog post we cover the journey from creating an index and efficiently retrieving, via a few illustrative examples, and share some measured performance gain numbers. 

Take Advantage of Search Index Optimizations with Existing SQLs 

Before this launch, the only way to take advantage of a BigQuery search index was to use the SEARCH function. The SEARCH function is powerful. In addition to column-specific, it supports cross-column search, which is particularly helpful in cases of complex schemas with hundreds of columns, including nested ones. It also provides powerful case sensitive and case insensitive tokenized search semantics.

Even though the SEARCH function is very versatile and powerful, it may not always provide the exact result semantics one may be looking for. For example, consider the following table that contains a simplified access log of a file sharing system:

Table: Events

The SEARCH function allows searching for a token that appears anywhere in the table. For example, you can look for any events that involve “specialdir” with the following query:

code_block[StructValue([(u’code’, u’– Query 1rnSELECT * FROM Events WHERE SEARCH(Events, “specialdir”);’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e50bf2cd290>)])]

The above query will return all rows from the above table.

However, consider the case where you want a more specific result — only events related to the folder “/root/dir/specialdir”. Using the SEARCH function as in the following query will return more rows than desired:

code_block[StructValue([(u’code’, u’– Query 2rnSELECT * FROM Events WHERE SEARCH(Filepath, “/root/dir/specialdir”);’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e51106f9190>)])]

The above query also returns all rows except the one with Event ID 2. The reason is SEARCH is a token search function, it returns true as long as the searched data contains all the searched tokens. That means, SEARCH(“/root/dir/specialdir/file1.txt”, “/root/dir/specialdir”) returns true. Even using backticks to enforce case sensitivity and the exact order of tokens would not help, SEARCH(“/root/dir/specialdir/file1.txt”, “`/root/dir/specialdir`”) also returns true.

Instead, we can use the EQUAL operator to make sure that the result only contains the events related to the folder, not the files in the folder.

code_block[StructValue([(u’code’, u’– Query 3rnSELECT * FROM Events WHERE Filepath = “/root/dir/specialdir”;’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e51106f9110>)])]

Query 3 Results

With this launch, Query 3 now can utilize the search index for lower latency, scanned bytes, and slot usage.

Prefix search

At the moment, the SEARCH function does not support prefix search. With the newly added support for using indexes with the STARTS_WITH and (a limited form of) LIKE, you can run the following queries with index optimizations:

code_block[StructValue([(u’code’, u’– Query 4rnSELECT * FROM Events WHERE STARTS_WITH(Filepath, “/dir/specialdir”);rn– Query 5rnSELECT * FROM Events WHERE Filepath LIKE “dir/specialdir%”);’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e5102bcb9d0>)])]

Borth Query 4 and Query 5 return one row with Event ID 2. The SEARCH function would not have been an ideal option in this case because every row contains both tokens “dir” and “specialdir”, and thus it would have returned all rows in the table.

Querying Genomic Data

In this section we share an example of retrieving information from a public dataset. BigQuery hosts a large number of public datasets, including bigquery-public-data.human_genome_variants — the 1000 Genomes dataset comprising the genomic data of roughly 2,500 individuals from 25 populations around the world. Specifically, the table 1000_genomes_phase_3_optimized_schema_variants_20150220 in the dataset contains the information of human genome variants published in phase 3 publications (https://cloud.google.com/life-sciences/docs/resources/public-datasets/1000-genomes). The table has  84,801,880 rows, with the logical size of 1.94 TB.

Suppose that a scientist aims to find information about a specific genomic variant such as rs573941896 in this cohort. The information includes the quality, the filter (PASS/FAIL), the DP (sequencing depth), and the call details (which individuals in the sample have this variant). They can issue a query as follows:

code_block[StructValue([(u’code’, u”SELECT names, quality, filter, DP, callrnFROM bigquery-public-data.human_genome_variants.1000_genomes_phase_3_optimized_schema_variants_20150220rnWHERE names[safe_offset(0)]=’rs573941896′;”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e5102bcbe50>)])]

The query returns 1 row:

Without a search index on the table, the above query takes 5 secs and scans 294.7 GB, consuming 1 hr 1 min slot time:

In the next section, we demonstrate the journey for benefitting from search indexes for this use case.

Create Search Index for Faster String Data Search

Creating a BigQuery’s search indexcan accelerate the desired retrieval in this case. We made a copy of the public table to one of our datasets before creating the index. The copied table is now my_project.my_dataset.genome_variants.

We use the following DDL to create the search index on the names column in the table:

code_block[StructValue([(u’code’, u’CREATE SEARCH INDEX my_index ON genome_variants(names);’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e51012b5a90>)])]

The CREATE SEARCH INDEX command returns immediately, and the index will be asynchronously created in the background. The index creation progress can be tracked by querying the INFORMATION_SCHEMA.SEARCH_INDEXES view:

code_block[StructValue([(u’code’, u”SELECT * FROM my_dataset.INFORMATION_SCHEMA.SEARCH_INDEXESrnWHERE table_name = ‘genome_variants’;”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e51012b54d0>)])]

The INFORMATION_SCHEMA.SEARCH_INDEXES view shows various metadata about the search index, including its last refresh time and coverage percentage. Note that the SEARCH function always returns correct results from all ingested data, even if some of the data isn’t indexed yet.

Once the indexing is complete, we perform the same query as above:

code_block[StructValue([(u’code’, u”SELECT names, quality, filter, DP, callrnFROM my_project.my_dataset.genome_variantsrnWHERE names[safe_offset(0)]=’rs573941896′;”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e51012b5ad0>)])]

We can see significant gains in 3 fronts:

Query latency: 725 ms (vs. 5 seconds without the search index)

Bytes processed: 60 MB (vs. 294.7 GB without the search index)

Slot time: 664 ms (vs. 1 hour 1 min without the search index).

Performance Gains When Using Search Indexes On Queries with String EQUAL

To benchmark the gains on larger and more realistic data, we ran a number of queries on Google Cloud Logging data for Google internal test project (at 10TB, 100TB scales). We compare the performance between having and not having the index optimizations. 

Rare string search

Common string search on most recent data (order-by+limit)

In our October 2022 launch, we unveiled an optimization for queries that use the SEARCH function on large partitioned tables with an ORDER BY on the partitioned column and a LIMIT clause. With this launch, the optimization is expanded to also cover queries with literal string comparisons using EQUAL, IN, STARTS_WITH and LIKE.

IP address search

JSON field search

Using search index optimizations with equal (=), IN, LIKE, and STARTS_WITH is currently in preview. Please submit this allowlisting form if you would like to enable and use it for your project. More optimizations are still on the way. Stay tuned.

Related Article

Improved text analytics in BigQuery: search features now GA

General availability of text indexes and search functions in BigQuery. This enables you to perform scalable text searches.

Read Article

Source : Data Analytics Read More

Expanding your Bigtable architecture with change streams

Expanding your Bigtable architecture with change streams

Engineers use Bigtable to hold vast amounts of transactional and analytical information as part of their data workflow. We are excited about the release of Bigtable change streams that will enhance these data workflows for event-based architectures and offline processing. In this article, we will cover the new feature and a few example applications that incorporate change streams.

Change streams

Change streams capture and output mutations made to Bigtable tables in real-time. You can access the stream via the Data API, and we recommend that you use the Dataflow connector which offers an abstraction over the change streams Data API and the complexity of processing partitions using the Apache Beam SDK. Dataflow is a managed service that will provision and manage resources and assist with the scalability and reliability of stream data processing.

The connector allows you to focus on the business logic instead of having to worry about specific Bigtable details such as correctly tracking partitions over time and other non-functional requirements.

You can enable change streams on your table via the Console, gcloud CLI, Terraform, or the client libraries. Then you can follow our change stream quickstart to get started with development. 

Example architectures

Change streams enable you to track changes to data in real time and react quickly. You can more easily automate tasks based on data updates or add new capabilities to your application by using the data in new ways. Here are some example application architectures based on Bigtable use cases that incorporate change streams.

Data enrichment with modern AI

New APIs for AI are rapidly in development and can add great value to your application data. There are APIs for audio, images, translation and more that can enhance data for your customers. Bigtable change streams gives a clear path to enrich new data as it is added.

Here we are transcribing and summarizing voice messages by using pre-built models available in Vertex AI. We can use Bigtable to store the raw audio file in bytes and when a new message is added, AI audio processing is kicked off via change streams. A Dataflow pipeline will use the Speech API to get a transcription of the message and the PaLM API to summarize that transcription. These can be written to Bigtable for access by the users to get the message their preferred method.

Full-text search and autocomplete

Full-text search and autocomplete are common use cases for many applications from online retail to streaming media platforms. For this scenario, we have a music platform which is adding full-text search functionality to the music library, by indexing the album names, song titles and artists in Elasticsearch.

When new songs are added, the changes are captured by a pipeline in Dataflow. It will extract the data to be indexed and write it to Elasticsearch. This keeps the index up to date and users can query it via a search service hosted on Cloud Functions.

Event-based notifications

Processing events and alerting customers in real time adds a valuable tool for application development. You can customize your architecture for pop-ups, push notifications, emails, texts, etc. Here is one example of what a logistics and shipping company could do.

Logistics and shipping companies have millions of packages traveling around the world at any given moment. They need to keep track of where each package is when it arrives at each new shipment center, so it can continue to the next location. You might be awaiting a fresh pair of shoes or maybe a hospital needs to know when their next shipment of gloves is arriving, so customers can sign up for either email or text notifications about their package status.

This is an event-based architecture that works great with Bigtable change streams. We have real-time data about the packages coming from shipment centers being written to Bigtable. The change stream is captured by our alerting system in Dataflow that uses APIs like SendGrid and Twilio for easy email and text notifications respectively. 

Real-time analytics

For any application using Bigtable, you will likely have heaps of data. Change streams can help you unlock real-time analytics use cases by allowing you to update metrics in small increments as the data arrives instead of large infrequent batch jobs. You could create a windowing scheme for regular intervals, then run aggregation queries on the data in the window and then write those results to another table for analytics and dashboarding.

This architecture shows a company that offers a SaaS platform for online retail and wants to show their customers the performance metrics for their online storefronts like the number of visitors, conversion rates, abandoned carts, most viewed items. They write that data to Bigtable and every five minutes aggregate the data by the dimensions they’d like their users to slice and dice by and then write that to an analytics table. They can create real-time dashboards using libraries like D3.js on top of data from the analytics table and provide better insights into their users. 

You can use our tutorial on collecting song analytics data to get started.

Next steps

Now you are familiar with new ways to use Bigtable in event-driven architectures and managing your data for analytics with change streams. Here are a few next steps for you to learn more and get started with this new feature:

Try change streams with our quickstart

Read the change streams documentation

Process a change stream pipeline following our tutorial

View the example code on GitHub

Source : Data Analytics Read More

Introducing dynamic topic destinations in Pub/Sub using Dataflow

Introducing dynamic topic destinations in Pub/Sub using Dataflow

Pub/Sub plays a key role in powering data streaming insights across a number of verticals. With the increasing adoption of streaming analytics, we’re seeing companies with more complex and nuanced use cases. One particular use case for using Pub/Sub is being able to ingest data from multiple various teams and data sources, sort that data into multiple topics based on particular message attributes, and do so while the data is in flight. Today, we’re introducing a new feature in Dataflow called dynamic destinations for Pub/Sub topics to help with this. 

Today, when you want to publish messages to a single topic, you create an instance of the Beam to Pub/Sub sink. And when publishing to an additional topic, you repeat this pattern, or you can have multiple publishers publishing to a single topic. However, when you want to start publishing to tens or even hundreds of topics, needing to use that many publisher clients becomes unwieldy and expensive. Dynamic destinations allows you to use a single publisher client to dictate which messages go to which topics. 

There are two methods for using dynamic destinations: provide a function to extract the destination from an attribute in the Pub/Sub message, or put the topic in a PubsubMessage class and write to it using the writeMessagesDynamic method. Please note that the Pub/Sub topic needs to already exist before using dynamic destinations. Below are two examples of assigning a message to a topic based on the country in the message:

Function Extract:

code_block[StructValue([(u’code’, u’avros.apply(PubsubIO.writeAvros(MyType.class).rn to((ValueInSingleWindow<Event> quote) -> {rn String country = quote.getCountry();rn return “projects/myproject/topics/events_” + country;rn });rnrnwriteMessagesDynamic Method:rnrnevents.apply(MapElements.into(new TypeDescriptor<PubsubMessage>() {})rn .via(e -> new PubsubMessage(rn e.toByteString(), Collections.emptyMap()).withTopic(e.getCountry())))rn tt .apply(PubsubIO.writeMessagesDynamic());’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e266b2053d0>)])]

Example: Secured Topics

With dynamic destinations, you can also start to enforce data security further up the pipeline. If you write all your data to a single topic you can always filter the data when writing out to your final data sink, but that opens up the possibility for people to have access to data they aren’t supposed to. When you’re able to delineate data based on message attributes you can keep data for each team restricted to their dedicated topic and then control end-user access to that topic. As an example, a gaming company can ingest data from Marketing, Player Analytics, and IT teams into a single pipeline. Each team would tag their data with the appropriate attribute value, which in turn assigns the data to the appropriate topic, and can view the information that’s relevant to just their team. Security concerns and back charge capabilities become much easier to tackle with this new feature.

To get started with dynamic destinations, ensure that you’ve upgraded to Apache Beam 2.50 release or later and check out the documentation for dynamic destinations in the PubsubIO here.

Source : Data Analytics Read More

Applying Generative AI to product design with BigQuery DataFrames

Applying Generative AI to product design with BigQuery DataFrames

For any company, naming a product or service is complex and time-consuming. This process is particularly challenging in the pharmaceutical industry. Typically, companies start by brainstorming and researching thousands of names. They must ensure that the names are unique, compliant with regulations, and easy to pronounce and remember. With so many factors to consider, multiplied across an entire product catalog, the process must be designed to scale.   

In this blog post, we will show how the power of data analytics and generative AI can help unleash the creative process, and accelerate testing. We will provide a step-by-step guide on how to generate potential drug names using BigQuery DataFrames. Please note that this blog post simply illustrates the concepts and does not address any regulatory requirements.

Background

Our goal in this demonstration is to generate a set of 10 brand names that can be reviewed by a panel of experts for an imaginary generic drug called “Entropofloxacin”. Drugs with the suffix -floxacin belong to the fluoroquinolones class of antibiotics.

We’ll use the text-bison model, a large language model that has been trained on a massive dataset of text and code. It can generate text, translate languages, write different kinds of creative content, and answer all kinds of questions.

We will also provide these indications & usage to the model: “Entropofloxacin is a fluoroquinolone antibiotic that is used to treat a variety of bacterial infections, including: pneumonia, streptococcus infections, salmonella infections, escherichia coli infections, and pseudomonas aeruginosa infections. It is taken by mouth or by injection. The dosage and frequency of administration will vary depending on the type of infection being treated. It should be taken for the full course of treatment, even if symptoms improve after a few days. Stopping the medication early may increase the risk of the infection coming back.”

Getting started

In case you want to follow along, we will use code from this Drug Name Generation notebook in this blog post. We will highlight key steps here, leaving some details in the notebook. 

We will be using BigQuery DataFrames to perform generative AI operations. It’s a brand new way to access BigQuery, providing a DataFrame interface that Python developers and data scientists are familiar with. It brings compute capabilities directly to your data in the Cloud, enabling you to process massive datasets. BigQuery DataFrames directly supports a wide variety of ML use cases, which we will showcase here.

Zero-shot learning

Let’s start with a base case, where we simply ask the model a question, through a prompt. No examples, no chains, just a simple request and response scenario.

First, we will need to create a prompt template. You will notice that the prompt guides the model toward the precise outcomes we’re looking for. Also, it is parameterized, so that we can easily update the parameters to try out different scenarios and settings.

code_block[StructValue([(u’code’, u’zero_shot_prompt = f”””Provide {NUM_NAMES} unique and modern brand names in Markdown bullet point format. Do not provide any additional explanation.rnrnrnBe creative with the brand names. Don’t use English words directly; use variants or invented words.rnrnrnThe generic name is: {GENERIC_NAME}rnrnrnThe indications and usage are: {USAGE}.”””rnrnrnprint(zero_shot_prompt)’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e5d9b4d5b10>)])]

We can submit our prompt to the model using the `model.predict()` function. This function takes a dataframe input. In our simple scenario with a 1 string input and a 1 string output, I’ve created a helper function. This function creates a dataframe for the input string, and also extracts the string value from the returned dataframe. The function includes an optional parameter for temperature, to control the degree of randomness, which can be helpful in a creative context.

code_block[StructValue([(u’code’, u’def predict(prompt: str, temperature: float = TEMPERATURE) -> str:rn # Create dataframern input = bigframes.pandas.DataFrame(rn {rn “prompt”: [prompt],rn }rn )rnrnrn# Return responsernreturn model.predict(input, temperature).ml_generate_text_llm_result.iloc[0]’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e5d9b4d5a50>)])]

To get a response, we first need to create a model reference using a BigQuery connection. Then we can pass the prompt to our helper method.

code_block[StructValue([(u’code’, u’# Get BigFrames sessionrnsession = bigframes.pandas.get_global_session()rnrnrn# Define the modelrnmodel = PaLM2TextGenerator(session=session, connection_name=connection_name)rnrnrn# Invoke LLM with promptrnresponse = predict(zero_shot_prompt)rnrnrn# Print results as MarkdownrnMarkdown(response)’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e5d9b4d5f10>)])]

And now, the exciting part. Here are several responses we get:

code_block[StructValue([(u’code’, u’XylocinrnZervoxrnZaroxrnZeroxyrnXerozid’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e5d9b4d5c90>)])]

These names might work! You might notice that the names are very similar. Well, that might not actually be a problem. According to “The art and science of naming drugs”: “The letters “X,” “Y” and “Z” often appear in brand names because they give a drug a high-tech, sciency sounding name (Xanax, Xyrem, Zosyn). Conversely, “H,” “J” and “W” are sometimes avoided because they are difficult to pronounce in some languages.”

Few-shot learning

Next, let’s try expanding on this base case by providing a few examples. This is referred to as few-shot learning, in which the examples provide a little more context to help shape the answer. It’s like providing some training data without retraining the whole model.

Fortunately, there is a public BigQuery FDA datasetavailable at bigquery-public-data.fda_drug that can help us with this task!

We can easily extract a few useful columns from the dataset into a dataframe using BigFrames:

code_block[StructValue([(u’code’, u’df = bpd.read_gbq(“bigquery-public-data.fda_drug.drug_label”,rn col_order=[“openfda_generic_name”,rn “openfda_brand_name”,rn “indications_and_usage”])’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e5d9b4d5f50>)])]

And it’s straightforward to sample the dataset for a few useful examples. Let’s run this code and peek at what we want to include in our prompt.

code_block[StructValue([(u’code’, u’# Take a sample and convert to a Pandas dataframe for local usage.rndf_examples = df.sample(NUM_EXAMPLES).to_pandas()rnrnrndf_examples’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e5d9b4d5090>)])]

We can create a more sophisticated prompt with 3 components:

General instructions (e.g. generate 𝑛 brand names)

Multiple examples generated above

Information about the drug we’d like to generate a name for (entropofloxacin)

Our prompt will now look like this, truncating some sections for readability:

Provide 10 unique and modern brand names in Markdown bullet point format, related to the drug at the bottom of this prompt.

Be creative with the brand names. Don’t use English words directly; use variants or invented words.

First, we will provide 3 examples to help with your thought process.

Then, we will provide the generic name and usage for the drug we’d like you to generate brand names for.
Generic name: BUPRENORPHINE HYDROCHLORIDE
Usage: 1 INDICATIONS AND USAGE BELBUCA is indicated for the management of pain…
Brand name: Belbuca

Generic name: DROSPIRENONE/ETHINYL ESTRADIOL/LEVOMEFOLATE CALCIUM AND LEVOMEFOLATE CALCIUM
Usage: 1 INDICATIONS AND USAGE Safyral is an estrogen/progestin COC containing a folate…
Brand name: Safyral

Generic name: FLUOCINOLONE ACETONIDE
Usage: INDICATIONS AND USAGE SYNALAR® Solution is indicated for the relief of the inflammatory and pruritic manifestations of corticosteroid-responsive dermatoses.
Brand name: Synalar

Generic name: Entropofloxacin
Usage: Entropofloxacin is a fluoroquinolone antibiotic that is used to treat a variety of bacterial…
Brand names:

With this prompt, we see a much different set of brand names generated. With the examples included, we see that the model is anchored on the generic name.

code_block[StructValue([(u’code’, u’EntrolrnEntromycinrnEntrozolrnEntrofloxrnEntroxilrnEntrosyn’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e5d9b4d54d0>)])]

Bulk generation

Now that we’ve learned the fundamentals of prompts & responses with BigQuery DataFrames, let’s explore generating names at scale. How can you generate candidate names when you have thousands of products? We can perform multiple operations in the Cloud without bringing the data into local memory within the notebook.

Let’s start with querying for drugs that don’t have a brand name in the FDA dataset. Technically, we are querying for drugs where the brand name and generic name match.

code_block[StructValue([(u’code’, u’# Query 3 columns of interest from drug label datasetrndf_missing = bpd.read_gbq(“bigquery-public-data.fda_drug.drug_label”,rn col_order=[“openfda_generic_name”,rn “openfda_brand_name”,rn “indications_and_usage”])rnrnrn# Exclude any rows with missing datarndf_missing = df_missing.dropna()rnrnrn# Include rows in which openfda_brand_name equals openfda_generic_namerndf_missing = df_missing[rn df_missing[“openfda_generic_name”] == df_missing[“openfda_brand_name”]]’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e5d9b4d5850>)])]

We’ll pass a whole dataframe column of prompts to BigFrames instead of a single string prompt. Let’s look at how we could construct this column.

code_block[StructValue([(u’code’, u’df_missing[“prompt”] = (rn”Provide a unique and modern brand name related to this pharmaceutical drug.”rn+ “Don’t use English words directly; use variants or invented words. The generic name is: “rn+ df_missing[“openfda_generic_name”]rn+ “. The indications and usage are: “rn+ df_missing[“indications_and_usage”]rn+ “.”rn)’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e5d9b4d5150>)])]

Next, let’s create a new helper function for batch prediction. We’ll use the column as-is without any transformation from/to strings.

code_block[StructValue([(u’code’, u’def batch_predict(rninput: bigframes.pandas.DataFrame, temperature: float = TEMPERATURErn) -> bigframes.pandas.DataFrame:rnreturn model.predict(input, temperature).ml_generate_text_llm_resultrnrnrnrnrnresponse = batch_predict(df_missing[“prompt”])’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e5d9b4d5ad0>)])]

After the operation completes, let’s take a look at one of the generated brand names for “alcohol free hand sanitizer”:

**Sani-Tize**

This is a modern and unique brand name for an alcohol-free hand sanitizer. It is derived from the words “sanitize” and “tize”, which give it a scientific and technical feel. The name is also easy to spell and pronounce, making it memorable and easy to market.

In this scenario, we saw that Generative AI is a powerful tool for accelerating the branding process. While we walked through a pharmaceutical drug name scenario, these concepts could be applied to any industry. We also saw that BigQuery puts all of the tools in one place for multiple prompting styles, all with an intuitive DataFrame interface.

Enjoy applying these creative tools to your next project! For more information, feel free to check out the quickstart documentation.

Source : Data Analytics Read More

Reducing BigQuery physical storage cost with new billing model

Reducing BigQuery physical storage cost with new billing model

Introduction

Google Cloud BigQuery is a scalable petabyte-scale data warehouse, renowned for its efficient Structured Query Language (SQL) capabilities. While BigQuery offers exceptional performance, cost optimization remains a critical aspect for any cloud-based service. 

This blog post focuses on reducing BigQuery storage costs through the utilization of the newly introduced Physical Bytes Storage Billing (PBSB) as opposed to Logical Bytes Storage Billing (LBSB). PBSB is currently in public preview and was launched to general availability on July 5th, 2023. We work with organizations actively exploring this feature to optimize storage costs. 

Drawing from the extensive experience Google Cloud has built with Deloitte in assisting clients, this blog post will share valuable insights and recommendations to help customers to smoothly migrate to the PBSB model when designing storage to support BigQuery implementations. 

Design challenges

In today’s business landscape, we see large organizations accumulating extensive amounts of data, often measured in petabytes, within Google Cloud BigQuery data warehouses. This data is crucial for performing thorough business analysis and extracting valuable insights. However, the associated storage costs can be significant, sometimes exceeding millions of dollars annually. Consequently, minimizing these storage expenses has emerged as a challenge for many of our clients.

Solution

By default, when creating a dataset in BigQuery, the unit of consumption for storage billing is logical bytes. However, with the introduction of physical bytes storage billing, customers now have the option to choose this billing model. By choosing this billing model, customers can take advantage of the cost savings offered by the compression capabilities of physical bytes without compromising data accessibility or query performance.

To address the challenge of high storage costs on Google Cloud BigQuery data warehouses, we implement a two-step approach that leverages the newly introduced physical bytes storage billing option:

First Approach: Conduct a BigQuery cost benefit assessment between PBSB and LBSB Google has provided an example of how to calculate the price difference at the dataset level using this query.

Running the query in the example above results in the following details in the table below;

In this example, the first datasets demonstrate an impressive active and long-term compression ratio, ranging from 16 to 25. As a result, there is a remarkable storage cost reduction of 8 times, leading to a substantial decrease in monthly costs from $70,058 to $8,558. 

However, the last dataset in this test, close to 11 TB or 96% of all active physical storage data for time travel is used. This is not suitable for PBSB. 

During the assessment, we observed the presence of “_session” and “_scripts” rows, which may impede CSV files downloads due to the 10 MB limit. The “_session” objects correspond to temporary tables generated within BigQuery sessions, while the “_scripts” objects pertain to intermediate objects produced as part of stored procedures or multi-statement queries. These objects are billed based on logical bytes and cannot be converted to physical bytes by users. To address this, customers can disregard them by modifying the query using this clause: 

where total_logical_bytes > 0 AND total_physical_bytes > 0 AND table_schema not like ‘_scripts%’ AND table_schema not like ‘_session%’. [2, 7, 8]

Second approach: Switch to physical storage billing for suitable datasets. Customers must remember they will not be able to enroll datasets for physical storage billing until all flat-rate commitments for your organization are no longer active. 

The simplest way to update the billing model for a dataset is to use the BigQuery (BQ) update command and set the storage_billing_model flag to PHYSICAL.

For example: bq update -d –storage_billing_model=PHYSICAL PROJECT_ID:DATASET_ID

After changing the billing model for a dataset, it takes 24 hours for the change to take effect. Another factor to consider when optimizing storage cost is Time Traveling.

Time travel allows customers to query updated or deleted data, restore deleted tables, or restore expired tables. The default time travel window covers the past seven days, and you can configure it using the BQ command-line tool to balance storage costs with your data retention needs. Here’s an example: 

bq update –dataset=true –max_time_travel_hours=48 PROJECT_ID:DATASET_NAME

This command sets the time travel window to 48 hours (2 days) for the specified dataset.

“The –max_time_travel_hours value must be an integer expressed in multiples of 24 (48, 72, 96, 120, 144, 168) between 48 (2 days) and 168 (7 days).” [5]

Considerations

Switching to the physical storage billing model has some considerations. The table below shows the pricing:

Based on this price list on the table above, customers should consider the following:

Based on Google Cloud BigQuery Storage Pricing, the unit price of physical storage billing is twice that of logical storage billing.

If the compression ratio is less than 2, customers will not benefit from PBSB for their datasets.

In LBSB, customers are not billed for bytes used for time travel storage, but in PBSB, you are billed for that and the same is true when using the fail-safe.

To ensure accurate assessment, it is advisable to conduct an evaluation of time travel storage utilization once BigQuery workload has reached a stable state and established a predictable pattern. This is important because the bytes utilized for time travel storage can vary over time.

Customers have the flexibility to configure the time travel window according to specific data retention requirements while considering storage costs. For instance, customers can customize the time travel window, adjusting it from the default 7 days to a shorter duration such as 2 days.

A fail-safe period of 7 days will be enforced, during which the time travel setting cannot be modified. However, once this period ends, customers have the flexibility to configure additional time travel days according to their preference. This configurable range extends from 2 to 7 days, allowing customers to establish an effective time travel window spanning between 9 and 14 days. If no action is taken, the default time travel window will remain set at 14 days.

Switching to physical bytes storage billing is a one-way operation. Once this change is made, customers cannot switch back to logical bytes storage billing.

Let’s go building

Adopting BigQuery Physical Bytes Storage Billing (PBSB) presents a substantial opportunity for reducing storage costs within BigQuery. The process of assessing and transitioning to this cost-effective billing model is straightforward, allowing customers to maximize the benefits. We have provided comprehensive guidance on conducting assessments and making a seamless transition to the PBSB model. In our upcoming blog post, we will delve into leveraging the newly introduced BigQuery Edition to further optimize BigQuery analysis costs from compute perspective. Wishing you a productive and successful cost optimization journey! And always reach out to us for support in your cloud journey from here.

Special thanks to Dylan Myers (Google), Enlai Wang  (Deloitte) for contributing to this article.

Source : Data Analytics Read More