Standardize your cloud billing data with the new FOCUS BigQuery view

Standardize your cloud billing data with the new FOCUS BigQuery view

Businesses today often rely on multiple cloud providers, making it crucial to have a unified view of their cloud spend. This is where the FinOps Open Cost and Usage Specification (FOCUS) comes in. And today, we’re excited to announce a new BigQuery view that leverages the recent FOCUS (Preview) to help simplify cloud cost management across clouds.

What is FOCUS?

The FinOps Cost and Usage Specification aims to deliver consistency and standardization across cloud billing data, by unifying cloud and usage data into one common data schema. Before FOCUS, there was no industry-standard way to normalize key cloud cost and usage measures across multiple cloud service providers (CSPs), making it challenging to understand how billing costs, credits, usage, and metrics map from one cloud provider to another (see FinOps FAQs for more details).

FOCUS helps FinOps practitioners perform fundamental FinOps capabilities using a generic set of instructions and unified schema, regardless of the origin of the dataset. FOCUS is a living, breathing specification that is constantly being iterated on and improved by the Working Group, which consists of FinOps practitioners, CSP leaders, Software as a Service (SaaS) providers, and more. The FOCUS specification v1.0 Preview was launched in November 2023, paving the way for more efficient and transparent cloud cost management. If you’d like to read more or join the Working Group, here is a link to the FOCUS website.

Introducing a BigQuery view for FOCUS v1.0 Preview

Historically, we’ve offered three ways to export cost and usage-related Cloud Billing data to BigQuery: Standard Billing Export, Detailed Billing Export (resource-level data and price fields to join with Price Export table), and Price Export. Today, we are introducing a new BigQuery view that transforms this data so that it aligns with the data attributes and metrics defined in the FOCUS v1.0 Preview.

A BigQuery view is a virtual table that represents the results of a SQL query. The BigQuery view can be formed off of a base query (see below on how to get access) that maps Google Cloud data into the display names, format, and behavior of the FOCUS Preview dimensions and metrics. BigQuery views are great because the queryable virtual table only contains data from the tables and fields specified in the base query that defines the view. BigQuery views are virtual tables, so incur no additional charges for data storage if you are already using Billing Export to BigQuery.

You should spend time optimizing costs, not mapping billing terminology across Cloud Providers. With the FOCUS BigQuery view, you can now…

View and query Google Cloud billing data that is adapted towards the FOCUS specificationUse the BigQuery view as a data source for a visualization tools like Looker StudioAnalyze your Google Cloud costs alongside data from other providers using the common FOCUS format

How it works

The FOCUS BigQuery view acts as a virtual table that sits on top of your existing Cloud Billing data. To use this feature, you will need Detailed Billing Export and Price Exports enabled. Follow these instructions to set up your billing exports to BigQuery. The FOCUS BigQuery view uses a base SQL query to map your Cloud Billing data into the FOCUS schema, presenting it in the specified format. This allows you to query and analyze your data as if it were native to FOCUS, making it easier to compare costs across different cloud providers.

We’ve made it easy to leverage the power of FOCUS with a step-by-step guide. To view this sample SQL query and follow the step-by-step guide, sign up here.

Looking ahead: A commitment to open standards and collaboration

At Google Cloud, open standards are part of our DNA. We were a founding member of the FinOps Foundation, the first CSP to join the Open Billing Standards Working group, and a core contributor to the v0.5 and v1.0 specifications. As a strong advocate for open billing standards, we believe customers deserve a glimpse of what’s possible with Google Cloud Billing data considering the latest FOCUS specification.

We look forward to shaping the standards of open billing standards alongside our customers, FinOps practitioners in the industry, the FinOps Foundation, CSPs, SaaS providers, and more. Get a unified view of your cloud costs today with the FOCUS BigQuery view. Sign up here to learn more and get started.

Related Article

When they go closed, we go open – Google Cloud and open billing data

Google Cloud partnered with the FinOps Foundation on FOCUS, a Linux Foundation project, to establish an open specification for cloud bill…

Read Article

Source : Data Analytics Read More

Serverless data architecture for trade surveillance at Deutsche Bank

Serverless data architecture for trade surveillance at Deutsche Bank

Ensuring compliance with regulatory requirements is crucial for every bank’s business. While financial regulation is a broad area, detecting and preventing market manipulation and abuse is absolutely mission-critical for an investment bank of Deutsche Bank’s size. This is called trade surveillance.

At Deutsche Bank, the Compliance Technology division is responsible for the technical implementation of this control function. To do this, the Compliance Technology team retrieve data from various operational systems in the front office and performs scenario calculations to monitor the trades executed by all of the bank’s business lines. If any suspicious patterns are detected, a compliance officer receives an internal alert to investigate the issue for resolution.

The input data comes from a broad range of systems, but the most relevant are market, trade, and reference data. Historically, provisioning data for compliance technology applications from front-office systems required the team to copy data between, and often even within, many different analytical systems, leading to data quality and lineage issues as well as increased architectural complexity. At the same time, executing trade surveillance scenarios includes processing large volumes of data, which requires a solution that can store and process all the data using distributed compute frameworks like Apache Spark.

A new architectural approach

Google Cloud can help solve the complex issues of processing and sharing data at scale across a large organization with its comprehensive data analytics ecosystem of products and services. BigQuery, Google Cloud’s serverless data warehouse, and Dataproc, a managed service for running Apache Spark workloads, are well positioned to support data-heavy business use cases, such as trade surveillance.

The Compliance Technology team decided to leverage these managed services from Google Cloud in their new architecture for trade surveillance. In the new architecture, the operational front-office systems act as publishers that present their data in BigQuery tables. This includes trade, market and reference data that is now available in BigQuery to various data consumers, including the Trade Surveillance application. As the Compliance Technology team doesn’t need all the data that is published from the front-office systems, they can create multiple views derived from only the input data that includes the required information needed to execute trade surveillance scenarios.

Scenario execution involves running trade surveillance business logic in the form of various different data transformations in BigQuery, Spark in Dataproc, and other applications. This business logic is where suspicious trading patterns, indicating market abuse or market manipulation, can be detected. Suspicious cases are written to output BigQuery tables and then processed through research and investigation workflows, where compliance officers perform investigations, detect potential false positives, or file a Suspicious Activity Report to the regulator if the suspicious case indicates a compliance violation.

Surveillance alerts are also retained and persistently stored to calculate how effective the detection is and improve the rate of how many false positives are actually detected. These calculations are run in Dataproc using Spark and in BigQuery using SQL. They are performed periodically and fed back into the trade surveillance scenario execution to further improve the surveillance mechanisms. Orchestrating the execution of ETL processes to derive data for executing trade surveillance scenarios and effectiveness calibrations is done through Cloud Composer, a managed service for workflow orchestration using Apache Airflow.

Here is a simplified view of what the new architecture looks like:

This is how the Compliance Technology team at Deutsche Bank describes the new architecture: 

“This new architecture approach gives us agility and elasticity to roll out new changes and behaviors much faster based on market trends and new emerging risks as e.g. cross product market manipulation is a hot topic our industry is trying to address in line with regulator’s expectations.”
– Asis Mohanty, Global Head, Trade Surveillance, Unauthorized Principal Trading Activity Technology, Deutsche Bank AG

“The serverless BigQuery based architecture enabled Compliance Technology to simplify the sharing of data between the front- and back-office whilst having a zero-data copy approach and aligning with the strategic data architecture.” 
– Puspendra Kumar, Domain Architect, Compliance Technology, Deutsche Bank AG

The benefits of a serverless data architecture

As the architecture shows above, trade surveillance requires various input sources of data. A major benefit of leveraging BigQuery for sourcing this data is that there is no need to copy data to make it available for usage by data consumers in Deutsche Bank. A more simplified architecture improves data quality and lowers cost by minimizing the amount of hops the data needs to take.

The main reason for not having to copy data is due to the fact that BigQuery does not have separate instances or clusters. Instead, every table is accessible by a data consumer as long as the consumer app has the right permissions and references the table URI in its queries (i.e., the Google Cloud project-id, the dataset name, and the table name). Thus, various consumers can access the data directly from their own Google Cloud projects without having to copy it and physically persist it there. 

For the Compliance Technology team to get the required input data to execute trade surveillance scenarios, they simply need to query the BigQuery views with the input data and the tables containing the derived data from the compliance-specific ETLs. This eliminates the need for copying the data, ensuring the data is more reliable and the architecture is more resilient due to fewer data hops. Above all, this zero-copy approach does enable data consumers in other teams in the bank besides trade surveillance to use market, trade and reference data by following the same pattern in BigQuery. 

In addition, BigQuery offers another advantage. It is closely integrated with other Google Cloud services, such as Dataproc and Cloud Composer, so orchestrating ETLs is seamless, leveraging Apache Airflow’s out-of-the-box operators for BigQuery. There is also no need to perform any copying of data to process data from BigQuery using Spark. Instead, an out-of-the-box connector allows data to be read via the BigQuery Storage API, which is optimized for streaming large volumes of data directly to Dataproc workers in parallel ensuring fast processing speed. 

Finally, storing data in BigQuery enables data producers to leverage Google Cloud’s native, out-of-the-box tooling for ensuring data quality, such as Dataplex automatic data quality. With this service, it’s possible to configure rules for data freshness, accuracy, uniqueness, completeness, timeliness, and various other dimensions and then simply execute them against the data stored in BigQuery. This happens fully serverless and automated without the need to provision any infrastructure for the rules execution and data quality enforcement. As a result, the Compliance Technology team can ensure that the data they receive from front-office systems complies with the required data quality standards, thus adding to the value of the new architecture. 

Given the fact that the new architecture leverages integrated and serverless data analytics products and managed services from Google Cloud, the Compliance Technology team can now fully focus on the business logic of their Trade Surveillance application. BigQuery stands out here because it doesn’t require any maintenance windows, version upgrades, upfront sizing or hardware replacements, as opposed to running a large-scale, on-premises Hadoop cluster. 

This brings us to the final advantage, namely the cost-effectiveness of the new architecture. In addition to allowing team members to now focus on business-relevant features instead of dealing with infrastructure, the architecture makes use of services which are charged based on a pay-as-you-go model. Instead of running the underlying machines in 24/7 mode, compute power is only brought up when needed to perform compliance-specific ETLs, execute the trade surveillance scenarios, or perform effectiveness calibration, which are all batch processes. This again helps further reduce the cost compared to an always-on, on-prem solution. 

Here’s the view from Deutsche Bank’s Compliance Technology team about the associated benefits: 

“Our estimations show that we can potentially save up to 30% in IT Infrastructure cost and achieve better risk coverage and Time to Market when it comes to rolling out additional risk and behaviors with this new serverless architecture using BigQuery.” 
Sanjay-Kumar Tripathi, Managing Director, Global Head of Communication Surveillance Technology & Compliance Cloud Transformation Lead, Deutsche Bank AG

Source : Data Analytics Read More

Looker Hackathon 2023 results: Best hacks and more

Looker Hackathon 2023 results: Best hacks and more

In December, the Looker team invited our developer and data community to collaborate, learn, and inspire each other at our annual Looker Hackathon. More than 400 participants from 93 countries joined together, hacked away for 48 hours and created 52 applications, tools, and data experiences. The hacks use Looker and Looker Studio’s developer features, data modeling, visualizations and other Google Cloud services like BigQuery and Cloud Functions.

For the first time in Looker Hackathon history, we had two hacks tie for the award of the Best Hack. See the winners below and learn about the other finalists from the event. In every possible case, we have included links to code repositories or examples to enable you to reproduce these hacks.

Best Hack winners

DashNotes: Persistent dashboard annotations

By Ryan J, Bartosz G, Tristan F

Have you ever wanted to take note of a juicy data point you found after cycling through multiple filterings of your data? You could write your notes in an external notes application, but then you might lose the dashboard and filter context important to your discovery. This Best Hack allows you to take notes right from within your Looker dashboard. Using the Looker Custom Visualization API, it creates a dashboard tile for you to create and edit text notes. Each note records the context around its creation, including the original dashboard and filter context. The hack stores the notes in BigQuery to persist the notes across sessions. Check out the GitHub repository for more details.

Document repository sync automation

By Mehul S, Moksh Akash M, Rutuja G, Akash

Does your organization struggle to maintain documentation on an increasing number of ever-changing dashboards? This Best Hack helps your organization automatically generate current detailed documentation on all your dashboards, for simplified administration. The automation uses the Looker SDK, the Looker API, and serverless Cloud Functions to parse your LookML for useful metadata, and stores it in BigQuery. Then the hack uses LookML to model and display the metadata inside a Looker dashboard. Checkout the GitHub repository for the backend service and the GitHub repository for the LookML for more details.

Nearly Best Hack winner

Querying Python services from a Looker dashboard

By Jacob B, Illya M

If your Looker dashboard had the power to query any external service, what would you build? This Nearly Best Hack explores how your Looker Dashboard can communicate with external Python services. It sets up a Python service to mimic a SQL server and serves it as a Looker database connection for your Looker dashboard to query. Then, clever LookML hacks enable your dashboard buttons to send data to the external Python service, creating a more interactive dashboard. This sets up a wide array of possibilities to enhance your Looker data experience. For example, with this hack, you can deploy a trained ML model from Google Cloud’s Vertex AI in your external service to deliver keen insights about your data. Check out the GitHub repository for more details.

Finalists

What do I watch?

By Hamsa N, Shilpa D

We’ve all had an evening when we didn’t know what movie to watch. You can now tap into a Looker dashboard that recommends ten movies you might like based on your most liked movie from IMDB’s top 1000 movies. The hack analyzes a combination of genre, director, stars, and movie descriptions, using natural language processing techniques. The resulting processed data resides in BigQuery, with LookML modeling the data. Check out the GitHub repository for more details.

Template analytics

By Ehsan S

If you need to determine which customer segment will be most effective to market to, check out this hack, which performs Recency, Frequency, Monetary (RFM) analysis on data from a Google Sheet to help you segment customers based on their last transaction recency, how often they’ve purchased, and how much they’ve spent over time. You provide the custom Looker Studio Community Connector, along with a Google Sheet, and the connector performs RFM analysis on your Google Sheet’s data. The hack’s Looker Studio report visualizes the results to give an overview of your customer segments and behavior. Check out the Google Apps Script code for more details.

LOV filter app

By Markus B

This hack implements a List of Values (LOV) filter that enables you to have the values of one dimension filter a second dimension. For example, take two related dimensions: “id” and “name”. The “name” dimension may change, while the “id” dimension always stays constant.

This hack uses Looker’s Extension Framework and Looker Components to show “name” values in the LOV filter that translate to “id” values in an embedded dashboard’s filter. This helps your stakeholders filter on values they’re familiar with and keeps your data model flexible and robust. Check out the GitLab repository for more details.

Looker accelerator

By Dmitri S, Joy S, Oleksandr K

This collection of open-source LookML dashboard templates provides insight into Looker project performance and usage. The dashboards use Looker’s System Activity data and are a great example of using LookML to create reusable dashboards. In addition, you can conveniently install the Looker Block of seven dashboards through the Looker Marketplace (pending approval) to help your Looker developer or admin to optimize your Looker usage. Check out the GitHub repository for more details.

The SuperViz Earth Explorer

By Ralph S

With this hack, you can visually explore the population and locations of cities across the world on an interactive 3D globe, and can filter the size of the cities in real time as the globe spins. This custom visualization uses the Looker Studio Community Visualization framework with the clever combination of three.js, a 3D Javascript library, and clever graphics hacks to create a visual experience. Check out the GitHub repository for more details.

dbt exposure generator

By Dana H.

Are you using dbt models with Looker? This hack automatically generates dbt exposures to help you debug and identify how your dbt models are used by Looker dashboards. This hack serves as a great example of how our Looker SDK and Looker API can help solve a common pain point for developers. Check out the GitHub repository for more details.

Hacking Looker for fun and community

At Looker Hackathon 2023, our developer community once again gave us a look into how talented, creative, and collaborative they are. We saw how our developer features like Looker Studio Community Visualizations, LookML, and Looker API, in combination with Google Cloud services like Cloud Functions and BigQuery, enable our developer community to build powerful, useful — and sometimes entertaining — tools and data experiences.

We hope these hackathon projects inspire you to build something fun, innovative, or useful for you. Tap into our linked documentation and code in this post to get started, and we will see you at the next hackathon!

Source : Data Analytics Read More

Unlock Web3 data with BigQuery and Subsquid

Unlock Web3 data with BigQuery and Subsquid

Editor’s note: The post is part of a series showcasing partner solutions that are Built with BigQuery.

Blockchains generate a lot of data with every transaction. The beauty of Web3 is that all of that data is publicly available. But the multichain and modular expansion of the space has increased the complexity of accessing data, where any project looking to build cross-chain decentralized apps (DApps) has to figure out how to tap into on-chain data that is stored in varying locations and formats.

Meanwhile, running indexers to extract the data and make it readable is a time-consuming, resource-intensive endeavor often beyond small Web3 teams’ capabilities, since proficiency in coding smart contracts and building indexers are entirely different skills.

Having recognized the challenges for builders to leverage one of the most valuable pieces of Web3 (its data!), the Subsquid team set out to build a fully decentralized solution that drastically increases access to data in a permissionless manner.

Subsquid explained

One way to think about the Subsquid Network is as Web3’s largest decentralized data lake — existing to ingest, normalize, and structure data from over 100 Ethereum Virtual Machines (EVM) and non-EVM chains. It allows devs to quickly access (‘query’) data more granularly — and vastly more efficiently — than via legacy RPC node infrastructure.

Subsquid Network is horizontally scalable, meaning it can grow alongside archival blockchain data storage. Its query engine is optimized to extract large amounts of data and is a perfect fit for both dApp development (indexing) and for analytics. In fact, a total of over 11 billion dollars in decentralized application and L1/L2 value depends on Subsquid indexing.

Since September, Subsquid has been shifting from its initial architecture to a permissionless and decentralized format. So far during the testnet, 30,000 participants — including tens of thousands of developers — have built and deployed over 40,000 indexers. Now, the Subsquid team is determined to bring this user base and its data to Google BigQuery.

BigQuery and blockchain

BigQuery is a powerful enterprise data warehouse solution that allows companies and individuals to store and analyze petabytes of data. Designed for large-scale data analytics, BigQuery supports multi-cloud deployments and offers built-in machine learning capabilities, enabling data scientists to create ML models with simple SQL.

BigQuery is also fully integrated with Google’s own suite of business intelligence and external tools, empowering users to run their own code inside BigQuery using Jupyter Notebooks or Apache Zeppelin.

Since 2018, Google has added support for blockchains like Ethereum and Bitcoin to BigQuery. Then, earlier this year, the on-chain data of 11 additional layer-1 blockchain architectures was integrated into BigQuery, including Avalanche, Fantom, NEAR, Polkadot, and Tron.

But while it’s great to be able to run analytics on public blockchain data, this might not always offer exactly the data a particular developer needs for their app. This is where Subsquid comes in.

Data superpowers for Web3 devs and analysts

Saving custom-curated data to BigQuery lets developers leverage Google’s analytics tools to gain insights into how their product is used, beyond the context of one chain or platform.

Multi-chain projects can leverage Subsquid in combination with BigQuery to quickly analyze their usage on different chains and gain insights into fees, operating costs, and trends. With BigQuery, they aren’t limited to on-chain data either. After all, Google is the company behind Google Analytics, an advanced analytics suite for website traffic.

Web3 Data Unlocked: Indexing Web3 Data with Subsquid & Google BigQuery

Subsquid Developer relations engineer Daria A. demonstrates how to store data indexing using Subsquid to BigQuery and other tools

Analyzing across domains by combining sets of on-chain activity with social media data and website traffic can help projects understand retention and conversion in their projects while identifying points where users drop off, to further improve their workflows.

“BigQuery is quickly becoming an essential product in Web3, as it enables builders to query and analyze one’s own data, as well as to leverage a rich collection of datasets already compiled by others. Since it’s SQL based, it’s extremely easy to explore any data and then run more and more complex queries. With a rich API and complete developer toolkit, it can be connected to virtually anything,” writes Dmitry Zhelezov, Subsquid CEO and co-founder.

“Now, with the addition of Subsquid indexing, Web3 developers literally have data superpowers. They can build a squid indexer from scratch or use an existing one to get exactly the data they need extremely efficiently. We can’t wait to see what this unlocks for builders.”

Get started with Subsquid on BigQuery today

Subsquid’s support for BigQuery is already feature-complete. Are you interested in incorporating this tool into your Web3 projects? Find out more in the documentation. You can also view an example project demoed on YouTube and open-sourced on GitHub.

The Built with BigQuery advantage for Data Providers and ISVs

Built with BigQuery helps companies like Subsquid build innovative applications with Google Data and AI Cloud. Participating companies can:

Accelerate product design and architecture through access to designated experts who can provide insight into key use cases, architectural patterns, and best practices.Amplify success with joint marketing programs to drive awareness, generate demand, and increase adoption.

BigQuery gives Data Providers and ISVs the advantage of a powerful, highly scalable unified AI lakehouse that’s integrated with Google Cloud’s open, secure, sustainable platform. Click here to learn more about Built with BigQuery.

Source : Data Analytics Read More

Introducing new vector search capabilities in BigQuery

Introducing new vector search capabilities in BigQuery

The advent of advanced AI and machine learning (ML) technologies has revolutionized the way organizations leverage their data, offering new opportunities to unlock its potential. Today, we’re announcing the public preview of vector search in BigQuery, which enables vector similarity search on BigQuery data. This functionality, also commonly referred to as approximate nearest-neighbor search, is key to empowering numerous new data and AI use cases such as semantic search, similarity detection, and retrieval-augmented generation (RAG) with a large language model (LLM).

Vector search is often performed on high-dimensional numeric vectors, a.k.a. embeddings, which incorporate a semantic representation for an entity and can be generated from numerous sources, including text, image, or video. BigQuery vector search relies on an index to optimize the lookups and distance computations required to identify closely matching embeddings.

Here is an overview of BigQuery vector search:

It offers a simple and intuitive CREATE VECTOR INDEX and VECTOR_SEARCH syntax that is similar to BigQuery’s familiar text search functionality. This simplifies combining vector search operations with other SQL primitives, enabling you to process all your data at BigQuery scale.It works with BigQuery’s embedding generation capabilities, notably via LLM-based or pre-trained models. Yet the generic interface allows you to use embeddings generated via other means as well.BigQuery vector indexes are automatically updated as the underlying table data mutates, with the ability to easily monitor indexing progress. This extensible framework can support multiple vector index types, with the first implemented type (IVF) combining an optimized clustering model with an inverted row locator in a two-piece index.The LangChain implementation simplifies Python-based integrations with other open-source and third-party frameworks.The VECTOR_SEARCH function is optimized for analytical use cases and can efficiently process large batches of queries (rows). It also delivers low-latency inference results when handling small input data. Faster, ultra-low-latency online prediction can be performed on the same data through our integration with Vertex AI.It’s integrated with BigQuery’s built-in governance capabilities, notably row-level, data masking, and column-level security policies.

Use cases

The combination of embedding generation and vector search enables many interesting use cases, with RAG being a canonical one. The examples below provide high-level algorithmic descriptions for what can be encoded in your data application or queries using vector search:

Given a new (batch of) support case(s), find ten closely-related previous cases, and pass them to an LLM as context to summarize and propose resolution suggestions.Given an audit log entry, find the most closely matching entries in the past 30 days.Generate embeddings from patient profile data (diagnosis, medical and medication history, current prescriptions, and other EMR data) to do similarity matching for patients with similar profiles and explore successful treatment plans prescribed to that patient cohort.Given the embeddings representing pre-accident moments from all the sensors and cameras in a fleet of school buses, find similar moments from all other vehicles in the fleet for further analysis, tuning, and re-training of the models governing the safety feature engagements.Given a picture, find the most closely-related images in the customer’s BigQuery object table, and pass them to a model to generate captions.

BigQuery-based RAG deep dive

BigQuery enables you to generate vector embeddings and perform vector similarity search to improve the quality of your generative AI deployments with RAG. You can find some some steps and tips below:

You can generate vector embeddings from text data using a range of supported models, including LLM-based ones. These models effectively understand the context and semantics of words and phrases, allowing them to encode the text into vectors that represent its meaning in a high-dimensional space.With BigQuery’s scale and ease of use, you can store these embeddings in a new column, right alongside the data it was generated from. You can then perform queries against these embeddings or build an index to improve retrieval performance.Efficient and scalable similarity search is crucial for RAG, as it allows the system to quickly find the most relevant pieces of information based on the query’s semantic meaning. Vector similarity search involves efficiently searching through millions or billions of vectors from the vector data store to find the most similar vectors. BigQuery vector search uses its indexes to efficiently find the closest matching vectors according to a distance measurement technique such as cosine or euclidean.When doing prompt engineering with RAG, the first step involves converting the input into a vector using the same (or a similar) model to that used for encoding the knowledge base. This ensures that the query and the stored information are in the same vector space, making it possible to measure similarity.The vectors identified as most similar to the query are then mapped back to their corresponding text data. This text data represents the pieces of information from the knowledge base that are most likely to be relevant to the query.The retrieved text data is then fed into a generative model. This model uses the additional context provided by the retrieved information to generate a response that is not only based on its pre-trained knowledge, but also enhanced by the specific information retrieved for the query. This is particularly useful for questions that require up-to-date information or detailed knowledge on specific topics.

The diagram below provides a simplified view of the RAG workflow in BigQuery:

Publication search and RAG examples

In the next three sections, we use the `patents-public-data.google_patents_research.publications` table in the Google Patents public dataset table as a running example to highlight three (of the many) use cases BigQuery vector search enables.

Case 1: Patent search using pre-generated embedding

One of the most basic use cases for BigQuery vector search is performing similarity search using data with pre-generated embeddings. This is common when you intend to use embeddings that are previously generated from proprietary or pre-trained models. As an example, if you store your data and queries in <my_patents_table> and <query_table> respectively, the search journey would consist index creation, followed by vector search:

code_block
<ListValue: [StructValue([(‘code’, “CREATE OR REPLACE VECTOR INDEX `<index_name>`rnON `<my_patents_table>`(embedding_v1)rnOPTIONS(distance_type=’COSINE’, index_type=’IVF’)”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e857331cfa0>)])]>
code_block
<ListValue: [StructValue([(‘code’, “SELECT query.publication_number AS query_publication_number,rn query.title AS query_title, base.publication_number,rn base.title, distancernFROMrn VECTOR_SEARCH(rn TABLE `<my_patents_table>`, ’embedding_v1′,rn TABLE `<query_table>`, top_k => 5, distance_type => ‘COSINE’)”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8573e9eeb0>)])]>

Note that indexing is mainly a performance optimization mechanism for approximate nearest-neighbor search, and vector search queries can succeed and return correct results without an index as well. For more details, including specifics on recall calculation, please see this tutorial.

Case 2: Patent search with BigQuery embedding generation

You can achieve a more complete end-to-end semantic search journey by using BigQuery’s capabilities to generate embeddings. More specifically, you can generate embeddings in BigQuery via LLM-based foundational or pre-trained models. The SQL snippet below assumes you have already created a BigQuery <LLM_embedding_model> that references a Vertex AI text embedding foundation model via BigQuery (see this tutorial for more details):

code_block
<ListValue: [StructValue([(‘code’, “CREATE TABLE `<patents_my_embeddings_table>` ASrn SELECT * FROM ML.GENERATE_TEXT_EMBEDDING(rn MODEL `<LLM_embedding_model>`,rn (SELECT *, abstract AS contentrn FROM `patents-public-data.google_patents_research.publications`rn WHERE LENGTH(abstract) > 0 AND LENGTH(title) > 0 AND country = ‘Singapore’))rn WHERE ARRAY_LENGTH(text_embedding) > 0;”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e857074ec70>)])]>

We skip demonstrating the index-creation step, as it is similar to “Case 1” above. After the index is created, we can use VECTOR_SEARCH combined with ML.GENERATE_TEXT_EMBEDDING to search related patents. Below is an example query to search patents related to “improving password security”:

code_block
<ListValue: [StructValue([(‘code’, “SELECT query.query, base.publication_number, base.title, base.abstractrnFROM VECTOR_SEARCH(rn TABLE `<patents_my_embeddings_table>`, ‘text_embedding’,rn (SELECT text_embedding, content AS queryrn FROM ML.GENERATE_TEXT_EMBEDDING(rn MODEL `<LLM_embedding_model>`,rn (SELECT ‘improving password security’ AS content))rn ), top_k => 5)”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8572fdee80>)])]>

Case 3: RAG via integration with generative models

BigQuery’s advanced capabilities allow you to easily extend the search cases covered above into full RAG journeys. More specifically, you can use the output from the VECTOR_SEARCH queries as context for invoking Google’s natural language foundation (LLM) models via BigQuery’s ML.GENERATE_TEXT function (see this tutorial for more details).

The sample query below demonstrates how you can ask the LLM to propose project ideas to improve user password security. It uses the top_k patents retrieved via semantic similarity vector search as context passed to the LLM model to ground its response:

code_block
<ListValue: [StructValue([(‘code’, ‘SELECT ml_generate_text_llm_result AS generated, promptrnFROM ML.GENERATE_TEXT(rn MODEL `<LLM_text_generation_model>`,rn (SELECT CONCAT(‘Propose some project ideas to improve user password security using the context below: ‘, STRING_AGG(FORMAT(“patent title: %s, patent abstract: %s”, base.title, base.abstract), ‘,\n’)) AS prompt,rn FROM VECTOR_SEARCH(rn TABLE `<patents_my_embeddings_table>`, ‘text_embedding’,rn (SELECT text_embedding, content AS queryrn FROM ML.GENERATE_TEXT_EMBEDDING(rn MODEL `<LLM_embedding_model>`,rn (SELECT ‘improving password security’ AS content))rn ), top_k => 5)rn ),rn STRUCT(0.4 AS temperature, 300 AS max_output_tokens, 0.5 AS top_p, 5 AS top_k,rn TRUE AS flatten_json_output));’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8572fdef10>)])]>

Getting started

BigQuery vector search is now available in preview. Get started by following the documentation and tutorial.

Source : Data Analytics Read More

Power self-serve analytics and generative AI with Sparkflows and Google Cloud

Power self-serve analytics and generative AI with Sparkflows and Google Cloud

Self-service analytics powered by ML and generative AI is the new holy grail for data-driven enterprises, enabling enhanced decision-making through predictive insights, and providing a significant boost in operational efficiency and innovation. C-level executives increasingly see self-service analytics as the key driver of employee productivity and business efficiency.

Today, technical practitioners employ a variety of open-source libraries, including Apache Spark, Ray, pandas, sk-learn, h20 and many more to create analytics and ML applications. This entails writing a lot of code, which has a steep learning curve. Additionally, developing front-end interfaces for business users to interact with the systems in a secure and scalable manner takes a long time.

Enterprises also face challenges in hiring and retaining data-science experts and incur overhead costs for managing a large number of heterogeneous tools and technologies. Handling a growing variety and volume of data from siloed sources is a huge barrier to analytics initiatives. Lack of seamless workload scaling slows business solutions development.

Democratizing analytics and building ML applications are best done when business users and IT teams are empowered with services offered by cloud technology through intuitive, easy-to-use workflows, analytical apps, and conversational interfaces.

This brings out the strong need for a unified self-service platform made for all users to create and launch business solutions powered by cloud.

Sparkflows

Sparkflows is a Google Cloud partner that provides a powerful platform packed with self-service analytics, ML and gen AI capabilities for building data products. Sparkflows help integrate diverse open-source technologies through intuitive user-driven interfaces.

With Sparkflows, data analytics teams can turbocharge the development of ETL, exploratory analytics, feature engineering, ML models and gen AI apps using 460+ no-code/ low-code processors, and various workbenches as shown below.

Various AI and gen AI workbenches in Sparkflows

Self-service with Sparkflows and Google Cloud

Sparkflows running on Google Cloud provides unified self-serve data science capabilities with connectivity top BigQuery, Vertex AI, AlloyDB and Cloud Storage. The solution automatically pushes down the computation to high-performance distributed job execution engines like Dataproc and BigQuery. These automated integrations scale business solutions for very large datasets.

Interaction diagram: Sparkflows and Google Cloud

Sparkflows has developed a large number of solutions for the sales and marketing, manufacturing and supply chain departments of retail and CPG customers.

Business scenarios using Sparkflows and Google Cloud

Let’s assume the engineering team of a retail company needs to empower the marketing team with a self-service analytics tool that can identify the customers who are likely to churn, and measure the effectiveness of the campaigns by analyzing the coupon responsiveness, sales, and demographic data.

The team needs to ingest and prepare data quickly, build ML models, analytics reports and gen AI apps in an automated fashion where Spark code will be generated and jobs will be submitted to a Dataproc cluster effortlessly.

Installation

As the first step, Sparkflows is installed inside the customer’s secure VPC network either on a virtual machine or in a container running in Google Cloud. Sparkflows runs securely with built-in SSO integration.

Configuration

Admin users configure the Dataproc Serverless Spark cluster and various types of LLM services like PaLM API in Sparkflows admin console.

Self-service solution design & execution

Sparkflows enables a unified experience for continuous machine learning.

Let’s now discuss the steps required to identify customers who are likely to churn and the ability to analyze the reviews by customers to measure satisfaction. This process involves:

Dataset explorationData preparationML model trainingML model predictionVisualizationsCreating analytical appsGenerative AI apps

Sparkflows connects with various Google Cloud services for performing the above operations (Ref: Interaction diagram: Sparkflows and Google Cloud).

Datasets

In this example, the datasets (customer transactions, campaigns, coupons and demographic info) are stored in BigQuery and product review data is in Cloud Storage. Business users can select a domain like retail and then view all the datasets stored in Google Cloud within Sparkflows. Users can browse files in Cloud Storage, explore and query BigQuery tables. Sparkflows dataset explorer seamlessly connects with Data Catalog.

Data preparation

Users can rapidly design various workflows for ingesting the datasets and performing data profiling, automated quality checks, cleaning and exploratory analysis using 350+ no-code/low-code data preparation processors. All these workflows help automate the Spark code generation and functionality development for the current business solution, cutting down the engineering time from weeks to hours.

Each of the visual workflows results in the automatic creation of a Spark job which is launched on Dataproc Serverless. Dataproc Serverless is an ideal platform for running these jobs. It is a highly performant and cost-effective distributed computing platform that is able to quickly spin up additional compute resources as needed. The platform is also very cost-effective as customers are only billed for resources for the duration of the job execution.

ML model training

Data scientists and analysts can perform feature engineering to calculate various aggregated metrics from the data processed by workflows designed in previous steps. Developers can leverage 80+ No Code/Low Code ML processors to create an ML modeling workflow. The features are used for training a model which can predict customers most likely to churn.

The features based on purchase pattern and coupon redemption information are used for creating the segments of customers

ML model prediction

Below is an example of the Prediction workflow for churn prediction.

The Prediction workflow can be triggered manually, via the built-in scheduler, through the API, or using the Analytical App UI.

ML Model Prediction Workflow

Visualization – descriptive and predictive analytics

Business users can drag the nodes used in workflows in the report designer UI and create powerful reports, which allow data scientists to inspect profiling stats, data quality results, exploratory insights, training metrics and prediction outputs.

When the underlying workflows are executed in a Dataproc cluster, the reports are automatically refreshed.

Reports of descriptive and predictive analytics

Business analytical apps

Business analytical apps in Sparkflows let business users build front-end applications for data products. Business users interact with these apps using their browsers. The analytical apps are built with an interactive UI.

Gen AI apps

Now, let’s build a few gen AI apps to allow the business team perform the following operations:

Ask questions from the product review dataSummarize, extract topics and translate texts

The first step is to configure the Vertex PaLM API connection in the admin console and select the connection in the Analytical App.

Allow users to query product reviews and gain insights
Allow users to translate and query documents

This is how Sparkflows helps sales and marketing teams of a retail company identify potential customer churn, measure campaign effectiveness, find target customer segments, and analyze product reviews and business documents.

ML solutions

It enables a wide range of gen AI apps, from content synthesis, content generation, and NLQ-based reports, to prompt-based business solutions.

Generative AI solutions

Better together

Having the ability to move fast with AI and generative AI is of great value to all types of enterprises. The partnership between Sparkflows and Google Cloud puts powerful and affordable self-serve AI and gen AI capabilities in the hands of the users in a secure and scalable way. Building gen AI solutions using Sparkflows and Google Cloud is highly affordable, thanks to Vertex’s highly cost-effective gen-ai pricing model and Sparkflows’ discounted pricing package. Overall, Sparkflows with Google Cloud drives operational efficiencies, accelerates business solutions, and speeds up time to market thereby propelling business growth.

Try out Sparkflows

Here are a few links to get started with Sparkflows and Google Cloud:

Get a sandbox instance in Google CloudSign up for the playgroundAsk for a demoLearn moreTech blogs

We thank the many Google Cloud and Sparkflows team members who contributed to this collaboration, especially Kaniska Mandal and Deb Dasgupta for their guidance during the process.

Source : Data Analytics Read More

Leveraging streaming analytics for actionable insights with gen AI and Dataflow

Leveraging streaming analytics for actionable insights with gen AI and Dataflow

In recent years, there’s been a surge in the adoption of streaming analytics for a variety of use cases, for instance predictive maintenance to identify operational anomalies, and online gaming — creating player-centric games by optimizing experiences in real-time. At the same time, the rise of generative AI and large language models (LLMs) that are capable of generating and understanding text, has led us to explore new ways to combine the two to create innovative solutions.

In this blog post, we showcase how to get real-time LLM insights in an easy and scalable way using Dataflow. Our solution applies to a gameroom chat, but it could be used to gain insights into a variety of other types of data, such as customer support chat logs, social media posts, and product reviews — any other domain where real-time communication is prevalent.

Game chats: a goldmine of information

Consider a company seeking real-time insights from chat messages. A key challenge for many companies is understanding users’ evolving jargon and acronyms. This is especially true in the gaming industry, where “gg” means “good game” or “g2g” means “got to go.” The ideal solution would adapt to this linguistic fluidity without requiring pre-defined keywords.

For our solution, we looked at anonymized data from Kaggle of gamers chatting while playing Dota 2, conversing freely with one another via short text messages. Their conversations were nothing short of gold in our eyes. From gamers’ chats with one another, we identified an opportunity to quickly detect ongoing connection or delay issues, and by that ensure good quality of service (QoS). Similarly, gamers often talk about missing items such as tokens or game weapons, information we can also leverage to improve the gaming experience and its ROI.

At the same time, whatever solution we built had to be easy and quick to implement!

Solution components

The solution we built includes industry-leading Google Cloud data analytics and streaming tools, plus open-source gaming data and an LLM.

BigQuery stores the raw data and holds detection alerts.Pub/Sub, a Google Cloud serverless message bus, is used to decouple the streamed chat messages and the Dataflow pipeline.Dataflow, a Google Cloud managed service for building and running the distributed data processing pipeline, relies on the Beam RunInference transform for a simple and easy-to-use interface for performing local and remote inference.The DOTA 2 game chat dataset is taken from Kaggle -G game chats raw data.Google/Flan-T5 is the LLM model used for detection based on the prompt. It is hosted in Hugging Face.

Once we settled on the components, we had to choose the right prompt for the specific business use case. In this case, we settled on game chats latency detection.

We analyzed our gaming data, looking for keywords such as connection, delay, latency, lag, etc.

Example:

code_block
<ListValue: [StructValue([(‘code’, “SELECT text from GOSU_AI_Dota2_game_chats ‘rn’WHERE text LIKE ‘%latency%’ or text like ‘%connection%’ ‘”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed148cec3d0>)])]>

The following game id came up:

code_block
<ListValue: [StructValue([(‘code’, “SELECT text from summit2023.GOSU_AI_Dota2_game_chats ‘rn’WHERE match = 507332 ‘”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed148cec0a0>)])]>

Here, we spotted a lot of lag and server issues: 

match

time

slot

text

507332

68.31666

1

0 losees

507332

94.94348

3

i think server

507332

77.21448

4

quit plsss

507332

51.15418

3

lag

507332

319.15543

8

made in chine

507332

126.50245

9

i hope he doesnt reconnect

507332

315.65628

8

shir server

507332

62.85132

5

15 packet loses

507332

65.75062

3

wtfd lag

507332

289.02948

9

someone abandon

507332

445.02468

9

YESYESYES

507332

380.94034

3

lag wtf

507332

79.34728

4

quit lagger stupid shit

507332

75.21498

9

great game volvo

507332

295.66118

7

HAHAHA

After a few SQL query iterations, we managed to tune the prompt in such a way that the true positive was high enough to raise a detection alert, but agnostic enough to spot delay issues without providing specific keywords.

“Answer by [Yes|No] : does the following text, extracted from gaming chat room, can indicate a connection or delay issue : “

Our next challenge was to create a Dataflow pipeline that seamlessly integrated two key features:

Dynamic detection prompts: Users must be able to tailor detection prompts for diverse use cases, all while the pipeline is up and running — without writing any code.

Seamless model updates: We needed a way to swap in a better-trained model without interrupting the pipeline’s operation, ensuring continuous uptime — again, without writing any code.

To that end, we chose to use the Beam RunInference transform.

RunInference offers numerous advantages: 

Data preprocessing and post-processing are encapsulated within the RunInference function and treated as distinct stages in the process. Why is this important? RunInference effectively manages errors associated with these stages and automatically extracts them into separate PCollections, enabling seamless handling as demonstrated in the code sample below.

RunInference’s automated model refresh mechanism is the watch file pattern. This enables us to update the model and load the newer version without halting and restarting the pipeline. 

All with a few lines of code:

RunInference uses a “ModelHandler” object, which wraps the underlying model and provides a configurable way to handle the used model.There are different types of ModelHandlers; which one you choose depends on the framework and type of data structure that contains the inputs. This makes it a powerful tool and simplifies the building of machine learning pipelines.

Solution architecture

We created a Dataflow pipeline to consume game chat messages from a Pub/Sub topic. In the solution, we simulated the pipeline by reading the data from a BigQuery table and pushed it to the topic.

The Flan-T5 model is loaded into the workers’ memory and provided with the following prompt:

“Answer by [Yes|No] : does the following text, extracted from gaming chat room, indicate a connection or delay issue : “

A Beam side input PCollection, read from a BigQuery table, empowers various business detections to be performed within the Beam pipeline.

The model generates a [Yes|No] response for each message within a 60-second fixed window. The number of Yes responses is counted, and if it exceeds 10% of the total number of responses, the window data is stored in BigQuery for further analysis.

Conclusion:

In this blog post, we showcased how to use LLMs with Beam Dataflow’s RunInference function to gain insights about gamers chatting amongst themselves.

We used the RunInference transform with a loaded Google/Flan-t5 model to identify anything that indicates a system lag, without giving the model any specific words. In addition, the prompts can be changed in real time and be provided as a side input to the created pipeline. This approach can be used to gain insights into a variety of other types of data, such as customer support chat logs, social media posts, and product reviews.

Check out the Beam ML Documentation to learn more about integrating ML using the RunInference transform as part of your real-time Dataflow workstreams. For a Google Colab notebook on using RunInference for Generative AI, check out this link.

Appendix:
Use RunInference for Generative AI | Dataflow ML

Source : Data Analytics Read More

LookML or ELT? Three reasons why you need LookML

LookML or ELT? Three reasons why you need LookML

Background

LookML is a powerful tool that applies business logic and governance to enterprise data analytics. However, LookML’s capabilities are often conflated with those of in-warehouse “ELT” transformation tools like Dataform and DBT

As these tools appear to be similar in nature, it is often thought that users need to choose one over the other. This post outlines why customers should be using both LookML and ELT tools in their data analytics stack, with a specific focus on the importance of LookML. In a  follow-up article, we will cover how you should architect your business logic and transformations between the LookML and ELT layers.

Quick background on LookML

If you are new to LookML, you will want to check out this video and help center article to get more familiar. But to quickly summarize:

LookML is a code-based modeling layer based on the principles of SQL that:

Allows developers to obfuscate the complexity behind their data and create an explorable environment for less-technical data personas

Brings governance and consistency to analytics because all queries are based off of the LookML model, acting as a “single source of truth”

Enables modern development through its git-integrated version control

LookML can power third-party tools

LookML was originally designed to enable analytics inside of Looker’s UI (or in custom data applications using Looker’s APIs). Looker has continued to evolve and announced several LookML integrations into other analytics tools, enabling customers to add governance and trust to more of their analytics, regardless of which user interface they are using. 

Increased adoption of “ELT” Transformation tools

Over the last few years, many organizations have adopted in-warehouse ELT transformation tools, like Dataform (from Google) and DBT, as an alternative to traditional ETL processes. ETL typically transforms data before it’s loaded into a data warehouse. ELT tools take a more simplified and agile approach by transforming data after it’s loaded in the warehouse. They also adhere to modern development practices.

Similarities with LookML

On the surface, characteristics of these ELT tools sound very similar to those of LookML. Both:

Are built on the foundations of SQL

Enable reusable and extendable code

Help define standardized data definitions

Are Git version-controlled and collaborative

Have dedicated IDEs with code validation and testing

The deeper value of LookML

LookML adds three critical capabilities to your organization that cannot be done solely with an ELT tool:

Flexible metrics and exploratory analysis

Consistency and governance in the “last mile” of your analytics

Agile development and maintenance of your BI layer

Flexible metrics and exploratory analysis

Many teams attempt to define and govern their data solely using their most familiar ELT tool. One reason you should avoid this is related to aggregated metrics (also known as “facts” or “measures”), specifically non-additive and semi-additive metrics. ELT tools are not designed to efficiently support these types of metrics, without building a lot of unnecessary tables.

Types of BI metrics

Metric Type

Examples

Additive

Sum of Sales 

Count of Orders

Semi-additive

Products in Stock

Account Balance

Non-additive

Daily Active Users (count distinct)

Average Gross Margin (avg)

Flexible Metrics
A flexible metric is a metric that is dynamically calculated based on the scope of the request from the user. For example, let’s say your business wants to report on the Daily Active Users on your website, a non-additive metric. If you’re working inside your ELT tool, you may say: “Hey, that’s easy! I can build a new table or view.”

code_block
<ListValue: [StructValue([(‘code’, ‘SELECTrn DATE(created_at) AS date,rn COUNT(DISTINCT user_id) AS daily_active_usersrnFROM `events` rnGROUP BY 1’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3eea0da566d0>)])]>

However, the business also wants to see Monthly Active Users. You can’t just sum up the daily active users because if the same user was active on multiple days, they would be counted multiple times for that month. This is an example of a non-additive metric, where the calculation is dependent on the dimensions it is being grouped by (in this case day or month).

You need to build another table or view that is grouped by month. But the business may also want to slice it by other dimensions, such as the specific pages that the users hit, or products they viewed. If the only tool you have available is the ELT tool, you’re going to end up creating separate tables for every possible combination of dimensions, which is a waste of time and storage space.

Exploratory analysis (powered by flexible metrics)
Even if you wanted to spend your time modeling every permutation, you’d be siloing your users into only analyzing one table or combination at a time. When they change how they want to look at the data, switching daily to monthly for example, they’d have to change the underlying table that they are querying, adding friction to the user experience and leading to less curious and less data-driven users.

LookML avoids this by enabling you to define one measure called “Active User Count” and allowing users to freely query it by day, month, or any other dimension without having to worry about which table and granularity they are querying (as shown below).

code_block
<ListValue: [StructValue([(‘code’, ‘# LookML for non-additive measure, flexibly works with any dimensionrnmeasure: active_user_count {rn description: “For DAU: query w/ Date. For MAU: query w/ Month”rn type: count_distinctrn sql: ${user_id} ;;rn }’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3eea0da56460>)])]>

Having flexible metrics can enable smoother exploratory analysis

Consistency and governance in the “last mile” of your analytics

Even if you believe you’ve perfectly manicured the data inside your data warehouse with your ELT tool, the “last mile” of analytics, between the data warehouse and your user, introduces another level of risk.

Inconsistent logic
No matter how much work you’ve put into your ELT procedures, analysts, who may be working in separate ungoverned BI or visualization tools, still have the ability to define new logic inside their individual report workbooks. What often happens here is users across different teams start to define logic differently or even incorrectly, as their metrics and tools are inconsistent.

Maintenance challenges
Even if you manage to do it correctly and consistently, you are likely duplicating logic in each workbook/dashboard, and inevitably over time, something changes. Whether the logic itself, or a table or column name, you now have to find and edit every report that is using the out-dated logic.

Agile development and maintenance in your BI layer

LookML itself is a “transformation” tool. LookML applies its transformations at query time, making it an agile option that doesn’t require persisting logic into a data-warehouse table before analysis. For example, a hypothetical ecommerce company may have some business logic around how it defines the “days to process” an order.

As an analyst, I have the autonomy to quickly apply this logic in LookML without having to wait for the data engineering team to bake it into the necessary warehouse tables.

code_block
<ListValue: [StructValue([(‘code’, ‘# LookML example of agile transformation at query timerndimension: days_to_process {rn description: “Days to process each order”rn type: numberrn sql: rn CASErn WHEN ${status} = ‘Processing’ rn THEN TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), ${created_raw}, DAY)rn WHEN ${status} IN (‘Shipped’, ‘Complete’, ‘Returned’) rn THEN TIMESTAMP_DIFF(${shipped_raw}, ${created_raw}, DAY)rn WHEN ${status} = ‘Canceled’ THEN NULLrn END ;;rn }’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3eea0da56700>)])]>

This is especially useful for use cases where the business requirements are fluid or not well defined yet (we’ve all been there). You can use LookML to accelerate “user acceptance testing” in these situations by getting reports and dashboards into user’s hands quickly — maybe even before the requirements are fully locked down.

If you’re solely using ELT tools and the requirements change, you will have to truncate and rebuild your tables each time. This is not only time-consuming, but expensive. Using LookML instead of ELT for quick, light-weight transformations puts data in the hands of your users faster and at less cost.

On the other hand, adding too much transformation logic at query time could have a negative impact on the cost and performance of your queries. LookML helps here as well. Using the same example, let’s say the business requirements solidified after a couple weeks and you’d like to add the “days to process” logic to our ELT jobs, rather than LookML. You can swap out the logic with the new database column name, and all of the existing dashboards and reports that were built will continue to work.

code_block
<ListValue: [StructValue([(‘code’, ‘# Added the transformation logic to ELT layer, new column “days_to_process”rndimension: days_to_process {rn description: “Days to process each order”rn type: numberrn sql: ${TABLE}.days_to_process ;;rn }’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3eea0da568e0>)])]>

LookML is a valuable tool for modern data teams that should be used in conjunction with your ELT tool of choice. In a follow-up post, we’ll take you through specific examples on how you should architect your transformations between the LookML and ELT layers. As a sneak peak, here’s a high-level summary of what we would recommend.

Where should I do what?

ELT

LookML

The “heavy lifting” transformations that need to be pre-calculated 

Data cleansing

Data standardization

Multi-step transformations

Defining data types

Any logic needed outside of Looker or LookML 

The “last mile” of your data

Metric definitions

Table relationship definitions

Light-weight transformation (at query time)

Filters

Frequently changing or loosely defined business logic

To get started with Looker, with LookML at its core, learn more at cloud.google.com/looker.

Source : Data Analytics Read More

Synthesized creates accurate synthetic data to power innovation with BigQuery

Synthesized creates accurate synthetic data to power innovation with BigQuery

Editor’s note: The post is part of a series showcasing partner solutions that are Built with BigQuery.

Today, it’s data that powers the enterprise, helping to provide competitive advantage, inform business decisions, and drive innovation. However, accessing high-quality data can be costly and time-consuming, and using it often involves complying with strict data compliance regulations.

Synthesized helps organizations gain faster access to data and navigate compliance restrictions by using generative AI to create shareable, compliant snapshots of large datasets. These snapshots can then be used to make faster and more informed business decisions, and power application development and testing. It does this by helping organizations overcome many of the obstacles to fast and compliant insights:

Accessing compliant data – BigQuery provides a wide range of capabilities that help data to be stored and governed in a secure and compliant way. However, when that data is used in a different context, for example to train an ML model, for testing, or to share information with a different department with different clearance levels, ensuring that data is accessed in a compliant way can become complex. Confidential datasets, such as those with personally identifiable information (PII), medical records, financial data, or other sensitive information that should not be disclosed, are often subject to different restrictions due to industry and local governmental regulations. This can make it difficult for international offices managing access for various teams across regions and countries.Ensuring data quality – One way to manage and protect confidential datasets is data masking, that is, obscuring data so that it cannot be read by certain users. While this is a powerful approach for many use cases, it’s less suited to scenarios where visibility of the underlying data is required for example training a machine learning model. On top of this, organizations are also tasked with uncovering insights from low-quality or unbalanced data, which makes it difficult to land on accurate and representative data insights.

Unlocking data’s potential with accurate snapshots

Synthesized uses generative AI to help customers across healthcare, financial services, insurance, government, and more generate a new and accurate view of their data with confidentiality restrictions automatically applied.

The solution effectively applies data transformations such as masking, subsetting, redaction or generation to create high-fidelity snapshots of large datasets that can be used for modeling and testing. Synthesized uses generative AI to capture deep statistical properties, which are often hidden in the data, to create valuable data patterns and recreate them in synthetic data. At the same time, Synthesized helps ensure adherence to enterprise data privacy regulations, as the output data is programmatically designed to be fully anonymized, for easy and fast access to high-quality data, enabling better decision-making.

With the click of a button, organizations can access insights from a synthetic snapshot that is representative of the entire original dataset — in a way that’s fast and compliant. In other words, the solution addresses the “chicken-and-egg” problem of data access: Data consumers have to formulate their request for data access in terms of SQL query, but they can’t write the query without access to data in the first place.

The newly generated synthetic data can be used for a variety of purposes, including:

Fast access to a compliant snapshot of the data for testing and development purposes.Simplifying model training by programmatically creating diverse data snapshots that cover a wide range of scenarios, including edge cases and rare events. This diversity helps improve the robustness and generalization of machine learning models.Accelerating and evaluating cloud migration with accurate test data that mimics the structure of cloud databases, so you can confidently add sanitized or synthetic data by extending existing CI/CD pipelines.Creating full datasets from unbalanced data, when an original dataset has unequal distribution of examples, and analysis requires the extrapolation of additional reliable data points.

German bank gets compliant, high-quality synthetic data

One of the largest banks in Germany turned to Synthesized to give its engineers and data science teams fast access to the synthetic test data. They wanted to accelerate the preparation time needed to query the data so that they could speed up testing and time to market, and increase accuracy. Synthesized provided non-traceable snapshots of the original datasets, enabling the bank to start data analysis, app migration and testing in the cloud, and experiment with large datasets for new AI/ML use cases and technologies.

Insurance company accelerates product development

Likewise, a leading insurance company wanted to move away from highly manual and resource-intensive data processes to help it remain competitive. Synthesized helped the company generate millions of highly representative test datasets that could be shared safely with third-party vendors for product development. The company was able to accelerate product development, save 200 man-hours per project and drastically reduce its volume of work.

Built with BigQuery

Synthesized extends the functions already available in BigQuery. For example, BigQuery covers masking and data loss prevention for redaction, while Synthesized applies transformations like subsetting and generation. Integrating Synthesized and BigQuery can help organizations to gain fast and secure access to ready-to-query datasets, extracting only the snapshots they need to inform testing or business intelligence. Once the snapshots are ready to be shared safely from a compliance perspective, they can be stored in an organization’s own systems, or shared with third parties for analysis.

Because these snapshots remain in BigQuery, they can be easily used with the full range of Google Data and AI products, including training AI models with BigQuery ML and Vertex AI.

Synthesized has API access to BigQuery, so extracting snapshots and provisioning data is easy and automated. Synthesized also uses a generative model to synthesize data and create balanced datasets from unbalanced datasets, providing the necessary distribution of examples that are ready for sharing. This generative model is stored within the customer’s tenant and can also be shared along with the data.

Here is a simple illustrative example query to generate a fast and compliant snapshot with 1,000 rows from a input table:

code_block
<ListValue: [StructValue([(‘code’, ‘SELECT dataset.synthesize(rn ‘project.dataset.input_table’,rn ‘project.dataset.output_table’,rn ‘{“synthesize”: {“num_rows”: 1000, “produce_nans”: true}}’rn);’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3eac3abd5a30>)])]>

Synthesized Scientific Data Kit (SDK) is now available on Google Marketplace. Learn more by visiting Synthesized.io/bigquery.

The built with BigQuery advantage for ISVs and data providers

Built with BigQuery helps ISVs and data providers build innovative applications with Google’s Data Cloud. Participating companies can:

Accelerate product design and architecture through access to designated experts who can provide insight into key use cases, architectural patterns, and best practicesAmplify success with joint marketing programs to drive awareness, generate demand, and increase adoption

BigQuery gives ISVs the advantage of a powerful, highly scalable unified AI lakehouse that’s integrated with Google Cloud’s open, secure, sustainable platform. Click here to learn more about Built with BigQuery.

Source : Data Analytics Read More

Automated fraud detection with Fivetran and BigQuery

Automated fraud detection with Fivetran and BigQuery

In today’s dynamic landscape, businesses need faster data analysis and predictive insights to identify and address fraudulent transactions. Typically, tackling fraud through the lens of data engineering and machine learning boils down to these key steps:

Data acquisition and ingestion: Establishing pipelines across various disparate sources (file systems, databases, third-party APIs) to ingest and store the training data. This data is rich with meaningful information, fueling the development of fraud-prediction machine learning algorithms.Data storage and analysis: Utilizing a scalable, reliable and high-performance enterprise cloud data platform to store and analyze the ingested data.Machine-learning model development: Building training sets out of and running machine learning models on the stored data to build predictive models capable of differentiating fraudulent transactions from legitimate ones.

Common challenges in building data engineering pipelines for fraud detection include:

Scale and complexity: Data ingestion can be a complex endeavor, especially when organizations utilize data from diverse sources. Developing in-house ingestion pipelines can consume substantial data engineering resources (weeks or months), diverting valuable time from core data analysis activities.Administrative effort and maintenance: Manual data storage and administration, including backup and disaster recovery, data governance and cluster sizing, can significantly impede business agility and delay the generation of valuable data insights.Steep learning curve/skill requirements: Building a data science team to both create data pipelines and machine learning models can significantly extend the time required to implement and leverage fraud detection solutions.

Addressing these challenges requires a strategic approach focusing on three central themes: time to value, simplicity of design and the ability to scale. These can be addressed by leveraging Fivetran for data acquisition, ingestion and movement, and BigQuery for advanced data analytics and machine learning capabilities.

Streamlining data integration with Fivetran

It’s easy to underestimate the challenge of reliably persisting incremental source system changes to a cloud data platform unless you happen to be living it and dealing with it on a daily basis. In my previous role, I worked with an enterprise financial services firm that was stuck on legacy technology described as “slow and kludgy” by the lead architect. The addition of a new column to their DB2 source triggered a cumbersome process, and it took six months for the change to be reflected in their analytics platform.

This delay significantly hampered the firm’s ability to provide downstream data products with the freshest and most accurate data. Consequently, every alteration in the source’s data structure resulted in time-consuming and disruptive downtime for the analytics process. The data scientists at the firm were stuck wrangling incomplete and outdated information.

In order to build effective fraud detection models, they needed all of their data to be:

Curated, contextual: The data should be personalized and specific to their use case, while being high quality, believable, transparent, and trustworthy.Accessible and timely: Data needs to always be available, high performance, and offering frictionless access with familiar downstream data consumption tools.

The firm chose Fivetran notably for its automatic and reliable handling of schema evolution and schema drift from multiple sources to their new cloud data platform. With over 450 source connectors, Fivetran allows the creation of datasets from various sources, including databases, applications, files and events.

The choice was game-changing. With Fivetran ensuring a constant flow of high-quality data, the firm’s data scientists could devote their time to rapidly testing and refining their models, closing the gap between insights and action and moving them closer to prevention.

Most importantly for this business, Fivetran automatically and reliably normalized the data and managed changes that were required from any of their on-premises or cloud-based sources as they moved to the new cloud destination. These included:

Schema changes (including schema additions)Table changes within a schema (table adds, table deletes, etc.)Column changes within a table (column adds, column deletes, soft deletes, etc.)Data type transformation and mapping (here’s an example for SQL Server as a source)

The firm’s selection of a dataset for a new connector was a straightforward process of informing Fivetran how they wanted source system changes to be handled — without requiring any coding, configuration, or customization. Fivetran set up and automated this process, enabling the client to determine the frequency of changes moving to their cloud data platform based on specific use case requirements.

Fivetran demonstrated its ability to handle a wide variety of data sources beyond DB2, including other databases and a range of SaaS applications. For large data sources, especially relational databases, Fivetran accommodated significant incremental change volumes. The automation provided by Fivetran allowed the existing data engineering team to scale without the need for additional headcount. The simplicity and ease of use of Fivetran allowed business lines to initiate connector setup with proper governance and security measures in place.

In the context of financial services firms, governance and complete data provenance are critical. The recently released Fivetran Platform Connector addresses these concerns, providing simple, easy and near-instant access to rich metadata associated with each Fivetran connector, destination or even the entire account. The Platform Connector, which incurs zero Fivetran consumption costs, offers end-to-end visibility into metadata (26 tables are automatically created in your cloud data platform – see the ERD here) for the data pipelines, including:

Lineage for both source and destination: schema, table, columnUsage and volumesConnector typesLogsAccounts, teams, roles

This enhanced visibility allows financial service firms to better understand their data, fostering trust in their data programs. It serves as a valuable tool for providing governance and data provenance — crucial elements in the context of financial services and their data applications.

BigQuery’s scalable and efficient data warehouse for fraud detection

BigQuery is a serverless and cost-effective data warehouse designed for scalability and efficiency, making it good fit for enterprise fraud detection. Its serverless architecture minimizes the need for infrastructure setup and ongoing maintenance, allowing data teams to focus on data analysis and fraud mitigation strategies.

Key benefits of BigQuery include:

Faster insights generation: BigQuery’s ability to run ad-hoc queries and experiments without capacity constraints allows for rapid data exploration and quicker identification of fraudulent patterns.Scalability on demand: BigQuery’s serverless architecture automatically scales up or down based on demand, ensuring that resources are available when needed and avoiding over-provisioning. This removes the need for data teams to manually scale their infrastructure, which can be time-consuming and error-prone. A key part here to understand is that BigQuery can scale while the queries are running/in-flight — a clear differentiator with other modern cloud data warehouses.Data analysis: BigQuery datasets can scale to petabytes, helping to store and analyze financial transactions data at near-limitless scale. This empowers you to uncover hidden patterns and trends within your data, for effective fraud detection.Machine learning: BigQuery ML offers a range of off-the-shelf fraud detection models, from anomaly detection to classification, all implemented through simple SQL queries. This democratizes machine learning and enables rapid model development for your specific needs. Different types of models that BigQuery ML supports are listed here.Model deployment for inference at scale: While BigQuery supports batch inference, Google Cloud’s Vertex AI can be leveraged for real-time predictions on streaming financial data. Deploy your BigQuery ML models on Vertex AI to gain immediate insights and actionable alerts, safeguarding your business in real-time.

The combination of Fivetran and BigQuery provides a simple design to a complex problem — an effective fraud detection solution capable of real-time, actionable alerts. In the next series of this blog, we’ll focus on the hands-on implementation of the Fivetran-BigQuery integration using an actual dataset and create ML models in BigQuery that can accurately predict fraudulent transactions.

Fivetran is available on Google Cloud Marketplace.

Source : Data Analytics Read More