Building trust in the data with Dataplex

Building trust in the data with Dataplex

Analytics data is growing exponentially and so is the dependence on the data in making critical business and product decisions.  In fact, the best decisions are said to be the ones which are backed by data. In data, we trust!  

But do we trust the data ? 

As the data volumes have grown – one of the key challenges organizations are facing is how to maintain the data quality in a scalable and consistent way across the organization.  While data quality is not a newly found need,  the needs used to be  contained when the data footprint was small and data consumers were few. In such a world, data consumers knew who the producers were and producers knew what the consumers needed. But today, data ownership is getting distributed and data consumption is finding new users and use cases.  So the existing data quality approaches find themselves limited and are isolated to certain pockets of the organization.  This often exposes data consumers to inconsistent and inaccurate data which ultimately impacts the decisions made from that data.  As a result, organizations today are losing 10s of millions of dollars due to the low quality of data.

These organizations are looking for solutions that empower their data producers to consistently create high quality data cloud scale. 

Building Trust with Dataplex data quality 

Earlier this year,  at Google Cloud,  we launched Dataplex, an intelligent data fabric that enables governance and data management across distributed data at scale.  One of the key things Dataplex enables out-of-box is for data producers to build trust in the data with a built-in data quality. 

Dataplex data quality task delivers a declarative, data-ops centric experience for validating data across BigQuery and Google Cloud Storage.  Producers can now easily build and publish quality reports or can easily include data validations as part of their data production pipeline.  Reports can be aggregated across various data quality dimensions and the execution is entirely serverless.

Dataplex data quality task provides  – 

A declarative approach for defining “what good looks like” that can be managed as part of a CI/CD workflow.  

A serverless and managed execution with no infrastructure to provision. 

Ability to validate across data quality dimensions like  freshness, completeness, accuracy and validity.

Flexibility in execution – either by using Dataplex serverless scheduler (at no extra cost) or executing the data validations as part of a pipeline (e.g. Apache Airflow).

Incremental execution – so you save time and money by validating new data only. 

Secure and performant execution with zero data-copy from BigQuery environments and projects. 

Programmatic consumption of quality metrics for Dataops workflows. 

Users can also execute these checks on data that is stored in BigQuery and Google Cloud Storage but is not yet organized with Dataplex.  For Google Cloud Storage data that is managed by Dataplex, Dataplex auto-detects and auto-creates tables for structured and semi-structured data. These tables can be referenced with the Dataplex data quality task as well. 

Behind the scenes – Dataplex makes use of an open source data quality engine – Cloud Data Quality Engine – to run these checks. Providing an open platform is one of our key goals and we have made contributions to this engine to integrate seamlessly with Dataplex’s  metadata and serverless environment.

You can learn more about this in our product documentation

Building enterprise trust at American Eagle Outfitters 

One of our enterprise customers  – American Eagle Outfitters (AEO) – is continuing to build trust in their critical data using Dataplex Data Quality Task.  Kanhu Badtia, lead data engineer from AEO, shares their rationale and experience with Dataplex data quality task:  

“AEO is a leading global specialty retailer offering high-quality & on-trend clothing under its American Eagle® and Aerie® brands. Our company operates stores in the United States, Canada, Mexico, and Hong Kong, and ships to 81 countries worldwide through its websites. 

We are a data-driven organization that utilizes data from physical and digital store fronts, from social media channels, from logistics/delivery partners and many other sources through established compliant processes. We have a team of data scientists and analysts who create models, reports and dashboards that inform responsible business decision-making on such matters as inventory, promotions, new product launches and other internal business reviews. As the data engineering team at AEO, our goal is to provide highly trusted data for our internal data consumers. 

Before Dataplex – AEO had methods for maintaining data quality that were effective for their purpose. However, those methods were not scalable with the continual expansion of data volume and demand for quality results from our data consumers.  Internal data consumers identified and reported quality issues where ‘bad data’ was impacting business critical dashboards/reports . As a result, our teams were often in “fire-fighting” mode – finding & fixing bad data. We were looking for a solution that would standardize and scale data quality across the production data pipelines. 

The majority of AEO’s business data is in Google’s BigQuery or in Google Cloud Storage (GCS). When Dataplex launched the data quality capabilities, we immediately started a proof-of-concept. After a careful evaluation, we decided to use it as the central data quality framework for production pipelines. We liked that – 

It provides an easy declarative (YAML) & flexible way of defining data quality. We were able to parameterize it to use across multiple tables. It allows validating data in any BigQuery table with a completely serverless and native execution using existing slot reservations. It allows executing these checks as part of the ETL pipelines using DataPlex Airflow Operators. This is a huge win as pipelines can now pause further processing if critical rules do not pass. Data quality checks are executed in parallel which gives us the required execution efficiency in pipelines. Data quality results are stored centrally in BigQuery & can be queried to identify which rules failed/succeeded and how many rows failed. This enables defining custom thresholds for success. Organizing data in Dataplex Lakes is optional when using Dataplex data quality. 

Our team truly believes that data quality is an integral part of any data-driven organization and Dataplex DQ capabilities align perfectly with that fundamental principle. 

For example, here is a sample Google Cloud Composer / Airflow DAG  that loads & validates the “item_master” table and stops downstream processing if the validation fails.

It includes simple rules for uniqueness, completeness and more complex rules for referential integrity or business rules such as checking daily price variance. We publish all data quality results centrally to a BigQuery table, such as this:

Sample data quality output table

We query this output table for data quality issues & fail the pipeline in case of critical rule failure. This stops low quality data from flowing downstream. 

We now have a repeatable process for data validation that can be used across the key data production pipelines. It standardizes the data production process and effectively ensures that bad data doesn’t break downstream reports and analytics.”

Learn more

Here at Google – we are excited to enable our customer’s journey to high quality, trusted data.  To learn more about our current data quality capabilities please refer to – 

Dataplex Data Quality Overview

Sample Airflow DAG with Dataplex Data Quality task

Related Article

Streamline data management and governance with the unification of Data Catalog and Dataplex

Data Catalog will be unified with Dataplex, providing an enterprise-ready data fabric that enables data management and governance at scale.

Read Article

Source : Data Analytics Read More

Built with BigQuery: Retailers drive profitable growth with SoundCommerce

Built with BigQuery: Retailers drive profitable growth with SoundCommerce

As economic conditions change, retail brands’ reliance on ever-growing customer demand puts these companies at financial and even existential risk. Top-line revenue and active customer growth do not equal profitable growth.

Despite multi-billion dollar valuations for some brands, especially those operating the direct-to-consumer model, the rising costs of meeting shoppers’ high expectations (i.e. free shipping, free returns) along with the escalating cost of goods, fulfillment operations, and delivery costs, create pressure for brands to turn a profit. The ONLY way brands drive profitable growth is by managing variable costs in real-time. This in turn mandates adopting modern data cloud infrastructure including Google Cloud services such as BigQuery.

To unlock the value of data, Google Cloud has partnered with SoundCommerce, a retail data and analytics platform that offers a unique way of connecting marketing, merchandising, and operations data and modeling it within a retail context – all so brands can optimize profitability across the business.

Profitability can be measured per order through short-term metrics like contribution profit or long-term metrics like Customer Lifetime Value (CLV).  Often, retailers calculate CLV as a measure of revenue with no consideration for the variable costs of serving that customer, for instance: the costs of marketing, discounting, delivering orders to the doorstep, or post-conversion operational exceptions (e.g. cancellations, returns).

What may first appear to be a high lifetime value customer through revenue-based CLV models, may not be profitable at all. By connecting marketing, merchandising, and operations data together, brands can understand their most profitable programs, channels, and products through the lens of actual customer value – and optimize accordingly. 

The journey for brands starts with the awareness and data enablement of a more complex data set containing all variable revenue, cost, and margin inputs. What does a retailer need to do to achieve this?

All data together in one place

Matched and deduplicated disparate data

Data is organized into entities and concepts that business decision makers can understand

A common definition of key business metrics (this is especially important yet challenging for retailers because systems are siloed by the department and common KPIs like contribution profit per order may be defined differently across a company)

Branched outputs for actionability: BI dashboards vs. data activation to improve marginal profit.

Once brands understand these requirements, up next is execution. This responsibility may fall within a ‘build’ strategy on the shoulders of technical IT/data leadership and their team(s) within a brand. This offers maximum control but at maximum cost and time-to-value. Retail data sources are complicated and change often. Technical teams within brands can spend too much time building and maintaining the tactical data ingestion process, which means they are spending less time deriving business value from the data. 

But it doesn’t have to be this hard. There are other options in the market that brands can consider, such as a tool like SoundCommerce which provides a library of data connectors, pre-built and customizable date mappings, and outbound data orchestrations all tailor-made and ready-to-go for retail brands.

SoundCommerce empowers retail technology leaders to:

Maintain data ownership to allow users to send modeled data to external data warehouses or layer-on business analytics tools for greater flexibility

Provide universal access to data across the organization so every employee can have access to and participate in a low-code or no-code experience

Expand and democratize data exploration and activation among both technical and non-technical users

For retail business and marketing decision-makers, SoundCommerce makes it easy to:

Calculate individual customer lifetime value through the lens of profitability – not just revenue.

Evaluate and predict lifetime shopper performance – identify which programs drive the highest CLV impact

Set CAC and retention cost thresholds – determine optimal Customer Acquisition Costs (CAC) and retention costs that ensure marketing efforts are profitable through the lens of total lifetime transactions

Below is a sample data flow that illustrates how SoundCommerce connects to all the tools a Retailer is using and ingests the first-party data, agnostic of the platform. SoundCommerce then models and transforms the data within the retail context for brands to take immediate action on the insights they gain from the modeled data.

SoundCommerce built on Google Cloud Platform Services

SoundCommerce selected the Google Cloud Platform and its services to achieve what they set out to do – drive profitability for retailers and brands. SoundCommerce perfected this very need of retailers to centralize and harmonize data, map to business users’ needs to infer key metrics and insights by visualizing the data, or reuse the produced datasets to build upon other use cases specific to retailers. SoundCommerce built a cloud-native solution on Google Cloud leveraging the data cloud platform. Data from various sources are ingested in raw format, parsed, and processed as messages in Cloud Storage buckets using Google Kubernetes Engine (GKE) and stored as individual events in Cloud BigTable. A mapping engine maps the data to proprietary data models stored in BigQuery to store the produced data as datasets. Customers use visualization dashboards in Looker to access the data exposed as materialized views from within BigQuery. In many cases, these views are directly accessible by the customer per their use case.

Power Retail Profitable Growth with Analytics Hub 

SoundCommerce adopted BigQuery and the recent release of Analytics Hub – a data exchange that enables BigQuery users to efficiently and securely share data. This feature ensures a more scalable direct access experience for SoundCommerce’s current and future customers. It meets brands where they are in their data maturity by giving them the keys to own their data and control their analytics-driven business outcomes. With this feature, retailers can customize their analysis with additional data they own and manage. 

“Retail Brands need flexible data models to make key business decisions in real-time to optimize contribution profit and shopper lifetime value,” said SoundCommerce CEO Eric Best. “GCP and BigQuery with Analytics Hub make it easy for SoundCommerce to land and maintain complex data sets, so brands and retailers can drive profitable shopper experiences with every decision.”

SoundCommerce uses Analytics Hub to increase the pace of innovation by sharing datasets with its customers in real time by using the streaming functionality of BigQuery. Customers subscribe to specific datasets through a data exchange as data is generated from external data sources and published into BigQuery. This leads to a natural flow of data that scales easily to hundreds of exchanges and thousands of listings. From the Customer’s viewpoint, Analytics Hub enables them to search listings and coalesce data from other software vendors to produce richer insights. All of the benefits are an add-on to the BigQuery features such as separation of compute and storage, petabyte-scale serverless data warehouse, and tighter integration with several Google Cloud products. 

The below diagram shows a view of SoundCommerce sharing datasets with one of its customers:

A SoundCommerce GCP project that hosts the BigQuery instance contains one or more Source Datasets that are composed into a Shared Dataset for a specific customer. The dataset is wrapped around Materialized views but can include other BigQuery objects such as Tables, Views, Authorized views and datasets, BigQuery ML models, external Tables, etc.

The same SoundCommerce GCP project contains the data exchange that acts as a container regarding the shared datasets. The exchange is made private to securely share the curated dataset relevant to the customer. The shared dataset is published into the data exchange as a private listing. The listing inherits the security permissions that are configured on the exchange. 

SoundCommerce shares a direct link to the Exchange to the Customer, which they can add as a Linked dataset into their project in their Google Cloud Organization. The shareable link can be pointed to a private listing. From here on, the dataset is visible in the Customer project like any other dataset and immediately available to accept queries and return results. Alternatively, the customer can also view the listing in their own project under Analytics Hub and subscribe to it by adding it as a linked dataset.

SoundCommerce is incrementally onboarding customers to use Analytics Hub for all data sharing use cases. This enables brands to get business insights faster and gain an understanding of their profitable growth quickly. Plus, it gives them the ownership to own their own data and manage it how they see fit for their business. From a technical standpoint, the adoption of Analytics Hub has led to leveraging an inherent capability in BigQuery for data sharing, faster scaling, and reducing operational overhead to onboard customers.

The Built with BigQuery advantage for ISVs 

Through Built with BigQuery launched in April as part of the Google Data Cloud Summit, Google is helping tech companies like SoundCommerce build innovative applications on Google’s data cloud with simplified access to technology, helpful and dedicated engineering support, and joint go-to-market programs. Participating companies can: 

Get started fast with a Google-funded, pre-configured sandbox. 

Accelerate product design and architecture through access to designated experts from the ISV Center of Excellence who can provide insight into key use cases, architectural patterns, and best practices. 

Amplify success with joint marketing programs to drive awareness, generate demand, and increase adoption.

BigQuery gives ISVs the advantage of a powerful, highly scalable data warehouse that’s integrated with Google Cloud’s open, secure, sustainable platform. And with a huge partner ecosystem and support for multi-cloud, open source tools, and APIs, Google provides technology companies the portability and extensibility they need to avoid data lock-in. 

Click here to learn more about Built with BigQuery.

We thank the many Google Cloud team members including Yemi Falokun in Partner Engineering who contributed to close collaboration with the SoundCommerce team.

Related Article

Securely exchange data and analytics assets at scale with Analytics Hub, now available in preview

Efficiently and securely exchange valuable data and analytics assets across organizational boundaries with Analytics Hub. Start your free…

Read Article

Source : Data Analytics Read More

How Atlas AI and Google Cloud are promoting a more inclusive economy

How Atlas AI and Google Cloud are promoting a more inclusive economy

What if governments could quantify where improved access to safe drinking water would have the greatest benefit on communities?  Or if telecom and energy companies could identify ways to  expand infrastructure access and improve regional wealth disparities?  

Satellite imagery and data from other “earth observation” sensors,  combined with the analytical power of artificial intelligence (AI) can unlock a detailed and dynamic understanding of development conditions in communities. This helps connect global investment and policy support to the places that need it most. Atlas AI, a public benefit startup, has developed an AI-enabled predictive analytics solution powered by Google Cloud that translates satellite imagery and other context-specific datasets into the basis for more inclusive investment in traditionally underserved communities around the world.  

The company’s proprietary Atlas of Human Settlements, an interlinking collection of spatial datasets makes Atlas AI’s platform possible.  These datasets measure the changing footprint of communities around the world, from a village in rural Africa to a large European city,  alongside a wide range of socioeconomic characteristics. These include the average spending capacity of local households, the extent of electricity access, and the productivity of key crops supplying local diets.  Atlas AI, a Google Cloud Ready — Sustainability Partner, provides access to software and streamed analytics that enable business, government and social sector customers to address urgent real-world operating, investment and policy challenges.  

Identifying unmet and emerging demand for consumer-facing businesses

For businesses serving end consumers such as in the telecom, financial services and consumer goods sectors, Atlas AI’s Aperture™ platform can help to identify revenue expansion opportunities with existing customers, forecast future infrastructure demand and promote more inclusive market expansion strategies.  The company’s technology analyzes recent market trends based on socioeconomic forecasts from the Atlas of Human Settlements such as demographic, asset wealth, and consumption growth, alongside an organization’s internal operations and customer data to identify strategies for business expansion, customer growth and sustainable investment.

Cassava Technologies, Africa’s first integrated tech company of continental scale, with a vision of a digitally connected continent that leaves no African behind, became an early adopter of the Aperture platform. Through this initiative, Cassava, aided by the hyperlocal resolution of Atlas AI’s market data and predictive analytics, has identified pockets of high internet usage in the Democratic Republic of Congo. The success of this initiative in just one country reiterates the real-world use of AI and how businesses on the continent can make informed decisions based on available intelligence.

Commenting on this achievement, Hardy Pemhiwa, President and CEO of Cassava Technologies, said, “Cassava’s mission is to use technology as an enabler of social mobility and economic prosperity, transforming the lives of individuals and businesses across the African continent. Deploying billions of dollars into such a rapidly evolving market requires a level of predictive market and business intelligence that has historically been unavailable before Atlas AI.  We see great potential for this technology in promoting expanded global investment into Africa’s development.”

Mapping vulnerable communities to inform climate action

Another promising application of Atlas AI’s technology involves the assessment of how climate change is affecting vulnerable populations.

For example Aquaya Institute, a nonprofit focused on universal access to safe water, utilized Atlas AI’s ApertureTM platform, and in particular datasets on human and economic activity in Ghana, to identify the best areas to expand water piping based on low existing coverage of piped water networks and the ability of customers in those regions to pay for those services.

Aquaya concluded that the methodology used in its project can be used for a wide range of water and sanitation projects.  Importantly, the authors note that “similar approaches could be extended to other development topics, including health care, climate change, conservation or emergency response.”

To study these impacts, researchers first need data on the sociodemographic characteristics of people living in affected areas- often areas lacking in recent granular statistical data.  Stanford professors and Atlas AI Co-Founders David Lobell, Marshall Burke and Stefano Ermon originally established via their academic research the efficacy of using satellite imagery and deep learning AI methods to identify impoverished zones in Africa.  The team was able to correlate daytime imagery and brightly lit images of the Earth at night with areas of high economic activity.  The researchers then used the visual representation to predict village- level development indicators such as household wealth.

Atlas AI arose from that body of research at Stanford thanks to partnership with The Rockefeller Foundation, with an aim of building the Atlas of Human Settlements to global scale and making that data informative and useful for real-world decision making such as guiding sustainable infrastructure development.  One constant from the Stanford research to Atlas AI’s current product development has been Google Earth Engine, which continues to power the company’s ingest and processing of petabytes of satellite imagery.  

Atlas AI and Google Cloud: Promoting sustainable investment globally 

New planetary scale data sets such as satellite imagery and deep learning AI technologies offer unprecedented capacity to generate economic estimates, guide investment and improve operating efficiency of organizations in countries where data is often scarce or out of date. These technologies can be used to improve targeting of social programs, to route infrastructure to disconnected communities and to gain a better understanding of how our activities are affecting the planet.  

Google Cloud has been working with partners such as Atlas AI to help our mutual customers around the world meet their environmental social and governance (ESG) related commitments. The Google Cloud Ready – Sustainability program is a recently announced validation program for partners with the goal of accelerating the development of solutions that can drive ESG transformations. 

Atlas AI is among the first Google Cloud partners to receive the Google Cloud Ready – Sustainability designation. Collaborative efforts among solutions providers, such as Atlas AI and other partners, will continue to create  innovative, intelligent technologies that can improve the outlook for millions of people around the world. 

Learn more about Google Cloud Ready – Sustainability.

Related Article

Google Cloud launches new sustainability offerings to help public sector agencies improve climate resilience

Leveraging Google Earth Engine, BigQuery, AI and ML, Google Cloud’s Climate Insights helps scientists standardize, aggregate, analyze and…

Read Article

Source : Data Analytics Read More

Announcing Pub/Sub metrics dashboards for improved observability

Announcing Pub/Sub metrics dashboards for improved observability

Pub/Sub offers a rich set of metrics for resource and usage monitoring. Previously, these metrics were like buried treasure: they were useful to understand Pub/Sub usage, but you had to dig around to find them. Today, we are announcing out-of-the-box Pub/Sub metrics dashboards that are accessible with one click from the Topics and Subscriptions pages in the Google Cloud Console. These dashboards provide more observability in context and help you build better solutions with Pub/Sub.

Check out our new one-click monitoring dashboards

The Overview section of the monitoring dashboard for all the topics in your project.

We added metrics dashboards to monitor the health of all your topics and subscriptions in one place, including dashboards for individual topics and subscriptions. Follow these steps to access the new monitoring dashboards:

To view the monitoring dashboard for all the topics in your project, open the Pub/Sub Topics page and click the Metrics tab. This dashboard has two sections: Overview and Quota.

To view the monitoring dashboard for a single topic, in the Pub/Sub Topics page, click any topic to display the topic detail page, and then click the Metrics tab. This dashboard has up to three sections: Overview, Subscriptions, and Retention (if topic retention is enabled).

To view the monitoring dashboard for all the subscriptions in your project, open the Pub/Sub Subscriptions page and click the Metrics tab. This dashboard has two sections: Overview and Quotas.

To view the monitoring dashboard for a single subscription, in the Pub/Sub Subscriptions page, click any subscription to display the subscription detail page, and then click the Metrics tab. This dashboard has up to four sections: Overview, Health, Retention (if acknowledged message retention is enabled), and either Pull or Push depending on the delivery type of your subscription.

A few highlights

When exploring the new Pub/Sub metrics dashboard, here are a few examples of things you can do. Please note that these dashboards are a work in progress, and we hope to update them based on your feedback. To learn about recent changes, please refer to the Pub/Sub monitoring documentation.

The Overview section of the monitoring dashboard for a single topic.

As you can see, the metrics available in the monitoring dashboard for a single topic are closely related to one another. Roughly speaking, you can obtain Publish throughput in bytes by multiplying Published message count and Average message size. Because a publish request is made up of a batch of messages, dividing Published messages by Publish request count gets you Average number of messages per batch. Expect a higher number of published messages than publish requests if your publisher client has batching enabled. 

Some interesting questions you can answer by looking at the monitoring dashboard for a single topic are: 

Did my message sizes change over time?

Is there a spike in publish requests?

Is my publish throughput in line with my expectations?

Is my batch size appropriate for the latency I want to achieve, given that larger batch sizes increase publish latency?

The Overview section of the monitoring dashboard for a single subscription.

You can find a few powerful composite metrics in the monitoring dashboard for a single subscription. These metrics are Delivery metrics, Publish to ack delta, and Pull to ack delta. All three aim to give you a sense of how well your subscribers are keeping up with incoming messages. Delivery metrics display your publish, pull, and acknowledge (ack) rate next to each other. Pull to ack delta and Publish to ack delta offer you the opportunity to drill down to any specific bottlenecks. For example, if your subscribers are pulling messages a lot faster than they are acknowledging them, expect the values reported in Pull to ack delta to be mostly above zero. In this scenario, also expect both your Unacked messages by region and your Backlog bytes to grow. To remedy this situation, you can increase your message processing power or setting up subscriber flow control.

The Health section of the monitoring dashboard for a single subscription.

Another powerful composite metric available in the monitoring dashboard for a single subscription is the Delivery latency health score in the Health section. You may treat this metric as a one-stop shop to examine the health of your subscription. This metric tracks a total of five properties; each can take a value of zero or one. If your subscribers are not keeping up, zero scores for “ack_latency” and/or “expired_ack_deadlines” effectively tell you that those properties are the reason why. We prescribe how to fix these failing scores in our documentation. If your subscription is run by a managed service like Dataflow, do not be alarmed by a “utilization” score of zero. With Dataflow, the number of streams open to receive messages is optimized, so the recommendation to open more streams does not apply. 

Some questions you can answer by looking at your monitoring dashboard for a single subscription are: 

What is the 99th percentile of my ack latencies? Is the majority of my messages getting acknowledged in under a second, allowing my application to run in near real-time? 

How well are my subscribers keeping up with my publishers? 

Which region has a growing backlog? 

How frequently are my subscribers allowing a message’s ack deadline to expire?

Customize your monitoring experience

Hopefully the existing dashboards are enough to diagnose a problem. But maybe you need something slightly different. If that’s the case, from the dropdown menu, click Save as Custom Dashboard to save an entire dashboard in your list of monitoring dashboards, or click Add to Custom Dashboard in a specific chart to save the chart to a custom dashboard. Then, in the custom dashboard, you can edit any chart configuration or MQL query. 

For example, by default, the Top 5 subscriptions by ack message count chart in the Subscriptions section of the monitoring dashboard for a single topic shows the top five attached subscriptions with the highest rate of acked messages. You can modify the dashboard to show the top ten subscriptions. To make the change, export the chart, click on the chart, and edit the line of MQL “| top 5, .max()” to “| top 10, .max()”. To know more about editing in MQL, see Using the Query Editor | Cloud Monitoring.

For a slightly more complex example, you can build a chart that compares current data to past data. For example, consider the Byte Cost chart in the Overview section of the monitoring dashboard for all topics. You can view the chart in Metrics Explorer. In the MQL tab, add the following lines at the end of the provided code snippet:

code_block[StructValue([(u’code’, u’| {rn add [when: “now”]rn ;rn add [when: “then”] | time_shift 1drn }rn| union’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eec0d491110>)])]

The preceding lines turn the original chart to a comparison chart that compares data at the same time on the previous day. For example, if your Pub/Sub topic consists of application events like requests for cab rides, data from the previous day can be a nice baseline for current data and can help you set the right expectations for your business or application for the current day. If you’d prefer, update the chart type to a Line chart for easier comparison. 

Set alerts

Quota limits can creep up on you when you least expect it. To prevent this, you can set up alerts that will notify you once you hit certain thresholds. The Pub/Sub dashboards have a built-in function to help you set up these alerts. First, access the Quota section in one of the monitoring dashboards for topics or subscriptions. Then, click Create Alert inside the Set Quota Alert card at the top of the dashboard. This will take you to the alert creation form with an MQL query that triggers for any quota metric exceeding 80% capacity (the threshold can be modified).

The Quota section of the monitoring dashboard for all the topics in your project.

In fact, all the provided charts support setting alerting policies. You can set up your alerting policies by first exporting a chart to a custom dashboard and then selecting Convert a provided chart to an alert chart. using the dropdown menu.

Convert a provided chart to an alert chart.

For example, you might want to trigger an alert if the pull to ack delta is positive more than 90% of the time during a 12-hour period. This would indicate that your subscription is frequently pulling messages faster than it is acknowledging them. First, export the Pull to Ack Delta chart to a custom dashboard, convert it to an alert chart, and add the following line of code at the end of the provided MQL query:

| condition gt(val(), 0)

Then, click Configure trigger. Set the Alert trigger to Percent of time series violates, the Minimum percent of time series in violation to 90%, and Trigger when condition is met for this amount of time to 12 hours. If the alert is created successfully, you should see a new chart with a red horizontal line representing the threshold with a text bubble that tells you if there have been any open incidents violating the condition. 

You can also add an alert for the Oldest unacked message metric. Pub/Sub lets you set a message retention period on your subscriptions. Aim to keep your oldest unacked messages within the configured subscription retention period, and fire an alert when messages are taking longer than expected to be processed. 

Making metrics dashboards that are easy to use and serve your needs is important for us. We welcome your feedback and suggestions for any of the provided dashboards and charts. You can reach us by clicking on the question icon on the top right corner in Cloud Console and choosing Send feedback. If you really like a chart, please let us know too! We will be delighted to hear from you.

Related Article

How Pub/Sub eliminates boring meetings and makes your systems scale

What is Cloud Pub/Sub? A messaging service for application and data integration!

Read Article

Source : Data Analytics Read More

Sign up for the Google Cloud Fly Cup Challenge

Sign up for the Google Cloud Fly Cup Challenge

Are you ready to take your cloud skills to new heights? We’re excited to announce the Google Cloud Fly Cup Challenge, created in partnership with The Drone Racing League (DRL) and taking place at Next ‘22 to usher in the new era of tech-driven sports. Using DRL race data and Google Cloud analytics tools, developers of any skill level will be able to predict race outcomes and provide tips to DRL pilots to help enhance their season performance. Participants will compete for a chance to win an all-expenses-paid trip to the season finale of the DRL World Championship Race and be crowned the champion on stage.  

How it works: 

Register for Next 2022 and navigate to the Developer Zone challenges to unlock the game

Complete each stage of the challenge to advance and climb the leaderboard

Win prizes, boost skills and have fun!

There will be three stages of the competition, and each will increase in level of difficulty. The first stage kicks off on September 15th, where developers will prepare data and become more familiar with the tools for data-driven analysis and predictions with Google ML Tools. There are over 500 prizes up for grabs, and all participants will receive an exclusive custom digital badge, and an opportunity to be celebrated for their achievements alongside DRL Pilots. There will be one leaderboard that will cumulate scores throughout the competition and prizes will be awarded as each stage is released. 

Stage 1: 

DRL Recruit: Starting on September 15th, start your journey here to get an understanding of DRL data by loading and querying race statistics. You will build simple reports to find top participants and fastest race times. Once you pass this lab you will be officially crowned a DRL recruit and progress for a chance to build on your machine learning skills and work with two more challenge labs involving predictive ML models. 

Prize: The top 25 on the leaderboard will win custom co-branded DRL + Google Cloud merchandise.

Stage 2: 

DRL Pilot: Opening in conjunction with the first day of Next 2022 on October 11, in this next stage you will develop a model which can predict a winner in a head to head competition and a score for each participant, based on a pilots profile and flight history. Build a “pilot profile card” that analyzes the number of crashes and lap times and compares it to other pilots. Fill out their strengths and weaknesses and compare them to real life performances, and predict the winner of the DRL Race in the Cloud at Next 2022, and be crowned top developer for this stage.

Prize: The first 500 participants to complete stage 2 of the contest will receive codes to download DRL’s Simulator on Steam.

Stage 3: 

DRL Champion: Continue this journey throughout the DRL championship season. Using the model developed in Stage 2. Use data from past races to score participants and predict outcomes. Provide pilots with real life tips and tricks to help improve their performance. The developer at the top of the leaderboard at the end of December 2022 will win an expenses-paid VIP trip to DRL’s final race in early 2023. 

Prize: Finish in the top 3 for an opportunity to virtually present your tips and tricks to professional DRL Pilots before the end of the 2022-2023 race season

Top the leaderboard as the Grand Champion and win an expenses paid VIP experience to travel to a DRL Championship Race in early 2023 and be celebrated on stage. For more information on prizes and terms please visit the DRL and Google Cloud website.  

Ready to Fly? 

The Google Cloud Fly Cup Challenge opens today and will remain available on the Next ‘22 portal through December 31, 2022 when the winner will be announced. We are looking forward to seeing how you innovate and build together for the next era of tech-driven sports. Let’s fly!

Source : Data Analytics Read More

Introducing seamless database replication to BigQuery and new tiered volume pricing

Introducing seamless database replication to BigQuery and new tiered volume pricing

In today’s competitive environment, organizations need to quickly and easily make decisions based on real-time data. That’s why we’re announcing Datastream for BigQuery,now available in preview, featuring seamless replication from operational database sources such as AlloyDB for PostgreSQL, PostgreSQL, MySQL, and Oracle, directly into BigQuery, Google Cloud’s serverless data warehouse. Datastream for BigQuery is Google’s next big step towards realizing our vision for the unified data cloud, combining databases, analytics, and machine learning into a single platform that offers the scale, speed, security, and simplicity that modern businesses need. With a serverless, auto-scaling architecture, Datastream allows you to easily set up an ELT (Extract, Load, Transform) pipeline for low-latency data replication enabling real-time insights. 

Consider the case of a large supermarket chain, with 100’s of stores spread out across the region. Each individual store runs its own local point-of-sale and stock management systems, collecting data throughout the day about transactions and stock-levels in the store. To provide visibility and help streamline the chain’s daily operations, the IT department set up a nightly batch process to collect and consolidate all of the data from the stores into a central data warehouse, so that reports on the stores’ performance could be ready for review in the morning. Maintaining this process took time and resources from the data engineering team, and as the chain grows and more data needs to be processed, this process ended up taking so long that reports would only be ready late into the day. Organizations like this are looking for a modern solution that allows effortless replication of operational data to their data warehouse, enabling real-time decision making; Datastream for BigQuery is that solution.

Datastream accelerates data-driven decision making in BigQuery

Developed in close partnership with Google Cloud’s BigQuery team, Datastream for BigQuery  delivers a unique, truly seamless and easy-to-use experience that enables real-time insights in BigQuery with just a few steps. 

Using BigQuery’s newly developed Change Data Capture (CDC) and Storage Write API’s UPSERT functionality, Datastream efficiently replicates updates directly from source systems into BigQuery tables in real-time. You no longer have to waste valuable resources building and managing complex data pipelines, self-managed staging tables, tricky DML merge logic, or manual conversion from database-specific data types into BigQuery data types. Just configure your source database, connection type, and destination in BigQuery and you’re all set. Datastream for BigQuery will backfill historical data and continuously replicate new changes as they happen. And as database schemas shift, Datastream seamlessly handles schema changes and automatically adds new tables and columns to BigQuery. 

New volume-based tiered pricing

We are also excited to announce the launch of volume-based tiered pricing that makes it more affordable for customers moving larger volumes of data. Volume-based tiered pricing will be applied automatically based on actual usage to unlock the power of Datastream. 

Klook, a leading travel and leisure e-commerce platform for experiences and services, processes vast amounts of data across a range of applications and databases. Using BigQuery, Klook’s data team produces daily reports and analysis for their management team to help drive better business decisions. “Dealing with complex data environments and ingesting data from different sources into our data warehouse is very challenging”, says Stacy Zhu, Senior Manager for Data at Klook. “Prior to adopting Datastream, we had a team of data engineers dedicated to the task of ingesting data into BigQuery, and we spent a lot of time and effort making sure that the data was accurate. With Datastream, our data analysts can have accurate data readily available to them in BigQuery with a simple click. We enjoy Datastream’s ease-of-use, and its performance helps us achieve large scale ELT data processing.”

Achievers, an award-winning employee engagement software and platform, is another customer who recently adopted Datastream. “Achievers had been heavily using Google Cloud VMs (GCE), and Google Kubernetes Engine (GKE)” says Daljeet Saini, Lead Data Architect at Achievers. “With the help of Datastream, Achievers will be streaming our data into BigQuery and enabling our analysts and data scientists to start using BigQuery for smart analytics, helping us take the data warehouse to the next level.”

Start using Datastream today

You can get started today with Datastream, available to all customers in all Google Cloud regions. For more information on Datastream for BigQuery, please check out the product page.

Source : Data Analytics Read More

A search recipe for grocers: Ingredients for success in eCommerce

A search recipe for grocers: Ingredients for success in eCommerce

Today, more than half of grocery sales are influenced by digital, but only 4% of those sales are actually completed online, according to Deloitte research1.  That leaves a tremendous growth opportunity for grocers to gain their share of digital shoppers. A one-stop-shop experience is preferred by consumers as it benefits both their time and wallet and offers numerous synergies to the grocer as well. While aggregators present growing competition, Google research shows that the majority of consumers still start with their neighborhood grocer, and would select those stores they already visit2. Of those digital visits, research shows at least 40% (and rapidly growing)  will be using search. Pair that with the fact that conversions using search are 50% higher than without3 gives enough evidence that prioritizing search would be one of the key ingredients to success. 

Here we review the overall recipe your onsite search should follow to create a delightful experience for your shopper:

Findable 

According to KISSMetrics4 12% of shoppers will bounce to a competitor’s site when encountering irrelevant results. Precision is critical when the shopper knows what they need –such as specific ingredients for a recipe, a specific product name, or even a stock keeping unit (SKU). The way a shopper describes the items may vary from the way it is labeled, making it all the more important to incorporate smart search technology to bridge the gaps. The way the queries are being entered is also evolving towards more hands-free methods like voice or photo search. Vector based search technology provides a mechanism to uncover substitute items when a specific item is available, as well as complimentary items to complete a meal. Recommenders and merchandising algorithms should be leveraged to showcase a store’s differentiators such as specialty, house brand or unique products related to the shopper’s search intent.

Providing more usability through shopping tools leveraging search can help stores differentiate against aggregators. In addition to the preference to visit the local and same stores, research5 shows that the repurchase rate of items is much higher in grocery than in other ecommerce verticals. Basing results off previous orders such as “buy it again” and influencing results based on the characteristics of past orders is what shoppers want. Convenience tools such as grocery list uploads and smart extraction of products from recipes drive benefits to both new and existing shoppers. When shoppers decide to come into stores, using search to help shoppers physically locate items in the store is a tool to win shoppers with a high sense of urgency.

Discoverable  

40% of shoppers say online is better than in-person at helping them discover new products, according to The Food Industry Association6. When shoppers are not quite sure of what items they need, all that the store carries, or what they are called (such as keywords to use) they will lean more on navigation and browsing. This mirrors the behavior shoppers exhibit perusing store aisles. The digital experience can elevate that of the stores through dynamically personalized navigation and browsing experiences using search. These drive a more empathetic experience that listens and adjusts to the shoppers’ intent, such as reshuffling aisles based on the shoppers’ diet and ingredient preferences. Introducing recommenders and chatbots through browsing simulates the in-store experience as shoppers frequently interact with other shoppers as well as store floor staff to get recommendations. These can be influenced by the individual shoppers’ intent signals and those of other similar shoppers. You also want shoppers to find your products on external search engines. For example, smart enrichment can be used to augment the depth of information on navigation and browse pages for a specific item or category. Vector search and recommenders can also be used to nudge based on store-specific targets, such as relevant house brands or generics as substitutes based on the shopper’s browse patterns. This is particularly valuable in the era of rising costs and the growth of cost-conscious consumers.

Informative 

Provide a rich set of information to help the shopper make a decision which may include not only that of the manufacturer but also from the community of users. This does more than help the shopper – it aids in shoppers finding your products organically on search engines.  Machine learning models can marry shopper signals with the underlying information provided to them (such as ingredients and recipes) to inform models such as recommenders and personalization–and to generate insights. These signals can arrive from third-party sources such as diet and health/fitness apps if the shopper gives permission for sharing. 

Below are examples of common data points such as calorie count, carb and fat intake, and diet goals being followed in apps.

7

Datapoints can also be extracted from nutrition and recipe labels using ML based entity detection and topic extraction. This eliminates the need for traditional data manipulation and enrichment mainly when product attributes and details are sparse. User-generated content such as reviews, posts, and videos can also be valuable sources to supplement the knowledge available around the product.

Pairing signals with entity detection and available product attributes can yield shopper intent.  This can be used to drive individual real time relevance such as the sorting of products based on affinity to ingredients:

   Shopper Views     Underlying Product Attributes    ML Derived Shopper Intent

Product

Barilla, Corn Flour, Rice Flower

Gluten Free, Brand = Barilla

Ingredients

Oregano, Olive Oil, Garlic

Mediterranean Diet

Dietary Facts

No Cholesterol, 0 Sat Fat

Heart Healthy

Wild Caught Shrimp

Organic, Sustainable

Green Consumer

Illuminated 

Search engines can be pre-trained with knowledge graphs such as a food knowledge graph so it is already aware of the relationships between entities and topics for the given market. This can also be used to generate additional labeling and tagging for product enrichment. Diet plans such as Keto, Dash, Mediterranean, and Vegan can be managed in these graphs for use across the shopper journey. This can also help surface items to the shopper they may have not considered or are aware of driving both conversion and revenue. Deep learning technologies such as computer vision can be applied to user-generated images, such as photos of plates, to extract trends to enrich search.

8

Vector search can be applied to present items based on their topical distance. In grocery this would mean finding alternatives to out-of-stock items based on the query, or complimentary items to a query such as meal or recipe detection. While vector search is primarily driven by keyword input, recommendations would generally be based on the underlying result such as a landing page, product detail page, or shopper states such as what is in their cart.

Personalized 

Search should align to the shopper based both on their current behavior and if their given preferences are available. As discussed in the Informative section, marrying shopper signals to underlying data points such as ingredients, nutrition, and product details will provide preference and affinity data to incorporate into how results are shown. Inference models using knowledge graphs including diet data and corresponding collaborative shopper behavior can be used to detect a specific diet plan the shopper is following. Individual preferences shared through a profile, such as allergies and dietary restrictions, can also be used to filter results so irrelevant products are not shown. Personalization is shown to have a significant positive impact when applied across the search journey. The diagram below shows some common practices of employing personalization in grocery:

Connected 

Connected experiences between the store and digital touchpoints such as apps, email, webstore and campaigns are essential. Research9 indicates that 61% of shoppers are willing to share personalization for a more relevant, personalized experience or one that drives convenience. Most touchpoints such as health and nutrition apps, e-commerce platforms, and customer/marketing platforms have the ability to exchange data with each other or through a central customer repository (such as a customer data platform). This can bring tremendous value to the shopper as 91% of shoppers are likely to make a repeat purchase if they feel heard10. Consider integrating these sources, particularly apps as a gateway to land new customers and further engage existing ones. Connecting back office systems, such as supply chain and inventory, are also important to drive more traffic – particularly for impulse / high urgency buyers who may otherwise go with an aggregator. Maintaining a continuous conversation across sessions and touchpoints regardless of where the experience starts or ends is now a necessity.

It’s all in the presentation

Presentation is equally important whether it is a meal or an experience. While the focus has been on underlying results shown to shoppers, it is important to focus on the presentation of those results. For example, highlighting ingredients important in the shopper’s decision in results can dramatically assist in decision making. If the shopper named specific ingredients in their search, the results are significantly more usable when those ingredients are highlighted.  If the shopper’s preferences or inferred affinity to specific features of the product are used they should be highlighted as well.  App and page real estate is limited, ordering results and navigation based on utility to the user, therefore, becomes all that more important.

1. Bridging the grocery digital divide – Deloitte Research
2. How omnichannel grocers can win as shopping moves online – Think With Google
3. Four reasons why site search is vital for online retailers
4. What Is Time On Site? Everything You Need to Know – KISSMetrics
5. Grocery and Food Delivery Site UX – Baymard
6. Has online grocery shopping hit its sales ceiling? – Retail Wire
7. Image source LiveStrong MyPlate App
8. Image courtesy of Science Direct
9. Most are concerned about data privacy, but few are willing to change habits – Helpnet Security
10. Data Privacy Versus Personalization: How Do Consumers Really Feel? – AdTaxi

Related Article

Why all retailers should consider Google Cloud Retail Search

Cloud Retail Search offers advanced search capabilities such as better understanding user intent and self-improving ranking models that h…

Read Article

Source : Data Analytics Read More

How Google scales ad personalization with Bigtable

How Google scales ad personalization with Bigtable

Cloud Bigtable is a popular and widely used key-value database available on Google Cloud. The service provides scale elasticity, cost efficiency, excellent performance characteristics, and 99.999% availability SLA. This has led to massive adoption with thousands of customers trusting Bigtable to run a variety of their mission-critical workloads.

Bigtable has been in continuous production usage at Google for more than 15 years now. It processes more than 5 billion requests per second at peak and has more than 10 exabytes of data under management. It’s one of the largest semi-structured data storage services at Google. 

One of the key use cases for Bigtable at Google is ad personalization. This post describes the central role that Bigtable plays within ad personalization.

Ad personalization

Ad personalization aims to improve user experience by presenting topical and relevant ad content. For example, I often watch bread-making videos on YouTube. If ads personalization is enabled in my ad settings, my viewing history could indicate to YouTube that I’m interested in baking as a topic and would potentially be interested in ad content related to baking products

Ad personalization requires large-scale data processing in near real-time for timely personalization with strict controls for user data handling and retention. System availability needs to be high, and serving latencies need to be low due to the narrow window within which decisions need to be made on what ad content to retrieve and serve. Sub-optimal serving decisions (e.g. falling back to generic ad content) could potentially impact user experience. Ad economics requires infrastructure costs to be kept as low as possible.

Google’s ad personalization platform provides frameworks to develop and deploy machine learning models for relevance and ranking of ad content. The platform supports both real-time and batch personalization. The platform is built using Bigtable, allowing Google products to access data sources for ads personalization in a secure manner that is both privacy and policy compliant, all while honoring users’ decisions about what data they want to provide to Google

The output from personalization pipelines, such as advertising profiles are stored back in Bigtable for further consumption. The ad serving stack retrieves these advertising profiles to drive the next set of ad serving decisions.

Some of the storage requirements of the personalization platform include:

Very high throughput access for batch and near real-time personalization

Low latency (<20 ms at p99) lookup for reads on the critical path for ad serving

Fast (i.e. in the order of seconds) incremental update of advertising models in order to reduce personalization delay

Bigtable 

Bigtable’s versatility in supporting both low-cost, high-throughput access to data for offline personalization as well as consistent low-latency access for online data serving makes it an excellent fit for the ads workloads. 

Personalization at Google-scale requires a very large storage footprint. Bigtable’s scalability, performance consistency and low cost required to meet a given performance curve are key differentiators for these workloads. 

Data model
The personalization platform stores objects in Bigtable as serialized protobufs keyed by Object ids. Typical data sizes are less than 1 MB and serving latency is less than 20 ms at p99. 

Data is organized as corpora, which correspond to distinct categories of data. A corpus maps to a replicated Bigtable.

Within a corpus, data is organized as DataTypes, logical groupings of data. Features, embeddings, and different flavors of advertising profiles are stored as DataTypes, which map to Bigtable column families. DataTypes are defined in schemas which describe the proto structure of the data and additional metadata indicating ownership and provenance. SubTypes map to Bigtable columns and are free-form. 

Each row of data is uniquely identified by a RowID, which is based on the Object ID. The personalization API identifies individual values by RowID (row key), DataType (column family), SubType (column part), and Timestamp.

Consistency
The default consistency mode for operations is eventual. In this mode, data from the Bigtable replica nearest to the user is retrieved, providing the lowest median and tail latency.

Reads and writes to a single Bigtable replica are consistent. If there are multiple replicas of Bigtable in a region, traffic spillover across regions is more likely. To improve the likelihood of read-after-write consistency, the personalization platform uses a notion of row affinity. If there are multiple replicas in a region, one replica is preferentially selected for any given row, based on a hash of the Row ID. 

For lookups with stricter consistency requirements, the platform first attempts to read from the nearest replica and requests that Bigtable return the current low watermark (LWM) for each replica. If the nearest replica happens to be the replica where the writes originated, or if the LWMs indicate that replication has caught up to the necessary timestamp, then the service returns a consistent response. If replication has not caught up, then the service issues a second lookup—this one targeted at the Bigtable replica where writes originated. That replica could be distant and the request could be slow. While waiting for a response, the platform may issue failover lookups to other replicas in case replication has caught up at those replicas.

Bigtable replication
The Ads personalization workloads use a Bigtable replication topology with more than 20 replicas, spread across four continents. 

Replication helps address the high availability needs for ad serving. Bigtable’s zonal monthly uptime percentage is in excess of 99.9%, and replication coupled with a multi-cluster routing policy allows for availability in excess of 99.999%.

A globe-spanning topology allows for data placement that is close to users, minimizing serving latencies. However, it also comes with challenges such as variability in network link costs and throughputs. Bigtable uses Minimum Spanning Tree-based routing algorithms and bandwidth-conserving proxy replicas to help reduce network costs. 

For ads personalization, reducing Bigtable replication delay is key to lowering the personalization delay (the time between a user’ action and when that action has been incorporated into advertising models to show more relevant ads to the user). Faster replication is preferred but we also need to balance serving traffic against replication traffic and make sure low-latency user-data serving is not disrupted due to incoming or outgoing replication traffic flows. Under the hood, Bigtable implements complex flow control and priority boost mechanisms to manage global traffic flows and to balance serving and replication traffic priorities. 

Workload Isolation
Ad personalization batch workloads are isolated from serving workloads by pinning a given set of workloads onto certain replicas; some Bigtable replicas exclusively drive personalization pipelines while others drive user-data serving. This model allows for a continuous and near real-time feedback loop between serving systems and offline personalization pipelines, while protecting the two workloads from contending with each other.

For Cloud Bigtable users, AppProfiles and cluster-routing policies provide a way to confine and pin workloads to specific replicas to achieve coarse-grained isolation. 

Data residency
By default, data is replicated to every replica—often spread out globally—which is wasteful for data that is only accessed regionally. Regionalization saves on storage and replication costs by confining data to the region where it is most likely to be accessed. Compliance with regulations mandating that data pertaining to certain subjects are physically stored within a given geographical area is also vital.

The location of data can be either implicitly determined by the access location of requests or through location metadata and other product signals. Once the location for a user is determined, it is stored in a location metadata table which points to the Bigtable replicas that read requests should be routed to. Migration of data based on row-placement policies happens in the background, without downtime or serving performance regressions.

Conclusion

In this blog post, we looked at how Bigtable is used within Google to support an important use case—modeling user intent for ad personalization. 

Over the past decade, Bigtable has scaled as Google’s personalization needs have scaled by orders of magnitude. For large-scale personalization workloads, Bigtable offers low cost storage with excellent performance characteristics. It seamlessly handles global traffic flows with simple user configurations. Its ease at handling both low-latency serving and high-throughput batch computations make it an excellent option for lambda-style data processing pipelines.

We continue to drive high levels of investment to further lower costs, improve performance, and bring new features to make Bigtable an even better choice for personalization workloads.

Learn more

To get started with Bigtable, try it out with a Qwiklab and learn more about the product here.

Acknowledgements
We’d like to thank Ashish Awasthi, Ashish Chopra, Jay Wylie, Phaneendhar Vemuru, Bora Beran, Elijah Lawal, Sean Rhee and other Googlers for their valuable feedback and suggestions.

Related Article

Moloco handles 5 million+ ad requests per second with Cloud Bigtable

Moloco uses Cloud Bigtable to build their ad tech platform and process 5+ million ad requests per second.

Read Article

Source : Data Analytics Read More

How to use Google Cloud to find and protect PII

How to use Google Cloud to find and protect PII

BigQuery is a leading data warehouse solution in the market today, and is valued by customers who need to gather insights and advanced analytics on their data. Many common BigQuery use cases involve the storage and processing of Personal Identifiable Information (PII)—data that needs to be protected within Google Cloud from unauthorized and malicious access.

Too often, the process of finding and identifying PII in BigQuery data relies on manual PII discovery and duplication of that data. One common way to do this is by taking an extract of columns used for PII and copying them into a separate table with restricted access. However, creating unnecessary copies of this data and processing it manually to identify PII increases the risks of failure and subsequent security events.

In addition, the security of PII data is often mandated by multiple regulations and failure to apply appropriate safeguards may result in heavy penalties. To address this issue, customers need solutions that 1) identify PII in BigQuery and 2) automatically implement access control on that data to prevent unauthorized access and misuse, all without having to duplicate it.  

This blog will discuss a solution developed by Google Professional Services for leveraging Google Cloud DLP to inspect and classify sensitive data and suggest a solution for using these insights to automatically tag and protect data in BigQuery tables. 

BigQuery Auto Tagging solution overview

Automatic DLP can help to identify sensitive data, such as PII, in BigQuery. Organizations can leverage Automatic DLP to automatically search across their entire BigQuery data warehouse for tables that contain sensitive data fields and report detailed findings in the console (see Figure 1 below,) in Data Studio, and in a structured format (such as a BigQuery results table.) Newly created and updated tables can be discovered, scanned, and classified automatically in the background without a user needing to invoke or schedule it. This way you have an ongoing view into your sensitive data.

Figure 1: Visibility of sensitive data fields in BigQuery

In this blog, we show how a new open source solution called BigQuery Auto Tagging Solution solves our second goal—automating access control on data. This solution sits as a layer on top of Automatic DLP and automatically enforces column-level access controls to restrict access to specific sensitive data types based on user-defined data classification taxonomies (such as high confidentiality or low confidentiality) and domains (such as Marketing, Finance, or ERP System.) This solution minimizes the risk of unrestricted access to PII and ensures that there is only one copy of data maintained with appropriate access control applied down to the column level. 

The code for this solution is available on Github at GoogleCloudPlatform/bq-pii-classifier. Please note that while Google does not maintain this code, you can reach out to your Sales Representative to get in contact with our Professional Services team for guidance on how to implement it.  

BigQuery and Data Catalog Policy Tags (now Dataplex) have some limitations that you should be aware of before implementing this solution to ensure that it will work for your organization:

Taxonomies and Policy Tags are not shared across regions: If you have data in multiple regions you will need to create or replicate your taxonomy in each region that you want to apply policy tags. 

Maximum number of 40 taxonomies per project: If you require different taxonomies for different business domains or have replications to support multiple Cloud regions those will count against this quota. 

Maximum number of 100 policy tags per taxonomy: Cloud DLP supports up to 150 infoTypes for classification, however, a single policy taxonomy can only support up to 100 including any nested categories. If you need to support more than 100 data types, you may need to split these across more than one taxonomy. 

High-level overview of the solution

Figure 2: High level architecture of solution

The solution is composed mainly of the following components: Dispatcher Requests topic, Dispatcher service, BigQuery Policy Tagger Requests topic, and BigQuery Policy Tagger service and logging components. 

The Dispatcher Service is a Cloud Run service that expects a BigQuery scope to be expressed as inclusion and exclusion lists of projects, datasets, and tables. This Dispatcher service will query Automatic DLP Data Profiles to check if the tables in-scope have data profiles generated. For these tables, it will publish one request per table to the “BigQuery Policy Tagger Requests” PubSub topic. This topic enables rate limiting of BigQuery column tagging operations and apply auto-retries with backoffs.

The “BigQuery Policy Tagger” Service is also a Cloud Run service that receives the information of the DLP scan results of a BigQuery table. This service will determine the final InfoType of each column and apply the appropriate Policy Tags as defined in the InfoType – Policy Tags mapping. Only-one INFO_TYPE is selected and the function assigns the associated policy tag.

Lastly, all Cloud Run services maintain structured logs that are exported by a log sink to BigQuery. There are multiple BigQuery views that help with monitoring and debugging Cloud Run call chains and tagging actions on columns.

Deployment options 

After deploying the solution, it can be used in two different ways:

[Option 1] Automatic DLP-triggered immediate tagging:

Figure 3: Deployment option 1 – Automatic DLP triggered tagging and inspection

Automatic DLP is configured to send a Pub/Sub notification on each inspection job completion. The Pub/Sub notification includes the resource name and it triggers the Tagger service directly. 

[Option 2] Scheduled tagging:

Figure 4: Deployment option 2 – Scheduled tagging and inspection

In this scenario, the Dispatcher service is invoked on a schedule with a payload representing a BigQuery scope to list inspected tables by Automatic DLP and create a tagging request per table. You could use Cloud Scheduler (or any Orchestration tool) to invoke the Dispatcher service. If the solution is deployed within a VPC-SC perimeter, other schedulers that support VPC-SC should be used (such as Composer or Custom App.)

In addition, more than one Cloud Scheduler/Trigger could be defined to group projects/datasets/tables that have the same tagging schedule (such as daily or monthly.)

To learn more about Automatic DLP, see our webpage. To learn more about the BQ classifier, see the open source project on Github: GoogleCloudPlatform/bq-pii-classifier and get started today! 

Related Article

Automatic data risk management for BigQuery using DLP

Automatic DLP for BigQuery, a fully managed service that continuously scans your data to give visibility of data risk, is now generally a…

Read Article

Source : Data Analytics Read More

Pro tools for Pros: Industry leading observability capabilities for Dataflow

Pro tools for Pros: Industry leading observability capabilities for Dataflow

Dataflow is the industry-leading unified platform offering batch and stream processing. It is a fully managed service that comes with flexible development options (from Flex Templates & Notebooks to Apache Beam SDKs for Java, Python and Go) and a rich set of built-in management tools. It comes with seamless integrations with all Google Cloud products, such as Pub/Sub, BigQuery, VertexAI, GCS, Spanner, and BigTable, as well as third-party services and products, such as Kafka and AWS S3, to best meet your data movement use cases.

While our customers value these capabilities, they continue to push us to innovate and provide more value as the best batch and streaming data processing service to meet their ever-changing business needs. 

Observability is a key area where the Dataflow team continues to invest more based on customer feedback. Adequate visibility into the state and performance of the Dataflow jobs is essential for business critical production pipelines. 

In this post, we will review Dataflow’s  key observability capabilities:

Job visualizers – job graphs and execution details

New metrics & logs

New troubleshooting tools – error reporting, profiling, insights

New Datadog dashboards & monitors

Dataflow observability at a glance

There is no need to configure or manually set up anything; Dataflow offers observability out of the box within the Google Cloud Console, from the time you deploy your job. Observability capabilities are seamlessly integrated with Google Cloud Monitoring and Logging along with other GCP products. This integration gives you a one-stop shop for observability across multiple GCP products, which you can use to meet your technical challenges and business goals.

Understanding your job’s execution: job visualizers

Questions: What does my pipeline look like? What’s happening in each step? Where’s the time spent?

Solution: Dataflow’s Job graph and Execution details tabs answer these questions to help you understand the performance of various stages and steps within the job

Job graph illustrates the steps involved in the execution of your job, in the default Graph view. The graph gives you a view of how Dataflow has optimized your pipeline’s code for execution, after fusing  (optimizing) steps to stages. TheTable view informs you more about each step and the associated fused stages and time spent in each step and their statuses as the pipeline continues execution. Each step in the graph displays more information, such as the input and output collections and output data freshness; these help you analyze the amount of work done at this step (elements processed) and the throughput for it.

Fig 1. Job graph tab showing the DAG for a job and the key metrics for each stage on the right.

Execution Details has all the information to help you understand and debug the progress of each stage within your job. In the case of streaming jobs, you can view the data freshness of each stage. The Data freshness by stages chart includes anomaly detection: it highlights “potential slowness” and “potential stuckness” to help you narrow down your investigation to a particular stage. Learn more about using the Execution details tab for batch and streaming here.

Fig 2. The execution details tab showing data freshness by stage over time, providing anomaly warnings in data freshness.

Monitor your job with metrics and logs

Questions:  What’s the state and performance of my jobs? Are they healthy? Are there any errors? 

Solution:  Dataflow offers several metrics to help you monitor your jobs. 

A full list of Dataflow job metrics can be found in our metrics reference documentation. In addition to the Dataflow service metrics, you can view worker metrics, such as CPU utilization and memory usage. Lastly, you can generate Apache Beam custom metrics from your code.

Job metrics is the one-stop shop to access the most important metrics for reviewing the performance of a job or troubleshooting a job. Alternatively, you can access this data from the Metrics Explorer to build your own Cloud Monitoring dashboards and alerts. 

Job and worker Logs are one of the first things that you can look at when you deploy a pipeline. You can access both these log types in the Logs panel on the Job details page. 

Job logs include information about startup tasks, fusion operations, autoscaling events, worker allocation, and more. Worker logs include information about work processed by each worker within each step in your pipeline.

You can configure and modify the logging level and route the logs using the guidance provided in our pipeline log documentation. 

Logs are seamlessly integrated into Cloud Logging. You can write Cloud Logging queries, create log-based metrics, and create alerts on these metrics. 

New: Metrics for streaming Jobs

Questions: Is my pipeline slowing down or getting stuck? I want to understand how my code is impacting the job’s performance. I want to see how my sources and sinks are performing with respect to my job

Solution: We have introduced several new metrics for Streaming Engine jobs that help answer these questions. Notable metrics are listed below. All of these are now instantly accessible from the Job metrics tab.

The engineering teams at the Renault Group have been using Dataflow for their streaming pipelines as a core part of their digital transformation journey.

“Deeper observability of our data pipelines is critical to track our application SLOs,”said Elvio Borrelli, Tech Lead – Big Data at the Renault Digital Transformation & Data team. “The new metrics, such as backlog seconds and data freshness by stage, now provide much better visibility about our end-to-end pipeline latencies and areas of bottlenecks. We can now focus more on tuning our pipeline code and data sources for the necessary throughput and lower latency”.

To learn more about using these metrics in the Cloud console, please see the Dataflow monitoring interface documentation.

Fig 3. The Job metrics tab showing the autoscaling chart and the various metrics categories for streaming jobs.

To learn how to use these metrics to troubleshoot common symptoms within your jobs, watch this webinar on Dataflow Observability: Dataflow Observability, Monitoring, and Troubleshooting 

Debug job health using Cloud Error Reporting

Problem: There are a couple of errors in my Dataflow job. Is it my code, data, or something else? How frequently are these happening?

Solution: Dataflow offers native integration with Google Cloud Error Reporting to help you identify and manage errors that impact your job’s performance.

In the Logs panel on the Job details page, the Diagnostics tab tracks the most frequently occurring errors. This is integrated with Google Cloud Error Reporting, enabling you to manage errors by creating bugs or work items or by setting up notifications. For certain types of Dataflow errors, Error Reporting provides a link to troubleshooting guides and solutions.

Fig 4. The diagnostics tab in the Log panel displaying top errors and their frequency.

New: Troubleshoot performance bottlenecks using Cloud Profiler

Problem: What part of my code is taking more time to process the data? What operations are consuming more CPU cycles or memory?

Solution: Dataflow offers native integration with Google Cloud Profiler, which lets you profile your jobs to understand the performance bottlenecks using CPU, memory, and I/O operation profiling support.

Is my pipeline’s latency high? Is it CPU intensive or is it spent time waiting for I/O operations? Or is it memory intensive? If so, which operations are driving this up? The flame graph helps you find answers to these questions. You can enable profiling for your Dataflow jobs by specifying a flag during job creation or while updating your job. To learn more see the Monitor pipeline performance documentation.

Fig 5. The CPU time profiler for showing the flame graph for the Dataflow job.

New: Optimize your jobs using Dataflow insights

Problem: What can Dataflow tell me about improving my job performance or reducing its costs?

Solution: You can review Dataflow Insights to improve performance or to reduce costs. Insights are enabled by default on your batch and streaming jobs; they are generated by auto-analyzing your jobs’ executions.

Dataflow insights is powered by the Google Active Assist’s Recommender service. It is automatically enabled for all jobs and is available free of charge. Insights include recommendations such as enabling autoscaling, increasing maximum workers, and increasing parallelism. Learn more about Dataflow insights in the Dataflow Insights documentation.

Fig 6. Dataflow Insights show up in  the Jobs overview page next to the active jobs.

New: Datadog Dashboards & Recommended Monitors

Problem: I would like to monitor Dataflow in my existing monitoring tools, such as Datadog.

Solution: Dataflow’s metrics and logs are accessible in observability tools of your choice, via Google Cloud Monitoring and Logging APIs. Customers using Datadog can now leverage the out of the box Dataflow dashboards and recommended monitors to monitor their Dataflow jobs alongside other applications within the Datadog console. Learn more about Dataflow Dashboards and Recommended Monitors in their blog post on how to monitor your Dataflow pipelines with Datadog.

Fig 7. Datadog dashboard monitoring Dataflow jobs across projects

ZoomInfo, a global leader in modern go-to-market software, data, and intelligence, is partnering with Google Cloud to enable customers to easily integrate their business-to-business data into Google BigQuery. Dataflow is a critical piece of this data movement journey.

“We manage several hundreds of concurrent Dataflow jobs,” said Hasmik Sarkezians, ZoomInfo Engineering Fellow. “Datadog’s dashboards and monitors allow us to easily monitor all the jobs at scale in one place. And when we need to dig deeper into a particular job, we leverage the detailed troubleshooting tools in Dataflow such as Execution details, worker logs and job metrics to investigate and resolve the issues.”

What’s Next

Dataflow is leading the batch and streaming data processing industry with the best in class observability experiences. 

But we are just getting started. Over the next several months, we plan to introduce more capabilities such as:

Memory observability to detect and prevent potential out of memory errors.

Metrics for sources & sinks, end-to-end latency, bytes being processed by a PTransform, and more.

More insights – quota, memory usage, worker configurations & sizes.

Pipeline validation before job submission.

Debugging user-code and data issues using data sampling.

Autoscaling observability improvements.

Project-level monitoring, sample dashboards, and recommended alerts.

Got feedback or ideas? Shoot them over, or take this short survey.

Getting Started

To get started with Dataflow see the  Cloud Dataflow quickstarts.

To learn more about Dataflow observability, review these articles:

Using the Dataflow monitoring interface

Building production-ready data pipelines using Dataflow: Monitoring data pipelines

Beam College: Dataflow Monitoring

Beam College: Dataflow Logging 

Beam College: Troubleshooting and debugging Apache Beam and GCP Dataflow

Source : Data Analytics Read More