Accelerate time to value with Google’s Data Cloud for your industry

Accelerate time to value with Google’s Data Cloud for your industry

Many data analytics practitioners today are interested in ways they can accelerate new scenarios and use cases to enable business outcomes and competitive advantage. As many enterprises look at rationalizing their data investments and modernizing their data analytics platform strategies, the prospect of migrating to a new cloud-first data platform like Google’s Data Cloud can be perceived as a risky and daunting task — not to mention the expense of the transition from redesign and remodeling of legacy data models in traditional data warehouse platforms to the refactoring of analytics dashboards and reporting for end users. The time and cost of this transition is not trivial. Many enterprises are looking for ways to deliver innovation at cloud speed without the time and costs of traditional replatforming where millions are spent on this type of transition. When access to all data within the enterprise and beyond is the future – it’s a big problem if you can’t leverage all of your data for its insights, and at cloud scale, because you’re stuck in the technologies and approaches of which aren’t designed to match your unique industry requirements. So, what is out there to address these challenges? 

Google’s Data Cloud for industries combines pre-built industry content, ecosystem integrations, and solution frameworks to accelerate your time to value. Google has developed a set of solutions and frameworks to address these issues as part of its latest offering called Google Cloud Cortex Framework, which is part of Google’s Data Cloud. Customers like Camanchaca accelerated build time for analytical models by 6x, and integrated Cortex content for improved supply chain and sustainability insights and saved 12,000 hours deploying 60 data models in less than 6 months. 

Accelerating time to value with Google Cloud Cortex Framework

Cortex Framework provides accelerators to simplify your cloud transition and data analytics journey in your industry. This blog explores some essentials you need to know about Cortex and how you can adopt and leverage its content to rapidly onramp your enterprise data from key applications such as SAP and Salesforce, along with data from Google, third-party, public and community data sets. Cortex is available today and it allows enterprises to accelerate time to value by providing endorsed connectors delivered by Google and our partners, reference architectures, ready to use data models and templates with BigQuery, Vertex AI examples, and an application layer that includes microservices templates for data sharing with BigQuery that developers can easily deploy, enhance, and make their own depending on the scope of their data analytics project or use case. Cortex content helps you get there faster — with lower time and complexity to implement. Let’s now explore some details of Cortex and how you can best take advantage of it with Google’s Data Cloud.   

First, Cortex is both a framework for data analytics and a set of deployable accelerators; the below image provides an overview of the essentials of Cortex Framework focusing on key areas of endorsed connectors, reference architectures, deployment templates, and innovative solution accelerators delivered by Google and our partners. We’ll explore each of these focus areas of Cortex in greater depth below.

Why Cortex Framework?  

Leading connectors: First, Cortex provides leading connectors delivered by Google and our partners. These connectors have been tested and validated to provide interoperability with Cortex data models in BigQuery, Google’s cloud-scale enterprise data warehouse. By taking the guesswork out of selecting which tooling works to integrate to Cortex with BigQuery, we’re taking the time, effort, and cost out of evaluating the various tooling available in the market. 

Deployment accelerators: Cortex provides a set of predefined deployable templates and content for enterprise use cases with SAP and Salesforce that include BigQuery data models, Looker dashboards, Vertex AI examples, and microservices templates for synchronous and asynchronous data sharing with surrounding applications. These accelerators are available free of charge today via Cortex Foundation and can easily be deployed in hours. The figure below provides an overview of Cortex Foundation and focus areas for templates and content available today:

Reference architectures: Cortex provides reference architectures for integrating with leading enterprise applications such as SAP and Salesforce as well as Google and third-party data sets and data providers. Reference architectures include blueprints for integration and deployment with BigQuery that are based on best practices for integration with Google’s Data Cloud and partner solutions based on real-world deployments. Examples include best practices and reference architectures for CDC (Change Data Capture) processing and BigQuery architecture and deployment best practices. 

The image below shows an example of reference architectures based on Cortex published best practices and options for CDC processing with Salesforce. You can take advantage of reference architectures such as this one today and benefit from these best practices to reduce the time, effort and cost of implementation based on what works and has been successful in real-world customer deployments.

Innovative solutions: Cortex Foundation includes support for various use cases and insights across a variety of data sources. For example, Cortex Demand Sensing is a solution accelerator offering leveraging Google Cloud Cortex Framework to deliver accelerated value to Consumer Packaged Goods (CPG) customers who are looking to infuse innovation into their Supply Chain Management and Demand Forecasting processes.

An accurate forecast is critical to reducing costs, and maximizing profitability. One gap for many CPG organizations is a near-term forecast that leverages all of the available information from various internal and external data sources to predict near-term changes in demand. As an enhanced view of demand materializes, CPG companies also need to manage and match demand and supply to identify near term changes in demand and their root cause, and then shape supply and demand to improve SLAs and increase profitability. 

Our approach shown below for Demand Sensing integrates SAP ERP and other data sets (e.g. Weather Trends, Demand Plan, etc) together with our Data Cloud solutions like BigQuery, Vertex AI and Looker to deliver extended insights and value to demand planners to improve the accuracy of demand predictions and help to defer cost and drive new revenue opportunities.

The ecosystem advantage

Building an ecosystem means connections with a diverse set of partners that accelerate your time to value. Google Cloud is excited to announce a range of new partner innovations that bring you more choice and optionality. 

Over 900 partnersput trust in BigQuery and Vertex AI to power their business by being part of the “Built with” Google Cloud initiative. These partners build their business on top of our data platform, enabling them to scale at high performance – both their technology and their business. 

In addition to this, more than 50 data platform partners offer fully validated integrations through our Google Cloud Ready – BigQuery initiative. 

A look ahead

Our solutions roadmap will target expansion of Cortex Foundation templates and content support for additional solutions in sales and marketing, supply chain, and expansion of use cases and models for finance. You will also see significant expansion with predefined BigQuery data models and content for Google Ads, Google Marketing Platform, and other cross-media platforms and applications and improvements with deployment experience and expansion into analytical accelerators that span across data sets and industries. If you would like to connect with us to share more details on what we are working on and our roadmap, we’re happy to engage with you! Please feel free to contact us at to learn more about the work we are doing and how we might help with your specific use cases or project. We’d love to hear from you!

Ready to start your journey?

With Cortex Framework, you come first in benefiting from our open source Data Foundation solutions content and packaged industry solutions content available on our Google Cloud Marketplace and Looker. The Cortex content is available free of charge so you can easily get started with your Google Data Cloud journey today!

Learn more about Google Cloud Cortex Framework and how you can accelerate business outcomes with less risk, complexity and cost. Cortex will help you get there faster with your enterprise data sources and establish a cloud-first data foundation with Google’s Data Cloud. 

Join the Data Cloud Summitto learn how customers like Richemont & Cartier use Cortex Framework to speed up time to value.

Source : Data Analytics Read More

Workload Identity for GKE made easy with open source tools

Workload Identity for GKE made easy with open source tools

Google Cloud offers a clever way of allowingGoogle Kubernetes Engine (GKE) workloads to safely and securely authenticate to Google APIs with minimal credentials exposure. I will illustrate this method using a tool called kaniko.

What is kaniko?

kaniko is an open source tool that allows you to build and push container images from Kubernetes pods when a Docker daemon is not easily accessible and you have no root access to the underlying machine. kaniko executes the build commands entirely in the userspace and has no dependency on the Docker daemon. This makes it a popular tool in continuous integration (CI) pipeline toolkits.

The dilemma

Suppose you want to access some Google Cloud – services from your GKE workload such as a secret fromSecret Manager, or in our case: build and push a container image to Google’sContainer Registry (GCR). However, it requires authorization of a Google service account (GSA) governed byCloud IAM. This is different from a Kubernetes service account (KSA) which provides an identity for pods and is dictated by its own Kubernetes Role-Based Access Control (RBAC). So how would you go about providing access to your GKE workloads to said Google Cloud services in a secure manner? 

1: Use the Compute Engine service account

The first option is to leverage the IAM service account used by the node pool(s). By default, this would be theCompute Engine default service account. The downside to this method is that the permissions of the service account is shared by all workloads, violating the principle of least privilege. Because of this, it is recommended that you use a custom service account with theleast privileged role and opt for a more granular approach when providing access to your workloads.

2: Use service account keys as Kubernetes secrets

The more secure second option is the tried, tested, and true method to generate account keys for a Google SA with the permissions that you need and mount them in your pod as aKubernetes secret. The pod manifest to build and push an image to GCR would look something like the following:

code_block[StructValue([(u’code’, u’apiVersion: v1rnkind: Podrnmetadata:rn name: kaniko-k8s-secretrnspec:rn containers:rn – name: kanikorn image: args: [“–dockerfile=Dockerfile”,rn “–context=gs://${GCS_BUCKET}/path/to/context.tar.gz”,rn “–${PROJECT}/${IMAGE_NAME}:${IMAGE_TAG}”,rn “–cache=true”]rn volumeMounts:rn – name: kaniko-secretrn mountPath: /secretrn env:rn – name: GOOGLE_APPLICATION_CREDENTIALSrn value: /secret/kaniko-secret.jsonrn restartPolicy: Neverrn volumes:rn – name: kaniko-secretrn secret:rn secretName: kaniko-secret’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eab96db8d50>)])]

The environment variable, GOOGLE_APPLICATION_CREDENTIALS contains the path to a Google Cloud credentials JSON file that is mounted at the path /secret inside the pod. It is through this service account key that the Kubernetes pod is able to access the build context files and push the image to GCR.

The downside to this method is you have live, non-expiring keys floating around with a constant risk of being leaked, stolen or accidentally committed to a public code repository.

3: Use Workload Identity

The third option usesWorkload Identity to provide the link between Google SA and Kubernetes SA. This grants the KSA the ability to act as the GSA when interacting with Google Cloud-native services and resources. This method still provides the granular access from IAM without requiring any service account keys to be generated and thus closing the gap.


You will need toenable Workload Identity on your GKE cluster as well asconfigure the metadata server for your node pool(s). You will also need a GSA (I called mine kaniko-wi-gsa) and assign it the proper roles it needs:

code_block[StructValue([(u’code’, u’gcloud projects add-iam-policy-binding ${PROJECT_ID} \rn –role roles/storage.admin \rn –member “serviceAccount:kaniko-wi-gsa@${PROJECT_ID}”‘), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eaba41bce10>)])]

On the Kubernetes side, create a KSA (I called mine kaniko-wi-ksa) and assign it the the following binding which will allow it to impersonate your GSA that has the permissions to access the Google Cloud services you need:

code_block[StructValue([(u’code’, u’gcloud iam service-accounts add-iam-policy-binding kaniko-wi-gsa@${PROJECT_ID} \rn –role roles/iam.workloadIdentityUser \rn –member “serviceAccount:${PROJECT_ID}[default/kaniko-wi-ksa]”‘), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eaba51b1610>)])]

The last thing you need to do is annotate your KSA with the full email of your GSA:

code_block[StructValue([(u’code’, u’kubectl annotate serviceaccount kaniko-wi-ksa \rn${PROJECT_ID}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eab96d06cd0>)])]

Here is the pod manifest for the same image build job, but using Workload Identity instead:

code_block[StructValue([(u’code’, u’apiVersion: v1rnkind: Podrnmetadata:rn name: kaniko-wirnspec:rn containers:rn – name: kanikorn image: args: [“–dockerfile=Dockerfile”,rn “–context=gs://${GCS_BUCKET}/path/to/context.tar.gz”,rn “–${PROJECT_ID}/${IMAGE_NAME}:${IMAGE_TAG}”,rn “–cache=true”]rn restartPolicy: Neverrn serviceAccountName: kaniko-wi-ksarn nodeSelector:rn “true”‘), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eab96d06750>)])]

Although using Workload Identity requires a little more initial setup, you no longer need to generate or rotate any security account keys.

What if you want to access services in another Google Cloud project?

Sometimes you may want to push your images to a central container registry located in a Google Cloud project that is different from the one your GKE cluster is in. Can you still use Workload Identity in this case?

Absolutely! Your GSA and necessary IAM binding are created from your external Google Cloud project, but you still reference the Workload Identity pool and KSA your GKE workload is running from.

Now what

By using kaniko, we illustrated Workload Identity and how it allows more secure access when authenticating to Google APIs. Userecommended security practices to harden your GKE cluster and stop using node service accounts or exporting service account keys as Kubernetes secrets.

Source : Data Analytics Read More

5 big things you can do at Google Data Cloud & AI Summit this week

5 big things you can do at Google Data Cloud & AI Summit this week

Data is at the heart of digital transformation and organizations are looking to find new opportunities to transform customer experiences, boost revenue, and reduce costs. In a new study conducted by Harvard Business Review Analytic Services for Google Cloud, 91% percent of leaders say that democratizing access to data is imperative to business success, and 76% say democratized access to artificial intelligence (AI) is critical. Recent advancements in AI have companies building their generative AI strategy — including how to build generative AI applications to drive business value, customize foundation models to meet their needs, and help ensure control over their data privacy. 

Join us at the Google Data Cloud & AI Summit ( Mar 29 Americas, Mar 30 EMEA) to reveal opportunities to transform your business. You’ll gain expert insights and learn about the latest innovations in Google Data Cloud for AI, databases, data analytics, and business intelligence.

Here are five big things you can do at the digital event.

#1: Get inspired with opening keynotes

June Yang and Lisa O’Malley will kick off the summit with a keynote session on what’s new in generative AI from Google Cloud. You will learn about the latest trends in AI and get a closer look at our recent announcements such as Generative AI App Builder and Generative AI support on Vertex AI. The keynote will showcase real-world use cases and you will get to see these new products in action with demos. Gerrit Kazmaier and Andi Gutmans will deliver the second keynote session on the latest innovations in Google Data Cloud and how you can use data and AI to reveal the next era of innovation and efficiency.

#2: Catch the latest announcements

Get the scoop on Google Cloud’s vision for a unified, open, intelligent data cloud. You’ll also be one of the first to learn the newest innovations across products like BigQuery, AlloyDB, Looker, and Vertex AI. 

#3: Go deeper with top experts in data and AI

After the keynotes, drill deep on topics that matter to you with fourteen breakout sessions across two tracks — AI Innovation and Data Essentials. The AI Innovation track will feature topics such as how to build generative AI apps in minutes, build, customize and deploy foundation models in Vertex AI, and activate your data with AI. The Data Essentials track will cover topics such as what’s new in databases, BigQuery, and Looker, and how to bring cross-cloud analytics to your data with a unified analytics lakehouse.  

#4: Get insights from customers, industry experts and partners

You will hear from customers across the globe who are solving complex data challenges with Google Cloud, including Dun & Bradstreet,, Orange, ShareChat, Oakbrook Finance, Richemont and CI&T. 

Bruno Aziza will also host a Q&A with Baris Gul, Director of Engineering and Warren Qi, Engineering Manager at about how they’re unlocking new potential to accelerate development cycle, reducing time to market from months to weeks. Ritika Gunnar and Azitabh Ajit, Director of Engineering, Data & Tech Platform, ShareChat will share how organizations can build data-rich applications faster and enable innovation. 

Yasmeen Ahmad will discuss with Sanjeev Mohan, Founder at SanjMo & Former VP at Gartner, various cost-saving strategies and how to use the intelligent financial-operation capabilities of Google Data Cloud products to control, plan, forecast, and optimize data analytics and AI costs effectively.

You’ll also learn how you can innovate faster and accelerate digital transformation with solutions from partners such as SAP, Databricks, Tabnine, and MongoDB.

#5: See it in action through demos

Throughout the Summit experience, there’ll be many opportunities for hands-on learning. Jason Davenport and Gabe Weiss will walk through an end-to-end demo of how the latest innovations in the Data Cloud come together to improve the customer experience for an online boutique application, including code examples. This demo will show how these new capabilities are used and provide developers a reference to go play with the demo on their own. 

In addition, you’ll find exciting interactive demos, videos, hands-on labs, and learning paths on the Summit website to build your skills and continue your data and AI journey after attending the sessions. 

We are excited to share what we’ve been working on. Save your spot today by registering on the Data Cloud & AI Summit homepage (Mar 29 Americas, Mar 30 EMEA). We hope to see you there.

Source : Data Analytics Read More

Effective strategies to closing the data-value gap

Effective strategies to closing the data-value gap

We have been going through a phenomenal amount of technological advances in the space of big data and cloud, driven by the organizations’ needs to get value out of data. It’s a given that data drives innovation, but business needs are changing more rapidly than processes can accommodate, resulting in a widening gap between data and value. Luckily, there are proven strategies that organizations can utilize to close the data-value gap. According to McKinsey, top performing organizations who see the highest returns from Artificial Intelligence (AI) are more likely to follow strategy, data, models, tools, technology and talent best practices. 

Our new Modern Data Strategy paper focuses on aligning data experiences, data economy and data ecosystems as part of the decision making process so that every organization can maximize ROI from their data and AI investments.

Figure: the three key areas of consideration to close a data-value gap

Data experiences to leave no user behind

Data experiences are key for organizations to not just democratize the reach of data and AI but also enable organizations to get the most out of their people. For example, a leading social networking service allowed a whole new group of less technically inclined users, including data analysts and product managers, to get insights from data using Google Cloud tools such as BigQuery and Looker. Users can access the data they need with self-service, adding agility across the business.

According to IDC, by 2026, 7 PB of data will be generated per second globally. At the same time, only 10% of the data generated each year is original, while the remaining 90% is replicated. This is because the organizational culture isn’t changing together with the new capabilities — this is holding back the outcomes and results that organizations expect. Users don’t have access to useful data experiences that match their maturity level.

To help solve these challenges, the Modern Data Strategy paper suggests different methods you can implement to help build personalized and self-service data experiences, giving everyone a chance to make data-informed decisions.

Data economy to capitalize on the value of data

Even though data platforms have evolved, the organizational model for becoming data driven, making data accessible and using it effectively has not evolved. Organizations and ways of working are trying to keep up with this rapid change to close the data-value gap. Applying a DevOps mentality to data helps closing this gap. Managing data as a product with clear ownership and SLAs allows organizations to get more value out of it.

Figure: The data economy potential is still in the early stages

Data consumers at Delivery Hero used considerable time figuring out where the data was, how to access it, and understanding the quality and policies around usage and sharing. This was due to a fragmented setup where teams were building their own data silos. Delivery Hero developed their new data platform on Google Cloud using data product thinking and settled on a global catalog that documents all available data and its meaning, quality and source. Teams discover, use and build on datasets from other teams with ease and speed, reducing the time to access data from 48 hours to having access to live data at all times. This has helped teams develop models to deliver growth and value across the business, from route management for drivers and order predictions in logistics to better recommendations and personalization on the website.

Data ecosystem to foster innovation

Closing the data-value gap can have a huge impact on an organization’s competitiveness. According to IDC, by 2025, at least 90% of new enterprise application releases will include embedded AI functionality. From a technology perspective, data platforms support these ambitions already. Choosing the right data ecosystem can improve the scalability, usability and time to insight of your analytical teams.

A modern data ecosystem is key to organizations becoming more effective and efficient. BigQuery is at the heart of Google Cloud’s analytics lakehouse with a strong ecosystem of tools around it. By having such an open, unified and intelligent platform, your teams do not need to spend time reinventing the wheel or spending time in data plumbing tuning underlying infrastructure.

For example, Mercado Libre was able to build a new solution that provided near real-time data monitoring and analytics for their transportation network and enabled data analysts to create, embed, and deliver valuable insights. This solution prevented them from having to maintain multiple tools and featured a more scalable architecture. 

Turn data into a competitive advantage with the right strategies

Data is the heart of digital transformation and offers incredible opportunities for organizations to accelerate the most strategic business outcomes, such as:

increasing revenue by understanding customer preferences and offering personalized experiences,

enhancing workforce productivity by making data-derived insights easily accessible to each worker to foster data-informed decision making.

Primarily, organizations need to be guided by a robust data strategy. This in turn will enable them to get value from data and turn it into competitive advantage. The bottom line is that data-driven companies innovate faster. First, they are able to continuously optimize their operational efficiencies regardless of the size and complexity of their organizations, and as a result they keep costs down. Second, they can adapt quicker as market conditions change. 

Here are four potential ways to drive new value using data: 

Applications: Accelerated product development and shorter time to market

Analytics: Greater organizational and operational efficiency, agility, and pace to execute innovative programs. 

Visualizations: Increased productivity by using right intuitive tools to allow doing advanced analytics and AI.

Predictions: Creating differentiated solutions driven by Data and AI

Benefits of your data strategy

A data strategy helps you create the necessary alignment across your organization.

These are some examples of activities that your data strategy should drive:

Principles and processes to guide the organization toward faster decision-making and continued alignment with business goals and objectives

Continuous review of recommended systems and tools that align with your strategy’s vision while avoiding a one-size-fits-all approach

Clear and consistent policies and procedures for managing data securely throughout its lifecycle, from creation to disposal

Ensuring that data is used ethically and responsibly, in compliance with relevant laws and regulations

Creating and enabling a culture of data-driven decision-making through the use of accessible, governed data to drive business value

The responsible use of AI across your organization

A framework for building a modern data strategy

Organizations can follow the three pillars below for building a modern data strategy:

Data experiences: Productive user experiences enabling all users to access and create value from relevant data

Establish a data university to advance data literacy for everyone in your organization

Pivot your organization from role-oriented to product-oriented with cross-functional teams

Define data principles for your organization that align with priorities and provide clarity in decision making

Data economy: Principles and practices to ensure that data can be published, discovered, built on, and relied on

Staff a data platform team that is obsessed with the developer and analyst experience

Make sure you get external/enterprise source data into the data economy early, such as customers, transactions, and product data

Find teams that depend on each other for data and help them share their data as a data product

Implement basic data governance practices — stewardship, discovery, quality, reliability, and privacy 

Data ecosystem: A unified, open and intelligent platform with end-to-end data capabilities for all users and needs

Choose a data ecosystem that provides your organization with all the capabilities you need out of the box, with open standards to plug in other components where needed

Make a plan to enable everyone in the organization to apply AI in their work using the capabilities in the ecosystem

Keep your operational work to a minimum by choosing serverless tools for the capabilities in your data ecosystem

Getting started

These are some important questions to ask yourself before designing a modern data strategy for your organization:

Does everyone in the organization have positive experiences working with data?

Do you have dedicated learning paths for different personas in your organization to use data effectively?

Do you have democratized access to data products to build the data economy?

Does it support all your data people, including app developers, business analysts, data scientists and line of business users?

Can you enable the next generation of data driven solutions including Data Science / Machine Learning (ML) at scale?

Are you effectively using data and AI to fast-track your strategic business objectives?

Can your data platform provide the technological ecosystem that enables data processing at scale with modern tools?

Can you share data internally and externally in a governed way?

Read our new paper, Three pillars for building a modern data strategy, to find out how to get started and learn more about how to define a modern data strategy for your business.

It was an honor and privilege to work on this with Thomas de Lazzari, Sina Nek Akhtar, David Montag, Andreas Ribbrock, Zara Wells for support, work they have done and discussions.

Source : Data Analytics Read More

How SHAREit Group leverages Google’s Data Cloud to maximize the values of DataCake for cross cloud data analytics

How SHAREit Group leverages Google’s Data Cloud to maximize the values of DataCake for cross cloud data analytics

Fully unleashing the power of data is an integral part of many companies’ goals of digitalization. At SHAREit Group, we heavily rely on data analytics to continue optimizing our services and operations. Over the past years, our mobile app products, notably the file sharing platform SHAREit, which aims to make content equally accessible by everyone, has quickly gained popularity and reached more than 2.4 billion users around the world. We could never achieve this without the insights we gained from our business data for product improvement and development.

SHAREit Group has adopted a multicloud strategy to deploy our products and run data analytics since our early days because we want to avoid provider lock-in, take advantage of the best-of-breed technologies, and avoid potential system failures in the event that one of our multiple cloud providers encounters technical issues. As an example, we use Google Cloud, because its infrastructure tools like Google Kubernetes Engine (GKE) and Spot VMs help us further lower our computing costs, while the combination of BigQuery and Firebase for data processing speeds up our data-driven decision-making.

To easily build a unified user experience across all our different cloud platforms, we rely on different open source tools. But using multiple public clouds and open source software inevitably complicates the ways we gather and manage our ever-increasing business data. That’s why we need a powerful big data platform that supports highly efficient data analytics, engineering and governance across different cloud platforms and data processing tools.

Current challenges for big data platforms

As the quantity of data continues to increase and the applications of data diversifies, the technology for big data platforms has also evolved. However, many obstacles like poor data quality and long data pipelines are still preventing companies from getting the most value out of the data they have. On our journey to build an enterprise-level, multicloud big data platform, we’ve encountered the following challenges:

Long data cycle: A centralized data team can help process data across different company systems in a more organized way, but this centralization also prolongs data cycles. Data specialists might not be familiar with how different domain teams use data, and it requires a lot of back-and-forth communication before raw data are transformed into useful information, which results in low efficiency in decision-making.

Data silos: We use several online analytical processing (OLAP) engines for our different systems. Each OLAP tool has its own metadata, which brings about data silos that prevent cross-database queries and management.

Steep learning curve: To utilize data in different databases, users need to have a good command of various SQL languages, which translates into a steep learning curve. On top of that, finding the most ideal pair of SQLs and query engines to process data workloads can be challenging. 

High management costs: Enhancing the cost-effectiveness of our cloud infrastructure is one of the main reasons why we adopted a multicloud architecture. However, many cloud-based big data platforms lack a mechanism of using cloud resources cost-efficiently. The management cost could have been significantly lower if we were able to avoid waste of CPUs and memory. 

Low transparency: The information about data assets and costs across different databases is often scattered on big data platforms, which makes it challenging to realize efficient data governance and cost management. We need a one-stop solution to fully eliminate excessive data and computing resources. 

DataCake: A highly efficient, automated one-stop big data platform supported by Google Cloud

To overcome the above-mentioned challenges, SHAREit Group in 2021 started using DataCake, to support all our data-driven businesses. DataCake facilitates the implementation of the data mesh architecture, a domain-oriented, self-serve data platform design that enables domain teams to conduct cross-domain data analysis independently. By supporting highly automated, no-code data analytics and development, DataCake lowers the bar for any user who wants to make use of data.

In addition, DataCake is built on multicloud IaaS, which allows us to flexibly leverage leading cloud tools like GKE, the most scalable and automated Kubernetes, and Spot VMs to realize the most cost-effective use of cloud resources. DataCake also supports all types of open source query engines and business intelligence tools, facilitating our wide adoption of open source software.

Key benefits of using DataCake include:

Highly efficient data collaboration: While giving full data sovereignty to each domain team, DataCake offers several features to facilitate data collaboration. First, it provides standard APIs that allow different domain teams to easily share data by one click. Secondly, LakeCat, a unified metastore service in DataCake, gathers all metadata in one place to simplify management and enables quick metadata queries. Thirdly, DataCake supports queries across 18 commonly used data sources, which enhances the efficiency of data exploration. According to the TPC-DS benchmark, DataCake delivers 4.5X higher performance than open source big data solutions.

Lower infrastructure costs: Leveraging multicloud IaaS means that DataCake gives its users full flexibility of choosing the cloud infrastructure tools that are most cost-effective and meet their needs the best. DataCake’s Autoscaler feature supports different virtual machine (VM) instance combinations and can help maintain a high usage rate of each instance. By optimizing the ways we use cloud infrastructure, DataCake has helped SHAREit group lower data computing costs by 50%.

Less query failure: Choosing the most suitable query engine for workloads using different SQLs can be a headache for data teams that leverage multiple query engines. At SHAREit Group, we employ not only open source data processing tools like Apache Spark and Apache Flink, but also cloud software including BigQuery. DataCake’s AI model, which is trained with SQL fragments, is able to select the most ideal query engine for a workload based on its SQL features. Overall, DataCake reduces query failure caused by unfit engines by more than 70%.

Simplified data analytics and engineering: DataCake makes data analytics and engineering feasible for everyone by adopting serverless PaaS and streamlining SQL use. With serverless PaaS, users can focus on data-related workloads without worrying about cluster management and resource scaling. At the same time, DataCake provides all types of development templates and a smart engine to automate SQL deployment, which allows users to complete the whole data engineering process without using any code.

Comprehensive data governance: On DataCake, users can see all their data assets and billing details in one place, which makes it easy to manage data catalogs and costs. With this high level of transparency, SHAREit Group has successfully saved 40% of storage costs.

How Google Cloud supports DataCake

In early 2022, SHAREit Group started incorporating Google Cloud into our multicloud architecture that underlies DataCake. We made this decision not only because we wanted to increase the diversity of our cloud infrastructure, but also because Google Cloud offers opportunities to maximize the benefits of using DataCake by further lowering costs and facilitating data analytics. Leveraging Google Cloud to support DataCake has given us the following advantages:

Lower computing costs: Spot VMs of Google Cloud are one of the VMs with the lowest price-performance ratio on the market, and DataCake’s Autoscaler feature can make the most out of this advantage by predicting the health status of Spot VMs to reduce the probability of them being recycled and disrupting computing. On top of that, DataCake built an optimized offline computing mechanism to avoid redoing computing through persistent volume claims. All in all, we’ve reduced the execution time of the same computing task by 20% to 40%, and realized 30% to 50% lower computing costs.

Lower cluster management costs: Google Cloud is highly compatible with open source tools and can help realize cost-effective open source cluster management. With the autoscaling feature of GKE, our clusters of Apache Spark, Apache Flink and Trino are automatically scaled up or down according to current needs, which helps us save 40% of cluster management costs.

More cost-effective queries: We use BigQuery as a part of PaaS supporting DataCake. Compared to other cloud warehouse tools, BigQuery offers more flexible pricing schemes that allow us to greatly reduce our data processing costs. Additionally, the query saving and sharing feature of BigQuery also facilitates the collaboration between different departments, while its capability to generate several terabytes of data in only a few seconds accelerates our data processing speed.

By merging Google Cloud and DataCake, we’re able to take advantage of the powerful infrastructure of Google Cloud to fully benefit from DataCake’s features. Now, we can conduct data analytics and engineering in the most cost-effective way possible and have more resources to invest in product development and innovation.

Continue democratizing data

SHAREit Group is happy to be part of the journey for data democratization and automation. With the help of Funtech and Google Cloud, SHAREit will continue to innovate with better data analytics capabilities and we’ll keep finding new ways to strengthen the edges of our big data platform on DataCake by leveraging the cutting-edge technologies of public cloud platforms like Google Cloud. 

Special thanks to Quan Ding, Data Analyst from SHAREit, for contributing to this post.

Source : Data Analytics Read More

Meet our Data Champions: Jan Riehle, at the intersection of beauty and data with Beauty for All (B4A)

Meet our Data Champions: Jan Riehle, at the intersection of beauty and data with Beauty for All (B4A)

Editor’s note: This blog is part of a series called Meet the Google Cloud Data Champions, a series celebrating the people behind data- and AI-driven transformations. Each blog features a champion’s career journey, lessons learned, advice they would give other leaders, and more. This story features Jan Riehle, Principal at Brazilian investment firm Rising Venture and founder and CEO of a Brazilian company that runs a beauty e-commerce platform, B4A.

Tell us about yourself. Where did you grow up? What did your journey into tech look like? 

I grew up in Karlsruhe, a  town in southern Germany. Directly after high school, in the early 2000s, I opened my first tech company, an agency creating web technology for small and medium-sized businesses. 

Several years later, I relocated to Switzerland, Singapore, and France, and I took roles in several tech companies while also collecting experiences in M&A and private equity. After pursuing an MBA at INSEAD, I relocated to Brazil in 2011, where I co-founded and ran various technology ventures between 2011 and 2015. 

In 2017, I started a search fund that acquired two businesses in the “beauty-tech” space in Sao Paulo, Brazil’s commercial capital. These two businesses were the beginning of what today is B4A (“Beauty for all”), a platform that creates an ecosystem, connecting and mutually benefiting consumers, beauty brands, and digital beauty influencers. 

What’s the coolest thing you and/or your team have accomplished by leveraging our Data Cloud solutions?

Our main objective is to provide value for our ecosystem participants, which are consumers, beauty brands and digital influencers. All of our technology efforts and success are made to serve that purpose and create value for these three groups.

The coolest thing we achieved by using Google’s Data Cloud solutions was to extremely shorten load times for our consumer-facing ecommerce platform, B4A Commerce Connect. You can see it in action at or The performance gains are visible when you load heavy collections (like the “Para Ele” Collection on the Glambox website, for example). For such large collections, we were able to reduce load times by about 90% by implementing AlloyDB for PostgreSQL.

Our platform combines data about customer characteristics with machine-learning algorithms so that a user of our website only sees products that make sense for their individual profile. This raises a challenge because every load in our ecommerce platform requires more computing power than in a standard ecommerce platform, where such optimizations are not needed. Therefore, using the right tools and optimizing performance becomes absolutely crucial to provide  smooth, fluid performance and a solid user experience. The user experience benefited dramatically from implementing AlloyDB. You can learn more about our journey in this blog.

Technology is one part of data-driven transformation. People and processes are others. How were you able to bring the three together? Were there adoption challenges within the organization, and if so, how did you overcome them?

B4A is a beauty company with technology in its DNA (we also call it a “beauty-tech”).  We always strive to use the best technologies for the benefit of our ecosystem participants. Implementing solutions from Google Cloud was very beneficial to our processes and did not result in any additional challenges, nor resistance from the team. We actually had the opposite reaction, to be honest: infrastructure requirements and maintenance efforts decreased by more than 50%, which our IT Operations team very much welcomed. At the same time, and as I mentioned before, our customers also benefited from integrating Google Cloud, and specifically AlloyDB, creating a win-win situation for the organization.

What advice would you give people who want to start data initiatives in their company? 

The starting point is the most fundamental moment. It will determine the success of your implementation. After all, a small difference in your steering angle at the start will make a big impact at the end. It’s not easy to set a course when you don’t know where you want to go. So, even if you are far away, you need to have a clear vision about where you want to go and a roadmap of how to get there.

With that in mind, I recommend preparing a data organization framework that will be able to support your plan. Even if you start small, you’ll need to set aside time to document, and cross-functionally review what you’ve envisioned. 

Before you jump into action and develop a technology project, try to have a clear blueprint about your data structure. Map out all the data you need to track and try to predict what you will need in the future. The better you plan in the beginning, the better your end result will be. 

What’s an important lesson you learned along the way to becoming more data-driven? Were there challenges you had to overcome?

I think in terms of data, we are different from most organizations. Our relationship with data has always been pronounced, and even our organizational structures are designed to use data in the best possible way, with data-focused squads accompanying many organizational processes. We already have a five-year track-record of extensive data usage at B4A. In the end, we need data for our main products to work, and we also sell it in an aggregate form to beauty brands.  

The first important learning I already provided in my previous answer: above all, it is important to define data structures and oversee company-wide processes well before starting to implement an actual database. In this step, it is extremely important that business and tech teams work hand-in-hand. The business side needs to have a certain degree of technical thinking in order to make this integration work in a productive way. Overall, the better you plan in the beginning, the better your end result will be.

The second learning is more subtle and it’s something I only realized recently, after years of using data for everything at B4A. The learning is that looking at the data’s current trajectory, rather than envisioning its potential trajectory, can actually limit an organization.

Data-driven organizations can train themselves to assume past data will evolve in a linear fashion. The problem with this approach, when you are in technology, is that you often work on potentially disruptive products. Instead of just looking at problems in a linear fashion, you should also think of the potential exponential curves that may develop once your features or products achieve a sufficient degree of product-market fit. 

This different behavior is often not considered when making projections using classical regressions or linear projections on top of data. Therefore, I sometimes provoke the team to look at the data from a different angle—an angle of where we want to go and how strong the disruptive effect of a new feature potentially could be in the market. In the end, the organization needs to find a way to combine both approaches and balance one with the other. 

Another best practice is to avoid letting the organization become obsessed with certain key metrics, or KPIs. When looking at metrics, never forget that they often reflect simplifications of reality. The context in the real world can be more complex, integrating many more variables that should be considered to get a full picture of a situation. Applying common sense should always be more important than trying to judge a situation only using one or several metrics. 

Thinking ahead 5-10 years, what possibilities with data and/or AI are you most excited about?

I think we are only at the very beginning of AI having an impact on productivity. Looking at five or even ten-year periods, it is difficult to predict the amount of disruption that will happen from AI. It’s already starting with generative AI, and I think over time, the entire Software-as-a-Service industry will end up being disrupted by more “Model-as-a Service” oriented companies. Instead of using a software product, you may be able to just ask the model to provide whatever you need at a specific moment in a very customized way. 

At B4A, we always strive to be at the top of new developments. We consider how we can implement them for not only the interest of our ecosystem participants, but also the interest of our company’s efficiency. Again, looking at the next ten years, I think the disruption from data and AI will be immense, larger than anything we have seen over the last 50 years.

Want to learn more about the latest innovations in Google’s Data Cloud across databases, data analytics, business intelligence, and AI? Join us at the Google Data Cloud & AI Summit to gain expert insights and data strategies to drive transformation in your organization.

Source : Data Analytics Read More

Document AI introduces powerful new Custom Document Classifier to automate document processing

Document AI introduces powerful new Custom Document Classifier to automate document processing

Businesses rely on an inflow of documents to drive processes and make decisions. As documents flow into a business, many are not classified by type, which makes it difficult for businesses to manage at scale. 

At Google Cloud, we’re committed to solving these challenges with continued investment in our state-of-the-art machine learning product for document processing and insights: Document AI Workbench, which helps users quickly build models with world-class accuracy trained for their specific use cases. In February 2023, we launched the Custom Document Extractor (CDE) in GAto help users extract structured data from documents in production use cases. Today, we’re announcing the newest model type to help users automate document processing, Custom Document Classifier (CDC). With CDC, users can train highly accurate machine learning models to automatically classify document types.

CDC provides tangible business value to customers. For example, businesses can validate if users submit the right documents within an application, lowering review time and cost. In addition, accurate classification enables businesses to better automate downstream processes. This includes selecting the proper storage, analysis, or processing steps.  

In this blog post, we’ll give an overview of the Custom Document Classifier and ways customers are already benefiting from it. 

Benefits of classification models with Document AI Workbench 

Our customers use Document AI Workbench to ultimately save time and money, building models with state of the art accuracy in a fraction of the time that traditional development methods require. Thus, CDC helps businesses achieve higher automation rates to scale processes while lowering costs.

Chris Jangareddy, managing director for Artificial Intelligence & Data at Deloitte Consulting LLP said, “Google Cloud Document AI is a leading document processing solution packed with rich features like multi-step classify and text extraction to automated sorting, classification, extraction, and quality assurance. By combining Document AI with Workbench, Google Cloud has created a forward-thinking and powerful AI platform for intelligent document processing that will allow for process transformation at an enterprise scale with predictable outcomes that can benefit businesses.”

Rajnish Palande, VP, Google Business Unit for BFSI, TCS said, “Document AI Workbench leverages artificial intelligence to manage and glean insights from unstructured data. Workbench brings together the power of classification, auto-annotation, page number identification, and multi-language support to help organizations rapidly deliver enhanced accuracy, improved operational efficiency, higher confidence in the information extract, and increased return on investment.” 

Sean Earley, VP of Delivery Services of Zencore said, “Document AI Workbench allows us to develop highly accurate document parsing models in a matter of days. Our customers have automated tasks that formerly required significant human labor. For example, using Document AI Workbench, a team of two trained a model to split, classify, and extract data from 15 document types to automate Home Mortgage Disclosure Act reporting. The mean trained model accuracy was 94%, drastically reducing the operational cost of our customer’s compliance reporting procedures.”

How to use Custom Document Classifier

Users can leverage a simple interface in the Google Cloud Consoleto prepare training data, create and evaluate models, and deploy a model into production, at which point it can be called to classify document types. You can follow the documentation for instructions on how to create, train, evaluate, deploy, and run predictions with models.

Import and prepare training data

To get started, users import and label documents to train an ML model. Users can label documents in bulk at import to build the training and test datasets needed to build a model accurate enough for production workloads in hours. If documents are already labeled using other tools, users can simply import labels with JSON in the Document format. Users can initiate training with a click of a button. Once the user has trained a model, they can auto-label documents to build a more robust training dataset to improve model performance.

Evaluate a model and iterate

Once a model is trained, it’s time to evaluate it by looking at the performance metrics–F1 score, precision, recall, etc. Users can dive into specific instances where the model predicted an error, then provide additional training data to improve future performance.

Going into production

Once a model meets accuracy targets, it’s time to deploy into production, after which the model endpoint can be called to classify document types. 

Getting started with Document AI Workbench 

Custom Document Classifier is publicly available in GA and ready to help customers automate document classification. Learn more via our Document AI Workbench web page, Document AI Workbench documentation or try it out in the Google Cloud Console.

Acknowledgements: Lukas Rutishauser, Software Engineering Manager; Michael Kwong, Software Engineering Manager; Rajagopal Janani, Software Engineering Manager; Michael Lanning, UX Designer; Shagun Lal, Product Marketing Manager; Tomas Moreno, Outbound Product Manager; Holt Skinner, Developer Advocate.

Source : Data Analytics Read More

Pub/Sub schema evolution is now GA

Pub/Sub schema evolution is now GA

Pub/Sub schemas are designed to allow safe, structured communication between publishers and subscribers. In particular, the use of schemas provides that guarantee that any message published adheres to a schema and encoding, which the subscriber can rely on when reading the data. 

Schemas tend to evolve over time. For example, a retailer is capturing web events and sending them to Pub/Sub for downstream analytics with BigQuery. The schema now includes additional fields that need to be propagated through Pub/Sub. Up until now Pub/Sub has not allowed the schema associated with a topic to be altered. Instead, customers had to create new topics. That limitation changes today as the Pub/Sub team is excited to introduce schema evolution, designed to allow the safe and convenient update of schemas with zero downtime for publishers or subscribers.

Schema revisions

A new revision of schema can now be created by updating an existing schema. Most often, schema updates only include adding or removing optional fields, which is considered a compatible change.

All the versions of the schema will be available on the schema details page. You are able to delete one or multiple schema revisions from a schema, however you cannot delete the revision if the schema has only one revision. You can also quickly compare two revisions by using the view diff functionality.

Topic changes

Currently you can attach an existing schema or create a new schema to be associated with a topic so that all the published messages to the topic will be validated against the schema by Pub/Sub. With schema evolution capability, you can now update a topic to specify a range of schema revisions against which Pub/Sub will try to validate messages, starting with the last version and working towards the first version. If first-revision is not specified, any revision <= last revision is allowed, and if last revision is not specified, then any revision >= first revision is allowed.

Schema evolution example

Let’s take a look at a typical way schema evolution may be used. You have a topic T that has a schema S associated with it. Publishers publish to the topic and subscribers subscribe to a subscription on the topic:

Now you wish to add a new field to the schema and you want publishers to start including that field in messages. As the topic and schema owner, you may not necessarily have control over updates to all of the subscribers nor the schedule on which they get updated. You may also not be able to update all of your publishers simultaneously to publish messages with the new schema. You want to update the schema and allow publishers and subscribers to be updated at their own pace to take advantage of the new field. With schema evolution, you can perform the following steps to ensure a zero-downtime update to add the new field:

1. Create a new schema revision that adds the field.

2. Ensure the new revision is included in the range of revisions accepted by the topic.

3. Update publishers to publish with the new schema revision.

4. Update subscribers to accept messages with the new schema revision.

Steps 3 and 4 can be interchanged since all schema updates ensure backwards and forwards compatibility. Once your migration to the new schema revision is complete, you may choose to update the topic to exclude the original revision, ensuring that publishers only use the new schema.

These steps work for both protocol buffer and Avro schemas. However, some extra care needs to be taken when using Avro schemas. Your subscriber likely has a version of the schema compiled into it (the “reader” schema), but messages must be parsed with the schema that was used to encode them (the “writer” schema). Avro defines the rules for translating from the writer schema to the reader schema. Pub/Sub only allows schema revisions where both the new schema and the old schema could be used as the reader or writer schema. However, you may still need to fetch the writer schema from Pub/Sub using the attributes passed in to identify the schema and then parse using both the reader and writer schema. Our documentation provides examples on the best way to do this.

BigQuery subscriptions

Pub/Sub schema evolution is also powerful when combined with BigQuery subscriptions, which allow you to write messages published to Pub/Sub directly to BigQuery. When using the topic schema to write data, Pub/Sub ensures that at least one of the revisions associated with the topic is compatible with the BigQuery table. If you want to update your messages to add a new field that should be written to BigQuery, you should do the following:

1. Add the OPTIONAL field to the BigQuery table schema.

2. Add the field to your Pub/Sub schema.

3. Ensure the new revision is included in the range of revisions accepted by the topic.

4. Start publishing messages with the new schema revision.

With these simple steps, you can evolve the data written to BigQuery as your needs change.

Quotas and limits

Schema evolution feature comes with following limits:

20 revisions per schema name at any time are allowed.

Each individual schema revision does not count against the maximum 10,000 schemas per project.

Additional resources

Please check out the additional resources available at to explore this feature further:


Client libraries



Source : Data Analytics Read More

Solving for what’s next in Data and AI at this year’s Gartner Data & Analytics Summit

Solving for what’s next in Data and AI at this year’s Gartner Data & Analytics Summit

The largest gathering of data and analytics leaders in North America is happening March 20 – 22nd in Orlando, Florida. Over 4,000 attendees will join in person to learn and network with peers at the 2023 Gartner® Data & Analytics Summit. This year’s conference is expected to be bigger than ever, as is Google Cloud’s presence!

We simply can’t wait to share the lessons we’ve learned from customers, partners and analysts! We expect that many of you will want to talk about data governance, analytics, AI, BI, data management, data products, data fabrics and everything in between!

We’re going big!

That’s why we’ve prepared a program that is bound to create opportunities for you to learn and network with the industry’s best data innovators.  Our presence at this event is focused on creating meaningful connections for you with the many customers and partners who make the Google Cloud Data community so great.

We’ll kick off with a session featuring Equifax’s Chief Product & Data Analytics Officer, Bryson Koehler and Google Cloud’s Ritika Gunnar.  Bryson will share how Equifax drove data transformation with the Equifax Cloud™.  That session is on Monday, 3/20 at 4pm. After you attend it, you will realize why Bryson’s team earned the Google Cloud Customer of the Year award twice!

That night, from 7:30PM on, we will host a social gathering so you can meet with Googlers, SAP leaders and our common customers at the “Nothing But Net Value with Google Cloud and SAP” event.

On Tuesday, you’ll have at least 4 opportunities to catch me and the rest of the team:

At 10:35am, Starburst’s Head of Product, Vishal Singh & I will cover how companies can turn Data Into Value with Data Products.  We’ll discuss the maturity phases organizations graduate through and will even give you a demo live! 

At 11:00am, Dataiku’s Field CDO, Conor Jensen & I will join Johnson & Johnson’s Anuli Anyanwu-Ofili and Ameritas’ CDAO Lorenzo Ball to talk about Everyday AI and we will share the rules that leaders follow to succeed with Data & AI.

At 12:25pm, our panel of experts, LiveRamp’s Kannan D.R & Quantum Metric’s Russell Efird will join Google Cloud’s Stuart Moncada to discuss how companies can build intelligent data applications and how our “Built with BigQuery” program can help your team do the same.

That night, from 7PM on, I will be speaking at the CDO Club Networking event hosted by Team8, Dremio, and Manta.  Register here to attend! 

But wait, there is more!  

On Wednesday, our community will continue to feature great customer success stories and I’ll be there to support them.  

At 11:30am, Tamr’s Head of Corporate Strategy Matthew Holzapfel & I will join P360’s CEO Anupam Nandwana, Thermo Fisher Scientific’s VP of IT, Abhijeet Bhandare to discuss what the move to data products means from a business perspective and how quality, trusted records underpin a Data Product strategy.  

At 2:30pm, Nexla’s CEO and co-founder Saket Saurabh & I will join Johnson & Johnson’s Vasanth Thirugnanam & Stagwell’s CTO Mansoor Basha to discuss the Data Fabric architecture best practices innovators use to create secure and trusted Data Products.

And if all of this is not enough you will find some of our partners present inside the Google Cloud booth (#434).  LiveRamp, Neo4j, Nexla, Quantum Metric, and Striim have all prepared innovative lighting talks that are bound to make you want to ask questions.

There are over 900 software companies who have built data products on our platform and while you don’t have 900 sessions at the event (we tried!), you can stop by our booth to inquire about the recent integrations we announced with Collibra, Elastic, MongoDB, Palantir, ServiceNow, Sisu, Reltio and more!

Top 5 Gartner sessions

I can’t wait to see all of you in person and our team looks forward to hearing how we can help you and your company succeed with data.

Beyond the above, there are of course many Gartner sessions that you should put on your schedule.  In my opinion, there are at least 5 you can’t afford to miss:  

Financial Governance and Recession Proofing Your Data Strategy with Adam Ronthal

What You Can’t Ignore in Machine Learning with Svetlana Sicular. I still remember attending her first session on this topic years ago — it’s always full of great stats and customer stories.

Ten Great Examples of Analytics in Action with Gareth Herschel. If you’re looking for case studies in success, sign up for this one!

Ask the Experts series, particularly the one on Cost Optimization with Allison Adams

Data Team Organizations and Efficiencies with Jorgen Heizenberg, Jim Hare and Debra Logan,

I hope you’ve found this post useful.  If there is anything we can do to help, stop by the Google Data Cloud booth (#434).

GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.

Source : Data Analytics Read More

BigQuery under the hood: Behind the serverless storage and query optimizations that supercharge performance

BigQuery under the hood: Behind the serverless storage and query optimizations that supercharge performance

Customers love the way BigQuery makes it easy for them to do hard things — from BigQuery Machine Learning (BQML) SQL turning data analysts into data scientists, to rich text analytics using the SEARCH function that unlocks ad-hoc text searches on unstructured data. A key reason for BigQuery’s ease of use is its underlying serverless architecture, which supercharges your analytical queries while making them run faster over time, all without changing a single line of SQL. 

 In this blog, we lift the curtain and share the magic behind BigQuery’s serverless architecture, such as storage and query optimizations as well as ecosystem improvements, and how they enable customers to work without limits in BigQuery to run their data analytics, data engineering and data science workloads.

Storage optimization

Improve query performance with adaptive storage file sizing 

BigQuery stores table data in a columnar file store called Capacitor. These Capacitor files initially had a fixed file size, on the order of hundreds of megabytes, to support BigQuery customers’ large data sets. The larger file sizes enabled fast and efficient querying of petabyte-scale data by reducing the number of files a query had to scan. But as customers moving from traditional data warehouses started bringing in smaller data sets — on the order of gigabytes and terabytes — the default “big” file sizes were no longer the optimal form factor for these smaller tables. Recognizing that the solution would need to scale for users with big and smaller query workloads, the BigQuery team came up with the concept of adaptive file sizing for Capacitor files to improve small query performance.

The BigQuery team developed an adaptive algorithm to dynamically assign the appropriate file size, ranging from tens to hundreds of megabytes, to new tables being created in BigQuery storage. For existing tables, the BigQuery team added a background process to gradually migrate existing “fixed” file size tables into adaptive tables, to migrate customers’ existing tables to the performance-efficient adaptive tables. Today, the background Capacitor process continues to scan the growth of all tables and dynamically resizes them to ensure optimal performance.

“We have seen a greater than 90% reduction in the number of analytic queries in production that take more than one minute to run.” – Emily Pearson, Associate Director, Data Access and Visualization Platforms, Wayfair

Big metadata for performance boost

Reading from and writing to BigQuery tables maintained in storage files can become inefficient quickly if workloads had to scan all the files for every table. BigQuery, like most large data processing systems, has developed a rich store of information on the file contents, which is stored in the header of each Capacitor file. This information about data, called metadata, allows query planning, streaming and batch ingest, transaction processing and other read-write processes in BigQuery to quickly identify the relevant files within storage on which to perform the necessary operations, without wasting time reading non-relevant data files.

But while reading metadata for small tables is relatively simple and fast, large (petabyte-scale) fact tables can generate millions of metadata entries. For these queries to generate results quickly the query optimizer needs a highly performant metadata storage system.

Based on the concepts proposed in their 2021 VLDB paper, “Big Metadata: When Metadata is BigData,” the BigQuery team developed a distributed metadata system, called CMETA, that features fine-grained column and block-level metadata that is capable of supporting very large tables and that is organized and accessible as a system table. When the query optimizer receives a query, it rewrites the query to apply a semi-join (WHERE EXISTS or WHERE IN) with the CMETA system tables. By adding the metadata data lookup to the query predicate, the query optimizer dramatically increases the efficiency of the query.

In addition to managing metadata for BigQuery’s Capacitor-based storage, CMETA also extends to external tables through BigLake, improving the performance of lookups of large numbers of Hive partitioned tables.

The results shared in the VLDB paper demonstrate that query runtimes are accelerated by 5× to 10× for queries on tables ranging from 100GB to 10TB using the CMETA metadata system.

The three Cs of optimizing storage data: compact, coalesce, cluster

BigQuery has a built-in storage optimizer that continuously analyzes and optimizes data stored in storage files within Capacitor using various techniques:

Compact and Coalesce: BigQuery supports fast INSERTs using SQL or API interfaces. When data is initially inserted into tables, depending on the size of the inserts, there may be too many small files created. The Storage Optimizer merges many of these individual files into one, allowing efficient reading of table data without increasing the metadata overhead.

The files used to store table data over time may not be optimally sized. The storage optimizer analyzes this data and rewrites the files into the right-sized files so that queries can scan the appropriate number of these files, and retrieve data most efficiently. Why is the right size important? If the files are too big, then there’s overhead in eliminating unwanted rows from the larger files. If the files are too small, there’s overhead in reading and managing the metadata for the larger number of small files being read.

Cluster: Tables with user-defined column sort orders are called clustered columns; when you cluster a table using multiple columns, the column order determines which columns take precedence when BigQuery sorts and groups the data into storage blocks. BigQuery clustering accelerates queries that filter or aggregate by the clustered columns by only scanning the relevant files and blocks based on the clustered columns rather than the entire table or table partition. As data changes within the clustered table, BigQuery storage optimizer automatically performs reclustering to ensure that the cluster definition is continuously updated ensuring consistent query performance.

Query optimization

Join skew processing to reduce delays in analyzing skewed data

When a query begins execution in BigQuery, the query optimizer converts the query into a graph of execution, broken down into stages, each of which have steps. BigQuery uses dynamic query execution, which means the execution plan can evolve dynamically to adapt to different data sizes and key distributions, ensuring fast query response time and efficient resource allocation. When querying large fact tables, there is a strong likelihood that data may be skewed, meaning data is distributed asymmetrically over certain key values, creating unequal distribution of the data. Thus, a query of a skewed fact table is likely to cause more records for the skewed data over normal data. When the query engine distributes the work to workers to query skewed tables, certain workers may take longer to complete their task because there are excess rows for certain key values, i.e., skew, creating uneven wait times across the workers.

More worker capacity is allocated to Left or the Right side of the join depending on where the data skew is detected (Left side has data skew in task 2; Right side has data skew in task 1)

Let’s consider data that can show skew in its distribution. Cricket is an international team sport. However, it is only popular in certain countries around the world. If we were to maintain a list of cricket fans by country, the data will show that it is skewed to fans from full Member countries of the International Cricket Council and is not equally distributed across all countries.

Traditional databases have tried to handle this by maintaining data distribution statistics. However, in modern data warehouses, data distribution can change rapidly and data analysts can drive increasingly complex queries rendering these statistics obsolete, and thus, less useful. Depending on tables being queried on join columns, the skew may be on the table column referenced on the left side of the join or the right side.

The BigQuery team addressed data skew by developing techniques for join skew processing by detecting data skew and allocating work proportionally so that more workers are allocated to process the join over the skewed data. While processing joins, the query engine keeps monitoring join inputs for skewed data. If a skew is detected, the query engine changes the plan to process the joins over the skewed data. The query engine will further split the skewed data, creating equal distribution of processing across skewed and non-skewed data. This ensures that at execution time, the workers processing data from the table with data skew are proportionally allocated according to the detected skew. This allows all workers to complete their tasks simultaneously, thereby accelerating query runtime by eliminating any delays caused by waits due to skewed data.

“The ease to adopt BigQuery in the automation of data processing was an eye-opener. We don’t have to optimize queries ourselves. Instead, we can write programs that generate the queries, load them into BigQuery, and seconds later get the result.” – Peter De Jaeger, Chief Information Officer, AZ Delta

Dynamic concurrency with queuing

BigQuery’s documentation on Quotas and limits for Query jobs states “Your project can run up to 100 concurrent interactive queries.” BigQuery used the default setting of 100 for concurrency because it met requirements for 99.8% of customer workloads. Since it was a soft limit, the administrator could always increase this limit through a request process to increase the maximum concurrency. To support the ever-expanding range of workloads, such as data engineering, complex analysis, Spark and AI/ML processing, the BigQuery team developed dynamic concurrency with query queues to remove all practical limits on concurrency and eliminate the administrative burden. Dynamic concurrency with query queues is achieved with the following features:

Dynamic maximum concurrency setting: Customers start receiving the benefits of dynamic concurrency by default when they set the target concurrency to zero. BigQuery will automatically set and manage the concurrency based on reservation size and usage patterns. Experienced administrators who need the manual override option can specify the target concurrency limit, which replaces the dynamic concurrency setting. Note that the target concurrency limit is a function of available slots in the reservation and the admin-specified limit can’t exceed that. For on-demand workloads, this limit is computed dynamically and is not configurable by administrators.

Queuing for queries over concurrency limits: BigQuery now supports Query Queues to handle overflow scenarios when peak workloads generate a burst of queries that exceed the maximum concurrency limit. With Query Queues enabled, BigQuery can queue up to 1000 interactive queries so that they get scheduled for execution rather than being terminated due to concurrency limits, as they were previously. Now, users no longer have to scan for idle time periods or periods of low usage to optimize when to submit their workload requests. BigQuery automatically runs their requests or schedules them on a queue to run as soon as current running workloads have completed. You can learn about Query Queues here.

“BigQuery outperforms particularly strongly in very short and very complex queries. Half (47%) of the queries tested in BigQuery finished in less than 10 sec compared to only 20% on alternative solutions. Even more starkly, only 5% of the thousands of queries tested took more than 2 minutes to run on BigQuery whereas almost half (43%) of the queries tested on alternative solutions took 2 minutes or more to complete.” – Nikhil Mishra, Sr. Director of Engineering, Yahoo!

Colossus Flash Cache to serve data quickly and efficiently

Most distributed processing systems make a tradeoff between cost (querying data on hard disk) and performance (querying data in memory). The BigQuery team believes that this trade-off is a fallacy and that users can have both low cost and high performance, without having to choose between them. To achieve this, the team developed a disaggregated intermediate cache layer called Colossus Flash Cache which maintains a cache in flash storage for actively queried data. Based on access patterns, the underlying storage infrastructure caches data in Colossus Flash Cache. This way, queries rarely need to go to disk to retrieve data; the data is served up quickly and efficiently from Colossus Flash Cache.

Optimized Shuffle to prevent excess resource usage

BigQuery achieves its highly scalable data processing capabilities through in-memory execution of queries. These in-memory operations bring data from disk and store intermediate results of the various stages of query processing in another in-memory distributed component called Shuffle. Analytical queries containing WITH clauses encompassing common table expressions (CTE) often reference the same table through multiple subqueries. To solve this, the BigQuery team built a duplicate CTE detection mechanism in the query optimizer.This algorithm reduces resource usage substantially allowing more shuffle capacity to be available to be shared across queries.

 To further help customers understand their shuffle usage, the team also added  PERIOD_SHUFFLE_RAM_USAGE_RATIO metrics to the JOBS INFORMATION_SCHEMA view and to Admin Resource Charts. You should see fewer Resource Exceeded errors as a result of these improvements and now have a tracking metric to take preemptive actions to prevent excess shuffle resource usage.

“Our teams wanted to do more with data to create better products and services, but the technology tools we had weren’t letting us grow and explore. And that data was growing continually. Just one of our data warehouses had grown 300% from 2014 to 2018. Cloud migration choices usually involve either re-engineering or lift-and-shift, but we decided on a different strategy for ours: move and improve. This allowed us to take full advantage of BigQuery’s capabilities, including its capacity and elasticity, to help solve our essential problem of capacity constraints.” – Srinivas Vaddadi, Delivery Head, Data Services Engineering, HSBC

Ecosystem optimization

Faster ingest, faster egress, faster federation

The performance improvements BigQuery users experience are not limited to BigQuery’s query engine. We know that customers use BigQuery with other cloud services to allow data analysts to ingest from or query other data sources with their BigQuery data. To enable better interoperability, the BigQuery team works closely with other cloud services teams on a variety of integrations:

BigQuery JDBC/ODBC drivers: The new versions of the ODBC / JDBC drivers support faster user account authentication using OAuth 2.0 (OAuthType=1) by processing authentication token refreshes in the background.

BigQuery with Bigtable: The GA release of Cloud Bigtable to BigQuery federation supports pushdown of queries for specific row keys to avoid full table scans.

BigQuery with Spanner: Federated queries against Spanner in BigQuery now allow users to specify the execution priority, thereby giving them control over whether federated queries should compete with transaction traffic if executed with high priority or if they can complete at lower-priority settings.

BigQuery with Pub/Sub: BigQuery now supports direct ingest of Pub/Sub events through a purpose-built “BigQuery subscription” that allows events to be directly written to BigQuery tables.

BigQuery with Dataproc: The Spark connector for BigQuery supports the DIRECT write method, using the BigQuery Storage Write API, avoiding the need to write the data to Cloud Storage.

What can BigQuery do for you? 

Taken together, these improvements to BigQuery translate into tangible performance results and business gains for customers around the world. For example, Camanchaca drove 6x faster data processing time, Telus drove 20x faster data processing and reduced $5M in cost, Vodafone saw 70% reduction in data ops and engineering costs, and Crux achieved 10x faster load times. 

“Being able to very quickly and efficiently load our data into BigQuery allows us to build more product offerings, makes us more efficient, and allows us to offer more value-added services. Having BigQuery as part of our toolkit enables us to think up more products that help solve our customers’ challenges.” – Ryan Haggerty, Head of Infrastructure and Operations, Crux

Want to hear more about how you can use BigQuery to drive similar results for your business? Join us at the Data Cloud and AI Summit ‘23 to learn what’s new in BigQuery and check out our roadmap of performance innovations using the power of serverless.

Source : Data Analytics Read More