New This Month in Data Analytics: Taking Home the Gold in Data

New This Month in Data Analytics: Taking Home the Gold in Data

As the Olympics kicked off in Tokyo at the end of July, we found ourselves reflecting on the beauty of diverse countries and cultures coming together to celebrate greatness and sportsmanship. For this month’s blog, we’d like to highlight some key data and analytics performances that should help inspire you to reach new heights in your data journey.

Let’s review the highlights!

BigQuery ML Anomaly Detection: A perfect 10 for augmented analytics

Identifying anomalous behavior at scale is a critical component of any analytics strategy. Whether you want to work with a single frame of data or a time series progression, BigQuery ML allows you to bring the power of machine learning to your data warehouse. 

In this blog released at the beginning of last month, our team walked through both non-time series and time-series approaches to anomaly detection in BigQuery ML:

Non-time series anomaly detection

Autoencoder model (now in Public Preview)

K-means model (already in GA)

Time-series anomaly detection

ARIMA_PLUS time series model (already in GA)

These approaches make it easy for your team to quickly experiment with data stored in BigQuery to identify what works best for your particular anomaly detection needs. Once a model has been identified as the right fit, you can easily port that model into the Vertex AI platform for real-time analysis or schedule it in BigQuery for continued batch processing.

App Analytics: Winning the team event

Google provides a broad ecosystem of technologies and services aimed at solving modern day challenges. Some of the best solutions come when those technologies are combined with our data analytics offerings to surface additional insights and provide new opportunities. 

Firebase has deep adoption in the app development community and provides the technology backbone for many organization’s app strategy. This month we launched a design pattern that shows Firebase customers how to use Crashlytics data, CRM, issue tracking, and support data in BigQuery and Looker to identify opportunities to improve app quality and enhance customer experiences.

Crux on BigQuery: Taking gold in the all-around data competition

Crux Informatics provides data services to many large companies to help their customers make smarter business decisions. While they were already operating on a modern stack and not on the hunt for a modern data warehouse, BigQuery became an enticing option due to performance and a more optimal pricing model. Crux also found advantages with lower-cost ingestion and processing engines like Dataflow that allow for streaming analytics.

… when it came to building a centralized large-scale data cloud, we needed to invest in a solution that would not only suit our current data storage needs but also enable us to tackle what’s coming, supporting a massive ecosystem of data delivery and operations for thousands of companies. Mark Etherington
Chief Technology Office, Crux Informatics

Technology is a team sport, and Crux found our support team responsive and ready to help. This decision to more deeply adopt Google Cloud’s data analytics offerings provides Crux with the flexibility to manage a constantly evolving data ecosystem and stay competitive.

You can read more about Crux’s decision to adopt BigQuery in this blog.

Google Trends: A classic emerges a champion

Following up on the launch of our Google Trends dataset in June, we delivered some examples of how to use that data to augment your decision making. 

As a quick recap of that dataset, Google Cloud, and in particular BigQuery, provide access to the top 25 trending terms by Nielsen’s Designated Market Area® (DMA) with a weekly granularity. These trending terms are based on search patterns and have historically only been available on the Google Trends website.

The Google Trends design pattern addresses some common business needs, such as identifying what’s trending geographically near your stores and how to match trending terms to products to identify potential campaigns. 

Dataflow GPU: More power than ever for those streaming sprints

Dataflow is our fully-managed data processing platform that supports both batch and streaming workloads. The ability of Dataflow to scale and easily manage unbounded data has made it the streaming solution of choice for large workloads with high-speed needs in Google Cloud. 

But what if we could take that speed and provide even more processing power for advanced use cases? Our team, in partnership with NVIDIA, did just that by adding GPU support to Dataflow. This allows our customers to easily accelerate compute-intensive processing like image analysis and predictive forecasting with amazing increases in efficiency and speed. 

Take a look at the times below:

Data Fusion: A play-by-play for data integration’s winning performance

Data Fusion provides Google Cloud customers with a single place to perform all kinds of data integration activities. Whether it’s ETL, ELT, or simply integrating with a cloud application, Data Fusion provides a clean UI and streamlined experience with deep integrations to other Google Cloud data systems. Check out our team’s review of this tool and the capabilities it can bring to your organization.

Source : Data Analytics Read More

What is Datastream?

What is Datastream?

With data volumes constantly growing, many companies find it difficult to use data effectively and gain insights from it. Often these organizations are burdened with cumbersome and difficult-to-maintain data architectures. 

One way that companies are addressing this challenge is with change streaming: the movement of data changes as they happen from a source (typically a database) to a destination. Powered by change data capture (CDC), change streaming has become a critical data architecture building block. We recently announced Datastream, a serverless change data capture and replication service. Datastream’s key capabilities include:

Replicate and synchronize data across your organization with minimal latency. You can synchronize data across heterogeneous databases and applications reliably, with low latency, and with minimal impact to the performance of your source. Unlock the power of data streams for analytics, database replication, cloud migration, and event-driven architectures across hybrid environments.Scale up or down with a serverless architecture seamlessly. Get up and running fast with a serverless and easy-to-use service that scales seamlessly as your data volumes shift. Focus on deriving up-to-date insights from your data and responding to high-priority issues, instead of managing infrastructure, performance tuning, or resource provisioning.Integrate with the Google Cloud data integration suite. Connect data across your organization with Google Cloud data integration  products. Datastream leverages Dataflow templates to load data into BigQuery, Cloud Spanner, and Cloud SQL; it also powers Cloud Data Fusion’s CDC Replicator connectors for easier-than-ever data pipelining.

Click to enlarge

Datastream use cases

Datastream captures change streams from Oracle, MySQL, and other sources for destinations such as Cloud Storage, Pub/Sub, BigQuery, Spanner and more. Some use cases of Datastream:  

For analytics use Datastream with a pre-built Dataflow template to create up-to-date replicated tables in BigQuery in a fully-managed way.For database replication use Datastream with pre-built Dataflow templates to continuously replicate and synchronize database data into Cloud SQL for PostgreSQL or Spanner to power low-downtime database migration or hybrid-cloud configuration.For building event-driven architectures use Datastream to ingest changes from multiple sources into object stores like Google Cloud Storage or, in the future, messaging services such as Pub/Sub or Kafka Streamline real-time data pipeline that continually streams data from legacy relational data stores (like Oracle and MySQL) using Datastream into MongoDB

How do you set up Datastream?

Create a source connection profile.Create a destination connection profile.Create a stream using the source and destination connection profiles, and define the objects to pull from the source.Validate and start the stream.

Once started, a stream continuously streams data from the source to the destination. You can pause and then resume the stream. 

Connectivity options

To use Datastream to create a stream from the source database to the destination, you must establish connectivity to the source database. Datastream supports the IP allowlist, forward SSH tunnel, and VPC peering network connectivity methods.

Private connectivity configurations enable Datastream to communicate with a data source over a private network (internally within Google Cloud, or with external sources connected over VPN or Interconnect). This communication happens through a Virtual Private Cloud (VPC) peering connection. 

For a more in-depth look into Datastream check out the documentation.

For more #GCPSketchnote, follow the GitHub repo. For similar cloud content follow me on Twitter @pvergadia and keep an eye out on thecloudgirl.dev.

Source : Data Analytics Read More

Building with Looker made easier with the Extension Framework

Building with Looker made easier with the Extension Framework

Our goal is to continue to improve our platform functionalities, and find new ways to empower Looker developers to build data experiences much faster and at a lower upfront cost.

We’ve heard the developer community feedback and we’re excited to have announced the general availability of the Looker Extension Framework.

The Extension Framework is a fully hosted development platform that enables developers to build any data-powered application, workflow or tool right in Looker. By eliminating the need to spin up and host infrastructure, the Extension Framework lets developers focus on building great experiences for their users. Traditionally, customers and partners who build custom applications with Looker, have to assemble an entire development infrastructure before they can proceed with implementation. For instance, they might need to stand up both a back end and front end and then implement services for hosting and authorization. This leads to additional time and cost spent.

The Extension Framework eliminates all development inefficiency and helps significantly reduce friction in the setup and  development process, so developers can focus on starting development right away. Looker developers would no longer need DevOps or infrastructure to host their data applications and these applications (when built on the Extension Framework), can take full advantage of the power of Looker. To enable these efficiencies, the Looker Extension Framework includes a streamlined way to leverage the Looker APIs and SDKs, UI components for building the visual experience, as well as authentication, permission management and application access control. 

Streamlining the development process with the Extension Framework

Content created via the Extension Framework can be built as a full-screen experience or embedded into an external website or application. We will soon be adding functionality to allow  for the embedding of extensions inside Looker (as a custom tile you plug into your dashboard, for example). Through our Public Preview period we have already seen over 150+ extensions deployed to production users, with an additional 200+ extensions currently in development. These extensions include solutions like: enhanced navigation tools, customized navigation and modified reporting applications, to name a few.

Extension Framework Feature Breakdown

The Looker Extension Framework includes the following features:

The Looker Extension SDK, which provides functions for Looker public API access and for interacting within the Looker environment.

Looker components, a library of pre-built React UI components you can use in your extensions.

The Embed SDK, a library you can use to embed dashboards, Looks, and Explores in your extension. 

The create-looker-extension utility, an extension starter kit that includes all the necessary extension files and dependencies.

Our Looker extension framework examples repo, with templates and sample extensions to assist you in getting started quickly.

The ability to access third-party API endpoints and add third-party data to your extension in building enhanced data experiences (e.g. Google Maps API).

The ability to create full-screen extensions within Looker. Full-screen extensions can be used for internal or external platform applications.

The ability to configure an access key for your extension so that users must enter a key to run the extension. 

Next Steps

If you haven’t yet tried the Looker Extension Framework, we think you’ll find it to be a major upgrade to your data app development experience. Over the next few months, we will continue to make enhancements to the Extension Framework with the goal of significantly reducing the amount of code required, and eventually empowering our developers with a low-code, no-code framework.

Comprehensive details and examples that help you get started in developing with the Extension Framework are now available here. We hope that these new capabilities inspire your creativity and we’re super excited to see what you build with the Extension Framework!

Source : Data Analytics Read More

Building the data science driven organisation from the first principles

Building the data science driven organisation from the first principles

In this blog series, a companion to our paper, we’re exploring different types of data-driven organizations. In the last blog, we discussed main principles for building a data engineering driven organisation. In this second part of the series focus is on how to build a data science driven organization. 

The emergence of big data, advances in machine learning and the rise of cloud services has been changing the technological landscape dramatically over the last decade and have pushed the boundaries in industries such as retail, chemistry, or healthcare. In order to create a sustainable competitive advantage from this paradigmatic shift companies need to become ‘data science driven organizations’.1 In the course of this article we discuss the socio-technical challenges those companies are facing and provide a conceptual framework built on first principles to help them on their journey. Finally, we will show how those principles can be implemented on Google Cloud.

Challenges of data science driven organizations

A data science driven organization can be described as an entity that maximizes the value from the data available while using machine learning and analytics to create a sustainable competitive advantage. Becoming such an organization is more of a sociotechnical challenge rather than a purely technical one. In this context, we identified four main challenges:

Inflexibility: Many organizations have not built an infrastructure flexible enough to quickly adapt to a fast changing technical landscape. This inflexibility comes with the cost of lock-in effects, outdated technology stacks, and a bad signaling for potential candidates. While those effects might be less pronounced in more mature areas like data warehouses, they are paramount for data science and machine learning.

Disorder: Most organizations grow in an organic way. This often results in a non-standardized technological infrastructure. While standardization and flexibility are often seen as diametrically opposed to each other, a certain level of standardization is needed to establish a technical ‘lingua franca’ across the organization. A lack of the same harms collaborations and knowledge sharing between teams and hampers modular approaches as seen in classical software engineering.

Opaqueness: Data science and machine learning are fundamentally data driven (engineering) disciplines. As data is in a constant flux, accountability and explainability are pivotal for any data science driven organization.2 As a result data science and machine learning workflows need to be defined in the same rigour as classical software engineering. Otherwise, such workflows will turn into unpredictable black boxes.

Data Culture (or lack thereof): Most organisations have a permission culture where data is managed by a team which then becomes a bottleneck on providing rapid access as they cannot scale with the requests. However, in organizations driven by data culture there are clear ways to access data while retaining governance. As a result, machine learning practitioners are not slowed down by politics and bureaucracy and they can carry out their experiments.

Personas in data science driven organizations

Data science driven organizations are heterogeneous. Nevertheless, most of them leverage four core personas: data engineers, machine learning engineers, data scientists, and analysts. It is important to mention that those roles are not static and overlap to a certain extent. An organizational structure needs to be designed in such a way that it can leverage the collaboration and full potential of all personas. 

Data engineers take care of creating data pipelines and making sure that the data available fulfills the hygienic needs. For example, cleansing, joining and enriching multiple data sources to turn data into information on which downstream intelligence is based. 

Machine learning engineers develop and maintain complete machine learning models. While machine learning engineers are the rarest of the four personas, they become indispensable once an organization plans to run business critical workflows in production. 

Data scientists act as a nexus between data and machine learning engineers. Together with business stakeholders they translate business driven needs into testable hypotheses, make sure that value is derived from machine learning workloads and create reports to demonstrate value from the data.

Data analysts bring the business insight and make sure that data driven solutions that business is seeking are implemented. They answer adhoc questions, provide regular reports that analyze not only the historical data but also what has happened recently.

There are different arguments if a company should build centralized or decentralized data science teams. In both cases, teams face similar challenges as outlined earlier. There are also hybrid models, as a federated organization whereby data scientists are embedded from a centralized organization. Hence, it is more important to focus on how to tackle those challenges using first principles. In the following sections, we discuss those principles and show how a data science and machine learning platform needs to be designed in order to facilitate those goals.

First principles to build a data science driven organization

Adaptability: A platform needs to be flexible enough to enable all kinds of personas. While some data scientists/analysts, for example, are more geared toward developing custom models by themselves, others may prefer to use no-code solutions like AutoML or carry out analysis in SQL. This also includes the availability of different machine learning and data science tools like TensorFlow, R, Pytorch, Beam, or Spark. At the same time, the platform should be open enough to work in multi-cloud and on-premises environments while supporting open source technology when possible to prevent lock-in effects. Finally, resources should never become a bottleneck as the platform needs to scale quickly with an organization’s needs.

Activation: Ability to operationalize models by embedding analytics into the tools used by end users is key to achieve scaling in providing services to a broad set of users. The ability to send small batches of data to the service and it returns your predictions in the response allows developers with little data science expertise to use models. In addition, it is important to facilitate seamless deployment and monitoring of edge inferences and automated processes with flexible APIs. This allows you to distribute AI across your private and public cloud infrastructure, on-premises data centers, and edge devices. 

Standardization: Having a high degree of standardization helps to increase a platform’s overall efficiency. A platform that supports standardized ways of sharing code and technical artifacts increases internal communication. Such platforms are expected to have built in repositories, feature stores and metadata stores. Furthermore, making those resources queryable and accessible boost teams’ performance and creativity. Only when such kind of communication is possible data science and machine learning teams can work in a modular fashion as it has been for classical software engineering for years. An important aspect of standardisation is enabled by using standard connectors so that you can rapidly connect to a source/target system. Products such as Datastreamand Data Fusion provide such capabilities. On top of it,  a high degree of standardization avoids ‘technical debt’ (i.e. glue code) which is still prevalent in the majority of most machine learning and data science workflows.3

Accountability: Data science and machine learning use cases often deal with sensitive topics like fraud detection, medical imaging, or risk calculation. Hence, it’s paramount that a data science and machine learning platform helps to make those workflows as transparent, explainable, and secure as possible. Openness is connected to operational excellence. Collecting and monitoring metadata during all stages of the data science and machine learning workflows is crucial to create a ‘paper trail’ allowing us to ask questions such as: 

Which data was used to train the model? 

Which hyperparameters were used? 

How is the model behaving in production? 

Did any form of data drift or model skew occur during the last period? 

Furthermore, a data science driven organization needs to have a clear understanding of their models. While this is less of an issue for classical statistical methods, machine learning models, like deep neural networks, are much more opaque. A platform needs to provide simple tools to analyze such models for confident usage. Finally, a mature data science  platform needs to provide all the security measures to protect data and artifacts while managing resource usage on a granular level.

Business Impact: many data science projects fail to go beyond pilot or POC stages according to McKinsey.4 Therefore, the ability to anticipate/measure business impact of new efforts, and choosing ROI rather than the latest cool solution is more important. As a result it is key to identify when to buy, build, or customize ML models and connect them together in a unified, integrated stack. For example, if there is an out of the box solution which can be leveraged simply by calling an API rather than building a model after months of development would help realising higher ROI and demonstrating value. 

We conclude this part with the summary of the first principles. The next section will show how those principles can be applied on Google Cloud’s unified ML platform, Vertex AI.

First principles of a data science driven organization

Using first principles to build a data science platform on Google Cloud

Adaptability
With Vertex AI, we are providing a platform built on the first principles that covers the entire data science / machine learning journey from data readiness to model management. Vertex AI opens up the usage of data science and machine learning by providing no-code, low-code, and custom code procedures for data science and machine learning workflows. For example, if a data scientist would like to build a classification model they can use AutoML Tables to build an end-to-end model within minutes. Alternatively, they can start their own notebook on Vertex AI to develop custom code in their framework of choice (for instance, TensorFlow, Pytorch, R). Reducing the entry barrier to build complete solutions is not only saving developers time but enables a wider range of personas (such as data or business analysts). This reduced barrier helps them to leverage tools enabling the whole organization to become more data science and machine learning driven. 

We strongly believe in open source technology as it provides higher flexibility, attracts talent, and reduces lock-in. With Vertex Pipelines, we are echoing the industry standards of the open source world for data science and machine learning workflow orchestration. As a result, allowing data scientists / machine learning engineers to orchestrate their workflows in a containerized fashion. With Vertex AI, we reduced the engineering overhead for resource provisioning and provided a flexible and cost-effective way to scale up and down when needed. Data scientists and machine learning practitioners can, for example, run distributed training and prediction jobs with a few lines of Python code in their notebooks on Vertex AI.

Vertex AI Architecture

Activation
It is important to operationalize your models by embedding analytics into the tools used by your end users. This allows scaling beyond traditional data science users and bringing other users into data science applications. For example, you can train BigQuery ML models and scale them using Vertex AI predictions. Let’s say business analysts running SQL queries are able to test the ability of the chosen ML models and experiment with what is the most suitable solution. This reduces the time for activation as business impact can be observed sooner. On the other hand, Vertex Edge Manager would let you deploy, manage, and run ML models on edge devices with Vertex AI.

Standardization
With all AI services living under Vertex AI, we standardized data science and machine learning workflows. Together with Vertex Pipelines every component in Vertex can be orchestrated, making any form of glue code obsolete helping to enhance operational excellence.  As Vertex Pipelines are based on components (containerized steps), parts of a pipeline can be shared with other teams. For example, let’s assume the scenario where a data scientist has written an ETL pipeline for extracting data from a database. This ETL pipeline is then used to create features for downstream data science and machine learning tasks. Data Scientists can package this component, share it by using GitHub or Cloud Source Repository and make it available for other teams who can seamlessly integrate it in their own workflows. This helps teams to work in a more modular manner and fosters team collaboration across the board. Having such a standardized environment makes it easier for data scientists and machine learning engineers to rotate between teams and avoid compatibility issues between workflows. New components like Vertex Feature Store further improve collaboration by helping to share features across the organization.

Accountability
Data science and machine learning projects are complex, dynamic, and therefore require a high degree of accountability. To achieve this, data science and machine learning projects need to create a ‘paper trail’ that captures the nuances of the whole process. Vertex ML Metadata  automatically tracks the metadata of models trained and workflows being run. It provides a metadata store to understand a workflow’s lineage (such as how a model was trained, which data has been used and how the model has been deployed). A new model repository provides you a quick overview of all models trained under the project. Additional services like Vertex Explainable AI help you to understand why a machine learning model made a certain prediction. Further, features like continuous monitoring including the detection of prediction drift and training-serving skew help you to take control of productionized models.

Business Impact: as discussed earlier it is key to identify when to buy, build, or customize ML models and connect them together in a unified, integrated stack. For example, if your company wants to make their services and products more accessible to their global clientele through translation, you could simply use Cloud Translation API if you are translating websites and user comments. That’s exactly what it was trained on and you probably don’t have an internet-sized dataset to train your ML model on. On the other hand, you may choose to build a custom solution on your own. Even though Google Cloud has Vision API that is trained on the same input (photos) and label, your organisation may have a much larger dataset of such images and might give better results for the particular use case. Of course, they can always compare their final model against the off-the-shelf solution to see if they made the right decision. Checking feasibility is important, so when we talk about building models, we always mention quick methods to check that you are making the right decisions.

Conclusion

Building a data science driven organization comes along with several socio-technical challenges. Often an organization’s infrastructure is not flexible enough to react to a fast changing technological landscape. A platform also needs to provide enough standardization to foster communication between teams and establish a technical ‘lingua franca’. Doing so is key to allow modularized workflows between teams and establish operational excellence. In addition, it is often too opaque to securely monitor complex data science and machine learning workflows. We argue that a data science driven organization should be built on a technical platform which is highly adaptable in terms of technological openness. Hence, enabling a wide set of personas and providing technological resources in a flexible and serverless manner. Whether to buy a solution or build a solution is one of the key drivers of realising return of investment for the organisation and this will define the business impact any AI solution would make. At the same time, enabling a broad number of users allows activating more use cases. Finally, a platform needs to provide the tools and resources to make data science and machine learning workflows open, explanatory, and secure in order to provide the maximum form of accountability. Vertex AI is built on those pillars helping you to become a data science driven organization. Visit our Vertex AI page to learn more.

1. Data science is used as an umbrella term for the interplay of big data, analytics and machine learning.
2. The Covid pandemic is a prime example as it has significantly affected our environment and therefore the data on which data-science and machine learning workflows are based.
3. Sculley et al. (2015). Hidden technical debt in machine learning.
4. https://www.mckinsey.com/business-functions/mckinsey-analytics/our-insights/global-survey-the-state-of-ai-in-2020

Source : Data Analytics Read More

Manage Capacity with Pub/Sub Lite Reservations. It’s easier and cheaper.

Manage Capacity with Pub/Sub Lite Reservations. It’s easier and cheaper.

If you need inexpensive managed messaging for streaming analytics, Pub/Sub Lite was made for you.  Lite can be as much as 10 times cheaper than Pub/Sub.  But, until now, the low price came with a lot more work. You had to manage the read and write throughput capacity of each partition of every topic.  Have 10 single-partition topics? Make sure you watch 10 write and another 10 read capacity utilization metrics or you might run out of capacity.

Hello, Reservations

We did not like this either. So we launched Pub/Sub Lite Reservations to manage throughput capacity for many topics with a single number.  A reservation is a regional pool of throughput capacity.  The capacity can be used interchangeably for read or write operations by any topic within the same project and region as the reservation. You can think of this as provisioning a cluster of machines and letting it handle the traffic.  Except instead of a cluster, there is just a single number. 

Less work is great, of course. It is even better when it saves you money.  Without reservations, you must provision each topic partition for the highest peaks in throughput.  Depending on how variable your traffic is this can mean that half or more of the provisioned capacity is unused most of the time. Unused, but not free. 

Reservations allow you to handle the same traffic spikes with less spare capacity.  Usually, the peaks in throughput are not perfectly correlated among topics. If so, the peaks in the aggregate throughput of a reservation are smaller, relative to the time average, than the peaks in individual topics.  This makes for less variable throughput and reduces the need for spare capacity. 

As a cost-saving bonus, reservations do away with the explicit minimum capacity per partition. There is still a limit on the number of partitions per reservation. With this limit, you pay for at least 1 MiB/s per topic partition.  This is not quite “scale to zero” of Pub/Sub, but beats the 4 MiB/s required for any partition without reservations.

An Illustration

Suppose you have three topics with traffic patterns that combine a diurnal rhythm with random surges.  The minimum capacity needed to accommodate this traffic is illustrated below.

“Before” you provision for the spikes in each topic independently. “After,” you aggregate the traffic, dampening most peaks.  In practice, you will provision more than shown here to anticipate the peaks you haven’t seen.  You will also likely have more topics. Both considerations increase the difference in favor of reservations. 

Are Reservations Always Best?

For all the benefits of shared capacity, it has the “noisy neighbor” problem.  A traffic peak on some topics can leave others without capacity.  This is a concern if your application critically depends on consistent latency and high availability.  In this case, you can isolate noisy topics in a separate reservation. In addition, you can limit the noise by setting throughput caps on individual topics. 

All in all, if you need to stream tens of megabytes per second at a low cost, Lite is now an even better option. 

Reservations are generally available to all Pub/Sub Lite users.  You can use reservations with existing topics. Start saving money and time by creating a Pub/Sub Lite reservation in the Cloud Console and let us know how it goes at pubsub-lite-helpline@google.com.

Source : Data Analytics Read More

Building a unified analytics data platform on Google Cloud

Building a unified analytics data platform on Google Cloud

Every company runs on data, but not every organization knows how to create value out of the data it generates. The first step to becoming a data driven company is to create the right ecosystem for data processing in a holistic way. Traditionally, organizations’ data ecosystems consisted of point solutions that provide data services. But that point solution approach is, for many companies, no longer sufficient. 

One of the most common questions we get from customers is, “Do I need a data lake, or should I consider a data warehouse? Do you recommend I consider both?” Traditionally, these two architectures have been viewed as separate systems, applicable to specific data types and user skill sets. Increasingly,  we see a blurring of lines between data warehouses and data lakes, which provides customers with an opportunity to create a more comprehensive platform that gives them the best of both worlds.

What if we don’t need to compromise, and we instead create an end-to-end solution covering the entire data management and processing stages, from data collection to data analysis and machine learning? The result is a data platform that can store vast amounts of data in varying formats and do so without compromising on latency. At the same time, this platform can satisfy the needs of all users throughout the data lifecycle. 

Emerging Trends 

There is no one-size-fits-all approach to building an end-to-end data solution. Emerging concepts include data lakehouses, data meshes, and data vaults that seek to meet specific technical and organizational needs. Some are not new and have been around in different shapes and formats, however, all of them work naturally within a Google Cloud environment. Let’s look into both ends of the spectrum of enabling data and enabling teams. 

Data mesh facilitates a decentralized approach to data ownership, allowing individual lines of business to publish and subscribe to data in a standardized manner, instead of forcing data access and stewardship through a single, centralized team. On the other hand, a data lakehouse brings raw and processed data closer together, allowing for a more streamlined and centralized repository of data needed throughout the organization. Processing can be done in transit via ELT, reducing the need to copy datasets across systems. This allows for easier data exploration and easier governance. The Data lakehouse works to store the data in a single-source-of-truth, making minimal copies of the data. This architecture offers low-cost storage in an open format accessible by a variety of processing engines like Spark, while also providing powerful management and optimization features. Consistent security and governance is key to any lakehouse. Finally, a data vault is designed to separate data-driven and model-driven activities. Data integrated into the raw vault enables parallel loading to facilitate scaling of large implementations.

In Google Cloud, there is no need to keep them separate. In fact, with interoperability among our portfolio of data analytics products, you can easily provide access to data residing in different places, effectively bringing your data lake and data warehouse together on a single platform.

Let’s look at some of the technological innovations that make this reality. BigQuery’s storage API allows treating a data warehouse like a data lake, letting you access the data residing in BigQuery. For example, you can use Spark to access data residing in the data warehouse without it affecting the performance of any other jobs accessing it. This is all made possible by the underlying architecture, which separates compute and storage. Likewise, Dataplex, our intelligent data fabric service, provides data governance and security capabilities across various lakehouse storage tiers built on GCS and BigQuery. 

We will continue to offer specialized products and solutions around data lake and data warehouse functionality but over time we expect to see a significant enough convergence of the two systems that the terminology will change. At Google Cloud, we consider this combination an “analytics data platform”.

Tactical or Strategical

Google Cloud’s data analytics platform is differentiated by being open, intelligent, flexible, and tightly integrated. There are many technologies in the market which provide tactical solutions that may feel comfortable and familiar. However, this can be a rather short-term approach that simply lifts and shifts a siloed solution into the cloud. In contrast, an analytics data platform built on Google Cloud offers modern data warehousing and data lake capabilities with close integration to our AI Platform. It also provides built-in streaming, ML, and geospatial capabilities and an in-memory solution for BI use cases. Depending on your organizational data needs, Google Cloud has the set of products, tools, and services to create the right data platform for you. 

To become a truly data-driven organization, the first step is to design and implement an analytics data platform that meets your technical and business needs. Whether you want to empower teams to own, publish, and share their data across the organization, or you want to create a streamlined store of raw and processed data for easier discovery, there is a solution that best meets the needs of your company.

To learn more about the elements of a unified analytics data platform built on Google Cloud, and the differences in platform architectures and organizational structures, read our Unified Analytics Platform paper.

Source : Data Analytics Read More

Transforming a Fortune 500 Alc-Bev Firm into a Nimble e-Commerce leader

Transforming a Fortune 500 Alc-Bev Firm into a Nimble e-Commerce leader

Editor’s note: Today we’re hearing from Ryan Mason, Director, Head of DTC Growth & Strategy, at alcoholic beverage firm, Constellation Brands on the company’s shift to Direct-to-Consumer (DTC) sales and how Google Cloud’s powerful technology stack helped with this transformation. 

It’s no secret that consumer businesses have been up-ended in a lasting manner after 18 months of the pandemic. Consumers have been forced to shop differently over the past year – and as a result, they’ve evolved to be more comfortable with online spending and have grown to expect a certain level of convenience. While the e-commerce share of consumer sales has grown steadily over the past decade, the pandemic was the catalyst for the famous “10 years of growth in 3 months” which many argue is here to stay. 

Facing this reality head-on,  we placed a new emphasis on Direct-to-Consumer (DTC) with our acquisition of Empathy Wines, a DTC-native wine brand that sells directly to consumers via e-commerce. To accelerate our innovation in the DTC space, we added headcount and new functions to the existing Empathy team and empowered the newly-minted DTC group to apply their digital commerce operating model across the rest of the wine and spirits portfolio, which includes Robert Mondavi Winery, Meiomi Wines, The Prisoner Wine Company, High West Whiskey, and more. 

One pandemic and one year later, DTC sales have surged in the wine and spirits category with Constellation positioned as a leader armed with a unique and powerful cloud technology stack, best-in-class e-commerce user experiences, modernized fulfillment solutions, and data-driven growth marketing. 

Benefits of Going DTC 

A report from McKinsey estimates that the strategic business shift to DTC has been accelerated by two years because of the pandemic and argues that consumer brands that want to thrive will need to aim for a 20% DTC business or higher, which is already taking shape in the market: Nike’s direct digital channels are on track to make up 21.5% of the total business by the end of 2021, up from 15.5% in the last fiscal year, and Adidas is aiming for 50% DTC by 2025. But outside of the clear revenue upside, the auxiliary benefits of going DTC are robust.

Click to enlarge

For Constellation Brands, each of these four pillars ring true, and our shift toward DTC is as much about margin accretion and revenue mix management as it is about consumer insights and data. The added complexities of the alcohol space add wrinkles to our DTC approach and manifest in many areas like consumer shopping preference, shipping and logistics hurdles, and more. In order to win share early and continue to lead the category, we recognized the need to harness the immense amount of first-party data to power impactful and actionable insights. 

Our DTC technology architecture has fostered a value chain that is completely digitized: website traffic, marketing expenditures, tasting room transactions, e-commerce transactions, logistics and fulfillment events, cost of goods sold (COGS) and margin profiles, etc. are recorded and stored in a data warehouse in real time.  For the first time, at any given moment, we can easily and deterministically answer complex business questions like “what is the age and gender distribution of my customers from Los Angeles who have purchased SKU X from Brand.com Y in the last 6 months? What is the cohort net promoter score? Did that increase after we introduced same-day shipping in this zip code? By how much?” 

The ability to answer these questions and understand the root causes allows us to stay nimble with product offerings and iterate marketing strategies at the speed of consumer preference. Further, it enables us to optimize our omnichannel presence in the same manner by leaning on DTC consumer insights to develop valuable strategies with key wholesale distribution partners and 3-Tier eCommerce partners like Drizly and Instacart. At its core, Constellation’s DTC practice is designed to be the consumer-centric “tip-of-the-spear” responsible for generating insights from which all sales channels, including wholesale, can benefit. 

Constellation’s DTC technology approach prioritizes consumer-centricity and insights generation

We have taken a modern approach to building a digital commerce technology stack, leveraging a hub-and-spoke model built around Shopify Plus and other key emergent technology providers like email provider Klaviyo, loyalty platform Yotpo, Net Promoter Score measurer Delighted, Customer Service module Gorgias, payments processor Stripe, event reservations platform Tock, and many more. For digital marketing and analytics, we use Google Cloud and Google Marketing Platform, which includes products like Analytics 360, Tag Manager 360, and Search Ads 360.

To help gather, organize, and store all of the inbound data from the ecosystem, we partnered with SoundCommerce, a data processing platform for eCommerce businesses. Together with SoundCommerce, we are able to automate data ingestion from all endpoints into a central data warehouse in Google BigQuery. With BigQuery, our data team is able to break data silos and quickly analyze large volumes of data that help unlock actionable insights about our business.  BigQuery itself allows for out-of-the-box predictive analytics using SQL via BigQuery ML, and a key differentiator for us is that all Google Marketing Platform data is natively accessible for analysis within BigQuery. 

But data possession only addresses half of the opportunity: we needed a powerful andmodern business intelligence platform to help make sense of the vast amounts of data flowing into the system. Core to the search was to find a partner that approached BI in a way that fit with our future-looking strategy.

Our DTC team relies on the accurate measurement of variable metrics like Customer Acquisition Cost (CAC), Customer Lifetime Value (CLV), Churn, and Net Promoter Score (NPS) as a bellwether of the health of the business and monitoring these figures on a daily basis is paramount to success. To enable us to keep an accurate pulse on strategic KPIs, we considered several incumbent BI platforms. Ultimately we selected Google Cloud’s Looker for a range of benefits that separated it from the rest of the pack.

Click to enlarge

From a vision perspective, in this particular case we felt Looker was most aligned with our belief that better decisions are made when everyone has access to accurate, up-to-date information. Looker allows us to realize that vision by surfacing data in a simple web-based interface that empowers everyone to take action with real-time data on critical commercial activities. Furthermore, Looker’s ability to automate and distribute formatted modules to a myriad of stakeholders on a regular cadence increases data literacy and business performance transparency.

From a product perspective, we chose Looker for it’s cloud offering, web-based interface, and  centralized, agile modeling layer that creates a trusted environment for all users to confidently interact with data — without any actual data extraction. While other BI tools have centralized semantic layers that require skilled IT resources, we’ve experienced that those can lead to bottlenecks and limited agility. With Looker’s semantic layer, LookML, our BI Team, led by Peter Donald, can easily build upon their SQL knowledge to add both a high degree of control as well as flexibility to our data model.  The fully browser-based development environment allows the data team to rapidly develop, test, and deploy code and is backed by robust and seamless Git source code management. 

In parallel, LookML empowers  business users to collaborate without the need for advanced SQL knowledge. Our data team curates interactive data experiences with Looker to help scale access and adoption. Business users can explore ad hoc analysis, create dashboards, and develop custom data experiences in the web-based environment to get the answers they need without relying on IT resources each time they have a new question, while also maintaining the confidence that the underlying data will always be accurate. This helps us meet our primary goal of providing all businesses users with the data access they need to monitor the pulse of key metrics in near real-time.

Impact and future of DTC BI at Constellation

In short order, taking a modern and integrated approach to the DTC technology stack has delivered economic impact across the portfolio, helping our team understand and combat customer churn, increase conversion rates, and optimize the customer acquisition cost (CAC) and customer lifetime value (CLV) ratios. Perhaps most important is the benefit it can provide to the customer base. Mining customer data and consumer behavior generates data into what our customers are seeking, giving us insights to supply more, or less of it. For example, observing sales velocity and conversion rates by SKU or by region can help us better understand changes in customer taste profiles and fluctuations in demand, providing the foundation for a more powerful innovation pipeline and more effective sales and distribution tactics in wholesale. Our team has also been an early pilot tester for Looker’s new integration with Customer Match, which contributes to the virtuous cycle between data insight and data activation. In the future, our plan is to leverage this cycle to amplify the impact of Google Ads across Search, Shopping, and YouTube placements for the wine and spirits portfolio. 

The operational impact of Looker is also substantial: our team estimates that the number of hours needed to reach critical business decisions has been reduced by nearly 60%, boosting productivity and accelerating the daily operating rhythm. A thoughtfully curated technology stack together with a modern BI solution allows us to stay at the vanguard of the industry. While the DTC sales channel is not designed to surpass the core business of wholesale for Constellation in terms of size, the approach enables unparalleled insights and measurement abilities that will pay dividends for the entire business for years to come.

Related Article

How to grow brands in times of rapid change

Catch up on all the best content from Google Cloud’s 2021 Retail & Consumer Packaged Goods summit

Read Article

Source : Data Analytics Read More

BigQuery workload management best practices

BigQuery workload management best practices

In the most recent season of  BigQuery Spotlight, we discussed key concepts like  the BigQuery Resource hierarchy, query processing, and the reservation model. This blog focuses on extending those concepts to operationalize workload management for various scenarios. We will discuss the following topics:

BigQuery’s Flexible Query Cost OptionsWorkload Management Key ConceptsReservation Applications PatternsCapacity Planning Best PracticesAutomation Tips

BigQuery’s flexible query cost options

BigQuery provides predictable and flexible pricing models for workload management. There are mainly 2 types: On-demand pricing and Flat-rate pricing. You can easily mix and match these pricing models to get the best value for money. 

With on-demand pricing, you pay per query. This is suitable for initial experimentation or small workloads. Flat-rate pricing consists of short-term and long-term commitments. For short-term commitments or flex slots, you can buy slots for as little as 60 second durations. These enable burst use cases like seasonal spikes. With long-term commitments, you can buy slots per month or year. Monthly and annual commitments are the best choice for on-going or complex workloads that need dedicated resources with fixed costs.

Workload management

In this section we will cover three key concepts: Commitments, Reservations and Assignments 

With flat-rate pricing you purchase a commitment, where you purchase a dedicated number of BigQuery slots. The first time you buy a slot commitment, BigQuery creates a default reservation and assigns your entire Google Cloud Organization to it. Commitments are purchased in a dedicated administration project, which centralizes the billing and management of purchased slots. Slots are a regional resource, meaning they are purchased in a specific region or multi-region (e.g. US) and can only be used for jobs used on data stored in that region.

A reservation is a pool of slots created from a commitment. An assignment is used to allocate slots within a reservation to a  project, folder or the entire organization. If you don’t create any assignment, BigQuery automatically shares the slots across your organization. You can specify which jobs should be using each reservation by indicating a job type of QUERY, PIPELINE (which includes LOAD, EXTRACT, and COPY jobs) or ML_EXTERNAL. You can also force a specific project to leverage on-demand slots by assigning it to a NONE reservation.

Check the managing your workloads and reservations documentation to learn more about using these concepts. 

Resource Hierarchy

Each level in the GCP resource hierarchy inherits the assignment from the level above it, unless you override it. However, the lowest granularity of slot assignment always takes precedence. For example, let’s say the organization is assigned to the “default” reservation. Any folder or project (like Project F) in the org will use the corresponding 100 slots. However, the dedicated reservation assignments for Storage (300) and Compute (500) folders will take precedence over the “default” reservation. Similarly, Project E’s “compute-dev” assignment with 100 slots will take precedence.  In this case, precedence means that they will leverage the available slots from the “storage-prod” and “compute-prod” reservations before pulling from other reservations.

Click to enlarge

Idle slot sharing

BigQuery optimizes resource utilization with its unique idle slot sharing capability, not found in any other cloud based data warehouses, which allows any idle slots in a reservation to be available for other reservations to use. As soon as the reservation needs that capacity back, it gets it while queries consuming idle slots simply go back to using their resources as before. This happens in real-time for every slot. This means that all capacity in an organization is available to be used at any time.  

Reservation applications patterns

Priority based allocation

Organizations can leverage priority based slot consumption using reservations and idle slot sharing. Reservations with high-priority or low-priority can be used for frequent movement of jobs in and out of the critical and non-critical projects respectively. You can leverage reservations with a small number of slots, and with the idle slots sharing option disabled, to handle expensive queries or ad-hoc workloads. You can also disable the idle slot sharing option when you are looking to get slot estimates for proof-of-concept workloads. Finally, the default reservation, or reservations with no slots can be used for running jobs with lowest priority, projects assigned to these reservations will only use idle slots.

Click to enlarge

For example, 

A company has a 5000 slot annual commitment for their organization

All projects in the organization are sharing these 5000 slots (see BigQuery fair scheduling for more details)

Without flat rate pricing, they have found that some critical business reports are delayed, or they are running after the non-critical ones

Additionally, some unapproved or ad-hoc workloads are consuming a lot of slots

Instead, we would recommend that they create 3 compute projects

Critical –  assigned to a reservation with 3000 slots

Non-critical – assigned to a reservation with 1500 slots

Idle slots are freely consumed by the above 2

Ad-hoc – assigned to a reservation with 500 slots and idle slots sharing disabled

With this method, critical workloads are guaranteed at least 3000 slots, non-critical workloads are guaranteed at least 1500 slots, and ad-hoc workloads are guaranteed to consume no more than 500 slots 

Mixed-mode reservation

Organizations do not need to pick just one pricing method, instead they can leverage flat-rate for some use cases and on-demand for others. Many BigQuery administrators chose to use an on-demand project for loading data. However, if you need to guarantee that data is loaded using a certain number of slots (ensuring a faster turnaround time), then you can leverage assignments for LOAD jobs. 

Additionally, on-demand projects can be useful for predictable workloads that are cost effective. Below, we highlight an example of mixing and matching both pricing models in the same organization.

Click to enlarge

Folder 1 projects have access to all the idle slots up from the  5k commitment. 

Project B has been explicitly assigned to the  ‘Executive BI’ reservation with 1000 slots – to make sure project B gets a minimum of 1000 slots for critical analytics workloads.

Folder 2 projects also have access to all the idle slots from the 5k commitment

Folder 2 has also been assigned to the ‘ML Projects’ reservation – to make sure that projects within the folder have access to a minimum of 2k slots for ML activities.

However, project D has been explicitly assigned to the reservation called ‘none’ to have that project use on-demand slots instead of any slots from the commitment. This is because it is more cost effective for this team to run predictable transformation workloads for machine learning activity in this project, which will have access to a pool of 2k on-demand slots.

Folder 3 has been assigned the reservation ‘Load Jobs’ for ingestion workloads. Therefore, project E would have access to minimum 500 slots for critical data load with access to any additional idle slots from org level reservation.

Capacity planning best practices

The following are general guidelines for pricing options for  given workloads:

For highly interactive compute projects, we recommend that you test performance and concurrency needs to assign the proper number of committed slots to the project (more on this below).

For projects with a low interactivity i.e. mainly batch processes with high data processing, we recommend using on-demand slots as a better cost effective option.

High-priority workloads with strict SLAs such as critical BI reports and ML models would benefit from using dedicated slots. 

During use case on-boarding, make sure to review the dataset sizes and understand the batch jobs. Potential load sizing can be done via estimation or through technical proof-of-concepts.

Actively monitor slot utilization to make sure you have purchased and assigned an optimal number of slots for given workloads.

Scaling throughput with slots

BigQuery dynamically re-assesses how many slots should be used to execute each stage of a query,   which enables powerful performance with respect to throughput and runtime. The following chart displays how BigQuery scales for throughput with an increase in the number of available slots. The chart below highlights throughput test comparison against traditional databases (TD: black line). The test was done with more than 200 TB of data with various degrees of query complexity,  and the throughput was measured using number of queries completed within 20 min for the given slot capacity. 

Leveraging the performance test metrics above, one can estimate the number of slots needed for simple to medium complexity queries to achieve the desired level of throughput. In general, BigQuery’s throughput increases with a small increase in the number of concurrent queries. However, for larger increases, there are other options to achieve the desired level of throughput. For example, in the chart above, if the number of concurrent queries increases from 200 to 300 for simple queries, there are two options to achieve the desired level of throughput:

Fixed slots: With fixed slots capacity, let’s say 10K slots, the throughput increases from 1000 to 1200 (as seen above). This is due to BigQuery’s fair resource sharing and dynamic optimization for each step of the query. So, if the average runtime is not impacted, you can continue to use the same capacity (or fixed slots). However, you need to monitor and ensure that the average runtime is not dropping below the acceptable SLA.

Increased slot capacity: If you need the same or better runtime, and higher throughput, for workloads with more concurrent queries than you would need more slots. The chart shows how providing more slots results in more throughput for the same number of concurrent queries.

Scaling run-time with slots

BigQuery’s query runtime depends on the four main factors: the number of slots, the number of  concurrent queries, the amount of data scanned and the complexity of the query. Increasing the number of slots results in a faster runtime, so long as the query work can continue to be parallelized. Even if there are additional slots available, if a part of the query cannot be delegated to “free” slots then adding more slots will not make it run faster. In the chart below, you can see that for complex queries the runtime changes from 50 seconds to 20 seconds when you increase slot capacity from 20k to 30k (with 100 concurrent queries).

You can test out your own query runtime and throughput to determine the optimal number of slots to purchase and reserve for certain workloads. Some tips for running BigQuery performance testing are:

Use large datasets, if possible > 50 TB for throughput testing

Use queries of varying complexity

Run jobs with a varying amount of available slots  

Use Jmeter for automation (check resources in github)

Create trend reports for:

Avg slot usage and query runtimes

Number of concurrent queries

Throughput (how many queries complete over X duration of time)

Slot utilization (total slot usage / total available capacity for X duration of time)

Avg. wait time

Load slots estimation workflow

If you are looking for guaranteed SLAs and better performance with your data ingestion, we recommend creating dedicated reservations for your load jobs. Estimating slots required for loading data is easy with this publicly available load slot calculator and the following estimation workflow.

The following factors need to be considered to get load slot estimations:

Dataset size 

Dataset Complexity: Number of fields | Number of nested/repeated fields

Data Format/ Conversion: Thrift LZO | Parquet LZO | Avro

Table Schema: Is the table Partitioned or Clustered?

Load frequency:  Hourly | Daily | Every n-hours

Load SLA: 1 hour for hourly partition loads | 4 hours for daily/snapshot loads

Historical Load Throughput: Estimated data size loaded per 2K slots per day

Automation tips

Optimization with flex slots

Consider a scenario with a compute project that has spikes in analysis during the last five days of every month, something common in many financial use cases. This is a predictable compute resource needed for a short duration of time. In contrast, there could be spikes on completely ad-hoc and non-seasonal workloads. The following automation can be applied to optimize cost and resource utilization without paying for peak usage for the long commitment periods.

At t0 to t1 everything is good. We are hitting SLAs, and we’re paying no more than we need. But from t1 to t3 is our peak load time. If we size to a steady state during peak demand, performance suffers, and SLAs are missed. If we size to peak, we can make SLAs, but we pay too much when off-peak.

A better solution would be to monitor for a rise in slot consumption and purchase flex slots, either using the Reservation API or Data Control statements (DCL), then assign the slots to the necessary resources. You can use quota settings,  automate the end-to-end flex slots cycle with alerts that trigger the flex slot purchase. For more details, check out this Practical Example for leveraging alerts and an example of putting everything together as a flow.

Take action

By default, BigQuery projects are assigned to the on-demand pricing model, where you pay for the amount of bytes scanned. Using BigQuery Reservations, you can switch to flat-rate pricing by purchasing commitments. Commitments are purchased in units of BigQuery slots. The cost of all bytes processed is included in the flat-rate price. Key benefits of using BigQuery Reservations include:

Predictability: Flat-rate pricing offers predictable and consistent costs. You know up-front what you are spending.

Flexibility: You choose how much capacity to purchase. You are billed a flat rate for slots until you delete the capacity commitment. You can also combine both the billing models!

Commitment discounts: BigQuery offers flat-rate pricing at a discounted rate if you purchase slots over a longer duration of time (monthly, annual). 

Workload management: Slot commitments can be further bucketed into reservations and assigned to BigQuery resources to provide dedicated capacity for various workloads, while allowing seamless sharing of any unused slots across workloads.

Centralized purchasing: You can purchase and allocate slots for your entire organization. You don’t need to purchase slots for each project that uses BigQuery.

Automation: By leveraging flex slots for seasonal spikes or ad-hoc demand rise, you can manage capacity to scale in need. Additionally, you can automate the entire process!

With capacity planning in the works, it is important that you also have a framework in place for on-going monitoring of slots for continuous optimization and efficiency improvements. Check out this blog for a deep-dive on leveraging the INFORMATION_SCHEMA and use this data studio dashboard, or this Looker block, as a monitoring template.

Related Article

BigQuery Admin reference guide: Data governance

Learn how to ensure your data is discoverable, secure and usable inside of BigQuery.

Read Article

Source : Data Analytics Read More

What’s new with Splunk Dataflow template: Automatic log parsing, UDF support, and more!

What’s new with Splunk Dataflow template: Automatic log parsing, UDF support, and more!

Last year, we released the Pub/Sub to Splunk Dataflow template to help customers easily and reliably export their high-volume Google Cloud logs and events into their Splunk Enterprise environment or their Splunk Cloud on Google Cloud (now in Google Cloud Marketplace). Since launch, we have seen great adoption across both enterprises and digital natives using the Pub/Sub to Splunk Dataflow to get insights from their Google Cloud data.

Pub/Sub to Splunk Dataflow template used to export Google Cloud logs into Splunk HTTP Event Collector (HEC)

We have been working with many of the users to identify and add new capabilities that not only addresses some of the feedback but also reduces the effort to integrate and customize the Splunk Dataflow template. 

Here’s the list of feature updates which are covered in more detail below:

Automatic logs parsing with improved compatibility with Splunk Add-on for GCP
More extensibility with user-defined functions (UDFs) for custom transformsReliability and fault tolerance enhancements

The theme ​​behind these updates is to accelerate time to value (TTV) for customers by reducing both operational complexity on the Dataflow side and data wrangling (aka knowledge management) on the Splunk side.

We have a reliable deployment and testing framework in place and confidence that it [Splunk Dataflow pipeline] will scale to our demand. It’s probably the best kind of infrastructure, the one I don’t have to worry about. Lead Cloud Security Engineer
Life Sciences company on Google Cloud

We want to help businesses spend less time on managing infrastructure and integrations with third-party applications, and more time analyzing their valuable Google Cloud data, be it for business analytics, IT or security operations. “We have a reliable deployment and testing framework in place and confidence that it will scale to our demand. It’s probably the best kind of infrastructure, the one I don’t have to worry about.” says the lead Cloud Security Engineer of a major Life Sciences company that leverages Splunk Dataflow pipelines to export multi-TBs of logs per day to power their critical security operations.

To take advantage of all these features, make sure to update to the latest Splunk Dataflow template (gs://dataflow-templates/latest/Cloud_PubSub_to_Splunk), or, at the time of this writing, version 2021-08-02-00_RC00 (gs://dataflow-templates/2021-08-02-00_RC00/Cloud_PubSub_to_Splunk) or newer.

More compatibility with Splunk

“All we have to decide is what to do with the time that is given us.”

— Gandalf

For Splunk administrators and users, you now have more time to spend analyzing data, instead of parsing and extracting logs & fields.

Splunk Add-on for Google Cloud Platform

By default, Pub/Sub to Splunk Dataflow template forwards only the Pub/Sub message body, as opposed to the entire Pub/Sub message and its attributes. You can change that behavior by setting template parameter includePubsubMessage to true, to include the entire Pub/Sub message as expected by Splunk Add-on for GCP.

However, in prior versions of Splunk Dataflow template, in the case of includePubsubMessage=true, Pub/Sub message body was stringified and nested under the field message, whereas Splunk Add-on expected a JSON object nested under data field.

Message body stringified prior to Splunk Dataflow version 2021-05-03-00_RC00

This led customers to either customize their Splunk Add-on configurations (via props & transforms) to parse the payload, or use spath to explicitly parse JSON, and therefore maintain two flavors of Splunk searches (via macros) depending on whether data was pushed by Dataflow or pulled by Add-on…far from ideal experience. That’s no longer necessary as Splunk Dataflow template now serializes messages in a manner compatible with Splunk Add-on for GCP. In other words, the default JSON parsing, built-in Add-on fields extractions and data normalization work out-of-the-box:

Message body as JSON payload as of Splunk Dataflow version 2021-05-03-00_RC00

Customers can readily take advantage of all the sourcetypes in Splunk Add-on including Common Information Model (CIM) compliance. Those CIM models are required for compatibility with premium applications like Splunk Enterprise Security (ES) and IT Service Intelligence (ITSI) without any extra effort on the customer, as long as they have set includePubsubMessage=true in their Splunk Dataflow pipelines.

Note on updating existing pipelines with includePubsubMessage:

If you’re updating your pipelines from includePubsubMessage=false to includePubsubMessage=true and you are using a UDF function, make sure to update your UDF implementation since the function’s input argument is now the PubSub message wrapper rather that the underlying message body, that is the nested data field. In your function, assuming you save the JSON-parsed version of the input argument in obj variable, the body payload is now nested in obj.data. For example, if your UDF is processing a log entry from Cloud Logging, a reference to obj.protoPayload needs to be updated to obj.data.protoPayload. 

Splunk HTTP Event Collector

We also heard from customers who wanted to use Splunk HEC ‘fields’ metadata to set custom index-time field extractions in Splunk. We have therefore added support for that last remaining Splunk HEC metadata field. Customers can now easily set these index-time field extractions on the sender side (Dataflow pipeline) rather than configuring non-trivial props & transforms on the receiver (Splunk indexer or heavy forwarder). A common use case is to index metadata fields from Cloud Logging, namely resource labels such as project_id and instance_id to accelerate Splunk searches and correlations based on unique Project IDs and Instance IDs. See example 2.2 under ‘Pattern 2: Transform events’ in our blog about getting started with Dataflow UDFs for a sample UDF on how to set HEC fields metadata using resource.labels object.

More extensibility with utility UDFs

“I don’t know, and I would rather not guess.”

— Frodo

For Splunk Dataflow users who want to tweak the pipeline’s output format, you can do so without knowing Dataflow or Apache Beam programming, or even having a developer environment setup. You might want to enrich events with additional metadata fields, redact some sensitive fields, filter undesired events, or set Splunk metadata such as destination index to route events to.

When deploying the pipeline, you can reference a user-defined function (UDF), that is a small snippet of JavaScript code, to transform the events in-flight. The advantage is that you configure such UDF as a template parameter, without changing, re-compiling or maintaining the Dataflow template code itself. In other words, UDFs offer a simple hook to customize the data format while abstracting low-level template details.

When it comes to writing a UDF, you can eliminate guesswork by starting with one of the utility UDFs listed in Extend your Dataflow template with UDFs. That article also includes a practical guide on testing and deploying UDFs.

More reliability and error handling

“The board is set, the pieces are moving. We come to it at last, the great battle of our time.”

— Gandalf

Last but not least, the latest Dataflow template improves pipeline fault tolerance and provides a simplified Dataflow operator troubleshooting experience.

In particular, Splunk Dataflow template’s retry capability (with exponential backoff) has been extended to cover transient network failures (e.g. Connection timeout), in addition to transient Splunk server errors (e.g. Server is busy). Previously, the pipeline would immediately drop these events in the dead-letter topic. While this avoids data loss, it added unnecessary burden for the pipeline operator who is responsible to replay these undelivered messages stored in the dead-letter. The new Splunk Dataflow template minimizes this overhead by attempting retries whenever possible, and only dropping messages into the dead-letter topic when it’s a persistent issue (e.g. Invalid Splunk HEC token), or when the maximum retry elapsed time (15 min) has expired. For a breakdown of possible Splunk delivery errors, see Delivery error types in our Splunk Dataflow reference guide.

Finally, as more customers adopt UDFs to customize the behavior of their pipelines per previous section, we’ve invested in better logging for UDF-based errors such as JavaScript syntax errors. Previously, you could only troubleshoot these errors by inspecting the undelivered messages in the dead-letter topic. You can now view these errors in the worker logs directly from the Dataflow job page in Cloud Console. Here’s an example query you can use in Logs Explorer:

You can also set up an alert policy in Cloud Monitoring to alert you whenever such UDF error happens so you can review the message payload in the dead-letter topic for further inspection. You can then either tweak the source log sink (if applicable) to filter out these unexpected logs, or revise your UDF function logic to properly handle these logs.

What’s next?

“Home is behind, the world ahead, and there are many paths to tread through shadows to the edge of night, until the stars are all alight.”

— J. R. R. Tolkien

We hope this gives you a good overview of recent Splunk Dataflow enhancements. Our goal is to minimize your operational overhead for logs aggregation & export, so you can focus on getting real-time insights from your valuable logs.

To get started, check out our reference guide on deploying production-ready log exports to Splunk using Dataflow. Take advantage of the associated Splunk Dataflow Terraform module to automate deployment of your log export, as well as these sample UDF functions to customize your logs in-flight (if needed) before delivery to Splunk.

Be sure to keep an eye out for more Splunk Dataflow enhancements in our GitHub repo for Dataflow Templates. Every feature covered above is customer-driven, so please continue to submit your feature requests as GitHub repo issues, or directly from your Cloud Console as support cases.

Acknowledgements

Special thanks to Matt Hite from Splunk both for his close collaboration in product co-development, and for his partnership in serving joint customers.

Related Article

Extend your Dataflow template with UDFs

Learn how to easily extend a Cloud Dataflow template with user-defined functions (UDFs) to transform messages in-flight, without modifyi…

Read Article

Source : Data Analytics Read More

BigQuery Admin reference guide: Data governance

BigQuery Admin reference guide: Data governance

Hopefully you’ve been following along with our BigQuery Admin series and are well on your way to getting ramped up with BigQuery. Now that you’re equipped with the fundamentals, let’s talk about something that’s relevant for all data professionals – data governance. 

What does data governance mean?

Data governance is everything you do to ensure your data is secure, private, accurate, available, and usable inside of BigQuery. With good governance, everyone in your organization can easily find – and leverage – the data they need to make effective decisions. All while minimizing the overall risk of data leakage or misuse, and ensuring regulatory compliance. 

BigQuery security features

Because BigQuery is a fully-managed service, we take care of a lot of the hard stuff for you! Like we talked about in our post on BigQuery Storage Internals, BigQuery data is replicated across data centers to ensure reliability and availability. Plus data is always encrypted at rest. By default, we’ll manage encryption keys for you. However, you have the option to leverage customer managed encryption keys, by using Cloud KMS to automatically rotate and destroy encryption keys. 

You can also leverage Google Virtual Private Cloud (VPC) Service Controls to restrict traffic to BigQuery. When you correctly apply these controls, unauthorized networks can’t access BigQuery data, and data can’t be copied to unauthorized Google Cloud projects. Free communication can still occur within the perimeter, but communication is restricted across the perimeter.

Aside from leveraging BigQuery’s out-of-the-box security features, there are also ways to improve governance from a process perspective. In this post, we’ll walk you through the different tactics to ensure data governance at your organization. 

Dataset onboarding: Understanding & classifying data 

Data governance starts with dataset onboarding. Let’s say you just received a request from someone on your eCommerce team to add a new dataset that contains customer transactions. The first thing you’ll need to do is understand the data. You might start by asking questions like these:

What information does this contain?How will it be used to make business decisions?Who needs access to this data?Where does the data come from, and how will analysts get access to it in BigQuery?

Understanding the data helps you make decisions on where the new table should live in BigQuery, who should have access to this data, and how you’ll plan to make the data accessible inside of BigQuery (e.g. leveraging an external table, batch loading data into native storage, etc.).  

For this example, the transactions live in an OLTP database. Let’s take a look at what information is contained in the existing table in our database. Below, we can see that this table has information about the order (when it was placed, who purchased it, any additional comments for the order), and details on the items that were purchased (the item ID, cost, category, etc.).

Click to enlarge

Now that we have an idea of what data exists in the source, and what information is relevant for the business, we can determine which fields we need in our BigQuery table and what transformations are necessary to push the data into a production environment.

Classifying information

Data classification means that you are identifying the types of information contained in the data, and storing it as searchable metadata.  By properly classifying data you can make sure that it’s handled and shared appropriately, and that data is discoverable across your organization. 

Since we know what the production table should look like, we can go ahead and create an empty BigQuery table, with the appropriate schema, that will house the transactions. 

As far as storing metadata about this new table, we have two different options. 

Using labels

On the one hand, we can leverage labels. Labels can be used on many BigQuery resourcesincluding Projects, Datasets and Tables. They are key:value pairs and can be used to filter data in Cloud Monitoring, or can be used in queries against the Information Schema to find data that pertains to specific use cases.

Click to enlarge

Although labels provide logical segregation and management of different business purposes in the Cloud ecosystem, they are not meant to be used in the context of data governance. Labels cannot specify a schema, and you can’t apply them to specific fields in your table.  Labels cannot be used to establish access policies or track resource hierarchy. 

It’s pretty clear that our transactions table may contain personally identifiable information (PII). Specifically, we may want to mark the email address column as “Has_PII” : True. Instead of using labels on our new table, we’ll leverage Data Catalog to establish a robust data governance policy, incorporating metadata tags on BigQuery resources and individual fields.

Using data catalog tags

Data Catalog is Google Cloud’s data discovery and metadata management service. As soon as you create a new table in BigQuery, it is automatically discoverable in Data Catalog. Data Catalog tracks all technical metadata related to a table, such as  name, description, time of creation, column names and datatypes, and others.

In addition to the metadata that is captured through the BigQuery integration, you can create schematized tagsto track additional business information. For example, you may want to create a tag that tracks information about the source of the data, the analytics use case related to the data, or column-level information related to security and sharing. Going back to that email column we mentioned earlier, we can simply attach a column-level governance tag to the field and fill out the information by specifying that email_address is not encrypted, it does contain PII, and more specifically it contains an email address.

While this may seem like a fairly manual process, Data Catalog has a fully equipped API which allows for tags to be created, attached and updated programmatically. With tags and technical metadata captured in a single location, data consumers can come to Data Catalog and search for what they need.

Ingesting & staging data

With metadata for the production table in place, we need to focus on  how to push data into this new table. As you probably know, there are lots of different ways to pre-process and ingest data into BigQuery. Often customers choose to stage data in Google Cloud Services to kick off transformation, classification or de-identification workflows. There are two pretty common paths for staging data for batch loading:

Stage data in a Google Cloud storage bucket: Pushing data into a Google Cloud storage bucket before directly ingesting it into BigQuery offers flexibility in terms of data structure and may be less expensive for storing large amounts of information. Additionally, you can easily kick off workflows when new data lands in a bucket by using PubSub to trigger transformation jobs. However, since transformations will happen outside of the BigQuery service, data engineers will need familiarity with other tools or languages. Blob storage also makes it difficult to track column-level metadata.Stage data in a BigQuery staging container: Pushing data into BigQuery gives you the opportunity to track metadata for specific fields earlier in the funnel, through BigQuery’s integration with Data Catalog. When running scan jobs with Data Loss Prevention (we’ll cover this in the next section), you can leave out specific columns and store the results directly in the staging table’s metadata inside of Data Catalog. Additionally, transformations to prepare data for production can be done using SQL statements, which may make them easier to develop and manage. 

Identifying (and de-identifying) sensitive information 

One of the hardest problems related to data governance is identifying any sensitive information in new data. Earlier we talked through tracking known metadata in Data Catalog, but what happens if we don’t know if data contains any sensitive information? This might be especially useful for free-form text fields, like the comments field in our transactions. With the data staged in Google Cloud, there’s an opportunity to programmatically identify any PII, or even remove sensitive information from the data, using Data Loss Prevention (DLP)

DLP can be used to scan data for different types of sensitive information such as names, email addresses, locations, credit card numbers and others. You can kick off a scan job directly from BigQuery, Data Catalog, or the DLP service or API. DLP can be used to scan data that is staged in BigQuery or in Google Cloud. Additionally, for data stored in BigQuery, you can have DLP push the results of the scan directly into Data Catalog.

You can also use the DLP API to de-identify data. For example, we may want to replace any instances of names, email addresses and locations with an asterisk (“*”). In our case, we can leverage DLP specifically to scan the comments column from our staging table in BigQuery, save the results in Data Catalog, and, if there are instances of sensitive data, run a de-identification workflow before pushing the sanitized data into the production table. Note that building a pipeline like the one we’re describing does require the use of some other tools. 

We could use a Cloud Function to make the API call, and an orchestration tool like Cloud Composer to run each step in the workflow (trying to decide on the right orchestration tool? check out this post). You can walk through an example of running a de-identification workflow using DLP and composer in this post.

Data sharing

BigQuery Identity Access Management

Google Cloud as a whole leverages Identity Access Management (IAM) to manage permissions across cloud resources. With IAM, you manage access control by defining who (identity) has what access (role) for which resource. BigQuery, like other Google Cloud resources, has several predefined roles. Or you can create custom roles based on more granular permissions.

When it comes to granting access to BigQuery data, many administrators chose to grant Google Groups, representing your company’s different departments, access to specific datasets or projects – so policies are simple to manage. You can see some examples of different business scenarios and the recommended access policies here. 

In our retail use case, we have one project for each team. Each team’s Google Group would be granted the BigQuery Data Viewer role to access information stored in their team’s project. However, there may be cases where someone from the ecommerce team needs data from a different project – like the product development team project. One way to grant limited access to data is through the use of authorized views.

Protecting data with authorized views

Giving a view access to a dataset is also known as creating an authorized view in BigQuery. An authorized view allows you to share query results with particular users and groups without giving them access to the underlying source data. So in our case, we can simply write a query to grab the pieces of information the ecommerce team needs to effectively analyze the data and save that view into the existing ecommerce project that they already have access to.

Column-level access policies

Aside from controlling access to data using standard IAM roles, or granting access to query results through authorized views, you also can leverage BigQuery’s column-level access policies. For example, remember that email address column we marked as containing PII earlier in this post? We may want to ensure that only members with high-security level clearance have access to query those columns. We can do this by:

First defining a taxonomy in Data Catalog, including a “High” policy tag for fields with high-security level clearanceNext, add our group of users who need access to highly sensitive data as Fine Grained Access Readers to the High resourceFinally, we can set a policy tag on the email column

You can find some tips on creating column-level access policies in our documentation on best practices.

Row-level access policies

Aside from restricting access to certain fields in our new table, we may want to only grant users access to rows that are relevant to them. One example may be if analysts from different business units only get access to rows that represent transactions for that business unit. In this case, the Google Group that represents the Activewear Team should only have access to orders that were placed on items categorized as “Active”. In BigQuery, we can accomplish this by creating a row-level access policy on the transactions table.

You can find some tips on creating row-level access policies in our documentation on best practices.

When to use what for data sharing

At the end of the day, you can achieve your goal of securing data using one or more of the concepts we discussed earlier. Authorized Views add a layer of abstraction to sharing data by providing the necessary information to certain users without giving them direct access to the underlying dataset. For cases where you want to transform (e.g. pre-aggregate before sharing) – authorized views are ideal. While authorized views can be used for managing column-level access, it may be preferable to leverage Data Catalog as you can easily centralize access knowledge in a single table’s metadata and control access through hierarchical taxonomies. Similarly, leveraging row level access policies, instead of authorized views to filter out rows, may be preferable in cases where it is easier to manage a single table with multiple access policies instead of multiple authorized views in different places. 

Monitoring data quality

One last element of data governance that we’ll discuss here is monitoring data quality. The quality of your BigQuery data can drop for many different reasons – maybe there was a problem in the data source, or an error in your transformation pipeline. Either way, you’ll want to know if something is amiss and have a way to inform data consumers at your organization. Just like we described earlier, you can leverage an orchestration tool like Cloud Composer to create pipelines for running different SQL validation tests.

Validation tests can be created in a few different ways:

One option is to leverage open source frameworks, like this one that our professional services team put together. Using frameworks like these, you can declare rules for when validation tests pass or failSimilarly, you can use a tool like Dataform – which offers the ability to leverage YAML files to declare validation rules. Dataform recently came under the Google Cloud umbrella and will be open to new customers soon, join the waitlist here!Alternatively, you could always roll your own solution by programmatically running queries using built in BigQuery functionality like ASSERT, if the assertion is not valid then BigQuery will return an error that can inform the next step in your pipeline

Based on the outcome of the validation test, you can have Composer send you a notification using Slack or other built-in-notifiers. Finally, you can use Data Catalog’s API to update a tag that tracks the data quality for the given table.  Check out some example code here! With this information added to Data Catalog, it becomes searchable by data consumers at your organization so that they can stay informed on the quality of information they use in their analysis. 

What’s next?

One thing that we didn’t mention in this post, but is certainly relevant to data governance, is ongoing monitoring around usage auditing and access policies. We’ll be going into more details on this in a few weeks when we cover BigQuery monitoring as a whole.  Be sure to keep an eye out for more in this series by following me on LinkedIn and Twitter!

Related Article

BigQuery Admin reference guide: Query processing

BigQuery is capable of some truly impressive feats, be it scanning billions of rows based on a regular expression, joining large tables, …

Read Article

Source : Data Analytics Read More