Blog

Teaming up with MLB for game-changing sports analytics

Teaming up with MLB for game-changing sports analytics

Editor’s note: Major League Baseball™ (MLB™) is working closely with Google Cloud to score more use cases for its massive amounts of data. Here’s how the organization is using Google Cloud services to change the game for players, fans, and broadcasters. MLB Senior Director of Software Engineering, Rob Engel, contributed to this blog.

Baseball and statistics go way back. Since the first pitch flew across home plate at a professional ballgame more than 150 years ago, practically every action on the field has been tallied, added, averaged, and saved to chronicle America’s favorite pastime. With explosive growth in the amount and variety of baseball data being collected, when cloud computing came around it was, well, a game changer.

These days, cloud technology enables MLB™ to collect and analyze 25 million unique data points from each of its 2,430 regular season games. From helping players perfect their game, to bringing fans closer to the game, we’ll walk through a few ways MLB is hitting a grand slam with data.

Computing in stadium faster than a fastball

From the second batting practice begins to the time a walk-off hit ends the game, MLB is collecting data on the field. Statcast player and ball tracking technology allows for collection and analysis of a massive amount of baseball data in ways that were never possible in the past. Beginning in 2020, Statcast is powered by a Hawk-Eye system that uses 12 high-resolution cameras at all 30 MLB stadiums to track every movement of the ball and every player at 30 frames per second, with 18 unique points on each player’s body. As soon as the ball leaves the pitcher’s hand, Hawk-Eye captures roughly 60 data points, including speed and break angle as it reaches the batter. 

ThroughAnthos and aGoogle Kubernetes Engine cluster, those cameras use on-site processing to turn the video feeds into structured data that’s instantly transmitted to the scoreboard and broadcasters. The result are stats that display faster than a 95-mile-an-hour pitch. And for the fans watching at home, the data from Hawk-Eye enables a live strike-zone visualization centered over home plate. 

“Using Anthos, we’re able to do that all on-premises and replicate the entire software infrastructure that we run in Google Cloud,” said Rob Engel, senior director of software engineering at MLB. “It’s deployed on-premises and we don’t have to do anything too different.” This uniformity across deployment environments is key for MLB developers who may be running in the cloud, in a data center, or in a stadium.

Anthos also enables a backup solution as a pinch-hitter if the in-stadium system fails. For example, if the broadcast at Yankee Stadium™ stopped, MLB could run its code across New York City at Citi Field™ where the Mets play, or even in the cloud, and continue broadcasting without interruption. “If we had any issue in any stadium, we can shoot that data up to Google Cloud and process it there,” Engel said. 

Adding context to amazing feats

But what about marrying all that in-stadium data with the years of historical Statcast data? Josh Frost, VP of product management at MLB, explains, “The exit velocity of a ball that was hit was 110 miles an hour—is that good? Is that bad? How does that compare across the league? That’s where we’re really focused as an organization—not just giving data to people but giving it context to make it information that can help them enjoy the game better.” 

While Hawk-Eye can clock a pitch at 95 miles an hour with precise location, it is up to umpires to call the shots, and determine whether the pitch was a ball or a strike, or if a player is safe at first base. That’s where manual operators come in. Before each pitch is thrown, MLB staff manually tag metadata, such as the current pitcher, batter, inning, and so on. 

Throughout a game, MLB is constantly uploading game data into Google Cloud to the point where each season amounts to over 25 terabytes of information. The player positional pose tracking data is stored in Bigtable and all the other game data is stored in Cloud SQL for PostgreSQL. And every night MLB runs a batch job usingDataflow to move game data from Bigtable andCloud SQL toCloud Storage buckets andBigQuery.

In the MLB Gameday Engine, which has the 150-year-old rules of baseball codified into logic, the organization’s live tracking statistics combine with traditional statistics—including batting average, strikeouts, and at bats. So when a player decides to steal third and sprints at 30 feet per second, MLB can rank that speed and provide context to instantly see if the player is in the top echelon for runners. 

Pitching endless data possibilities

Everything—live, historical and in between—is fed into the MLB Stats API that populates consumer-facing tools like Baseball Savant, where fans can search for things like hit distance and launch angle. It also powers real-time use cases for broadcasters, as well as the MLB app andFilm Room. “We’re pulling in data from the API for everything from reviewing major league on-field performance, to player acquisitions, to running our models, to how player performance is going. It’s endless,” said John Krazit, director of baseball systems at the Arizona Diamondbacks™.

With endless data possibilities, MLB is putting together some amazing new experiences. On deck for this year is bringing FieldVision to the next level. This technology gives fans a 3D look at the field using the Hawk-Eye pose tracking data on players’ movements that’s stored in Bigtable. With the ability to generate replays from any position on the field, FieldVision delivers a view beyond what MLB has offered in the past, bringing fans closer to the field right from their desktop or mobile apps.

Now that’s a home run for everyone.

Major League Baseball trademarks and copyrights are used with permission of Major League Baseball. Visit MLB.com.

Related Article

Play ball! How MLB’s data cloud is delivering a next-level fan experience

As we celebrate the first day of Major League Baseball’s 2021 season, we’re taking a closer look at how MLB is bringing their data cloud …

Read Article

Source : Data Analytics Read More

Bringing together the best of both sides of BI with Looker and Data Studio

Bringing together the best of both sides of BI with Looker and Data Studio

In today’s data world, companies have to consider many scenarios in order to deliver insights to users in a way that makes sense to them. On one hand, they must deliver trusted, governed reporting to inform mission-critical business functions. But on the other hand, they must enable a broad population to answer questions on their own, in an agile and self-serve manner. 

Data-driven companies want to democratize access to relevant data and empower their business users to get answers quickly. The trade-off is governance; it can be hard to maintain a single shared source of truth, granular control of data access, and secure centralized storage. Speed and convenience compete with trust and security. Companies are searching for a BI solution that can be easily used by everyone in an organization (both technical and non-technical), while still producing real-time governed metrics.

Today, a unified experience for both self-serve and governed BI gets one step closer, with our announcement of the first milestone in our integration journey between Looker and Data Studio. Users will now be able to access and import governed data from Looker within the Data Studio interface, and build visualizations and dashboards for further self-serve analysis.

This union allows us to better deliver a complete, integrated Google Cloud BI experience for our users. Business users will feel more empowered than ever, while data leaders will be able to preserve the trust and security their organization needs.

This combination of ​self-serve and governed BI together will help enterprises make better data-driven decisions. 

Looker and Data Studio, Better Together 

Looker is a modern enterprise platform for business intelligence and analytics, that helps organizations build data-rich experiences tailored to every part of their business. Data Studio is an easy to use self-serve BI solution enabling ad-hoc reporting, analysis, and data mashups across 500+ data sets. 

Looker and Data Studio serve complementary use cases. Bringing Looker and Data Studio together opens up exciting opportunities to combine the strengths of both products and a broad range of BI and analytic capabilities to help customers reimagine the way they work with data. 

How will the integration work?

This first integration between these products allows users to connect to the Looker semantic model directly from Data Studio. Users can bring in the data they wish to analyze, connect it to other available data sources, and easily explore and build visualizations within Data Studio. The integration will follow three principles with respect to governance:

Access to data is enforced by Looker’s security features, the same way it is when using the Looker user interface.

Looker will continue to be the single access point for your data.

Administrators will have full capabilities to manage Data Studio and Looker together.

How are we integrating these products?

This is the first step in our roadmap that will bring these two products closer together. Looker’s semantic model allows metrics to be centrally defined and broadly used, ensuring a single version of the truth across all of your data. This integration allows the Looker semantic model to be used within Data Studio reports, allowing people to use the same tool to create reports that rely on both ad-hoc and governed data. This brings together the best of both worlds – a governed data layer, and a self-serve solution that allows analysis of both governed and ungoverned data.

With this announcement, the following use cases will be supported:

Users can turn their Looker-governed data into informative, highly customizable dashboards and reports in Data Studio.

Users can blend governed data from Looker with data available from over 500 data sources in Data Studio, to rapidly generate new insights.

Users can analyze and rapidly prototype ungoverned data (from spreadsheets, csv files, or other cloud sources) within Data Studio.

Users can collaborate in real-time to build dashboards with teammates or people outside the company. 

When will this integration be available to use?

The Data Studio connector for Looker is currently in preview. If you are interested in trying it out, please fill out this form.

Next Steps

This integration is the first of many in our effort to bring Looker and Data Studio closer together. Future releases will introduce additional features to create a more seamless user experience across these two products. We are very excited to roll out new capabilities in the coming months and will keep you updated on our future integrations of the two products.

Related Article

Limitless Data. All Workloads. For Everyone

Read about the newest innovations in data cloud announced at Google Cloud’s Data Cloud Summit.

Read Article

Source : Data Analytics Read More

Meet Google’s unified data and AI offering

Meet Google’s unified data and AI offering

Without AI, you’re not getting the most out of your data.
Without data, you risk stale, out-of-date, suboptimal models.

But most companies are still struggling with how to keep these highly interdependent technologies in sync and operationalize AI to take meaningful action from data.

We’ve learned from Google’s years of experience in AI development how to make data-to-AI workflows as cohesive as possible and as a result our data cloud is the most complete and unified data and AI solution provider in the market. By bridging data and AI, data analysts can take advantage of user-friendly, accessible ML tools, and data scientists can get the most out of their organization’s data. All of this comes together with built-in MLOps to ensure all AI work — across teams — is ready for production use. 

In this blog we’ll show you how all of this works, including exciting announcements from the Data Cloud Summit:

Vertex AI Workbench is now GA bringing together Google Cloud’s data and ML systems into a single interface so that teams have a common toolset across data analytics, data science, and machine learning. With native integrations across BigQuery, Spark, Dataproc, and Dataplex data scientists can build, train and deploy ML models 5X faster than traditional notebooks. 

Introducing Vertex AI Model Registry, a central repository to manage and govern the lifecycle of your ML models. Designed to work with any type of model and deployment target, including BigQuery ML, Vertex AI Model Registry makes it easy to manage and deploy models. 

Use ML to get the most out of your data, no matter the format

Analyzing structured data in a data warehouse, like using SQL in BigQuery, is the bread and butter for many data analysts. Once you have data in a database, you can see trends, generate reports, and get a better sense of your business. Unfortunately, a lot of useful business data isn’t in the tidy tabular format of rows and columns. It’s often spread out over multiple locations and in different formats, frequently as so-called “unstructured data” — images, videos, audio transcripts, PDFs — can be cumbersome and difficult to work with. 

Here, AI can help. ML models can be used to transcribe audio and videos, analyze language, and extract text from images—that is, to translate elements of unstructured data into a form that can be stored and queried in a database like BigQuery. Google Cloud’s Document AI platform, for example, uses ML to understand documents like forms and contracts. Below, you can see how this platform is able to intelligently extract structured text data from an unstructured document like a resume. Once this data is extracted, it can be stored in a data warehouse like BigQuery.

Bring machine learning to data analysts via familiar tools

Today, one of the biggest barriers to ML is that the tools and frameworks needed to do ML are new and unfamiliar. But this doesn’t have to be the case. BigQuery ML, for example, allows you to train sophisticated ML models at scale using SQL code, directly from within BigQuery. Bringing ML to your data warehouse alleviates the complexities of setting up additional infrastructure and writing model code. Anyone who can write SQL code can train a ML model quickly and easily.

Easily access data with a unified notebook interface

One of the most popular ML interfaces today are notebooks: interactive environments that allow you to write code, visualize and pre-process data, train models, and a whole lot more. Data scientists often spend most of their day building models within notebook environments. It’s crucial, then, that notebook environments have access to all of the data that makes your organization run, including tools that make that data easy to work with. 

Vertex AI Workbench, now generally available, is the single development environment for the entire data science workflow. Integrations across Google Cloud’s data portfolio allow you to natively analyze your data without switching between services:

Cloud Storage: access unstructured data

BigQuery: access data with SQL, take advantage of models trained with BigQuery ML

Dataproc: execute your notebook using your Dataproc cluster for control

Spark: transform and prepare data with autoscaling serverless Spark

Below, you’ll see how you can easily run a SQL query on BigQuery data with Vertex AI Workbench.

But what happens after you’ve trained the model? How can both data analysts and data scientists make sure their models can be utilized by application developers and maintained over time?

Go from prototyping to production with MLOps

While training accurate models is important, getting those models to be scalable, resilient, and accurate in production is its own art, known as MLOps. MLOps allow you to:

Know what data your models are trained on

Monitor models in production

Make training process repeatable

Serve and scale model predictions

A whole lot more! (See the “Practitioners Guide to MLOps” whitepaper for a full and detailed overview of MLOps)

Built-in MLOps tools within Vertex AI’s unified platform remove the complexity of model maintenance. Practical tools can help with everything from training and hosting ML models, managing model metadata, governance, model monitoring, and running pipelines – all critical aspects of running ML in production and at scale. 

And now, we’re extending our capabilities to make MLOps accessible to anyone working with ML in your organization. 

Easy handoff to MLOps with Vertex AI Model Registry

Today, we’re announcing Vertex AI Model Registry, a central repository that allows you to register, organize, track, and version trained ML models and is designed to work with any type of model and deployment target, whether that’s through BigQuery, Vertex AI, AutoML, custom deployments on GCP or even out of the cloud.

Vertex AI Model Registry is particularly beneficial for BigQuery ML. While BigQuery ML brings the powerful scalability of BigQuery for batch predictions, using a data warehouse engine for real-time predictions just isn’t practical. Furthermore, you might start to wonder how to orchestrate your ML workflows based in BigQuery. You can now discover and manage  BigQuery ML models and easily deploy those models to Vertex AI for real-time predictions and MLOps tools. 

End-to-End MLOps with pipelines

One of the most popular approaches to MLOps is the concept of ML pipelines: where each distinct step in your ML workflow from data preparation to model training and deployment are automated for sharing and reliably reproducing. 

Vertex AI Pipelines is a serverless tool for orchestrating ML tasks using pre-built components or your own custom code. Now, you can easily process data and train models with BigQuery, BigQuery ML, and Dataproc directly within a pipeline. With this capability, you can combine familiar ML development within BigQuery and Dataproc into reproducible, resilient pipelines and orchestrate your ML workflows faster than ever.

See an example of how this works with the new BigQuery and BigQuery ML components.

Learn more about how to use BigQuery and BigQuery ML components with Vertex AI Pipelines.

Learn more and get started 

We’re excited to share more about our unified data and AI offering today at the Data Cloud Summit. Please join us for the spotlight session on our “AI/ML strategy and product roadmap” or the “AI/ML notebooks ‘how to’ session.”

And if you’re ready to get hands on with Vertex AI, check out these resources:

Codelab: Training an AutoML model in Vertex AI

Codelab: Intro to Vertex AI Workbench

Video Series: AI Simplified: Vertex AI

GitHub: Example Notebooks

Training: Vertex AI: Qwik Start

Related Article

What is Vertex AI? Developer advocates share more

Developer Advocates Priyanka Vergadia and Sara Robinson explain how Vertex AI supports your entire ML workflow—from data management all t…

Read Article

Source : Data Analytics Read More

Build a secure data warehouse with the new security blueprint

Build a secure data warehouse with the new security blueprint

As Google Cloud continues our efforts to be the industry’s most trusted cloud, we’re taking an active stake to help customers achieve better security for their cloud data warehouses. With our belief in shared fate driving us to make it easier to build strong security into deployments, we provide security best practices and opinionated guidance for customers in the form of security blueprints. Today, we’re excited to share a new addition to our portfolio of blueprints with the publication of our Secure Data Warehouse Blueprint guide and deployable Terraform.

Many enterprises take advantage of cloud capabilities to analyze their sensitive business data. However, customers have told us that their teams invest a great deal of time in protecting the sensitive data in cloud-based data warehouses. To help accelerate your data warehouse deployment and enable security controls, we’ve designed the secure data warehouse blueprint.

What is the secure data warehouse blueprint?

The secure data warehouse blueprint provides security best practices to help protect your data and accelerate your adoption of Google Cloud’s data, machine learning (ML), and artificial intelligence (AI) solutions. The blueprint’s architecture can help you not only cover your data’s life cycle, but also incorporate a governance and security posture as seen in the following diagram.

The blueprint consists of multiple components: 

The landing area ingests batch or streaming data. 

The data warehouse component handles storage and de-identification of data, which can later be re-identified through a separate process.

The classification and data governance component manages your encryption keys, de-identification template, and data classification taxonomy.

The security posture component aids in detection, monitoring, and response.

The blueprint can create these components for you by showing you how to deploy and configure the appropriate cloud services in your environment. For the data presentation component, which is outside of the scope of the blueprint, use your team’s chosen business intelligence tools. Make sure the tools used, such as Looker, have appropriate access to the data warehouse. 

To get started, you can discuss recommended security controls with your team by using the blueprint as a framework. You can then adapt and customize the blueprint to your enterprise’s requirements for your most sensitive data.

Let’s explore some benefits this blueprint can provide for your organization: accelerate business analysis securely and provide strong baseline security controls for your data warehouse.

Accelerate business analysis

Limited security experience or knowledge of best practices can inhibit your data warehouse transformation plans. The blueprint helps address this need in different ways by providing you with code techniques, best practices on data governance, and example patterns to achieve your security goals.

This blueprint provides infrastructure as code (IaC) techniques such as codifying your infrastructure and declaring your environmentthat allow your teams to analyze the controls and compare them against your enterprise requirements for creating, deploying, and operating your data warehouse. IaC techniques can also help you simplify the regulatory and compliance reviews that your enterprise performs. The blueprint allows for flexibility – you can start a new initiative or configure it to deploy into your existing environment. For instance, you can choose to use the blueprint’s existing network and logging modules. Alternatively, you can keep your existing networking and logging configuration and compare them against the blueprint’s recommendations to further enhance your environment with best practices. 

The blueprint also provides guidance and a set of Google best practices on data governance. It helps you implement Data Catalog policy tags with BigQuery column-level access controls. You can enforce separation-of-duty principles. The blueprint defines multiple personas and adds least-privileged IAM roles, so you can manage user identity permissions through groups.

Sometimes, seeing an example pattern and adapting it can help teams accelerate their use of new services. Your team can focus on customization to achieve your enterprise goals instead of working through setup details of unfamiliar services or concepts. For instance, this blueprint has examples on how to separate build concerns with your Dataflow flex template from your infrastructure. The example shows how you can:

Create a separate re-identification process and environment. 

Apply data protection and governance controls such as data loss prevention (DLP) with Cloud Data Loss Prevention (DLP) and customer-managed encryption keys (CMEK) with Cloud HSM.

Generate sample synthetic data that you can send through the system to observe how confidential data flows through the environment.

We’ve provided this guidance to help with faster reviews for the completeness and security of your data warehouse. 

Protect data with layered security controls

Using this blueprint, you can demonstrate to your security, risk, and compliance teams which security controls are implemented in the environment. The blueprint builds an architecture that can minimize the infrastructure you need to manage and uses many built-in security controls. The following diagram shows not only the services used in the architecture, but also how the services work together to help protect your data. VPC Service Controls creates perimeters to group services by functional concerns. Perimeter bridges are defined to allow communication and to monitor between the perimeters.

The data governance perimeter controls your encryption keys that are stored in Cloud HSM, de-identification templates that are used by Cloud DLP, and data classification taxonomy that is defined in Data Catalog. This perimeter also serves as the central location for audit logging and monitoring.

The data ingestion perimeter uses Dataflow to de-identify your data based on your de-identification template and store the data in BigQuery.

The confidential data perimeter covers the case when sensitive data might need to be re-identified. A separate Dataflow pipeline is created to send the data to an isolated BigQuery dataset in a different project.

Additional layers such as IAM, organization policies, and networking are described in the Secure Data Warehouse Blueprint guide.

Let’s explore how these controls relate to three topics that often arise in security discussions: minimizing data exfiltration, configuring data warehouse security controls, and facilitating compliance. 

Minimize data exfiltration

The blueprint enables you to deploy multiple VPC Service Controls perimeters and corresponding bridges so that you can monitor and define where data can flow. These perimeters can constrain data to your specified projects and services. Access to the perimeter is augmented with context information from the Access Context Manager policy.  

The blueprint helps you create an environment with boundaries where data can flow and what data can be seen. You can customize the organization policies or use the provided organization policies that help create guardrails, such as preventing the use of external IPs. Your data in transit remains on trusted networks by using private networks and private connectivity to services. You can use the provided Cloud DLP configuration to de-identify your data with additional security protection in case the data is unintentionally accessed. The data is obfuscated through Cloud DLP’s cryptographic transformation method.

Limiting who can access your most sensitive data is an important consideration. The blueprint uses fine-grained IAM access policies to ensure least privilege and minimize lateral movement. These IAM policies are limited in scope and bound as close to the resource as possible rather than at the project level. For instance, an IAM access policy is bound to a key that is used to protect BigQuery. Also, service accounts, rather than user identities, are defined to perform operations on your data. These service accounts are granted predefined roles with least privilege in mind. The IAM bindings for these privileged accounts are transparently defined in the blueprint, so you can evaluate the IAM policies and monitor them. You can allow the correct users to see the re-identified data by adding additional access grants using column-level access controls with BigQuery.

Configure pervasive data warehouse controls

The data warehouse security controls can cover various aspects of your data warehouse across multiple resources rather than focusing on a single service. Various security controls are packaged into different modules. For instance, if you want to protect your trust perimeter, create Cloud DLP de-identification templates, use BigQuery controls, or build Dataflow pipelines for ingestion or re-identification, you can explore those particular modules. You can adjust those modules to match your requirements. 

Data encryption controls can be enabled on each service to protect your data with customer-managed encryption keys. Multiple keys are created, which define separate cryptographic boundaries for specific purposes. For instance, one of the keys may be used for services that handle ingestion, while a separate key may be used for protecting the data within BigQuery. These keys have an automatic rotation policy and are stored in Cloud HSM. 

The blueprint helps you build data governance by applying Data Catalog policy tags. It can create a taxonomy that you define. The following diagram shows a hierarchy where the highest level of access is tagged as “Confidential.”

These policy tags are applied to your BigQuery table schema and enable column-level access controls.

Beyond the IAM policies that limit who can access your data, additional controls like Dataflow’s streaming engine and organization policies are in place to help minimize things you need to manage. Multiple organization policies are configured such as preventing service account creation to make it obvious when a change occurs. These policies are applied at the project level to give you flexibility and more granularity of these controls. 

Facilitate compliance needs

The blueprint helps address data minimization requirements with Cloud DLP’s cryptographic transformation methods. We have also recently added automatic DLP for BigQuery to provide the build-in capability to help detect unexpected types of data in your environment. This new integrated DLP capability also gives you visibility into your environment to help with assessments. Data encryption is handled in the deployed services using keys that you manage in Cloud HSM, which is built with clusters of FIPS 140-2 Level 3 certified HSMs

Access controls to your data are configured with service accounts built around the principle of least privilege access to data. If a user identity needs to read confidential data, that identity must be explicitly granted access to both the dataset and column, and audit logs capture these IAM updates. Policies defined in Access Context Manager add extra context information to enforce more granular access. The Access Context Manager policy is configured with context information such as IP and user identity that you can further enhance.

You can use additional Google best practices from the Security Foundations blueprint. That blueprint uses built-in security controls such as Security Command Center, Cloud Logging, and Cloud Monitoring. The security foundation blueprint’s logging and monitoring section describes how Security Command Center helps you with your threat detection needs. Security Health Analytics is a built-in service of Security Command Center that monitors each project to minimize misconfigurations. Audit logs are centrally configured with CMEK to help with your access monitoring. In addition, Access Transparency can be configured for additional insight.

We also know that it’s helpful to get outside validation and perspective. Our Google Cybersecurity Action Team and a third-party security team have reviewed the security controls and posture established by this blueprint. Learn more about the reviews by reading the additional security considerations topic in the guide. These external reviews help you know you are applying best practices and strong controls to help protect your most sensitive data.

Explore best practices

To learn more about these best practices, read the Secure Data Warehouse Blueprint guide and get the deployable Terraform along with our guided tutorial. Listen to our security podcast titled “Rebuilding vs Forklifting and how to secure a data warehouses in the cloud.” Our ever-expanding portfolio of blueprints is available on our Google Cloud security best practices center. Build security into your Google Cloud deployments from the start and help make your data warehouse safer with Google.

Related Article

Build security into Google Cloud deployments with our updated security foundations blueprint

Get step by step guidance for creating a secured environment with Google Cloud with the security foundations guide and Terraform blueprin…

Read Article

Source : Data Analytics Read More

Boost the power of your transactional data with Cloud Spanner change streams

Boost the power of your transactional data with Cloud Spanner change streams

Data is one of the most valuable assets in today’s digital economy. One way to unlock the value of your data is to give it life after it’s first collected. A transactional database, like Cloud Spanner, captures incremental changes to your data in real time, at scale, so you can leverage it in more powerful ways. Cloud Spanner is our fully managed relational database that offers near unlimited scale, strong consistency, and industry-leading high availability of up to 99.999%. 

The traditional way for downstream systems to use incremental data that’s been captured in a transactional database is through change data capture (CDC), which allows you to trigger behavior based on changes to your database, such as a deleted account or an updated inventory count.

Today, we are announcing Spanner change streams, coming soon, that lets you capture change data from  Spanner databases and easily integrate it with other systems to unlock new value. 

Change streams for Spanner goes above and beyond the traditional CDC capabilities of tracking inserts, updates, and deletes. Change streams are highly flexible and configurable, letting you track changes on exact tables and columns or across an entire database. You can replicate changes from Spanner to BigQuery for real-time analytics, trigger downstream application behavior using Pub/Sub, and store changes in Google Cloud Storage (GCS) for compliance. This ensures you have the freshest data to optimize business outcomes. 

Change streams provides a wide range of options to integrate change data with other Google Cloud services and partner applications through turnkey connectors, including custom Dataflow processing pipelines or the change streams read API.

Spanner consistently processes over 1.2 billion requests per second. Since change streams are built right into Spanner, you not only get industry-leading availability and global scale—you also don’t have to spin up any additional resources. The same IAM permissions that already protect your Spanner databases can be used to access change streams queries.Change stream queries are protected by spanner.databases.select, and change stream DDL operations are protected by spanner.databases.updateDdl.

Change streams in action

In this section, we’ll look at how to set up a change stream that sends change data from Spanner to an analytic data warehouse in BigQuery.

Creating a change stream 

As discussed above, a change stream tracks changes on an entire database, a set of tables, or a set of columns in a database. Each change stream can have a retention period of anywhere from one day to seven days, and you can set up multiple change streams to track exactly what you need for your specific business objectives. 

First, we’ll create a change stream on a table called InventoryLedger. This table tracks inventory changes on two columns: InventoryLedgerProductSku and InventoryLedgerChangedUnits with a 7-day retention period.

Change records

Each change record contains a wealth of information, including primary key, the commit timestamp, transaction ID, and of course, the old and new values of the changed data, wherever applicable. This makes it easy to process change records as an entire transaction, in sequence based on their commit timestamp, or individually as they arrive, depending on your business needs. 

Back to the inventory example, now that we’ve created a change stream on the InventoryLedger table, all inserts, updates, and deletes on this table will be published to the InventoryStream change stream. These changes are strongly consistent with the commits on the InventoryLedger table: When a transaction commit succeeds, the relevant changes will automatically persist in the change stream. You never have to worry about missing a change record.

Processing a change stream

There are numerous ways that you can process change streams depending on the use case:

Analytics: You can send the change records to BigQuery, either as a set of change logs or by updating the tables.  

Event triggering: You can send change logs to Pub/Sub for further processing by downstream systems. 

Compliance: You can retain the change log to Google Cloud Storage for archiving purposes. 

The easiest way to process change stream data is to use our Spanner connector for Dataflow, where you can take advantage of Dataflow’s built-in pipelines to BigQuery, Pub/Sub, and Google Cloud Storage. The diagram below shows a Dataflow pipeline that processes this change stream and imports change data directly into BigQuery.

Alternatively, you can build a custom Dataflow pipeline to process change data with Apache Beam. In this case, we provide a Dataflow connector that outputs change data as an Apache Beam PCollection of DataChangeRecord objects. 

For even more flexibility, you can use the underlying change streams query API. The query API is a powerful interface that lets you read directly from a change stream to implement your own connector and stream changes to the pipeline of your choice. On the query API side, a change stream is divided into multiple partitions, which can be used to query a change stream in parallel for higher throughput. Spanner dynamically creates these partitions based on load and size. Partitions are associated with a Spanner database split, allowing change streams to scale as effortlessly as the rest of Spanner.

Get started with change streams

With change streams, your Spanner data follows you wherever you need it, whether that’s for analytics with BigQuery, for triggering events in downstream applications, or for compliance and archiving. Change streams are highly flexible and configurable —allowing you to capture change data for the exact data you care about, and for the exact period of time that matters for your business. And because change streams are built into  Spanner, there’s no software to install, and you get external consistency, high scale, and up to 99.999% availability.

There’s no extra charge for using change streams, and you’ll pay only for extra compute and storage of the change data at the regular Spanner rates.

To get started with Spanner, create an instance, or try it out with a Spanner Qwiklab.

We’re excited to see how Spanner change streams will help you unlock more value out of your data!

Related Article

Cloud Spanner myths busted

The blog talks about the 7 most common myths and elaborates the truth for each of the myths.

Read Article

Source : Data Analytics Read More

BigLake: unifying data lakes and data warehouses across clouds

BigLake: unifying data lakes and data warehouses across clouds

The volume of valuable data that organizations have to manage and analyze is growing at an incredible rate. This data is increasingly distributed across many locations, including  data warehouses, data lakes, and NoSQL stores. As an organization’s data gets more complex and proliferates across disparate data environments, silos emerge, creating increased risk and cost, especially when that data needs to be moved. Our customers have made it clear; they need help. 

That’s why today, we’re excited to announce BigLake, a storage engine that allows you to unify data warehouses and lakes. BigLake gives teams the power to analyze data without worrying about the underlying storage format or system, and eliminates the need to duplicate or move data, reducing cost and inefficiencies. 

With BigLake, users gain fine-grained access controls, along with performance acceleration across BigQuery and multicloud data lakes on AWS and Azure. BigLake also makes that data uniformly accessible across Google Cloud and open source engines with consistent security. 

BigLake extends a decade of innovations with BigQuery to data lakes on multicloud storage, with open formats to ensure a unified, flexible, and cost-effective lakehouse architecture.

BigLake architecture

BigLake enables you to:

Extend BigQuery to multicloud data lakes and open formats such as Parquet and ORC with fine-grained security controls, without needing to set up new infrastructure.

Keep a single copy of data and enforce consistent access controls across analytics engines of your choice, including Google Cloud and open-source technologies such as Spark, Presto, Trino, and Tensorflow.

Achieve unified governance and management at scale through seamless integration with Dataplex.

Bol.com, an early customer using BigLake, has been accelerating analytical outcomes while keeping their costs low:

“As a rapidly growing e-commerce company, we have seen rapid growth in data. BigLake allows us to unlock the value of data lakes by enabling access control on our views while providing a unified interface to our users and keeping data storage costs low. This in turn allows quicker analysis on our datasets by our users.”—Martin Cekodhima, Software Engineer, Bol.com

Extend BigQuery to unify data warehouses and lakes with governance across multicloud environments

By creating BigLake tables, BigQuery customers can extend their workloads to data lakes built on Google Cloud Storage (GCS), Amazon S3, and Azure data lake storage Gen 2. BigLake tables are created using a cloud resource connection, which is a service identity wrapper that enables governance capabilities. This allows administrators to manage access control for these tables similar to BigQuery tables, and removes the need to provide object store access to end users. 

Data administrators can configure security at the table, row or column level on BigLake tables using policy tags. For BigLake tables defined over Google Cloud Storage, fine grained security is consistently enforced across Google Cloud and supported open-source engines using BigLake connectors. For BigLake tables defined on Amazon S3 and Azure data lake storage Gen 2, BigQuery Omni enables governed multicloud analytics by enforcing security controls. This enables you to manage a single copy of data that spans BigQuery and data lakes, and creates interoperability between data warehousing, data lake, and data science use cases.

Open interface to work consistently across analytic runtimes spanning Google Cloud technologies and open source engines 

Customers running open source engines like Spark, Presto, Trino, and Tensorflow through Dataproc or self managed deployments can now enable fine-grained access control over data lakes, and accelerate the performance of their queries. This helps you build secure and governed data lakes, and eliminate the need to create multiple views to serve different user groups. This can be done by creating BigLake tables from a supported query engine like Spark DDL, and using Dataplex to configure access policies. These access policies are then enforced consistently across the query engines that access this data – greatly simplifying access control management. 

Achieve unified governance & management at scale through seamless integration with Dataplex

BigLake integrates with Dataplex to provide management-at-scale capabilities. Customers can logically organize data from BigQuery and GCS into lakes and zones that map to their data domains, and can centrally manage policies for governing that data. These policies are then uniformly enforced by Google Cloud and OSS query engines. Dataplex also makes management easier by automatically scanning Google Cloud storage to register BigLake table definitions in BigQuery, and makes them available via Dataproc Metastore. This helps end users discover these BigLake tables for exploration and querying using both OSS applications and BigQuery. 

Taken together, these capabilities enable you to run multiple analytic runtimes over data spanning lakes and warehouses in a governed manner. This breaks down data silos and significantly reduces the infrastructure management, helping you to advance your analytics stack and unlock new use cases.

What’s next?

If you would like to learn more about BigLake, please visit our website. Alternatively, get started with BigLake today by using this quickstart guide, or contact the Google Cloud sales team.

Related Article

Limitless Data. All Workloads. For Everyone

Read about the newest innovations in data cloud announced at Google Cloud’s Data Cloud Summit.

Read Article

Source : Data Analytics Read More

Vodafone optimizes the business value it extracts from its data with Google Cloud

Vodafone optimizes the business value it extracts from its data with Google Cloud

Editor’s note:  Frustrated that its existing on-premises infrastructure was limiting the business intelligence it could derive from its data, Vodafone decided to switch to a more accessible, scalable cloud solution. Here, Osman Peermamode, Head of Data and Analytics at Vodafone, explains how the migration to Google Cloud is cost-effectively transforming its service offering to optimize its business value and provide enhanced solutions to its customers.

Most people know Vodafone as a mobile network operator. And, with more than 300 million customers worldwide, we’re indeed one of the largest mobile telecommunications companies in the world. But Vodafone is so much more than that. We provide broadband to more than 28 million people, TV to more than 22 million people, and more than 140 million IoT connections help businesses with their digital transformation. M-Pesa, our mobile money transfer service, is used by more than 50 million people across Africa.

All these services serve a common purpose. They create a sustainable digital future that’s accessible to all, built on next-generation connectivity powered by Vodafone. The customer experience sits at the heart of this vision, and data is the key to a lasting relationship with our customers. But data alone isn’t enough – it must also be securely processed and made available across the organization. That’s why, in 2019, we decided to team up with Google Cloud to unlock the true value in our data.

Democratizing Vodafone’s data with Nucleus

Of course, leveraging data isn’t new territory for Vodafone. We’ve long used data insights to improve our customer retention by optimizing packages and personalizing our offering to our users, for example. But before our move into the cloud, the data we collected was very fragmented and costly. Data sat in silos on our on-premises infrastructure, where it became quickly outdated, and multiple copies of data sets were just inefficient, they decreased our data’s quality and credibility. 

In comes Nucleus. Powered by Google Cloud products such as BigQuery, Dataproc, and Cloud Data Fusion, Nucleus is a unified data platform that integrates all of Vodafone’s data. Instead of reconciling between different data warehouses, as we did in the past, we have established a single source of truth, making data accessible across our organization. This supports fact-based decision-making and helps us to reduce costs, streamline operations, and deliver new services and products quickly across Vodafone’s international markets. 

Nucleus has three core components. First, there’s Neuron, which houses all the data – 70 petabytes and growing. Then there’s Dynamo, a pioneering hybrid cloud system that makes it easy for us to move data from all our on-premises repositories around the world to Google Cloud. It can process around 50 TB of data per day, that’s the equivalent of 25,000 hours of HD film. Finally, there’s Vodafone’s common data platform, which provides access to both raw and organized data for all areas of the business, enabling Vodafone to build capabilities once and then instantly deploy them across markets.

Raising the bar on customer service

With Nucleus and our ongoing partnership with Google Cloud, we’re transforming our approach to data. We’re better equipped to meet new regulatory requirements at a much lower cost, and we’ve unlocked new capabilities by generating high quality insights that inform the next generation of data and analytics products. What’s most exciting is that we’ve already identified more than 700 concrete use cases that take advantage of those data capabilities to make Vodafone more customer friendly, sustainable, and profitable.

Let’s take customer loyalty, the key to our success. By reducing our data ingestion for business intelligence from 36 hours to 25 minutes, Nucleus has already made us much more responsive to changes in customer behavior. If mobile users are paying unnecessary roaming costs, for example, we can identify the issue in almost real-time and save them money. We can also automatically spot when a customer needs a speed boost, or proactively contact customers who might be experiencing an issue with their connection. Overall, we’ve already raised the bar on customer service.

Unlocking value, boosting profitability

Satisfied customers are profitable customers. We can offer highly personalized content, apps, and rewards. We’re also better at detecting fraud in almost real-time.

The data insights we’ve generated with Google Cloud also help us to gain a more holistic understanding of our customers. We can tailor our services to each household and suggest new services. By using the data insights from Nucleus to improve our campaigns and customer management, we’re unlocking untapped value. 

Our new data platform also gives us a better overview of the profitability of our channels and retail stores. This helps us to weed out unnecessary spending and makes sure our retail offering is centered around the needs of our customers. By channelling our retail footprint in places where they’re most needed, we’ve been able to increase our store profitability in some markets.

Accelerating change with data-driven sustainability

A better grasp on our data also supports us on our mission to drive positive change for society and our planet. Google Cloud helps us to better monitor all our sustainability KPIs, from our reduction in greenhouse gas emissions to the share of energy we use from renewable sources. We’re also helping to solve global health issues, by supporting governments and aid organizations, for example, with secure, anonymous and aggregated mobile phone signal data to help tackle COVID-19. Now, we can provide even deeper insights to help curb the spread of disease.

These were just a few of the 700 use cases that are helping us to deliver exciting new products, reduce costs, and centralize our operations. With Nucleus, we’re building the foundation for a digital future, and we’re thrilled to be building it with Google Cloud. Carried out by around 1,000 in-house employees from both companies, it’s a true joint effort, inspired by our shared vision of digital technology that’s accessible to all. By tapping into the collective power of Vodafone and Google Cloud, we’re transforming our services for the people, organizations, and communities we serve.

Find out more about how Google Cloud is helping the telecommunications industry here.

Related Article

Vodafone calls for transformative insights, Google Cloud answers

Vodafone is working with Google Cloud to build a “data ocean”

Read Article

Source : Data Analytics Read More

Investing in our data cloud partner ecosystem to accelerate data-driven transformations

Investing in our data cloud partner ecosystem to accelerate data-driven transformations

By 2023, 60% of organizations will use three or more analytics solutions to build business applications to connect insights to actions. These multiple implementations add complexity and challenges with multiple data models, disparate toolsets, and lack of integration and governance. To provide organizations the flexibility, interoperability and agility to accelerate data-driven transformations, we have significantly expanded our data cloud partner ecosystem, and are increasing our partner investment across a number of new areas. 

This week at the Data Cloud Summit, we are announcing a new Data Cloud Alliance, along with the founding partners Accenture, Confluent, Databricks, Dataiku, Deloitte, Elastic, Fivetran, MongoDB, Neo4j, Redis, and Starburst, to make data more portable and accessible across disparate business systems, platforms, and environments—with a goal of ensuring that access to data is never a barrier to digital transformation.

We are also rolling out updates to ensure that organizations can effectively utilize the expertise and power of our data cloud partners, including our new Google Cloud Ready – BigQuery initiative to help customers identify validated partner integrations with BigQuery; a public preview of our Analytics Hub to help partners share and monetize their data; a new Built with BigQuery initiative to highlight partner products that utilize our data cloud capabilities; and several new innovations and launches from our partners.

Helping customers identify validated partner integrations with the Google Cloud Ready – BigQuery initiative 

We strive to give customers the best experience when using partner solutions together with Google’s data cloud products. And as more and more customers deploy partner solutions alongside BigQuery, it’s critical that they are able to identify highly effective, validated, and trusted integrations to get the most out of their data. 

To enable this, we are launching a new Google Cloud Ready – BigQuery initiative. Google Cloud Ready – BigQuery is a validation program whereby Google Cloud engineering teams evaluate and validate BigQuery integrations and connectors using a series of data integration tests and benchmarks. Today, we’re announcing 25 launch partners whose integrations and connectors are validated as Google Cloud Ready – BigQuery:

Google Cloud Ready – BigQuery partners

For example, Google Cloud-validated connectors from Informatica help customers streamline data transformations and rapidly move data from any SaaS application, on-premises database, or big data source into Google BigQuery.

“Google Cloud and Informatica have been strategic cloud partners for the last five years, providing end-to-end, scalable enterprise-class data migration, integration and management solutions for customers. Being recognized as a Google Cloud Ready – BigQuery partner further validates Informatica’s ability to help customers be successful in their journey to cloud with Google”said Jitesh Ghai, Chief Product Officer at Informatica.

Google Cloud-validated Fivetran connectors continuously replicate data from key applications, event streams, file stores, and more into BigQuery, helping turn big data into informed business decisions. Customers can keep up-to-date with the performance and health of the connectors through logs and metrics available through Google Cloud Monitoring. 

“Customers are looking to move data reliably and securely into Google BigQuery to meet the needs of their business,” said Fraser Harris, VP of Product at Fivetran. “We are proud to announce that we have achieved Google Cloud Ready – BigQuery Designation. This marks another milestone in our long-standing partnership with Google Cloud that provides our customers with further assurance that Fivetran products work seamlessly with BigQuery – today and into the future.”

Similarly, Google Cloud-validated BigQuery and Tableauintegrations allow customers to analyze billions of rows in seconds without writing a single line of code and with zero server-side management. Organizations can create dashboards in minutes and share insights with users instantaneously.

“Tableau strives to meet customers where they are and for many organizations with large complex data problems, that’s on the Google Cloud Platform,” said Brian Matsubara, Vice President, Global Technology Alliances at Tableau. “Partnering with Google empowers our customers to explore their data in real-time to unlock actionable insights that can transform a business.”

If you are already a Google Cloud partner, sign up to get your product integration validated by our experts. To become a Google Cloud partner, click here to enroll.

Helping ISVs build and grow their applications with BigQuery

More than 700 partners power their applications with Google’s data cloud – including companies like ZoomInfo, Equifax, Exabeam, Bloomreach, and Quantum Metric. We’re committed to helping these partners both build effective products and go to market, and this week we’re excited to launch the Built with BigQuery initiative, which helps ISVs get started building applications using data and machine learning products like BigQuery, Looker, Spanner, and VertexAI. The program provides dedicated access to Google Cloud expertise, training and co-marketing support to help partners build capacity and go to market. Furthermore, Google Cloud engineering teams work closely with our partners on product design and optimization, to share architecture patterns and best practices. This allows SaaS companies to harness the full potential of data to drive innovation at scale.

“Built with Google’s data cloud, Exabeam’s limitless-scale cybersecurity platform helps enterprises respond to security threats faster and more accurately” said Sanjay Chaudhary, VP of Products at Exabeam. “We are able to ingest data from over 500 security vendors, convert unstructured data into security events, and create a common platform to store them in a cost effective way. The scale and power of Google’s data cloud enables our customers to search multi-year data and detect threats in seconds”

Click here to learn more and apply for the Built with BigQuery initiative.

Enhancing secure data sharing with Analytics Hub

We are also launching a public preview of Analytics Hub, a fully-managed service built on BigQuery that allows our data sharing partners to efficiently and securely exchange valuable data and analytics assets across any organizational boundary. With unique datasets that are always-synchronized, and bi-directional sharing, partners can create a rich and trusted data ecosystem

“As external data becomes more critical to organizations across industries, the need for a unified experience between data integration and analytics has never been more important. We are proud to be working with Google Cloud to power the launch of Analytics Hub, feeding hundreds of pre-engineered data pipelines from hundreds of external datasets,” said Dan Lynn, SVP Product at Crux. “The sharing capabilities that Analytics Hub delivers will significantly enhance the data mobility requirements of practitioners.”

Click here to join the public preview of Analytics Hub.

New launches from our data cloud partners

We’re excited to highlight several important launches from our partners themselves. At Google Cloud, we’re proud to support the fastest-growing and most innovative data and analytics companies, whether they’re running applications on Google Cloud, launching new integrations or connectors, co-creating entirely new capabilities with BigQuery, or continually tweaking and updating their platforms to provide the best experience for customers.

This week our partners Databricks, Fivetran, MongoDB, Neo4j, and Starburst Data are all announcing new capabilities for customers, including:

Databricks SQL will be publicly available for all customers on Google Cloud this month, enabling customers to operate multi cloud lakehouse architectures with performant query execution. Learn more, here.

Fivetran, in addition to joining the Cloud Ready – BigQuery initiative, is now a partner for the Google Cloud Cortex Framework. With deep experience in moving data from a variety of SaaS and database sources – including SAP, Fivetran offers Google Cloud customers accelerated time to value in unlocking the Google Cloud Cortex Framework data models, driving real-time analytics and business insights.

MongoDB is working to launch real-time integration of operational data from Atlas to Google BigQuery (and vice versa) via Dataflow Templates. This enables customers to cross-reference operational data and leverage BigQuery and it’s Analytics, as well as AI/ML tools to support use cases such as anomaly detection in IoT, product recommendations in Retail and fault detection in Manufacturing and feed these insights back to MongoDB Atlas to power the modern real-time enabled enterprise for Continuous Intelligence. This is targeted to be available in Q3/22. 

Neo4j is launching a fully-managed graph technology service for data scientists and developers to build intelligent, algorithm-powered applications with Neo4j Graph Data Science on Google Cloud.

Starburst is announcing a packaged offer for customers to enrich their BigQuery data foundation with hybrid, cross-cloud data stores.

The depth and breadth of innovation and support from the Google Cloud ecosystem is a tremendous asset for customers as they accelerate their data-driven digital transformations. Our community of expert services partners and systems integrators are heavily engaged, too – to date, our partners have earned more than 80 Specializations and more than 200 Expertises pertaining to data cloud technologies on Google Cloud. Visit our partner directory to find partners specialized in Google’s data cloud.

If you are already a Google Cloud partner, sign up to get your product integration validated by our experts. If you are looking to build your applications on Google’s data cloud, apply for the Built with BigQuery initiative. To become a Google Cloud partner, click here to enroll.

Related Article

Limitless Data. All Workloads. For Everyone

Read about the newest innovations in data cloud announced at Google Cloud’s Data Cloud Summit.

Read Article

Source : Data Analytics Read More

Limitless Data. All Workloads. For Everyone

Limitless Data. All Workloads. For Everyone

Today, data exists in many formats, is provided in real-time streams, and stretches across many different data centers and clouds, all over the world. From analytics, to data engineering, to AI/ML, to data-driven applications, the ways in which we leverage and share data continues to expand. Data has moved beyond the analyst and now impacts every employee, every customer, and every partner. With the dramatic growth in the amount and types of data, workloads, and users, we are at a tipping point where traditional data architectures – even when deployed in the cloud – are unable to unlock its full potential. As a result, the data-to-value gap is growing. 

To address these challenges, we are unveiling several data cloud innovations today that allow our customers to work with limitless data, across all workloads, and extend access to everyone. These announcements include BigLake and Spanner change streams to further unify customer data while ensuring it’s delivered in real-time, as well as Vertex AI Workbench and Model Registry to close the data to AI value gap. And to bring data within reach for anyone, we are announcing a unified business intelligence (BI) experience that includes a new Workspace integration, along with new programs that further enable our data cloud partner ecosystem. 

Removing all data limits 

Today, we are announcing the preview of BigLake, a data lake storage engine, to remove data limits by unifying data lakes and warehouses. Managing data across disparate lakes and warehouses creates silos and increases risk and cost, especially when data needs to be moved. BigLake allows companies to unify their data warehouses and lakes to analyze data without worrying about the underlying storage format or system, which eliminates the need to duplicate or move data from a source and reduces cost and inefficiencies. 

With BigLake, customers gain fine-grained access controls, with an API interface spanning Google Cloud and open file formats like Parquet, along with open-source processing engines like Apache Spark. These capabilities extend a decade’s worth of innovations with BigQuery to data lakes on Google Cloud Storage to enable a flexible and cost-effective open lake house architecture. 

Twitter already uses storage capabilities with BigQuery to remove the limits of data to better understand how people use their platform, and what types of content they might be interested in. As a result, they are able to serve content across trillions of events per day with an ads pipeline that runs more than 3M aggregations per second

Another major innovation we’re announcing today is Spanner change streams. Coming soon, this new product will further remove data limits for our customers, allowing them to track changes within their Spanner database in real time in order to unlock new value. Spanner change streams tracks Spanner inserts, updates, and deletes to stream the changes in real time across a customer’s entire Spanner database. This ensures customers always have access to the freshest data as they can easily replicate changes from Spanner to BigQuery for real-time analytics, trigger downstream application behavior using Pub/Sub, or store changes in Google Cloud Storage (GCS) for compliance. With the addition of change streams, Spanner, which currently processes over 2 billion requests per second at peak with up to 99.999% availability, now gives customers endless possibilities to process their data. 

Remove the limits of your data workloads

Our AI portfolio is powered by Vertex AI, a managed platform with every ML tool needed to build, deploy and scale models, and is optimized to work seamlessly with data workloads in BigQuery and beyond. Today, we’re announcing new Vertex AI innovations that will provide customers with an even more streamlined experience to get AI models into production faster and make maintenance even easier.

Vertex AI Workbench, which is now generally available, brings data and ML systems into a single interface so that teams have a common toolset across data analytics, data science, and machine learning. With native integrations across BigQuery, Serverless Spark, and Dataproc, Vertex AI Workbench enables teams to build, train and deploy ML models 5X faster than traditional notebooks. In fact, a global retailer was able to drive millions of dollars in incremental sales and deliver 15% faster speed to market with Vertex AI Workbench.

With Vertex AI, customers have the ability to regularly update their models. But managing the sheer number of artifacts involved can quickly get out of hand. To make it easier to manage the overhead of model maintenance, we are announcing new MLOps capabilities with Vertex AI Model Registry. Now in preview, Vertex AI Model Registry provides a central repository for discovering, using, and governing machine learning models, including those in BigQuery ML. This makes it easy for data scientists to share models and application developers to use them, ultimately enabling teams to turn data into real-time decisions, and be more agile in the face of shifting market dynamics.

Extending the reach of your data

Today, we are launching Connected Sheets for Looker, and the ability to access Looker data models within Data Studio. Customers now have the ability to interact with data however they choose, whether it be through Looker Explore, from Google Sheets, or using the drag-and-drop Data Studio interface. This will make it easier for everyone to access and unlock insights from data in order to drive innovation, and to make data-driven decisions with this new unified Google Cloud business intelligence (BI) platform. This unified BI experience makes it easy to tap into governed, trusted enterprise data, to incorporate new data sets and calculations, and to collaborate with peers.

Mercado Libre, the largest online commerce and payments ecosystem in Latin America, has been an early adopter of Connected Sheets for Looker. Using this integration, they have been able to provide broader access to data through a spreadsheet interface that their employees are already familiar with. By lowering the barrier to entry, they have been able to build a data-driven culture in which everyone can inform their decisions with data. 

Doubling down on the data cloud partner ecosystem

Closing the data-to-value gap with these data innovations would not be possible without our incredible partner ecosystem. Today, there are more than 700 software partners powering their applications using Google’s data cloud. Many partners like Bloomreach, Equifax, Exabeam, Quantum Metric, and ZoomInfo, have started using our data cloud capabilities with the Built with BigQuery initiative, which provides access to dedicated engineering teams, co-marketing, and go-to-market support. 

Our customers want partner solutions that are tightly integrated and optimized with products like BigQuery. So today, we’re announcing Google Cloud Ready – BigQuery, a new validation that recognizes partner solutions like those from Fivetran, Informatica and Tableau that meet a core set of functional and interoperability requirements. Today, we already recognize more than 25 partners in this new Google Cloud Ready – BigQuery program that reduces costs for customers associated with evaluating new tools while also adding support for new customer use cases. 

We’re also announcing a new Database Migration Program to help our customers efficiently and effectively accelerate the move from on-premise and other clouds to Google’s industry-leading managed database services. This includes tooling, resources, and knowledgeable experience from alliances like Deloitte, as well as incentives from Google to offset the cost of migrating databases.

We remain committed to continued innovation with the leading data and analytics companies where our customers are investing. This week Databricks, Fivetran, MongoDB, Neo4j, and Redis are all announcing significant new capabilities for customers on Google Cloud.

All of these announcements and more will be shared in detail at our Data Cloud Summit. Be sure to watchthe data cloud strategy sessions, breakouts, and get access to hands on content. There is no doubt the future of data holds limitless possibilities, and we are thrilled to be on this data cloud journey.

Related Article

Ready to solve for the future? Data Cloud Summit ’22 is coming April 6

Hear from customers, leaders and builders from Google Cloud at Data Cloud Summit 2022 to get the insight you need for your data organization

Read Article

Source : Data Analytics Read More

5 Hardware Accelerators Every Data Scientist Should Leverage

5 Hardware Accelerators Every Data Scientist Should Leverage

The data science profession has become highly complex in recent years. Data science companies are taking new initiatives to streamline many of their core functions and minimize some of the more common issues that they face. They are using tools like Amazon SageMaker to take advantage of more powerful machine learning capabilities.

Amazon SageMaker is a hardware accelerator platform that uses cloud-based machine learning technology. It helps developers create and maintain highly effective machine learning applications that operate in the cloud. Although it is primarily cloud-based, SageMaker also works on embedded systems as well.

Although SageMaker has become a popular hardware accelerator since it was launched in 2017, there are plenty of other overlooked hardware accelerators on the market. If you want to streamline various parts of the data science development process, then you should be aware of all of your options. The right hardware accelerator can help significantly.

Here are some of the most common hardware accelerators.

Morphware

Morphware is a newer hardware accelerator, but it is already becoming very popular. It is a hardware accelerator with highly powerful computing capabilities that are able to handle state-of-the-art machine learning tasks. It allows people with excess computing resources to sell them to data scientists in exchange for cryptocurrencies.

One of the biggest advantages of Morphware over many other hardware accelerators is that it is a two-sided marketplace. Data scientists can access remote computing power through sophisticated networks. Companies and individuals with the computing power that data scientists might need are able to sell it in exchange for cryptocurrencies.

There are a lot of powerful benefits of offering an incentive-based approach as hardware accelerators. Among other benefits, this helps make sure global computing resources are used as efficiently as possible and allows data science companies to take advantage of these resources at a reduced cost.

IBM Watson Studio

IBM Watson Studio is a very popular solution for handling machine learning and data science tasks. It is highly popular among companies developing artificial intelligence tools. Companies working on AI technology can use it to improve scalability and optimize the decision-making process.

There are a number of major selling points of IBM Watson Studio which include:

A feature known as AutoAI. This feature helps automate many parts of the data preparation and data model development process. This significantly reduces the amount of time needed to engage in data science tasks.A text analytics interface that helps derive actionable insights from unstructured data sets.A data visualization interface known as SPSS Modeler.

There are a number of reasons that IBM Watson Studio is a highly popular hardware accelerator among data scientists.

Neptune.ai

Neptune.AI is another popular hardware accelerator. It allows data scientists to log, store, share, compare and search important metadata that is used to build models for data science applications.

There are a number of great advantages of Neptune.AI. Some of the biggest benefits include the following:

it can be integrated with over 25 tools that can be used to scale and simplify data science tasks.Data scientists can easily collaborate with each other and share insights.There are a lot of highly advanced filters that can be used to search for relevant meta-data more easily and conduct useful experiments.

Neptune.ai might not have the same brand recognition as Amazon SageMaker or IBM Watson. However, it is still a very powerful hardware Accelerator that offers great features for its price tag.

Google Cloud AI Platform

Google is a technology giant that requires no introduction. However, you still may have never heard of the Google Cloud AI Platform. This is a very popular hardware accelerator that offers a lot of great benefits to data scientists.

You can use Google Cloud AI Platform to construct intricate machine learning models. One of the biggest selling points of this interface is versatility. You can create machine learning models of any size that work with every type of data you might need.

Data scientists have used the Google cloud AI platform for many different applications. However, some of the most popular have been creating interactive customer service tools like chatbots.

If you went solely off of reviews of the versatility and performance capabilities of this hardware accelerator, you would think it is hands-down the best in the market. However, it does have some downsides. Its performance capabilities do come at a cost. One of the biggest downsides is that it has a convoluted user interface that is difficult to navigate. Some customers have said this makes it more difficult to use. It is one of the reasons that there is a greater learning curve, so this might not be the most popular hardware accelerator for inexperienced data scientists or those with smaller teams working on less complex projects.

Comet

Comet is a very powerful hardware accelerator that is used for various data science projects. It is used to manage, streamline and improve every stage of the machine learning lifecycle. It can improve experiment tracking, data collection and monitoring of model development.

One of the biggest benefits of Comet is that it allows you to handle tasks in real time. Data scientist can see how well various elements of the model perform at any stage of the lifecycle.

There are other benefits of using Comet as well. You can easily integrate the hardware accelerator with other tools. It also comes with a number of great workspaces and user management tools. Furthermore, there are powerful visualization tools for handling various workflows.

The post 5 Hardware Accelerators Every Data Scientist Should Leverage appeared first on SmartData Collective.

Source : SmartData Collective Read More