Blog

IDC reveals 323% ROI for SAP customers using BigQuery

IDC reveals 323% ROI for SAP customers using BigQuery

If the COVID-19 pandemic has taught us anything, it is that speed and intelligence are of the essence when it comes to making business decisions. Organizations must find ways of keeping ahead of competitors and disruptions by continually leveraging data to make smart decisions. The problem? Data may be everywhere, but it’s not always available in a form that businesses can use to generate analytics in real time. As a result, enterprise users are divided when it comes to data volume, with half telling IDC there’s too much data and nearly as many saying there isn’t enough. The first step toward extracting the full value of your company’s data, then, is to synthesize and process it at scale. 

In a study recently published by research firm IDC, organizations running SAP reported using  BigQuery to optimize data from their SAP applications including ERP, supply chain, CRM, and others as well as other internal and external data sources. IDC interviewed seven customers for this study with an average annual revenue of $1.8B, 36,000 employees and 1.5PB of data. 

BigQuery is a fully managed serverless cloud data warehouse within Google’s data cloud,  helps companies integrate, aggregate, and analyze petabytes of data. SAP customers use BigQuery to generate more impactful analytics and improve business results while lowering platform and staffing costs — and have reaped average annual benefits of $6.4 million per organization, according to IDC’s findings. 

BigQuery provides major productivity and infrastructure benefits

IDC found that the benefits that BigQuery offered organizations fell into three broad categories: increased business productivity, IT staff productivity, and reduced infrastructure costs. These three factors combined to give enterprises speed, efficiency, and power that led to improved business results. 

Business productivity benefits: Companies interviewed by IDC reported that BigQuery enhanced their ability to generate SAP-related business insights — including self-service options — by reducing query cycle times and delivering reports to line of business executives, managers, and end users more quickly. After adopting BigQuery, these organizations found that queries accelerated by 63% and business reports or dashboards were generated 77% faster. BigQuery provided data scientists, business intelligence teams, data/analytics engineers, and business analysts with easier access to more robust and timely data and insights derived from their SAP environments. This helped them work more effectively and deliver richer and more timely data-driven insights to line of business end users. “BigQuery for SAP has had a significant impact on our data analytics activities,” reports one respondent. “The speed of running queries is amazing. . . Our analytics teams have access to information and reports that they did not have before.”

IT staff productivity benefits: With BigQuery, organizations reported freeing up staff time to not only handle expanding data environments, but also support other data- and business-related initiatives. Since BigQuery is a fully managed solution, internal IT staff no longer had to conduct routine tasks such as data backups, allowing them to take a more proactive approach in leveraging their SAP data environments to support business operations. “Before BigQuery for SAP, we used to say that our team worked for analytics,” notes a survey respondent. “Our DevOps team was spending a lot of time doing carrot feeding, firefighting, and a lot of reactive work like patching. Now, we have been able to focus the majority of our time on proactive roadmap development and future functionality.”

IT infrastructure cost reductions: The IDC study showed that BigQuery helped respondents reduce costs by lowering their platform expenses and making management and administration of their data warehousing environments more efficient. These cost savings came thanks to BigQuery’s ability to consolidate data on a single cloud-based platform, limiting overlap, redundancies, and overprovisioning. Organizations also reduced the costs associated with maintaining and running on-premises data warehousing environments, including retiring less cost-effective legacy solutions. “The biggest area of value of BigQuery for SAP for us is that we no longer need to deal with having to maintain our own data,” explained a survey respondent. “We don’t have to worry about updates or administration, and most maintenance is now outsourced to BigQuery, including backups and business continuity.”

Speed, efficiency and power for improved results

Overall, respondents said that the enhanced ability to get value from their SAP data using BigQuery led directly to improved business results. More efficient data usage meant better strategy execution. Faster and more precise analytics gave more employees the information and tools they need to create more value for their organizations. BigQuery’s scalability and speed improved business operations and shortened time to insight. Customers gained a clearer real-time view of business operations, which helped them win new business, differentiate themselves against competitors, and reduce customer churn. 

“The main benefit of BigQuery for SAP is real-time intelligence that helps us improve pricing, traffic management, routing configuration, and rating — the core of our business,” says a respondent. “This gives us the edge over competitors — when we have better intel and faster analytics, it allows us to improve revenue margins and business outcomes.” In fact, organizations increased revenue by an average $22.8 million. What could your company do with those kinds of results?

To learn more about the ways in which Google BigQuery can make your SAP organization more agile and efficient, read the full IDC report and check out additional SAP on Google Cloud resources.

Related Article

How SAP customers can accelerate analytics in the cloud

Google Cloud’s data analytics and AI and machine learning solutions can help SAP customers store, analyze, and derive insights from all t…

Read Article

Source : Data Analytics Read More

BigQuery Admin reference guide: API landscape

BigQuery Admin reference guide: API landscape

So far in this series, we’ve been focused on generic concepts and console-based workflows. However, when you’re working with huge amounts of data or surfacing information to lots of different stakeholders, leveraging BigQuery programmatically becomes essential. In today’s post, we’re going to take a tour of BigQuery’s API landscape – so you can better understand what each API does and what types of workflows you can automate with it.

Leveraging Google Cloud APIs

You can access all the Google Cloud APIs from server applications with our client libraries in many popular programming languages, from mobile apps via the Firebase SDKs, or by using third-party clients. You can also access Cloud APIs with theGoogle Cloud SDK tools or Google Cloud Console. If you are new to Cloud APIs, see Getting Started on how to use Cloud APIs.

 All Cloud APIs provide a simple JSON HTTP interface that you can call directly or via Google API Client Libraries,Most Cloud APIs also provide a gRPC interface you can call via Google Cloud Client Libraries, which provide better performance and usability. For more information about our client libraries, checkout Client Libraries Explained.

BigQuery v2 API

The BigQuery v2 API is where you can interact with the “core” of BigQuery. This API gives you the ability to manage data warehousing resources like datasets, tables (including both external tables and views), and routines (functions and procedures). You can also leverage BigQuery’s machine learningcapabilities, and create or poll jobs for querying, loading, copying or extracting data. 

Programmatically getting query results

One common way to leverage this API is to programmatically get answers to business questions by running BigQuery queries and then doing something with the results. One example that quickly came to mind was automatically filling in a Google Slide template. This can be especially useful if you’re preparing slides for something like a quarterly business review – where each team may need a slide that shows their sales performance for the last quarter. Many times an analyst is forced to manually run queries and copy-paste the results into the slide deck. However, with the BigQuery API, Google Slides APIand a Google Apps Script we can automate this entire process!

If you’ve never used Google Apps scripts before, you can use them to quickly build serverless functions that run inside of Google Drive. Google Apps Scripts already have the Google Workspace and Cloud Libraries available, so you simply need to add the Slides and BigQuery service into your script.

In your script you can do something like loop through each team’s name and use it as a parameter to run a parameterized query. Finally, you can use that to replace a template in that team’s slide within the deck. Check out someexample code here, and look out for a future post on more details on the entire process!

Loading in new data

Aside from querying existing data available in BigQuery, you can also use the API to create and run a load job to add new data into BigQuery tables. This is a common scenario when building batch loading pipelines. One example might be if you’re transforming and bringing data into BigQuery from a transactional database each night. If you remember from our post on tables in BigQuery, you can actuallyrun an external query against a Cloud SQL database. This means that we can simply send a query job, through BigQuery’s API, to grab new data from the Cloud SQL table. Below, we’re using the magics command from the google.cloud.bigquery Python library to save the results into a pandas dataframe.

Next, we may need to transform the results. For example, we can use the Google Maps GeoCoding API to get the latitude and longitude coordinates for each customer in our data. 

Finally, we can create a load job to add the data, along with the coordinates, into our existing native BigQuery table.

You can access this code in our sample Jupyter Notebook. However, if you were using this in production you may want to leverage something like a Google Cloud Function.

Reservations API

While the “core” of BigQuery is handled through the BigQuery v2 API, there are other APIs to manage tangential aspects of BigQuery. The Reservations API, for example, allows you to programmatically leverage workload management resources like capacity commitments, reservations and assignments as we discussed in a previous post

Workload management

Let’s imagine that we have an important dashboard loading at 8am on the first Monday of each month. You’ve decided that you want to leverageflex slotsto ensure that there are enough workers to make the dashboard load super fast for your CEO. So, you decide to write  a program that purchases a flex slot commitment, creates a new reservation for loading the dashboard and then assigns the project where the BI tool will run the dashboard to the new reservation. Check out the full sample code here!

Storage API

Another relevant API for working with BigQuery is theStorage API. The Storage API allows you to use BigQuery like a Data Warehouse and a Data Lake. It’s real-time so that you don’t have to wait for your data, it’s fast so that you don’t need to reduce or sample your data, and it’s efficient so that you should only read the data you want. It’s broken down into two components.

The Read Client exposes a data-stream suitable for reading large volumes of data. It also provides features for parallelizing reads, performing partial projections, filtering data, and offering precise control over snapshot time.The Write Client (preview) is the successor to the streaming mechanism found in the BigQuery v2 API. It supports more advanced write patterns such as exactly one semantics. More on this soon!

The Storage API was used to build a series of Hadoop connectors so that you can run your Spark workloads directly on your data in BigQuery. You can also build your own connectors using the Storage API!

Connections API

The BigQuery Connections API is used to create a connection to external storage systems, like Cloud SQL. This enables BigQuery users to issue live, federated, queries against other systems. It also supports BigQuery Omni to define multi-cloud data sources and structures.

Programmatically Managing Federation Connections

Let’s imagine that you are embedding analytics for your customers. Your web application is structured such that each customer has a single-tenant Cloud SQL instance that houses their data. To perform analytics on top of this information, you may want to create connections to each Cloud SQL database. Instead of manually setting up each connection, one option could be using the Connections API to programmatically create new connections during the customer onboarding process.

Data Transfer Service API

BigQuery data transfer serviceallows you to automate work to ingest data from known data sources, with standard scheduling. DTS offers support for migration workloads, such as specialized integrations that help with synchronizing changes from a legacy, on premise warehouse. Using the API you can do things likecheck credentials, trigger manual runs and get responses from prior transfers.

Data QnA API

The last API I’ll mention is one of my favorites—Data QnA which is currently in preview. Are business users at your organization always pinging you to query data on their behalf? Well, with the QnA API you can convert natural language text inquiries into SQL – meaning you can build a super powerful chatbot that fulfills those query requests, or even give your business users access to connected sheets so they can ask analytics questions directly in a spreadsheet. Check out this post to learn more about how customers are using this API today!

Wrapping up

Hopefully this post gave you a good idea of how working with the BigQuery APIs can open up new doors for automating data-fueled workflows. Keep in mind that there are lots of more relevant APIs when working with BigQuery, like some of the platforms we discussedlast week in our post on Data Governance– including  IAM for managing access policies, DLP for managing sensitive data, andData Catalog for tracking and searching metadata

Next week is our last post for this season of ourBigQuery Spotlight YouTube series, and we’ll be walking through monitoring use cases. Remember to follow me on LinkedIn and Twitter!

Related Article

BigQuery Admin reference guide: Data governance

Learn how to ensure your data is discoverable, secure and usable inside of BigQuery.

Read Article

Source : Data Analytics Read More

Handling duplicate data in streaming pipelines using Dataflow and Pub/Sub

Handling duplicate data in streaming pipelines using Dataflow and Pub/Sub

Purpose

Processing streaming data to extract insights and powering real time applications is becoming more and more critical. Google Cloud Dataflow and Pub/Sub provides a highly scalable, reliable and mature streaming analytics platform to run mission critical pipelines. One very common challenge that developers often face when designing such pipelines is how to handle duplicate data. 

In this blog, I want to give an overview of common places where duplicate data may originate in your streaming pipelines and discuss various options that are available to you to handle them. You can also check out this tech talk on the same topic.

Origin of duplicates in streaming data pipelines

This section gives an overview of the places where duplicate data may originate in your streaming pipelines. Numbers in red boxes in the following diagram indicate where this may happen.

Some duplicates are automatically handled by Dataflow while for others developers may need to use some techniques to handle them. This is summarized in the following table.

1. Source generated duplicate
Your data source system may itself produce duplicate data. There could be several reasons like network failure, system errors etc that can produce duplicate data. Such duplicates are referred to as ‘source generated duplicates’.

One example where this could happen is when you set trigger notifications from Google Cloud Storage to Pub/Sub in response to object changes to GCS buckets. This feature guarantees at-least-once delivery to Pub/Sub and can produce duplicate notifications.

2. Publisher generated duplicates 
Your publisher when publishing messages to Pub/Sub can generate duplicates due to at-least-once publishing guarantees. Such duplicates are referred to as ‘publisher generated duplicates’. 

Pub/Sub automatically assigns a unique message_id to each message successfully published to a topic. Each message is considered successfully published by the publisher when Pub/Sub returns an acknowledgement to the publisher. Within a topic all messages have a unique message_id and no two messages have the same message_id. If success of the publish is not observed for some reason (network delays, interruptions etc) the same message payload may be retried by the publisher. If retries happen, we may end up with duplicate messages with different message_id in Pub/Sub. For Pub/Sub these are unique messages as they have different message_id.

3. Reading from Pub/Sub
Pub/Sub guarantees at least once delivery for every subscription. This means that a message may be delivered more than once by the same subscription if Pub/Sub doesn’t receive acknowledgement within the acknowledgement deadline. The subscriber may acknowledge after the acknowledgement deadline or the acknowledgement may be lost due to transient network issues. In such scenarios the same message would be redelivered and subscribers may see duplicate data. It is the responsibility of the subscribing system (for example Dataflow) to detect such duplicates and handle accordingly.

When Dataflow receives messages from Pub/Sub subscription, messages are acknowledged after they are successfully processed by the first fused stage. Dataflow does optimization called fusion where multiple stages can be combined into a single fused stage. A break in fusion happens when there is a shuffle which happens if you have transforms like GROUP BY, COMBINE or I/O transforms like BigQueryIO. If a message has not been acknowledged within its acknowledgement deadline, Dataflow attempts to maintain the lease on the message by repeatedly extending the acknowledgement deadline to prevent redelivery from Pub/Sub. However this is best effort and there is a possibility that messages may be redelivered. This can be monitored using metrics listed here.

However, because Pub/Sub provides each message with a unique message_id, Dataflow uses it to deduplicate messages by default if you use the built-in Apache Beam PubSubIO. Thus Dataflow filters out such duplicates originating from redelivery of the same message by Pub/Sub. You can read more about this topic on one of our earlier blog under the section “Example source: Cloud Pub/Sub”

4. Processing data in Dataflow
Due to the distributed nature of processing in Dataflow each message may be retried multiple times on different Dataflow workers. However Dataflow ensures that only one of those tries wins and the processing from the other tries does not affect downstream fused stages. Dataflow does guarantee exactly once processing by leveraging checkpointing at each stage to ensure such duplicates are not reprocessed affecting state or output. You can read more about how this is achieved in this blog.

5. Writing to a sink
Each element can be retried multiple times by Dataflow workers and may produce duplicate writes. It is the responsibility of the sink to detect these duplicates and handle them accordingly. Depending on the sink, duplicates may be filtered out, over-written or appear as duplicates.

File systems as sink
If you are writing files, exactly once is guaranteed as any retries by Dataflow workers in event of failure will overwrite the file. Beam provides several I/O connectors to write files, all of which guarantees exactly once processing.

BigQuery as sink

If you use the built-in Apache Beam BigQueryIO to write messages to BigQuery using streaming inserts, Dataflow provides a consistent insert_id (different from Pub/Sub message_id) for retries and this is used by BigQuery for deduplication. However, this deduplication is best effort and duplicate writes may appear. BigQuery provides other insert methods as well with different deduplication guarantees as listed below.

You can read more about BigQuery insert methods at the BigQueryIO Javadoc. Additionally for more information on BigQuery as a sink check out the section “Example sink: Google BigQuery” in one of our earlier blog

For duplicates originating from places discussed in points 3), 4) and 5) there are built-in mechanisms in place to remove such duplicates as discussed above, assuming BigQuery is a sink. In the following section we will discuss deduplication options for ‘source generated duplicates’ and ‘publisher generated duplicates’. In both cases, we have duplicate messages with different message_id, which for Pub/Sub and downstream systems like Dataflow are two unique messages.

Deduplication options for source generated duplicates and publisher generated duplicates

1. Use Pub/Sub message attributes

Each message published to a Pub/Sub topic can have some string key value pairs attached as metadata under the “attributes” field of PubsubMessage. These attributes are set when publishing to Pub/Sub. For example, if you are using the Python Pub/Sub Client Library, you can set the “attrs” parameter of the publish method when publishing messages. You can set the unique fields (e.g: event_id) from your message as attribute value and field name as attribute key.

Dataflow can be configured to use these fields to deduplicate messages instead of the default deduplication using Pub/Sub message_id. You can do this by specifying the attribute key when reading from Pub/Sub using the built-in PubSubIO.

For Java SDK, you can specify this attribute key in the withIdAttribute method of PubsubIO.Read() as shown below.

In the Python SDK, you can specify this in the id_label parameter of the ReadFromPubSub PTransform as shown below.

This deduplication using a Pub/Sub message attribute is only guaranteed to work for duplicate messages that are published to Pub/Sub within 10 minutes of each other.

2. Use Apache Beam Deduplicate PTransform
Apache Beam provides deduplicate PTransforms which can deduplicate incoming messages  over a time duration. Deduplication can be based on the message or a key of a key value pair, where the key could be derived from the message fields. The deduplication window can be configured using the withDuration method, which can be based on processing time or event time (specified using the withTimeDomain method). This has a default value of 10 mins.

You can read the Java documentation or the Python documentation of this PTransform for more details on how this works.

This PTransform uses the Stateful API under the hood and maintains a state for each key observed. Any duplicate message with the same key that appears within the deduplication window is discarded by this PTransform.

3. Do post-processing in sink
Deduplication can also be done in the sink. This could be done by running a scheduled job that periodically deduplicates rows using a unique identifier.

BigQuery as a sink
If BigQuery is the sink in your pipeline, scheduled query can be executed periodically that writes the deduplicated data to another table or updates the existing table. Depending on the complexity of the scheduling you may need orchestration tools like Cloud Composer or Dataform to schedule queries.

Deduplication can be done using a DISTINCT statement or DML like MERGE. You can find sample queries about these methods on these blogs (blog 1, blog 2).

Often in streaming pipelines you may need deduplicated data available in real time in BigQuery. You can achieve this by creating materialized views on top of underlying tables using a DISTINCT statement.

Any new updates to the underlying tables will be updated in real time to the materialized view with zero maintenance or orchestration.

Technical trade-offs of different deduplication options

Related Article

After Lambda: Exactly-once processing in Google Cloud Dataflow, Part 1

Learn the meaning of “exactly once” processing in Dataflow, its importance for stream processing overall and its implementation in stream…

Read Article

Source : Data Analytics Read More

Single-cell genomic analysis accelerated by NVIDIA on Google Cloud

Single-cell genomic analysis accelerated by NVIDIA on Google Cloud

In the past decade, the Healthcare and Life Sciences industry has enjoyed a boon in technological and scientific advancement. New insights and possibilities are revealed almost daily. At Google Cloud, driving innovation in cloud computing is in our DNA. Our team is dedicated to sharing ways Google Cloud can be used to accelerate scientific discovery. For example, the recent announcement of AlphaFold2 showcases a scientific breakthrough, powered by Google Cloud, that will promote a quantum leap in the field of proteomics. In this blog, we’ll review another omics use case, single-cell analysis, and how Google Cloud’s Dataproc and NVIDIA GPUs can help accelerate that analysis.

The Need for Performance in Scientific Analysis

The ability to understand the causal relationship between genotypes and phenotypes is one of the long-standing challenges in biology and medicine. Understanding and drawing insights from the complexity of biological systems abounds from the actual code of life (DNA) through to expression of genes (RNA) to translation of gene transcripts into proteins that function in different pathways, cells, and tissues within an organism. Even the smallest of changes in our DNA can have large impacts on protein expression, structure, and function, which ultimately drives development and response – at both cellular and organism levels. And, as the omics space becomes increasingly data- and compute-intensive, research requires an adequate informatics infrastructure. An infrastructure that scales with growing data demands, enables a diverse range of resource-intensive computational activities, and is affordable and efficient – reducing data bottlenecks and enabling researchers to maximize insight. 

But where do all these data and compute challenges come from and what makes scientific study so arduous? The layers of biological complexity begin to be made apparent immediately when looking at not just the genes themselves, but their expression. Although all the cells in our body share nearly identical genotypes, our many diverse cell types (e.g. hepatocytes versus melanocytes) express a unique subset of genes necessary for specific functions, making transcriptomics a more powerful method of analysis by allowing researchers to map gene expression to observable traits. Studies have shown that gene expression is heterogeneous, even in similar cell types. Yet, conventional sequencing methods require DNA or RNA extracted from a cell population. The development of single-cell sequencing was pivotal to the omics field. Single-cell RNA sequencing has been critical in allowing scientists to study transcriptomes across large numbers of individual cells. 

Despite its potential, and the increasing availability of single-cell sequencing technology, there are several obstacles: an ever increasing volume of high-dimensionality data, the need to integrate data across different types of measurements (e.g. genetic variants, transcript and protein expression, epigenetics) and across samples or conditions, as well as varying levels of resolution and the granularity needed to map specific cell types or states. These challenges present themselves in a number of ways including background noise, signal dropouts requiring imputation, and limited bioinformatics pipelines that lack statistical flexibility. These and other challenges result in analysis workflows that are very slow, prohibiting the iterative, visual, and interactive analysis required to detect differential gene activity. 

Accelerating Performance

Cloud computing can help not only with data challenges, but with some of the biggest obstacles: scalability, performance, and automation of analysis. To address several of the data and infrastructure challenges facing single-cell analysis, NVIDIA developed end-to-end accelerated single-cell RNA sequencing workflows that can be paired with Google Cloud Dataproc, a fully-managed service for running open source frameworks like Spark, Hadoop, and RAPIDS. The Jupyter notebooks that power these workflows include examples using samples like human lung cells and mouse brains cells and demonstrate acceleration between CPU-based processing compared to GPU-based workflows. 

Google Cloud Dataproc powers the NVIDIA GPU-based approach and demonstrates data processing capabilities and acceleration, which in turn have the potential of delivering considerable performance gains.  When paired with RAPIDS, practitioners can accelerate data science pipelines on NVIDIA GPUs, reducing operations like data loading, processing, and training from hours to seconds. RAPIDS abstracts the complexities of accelerated data science by building upon popular Python and Java libraries effortlessly. When applying RAPIDS and NVIDIA accelerated compute to single-cell genomics use cases, practitioners can churn through analysis of a million cells in only a few minutes.

Give it a Try

The journey to realizing the full potential of omics is long; but through collaboration with industry experts, customers, and partners like NVIDIA, Google Cloud is here to help shine a light on the road ahead. To learn more about the notebook provided for single-cell genomic analysis, please take a look at NVIDIA’s walkthrough. To give this pattern a try on Dataproc, please visit our technical reference guide.

Related Article

Building a genomics analysis architecture with Hail, BigQuery, and Dataproc

Try a cloud architecture for creating large clusters for genomics analysis with BigQuery and Google-built healthcare tooling.

Read Article

Source : Data Analytics Read More

Converging architectures: Bringing data lakes and data warehouses together

Converging architectures: Bringing data lakes and data warehouses together

Historically, data warehouses have been painful to manage. The legacy, on-premises systems that worked well for the past 40 years have proved to be expensive and they had many challenges around data freshness, scaling, and high costs. Furthermore, they cannot easily provide AI or real-time capabilities that are needed by modern businesses. We even see this with the cloud newly created data warehouses as well. They do not have AI capabilities still, despite showing that or arguing that they are the modern data warehouses. They are really like the lift and shift version of the legacy on-premises environments over to cloud. 

At the same time, on-premises data lakes have other challenges. They promised a lot, looked really good on paper,  promised low cost and ability to scale. However, in reality this did not capitalize for many organizations. This was  mainly because they were not easily operationalized, productionized, or utilized. This in return increased the overall total cost of ownership. There are also significant data governance challenges created by the data lakes. They did not work well with the existing IAM and security models. Furthermore, they ended up creating data silos because data is not easily shared across through the hadoop environment.

With varying choices, customers would choose the environment that made sense, perhaps a pure data warehouse, or perhaps a pure data lake, or a combination. This leads to a set of tradeoffs for nearly any real-world customer working with real-world data and use cases. Therefore, this past approach has naturally set up a model where we see different and often disconnected teams setting up shop within organizations. Resulting in users split between their use of the data warehouse and the data lake. 

Data warehouse users tend to be closer to the business, and have ideas about how to improve analysis, often without the ability to explore the business to drive a deeper understanding. On the contrary, data lake users are closer to the raw data and have the tools and capabilities to explore the data. Since they spend so much time doing this, they are focused on the data itself, and less focused on the business. This disconnect robs the business of the opportunity to find insights that would drive the business forward to higher revenues, lower costs, lower risk, and new opportunities.

Since then the two systems co-existed and complemented each other as the two main data analytics systems of enterprises, residing side by side in the shared IT sphere. These are also the data systems at the heart of any digital transformation of the business and the move to a full data-driven culture. As more organizations are migrating their traditional on-premises systems to the cloud and SaaS solutions, this is a period during which enterprises are rethinking the boundaries of these systems toward a more converged analytics platform.

This rethinking has led to convergence of data lakes and warehouses, as well as data teams across organizations. The cloud offers managed services that help expedite the convergence so that any data person could start to get insight and value out of the data, regardless of the system. The benefits of the converged data lake and data warehouse environment present itself in several ways. Most of these are driven by the ability to provide managed, scalable, and serverless technologies. As a result, the notion of storage and computation is blurred. Now it is no longer important to explicitly manage where data is stored or what format it is stored. Users are democratized, they should be able to access the data regardless of the infrastructure limitations. From a data user perspective, it doesn’t really matter whether the data resides in a data lake or a data warehouse. They do not look into which system the data is coming from. They really care about what data that they have, and whether they can trust it. The volume of the data that they can ingest and whether it is real time or not. They are also discovering and managing data across varied datastores and taking them away from the siloed world into an integrated data ecosystem. Most importantly, analyze and process data with any person or tool.

At Google Cloud, we provide a cloud native, highly scalable and secure, converged solution that delivers choice and interoperability to customers. Our cloud native architecture reduces cost and improves efficiency for organizations. For example, BigQuery‘s full separation of storage and compute allows for BigQuery compute to be brought to other storage mechanisms through federated queries. BigQuery storage API allows treating a data warehouse like a data lake. It allows you to access the data residing in BigQuery. For example, you can use Spark to access data resigning in Data Warehouse without it affecting performance of any other  jobs accessing it. On top of this, Dataplex, our intelligent data fabric service, provides data governance and security capabilities across various storage tiers built on GCS and BigQuery.

There are many benefits achieved by the convergence of the data warehouses and data lakes and if you would like to find more, here’s the full paper.

Related Article

Registration is open for Google Cloud Next: October 12–14

Register now for Google Cloud Next on October 12–14, 2021

Read Article

Source : Data Analytics Read More

Save messages, money, and time with Pub/Sub topic retention

Save messages, money, and time with Pub/Sub topic retention

Starting today, there is a simpler, more useful way to save and replay messages that are published to Pub/Sub: topic retention. Prior to topic retention, you needed to individually configure and pay for message retention in each subscription. Now, when you enable topic retention, all messages sent to the topic within the chosen retention window are accessible to all the topic’s subscriptions, without increasing your storage costs when you add subscriptions. Additionally, messages will be retained and available for replay even if there are no subscriptions attached to the topic at the time the messages are published. 

Topic retention extends Pub/Sub’s existing seek functionality—message replay is no longer constrained to the subscription’s acknowledged messages. You can initialize new subscriptions with data retained by the topic, and any subscription can replay previously published messages. This makes it safer than ever to update stream processing code without fear of data processing errors, or to deploy new AI models and services built on a history of messages. 

Topic retention explained

With topic retention, the topic is responsible for storing messages, independently of subscription retention settings. The topic owner has full control over the topic retention duration and pays the full cost associated with message storage by the topic. As a subscription owner, you can still configure subscription retention policies to meet your individual needs.

Topic-retained messages are available even when the subscription is not configured to retain messages

Initializing data for new use cases

As organizations become more mature at using streaming data, they often want to apply new use cases to existing data streams that they’ve published to Pub/Sub topics. With topic retention, you can access the history of this data stream for new use cases by creating a new subscription and seeking back to a desired point in time.

Using the GCloud CLI 

These two commands initialize a new subscription and replay data from two days in the past. Retained messages are available within a minute after the seek operation is performed.

Choosing the retention option that’s right for you

Pub/Sub lets you choose between several different retention policies for your messages; here’s an overview of how we recommend you should use each type.

Topic retention lets you pay just once for all attached subscriptions, regardless of when they were created, to replay all messages published within the retention window. We recommend topic retention in circumstances where it is desirable for the topic owner to manage shared storage.

Subscription retention allows subscription owners, in a multi-tenant configuration, to guarantee their retention needs independently of the retention settings configured by the topic owner.

Snapshots are best used to capture the state of a subscription at the time of an important event, e.g. an update to subscriber code when reading from the subscription.

Transitioning from subscription retention to topic retention

You can configure topic retention when creating a new topic or updating an existing topic via the Cloud Console or the gcloud CLI. In the CLI, the command would look like:

gcloud pubsub topics update myTopic –message-retention-duration 7d.

If you are migrating to topic retention from subscription storage, subscription storage can be safely disabled after 7 days.

What’s next

Pub/Sub topic retention makes reprocessing data with Pub/Sub simpler and more useful. To get started, you can read more about the feature, visit the pricing documentation, or simply enable topic retention on a topic using Cloud Console or the gcloud CLI.

Related Article

How to detect machine-learned anomalies in real-time foreign exchange data

Model the expected distribution of financial technical indicators to detect anomalies and show when the Relative Strength Indicator is un…

Read Article

Source : Data Analytics Read More

How Renault solved scaling and cost challenges on its Industrial Data platform using BigQuery and Dataflow

How Renault solved scaling and cost challenges on its Industrial Data platform using BigQuery and Dataflow
Following the first significant bricks of data management implemented in our plants to serve initial use cases such as traceability and operational efficiency improvements, we confirmed we got the right solution to collect industrial data from our machines and operations at scale. We started to deploy that solution and we therefore needed a state-of-the-art data platform for contextualization, processing and hosting all the data gathered. This platform had to be scalable to deploy across our entire footprint, and also affordable and reliable, to foster the use of data in our operations. Our partnership with Google Cloud in 2020 was a key lever to reach this target. Jean-Marc Chatelanaz – Industry Data Management 4.0 Director, Renault

French multinational automotive manufacturer Renault Group has been investing in Industry 4.0 since the early days. A primary objective of this transformation has been to leverage manufacturing and industrial equipment data through a robust and scalable platform. Renault designed an industrial data acquisition layer and connected it to Google Cloud, using optimized big data products and services that together form Renault’s Industrial Data Platform.

In this blog, we’ll outline the evolution of that data platform on Google Cloud and how Renault worked with Google Cloud’s professional services to design and build a new architecture to achieve a secure, reliable, scalable, and cost-effective data management platform. Here’s how they’ve done it.

All started with Renault’s digital transformation

Renault Group has manufactured cars since 1898. Today, it is an international group with five brands that sold close to 3 million vehicles worldwide in 2020 and provide sustainable mobility throughout the world, with 40 manufacturing sites and more than 170,000 employees.

A leader in the Industry 4.0 movement, Renault launched its digital transformation plan in 2016 to achieve greater efficiency and modernize its traditional manufacturing and industrial practices. A key objective of this transformation was using the industrial data from up to 40 sites globally to promote data-based business decisions and create new opportunities.

Several initiatives were already growing in silos, such as Conditional Based Maintenance, Full Track and Trace, Industrial Data Capture and AI projects. In 2019, Renault kicked off a program named Industry Data Management 4.0 (IDM 4.0) to federate all those initiatives as well as future ones, and, most of all, to design and build a unique data platform and a framework for all Renault’s industrial data. 

The IDM 4.0 would ultimately enable Renault’s manufacturing, supply chain, and production engineering teams to quickly develop analytics, AI, and predictive applications based on a single industrial data capture and data referential platform as shown with these pillars:

Renault Group data transformation pillars

With Renault’s leadership in Industry 4.0 for manufacturing and supply chain, and with Google’s expertise in big data, analytics and machine learning since its foundations, Google Cloud solutions were a clear match for Renault’s ambitions.

Building IDM 4.0 in the cloud

Before IDM 4.0, Renault Digital was the internal team that incubated the idea of data in the cloud. Following some trials, the team was able to reduce storage costs by more than 50% by moving to Google Cloud. Key business requirements were data reliability, long-term storage, and unforgeability, all of which led the team to use Apache Spark running on fully-managed Dataproc for data processing, and Spanner as the main database.

The IDM 4.0 program aimed to scale and unfold more business use cases while keeping the platform cost-effective, with maximum performance and reliability. Meeting this goal required reviewing the architecture, as the costs projections of the business usage would go beyond the project budget, refraining from additional business explorations. 

Accelerating the redesign of that core data platform was a pillar of the Renault-Google Cloud partnership. . Google Cloud’s professional services organization (PSO) helped define the best architecture to meet requirements, solve challenges, and handle 40 plants’ worth of scaling data in a reliable and cost-sustainable way.

Renault Group IDM 4.0 layers

The data capture journey is complex and includes several components. Basically, Renault built a specific process relying on standard data models designed with OPC-UA, an open machine-to-machine communication protocol. In the new Google Cloud architecture, this data model is implemented in BigQuery such that most writes are append-only and it is not necessary to do lots of join operations to retrieve the required information. BigQuery is complemented with a cache in Memorystore to retrieve up-to-date states at all times. This architecture enabled high-performing and cost-efficient reads and writes.

The team also started developing with the Beam framework and Dataflow. Working with Beam was not only as easy as working with Spark, but also unlocked a number of benefits:

A unified model for batch and streaming processing: less coding variety and more flexibility in choosing the types of processing jobs depending on needs and cost target

A thriving ecosystem around Beam: many available connectors (PubsubIO, BigQueryIO, JMSIO, etc)

Simplified operations: efficient autoscaling, smoother updates, finer cost tracking, easy monitoring, end-to-end testing, and error management.

Dataflow has since become the go-to tool for most data processing needs in the platform. Pub/Sub was also a natural choice for streaming data processing, along with other third-party messaging systems. The teams are also maintaining their advanced use of Google Cloud’s managed services such as Google Kubernetes Engine (GKE), Cloud Composer, and Memorystore.

The new platform is built to scale, and as of Q1 2021, the IDM 4.0 team has been connecting more than 4,900 industrial appliances through Renault’s internal data acquisition solution, transmitting more than 1 billion messages per day to the platform in Google Cloud!

Focusing on cost efficiency

While the migration to the new architecture was ongoing, cost optimization became a key decision driver, especially with the COVID-19 pandemic and its impacts on the world’s economics.

Hence, Renault’s management team and Google Cloud’s PSO agreed to allocate a special task force to optimize costs on parts of the platform that were not yet migrated. 

This special task force discovered that one Apache Spark cluster continuously performed Cloud Storage Class A & B operations every few tens of milliseconds. This was inconsistent with the majority of incoming flows, which were hourly or daily batches. One correction we applied was  simply increasing the interval polling the filesystem, i.e. the spark.sql.streaming.pollingDelay parameter from Spark. Later, we confirmed a 100x drop in Cloud Storage Class A calls, and a meaningful drop in class B calls, which resulted in a 50% decrease in production costs.

Drop in cost of storage  following cost optimization actions

Renault now uses  Dataflow to ingest and transform data from manufacturing plants as well as from other connected key referential databases. BigQuery is used to store and analyze massive amounts of data, such as packages and vehicle tracking, with many more data sources in the works. 

 The below diagram shows last year’s impressive achievements!

Evolution of data ingestion and platform costs (not to scale)

The new IDM 4.0 platform started to run production workloads just a few months ago, and Renault expects to have 10 times more industrial data coming in the next two years, helping to keep costs, performance, and reliability optimum. 

Unfold new opportunities and democratize data access 

The IDM 4.0 team has succeeded in gathering industrial data from plants, and merging and harmonizing this data in a scalable, reliable, and cost-effective platform. The team also managed to expose data in a controlled and secure way to data scientists, business teams or any application. 

In this blog, we’ve overviewed use cases ranging  from tracking to conditional maintenance, but this platform can open many other possibilities for innovation and business impact, such as self-service analytics that will democratize the use of data and support data-driven business decisions.

This Renault data transformation with Google Cloud demonstrates the potential of Industry 4.0 to improve Renault’s manufacturing, process engineering, and supply chain processes—and also provides an example for other companies looking to leverage data and cloud platforms for improved results. To learn more about BigQuery, click here

Acknowledgments 

We’d like to thank all collaborators who participated actively in writing these lines; both from Google and Renault working tremendously as one team for 1+ year, fully remote:

Writers from Google:

Jean-Francois Macresy, Consultant, IDM4.0 Architecture best practicesJeremie Gomez, Strategic Engineer, IDM4.0 Data specialistRifel Mahiou, Program Manager, IDM4.0 Partnership strategy & governanceRazvan Culea, Data Engineer, IDM4.0 Data optimization 

Reviewers from Renault:

Sebastien Aubron, IDM4.0 Program ManagerJean-Marc Chatelanaz, IDM4.0 DirectorMatthieu Dureau, IDM4.0 platform Product OwnerDavid Duarte, IDM4.0 platform Data Engineer

Related Article

Registration is open for Google Cloud Next: October 12–14

Register now for Google Cloud Next on October 12–14, 2021

Read Article

Source : Data Analytics Read More

BigQuery Admin reference guide: Monitoring

BigQuery Admin reference guide: Monitoring

Last week, we shared information on BigQuery APIs and how to use them, along with another blog on workload management best practices. This blog focuses on effectively monitoring BigQuery usage and related metrics to operationalize workload management we discussed so far.

Monitoring Options for BigQuery Resource

BigQuery Monitoring Best Practices

Visualization Options For Decision Making

Tips on Key Monitoring Metrics 

Monitoring options for BigQuery

Analyzing and monitoring BigQuery usage is critical for businesses for overall cost optimization and performance reporting. BigQuery provides its native admin panel with overview metrics for monitoring. BigQuery is also well integrated with existing GCP services like Cloud Logging to provide detailed logs of individual events and Cloud Monitoring dashboards for analytics, reporting and alerting on BigQuery usage and events. 

BigQuery Admin Panel

BigQuery natively provides an admin panel with overview metrics. This feature is currently in preview and only available for flat-rate customers within the Admin Project. This option is useful for organization administrators to analyze and monitor slot usage and overall performance at the organization, folder and project levels. Admin panel provides real time data for historical analysis and is recommended for capacity planning at the organization level. However, it only provides metrics for query jobs. Also, the history is only available for up to 14 days.

Cloud Monitoring

Users can create custom monitoring dashboards for their projects using Cloud Monitoring. This provides high-level monitoring metrics, and options for alerting on key metrics and automated report exports. There is a subset of metrics that are particularly relevant to BigQuery including slots allocated, total slots available, slots available by job, etc. Cloud Monitoring also has a limit of 375 projects that can be monitored per workspace (as of August 2021). This limit can be increased upon request. Finally, there is limited information about reservations in this view and no side by side information about the current reservations and assignments.

Audit logs 

Google Cloud Audit logs provide information regarding admin activities, system changes, data access and data updates to comply with security and compliance needs. The BigQuery data activities logs, provide the following key metrics:

query – The BigQuery SQL executed

startTime – Time when the job started

endTime – Time when the job ended

totalProcessedBytes – Total bytes processed for a job

totalBilledBytes – Processed bytes, adjusted by the job’s CPU usage

totalSlotMs – The total slot time consumed by the query job

referencedFields – The columns of the underlying table that were accessed

Users can set up an aggregated logs sink at organization, folder or project level to get all the BigQuery related logs:

Other Filters:

Logs from Data Transfer Service

protoPayload.serviceName=bigquerydatatransfer.googleapis.com

Logs from BigQuery Reservations API

protoPayload.serviceName=bigqueryreservation.googleapis.com

INFORMATION_SCHEMA VIEWS

BigQuery provides a set of INFORMATION_SCHEMA views secured for different roles to quickly get access to BigQuery jobs stats and related metadata. These views (also known as system tables) are partitioned and clustered for faster extraction of metadata and are updated in real-time. With the right set of permission and access level a user can monitor/review jobs information at user, project, folder and organization level. These views allow users to:

Create customized dashboards by connecting to any BI tool 

Quickly aggregate data across many dimensions such as user, project, reservation, etc.

Drill down into jobs to analyze total cost and time spent per stage

See holistic view of the entire organization

For example, the following query provides information about the top 2 jobs in the project with details on job id, user and bytes processed by each job.

Data Studio

Leverage these easy to set up public Data Studio dashboards for monitoring slot and reservation usages,  query troubleshooting, load slot estimations, error reporting, etc. Check out this blog for more details on performance troubleshooting using Data Studio.

Looker 

Looker marketplace provides  BigQuery Performance Monitoring Block for monitoring BigQuery usage. Check out this blog for more details on performance monitoring using Looker. 

Monitoring best practices

Key metrics to monitor

Typical questions administrator or workload owners would like to understand are:

What is my slots utilization for a given project?

How much data scan and processing takes place during a given day or an hour?

How many users are running jobs concurrently?

How is performance and throughout changing over the time?

How can I appropriately perform cost analysis for showback and chargeback?

One of the most demanding analyses is to understand how many slots are good for a given workload i.e. do we need more or less slots as workload patterns change? 

Below is a list of key metrics and trends to observe for better decision making on BigQuery resources:

Monitor slot usage and performance trends (week over week, month over month). Correlate trends with any workload pattern changes, for example:

Are more users being onboarded within the same slot allocation?

Are new workloads being enabled with the same slot allocation?

You may want to allocate more slots if you see:

Concurrency – consistently increasing

Throughput – consistently decreasing

Slot Utilization – consistently increasing or keeping beyond 90%

If slot utilization has spikes, are they on a regular frequency?

In this case, you may want to leverage flex slots for predictable spikes

Can some non-critical workloads be time shifted?

For a given set of jobs with the same priority, e.g.  for a specific group of queries or users:

Avg. Wait Time – consistently increasing

Avg. query run-time – consistently increasing

Concurrency and throughput

Concurrency is the number of queries that can run in parallel with the desired level of performance, for a set of fixed resources. In contrast, throughput is the number of completed queries for a given time duration and a fixed set of resources.

In the blog BigQuery workload management best practices, we discussed in detail on how BigQuery leverages dynamic slot allocation at each step of the query processing. The chart above reflects the slot replenishment process with respect to concurrency and throughput. More complex queries may require more number of slots, hence fewer available slots for other queries. If there is a requirement for a certain level of concurrency and minimum run-time, increased slot capacity may be required.  In contrast, simple and smaller queries gives you faster replenishment of slots, hence high throughput to start with for a given workload. Learn more about BiqQuery’s fair schedulingand query processing in detail.

Slot utilization rate

Slot utilization rate is a ratio of slots used over total available slots capacity for a given period of time. This provides a window of opportunities for workload optimization. So, you may want to dig into the utilization rate of available slots over a period. If you see that on an average a low percentage of available slots are being used during a certain hour, then you may add more scheduled jobs within that hour to further utilize your available capacity.  On the other hand, high utilization rate means that either you should move some scheduled workloads to different hours or purchase more slots.

For example: 

Given a 500 slot reservation (capacity), the following query can be used to find total_slot_ms over a period of time:

Lets say, we have the following results from the query above:

sum(total_slot_ms) for a given second is 453,708 mssum(total_slot_ms) for a given hour is 1,350,964,880 mssum(total_slot_ms) for a given day is  27,110,589,760 ms

Therefore, slot utilization rate can be calculated  using the following formula: 

Slot Utilization = sum(total_slot_ms) / slot capacity available in msBy second: 453,708 / (500 * 1000) = 0.9074 => 90.74%By hour: 1,350,964,880/(500 * 1000 * 60 * 60) = 0.7505 => 75.05%By day: 27,110,589,760 / (500 * 1000 * 60 * 60 * 24) = 0.6276 => 62.76%

Another common metric used to understand slot usage patterns is to look at the average slot time consumed over a period for a specific job or workloads tied to a specific reservation. 

Average slot usage over a period: Highly relevant for workload with consistent usage

Metric:

SUM(total_slot_ms) / {time duration in milliseconds} => custom duration

Daily Average Usage: SUM(total_slot_ms) / (1000 * 60 * 60 * 24) => for a given day

Example Query:

Average slot usage for an individual job: Job level statistics

Average slot utilization over a specific time period is useful to monitor trends, help understand how slot usage patterns are changing or if there is a notable change in a workload. You can find more details about trends in the ‘Take Action’ section below. Average slot usage for an individual job is useful to understand query-run time estimates, to identify outlier queries and to estimate slots capacity during capacity planning.

Chargeback

As more users and projects are onboarded with BigQuery, it is important for administrators to not only monitor and alert on resource utilization, but also help users and groups to efficiently manage cost+performance. Many organizations require that individual project owners be responsible for resource management and optimization. Hence, it is important to provide reporting at a project-level that summarizes costs and resources for the decision makers.

Below is an example of a reference architecture that enables comprehensive reporting,  leveraging audit logs, INFORMATION_SCHEMA and billing data. The architecture highlights persona based reporting for admin and individual users or groups by leveraging authorized view based access to datasets within a monitoring project.

Export audit log data to BigQuery with specific resources you need (in this example for the BigQuery). You can also export aggregated data at organization level.

The INFORMATION_SCHEMA provides BigQuery metadata and job execution details for the last six months. You may want to persist relevant information for your reporting into a BigQuery dataset. 

Export billing data to BigQuery for cost analysis and spend optimization.

With BigQuery, leverage security settings such as authorized views to provide separation of data access by project or by persona for admins vs. users.

Analysis and reporting dashboards built with visual tools such as Looker represent the data from BigQuery dataset(s) created for monitoring. In the chart above, examples of dashboards include: 

Key KPIs for admins such as usage trend or spend trends

Data governance and access reports

Showback/Chargeback by projects

Job level statistics 

User dashboards with relevant metrics such as query stats, data access stats and job performance

Billing monitoring

To operationalize showback or chargeback reporting, cost metrics are important to monitor and include in your reporting application. BigQuery billing is associated at project level as an accounting entity. Google Cloud billing reports help you understand trends and protect your resource costs and help answer questions such as:

What is my BigQuery project cost this month?

What is the cost trend for a resource with a specific label?

What is my forecasted future cost based on historical trends for a BigQuery project?

You can refer to these examples to get started with billing reports and understand what metrics to monitor. Additionally, you can export billing and audit metrics to BigQuery dataset for comprehensive analysis with resource monitoring.

As a best practice, monitoring trends is important to optimize spend on cloud resources. This article provides a visualization option with Looker to monitor trends. You can take advantage of readily available Looker block to deploy spend analytics and block for audit data visualization for your projects and, today!

When to use

The following tables provide guidance on using the right tool for monitoring based on the feature requirements and use cases.

Following features can be considered in choosing the mechanism to use for BigQuery monitoring:

Integration with BigQuery INFORMATION_SCHMA  – Leverage the data from information schema for monitoring Integration with other data sources – Join this data with other sources like business metadata, budgets stored in google sheets, etc.Monitoring at Org Level –  Monitor all the organization’s projects togetherData/Filter based Alerts – Alert on specific filters or data selection in the dashboard. For example, send alerts for a chart filtered by a specific project or reservation.User based Alerts – Alert for specific userOn-demand Report Exports – Export the report as PDF, CSV, etc.

1 BigQuery Admin Panel uses INFORMATION SCHEMA under the hood.

2 Cloud monitoring provides only limited integration as it surfaces only high-level metrics.

3 You can monitor up to 375 projects at a time in a single Cloud Monitoring workspace.

BigQuery monitoring is important across different use cases and personas in the organization. 

Personas

Administrators – Primarily concerned with secure operations and health of the GCP fleet of resources. For example, SREs

Platform Operators – Often run the platform that serves internal customers. For example, Data Platform Leads

Data Owners / Users – Develop and operate applications, and manage a system that generates source data. This persona is mostly concerned with their specific workloads. For example, Developers

The following table provides guidance on the right tool to use for your specific requirements:

Take action

To get started quickly with monitoring on BigQuery, you can leverage publicly available data studio dashboard and related github resources. Looker also provides BigQuery Performance Monitoring Block for monitoring BigQuery usage. To quickly deploy billing monitoring with GCP, see reference blog and related github resources. 

The key to successful monitoring is to enable proactive alerts. For example, setting up alerts when the reservation slot utilization rate crosses a predetermined threshold. Also, it’s important to enable the individual users and teams in the organization to monitor their workloads using a self-service analytics framework or dashboard. This allows the users to monitor trends for forecasting resource needs and troubleshoot overall performance.

Below are additional examples of monitoring dashboards and metrics:

Organization Admin Reporting (proactive monitoring)

Alert based on thresholds like 90% slot utilization rate 

Regular reviews of consuming projects

Monitor for seasonal peaks

Review jobs metadata from information schema for large queries using  total_bytes_processed and total_slot_ms metrics

Develop data slice and dice strategies in the dashboard for appropriate chargeback

Leverage audit logs for data governance and access reporting

Specific Data Owner Reporting (self-service capabilities)

Monitor for large queries executed in the last X hours

Troubleshoot job performance using concurrency, slots used and time spent per job stage, etc.

Develop error reports and alert on critical job failures

Understand and leverage INFORMATION_SCHEMA for real-time reports and alerts. Review more examples on job stats and technical deep-dive INFORMATION_SCHEMA explained with this blog.

Related Article

BigQuery Admin reference guide: API landscape

Explore the different BigQuery APIs that can help you programmatically manage and leverage your data.

Read Article

Source : Data Analytics Read More

Education is leveraging cloud and AI to ensure intelligent safety and incidence management for student success

Education is leveraging cloud and AI to ensure intelligent safety and incidence management for student success

Campus safety is increasingly important to students and parents. A student poll published by the College Board and Art & Sciences Group found that 72% of students indicated that the safety of the campus was very important to them in college consideration and choice, with  86% percent having reported that it was very important to their parents.

The latest annual report of the National Center for Education Statistics (NCES) for indicators of school crime and safety found  that there were 836,100 total victimizations on campus in the United States, including theft, sexual assaults, vandalism, and even violent deaths. Unfortunately, those acts of violence and threats to safety are a global challenge for university staff who are responsible for student safety and wellbeing

Besides the importance of ensuring campus safety for student success, the Clery Act requires all colleges and universities participating in federal financial aid programs to keep and disclose information about crime on and near their campuses. Compliance is monitored by the Department of Education and violation of the Clery Act compliance requirements results in multimillion dollar fines in civil penalties to several academic institutions.

Google Cloud is helping educational institutions put the right solution in place to ensure campus safety and facilitate compliance with Clery Act requirements

When we announced Student Success Services last November, our goal was to reinvent how educational institutions attract, support and engage with students and faculty. As part of the offering, we’ve partnered with NowIMS to provide a fully managed intelligent safety and incident management platform in compliance with FERPA, GDPR and COPPA.

NowIMS is provisioned to the customer’s own Google Cloud project through the Google Cloud Marketplace to facilitate automated deployment and provide integrated billing with the rest of the organization’s Google Cloud services. Some of the key functionality includes:

Automated capture and consolidation of data from different sources, including social media, websites, external alarm systems, video cameras and other sources that facilitate user-reported incidents 

Analysis and visualization of data at scale, aggregating all data and providing a single pane of glass view that ensures teams have access to the data and audit logs they need to get work done, complying with security and data governance policies.

Real-time tailored reporting and messaging: Teams will be notified as intelligence is collected and processed, and have access to self-service real-time reports. Each user group will receive tailored communication through their desired channels.

Automated generation of reports: Clery Act requirements bring significant operational overhead, and NowIMS helps to automate the generation of applicable reports, giving time back to campus safety team members to do what matters most–providing personalized support to students.

A customized approach that aligns with the needs of educational institutions

Although NowIMS is a fully managed solution, it integrates with the rest of Student Success Services to offer customization to any educational institution. Our solution is currently being used by Higher ED and K-12 customers across the US. For example, Fort Bend Independent School District is using NowIMS on Google Cloud to solve student safety and emergency communication challenges and serves 78,500 students across the district’s 80 schools. And recently, Alabama A&M University has deployed NowIMS to facilitate intelligence tracking for public safety, including issues such as drugs on campus, fights, suspicious packages, and more.  Captain Ruble of Alabama A&M stated, “NowIMS is the first tool we have found that will provide a single pane of glass for our operators. Before NowIMS, we had to watch more than 6 other systems.” In addition to this public safety use case, the Information Technology department will also use NowIMS to track badge access entry events, as well as network anomaly and outage notifications.  

To find out more about how NowIMS detects issues and rapidly responds to risks on and off campus, check out our session from Student Success Week.

Source : Data Analytics Read More

What are the newest datasets in Google Cloud?

What are the newest datasets in Google Cloud?

Editor’s note:  With Google Cloud’s datasets solution, you can access an ever-expanding resource of the newest datasets to support and empower your analyses and ML models, as well as frequently updated best practices on how to get the most out of any of our datasets. We will be regularly updating this blog with new datasets and announcements, so be sure to bookmark this link and check back often.

August 2021

New dataset: Google Cloud Release Notes

Looking for a new way to access Google Cloud Release Notes besides the docs, the XML feed, and Cloud Console? Check out the Google Cloud Release Notes dataset. With up-to-date release notes for all generally available Google Cloud products, this dataset allows you to use BigQuery to programmatically analyze release notes across all products, exploring security bulletin updates, fixes, changes, and the latest feature releases. 

Access the dataset

Access the BigQuery release notes dataset from https://cloud.google.com/release-notes/all

July 2021

Best practice: Use Google Trends data for common business needs

The Google Trends dataset represents the first time we’re adding Google-owned Search data into Datasets for Google Cloud. The Trends data allows users to measure interest in a particular topic or search term across Google Search, from around the United States, down to the city-level. You can learn more about the dataset here, and check out the Looker dashboard here! These tables are super valuable in their own right, but when you blend them with other actionable data you can unlock whole new areas of opportunity for your team. To learn how to make informed decisions with Google Trends data, keep reading.

Access the dataset

New dataset: COVID-19 Vaccination Search Insights

With COVID-19 vaccinations being a topic of interest around the United States, this dataset shows aggregated, anonymized trends in searches related to COVID-19 vaccination and is intended to help public health officials design, target, and evaluate public education campaigns. Check out this interactive dashboard to explore searches for COVID-19 vaccination topics by region.

Access the dataset

Source: https://google-research.github.io/vaccination-search-insights/

June 2021

New dataset: Google Diversity Annual Report 2021

Since 2014, Google has disclosed data on the diversity of its workforce in an effort to bring candid transparency to the challenges technology companies like Google face in recruitment and retention of underrepresented communities. In an effort to make this data more accessible and useful, we’ve loaded it into BigQuery for the first time ever. To view Google’s Diversity Annual Report and learn more, check it out.

Access the dataset

New dataset: Google Trends Top 25 Search terms

The most popular and surging Google Search terms are now available in BigQuery as a public dataset. View the Top 25 and Top 25 rising queries from Google Trends from the past 30-days, including 5 years of historical data across the 210 Designated Market Areas (DMAs) in the US. Keep reading.

Access the dataset

Top 25 Google Search terms, ranked by search volume (1 through 25) and with average search index score across the geographic areas (DMAs) in which it was searched.

New dataset: COVID-19 Vaccination Access

With metrics quantifying travel times to COVID-19 vaccination sites, this dataset is intended to help Public Health officials, researchers, and Healthcare Providers to identify areas with insufficient access, deploy interventions, and research these issues. Check out how this data is being used in a number of new tools.

Access the dataset

(Image courtesy of Vaccine Equity Planner, https://vaccineplanner.org/)

Best practice: Leveraging BigQuery Public Boundaries datasets for geospatial analytics 

Geospatial data is a critical component for a comprehensive analytics strategy. Whether you are trying to visualize data using geospatial parameters or do deeper analysis or modeling on customer distribution or proximity, most organizations have some type of geospatial data they would like to use – whether it be customer zipcodes, store locations, or shipping addresses. However, converting geographic data into the correct format for analysis and aggregation at different levels can be difficult. In this post, we’ll walk through some examples of how you can leverage the Google Cloud platform alongside Google Cloud Public Datasets to perform robust analytics on geographic data. Keep reading.

Access the dataset

Get the metadata and try BigQuery sandbox 

When you’ve learned about many of our datasets and pre-built solutions from across Google, you may be ready to start querying them. Check out the full dataset directory and read all the metadata at g.co/cloud/marketplace-datasets, then dig into the data with our free-to-use BigQuery sandbox account, or $300 in credits with our Google Cloud free trial.

Related Article

Discover datasets to enrich your analytics and AI initiatives

Discover and access datasets and pre-built dataset solutions from Google, or public and commercial data providers

Read Article

Source : Data Analytics Read More