Preprocessing and transforming raw data into features is a critical but time consuming step in the ML process. This is especially true when a data scientist or data engineer has to move data across different platforms to do MLOps. In this blogpost, we describe how we streamline this process by adding two feature engineering capabilities in BigQuery ML
Our previous blog outlines the data to AI journey with BigQuery ML, highlighting two powerful features that simplify MLOps – data preprocessing functions for feature engineering and the ability to export BigQuery ML TRANSFORM statement as part of the model artifact. In this blog post, we share how to use these features for creating a seamless experience from BigQuery ML to Vertex AI.
Data Preprocessing Functions
Preprocessing and transforming raw data into features is a critical but time consuming step when operationalizing ML. We recently announced the public preview of advanced feature engineering functions in BigQuery ML. These functions help you impute, normalize or encode data. When this is done inside the database, BigQuery, the entire process becomes easier, faster, and more secure to preprocess data.
Here is a list of the new functions we are introducing in this release. The full list of preprocessing functions can be found here.
ML.MAX_ABS_SCALER Scale a numerical column to the range [-1, 1] without centering by dividing by the maximum absolute value.
ML.ROBUST_SCALER Scale a numerical column by centering with the median (optional) and dividing by the quantile range of choice ([25, 75] by default).
ML.NORMALIZER Turn an input numerical array into a unit norm array for any p-norm: 0, 1, >1, +inf. The default is 2 resulting in a normalized array where the sum of squares is 1.
ML.IMPUTER Replace missing values in a numerical or categorical input with the mean, median or mode (most frequent).
ML.ONE_HOT_ENCODER One-hot encode a categorical input. Also, it optionally does dummy encoding by dropping the most frequent value. It is also possible to limit the size of the encoding by specifying k for the k most frequent categories and/or a lower threshold for the frequency of categories.
ML.LABEL_ENCODER Encode a categorical input to integer values [0, n categories] where 0 represents NULL and excluded categories. You can exclude categories by specifying k for k most frequent categories and/or a lower threshold for the frequency of categories.
Model Export with TRANSFORM Statement
You can now export BigQuery ML models that include a feature TRANSFORM statement. The ability to include TRANSFORM statements makes models more portable when exporting them for online prediction. This capability also works when BigQuery ML models are registered with Vertex AI Model Registry and deployed to Vertex AI Prediction endpoints. More details about exporting models can be found in BigQuery ML Exporting models.
In this tutorial, we will use the bread recipe competition dataset to predict judges rating using linear regression and boosted tree models.
Objective: To demonstrate how to preprocess data using the new functions, register the model with Vertex AI Model Registry, and deploy the model for online prediction with Vertex AI Prediction endpoints.
Dataset: Each row represents a bread recipe with columns for each ingredient (flour, salt, water, yeast) and procedure (mixing time, mixing speed, cooking temperature, resting time). There are also columns that include judges ratings of the final product from each recipe.
Overview of the tutorial: Steps 1 and 2 show how to use the TRANSFORM statement. Steps 3 and 4 demonstrate how to manually export and register the models. Steps 5 through 7 show how to deploy a model to Vertex AI Prediction endpoint.
For the best learning experience, follow this blog post alongside the tutorial notebook.
Step 1: Transform BigQuery columns into ML features with SQL
Before training an ML model, exploring the data within columns is essential to identifying the data type, distribution, scale, missing patterns, and extreme values. BigQuery ML enables this exploratory analysis with SQL. With the new preprocessing functions it is now even easier to transform BigQuery columns into ML features with SQL while iterating to find the optimal transformation. For example, when using the ML.MAX_ABS_SCALER function for an input column, each value is divided by the maximum absolute value (10 in the example):
code_block[StructValue([(u’code’, u’SELECTrn input_column,rn ML.MAX_ABS_SCALER (input_column) OVER() AS scale_columnrnFROMrn UNNEST([0, -1, 2, -3, 4, -5, 6, -7, 8, -9, 10]) as input_columnrnORDER BY input_column’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e864e54c710>)])]
Once the input columns for an ML model are identified and the feature transformations are chosen, it is enticing to apply the transformation and save the output as a view. But this has an impact on our predictions later on because these same transformations will need to be applied before requesting predictions. Step 2 shows how to prevent this separation of processing and model training.
Step 2: Iterate through multiple models with inline TRANSFORM functions
Building on the preprocessing explorations in Step 1, the chosen transformations are applied inline with model training using the TRANSFORM statement. This interlocks the model iteration with the preprocessing explorations while making any candidate ready for serving with BigQuery or beyond. This means you can immediately try multiple model types without any delayed impact of feature transformations on predictions. In this step, two models, linear regression and boosted tree, are trained side-by-side with identical TRANSFORM statements:
Training with linear regression – Model a
code_block[StructValue([(u’code’, u”CREATE OR REPLACE MODEL `statmike-mlops-349915.feature_engineering.03_feature_engineering_2a`rnTRANSFORM (rn JUDGE_A,rnrn ML.MIN_MAX_SCALER(flourAmt) OVER() as scale_flourAmt, rn ML.ROBUST_SCALER(saltAmt) OVER() as scale_saltAmt,rn ML.MAX_ABS_SCALER(yeastAmt) OVER() as scale_yeastAmt,rn ML.STANDARD_SCALER(water1Amt) OVER() as scale_water1Amt,rn ML.STANDARD_SCALER(water2Amt) OVER() as scale_water2Amt,rnrn ML.STANDARD_SCALER(waterTemp) OVER() as scale_waterTemp,rn ML.ROBUST_SCALER(bakeTemp) OVER() as scale_bakeTemp,rn ML.MIN_MAX_SCALER(ambTemp) OVER() as scale_ambTemp,rn ML.MAX_ABS_SCALER(ambHumidity) OVER() as scale_ambHumidity,rnrn ML.ROBUST_SCALER(mix1Time) OVER() as scale_mix1Time,rn ML.ROBUST_SCALER(mix2Time) OVER() as scale_mix2Time,rn ML.ROBUST_SCALER(mix1Speed) OVER() as scale_mix1Speed,rn ML.ROBUST_SCALER(mix2Speed) OVER() as scale_mix2Speed,rn ML.STANDARD_SCALER(proveTime) OVER() as scale_proveTime,rn ML.MAX_ABS_SCALER(restTime) OVER() as scale_restTime,rn ML.MAX_ABS_SCALER(bakeTime) OVER() as scale_bakeTimern)rnOPTIONS (rn model_type = ‘LINEAR_REG’,rn input_label_cols = [‘JUDGE_A’],rn enable_global_explain = TRUE,rn data_split_method = ‘AUTO_SPLIT’,rn MODEL_REGISTRY = ‘VERTEX_AI’,rn VERTEX_AI_MODEL_ID = ‘bqml_03_feature_engineering_2a’,rn VERTEX_AI_MODEL_VERSION_ALIASES = [‘run-20230112234821′]rn ) ASrnSELECT * EXCEPT(Recipe, JUDGE_B)rnFROM `statmike-mlops-349915.feature_engineering.bread`”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e863e901550>)])]
Training with boosted tree – Model b
code_block[StructValue([(u’code’, u”CREATE OR REPLACE MODEL `statmike-mlops-349915.feature_engineering.03_feature_engineering_2b`rnTRANSFORM (rn JUDGE_A,rnrn ML.MIN_MAX_SCALER(flourAmt) OVER() as scale_flourAmt, rn ML.ROBUST_SCALER(saltAmt) OVER() as scale_saltAmt,rn ML.MAX_ABS_SCALER(yeastAmt) OVER() as scale_yeastAmt,rn ML.STANDARD_SCALER(water1Amt) OVER() as scale_water1Amt,rn ML.STANDARD_SCALER(water2Amt) OVER() as scale_water2Amt,rnrn ML.STANDARD_SCALER(waterTemp) OVER() as scale_waterTemp,rn ML.ROBUST_SCALER(bakeTemp) OVER() as scale_bakeTemp,rn ML.MIN_MAX_SCALER(ambTemp) OVER() as scale_ambTemp,rn ML.MAX_ABS_SCALER(ambHumidity) OVER() as scale_ambHumidity,rnrn ML.ROBUST_SCALER(mix1Time) OVER() as scale_mix1Time,rn ML.ROBUST_SCALER(mix2Time) OVER() as scale_mix2Time,rn ML.ROBUST_SCALER(mix1Speed) OVER() as scale_mix1Speed,rn ML.ROBUST_SCALER(mix2Speed) OVER() as scale_mix2Speed,rn ML.STANDARD_SCALER(proveTime) OVER() as scale_proveTime,rn ML.MAX_ABS_SCALER(restTime) OVER() as scale_restTime,rn ML.MAX_ABS_SCALER(bakeTime) OVER() as scale_bakeTimern)rnOPTIONS (rn model_type = ‘BOOSTED_TREE_REGRESSOR’,rn booster_type = ‘GBTREE’,rn num_parallel_tree = 1,rn max_iterations = 30,rn early_stop = TRUE,rn min_rel_progress = 0.01,rn tree_method = ‘HIST’,rn subsample = 0.85, rn input_label_cols = [‘JUDGE_A’],rn enable_global_explain = TRUE,rn data_split_method = ‘AUTO_SPLIT’,rn l1_reg = 10,rn l2_reg = 10,rn MODEL_REGISTRY = ‘VERTEX_AI’,rn VERTEX_AI_MODEL_ID = ‘bqml_03_feature_engineering_2b’,rn VERTEX_AI_MODEL_VERSION_ALIASES = [‘run-20230112234926′]rn ) ASrnSELECT * EXCEPT(Recipe, JUDGE_B)rnFROM `statmike-mlops-349915.feature_engineering.bread`”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e864e9da0d0>)])]
Identical input columns that have the same preprocessing means you can easily compare the accuracy of the models. Using the BigQuery ML function ML.EVALUATE makes this comparison as simple as a single SQL query that stacks these outcomes with the UNION ALL set operator:
The results of the evaluation comparison show that using the boosted tree model results in a much better model than linear regression with drastically lower mean squared error and higher r2.
Both models are ready to serve predictions, but the clear choice is the boosted tree regressor. Once we decide which model to use, you can predict directly within BigQuery ML using the ML.PREDICT function. In the rest of the tutorial, we show how to export the model outside of BigQuery ML and predict using Google Cloud Vertex AI.
Using BigQuery Models for Inference Outside of BigQuery
Once your model is trained, if you want to do online inference for low latency responses in your application for online prediction, you have to deploy the model outside of BigQuery. The following steps demonstrate how to deploy the models to Vertex AI Prediction endpoints.
This can be accomplished in one of two ways:
Manually export the model from BigQuery ML and set up a Vertex AI Prediction Endpoint. To do this, you need to do steps 3 and 4 first.
Register the model and deploy from Vertex AI Model Registry automatically. The capability is not available yet but will be available in a forthcoming release. Once it’s available steps 3 and 4 can be skipped.
Step 3. Manually export models from BigQuery
BigQuery ML supports an EXPORT MODEL statement to deploy models outside of BigQuery. A manual export includes two models – a preprocessing model that reflects the TRANSFORM statement and a prediction model. Both models are exported with a single export statement in BigQuery ML.
code_block[StructValue([(u’code’, u”EXPORT MODEL `statmike-mlops-349915.feature_engineering.03_feature_engineering_2b`rn OPTIONS (URI = ‘gs://statmike-mlops-349915-us-central1-bqml-exports/03/2b/model’)”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e864dabafd0>)])]
The preprocessing model that captures the TRANSFORM statement is exported as a TensorFlow SavedModel file. In this example it is exported to a GCS bucket located at ‘gs://statmike-mlops-349915-us-central1-bqml-exports/03/2b/model/transform’.
The prediction models are saved in portable formats that match the frameworks in which they were trained by BigQuery ML. The linear regression model is exported as a TensorFlow SavedModel and the boosted tree regressor is exported as Booster file (XGBoost). In this example, the boost tree model is exported to a GCS bucket located at ‘gs://statmike-mlops-349915-us-central1-bqml-exports/03/2b/model’
These export files are in a standard open format of the native model types making them completely portable to be deployed anywhere – they can be deployed to Vertex AI (Steps 4-7 below), on your own infrastructure, or even in edge applications.
Steps 4 through 7 show how to register and deploy a model to Vertex AI Prediction endpoint. These steps need to be repeated separately for the preprocessing models and the prediction models.
Step 4. Register models to Vertex AI Model Registry
To deploy the models in Vertex AI Prediction, they first need to be registered with the Vertex AI Model Registry To do this two inputs are needed – the links to the model files and a URI to a pre-built container. Go to Step 4 in the tutorial to see how exactly it’s done.
The registration can be done with the Vertex AI console or programmatically with one of the clients. In the example below, the Python client for Vertex AI is used to register the models like this:
Vertex AI includes a service forhosting models for online predictions. To host a model on a Vertex AI Prediction endpoint you first create an endpoint. This can also be done directly from the Vertex AI Model Registry console or programmatically with one of the clients. In the example below, the Python client for Vertex AI is used to create the endpoint like this:
Deploying a model from the Vertex AI Model Registry (Step 4) to a Vertex AI Prediction endpoint (Step 5) is done in a single deployment action where the model definition is supplied to the endpoint along with the type of machine to utilize. Vertex AI Prediction endpoints can automatically scale up or down to handle prediction traffic needs by providing the number of replicas to utilize (default is 1 for min and max). In the example below, the Python client for Vertex AI is being used with the deploy method for the endpoint (Step 5) using the models (Step 4):
Once the model is deployed to a Vertex AI Prediction endpoint (Step 6) it can serve predictions. Rows of data, called instances, are passed to the endpoint and results are returned that include the processed information: preprocessing result or prediction. Getting prediction results from Vertex AI Prediction endpoints can be done with any of the Vertex AI API interfaces (REST, gRPC, gcloud, Python, Java, Node.js). Here, the request is demonstrated directly with the predict method of the endpoint (Step 6) using the Python client for Vertex AI as follows:
The result of an endpoint with a preprocessing model will be identical to applying the TRANSFORM statement from BigQuery ML. The results can then be pipelined to an endpoint with the prediction model to serve predictions that match the results of the ML.PREDICT function in BigQuery ML. The results of both methods, Vertex AI Prediction endpoints and BigQuery ML with ML.PREDICT are shown side-by-side in the tutorial to show that the results of the model are replicated. Now the model can be used for online serving with extremely low latency. This even includes using private endpoints for even lower latency and secure connections with VPC Network Peering.
With the new preprocessing functions, you can simplify data exploration and feature preprocessing. Further, by embedding preprocessing within model training using the TRANSFORM statement, the serving process is simplified by using prepped models without needing additional steps. In other words, predictions are done right inside BigQuery or alternatively the models can be exported to any location outside of BigQuery such as Vertex AI Prediction for online serving. The tutorial demonstrated how BigQuery ML works with Vertex AI Model Registry and Prediction to create a seamless end-to-end ML experience. In the future you can expect to see more capabilities that bring BigQuery, BigQuery ML and Vertex AI together.
While there has been a lot of attention given to wildfires, floods, and hurricanes, heat-related weather events are still understudied and underreported. Every summer, heat waves pose a major threat to the health of people and ecosystems. 83% of the North American population lives in cities, where the urban heat island (UHI) effect leads to higher local temperatures compared to surrounding rural areas.
But not everyone living in US cities experiences summer heat waves and urban heat islands equally. Communities with lower incomes or people of color are more likely to be impacted by extreme heat events, both due to fewer green spaces in urban areas and not having access to air conditioning. While there have been many studies that have shed light on environmental inequities between neighborhoods of varying income levels, there has been little analysis of what it will take to provide all people with protection from severe heat.
In the summer of 2019, TC Chakraborty, then a PhD candidate at the Yale School of the Environment, and Tanushree Biswas, then Spatial Data Scientist at The Nature Conservancy, California, met at one of our Geo for Good Summits. The summits bring together policymakers, scientists, and other change-makers who use Google’s mapping tools. They wanted to share ideas in their areas of expertise (urban heat and tree cover, respectively) and explore a potential path to address urban climate change inequities using open tools and a suite of datasets. Given the ability of tree cover to help mitigate local heat in cities, they wondered how much space is actually available for trees in lower income urban neighborhoods.
Three years later, in June 2022, Chakraborty, Biswas, and co-authors L.S. Campbell, B. Franklin, S.S. Parker, and M. Tukman published a paper to address this challenge. The study combines medium-to-high-resolution satellite observations with census data to calculate the feasible area available for urban afforestation — planting new trees — for over 200 urban clusters in California. The paper demonstrates a systematic approach that leverages publicly available data on Google Earth Engine, Google’s planetary-scale platform for Earth science data & analysis, which is free of charge for nonprofits, academics, and research use cases. Results from the study can be explored through a Earth Engine web application: Closing Urban Tree Cover Inequity (CUTI).
California is the most populated state in the United States, the fifth largest economy in the world and frequently impacted by heat waves. This makes California a prime location to demonstrate approaches to strategically reducing surface UHI (SUHI), which has the potential to positively impact millions, especially those vulnerable to heat risk. Chakraborty et al. (2022) found that underprivileged neighborhoods in California have 5.9% less tree cover (see Fig. 1 for an illustrative example for Sacramento) and 1.7 °C higher summer SUHI intensity than more affluent neighborhoods. This disparity in tree cover can be partially closed through targeted urban afforestation.
Fig 1. Street and satellite views of an underprivileged and a more affluent neighborhood in Sacramento.
Leveraging the wealth of data for cities in California, including heat-related mortality and morbidity data, sensitivity of residential energy demand to temperature, and carbon sequestration rates of California’s forests, the researchers calculated co-benefits of several urban afforestation scenarios. For their maximum possible afforestation scenario, they found potential for an additional 36 million (1.28 million acres of) trees, which can provide economic co-benefits, estimated to be worth as much as $1.1 billion annually:
4.5 million metric tons of annual CO2 sequestration
Reduction in heat-related medical visits (~4000 over 10 years)
Energy usage and cost reductions
Stormwater runoff reduction
Property value increase
With a focus on reducing disparities in SUHI and tree cover within these cities, the study provides suitability scores for scaling urban afforestation at the census-block group level across California. By focusing on California neighborhoods with high suitability scores, the authors estimate that an annual investment of $467 million in urban afforestation would both reduce heat disparities and generate $712 million of net annual benefits. Specifically, these benefits would go to 89% of the approximately 9 million residents in the lowest income quartiles of California cities. This annual investment equates to a 20-year commitment of $9.34 billion or 10,000 Electric Vehicles annually.
Fig. 2 Maps showing a median income, b current canopy cover percentage, c summer daytime surface urban heat island (SUHI) intensity, and d available area for planting trees as percentage of total area for census block groups in the Los Angeles urban cluster (Image source: TC Chakraborty and Tanushree Biswas).
The adverse effects of climate change disproportionately impact cities, so it’s critical to start thinking about viable solutions to address environmental disparities within urban communities. Providing urban planners with data-driven tools to design climate-resilient cities is a key first step. The Chakraborty et al. study leverages Earth Engine data, tech, and cloud compute resources to provide actionable insights to address environmental disparities in cities. It’s a great example of how Earth Engine can help inform urban policy and provide a bird’s-eye view of logistical support for scalable climate solutions that enable innovation and investment opportunities. In the future, Chakraborty and Biswas hope to further scale this analysis across U.S. cities to provide baseline data that can help move us towards equitable climate adaptation for everyone.
Google wants to support this kind of research! If you are a researcher working on climate impact, apply to the Google Cloud Climate Innovation Challenge in partnership with The National Science Foundation (NSF) and AI Institute for Research on Trustworthy AI in Weather, Climate, and Coastal Oceanography (AI2ES) for free credits to fuel your research.
Thanks to TC Chakraborty and Tanushree Biswas for their help in preparing this blog post.
The Oden Technologies solution is an analytics layer for manufacturers that combines and analyzes all process information from machines and production systems to give real-time visibility to the biggest causes of inefficiency and recommendations to address them. Oden empowers front-line plant teams to make effective decisions, such as prioritizing resources more effectively, solving issues faster, and realizing optimal behavior.
Use cases: Challenges and problems solved
Manufacturing plants have limited resources and would like to use them optimally by eliminating any inefficiencies and making recommendations and providing data points as a key input for decision making. These data points are based on a torrent of data coming from multiple devices.
Oden’s customers are manufacturers with continuous and batch processes, such as in plastics extrusion, paper and pulp, and chemicals. Oden powers real-time and historical dashboards and reports necessary for this decision-making through leveraging the underlying Google Cloud Platform.
Oden’s platform aggregates streaming, time-series data from multiple devices and instruments and processes them in real-time. This data is in the form of continuously sampled real-world sensor readings (metrics) that are ingested into CloudIoT and transformed in real-time using Dataflow before being written to Oden’s time series database. Transformations include data cleaning, normalization, synchronization, smoothing, outlier removal, and multi-metric calculations that are built in collaboration with manufacturing customers. The time-series database then powers real-time and historical dashboards and reports.
One of the major challenges of working with real-time manufacturing data is handling network disruptions. Manufacturing environments are often not well served by ISPs and can experience network issues due to environmental and process conditions or other factors. When this happens, data can be backed up locally and arrive late after the connection recovers. To avoid overloading real-time dataflow jobs with this late data, BigQuery supports late data handling and recoveries.
In addition to the sensor data, Oden collects metadata about the production process and factory operation such as products manufactured on each line, their specifications and quality. Integrations provide the metadata via Oden’s Integration APIs running on Google Kubernetes Engine (GKE), which then writes it to a PostgreSQL database hosted in CloudSQL. The solution then uses this metadata to contextualize the time-series data in manufacturing applications.
Oden uses this data in several ways, including real-time monitoring and alerting, dashboards for line operators and production managers, historical query tools for quality engineers, and machine learning models trained on historical data and scored in real-time to provide live predictions, recommendations, and insights. This is all served in an easy to access and understand UI, greatly empowering employees across the factory to use data to improve their lines of business.
The second major challenge in manufacturing systems, is achieving quality specifications on the final product for it to be sold. Typically, Quality Assurance is conducted offline: after production has completed, a sample is taken from the final product, and a test is performed to determine physical properties of the product. However, this introduces a lag between the actual time period of production, and information about the effectiveness of that production—sometimes hours (or even days) after the fact. This prevents proactive adjustments that could correct for quality failures, and results in considerable waste.
At the heart of the Oden platform is Google BigQuery, which plays an important backstage role in Oden’s data-driven software. Metric data is written simultaneously to BigQuery via a BigQuery Subscription through Cloud PubSub and metadata from CloudSQL is accessible via BigQuery’s Federated Queries. This makes BigQuery an exploratory engine for all customer data allowing Oden’s data scientists and engineers to support the data pipelines and build Oden’s machine learning models.
Sometimes these queries are ad-hoc analysis that helps understand data better. For example, here’s a BigQuery query joining both the native BigQuery metrics table and a Federated Query to the metadata in PostgreSQL This query helps determine the average lag between the event time and ingest time of customer metrics by day for November:
code_block[StructValue([(u’code’, u’SELECTrn customer_id,rn metric_id,rn TIMESTAMP_TRUNC(ingest_time, DAY) AS ingest_day,rn AVG(TIMESTAMP_DIFF(ingest_time, ingest_time, MILLISECOND)) AS diff_msrnFROMrn `oden-production.metrics.metrics`rnJOINrn EXTERNAL_QUERY(“oden-production.us.metadatadb”,rn “SELECT customer_id::text, metric_id::text FROM customer_metrics”)rnWHERErn ingest_time >= ‘2022-11-01’rn AND ingest_time < ‘2022-12-01’rnGROUP BYrn customer_id,rn metric_id,rn ingest_day’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e2c879c2c90>)])]
In addition to ad-hoc queries, there are also several key features of Oden that use BigQuery as their foundation. Below, two major features that leverage BigQuery as the highly scalable source of truth for data are covered.
Use case 1: The data reliability challenge of manufacturing and the cloud
As mentioned earlier, one of the major challenges of working with real-time manufacturing data is handling network disruptions. After the connection recovers, you encounter data that has been backed up and is out of temporal sequence. To avoid overloading real-time dataflow jobs with this late data, BigQuery is used to support late data handling and recoveries.
The data transformation jobs that run on Dataflow are written in the Apache Beam framework and usually perform their transformations by reading metrics from an input Pubsub topic and writing back to an output topic. This forms a directed acyclic graph (DAG) of transformation stages before the final metrics are written to the time-series database. But the streaming jobs degrade in performance when handling large bursts of late data putting the ability at risk to meet Service Level Agreements (SLAs), which guarantee customers high availability and fast end-to-end delivery of real-time features.
A key tenet of the Apache Beam model is that transformations can be applied to both bounded and unbounded collections of data. With this in mind, Apache Beam can be used for both streaming and batch processing. Oden takes this a step further with a universal shared connector for every one of the transformation jobs which allows the entire job to switch between a regular “streaming mode” and an alternative “Batch Mode.” In “Batch Mode,” the streaming jobs can do the same transformation but use Avro files or BigQuery as their data source.
This “Batch Mode” feature started as a method of testing and running large batch recoveries after outages. But it has since evolved into a solution to late data handling problems. All data that comes in “late” to Oden bypasses the real-time Dataflow streaming jobs and is written to a special “Late Metrics” PubSub topic and then to BigQuery. Nightly, these “Batch Mode” jobs are deployed and data is queried that wasn’t processed that day out of BigQuery and results written back to the time-series database. This creates two SLAs for customers; a real-time one of seconds for “on-time” data and a batch one of 24 hours for any data that arrives late.
Occasionally, there is a need to backfill transformations of these streaming jobs due to regressions or new features to backport over old data. In these cases, batch jobs are leveraged again. Additionally, jobs specific to customer data are joined with metrics and customer configuration data hosted in CloudSQL via BigQuery’s Federated queries to CloudSQL.
By using BigQuery for recoveries, dataflow jobs continue to run smoothly, even in the face of network disruptions. This allows maintaining high accuracy and reliability in real-time data analysis and reporting. Since moving to separate BigQuery-powered late-data handling, the median system latency of calculated metrics feature for real-time metrics is under 2s which allows customers to observe and respond to their custom multi-sensor calculated metrics instantly.
Use case 2: Building and delivering predictive quality models
The next use case deals with applying machine learning to manufacturing: predicting offline quality test results using real-time process metrics. This is a challenging problem in manufacturing environments, where not only is high accuracy and reliability necessary, but the data is also collected at different sampling rates (seconds, minutes, and hours) and stored in several different systems. The merged datasets represent the comprehensive view of data to factory personnel, who use the entire set of context information to make operational decisions. This ensures Predictive Quality models access this same full picture of the manufacturing process as it provides predictions.
At Oden, BigQuery addresses the two key challenges of machine learning in the manufacturing environment:
Using time series data stored in time series database, summary aggregates are performed to construct features as input for model training and scoring.
Using federated queries to access context metadata, data is merged with the aggregates to fully characterize the production period. This allows easily combining the data from both sources and using it to train machine learning models.
Oden uses a variety of models and embeddings — ranging from linear models (Elastic Nets, Lasso), ensemble models (boosted trees, random forests) to DNNs that allow addressing the different complexity-accuracy-interpretability requirements of customers.
The chart shows out-of-sample predictions of offline quality test values, compared with the actual values that were observed after the end of production. The predicted values provide lead time of quality problems of up to one hour.
Models are trained using an automated pipeline based on Apache Airflow and scikit learn, and models are stored in Google Cloud Storage for versioning and retrieval. Once the models are trained, they can be used to predict the outcomes of quality tests in real-time via a streaming Dataflow job. This allows factory floor operators to identify and address potential problems before they become more serious or have a large impact. This improves the overall efficiency of the production process, and reduces the amount of waste that a factory generates. Factory floor operators receive up-to-date information about the quality characteristics of current production conditions, up to an hour before the actual test value is available for inspection. This gives early warning to help catch quality failures. In turn, this reduces material waste and machine downtime, metrics that are central to many manufacturers’ continuous improvement initiatives, as well as their day-to-day operations.
Operators come to rely upon predictive models to execute their roles effectively, regardless of their experience level or their familiarity with a specific type of production or machinery, and up-to-date models are critical to the success of the predictions in improving manufacturing processes. Hence, in addition to training, life-cycle management of models and ML ops are important considerations in deploying reliable models to the factory floor. Oden is focusing on leveraging Vertex AI to make the ML model lifecycle more simple and efficient.
Oden’s Predictive Quality model empowers operators to take proactive steps to optimize production on the factory floor, and allows for real-time reactions to changes in the manufacturing process. This contributes to cost reduction, energy savings, and reduced material wasted.
The future of BigQuery at Oden
Actionable data, like the processed data generated by Oden, has become such a critical part of making predictions and decisions to remain competitive in the manufacturing space. In order to use these insights to their full potential, businesses need a low barrier to access data, unify the data with other data sources, derive richer insights and make learned decisions. Oden already leads the market in having trustworthy, usable, and understandable insights from combined process, production, and machine data that is accessible from everyone within the plant to improve their line of business. There is opportunity to go beyond the Oden interface to integrate with even more business systems The data can be made available in the form of finished datasets, hosted in BigQuery. Google Cloud has launched a new service called Analytics Hub , powered by BigQuery with the intent to make data sharing easier, secure, searchable, reliable and highly scalable.
Analytics Hub is based on the Publish-Subscribe model where BigQuery datasets are enlisted into a Data exchange as a Shared dataset, which hosts hundreds of listings. It lets users share multiple BigQuery objects such as views, tables, external tables, models etc into the Data exchange. A Data exchange can be marked public or private for a dedicated sharing. On the other end, businesses can subscribe to one or more listings in their BigQuery instance, where it is consumed as a Linked dataset to run queries against. Analytics Hub sets up a real-time data pipeline with a low-code no-code approach to share data, while giving Oden complete control over what data needs to be shared for better governance.
This empowers advanced users, who have use-cases that exceed the common workflows already achievable with Oden’s configurable dashboard and query tools, to leverage the capabilities of BigQuery in their organization. This brings Oden’s internal success with BigQuery directly to advanced users. With BigQuery, they can join against datasets not in Oden, express complex BigQuery queries, load data directly with Google’s BigQuery client libraries, and integrate Oden data into third party Business Intelligence software such as Google Data Studio.
Google Cloud and Oden are forging a strong partnership in several areas, most of which are central to customers needs. Oden has developed a turnkey solution by using the best in class Google Cloud tools and technologies, delivering pre-built models to accelerate time to value and enabling manufacturers to have accessible and impactful insights without hiring a data science team. Together, Google and Oden are expanding the way manufacturers access and use data by creating a clear path to centralize production, machine, and process data into the larger enterprise data platform paving the way for energy savings, material waste reduction and cost optimization,
Click here to learn more about Oden Technologies or to request a demo.
The Built with BigQuery advantage for ISVs
Google is helping tech companies like Oden Technologies build innovative applications on Google’s data cloud with simplified access to technology, helpful and dedicated engineering support, and joint go-to-market programs through the Built with BigQuery initiative, launched in April ‘22 as part of the Google Data Cloud Summit. Participating companies can:
Get started fast with a Google-funded, pre-configured sandbox.
Accelerate product design and architecture through access to designated experts from the ISV Center of Excellence who can provide insight into key use cases, architectural patterns, and best practices.
Amplify success with joint marketing programs to drive awareness, generate demand, and increase adoption.
BigQuery gives ISVs the advantage of a powerful, highly scalable data warehouse that’s integrated with Google Cloud’s open, secure, sustainable platform. And with a huge partner ecosystem and support for multi-cloud, open source tools and APIs, Google provides technology companies the portability and extensibility they need to avoid data lock-in.
We thank the Google Cloud and Oden team members who co-authored the blog: Oden: Henry Linder, Staff Data Scientist & Deepak Turaga, SVP Data Science and Engineering. Google: Sujit Khasnis, Solutions Architect & Merlin Yammsi, Solutions Consultant
BigQuery BI Engine is a fast, in-memory analysis system for BigQuery currently processing over 2 billion queries per month and growing. BigQuery has its roots in Google’s Dremel system and is a data warehouse built with scalability as a goal. On the other hand BI Engine was envisioned with data analysts in mind and focuses on providing value on Gigabyte to sub-Terabyte datasets, with minimal tuning, for real time analytics and BI purposes.
Using BI Engine is simple – create a memory reservation on the project that runs BigQuery queries, and it will cache data and use the optimizations. This post is a deep dive into how BI Engine helps deliver blazing fast performance for your BigQuery queries and what users can do to leverage its full potential.
BI Engine optimizations
The two main pillars of BI Engine are in-memory caching of data and vectorized processing. Other optimizations include CMETA metadata pruning, single-node processing, and join optimizations for smaller tables.
BI Engine utilizes the “Superluminal” vectorized evaluation engine which is also used for YouTube’s analytic data platform query engine – Procella. In BigQuery’s row-based evaluation, the engine will process all columns within a row for every row. The engine is potentially alternating between column types and memory locations before going to the next row. In contrast, a vectorized engine like Superluminal will process a block of values of the same type from a single column for as long as possible and only switch to the next column when necessary. This way, hardware can run multiple operations at once using SIMD, reducing both latency and infrastructure costs. BI Engine dynamically chooses block size to fit into caches and available memory.
For the example query, “SELECT AVG(word_count), MAX(word_count), MAX(corpus_date) FROM samples.shakespeare”, will have the following vectorized plan. Note how the evaluation processes “word_count” separately from “corpus_date”.
BigQuery is a disaggregated storage and compute engine. Usually the data in BigQuery is stored on Google’s distributed file system – Colossus, most often in blocks in Capacitor format and the compute is represented by Borg tasks. This enables BigQuery’s scaling properties. To get the most out of vectorized processing, BI Engine needs to feed the raw data at CPU speeds, which is achievable only if the data is already in memory. BI Engine runs Borg tasks as well, but workers are more memory-heavy to be able to cache the data as it is being read from Colossus.
A single BigQuery query can be either sent to a single BI Engine worker, or sharded and sent to multiple BI Engine workers. Each worker receives a piece of a query to execute with a set of columns and rows necessary to answer it. If the data is not cached in the workers memory from the previous query, the worker loads the data from Colossus into local RAM. Subsequent requests for the same or subset of columns and rows are served from memory only. Note that workers will unload the contents if data hasn’t been used for over 24 hours. As multiple queries arrive, sometimes they might require more CPU time than available on a worker, if there is still reservation available, a new worker will be assigned to same blocks and subsequent requests for the same blocks will be load-balanced between the workers.
BI Engine can also process super-fresh data that was streamed to the BigQuery table. Therefore, there are two formats supported by BI Engine workers currently – Capacitor and streaming.
In-memory capacitor blocks
Generally, data in a capacitor block is heavily pre-processed and compressed during generation. There are a number of different ways the data from the capacitor block can be cached, some are more memory efficient, while others are more CPU efficient. BI Engine worker intelligently chooses between those preferring latency and CPU-efficient formats where possible. Thus actual reservation memory usage might not be the same as logical or physical storage usage due to the different caching formats.
In-memory streaming data
Streaming data is stored in memory as blocks of native array-columns and is lazily unloaded when blocks get extracted into Capacitor by underlying storage processes. Note that for streaming, BI workers need to either go to streaming storage every time to potentially obtain new blocks or serve slightly stale data. BI Engine prefers serving slightly stale data and loading the new streaming blocks in the background instead.
BI Engine worker does this opportunistically during the queries, if the worker detects streaming data and the cache is newer than 1 minute, a background refresh is launched in parallel with the query. In practice, this means that with enough requests the data is no more stale than the previous request time. For example if a request arrives every second, then the streaming data will be around a second stale.
First requests loading data are slow
Due to the read time optimizations, loading data from previously unseen columns can take longer than BigQuery does. Subsequent reads will benefit from these optimizations.
For example, the query above here is backend time for a sample run of the same query with BI Engine off, first run and subsequent run.
Multiple block processing and dynamic single worker execution
BI Engine workers are optimized for BI workloads where the output size will be small compared to the input size and the output will be mostly aggregated. In regular BigQuery execution, a single worker tries to minimize data loading due to network bandwidth limitations. Instead, BigQuery relies on massive parallelism to complete queries quickly. On the other hand, BI Engine prefers to process more data in parallel on a single machine. If the data has been cached, there is no network bandwidth limitation and BI Engine further reduces network utilization by reducing the number of intermediate “shuffle” layers between query stages.
With small enough inputs and a simple query, the entire query will be executed on a single worker and the query plan will have a single stage for the whole processing. We constantly work on making more tables and query shapes eligible for a single stage processing, as this is a very promising way to improve the latency of typical BI queries.
For the example query, which is very simple and the table is very small, here is a sample run of the same query with BI Engine distributed execution vs single node (default).
How to get most out of BI Engine
While we all want a switch that we can toggle and everything becomes fast, there are still some best practices to think about when using BI Engine.
Output data size
BI optimizations assume human eyes on the other side and that the size of output data is small enough to be comprehensible by a human. This limited output size is achieved by selective filters and aggregations. As a corollary, instead of SELECT * (even with a LIMIT), a better approach will be to provide the fields one is interested in with an appropriate filter and aggregation.
To show this on an example – query “SELECT * FROM samples.shakespeare” processes about 6MB and takes over a second with both BigQuery and BI Engine. If we add MAX to every field – “SELECT MAX(word), MAX(word_count), MAX(corpus), MAX(corpus_date) FROM samples.shakespeare”, both engines will read all of the data, perform some simple comparisons and finish 5 times faster on BigQuery and 50 times faster on BI Engine.
Help BigQuery with organizing your data
BI Engine uses query filters to narrow down the set of blocks to read. Therefore, partitioning and clustering your data will reduce the amount of data to read, latency and slot usage. With a caveat, that “over partitioning” or having too many partitions might interfere with BI Engine multi-block processing. For optimal BigQuery and BI Engine performance, partitions larger than one gigabyte are preferred.
BI Engine currently accelerates stages of the query that read data from the table, which are typically the leaves of the query execution tree. What this means in practice is that almost every query will use some BigQuery slots.That’s why one gets the most speedup from BI Engine when a lot of time is spent on leaf stages. To mitigate this, BI Engine tries to push as many computations as possible to the first stage. Ideally, execute them on a single worker, where the tree is just one node.
For example Query1 of TPCH 10G benchmark, is relatively simple. It is 3 stages deep with efficient filters and aggregations that processes 30 million rows, but outputs just 1.
Running this query in BI Engine we see that the full query took 215 ms with “S00: Input” stage being the one accelerated by BI Engine taking 26 ms.
Running the same query in BigQuery gets us 583ms, with “S00: Input” taking 229 ms.
What we see here is that the “S00: Input” stage run time went down 8x, but the overall query did not get 8x faster, as the other two stages were not accelerated and their run time remained roughly the same. With breakdown between stages illustrated by the following figure.
In a perfect world, where BI Engine processes its part in 0 milliseconds, the query will still take 189ms to complete. So the maximum speed gain for this query is about 2-3x.
If we, for example, make this query heavier on the first stage, by running TPCH 100G instead, we see that BI Engine finishes the query 6x faster than BigQuery, while the first stage is 30 times faster!
vs 1 second on BigQuery
Over time, our goal is to expand the eligible query and data shapes and collapse as many operations as feasible into a single BI Engine stage to realize maximum gains.
As previously noted, BI Engine accelerates “leaf” stages of the query. However, there is one very common pattern used in BI tools that BI Engine optimizes. It’s when one large “fact” table is joined with one or more smaller “dimension” tables. Then BI Engine can perform multiple joins, all in one leaf stage, using so-called “broadcast” join execution strategy.
During the broadcast join, the fact table is sharded to be executed in parallel on multiple nodes, while the dimension tables are read on each node in their entirety.
For example, let’s run Query 3 from the TPC-DS 1G benchmark. The fact table is store_sales and the dimension tables are date_dim and item. In BigQuery the dimension tables will be loaded into shuffle first, then the “S03: Join+” stage will, for every parallel part of store_sales, read all necessary columns of two dimension tables, in their entirety, to join.
Note that filters on date_dim and item are very efficient, and the 2.9M row fact table is joined only with about 6000 rows. BI Engine plan will look a bit different, as BI Engine will cache the dimension tables directly, but the same principle applies.
For BI Engine, let’s assume that two nodes will process the query due to the store_sales table being too big for a single node processing. We can see on the image below that both nodes will have similar operations – reading the data, filtering, building the lookup table and then performing the join. While only a subset of data for the store_sales table is being processed on each, all operations on dimension tables are repeated.
“build lookup table” operation is very CPU intensive compared to filtering
“join” operation performance also suffers if the lookup tables are large, as it interferes with CPU cache locality
dimension tables need to be replicated to each “block” of fact table
The takeaway is when join is performed by BI Engine, the fact table is sometimes split into different nodes. All other tables will be copied multiple times on every node to perform the join. Keeping dimension tables small or selective filters will help to make sure join performance is optimal.
Summarizing everything above, there are some things one can do to make full use of BI Engine and make their queries faster
Less is more when it comes to data returned – make sure to filter and aggregate as much data as possible early in the query. Push down filters and computations into BI Engine.
Queries with a small number of stages get the best acceleration. Preprocessing the data to minimize query complexity will help with optimal performance. For example, using materialized views can be a good option.
Joins are sometimes expensive, but BI Engine may be very efficient in optimizing typical star schema queries.
Data analytics technology is constantly shedding new insights into our lives. Many things are well observed through anecdotal experiences, but we have had a hard time proving them before data analytics technology became mainstream.
The value of technical MBAs and other degrees is a prime example. A growing number of experts are using data analytics technology to help illustrate the ROI that they offer.
What value does an MBA provide? How can data analytics technology help back it up?
Data Analytics Technology Proves Technical MBA Degrees Offer Huge Value
Getting a graduate degree is a big decision. It has its appeal, sure. Better pay. Career advancement. And, of course, you are also just sincerely interested in the subject matter. You may have chosen a career in healthcare or another admirable profession because you want to have a positive impact on people’s lives. You believe that getting an MBA might help you do that a little bit better.
On the other hand, you can’t really think of a bigger commitment. It’s expensive. It takes a lot of time. What is a motivated professional to do?
You might want to consider investing in a technical MBA. You might have been skeptical of the benefits in the past, but data analytics technology has helped many professionals get a better sense of the value that it offers.
Data analytics has helped companies evaluate the salaries of different professionals before and after they got their degrees. This has helped provide data-driven insights into the benefits of getting this specialized degree.
Data analytics has also helped professionals discover the unique opportunities that they can pursue with a technical MBA. They can use data mining tools to find job opportunities that are best suited for their credentials.
In this article, we take a look at how getting your MBA can positively impact your career trajectory. It will cover some of the factors that data analytics has evaluated to gauge the ROI of this degree.
Specialized MBAs in big data
You may also pursue a specialized MBA with a focus in big data or a similar field. As the linked article from Fortune shows, these degrees are in huge demand as more companies are looking for experts in business analytics.
Getting an MBA provides you with valuable career leverage. Your employer may not be interested in paying you more than they already are. However, now that you have an MBA you are a hotter commodity than you used to be. If the hospital that you work for now doesn’t want to fork over more money, there will probably be one somewhere that is more willing.
Of course, you can’t get too crazy. No one is going to double your salary overnight. But feel out the market a little bit. If there is a big demand for people with your specialty — right now there is — it can help with your negotiations.
The general rule of thumb is to start 5-10% higher than you actually expect to get. That way you can negotiate a little bit and still wind up with a number you are happy with.
Data analytics helps people with MBAs prove the value of their degree. They can use data mining to find meaningful insights on the benefits of their education, which they can bring up in their job interviews.
Networking isn’t exactly what people think it is. You don’t go to a party one day and meet Ms. HospitalOverlord. “An MBA, you say? You’ll come work for me, won’t you? How does one million dollars a year sound?”
You don’t get jobs just because you know people, but connections do help you stand out in otherwise crowded candidate pools. Graduate school is a particularly good place to meet people. Many of your fellow students will already be professionally situated. They may have firsthand knowledge about job openings. If not, they may be able to put you in contact with someone who does.
Similarly, your instructors will probably also work closely within the industry. If they are not currently involved professionally with hospitals, they will almost certainly know people who are.
These connections add up to new opportunities.
The benefits of networking are even more significant in the age of big data. People can use digital platforms like LinkedIn to establish more connections, which helps them leverage big data.
The most obvious benefit of graduate school for your career trajectory is that it equips you with new skills. Strip away everything else, and this is the real value of your degree. It’s what your employers are looking for in an MBA.
Well, ok. But I’ve been working in the industry for ten years. I learn on the job.
You do. But you do it through a lens that is at least ten years old. Most people stagnate with the same professional perspective they had when they left school. It’s not really their fault. It’s difficult to grow professionally while also trying to, you know, do your job.
Getting an MBA is a sure way to refresh your understanding of key concepts and get a new understanding of current trends. For example, people who graduated ten years ago entered a professional landscape with a much different understanding of digital technologies.
You may not have learned enough about data implementation or digital records to satisfy current expectations.
Of course, it’s a good idea to routinely update your skills, and you don’t always need to go in for a full degree to do it. There are upskilling curriculums that can be completed in weeks instead of years. They serve as an excellent way to brush up on new concepts and potentially acquire a little bit of professional momentum in the process.
But while these are handy opportunities to take advantage of from time to time, they don’t have quite the same career impact that an MBA has.
You Just Stand Out
Finally, MBAs stand out more than people without advanced degrees. If your objective is to move up the corporate ladder, you are going to need to be competitive. Though it may cynical to say it, working hard only gets you so far. Everyone at your level works hard. That’s how they got there in the first place.
If you’re neck and neck with a coworker for a promotion, having an advanced degree may be just the thing that tips you over the edge. And again, it’s not only about moving up at the hospital where you are at right now.
MBAs can open up doors all over the place. If you aren’t getting the career traction you are hoping for at your current hospital, the MBA will make it much easier to change lanes and nab a better position for yourself.
Big Data Shows the True Value of an MBA
A growing number of people are discovering the value of MBAs. Data analytics technology has helped make the case that they are even more valuable than previously thought. People with specialized MBAs may benefit even more, since big data has become so important in business.
Nursing professionals will need to appreciate the importance of big data and know how to use it effectively. Keep reading to learn more.
The Growing Relevance of Big Data in the Nursing Profession
Healthcare technology is changing all the time. These developments creep into the hospital routine and have a direct impact not just on how nurses do their jobs, but also on how well the patients do in and out of the hospital.
It’s a big deal. So, what’s out there? And how is it impacting patients, and their eventual path to recovery? In this article, we talk about how big data technology impacts nurses and the communities they serve.
Data Processing and Implementation
Nurses and hospitals in general are using data to improve their ability to serve the community. During the height of Covid, data processing was used to anticipate surges and help hospital nurses and administrators plan out how to best use their limited resources.
It’s used now to predict other viral surges — the flu for example — and provide more general evaluations of local health. Big data is driving big changes in healthcare.
While hospitals mostly do the same things, the communities that they serve can be very different. Maybe Town X has significantly higher levels of obesity than the national averages would indicate. Data analysis allows Town X’s hospital to anticipate what sort of medical conditions these high obesity levels will produce, and plan accordingly.
Town Y, on the other hand, has a factory that produces high emissions. The data there has not only helped the hospital treat the various respiratory conditions that those emissions have produced, but it is also helping locals motion for increased sanctions on the factory.
Data is so important to modern healthcare that nurses can now specialize in it. Analytic nursing is still a relatively obscure facet of the industry. However, as data technology in healthcare continues to expand, data comprehension is a professional skill that will only grow in importance and demand.
The last twenty years of healthcare-related technological developments have given patients increased autonomy during their stays in the hospital. Being admitted can be a traumatic experience for many people, even excepting the fact that it usually means their health is bad.
Once you’re admitted, you are at the mercy of the doctors and nurses working on your floor. You can’t eat without them bringing you food. You can’t even sleep uninterrupted without getting woken up every few hours for a test or a check-in.
It has to be this way of course. If you didn’t need these things, you probably wouldn’t need to be in the hospital at all. But it can be a demoralizing experience. Good thoughts breed good outcomes, so it is important to provide patients with as much independence as possible.
Smart hospital beds make this possible. For the patient, it means they can do things like adjust their position and comfort levels without calling in a nurse. A process that used to not only mean bothering the floor staff, but also sitting in your bed for who knows how long, waiting for them to come by.
For the floor nurses, it means fewer interruptions for small tasks, so they can focus more on patient care.
An adjustable bed? That doesn’t sound terribly smart.
Well. Fair enough. The patient experiences the bed in terms of its adjustability. The increased level of autonomy and comfort that they are given. However, the adjustability features are the tip of the iceberg. The bed can also monitor patient activity and provide data on things like heart rate, or even sleep patterns — important metrics that can make a big difference in healthcare outcomes.
Remote Patient Monitoring
So while the patient is sleeping on their fancy bed, Drew is down the hall making his rounds. He’s a good nurse, but one of his most distinct limitations — often cited on his employee evaluations — is that he can’t be in two places at once.
So when he is checking in on Molly Gordon, a young woman who was recently diagnosed with type one diabetes, he has no idea that Malcolm Ingram’s heart rate just plummeted down the hall.
Until his remote monitoring technology sounds the alert. Drew, along with several other nurses respond to the call. Malcolm is saved thanks to remote monitoring technology.
Heart monitors and other wearable healthcare devices make it easier for doctors and nurses to see what their patients are doing from anywhere.
Better yet, you don’t have to be in a hospital to use wearables. Heart monitors, pacemakers, glucose monitors, and other common healthcare devices are now often infused with Bluetooth technology. Referred to as the Internet of Things (IoT) these devices communicate directly with doctors or the patient themselves to provide up-to-the-minute health updates that can be lifesaving.
Malcolm’s story could have played out just as easily at home. Heartrate plummets. His heart monitor sends out the alert. .The company could theoretically call an ambulance. Or perhaps his emergency contact — a sister living down the street — gets the message and intervenes.
There are more ways than ever to provide high-quality healthcare evaluations, and data collection remotely.
A Wearables Postscript
As a short epilogue for the wearables conversation, it’s worth remembering that you don’t need to have a pre-existing condition to get in on the wearables game. Fitness trackers are a common wearable that people all over the world use.
In fact, Malcolm’s FitBit was how he first learned that his heart rate tends to dip in his sleep. These devices provide people with important information about their sleeping habits, heart rate, and even their blood pressure. Data points they might otherwise have only gotten once a year.
Big Data is Crucial for the Nursing Profession
A growing number of nursing professionals are discovering the benefits of big data technology. The benefits of using data technology are well documented.
Any sports player or fan should be aware that data analysis has become a significant part of sports. As individuals and teams seek to gain an advantage over the opposition, analyzing data and using the results to shape your performance is one of the best ways to get ahead.
Whether you are an English Premier League soccer team, a Moneyball baseball team, a roller skater, a golfer, or a National Hockey League franchise, the collection and use of data in sports are now commonplace. According to Bloomberg, the market for sports research and data analysis net worth is expected to be around $8.4 Billion by 2026.
In this brief article, we explore the impact of data analytics on sports.
Data in Individual Sports
Data analytics are commonplace in individual sports such as roller skating and can be used in a variety of ways. For example, when going for a skate fitting, the use of the 3D skate lab gives a 3D scan of your feet and suggests the most suitable skate style for you. For example, it can recommend the most suitable roller skates for wide feet, sprinters, or roller hockey players.
In terms of performance, data techniques such as tracking learning detection and video analysis, provide roller skate coaches with tools in which they can convey suitable technical improvements to skaters.
In sports such as speed skating, knowing and learning the technical improvements required to maximize speed can be the difference between finishing in first or second place.
Helping You Win
Data is used in sports to gain an advantage over the opposition. Depending on the sport, data is used in different ways. For example, in soccer, management will use key performance indicators to review whether a prospective transfer of a player should take place. These indicators may include the player’s sprint speed or distance covered per goal.
Increasingly, soccer teams look at stats such as expected goals (Xg) as a key metric to understand underlying performance levels other than the actual scoreline. The cut-through of soccer stats is so clear that even pundits after the game are now discussing Xg as a tool to better describe the narrative of a soccer game.
When Leicester City blew everyone away to win the English Premier League in 2015-16, it was credited largely to their extensive data analytics team, who managed to use data to craft a perfectly balanced squad that implemented the manager’s style of play.
Another of the most famous pioneering examples of using data analytics to improve performance was made in the Moneyball film. Billy Beane, General Manager of the Oakland Athletics baseball team used sabermetrics to help recruit players for the team, despite having a shoestring budget. Oakland Athletics become famous in the 2002 season for a 20-game winning streak.
Driving Consumer Engagement
Data analytics don’t only impact the way individuals and teams play a sport, but also how the general public engages with the sport.
Most large-scale sports organizations have staff responsible for monitoring digital engagement and online content to boost engagement with a sporting individual or team. Using online tools such as social media, digital and communication experts use online analytics to best shape marketing campaigns.
For example, a sports team that wants to connect with millennials may use Instagram as the primary tool to roll out a marketing campaign. But if their focus was Gen Z, they may use other social media sites such as Snapchat to attract the younger audience.
Benefiting The Sport Ecosystem
The whole sporting ecosystem also gains a benefit from the increase in data analytics. For example, analytics are increasingly used to help manage crowds inside stadiums, manage congestion before and after matches, and even help sell more food and drink inside stadiums.
Many large-scale sports teams now partner with retailers, payment providers, and telecommunications firms that can provide real-time data on consumer behavior. Understanding crowd behavior in stadiums with large crowds can be particularly helpful as it can guide the messaging and communication strategy with the crowd.
Data and Equipment
Data is very popular for individual sports such as golf and tennis to help guide players to purchase certain equipment pieces.
For example, you may wish to purchase an expensive Driver to boost your golf game. Many YouTubers and golf professionals will use devices such as launch monitors to conduct product reviews on golf clubs to give you quick insights into whether certain clubs are a match for your playing style.
Launch monitors in golf are also great for assessing your performance. Real-time insights into your carry distance, club head speed, and spin revolutions can help you understand your golf game better.
Data in sports is a serious business. It can be used in a variety of different ways from understanding the strengths and weaknesses of the opposition, improving the understanding of your game, improving digital engagement of yourself or your team, helping you select the best equipment piece, and also helping with crowd management.
Arvind Ltd has been in the apparel industry for more than 90 years, with its retail powerhouse Arvind Fashions Ltd being the backbone of well-known names in the retail fashion industry in India.
Arvind Fashions Ltd (Arvind) has seen significant growth in its portfolio with new franchises being added every year. The six high conviction brands include Tommy Hilfiger, Calvin Klein, Sephora, Arrow, U.S. Polo Assn. & Flying Machine.
To secure a foundation for future growth, the company has embarked on a digital transformation (DX) journey, focusing on profitability and improving the customer experience. The key objectives for Arvind’s DX is to unlock the value of existing applications, gain new insights, and build a solid workflow with resilient systems.
Getting Google Cloud to address the challenges around insights & analytics was a natural step forward, since Arvind had already formed a relationship with Google Cloud, starting with its productivity and collaboration tools during the pandemic.
Arvind’s enterprise applications estate is a mix of SAP, Oracle POS, logistics management systems and other applications. Having so many different applications made it a challenge for the company to bring all of this data together to drive retail insights and at the same time maintain the freshness of its products.
As a case in point, the existing sales reporting and inventory reconciliation process had been enabled by a mix of automated and semi-automated desktop applications. There were challenges to scale the infrastructure in order to process large amounts of data at a low latency.
The synchronization of master data across functions was critical to build the data platform that provides consistent insights to multiple stakeholders across the organization.
Solution Approach – Modern Data Platform
There are several ways to solve the challenges above and do more by building a modern data analytics platform. For example, using a data lake based approach that builds use case by use case, hybrid data estates and so on. Regardless of the approach, it is important to define the solution based on certain principles.
In Arvind’s scenario, the key business principles considered are that data platforms should support Variety, Variability, Velocity and Volume. Each of these 4 V’s are critical business pivots to successful fashion retailing. Variety in SKU’s to deal with myriad fashion trends every season, Variability in shopping footfalls due to different festivities, weekends and special occasions, Velocity to be agile and responsive to customer needs, and Volumes of data that bring richer insights.
This is where Google BigQuery enabled data platform comes in, as it is able to meet the needs above.
Solution Architecture – Current Capabilities & Future Vision
BigQuery is the mothership of the data and analytics platform on Google Cloud. Its serverless construct ensures that data engineering teams focus only on insights & analytics. Storage and compute is decoupled and can be independently scaled. BigQuery has been leveraged to service both the raw as well as the curated data zones.
With BigQuery procedures, it is possible to process the data natively within the data warehouse itself. Procedures have been leveraged to process the data in a low latency manner with the familiar SQL.
But then what happens to advanced analytics and insights? With simplicity being our key guiding principle, BigQuery machine learning ensures that data analysts can create, train and deploy analytics models even with complex requirements. It can also consume data from Looker Studio, which is seamlessly integrated with BigQuery.
Here are the key principles and highlights of the data platform that have been achieved:
Simple, yet exhaustive – We needed a solution with vast technical capabilities such as data lake & data warehouse, data processing, data consumption, analytics amongst others. And at the same time it needed to be simple to implement and run ongoing operations.
Agility – High quality analytics use cases typically require a significant amount of time, effort and skill set. While building a simple solution we ensured that the selection of technology services ensured agility in the long term.
Security – An organization can be truly successful if the insights and analytics operations are democratized. But while data is made available to a wider community, we need to ensure data governance and security.
Ease of operations – Data engineering teams spend a lot of time doing infrastructure setting and management operations. With BigQuery, teams can put in more effort on building the data pipelines and models to feed into analytics instead of worrying about the infrastructure operations.
Costs – Decoupling storage and compute allows for flexible pricing. A pay-as-you-go model is the ideal solution to managing costs.
The ingestion frequency of the store level inventory (~800 stores) has now been changed to daily. With the additional data volumes and processing the scaling on BigQuery has been seamless. There are new processes and dashboards to address the reconciliation and root cause analysis. Operational efficiencies have improved leading to better productivity and turn around time of critical processes.
The discrepancies in various reconciliation activities have drastically reduced by an order of magnitude of 300X due to the capabilities offered by the data platform. Not only is it possible to identify discrepancies but the data platform has also enabled in identifying the root causes for the same as well.
Arvind Fashions Ltd have also been able to enhance some of the existing business processes and systems with insight from the data platform.
It’s going to be an exciting journey for Arvind Fashions Ltd and Google Cloud. There are several initiatives ready for kick off such as getting more apps on the edge devices, warehouse analytics, advanced customer data platforms, predicting the lifecycle of designs, style codes and other exciting initiatives.
However, there is a dark side to using big data technology. It is also increases security risks. Spoofing is more likely to be a problem for data-driven organizations.
The good news is that there are ways to mitigate these risks.
Anti-spoofing solutions and Guidelines for Data-Driven Organizations
Email is a vital communication tool for businesses, but it’s also a common target for hackers who use spoofing techniques to gain access to sensitive information. Hackers are more likely to engage in spoofing these days, because companies store more data than ever. The amount of data they store is a goldmine in the wrong hands.
Spoofing is a type of cyber attack in which an attacker sends an email that appears to come from a legitimate source, but is actually a fraudulent message designed to steal information or spread malware. This is the first step to committing a data breach.
To protect your business’s data from email spoofing, it’s important to understand the different types of spoofing and the solutions that are available to combat them. Here are five anti-spoofing solutions data-driven companies need to consider implementing in your business:
1. DNS spoofing protection
DNS spoofing is a type of attack in which an attacker redirects traffic from a legitimate domain to a fake website. To protect your business from this type of attack, you should use DNS spoofing protection software that monitors your domain’s DNS records and alerts you if any unauthorized changes are made. MTA-STS is an email authentication protocol that can prevent DNS spoofing attacks and data breaches.
DMARC is an email authentication protocol that helps protect your business from spoofed emails that are precursors to data breaches. It works by providing a mechanism for email receivers to check the authenticity of messages sent from your domain. DMARC allows you to specify which email authentication mechanisms are used on your domain, and it also provides a way for you to receive reports on any suspicious activity.
3. BIMI (Brand Indicators for Message Identification)
BIMI is a new email authentication protocol that allows you to display your company’s logo in the email client next to the email, making it easier for recipients to identify legitimate emails from your company. BIMI works by using DMARC to validate the email and then adding a verified logo to the email.
4. Sender Policy Framework (SPF)
SPF is an email authentication protocol that allows you to specify which email servers are authorized to send emails from your domain. This helps to prevent email spoofing by verifying that the email is coming from a legitimate source. By publishing an SPF record in your domain’s DNS, you can specify which email servers are authorized to send email from your domain, and email receivers can use this information to verify the authenticity of the email.
5. Certificates for Secure Email (COSE)
COSE is an email encryption standard that allows you to encrypt your emails to protect them from spoofing and other types of attacks. It uses public key encryption to encrypt the email, and the recipient can then use their private key to decrypt the email. This provides a secure way to send sensitive information and can help to protect your business from email spoofing attacks.
6. Email security software
Email security software is designed to help protect your business from various types of email attacks, including spoofing. It can help detect and block malicious emails, as well as provide other security features such as encryption and spam filtering.
Data-Driven Companies Must Thwart Spoofing Attacks
Cyberattacks are more common than ever in the age of big data. A growing number of hackers use spoofing as part of their strategy to commit data breaches.
Implementing these anti-spoofing solutions will help protect your business from email spoofing attacks, and keep your customer’s information secure. But it’s important to note that these solutions are not a one-time setup, it’s crucial to constantly monitor and update these measures to stay ahead of the constantly evolving threat of email spoofing. Keep in mind that these solutions should be used in conjunction with other security measures such as firewalls and antivirus software to provide the best possible protection for your business.
In conclusion, email spoofing is a serious threat to businesses, but with the right anti-spoofing solutions in place, you can protect your business from this type of attack. These are very important data security strategies to implement. Whether you’re using DMARC, BIMI, MTA-STS, SPF or email security software, it’s essential to stay vigilant and take the necessary steps to keep your business safe. Implementing these solutions will not only protect your business but also give your customers peace of mind that their personal information is safe with you.
Editor’s note: The post is part of a series highlighting our partners, and their solutions, that are Built with BigQuery.
Background and Context
A Customer Data Platform (CDP) provides a monolithic view of the customer by collecting data from various sources to provide accessibility across different systems. CDPs are becoming increasingly vital as companies look to improve customer experience, engagement and personalization. Among the different CDP types, a Composable CDP commonly referred to as Deconstructed CDP distinguishes itself as it provides the flexibility and agility to customize or even interchange the components of CDP based on specific requirements such as security, governance etc. A key construct of a CDP is the identity resolution which follows a set of rules to decide how collected data is used to create a new entity or merged to an existing one.
Lytics’ composable CDP approach
Lytics is a next generation composable CDP that enables companies to deploy a scalable CDP around their existing data warehouse/lakes, without losing control over their customer data and without compromising security and privacy to thrive in a cookieless future by tracking and identifying customer behavior [how]. Lytics offers reverse ETL capabilities and the ability to deploy both private instance (Single Tenant hosted SaaS ) & private cloud (deployed fully in Customers GCP project) deployments directly into your GCP Project (VPC), Lytics’ unique data-driven approach enables enterprises to improve their data pipeline by simplifying creation of unified view , increase customer engagement by personalizing experiences and advancing interactiveness and drive more effective marketing ROI as a result of delivering more value to customers.
Challenges in Data Sharing and Clean Rooms
More and more organizations have identified the benefits of securely sharing and organizing customer data as a means to improve business performance. Critical in the use case is the need to suppress PII during the process of enrichment while fully enabling data dependent teams, such as BI (Business Intelligence), Data Analysts, and Data Scientist, to stitch and enrich the data from multiple sources, while avoiding the risk of exposing any sensitive data.
As data sharing/clean rooms go become more prevalent, a set of new challenges need to be addressed namely:
Data residing in silos across different sources needs a method for relating two disparate sets of data to each other, identifiers, enabling the creation of a single view of the customer and all related attributes. A traditional solution such as an identity matching solution using a hashed identifier in the source data and customer’s dataset can help mitigate the technical challenge.
Latencies while working with a data warehouse and exchange can directly impact the agility for business teams to make real-time decisions. It is key to minimize the latencies in long running workflows
Ever changing security, compliance, and regulation requirements can create major roadblocks and drastically slow down or prevent a businesses ability to engage with their customers.
Flexibility is a must. Partners, vendors, and everyone in between provide a wide variety of requirements to navigate which prevents rigid, traditional, solutions from being viable.
Solution: Lytics on Google Cloud
Lytics chose to build on Google Cloud because of its desire to leverage the Google Data Cloud, with BigQuery at the core, to offer this necessary infrastructure and scalability which unlocks Google Cloud AI/ML capabilities for enhanced entity identification using a unique identifier to stitch a user profile, enrichment of first-party data with external sources, and ultimately highly targeted audience segments built from multiple behavioral factors and proprietary engagement based behavioral scores. In addition to BigQuery, the Lytics solution integrates with a variety of Google Cloud products including Google Cloud Pub/Sub, Google Kubernetes Engine (GKE). Also, Lytics’ First-Party Modeled Audiences is powered by Google, a one-of-a-kind solution that expands the reach of existing Google Ads audiences by accessing YouTube’s user data via a Lytics/Google proprietary API to create the Modeled Audiences using Google Tag Manager and Google Ads products.
Lytics and Google have been developing a scalable and repeatable offering for securely sharing data which utilizes Google Cloud (BigQuery) and the Lytics Platform. BigQuery’s capabilities have enabled use cases from data aggregation, data analysis, data visualization, and activation, all of which are able to be executed with extremely low-latency. Furthermore, with the introduction of Analytics Hub, Lytics is now able to offer their own Data Clean Room solution on Google Cloud for Advertisers and Media built from the collaboration including Lytics, BigQuery, and Google Cloud Analytics Hub. The Lytics Data Clean Room offers a range of features for managing and processing data, including ingestion, consolidation, enrichment, stitching, and entity resolution. This solution’s privacy centric data enablement utilizes Google’s exchange level permissions as well as leverages the ability to unify data without the necessity of exposing PII, thereby allowing customers to leverage data across their organization, while avoiding data duplication and limiting privacy risks.
Lytics Conductor’s strength is in its ability to collect and unify first-party datasets into customer profiles followed by its hosting in Google Cloud and integrating with BigQuery. The integration with BigQuery makes Lytics Conductor an ideal application to simplify and unlock data sharing by unifying and coalescing datasets that helps businesses to build or expand existing BigQuery data warehouses.
Lytics Conductor fuels Analytics Hub to create master data Exchanges for intra and inter sharing within organizations that house hundreds of listings to focus on secure, controlled consumption of critical datasets for Partners, Parent and child brands.
Lytics Cloud Connect, Lytics’ reverse-ETL product, closes the activation loop by easily and securely associating this new data set with individual customer profiles. This enables further segmentation and activation across hundreds of out-of-the-box, real-time integrations with mission-critical marketing tools.
The key features of the solution are:
Gather and aggregate data from multiple data sources
Construct unified customer profiles using industry-leading identity resolution capabilities
Develop or augment BigQuery warehouse with event and profiles synchronization by coalescing data across systems
Build Data exchanges and Listings using Analytics Hub in a scalable, reliable and secure manner for both internal and external Partner consumption
Activate segments across critical marketing channels
Below is a block diagram of how Lytics, BigQuery, and Google Cloud Analytics Hub work together to power the solution:
Data sources (batch and stream) map to a pre-built schema managed by Conductor.
Conductor provides a vast ecosystem of pre-built connectors and APIs unify disparate data sources into profiles, which are delivered to BigQuery. These profiles are the core foundation of first-party data to build the customer 360 view.
Analytics Hub helps in creating, publishing and searching listings, which is a subset of profiles and their attributes consumable as datasets. Analytics Hub establishes a natural data flow from producer to a consumer with a low-code, no-code solution.
Cloud Connect consumes the listings via its inherent ability to use standard SQL to model data for direct activation across channel tools or enrichment of unified profiles in Conductor.
Conductor directly interfaces with Decision Engine to build and explore segments of unified profiles.
Cloud Connect and Decision Engine allow for direct activation of either SQL based data models or audience segments.
The joint partnership between Google Cloud and Lytics has yielded a highly scalable and secure solution with tighter control of mission critical data for faster activation. The real-time streaming capability of BigQuery contributes widely to the rapid activations. The solution is more repeatable and reachable as it builds on Lytics functionality and Google Cloud infrastructure. In addition, it can be future proof In the wake of stringent privacy constraints, industry compliance standards and newer regulations.
How Lytics and Google Cloud are Better Together
The Lytics and Google Cloud story is one of collaboration and innovation on behalf of our current and future joint customer’s.
Built on Google Cloud, Lytics is able to leverage the power of Google processing speeds and capabilities with no technical lift from the customer. Lytics Conductor and Google Analytics hub are able to provide complementary capabilities improving data management needs on behalf of our joint customers focusing on getting maximum value out of data. In particular, the capabilities and power of Analytics Hub and BigQuery have helped decrease the time to value in complex data sharing scenarios where partnership collaboration is safe and secure and cross brand activation can be done in hours rather than days or even weeks
Lytics is a long-standing partner with Google Cloud; the Lytics platform has exclusive and total residency built on GCP. There are no retrofits to the solution described above and it is uniquely positioned to seamlessly integrate GCP features with minimal to no technical lift for our joint customers. The solution described above is available today. Google Cloud provides Lytics a means to enable forward thinking data strategies, future proofing solutions with a Privacy Forward approach to data management, unparalleled global scale and access to the greatest technical team in the industry.
Click here to learn more about Lytics’ CDP capabilities or to request a demo.
The Built with BigQuery advantage for ISVs
Google is helping tech companies like Lytics build innovative Solutions on Google’s data cloud with simplified access to technology, helpful and dedicated engineering support, and joint go-to-market programs through the Built with BigQuery initiative, launched in April as part of the Google Data Cloud Summit. Participating companies can:
Get started fast with a Google-funded, pre-configured sandbox.
Accelerate product design and architecture through access to designated experts from the ISV Center of Excellence who can provide insight into key use cases, architectural patterns, and best practices.
Amplify success with joint marketing programs to drive awareness, generate demand, and increase adoption.
BigQuery gives ISVs the advantage of a powerful, highly scalable data warehouse that’s integrated with Google Cloud’s open, secure, sustainable platform. And with a huge partner ecosystem and support for multi-cloud, open source tools and APIs, Google provides technology companies the portability and extensibility they need to avoid data lock-in.