Looker Studio supports self-serve analytics for ad hoc data, and together with Looker, contributes to the more than 10 million users who access the Looker family of products each month. Today, we are introducing new ways for analysts to provide business users with options to explore data and self-serve business decisions, expanding ways all our users can analyze and explore data — leading to faster and more informed decisions.
Introducing personal report links
Business users often leverage shared dashboards from data analysts, which contain key company metrics and KPIs, as a starting point and want to explore beyond the curated analysis to arrive at more specific insights for their own data needs. The introduction of personal reports in Looker Studio enables this activity, delivering a private sandbox for exploration so users can self-serve their own questions and find insights faster – without modifying the original curated report.
Whether you share a report link in group chats or direct messages, an individual copy is created for each user that opens it so that everyone gets their own personal report.
Personal Looker Studio reports are designed to be ephemeral, meaning you don’t need to worry about creating unwanted content, but if you land on valuable insights that you want to keep, you can save and share these reports with new links, separate from the original report you built from.
You can learn more about how personal reports work and how to use them in our Help Center.
Looker Studio Personal Link Visual
Automated report updates
Your analysis and insights are only as good as the freshness of your reports. Looker Studio users can now enable their reports to auto-refresh data at a predefined cadence, so critical business decisions are based on current and updated information.
To learn more about how auto-refresh works, including details on how it works with cache, presentation mode, and existing data freshness settings, visit our Help Center.
Looker Studio Auto refresh feature Visual
Faster filtering in reports
Quick filters enable powerful exploration to slice data and uncover hidden patterns and insights within the context of your report. Quick filters don’t affect other users’ views, so whether you are exploring in a shared or personal report, your unique view is only shared once you are ready. The filter bar also gives you a complete picture of whether applied filters originate from interactive cross-chart filtering or quick filters.
Learn more about how to add quick filters in reports in our Help Center.
Looker Studio Quick filters and filter bar feature Visual
Pause updates
Configuring multiple filters and charts for exploration can quickly add to the query volume, even with presence of a cache. We’ve heard from analysts that they want better control over running queries, so they can optimize query volume and, thus, query costs.
We have added the ability to pause updates, giving you the flexibility to fully configure chart elements like fields, filters, parameters, sorting, and calculated formulas before running any data updates. You can then simply resume updates to see the updated data. Pausing updates does not prevent any style changes, so you can continue to modify design elements and other detailed styles and formatting without running a single query. Learn more about this feature in our Help Center.
The new pause report updates feature in Looker Studio has meaningfully improved the report creation experience. Asset producers can build and test reports without wasting database resourcing waiting for data to reload. Caroline Bollinger BI Tooling Product, Wayfair
View underlying data
Data accuracy is one thing — being able to see its detail is another. As analysts configure charts to build reports and design information hierarchy, previewing the underlying data is important for understanding context and seeing what data is available and its structure so you can make the best decisions about what to include in your analysis. It’s also handy when troubleshooting or customizing your reports.
This feature allows analysts to preview all the data that appears in a chart, including the primary dimensions, breakdown dimensions, and metrics. Learn more about how to view underlying data in our Help Center.
Looker Studio Data preview feature Visual
With this collection of updates, Looker Studio users can now easily know the data they share is up-to-date, inspect it in detail, rapidly create filters, and share personal links to reports. The goal remains, as always, to empower users to make smart and impactful decisions based on their enterprise data. To stay on top of all our latest features, view our release notes. Access Looker Studio for free and learn more about Looker Studio Pro.
Geographical redundancy is one of the keys to designing a resilient data lake architecture in the cloud. Some of the use cases for customers to replicate data geographically are to provide for low-latency reads (where data is closer to end users), comply with regulatory requirements, colocate data with other services, and maintain data redundancy for mission-critical apps.
BigQuery already stores copies of your data in two different Google Cloud zones within a dataset region. In all regions, replication between zones uses synchronous dual writes. This ensures in the event of either a soft (power failure, network partition) or hard (flood, earthquake, hurricane) zonal failure, no data loss is expected, and you will be back up and running almost immediately.
We are excited to take this a step further with the preview of cross-region dataset replication, which allows you to easily replicate any dataset, including ongoing changes, across cloud regions. In addition to ongoing replication use cases, you can use cross-region replication to migrate BigQuery datasets from one source region to another destination region.
How does it work?
BigQuery provides a primary and secondary configuration for replication across regions:
Primary region: When you create a dataset, BigQuery designates the selected region as the location of the primary replica.Secondary region: When you add a dataset replica in a selected region, BigQuery designates this as a secondary replica. The secondary region could be a region of your choice. You can have more than one secondary replica.
The primary replica is writeable, and the secondary replica is read-only. Writes to the primary replica are asynchronously replicated to the secondary replica. Within each region, the data is stored redundantly in two zones. Network traffic never leaves the Google Cloud network.
While replicas are in different regions, they do not have different names. This means that your queries do not need to change when referencing a replica in a different region.
The following diagram shows the replication that occurs when a dataset is replicated:
Replication in action
The following workflow shows how you can set up replication for your BigQuery datasets.
You can add a single replica to any dataset within each region or multi-region. After you add a replica, it takes time for the initial copy operation to complete. You can still run queries referencing the primary replica while the data is being replicated, with no reduction in query processing capacity.
code_block<ListValue: [StructValue([(‘code’, “– Create the primary replica in the primary region.rnCREATE SCHEMA my_dataset OPTIONS(location=’us-west1′);rnrn– Create a replica in the secondary region.rnALTER SCHEMA my_datasetrnADD REPLICA `us-east1`rnOPTIONS(location=’us-east1′);”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0c6f3269d0>)])]>
To confirm the status that the secondary replica has successfully been created, you can query the creation_complete column in the INFORMATION_SCHEMA.SCHEMATA_REPLICAS view.
code_block<ListValue: [StructValue([(‘code’, “– Check the status of the replica in the secondary region.rnSELECT creation_time, schema_name, replica_name, creation_completernFROM `region-us-west1`.INFORMATION_SCHEMA.SCHEMATA_REPLICASrnWHERE schema_name = ‘my_dataset’;”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0c6f326580>)])]>
Query the secondary replica
Once initial creation is complete, you can run read-only queries against a secondary replica. To do so, set the job location to the secondary region in query settings or the BigQuery API. If you do not specify a location, BigQuery automatically routes your queries to the location of the primary replica.
code_block<ListValue: [StructValue([(‘code’, ‘– Query the data in the secondary region..rnSELECT COUNT(*) rnFROM my_dataset.my_table;’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0c6f326460>)])]>
If you are using BigQuery’s capacity reservations, you will need to have a reservation in the location of the secondary replica. Otherwise, your queries will use BigQuery’s on-demand processing model.
Promote the secondary replica as primary
To promote a replica to be the primary replica, use the ALTER SCHEMA SET OPTIONS DDL statement and set the primary_replica option. You must explicitly set the job location to the secondary region in query settings.
code_block<ListValue: [StructValue([(‘code’, “ALTER SCHEMA my_dataset SET OPTIONS(primary_replica = ‘us-east1’)”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0c6f3262b0>)])]>
After a few seconds, the secondary replica becomes primary, and you can run both read and write operations in the new location. Similarly, the primary replica becomes secondary and only supports read operations.
Remove a dataset replica
To remove a replica and stop replicating the dataset, use the ALTER SCHEMA DROP REPLICA DDL statement. If you are using replication for migration from one region to another region, delete the replica after promoting the secondary to primary. This step is not required, but is useful if you don’t need a dataset replica beyond your migration needs.
code_block<ListValue: [StructValue([(‘code’, ‘ALTER SCHEMA my_datasetrnDROP REPLICA IF EXISTS `us-west1`;’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0c6f3260a0>)])]>
Getting started
We are super excited to make the preview for cross-region replication available for BigQuery, which will allow you to enhance your geo-redundancy and support region migration use cases. Looking ahead, we will include a console-based user interface for configuring and managing replicas. We will also offer a cross-region disaster recovery (DR) feature that extends cross-region replication to protect your workloads in the rare case of a total regional outage. You can also learn more about BigQuery and cross-region replication in the BigQuery cross-region dataset replication QuickStart.
As data pipelines become more complex and involve multiple team members, it can be challenging to keep track of changes, collaborate effectively, and deploy pipelines to different environments in a controlled manner.
We can help, with Google’s new pipeline edit feature that we introduced for Cloud Data Fusion (CDF) batch pipelines.
Better pipelines with versioning
A typical pipeline development process is iterative in nature. You make small unit changes to a pipeline, test them on small data, then on production data, then iteratively add features to the pipeline. Iterative pipeline design is also critical for a seamless experience to the Cloud Data Fusion user as it reduces overheads in developing and testing pipelines. An ETL developer is able to design a pipeline iteratively, where improvements are added incrementally while maintaining a full history of changes.
You can edit pipelines starting in Cloud Data Fusion version 6.9. When you edit a pipeline you’ve already deployed, you don’t have to duplicate the pipeline and implement a versioning strategy across multiple pipelines. Instead, you edit a single pipeline and the versions are tracked for you. With pipeline edit capability in place after deployment, you do not have to implement versioning artificially by duplicating a pipeline. Thereby enhancing the user experience and productivity, and maintaining a correlation mapping between the various clones of a pipeline.
Benefits of pipeline editing
The pipeline edit feature lets you do the following:
Incrementally make changes to any part of the deployed pipeline, such as the pipeline structure, configuration, metadata, preferences, and comments.You can also export an edited JSON file for a deployed pipeline.
How is it different from the CDF duplicate pipeline feature?
Duplicating a pipeline creates a new pipeline with a different name while editing a pipeline creates a new version of the same pipeline, which prevents proliferation of pipelines (as seen in figure below), allowing for better organization.
Maintain a history of all edited versions of the pipeline. Similar to the Google Docs experience, you can view the older versions of the pipeline.
Before you begin
You need a Cloud Data Fusion instance with version 6.9.1 or above.
Upgrading to 6.9.1 or above will also unlock Source Control Management with Github. You can refer to the blog here.
NOTE: The pipeline edit feature is supported only for CDF batch pipelines.
How to use this feature?
When you edit the pipeline, CDF creates a new draft, once deployed it becomes the latest version of the pipeline (in case of upgraded instances, the pipelines are upgraded to become the latest version of the pipeline).The latest version retains the triggers, pipeline configurations, runtime arguments, metadata, comments, and schedules from the previous version. The latest version is the active version of the pipeline, i.e; it can be run or scheduled to run.
To edit a deployed pipeline follow the below steps:
Go to the pipeline that you want to edit and click Edit, you can access this in the UI through both pipeline studio and the pipeline list page:
Edit through the pipeline studio page
Edit through the pipeline list page
A new draft of the pipeline is created. Edit your pipeline and make the necessary changes. Optional: To finish editing the pipeline later, click Save. Draft statuses are displayed to mitigate concurrency issues (more discussed below).
Edit Draft opens for changes
“In-Progress” editing status for the edit draft that is yet to be deployed.
After you finish editing the pipeline, click Deploy. This will open the Enter Change Summary dialog box, enter a description of the changes you made to the pipeline and click Deploy. A best practice is to enter a descriptive change summary as it identifies the edit version.
Note: You must make changes to your pipeline draft in order to deploy it, else an error message is displayed.
View version history
The history button is introduced in the pipeline studio page, which displays a list of edit versions and through which the previous edit versions of the pipeline can be accessed. The only actions that can be performed on an older edit version are view and restore. The older versions are identified by the date of creation and the change summary.
When you click view, you are redirected to the older pipeline version on the studio page. Keep in mind that this is an inactive version that cannot be run or scheduled for runs.
You can go back to the latest version through the return to latest version link.
When you click restore on an older version, it restores the older version to the latest active version, as such you are able to run or schedule runs on this version. Note : You cannot restore the latest version.
Export older edit version
When you wish to view or manipulate an older version pipeline json, you can export it locally. The edited json can be imported back to the pipeline edit draft.
An orphaned edit draft
When a pipeline is deleted, all deployed versions of the pipeline are removed other than the ones that are open in draft status. The draft pipeline enters an orphaned status, since the associated pipeline is removed and the draft no longer belongs to an existing pipeline. Deploying the draft will deploy a brand new pipeline and resolve the orphaned status.
An obsolete edit draft
When a newer version of the pipeline that you are currently editing becomes available, your changes are out of date. This happens when another user deploys the pipeline before you finish editing. The draft then enters the out of date/obsolete status.
Deployment is blocked and you see the error message prompting you to manually reconcile your changes.
To manually reconcile your pipeline, click on Export and Rebase in the prompt, this will export your current json draft locally, and rebase studio to the latest version. Thereby, resolving the out of date/obsolete status. Manually resolving the conflicts and importing the changes back into the draft is the recommended solution.
BigQuery stores massive amounts of unstructured and structured text data, including customer information and business operations. BigQuery’s Search and Machine Learning (ML) capabilities provide powerful analytical tools for extracting valuable business insights from these text data sets.
Text preprocessing is a critical step in text analysis and information retrieval pipelines, as it transforms raw natural language or unstructured text into machine-friendly formats. It is a prerequisite for many text-based operations, such as full-text search indexing and machine learning pipelines. In many cases, the effectiveness of a text search index is greatly influenced by the quality of the tokenization algorithm chosen. Similarly, the performance of machine learning models relies heavily on the quality of the preprocessed inputs.
Consider a hypothetical fraud investigation and prevention scenario. During the investigation, it would be beneficial to search business logs data to identify suspicious activities associated with the reported transactions. The process involves retrieving entries containing relevant customer information from a logs table that is generated by day-to-day business activities.
The following information may be of interest:
customer IDaccess IP addressemail addresslast 4 digits of credit card number.
With the newly added powerful PATTERN_ANALYZER and its configurability, we will create a search index to help us search for the above specific information.
Constructing a search index configuration
A search index with the PATTERN_ANALYZER allows us to extract and index information that matches a specified RE2 regular expression. In this case, our regular expression will have the following form (note that “?:” denotes non-capturing group since we will capture the entire pattern as a whole):
We then can make further steps on the captured tokens to improve the search index effectiveness and search usability:
lowercase the text to enable case-insensitive searchesremove some known email addresses such as trusted system robot emails or testing emailsremove some fixed/known IP addresses like localhost.
To do this, we can use the token filters option of the analyzer:
Experimenting with different configurations
Prior to creating our search index, which can be a time-consuming and expensive process, we use the newly added TEXT_ANALYZE function to experiment with various configurations to find the one that works as intended. Below we show some example queries to test the configurations.
Since we will use the same analyzer options in our example, we can declare a variable called CUSTOM_OPTIONS so that we can reuse it in multiple queries:
Example 1: text containing both UUID and IP address
code_block<ListValue: [StructValue([(‘code’, ‘SELECT TEXT_ANALYZE(rn “the uuid is f8161363-6166-44ed-b99d-776a6ebd183c and the ip address is 198.95.180.234″,rn analyzer => ‘PATTERN_ANALYZER’,rn analyzer_options => CUSTOM_OPTIONSrn) AS tokens;’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e518e66a070>)])]>
Result:
Note: in this example, we have two separate lists of stop words for illustration purposes. In practice, combining the list to be 1 list would make the search execution faster.
Example 2: text containing multiple email addresses and an obfuscated credit card number
code_block<ListValue: [StructValue([(‘code’, ‘SELECT TEXT_ANALYZE(rn “email addresses are personA@Domain.com and personB@AnotherDomain.com and the credit card number is XXXX-XXXX-XXXX-7193″,rn analyzer => ‘PATTERN_ANALYZER’,rn analyzer_options => CUSTOM_OPTIONSrn) AS tokens;’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e518e66a430>)])]>
Result:
We can continue iterating until we identify a good configuration for our use cases. Let’s say that in addition to the critical information for our investigation, we also want to add regular words to our index. This way, the searches for normal words can also make use of the search index. We can do this by appending the following regular expression to the end of our pattern:
Since we are indexing normal words now, it is also useful to remove some common English words so they don’t pollute the index with unnecessary entries. We do this by specifying the stop_words list in the analyzer options:
The latest configuration and verification using TEXT_ANALYZE function can be seen below (note that we need to escape the backslashes in our SQL statements):
code_block<ListValue: [StructValue([(‘code’, ‘SELECT TEXT_ANALYZE(“the uuid is f8161363-6166-44ed-b99d-776a6ebd183c and the ip address is 198.95.180.234″,rn analyzer => ‘PATTERN_ANALYZER’,rn analyzer_options => CUSTOM_OPTIONSrn) AS tokens;’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e518e66a1f0>)])]>
Result:
code_block<ListValue: [StructValue([(‘code’, ‘SELECT TEXT_ANALYZE(“email addresses are personA@Domain.com and personB@AnotherDomain.com and the credit card number is XXXX-XXXX-XXXX-7193″, rn analyzer => ‘PATTERN_ANALYZER’,rn analyzer_options => CUSTOM_OPTIONSrn) AS tokens’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e518e66ab80>)])]>
Result:
Creating the search index
Once we conclude our testing effort and find the suitable configuration, we can create the search index:
code_block<ListValue: [StructValue([(‘code’, “CREATE SEARCH INDEX MyIndexrnON MyTablernOPTIONS(rn analyzer = ‘PATTERN_ANALYZER’,rn analyzer_options = (CUSTOM_OPTIONS)rn)”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e518e66a160>)])]>
Using the search index
As a reminder of how the SEARCH function works, it applies the analyzer (with the specified configuration) on both the search data as well as the input search query, and returns TRUE if the tokens from the search query is a subset of the tokens from the searched data.
While the SEARCH function works without indexes, it can take advantage of the search index for better performance if the function and the index share the same analyzer configurations. We set the analyzer and analyzer_options arguments in the SEARCH function below:
code_block<ListValue: [StructValue([(‘code’, ‘– Search for all log entries containing a specific email address.rnSELECT log_entryrnFROM tablernWHERE SEARCH(rn log_entry, “persona@domain.com”,rn analyzer => “PATTERN_ANALYZER”,rn analyzer_options => CUSTOM_OPTIONSrn)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e518e66a9d0>)])]>
IP addresses, customer UUIDs, other normal words (not the stop words) are also indexed, so searching for those can be improved by the search index in a similar manner.
In this example, we can see that the new PATTERN_ANALYZER is a powerful tool to help build an effective search index for our fraud investigation scenario. The text analyzers and their options are designed to be flexible to accommodate various use cases. With correct configurations for search indexes, query performance can be greatly improved as presented in this blog post.
Text analyzers with BigQuery ML
We also announce two new text preprocessing functions: ML.TF_IDF and ML.BAG_OF_WORDS. Similar to other preprocessing functions, these two new functions can be used in the TRANSFORM clause to create ML models with data preprocessing. The following example shows how to use these functions with text analyzers.
The ML use cases for text analyzers focus more on extracting all text tokens and performing Unicode normalization on them before vectorization. This is achieved via combining the above functions with the newly introduced TEXT_ANALYZE function.
While BigQuery offers the ability to use pre-trained ML models for text vectorization, the aforementioned statistics-based functions offer simplicity, interpretability, and lower computational requirements. Additionally, when dealing with new domains where extensive domain-specific data for fine-tuning is lacking, statistics-based methods often outperform pre-trained model-based methods.
In this example, we will explore building a machine learning model to classify news into 5 categories: tech, business, politics, sport, and entertainment. We will use the BBC news public dataset, which is hosted in BigQuery at bigquery-public-data.bbc_news, and carry out the following steps:
Tokenize news as raw strings and remove English stop words.Train a classification model using the tokenized data and a custom vectorization function in BigQuery ML.Make predictions on new test data to classify news into categories.
Raw text preprocessing with TEXT_ANALYZE
The first step in building the classifier is to tokenize the raw news text and preprocess the tokens. The default LOG_ANALYZER with its default list of delimiters is often good enough, so we can use them without any further configuration.
Additionally, our text data may contain Unicode text, so ICU normalization would be a useful preprocessing step.
{ “normalizer”: { “mode”: “ICU_NORMALIZE” } }
Next, we filter out some common English stop words from the tokens list:
Now we can use the tokenized data to train our classifier. In this example we use the random forest classifier together with the TRANSFORM clause using ML.BAG_OF_WORDS as the vectorization function:
code_block<ListValue: [StructValue([(‘code’, “CREATE OR REPLACE MODELrn `myproject`.`mydataset`.`test_model`rn TRANSFORM(rn ML.BAG_OF_WORDS(body) OVER() AS body, labelrn )rnOPTIONS (rn MODEL_TYPE=’RANDOM_FOREST_CLASSIFIER’,rn NUM_PARALLEL_TREE = 50,rn TREE_METHOD = ‘HIST’,rn EARLY_STOP = FALSE,rn SUBSAMPLE = 0.85,rn INPUT_LABEL_COLS = [‘label’]rn) ASrn SELECTrn body, category AS labelrn FROM `myproject`.`mydataset`.`tokenized`;”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e518e66a550>)])]>
After training, we can see that the created model has good performance.
Using the model for inference
Finally, we can test out the model using an excerpt from a sport article outside of the training data. Note that we will need to use the exact same TEXT_ANALYZE options for preprocessing.
code_block<ListValue: [StructValue([(‘code’, ‘WITH ExampleData AS (rnSELECT “Baseball, softball, flag football and lacrosse, sports with roots and a high level of popularity in the United States, have been proposed for the 2028 Games, alongside two sports with more international appeal, squash and cricket. The International Olympic Committee will vote on the proposals by the Los Angeles organizing committee later this month.” AS bodyrn)rnSELECTrn predicted_label, predicted_label_probs, bodyrnFROMrn ML.PREDICT(rn MODEL `myproject.mydataset.test_model`,rn (SELECTrn TEXT_ANALYZE(body, analyzer_options => CUSTOM_OPTIONSrn ) AS bodyrn FROM ExampleData)rn );’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e518e66aeb0>)])]>
Result:
This example illustrates that TEXT_ANALYZE and BigQuery ML vectorization functions are great tools for preprocessing data and creating machine learning models. This is a significant improvement compared to the multiple steps presented in our previous blogpost on sentiment analysis with BigQuery ML.
Conclusion
The new features added to our text analysis toolset are valuable additions to the existing functionality. They provide users with more flexibility, power, and insight into their data. The ability to perform custom text analysis in a variety of ways makes our toolset even more comprehensive and user-friendly.
We invite you to try out the new features and see for yourself how they can enhance your text preprocessing and analysis workflow. We hope that these new features will help you to gain even deeper insights into your data. We are committed to continuing to develop and improve our text analysis tools to meet your needs.
Real-time stream processing can be a game-changer for organizations. Customers across industries have incredible stories of how it transformed patient monitoring in healthcare, fraud detection in financial services, personalized marketing in retail, and improved network reliability in telecommunications, among others.
However, stream processing can be quite complex. People who build stream processing pipelines need to consider multiple factors and the delicate balance between completeness, latency, and cost. Further, managing distributed infrastructure adds to the complexity of operating critical streaming pipelines at scale. Ensuring fault tolerance, guaranteeing processing accuracy, and finding the right balance between cost and latency when sizing and scaling the cluster can be challenging.
Dataflow Streaming Engine reduces this complexity and helps you operate streaming pipelines at scale by taking on some of the harder infrastructure problems. Responsive and efficient autoscaling is a critical part of Dataflow Streaming Engine and a big reason why customers choose Dataflow. We’re always working to improve autoscaling and have recently implemented a set of new capabilities for more intelligent autotuning of autoscaler behavior. Let’s take a closer look at some of them.
Asymmetric autoscaling
Dataflow has two pools of workers: user and backend. User workers execute user code, while backend workers handle data shuffling, deduplication, and checkpointing.
Previously, Dataflow scaled both pools proportionately to keep their workers ratio at an optimal level for an average job. However, no streaming job is average. For example, a shuffle-heavy job requires the backend to upscale, but not necessarily user workers.
Asymmetric autoscaling allows scaling the two worker pools independently to meet the requirements of a given streaming job without wasting resources. The following graph illustrates how a shuffle-heavy job benefits from asymmetric autoscaling:
Fewer user workers are needed, saving costsLess frequent latency spikes with more stable autoscaling decisions
Asymmetric autoscaling is GA and automatically enabled for Dataflow Streaming Engine jobs.
BigQuery autosharding
BigQuery is one of the common destinations for Dataflow streaming pipelines’ output. Beam SDK includes a built-in transform, known as BigQuery I/O connector, that can write data to BigQuery tables. Historically, Dataflow has relied on users to manually configure numFileShards or numStreamingKeys to parallelize BigQuery writes. This is painful because it is extremely challenging to get the configuration right and requires a lot of manual effort to maintain. Too low a number of shards limits the throughput. Too high adds overhead and risks exceeding BigQuery quotas and limits. With autosharding, Streaming Engine dynamically adjusts the number of shards for BQ writes so that the throughput keeps up with input rate.
Previously, autosharding was only available for BQ Streaming Inserts. With this launch, BigQuery Storage Write API also gets an autosharding option. It uses the streaming job’s throughput and backlog to determine the optimal number of shards per table, reducing resource waste. This feature is now GA.
In-flight job options update
Sometimes you may want to fine tune the autoscaling you get out of the box even further. For example, you may want to :
Save costs during latency spikes by setting a smaller minimum/maximum number of workersHandle anticipated input load spike with low latency by setting a larger minimum number of workersKeep latency low during traffic churns by defining a narrower range of min/max number of workers
Previously, Dataflow users could update auto-scaling parameters for long-running streaming jobs by doing a job update. However, this update operation requires stopping and restarting a job, which incurs minutes of downtime and doesn’t work well for pipelines with strict latency guarantees. In-flight Job Updates allow Streaming Engine users to adjust the min/max number of workers at runtime. This feature is now GA. Check out this blog to learn more about it.
Key-based upscaling cap
You may want to cap upscaling at some point to balance throughput with incremental cost. One way to do it is by using a CPU utilization cap, i.e., upscale until worker CPU utilization drops to a certain level. The theory is that user workers with low utilization should be able to handle more load. But this assumption is not true when the bottleneck isn’t user-worker CPU, but some other resource (e.g., I/O throughput). In some of these cases, upscaling can help increase throughput, even though the CPU utilization is relatively low.
Key-based upscaling cap is the new approach we introduce in the Dataflow Streaming Engine and replaces the previous cap derived from CPU-utilization. Event keys are the basic unit of parallelism in streaming. When there are more keys available to process in parallel, having more user workers can increase throughput, but only up to the limit of available keys. The key-based upscaling cap prevents upscaling beyond the number of available keys because extra workers sit idle and don’t help increase throughput. This approach offers two customer benefits:
Upscaling achieves higher throughput for I/O-bound pipelines when there are more keys to be processed in parallel.It saves cost when adding more user workers doesn’t help improve throughput.
Key-based upscaling cap is GA and automatically enabled for Dataflow Streaming Engine jobs.
Downscale dampening
Customers need to achieve low latency for their streaming jobs at low cost. Traditionally, to save cost, the autoscaler tracked job state and upscaled when latency spiked or downscaled when CPU utilization dropped.
When input traffic is spiky, autoscaling based on static parameters often leads to a pattern we call ‘yo-yoing’. It happens when a too-aggressive downscale causes an immediate subsequent upscale. This leads to higher job latency and sometimes additional costs.
With downscale dampening, the autoscaler tracks the autoscaling history. When it detects yo-yoing, it applies downscale dampening to reduce latency at similar or lower job cost. This sample test job is an illustrative example. Downscale dampening is GA and automatically enabled for Dataflow Streaming Engine jobs.
Fast worker handover
When the autoscaler changes the number of workers, the new workers need to load the pipeline state. One way to do this is to run an initial scan from the persistence store. This initial scan operation can take minutes, which results in a drop in pipeline throughput and an increase in tail latency. Instead, Dataflow Streaming Engine jobs that have the fast worker handover enabled transfer in-memory directly from “old” to “new” workers, avoiding the initial persistence store scans. This reduces the handover time from minutes to seconds. As a result, you can see significant improvements in both throughput and latency when autoscaling events happen. Fast-worker handover is already rolling transparently for Dataflow Streaming Engine users, so once you enable Streaming Engine, you don’t need to enable this feature separately.
Thank you for reading this far! We are thrilled to bring these capabilities to our Streaming Engine users and see different ways you use Dataflow to transform your businesses. Get started with Dataflow right from your console. Stay tuned for future updates and learn more by contacting the Google Cloud Sales team.
In today’s data-driven world, organizations rely on efficient and scalable data integration solutions to unlock valuable insights and drive informed decision-making. As technology evolves, so do the demands of managing data. Migrating from traditional data integration platforms to cloud-based solutions has become a strategic imperative, and Google Cloud provides several ways to approach the migration process. In this blog, we discuss the various strategies of migrating from on-prem IBM Datastage to Google Cloud and the advantages of the corresponding approaches.
IBM DataStage is a data integration tool that helps you design, build, and run jobs to move and transform data. It supports both ETL (extract, transform, load) and ELT (extract, load, transform) patterns. IBM DataStage supports a variety of data sources to a variety of targets, including the Netezza database.
IBM DataStage is a powerful data integration platform that can help businesses to manage their data more effectively. However, there are some disadvantages to having on-premises. It can be expensive to purchase and maintain. Businesses need to invest in hardware, software, and licensing costs, as well as the cost of hiring and training staff to manage the system. Additionally, it is difficult to scale. As businesses grow, they need to purchase additional hardware and software to accommodate more data and users. It is complex to set up and manage. Businesses need to have a team of skilled IT professionals to manage the system and ensure that it is running smoothly.
By adopting Google Cloud to run IBM DataStage transformations, businesses gain the agility and customization necessary to navigate the complex data landscape efficiently while maintaining compliance, keeping costs under control, and performance improvements, and ultimately positioning them to thrive in a dynamic environment.
Migration Strategy: Google Cloud equivalent architecture
In order to migrate to Google Cloud, the process involves:
Data extraction: Replace DataStage’s data extraction functionality with Cloud Storage for batch data ingestion and Cloud Pub/Sub for real-time data ingestion.Data transformation: Transform data from Cloud Storage to BigQuery using either BigQuery stored procedures/Dataflow/serverless Dataproc which can serve as alternatives for transforming and processing data.Data loading: BigQuery becomes the ideal counterpart for DataStage’s data loading capabilities, offering a serverless, highly scalable data warehouse solution.Workflow orchestration: Cloud Composer can be used to orchestrate data workflows, ensuring smooth execution.
1. Data extraction
IBM DataStage is a popular data integration platform that provides a wide range of features for data extraction, transformation, and loading (ETL). However, it can be expensive and complex to implement and maintain. Google Cloud offers a number of services that can be used to replace IBM DataStage’s data extraction functionality, including Cloud Storage for batch data ingestion and Pub/Sub for real-time data ingestion.
Batch data ingestion with Cloud Storage Cloud Storage is a highly scalable and durable object storage service that can be used to store and manage large amounts of data. To ingest batch data into Cloud Storage, you can use a variety of methods, including:
Transfer jobs: You can use the Google Cloud console or the gsutil command-line tool to create transfer jobs that will automatically transfer data from your on-premises systems to Cloud Storage.Dataproc: You can use Dataproc, a managed Hadoop and Spark service, to process and ingest batch data into Cloud Storage.Cloud Data Fusion: You can use Data Fusion, a fully-managed data integration service, to build and manage batch data pipelines that ingest data into Cloud Storage.Real-time data ingestion with Pub/Sub: Pub/Sub is a fully-managed real-time messaging service that can be used to ingest data from a variety of sources, including sensors, devices, and applications. Once data is ingested into Pub/Sub, it can be processed and consumed by a variety of downstream applications, including data warehouses, data lakes, and streaming analytics platforms.
Real-time data into Pub/Sub You can use a variety of methods, including:
Pub/Sub client libraries: Google Cloud provides client libraries for a variety of programming languages, including Java, Python, and Go. You can use these client libraries to create publishers that will publish data to Pub/Sub.Cloud IoT Core: IoT Core is a fully-managed service that connects, manages, and secures millions of IoT devices. You can use IoT Core to publish data from IoT devices to Pub/Sub.Dataflow: Dataflow is a fully-managed streaming analytics service that can be used to process and ingest real-time data into Pub/Sub.
2. Data transformation
Google Cloud offers a variety of services that can be used to transform data from Netezza SQL to BigQuery SQL. Three of the most popular options are BigQuery stored procedures, Dataflow, and serverless Dataproc.
BigQuery stored procedures are user-defined functions that can be stored in BigQuery and executed on demand. They can be used to perform a variety of data transformation tasks, such as filtering, aggregating, and joining data. One of the benefits of using BigQuery stored procedures for data transformation is that they are very efficient. They are executed directly in the BigQuery engine, so there is no need to move data around. Additionally, BigQuery stored procedures can be parallelized, so they can handle large datasets quickly.
You can use a BigQuery stored procedure to filter data from a Cloud Storage file and load it into a BigQuery table. You can also use a BigQuery stored procedure to join two Cloud Storage files and load the joined data into a BigQuery table.
Dataflow is a fully-managed streaming analytics service that can be used to process and transform data in real time or in batch mode. Dataflow pipelines can be used to transform data from Cloud Storage to BigQuery, and they can also be used to perform other data processing tasks, such as machine learning and data filtering. One of the benefits of using Dataflow for data transformation is that it is very scalable. Dataflow pipelines can be scaled up or down to meet the needs of your workload. Additionally, Dataflow pipelines are very reliable, and they can handle large datasets with ease.
You can use a Dataflow pipeline to read data from a Cloud Storage file, transform it using Apache Spark, and load it into a BigQuery table. For example, you can use a Dataflow pipeline to split a CSV file into individual rows and load the rows into a BigQuery table.
Serverless Dataproc is a fully-managed service that allows you to run Apache Spark workloads without having to provision and manage your own cluster. Serverless Dataproc can be used to transform data from Cloud Storage to BigQuery, and it can also be used to perform other data processing tasks. One of the benefits of using serverless Dataproc for data transformation is that it is very easy to use. You simply need to submit your Spark job to the serverless Dataproc service, and the service will take care of the rest. Additionally, serverless Dataproc is very scalable, so you can easily scale your data transformation pipelines to meet the needs of your workload.
You can use a serverless Dataproc Spark job to read data from a Cloud Storage file, transform it using Apache Spark, and load it into a BigQuery table. For example, you can use a serverless Dataproc Spark job to convert a JSON file to a Parquet file and load the Parquet file into a BigQuery table.
Which option should you choose?
The best option for transforming data from Cloud Storage to BigQuery will depend on your specific needs and requirements. If you need a highly efficient and scalable solution, then Google BigQuery stored procedures or Dataflow would be a good choice. If you need a solution that is easy to use and manage, then serverless Dataproc would be a good choice.
Here is a table that summarizes the key differences between the three options:
3. Data loading
BigQuery is a serverless, highly scalable data warehouse solution that can be used as a counterpart for DataStage’s data loading capabilities. BigQuery offers a superior data loading solution compared to IBM DataStage due to its scalability, speed, real-time data loading, seamless integration with the Google Cloud ecosystem, cost efficiency, user-friendliness, robust security, and management. BigQuery’s serverless infrastructure and pay-as-you-go pricing model reduces operational overhead and capital costs.
BigQuery is a fully-managed data warehouse that offers a number of benefits over traditional ETL tools, including:
Serverless architecture: BigQuery is a serverless data warehouse, which means that you don’t need to provision or manage any infrastructure. BigQuery will automatically scale your resources up or down to meet the needs of your workload.High scalability: BigQuery is highly scalable, so you can easily handle large datasets. BigQuery can handle datasets of any size, from petabytes to exabytes.Real-time Data Loading: BigQuery supports real-time data loading, allowing organizations to ingest and analyze data as it arrives. This is crucial for applications requiring up-to-the-minute insights and responsive decision-making.Security and Compliance: Google Cloud provides robust security features and compliance certifications, ensuring data loading and storage meet stringent security and regulatory requirements.Managed Service: Google takes care of managing, monitoring, and maintaining the infrastructure, ensuring high availability and reliability. This allows organizations to focus on their core data loading and analytical tasks.
BigQuery can be used as a counterpart for DataStage’s data loading capabilities in a number of ways. For example, you can use BigQuery to:
Load data from a variety of sources: BigQuery can load data from a variety of sources, including Cloud Storage, Cloud SQL, and Dataproc. This means that you can use BigQuery to load data from any system that you are currently using.Transform data during loading: BigQuery can transform data during loading using SQL. This means that you can clean, filter, and aggregate data as you load it into BigQuery.Orchestrate data workloads: BigQuery can be used to schedule data loads. This means that you can automate the process of loading data into BigQuery using various Orchestration tools based on your requirements.
Here are a few examples of how you can use BigQuery to load data from DataStage:
Use the BigQuery Data Transfer Service: The BigQuery Data Transfer Service is a fully-managed service that allows you to automate the process of loading data from a variety of sources into BigQuery. You can use the BigQuery Data Transfer Service to load data directly from DataStage into BigQuery or from DataStage to Cloud Storage to BigQuery.Use the BigQuery API: The BigQuery API is a RESTful API that allows you to interact with BigQuery programmatically. You can use the BigQuery API to load data from DataStage into BigQuery.Use a third-party tool: There are a number of third-party tools that can be used to load data from DataStage into BigQuery. For example, you can use the Fivetran connector for DataStage to load data from DataStage into BigQuery.
4. Workflow orchestration
The migration and on-going synchronization of data involves 100s of ETL processes. Many of these processes consist of (a) multiple steps which may be prone to failure and/or (b) dependencies which may require upstream jobs to complete successfully before a downstream job can run successfully. As a result, if your organization has specific requirements such as requiring an ability to restart processes which have failed or need to be re-run as a result of an upstream process running again, then Google Cloud proposes Cloud Composer. Composer is Google’s fully managed workflow orchestration service built on Apache Airflow. ETL workflows will be represented in Cloud Composer as Directed Acyclic Graphs (DAGs) that can be scheduled and manually triggered as needed.
While IBM Datastage does encompass workflow orchestration capabilities, Cloud Composer’s core advantage lies in its ability to provide an agile and scalable solution that leverages the strengths of Apache Airflow, making it an attractive choice for organizations seeking seamless, adaptable, and reliable data pipeline management.
Here are some specific examples of how Cloud Composer/Airflow can be used to orchestrate data workflows more effectively:
Parallel execution: Cloud Composer/Airflow can execute tasks in parallel, which can significantly optimize environment performance and cost.Dependency management: Cloud Composer/Airflow can automatically manage dependencies between tasks such as nested tasks in DAGs or managing a parent-child DAG relationship which can help to ensure that your data pipelines are executed in the right sequence reliably.Error handling: Cloud Composer/Airflow provides a number of features for handling errors during workflow execution, such as retrying failed tasks and notifying you of errors.Monitoring: Cloud Composer provides a built-in monitoring dashboard that gives you visibility into the status of your workflows and tasks.
Additional benefits of using Cloud Composer/Airflow
In addition to the benefits listed above, Cloud Composer/Airflow also offers a number of other benefits, including:
Cloud-native: Cloud Composer/Airflow is a cloud-native solution, which means that it is designed to be deployed and managed in the cloud. This can offer a number of advantages, such as improved scalability and reliability.Extensibility: Cloud Composer/Airflow is extensible, which means that you can add custom features and functionality to meet the specific needs of your business.Vendor lock-in: Cloud Composer/Airflow is based on Apache Airflow, which is an open-source project. This means that you are not locked into a proprietary vendor, and you can easily switch to a different solution if needed.
Benefits of migrating IBM Datastage to Google Cloud
Migrating ETL workloads from IBM Datastage to Google Cloud provides you with the ability to tailor and scale the resources as needed for each job. Instead of deploying a one size fits all workload architecture targeting an “average” workload, Google Cloud enables a bespoke model where each workload resource can be customized based on your requirements, which help to cost-efficiently meet SLA targets. This customization results in performance and cost benefits, including:
Scalability: Google Cloud infrastructure provides on-demand scalability, allowing you to scale resources as your data processing needs evolve.Cost efficiency: With a pay-as-you-go model, Google Cloud helps control costs by optimizing resource allocation based on actual usage.Integrated ecosystem: Google Cloud offers a comprehensive ecosystem of tools that seamlessly work together, simplifying the design, deployment, and management of data workflows.Real-time insights: Google Cloud’s real-time data processing capabilities enable quick and informed decision-making
Conclusion
Google Cloud provides a multifaceted approach to migration, with customization as its cornerstone. This tailoring of resources to match the unique requirements of each ETL job not only enhances performance but also ensures cost efficiency, which is further bolstered by the scalability and pay-as-you-go model. The integrated ecosystem of Google Cloud simplifies data workflow management and, combined with real-time data processing capabilities, empowers organizations to make swift, data-informed decisions.
Next Steps
If you want to start small, you can start exploring BigQuery and take advantage of its free-usage tier and no cost sandbox to get started. If you are considering migrating IBM DataStage to Google Cloud, please refer to the official documentation with more details on the prerequisites and foundational setup required in Google Cloud prior to getting started.
Looker Studio enables millions of users to bring their data to life with insightful dashboards and visualizations, connecting to more than 1,000 data sources and a host of community-sourced report templates. Looker Studio Pro expands on this self-service business intelligence platform with enterprise capabilities, including team content management and Google Cloud support. Today, we are bringing Looker Studio Pro to your mobile devices through a new application available for Android on Google Play and for iOS on the App Store, enabling you to view reports and get real-time data about your business from anywhere.
Looker Studio Pro mobile app features
Dynamic report layout: Visualize your data your way
No need to build new mobile specific layouts for your existing or new reports. If you choose a mobile friendly view, reports will be rendered to fit your mobile screen. This means that you can access all of the same information and functionality in a format that is optimized for viewing on a mobile screen.
Mobile Friendly view of your reports enables:
Improved usability: A mobile friendly view makes it easier to navigate and interact with your reports, even on a small screen.Enhanced readability: Mobile friendly reports are designed to be easy to read on small screens, with larger fonts and more white space.
In the app, you can choose to view your reports in:
Original view – Optimized for desktopMobile-Friendly view – Optimized for your mobile screen
You can easily switch between the two views in the app from ‘options’.
Access all your reports hassle-free
Given the large number of reports to sift through, we sought to simplify and expedite the search for data by integrating Looker Studio Pro’s categories into the app.
In the app, your reports are categorized:
My workspace – access all reports you created here.Team workspaces – access all reports for teams you are part of.Recents – a handy option to quickly find and access the reports you’ve looked at recently.Shared with me – view and collaborate on reports shared with you.
Further, you can sort the reports by ‘Last opened by me,’ ‘Last modified by me,’ and ‘Last modified and Created’ to find the report you are looking for easily.
Share your reports with a simple tap
We know that collaboration and sharing insights with your team is important. Collaboration on mobile is made easy as you can now share reports with your team on the app of your choice with a single click. A link to the report will be generated that others can access easily on any device.
Moreover, you can access all the reports shared with you in ‘Team Workspaces’ and ‘Shared with me’ folders.
Seamlessly access interactive reports from your scheduled email or chat
Looker Studio Pro Mobile makes it easy to access your reports from your scheduled emails and chats. When you receive a scheduled report in your email/chat, tap the link to view and interact with your data live in the app. No more static PDFs!
How to get the Looker Studio Pro App
Getting your hands on the app is easy. Simply download the app from: Play Store or App Store and sign in with your corporate credentials.
Note: The mobile app is only available for Looker studio Pro customers. Learn more about Looker Studio Pro subscription here.
AI is already having a profound impact on how organizations operate. The power of AI allows you to reimagine what you do, how you do it and who you do it for. For many companies it feels like they’re just one step away from using AI to start solving real business problems — they just need to activate their data.
Google Cloud has a robust portfolio of platforms and tools for storing, transforming, and gaining insights from your data, and that can be activated for AI. In this blog, we will summarize the key innovations to our Data and AI Cloud in 2023 across three strategic areas.
Interconnecting all your data – structured and unstructured, in any format, across all locations.Bringing AI to your data – securely and quickly build AI models with all your data.Boosting productivity – helping all data teams to analyze data, generate code and optimize data workloads.
Register for the upcoming webcast on Nov 13th to get a snapshot of our plans and investments in BigQuery, Streaming Analytics, Data Lakes, Data Integration, and GenAI.
Interconnecting all your data
Data is spread over tens and sometimes hundreds of data silos. Your data workloads are increasing, with new formats, mostly unstructured, across clouds and on-premises systems. There are just too many tools to learn and move between. With all these challenges, AI projects eventually become data projects in disguise.
Google’s Data and AI Cloud lets you interconnect your data at multiple levels.
Interconnect structured and unstructured data – To unlock 360-degree insights into your business, you need to combine and analyze unstructured data, such as images, voice, and documents, with your structured data.
We launched the general availability of BigLake Object Tables to help data users easily access, transverse, process and query unstructured data using SQL. We also launched support for the Hudi and Delta file format in BigLake, now generally available. Taking BigLake one step further, we launched the preview of fully managed Iceberg tables in BigLake, so you can use high-throughput streaming ingestion for your data in Cloud Storage, get a fully managed experience with automatic storage optimizations for your Lakehouse, and perform DML transactions using BigLake to enable consistent modifications and improved data security, all while retaining full compatibility with the Iceberg reader.
BigLake has seen hyper-growth, with a 27x increase in BigLake usage since the beginning of the year.
Interconnect data across clouds – Many customers manage and analyze their data on Google Cloud, AWS or Azure with BigQuery Omni, which provides a single pane of glass across clouds. Taking BigQuery Omni one step further, we added support for cross-cloud materialized views, and cross-cloud joins. We also extended analytics to on-prem data by bringing Dataproc Spark to Google Distributed Cloud. This allows you to run Spark on sensitive data in your data centers to support compliance or data sovereignty requirements and connect it with your BigQuery data in Google Cloud.
Interconnecting data management and governance – We added intelligent data profiling and data quality capabilities to help you understand the completeness, accuracy and validity of your data. We also launched extended data management and governance capabilities in Dataplex. You get a single pane of glass for all your data and AI assets — including Vertex AI models and datasets, operational databases, and analytical data on Google Cloud and Omni.
Data sharing – In a given week, thousands of organizations share hundreds of petabytes of data across organizational boundaries using BigQuery. To further support interconnection of data, we launched BigQuery data clean rooms to share and match datasets across companies and collaborate on analysis with trusted partners, all while respecting user privacy.
Cost optimization – Interconnecting all your data shouldn’t be expensive and unpredictable. So, we introduced BigQuery pricing editions along with innovations for slots autoscaling and a new compressed storage billing model. BigQuery editions provide more choice and flexibility for you to select the right feature set for various workload requirements. You can mix and match among Standard, Enterprise, and Enterprise Plus editions to achieve the preferred price-performance by workload. BigQuery editions include the ability for single- or multi-year commitments at lower prices for predictable workloads, and new autoscaling capabilities that support unpredictable workloads by providing the option to pay only for the compute capacity you use.
Bringing AI to your data
AI provides numerous opportunities to activate your data. So, we made AI easily accessible to all your data teams and also made it easy to use your data to train AI models.
Customers already run hundreds of millions of predictions and training runs in BigQuery. In just the last six months, ML operations in BigQuery have grown more than 250% compared to last year.
Here are a few of the ways we’ve enhanced BigQuery to support AI.
Access to foundational models – We enabled users to access Vertex AI’s foundation models directly from BigQuery. With just a single statement you can connect a BigQuery table to a large language model (LLM) and tune prompts with your BigQuery data. This allows you to use generative AI capabilities such as text analysis on your data or generate new attributes to enrich your data model. With a few clicks, you can use the Vertex Doc AI workbench to deploy a personalized LLM extractor, which can then be directly accessed from BigQuery to extract specific knowledge from your text data.
Expand the range of AI models – We also launched BigQuery ML inference engine, which allows you to access an ecosystem of pretrained models and open ML frameworks. Run predictions on Google vision, natural language and translation models in BigQuery, import models in additional formats like TensorFlow Lite, ONNX and XGBoost, and use models hosted in Vertex AI directly.
Features and vector embeddings – BigQuery is now the place to store all your ML features and vector embeddings with the launch of BigQuery feature tables and vector embeddings in preview. By loading feature and vector embedding data into BigQuery, you can build powerful semantic searches and do recommendation queries on the scale of your BigQuery data, in real-time. And you can manage your features the same way you manage your other data. Plus, we automatically synchronize the data into Vertex AI Feature Store to enable low-latency serving for your web applications without having to move any data.
Unified workspace for data and AI people – To bring AI and data together into one shared environment, we launched the preview of BigQuery Studio, which brings data engineering, analytics, and ML workloads together, so you can edit SQL, Python, Spark and other languages and easily run analytics, at petabyte scale and without additional infrastructure management overhead. BigQuery Studio gives you direct access to Colab Enterprise, a new offering that brings Google Cloud’s enterprise-level security and compliance to Colab.
We also launched the preview of BigQuery DataFrames API, which provides a simple way to run Python for data science directly in BigQuery, using familiar APIs for Pandas or Scikit. With the ability to write Python in BigQuery, you get an awesome notebook experience.
Boosting productivity with AI
This year, we put our decades of investment and research in AI into action to help you boost productivity.
AI for data analytics – We launched Duet AI in BigQuery to simplify data analysis, generate code and optimize your data workloads. It can:
Assist with writing SQL queries and Python code, allowing you to focus more on logic and outcomesAuto-suggest code in real time and generate full functions and code blocksHelp you through your data work with a chat experience
We also brought Duet AI to our data migration service to help you modernize legacy applications through automatic SQL translations.
AI for data governance – We also brought Duet AI to Dataplex. Duet AI in Dataplex can be used for metadata insights to solve the cold-start problem — how do I know which questions I can ask my data? Leveraging Duet AI, we help you jumpstart your analytics with a generated list of questions that you can ask of your data, based on metadata and usage patterns, with one-click access to SQL queries that you can run in BigQuery Studio.
AI for business intelligence – More than 10 million users access Looker each month and they can gain deeper insights with access to more than 1,000 data sources and 800+ community connectors.
Further, we launched Duet AI in Looker to help users do conversational data analysis in natural language. It allows you to:
Do conversational data analysis in natural languageAutomatically create dashboards and reports by telling Looker the goal of your analysisGenerate Google Slides presentations with intelligent summaries from your Looker dashboardsUse natural language to quickly create calculations and visuals with our Duet Formula and Visual assistantsAnd rapidly create LookML code and specify the intent of your data model in natural language
What’s next
We have an amazing roadmap of innovations planned for 2024 and beyond. Register for the upcoming webcast on Nov 13th to get a snapshot of our plans and investments in BigQuery, Streaming Analytics, Data Lakes, Data Integration and GenAI. We can’t wait to see what you build and how you innovate with these new capabilities in Google’s Data and AI Cloud!
Retail brands face the challenges of data duplication and data silos as businesses and technology stacks grow — while consumer expectations remain high. These challenges not only increase the complexity of data management but can also lead to a rise in overall costs. By leveraging platform modernization services within Google Cloud, such as BigQuery, BigLake, and Dataplex, retail brands can help drive faster analytics, avoid unauthorized data access risks and reduce overall security and governance challenges. Let’s break down what retail brands experience and how Google Cloud can help overcome these challenges.
Data silos can increase complexities and lower productivity
Data silos refers to compartmentalization of information that arises when different departments within a retail organization manage their data independently. For example, marketing, sales, supply chain, logistics, and customer service teams often utilize distinct systems to track and store their data. This compartmentalization, although convenient initially, can lead to added complexities as processes grow and evolve with the growth of data volumes and business needs.
The retail industry is a mammoth data machine reliant on a well-oiled data analytics engine at its heart. The data flows in from different sources where it is then stored in different systems and in different formats. This data is also subject to change from time to time that can make cross departmental collaboration difficult and error prone. Here are how these challenges can impact retail businesses:
Limited visibility and insights: Data silos prevent retailers from having a comprehensive view of their customers, operations, and overall business performance. This makes it difficult to gain insights into customer behavior, identify trends, and make informed decisions.Inefficient operations: Data silos can lead to inefficiencies in various operational areas, such as inventory management, supply chain optimization, and customer service. When data is scattered across different systems, it increases the risk of data inconsistencies.Limited marketing reach: Data silos prevent retailers from effectively targeting and personalizing their marketing campaigns. Without a unified view of customer data, it’s challenging to understand customer preferences, interests, and purchase patterns.Limited access control: Data silos can make it difficult to implement and enforce consistent access control policies. This can increase the risk of unauthorized access to sensitive data.
The most efficient way to address concerns about data silos and security while working with retail data from various sources is to build a data mesh.
What is a Data Mesh?
A data mesh is a new approach to data architecture that emphasizes autonomy, governance, and scalability. It is designed to address the challenges of traditional data architectures, which are often monolithic and inflexible.
A data mesh is a decentralized, scalable, and modular architecture for managing data. It is designed to address the challenges of data silos, which can make it difficult to find, use, and share data across an organization.
In a data mesh, data is owned and managed by domain experts, who are responsible for ensuring that the data is accurate, complete, and up-to-date. The data is then published to a central catalog, where it can be discovered and used by other parts of the organization.
Building a secure data mesh with Google Cloud
A data mesh works well with an analytics lake house. An analytics lake house is an architecture that combines the benefits of a data lake and a data warehouse. It provides a centralized repository for data, while also allowing for the use of analytical tools and processes that are typically associated with data warehouses.
The data mesh can provide a way to manage the data in the lake house, while the lake house can provide a place to store and analyze the data. The key benefits of the lake house architecture in the retail industry are:
Improved customer experience: Customers will no longer need to reach out to multiple different departments to get complete information on questions spanning department boundaries as the data will be easily accessible by all departments as long as they have jurisdiction over it.Improved efficiency at lower costs: Retailers will no longer need to bear extra expenses of storing duplicate data and moving data across teams/platforms.Securing data without silos: With fine-grained access control, data stewards can ensure data is kept secure with access being provided in accordance with the principle of least privileges.Improved data governance: With Dataplex, data stewards and administrators can build a logical data mesh and governance layer to implement governance, validation and cataloging of data from a single interface.
Retail organizations often work with a centralized data lake where data from different sources such as sales, marketing, inventory, supply chain, etc., come in. This data often comes in unprocessed. From there, different analytics teams fetch the data, process it and then generate reports for business users.
BigQuery enables data stewards to import data into native tables called Capacitor. They can grant access to the data at various asset levels. For example, the finance team would not have access to HR data and the supply chain team would not have access to sales data. For organizations in a multi-cloud architecture, BigLake also enables the creation of BigQuery assets on data residing outside of Google Cloud in AWS and Azure.
What about data residing in Cloud Storage? Data in this object store is neither easily queryable nor suitable for running analytics on directly. BigLake solves this problem by allowing the creation of BigLake tables on Cloud Storage data. This avoids data duplication and movement. Finally, you can apply fine-grained access controls on these tables to securely share these assets with different analytics teams while the data stays in place.
Dataplex plays the governing role in building this data mesh by organizing data in Cloud Storage and BigQuery into a hierarchy of lakes, zones, and assets. Within each lake, you can create zones for subcategories such as region or business unit. You can also separate zones into raw and curated data, do data validation and auto-discovery to flag pattern changes for incoming data that can potentially break downstream processes.
Once the data mesh is built out, analysts can use BigQuery to run their analytics on popular open source engines for things like sales predictions, supply-chain analysis, or inventory estimations. The Google Cloud services discussed in this article integrate seamlessly with other Google Cloud data portfolio services like Dataflow and Dataproc, allowing users to build robust and powerful data pipelines.
Google Cloud can simplify data management, enabling retail brands to exceed consumer expectations. Using these platform modernization services within Google Cloud, businesses can not only avoid data duplication, but they can also reduce costs and down silos without compromising on security.
To learn more about how TEKsystems Global Services is helping businesses take advantage of Google Cloud, please visit www.TEKsystems.com.