Troubleshoot and optimize your BigQuery analytics queries with query execution graph

Troubleshoot and optimize your BigQuery analytics queries with query execution graph

We all want queries to run faster, better, and cheaper. BigQuery certainly does! But queries can get complicated quickly; this sometimes makes manual optimization necessary and unavoidable.

BigQuery is a complex distributed system, and numerous internal and external factors can influence query speed. Understanding what is going on might be a bit of a puzzle. To solve this, we launched the query execution graph with performance insights in preview a few months ago. Now, we are excited to announce its general availability.

Mercadolibre is a leading company in LATAM that processes millions of queries daily on BigQuery. “The query execution graph has helped us see where our queries are slowing down, and in a lot of cases, it’s pointed us in the right direction for making optimizations,” said Fernando Ariel Rodriguez, Data & Analytics expert at Mercadolibre.

Think of the query execution graph as your magnifying glass to zoom into those nitty-gritty details of your query execution. It transforms your query plan info into an easy-to-digest graphical format and gives detailed information for each step. Whether you’re dealing with a query in progress or a completed one (and soon a failed one as well), it’s a great way to understand what’s happening under the hood.

Here’s the cool part: the query execution graph also serves up performance insights. These are like friendly neighborhood tips, aiming to offer suggestions on enhancing your query performance. But remember, just like with any good detective work, getting the full picture involves looking at it from multiple angles, and these insights might only provide a piece of the puzzle.

Diving in

Now, let’s delve into the intricacies of query performance insights. When BigQuery is put to work, it transforms your SQL statement into a query plan, divided into stages, each made up of various execution steps. Each stage is unique — some might be resource-intensive and time-consuming. But with our execution graph, spotting these potential speed bumps is a breeze.

But, of course, we’re not stopping there. BigQuery also offers you insights into potential factors that might be causing your query to take the scenic route.

Familiar with slot contention? When you run a query, BigQuery tries to divide the work into manageable tasks. Each of these tasks is then assigned to a slot, which ideally, works on them in parallel for max efficiency. But if there aren’t enough slots to pick up tasks, you’ve got a slot contention situation.

Then there is the “insufficient shuffle quota” issue. Think of it this way: as a slot finishes up a task, it stores the intermediate results in a “shuffle.” Future stages of your query then pull data from this shuffle. But if you’ve got more data to write to the shuffle than there’s capacity, you might run into the “insufficient shuffle quota” issue.

If your query encounters either of the above issues, consider these solutions: optimize the query to use fewer resources, allocate more resources, or distribute the workload to avoid peak demand.

We must also address the intricacies of data-intensive joins. If your query includes a join with non-unique keys on both sides, you might end up with an output table that’s vastly larger than the input tables. This disparity between output and input row counts indicates a significant skew. A word to wise analysts: meticulously review your JOIN conditions. Did you anticipate the bloated size of the output table? It’s best to avoid cross joins, but if you must use them, think about adding a GROUP BY clause for preliminary result aggregation, or a window function might come to the rescue.

Lastly, there’s the possibility of a data input scale change. This is essentially when your query ends up reading at least 50% more data from a table than the last time you ran the query. You might be asking, “Well, how’d that happen?” One possibility is that the size of the table used in the query has recently grown. You can use the table change history to double-check this.

Viewing query performance insights across your entire organization

You can quickly retrieve insights for your entire organization by querying the INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION view as shown below:

code_block<ListValue: [StructValue([(‘code’, “SELECTrn `bigquery-public-data`.persistent_udfs.job_url(rn project_id || ‘:us.’ || job_id) AS job_url,rn query_info.performance_insightsrnFROMrn `region-us`.INFORMATION_SCHEMA.JOBS_BY_ORGANIZATIONrnWHERErn DATE(creation_time) >= CURRENT_DATE – 30 — scan 30 days of query historyrn AND job_type = ‘QUERY’rn AND state = ‘DONE’rn AND error_result IS NULLrn AND statement_type != ‘SCRIPT’rn AND EXISTS ( — Only include queries which had performance insightsrn SELECT 1rn FROM UNNEST(rn query_info.performance_insights.stage_performance_standalone_insightsrn )rn UNION ALLrn SELECT 1rn FROM UNNEST(rn query_info.performance_insights.stage_performance_change_insightsrn )rn );”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e59c8221370>)])]>

The output of the above query will return all query jobs in your organization for which performance insights were generated, along with a generated URL that deep-links to the query execution graph in the Google Cloud console (that way you can visually inspect the query stages and their insights).

What’s next?

So, what’s on the horizon for our query execution graph and performance insights? Well, we’re continuously fine-tuning these features. Expect to see more metrics, additional performance insights, and an even more intuitive graph visualization. We’re just getting warmed up, so stay tuned for more exciting updates!

We hope you use BigQuery query execution graph and performance insights to better understand and optimize your queries. If you have any feedback and thoughts about this feature please feel free to reach us at bq-query-inspector-feedback@google.com. To learn about the feature in more detail, please see the public documentation.

Source : Data Analytics Read More

Manage dynamic query concurrency with BigQuery query queues

Manage dynamic query concurrency with BigQuery query queues

BigQuery is a powerful cloud data warehouse that can handle demanding workloads. BigQuery users can get the benefit of continuous improvements in performance, durability, efficiency, and scalability, without downtime and upgrades.

Today, we are pleased to announce the general availability of query queues in BigQuery.

What is query queues and why use it?

BigQuery query queues introduces a dynamic concurrency limit and enables queueing. All BigQuery customers are enabled by default. Previously BigQuery supported a fixed concurrency limit of 100 queries per project. When the number of queries exceeded this limit, users received a quota exceeded error when attempting to submit an interactive job.

Concurrency is now calculated dynamically based on the available slot capacity and the number of queries that are currently running. While most customers will use the dynamic concurrency calculation, administrators can also choose to set a maximum concurrency target for a reservation to ensure that each query has enough slot capacity to run. This also means that queries that cannot be processed immediately are added to a queue and run as soon as resources become available, instead of failing.

Here is what happens with query queues enabled:

Using query queues

Dynamic concurrency: BigQuery dynamically determines the concurrency based on available resources and can automatically set and manage the concurrency based on reservation size and usage patterns. While the default concurrency configuration is set to zero, which enables dynamic configuration, experienced administrators can manually override this option by specifying a target concurrency limit. The admin-specified limit can’t exceed the maximum concurrency provided by available slots. The limit is not configurable by administrators for on-demand workloads.Queuing: Query queues helps to manage scenarios where peak workloads generate a sudden increase in queries that exceed the maximum concurrency limit. With queuing enabled, BigQuery can queue up to 1,000 interactive queries and 20,000 batch queries, ensuring that they are scheduled for execution rather than being terminated due to concurrency limits, as was previously the case. Users no longer need to search for idle times or periods of low usage to optimize when to submit their workload requests. BigQuery automatically runs their requests or schedules them in a queue to run as soon as the current running workloads have finished.

Key metrics and highlights

Target job concurrency: Setting a lower target_job_concurrency for a reservation increases the minimum number of slots allocated per query, which potentially results in faster or more consistent performance, particularly for complex queries. Changes to concurrency are only supported at the reservation level.Specs: Within each project, up to 1,000 interactive queries can be queued at once, and 20,000 for batch queries. Batch queries use the same resources as interactive queries.Timeouts: Users can now configure a timeout value for each query/job queue. If a query can’t start executing within the specified time, BigQuery will attempt to cancel the query/job instead of queuing it for an extended amount of time. The default timeout value is 6 hours for interactive, 24 hours for batch, and can be set at the organization or project level.

For more information, read the query queues documentation.

Source : Data Analytics Read More

Actuate your data in real time with new Bigtable change streams

Actuate your data in real time with new Bigtable change streams

Cloud Bigtable is a highly scalable, fully managed NoSQL database service that offers single-digit millisecond latency and an availability SLA up to 99.999%. It is a good choice for applications that require high throughput and low latency, such as real-time analytics, gaming, and telecommunications.

Cloud Bigtable change streams is a feature that allows you to track changes to your Bigtable data and easily access and integrate this data with other systems. With change streams, you can replicate changes from Bigtable to BigQuery for real-time analytics, trigger downstream application behavior using Pub/Sub (for event-based data pipelines), or capture database changes for multi-cloud scenarios and migrations to Bigtable.

Cloud Bigtable change streams is a powerful tool that can help you unlock new value from your data.

NBCUniversal’s streaming service Peacock uses Bigtable for identity management across their platform. The Bigtable change streams feature helped them simplify and optimize their data pipeline. 

“Bigtable change streams was simple to integrate into our existing data pipeline leveraging the dataflow beam connector to alert on changes for downstream processing. This update saved us significant time and processing in our data normalization objectives.” – Baihe Liu, Peacock

Actuating your data changes

Enabling a change stream on your table can easily be done through the Google Cloud console, or via the API, client libraries or declarative infrastructure tools like Terrafom.

Once enabled on a particular table, all data changes to the table will be captured and stored for up to seven days. This is useful for tracking changes to data over time, or for auditing purposes. The retention period can be customized to meet your specific needs. You can build custom processing pipelines using the Bigtable connector for Dataflow. This allows you to process data in Bigtable in a variety of ways, including batch processing, streaming processing, and machine learning. Or, you can have even more flexibility and control by integrating with the Bigtable API directly.

Cloud Bigtable change streams use cases 

Change streams can be leveraged for a variety of use cases and business-critical workloads. 

Analytics and ML
Collect event data and analyze it in real time. This can be used to track customer behavior to update feature store embeddings for personalization, monitor system performance in IoT services for fault detection or identify security threats, or monitor events to detect fraud.

In the context of BigQuery, change streams can be used to track changes to data over time, identify trends, and generate reports. There are two main ways to send change records to BigQuery: as a set of change logs or mirroring your data on BigQuery for large scale analytics.

Event-based applications 
Leverage change streams to trigger downstream processing of certain events, for example, in gaming, to keep track of player actions in real time. This can be used to update game state, provide feedback to players, or detect cheating.

Retail customers leverage change streams to monitor catalog changes like pricing or availability to trigger updates and alert customers.

Migration and multi-cloud
Capture Bigtable changes for multicloud or hybrid cloud scenarios. For example, leverage Bigtable HBase replication tooling and change streams to keep your data replicated across clouds or on-premises databases. This topology can also be leveraged for online migrations to Bigtable without disruption to serving activity.

Compliance
Compliance often refers to meeting the requirements of specific regulations, such as HIPAA or PCI DSS. Retaining the change log can help you to demonstrate compliance by providing a record of all changes that have been made to your data. This can be helpful in the event of an audit or if you need to investigate a security incident.

Learn more

Change streams is a powerful feature providing additional capability to actuate your data on Bigtable to meet your business requirements and optimize your data pipelines. To get started, check out our documentation for more details on Bigtable change streams, along with these additional resources:

Expanding your Bigtable architecture with change streams

Process a Bigtable change stream tutorial

Create a change stream-enabled table and capture changes quickstart

Bigtable change streams Code samples

Source : Data Analytics Read More

Landis+Gyr: Securing the energy supply with AI and machine learning

Landis+Gyr: Securing the energy supply with AI and machine learning

With the global energy crisis driving a dramatic spike in the price of fossil fuels, countries around the world have urgently stepped up their transition to renewable power to reduce their reliance on imported, carbon-intense fuel sources. According to the International Energy Agency (IEA), renewables will account for over 90% of global electricity expansion over the next five years, overtaking coal to become the largest source of global electricity by early 2025. 

However, while investment in renewables improves countries’ long-term energy security, the rapid transition presents energy-security challenges of its own. As more renewable energy sources and Distributed Energy Resource (DER) assets (such as rooftop solar panels and microturbines) are added to the grid, this can lead to network congestion and imbalanced energy distribution, affecting the quality of supply. 

A new business model for a changing industry

An industry leader in energy management solutions for more than 125 years, Landis+Gyr helps utility companies and grid operators to manage their energy infrastructure more efficiently. For Landis+Gyr, the answer to the challenges presented by a rapid transition to renewables lies in modernizing the traditional smart metering business model by using artificial intelligence (AI) and machine learning (ML) to make the most of the huge amount of data that its utility customers are collecting every day. 

However, Landis+Gyr initially found it difficult to innovate at the pace required by the rapidly changing energy industry, as its legacy on-premises analytics infrastructure lacked AI and ML capabilities. 

“We wanted to build a new product that enabled advanced analytics, including AI and ML capabilities, which we could offer as a service to our customers,” says Antonio Hidalgo, VP Digital Solutions at Landis+Gyr. “We also needed a solution that was easy to set up, scalable, and which required very little management from our side. Those were the key criteria for us and Google Cloud met them all.”

Google Cloud partnered with Landis+Gyr, making it our mission to help the company to innovate faster with a unified and intelligent data platform.

Keeping the lights on with real-time data insights

Thanks to Landis+Gyr’s smart-metering infrastructure, utility companies all over the globe retrieve granular interval data for millions of devices every single day, creating a huge volume of data that needs to be uploaded and processed. Now, with the help of Cloud Storage, BigQuery, and Looker, Landis+Gyr can provide its utility customers with a comprehensive data lake and a complete analytics solution by using a cloud data warehouse, enabling them to make full use of their data to gain powerful insights in real time.

“BigQuery was super fast to set up and it scales seamlessly up and down, which makes the pricing structure very scale-friendly,” says Hidalgo. “Flexibility was also key for us. Our utility customers want to be able to define their own dashboards on a continuous basis, based on the information that is relevant to them. With Looker, we are now able to build and release something in a matter of days, not months. With our on-premises solutions, that was impossible. But on the cloud, with a modern CI/CD pipeline and the right level of automation, we can do it.”

Landis+Gyr is now able to provide its customers with data insights to manage all aspects of the power supply and distribution more effectively, from determining the quality of energy supply, through forecasting energy demand, to helping utilities maintain a distribution network with a growing volume of DER assets.

“Near real-time monitoring of sensor data streams enables our utility customers to predict many network issues in advance,” says Hidalgo. “By using Pub/Sub, we can ingest events and stream them in real time to BigQuery with Dataflow. Utilities can then analyze the network data to identify the problem and make decisions accordingly.”

Mapping the power grid with machine learning

Landis+Gyr is also making use of Google Cloud ML tools to improve the way utility companies maintain an accurate network connectivity model. Understanding the power system’s network topology, or how the various components such as generators, transformers and transmission lines are connected, is crucial for maintaining reliable and efficient grid operations, such as network planning, outage management and fault detection. This topology is traditionally maintained using Geographic Information Systems (GIS), which Hidalgo says are prone to inaccuracy. 

“On average, GIS systems typically have 5-20% inaccuracy in terms of where the end consumers are located in relation to the network,” Hidalgo notes. “With the introduction of new ML capabilities with Vertex AI, we can develop something far more accurate without the need to go to the field to confirm anything. A prediction with 97-98% accuracy is far more valuable from an operational point of view than the current GIS systems.”

Secure data for energy security

As data becomes increasingly important to utility companies, Landis+Gyr understands how critical it is to secure that data, and has been working with Google Cloud to apply cloud security best practices as a cornerstone of its new analytics solution. 

“Before partnering with Google Cloud, we made a careful assessment of the data protection safeguards and security compliance of the different cloud providers,” says Hidalgo. “Google Cloud offers us state-of-the-art data protection and security standards, which means that our customers are the only ones who are accessing their data.”

Future-proofing the industry for a greener tomorrow

The past two years of this partnership have built the foundations for future innovations, not only from a technology perspective, but also from an organizational and operational perspective. And as the transition to renewables continues to transform the energy industry as a whole, Hidalgo is convinced that Landis+Gyr can use ML and AI to transform the way the industry meets the ongoing challenges of this transition. 

“With Google Cloud, we are engineering a scalable, flexible solution that enables our customers to release new features quickly, to meet the fast-changing conditions we are now facing in the energy sector,” Hidalgo says. “And with AI and ML, we have the opportunity to create completely new insights with much more accuracy in terms of predictions, providing an entirely new way for the utility companies to operate the grid. ”

We all know how important the transition to renewables is. That’s why we are proud to partner with Landis+Gyr to lead the data transformation in the utilities sector and ensure that utilities can manage the security of supply with a powerful analytics solution.

Source : Data Analytics Read More

Fine tune autoscaling for your Dataflow Streaming pipelines

Fine tune autoscaling for your Dataflow Streaming pipelines

Stream processing helps users get timely insights and act on data as it is generated. It is used for applications such as fraud detection, recommendation systems, IoT and others. However, scaling live streaming pipelines as input load changes is a complex task, especially if you need to provide low-latency guarantees and keep costs in check. That’s whyDataflow has invested heavily in improving its autoscaling capabilities over the years, to help users by automatically adjusting compute capacity for the job. These capabilities include:

Horizontal auto-scaling: This lets the Dataflow service automatically choose the appropriate number of worker instances required for your job.

Streaming Engine: This provides smoother horizontal autoscaling in response to variations in incoming data volume.

Vertical auto-scaling (in Dataflow Prime): This dynamically adjusts the compute capacity allocated to each worker based on utilization.

Sometimes customers want to customize the autoscaling algorithm parameters. In particular, we see three common use cases when customers want to update min/max number of workers for a running streaming job:

Save cost when latency spikes: Latency spikes may cause excessive upscaling to handle the input load, which increases cost. In this case, customers may want to apply a smaller number of worker’ limits to reduce the costs.

Keep latency low during expected increase in traffic: For example, a customer may have a stream that is known to have spikes in traffic every hour. It can take minutes for the autoscaler to respond to those spikes. Instead, the users can have the number of workers to be increased proactively ahead of the top of the hour.

Keep latency low during traffic churns: It can be hard for the default autoscaling algorithm to select the optimal number of workers during bursty traffic. This can lead to higher latency. Customers may want to be able to apply a narrower range of min/max number of workers to make autoscaling less sensitive during these periods. 

Introducing inflight streaming job updates for user-calibrated autoscaling

Dataflow already offers a way to update auto-scaling parameters for long-running streaming jobs by doing a job update. However, this update operation causes a pause in the data processing, which can last minutes and doesn’t work well for pipelines with strict latency guarantees.

This is why we are happy to announce thein-flight job option update feature. This feature allows Streaming Engine users to adjust min/max number of workers at runtime. If the current number of workers is within the new minimum and maximum boundaries then this update will not cause any processing delays. Otherwise the pipeline will start scaling up or down within a short period of time.

It is available for users through:

Google Cloud console command:

code_block<ListValue: [StructValue([(‘code’, ‘gcloud dataflow jobs update-options \rn –region=REGION\rn –min-num-workers=MINIMUM_WORKERS\rn –max-num-workers=MAXIMUM_WORKERS\rnJOB_ID’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0f805b1580>)])]>

Dataflow Update API

code_block<ListValue: [StructValue([(‘code’, ‘PUT https://dataflow.googleapis.com/v1b3/projects//locations//jobs/?updateMask=runtime_updatable_params.max_num_workers,runtime_updatable_params.min_num_workersrn{rn “runtime_updatable_params”: {rn “min_num_workers”: ,rn “max_num_workers”: rn }rn}’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0f805b1fa0>)])]>

Please note that the in-flight job updates feature is only available to pipelines using Streaming Engine. 

Once the update applied, users can see the effects in the Autoscaling monitoring UI:

The “Pipeline options” section in the “Job info” panel will display the new values of “minNumberOfWorkers” and “maxNumberOfWorkers”.

Case Study: How Yahoo used this feature

Yahoo needs to frequently update their streaming pipelines that process Google Pub/Sub messages. This customer also has a very tight end-to-end processing SLA so they can’t afford to wait for the pipeline to be drained and replaced. If they were to follow the typical process, they would start missing their SLA. 

With the new in-flight update option, we proposed an alternative approach. Before the current pipeline drain is initiated, its maximum number of workers is set to the current number of workers using the new API. Then a replacement pipeline is launched with the maximum number of workers also equal to the current number of workers of the existing pipeline. This new pipeline is launched on the same Pub/Sub subscription as the existing one (note: in general using the same subscriptions for multiple pipelines is not recommended as it allows duplicates to occur as there is no deduplication across separate pipelines. It works only when duplicates during update are acceptable). Once the new pipeline starts processing the messages, the existing pipeline is drained. Finally, the new production pipeline is updated with the desired minimum and maximum number of workers. 

Typically, we don’t recommend running more than one Dataflow pipeline on the same Pub/Sub subscription. It’s hard to predict how many Pub/Sub messages will be in the pipeline, so the pipelines might scale up too much. The new API lets you disable autoscaling during replacement, which has been shown to work well for this customer and helped them maintain the latency SLA. 

“With Yahoo mail moving to the Google Cloud Platform we are taking full advantage of the scale and power of Google’s data and analytics services. Streaming data analytics real time across hundreds of millions of mailboxes is key for Yahoo and we are using the simplicity and performance of Google’s Dataflow to make that a reality.” – Aaron Lake, SVP & CIO, Yahoo

You can see the source code of sample scripts to orchestrate a no-latency pipeline replacement and a simple test pipeline in this GitHub repository

What’s next

Autoscaling live streaming pipelines is important to achieve low-latency guarantees and meet the cost requirements. Doing it right can be challenging. That’s where theDataflow Streaming Engine comes in. 

Many auto scaling features are now available to all Streaming Engine users. With the in-flight job updates, our users get an additional tool to fine tune the auto-scaling for their requirements. 

Stay tuned for future updates and learn more by contacting the Google Cloud Sales team.

Source : Data Analytics Read More

Delivering greater insights for insurance underwriters with BigQuery geospatial analytics

Delivering greater insights for insurance underwriters with BigQuery geospatial analytics

About CNA

A key goal for insurers is to provide the best match between customer need, product offering and the premium charged. However, the price charged and eventual profitability is partially dependent on how well the insurer understands and assesses the risk being insured. This is usually a time-consuming and often subjective process that relies on the unique skills of the individual underwriter along with their ability to capture, analyze, and interpret vast amounts of associated data points. 

CNA is one of the largest commercial property and casualty insurance companies in the U.S. With over 120 years of experience, CNA provides a broad range of standard and specialized insurance products and services for businesses and professionals in the United States, Canada and Europe. Over the last three years, CNA has been building its data-analytics foundation on Google’s Data Cloud. This has been accomplished by consolidating hundreds of global data sources, leveraging Vertex AI automation to accelerate time-to-market for machine learning models, and delivering a series of executive reporting dashboards on key business performance metrics. 

Challenge: Underwriting flood risk for commercial properties

With the building of their analytics foundation complete, CNA has increasingly focused their attention on providing value-added analytic products at scale to their business units, specifically,  to enhance the underwriting experience by using advanced analytics technologies to deliver more robust insights within the underwriting ecosystem. This will enable underwriters to quickly and more accurately evaluate property risk and perform additional analysis to better understand the implications of various decisions and provide quality insurance products.

Effectively assessing underwriting flood risk was a priority area for CNA. This is a difficult process that must take into consideration a couple key factors: 

Flood hazards events are multidimensional in nature, with threats coming from coastal surges, fluvial (river), and pluvial (surface) sources. Coastal and fluvial risks are directly related to their proximity to water bodies. However, surface flooding is often spontaneous and typically occurs when intense rainfall or surface run-off overwhelms an urban drainage system. Pluvial flooding is often difficult to predict and urban areas are particularly vulnerable and have high levels of exposure to this hazard event. 

Flood hazards require a comprehensive understanding of the complex spatial relationships that exist between the commercial property being insured and its surrounding environment. The ability to identify any associated risk due to adjacent property location is also important. 

“We needed a way to improve how we assessed and understood flood risk. Shifting to a process that incorporates more robust floodplain data sources and applying geospatial analytics technologies to better understand the complex spatial relationships that exist is critical to providing underwriters with the accurate and timely insights that they need.” – Tom Stone, VP Aggregation and Catastrophe Management, CNA 

CNA’s current process for assessing flood risk had limitations across the following areas: 

Limited data coverage: The data used for analysis had gaps and only took into consideration fluvial and coastal water sources. Pluvial sources that accounted for two thirds of flood losses were not a part of the risk assessment process. 

Limited geographic extent: The process relied on a geocoded point location (property address) that did not capture the geographic extent of the entire site. The impact of the hazard on the entire site or even adjacent ones was challenging. 

Challenge with associated risk: Geocoded point locations could be anywhere on a property. These locations may intersect with existing floodplain data. However, in the absence of a precise intersection, the actual proximity of a property to a floodplain (close by, near to) could not be assessed. 

Solution: Powering flood risk assessment with BigQuery geospatial analytics

To solve this challenge CNA worked with Google Cloud and several third-party data vendors to develop a more sophisticated solution that addressed the current challenges with underwriting flood risk assessment

Floods and their associated impacts have a strong location component and geospatial analytics that uses location data (latitude, longitude) to better understand spatial relationships and patterns is the foundation of the solution. Improving the accuracy of geospatial analysis requires access to comprehensive data. CNA acquired several datasets on the natural and anthropogenic environment, including buildings, flood risk (coastal, fluvial and pluvial), land parcels, and more. 

These were large datasets that had both national and global geographic coverage. Because of this, CNA needed a data analytics solution that could store, process, and perform large-scale and computationally expensive geospatial analytics to better understand the complex relationships that exist between these datasets and the combined impact to both their current and future portfolio of business. 

“Over the years we have worked with Google Cloud to accelerate our data journey. BigQuery geospatial analytics was a natural fit for this flood risk assessment challenge that requires robust data management, powerful geospatial analytics functionality, and extensive computational scale to quickly analyze these large datasets.” – Dr. Pierre Braganza, Chief Enterprise Architect, CNA

Based on previous strategic investments, CNA was in a unique position to solve this challenge. They had already achieved data sufficiency at scalewith 90% of their data residing in a highly governed data warehouse built on BigQuery. BigQuery has the added benefit of possessing powerful capabilities for both geospatial data management and geospatial analytics. Using BigQuery for geospatial analytics lets you analyze geographic data using standard SQL geography functions. As these functions execute they harness the compute power of BigQuery, which makes this an ideal platform for rapidly analyzing the complex spatial relationships that exist within and between the datasets. 

Building an intelligent geospatial flood risk API 

The solution uses Dataflow and geobeam to land terabytes of geospatial data in raster and vector formats into BigQuery. Underpinning the solution is an intelligent API built on Google Kubernetes Engine that dynamically executes several BigQuery geospatial analytics functions to explore the spatial relationships between a potential underwriting location (latitude, longitude), the underlying associated parcel, the geographic extent of the related buildings, and their intersection with the floodplain. A multilevel flood risk score is automatically generated based on the locations’ spatial intersection (ST_Intersects) or proximity (ST_Buffer) to the floodplain zones, parcels and associated buildings.

Potential underwriting location that is associated with a property that spans multiple floodplain risk zones. Areas in red and orange have higher flood risk scores than those in yellow and green. The location is found in a low risk flood area (shaded green). However, it is within a building that is in a high risk flood area (shaded red).

This will be a significant improvement over the old process. 

“Previously when we geocoded a property address (latitude, longitude) it could be found anywhere on a property, which may not overlay with the floodplain. Now we have an intelligent API that assesses risk at multiple levels based on the geographic extent of the entire property.” – Tom Stone 

CNA can now understand flood hazards that affect any portion of a property as opposed to a single point location. This is of particular importance in instances where an insured location may be found on a site that has a geographic scope that crosses multiple floodplain zones. In such a scenario, a seemingly low-risk property could be determined to be more high risk based on this spatial relationship. 

The API is being integrated with CNA’s underwriting system and related business data. During the underwriting process, underwriters will automatically be provided with the various risk scores so more informed decisions can be made about the commercial property to be insured. The API will also allow CNA to analyze its entire business portfolio to better understand flood risk. 

Results: Improved risk decisions with geospatial technology 

CNA’s underwriting flood risk solution will drive improved insights for underwriters, which should provide numerous benefits: 

Underwriters are empowered with new functionality that quickly and accurately alerts them to locations on a property schedule that are susceptible to flooding while allowing for further analysis on potential implications. 

Through the solution, CNA will gain a more complete flood risk dataset that includes pluvial risk information on which to make more informed decisions. This should help reduce losses that are occurring based on inadequate pricing for this type of hazard. 

CNA will better understand the relationship between a location and the associated property, buildings and floodplain zones to which it belongs. This is critical for being able to classify both the direct and associated risks for a property that may span multiple floodplain zones. Preliminary analysis has shown an approximate 25% increase in the ability to provide a comprehensive view of flood risk. 

In the future, CNA plans to build additional geospatial visualization capabilities to allow for further exploration. The solution is designed to incorporate additional hazard types as business needs arise. CNA also has the ability to quickly perform a scenario-based analysis on an entire portfolio of business and its susceptibility to flood risk. 

By leveraging BigQuery for geospatial analytics, CNA tackled the spatial problem of being able to better understand and measure flood risk. With 90% of all data possessing a location component, geospatial analytics can be applied to other business areas and problem sets.

Source : Data Analytics Read More

BigQuery’s user-friendly SQL: Elevating analytics, data quality, and security

BigQuery’s user-friendly SQL: Elevating analytics, data quality, and security

SQL is used by approximately 7 million people worldwide to manage and analyze data on a daily basis. Whether you are a data engineer or analyst, how you manage and effectively use your data to provide business driven insights, has become more important than ever. 

BigQuery is an industry leading, fully-managed, cloud data warehouse, that helps simplify the end-to-end analytics experience. It starts from data ingestion, preparation, analysis, all the way to ML training and inference using SQL. Today, we are excited to bring new SQL capabilities to BigQuery that extend our support for data quality, security, and flexibility. These new capabilities include: 

Schema operations for better data quality: create/alter views with column descriptions, flexible column name, LOAD DATA SQL statement

Sharing and managing data in a more secure way: authorized stored procedures

Analyzing data with more flexibility: LIKE ANY/SOME/ALL, ANY_VALUE (HAVING), index support for arrays & struct 

Extending schema support for better data quality 

Here’s an overview of how we’ve extended schema support in BigQuery to make it easier for you to work with your data.

Create/alter views with column descriptions (preview)
We hear from customers that they frequently use views to provide data access to others, and the ability to provide detailed information about what is contained in the columns would be very useful. Similar to column descriptions of tables, we’ve extended the same capability for views. Instead of having to rely on Terraform to precreate views and populate column details, you can now directly create and modify column descriptions on views using CREATE/ALTER Views with Column Descriptions statements.

code_block<ListValue: [StructValue([(‘code’, ‘– Create a view with column descriptionrnCREATE VIEW view_name (column_name OPTIONS(description= “col x”)) AS…rnrn– Alter a view with column descriptionrnALTER VIEW view_name ALTER COLUMN column_name rnSET OPTIONS(description=“col x”)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0fb365bcd0>)])]>

Flexible column name (preview)
To help you to improve data accessibility and usability, BigQuery now supports more flexibility for naming columns in your preferred international language and using special characters like ampersand (&) and percent sign (%) in the column name. This is especially important for customers with migration needs and international business data. Here is a partial list of the supported special characters: 

Any letter in any language 

Any numeric character in any language

Any connector punctuation character

Any kind of hyphen or dash

Any character intended to be combined with another character

Example column names: 

`0col1`

`姓名`

`int-col`

You can find a full detailed list of the supported characters here

LOAD DATA SQL statement (GA)

“In the past we mainly used the load API to load data into BigQuery, which required engineer expertise to learn about the API and do configurations. Since LOAD DATA was launched, we are now able to load data with SQL only statements, which made it much simpler, more compact and convenient.” – Steven Yampolsky, Director of Data Engineering, Northbeam 

Rather than using the load API or the CLI, BigQuery users like the compatibility and convenience of the SQL interface to load data as part of their SQL data pipeline. To make it even easier to load data into BigQuery, we have extended support for a few new use cases:

Load data with flexible column name (preview)

code_block<ListValue: [StructValue([(‘code’, ‘LOAD DATA INTO dataset_name.table_namern(`flexible column name 列` INT64)rnFROM FILES (uris=[“file_uri”], format=“CSV”);’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0fb365b1c0>)])]>

Load into tables with renamed columns, or columns dropped and added in a short time

code_block<ListValue: [StructValue([(‘code’, ‘–Create a table rnCREATE TABLE dataset_name.table_name (col_name_1 INT64);rnrn–Rename a column in the table rnALTER TABLE dataset_name.table_name RENAME COLUMN col_name_1 TO col_name_1_renamed;rnrn–load data into a table with the renamed column rnLOAD DATA INTO dataset_name.table_namern(col_name_1_renamed INT64)rnFROM FILES (uris=[“file_uri”], format=“CSV”);’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0fb365b670>)])]>

Load data into a table partitioned by ingestion time

code_block<ListValue: [StructValue([(‘code’, ‘LOAD DATA INTO dataset_name.table_namern(col_name_1 INT64)rnPARTITION BY _PARTITIONDATErnFROM FILES (uris=[“file_uri”], format=“CSV”);’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0fb365b5e0>)])]>

Load data into or overwrite one selected partition

code_block<ListValue: [StructValue([(‘code’, “LOAD DATA INTO dataset_name.table_namernPARTITIONS(_PARTITIONTIME = TIMESTAMP(‘2023-01-01’))rn(col_name_1 INT64)rnPARTITION BY _PARTITIONDATErnFROM FILES (uris=[“file_uri”], format=“CSV”);”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0fb365bc70>)])]>

Sharing and managing data in a more secure way 

Authorized stored procedures (preview) 
A stored procedure is a collection of statements that can be called from other queries. If you need to share query results from stored procedures with specific users without giving them read access to the underlying table, the newly introduced authorized stored procedure provides you with a convenient and secure way to share data access. 

How does it work?

Data engineers craft specific queries and grant permission on authorized stored procedures for specific analyst groups, who can then run and view query results without the read permission for the underlying table.

Analysts can then use authorized stored procedures to create query entities (tables, views, UDFs, etc.), call procedures, or perform DML operations.

Extended support to analyze data with more flexibility 

LIKE ANY/SOME/ALL (preview)
Analysts frequently need to search against business information stored in string columns, e.g., customer names, reviews, or inventory names. Now you can use LIKE ANY/LIKE ALL to check against multiple patterns in one statement. There is no need to use multiple queries with LIKE operators in conjunction with a WHERE clause.

With the newly introduced LIKE qualifiers ANY/SOME/ALL, you can filter rows on fields that match any/or all specified patterns. This can make it more efficient for analysts to filter data and generate insights based on their search criteria.

LIKE ANY (synonym for LIKE SOME): you can filter rows on fields which match any of one or multiple specified patterns

code_block<ListValue: [StructValue([(‘code’, “–Filter rows that match any patterns like ‘Intend%’, ‘%intention%’ rnWITH Words ASrn (SELECT ‘Intend with clarity.’ as value UNION ALLrn SELECT ‘Secure with intention.’ UNION ALLrn SELECT ‘Clarity and security.’)rn SELECT * FROM Words WHERE value LIKE ANY (‘Intend%’, ‘%intention%’);rn/*————————+rn | value |rn +————————+rn | Intend with clarity. |rn | Secure with intention. |rn +————————*/”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0fb365b100>)])]>

LIKE ALL: you can filter rows on fields which match all of the specified patterns

code_block<ListValue: [StructValue([(‘code’, “–Filter rows that match all patterns like ‘%ity%’, ‘%ith’rnWITH Words ASrn (SELECT ‘Intend with clarity.’ as value UNION ALLrn SELECT ‘Secure with identity.’ UNION ALLrn SELECT ‘Clarity and security.’)rn SELECT * FROM Words WHERE value LIKE ALL (‘%ity%’, ‘%ith’); rn/*———————–+rn | value |rn +———————–+rn | Intend with clarity. |rn | Secure with identity. |rn +———————–*/”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0fb365b520>)])]>

ANY_VALUE (HAVING MAX | MIN) (GA)
It’s common for customers to query for a value associated with a max or min value in a different column in the same row, e.g., to find the SKU of the best-selling product. Previously, you needed to use a combination of array_agg() and order by (), or last_value() in a window function to get the results, which is more complicated and less efficient, especially when there are duplicate records. 

With ANY_VALUE(x HAVING MAX/MIN y), as well as its synonyms MAX_BY and MIN_BY, you can now easily query a column associated with the max/min value of another column, with a much cleaner and readable SQL statement. 

Example: find the most recent contract value for each of your customers.

Index support for arrays & struct (GA)
Array is an ordered list of values of the same data type. Currently, to access elements in an array, you can use either OFFSET(index) for zero-based indexes (start counting at 0), or ORDINAL(index) for one-based indexes (start counting at 1). To make it more concise, BigQuery now supports a[n] as a synonym for a[OFFSET(n)]. This makes it easier for users who are already familiar with such array index access conventions.

code_block<ListValue: [StructValue([(‘code’, ‘SELECTrn some_numbers,rn some_numbers[1] AS index_1, — index starting at 0rn some_numbers[OFFSET(1)] AS offset_1, — index starting at 0rn some_numbers[ORDINAL(1)] AS ordinal_1 — index starting at 1rnFROM Sequencesrnrn/*——————–+———+———-+———–*rn | some_numbers | index_1 | offset_1 | ordinal_1 |rn +——————–+———+———-+———–+rn | [0, 1, 1, 2, 3, 5] | 1 | 1 | 0 |rn | [2, 4, 8, 16, 32] | 4 | 4 | 2 |rn | [5, 10] | 10 | 10 | 5 |’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0fb365b280>)])]>

Struct is a data type that represents an ordered tuple of fields of various data types. Today if there are anonymous fields or duplicate field names, it can be challenging to access field values. Similar to Array, we are introducing OFFSET(index) for zero-based indexes (start counting at 0) andORDINAL(index) for one-based indexes (start counting at 1). With this index support, you can easily get the value of a field at a selected position in a struct.

code_block<ListValue: [StructValue([(‘code’, ‘WITH Items AS (SELECT STRUCT<INT64, STRING, BOOL>(23, “tea”, FALSE) AS item_struct)rnSELECTrn item_struct[0] AS field_index, — index starting at 0rn item_struct[OFFSET(0)] AS field_offset, — index starting at 0rn item_struct[ORDINAL(1)] AS field_ordinal — index starting at 1rnFROM Itemsrnrn/*————-+————–+—————*rn | field_index | field_offset | field_ordinal |rn +————-+————–+—————+rn | 23 | 23 | 23 |rn *————-+————–+—————*/’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0fb365bd90>)])]>

More BigQuery features that are now GA

Finally, several BigQuery features have recently moved from preview to GA, and are now fully supported by Google Cloud. These include:

Drop column/rename column 
If you want to drop a column or rename a column, you can already run a zero-cost metadata only command DROP COLUMNor RENAME COLUMN

In GA, we have further extended the support to table copies and copy jobs. If you have a table with a column that previously had been renamed or dropped, you can now make a copy of that table by using either CREATE TABLE COPY statement or run a copy job with the most up-to-date schema information. 

Case-insensitive string collation 
Today, you can compare or sort strings regardless of case sensitivity by specifying ‘und:ci’. This means [A,a] will be treated as equivalent characters and will precede [B. b] for string value operations. In GA, we have extended this support for aggregate functions (MIN, MAX, COUNT DISTINCT), creating views, materialized views, BI engine, and many others. See more detailshere.

What’s next?

We will continue this journey focusing on building user-friendly SQL capabilities to help you load, analyze and manage data in BigQuery in the most efficient way. We would love to hear how you plan to use these features in your day to day. If you have any specific SQL features you want to use, please file a feature request here. To get started, try BigQuery for free.

Source : Data Analytics Read More

Reducing BigQuery physical storage cost with new billing model

Reducing BigQuery physical storage cost with new billing model

Introduction

BigQuery is a scalable petabyte-scale data warehouse, renowned for its efficient Structured Query Language (SQL) capabilities. While BigQuery offers exceptional performance, cost optimization remains a critical aspect for any cloud-based service. 

This blog post focuses on reducing BigQuery storage costs through the utilization of the newly introduced Physical Bytes Storage Billing (PBSB) as opposed to Logical Bytes Storage Billing (LBSB). PBSB is generally available as of July 5th, 2023. We work with organizations actively exploring this feature to optimize storage costs. 

Drawing from the extensive experience Google Cloud has built with Deloitte in assisting clients, this blog post will share valuable insights and recommendations to help customers to smoothly migrate to the PBSB model when designing storage to support BigQuery implementations. 

Design challenges

In today’s business landscape, we see large organizations accumulating extensive amounts of data, often measured in petabytes, within BigQuery data warehouses. This data is crucial for performing thorough business analysis and extracting valuable insights. However, the associated storage costs can be significant, sometimes exceeding millions of dollars annually. Consequently, minimizing these storage expenses has emerged as a challenge for many of our clients.

Solution

By default, when creating a dataset in BigQuery, the unit of consumption for storage billing is logical bytes. However, with the introduction of physical bytes storage billing, customers now have the option to choose this billing model. By choosing this billing model, customers can take advantage of the cost savings offered by the compression capabilities of physical bytes without compromising data accessibility or query performance.

To address the challenge of high storage costs on BigQuery data warehouses, we implement a two-step approach that leverages the newly introduced physical bytes storage billing option:

First Approach: Conduct a BigQuery cost benefit assessment between PBSB and LBSB Google has provided an example of how to calculate the price difference at the dataset level using this query.

Running the query in the example above results in the following details in the table below;

In this example, the first datasets demonstrate an impressive active and long-term compression ratio, ranging from 16 to 25. As a result, there is a remarkable storage cost reduction of 8 times, leading to a substantial decrease in monthly costs from $70,058 to $8,558. 

However, the last dataset in this test, close to 11 TB or 96% of all active physical storage data for time travel is used. This is not suitable for PBSB. 

During the assessment, we observed the presence of “_session” and “_scripts” rows, which may impede CSV files downloads due to the 10 MB limit. The “_session” objects correspond to temporary tables generated within BigQuery sessions, while the “_scripts” objects pertain to intermediate objects produced as part of stored procedures or multi-statement queries. These objects are billed based on logical bytes and cannot be converted to physical bytes by users. To address this, customers can disregard them by modifying the query using this clause: 

where total_logical_bytes > 0 AND total_physical_bytes > 0 AND table_schema not like ‘_scripts%’ AND table_schema not like ‘_session%’. [2, 7, 8]

Second approach:Switch to physical storage billing for suitable datasets. Customers must remember they will not be able to enroll datasets for physical storage billing until all flat-rate commitments for your organization are no longer active. 

The simplest way to update the billing model for a dataset is to use the BigQuery (BQ) update command and set the storage_billing_model flag to PHYSICAL.

For example: bq update -d –storage_billing_model=PHYSICAL PROJECT_ID:DATASET_ID

After changing the billing model for a dataset, it takes 24 hours for the change to take effect. Another factor to consider when optimizing storage cost is Time Traveling.

Time travel allows customers to query updated or deleted data, restore deleted tables, or restore expired tables. The default time travel window covers the past seven days, and you can configure it using the BQ command-line tool to balance storage costs with your data retention needs. Here’s an example: 

bq update –dataset=true –max_time_travel_hours=48 PROJECT_ID:DATASET_NAME

This command sets the time travel window to 48 hours (2 days) for the specified dataset.

“The –max_time_travel_hours value must be an integer expressed in multiples of 24 (48, 72, 96, 120, 144, 168) between 48 (2 days) and 168 (7 days).” [5]

Considerations

Switching to the physical storage billing model has some considerations. The table below shows the pricing:

Based on this price list on the table above, customers should consider the following:

Based on BigQuery storage pricing, the unit price of physical storage billing is twice that of logical storage billing.

If the compression ratio is less than 2, customers will not benefit from PBSB for their datasets.

In LBSB, customers are not billed for bytes used for time travel storage, but in PBSB, you are billed for that and the same is true when using the fail-safe.

To ensure accurate assessment, it is advisable to conduct an evaluation of time travel storage utilization once BigQuery workload has reached a stable state and established a predictable pattern. This is important because the bytes utilized for time travel storage can vary over time.

Customers have the flexibility to configure the time travel window according to specific data retention requirements while considering storage costs. For instance, customers can customize the time travel window, adjusting it from the default 7 days to a shorter duration such as 2 days.

A fail-safe period of 7 days will be enforced, during which the time travel setting cannot be modified. However, once this period ends, customers have the flexibility to configure additional time travel days according to their preference. This configurable range extends from 2 to 7 days, allowing customers to establish an effective time travel window spanning between 9 and 14 days. If no action is taken, the default time travel window will remain set at 14 days.

Switching to physical bytes storage billing is possible, however to do so, you must wait 14 days before you can change the billing model again.

Let’s go building

Adopting BigQuery Physical Bytes Storage Billing (PBSB) presents a substantial opportunity for reducing storage costs within BigQuery. The process of assessing and transitioning to this cost-effective billing model is straightforward, allowing customers to maximize the benefits. We have provided comprehensive guidance on conducting assessments and making a seamless transition to the PBSB model. In our upcoming blog post, we will delve into leveraging the newly introduced BigQuery editions to further optimize BigQuery analysis costs from a compute perspective. Wishing you a productive and successful cost optimization journey! And always reach out to us for support in your cloud journey from here.

Special thanks to Dylan Myers (Google), Enlai Wang  (Deloitte) for contributing to this article.

Source : Data Analytics Read More