Learn how BI Engine enhances BigQuery query performance

Learn how BI Engine enhances BigQuery query performance

Learn how BI Engine enhances BigQuery query performance

BigQuery BI Engine is a fast, in-memory analysis service that lets users analyze data stored in BigQuery with rapid response times and with high concurrency to accelerate certain BigQuery SQL queries. BI Engine caches data instead of query results, allowing different queries over the same data to be accelerated as you look at different aspects of the data. By using BI Engine with BigQuery streaming, you can perform real-time data analysis over streaming data without sacrificing write speeds or data freshness.

​​BI Engine architecture

The BI Engine SQL interface expands BI Engine support to any business intelligence (BI) tool that works with BigQuery such as Looker, Tableau, Power BI, and custom applications to accelerate data exploration and analysis. With BI Engine, you can build rich, interactive dashboards and reports in BI tool of your choice without compromising performance, scale,security, or data freshness. To learn more about the BI Engine SQL interface, please refer here.

The following diagram shows the updated architecture for BI Engine:

Shown here is one simple example of a Looker dashboard that was created with BI Engine capacity reservation (top) versus the same dashboard without any reservation (bottom).This dashboard is created from the BigQuery public dataset `bigquery-public-data.chicago_taxi_trips.taxi_trips`  to analyze the Sum of total_trip cost and logarithmic average of total trip cost over time.

total_trip cost for past 5 years

BI Engine will cache the minimum amount of data possible to resolve a query to maximize the capacity of the reservation. Running business intelligence on big data can be tricky.

Here is a query against the same public dataset, ‘bigquery-public-data.chicago_taxi_trips.taxi_trips,’ to demonstrate BI Engine performance with/without reserved BigQuery slots.

Example Query

code_block[StructValue([(u’code’, u”SELECTrn (DATE(trip_end_timestamp , ‘America/Chicago’)) AS trip_end_timestamp_date,rn (DATE(trip_start_timestamp , ‘America/Chicago’)) AS trip_start_timestamp_date,rn COALESCE(SUM(CAST(trip_total AS FLOAT64)), 0) AS sum_trip_total,rn CONCAT (‘Hour :’,(DATETIME_DIFF(trip_end_timestamp,trip_start_timestamp,DAY) * 1440) ,’ , ‘,’Day :’,(DATETIME_DIFF(trip_end_timestamp,trip_start_timestamp,DAY)) ) AS trip_time,rn CASE WHENrn ROUND(fare + tips + tolls + extras) = trip_total THEN ‘Tallied’rn WHEN ROUND(fare + tips + tolls + extras) < trip_total THEN ‘Tallied Less’rn WHEN ROUND(fare + tips + tolls + extras) > trip_total THEN ‘Tallied More’rn WHEN (ROUND(fare + tips + tolls + extras) = 0.0 AND trip_total = 0.0) THEN ‘Tallied 0’rn ELSE ‘N/A’ END AS trip_total_tally,rn REGEXP_REPLACE(TRIM(company),’null’,’N/A’) as company,rn CASE WHENrn TRIM(payment_type) = ‘Unknown’ THEN ‘N/A’rn WHEN payment_type IS NULL THEN ‘N/A’ ELSE payment_type END AS payment_typern FROMrn `bigquery-public-data.chicago_taxi_trips.taxi_trips`rn GROUP BYrn 1,rn 2,rn 4,rn 5,rn 6,rn 7rnORDER BYrn 1 DESC,rn 2 ,rn 4 DESC,rn 5 ,rn 6 ,rn 7rnLIMIT 5000″), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ee1a4db3b10>)])]

The above query was run with the below combinations: 

Without any BigQuery slot reservation/BI Engine reservation,  the query observed 7.6X more average slots and 6.3X more job run time compared to the run with reservations (last stats in the result). 

Without BI Engine reservation but with BigQuery slot reservation, the query observed 6.9X more average slots and 5.9X more job run time compared to the run with reservations (last stats in the result). 

With BI Engine reservation and no BigQuery slot reservation, the query observed 1.5 more average slots and the job completed in sub-seconds (868 ms). 

With both BI Engine reservation and BigQuery slot reservation, only 23 average slots were used and the job completed in sub-second as shown in results.This is the most cost effective way in regards to average slots and run time compared to all other options (23.27 avg_slots , 855 ms run time).

INFORMATION_SCHEMA is a series of views that provide access to metadata about datasets, routines, tables, views, jobs, reservations, and streaming data. You can query the INFORMATION_SCHEMA.JOBS_BY_* view to retrieve real-time metadata about BigQuery jobs. This view contains currently running jobs, and the history of jobs completed in the past 180 days.

Query to determine bi_engine_statistics and number of slots. More schema information can be found here.

code_block[StructValue([(u’code’, u”SELECTrn project_id,rn job_id,rn reservation_id,rn job_type,rn TIMESTAMP_DIFF(end_time, creation_time, MILLISECOND) AS job_duration_mseconds,rn CASErn WHEN job_id = ‘bquxjob_54033cc8_18164d54ada’ THEN ‘YES_BQ_RESERV_NO_BIENGINE’rn WHEN job_id = ‘bquxjob_202f17eb_18149bb47c3’ THEN ‘NO_BQ_RESERV_NO_BIENGINE’rn WHEN job_id = ‘bquxjob_404f2321_18164e0f801’ THEN ‘YES_BQ_RESERV_YES_BIENGINE’rnWHEN job_id = ‘bquxjob_48c8910d_18164e520ac’ THEN ‘NO_BQ_RESERV_YES_BIENGINE’ ELSE ‘NA’ END as query_method,rn bi_engine_statistics,rn — Average slot utilization per job is calculated by dividingrn– total_slot_ms by the millisecond duration of the jobrn SAFE_DIVIDE(total_slot_ms,(TIMESTAMP_DIFF(end_time, start_time, MILLISECOND))) AS avg_slotsrnFROMrnregion-us.INFORMATION_SCHEMA.JOBS_BY_PROJECTrnwhere creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 80 DAY) AND CURRENT_TIMESTAMP()rnAND end_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY) AND CURRENT_TIMESTAMP()rnANd job_id in (‘bquxjob_202f17eb_18149bb47c3′,’bquxjob_54033cc8_18164d54ada’,’bquxjob_404f2321_18164e0f801′,’bquxjob_48c8910d_18164e520ac’)rnORDER BY avg_slots DESC”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ee18b949590>)])]

From the observation, the most effective way of improving performance  for BI queries is to use BI ENGINE reservation along with BigQuery slot reservation.This will increase query performance, throughput and also utilizes less number of slots. Reserving BI Engine capacity will let you save on slots in your projects.

BigQuery BI Engine optimizes the standard SQL functions and operators when connecting business intelligence (BI) tools to BigQuery. Optimized SQL functions and operators for BI Engine are found here.

Monitor BI Engine with Cloud Monitoring

BigQuery BI Engine integrates with Cloud Monitoring so you can monitor BI Engine metrics and configure alerts.

For information on using Monitoring to create charts for your BI Engine metrics, see Creating charts in the Monitoring documentation.

We ran the same query without BI engine reservation and noticed 15.47 GB were processed.

After BI Engine capacity reservation, in Monitoring under BIE Reservation Used Bytes dashboard we got a compression ratio of ~11.74x (15.47 GB / 1.317 MB). However compression is very data dependent, primarily compression depends on the data cardinality. Customers should run tests on their data to determine their compression rate.

Monitoring metrics ‘Reservation Total Bytes’ gives information about the BI engine capacity reservation whereas ‘Reservation Used Bytes’ gives information about the total used_bytes. Customers can make use of these 2 metrics to come up with the right capacity for reservation. 

When a project has BI engine capacity reserved, queries running in BigQuery will use BI engine to accelerate the compatible subquery performance.​​The degree of acceleration of the query falls into one of the below mentioned modes:

BI Engine Mode FULL – BI Engine compute was used to accelerate leaf stages of the query but the data needed may be in memory or may need to be scanned from a disk. Even when BI Engine compute is utilized, BQ slots may also be used for parts of the query. The more complex the query,the more slots are used.This mode executes all leaf stages in BI Engine (and sometimes all stages).

BI Engine Mode PARTIAL – BI Engine accelerates compatible subqueries and BigQuery processes the subqueries that are not compatible with BI Engine.This mode also provides bi-engine-reason for not using BI Engine mode fully.This mode executes some leaf stages in BI Engine and rest in BigQuery.

BI Engine Mode DISABLED – When BI Engine process subqueries that are not compatible for acceleration, all leaf stages will get processed in BigQuery. This mode also provides bi-engine-reason for not using BI Engine mode fully/partially.

Note that when you purchase a flat rate reservation, BI Engine capacity (GB) will be provided as part of the monthly flat-rate price. You can get up to 100 GB of BI Engine capacity included for free with a 2000-slot annual commitment. As BI Engine reduces the number of slots processed for BI queries, purchasing less slots by topping up little BI Engine capacity along with freely offered capacity might suffice your requirement instead of going in for more slots!

References

bi-engine-introbi-engine-reserve-capacity streaming-apibi-engine-sql-interface-overview bi-engine-pricing bi-engine-sql-interface-overview 

To learn more about how BI Engine and BigQuery can help your enterprise, try out listed Quickstarts page 

bi-engine-data-studiobi-engine-looker Bi-engine-tableau

Related Article

Introducing Firehose: An open source tool from Gojek for seamless data ingestion to BigQuery and Cloud Storage

The Firehose open source tool allows Gojek to turbocharge the rate it streams its data into BigQuery and Cloud Storage.

Read Article

Source : Data Analytics Read More