Stop, visualize and listen, our Looker hackathon is back with a brand new edition.
This December 5th, we are kicking off Looker Hackathon 2023, a virtual two day event for developers, innovators and data scientists to collaborate, build, learn and inspire each other, as you design new innovative applications, tools and data experiences on Looker and Looker Studio. The best and most exciting entry will be awarded the title of “Best Hack”.
At the event, you can expect to:
Meet and team up with your developer community and Google Cloud InnovatorsGain hands-on experience and learn about the latest Looker capabilitiesMeet and talk to Google Cloud engineers and staff, and possibly play some trivia tooTurn your idea into reality and have fun along the way
In this post, we’ll be showing how to manage BigQuery costs with budgets and custom quota – keep reading, or jump directly into tutorials for creating budgetsor setting custom quota!
Early in your journey to build or modernize on the cloud, you’ll learn that cloud services are often pay-as-you-go; and running analytics on BigQuery is no exception. While BigQuery does offer several pricing models, the default on-demand pricing model (the one most new users start with) charges for queries by the number of bytes processed.
This pricing structure has some major benefits: you only pay for the services you use, and avoid termination charges and up-front fees. However, the elastic nature of BigQuery means that it’s important to understand and take advantage of the tools available to help you stay on top of your spending and prevent surprises on your cloud bill.
Budgets and custom quotas are two powerful tools provided by Google Cloud that you can (and I’d argue you should!) use to manage BigQuery costs. So let’s dive into how each of these work and help you get started.
As your cloud usage grows and changes over time, your costs will change too. Budgets allow you to monitor all of your Google Cloud charges in one place, including BigQuery. They can track both your actual and forecasted spend, and alert you when you’re exceeding your defined budgets, which helps you to both avoid unexpected expenses and plan for growth.
Budgets can be configured for a Cloud Billing account (that can include more than one project linked to the billing account), or for individual projects. To manage budgets for a Cloud Billing account, you need the Billing Account Administrator or Billing Account Costs Manager role on the Cloud Billing account. To manage budgets for an individual project, you need the Project Owner or Project Editor role on the project.
Budgets can be created within the Billing Console in the Budgets & alerts page. At a high-level, you will define the following areas when creating a budget:
ScopeBudgets are a tool that span Google Cloud, and you can scope budgets to apply to the spend in an entire Cloud Billing account, or narrow the scope by filtering on projects, services, or labels. To create a budget focused on BigQuery spend, you can scope it to the BigQuery service. Note that this scope includes both BigQuery on-demand query usage and BigQuery storage.Budget amountThe budget amount can be a total that you specify, or you can base the budget amount on the previous calendar period’s spend.ActionsAfter setting the budget amount, you can set multiple threshold rules to trigger email notifications. Each threshold can be customized as a percentage of the total budget amount, and can be based on actual costs (as you’re charged for using services) or forecasted costs (as Google Cloud forecasts that you’re going to be spending a certain amount). Using forecasted costs can alert you before you actually spend and stay ahead of any issues!
Screen capture of creating a budget in the Billing Console
You have several options for creating a budget: you can use the Cloud Console (as shown in the above screenshot), the gcloud command-line tool, or the Cloud Billing API.
Once your budget is in place, email alert notifications will be sent when you hit (or are forecasted to hit) your budget!
Monitoring spending with budgets
In addition to any of the alert actions you choose when setting up a budget, you can monitor your spending against your budgets using the Cloud Billing dashboard or the Budget API. You can see how much of your budget has been consumed, which resources are contributing the most to your costs, and where you might be able to optimize your usage.
Sample screen capture of of budget report
Tips for using budgets
Budget email alerts are sent to users who have Billing Account Administrator and Billing Account User roles by default. While this is a great first step, you can also configure budgets to notify users through Cloud Monitoring, or you can have regular budget updates sent to Pub/Sub for full customization over how you want to respond to budget updates such as sending messages to Slack.With the Cloud Billing Budget API, you can view, create, and manage budgets programmatically at scale. This is especially useful if you’re creating a large number of budgets across your organization.While this blog post focuses on using budgets for BigQuery usage, budgets are a tool that can be used across Google Cloud, so you can use this tool to manage Cloud spend as a whole or target budgets for particular services or projects.
Custom quotas are a powerful feature that allow you to set hard limits on specific resource usage. In the case of BigQuery, quotas allow you to control query usage (number of bytes processed) at a project- or user-level. Project-level custom quotas limit the aggregate usage of all users in that project, while user-level custom quotas are separately applied to each user or service account within a project.
Custom quotas are relevant when you are using BigQuery’s on-demand pricing model, which charges for the number of bytes processed by each query. When you are using the capacity pricing model, you are charged for compute capacity (measured in slots) used to run queries, so limiting the number of bytes processed is less useful.
By setting custom quotas, you can control the amount of query usage by different teams, applications, or users within your organization, preventing unexpected spikes in usage and costs.
Note that quotas are set within a project, and you must have the Owner, Editor, or Quota Administrator role on that project in order to set quotas.
Custom quota can be set by heading to the IAM & Admin page of the Cloud console, and then choosing Quotas. This page contains hundreds of various quota, so use the filter functionality with Metric: bigquery.googleapis.com/quota/query/usage to help you zero in on the two quota options for BigQuery query usage:
Query usage per day <- this is the project-level quotaQuery usage per user per day <- this is the user-level quota
Screen capture of the BigQuery usage quotas in the Cloud Console
After selecting one or both quotas, click toEdit Quotas. Here you will define your daily limits for each quota in tebibytes (TiB), so be sure to make any necessary conversions.
Screen capture of setting new custom quota amounts in the Cloud Console
To set custom quotas for BigQuery, you can use the Cloud Console (as described above), the gcloud command-line tool, or the Service Usage API. You can also monitor your quotas and usage within the Quotas page or using the Service Usage API.
Screen capture of monitoring quota usage within the Quota page of the Cloud Console
Tips for using custom quotas
You may use either project-level or user-level of these quota options, or both in tandem. Used in tandem, usage will count against both quotas, and adhere to the stricter of the two limits.Once quota is exceeded, the user will receive a usageQuotaExceeded error and the query will not execute. Quotas are proactive, meaning, for example, you can’t run an 11 TB query if you have a 10 TB quota.Daily quotas reset at midnight Pacific Time.Separate from setting a custom quota, you can also set a maximum bytes billed for a specific query (say, one you run on a schedule) to limit query costs.
Differences between budgets and custom quotas
Now that you’ve learned more about budgets and custom quota, let’s look at them side-by-side and note some of their differences:
Their scope: Budgets are tied to a billing account (which can be shared across projects), while quotas are set for individual projects.What they track: Budgets are set for a specific cost amount, while quotas are set for specific resource or service usage.How they are enforced: Budgets track your costs and alert you when you’re exceeding your budget, while quotas enforce a hard limit on the amount of resources that can be used in a project and will return an error when a user/service tries to exceed the limit.
Tracking and analyzing your BigQuery costs will make you feel more at ease when running queries within the on-demand pricing model, and it can help you make informed decisions, optimize your costs, and maximize the value of your cloud spend.
As you scale your BigQuery environment, you may want to move your workloads to the BigQuery editions pricing model, which charges by the amount of capacity allocated to your workload, measured in slots (a unit of measure for BigQuery compute power) rather than per each query. This model also can provide discounted capacity for long term commitments. One of BigQuery’s unique features is the ability to combine the two different pricing models (on-demand and capacity) to optimize your costs.
Smart business owners are realizing that it is virtually impossible to succeed in 2023 without investing in AI or big data technology. A survey from surveyOpens shows that 91.7% of executives at top companies are increasing investments in big data. The explosion of content created with AI tools like ChatGPT and MidJourney has made it […]
Thirty years ago, businesses were starting to recognize that data was the future. However, they never imagined that big data technology would have the impact that we have witnessed in recent years. More companies are using big data to drive business decisions than ever before. However, many companies still neglect to have formal data strategies […]
Google Cloud BigQuery is a key service that helps you create a Data Warehouse that provides the scale and ease of querying large data sets. Let’s say that you have standardized on using BigQuery and have set up data pipelines to maintain the datasets. The next question would be to determine how best to make this data available to applications. APIs are often the way forward for this and what I was looking to experiment with is to consider a service that helps me create an API around my data sources (BigQuery in this case) and do it easily.
In this blog post, we shall see how to use Hasura, an open-source solution, that helped me create an API around my BigQuery dataset.
The reason to go with Hasura is the ease with which you can expose your domain data via an API. Hasura supports a variety of data sources including BigQuery, Google Cloud SQL and AlloyDB. You control the model, relationships, validation and authorization logic through metadata configuration. Hasura consumes this metadata to generate your GraphQL and REST APIs. It’s a low-code data to API experience, without compromising any of the flexibility, performance or security you need in your data API.
While Hasura is open-source, it also has fully managed offerings on various cloud providers including Google Cloud.
You need to have a Google Cloud Project. Do note down the Project Id of the project since we will need to use that later in the configuration in Hasura.
BigQuery dataset – Google Trends dataset
Our final goal is to have a GraphQL API around our BigQuery dataset. So what we need to have in place is a BigQuery dataset. I have chosen the Google Trends database that is made available in the Public Datasets program in BigQuery. This is an interesting dataset that makes available (both US and Internationally), the top 25 overall or top 25 rising queries from Google Trends from the past 30 days.
I have created a sample dataset in BigQuery in my Google Cloud project named ‘google_trends’ and have copied the dataset and the tables from the bigquery-public-data dataset. The tables are shown below:
Google Trends dataset
What we are interested in is the international_top_terms that helps me to see the trends across countries that are supported in the Google Trends dataset that has been made available.
The schema for the international_top_terms dataset schema is shown below:
International Top Terms table schema
A sample BigQuery query (Search terms from the previous day in India) that we eventually would like to expose over the GraphQL API is shown below:
If I run this query in the BigQuery workspace, I get the following result (screenshot below):
International Trends sample data
Great ! This is all we need for now from a BigQuery point of view. Remember you are free to use your own dataset if you’d like.
We will come to the Hasura configuration in a while, but before that, do note that the integration between Hasura and Google Cloud will require that we generate a service account with the right permissions. We will provide that service account to Hasura, so that it can invoke the correct operations on BigQuery to configure and retrieve the results.
Service account creation in Google Cloud is straightforward and you can do that from the Google Cloud Console → IAM and Admin menu option.
Create a Service account with a name and description.
Service Account Creation
In the permissions for the service account, ensure that you have the following Google Cloud permissions, specific to BigQuery:
Service Account Permissions
Once the service account is created, you will need to export this account via its credentials (JSON) file. Keep that file safely as we will need that in the next section.
This completes the Google Cloud part of the configuration.
You need to sign up with Hasura as a first step. Once you have signed it, click on New Project and then choose the Free Tier and Google Cloud to host the Hasura API Layer, as shown below. You will also need to select the Google Cloud region to host the Hasura service in and then click on the Create Project button.
Hasura Project Creation
Setting up the data connection
Once the project is created, you need to establish the connectivity between Hasura and Google Cloud and specifically in this case, set up the Data Source that Hasura needs to configure and talk to.
For this, visit the Data section as shown below. This will show that currently there are no databases configured i.e. Databases(0). Click on the Connect Database button.
Hasura Data Source creation
From the list of options available, select BigQuery and then click on Connect Existing Database.
Hasura BigQuery Data Source creation
This will bring up a configuration screen (not shown here), where you will need to entire the service account, Google Project Id and BigQuery Dataset name.
Create an environment variable in the Hasura Settings that contains your Service Account Key (JSON file contents). A sample screenshot from my Hasura Project Settings is shown below. Note that the SERVICE_ACCOUNT_KEY variable below has the value of the JSON Key contents.
Hasura Project Settings
Coming back to the Database Connection configuration, you will see a screen as shown below. Fill out the Project Id and Dataset value accordingly.
Hasura BigQuery Datasource configuration
Once the data connection is successfully set up, you can now mark which tables need to be tracked. Go to the Datasource settings and you will see that Hasura queried the metadata to find the tables in the dataset. You will see the tables listed as shown below:
Hasura BigQuery Datasource tables
We select the table that we are interested in tracking i.e. select it and then click on the Track button.
Hasura BigQuery Datasource table tracking
This will mark the table as tracked and we will now be able to go to the GraphQL Test UI to test out the queries.
The API tab provides us with a nice Explorer UI where you can build out the GraphQL query in an intuitive manner.
One of the trickiest things for businesses to navigate in the age of social media is the customer complaint. On one hand, companies (especially startups) should take customer concerns into account when considering improvements or design changes to a product. Let’s take a look at why it matters and possible motivations for malicious complaints, and […]
Data analytics technology has helped change the future of modern business. The ecommerce sector is among those most affected by advances in analytics. We have previously pointed out that a number of ecommerce sites are using data analytics to optimize their business models. Therefore, it should be no surprise that the market for data analytics […]
AI technology has been one of the most disruptive technological changes of the past couple of years. One forecast reports that the market for AI will be worth over $594 billion by 2032. The growing number of people using AI services such as ChatGPT is a testament to how influential AI has become. Last month, […]
Enterprises are generating data at an exponential rate, spanning traditional structur transactional data, semi-structured data like JSON and unstructured data like images and audio. The diversity and volume of data presents complex architectural challenges for data processing, data storage and query engines, requiring developers to build custom transformation pipelines to deal with semi-structured and unstructured data.
In this post we will explore the architectural concepts that power BigQuery’s support for semi-structured JSON, which eliminates the need for complex preprocessing and provides schema flexibility, intuitive querying and the scalability benefits afforded to structured data. We will review the storage format optimizations, performance benefits afforded by the architecture and finally discuss how they affect the billing for queries that use JSON paths.
Integration with the Capacitor File Format
Capacitor, a columnar storage format, is the backbone of BigQuery’s underlying storage architecture. Based on over a decade of research and optimization, this format stores exabytes of data, serving millions of queries. Capacitor is optimized for storing structured data. Capacitor uses techniques like Dictionary, Run Length Encoding (RLE), Delta encoding etc. to store the column values optimally. To maximize RLE, it also employs record-reordering. Usually, row order in the table does not have significance, so Capacitor is free to permute rows to
improve RLE effectiveness. Columnar storage lends itself to block oriented vectorized processing which is employed by an embedded expression library.
To natively support semi-structured formats like JSON, we developed the next generation of the BigQuery Capacitor format optimized for sparse semi-structured data.
On ingestion, JSON data is shredded into virtual columns to the extent possible. JSON keys are typically only written once per column instead of once per row. Other non data characters such as colons and whitespace are not part of the data stored in the columns. Placing the values in columns allows us to apply the same encodings that are used for structured data such as Dictionary encoding, Run Length Encoding and Delta encoding. This greatly reduces storage and thus IO costs at query time. Additionally, JSON nulls and arrays are natively understood by the format leading to optimal storage of virtual columns. Fig. 1 shows the virtual columns for records R1 and R2. The records are fully shredded into nested columns.
Figure 1: Shows the virtual columns for given records R1 and R2
BigQuery’s native JSON data type maintains the nested structure of JSON by understanding JSON objects, arrays, scalar types, JSON nulls (‘null’) and empty arrays
Capacitor, the underlying file-format of the native JSON data type, uses record-reordering to keep similar data and types together. The objective of record-reordering is to find an optimal layout where Run Length Encoding across the rows is maximized resulting in smaller virtual columns. In the event of a particular key having different types, say integer and string across a range of rows in the file, record-reordering results in grouping the rows with the string data type together and the rows with the integer type together resulting in run length encoded spans of missing values in both virtual columns thus producing smaller columns. To illustrate the column sizes, Fig. 2 shows a single key `source_id` that is either an integer or a string across the source JSON data. This produces two virtual columns based on the types. It showcases how reordering and applying RLE results in small virtual columns. For simplicity, we use `kMissing` to denote a missing value.
Figure 2 : Shows the effect of row reordering on column sizes.
Capacitor was specifically designed to handle well structured datasets. This was challenging for JSON data which takes a variety of shapes with a mix of types. Here are some of the challenges that we overcame while building the next generation of Capacitor that natively supported the JSON data type.
Adding/Removing keys JSON keys are treated as optional elements thus marking them as missing in the rows that don’t have the keys.Changing Scalar Types Keys that change scalar types such as string, int, bool and float across rows are written into distinct virtual columns.Changing Types with non-scalar types For non-scalar values such as object and array, the values are stored in an optimized binary format that’s efficient to parse.
At ingestion time, after shredding the JSON data into virtual columns, the logical size of each virtual column is calculated based on the data size [https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#data_type_sizes] for specific types such as INT64, STRING, BOOL. Thus the size of the entire JSON column is the sum of the data sizes of each virtual column stored in it. This cumulative size is stored in metadata and becomes the upper bound logical estimated size when any query touches the JSON column.
These techniques produce individual virtual columns that are substantially smaller when compared to the original JSON data. For example, storing a migrated version of bigquery-public-data.hacker_news.full discussed below as a JSON column compared to STRING, leads to a direct saving of 25% for logical uncompressed bytes.
Improved Query Performance with JSON native data types
With the JSON data stored in a STRING column, if you wanted to filter on certain JSON paths or only project specific paths, the entire JSON STRING row would have to be loaded from storage, decompressed and then each filter and projection expression evaluated one row at a time.
In contrast, with the native JSON data type, only the necessary virtual columns are processed by BigQuery. To make projections and filter operations as efficient as possible, we added support for compute and filter pushdown of JSON operations. Projections and filter operations are pushed down to the embedded evaluation layer that can perform the operations in a vectorized manner over the virtual columns making these very efficient in comparison to the STRING type.
When the query is run, customers are only charged based on the size of the virtual columns that were scanned to return the JSON paths requested in the SQL query. For example, in the query shown below, if payload is a JSON column with keys `reference_id` and `id`, only the specific virtual columns representing the JSON keys namely `reference_id` and `id` are scanned across the data.
code_block<ListValue: [StructValue([(‘code’, ‘SELECT payload.reference_id FROM tablernWHERE Safe.INT64(payload.id) = 21’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ee5ea1a7610>)])]>
Note: The estimate always shows the upper bound of the total JSON size rather than specific information about the virtual columns being queried.
With BigQuery Editions, optimized virtual columns lend to queries that require far less IO and CPU in comparison to storing the JSON string unchanged since you’re scanning specific columns rather than loading the entire JSON blob and extracting the paths from the JSON string.
With the new type, BigQuery is now able to process just the paths within the JSON that are requested by the SQL query. This can substantially reduce query cost.
For instance running a query to project a few fields such as `$.id` and `$.parent` from a migrated version of the public dataset `biquery-public-data.hacker_news.full` containing both STRING and JSON type of the same data shows a reduction of processed bytes from 18.28GB to 514MB which is a 97% reduction.
The following query was used for creating a table from the public dataset that contains both a STRING column in `str_payload` and a JSON column in `json_payload` and saved to a destination table.
code_block<ListValue: [StructValue([(‘code’, ‘– Saved to a –destination_table my_project.hacker_news_json_strrnSELECT TO_JSON(t) AS json_payload, TO_JSON_STRING(t) AS str_payload FROM `bigquery-public-data.hacker_news.full` trnrn– Processed bytes for `str_payload.id` 18.28 GBrnSELECT JSON_QUERY(str_payload, “$.id”) FROM `my_project.hacker_news_json_and_str`rnWHERE JSON_VALUE(str_payload, “$.parent”) = “11787862” rnrnrn– processed bytes for `json_payload.id` 514 MBrnSELECT json_payload.id FROM `my_project.hacker_news_json_str`rnWHERE LAX_INT64(json_payload.parent) = 11787862’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ee5ea1a7310>)])]>
The improvements to the Capacitor file format have enabled the flexibility required by today’s semi-structured data workloads. Shredding the JSON data into virtual columns along with enhancements such as compute and filter pushdown has shown promising results. Since the JSON data is shredded, only the virtual columns matching the requested JSON paths are scanned by BigQuery. Compute and filter pushdowns make it possible for the runtime to do work closest to the data evaluation layer in a vectorized manner thus exhibiting better performance. Lastly these choices have made it possible to improve compression and reduce the IO/CPU usage (smaller columns = less IO) as a result, reducing the slot usage for fixed-rate reservations and the bytes billed for on-demand usage.
Beyond the storage and query optimizations, we’ve also invested in programmability and ease of use by providing a range of JSON SQL Functions. Also see the latest blog post about the new JSON SQL functions that were released recently.
Convert to the JSON native data type in BigQuery today!
To convert your table with a STRING JSON column into a table with a native JSON column, run the following query and save the results to a new table, or select a destination table.
code_block<ListValue: [StructValue([(‘code’, ‘SELECT * EXCEPT (payload), SAFE.PARSE_JSON (payload) as json_payloadrnFROM <table>’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ee5ea1a7a90>)])]>
Looker Studio supports self-serve analytics for ad hoc data, and together with Looker, contributes to the more than 10 million users who access the Looker family of products each month. Today, we are introducing new ways for analysts to provide business users with options to explore data and self-serve business decisions, expanding ways all our users can analyze and explore data — leading to faster and more informed decisions.
Introducing personal report links
Business users often leverage shared dashboards from data analysts, which contain key company metrics and KPIs, as a starting point and want to explore beyond the curated analysis to arrive at more specific insights for their own data needs. The introduction of personal reports in Looker Studio enables this activity, delivering a private sandbox for exploration so users can self-serve their own questions and find insights faster – without modifying the original curated report.
Whether you share a report link in group chats or direct messages, an individual copy is created for each user that opens it so that everyone gets their own personal report.
Personal Looker Studio reports are designed to be ephemeral, meaning you don’t need to worry about creating unwanted content, but if you land on valuable insights that you want to keep, you can save and share these reports with new links, separate from the original report you built from.
You can learn more about how personal reports work and how to use them in our Help Center.
Looker Studio Personal Link Visual
Automated report updates
Your analysis and insights are only as good as the freshness of your reports. Looker Studio users can now enable their reports to auto-refresh data at a predefined cadence, so critical business decisions are based on current and updated information.
To learn more about how auto-refresh works, including details on how it works with cache, presentation mode, and existing data freshness settings, visit our Help Center.
Looker Studio Auto refresh feature Visual
Faster filtering in reports
Quick filters enable powerful exploration to slice data and uncover hidden patterns and insights within the context of your report. Quick filters don’t affect other users’ views, so whether you are exploring in a shared or personal report, your unique view is only shared once you are ready. The filter bar also gives you a complete picture of whether applied filters originate from interactive cross-chart filtering or quick filters.
Learn more about how to add quick filters in reports in our Help Center.
Looker Studio Quick filters and filter bar feature Visual
Configuring multiple filters and charts for exploration can quickly add to the query volume, even with presence of a cache. We’ve heard from analysts that they want better control over running queries, so they can optimize query volume and, thus, query costs.
We have added the ability to pause updates, giving you the flexibility to fully configure chart elements like fields, filters, parameters, sorting, and calculated formulas before running any data updates. You can then simply resume updates to see the updated data. Pausing updates does not prevent any style changes, so you can continue to modify design elements and other detailed styles and formatting without running a single query. Learn more about this feature in our Help Center.
The new pause report updates feature in Looker Studio has meaningfully improved the report creation experience. Asset producers can build and test reports without wasting database resourcing waiting for data to reload. Caroline Bollinger BI Tooling Product, Wayfair
View underlying data
Data accuracy is one thing — being able to see its detail is another. As analysts configure charts to build reports and design information hierarchy, previewing the underlying data is important for understanding context and seeing what data is available and its structure so you can make the best decisions about what to include in your analysis. It’s also handy when troubleshooting or customizing your reports.
This feature allows analysts to preview all the data that appears in a chart, including the primary dimensions, breakdown dimensions, and metrics. Learn more about how to view underlying data in our Help Center.
Looker Studio Data preview feature Visual
With this collection of updates, Looker Studio users can now easily know the data they share is up-to-date, inspect it in detail, rapidly create filters, and share personal links to reports. The goal remains, as always, to empower users to make smart and impactful decisions based on their enterprise data. To stay on top of all our latest features, view our release notes. Access Looker Studio for free and learn more about Looker Studio Pro.