BigQuery is capable of some truly impressive feats, be it scanning billions of rows based on a regular expression, joining large tables, or completing complex ETL tasks with just a SQL query. One advantage of BigQuery (and SQL in general), is it’s declarative nature. Your SQL indicates your requirements, but the system is responsible for figuring out how to satisfy that request.
However, this approach also has its flaws – namely the problem of understanding intent. SQL represents a conversation between the author and the recipient (BigQuery, in this case). And factors such as fluency and translation can drastically affect how faithfully an author can encode their intent, and how effectively BigQuery can convert the query into a response.
In this week’s BigQuery Admin Reference Guide post, we’ll be providing a more in depth view of query processing. Our hope is that this information will help developers integrating with BigQuery, practitioners looking to optimize queries, and administrators seeking guidance to understand how reservations and slots impact query performance.
A refresher on architecture
Before we go into an overview of query processing. Let’s revisit BigQuery’s architecture. Last week, we spoke about BigQuery’s native storage on the left hand side. Today, we’ll be focusing on Dremel, BigQuery’s query engine. Note that today we’re talking about BigQuery’s standard execution engine, however BI Engine represents another query execution engine available for fast, in-memory analysis.
As you can see from the diagram, dremel is made up of a cluster of workers. Each one of these workers executes a part of a task independently and in parallel. BigQuery uses a distributed memory shuffle tier to store intermediate data produced from workers at various stages of execution. The shuffle leverages some fairly interesting Google technologies, such as our very fast petabit network technology, and RAM wherever possible. Each shuffled row can be consumed by workers as soon as it’s created by the producers.
This makes it possible to execute distributed operations in a pipeline. Additionally, if a worker has partially written some of its output and then terminated (for example, the underlying hardware suffered a power event), that unit of work can simply be re-queued and sent to another worker. A failure of a single worker in a stage doesn’t mean all the workers need to re-run.
When a query is complete, the results are written out to persistent storage and returned to the user. This also enables us to serve up cached results the next time that query executes.
Overview of query processing
Now that you understand the query processing architecture, we’ll run through query execution at a high level, to see how each step comes together.
To proceed, there’s some level of API processing that must occur. Some of things that must be done are authenticating and authorizing the request, plus building and tracking associated metadata such as the SQL statement, cloud project, and/or query parameters.
Decoding the query text: Lexing and parsing
Lexing and parsing is a common task for programming languages, and SQL is no different. Lexing refers to the process of scanning an array of bytes (the raw SQL statement) and converting that into a series of tokens. Parsing is the process of consuming those tokens to build up a syntactical representation of the query that can be validated and understood by BigQuery’s software architecture.
If you’re super interested in this, we recommend checking out the ZetaSQL project, which includes the open source reference implementation of the SQL engine used by BigQuery and other GCP projects.
Referencing resources: Catalog resolution
SQL commonly contains references to entities retained by the BigQuery system – such as tables, views, stored procedures and functions. For BigQuery to process these references, it must resolve them into something more comprehensible. This stage helps the query processing system answer questions like:
Is this a valid identifier? What does it reference?
Is this entity a managed table, or a logical view?
What’s the SQL definition for this logical view?
What columns and data types are present in this table?
How do I read the data present in the table? Is there a set of URIs I should consume?
Resolutions are often interleaved through the parsing and planning phases of query execution.
Building a blueprint: Query planning
As a more fully-formed picture of the request is exposed via parsing and resolution, a query plan begins to emerge. Many techniques exist to refactor and improve a query plan to make it faster and more efficient. Algebraization, for example, converts the parse tree into a form that makes it possible to refactor and simplify subqueries. Other techniques can be used to optimize things further, moving tasks like pruning data closer to data reads (reducing the overall work of the system).
Another element is adapting it to run as a set of distributed execution tasks. Like we mentioned in the beginning of this post, BigQuery leverages large pools of query computation nodes, or workers. So, it must coordinate how different stages of the query plan share data through reading and writing from storage, and how to stage temporary data within the shuffle system.
Doing the work: Query execution
Query execution is simply the process of working through the query stages in the execution graph, towards completion. A query stage may have a single unit of work, or it may be represented by many thousands of units of work, like when a query stage reads all the data in a large table composed of many separate columnar input files.
Query management: scheduling and dynamic planning
Besides the workers that perform the work of the query plan itself, additional workers monitor and direct the overall progress of work throughout the system. Scheduling is concerned with how aggressively work is queued, executed and completed.
However, an interesting property of the BigQuery query engine is that it has dynamic planning capabilities. A query plan often contains ambiguities, and as a query progresses it may need further adjustment to ensure success. Repartitioning data as it flows through the system is one example of a plan adaptation that may be added, as it helps ensure that data is properly balanced and sized for subsequent stages to consume.
Finishing up: finalizing results
As a query completes, it often yields output artifacts in the form of results, or changes to tables within the system. Finalizing results includes the work to commit these changes back to the storage layer. BigQuery also needs to communicate back to you, the user, that the system is done processing the query. The metadata around the query is updated to note the work is done, or the error stream is attached to indicate where things went wrong.
Understanding query execution
Armed with our new understanding of the life of a query, we can dive more deeply into query plans. First, let’s look at a simple plan. Here, we are running a query against a public BigQuery dataset to count the total number of citi bike trips that began at stations with “Broadway” in the name.
WHERE start_station_name LIKE “%Broadway%”
Now let’s consider what is happening behind the scenes when BigQuery processes this query.
First, a set of workers access the distributed storage to read the table, filter the data, and generate partial counts. Next, these workers send their counts to the shuffle.
The second stage reads from those shuffle records as its input, and sums them together. It then writes the output file into a single file, which becomes accessible as the result of the query.
You can clearly see that the workers don’t communicate directly with one another at all; they communicate through reading and writing data. After running the query in the BigQuery console, you can see the execution details and gather information about the query plan (note that the execution details shown below may be slightly different than what you see in the console since this data changes).
Note that you can also get execution details from the information_schema tables or the Jobs API. For example, by running:
job_id = “bquxjob_49c5bc47_17ad3d7778f”
Interpreting the query statistics
Query statistics include information about how many work units existed in a stage, as well as how many were completed. For example, inspecting the result of the information schema query used earlier we can get the following:
Input and output
Using the parallel_inputs field, we can see how finely divided the input is. In the case of a table read, it indicates how many distinct file blocks were in the input. In the case of a stage that reads from shuffle, the number of inputs tells us how many distinct data buckets are present. Each of these represent a distinct unit of work that can be scheduled independently. So, in our case, there are 57 different columnar file blocks in the table.
In this representation, we can also see the query scanned more than 33 million rows while processing the table. The second stage read 57 rows, as the shuffle system contained one row for each input from the first stage.
It’s also perfectly valid for a stage to finish with only a subset of the inputs processed. Cases where this happens tend to be execution stages where not all the inputs need to be processed to satisfy what output is needed; a common example of this might be a query stage that consumes part of the input and uses a LIMIT clause to restrict the output to some smaller number of rows.
It is also worth exploring the notion of parallelism. Having 57 inputs for a stage doesn’t mean the stage won’t start until there’s 57 workers (slots) available. It means that there’s a queue of work with 57 elements to work through. You can process that queue with a single worker, in which case you’ve essentially got a serial execution. If you have multiple workers, you can process it faster as they’re working independently to process the units. However, more than 57 slots doesn’t do anything for you; the work cannot be more finely distributed.
Aside from reading from native distributed storage, and from shuffle, it’s also possible for BigQuery to perform data reads and writes from external sources, such as Cloud Storage (as we discussed in our earlier post). In such cases the notion of parallel access still applies, but it’s typically less performant.
BigQuery communicates resource usage through a computational unit known as a slot. It’s simplest to think of it as similar to a virtual CPU and it’s a measure that represents the number of workers available / used. When we talk about slots, we’re talking about overall computational throughput, or rate of change. For example, a single slot gives you the ability to make 1 slot-second of compute progress per second. As we just mentioned, having fewer workers – or less slots – doesn’t mean that a job won’t run. It simply means that it may run slower.
In the query statistics, we can see the amount of slot_ms (slot-milliseconds) consumed. If we divide this number by the amount of milliseconds it took for the query stage to execute, we can calculate how many fully saturated slots this stage represents.
SELECT job_stages.name, job_stages.slot_ms/(job_stages.end_ms – job_stages.start_ms) as full_slots
, UNNEST(job_stages) as job_stages
job_id = “bquxjob_49c5bc47_17ad3d7778f”
This information is helpful, as it gives us a view of how many slots are being used on average across different workloads or projects – which can be helpful for sizing reservations (more on that soon). If you see areas where there is a higher number of parallel inputs compared to fully saturated slots, that may represent a query that will run faster if it had access to more slots.
Time spent in phases
We can also see the average and maximum time each of the workers spent in the wait, read, compute and write phase for each stage of the query execution:
Wait Phase: the engine is waiting for either workers to become available or for a previous stage to start writing results that it can begin consuming. A lot of time spent in the wait phase may indicate that more slots would result in faster processing time.
Read Phase: the slot is reading data either from distributed storage or from shuffle. A lot of time spent here indicates that there might be an opportunity to limit the amount of data consumed by the query (by limiting the result set or filtering data).
Compute Phase: where the actual processing takes place, such as evaluating SQL functions or expressions. A well-tuned query typically spends most of its time in the compute phase. Some ways to try and reduce time spent in the compute phase are to leverage approximation functions or investigate costly string manipulations like complex regexes.
Write phase: where data is written, either to the next stage, shuffle, or final output returned to the user. A lot of time spent here indicates that there might be an opportunity to limit the results of the stage (by limiting the result set or filtering data)
If you notice that the maximum time spent in each phase is much greater than the average time, there may be an uneven distribution of data coming out of the previous stage. One way to try and reduce data skew is by filtering early in the query.
While many query patterns use reasonable volumes of shuffle, large queries may exhaust available shuffle resources. Particularly, if you see that a query stage is heavily attributing its time spent to writing out to shuffle, take a look at the shuffle statistics. The shuffleOutputBytesSpilled tells us if the shuffle was forced to leverage disk resources beyond in-memory resources.
Note that a disk-based write takes longer than an in-memory write. To prevent this from happening, you’ll want to filter or limit the data so that less information is passed to the shuffle.
Tune in next week
Next week, we’ll be digging into more advanced queries and talking through tactical query optimization techniques so make sure to tune in! You can keep an eye out for more in this series by following Leigha on LinkedIn and Twitter.
Editor’s note: Today we’re hearing from Sagar Batchu, Director of Engineering at LiveRamp. He shares how Google Cloud helped LiveRamp modernize its data analytics infrastructure to simplify its operations, lower support and infrastructure costs and enable its customers to connect, control, and activate customer data safely and securely.
LiveRamp is a data connectivity platform that provides best in class identity resolution, activation and measurement for customer data so businesses can create a true customer 360 degree view. We run data engineering workloads at scale, often processing petabytes of customer data every day via LiveRamp Connect platform APIs.
As we integrated more internal and external APIs and the sophistication of our product offering grew, the complexity of our data pipelines increased. The status quo for building data pipelines very quickly became painful and cumbersome as these processes take time and knowledge of an increasingly complex data engineering stack. Pipelines became harder to maintain as the dependencies grew and the codebase became increasingly unruly.
Beginning last year, we set out to improve these processes and re-envision how we reduce time to value for data teams by thinking of our canonical ETL/LT analytics pipelines as a set of reusable components. We wanted teams to spend their time adding new features which encapsulate business value rather than spending time figuring out how to run workloads at scale on cloud infrastructure. This was even more pertinent with data science, data analyst and services teams whose daily wheelhouse was not the nitty gritty of deploying pipelines.
With all this in mind, we decided to start a data operations initiative, a concept popularised in the last few years, which aims to accelerate the time to value for data-oriented teams by allowing different personas in the data engineering lifecycle to focus on the “what” rather than the “how.”
We chose Google Cloud to execute on this initiative to speed up our transformation. Our architectural optimizations, coupled with Google Cloud’s platform capabilities simplified our operational model, reduced time to value, and greatly improved the portability of our data ecosystem for easy collaboration. Today, we have ten teams across LiveRamp running hundreds of workloads a day, and in the next quarter, we plan to scale to thousands.
Why LiveRamp Chose Google Cloud
Google Cloud provides all the necessary services in a serverless fashion to build complex data applications and run massive infrastructure. Google Cloud offers data analytics capabilities that help organizations like LiveRamp to easily capture, manage, process and visualize data at scale. Many of the Google Cloud data processing platforms also have open source roots making them extremely collaborative. One such platform is CDAP (Cask Data Application Platform), which Cloud Data Fusion is built on. We were drawn to this for the following reasons:
CDAP is inherently multicloud. Pipeline building blocks known as Plugins define individual units of work. They can be run through different provisioners which implement managed cloud runtimes.
The control plane is a set of microservices hosted on Kubernetes, whereas the data plane leverages the best of breed big data cloud products such as Dataproc.
It is built as a framework and is inherently extensible, and decoupled from the underlying architecture. We can extend it both at the system and user-level through “extensions” and “plugins” respectively. For example, we were able to add a system extension for LiveRamp specific authorisation and build a plugin that encompasses common LiveRamp identity operations.
It is open sourced, and there is a dedicated team at Google Cloud building and maintaining the core codebase as well as a growing suite of source, transform and sink connectors.
It aligns with ourremote execution and non-data movement strategy. CDAP executes pipelines remotely and manages through a stream of metadata via public cloud APIs.
CDAP supports an SRE mindset by providing out of the box monitoring and observability tooling.
It has a rich set of APIs backed by scalable microservices to provide ETL as a Service to other teams.
Cloud Data Fusion, Google Cloud’s fully managed, native data integration platform is based on CDAP. We benefit from the managed security features of Data Fusion like IAM integration, customer manager encryption keys, role based access controls and data residency to ensure stricter governance requirements around data isolation.
How are teams using the Data Operations Platform?
Through this initiative, we have encouraged data science and engineering teams to focus on business logic and leave data integrations and infrastructure as separate concerns. A centralised team runs CDAP as a service, and custom plugins are hosted in a democratized plugin marketplace where any team can contribute their canonical operations.
Adoption of the platform was driven by one of our most common patterns of data pipelining: The need to resolve customer data using our Identity APIs. LiveRamp Identity APIs connect fragmented and inaccurate customer identity by providing a way to resolve PII to pseudonymous identifiers. This enables client brands to connect, control, and activate customer data safely and securely.
The reality of customer data is that it lives in a variety of formats, storage locations, and often needs bespoke cleanup. Before, technical services teams at LiveRamp had to develop expensive processes to manage these hygiene and validation processes even before the data was resolved to an identity. Over time, a combination of bash and python scripts and custom ETL pipelines became untenable.
By implementing our most used Identity APIs, a series of CDAP plugins, our customers were able to operationalise their processes by logging into a Low Code user interface, select a source of data, run standard validation and hygiene steps, visually inspect using CDAP’s Wrangler interface for especially noisy cases, and channel data into our Identity API. As these workflows became validated, they have been established as standard CDAP pipelines that can now be parameterized and distributed on the internal marketplace. These technical services teams have not only reduced their time to value but have also enabled future teams to leverage their customer pipelines without worrying about the portability to other team’s infrastructures.
What’s Next ?
With critical customer use cases now powered by CDAP, we plan on scaling out usage of the platform to the next batch of teams. We plan on taking on more complex pipelines, cross-team workloads, and adding support for the ever growing LiveRamp platform API suite.
In addition to the Google Cloud community and the external community, we have a growing base of LiveRamp developers building out plugins on CDAP to support routine transforms and APIs. These are used by other teams who push the limits and provide feedback — spinning a flywheel of collaboration between those who build and those who operate. Furthermore, teams internally can continue to use their other favorite data tools like BigQuery and Airflow as we continue to deeply integrate CDAP into our internal data engineering ecosystem.
Our data operations platform powered by CDAP is quickly becoming a center point for data teams – a place to ingest, hygiene, transform, and sink their data consistently.
We are excited by Google Cloud’s roadmap for CDAP and Data Fusion. Support for new execution engines, data sources and sinks, and new features like Datastream and Replication will mean LiveRamp teams can continue to trust that their applications will be able to interoperate with the ever evolving cloud data engineering ecosystem.
The Extension Framework is a fully hosted development platform that enables developers to build any data-powered application, workflow or tool right in Looker. By eliminating the need to spin up and host infrastructure, the Extension Framework lets developers focus on building great experiences for their users. Traditionally, customers and partners who build custom applications with Looker, have to assemble an entire development infrastructure before they can proceed with implementation. For instance, they might need to stand up both a back end and front end and then implement services for hosting and authorization. This leads to additional time and cost spent.
The Extension Framework eliminates all development inefficiency and helps significantly reduce friction in the setup and development process, so developers can focus on starting development right away. Looker developers would no longer need DevOps or infrastructure to host their data applications and these applications (when built on the Extension Framework), can take full advantage of the power of Looker. To enable these efficiencies, the Looker Extension Framework includes a streamlined way to leverage the Looker APIs and SDKs, UI components for building the visual experience, as well as authentication, permission management and application access control.
Streamlining the development process with the Extension Framework
Content created via the Extension Framework can be built as a full-screen experience or embedded into an external website or application. We will soon be adding functionality to allow for the embedding of extensions inside Looker (as a custom tile you plug into your dashboard, for example). Through our Public Preview period we have already seen over 150+ extensions deployed to production users, with an additional 200+ extensions currently in development. These extensions include solutions like: enhanced navigation tools, customized navigation and modified reporting applications, to name a few.
Extension Framework Feature Breakdown
The Looker Extension Framework includes the following features:
If you haven’t yet tried the Looker Extension Framework, we think you’ll find it to be a major upgrade to your data app development experience. Over the next few months, we will continue to make enhancements to the Extension Framework with the goal of significantly reducing the amount of code required, and eventually empowering our developers with a low-code, no-code framework.
Comprehensive details and examples that help you get started in developing with the Extension Framework are now available here. We hope that these new capabilities inspire your creativity and we’re super excited to see what you build with the Extension Framework!
With data volumes constantly growing, many companies find it difficult to use data effectively and gain insights from it. Often these organizations are burdened with cumbersome and difficult-to-maintain data architectures.
One way that companies are addressing this challenge is with change streaming: the movement of data changes as they happen from a source (typically a database) to a destination. Powered by change data capture (CDC), change streaming has become a critical data architecture building block. We recently announced Datastream, a serverless change data capture and replication service. Datastream’s key capabilities include:
Replicate and synchronize data across your organization with minimal latency. You can synchronize data across heterogeneous databases and applications reliably, with low latency, and with minimal impact to the performance of your source. Unlock the power of data streams for analytics, database replication, cloud migration, and event-driven architectures across hybrid environments.Scale up or down with a serverless architecture seamlessly. Get up and running fast with a serverless and easy-to-use service that scales seamlessly as your data volumes shift. Focus on deriving up-to-date insights from your data and responding to high-priority issues, instead of managing infrastructure, performance tuning, or resource provisioning.Integrate with the Google Cloud data integration suite. Connect data across your organization with Google Cloud data integration products. Datastream leverages Dataflow templates to load data into BigQuery, Cloud Spanner, and Cloud SQL; it also powers Cloud Data Fusion’s CDC Replicator connectors for easier-than-ever data pipelining.
Click to enlarge
Datastream use cases
Datastream captures change streams from Oracle, MySQL, and other sources for destinations such as Cloud Storage, Pub/Sub, BigQuery, Spanner and more. Some use cases of Datastream:
For analytics use Datastream with a pre-built Dataflow template to create up-to-date replicated tables in BigQuery in a fully-managed way.For database replication use Datastream with pre-built Dataflow templates to continuously replicate and synchronize database data into Cloud SQL for PostgreSQL or Spanner to power low-downtime database migration or hybrid-cloud configuration.For building event-driven architectures use Datastream to ingest changes from multiple sources into object stores like Google Cloud Storage or, in the future, messaging services such as Pub/Sub or Kafka Streamline real-time data pipeline that continually streams data from legacy relational data stores (like Oracle and MySQL) using Datastream into MongoDB.
How do you set up Datastream?
Create a source connection profile.Create a destination connection profile.Create a stream using the source and destination connection profiles, and define the objects to pull from the source.Validate and start the stream.
Once started, a stream continuously streams data from the source to the destination. You can pause and then resume the stream.
To use Datastream to create a stream from the source database to the destination, you must establish connectivity to the source database. Datastream supports the IP allowlist, forward SSH tunnel, and VPC peering network connectivity methods.
Private connectivity configurations enable Datastream to communicate with a data source over a private network (internally within Google Cloud, or with external sources connected over VPN or Interconnect). This communication happens through a Virtual Private Cloud (VPC) peering connection.
For a more in-depth look into Datastream check out the documentation.
As the Olympics kicked off in Tokyo at the end of July, we found ourselves reflecting on the beauty of diverse countries and cultures coming together to celebrate greatness and sportsmanship. For this month’s blog, we’d like to highlight some key data and analytics performances that should help inspire you to reach new heights in your data journey.
Let’s review the highlights!
BigQuery ML Anomaly Detection: A perfect 10 for augmented analytics
Identifying anomalous behavior at scale is a critical component of any analytics strategy. Whether you want to work with a single frame of data or a time series progression, BigQuery ML allows you to bring the power of machine learning to your data warehouse.
In this blog released at the beginning of last month, our team walked through both non-time series and time-series approaches to anomaly detection in BigQuery ML:
These approaches make it easy for your team to quickly experiment with data stored in BigQuery to identify what works best for your particular anomaly detection needs. Once a model has been identified as the right fit, you can easily port that model into the Vertex AI platform for real-time analysis or schedule it in BigQuery for continued batch processing.
App Analytics: Winning the team event
Google provides a broad ecosystem of technologies and services aimed at solving modern day challenges. Some of the best solutions come when those technologies are combined with our data analytics offerings to surface additional insights and provide new opportunities.
Firebase has deep adoption in the app development community and provides the technology backbone for many organization’s app strategy. This month we launched a design pattern that shows Firebase customers how to use Crashlytics data, CRM, issue tracking, and support data in BigQuery and Looker to identify opportunities to improve app quality and enhance customer experiences.
Crux on BigQuery: Taking gold in the all-around data competition
Crux Informatics provides data services to many large companies to help their customers make smarter business decisions. While they were already operating on a modern stack and not on the hunt for a modern data warehouse, BigQuery became an enticing option due to performance and a more optimal pricing model. Crux also found advantages with lower-cost ingestion and processing engines like Dataflow that allow for streaming analytics.
… when it came to building a centralized large-scale data cloud, we needed to invest in a solution that would not only suit our current data storage needs but also enable us to tackle what’s coming, supporting a massive ecosystem of data delivery and operations for thousands of companies. Mark Etherington Chief Technology Office, Crux Informatics
Technology is a team sport, and Crux found our support team responsive and ready to help. This decision to more deeply adopt Google Cloud’s data analytics offerings provides Crux with the flexibility to manage a constantly evolving data ecosystem and stay competitive.
You can read more about Crux’s decision to adopt BigQuery in this blog.
As a quick recap of that dataset, Google Cloud, and in particular BigQuery, provide access to the top 25 trending terms by Nielsen’s Designated Market Area® (DMA) with a weekly granularity. These trending terms are based on search patterns and have historically only been available on the Google Trends website.
The Google Trends design pattern addresses some common business needs, such as identifying what’s trending geographically near your stores and how to match trending terms to products to identify potential campaigns.
Dataflow GPU: More power than ever for those streaming sprints
Dataflow is our fully-managed data processing platform that supports both batch and streaming workloads. The ability of Dataflow to scale and easily manage unbounded data has made it the streaming solution of choice for large workloads with high-speed needs in Google Cloud.
But what if we could take that speed and provide even more processing power for advanced use cases? Our team, in partnership with NVIDIA, did just that by adding GPU support to Dataflow. This allows our customers to easily accelerate compute-intensive processing like image analysis and predictive forecasting with amazing increases in efficiency and speed.
Take a look at the times below:
Data Fusion: A play-by-play for data integration’s winning performance
Data Fusion provides Google Cloud customers with a single place to perform all kinds of data integration activities. Whether it’s ETL, ELT, or simply integrating with a cloud application, Data Fusion provides a clean UI and streamlined experience with deep integrations to other Google Cloud data systems. Check out our team’s review of this tool and the capabilities it can bring to your organization.
Google provides a set of Dataflow templates that customers commonly use for frequent data tasks, but also as reference data pipelines that developers can extend. But what if you want to customize a Dataflow template without modifying or maintaining the Dataflow template code itself? With user-defined functions (UDFs), customers can extend certain Dataflow templates with their custom logic to transform records on the fly:
Record transformation with Dataflow UDF
At the time of writing, the following Google-provided Dataflow templates support UDF:
Here’s the format of a Dataflow UDF function called process which you can reuse and insert your own custom transformation logic into:
Note: The variable includePubsubMessage is required if the UDF is applied to Pub/Sub to Splunk Dataflow template since it supports two possible element formats: that specific template can be configured to process the full Pub/Sub message payload or only the underlying Pub/Sub message data payload (default behavior). The statement setting data variable is needed to normalize the UDF input payload in order to simplify your subsequent transformation logic in the UDF, consistent with the examples below. For more context, see includePubsubMessage parameter in Pub/Sub to Splunk template documentation.
Common UDF patterns
The following code snippets are example transformation logic to be inserted in the above UDF process function. They are grouped below by common patterns.
Pattern 1: Enrich events
Follow this pattern to enrich events with new fields for more contextual information.
Add a new field as metadata to track pipeline’s input Pub/Sub subscription
Set Splunk HEC metadata source field to track pipeline’s input Pub/Sub subscription
Add new fields based on a user-defined local function e.g. callerToAppIdLookup() acting as a static mapping or lookup table
Pattern 2: Transform events
Follow this pattern to transform the entire event format depending on what your destination expects.
Revert logs from Cloud Logging log payload (LogEntry) to original raw log string. You may use this pattern with VM application or system logs (e.g. syslog or Windows Event Logs) to send source raw logs (instead of JSON payloads):
Transform logs from Cloud Logging log payload (LogEntry) to original raw log string by setting Splunk HEC event metadata. Use this pattern with application or VM logs (e.g. syslog or Windows Event Logs) to index original raw logs (instead of JSON payload) for compatibility with downstream analytics. This example also enriches logs by setting HEC fields metadata to incoming resource labels metadata:
Pattern 3: Redact events
Follow this pattern to redact or remove a part of the event.
An easy way to test your UDF on Nashorn engine is by launching Cloud Shell where JDK 11 is pre-installed, including jjs command-line tool to invoke Nashorn engine.
In Cloud Shell, you can launch Nashorn in interactive mode as follows:
To test your UDF, define an arbitrary input JSON object depending on your pipeline’s expected in-flight messages. In this example, we’re using a snippet of a Dataflow job log message to be processed by our pipeline:
You can now invoke your UDF function to process that input object as follows:
Notice how the input object is serialized first before being passed to UDF which expects an input string as noted in the previous section.
Print the UDF output to view the transformed log with the appended inputSubscription field as expected:
Finally exit the interactive shell:
The relevant parameters to configure:
gcs-location: GCS location path to the Dataflow template
As a Dataflow user or operator, you simply reference a pre-existing template URL (Google-hosted), and your custom UDF (Customer-hosted) without the requirement to have a Beam developer environment setup or to maintain the template code itself.
We hope this helps you get started with customizing some of the off-the-shelf Google-provided Dataflow templates using one of the above utility UDFs or writing your own UDF function. As a technical artifact of your pipeline deployment, the UDF is a component of your infrastructure, and so we recommend you follow Infrastructure-as-Code (IaC) best practices including version-controlling your UDF. If you have questions or suggestions for other utility UDFs, we’d like to hear from you: create an issue directly in GitHub repo, or ask away in our Stack Overflow forum.
In a follow-up blog post, we’ll dive deeper into testing UDFs (unit tests and end-to-end pipeline tests) as well as setting up a CI/CD pipeline (for your pipelines!) including triggering new deployment every time you update your UDFs – all without maintaining any Apache Beam code.
By interpreting and analyzing the data, organizations can understand and predict trends, improve security and make data-driven decisions. Big data and the artificial intelligence technologies used to leverage it can go beyond market predictions, and you can use data to improve working processes and optimize your return on investment (ROI). In this post, we’ll explore how organizations can leverage big data and AI instruments to improve their ROI.
How Big Data is changing the finance and retail scene
Let’s start with a use case. Typically, finance and retail sectors face challenges in optimizing their ROI. In retail, in particular, although it is always possible to reach the customer, doing it with the minimum spending of time and money is a challenge. Leveraging big data helps by aggregating information about customer behavior and making predictions about it, which helps target promotions.
The finance sector, specifically banks, is using big data analytics to understand transactions and payments and help customers. Banks are transitioning into data-driven organizations, using big data solutions to expand their offers to digital wallets. Big data is helping banks to tie their offers beyond the typical bank card, transforming digitally and making payments more secure and simple for their users.
The benefits of big data analytics for business are not only for financial and retail. Data analytics improve efficiency, performance, and productivity for every organization, regardless of size. One-way big data technologies are helping companies is by simplifying payment processing.
Data analytics simplifies and personalizes payment methods
Two technologies are spreading due their convenience and security: virtual cards and e-wallets.
What are Virtual cards?
A virtual card consists of a randomized credit card number that is used for payment and purchases. Companies use this unique 16 digit number for B2B payments and employee expenses. A virtual card program offers a secure payment product that can be redeemed instantly. Virtual cards are also a savvy way for data-savvy companies to manage corporate expenses.
Companies like meshpaymens.com offer a way to simplify corporate payments without corporate cards. Processes typically time-consuming, like credit card payment reconciliation, are automated and simplified. In addition, virtual cards integrate seamlessly with ERP and internal accounting systems via new data-driven capabilities.
An e-wallet is an application that uses complex data algorithms to enable you to make online payments with an email address and a password. You can link the e-wallet to one or more accounts or cards and then spend money online without sharing sensitive information. Examples of e-wallets are Paypal, Google, and Apple Pay. In some cases, you can use the e-wallet to pay for in-store purchases if the application is installed on your phone.
E-wallets are convenient since they can store money, loyalty cards, credit cards, driving license and other details. You can use them online and for in-store payments. The latter is still not universally adopted so it is sort of a downside. They are not really useful for business payments because they don’t integrate well with internal accounting and ERP systems.
How virtual card numbers impact B2B payments
Companies are leveraging data to improve processes, streamline workflows and reduce costs. One area that can be significantly improved with virtual cards is payments processing. Processing payments, expenses, and invoice reconciliation are some of the biggest time-consuming activities.
Organizations across industries need to reconcile an increasing number of expense payments and purchases. More data involves more time employees need to spend matching records, more errors, and more overhead costs. When using credit cards or cheques reconciling transactions requires a lot of manual intervention and it is a pain point for many organizations
One of the benefits of using virtual card numbers is the automation of the B2B payments process. Usually, virtual car numbers are single-use. That means, the identifier is unique and linked to a specific transaction, supplier, and amount. VCNs provide security at a granular level that is not available for traditional credit card transactions. You can set company-specific information like cost and project code, amount of the transaction, and timeframe.
Some of the advantages of virtual credit cards that rely on big data include:
Safety: since there are no physical cards, transactions are more secure than credit cards. This reduces the risk of payment fraud and prevents sharing cards among employees by using the best big data capabilities.Better cash flows: it is a faster payment method, therefore, giving more insights and control over the company’s cash flow. Virtual payments optimize the working capital of your company by processing payments immediately. This prevents accounts payable teams to hold on to funds for longer than needed.Budget management: virtual cards enable organizations to manage their expenses budget. You can allocate spending in different virtual cards, so you can deal with multiple payment accounts.
Virtual cards can have a bit of a downside since vendors need to accept this type of payment to work. Additionally, it depends on the type of big data technology vendors use.
Big Data has made virtual cards and e-wallets highly effective payment management options
Big data has significantly changed our approach to payment management. Virtual cards are one of the most effective ways to optimize payment management for organizations. A virtual cards platform enables the automatic generation of virtual card numbers, integrating it with internal accounting and resource management systems. Ultimately, implementing virtual cards helps better manage and control corporate spending, improving the ROI.
Artificial intelligence has led to a number of developments in many industries. A growing number of companies are using AI technology to transform many aspects of their workplace.
One of the biggest benefits of AI is that it has helped streamline many workplace functions. Many companies are using AI technology to make it easier for employees to work from home. Countless services like Zoom use AI to offer more robust features to their users, which helps companies offering work from home opportunities do so more efficiently.
AI is also helping solve some of the challenges that have come with working from home. We stated that AI is essential to fighting cybercrime during the pandemic and this will hold true after the pandemic ends. Cybercriminals have started scaling their cyberattacks to target people working from home, since they tend to have less reliable digital security. AI has led to some important advances that will shore up defenses against these criminals.
AI is Going to be Essential in the Fight Against Cybercrime as More People Work from Home
According to the analysis of Cybersecurity Ventures, the yearly cost of cybercrime is expected to reach $10.5 trillion, and ransomware damage costs will reach $20 billion by 2025. It indicates that businesses should do everything they can to protect their critical data. AI is going to be more important than ever.
This article will help you to understand how remote working has caused cybercrime, its consequences, and proactive measures focusing on AI-driven cybersecurity apps to handle this critical issue.
Remote working – Pre and Post Pandemic
Remote working is not a wholly new concept; before the coronavirus pandemic, some companies had arrangements to work from home once or twice a week. However, COVID-19 has made this the rule rather than the exception.
As the consequences of a global pandemic, cybersecurity statistics show a significant increase in data breaching and hacking incidents from sources that employees increasingly use to complete their tasks, such as mobile and IoT devices. So, the value of a free VPN service that ensures cybersecurity has tremendously increased.
Remote Working and Use of Technology
According to Statista, 44% percent of U.S. employees are working from after the pandemic. In addition, most company transactions were reliant on the internet and devices such as Laptops, Desktops, Androids, iPhones, iPods, Macs, and more.
Communication is the heart of every business, and these technologies make it possible for employees to communicate while working from anywhere. The reality is that without these devices, little or nothing can be accomplished.
Cybercrime and IoT devices
Businesses have a large number of employees working remotely. So, these remote employees are more likely to be attacked by cybercriminals. Sometimes employees also rely on public WIFI, which is notoriously unreliable. Cybercriminals are fast to exploit these flaws. That’s why cyberattacks are continuously increasing.
ZecOps, a mobile security forensics firm located in San Francisco, uncovered a problem in the Mail app for iPhones and iPads, which is a vulnerability that allows hackers to remotely take data from iPhones even if they are running the latest versions of iOS. As a result, it has increased the security concerns for businesses.
Optimizing AI-Driven Cybersecurity Apps
AI has been incredibly important in the evolution of cybersecurity. While precautions such as VPNs and a zero-trust strategy are still important in preventing cyberattacks, however, you can consider incorporating the following AI apps into your security network to improve threat detection, response, and reporting.
The recent estimates uncover over 8,000 new vulnerabilities in mainstream software and hardware platforms every year on average. That’s over 20 every single day. There is an intruder to fix this issue.
It uses sophisticated AI algorithms to scan and detect vulnerabilities and cybersecurity flaws in your digital infrastructure, helping you avoid costly data breaches. It is one of the most effective cybersecurity apps that use AI to thwart hackers.
2. WiFi Proxy and Switcherry VPN
Switcherry VPN is stated to provide a one-touch connection. Its free mode will keep you private and anonymous on all of your devices. It will also use complex AI features to safeguard your data and privacy on public WiFi and prevent your connection from any tracking. It is compatible with Android, iPhone, Windows, iPad, and Google extensions.
3. Syxsense secure
Synsense secure is a cybersecurity app that uses some of the most advanced machine learning tools to protect against cybercrime. This AI-powered software combines vulnerability detection, patch management, and endpoint security in a single cloud console, making Syxsense Secure the world’s first IT management and security solution. It also works with Windows, Mac OS X, and Linux.
With a drag-and-drop interface, the software streamlines complex IT and security operations. In addition, the pre-built templates can keep your organization secure in the absence of huge teams. You can also read out 10 ways to ensure data security.
Businesses usually face phishing and ransomware kinds of cyber-attack. The details of each are presented below.
Phishing: Email plays an important role in any business, which is why phishing is likely to be on the rise, as seen during the epidemic. Even though phishing assaults decreased in 2019, they still accounted for one out of every 4,200 emails in 2020.
Ransomware: This cybercrime is becoming more sophisticated, and the consequences for businesses are becoming more severe. According to Cybersecurity Ventures, the average cost of a ransomware attack on a business is 133,000 dollars. The prominent businesses suffered by ransomware are the following.
1. Software AG
Software AG is Germany’s second-largest software corporation and Europe’s seventh-largest. In October 2020, the company was subjected to a cyber-attack. Clop ransomware was distributed by the hackers, who wanted a $23 million ransom in exchange for the company’s records and personal information.
2. Sopra Steria
Sopra Steria, a French IT services provider, was also affected hard by Ryuk ransomware.
3. Seyfarth Shaw
In October 2020, a ransomware attack targeted Seyfarth Shaw LLP, a renowned multinational law company in Chicago. Their entire email system was compromised as a result of the hack.
AI-Powered Apps Are the Secret to Stopping Cybercrime
As most businesses have adopted remote working, more devices are required to ensure that employees can communicate easily and the commercial transactions can flow freely. AI technology is helping fight cybercrime.
Cybersecurity apps are the most recent advancement in the fight against cybercrime. The above discussion helps you to know cybersecurity apps primarily intended to protect networks, websites, and wireless devices such as iPhones, iPads, Android phones, iPods, and Macs from malicious hackers.
Modern advances in big data technology, the internet, and the arrival of the digital age have been the driving forces behind a true revolution in the ways we communicate that the world has experienced over recent years. The digital age is here to stay and big data has changed how business operate forever.
Big Data is Transforming the Trading Profession
While traders used to spend their time working in large office buildings and hectic trade floors in places like New York’s Wall Street and London’s city exchange, a lot of trading is now done across digital platforms with a few simple clicks of a button. In some cases, they don’t even need to perform some of the tasks that they need to accomplish. Big data technology has led to some more impressive advances in AI that can help automate many of these tasks.
Remote Virtual Private Servers (VPS) mean that traders don’t always necessarily need access to a home internet connection to be able to trade on the foreign exchange market (Forex) from absolutely anywhere in the world. Data transmissibility has improved so much that people can place trades in a matter of minutes.
Here we have come up with a guide to how traders today can embrace big data technology to make their job easier in 2021.
Advanced Technology Trading Terminals
A trading terminal or ‘an electronic trading platform’ is computer software that allows traders to place orders and is a gateway to the markets. Trading terminals which utilize highly advanced big data technologies such as trading robots can be very beneficial and provide traders with the improved speed and reliability that they crave. Download MetaTrader 5 if you want to make the most out of a trading platform which uses innovative trading ideas and cutting-edge modern technologies. This wouldn’t have been possible without new machine learning advances predicated on data technology. The MetaTrader 5 download will come with a multicurrency tester and is capable of doing 6 types of pending orders.
Traders should make sure they stay ahead of the game and embrace the latest technologies in the 21st century. Modern technology can provide traders with the best quality up-to-date, in-depth breakdown and analysis of global markets and international currencies. Sophisticated data analytics capabilities can handle this task in a fraction of the time that it used to take.
Modern traders should embrace and welcome the developments in big data technology as a new tool they can use to their advantage on a daily basis.
Analysis and Forecasting Markets with Modern Technology
One of the hardest parts of trading is predicting what will happen in the future with markets. When will a stock or commodity go up or down in price? When will an international currency decrease significantly in value? Accurate and reliable forecasting is therefore as important as ever but thankfully nowadays traders have modern technologies to help them. There is now amazing predictive analytics software, trading robots, which utilize modern technologies to come up with market forecasts. These technologies are some of the most impressive developments brought on by advances in data analytics and AI.
However, robots and modern technology unfortunately could not have foreseen the extent of the chaos that the coronavirus caused to the global economy when everywhere started shutting down and imposing lockdowns in 2020. Traders should nevertheless not shy away from using modern technologies for forecasts and crucial trade tips.
Trading without using the machine learning and big data technologies of today would be trickier, and more long-winded and time consuming. The technology is a blessing for traders. Forex traders today can trade currencies from everywhere they are in the world. In the past you would have had to be in a sweaty suit and tie in the office, but today thanks to modern technologies you can even trade whilst relaxing and laying back on the beach using your phone.
But what if machine learning could be used beyond niche or individual contexts? Taking artificial intelligence a step further and implementing it into our cities and infrastructures has the potential for improving operating efficiencies, aiding in sustainability efforts, urban planning and more. Below, we’ll be exploring a few of the ways that machine learning can be used for improving our cities and making them smarter overall.
Using AI to account for carbon footprints
Often times, we will hear from various forms of media that we should be aiming to reduce our individual and collective carbon footprints – however, how can cities and organizations accurately calculate their contributions to carbon emissions? Overall, a carbon footprint can be broken down into three categories – direct emissions from the organization or city’s operations (scope 1 emissions), emissions that are related to the generation of electricity required to run the city (scope 2 emissions) and emissions from consumption and production of city product (scope 3 emissions), which involve upstream suppliers and downstream consumers (e.g., city inhabitants)1.
While obtaining and processing data is a challenge, several start-up companies are developing tools that will not only quantify emissions but also help develop plans (based on data) on how to reduce emissions, such as through laying out more sustainable and informed decision making or through switching to viable renewable energy sources. Many companies use platforms like Spark 3.0 to help with data processing, but it still proves challenging.
One particular company, Watershed, hopes to be able to build a tool in which raw data can yield insight and concrete actions in which carbon emissions are reduced.
Drought Risk Assessment and Prediction
With climate change on the rise, more severe weather events such as drought are becoming more prevalent. Overall, droughts have cost the world $1.5 million between 1988-2017 and the resulting food insecurity has caused hundreds of thousands of deaths, if not more.2 Through artificial intelligence-based prediction, there can be improvement in decision making regarding droughts and better methods and timing employed to ensure optimal water resource allocation and disseminating information ahead of drought events.
One such example of AI being used for prediction of high impact weather events is the Gradient Boosted Regression Trees (GBRT) algorithm, in which it was found that in 75% of cases, AI-based forecast was chosen over human intuition by professional forecasters.2
There is growing evidence big data and machine learning can help save the environment. Preserving habitats for various animals is just as important within cities as it is in tropical rainforests.
Often times, conservationists and ecologists will set camera traps in order to get a better idea of what animals are living in an area, what time they are active as well and to monitor human impact on wildlife. Unfortunately, going through footage manually takes a tremendous amount of time and can delay actions that would benefit local flora and fauna. That’s where AI algorithms such as the one created by RESOLVE come in – this AI algorithm can let conservationists know about the presence of animals real time as well as identify any detected animals almost immediately so that appropriate action can be taken as soon as possible. Additionally, algorithms such as this one can be used to detect illegal activity in real time, meaning poachers will have a more difficult time capturing animals.
Air Quality Monitoring and Prediction
Air pollution unfortunately is a large issue globally. The United States alone in 2020 produced around 68 million tons of pollution4. Such pollution contributes to higher incidences of asthma and other respiratory issues, especially in vulnerable populations such as young children and the elderly. To help the general public better prepare for days of poor air quality and to put into place effective countermeasures, air quality warnings systems based off of artificial intelligence may be implemented. In particular, the AI system proposed by Mo et al., (2019) in their article ‘A Novel Air Quality Early-Warning System Based on Artificial Intelligence’ is based on an air pollution prediction model as well as an air quality evaluation model.5 Its through this system in which an early-warning system can be implemented in regards to air quality and in which data can be analyzed and used to create reasonable countermeasures in addition to predictions of air quality in the future.
AI based Parking Monitoring.
One problem common to many cities is parking. If you’ve ever been frustrated from circling around in a packed parking lot looking for a spot, this particular application of artificial intelligence will probably be of interest to you. Artificial intelligence can help through using monitors and sensors to assess real time occupancy in parking garages – if there happens to be no vacancy, then visitors will be alerted so they won’t have to waste time circling the lot.6 Additionally, AI algorithms in particularly large parking spaces can be used to guide visitors to areas of vacancy, also saving time.
Smart parking systems can also be used to gauge times of high activity based on parking occupancy so businesses can better prepare for peak hours as well as times of low parking occupation and thus low customer turnout.
Optimization of Electric Vehicle Charging
As public transport vehicles move from being powered by traditional fossil fuels to being electrically fueled, there are quite a number of things that need to be taken into consideration, such as battery storage, electric generator backup and creating or adapting a charging system for these vehicles. Additionally, there are several variables that go into the amount and cost of energy that a vehicle uses, such as weather and traffic conditions, in house vs on the go charging and peak demand constraints just to name a few.7 If cities were to adopt an AI-enabled energy optimization system, expenses could be kept to a minimum through calculating the amount of energy sources and facilities required upfront as well as integration of renewable power sources to charge the vehicles as appropriate.
Additionally, artificial intelligence integration could also help extend battery life of electric vehicles through accounting for manufacturer-based constraints and real time conditions at the same time to optimize charge level as well as minimize degradation levels.7 One way of doing so would be AI algorithms alerting the public transit company of lower than usual electricity prices but also the amount that the vehicles should be charged such that none of the batteries are overcharged.
Improving Power Grid Performance
Depending on where you live in the world, you may already be familiar with smart grids. A smart grid refers to a modern electricity system in which there are sensors, automation, communication, and computers to improve the efficiency, reliability, and safety of an electricity system. Smart grid systems can benefit a city in numerous ways including8:
Automatic re-routing when there are abnormalities in the system.More integration of renewable energy systems and customer-owned power generation systemsMore efficiency electricity transmissionReduced operation and management costs for utilities.Reduced peak demand rates.Improved grid security
Faster restoration of power after power disturbances (which is critical in severe weather events like snowstorms or heatwaves.)
When it’s impossible for human eyes to keep track of all the security feeds within a city, artificial intelligence can assist – for example, microphone input from street cameras can be interpreted by AI as gun shots or other sounds indicative of distress. In such situations, AI algorithms can alert emergency service operators with location data and any other required data to decide to dispatch emergency services or not. Digital signage can be updated real time to alert the public of situations requiring attention such as flooding or other emergent situations. Another way in which AI can be used to improve upon public safety is through the controlling of traffic lights in order to clear the way for first responders rather than rely on police forces to arrive.