How Looker is helping marketers optimize workflows with first-party data

How Looker is helping marketers optimize workflows with first-party data

It’s no secret that first-party data will become increasingly important to brands and marketing teams as they brace for transformation and unlock new, engaging ways of providing consumer experiences.  The pain points that marketers are experiencing today (siloed data preventing complete customer view, non-actionable insights, and general data access issues) are coming into focus as marketers prepare to wean off third-party data that the industry has grown so accustomed to, and begin efforts to truly harness their organizations’ first party data.

While gathering first-party data is a critical first step, simply collecting it doesn’t guarantee success. To pull away from the competition and truly move the needle on business objectives, marketers need to be able to put their first-party data to work.  How can they begin realizing value from their organizations’ first-party data? 

In May, Looker and Media.Monks (formerly known as MightyHive)  hosted a webinar, “Using Data to Gain Marketing Intelligence and Drive Measurable Impact,” that shared strategies brands have used to take control of their data. Brands are relying on technology to make that happen, and in these cases, the Looker analytics platform was key to their success. 

However, technology alone isn’t a silver bullet for data challenges. It’s the ability to closely align to the principles outlined below that will help determine a brand’s success when it comes to realizing the value of their first-party data. While technology is important,  you need to have the right mix of talent, processes and strategies to bring it to life. 

If you have data that nobody can access, what use is having this data?

Siloed or difficult-to-access data has only a fraction of the value that portable, frictionless data has. And data, insights and capabilities that can’t be put in the right hands (or are difficult to understand) won’t remain competitive in a market where advertisers are getting nimbler and savvier every day.

A large part of making data frictionless and actionable lies in the platforms you use to access it. Many platforms are either too technical for teams to understand (SQL databases) or pose too much of a security risk (site analytics platforms) to be shared with the wider team of stakeholders who could benefit from data access. 

Ease of use becomes incredibly important when considering the velocity and agility that marketers require in a fast-changing world. In the webinar’s opening, Elena Rowell, Outbound Product Manager at Looker, noted that the foundational data requirement for brands is to have “the flexibility to match the complexity of the data.”

Knowing your customer 

While the transformation to become a best-in-class data driven marketing organization may look like a long, arduous process, it really is not.  Elena instead views it as an iterative journey along which brands can quickly capture value.  “Every incremental insight into customer behavior brings us one step closer to knowing the customer better,” she said. “It’s not a switch you flip – there’s so much value to be gained along the way.” 

She showed how Simply Business, a UK based insurance company, took this approach. First, they implemented more data-driven decision-making by building easy access to more trustworthy data using Looker. They could look in depth at marketing campaigns and implement changes that needed to be made along the way.  They started to build lists in Looker that enabled better targeting and scheduled outbound communication campaigns. 

The original goal was to better understand what was going on with marketing campaigns, but as they put the data and intelligence stored in it to work, Simply Business found value at every step.

Insights for impact   

Ritual, an e-commerce, subscription-based multivitamin service, needed a way to measure the true impact of their acquisition channels & advertising efforts.   They understood that an effective way to grow the business is by having insights into which ads & messaging were creating the greatest impact and what was resonating with their current and prospective customers.  

Ritual chose to use Looker, which offers marketers interactive access to their company’s first-party data.  During the webcast,A 360-View of Acquisition, Kira Furuichi, Data Scientist, and Divine Edem, Associate Manager of Acquisition at Ritual, explained how an e-commerce startup makes use of platform and on-site attribution data to develop a multifaceted understanding of customer acquisition and its performance.

Ritual now has insights into not only the traffic channels are bringing to the site, but how customers from each of those channels behave after they arrive. This leads to a better overall understanding of how products are really resonating with customers. Their consumer insights team shares that information with the acquisition team so they can collaborate to make overall decisions, especially in Google Ads.   Deeper insights into ad copy and visuals help the teams hone in on what the prospective customer is looking for as they browse for new products to add to their daily routine. They also found it fosters collaboration with other internal teams invested in this sector of business as growth and acquisition is a huge part of the company strategy overall.

Edem contributed saying, “having an acquisition channel platform like Google being synced into our Looker helps other key stakeholders on the team outside of acquisition, whether it’s the product team, the engineering team, or the operations name – being able to understand the channel happenings and where there’s room for opportunity.  Because it’s such a huge channel and having Looker being able to centralize and sync all that information makes the process a little less overwhelming and really helps us to see things through a nice magnifying glass.”

Access, analyze, and activate first-party data

Marketing is typically in the backlog of integration work and IT teams often don’t have domain marketing knowledge. Every brand’s data needs are different. This is where Media.Monks can step in with expertise to guide the process and marshall technical resources to deliver against your roadmap.  Having a combination of both domain marketing expertise and deep experience in engineering and data science, Media.Monks helps brands fill the void left by a lack of internal resources and accelerate their data strategies.

With the Looker extension framework partners can develop custom applications that sit on top of and interact with Looker to fit unique customers needs. These custom Looker applications allow Media.Monks to equip customers with not just read-only dashboards, but tools and interfaces to actually trigger action across other platforms and ecosystems. 

For instance, Media.Monks recently developed a Looker application that allows a marketer to define a first party audience from CRM data, submit that audience to Google’s Customer Match API, and ultimately send the audience to a specific Google Ads account for activation. The entire end-to-end process is performed within a single screen in Looker, and makes a previously daunting and error-prone process able to be completed in a few minutes by a non-technical user.

Media.Monks-built product for activating first-party data from Looker into Google Ads

We know that breaking down silos, making data available across your organization, and building paths to activate is critical.  Looker as a technology makes it much easier but it still takes expertise, effort and time to make it work for your needs and that is where our experience and the GC [Google Cloud] partner ecosystem comes in.

With the Looker platform, the potential for new data experiences are endless and with the right partner to support your adoption, deployment and use cases, you can accelerate your speed to value at every step of your transformation journey.

Gain a more accurate understanding of ad channel performance to minimize risk of ad platforms over-reporting their own effectiveness

Uncover insights on what resonates with customers to enable better optimization of ad copy and creative

And democratize data within the company for stakeholders that need it.  IIntegrations with collaboration tools like Google Sheets and Slack provides value for people on the team even if they have limited access to Looker.

To learn more about how to approach first party data and how brands have found success using Looker, check out the webinar and see some of these strategies in action.  Additionally, you can register to attend the LookerJOIN annual conference where we will be presenting, “3 Ways to Help Your Marketing Team  Stay Ahead of the Curve.”  Look for it under the Empowering Others with Data category.  

There is no cost to attend JOIN.  Register hereto attend sessions on how Looker helps organizations build and deliver custom data-driven experiences that goes beyond just reports and dashboards, scales and grows with your business, allows developers to build innovative data products faster, and ensures data reaches everyone.

Source : Data Analytics Read More

Give your data processing a boost with Dataflow GPU

Give your data processing a boost with Dataflow GPU

We are excited to bring GPUs to the world of big data processing, in partnership with NVIDIA,  to unlock new possibilities for you. With Dataflow GPU, users can now leverage the power of NVIDIA GPUs in their data pipelines. This brings together the simplicity and richness of Apache Beam, serverless and no-ops benefits of Dataflow, and the power of GPU based computing.   Dataflow GPUs are provisioned on-demand and you only pay for the duration of your job. 

Businesses of all sizes and industries are going through hard data driven transformations today. A key element of that transformation is using data processing in conjunction with machine learning to analyze and make decisions about your systems, users, devices and the broader ecosystem that they operate in. 

Dataflow enables you to process vast amounts of data (including structured data, log data, sensor data, audio video files and other unstructured data) and use machine learning to make decisions that impact your business and users. For example, users are using Dataflow to solve problems such as detecting credit card fraud, physical intrusion detection by analyzing video streaming, and detecting network intrusion by analyzing network logs. 

Benefits of GPUs

Unlike CPUs, which are optimized for general purpose computation, GPUs are optimized for parallel processing. GPUs implement an SIMD (single instruction, multiple data) architecture, which makes them more efficient for algorithms that process large blocks of data in parallel.  Applications that need to process media and apply machine learning typically benefit from the highly parallel nature of GPUs.

Google Cloud customers can now use NVIDIA GPUs to accelerate data processing tasks as well as image processing and machine learning tasks such as predictions. To understand the potential benefits, NVIDIA ran tests to compare the performance of the Dataflow pipeline that uses a TensorRT optimized BERT (Bidirectional Encoder Representations from Transformers) ML model for natural language processing. The following table captures the results of the tests: using Dataflow GPU to accelerate the pipeline resulted in an order of magnitude reduction in CPU and memory usage for the pipeline.

You can see more details about the test setup and test parameters at the blog post by NVIDIA.

We recommend testing Dataflow GPU with your workloads since the extent of the benefit depends on the data and the type of computation that is performed.

What customers are saying 

Cloud to Street, uses satellites and AI to track floods in near real-time anywhere on earth to insure risk and save lives. The company produces flood maps at scale for disaster analytics and response by using Dataflow pipelines to automate batch processing and downloading of satellite data at large scale.  Cloud to Street uses Dataflow GPU to not only process satellite imagery but also apply resource intensive machine learning tasks in the Dataflow pipeline itself. 

“GPU-enabled Dataflow pipelines asynchronously apply machine learning algorithms to satellite imagery. As a result, we are able to easily produce maps at scale without wasting time manually scaling machines, maintaining our own clusters, distributing workloads, or monitoring processes,” said Veda Sunkara, Machine Learning Engineer, Cloud to Street.

Getting started with Dataflow GPU

With Dataflow GPU, customers have the choice and flexibility to use any of the following high performance NVIDIA GPUs: NVIDIA® T4 Tensor Core, NVIDIA® Tesla® P4, NVIDIA® V100 Tensor Core, NVIDIA® Tesla® P100, NVIDIA® Tesla® K80.

Using Dataflow GPU is straightforward. Users can specify the type and number of GPUs to attach to Dataflow workers using the worker_accelerator parameter. We have also made it easy to install GPU drivers by automating the installation process. You instruct Dataflow to automatically install required GPU drivers by specifying the install-nvidia-driver parameter. 

Apache Beam notebooks with GPU

Apache Beam notebooks enable users to iteratively develop pipelines, inspect your pipeline graph interactively using JupyterLab notebooks. We have added support for GPU to Apache Beam notebooks which enables you to easily develop a new Apache Beam job that leverages GPU and test it iteratively before deploying the job to Dataflow. Follow the instructions at Apache Beam notebooks documentation to start a new notebooks instance to walk through a built in sample pipeline that uses Dataflow GPU.

Integrated monitoring

Furthermore, we have integrated monitoring of GPU into Cloud Monitoring. As a result you can easily monitor the performance and usage of GPU resources in your pipeline and optimize accordingly.

Looking ahead: Right Fitting for GPU

We are also announcing a new breakthrough capability called Right Fitting as part of Dataflow Prime Preview. Right Fitting allows you to specify the stages of the pipeline that need GPU resources. That allows the Dataflow service to provision GPUs only for the stages of the pipeline that need it, thereby reducing the cost of your pipelines substantially.  You can learn more about the Right Fitting capability here. You can find more details about Dataflow GPU at Dataflow support for GPU. Dataflow GPU are priced on a usage basis. You can find pricing information at Dataflow Pricing.

Related Article

Dataflow Prime: bring unparalleled efficiency and radical simplicity to big data processing

Create even better data pipelines with Dataflow Prime, coming to Preview in Q3 2021.

Read Article

Source : Data Analytics Read More

Crux chose BigQuery for rock-solid, cost-effective data delivery

Crux chose BigQuery for rock-solid, cost-effective data delivery

Editor’s note: Today, we’re hearing from data engineering and operations company Crux Informatics about how BigQuery helps them achieve rock-solid, cost-effective data delivery. 

At Crux Informatics, our mission is to get data flowing by removing obstacles in the delivery and ingestion of data at scale. We want to remove any friction across the data supply chain that stops companies from getting the most value out of data, so they can make smarter business decisions. 

But as you may know, if you’re in the business of data, this industry never stands still. It’s constantly evolving and changing. Data at rest is a bit of a misnomer in our book—the data might be at rest, but the computation around it never is. It’s therefore critical that we have solid, scalable, and cost-effective infrastructure. Without it, we would never make it off the starting block. 

That’s why when it came to building a centralized large-scale data cloud, we needed to invest in a solution that would not only suit our current data storage needs but also enable us to tackle what’s coming, supporting a massive ecosystem of data delivery and operations for thousands of companies. 

Low-cost ingestion with uncompromised performance

At Crux, we use BigQuery as our data warehouse. An interesting note about our journey to BigQuery is that, unlike many other organizations, we weren’t looking for a new solution to modernize our data analytics platform. We already had a cloud-based modern infrastructure composed of Snowflake on AWS that had worked well for us. We had previously partnered up with Google to deliver high-performance data access and data processing to our customers for our data lake. We weren’t in a situation where we were under pressure to move because our legacy solutions were ill-suited or outdated for today’s fast-paced data analytics requirements. 

Though there are many additional advantages to choosing BigQuery, arguably the most significant factor was the pricing model. We don’t use technology because it’s cost-effective— rather, it has to be cost-effective in the way that we use it. 

The volume of data that we consume, the number of suppliers that we add each day, the quantity of data users we support, and the increasing demand for sharing data mean that our data consumption is astronomical. But Crux doesn’t generate revenue from consuming data, our business value is in the description, delivery, validation, transformation and distribution of data. Therefore, we had to have a core system where the cost of ingestion was close to zero. 

There is no charge for loading data into BigQuery, which matches our own commercial model extremely well. Of course, we have to pay for storage, but not having to worry about computation costs on top to load data into BigQuery allows us to build a very solid foundation at a competitive price point that is not easily matched by other large-scale data warehouse solutions. 

When you combine low-to-no-cost high-speed ingestion with the ability to store all of our data in multiple formats in one place, it offers some very tangible benefits and cost savings.

Platform integrations create a better data experience 

Beyond pricing, choosing BigQuery provided several other advantages to us, including the ability to integrate and access other Google Cloud Platform (GCP) tools and services. For instance, we can set up connections to external data sources that live in the larger GCP ecosystem and run federated queries or create external data tables. 

Besides static and semi-static reference data, we also work with real-time streaming data. For instance, we wanted to ingest live market prices and stream them into BigQuery. This isn’t necessarily for immediate data analysis. We see a high demand for capturing data in real time so that it can be used later for historical analysis or, indeed neartime immediate use.

We also leverage solutions like Dataflow, which gives us a lot of extra value without the extra effort. Using Dataflow with BigQuery for streaming data processing made the whole integration process seamless.

From the perspective of our end data users, BigQuery’s integrations also help us increase overall customer satisfaction. We have the ability to easily transfer data anywhere they need it, including Google Cloud Storage, Amazon, Azure, and more. Our customers can also integrate their favorite business intelligence tools without additional overhead to their own processes. One of our most common use cases is the ability to create instant ODBC connections to BigQuery. This has huge implications for our ability to deliver a better overall experience to the people using our data. 

In addition, BigQuery makes it easy to handle cross-region replication without any additional costs. BigQuery really is a cloud, rather than an application running on a cloud. Instead of having to organize database replication, paying additional storage and egress fees, we can globally host our datasets at a fraction of our previous costs. 

The right customer support makes all the difference

The final crucial differentiator in our decision was the Google team itself. Google’s support is exceptional—and this is not always the case with other providers. It’s rare to have such a close relationship where you have the ability to report an issue and know that it will end up on the desk where it needs to be. 

Google Cloud premier partner SADA helped us successfully modernize with BigQuery. SADA and Google worked with us to develop a cost structure that met our IT budget requirement. SADA’s continuous support throughout the process provided a successful customer experience and enabled our company to reach its goals. 

The support teams at Google and SADA are always quick to respond and help us keep everything running smoothly. And to be honest, while cost may have been the primary concern when we chose BigQuery—it is the support teams that have solidified the relationship. 

In many ways, flexibility, reliability, and scalability are almost table stakes for many cloud providers. But this level of caring and attention from support is certainly not everywhere. It is and will continue to be, a key component to our success with BigQuery.

It’s the real cloud data deal

Rarely do organizations ever want one type of data access, and it’s rare that there is a consolidated approach to data even under one roof. Google Cloud and BigQuery are helping us provide our customers with all the benefits of a data cloud, including fully managed global scale, load management, high performance, and genuinely good support. We put data in and data comes back out, and that’s really the experience we want for our suppliers, our data users, and ourselves. This foundation enables our hub to meet our customers needs, delivering data from any source, validated and transformed as they want it, to the destination they want.

Related Article

Top 25 Google Search terms, now in BigQuery

Google Trends datasets for the Top 25 terms and Top 25 Rising terms now available in BigQuery to enhance your business analyses

Read Article

Source : Data Analytics Read More

New This Month in Data Analytics: Simple, Sophisticated, and Secure

New This Month in Data Analytics: Simple, Sophisticated, and Secure

June is the month that holds the summer solstice, and some of us in the northern hemisphere get to enjoy the longest days of sunshine out of the entire year. We used all the hours we could in June to deliver a flurry of new features across BigQuery, Dataflow, Data Fusion, and more.  Let’s take a look!

Simple, Sophisticated, and Secure

Usability is a key tenant of our data analytics development efforts. Our new user-friendly BigQuery improvements this month include:

Flexible data type castingFormatting to change column descriptions GRANT/REVOKE access control commands using SQL

We hope this will delight data analysts, data scientists, DBAs, and SQL-enthusiasts who can find out more details in our blog here.

Beyond simplifying commands, we also recognize that it’s equally important to have more sophistication when dealing with transactions. That’s why weintroduced multi-statement transactions in BigQuery.

As you probably know, BigQuery has long supported single-statement transactions through DML statements, such as INSERT, UPDATE, DELETE, MERGE and TRUNCATE, applied to one table per transaction. With multi-statement transactions, you can now use multiple SQL statements, including DML, spanning multiple tables in a single transaction. 

This means that any data changes across multiple tables associated with all statements in a given transaction are committed atomically (all at once) if successful—or all rolled back atomically in the event of a failure. 

We also know that organizations need to control access to data, down to the granular level and that, with the complexity of data platforms increasing day by day, it’s become even more critical to identify and monitor who has access to sensitive data. 

To help address these needs,  we announced the general availability of BigQuery row-level security. This capability gives customers a way to control access to subsets of data in the same table for different groups of users. Row-level security in BigQuery enables different user personas access to subsets of data in the same table and can easily be created, updated, and dropped using DDL statements. To learn more, check out the documentation and best practices.

Simple, Safe, and Smart

Beyond building a simpler, more sophisticated and more secure data platform for customers, our team has been focused on providing solutions powered by built-in intelligence. One of our core beliefs is that for machine learning to be adopted and useful at scale, it must be easy to use and deploy.  

BigQuery ML, our embedded machine learning capabilities, have been adopted by 80% of our top customers around the globe and it has become a cornerstone of their data to value journey.  

As part of our efforts, we announced the general availability of AutoML tables in BigQuery ML.  This no-code solution lets customers automatically build and deploy state-of-the-art machine learning models on structured data. With easy integration with Vertex AI, AutoML in BQML makes it simple to achieve machine learning magic in the background. From preprocessing data to feature engineering and model tuning all the way to cross validation, AutoML will “automagically” select and ensemble models so everyone—even non-data scientists—can use it.   

Want to take this feature for a test drive? Try it today on BigQuery’s NYC Taxi public dataset following the instructions in this blog! 

Speaking of public datasets, we also introduced the availability of Google Trends data in BigQuery to enable customers to measure interest in a topic or search term across Google Search.  This new dataset will soon be available in Analytics Hub and will be anonymized, indexed, normalized, and aggregated prior to publication. 

Want to ensure your end-cap displays are relevant to your local audience?  You can take signals from what people are looking for in your market area to inform what items to place. Want to understand what new features could be incorporated into an existing product based on what people are searching for?  Terms that appear in these datasets could be an indicator of what you should be paying attention to.

All this data and technology can be put to use to deploy critical solutions to grow and protect your business. For example,  it can be difficult to know how to define anomalies during detection. If you have labeled data with known anomalies, then you can choose from a variety of supervised machine learning model types that are already supported in BigQuery ML. 

But what if you don’t know what kind of anomaly to expect, and you don’t have labeled data? Unlike typical predictive techniques that leverage supervised learning, organizations may need to be able to detect anomalies in the absence of labeled data. 

That’s why, we were particularly excited to announce the public preview of new anomaly detection capabilities in BigQuery ML that leverage unsupervised machine learning to help you detect anomalies without needing labeled data.  

Our team has been working with a large number of enterprises who leverage machine learning for better anomaly detection. In financial services for example, customers have used our technology to detect machine-learned anomalies in real-time foreign exchange data.  

To make it easier for you to take advantage of their best practices, we teamed up with Kasna to develop sample code, architecture guidance, and a data synthesizer that generates data so you can test these innovations right away. 

Simple, Scalable, and Speedy

Capturing, processing and analyzing data in motion has become an important component of our customer architecture choices. Along with batch processing, many of you need the flexibility to stream records into BigQuery so they can become available for query as they are written.  

Our new BigQuery Storage Write API combines the functionality of streaming ingestion and batch loading into a single API. You can use it to stream records into BigQuery or even batch process an arbitrarily large number of records and commit them in a single atomic operation.

Flexible systems that can do batch and real-time in the same environment is in our DNA: Dataflow, our serverless, data processing service for streaming and batch data was built with flexibility in mind.  

This principle applies not just to what Dataflow does but also how you can leverage it—whether you prefer using Dataflow SQL right from the BigQuery web UI, Vertex AI notebooks from the Dataflow interface, or the vast collection of pre-built templates to develop streaming pipelines.

Dataflow has been in the news quite a bit recently. You might have noted the recent introduction of Dataflow Prime, a new no-ops, auto-tuning functionality that optimizes resource utilization and further simplifies big data processing. You might have also read that Google Dataflow is a Leader in The 2021 Forrester Wave™: Streaming Analytics, giving Dataflow a score of 5 out of 5 across 12 different criteria.  

We couldn’t be more excited about the support the community has provided to this platform. The scalability of Dataflow is unparalleled and as you set your company up for more scale, more speed, and “streaming that screams”, we suggest you take a look at what leaders at Sky, RVU or Palo Alto Networks have already accomplished.

If you’re new to Dataflow, you’re in for a treat: this past month, Priyanka Vergadia (AKA CloudGirl) released a great set of resources to get you started. Read her blog here and watch her introduction video below!

What is Google Cloud Dataflow?

Cloud Dataflow, a fully managed data processing pipeline.

Simple structure that sticks together

We thrive to be the partner of choice for your transformation journey, regardless where your data comes from and how you choose to unify your data stack.  

Our partners at Tata Consultancy Services (TCS) recently released research that highlights the importance of a unifying digital fabric and how data integration services like Google Cloud Data Fusion can enable their clients to achieve this vision.

We also  announced SAP Integration with Cloud Data Fusion, Google Cloud’s native data integration platform, to seamlessly move data out of SAP Business Suite, SAP ERP and S4/HANA. To date, we provide more than 50 pipelines in Cloud Data Fusion to rapidly onboard SAP data.  

This past month, we introduced our SAP Accelerator for Order to Cash.  This accelerator is a sample implementation of the SAP Table Batch Source feature in Cloud Data Fusion and will help you get started with your end-to-end order to cash process and analytics. 

It includes sample Cloud Data Fusion pipelines that you can configure to connect to your SAP data source, perform transformations, store data in BigQuery, and set up analytics in Looker. It also comes with LookML dashboards which you can access on Github.

Countless great organizations have chosen to work with Google for their SAP data. In June, we wrote about ATB Financial’s journey and how the company uses data to better serve over 800,000 customers, save over CA$2.24 million in productivity, and realize more than CA$4 million in operating revenue through “D.E.E.P”, a data exposure enablement platform built around BigQuery.

Finally, if you are an application developer looking for a unified platform that brings together data from Firebase Crashlytics, Google Analytics, Cloud Firestore, and third party datasets, we have good news!  

This past month, we released a unified analytics platform that combines Firebase, BigQuery, Google Looker and FiveTran to easily integrate disparate data sources,  and infuse data into operational workflows for greater product development insights and increased customer experience. This resource comes with sample code, a reference guide and a great blog!  We hope you enjoy it. See you all next month!

Creating a unified app analytics platform

This video explains how to centralize common data sources for digital platforms into BigQuery and dig deeper into customer behavior to make informed business decisions.

Source : Data Analytics Read More

Creating a unified analytics platform for digital natives

Creating a unified analytics platform for digital natives

Digital native companies have no shortage of data, which is often spread across different platforms and Software-as-a-service (SaaS) tools. As an increasing amount of data about the business is collected, democratizing access to this information becomes all the more important. While many tools offer in-application statistics and visualizations, centralizing data sources for cross-platform analytics allows everyone at the organization to get an accurate picture of the entire business. With Firebase, BigQuery and Looker, digital platforms can easily integrate disparate data sources and infuse data into operational workflows – leading to better product development and increased customer happiness.

How it works

In this architecture, BigQuery becomes the single source of truth for analytics, receiving data from various sources on a regular basis. Here, we can take use of the broad Google ecosystem to directly import data from Firebase Crashlytics, Google Analytics, Cloud Firestore and query data within Google Sheets. Additionally, third party datasets can be easily pushed into BigQuery with data integration tools like FiveTran

Within Looker, data analysts can leverage pre-built dashboards and data models, or LookML, through source-specific Looker Blocks. By combining these accelerators with custom, first party LookML models, analysts can join across the data sources for more meaningful analytics. Using Looker Actions, data consumers can leverage insights to automate workflows and improve overall application health.

The architecture’s components are described below:

Data source

Name

Description

Google data sources

Google Analytics 4

Tracks customer interactions in your application

Firebase Crashlytics

Collects and organizes Firebase application crash information

Cloud Firestore 

Backend database for your Firebase application

Google Sheets

Spreadsheet service that can be used to collect manually entered, first party data 

Third-party data sources

Customer Relationship Management Platform (CRM)

Manages customer data. (While we use Salesforceas a reference, the same ideas can be applied to other tools)

Issue tracking or Project Management software

Can help product and engineering teams track bug fixes and new feature development in applications. (While we use JIRA as a reference, the same ideas can be applied to other tools)

Customer support software or a help desk

A tool that organizes customer communications to help businesses respond to customers more quickly and effectively. (While we use Zendesk as a reference, the same ideas can be applied to other tools)

Cross-functional analytics

With the various data sources centralized into BigQuery, members across different teams can use the data to make informed decisions. Executives may want to combine business goals from a Google Sheet with CRM data to understand how the organization is tracking towards revenue goals. In preparation for board or team meetings, business leaders can use Looker’s integrations with Google Workspace, to send query results into Google Sheets and populate a chart inside a Google Slide deck. 

Technical program managers and site reliability engineers may want to combine Crashlytics, CRM and customer support data to prioritize bugs in the application that are affecting the highest value customers, or are often brought up inside support tickets. Not only can these users easily link back to the Crashlytics console for deeper investigation into the error, they can also use Looker’s JIRA action to automatically create JIRA issues based on thresholds across multiple data sources. 

Account and customer success managers (CSMs) can use a central dashboard to track the health of their customers using inputs like usage trends in the application, customer satisfaction scores and crash reports. With Looker alerts, CSMs can be immediately notified of problems with an account and proactively reach out to customer contacts.

Getting started

To get started creating a unified application analytics platform, be sure to check out our technical reference guide. If you’re new to Firebase you can learn more here.To get started with BigQuery, check out the BigQuery Sandbox and these guides. For more information on Looker, sign up for a free trial here. 

Related Article

Spring forward with BigQuery user-friendly SQL

The newest set of user-friendly SQL features in BigQuery are designed to enable you to load and query more data with greater precision, a…

Read Article

Source : Data Analytics Read More

Introducing Looker Incremental PDTs: benefits and use cases

Introducing Looker Incremental PDTs: benefits and use cases

With the cost of cloud-based services making up an increasingly large part of companies’ budgets — and with businesses fueled by access to vital data resources — it makes sense that cost and performance optimization is top-of-mind for most data teams.

Transformation and aggregation of data can reduce the total amount of data queried. This improves performance and, for those on pay-for-use cloud billing plans, also keeps expenses low. As an analyst, you can utilize Looker’s persistent derived tables (PDTs) and the new incremental PDTs as tools in your performance-improving, cost-saving efforts.

Faster query responses and lower costs with PDTs

Persistent derived tables are the materialized results of a query, written to the Looker scratch schema in the connected database and rebuilt on a defined schedule. Commonly used to reduce database load and increase query performance, PDTs mean fewer rows of data are queried for subsequent responses. PDTs are also the underlying mechanism for Looker’s aggregate awareness, allowing Looker to intelligently query the smallest possible dataset for any query using computational algebra.

PDT use cases

PDTs are highly flexible and agile because they are built and controlled directly by data analysts. Changes can be implemented in moments, avoiding lengthy ticketing processes with data engineering teams. Analysts do not need to rely on data engineering for every iterative improvement to materialized transformation. Plus, implementing PDTs can be accomplished with only a few lines of Looker’s modeling language (LookML). If necessary, PDTs can be optimized for your specific database dialect using customized persistence and performance strategies, too.

One common use case for PDTs is data transformation or in-warehouse data post-ETL processes. Data teams use PDTs to normalize data, as well as for data cleansing and quality improvements. Because PDTs are rebuilt on a defined persistence schedule, the data in these persistent tables remains fresh and relevant, never stale.

Incremental PDTs: the faster and more efficient option

Data teams who have implemented PDTs have found them to be extremely powerful. But there were some use cases for which PDTs were not well suited — data teams using prior versions of Looker PDTs found that the scheduled rebuilding to ensure data freshness was cumbersome when data tables were especially large (PDT rebuilds were themselves resource intensive) or when data was changed or appended frequently (PDT rebuilds were necessarily frequent). You can now use incremental PDTs to overcome these challenges and make PDTs useful as datasets get larger and less mutable.

How are incremental PDTs different?

Standard PDTs remain extremely valuable for small datasets, or for data that is frequently changing. But, as data sets get bigger and less mutable, PDTs can start taking a long time to build and become expensive to compute, particularly because it becomes necessary to frequently rebuild standard PDTs from scratch. This is where incremental PDTs come in. They allow you to append fresh data to the PDT without a complete rebuild, skirting the need for an expensive rebuild of the table in its entirety. This can save considerable time and further drive down costs. In fact, companies who have helped test the feature have reported that their queries return data 10x-20x faster.

How to use incremental PDTs

Implementing incremental PDTs requires a single simple parameter (increment_key) in LookML. Incremental PDTs are supported in a range of database dialects, but be sure to check our list of supported dialects before beginning implementation. Incremental PDTs are supported for all types of PDTs: Native (LookML), SQL-based, and aggregate tables.

You can even account for late arriving data by using an increment offset (via the optional increment_offset parameter), which offers the flexibility to rebuild defined portions of the table independently. This helps ensure tables are not missing any data or compromising data accuracy.

When to use incremental PDTs

Identifying candidate queries for incremental PDTs can be as simple as finding PDTs with long build times in the “PDT Overview” section of the System Activity Database Performance Dashboard. This pre-curated dashboard is a good starting point to identify PDTs that take longer periods to build or build slowly.

To use a table with incremental PDTs, they do need to be persisted (a consistent source for data) and must have a timestamp column. Tables with immutable rows, such as those resembling event streams, are excellent candidates for incremental PDTs. A common use case for incremental PDTs is event streams and other data types where the bulk of the data remains unchanged but new data is appended frequently to the source table.

For more example incremental PDT use cases and for details on how to test the behavior of incremental PDTs in Looker development mode, please refer to the incremental PDT documentation.

Learn more about Looker PDTs

Some resources to help you with performance and cost efficiency and PDTs include:

Implementing Incremental PDTs with Kai Banks (video)Incremental PDTs (Looker Docs)The Power of Looker Transformations with Aleks Aleksic (video, JOIN 2020)Identifying and Building PDTs for Performance Optimization by Mike DeAngelo (Looker Help)Derived Tables in Looker (Looker Docs)Optimizing Queries for Aggregate Awareness (video, JOIN 2020)Aggregate Awareness Tutorial by Sean Higgins (Looker Help)Aggregate Awareness (Looker Docs)

Want to know more about Looker and how to create fast, efficient, powerful analytics for your organization? You can request a Looker demo to learn how Looker uses an in-database architecture to get the most out of your database investment.

Source : Data Analytics Read More

ATB Financial boosts SAP data insights and business outcomes with BigQuery

ATB Financial boosts SAP data insights and business outcomes with BigQuery

When ATB Financial decided to migrate its vast SAP landscape to the cloud, the primary goal was to focus on things that matter to customers as opposed to IT infrastructure. Based in Alberta, Canada, ATB Financial serves over 800,000 customers through hundreds of branches as well as digital banking options. To keep pace with competition from large banks and FinTech startups and to meet the increasing 24/7 demands of customers, digital transformation was a must. To support this new mandate, in 2019, ATB migrated its extensive SAP backbone to Google Cloud. In addition to SAP S/4 HANA, ATB runs SAP financial services, core banking, payment engine, CRM and business warehouse on Google Cloud. 

In parallel, changes were needed to ATB’s legacy data platform. The platform had stability and reliability issues and also suffered from a lack of historical data governance. Analytics processes were ad hoc and manual. The legacy data environment was also not set up to tackle future business requirements that come with a high dependency on real-time data analysis and insights.

After evaluating several potential solutions, ATB choseBigQuery as a serverless data warehouse and data lake for its next-generation, cloud-native architecture. “BigQuery is a core component of what we call our data exposure enablement platform, or DEEP,” explains Dan Semmens, Head of Data and AI at ATB Financial. According to Semmens, DEEP consists of four pillars, all of which depend on Google Cloud and BigQuery to be successful:

Real-time data acquisition: ATB uses BigQuery throughout its data pipeline, starting with sourcing, processing, and preparation, moving along to storage and organization, then discovery and access, and finally consumption and servicing. So far, ATB has ingested and classified 80% of its core SAP banking data as well as data from a number of its third-party partners, such as its treasury and cash management platform provider, its credit card provider, and its call center software. 

Data enrichment: Before migrating to Google Cloud, ATB managed a number of disconnected technologies that made data consolidation difficult. The legacy environment could handle only structured data, whereas Google Cloud and BigQuery lets the bank incorporate unstructured data sets, including sensor data, social network activity, voice, text, and images. ATB’s data enrichment program has enabled more than 160 of the bank’s top-priority insights running on BigQuery, including credit health decision models, financial reporting, and forecasting, as well as operational reporting for departments across the organization. Jobs such as marketing campaigns and month-end processes that used to take five to eight hours now run in seconds, saving over CA$2.24 million in productivity. 

Self-service analytics: Data for self-service reporting, dashboarding, and visualization is now available for ATB’s 400+ business users and data analysts. Previously, bringing data and analytics to the business users who needed it while ensuring security was burdensome for IT, fraught with recurrent data preparation and other highly manual elements. Now, ATB automates much of its data protection and governance controls through the entire data lifecycle management process. Data access is not only open to more team members but it is faster and easier to acquire without compromising security. And it’s not just raw data that users can access. ATB uses BigQuery to define its enterprise data models and create what it calls its data service layer to make it easier for team members to visualize their data.

AI-assisted analytics and automation: Through Google Cloud and BigQuery, ATB has been able to publish data and ML models that provide alerts and notifications via APIs to customer service agents. These real-time recommendations allow customer service agents to provide more tailored service with contextualized advice and suggested new services. So far, the company has deployed more than 40 ML models to generate over 20,000 AI-assisted conversations per month. Thanks to improved customer advocacy and less churn, the bank has realized more than CA$4 million in operating revenue. During the ongoing COVID crisis, the system was also able to predict when business and personal banking customers were experiencing financial distress so that a relationship manager could proactively reach out to offer support, such as payment deferral or loan restructuring. The AI tools provided by BigQuery are also helping ATB detect fraud that previously evaded rules-based fraud detection by using broader sets of timely and accurate data. 

Thanks to the speed and ease of moving data from SAP to BigQuery, ATB is using artificial intelligence (AI) and machine learning (ML) to do things it previously hadn’t thought possible, including sophisticated fraud prevention models, product recommendations, and enriched CRM data that improves the customer experience. 

Using the power of Google Cloud and BigQuery, ATB Financial has been able to draw more value from its SAP data while lowering cost and improving security and reliability. Speed to provide data sets and insights to internal team members has improved 30%. The bank also has seen a 15x reduction in performance incidents while improving data governance and security. Dan Semmens projects that the digital transformation strategy built on Google Cloud and BigQuery has both saved millions compared to its on-premises environment and has also realized millions in new business opportunities. 

Semmens is looking toward the future that includes initiatives like Open Banking and greater ability to provide real time personalized advice for customers to drive revenue growth. “We see our data platform as foundational to ATB’s 10-year strategy,” he says. “The work we’ve undertaken over the past 18 months has enabled critical functionality for that future.” 

Learn more about how ATB Financial is leveraging BigQuery to gain more from SAP data. Visit us here to explore how Google Cloud, BigQuery, and other tools can unlock the full value of your SAP enterprise data.

Related Article

Read Article

Source : Data Analytics Read More

Fresh updates: Google Cloud 2021 Summits

Fresh updates: Google Cloud 2021 Summits

There are a lot of great things happening at Google Cloud, and we’re delighted to share new product announcements, customer perspectives, interactive demos, and more through our Google Cloud Summit series, a collection of digital events taking place over the coming months.

Join us to learn more about how Google Cloud is transforming businesses in various industries, including Manufacturing & Supply Chain, Retail & Consumer Goods, and Financial Services. We’ll also be highlighting the latest innovations in data, artificial intelligence (AI) and machine learning (ML), security and more. 

Content will be available for on-demand viewing immediately following the live broadcast of each event. Bookmark this page to easily find updates as news develops, and don’t forget to register today or watch summits on demand by visiting the Summit series website.

Upcoming events

Government & Education Summit | Nov 3-4, 2021

Mark your calendars – registration is open for Google Cloud’s Government and Education Summit, November 3–4, 2021.

Government and education leaders have seen their vision become reality faster than they ever thought possible. Public sector leaders embraced a spirit of openness and created avenues to digital transformation, accepting bold ideas and uncovering new methods to provide public services, deliver education and achieve groundbreaking research. At Google Cloud, we partnered with public sector leaders to deliver an agile and open architecture, smart analytics to make data more accessible, and productivity tools to support remote work and the hybrid workforce. 

The pandemic has served as a catalyst for new ideas and creative solutions to long-standing global issues, including climate change, public health, and resource assistance. We’ve seen all levels of government and education leverage cloud technology to meet these challenges with a fervor and determination not seen since the industrial revolution. We can’t wait to bring those stories to you at the 2021 Google Cloud Government and Education Summit.

The event will open doors to digital transformation with live Q&As, problem-solving workshops and leadership sessions, designed to bring forward the strongest talent, the most inclusive teams, and the boldest ideas. Interactive, digital experiences and sessions that align with your schedule and interests will be available, including dedicated sessions and programming for our global audiences.

Register today for the 2021 Google Cloud Government and Education Summit. Moving into the next period of modernization, we feel equipped with not just the technology, but also the confidence to innovate and the experience to deliver the next wave of critical digital transformation solutions.

Digital Manufacturer Summit | June 22, 2021

Together we can transform the future of our industry. At Google Cloud’s Digital Manufacturer Summit customers will hear from Porsche, Renault Group, Doosan Heavy Industries & Construction, GE Appliances and Landis+Gyr who are boosting productivity across their enterprise with digital solutions powered by AI and analytics. 

Google Cloud recently launched a report and blog on AI Acceleration, which reveals that the COVID-19 pandemic may have spurred a significant increase in the use of AI and other digital enablers among manufacturers. We will continue this thought leadership in the summit.

Hear from forward-thinking business executives as they discuss the latest trends and the future of the industry. Participate in focused sessions and gain game-changing insights that dive deep into customer experience, product development, manufacturing operations, and supply chain operations. 

Register now: Global & EMEA

APAC Technical Series | June 22 – 24, 2021

IT and business professionals located in the Asia Pacific region can continue their cloud technology learnings by taking part in a three-day deep-dive into the latest data and machine learning technologies. This event will help you harness data and unlock innovation to build, iterate, and scale faster and with confidence. 

Register now: APAC

Security Summit | July 20, 2021

At Google Cloud Security Summit, security professionals can learn why many of the world’s leading companies trust Google Cloud infrastructure, and how organizations can leverage Google’s cloud-native technology to keep their organization secure in the cloud, on-premises, or in hybrid environments. 

During the opening keynote, engaging sessions, and live Q&A, customers will learn about how our Trusted Cloud can help them build a zero-trust architecture, implement shared-fate risk management, and achieve digital sovereignty. Join us to hear from some of the most passionate voices exploring how to make every day safer with Google. Together, we’ll reimagine how security should work in the cloud.

Register now: NORTHAM & EMEA

Retail & Consumer Goods Summit | July 27, 2021

Are you ready for the continued growth in digital shopping? Do you understand how leveraging AI and ML can improve your business? Join your peers and thought leaders for engaging keynotes and breakout sessions designed for the Retail and CPG industries at our upcoming Retail and Consumer Goods Summit on July 27th.

You’ll learn how some of the world’s leading retail and consumer goods companies like Ulta, Crate & Barrel, Albertsons, IKEA, and L’Oreal are using Google Cloud AI, machine learning, and data analytics technology to accelerate their digital transformation. In addition, we’ll share announcements on new products and solutions to help retailers and brands succeed in today’s landscape. 

Register now: Global

Now available on demand

Data Cloud Summit | May 26, 2021

In case you missed it, check out content from the Google Data Cloud Summit, which featured the launch of three new solutions – Dataplex, Analytics Hub and Datastream – to provide organizations with a unified data platform. The summit also featured a number of engaging discussions with customers including Zebra Technologies, Deutsche Bank, Paypal, Wayfair and more.

Watch on-demand now: Global

Financial Services Summit | May 27, 2021

We launched Datashare at the Financial Services Summit; this solution is designed to help capital markets firms share market data more securely and efficiently. Attendees can also view sessions on a range of topics including sustainability, the future of home buying, embedded finance, dynamic pricing for insurance, managing transaction surges in payments, the market data revolution, and more. Customers such as Deutsche Bank, BNY Mellon, HSBC, Credit Suisse, PayPal, Global Payments, Roostify, AXA, Santander, and Mr Cooper shared their insights as well. 

Watch on-demand now: NORTHAM & EMEA

We have also recently launched several new blog posts tied to the Financial Services Summit: 

Introducing Datashare solution for financial services for licensed market data discovery, access and analytics on Google Cloud

Google Cloud for financial services: driving your transformation cloud journey

How insurers can use severe storm data for dynamic pricing

Why embedding financial services into digital experiences can generate new revenue 

Applied ML Summit | June 10, 2021

The Google Applied ML Summit featured a range of sessions to help data scientists and ML engineers explore the power of Google’s Vertex AI platform, and learn how to accelerate experimentation and production of ML models.  Besides prominent Google AI/ML experts and speakers, the event also featured over 16 ML leaders from customers and partners like Spotify, Uber, Mr. Cooper, Sabre, PyTorch, L’Oreal, Vodafone and WPP/Essence.

Watch on-demand now: Global

Related Article

Save the date for Google Cloud Next ‘21: October 12-14, 2021

Join us and learn how the most successful companies have transformed their businesses with Google Cloud. Sign-up at g.co/cloudnext for up…

Read Article

Source : Data Analytics Read More

Quickly, easily and affordably back up your data with BigQuery table snapshots

Quickly, easily and affordably back up your data with BigQuery table snapshots

Mistakes are part of human nature. Who hasn’t left their car unlocked or accidentally hit “reply all” on an email intended to be private? But making mistakes in your enterprise data warehouse, such as accidentally deleting or modifying data, can have a major impact on your business. 

BigQuery time travel, which is automatically enabled for all datasets, lets you quickly access the state of a table as of any point in time within the last 7 days. However, recovering tables using this feature can be tricky as you need to keep track of the “last known good” time. Also, you may want to maintain the state of your data beyond the 7 day window, for example, for auditing or regulatory compliance requirements. This is where the new BigQuery table snapshots feature comes into play.

Table snapshots are available via the BigQuery API, SQL, command line interface, or the Google Cloud Console. Let’s look at a quick example in the Cloud Console.

First, we’ll create a new dataset and table to test out the snapshot functionality:

Next, open to the properties page for the newly created table by selecting it in the Explorer pane. The source table for a snapshot is called the base table.

While you can use SQL or the BigQuery command line tool to create a snapshot, for this example we’ll create a snapshot of the inventory table using the Snapshot button in the Cloud Console toolbar.

BigQuery has introduced a new IAM permission (bigquery.tables.createSnapshot) that is required for the base table in addition to the existing bigquery.tables.get and bigquery.tables.getData permissions. This new permission has been added to the bigQuery.dataViewer and bigQuery.dataEditor roles, but will need to be added to any custom roles that you have created.

Table snapshots are treated just like regular tables, except you can’t make any modifications (either data or schema) to them. If you create a snapshot in the same dataset as the base table, you will need to give it a unique name or use the suggested name which appends a timestamp on to the end of the table name. 

If you want to use the original table name as the snapshot name, you will need to create it in a new dataset so there will be no naming conflicts. For example, you could write a script to create a new dataset and create snapshots of all of the tables from a source dataset, preserving their original names. Note that when you create a snapshot in another dataset, it will inherit the security configuration of the destination dataset, not the source.

You can optionally enter a value into the Expiration time field and have BigQuery automatically delete the snapshot at that point in time. You can also optionally specify a value in the Snapshot time field to create the snapshot from a historical version of the base table within the time travel window. For example, you could create a snapshot from the state of a base table as of 3 hours ago.

For this example, I’ll use the name inventory-snapshot. A few seconds after I click Save, the snapshot is created. It will appear in the list of tables in the Explorer pane with a different icon.

The equivalent SQL for this operation would be:

Now, let’s take a look at the properties page for the new table snapshot in the Cloud Console.

In addition to the general snapshot table information, you see information about the base table that was used to create the snapshot, as well as the date and time that the snapshot was created. This will be true even if the base table is deleted. Although the snapshot size displays the size of the full table, you will only be billed (using standard BigQuery pricing) for the difference in size between the data maintained in the snapshot and what is currently maintained in the base table. If no data is removed, or changed in the base table, there will be no additional charge for the snapshot.

As a snapshot is read-only, If you attempt to modify the snapshot table data via DML or change the schema of the snapshot via DDL, you will get an error. However, you can change snapshot properties such as description, expiration time, or labels. You can also use table access controls to change who has access to the snapshot, just like any other table. 

Let’s say we accidentally deleted some data from the base table. You can simulate this by running the following commands in the SQL workspace.

You will see that the base table now has only 6 rows, while the number of rows and size of the snapshot has not changed. If you need to access the deleted data, you can query the snapshot directly. For example, the following query will show you that the snapshot still has 7 rows:

However, if you want to update the data in a snapshot, you will need to restore it to a writable table. To do this, click the Restore button in the Cloud Console.

By default, the snapshot will be restored into a new table. However, if you would like to replace an existing table, you can use the existing table name and select the Overwrite table if it exists checkbox.

This operation can also be performed with the BigQuery API, SQL, or CLI. The equivalent SQL for this operation would be:

In this blog, we’ve demonstrated how to use the Google Cloud Console and the new table snapshots feature to easily create backups of your BigQuery tables. You can also create periodic (daily, monthly, etc.) snapshots of tables using the BigQuery scheduled query functionality. Learn more about table snapshots in the BigQuery documentation.

Related Article

BigQuery Admin reference guide: Storage internals

Learn how BigQuery stores your data for optimal analysis, and what levers you can pull to further improve performance.

Read Article

Source : Data Analytics Read More

Open data lakehouse on Google Cloud

Open data lakehouse on Google Cloud

For more than a decade the technology industry has been searching for optimal ways to store and analyze vast amounts of data that can handle the variety, volume, latency, resilience, and varying data access requirements demanded by organizations.

Historically, organizations have implemented siloed and separate architectures, data warehouses used to store structured aggregate data primarily used for BI and reporting whereas data lakes, used to store unstructured and semi-structured data, in large volumes, primarily used for ML workloads. This approach often resulted in extensive data movement, processing, and duplication requiring complex ETL pipelines. Operationalizing and governing this architecture was challenging, costly and reduced agility. As organizations are moving to the cloud they want to break these silos. 

To address some of these issues,a new architecture choice has emerged: the data lakehouse, which combines key benefits of data lakes and data warehouses. This architecture offers low-cost storage in an open format accessible by a variety of processing engines like Spark while also providing powerful management and optimization features. 

At Google cloud we believe in providing choice to our customers. Organizations that want to build their data lakehouse using open source technologies only can easily do so by using low cost object storage provided by Google Cloud Storage, storing data in open formats like Parquet, with processing engines like Spark and use frameworks like Delta, Iceberg or Hudi through Dataproc to enable transactions. This open source based solution is still evolving and requires a lot of effort in configuration, tuning and scaling. 

At Google Cloud, we provide a cloud native, highly scalable and secure, data lakehouse solution that delivers choice and interoperability to customers. Our cloud native architecture reduces cost and improves efficiency for organizations.  Our solution is based on:

Storage: Providing choice of storage across low cost object storage in Google Cloud Storage or highly optimized analytical storage in BigQuery

Compute: Serverless compute that provide different engines for different workloads

BigQuery, our serverless cloud data warehouse provides ANSI SQL compatible engine that can enable analytics on petabytes of data.

Dataproc, our managed Hadoop and Spark service enables using various open source frameworks

Serverless Spark, allows customers to submit their workloads to a managed service and take care of the job execution. 

Vertex AI, our unified MLOps platform enables building large scale ML models with very limited coding

Additionally you can use many of our partner products like Databricks, Starburst or Elastic for various workloads.

Management: Dataplex enables a metadata-led data management fabric across data in Google Cloud Storage (object storage) and BigQuery (highly optimized analytical storage). Organizations can create, manage, secure, organize and analyze data in the lakehouse using Dataplex.

Let’s take a closer look at some key characteristics of a data lakehouse architecture and how customers have been building this on GCP at scale. 

Storage Optionality

At Google Cloud our core principle is delivering an open platform. We want to provide customers with a choice of storing their data in low cost object storage in Google Cloud Storage or highly optimized analytical storage or other storage options available on GCP. We recommend organizations store their structured data in BigQuery Storage. BigQuery Storage also provides a streaming API that enables organizations to ingest large amounts of data in real-time and analyze it. We recommend unstructured data to be stored in Google Cloud storage. In some cases where organizations need to access their structured data in OSS formats like Parquet or ORC they can store them on Google Cloud Storage. 

At Google Cloud we have invested in building Data Lake Storage API also known as BigQuery Storage API to provide consistent capabilities for structured data across both BigQuery and GCS storage tiers. This API enables users to access BigQuery Storage and GCS through any open source engine like Spark, Flink etc. Storage API enables users to apply fine grained access control on data in BigQuery and GCS storage (coming soon).

Serverless Compute

The data lakehouse enables organizations to break data silos and centralize data, which facilitates various different types of use cases across organizations. To get maximum value from data, Google Cloud allows organizations to use different execution engines, optimized for different workloads and personas to run on top the same data tiers. This is made possible because of complete separation of compute and storage on Google Cloud.  Meeting users at their level of data access including SQL, Python, or more GUI-based methods mean that technological skills do not limit their ability to use data for any job. Data scientists may be working outside traditional SQL-based or BI types of tools. Because BigQuery has the storage API, tools such as AI notebooks, Spark running on Dataproc, or Spark Serverless can easily be integrated into the workflow. The paradigm shift here is that the data lakehouse architecture supports bringing the compute to the data rather than moving the data around. With serverless Spark and BigQuery, data engineers can spend all their time on the code and logic. They do not need to manage clusters or tune infrastructure. They submit SQL or PySpark jobs from their interface of choice, and processing is auto-scaled to match the needs of the job.

BigQuery leverages serverless architecture to enable organizations to run large scale analytics using a familiar SQL interface. Organizations can leverage BigQuery SQL to run analytics on petabyte scale data sets. In addition, BigQuery ML democratizes machine learning by letting SQL practitioners build models using existing SQL tools and skills. BigQuery ML is another example of how customers’ development speed can be increased by using familiar dialects and the need to move data.  

Dataproc, Google Cloud’s managed Hadoop, can read the data directly from lakehouse storage; BigQuery or GCS and run its computations, and write it back. In effect, users are given freedom to choose where and how to store the data and how to process it depending on their needs and skills. Dataproc enables organizations to leverage all major OSS engines like Spark, Flink, Presto, Hive etc.  

Vertex AI is a managed machine learning (ML) platform that allows companies to accelerate the deployment and maintenance of artificial intelligence (AI) models. Vertex AI natively integrates with BigQuery Storage and GCS to process both structured and unstructured data. It enables data scientists and ML engineers across all levels of expertise to implement Machine Learning Operations (MLOps) and thus efficiently build and manage ML projects throughout the entire development lifecycle. 

Intelligent data management and governance

The data lakehouse works to store the data in a single-source-of-truth, making minimal copies of the data. Consistent security and governance is key to any lakehouse. Dataplex, our intelligent data fabric service, provides data governance and security capabilities across various lakehouse storage tiers built on GCS and BigQuery. Dataplex uses metadata associated with the underlying data to enable organizations to logically organize their data assets into lakes and data zones. This logical organization can span across data stored in BigQuery and GCS. 

Dataplex sits on top of the entire data stack to unify governance and data management. It provides a unified data fabric that enables enterprises to intelligently  curate,secure and govern data, at scale, with an integrated analytics experience. It provides automatic data discovery and schema inference across different systems and complements this with automatic registration of metadata as tables and filesets into metastores. With built-in data classification and data quality checks in Dataplex, customers have access to data they can trust.

Data sharing: is one of the key promises of evolved data lakes is that different teams and different personas can share the data across the organization in a  timely manner. To make this a reality and break organizational barriers, Google offers a layer on top of BigQuery called Analytics Hub. Analytics Hub provides the ability to create private data exchanges, in which exchange administrators (a.k.a. data curators) give permissions to publish and subscribe to data in the exchange to specific individuals or groups both inside the company and externally to business partners or buyers. (within or outside of their organization). 

Open and flexible

In the ever evolving world of data architectures and ecosystems, there are a growing suite of tools being offered to enable data management, governance, scalability, and even machine learning. 

With promises of digital transformation and evolution, organizations often find themselves with sophisticated solutions that have a significant amount of bolted-on functionality. However, the ultimate goal should be to simplify the underlying infrastructure,and enable teams to focus on their core responsibilities: data engineers make raw data more useful to the organization, data scientists explore the data and produce predictive models so business users can make the right decision for their domains.

Google Cloud has taken an approach anchored on openness, choice and simplicity and offers a planet-scale analytics platform that brings together two of the core tenants of enterprise data operations, data lakes and data warehouses into a unified data ecosystem.  

The data lakehouse is a culmination of this architectural effort and we look forward to working with you to enable it at your organization. For more interesting insights on lakehouse, you can read the full whitepaper here.

Related Article

Read Article

Source : Data Analytics Read More