Use graphs for smarter AI with Neo4j and Google Cloud Vertex AI

Use graphs for smarter AI with Neo4j and Google Cloud Vertex AI

In this blog post, we’re going to show you how to use two technologies together: Google Cloud Vertex AI, an ML development platform, and Neo4j, a graph database. Together these technologies can be used to build and deploy graph-based machine learning models.

The code underlying this blog post is available in a notebook here.

Why should you use graphs for data science?

Many critical business problems use data that can be expressed as graphs. Graphs are data structures that describe the relationships between data points as much as the data themselves. 

An easy way to think about graphs is as analogous to the relationship between nouns and verbs. Nodes, or the nouns, are things such as people, places, and items. Relationships, or the verbs, are how they’re connected. People know each other and items are sent to places. The signal in those relationships is powerful.

Graph data can be huge and messy to deal with. It is nearly impossible to use in traditional machine learning tasks.

Google Cloud and Neo4j offer scalable, intelligent tools for making the most of graph data. Neo4j Graph Data Science and Google Cloud Vertex AI make building AI models on top of graph data fast and easy.

Dataset – Identify Fraud with PaySim

Graph based machine learning has numerous applications. One common application is combating fraud in many forms. Credit card companies identify fake transactions, insurers face false claims and lenders look out for stolen credentials. 

Statistics and machine learning have been used to fight fraud for decades. A common approach is to build a classification model on individual features of a payment and users. For example, data scientists might train an XGBoost model to predict if a transaction is fraudulent using the amount of transaction, its date and time, origin account, target accounts and resulting balances. 

These models miss a lot of fraud. By channeling transactions through a network of fraudulent actors, fraudsters can beat checks that look only at a single transaction. A successful model needs to understand the relationships between fraudulent transactions, legitimate transactions and actors.

Graph techniques are perfect for these kinds of problems. In this example, we’ll show you how graphs apply in this situation. Then, we’ll show you how to construct an end-to-end pipeline training a complete model using Neo4J and Vertex AI. For this example, we’re using a variation on the PaySim dataset from Kaggle that includes graph features.

Loading Data into Neo4j

First off, we need to load the dataset into Neo4j. For this example, we’re using AuraDS. AuraDS offers Neo4j Graph Database and Neo4j Graph Data Science running as a managed service on top of GCP. It’s currently in a limited preview that you can sign up for here.

AuraDS is a great way to get started on GCP because the service is fully managed. To set up a running database with the Paysim data, all we need to do is click through a few screens and load the database dump file.

Once the data is loaded, there are many ways to explore it with Neo4j. One is to use the Python API in a notebook to run queries.

For instance, we can see the node labels by running the query:

In our notebook, this gives us the following:

The notebook gives examples of other queries including relationship types and transaction types as well. You can explore those yourself here.

Generating Embeddings with Neo4j

After you’ve explored your data set, a common next step is to use the algorithms that are part of Neo4j Graph Data Science to engineer features that encode complex, high dimensional graph data into values that tabular machine learning algorithms can use.

Many users start with basic graph algorithms to identify patterns. You can look at weakly connected components to find disjointed communities of account holders sharing common logins. Louvain methods are useful to find rings of fraudsters laundering money. Page rank can be used to figure out which accounts are most important. However, these techniques require you to know exactly the pattern you’re looking for.

visualization of two weakly connected components

A different approach is to use Neo4j to generate graph embeddings. Graph embeddings boil down complex topological information in your graph into a fixed length vector where related nodes in the graph have proximal vectors. If graph topology, for example who fraudsters interact with and how they behave, is an important signal, the embeddings will capture that so that previously undetectable fraudsters can be identified because they have similar embeddings to known fraudsters.

graph embeddings, showing how graph topology translates into a fixed dimensional vector space

Some techniques make use of the embeddings on their own. For instance, using a t-sne plot to find clusters visually, or computing raw similarity scores. The magic really happens when you combine your embeddings with Google Cloud Vertex AI to train a supervised model. 

For our PaySim example, we can create a graph embedding with the following call:

That creates a 16 dimensional graph embedding using the Fast Random Project algorithm. One neat feature in this is the nodeSelfInfluence parameter. This helps us tune how much nodes further out in the graph influence the embedding.

With the embedding calculated, we can now dump it into a pandas dataframe, write that to a CSV and push that to a cloud storage bucket where Google Cloud’s Vertex AI can work with it. As before, these steps are detailed in the notebook here.

Machine Learning with Vertex AI

Now that we’ve encoded the graph dynamics into vectors, we can use tabular methods in Google Cloud’s Vertex AI to train a machine learning model.

First off, we pull the data from a cloud storage bucket and use that to create a dataset in Vertex AI. The Python call looks like this:

With the dataset created, we can then train a model on it. That python call looks like this:

You can view the results of that call in the notebook. Alternatively, you can login to the GCP console and view the results in the Vertex AI’s GUI.

The console view is nice because it includes things like ROC curves and the confusion matrix. These can assist in understanding how the model is performing.

Vertex AI also offers helpful tooling for deploying the trained model. The dataset can be loaded into a Vertex AI Feature Store. Then an endpoint can be deployed. New predictions can be computed by calling that endpoint. This is detailed in the notebook here.

Future Work

Working on this notebook, we quickly realized that there is an enormous amount of potential work that could be done in this area. Machine learning with graphs is a relatively new field, particularly when compared to the study of methods for tabular data.

Specific areas we’d like to explore in future work include:

Improved Dataset – For data privacy reasons, it’s very difficult to publicly share fraud datasets. That led us to use the PaySim dataset in this example. That is a synthetic dataset. From our investigation, both of the dataset and the generator that creates it, there seems to be very little information in the data. A real dataset would likely have more structure to explore. 

In future work we’d like to explore the graph of SEC EDGAR Form 4 transactions. Those forms show the trades that officers of public companies make. Many of those people are officers at multiple companies, so we anticipate the graph being quite interesting. We’re planning workshops for 2022 where attendees can explore this data together using Neo4j and Vertex AI. There is already a loader that pulls that data into Google BigQuery here.

Boosting and Embedding – Graph embeddings like Fast Random Projection duplicate the data because copies of sub graphs end up in each tabular datapoint. XGBoost, and other boosting methods, also duplicate data to improve results. Vertex AI is using XGBoost. The result is that the models in this example likely have excessive data duplication. It’s quite possible we’d see better results with other machine learning methods, such as neural networks.

Graph Features – In this example we automatically generated graph features using the embedding. It’s also possible to manually engineer new graph features. Combining these two approaches would probably give us richer features.

Next Steps

If you found this blog post interesting and want to learn more, please sign up for the AuraDS preview here. Learn more about Vertex AI here. The notebook we’ve worked through is here. We hope you fork it and modify it to meet your needs. Pull requests are always welcome!

Source : Data Analytics Read More

How Bayer Crop Science uses BigQuery and geobeam to improve soil health

How Bayer Crop Science uses BigQuery and geobeam to improve soil health

Bayer Crop Science uses Google Cloud to analyze billions of acres of land to better understand the characteristics of the soil that produces our food crops. Bayer’s teams of data scientists are leveraging services from across  Google Cloud to load, store, analyze, and visualize geospatial data to develop unique business insights. And because much of this important work is done using publicly-available data, you can too!

Agencies such as the United States Geological Survey (USGS), National Oceanic and Atmospheric Administration (NOAA), and the National Weather Service (NWS) perform measurements of the earth’s surface and atmosphere on a vast scale, and make this data available to the public. But it is up to the public to turn this data into insights and information. In this post, we’ll walk you through some ways that Google Cloud services such as BigQuery and Dataflow make it easy for anyone to analyze earth observation data at scale. 

Bringing data together

First, let’s look at some of the datasets we have available. For this project, the Bayer team was very interested in one dataset in particular from ISRIC, a custodian of global soil information. ISRIC maps the spatial distribution of soil properties across the globe, and collects soil measurements such as pH, organic matter content, nitrogen levels, and much more. These measurements are encoded into “raster” files, which are large images where each pixel represents a location on the earth, and the “color” of the pixel represents the measured value at that location. You can think of each raster as a layer, which typically corresponds to a table in a database. Many earth observation datasets are made available as rasters, and they are excellent for storage of gridded data such as point measurements, but it can be difficult to understand spatial relationships between different areas of a raster, and between multiple raster tiles and layers.

Processing data into insights

To help with this, Bayer used Dataflow with geobeam to do the heavy-lifting of converting the rasters into vector data by turning them into polygons, reprojecting them to the WGS 84 coordinate system used by BigQuery, and generating h3 indexes to help us connect the dots — literally. Polygonization in particular is a very complex operation and its difficulty scales exponentially with file size, but Dataflow is able to divide and conquer by splitting large raster files into smaller blocks and processing them in parallel at massive scale. You can process any amount of data this way, at a scale and speed that is not possible on any single machine using traditional GIS tools. What’s best is that this is all done on the fly with minimal custom programming. Once the raster data is polygonized, reprojected, and fully discombobulated, the vector data is written directly to BigQuery tables from Dataflow.

Once the data is loaded into BigQuery, Bayer uses BigQuery GIS and the h3 indexes computed by geobeam to join the data across multiple tables and create a single view of all of their soil layers. From this single view, Bayer can analyze the combined data, visualize all the layers at once using BigQuery GeoViz, and apply machine learning models to look for patterns that humans might not see

Screenshot of Bayer’s soil analysis in GeoViz

Using geospatial insights to improve the business

The soil grid data is essential to help characterize the soil characteristics of the crop growth environments experienced by Bayer’s customers. Bayer can compute soil environmental scenarios for global crop lands to better understand what their customers experience in order to aid in testing network optimization, product characterization, and precision product design. It also impacts Bayer’s real-world objectives by enabling them to characterize the soil properties of their internal testing network fields to help establish a global testing network and enable environmental similarity calculations and historical modeling.

It’s easy to see why developing spatial insights for planting crops is game-changing for Bayer Crop Sciences, and these same strategies and tools can be used across a variety of industries and businesses.

Google’s mission is to organize the world’s information and make it universally accessible and useful, and we’re excited to work with customers like Bayer Crop Sciences who want to harness their data to build products that are beneficial to their customers and the environment. To get started building amazing geospatial applications for your business, check out our reference guide to learn more about geospatial capabilities in Google Cloud, and open BigQuery in the Google Cloud console to get started using BigQuery and geobeam for your geospatial workloads.

Source : Data Analytics Read More

Data governance in the cloud – part 1 – People and processes

Data governance in the cloud – part 1 – People and processes

In this blog, we’ll cover data governance as it relates to managing data in the cloud. We’ll discuss the operating model which is independent of technologies whether on-prem or cloud, processes to ensure governance, and finally the technologies that are available to ensure data governance in the cloud. This is a two part blog on data governance. In this first part, we’ll discuss the role of data governance, why it’s important, and processes that need to be implemented to run an effective data governance program. 

In the second part, we’ll dive into the tools and technologies that are available to implement data governance processes, e.g. data quality, data discovery, tracking lineage, and security.

For an in-depth and comprehensive text on Data governance, check Data Governance: People, Processes, and Tools to Operationalize Data Trustworthiness.

What is Data Governance?

Data governance is a function of data management which creates value for the organization by implementing processes to ensure high data quality, and provides a platform that makes it easier to share data securely across the organization while ensuring compliance with all the regulations. 

The goal of data governance is to maximize the value derived from data, build user trust, and ensure compliance by implementing required security measures.

Data governance needs to be in place from the time a factoid of data is collected or generated and until the point in time at which that data is retired. Along the way, in this full lifecycle of the data, data governance focuses on making the data available to all stakeholders in a form that they can readily access and use in a manner that generates the desired business outcomes (insights, analysis), and if relevant, conforms to regulatory standards. These regulatory standards are often an intersection of industry (e.g. healthcare), government (e.g. privacy), and company (e.g. non-partisan) rules and codes of behavior. See more details here.

Why is Data Governance Important?

In the last decade, the amount of data generated by users using mobile phones, health & fitness and IOT devices, retail beacons etc. have caused an exponential growth in data. At the same time, the cloud has made it easier to collect, store, and analyze this data at a lower cost. As the volume of data and adoption of cloud continues to grow, organizations are challenged with a dual mandate to democratize and embed data in all decision making while ensuring that it is secured and protected from unauthorized use. 

An effective data governance program is needed to implement this dual mandate to make the organization data driven on one hand and securing data from unauthorized use on the other. Organizations without an effective data governance program will suffer from compliance violations leading to fines, poor data quality which leads to lower quality insights impacting business decisions, challenges in finding data which results in delayed analysis and missed business opportunities, poorly trained data models for AI which reduces the model accuracy and benefits of using AI.

An effective data governance strategy encompasses people, processes, and tools & technologies. It drives data democratization to embed data in all decision making, builds user trust, increases brand value, reduces the chances of compliance violations which can lead to substantial fines, and loss of business.

Components of Data Governance

People and Roles in Data Governance

A comprehensive data governance program starts with a data governance council composed of leaders representing each business unit in the organization. This council establishes the high level governing principles on how the data will be used to drive business decisions. The council with the help of key people in each b business functions identify the data domains, e.g. customer, product, patient, and provider. The council then assigns data ownership and stewardship roles for each data domain. These are senior level roles and each owner is held accountable and accordingly rewarded for driving the data goals set by the data governance council. Data owners and stewards are assigned from business, for example customer data owner may be from marketing or sales, finance data owner from finance, while HR data owner from HR.

The role of IT is that of data custodian. IT ensures the data is acquired, protected, stored, and shared according to the policies specified by data owners. As data custodians, IT does not make the decisions on data access or data sharing. IT’s role is limited to managing technology to support the implementation of data management policies set by data owners.

Processes in Data Governance

Each organization will establish processes to drive towards the implementation of goals set by the data governance council. The processes are established by data owners and data stewards for each of their data domains. The processes focus on the following high level goals:

1. Data meets the specified data quality standards – e.g. 98% completeness, no more than 0.1% duplicate values, 99.99% consistent data across different tables, and what constitutes on-time delivery

2. Data security policies to ensure compliance with internal and external policies

Data is encrypted at rest and on wire

Data access is limited to authorized users only

All sensitive data fields are redacted or encrypted and dynamically decrypted only for authorized users

Data can be joined for analytics in de-identified form, e.g. using deterministic encryption or hashing

Audits are available for authorized access as well as unauthorized attempts

3. Data sharing with external partners is available securely via APIs

4. Compliance with industry and geo specific regulations, e.g. HIPAA, PCI DSS, GDPR, CCPA, LGPD

5. Data replication is minimized

6. Centralized data discovery for data users via data catalogs

7. Trace data lineage to identify data quality issues, data replication sources, and help with audits

Technology

Implementing the processes as specified in the data governance program requires use of technology. From securing data, retaining and reporting audits, to automate monitoring and alerts, multiple technologies are integrated to manage data life cycle.

In Google Cloud, a comprehensive set of tools enables organizations to manage their data securely and drive data democratization. Data Catalog enables users to easily find data from one centralized place across Google Cloud. Data Fusion tracks lineage so data owners can trace data at every point in the data life cycle and fix issues that may be corrupting data. Cloud Audit Logs  retain audits needed for compliance. Dataplex provides intelligent data management, centralized security and governance, automatic data discovery, metadata harvesting, lifecycle management, and data quality with built-in AI-driven intelligence.

We will discuss the use of tools and technologies to implement governance in part 2 of this blog.

Source : Data Analytics Read More

Carrefour Belgium: Driving a seamless digital experience with SAP on Google Cloud

Carrefour Belgium: Driving a seamless digital experience with SAP on Google Cloud

Stijn Stabel’s first days as CTO of Carrefour Belgium were… challenging. “The data center was out of date with a roof that leaked,” he recalls. Not quite what one would expect from the eighth-largest retailer in the world. 

Carrefour is a household name in Europe, Asia, and the Middle East, operating over 12,000 hypermarkets, groceries, and convenience stores in more than 30 countries. Carrefour Belgium has more than 10,000 employees operating 700 stores along with an eCommerce business. Nonetheless, Stabel’s ambitions were ambitious: “Our goal is to become a digital retail company,” he says. “We want to move quickly from being a slow mover in digital transformation to becoming trailblazing. That’s one of the coolest challenges you can have as a CTO.”

Three years later, Carrefour Belgium is well along the path to achieving that goal, having migrated nearly all of its SAP application stack to Google Cloud, including finance, HR, and others. “We’re really going headfirst and full-on. It’s a huge challenge, but it’s definitely one of the most exciting transformations I have seen so far,” he says. 

The challenges that Carrefour Belgium faced were more than just an aging data center. With systems divided between two private clouds, there was no efficient way to leverage data for the advanced analytics Stabel knew the company would need to compete. “The world is changing at a record pace,” he says. “Either you’re keeping up with that or you’re not. Standing still means basically choosing to give up on the company’s progress — and at some point, to really give up on the company altogether.” 

This is especially true, he says, when it comes to creating a seamless customer experience both online and in stores. “Everything in the store will be digital,” he says. “How much longer are people going to put up with price tags printed on paper that have to be changed over and over? How long will it be acceptable to maintain a large ecological footprint? Sustainability will be increasingly important.”

Olivier Luxon, Carrefour Belgium’s CIO, agrees, emphasizing the centrality of customer experience in everything the company does and how quickly customer needs shifted due to the global pandemic. “What we really saw with COVID was customers seeking more digital services. That had been a trend previously, but it accelerated dramatically with COVID.”

The decision to move SAP to Google Cloud

After researching its options, Carrefour Belgium chose a “Google-first” cloud strategy for four reasons:

Partnership: “Google listens to our needs and adapts to them instead of trying to box us in,” Stabel says. 

Technology: Google’s analytical and other tools, particularly BigQuery, were the deciding factor when it came to choosing a cloud provider for its SAP systems. “There’s no sound reason to move your SAP environment to a cloud host other than the one you’re going to use it for analytics,” he says, noting that the company will eventually bring all of its SAP data into BigQuery for analytics. “Our data team is working on building out a foundation based on BigQuery, which will enable them to work on multiple use cases such as assortment planning and promotion. The goal is to become a truly data-driven company.” 

Security: Carrefour Belgium is implementing BeyondCorp, the zero trust security solution from Google Cloud. By shifting access controls from the network perimeter to individual users, BeyondCorp enables secure work from virtually any location without the need for a traditional VPN. ”We’re going to be the first Carrefour division moving to that platform, so we’re very excited to be blazing the trail,” Stabel says.

Value: “I have to report to a CFO, so partnership and technology alone are not enough to make a business case,” Stabel says. “I have to show demonstrable business value, and that’s what we get with Google Cloud.”

Carrefour Belgium’s migration strategy has been to lift and shift its legacy SAP ERP Central Component (ECC) environment before doing a greenfield implementation of S/4HANA on Google Cloud, a process that is already underway. Currently, the HR module has been upgraded to S/4HANA, with retail operations to follow. 

More performance, better experience, greater insights

It’s still early days, but the move to Google Cloud has already paid dividends in improved performance for back office operations, which, Olivier points out, frees time and resources to devote to serving customers better. Eventually, he feels that SAP on Google Cloud will have a direct impact on customer experience, particularly given the opportunities that data analytics will provide to better understand customer needs and meet them more effectively. “Data is becoming more and more important, not only for Carrefour, but for all the companies in the world,” Olivier says. “It drives personalized customer experience, promotions, operational efficiency, and finance. If you don’t set up the right data architecture on Day One, it will be close to impossible to be efficient as a company in a few years from now.”

In the end, the goal is to provide Carrefour Belgium the tools it needs to serve customers better. “SAP supports our business by giving us the right tools and processes to manage areas including supply chain, finance, HR, and retail,” Olivier says. “What was missing, however, was the availability, scalability, and security we needed to better serve our employees, stores and customers, and that’s something we got by moving to Google Cloud.” And by moving to Google Cloud — which has been carbon neutral since 2007 and is committed to operating entirely carbon-free by 2030 — Carrefour is also able to pursue its sustainability objectives simply by modernizing its business operations in the cloud.

“Google is a company that eats, breathes, and sleeps digital,” Stabel says. “At its heart, Carrefour is a retail company. We know how to be a retailer. Our partnership is a cross-pollination. What I’m really looking forward to is continuing to learn from Google Cloud and see what other solutions we can adopt to improve Carrefour Belgium and better serve our users &  customers.”

Learn more about Carrefour Belgium’s deployment and how you can accelerate your organization’s digital transformation by moving your SAP environment to Google Cloud.

Related Article

Supporting business transformation for German retailers with SAP on Google Cloud

Learn how we’re helping German retailers migrate SAP systems to Google Cloud to minimize risk and downtime and build a foundation for fut…

Read Article

Source : Data Analytics Read More

Top 10 takeaways from Looker’s 2021 JOIN@Home conference

Top 10 takeaways from Looker’s 2021 JOIN@Home conference

JOIN@Home was an incredible celebration of the achievements that the Looker community made in the last year, and I was proud to be a part of it. Prominent leaders in the data world shared their successes, tips, and plans for the future. In the spirit of keeping the learning alive, I summarized the top two takeaways from each of the keynotes. They’re accompanied by illustrations that were captured live during the sessions by a local artist. 

Plus, there’s a fun surprise for you at the end. 

“Celebrating Data Heroes – Transforming Our World with Data”

Our opening keynote featured a number of inspiring data professionals who use Looker in their work every day to see trends, drive decision making, and grow their customer base.

Some of their main takeaways were:

You can use analytics to make change for the greater good.
Surgeon scientist Dr. Cherisse Berry spoke of cross-referencing healthcare outcomes data like trauma care survival rates, how long patients wait before being seen, and whether patients were appropriately triaged with demographic data to find gender and racial disparities in healthcare. For instance, she found that critically injured women receive trauma care less often than men. Because her analysis made the disparity known, informed decisions and actions can be taken to bring greater equality to New York state’s trauma care system. 

Provide templates to make insights more easily available to more users, especially non-technical ones.
Michelle Yurovsky of  UIPath, an automation platform that helps customers avoid repetitive tasks, shared one of the key ways UIPath gets customers engaged: by providing dashboard templates that answer common automation questions. Customers get useful insights the second they click on the product. They can copy and modify the templates according to their business needs, so they’re less intimidated to start working with analytics – especially if they have no previous experience building dashboards. 

*Source: Coursera internal data, November 2021.

“Developing a Better Future with Data”

This keynote looked to the future of analytics.

Two major themes were:

Composable analytics capabilities help make application development faster, easier and more accessible.
Composable analytics means creating a custom analytics solution using readily available components. You have access to composable analytics with Looker through the extension framework, which offers downloadable components you can use to build your application right on top of the Looker platform. Filter and visualization components enable you to more easily create the visual side of these data experiences.

Augmented analytics help make it easier to handle the scale and complexity of data in modern business – and to make smarter decisions about probable future outcomes.
Augmented analytics generate sophisticated analyses, created by integrating machine learning (ML) and artificial intelligence (AI) with data. The Looker team has worked to make augmented analytics more accessible to everyone this year. In particular, new Blocks give you access to ML  insights through the familiar Looker interface, enabling you to more quickly prototype ML- and AI-driven solutions. For instance, the Time-series Forecasting Block (which uses BigQuery ML) can be installed to give analysts deeper insights into future demand for better inventory and supply chain management. CCAI Insights gives call centers access to Contact Center AI Insights data with analysis they can use immediately. 

“The Looker Difference”

Product Managers Ani Jain and Tej Toor highlighted many recent features you might find useful for activating and enabling users with Looker.

Here are two moments that stood out:

Giving your teams better starting points can lead to more engagement with analytics.
Two improved ways to find insights from this year: Quick Starts and board improvements. Quick Starts function as pre-built Explore pages that your users can open with a click, helping to make ad hoc analysis more accessible and less intimidating. They’re also a convenient way to save an analysis you find yourself doing frequently – and they even save your filter settings. And, with new navigation improvements in Looker, boards are easier to find and use. Now you can pin links to a board, whether it’s a dashboard, a Look, an Explore, or something else, including external links. So go ahead. Try your hand at creating a useful data hub for your team with a new board.

Natural language processing and Looker can help you make sense of relationships within data, quickly.
A great example of this is the Healthcare NLP API Block, which creates an interactive user interface where healthcare providers, payers, pharma companies, and others in the healthcare industry can more easily access intelligent insights. Under the hood, this Block works on top of GCP Healthcare NLP API, an API offering pre-trained natural language models to extract medical concepts and relationships from medical text. The NLP API helps to structure the data, and the Looker Block can make the insights within that data more accessible.

“Building and Monetizing Custom Data Experiences with Looker” 

Pedro Arellano, Product Director at Looker, and Jawad Laraqui, CEO of Boston-based consultancy Data Driven, chatted about embedded analytics and the remarkable speed one can build data applications with Looker and monetization strategies.

Points you don’t want to miss from this one:

Looker can help you augment an existing customer experience and create a new revenue stream with embedded data.
For example, you can provide personalized insights to a customer engaged with your product, or automate business processes such as using data to trigger a service order workflow when an issue is encountered with a particular product. Embedding data in these ways can make the customer experience smoother all around. To take it a step further, you can monetize a data product you build to help create a new revenue stream.

Building for Looker Marketplace can help you find more customers for your app and can promote a better user experience.
Jawad compared using the extension framework to build for the Looker Marketplace as having an app in the Apple store. Being in the Marketplace is a way for customers to find and use his product organically, and it helps give the end users a streamlined experience. He said: “We were able to quickly copy and paste our whole application from a stand-alone website into something that is inside of Looker. And we did this quickly—one developer did this in one day. It’s a lot easier than you think, so I encourage everyone to give it a try. Just go build!”

“Looker for Modelers: What’s New and What’s Next”

Adam Wilson, Product Manager at Looker, covered the latest upgrades and future plans for Looker’s semantic data model. This layer sits atop multiple sources of data and standardizes common metrics and definitions, so it can be governed and fed into modern built-in business intelligence (BI) interactive dashboards, connected into familiar tools such as Google sheets, and other BI tools where users work —we’re calling this the unified semantic model.

Capabilities to look out for:

Take advantage of Persistent Derived Table (PDT) upgrades that facilitate the end-user experience.
You can use incremental PDTs to capture data updates without rebuilding the whole table, meaning your users get fresh data on a more regular basis with a lower load on your data warehouse. And it’s now possible to validate PDT build status in development mode, giving you the visibility needed to determine when to push updates to production. Coming soon, you’ll be able to do an impact analysis on proposed changes with visualized dependencies between PDTs.

Reach users where they are with Connected Sheets and other BI tools.
Coming soon, you’ll be able to explore Looker data in Google Sheets and share charts to Slides, too. And with Governed BI Connectors, Looker can act as a source of truth for users who are accustomed to interacting with data in Tableau, Power BI, and Google Data Studio.  You can sign up to hear when the Connected Sheets and Looker integration is available or separately to hear about preview availability for Governed BI Connectors.

* Source: The Total Economic Impact™ of Looker, Forrester Consulting, June 2021.
A commissioned study conducted by Forrester Consulting on behalf of Google Cloud

** Source: Google Community Data, December 2020, 2021

Hackathon

Speaking of interesting new developments, here’s your fun surprise: a hackathon recap with a new chart you can use in your own analytics.

The Looker developer community came together to create innovative Looker projects at this year’s JOIN hackathon, Hack@Home 2021. The event provided the participants access to the latest Looker features and resources to create tools useful for all Looker developers. The Nearly Best Hack Winner project demonstrated how easy it is to make custom visualizations by creating an animated bar chart race visualization that anyone can use. The Best Hack Winner showcased the power of the Looker extension framework with a Looker application that conveniently writes CSV data into Looker database connections.

You can still view all the keynotes, as well as the breakout sessions and learning deep dives, on-demand on the JOIN@Home content hub. These will be available through the end of the month, so go soak up the learning while you can.

Source : Data Analytics Read More

Announcing preview of BigQuery’s native support for semi-structured data

Announcing preview of BigQuery’s native support for semi-structured data

Today we’re announcing a public preview for the BigQuery native JSON data type, a capability which brings support for storing and analyzing semi-structured data in BigQuery. 

With this new JSON storage type and advanced JSON features like JSON dot notation support, adaptable data type changes, and new JSON functions, semi-structured data in BigQuery is now intuitive to use and query in its native format. 

You can enroll in the feature preview by signing up here.

The challenge with changing data

Building a data pipeline involves many decisions. Where will my data be ingested from? Does my application require data to be loaded as a batch job or real-time streaming ingest? How should my tables be structured? Many of these decisions are often made up front before a data pipeline is built, meaning table or data type changes down the road can unfortunately be complex and/or costly. 

To handle such events, customers have traditionally had to build complex change-handling automation, pause data ingest to allow for manual intervention, or write unplanned data to a catch-all String field which later has to be parsed in a post-process manner. 

These approaches all add cost, complexity, and slow down your ability to make data driven insights.

Native JSON to the rescue

JSON is a widely used format that allows for semi-structured data, because it does not require a schema. This offers you added flexibility to store and query data that doesn’t always adhere to fixed schemas and data types. By ingesting semi-structured data as a JSON data type, BigQuery allows each JSON field to be encoded and processed independently. You can then query the values of fields within the JSON data individually via dot notation, which makes JSON queries easy to use. This new JSON functionality is also cost efficient compared to previous methods of extracting JSON elements from String fields, which requires processing entire blocks of data.

Thanks to BigQuery’s native JSON support, customers can now write to BigQuery without worrying about future changes to their data. Customers like DeNA, a mobile gaming and e-commerce services provider, sees value in leveraging this new capability as it provides faster time to value.

“Agility is key to our business. We believe Native JSON functionality will enable us to handle changes in data models more quickly and shorten the lead time to pull insights from our data.”—Ryoji Hasegawa, Data Engineer, DeNA Co Ltd.

JSON in action

The best way to learn is often by doing, so let’s see native JSON in action. Suppose we have two ingestion pipelines, one performing batch ingest and the other performing real-time streaming ingest, both of which ingest application login events into BigQuery for further analysis. By leveraging the native JSON feature, we can now embrace upstream data evolution and changes to our application. 

Batch ingesting JSON as a CSV

JSON types are currently supported via batch load jobs of CSV-formatted files. So as an example, let’s create a new table called json_example.batch_events and then ingest this correctly escaped login_events.csv file into BigQuery with the below bq commands. You’ll notice  the batch_events table has both structured columns as well as a labels field which uses the new JSON type for our semi-structured fields. In this example some application values will remain highly structured such as event creationTime, event ID, event name, etc. so we’ll define this table as storing both structured data as well as semi-structured data.

We’ll look at how to run queries using the new JSON functions a bit later in this blog, but first let’s also explore how we might stream semi-structured real-time events into BigQuery using the JSON type too.

Real-Time Streaming JSON Events

Now let’s walk through an example of how to stream the same semi-structured application login events into BigQuery. We’ll first create a new table called json_example.streaming_events which leverages the same combination of structured and semi-structured columns. However, instead of using the bq command line, we’ll create this table by running the SQL Data definition language (DDL) statement:

BigQuery supports two forms of real-time ingestion: the BigQuery Storage Write API and the legacy streaming API. The Storage Write API provides a unified data-ingestion API to write data into BigQuery via gRPC and provides advanced features like exactly-once delivery semantics, stream-level transactions, support for multiple workers, and is generally recommended over the legacy streaming API. However because the legacy streaming API is still in use by some customers, let’s walk through both examples: ingesting JSON data through the Storage Write API and ingesting JSON data through the legacy insertAll streaming API.

JSON via the Storage Write API

To ingest data via the Storage Write API, we’ll stream data as protocol buffers. For a quick refresher on working with protocol buffers, here’s a great tutorial

We’ll first define our message format for writing into the json_example.streaming_events table using a .proto file in proto2. You can copy the file from here, then run the following command within a Linux environment to update your protocol buffer definition:

We’ll then use this sample Python code to stream both structured and semi-structured data into the streaming_events table. This code streams a batch of row data by appending proto2 serialized bytes to the serialzed_rows repeated field like the example below. Of particular note is the labels field which was defined within our table to be JSON.

Once executed, we can see our table now has ingested a few rows from the Storage Write API!

Preview of the json_example.streaming_events table in BigQuery after data ingestion.

JSON via the legacy insertAll streaming API

And lastly, let’s explore streaming data to the same streaming_events table with the legacy insertAll API. With the insertAll API approach, we’ll ingest a set of JSON events stored within a local file in real-time to our same streaming_events table. The events will will be structured like the below, with the labels field being highly variable and semi-structured:

Example JSON events to be ingested into BigQuery, with the field labels as a highly variable input which is constantly being updated.

Now run the following Python code which reads data from the local JSON events file and streams it into BigQuery.

Now that our JSON events have successfully been ingested into BigQuery (through batch ingest, the Storage Write API, the legacy streaming API, or even all three) we’re ready to query our semi-structured data in BigQuery!

Preview of the json_example.streaming_events table in BigQuery highlighting the semi-structured nature of the labels JSON field.

Querying your new JSON data

With the introduction of the native JSON type, we’re also introducing new JSON functions to easily and efficiently query data in its native format. 

For instance, we can get a count of the events we ingested which encountered a login authentication failure by filtering on the labels.property field of the JSON value using dot notation:

We can also perform aggregations by averaging event threats caused by login failures within our data set by natively casting a threatRating field within labels as a FLOAT:

Native JSON with existing tables

What if you have existing tables, can you take advantage of the Native JSON type without rewriting all your data? Yes! 

BigQuery makes operations like modifying existing table schemas a snap though DDL statements like the below which adds a new JSON column titled newJSONField to an existing table:

From here, you can decide on how you want to leverage your newJSON column by either converting existing data (perhaps existing JSON data stored as a String) into the newJSON field or by ingesting net new data into this column.

To convert existing data into JSON, you can leverage an UPDATE DML statement to update your existing through either the PARSE_JSON function, which converts a String into a JSON type, or by using the TO_JSON function, which converts any data type into a JSON type. Here are examples of each below:

Converting a String into JSON:

Converting existing data stored as a nested and repeated STRUCT, like the example here, into JSON:

How can you get started with native JSON in BigQuery?

Data comes in all forms, shapes, sizes, and never stops evolving. If you’d like to support your data and its future evolution with the BigQuery native JSON preview feature, please complete the sign up form here.

Source : Data Analytics Read More

Google Cloud Data Analytics 2021: The year in review

Google Cloud Data Analytics 2021: The year in review

As I look back on 2021 I’m proud to see a fast growing number of companies use our data platform to unlock new insights, build new business models and help improve their employees’ and their customers’ experience.

Data itself is just inactive information, useless without activation. The true power of data comes when it’s being used to build intelligent applications, help people make better decisions, increase automation and ultimately change how value is being created.

This year, tens of thousands of  customers unlocked their data advantage with Google Cloud’s unified data platform. From breaking down data silos, building internet-scale applications, building smart processes with AI, building data meshes that span beyond their enterprise and turn data into an asset.

These customers all used Google Cloud’s unified data platform to remove barriers across data silos, accelerate existing analytic investments, and achieve business outcomes faster. 

I’m truly honored to share some of the most important moments from our partners, customers, and practitioners this year. Thank you for your trust and commitment and for choosing Google Cloud as your innovation partner — to break down silos and turn data into value.

One retailer using these solutions is Carrefour, one of the largest grocery retailers in France. Carrefour needed to ensure it had the right products, in front of the right shoppers, at the right store location. With Google Cloud, Carrefour developed an assortment recommendation tool that helped the chain support a more personalized selection at the store level, giving store directors the autonomy to influence inventory needs. The tool also gives Carrefour headquarters visibility into the merchandising decisions by each of their franchise stores.

Enabling the real-time enterprise 

In 2021, more customers looked to shift to real-time data processing and insights with Google Cloud so that they could make decisions at the speed of their business and deliver excellent customer experiences. For example, Twitter’s data platform ingests trillions of events, processes hundreds of petabytes of data, and runs tens of thousands of jobs on over a dozen clusters every day. With this expanded partnership, Twitter is adopting Google’s Data Cloud including BigQuery, Dataflow, Cloud Bigtable and machine learning (ML) tools. These tools not only power the company’s rapidly growing data ecosystem to enable faster data-informed decisions, but also to enable deeper ML-driven product innovation.

Another great example is the story of Verizon Media, who switched to the Google Cloud from another provider to ingest 200TB Daily, store 100PB in BigQuery, stream 300MB per second and achieve a 90+% productivity improvement by combining Looker, BigQuery and the rest of our Data Platform.

Finally, another great journey is that of ATB Financial who migrated its extensive SAP backbone that supports its 800,000+ customers to Google Cloud, and built a system on BigQuery for real-time data acquisition, enrichment, and AI assisted and self-service analytics

Going BIG with healthcare and life sciences

HCA Healthcare is using BigQuery to analyze data from its 32 million annual encounters and identify opportunities to improve clinical care. I was particularly pleased about our partnership and how  Sam Hazen, the company’s CEO explained that “Next-generation care demands data science-informed decision support and how he described our partnership and shared passion for ”innovation and continual improvement as foundational to our efforts.”

Moderna relies on data to respond to the disproportionate impact the pandemic has had on minority groups, utilizing insights from Looker to increase diversity in their COVID-19 vaccine trials to improve representation. “Looker has a depth to it — it’s not just a visualization that you look at. People can go deeper as they learn more,” Dave Johnson, VP of Informatics, Data Science, and AI at Moderna.

We were also incredibly honored to work with the National Cancer Institute’s in order to support breast cancer research with fast and secure data sharing.  Combining our AI Platform and BigQuery to work with large and heterogeneous data, the team was able to successfully demonstrate that “researchers can inexpensively analyze large amounts of data, and do so faster than ever before.”

Increasing enterprise agility

Niantic Labs built a globally scalable game for millions of users on Google Cloud. In this video, they share their experience scaling with Google Kubernetes Engine (GKE) and Spanner, and describe how their data science team works with BigQuery, Dataflow, and Pub/Sub for their data analytics needs.

Finally, Telefónica partnered with Google Cloud to foster Spain’s digital transformation and advance 5G mobile edge computing. As part of this partnership, Google Cloud will launch a new cloud region in Spain to assist the country in economic recovery amidst the COVID-19 crisis. Telefónica will also use Google Cloud services to boost its own digital capabilities—in areas such as machine learning, artificial intelligence (AI), data analytics, and application development—to continue to provide new services and tools to its global customer base. 

2021 Data Cloud momentum, thanks to all our customers

Some of my favorite customer stories of this year describe how our customers inspired and pushed us to take our products and offerings to new heights. 

For me, this was most apparent at our inaugural Data Cloud Summit in late May, where we were able to unveil to our customers the latest product and feature announcements of everything having to do with data. We launched Dataplex, allowing customers and users to centrally manage, monitor, and govern data across data lakes, warehouses, and marts – all from one single viewpoint. We also announced Datastream, our serverless change data capture (CDC) and replication service, as well as Analytics Hub, a fully-managed service built on BigQuery that allows our customers to create safe and governable data sharing ecosystems. We’ve made migrations to Cloud SQL easier and faster with the Database Migration Service. More than 85% of all migrations are underway in under an hour, with the majority of customers migrating their databases from other clouds. 

At Google Cloud Next ‘21, we also had devoted space to share our product improvements and iterations back to our customers. Amidst many announcements, I was most proud to speak about Spark on Google Cloud, the world’s first autoscaling and serverless Spark service for the Google Cloud data platform, BigQuery Omni, our cross-cloud analytics solution, Google Earth Engine on Google Cloud, a launch that brings together Google Earth Engine’s 50+ petabyte catalog of satellite imagery and geospatial datasets for planetary-scale analysis, and Spanner PostgreSQL interface, which allows enterprises to take advantage of Spanner’s unmatched global scale, 99.999% availability, and strong consistency using skills and tools from the popular PostgreSQL ecosystem. 

We also held the sixth edition of JOIN, Looker’s annual user conference which included 3 days of live educational content with over 15 customers and partners participating spanning 5 keynotes, 33 breakouts, 12 how-to’s, 27 Data Circles of Success, and our popular Hackathon. Content focused on activating users with data experiences, composable analytics, and our unifiedsemantic model. All sessions from JOIN are now available on-demand.

Tapping into the data ecosystem

One of our customers’ most recurring themes was an interest in expanding their data aperture and tapping into the data ecosystem around them. We addressed this feedback in three main ways. 

First, we activated an ecosystem for collective intelligence on BigQuery. Now, more than 3,000 organizations shared more than 250 petabytes of data and Google Cloud shared more than 150 public datasets that can be used across a myriad of use cases.

Second, We also leaned into this spirit of knowledge sharing by packaging over 30 architecture design patterns which include code, data models, and industry best practices. We’ve also increased industry domain expertise in areas such as retail, financial services, healthcare, and gaming and are continuing to develop industry white papers such as this one – How to develop Global Multiplayer Games using Cloud Spanner to reduce the time to value for customers.

Third, we continue to support an open ecosystem of data partners including; Neo4j, Databricks, MongoDB, Informatica, Tableau, and C3.ai giving customers the flexibility of choice to build their data clouds without being locked into a single approach.  

Going further than we imagined, together

We are incredibly grateful to our customers for choosing Google Cloud to write their data story, and we can’t wait to see what you do next. Learn more about how organizations are building their data clouds with Google Cloud solutions.

Related Article

Turn data into value with a unified and open data cloud

At Google Cloud Next we announced Google Earth Engine with Bigquery, Spark on Google Cloud and Vertex AI Workbench

Read Article

Source : Data Analytics Read More

Leveraging BigQuery Audit Log pipelines for Usage Analytics

Leveraging BigQuery Audit Log pipelines for Usage Analytics

In the BigQuery Spotlight series, we talked about Monitoring. This post focuses on using Audit Logs for deep dive monitoring.  BigQuery Audit Logs are a collection of logs provided by  Google Cloud that provide insight into operations related to your use of BigQuery. A wealth of information is available to you in the Audit Logs. Cloud Logging captures events which can show “who” performed “what” activity and “how” the system behaved. 

BigQuery Audit Logs Overview 

Google Cloud Audit logs provide the following capabilities:

Audit logs are useful in monitoring specific BigQuery use cases which may not be possible  with other options like INFORMATION_SCHEMA or BigQuery Admin Panel. Read more about available monitoring optionsin this blog.  Following are some important use cases where we can leverage audit logs for monitoring BigQuery:

Identify and address data breaches and other security incidents

Understand user data access patterns (including column access information)

Perform user behavior analytics

Identify bad actors across regions using IP address analysis

BigQuery provides three different types of audit log messages: 

Audit Data – Older version of logs which reports API invocations 

BigQueryAuditMetadata –  Reports resource interactions such as table read, table expired etc

AuditLogs – The logs that BigQuery Reservations and BigQuery Connections use when reporting requests.

BigQuery Audit logs are categorized into the following streams :

Admin Activity Logs: events like PatchDataset, UpdateTable, DeleteTable, PatchTable, etc.

Data Access Logs: events like Query, TableDataChange, TableDataRead, etc.

System Events: events like internal table expiration

Policy Denied Logs: events related to BigQuery permissions 

Types of  BigQuery Log Events:

For new workloads, use only new versions of the log events. New log events start with the prefix google.cloud.bigquery.v2. Older versions of the log events can be ignored. For example, datasetservice, tabledataservice.

Personas and Use Cases

Personas and use cases are very important to understand the analysis requirements and access levels for monitoring using BigQuery Audit Logs. Here is the list of some common personas and their use cases:

Data Owners / Users – Develop and operate applications, and manage a system that generates source data. This persona is mostly concerned with their specific workloads. For example, DevelopersPlatform Operators – Often run the platform that serves internal customers. For example, Data Platform LeadsAdministrators – Primarily concerned with secure operations and health of the GCP fleet of resources. For example, SREs

BigQuery Audit Logs Export Options 

To support the above personas and use cases, logs can be exported to various destinations beyond just Log Explorer. Following are the supported destinations:

Cloud Storage – JSON-formatted files stored in GCS buckets 

BigQuery –  Log Tables created in BigQuery datasets

Pub/Sub –  JSON-formatted messages delivered to Pub/Sub topics

Log Buckets – JSON-formatted logs available for further analysis via cloud monitoring

Following features can be considered in choosing the right destination to export BigQuery logs for monitoring: 

Export Requirement – Use cases for exporting to different destinationsLanguage – Language support to analyze the BigQuery audit logs Pipeline Setup – Options available for setting up the export pipeline Maintenance Overhead – Efforts required to maintain and manage the export pipeline Retention Period/Wipeout Support – Supported data retention and expiration policies

Pipeline Setup 

Using Aggregated Sink

Most of the companies have many projects that lead to various islands of logs. We recommend Platform Operators and Administrators use Aggregated Sink to export organization-level audit logs in a centralized administration project. Data Owners can also export logs using aggregated sinks at project level if needed.

Logging Service Account

Cloud Logging uses a default service account to create and export log data in real-time. For VPC-SC constraints, you’d need to add this service account to an access level and then assign it to the destination service perimeter. For more details, see VPC Service Controls: Cloud Logging

Export Option 

Logs can be exported using various filters written in the Logging Query Language and exported to various destinations. The diagram below shows a sample architecture to set up log exports to specific destinations based on various use cases.

Let’s take a closer look at pipeline setup steps for each of these destinations.

BigQuery Export 

For BigQuery export,  log data is streamed into BigQuery one record at a time instead of using load jobs. This approach enables querying data in BigQuery in near-real time without the delay of running a load job or maintaining a data ingestion pipeline. 

When creating a sink to export your logs to BigQuery, you can either use date-sharded or partitioned tables. Both table types partition data based on the log entries’ timestamp field. The default selection is date shared tables. The recommended approach is partitioned tables for easy access and management, and better performance.

The aggregated sink option creates the following audit log tables in BigQuery based on corresponding event types:

cloudaudit_googleapis_com_system_event

cloudaudit_googleapis_com_policy

cloudaudit_googleapis_com_data_access

cloudaudit_googleapis_com_activity

Setup Steps

Step 1: Create aggregated sink at the organization level to route the logs to BigQuery sink

Other Filters:
protoPayload.serviceName=bigquerydatatransfer.googleapis.com
protoPayload.serviceName=bigqueryreservation.googleapis.com

Step 2: Provide service account access
Grant the BigQuery Data Editor role to the default logging service account on the BigQuery destination dataset. 

Step 3: Set retention policy
When you create the table, set a partition expiration to limit the size of the logging export storage.

Step 4: Create derived tables or views for further analysis
Here’s a sample SQL query that provides details on who accessed which dataset / tables / columns and what kind of permissions they had to perform that operation.

This sample query provides details on who/how performed the various Dataset activities like Dataset read, Dataset creation and Dataset delete

This sample query provides the details on the Expired tables

Google Cloud Storage Export 

Google Cloud Storage provides inexpensive, long term storage of logs. Moving logs to Nearline or Coldline and then deleting them helps you manage the ongoing operational cost of maintaining the logs. If you are looking to export these logs for further analysis, you can load the JSON-formatted log files to BigQuery or create external tables on the logs data in GCS. 

Setup Steps

Step 1: Create aggregated sink at the organization level to route the logs toGCS sink

Step 2: Provide service account access
Grant the Storage Object Creator role to the default logging service account to write to the bucket

Step 3: Set retention policy
Configure Retention Policyusing object lifecycle management on GCS

Step 4:  External Tables 
In Some cases you might need to query the audit logs data stored in GCS. In that case you can use  External tables in BigQuery to explore the data further. However, the query performance for external tables is not the same as that of native BigQuery Tables.

Step 4.1 Create the external table
bq mk
–external_table_definition=source_format=Cloud Storage URI Dataset.table

Step 4.2 Create a view for querying 

Pub/Sub Export

Logs can be exported from cloud logging to 3rd party tools like Splunk in real time using the Pub/Sub sink. Splunk enables you to search, analyze, and visualize logs, events, and metrics gathered from your on-premises and cloud deployments for IT and security monitoring. You can also create a Dataflow pipeline from Pub/sub to BigQuery to perform transformations and aggregations, and eventually load the results to BigQuery for data analytics.

Setup Steps

Step 1: Create aggregated sink at the organization level to route the logs to Pub/Sub Sink

Step 2: Provide service account access
Grant the Pub/Sub Publisher role on the topic to the default logging service account 

Step 3: Set up a subscription to pull log messages from the topic created 
Use command-line to pull messages via Pub/Sub subscription pull. Alternatively, you could implement a simple subscriber. Check out these code samples for more details.

Step 4: Set up the third party integration
You can ingest the log messages to third party tools like Splunk either using Dataflow or pull logs directly using Splunk Add-on for Google Cloud Platform.

Cloud Monitoring

Cloud Monitoring is a part of the Cloud logging and monitoring framework. The log information from Log buckets is automatically synced to the Cloud Monitoring workspace and available for high-level analytics. It also provides built-in high level log metrics and alerting features.

However, it cannot be customized for advanced analytics requirements. Also, it is difficult to configure granular permissions and  access control levels in a monitoring workspace. 

Pipeline Automation

The pipeline setup steps can be easily automated, version controlled and managed using Terraform. Here’s a sample Terraform script for setting up an aggregated sink to BigQuery.

What’s next?

You can leverage the audit logs data in the BigQuery sink to create more advanced analytics dashboards using Data Studio or Looker. Here are some sample use cases for monitoring BigQuery using audit logs:

Access Reports for Project Owners
Project owners would like to see who, when and what dataset is being accessed in their projects. For example, who is accessing a particular dataset from which location. This will help identify any anomalies and report proactively on any such access issues.

Usage Reports for Data Owners
Data owners are usually interested in getting more details on how others are using their datasets and tables in a project. For example, they’d like to see how often a column in a specific table is being accessed and by which users and how many tables are expiring 

Here’s a sample dashboard example using Data Studio:

Checkout this Github Repository for more SQL scripts on querying the logs information in BigQuery and Terraform scripts for the overall logs export pipeline automation.

Related Article

Monitoring BigQuery reservations and slot utilization with INFORMATION_SCHEMA

BigQuery Reservations help manage your BigQuery workloads. Learn how to use BigQuery’s INFORMATION_SCHEMA system tables to create the Sys…

Read Article

Source : Data Analytics Read More

The top three insights we learned from data analytics customers in 2021

The top three insights we learned from data analytics customers in 2021

As you know, we’re obsessed about learning from customers. That’s why, we made it a point to sit down with a new Google Cloud customer every week this year to share with you their journey both to our unified data cloud platform and what they’ve learned from their experience along the way. 

What began as a simple exercise of routinely keeping up with our customers quickly evolved into a complete video series that I have the pleasure of hosting every Tuesday. This year you learned from companies of all sizes, all industries and across the globe.

In this series, we learn that succeeding with data requires tough business decisions, a lot of creativity and a specific vision. Perhaps for the first time, this series, aptly titled “Data Journeys,” spotlights our community voices from the data space in a genuine, grassroots, and unfiltered way.

We’ve learned so much from each customer guest and their company’s unique challenges, trials, and triumphs – all things that make up a great journey or adventure. So, as we put the wrappings on another eventful year, we’re thrilled to tie a bow around our 2021 customer “Data Journeys” and share the top 3 learnings from their journeys with you.

Lesson 1: Customers migrating to the cloud aren’t simply looking for a provider that can successfully execute a lift-and-shift motion. Instead, they look to take advantage of their cloud migration as an opportunity to rethink, relearn, and rebuild. 

When they were reaching the end of their contract with a different data warehousing provider with a fixed compute model, Keybank, a regional financial institution with over 1,000 branches managing over $145 billion in assets, decided to migrate to the cloud. I was fortunate to speak with Mike Onders, EVP Chief Data Officer, Divisional CIO, and Head of Enterprise Architecture for Keybank who led the migration charge. 

According to Onders, he decided Google Cloud Platform was the right fit for the dynamic, security-sensitive nature of the banking company’s business because Google’s data suite provided an elastic, fast, and consistent environment for data. 

When selecting Google Cloud however, Keybank was clear they did not want to simply lift-and-shift their Teradata Warehouse, analytics users, and Hadoop Data Lake to Google and wipe their hands to call it a day. Instead, they saw the migration as an opportunity to reinvent old processes and traditional ways of doing banking. “We really spun up Python and Spark clusters to do some new fraud modeling, especially around smart holds and check deposits and how to use fraud algorithms to determine whether we should hold this money or release it to you when you deposit that check,” Onders said.

We really spun up Python and Spark clusters to do some new fraud modeling, especially around smart holds and check deposits and how to use fraud algorithms to determine whether we should hold this money or release it to you when you deposit that check Mike Onders , EVP Chief Data Officer, Divisional CIO, and Head of Enterprise Architecture at Keybank

In addition to smart check hold modelings, Keybank has re-engineered how they conduct attrition modeling, next best offers/product recommendations, and credit risk predictions using Google Cloud’s intelligent data cloud platform. Now, they are eyeing even more new innovation, particularly around how to use data to drive customer satisfaction and differentiate their services from their competitors. .

For more insights, watch Keybank’s episode on Data Journeys. 

Lesson 2: Customers are passionate about applying Google’s technology, from predictive analytics to AI and automation, to ‘data for good’ initiatives.

TELUS is one of Canada’s largest telecommunications companies with over 9.5 million subscribers. The powerhouse manages geolocated data generated by talk and text signals, networks, and cellular towers across the country. For some perspective on the sheer amount of data they handle, TELUS analyzed over 1.2 petabytes of data last year and are expecting that number to grow more each year. 

In March of 2020 it became clear that COVID-19 was more than an outbreak and growing to be a global pandemic, and TELUS faced a choice to carry on business as usual or to serve their community at a larger scale. With Google Cloud by their side to support them, the choice was a no brainer.

TELUS realized they could leverage their data residing within Google Cloud’s data cloud in a privacy-preserving manner to launch a data for good initiative and support all Canadians during a time of need. Within a matter of just 3 weeks using tools like BigQuery and Data Studio, TELUS launched a new platform that empowered the Canadian government to make better strategic decisions about how to combat COVID within its borders – all while keeping individual subscribers’ data private.

For their efforts, TELUS earned a HPE-IAPP Privacy Innovation Award and helped the government contain and minimize the spread of the virus.

After speaking with Michael Ames, Senior Director of Healthcare and Life Science at SADA, I realized their company was similarly focused on using data analytics technology for community good. SADA is a Google Cloud Premier Partner that offers consultation and implementation services. Ames’ day-to-day involves helping hospitals and healthcare providers migrate their data to the cloud. 

Hospitals and healthcare systems around the United States undoubtedly handle massive amounts of data from symptom records to patient history to insurance and payment records, and more. SADA realized that increasing the rate that the right treatment would be matched with the right person during their first visit would help hospitals run their business better, increase patient satisfaction and health, save both parties time and money, and therefore maximize profits that hospitals could reinvest in care for their community. 

But, after conducting a study with the journal Nature taking the top ten biggest drugs by revenue and analyzing how they affected patients to determine whether they achieved their intended purpose, Ames knew matching the right drug with the right patient was a major issue. 

Once SADA used data to create predictive models within Google’s intelligent data cloud stack, they were able to help doctors prescribe medications that better matched individual patient needs and health profiles, providing far more personalized medicine. 

Looking forward, SADA is excited to use a data-driven approach to improve other aspects of health care. “Take that idea of prescribing drugs and expand that to physical therapy treatments, behavioral therapy treatments, decisions we make about where and how people should live, whether we should build a park in a certain neighborhood, how the air quality is affecting health, and we start to get a sense of the need to bring data together in a way that will be… much more powerful in treating the health of people” Ames said.

Take that idea of prescribing drugs and expand that to physical therapy treatments, behavioral therapy treatments, decisions we make about where and how people should live, whether we should build a park in a certain neighborhood, how the air quality is affecting health, and we start to get a sense of the need to bring data together in a way that will be… much more powerful in treating the health of people Michael Ames, Senior Director of Healthcare and Life Science at SADA

For more insights, watch TELUS’ episode and SADA’s episode on Data Journeys. 

Lesson 3: Customers choose Google Cloud to address today’s needs, but also because they are gearing up with technology that can solve their unknown problems of tomorrow. 

Delivery Hero is the largest food delivery network outside of China. The food delivery company operates in over 50 countries and processed over 663 million orders in just the first quarter of 2021.

I was shocked to learn that millions of orders equal 174 data sets, 5,000 tables, and upwards of 7 million queries a month. Hence, it was my pleasure to speak with Matteo Fava, Delivery Hero’s Senior Director, Global Data Products and Analytics, and understand how they manage their immense data volume.

A significant part of Delivery Hero’s data strategy involves making sure the right teams have access to the right data on the backend, and that their customers have a safe, efficient, and enjoyable experience of receiving their food on the frontend. In fact, that was the main reason why they partnered with Google Cloud in the first place: to ensure business would run as usual with no hiccoughs.  

As Fava’s team gained deeper understanding into user behavior and delivery routes through Google Analytics and BigQuery, Delivery Hero discovered a puzzle they hadn’t previously anticipated of how to best deliver food in a consistent way despite significant cross-cultural differences between countries or even cities or neighborhoods within the same country. 

One such puzzle was how to optimize “horizontal” deliveries such as delivering to standard homes versus “vertical” deliveries such as delivering to apartment buildings while often keeping a 15 minute delivery promise to their customers. Delivery Hero ultimately used real-time insights into deliveries and predictive modeling to ensure their estimated time of food delivery took into account longer elevator wait time in populated cities like Hong Kong and the time it takes to enter and navigate condominium complexes in places like Dubai. 

Delivery Hero knew it was one thing to have an efficient new platform for managing data, but it’s another to trust that this platform can solve unexpected challenges that may not be apparent today. 

For more insights, watch Delivery Hero’s episode on Data Journeys. 

Wrapping it up

We can’t wait to explore new customer data journeys next year and continue to share insights with the  community. 

We hope you’ll follow the Data Journey series  by subscribing to the  playlist, and if you have a great suggestion for a guest (or want to be a guest yourself), please let us know!

Source : Data Analytics Read More

How to migrate an on-premises data warehouse to BigQuery on Google Cloud

How to migrate an on-premises data warehouse to BigQuery on Google Cloud

Data teams across companies have continuous challenges of consolidating data, processing it and making it useful. They deal with challenges such as a mixture of multiple ETL jobs, long ETL windows capacity-bound on-premise data warehouses and ever-increasing demands from users. They also need to make sure that the downstream requirements of ML, reporting and analytics are met with the data processing. And, they need to plan for the future – how will more data be handled and how new downstream teams will be supported?

Checkout how Independence Health Group is addressing their enterprise data warehouse (EDW) migration in the video above.

Why BigQuery?

On-premises data warehouses become difficult to scale so most companies’ biggest goal is to create a forward looking system to store data that is secure, scalable and cost effective. GCP’s BigQuery is serverless, highly scalable, and cost-effective and is a great technical fit for the EDW use-case. It’s a multicloud data warehouse designed for business agility. But, migrating a large, highly-integrated data warehouse from on-premise to BigQuery is not a flip-a-switch kinda migration. You need to make sure your downstream systems dont break due to inconsistent results in migrating datasets, both during and after the migration. So..you have to plan your migration. 

Data warehouse migration strategy

 The following steps are typical for a successful migration: 

Assessment and planning: Find the scope in advance to plan the migration of the legacy data warehouse 

Identify data groupings, application access patterns and capacities

Use tools and utilities to identify unknown complexities and dependencies 

Identify required application conversions and testing

Determine initial processing and storage capacity for budget forecasting and capacity planning 

Consider growth and changes anticipated during the migration period 

Develop a future state strategy and vision to guide design

Migration: Establish GCP foundation and begin migration

As the cloud foundation is being set up, consider running focused POCs to validate data migration processes and timelines

Look for automated utilities to help with any required code migration

Plan to maintain data synchronization between legacy and target EDW during the duration of the migration. This becomes a critical business process to keep the project on schedule.

Plan to integrate some enterprise tooling to help existing teams span both environments

Consider current data access patterns among EDW user communities and how they will map to similar controls available in Big Query. 

Key scope includes code integration and data model conversions

Expect to refine capacity forecasts and refine allocation design. In Big Query there are many options to balance cost and performance to maximize business value. For example, you can use either on-demand or flat-rate slot pricing or a combination of both. 

Validation and testing

 Look for tools to allow automated, intelligent data validation 

Scope must include both schema and data validation

Ideally solutions will allow continuous validation from source to target system during migration

Testing complexity and duration will be driven by number and complexity of applications consuming data from the EDW and rate of change of those applications 

A key to successful migration is finding Google Cloud partners with experience migrating EDW workloads. For example, our Google Cloud partner Datameticaoffers services and specialized Migration Accelerators for each of these migration stages to make it more efficient to plan and execute migrations.

Click to enlarge

Data warehouse migration: Things to consider

Financial benefits of open source: Target moving to ‘Open Source’ where none of the services have license fees. For example BigQuery uses Standard SQL; Cloud Composer is managed Apache Airflow, Dataflow is based on Apache Beam. Taking these as managed services provides the financial benefits of open source, but avoids the burden of maintaining open source platforms internally. 

Serverless: Move to “serverless” big data services. The majority of the services used in a recommended GCP data architecture scale on demand allowing more cost effective alignment with needs. Using fully managed services lets you focus engineering time on business roadmap priorities, not building and maintaining infrastructure. 

Efficiencies of a Unified platform: Any data warehouse migration involves integration with services that surround the EDW for data ingest and pre-processing and advanced analytics on the data stored in the EDW to maximize business value. A cloud provider like GCP offers a full breadth of integrated and managed ‘big data’ services with built-in machine learning. This can yield significantly reduced long-term TCO by increasing both operational and cost efficiency when compared to EDW-specific point solutions. 

Establishing a solid cloud foundation: From the beginning, take the time to design a secure foundation that will serve the business and technical needs for workloads to follow. Key features include: Scalable Resource Hierarchy, Multi-layer security, multi-tiered network and data center strategy and automation using Infrastructure-as-Code. Also allow time to integrate cloud-based services into existing enterprise systems such as CI/CD pipelines, monitoring, alerting, logging, process scheduling, and service request management. 

Unlimited expansion capacity: Moving to cloud sounds like a major step, but really look at this as adding more data centers accessible to your teams. Of course, these data centers offer many new services that are very difficult to develop in-house and provide nearly unlimited expansion capacity with minimal up-front financial commitment. . 

Patience and interim platforms: Migrating an EDW is typically a long running project. Be ready to design and operate interim platforms for data synchronization, validation and application testing. Consider the impact on up-stream and down-stream systems. It might make sense to migrate and modernize these systems concurrent with the EDW migration since they are probably data sources and sinks and may be facing similar growth challenges. Also be ready to accommodate new business requirements that develop during the migration. Take advantage of the long duration to have existing your operational teams learn new services from the partner leading the deployment so your teams are ready to take over post-migration. 

Experienced partner: An EDW migration can be a major undertaking with challenges and risks during migration, but offers tremendous opportunities to reduce costs, simplify operations and offer dramatically improved capacities to internal and external EDW users. Selecting the right partner reduces the technical and financial risks, and allows you to plan for and possibly start leveraging these long-term benefits early in the migration process.

Click to enlarge

Example Data Warehouse Migration Architecture

Setup foundational elements. In GCP these include, IAM for authorization and access, cloud resource hierarchy, billing, networking, code pipelines, Infrastructure as Code using Cloud Buildwith Terraform ( GCP Foundation Toolkit), Cloud DNS and a dedicated/partner Interconnect to connect to the current data centers.

Activate monitoring and security scanning services before real user data is loaded using Cloud Operations for monitoring and logging and Security Command Centerfor security monitoring. 

Extract files from on-premise legacy EDW and move to Cloud Storage and establish on-going synchronization usingBig Query Transfer services

From Cloud Storage, process the data in Dataflow and Load/Export data to BigQuery. 

Validate the export using Datametica’s validation utilities running in a GKE cluster and Cloud SQL for auditing and historical data synchronization as needed. Application teams test against the validated data sets throughout the migration process. 

Orchestrate the entire pipeline using Cloud Composer, integrated with on-prem scheduling services as needed to leverage established processes and keep legacy and new systems in sync. 

Maintain close coordination with teams/services ingesting new data into the EDW and down-streams analytics teams relying on the EDW data for on-going advanced analytics. 

Establish fine-grained access controls to data sets and start making the data in Big Query available to existing reporting, visualization and application consumption tools using BigQuery data connectors for ‘down-stream’ user access and testing. 

Incrementally increase Big Query flat-rate processing capacityto provide the most cost-effective utilization of resources during migration. 

To learn more about migrating from on-premises Enterprise Data Warehouses (EDW) to Bigquery and GCP here.

Related Article

5 reasons your legacy data warehouse won’t cut it

Legacy data warehouses are expensive and hard to maintain. Here are 5 reasons they aren’t cutting it, and why migrating to the cloud open…

Read Article

Source : Data Analytics Read More