Enriching Knowledge Graphs in Neo4j with Google Enterprise Knowledge Graph

Enriching Knowledge Graphs in Neo4j with Google Enterprise Knowledge Graph

Knowledge graphs are inevitably built with incomplete information, complete knowledge being somewhat hard to come by. Knowledge graphs stored in Neo4j Graph Database can be enriched with data from Google Enterprise Knowledge Graph to provide more complete information. 

Knowledge graphs enriched in this way contain more information, enabling the systems that use them to answer more questions. Those downstream systems can then make more accurate business decisions and provide better customer experiences.

In this blog post, we’ll demonstrate how to use Neo4j in conjunction with Google Enterprise Knowledge Graph to enrich a knowledge graph.

Neo4j

Neo4j Aura is the leading graph database. It is available on Google Cloud Marketplace. Neo4j Aura is the only graph database embedded in the Google Cloud Console. Neo4j AuraDS (Data Science) includes three components:

Neo4j Graph Database (GDB) – Graph database for storing, retrieving and managing data natively as graphs

Neo4j Graph Data Science (GDS) – A collection of 60+ graph algorithms for computing centrality, embeddings, etc

Neo4j Bloom – A graph specific business intelligence (BI) tool

Google Enterprise Knowledge Graph

Google Enterprise Knowledge Graph consists of two services: 

The Entity Reconciliation API lets customers build their own private Knowledge Graph with data stored in BigQuery. 

Google Knowledge Graph Search API lets customers search for more information about their entities, such as official name, types, description, etc. 

Architecture

The easiest way to enrich a knowledge graph stored in Neo4j Aura with Google Enterprise Knowledge Graph is to access both tools via Python APIs from Vertex AI Workbench. The architecture for such a system is shown below

Example with Private Equity Data

To illustrate this architecture, we’re going to look at some data from ApeVue, a data provider for private equity markets information. ApeVue aggregates price data on pre-IPO companies from a variety of private equity brokers and distributes market consensus prices and index benchmarks based on that information.

For this example we’re going to construct a knowledge graph in Neo4j Aura from some CSV files provided by ApeVue. These files cover the ApeVue50 Benchmark Index and its constituent companies. An iPython notebook that runs the load is available here. You’re more than welcome to run it yourself!

When complete, we can inspect the graph and see the index and subindices. One way to do that graphically is with Neo4j Browser. You can open that from within the Neo4j Aura console.

Now, we run the query: 

MATCH (n:Sector) RETURN n

The result is a view of all the sectors in the data set, in this case 14 of them.

We can click on any of those sectors to view the companies which are part of it as well:

Similarly, we can search for specific items within our knowledge graph and then view the neighborhood around them. For instance, we can search for Neo4j by running the Cypher query: 

MATCH (n:Company {name: ‘Neo4j’}) return n

When complete, expanding the neighborhood around the Neo4j node gives an interesting view of how it relates to other firms and investors:

In this case, we can see that Neo4j has a number of investors (the blue nodes). One of these is named “GV Management Company, LLC,” previously known as Google Ventures.

Enriching the Data with Enterprise Knowledge Graph

Our knowledge graph is pretty interesting in and of itself. But we can make it better. There’s no way to view the graph underlying Enterprise Knowledge Graph. However, we can query that graph and use it to enrich the knowledge graph we’ve built in Neo4j. In this case, we’re going to pass the company name into the API and get the description of that company back. We’ll then add a new property called “description” to each “Company” node in the Neo4j database.

That’s done via some Python code running in Vertex AI Workbench. A notebook showing an example of this is available here. The main part of that is the call to the Cloud Knowledge Graph API here:

Once we’ve run through the notebook, it’s possible to inspect the enriched data set in Neo4j. For instance, if we query for “Neo4j” again in our knowledge graph, we’ll see it now has a property called description:

What we’ve done here is take the best of both worlds. We’ve leveraged Google’s incredible expertise in search and knowledge aggregation with Neo4j’s market leading ability to store, query and visually represent data as graphs. The result is a very powerful way to build custom knowledge graphs.

Exploring the Knowledge Graph

Now that we’ve loaded and enriched our data, let’s explore it. To do so we’ll use Neo4j Bloom, a graph specific business intelligence tool.

First off, let’s take a look at the companies and graphs:

Here we can see our 14 sectors as the orange dots. A few sectors only have one company as part of them but others have many. Rerunning the search to include our investors, we now get a much more crowded view:

We can use the Neo4j Graph Data Science component to better make sense of this picture. Computing the betweenness centrality on the graph gives us this view.

In it we’ve sized nodes according to their score. One surprising thing in this is that the investors aren’t what’s important. Rather, the companies and sectors are the main connecting points in our graph.

We can also add filters to our perspective. For instance, we can filter to see what companies had a return of 10% or greater in the last month. That gives us this view:

Zooming in on that view, we can identify a firm that invested in four of those companies that returned 10% or greater.

This is a small subset of what’s possible with Bloom. We invite you to explore this data on your own!

Conclusion

In this blog post we loaded a dataset provided by ApeVue into Neo4j AuraDS. AuraDS is Neo4j’s managed service, running on Google Cloud Platform. The data we loaded consisted of the ApeVue50 index, including return, depth and open interest data. We queried the data using Neo4j Cypher and Browser. We enriched that dataset with queries to Google Enterprise Knowledge Graph. Finally, we explored the dataset using Neo4j Bloom.

The enriched knowledge graph gives us a more complete view of the private equity data. Downstream systems using this knowledge graph would have access, not just to return investor and sector information, but also to a natural language description of what the company does. This could be used to power conversational agents or be fed into full text searches of the data set. 

Systems that use Neo4j and Google Enterprise Knowledge Graph can provide better insights and understanding of connected information than otherwise possible. Downstream systems can then make more accurate business decisions and provide better customer experiences.

You can find the notebooks we used on GitHub here.

If you’d like to learn more about the dataset, please reach out to Nick Fusco and the team at ApeVue (contact@apevue.com). For information on Neo4j, please reach out to ecosystem@neo4j.com with any questions.

We thank the many Google Cloud and Neo4j team members who contributed to this collaboration.

Source : Data Analytics Read More

At Box, a game plan for migrating critical storage services from HBase to Cloud Bigtable

At Box, a game plan for migrating critical storage services from HBase to Cloud Bigtable

Introduction

When it comes to cloud-based content management, collaboration, and file sharing tools for businesses and individuals, Box, Inc. is a recognized leader. Recently, we decided to migrate from Apache HBase, a distributed, scalable, big data store deployed on-premises, to Cloud Bigtable, Google Cloud’s HBase-compatible NoSQL database. By doing so, we achieved the many benefits of a cloud-managed database: reduced operational maintenance work on HBase, flexible scaling, decreased costs and an 85% smaller storage footprint. At the same time, the move allowed us to enable BigQuery, Google Cloud’s enterprise data warehouse, and run our database across multiple geographical regions.

But how? Adopting Cloud Bigtable meant migrating one of Box’s most critical services, whose secure file upload and download functionality is core to its content cloud. It also meant migrating over 600 TB of data with zero downtime. Read on to learn how we did it, and the benefits we’re ultimately enjoying. 

Background

Historically, Box has used HBase to store customer file metadata with the schema in the table below. This provides us a mapping from a file to a file’s physical storage locations. This metadata is managed by a service called Storage Service, which runs on Kubernetes; this metadata is used on every upload and download at Box. For some context on our scale: at the start of the migration we had multiple HBase clusters that each stored over 600 billion rows and 200 terabytes of data. Additionally, these clusters received around 15,000 writes per second and 20,000 reads per second, but could scale to serve millions of requests for analytical jobs or higher loads.

Our HBase architecture consisted of three fully replicated clusters spread across different geographical regions: Two active clusters for high availability, and another to handle routine maintenance. Each regional Storage Service wrote to its local HBase cluster and those modifications were replicated out to other regions. On reads, Storage Service first fetched from the local HBase cluster and fell back onto other clusters if there was a replication delay.

Preparing to migrate

To choose the best Bigtable cluster configuration for our use case, we ran performance tests and asynchronous reads and writes before the migration. You can learn more about this on the Box blog here

Since Bigtable requires no maintenance downtime, we decided to merge our three HBase clusters down to just two Bigtable clusters in separate regions for disaster recovery. That was a big benefit, but now we needed to figure out the best way to merge three replicas into two replicas!

Theoretically, the metadata in all three of our HBase clusters should have been the same because of partitioned writes and guaranteed replication. However, in practice, metadata across all the clusters had drifted, and Box’s Storage Service handled these inconsistencies upon read. Thus, during the backfill phase of the migration, we decided to take snapshots of each HBase cluster and import them into Bigtable. But we were unsure about whether to overlay the snapshots or to import the snapshots to separate clusters.

To decide on how to merge three clusters to two, we ran the Google-provided Multiverse Scan Job, a customized MapReduce job that sequentially scans HBase table snapshots in parallel. This allowed us to effectively perform a ​​sort-merge-join of the three tables and compare rows and cells for differences between the three HBase clusters. While the job scanned the entire table, a random 10% of critical rows were compared. This job took 160 Dataproc worker nodes and ran for four days. Then, we imported the differences into BigQuery for analysis.

We found that inconsistencies fell into three categories:

Missing rows in an HBase cluster

A row existed, but was missing columns in an HBase cluster

A row existed, but had differing non-critical columns in an HBase cluster

This exercise helped us decide that consolidating all three snapshots into one would provide us with the most consistent copy, and to have Bigtable replication handle importing the data into the secondary Bigtable cluster. This would resolve any issues with missing columns or rows.

Migration plan

So, how do you migrate trillions of rows into a live database? Based on our previous experience migrating a smaller database into Bigtable, we decided to implement synchronous modifications. In other words, every successful HBase modification would result in the same Bigtable modification. If either step failed, the overall request would be considered a failure, guaranteeing atomicity. For example, when a write to HBase succeeded, we would issue a write to Bigtable, serializing the operations. This increased the total latency of writes to the sum of a write to HBase and Bigtable. However, we determined that was an acceptable tradeoff, as doing parallel writes to both these databases would have introduced complex logic in Box’s Storage Service.

One complexity was that Box’s Storage Service performed many check-and-modify operations. These couldn’t be mirrored in Bigtable for the duration of migration where Bigtable had not been backfilled, and consequently check-and-modify operations would differ from the HBase check-and-modifies. For this reason, we decided to rely on the result of the HBase check-and-modify, and would only perform the modification if the HBase check-and-modify succeeded.

Rollout plan

To roll out synchronous modifications safely, we needed to control it by both percentage and region. For example, our rollout plan for a region looked like the following:

1% region 15% region 125% region 150% region 1100% region 1

Synchronous modifications ensured that Bigtable had all new data written to it. However, we still needed to backfill the old data. After running synchronous modifications for a week and observing no instabilities, we were ready to take the three HBase snapshots and move onto the import phase.

Bigtable import: Backfilling data

We had three HBase snapshots of 200TB each. We needed to import these into Bigtable using the Google-provided Dataproc Import Job. This job had to be run carefully since we were fully dependent on the performance of the Bigtable cluster. If we overloaded our Bigtable cluster, we would immediately see adverse customer impact — an increase in user traffic latency. In fact, our snapshots were so large that we scaled up our Bigtable cluster to 500 nodes to avoid any performance issues. We then began to import each snapshot sequentially. An import of this size was completely unknown to us so we controlled the rate of import by slowly increasing the size of the Dataproc cluster and monitoring Bigtable user traffic latencies.

Validation

Before we could start relying on reads from Bigtable, there was a sequence of validations that had to happen. If any row was incorrect, this could lead to negative customer impact. The size of our clusters made it impossible to do validation on every single row. Instead we took three separate approaches to validation to gain confidence on the migration:

1. Async read validation: Optimistic customer-download-driven validation

On every read, we asynchronously read from Bigtable and added logging and metrics to notify us of any differences. The one caveat with this approach was that we have a lot of reads that are immediately followed by an update. This approach created a lot of noise, since all of the differences we surfaced were from modifications that happened in between the HBase read and the Bigtable read.

During this read validation we discovered that Bigtable regex scans were different from HBase regex scans. For one, Bigtable only supports “equals” regex comparators. Also, the Bigtable Regex uses RE2 which treats “.” (any character, which unless specified excludes newline) differently than HBase. Thus, we had to roll out a specific regex for Bigtable scans and validate that they were returning the expected results.

2. Sync validation: Run a Dataproc job with hash comparison between Bigtable and HBase

This validation job, similar to the one found here, performed a comparison of hashes across Bigtable and HBase rows. We ran it on a sample of 3% of the rows and uncovered a 0.1% mismatch. We printed these mismatches and analyzed them. Most of these mismatches were from optimistic modifications to certain columns and found that no re-import or correction was needed.

3. Customer perspective validation 

We wanted to perform an application-level validation to see what customers would be seeing instead of a database-level validation.

We wrote a job to scan the whole filesystem that would queue up objects where we would call an endpoint in Storage Service that would compare the entry in Bigtable and HBase. For more information, check out this Box blog.

This validation supported the output of the Sync Validation job. We didn’t find any differences that weren’t explained above.

Flipping to Bigtable

All these validations gave us the confidence to return reads from Bigtable instead of HBase. We kept synchronous dual modifications to HBase on as a backup, in case we needed to roll anything back. After returning only Bigtable data, we were finally ready to turn off modifications to HBase. At this point Bigtable became our source of truth.

Thumbs up to Bigtable

Since completing the migration to Bigtable, here are some benefits we’ve observed. 

Speed of development

We now have full control of scaling up and down Bigtable clusters. We turned on Bigtable autoscaling, which automatically increases or decreases our clusters given CPU and storage utilization parameters. We were never able to do this before with physical hardware. This has allowed our team to develop quickly without impacting our customers.

Our team now has much less overhead related to managing our database. In the past, we would constantly have to move around HBase traffic to perform security patches. Now, we don’t need to worry about managing that at all.

Finally, MapReduce jobs that would take days in the past now finish under 24 hours.

Cost savings

Before Bigtable, we were running three fully replicated clusters. With Bigtable, we are able to run one primary cluster that takes in all the requests, and one replicated secondary cluster that we could use if there were any issues with the primary cluster. Besides, for disaster recovery, the secondary cluster is extremely useful to our team to run data analysis jobs.

Then, with autoscaling, we can run our secondary cluster much more lightly until we need to run a job, at which point it self-scales. The secondary cluster runs with 25% less nodes than the primary cluster. When we used HBase, all three of our clusters were sized evenly.

New analysis tools

We ported all our HBase MapReduce jobs over to Bigtable, and found that Bigtable has provided us with parity in functionality with minor configuration changes to our existing jobs.

Bigtable has also enabled us to use the Google Cloud ecosystem:

We were able to add Bigtable as an external BigQuery source. This allowed us to query our tables in real time, which was never possible in HBase. This application was best suited to our small tables. Care should be taken with running queries on a production Bigtable cluster due to impact on CPU utilization. App profiles may be used to isolate traffic to secondary clusters.

For our larger tables we decided to import them into BigQuery through a Dataproc job. This enabled us to pull ad hoc analytics data without running any extra jobs. Further, querying BigQuery is also much faster than running MapReduce jobs.

Long story short, migrating to Bigtable was a big job, but with all the benefits we’ve gained, we’re very glad we did! 

Considering a move to Bigtable? Find more information about migrations and Google Cloud supported tools:

Learn about our live migration tools such as HBase Replication and the HBase mirroring client

Walk through the migration guide for step-by-step instructions

Watch Box’s presentation, How Box modernized their NoSQL databases with minimal effort and downtime

Source : Data Analytics Read More

Unlocking Retail Location Data with CARTO and BigQuery

Unlocking Retail Location Data with CARTO and BigQuery

Retail companies put geospatial data to use to solve all manner of challenges. Data that is tied to location is vital to understanding customers and solving business problems. Say you’re a Director of Leasing who needs to choose the location for a new store. You’ll need to know your potential customers’ demographics, the products being sold by competitors, foot-traffic patterns — but all that data would be essentially useless if it wasn’t tied to a spatial location.

Adding the spatial dimension to your data unlocks new potential, but also adds a degree of complexity. Geospatial data requires map-based visualizations, unique functions and procedures, and far more storage space than your average data. This is why Location Intelligence platforms like CARTO, and peta-byte scale data warehouses like BigQuery are an essential part of a business that wants to use geospatial data to solve their business problems.

CARTO for Retail is a set of platform components and analytical functions that have been developed to assist Retail companies with their geospatial data. The CARTO for Retail functions are deployed directly in BigQuery, and combine spatial analytics with BigQuery Machine Learning tools to run predictions and analysis in the same location as your data. TheCARTO for Retail Reference Guide goes into extensive detail about this solution, which we’ll dive into below.

CARTO Platform Components

The CARTO platform provides a set of capabilities that, when combined with the processing power of BigQuery, form the CARTO for Retail Solution. Below is an illustration of the components in play:

Visualization

The CARTO Builder is a web-based drag-and-drop analysis tool that allows you to quickly see your geospatial data on a map. You can do discovery analyses with the built-in spatial functions, which push the processing down to BigQuery without any configuration required on your end. If you do want to put your hands on the commands that are sent to BigQuery, you can open the SQL interface and edit the code directly. This makes the CARTO Builder an excellent tool for rapid-prototyping geospatial applications. 

Once you’re ready to add advanced application features like fine-grained access controls or custom filtering, you can add to your prototype’s code using the deck.gl framework (this is what CARTO uses on the backend) and CARTO for React. They also provide some helpful templates at that link to get you started.

Data 

While most companies generate some of their own geospatial data, very few have the means (think hundreds of thousands of drones in the sky on a daily basis) to generate the full picture. How about adding some location data to your location data? CARTO’s Data Observatory provides curated 3rd party data (some free, most for purchase) including socio-demographics, Points of Interest, foot & road traffic, behavioral data or credit card transactions. All the data is already hosted in BigQuery, so it’s easy to merge with your existing data. BigQuery itself also has a number of publicly available geospatial datasets, including OpenStreetMap.

Analytics

There are a series of retail-specific User Defined Functions and stored procedures within the CARTO Analytics Toolbox. These procedures, called the CARTO’s Analytics Toolbox for BigQuery can be accessed through the CARTO platform, or directly in the BigQuery console. Leveraging the massive computing power of BigQuery, you can do the following analyses:

Clustering
Analysis to Identify the optimal store locations by geographically clustering customers, competitors and existing stores. 

Commercial Hotspots
Models to focus on the most effective areas for expansion, based on the surrounding retail fabric.

Whitespace Analysis
Routines to identify the best potential locations, where expected revenue is higher than  top performing stores, and where key business criteria are met. 

Twin Areas Analysis
ML-driven analytics to focus network expansion strategies on the most similar locations to the best performing stores.

Store Revenue Prediction
A trained Machine Learning model to predict the annual revenue of a planned store location.

Store Cannibalization
A model to estimate the overlap of areas and spatial features that a new store would have with the existing store network.

Example

Now let’s see the CARTO for Retail components in action. Our goal for this example is to identify similar areas (known as twin areas) in Texas to match a particular high-performing store in Austin. We first create a connection to BigQuery using a service account.

Next, we need to create our data model using the carto.BUILD_REVENUE_MODEL_DATAfunction. This function takes in stores, revenue data, and competitors, then creates an evaluation grid to find twin areas, trade areas (which can be any polygon such as a radius, drive time, or custom created polygon), and desired enrichment variables. Below is an example of this function:

code_block[StructValue([(u’code’, u”CALL `carto-un`.carto.BUILD_REVENUE_MODEL_DATA(rn — Stores: revenue, store, geomrn ”’rn SELECT id as store, revenue, geom FROM project.dataset.stores_tablern ”’,rn — Stores information variables – optionalrn NULL,rn — Competitors: competitor, geomrn ”’rn SELECT id as competitor, geom FROM project.dataset.competitors_tablern ”’,rn — Area of interest: geomrn ”’rn SELECT state_geom as geom FROM `bigquery-public-data.geo_us_boundaries.states` WHERE state_name IN (‘Texas’, ‘Arizona’)rn ”’,rn — Grid params: grid type and levelrn ‘h3’, 7,rn — Decay params: kring size and decay functionrn 1, ”,rn — Data Observatory enrichmentrn [(‘total_pop_3200daaa’, ‘sum’), (‘households_3cda72a3’, ‘sum’), (‘median_age_fb9fb9a’, ‘sum’), (‘pop_25_years_ov_3eb5c867’, ‘sum’), (‘median_income_20186485’, ‘avg’), (‘income_200000_o_6cda1f8a’, ‘sum’), (‘median_rent_37636cdd’, ‘sum’), (‘families_with_y_228e5b1c’, ‘sum’), (’employed_pop_7fe50b6c’, ‘sum’)],rn ‘carto-data.CARTO_DO_USER_ID’,rn — Custom data enrichmentrn NULL, NULL,rn — Output destination prefixrn ‘project.dataset.retail_model_texas’rn);”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4e7ec14610>)])]

Next, we need to build out the revenue model using carto.BUILD_REVENUE_MODEL. This uses BigQuery ML to perform the model predictions and supports LINEAR_REG and BOOSTED_TREE_REGRESSOR. Check the model documentation for more information.

This will output the model, SHAP values, and model statistics to understand the model performance. Below is an example query to run this:

code_block[StructValue([(u’code’, u’CALL `carto-un`.carto.BUILD_REVENUE_MODEL(rn — Model datarn ‘cartodb-gcp-solutions-eng-team.forrest.retail_model_texas’,rn — Optionsrn ‘{“MAX_ITERATIONS”: 20}’,rn — Output destination prefixrn ‘cartodb-gcp-solutions-eng-team.forrest.retail_model_texas_model’rn);’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4e7ec14f50>)])]

Finally, we can predict our twin areas. We pick a target index which we can identify from our map. As we can see here, this cell is the top performing store we want to find similar areas to.

From here we can run our Twin Areas model, which is based on Principal Component Analysis (PCA). We provide a query containing our target H3 cell, a second query of the cells we want to target to study (any cell without a store in Texas), and several other arguments to fine tune our results:

code_block[StructValue([(u’code’, u”CALL `carto-un`.carto.FIND_TWIN_AREAS(rn — Input queriesrn ”’SELECT * FROM `project.dataset.retail_model_texas` WHERE index = ‘87489e262ffffff’ ”’,rn ”’SELECT * FROM `project.dataset.forrest.retail_model_texas` WHERE revenue_avg = 0”’,rn — Twin areas model inputsrn ‘index’,rn 0.75,rn NULL,rn ‘project.dataset.twin_area_texas’rn);”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4e94d0bc10>)])]

The result is this interactive map which shows us the top areas that will likely perform the most similar to our target store based on our geospatial factors. We can also include other store factors in our first step to add site specific details like square footage or year built.

What’s Next

Believe it or not, there are even more tools and functions that can help you make the most of your geospatial data, which are explored in the CARTO for Retail Reference Guide. There’s the Bigquery Tiler, and CARTO’s out-of-the-box Site Selection Application, which includes relevant 3rd party data, advanced map visualizations and embedded models to pinpoint the best locations for network expansion.

In addition to the CARTO Analytics Toolbox, BigQuery also has many additional GIS functions for analyzing your geospatial data. Check out this blog on optimizing your spatial storage clustering in BigQuery. If you’re looking to analyze raster data, or can’t find the dataset you need in CARTO’s data observatory, consider trying Google Earth Engine.

Related Article

Transform satellite imagery from Earth Engine into tabular data in BigQuery

With Geobeam on Dataflow, you can transform Geospatial data from raster format in Earth Engine to vector format in BigQuery.

Read Article

Source : Data Analytics Read More

Shorten the path to insights with Aiven for Apache Kafka and Google BigQuery

Shorten the path to insights with Aiven for Apache Kafka and Google BigQuery

Every company aims to be data driven, but bringing accurate data in front of the right stakeholders in a timely manner can be quite complex. The challenge arises even more when the source data resides in different technologies, with various access interfaces and data formats.

This is where the combination of Aiven for Apache Kafka® and Google BigQuery excels, by providing the ability to source the data from a wide ecosystem of tools and, in streaming mode, push it to BigQuery where datasets can be organized, manipulated and queried.

From data sources to Apache Kafka with Kafka Connect

Aiven, together with Apache Kafka, offers the ability to create a managed Kafka Connect cluster. Therange of 30+ connectors available enables integrating Kafka with a wide set of various technologies as both source and sink using a JSON configuration file. Even more, if the connector for the particular technology needed isn’t in the list, an integration with a self-managed Kafka Connect cluster provides complete freedom on the connector selection, while keeping the benefit of the fully-managed Apache Kafka cluster.

If the datasource is a database, connectors like the Debezium source for PostgreSQL can enable a reliable and fast change data capture mechanism using the native database replication features, thereby adding minimal load on the source system.

Data in Apache Kafka

During the ingestion phase, to optimize throughput, connectors can use the Avro data format and store the data’s schema in Karapace, Aiven’s open source tool for schema registry and REST API endpoints.

Data in Apache Kafka is stored in topics which can have an associated retention period defining the amount of time or space for which the data will be kept. The topics can be read by one or more consumers independently or in competition as part of the same application (“consumer group” in Apache Kafka terms).

If some reshaping of the data is needed, before it lands on the target datastore, Aiven for Apache Flink allows, in streaming mode, to perform such transformations by using SQL statements. Cleansing or enrichment projects with data coming from different technologies are common examples.

Push data to Google BigQuery

Once the data is in the right shape to be analyzed, the Apache Kafka topic can be pushed to BigQuery in streaming mode using the dedicated sink connector. The connector has a wide range of configuration options including the timestamp to be used for partitioning and the thread pool size defining the number of concurrent writing threads.

The data, coming in streaming mode via Apache Kafka, is now landed in one or more BigQuery tables, ready for further analysis and processing. BigQuery offers a rich set of SQL functions allowing to parse nested datasets, apply complex geographical transformations, and even train and use machine learning models amongst others. The depth of BigQuery SQL functions enable analysts, data engineers and scientists to perform their work in a unique platform using the common SQL language.

A streaming solution for fast analytics

With the wide set of source connectors available and its streaming capabilities, Aiven for Apache Kafka is the perfect fit to enable the data to flow from a huge variety of data sources to BigQuery for analytics.

One example of a customer using this pattern is the retail media platform Streaem, part of Wunderman Thompson. Streaem provides a self-service retail media platform for retailers and their brands to monetise areas of their site and in store digital assets by combining intelligence about their products and signals from their customers along with campaign information provided by advertisers. For example, a user might type “Coke” into a search box, and as well as seeing the regular results they will also see some sponsored listings. Then, as they browse around the site, there could be promoted products based on their previous interaction.

Streaem are fully committed to using Google Cloud as their platform of choice, but their technology is event-driven and based around Kafka as a message broker which is not natively available. Using Aiven’s Apache Kafka offering on top of Google Cloud lets Streaem get the best of both worlds; industry-standard event streaming on their preferred cloud, without the headache of managing Kafka themselves. With multiple microservices deployed, all of which need a consistent and up-to-date view of the world, Kafka is an obvious service to place at the center of their world to make sure everything has the latest information in a way which will scale effortlessly as Streaem itself reaches new heights.

“At Streaem we use Kafka as a core part of our platform where event-based data is a key enabler for the services we deliver to our clients” says Garry Turkington, CTO. “Using hosted data services on GCP allows us to focus on our core business logic while we rely on Aiven to deliver a high-performance and reliable platform we can trust.”

Analytics is still a challenge in a Kafka-only world, so Streaem uses a managed open-source Kafka Connector on the Aiven platform to stream the microservices data into Google BigQuery. This means that data about customer activity or keyword auctions or anything else in the live platform are available with low latency into BigQuery, powering Streaem’s reporting dashboards and providing up-to-date aggregations for live decisions to be made. By using Google Cloud Platform, Aiven for Apache Kafka, and BigQuery, Streaem can be confident that their production systems are running smoothly whilst they concentrate their efforts on growing their business.

Other use cases

Aiven for Apache Kafka along with Google Cloud BigQuery is driving crucial insights across a range of industry verticals and use cases. For example: 

Retail: Demand Planning with BQML, Recommendation Engines, Product Search

Aiven is leveraged at a large European retail chain for open source database and event streaming infrastructure (Aiven for Apache Kafka, Aiven for OpenSearch, Aiven for Postgres, Aiven for Redis). The data is then fed to trained models in BigQuery ML to recommend products to purchase. These models can be exposed as APIs managed in Vertex AI for production applications.

E-commerce: Real-Time Dynamic Pricing

A global travel booking site uses Aiven for data streaming infrastructure (Aiven for Apache Kafka), handling global pricing and demand data in near real-time, and Aiven for OpenSearch for SIEM and application search use cases. Data then flows into BigQuery for analytics, giving the customer a best-in-class enterprise data warehouse.

Gaming: Player Analytics

Aiven powers data streaming (via Aiven for Apache Kafka) for a Fortune 500 gaming company, supporting multiple gaming titles and more than 100 million players globally. Analytics in BigQuery drives critical insights using player metadata.

Conclusion / Next Steps

The combination of Aiven for Apache Kafka and Google BigQuery drives analytics on the latest data in near real time, minimizing the time to insight and maximizing the impact. Customers of Aiven and Google are already taking advantage of this powerful combination, and seeing the benefits to their business. If you would like to experience this for yourself, sign up for Aiven and use the following links to learn more:

Aiven for Apache Kafka to discover the features, plans and options available for a managed Apache Kafka service

Apache Kafka BigQuery sink connector to review the settings and examples of pushing data from Apache Kafka to BigQuery

To learn more about Google Cloud BigQuery, click here.

Ready to give it a try? Click here to check out Aiven’s listing on Google Cloud Marketplace, and let us know what you think.

Source : Data Analytics Read More

Built with BigQuery and Google AI: How Glean enhances enterprise search quality and relevance for teams

Built with BigQuery and Google AI: How Glean enhances enterprise search quality and relevance for teams

Context

AboutGlean

Glean searches across all your company’s apps to help you find exactly what you need and discover the information you need to do your best work. It delivers powerful, unified enterprise search across all workplace applications, websites, and data sources used within an enterprise. Search results respect the existing permissions from your company’s systems, so users only see what they already have permission to see. Glean’s enterprise search also takes into account your role, projects, collaborators, and the language and acronyms specific to your company to deliver highly personalized results that provide you with information most pertinent to you and your work. This greatly reduces time spent searching, helping you be more productive and experience less frustrations at work finding what you need to progress. 

Why Google Cloud is foundational for Glean

Crucial to the performance of Glean’s powerful and personalized enterprise search is the technology behind it. Glean is built on Google Cloud (see diagram 1) and leverages Google’s data cloud – a modern data stack with components such as BigQuery, DataFlow, and Vertex AI.

Use case 1: Processing and enriching pipelines of data through Dataflow.

Glean uses Google Cloud Dataflow to extract the relevant pieces from the content indexed from different sources of workplace knowledge. It then augments the data with various relevance signals before storing them in the search index that’s hosted in the project’s Google Kubernetes Engine. Additionally, Glean uses Dataflow to generate training data at scale for our models (also trained on Google Cloud). As a whole, Google Cloud Dataflow enables Glean to build complex and flexible data processing pipelines that autoscale efficiently when processing large corpuses of data.

Use case 2: Running analytical workloads with BigQuery and Looker Studio

Glean closely measures and optimizes the satisfaction of users who are using the product to find information. This involves understanding the actions taken by the Glean user in a session and identifying when the user was able to find content useful to them in the search results, as opposed to when the results were not helpful for the user. In order to compute this metric, Glean stores the anonymized actions taken in the product in BigQuery and uses BigQuery queries to compute user satisfaction metrics. These metrics are then visualized by building a product dashboard over the BigQuery data using Looker Studio.

Use case 3: Running ML models with VertexAI.

Glean is able to train state-of-the-art language models adapted to enterprise/domain-specific language at scale by using TPUs through Vertex AI. 

TPUs, or Tensor Processing Units, are custom-designed hardware accelerators developed by Google specifically for machine learning workloads. TPUs are designed to speed up and optimize the training and inference of deep neural networks.

Google offers TPUs as a cloud-based service to its customers, which enables users to train and run machine learning models at a much faster rate than traditional CPUs or GPUs. Compared to traditional hardware, TPUs have several advantages, including higher performance, lower power consumption, and better cost efficiency. TPUs are also designed to work seamlessly with other Google Cloud services, such as TensorFlow, which is a popular open-source machine learning framework. This makes it easy for developers and data scientists to build, train, and deploy machine learning models using TPUs on Google Cloud.

Training data derived from the enterprise corpora is used to do domain-adaptive pretraining and task-specific fine-tuning on large-scale models with flexibility enabled by Vertex AI. Search is additionally powered by vector search served with encoders and (Artificial Neural Networks) ANN indices trained and built through Vertex AI.

A collaborative solution

What is the joint solution and what does it look like arch diagram

Glean offers a variety of features and services for its users: 

Feature 1: Search across all your company’s apps. 

Glean understands context, language, behavior, and relationships with others, to constantly learn about what you need and instantly find personalized answers to your questions. For instance, Glean factors in signals like docs shared with you, documents trending in your team, the office you are based in, the applications you use the most, and the most common questions being asked and answered in the various communication apps to surface documents and answers that are relevant to your search query. In order to provide this personalized experience, Glean understands the interactions between enterprise users as well as actions performed relevant to information discovery. It uses Cloud Dataflow to join these signals with the parsed information from the different applications, and trains semantic embeddings using the information on Vertex AI.

Feature 2: Discover the things you should know

Glean makes it easy to get things done by surfacing the information you need to know and the people you need to meet. This is effectively a recommendation engine for end users that surfaces data and knowledge that is contextually relevant for them. Glean leverages the vector embeddings trained using Vertex AI to be able recommend relevant enterprise information to employees that are timely and contextually relevant.

Feature 3: Glean is easy to use and ready to go, right out of the box. 

It connects with all the apps you already use, so employees can continue working with the tools they already know and love. Glean uses fully-managed and auto-scalable Google Cloud components like App Engine, Kubernetes and Cloud Tasks to ensure high reliability and low operational overhead for the infrastructure components of the search stack.

Glean uses a variety of Google Cloud components to build the product, including: 

Cloud GKE

Cloud Dataflow

Cloud SQL

Cloud Storage

Cloud PubSub

Cloud KMS

Cloud Tasks

Cloud DNS

Cloud IAM

Compute Engine

Vertex AI

BigQuery

Stackdriver Logging/Monitoring/Tracing

Building the product on top of these components provides Glean with a reliable, secure, scalable and cost-effective platform. It enables us to focus on the core application and relevance features, and helps us stand out.

Building better data products with Google Cloud

Glean trusts Google Cloud as our principal and unique cloud provider. This is mainly because of four factors:

Factor 1: Security 

Google Cloud provides fine-grained IAM roles as well as various security features such as Cloud Armor, IAP based authentication, encryption by default, key management service, shielded VMs and private Google access that enables Glean to have a hardened, least-privilege configuration where the customer is fully in control of the data and has a full view into the access to the system.

Factor 2: Reliable and scalable infrastructure services 

With fully managed services that auto-scale like GKE, Cloud SQL, Cloud Storage, and Cloud Dataflow, we can focus on the core application logic and not worry about the system being unable to handle peak load or uptime of the system, nor worry about needing to manually down-scale the system during periods of low use for cost efficiency.

Factor 3: Advanced data processing, analytics and AI/ML capabilities 

For a search and discovery product like Glean, it’s very important to be able to make use of flexible data processing and analytics features in a cost effective manner. Glean builds on top of Google Cloud features like Cloud Dataflow, Vertex AI, and BigQuery to provide a highly personalized and relevant product experience to its users.

Factor 4: Support 

The Google Cloud team has been a true partner to Glean and has been providing prompt support of any production issues or questions we have about the Google Cloud feature set. They’re also highly receptive to feedback and direct interaction with the product group to influence the product roadmap through new features.

Conclusion

At time of writing, Glean is one of over 800 tech companies powering their products and businesses using data cloud products from Google, such as BigQuery, Dataflow, Vertex AI. Google’s Built with BigQuery initiative, helps ISVs like Glean get started building applications using data and machine learning products and continue to add additional levels of capabilities with additional product features. By providing dedicated access to technology, expertise, and go-to-market programs, the Google Built-with initiatives (BigQuery, Google AI, etc.) can help tech companies accelerate, optimize and amplify their success. 

Glean’s enterprise search and knowledge management solutions are built on Google Cloud. By partnering with Google Cloud, Glean leverages an all-in-one cloud platform for data collection, data transformation, storage, analytics and machine learning capabilities.

Through Built with BigQuery, Google is enabling, and co-innovating with tech companies like Glean to unlock product capabilities and build innovative applications using Google’s data cloud and AI products that simplify access to the underlying technology, receive helpful and dedicated engineering support, and engage in joint go-to-market programs. Participating companies can:

Get started fast with a Google-funded, pre-configured sandbox. 

Accelerate product design and architecture through access to designated experts from the ISV Center of Excellence for Data Analytics and AI, who can provide insight into key use cases, architectural patterns and best practices

Amplify success with joint marketing programs to drive awareness and generate demand, and increase adoption

We would like to thank Smitha Venkat and Eugenio Scafati on the Google team for contributing to this blog.

Source : Data Analytics Read More

Introducing Telecom Subscriber Insights: Helping CSPs grow business with AI-driven digitization and personalization

Introducing Telecom Subscriber Insights: Helping CSPs grow business with AI-driven digitization and personalization

Communication Service providers (CSPs) are facing a new dynamic where they have a digitally savvy customer base and market competition is higher than ever before. Understanding customers and being able to present them with relevant products and services in a contextual manner has become the focus of CSPs. Digitization and hyper-personalization will be powering the growth of CSPs in the new world.  

Data shows that personalized offers drive to up to 30% increase in sales and 35% increase in customer lifetime value. In fact, over 90% of the CSPs surveyed in a Delta Partners Global Telecom Executive Survey expect sales and queries to increase digitally in the post-pandemic world, and 97% are showing a commitment to improving and investing in digital growth channels. 

At Google Cloud, we are harnessing Google’s digital marketing learnings, capabilities, and technology, and combining them with our products and solutions to enable customers across industries — including CSPs — to help accelerate their digital transformation as they prepare for the next wave of growth.

Today at Mobile World Congress 2023 Barcelona, we announced Telecom Subscriber Insights, a new service designed to help CSPs accelerate subscriber growth, engagement, and retention. It does so by taking insights from CSPs’ existing permissible data sources such as usage records, subscriber plan/billing information, Customer Relationship Management (CRM), app usage statistics, and others. With access to these insights, CSPs can better recommend subscriber actions, and activate them across multiple channels. 

Key use cases

Telecom Subscriber Insights is an artificial intelligence (AI) powered product that CSPs’ line of business (LOB) owners can use to improve key performance indicators (KPIs) across the subscriber lifecycle — from acquisition and engagement, to up-sell/cross-sell, and reducing customer churn.

Subscriber acquisition

Increasing the number of new acquired subscribers, while also reducing the cost of acquisition, is a key goal for CSPs. Telecom Subscriber Insights helps CSPs achieve this goal by helping to reduce friction to subscribers throughout the onboarding process, and make the experience simple and elegant with one-click provisioning. 

Subscriber engagement

As CSPs have expanded beyond physical stores into the digital world, they have changed the way they engage with consumers — using digital channels to learn more about subscriber interests, needs, and to help personalize their offerings. In this new engagement model, CSPs are not only looking to increase digital engagement through their app, but also reimagine and increase the quality of that engagement. 

Telecom Subscriber Insights provides contextual and personalized recommendations that can enable CSPs to interact with subscribers in their app, helping to improve key KPIs associated with digital engagement.

Up-sell/cross-sell

Quality subscriber engagement is rewarding on both sides: subscribers are delighted by more contextual and personalized offers, and business owners can improve customer loyalty and growing revenue through product cross-sell/up-sell opportunities.

Telecom Subscriber Insights supports contextual and hyper-personalized offers, and can enable CSPs to deliver targeted engagements that help meet the needs of subscribers with product and services offers, while helping to ensure subscribers are not overloaded with unwanted offers.

Reducing customer churn

Throughout the customer lifecycle, CSPs are focused on delivering quality products and services to their customers to help drive customer lifetime value and reduce potential churn. 

Telecom Subscriber Insights helps CSPs identify segments of subscribers who have a high propensity for churn, and recommends the next best offers to help retain customers.

The technology driving insights

Telecom Subscriber Insights is a cloud-based AI powered first party product built on proven Google Cloud tools that ingests data from various sources to help forecast subscriber behavior, and recommend subscriber segments who are probable to upgrade, and probable to churn. The solution also recommends appropriate actions to CSPs including identifying Next Best Offers (NBOs) based on the identified segments. CSPs are provided with integrations to multiple activation channels, helping them close the engagement with subscribers based on recommended actions. Google Cloud values other Independent Software Vendors (ISVs) who continue to complement Telecom Subscriber Insights with their own offerings built using the same Google Cloud tools.

Telecom Subscriber Insights offers a variety of capabilities, as follows:

Ingesting and normalizing Telecom subscriber data

Various data sources such as CSPs’ OSS systems, CRM, and various other sources are ingested into BigQuery. Data is then normalized to pre-defined data models. The data could be ingested with simple APIs through batch upload, or streamed using standard Google Cloud technologies. 

Pre-trained AI models

Predefined data models help in continuous training of various machine learning (ML) models to provide results with high efficacy. The model can be further fine-tuned on a per-deployment basis for granular performance. ML models including propensity, churn, NBO, segmentation, and many more are carefully selected to meet CSPs’ business objectives. Telecom Subscriber Insights can integrate with some other proven ML models, where applicable, with an ecosystem approach to help enrich our offering to CSPs.

Automated data pipeline

Telecom Subscriber Insights provides an automated data pipeline that stitches aspects of data processing —  from data ingestion, transformation, normalization, storage, query, to visualization — helping to ensure customers need to expend near-zero effort to get started.

Multi-channel activation

Subscribers can be reached on various channels including SMS, in-app notifications, and various other interfaces. To make integration easy without introducing new interfaces to subscribers, CSPs can embed Software Development Kits (SDKs) as part of their current apps and interfaces. Telecom Subscriber Insights is built on the premise of invoking multiple activation channels using a one-click activation on the marketing dashboard, so the CSP can reach the subscriber in a contextually appropriate manner.

One-click deployment and configuration 

Telecom Subscriber Insights is available as a customer-hosted and operated model. The software is packaged in an easy-to-consume Google Cloud project that is built and transferred to the customer when they purchase the product. The Google Cloud project includes the necessary software components, which Google remotely manages and upgrades over a control plane. This model helps ensure that customers have to do near zero work on an ongoing basis to have the latest AI models and software, while benefiting from operating the software on their own, with full control.

Get started with Telecom Subscriber Insights today

Telecom Subscriber Insights features a very simple pay-as-you-go pricing model that charges only for the population of subscriber data that is analyzed. This enables customers to prove value with a proof of concept (PoC) with a small subscriber footprint before rolling it out across the entire subscriber base. Find out more about Telecom Subscriber Insights by visiting the product page, and reach out to us at telecom_subscriber_insights@google.com

Source : Data Analytics Read More

Getting started with Terraform and Datastream: Replicating Postgres data to BigQuery

Getting started with Terraform and Datastream: Replicating Postgres data to BigQuery

Consider the case of an organization that has accumulated a lot of data, stored in various databases and applications. Producing analytics reports takes a long time because of the complexity of the data storage. The team decides to replicate all the data to BigQuery in order to increase reporting efficiency. Traditionally this would be a large, complex and costly project that can take a long time to complete.

Instead of painstakingly setting up replication for each data source, the team can now use Datastream and Terraform. They compile a list of data sources, create a few configuration files according to the organization’s setup, and voila! Replication begins and data starts to appear in BigQuery within minutes.

Datastream is Google Cloud’s serverless and easy-to-use change data capture and replication service. If you are unfamiliar with Datastream, we recommend this post for a high-level overview, or read our latest Datastream for BigQuery launch announcement.

Terraform is a popular Infrastructure as code (IaC) tool. Terraform enables infrastructure management through configuration files, which makes management safer, more consistent and easily automatable. 

Launched in mid February 23’, the Terraform support in Datastream unblocks and simplifies some exciting use cases, such as:

Policy compliance management – Terraform can be used to enforce compliance and governance policies on the resources that teams provision.

Automated replication process – Terraform can be used to automate Datastream operations. This can be useful when you need automated replication, replication from many data sources, or replication of a single data source to multiple destinations.

Using Terraform to set up Datastream replication from PostgreSQL to BigQuery

Let’s look at an example where the data source is a PostgreSQL database, and review the process step by step.

Limitations

Datastream will only replicate data to the BigQuery data warehouse configured in the same Google Cloud project as Datastream, so make sure that you create Datastream resources in the same project you want your data in.

Requirements

Datastream API needs to be enabled on the project before we continue. Go to the API & Services page in the Google Cloud console, and make sure that the Datastream API is enabled.

Make sure you have the Terraform cli installed – Terraform cli installation guide

It’s possible to follow the steps in this blog with either a MySQL or Oracle database with a few slight modifications. Just skip the Postgres configuration section, and use our MySQL or Oracle configuration guides instead.

You will naturally need a Postgres database instance with some initial data. If you want to set up a new Postgres instance, you can follow the steps in the Cloud SQL for Postgres quickstart guide. 

We will need to make sure that PostgreSQL is configured for replication with Datastream. This includes enabling logical replication and optionally creating a dedicated user for Datastream. See our PostgreSQL configuration guide. Make sure to note the replication slot and publication names from this step, as we will need them to configure the replication later on.

You will also need to set up connectivity between your database and Datastream. Check the Network connectivity options guide, and find the connectivity type that fits your setup.

Configuring the replication with Terraform

We will start by creating Datastream Connection Profiles, which store the information needed to connect to the source and destination (e.g. hostname, port, user, etc.).

To do this, we will start by creating a new .tf file in an empty directory, and adding the following configurations to it:

code_block[StructValue([(u’code’, u’resource “google_datastream_connection_profile” “source” {rn display_name = “Postgresql Source”rn location = “us-central1″rn connection_profile_id = “source”rnrn postgresql_profile {rn hostname = “HOSTNAME”rn port = 5432rn username = “USERNAME”rn password = “PASSWORD”rn database = “postgres”rn }rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4cbfe4e110>)])]

In this example, we create a new Connection Profile that points to a Postgres instance. Edit the configuration with your source information and save it. For other sources and configurations see the Terraform Datastream Connection Profile documentation.

Next, let’s define the Connection Profile for the BigQuery destination:

code_block[StructValue([(u’code’, u’resource “google_datastream_connection_profile” “destination” {rn display_name = “BigQuery Destination”rn location = “us-central1″rn connection_profile_id = “destination”rnrn bigquery_profile {}rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4cbfbb32d0>)])]

We now have the source and destination configured, and are ready to configure the replication process between them. We will do that by defining a Stream, which is a Datastream resource representing the source and destination replication.

code_block[StructValue([(u’code’, u’resource “google_datastream_stream” “stream” {rn display_name = “Postgres to BigQuery Stream”rn location = “us-central1″rn stream_id = “stream”rn desired_state = “RUNNING”rnrn source_config {rn source_connection_profile = google_datastream_connection_profile.source.idrn postgresql_source_config {rnt publication = “PUBLICATION_NAME”rnt replication_slot = “REPLICATION_SLOT”rnt}rn }rnrn destination_config {rn destination_connection_profile = google_datastream_connection_profile.destination.idrn bigquery_destination_config {rn data_freshness = “900s”rn source_hierarchy_datasets {rn dataset_template {rn location = “us-central1″rn }rn }rn }rn }rnrn backfill_all {}rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4cbfbb3c50>)])]

In this configuration, we are creating a new Stream and configuring the source and destination Connection Profiles and properties. Some features to note here:

Backfill ( backfill_all ) – means that Datastream will replicate an initial snapshot of historical data. This can be configured to exclude specific tables.

Replicating a subset of the source – you can specify which data should be included or excluded from the Stream using the include / exclude lists – see more in the API docs

Edit the configuration with the source Postgres publication and replication slot that you created in the initial setup. For other sources and configurations see the Terraform Stream documentation.

Running the Terraform configuration

Now that we have our configuration ready, it’s time to run it with the Terraform CLI. For that, we can use Cloud Shell which has terraform CLI installed and configured with the permissions needed for your project. 

You can also prepare a local environment, by adding this to your .tf file.

code_block[StructValue([(u’code’, u’terraform {rn required_providers {rn google = {rn source = “hashicorp/google”rn version = “4.53.0”rn }rn }rn}rnrnprovider “google” {rn credentials = file(“PATH/TO/KEY.json”)rn project = “PRROJECT_ID”rn region = “us-central1″rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4c8f799d10>)])]

Start by running the terraform init command to initialize terraform in your configuration directory. Then run the terraform plan command to check and validate the configuration.

Now lets run terraform apply to apply the new configuration

If all went well, you should now have a running Datastream Stream! Go to the Datastream console to manage your Streams, and to the BigQuery console and check that the appropriate data sets were created.

When you’re done, you can use terraform destroy to remove the created Datastream resources.

Something went wrong?

You can set: export TF_LOG=DEBUG flag to see debug logs for the Terraform CLI. See Debugging Terraform for more.

Automating multiple replications

To automate terraform configurations, we can make use of input variables, the count argument and the element function. A simple variable that holds a sources list can look like this:

code_block[StructValue([(u’code’, u’variable “sources” {rn type = list(object({rn name = stringrn hostname = stringrn …rn publication = stringrn replication_slot = stringrn }))rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4c8f799650>)])]

Add a count field to the resources you want to automate (source, stream and maybe destination). Replace values from the variable to:

code_block[StructValue([(u’code’, u’count = length(var.sources)rn…rnconnection_profile_id = “source_${element(var.sources, count.index).name}”‘), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4c8eb8d190>)])]

Source : Data Analytics Read More

Building streaming data pipelines on Google Cloud

Building streaming data pipelines on Google Cloud

Many customers build streaming data pipelines to ingest, process and then store data for later analysis. We’ll focus on a common pipeline design shown below. It consists of three steps: 

1. Data sources send messages with data to a Pub/Sub topic.

2. Pub/Sub buffers the messages and forwards them to a processing component.

3. After processing, the processing component stores the data in BigQuery.

For the processing component, we’ll review three alternatives, ranging from basic to advanced: 

A. BigQuery subscription.

B. Cloud Run service.

C. Dataflow pipeline.

Example use cases

Before we dive deeper into the implementation details, let’s look at a few example use cases of streaming data pipelines:

Processing ad clicks. Receiving ad clicks, running fraud prediction heuristics on a click-by-click basis, and discarding or storing them for further analysis.

Canonicalizing data formats. Receiving data from different sources, canonicalizing them into a single data model, and storing them for later analysis or further processing. 

Capturing telemetry. Storing user interactions and displaying real-time statistics, such as active users, or the average session length grouped by device type.

Keeping a change data capture log. Logging all database updates from a database to BigQuery through Pub/Sub. 

Ingesting data with Pub/Sub

Let’s start at the beginning. You have one or multiple data sources that publish messages to a Pub/Sub topic. Pub/Sub is a fully-managed messaging service. You publish messages, and Pub/Sub takes care of delivering the messages to one or many subscribers. The most convenient way to publish messages to Pub/Sub is to use the client library

To authenticate with Pub/Sub you need to provide credentials. If your data producer runs on Google Cloud, the client libraries take care of this for you and use the built-in service identity. If your workload doesn’t run on Google Cloud, you should use identity federation, or as a last resort, download a service account key (but make sure to have a strategy to rotate these long-lived credentials). 

Three alternatives for processing

It’s important to realize that some pipelines are straightforward, and some are complex. Straightforward pipelines don’t do any (or lightweight) processing before persisting the data. Advanced pipelines aggregate groups of data to reduce data storage requirements and can have multiple processing steps.

We’ll cover how to do processing using either one of the following three options:

A BigQuery subscription, a no-code pass-through solution that stores messages unchanged in a BigQuery dataset.

A Cloud Run service, for lightweight processing of individual messages without aggregation.

A Dataflow pipeline, for advanced processing (more on that later). 

Approach 1: Storing data unchanged using a BigQuery subscription

The first approach is the most straightforward one. You can stream messages from a Pub/Sub topic directly into a BigQuery dataset using a BigQuery subscription. Use it when you’re ingesting messages and don’t need to perform any processing before storing the data. 

When setting up a new subscription to a topic, you select the Write to BigQuery option, as shown here:

The details of how this subscription is implemented are completely abstracted away from users. That means there is no way to execute any code on the incoming data. In essence, it is a no-code solution. That means you can’t apply filtering on data before storing. 

You can also use this pattern if you want to first store, and perform processing later in BigQuery. This is commonly referred to as ELT (extract, load, transform).

Tip: One thing to keep in mind is that there are no guarantees that messages are written to BigQuery exactly once, so make sure to deduplicate the data when you’re querying it later. 

Approach 2: Processing messages individually using Cloud Run 

Use Cloud Run if you do need to perform some lightweight processing on the individual messages before storing them. A good example of a lightweight transformation is canonicalizing data formats – where every data source uses its own format and fields, but you want to store the data in one data format.

Cloud Run lets you run your code as a web service directly on top of Google’s infrastructure. You can configure Pub/Sub to send every message as an HTTP request using a push subscription to the Cloud Run service’s HTTPS endpoint. When a request comes in, your code does its processing and calls the BigQuery Storage Write API to insert data into a BigQuery table. You can use any programming language and framework you want on Cloud Run.

As of February 2022, push subscriptions are the recommended way to integrate Pub/Sub with Cloud Run. A push subscription automatically retries requests if they fail and you can set a dead-letter topic to receive messages that failed all delivery attempts. Refer to handling message failures to learn more. 

There might be moments when no data is submitted to your pipeline. In this case, Cloud Run automatically scales the number of instances to zero. Conversely, it scales all the way up to 1,000 container instances to handle peak load. If you’re concerned about costs, you can set a maximum number of instances. 

It’s easier to evolve the data schema with Cloud Run. You can use established tools to define and manage data schema migrations like Liquibase. Read more on using Liquibase with BigQuery. 

For added security, set the ingress policy on your Cloud Run microservices to be internal so that they can only be reached from Pub/Sub (and other internal services), create a service account for the subscription, and only give that service account access to the Cloud Run service. Read more about setting up push subscriptions in a secure way

Consider using Cloud Run as the processing component in your pipeline in these cases:

You can process messages individually, without requiring grouping and aggregating messages.

You prefer using a general programming model over using a specialized SDK.

You’re already using Cloud Run to serve web applications and prefer simplicity and consistency in your solution architecture. 

Tip: TheStorage Write APIis more efficient than the older insertAll method because it uses gRPC streaming rather than REST over HTTP. 

Approach 3: Advanced processing and aggregation of messages using Dataflow

Cloud Dataflow, a fully managed service for executing Apache Beam pipelines on Google Cloud, has long been the bedrock of building streaming pipelines on Google Cloud. It is a good choice for pipelines that aggregate groups of data to reduce data and those that have multiple processing steps. Cloud Dataflow has a UI that makes it easier to troubleshoot issues in multi-step pipelines. 

In a data stream, grouping is done using windowing. Windowing functions group unbounded collections by the timestamps. There are multiple windowing strategies available, including tumbling, hopping and session windows. Refer to the documentation on data streaming to learn more. 

Cloud Dataflow can also be leveraged for AI/ML workloads and is suited for users that want to preprocess, train, and make predictions on a machine learning model using Tensorflow. Here’s a list of great tutorials that integrate Dataflow into end-to-end machine learning workflows.

Cloud Dataflow is geared toward massive scale data processing. Spotify notably uses it to compute its yearly personalized Wrapped playlists. Read this insightful blogpost about the 2020 Wrapped pipeline on the Spotify engineering blog. 

Dataflow can autoscale its clusters both vertically and horizontally. Users can even go as far as using GPU powered instances in their clusters and Cloud Dataflow will take care of bringing new workers into the cluster to meet demand, and also destroy them afterwards when they are no longer needed.

Tip: Cap the maximum number of workers in the cluster to reduce cost and set up billing alerts. 

Which approach should you choose?

The three tools have different capabilities and levels of complexity. Dataflow is the most powerful option and the most complex, requiring users to use a specialized SDK (Apache Beam) to build their pipelines. On the other end, a BigQuery subscription doesn’t allow any processing logic and can be configured using the web console. Choosing the tool that best suits your needs will help you get better results faster. 

For massive (Spotify scale) pipelines, or when you need to reduce data using windowing, or have a complex multi-step pipeline, choose Dataflow. In all other cases, starting with Cloud Run is best, unless you’re looking for a no-code solution to connect Pub/Sub to BigQuery. In that case, choose the BigQuery subscription.

Cost is another factor to consider. Cloud Dataflow does apply automatic scaling, but won’t scale to zero instances when there is no incoming data. For some teams, this is a reason to choose Cloud Run over Dataflow.  

This comparison table summarizes the key differences.

Next steps

Read more about BigQuery subscriptions, Cloud Run, and Dataflow.

Check out this hands-on tutorial on GitHub by Jakob Pörschmann that explores all three types of processing.

I’d like to thank my co-author Graham Polley from Zencore for his contributions to this post – find him on LinkedIn or Twitter. I also want to thank Mete, Sara, Jakob, Valentin, Guillaume, Sean, Kobe, Christopher, Jason, and Wei for their review feedback.

Source : Data Analytics Read More

No cash to tip? No problem. How TackPay built its digital tipping platform on Google Cloud

No cash to tip? No problem. How TackPay built its digital tipping platform on Google Cloud

Society is going cashless. While convenient for consumers, that’s caused a drastic decrease in income for tipped workers and this is the problem TackPay addresses. TackPay is a mobile platform that allows users to send, receive and manage tips in a completely digital way, providing tipped workers with a virtual tip jar that makes it easy for them to receive cashless tips directly.

Digitizing the tipping process not only allows individuals to receive tips without cash, but also streamlines a process that has frequently been unfair, inefficient, and opaque, especially in restaurants and hotels. Through TackPay’s algorithm, venues can define the rules of distribution, and automate the tip management process, saving them time. And because tips no longer go through a company’s books but through Tackpay, it simplifies companies’ tax accounting, too.

With a simple, fast and web-based experience accessible by QR code, customers can leave a cashless tip with total flexibility, and transparency.

Technology in TackPay

Without question, our main competitor is cash. From the very beginning, TackPay has worked to make the tipping experience as easy and as fast as giving a cash tip. For this reason, the underlying technology has to deliver the highest level of performance to ensure customer satisfaction and increase their tipping potential.

For example, we need to be able to calibrate the loading of requests in countries at peak times to avoid congesting requests. Offering the page in a few thousandths of a second allows us to avoid a high dropout rate and user frustration. Transactions can also take place in remote locations with little signal, so it is crucial for the business to offer a powerful and accessible service for offline availability options. These are a few of the reasons TackPay chose Google Cloud.

Functional components

TackPay interfaces include a website, web application and a mobile app. The website is mostly informational, containing sign-up, login and forms for mailing list subscription and partnerships. The web app is the application’s functional interface itself. It has four different user experiences based on the user’s persona: partner, tipper, tipped, and group.

The partner persona has a customized web dashboard.

The tipper sees the tipping page, the application’s core functionality. It is designed to provide a light-weight and low-latency transaction to encourage the tipper to tip more efficiently and frequently.

The tipped, i.e.the receiver, can use the application to onboard into the system, their tip fund, and track their transactions via a dashboard.

The group persona allows the user to combine tips for multiple tip receivers across several services as an entity. 

The mobile interface also has similar experience to that of the web for the tipped and group personas. A user dashboard that spans across a few personas covers the feedback, wallet, transactions, network, profile, settings, bank details, withdrawal, docs features for the Tipped persona. In addition to those features, the dashboard also covers the venue details for the Group persona.

Technical architecture to enable cashless tipping

Below is the technical architecture diagram at a high level:

Ingestion
Data comes in from the web application, mobile app, third-party finance application APIs and Google Analytics. The web application and mobile app perform the core business functionality. The website and Google Analytics serve as the entry point for business analytics and marketing data. 

Application
The web application and mobile app provide the platform’s core functionality and share the same database — Cloud Firestore.

The tipper persona typically is not required to install the mobile app; they interact with the web application that can be scanned via a QR code and tip for the service. Mobile app is mainly for the tipped and the group categories. 

Some important functional triggers are also enabled between the database and application using Google Cloud Functions Gen 2. The application also uses Firebase Authentication, Cloud IAM and Logging.

Database and storage
Firestore collections are used to hold functional data. The collections include payments data for businesses, teams, the tipped, tippers and data for users, partners, feedback, social etc. BigQuery stores and processes all Google Analytics and website data, while  Cloud Storage for Firebase stores and serves user data generated from the app. 

Analytics and ML
We use BigQuery data for analytics and Vertex AI AutoML for Machine Learning. At this stage, we’re using Data Studio for on-demand, self-serve reporting, analysis, and data mashups across the data sets. The goal is to eventually integrate it with Google Cloud’s Looker in order to bring in the semantic layer and standardize on a single point of data access layer for all analytics in TackPay. 

Building towards a future of digital tipping

TackPay product has been online for a few months and is actively processing tips in many countries, including Italy, Hungary, UK, Spain, Canada. The solution has been recently installed in leading companies in the hospitality industry in Europe, becoming a reliable partner for them. There is an ambitious plan to expand into the Middle East market in the coming months. 

To enable this expansion, we’ll need to validate product engagement in specific target countries and scale up by growing the team and the product. Our technical collaboration with Google Cloud will help to make that scaling process effortless. If you are interested about tech considerations for startups, fundamentals of database design with Google Cloud, and other developer / startup topics, check out my blog.

If you want to learn more about how Google Cloud can help your startup, visit our pagehere to get more information about our program, and sign up for our communications to get a look at our community activities, digital events, special offers, and more.

Source : Data Analytics Read More

The Denodo Platform meets BigQuery

The Denodo Platform meets BigQuery

It’s only natural that the Denodo Platform would achieve theGoogle Cloud Ready – BigQuery designation earlier this month; after all, the Denodo Platform and Google BigQuery have much in common.

The Denodo Platform, powered bydata virtualization, enables real-time access across disparate on-premises and cloud data sources, without replication, andBigQuery, the cloud-based enterprise data warehouse (EDW) on Google Cloud , enables blazing-fast query-response across petabytes of data, even when some of that data is stored outside of BigQuery in on-premises systems.

For users of the Denodo Platform on Google Cloud, BigQuery certification offers confidence that the Denodo Platform’s data integration and data management capabilities work seamlessly with BigQuery, as Google only confers this designation on technology that meets stringent functional and interoperability requirements.

In addition to storage “elbow room,” BigQuery brings new analytical capabilities to Denodo Platform users on Google Cloud, including out-of-the box machine learning (ML) capabilities like Apache Zeppelin for Denodo, as well as geospatial, business intelligence (BI), and other types of data analysis tools.

But it gets better.

The Denodo Platform on Google Cloud + BigQuery

Combining the full power of the Denodo Platform with BigQuery enables easy access to a wider breadth of data, all with a single tool. The Denodo Platform’s ability to deliver data in real time over BigQuery cloud-native APIs enables frictionless data movement between on-premises, cloud, and Google Cloud Storage data sources.

Enhanced BigQuery support combines Google’s native connectivity with the Denodo Platform’s query pushdown optimization features, to process massive big-data workloads with better performance and efficiency. For further performance, BigQuery can be leveraged as a high-performance caching database for the Denodo Platform in the cloud. This supports advanced optimization techniques like multi-pass executions based on intermediate temporary tables.

Users also benefit from the same flexible pricing available on Google Cloud, letting them start small with BigQuery, and scale as needed.

Use Cases Abound

Combining the Denodo Platform with BigQuery enables a wide variety of use cases, such as:

Machine Learning/Artificial Intelligence (ML/AI) and Data Science in the Cloud

Users can leverage the Denodo Platform’s data catalog to search the available datasets and tag the right ones for analytics and ML projects. This also helps data scientists to combine data stored in BigQuery and data virtualization layers to build models in a quick and easy manner, putting cloud elasticity to work. Using the metadata and data lineage capabilities of the Denodo Platform, users can access all of the data in a governed fashion.

Zero-Downtime Migrations and Modernizations

The Denodo Platform acts as a common access point between two or more data sources, providing access to multiple sources, simultaneously, even when the sources are moved, while hiding the complexities of access from the data consumers. This enables seamless, zero-downtime migrations from on-premises systems or other cloud data warehouses (such as Oracle or SQL Server) to BigQuery. Similarly, the Denodo Platform makes it possible for stakeholders to modernize their systems, in this case their BigQuery instance, with zero impact on users.

Data Lake Creation 

Users can easily create virtual data lakes, which combine data across sources, regardless of type or location, while also enabling the definition of a common semantic model across all of the disparate sources.

Data-as-a-Service (DaaS) 

The Denodo Platform also facilitates easy delivery of BigQuery and Google Cloud Storage data (structured and semi-structured) to users as an API endpoint. With this support, the platform lets companies expose data in a controlled, curated manner, delivering only the data that is suitable for specific business partners and other external companies, and easily monetizing relevant datasets when needed.

The Dream of a Hybrid Data Warehouse, Realized

Let’s look at one way that the Denodo Platform and BigQuery can work together on Google Cloud. In the architecture illustrated below, the two technologies enable a hybrid (on-premises/cloud) data warehouse configuration.

I’d like to point out a few things in this diagram (see the numbered circles). You can:

Move your relational data for interactive querying and offline analytics to BigQuery.

Move your relational data from large scale databases and applications to Google Spanner, when you need high I/O and global consistency.

Move your relational data from Web frameworks and existing applications to Google Cloud SQL.

Combine all of these sources with the relational data sitting on-premises in a traditional data warehouse, creating a single centralized data hub.

Run real-time queries on virtual data from other applications.

Build operational reports and analytical dashboards on top of the Denodo Platform to gain insights from the data, and use Looker or other BI tools to serve thousands of end users.

Getting Started

BigQuery certification provides Denodo Platform users on Google Cloud with yet another reason to appreciate Google Cloud. Visit the Denodo Platform for Google Cloud page for more information.

If you are new to the Denodo Platform on Google Cloud, there is no better way to discover its power than to try it out for yourself. Denodo offers not only a way to do that, for free for 30 days, but also built-in guidance and support.

Source : Data Analytics Read More