Blog

Inside the eDreams ODIGEO Data Mesh — a platform engineering view

Inside the eDreams ODIGEO Data Mesh — a platform engineering view

Editor’s note: Headquartered in Barcelona, eDreams ODIGEO (or eDO for short) is one of the biggest online travel companies in the world, serving more than 20 million customers in 44 countries. Read on to learn about how they modernized their legacy data warehouse environment to a data mesh built on BigQuery.

Data analytics has always been an essential part of eDO’s culture. To cope with exponential business growth and the shift to a customer-centered approach, eDO evolved its data architecture significantly over the years, transitioning from a centralized data warehouse to a modern data mesh architecture powered by BigQuery. In this blog, a companion to a technical customer case study, we will cover key insights from eDO’s data mesh journey.

Evolution of data platform architecture at eDO

Initially, eDO had a centralized data team that managed a data warehouse. Data was readily available and trusted. However, as the company grew, adding new data and adapting to increasing complexity became a bottleneck. Some analytics teams built their own pipelines and silos, which improved data availability and team autonomy but introduced challenges with data trust, quality, and ownership.

As teams and complexity grew, data silos became intolerable. AI and data science needs only made the need for a unified data model more pressing. In 2019, building a centralized model with a data lake would have been the natural next step, but it would have been at odds with the distributed e-commerce platform, which is organized around business domains and autonomous teams — not to mention the pressure of data changes imposed by its very high development velocity.

Instead, eDO embraced the data mesh paradigm, deploying a business-agnostic data team and a distributed data architecture aligned with the platform and teams. The data team no longer “owns” data, but instead focuses on building self-service components for producers and consumers.

The eDO data mesh and data products architecture

In the end, eDO implemented a business-agnostic data platform that operates in self-service mode. This approach isolates development teams from data infrastructure concerns, as is the case with a traditional centralized data team. It also isolates the data platform team from any business knowledge, unlike a traditional centralized data team. This prevents the data team from becoming overloaded by the ever-increasing functional complexity of the system.

The biggest challenges for adopting data mesh are cultural, organizational, and ownership-related. However, engineering challenges are also significant. Data mesh systems must be able to handle large volumes of data, link and join datasets from different business domains efficiently, manage multiple versions of datasets, and ensure that documentation is available and trustworthy. Additionally, these systems must minimize maintenance costs and scale independently of the number of teams and domains using them.

To build its data mesh, eDO chose a number of Google Cloud products, with BigQuery at the heart of the architecture. These technologies were chosen for their serverless nature, support for standard protocols, and ease of integration with the existing e-commerce platform.

One key enabler of this architecture is the deployment of data contracts. These contracts are:

Owned exclusively by a single microservice and its teamImplemented declaratively using Avro and YAML filesDocumented in the Avro schemaManaged in the source code repository of the owning microservice

Declarative contracts enable the automation of data ingestion and quality checks without requiring knowledge of business domains. Exclusive ownership of contracts by microservices and human-readable documentation in a git repository help producers take ownership of their contracts.

Finally, when it comes to data architecture, the platform has four types of data products: domain events, snapshots, derivatives and quality records that fulfill different access patterns and needs.

Lessons learned and the future of the platform

eDO learned a lot during the four years since the project started and have many improvements and new features planned for the near future.

Lessons learned:

Hand-pick consumer stakeholders who are willing and fit to be early adopters.Make onboarding optional.Foster feedback loops with data producers.Data quality is variable and should be agreed upon between producers and consumers, rather than by a central team.Plan for iterative creation of data products, driven by real consumption.Data backlogs are part of teams’ product backlogs.Include modeling of data contracts in your federated governance, as it is the hardest and riskiest part.Consider Data Mesh adoption a difficult, challenging engineering project, in domain and platform teams.Documentation should be embedded into data contracts, and tailored to readers who are knowledgeable about the domain of the contract.

Work on improving the data mesh continues! Some of the areas include:

Improved data-quality SLAs – working on ways to measure data accuracy and consistency, which will enable better SLAs between data consumers and producers.Self-motivated decommissions – experimenting with a new “Data Maturity Assessment” framework to encourage analytical teams to migrate their data silos to the data mesh.Third-party-aligned data contracts – creating different pipelines or types of data contracts for data that comes from third-party providers.Faster data timeliness – working on reducing the latency of getting data from the streaming platform to BigQuery.Multi-layered documentation – deploying additional levels of documentation to document not only data contracts, but also concepts and processes of business domains.

Conclusion

Data mesh is a promising fresh approach to data management that can help organizations overcome the challenges of traditional centralized data architectures. By following the lessons learned from eDO’s experience, we hope organizations can successfully implement their own Data Mesh initiatives.

Want to learn more about deploying Data Mesh in Google Cloud? See this technical white paper today.

Source : Data Analytics Read More

Accelerate data-driven growth with Google Cloud and Fivetran

Accelerate data-driven growth with Google Cloud and Fivetran

Success is often measured by one’s ability to harness and interpret data effectively. The journey begins with centralizing and analyzing marketing data in real time, a game-changing approach that can drive impactful decisions and elevate campaign strategies.

In this blog post, we’ll explore how Google Cloud and Fivetran, a SaaS data integration platform, can help you centralize and analyze your marketing data in real time. Fivetran can take care of multiple steps in the ELT and automated data ingestion journey. We’ll share real-world examples of how organizations are using real-time analytics to improve their results, and we’ll discuss the benefits of this approach for businesses of all sizes.

Centralizing data is the first step to informed decisions

A large healthcare organization recognized the transformative power of data and analytics, and embarked on a mission to harness their expertise and insights to fuel exponential growth and success. Their challenge was despite their successful marketing strategies that increased demand generation, concerns arose due to rising budgetary expenses and reductions in campaign spending. The marketing team encountered difficulties in accurately measuring campaign performance and efficiency, identifying the specific initiatives responsible for marketing outcomes, and procuring timely insights from their agency for informed decision-making.

The marketing team recognized that the challenge was with the data scattered across various applications, databases, and file systems. This fragmentation and lack of centralization of marketing data hindered their ability to analyze and optimize campaign performance, track and optimize spending across different channels, and accurately measure healthcare enrollment attribution and conversion rates. Consequently, optimizing advertising expenditures and acquiring holistic performance metrics remained unattainable.

The lack of a centralized data repository became acutely apparent when they were asked a question: ‘What’s working well?’ Their response, ‘nothing,’ underscored the critical need for a unified data strategy. This realization marked a turning point, igniting a journey towards data-driven marketing excellence.”

The team’s aspiration was clear to create a data-driven decision-making process that would seamlessly integrate with their datasets and deliver measurable benefits. Key considerations to maximize data value included data usage patterns, data sources from both external and internal sources, primary workloads, data platforms, and enrichment and transformation requirements.

To resolve their data quandary, this forward-thinking company turned to Google Cloud and Fivetran, whose partnership provided a swift and effective means to eliminate data silos and centralize their marketing data. This harmonious integration unlocked a wealth of possibilities, depicted in the diagram below:

Fivetran’s automated data movement platform supports a wide range of marketing analytics data sources with out-of-the-box, no-code, self-serve connectors. These connectors include social media and ad platforms, email, marketing analytics platforms, CRM solutions, and various databases and object stores.

The integration with Google BigQuery and other Google Ads services seamlessly provided end-to-end automation and reliability, complete with customizable sync schedules, normalized schemas for organized and understandable data, and even pre-built data models tailored for marketing data sources. This approach empowers the marketing team to swiftly act on crucial metrics like click-through rates, ad spend efficiency, and more while enhancing performance and efficiency.

Centralizing their data enhances efficiency, allowing the team to prioritize strategic planning, campaign optimization, and improving customer engagement. By making data accessible and organized, the CMO and the marketing team can focus on delivering even better service and engagement to their audience.

If you’d like to dig deeper into how Fivetran and Google Cloud are working together to enhance marketing analytics, check out our “Unlocking the Power of Marketing Analytics with Fivetran and Google Cloud” on-demand hands-on lab.

Real-time (or continuous) analytics with Fivetran and BigQuery

When faced with the challenge of “real-time analytics,” the initial response often triggers a fundamental question: What exactly defines “real-time”? Does it mean millisecond-level latency, subsecond response times, seconds, minutes, hours, or even daily updates? The answers can vary widely based on the specific use case and the perspective of those asked.

Traditionally, analytics platforms don’t dive into millisecond-level operational surveillance of individual events or sensor data — that’s the realm of highly specialized hardware and software designed for specific physical domains where data is generated, typically found in operational technology environments.

When discussing common analytics workloads like cloud data warehousing, data lakes, data applications, data engineering, AI/ML, and more, the term “real-time” can be somewhat misleading. Instead, a more precise description for what Google Cloud and Fivetran offer in analytics workloads is “continuous” data processing, aligning more closely with the reality of data flow and insights.

By focusing on specific use cases and desired outcomes, the importance of continuous data movement at intervals ranging from one minute to 24 hours becomes evident. It’s crucial to align this requirement with business justification, weighing the cost and effort against the value and necessity.

Frequently, scenarios arise where continuous replication and change data capture are essential, especially when dealing with significant incremental changes to data sources, often measuring in tens or hundreds of gigabytes per hour.

For example, let’s consider the case of a typical SAP ERP customer, such as a global logistics company. They operate a substantial SAP S/4HANA environment, generating a significant volume of transactions, resulting in high change data volumes exceeding 100 gigabytes per hour due to updates, inserts, and deletes. Fivetran continuously replicates this S/4HANA data to Google Cloud BigQuery, integrating it with other datasets Fivetran ingests, including operational databases and their enterprise customer management system. The aim is to provide a comprehensive global view of supply chain effectiveness and operational efficiency, supporting continuous decision-making throughout the day.

The term “real-time” can vary among organizations and use cases. However, in the realm of analytics, continuous data movement proves to be a reliable method for attaining desired results. Leveraging Fivetran to consistently replicate SAP data into Google Cloud BigQuery enables organizations to gain a comprehensive perspective of their operations and make faster, more insightful decisions.

Next steps

Our customers are transforming their businesses by integrating their applications with Google Cloud and Fivetran offerings, such as Oakbrook Financial. Our goal is to make this journey easier for our customers and help them to grow their data analytics business on Google cloud. With Google Cloud’s holistic cloud migration and modernization program, Google Cloud RaMP, you can easily gain insight into your current landscape and estimate the total cost of migration. You can learn more about how Google Cloud can help accelerate your data modernization journey, visit our page here. To know more about using Fivetran to unlock business intelligence in BigQuery. review the documentation available here or visit the Fivetran resources page.

Source : Data Analytics Read More

Kingfisher: building the future of home improvement with a powerful new data platform

Kingfisher: building the future of home improvement with a powerful new data platform

In retail, the distinction between physical and digital has blurred, and the home improvement sector is no exception. At Kingfisher, our retail banners include B&Q, Castorama, Brico Dépôt, and Screwfix, and our customers expect a seamless shopping experience whether they shop with us online or in store.

From product recommendations to delivery suggestions to location-based services, the technology required to meet these changing expectations is all powered by data. And to effectively deliver for our customers, that data needs to be highly available, highly accurate, and highly optimized. It needs to be gold standard.

In November 2022, we signed a five-year strategic partnership with Google Cloud to accelerate the migration of our huge on-premises SAP workloads to the cloud. By moving away from a tightly coupled, monolithic architecture, our goal was to improve our resilience and strengthen our data platform to unleash the full potential of our data, giving us the resilience and agility we need to power our growth.

The right platform — and people — to help us do it ourselves

Data speed and quality are critical to helping us achieve our goals, with data platforms, recommendation engines, sophisticated machine learning, and the ability to quickly ingest and curate content, all key to our future plans. Partnering with Google Cloud helps us to unlock this potential, with Google Cloud Marketplace also offering a critical capability to help us simplify procurement.

One of the most important factors in our decision to partner with Google Cloud was the expertise. Google Cloud not only brings a specific set of functionality and services that are well suited to our needs, but a team of people behind those solutions who understand our industry. They know our needs and goals, and can help us work through our challenges. And as we continue to migrate our workloads and build our architecture, the Google Cloud team is fully engaged, working closely with us in a meeting of minds.

Unleashing data’s potential to meet customer expectations

This partnership has already started to deliver results. By abstracting the data away from our core platform, we are now able to use it more productively for our business needs, and we’re currently working on some use cases around supply chain visibility. One such project enables us to dynamically price products according to various live promotions, so that we can react to the market with more agility.

This is only possible because we’ve been able to separate the data from our SAP system and run it through Google Cloud solutions such as BigQuery and Cloud Storage, enabling us to provide a more personalized shopping experience. Plans are now underway to use our product and sales data to create a new, real-time order system to greatly improve stock accuracy, as well as scaling up B&Q’s online marketplace from 700,000 products to 4 million (Kingfisher’s H1 2023/23 Financial Results).

Facing the future with a solid foundation

The future of digital retail is highly compute-intensive, requiring a level of real-time data processing near equivalent to that needed to power a major stock market. Working with Google Cloud gives us the processing power to use our data to its full potential, freeing us to focus on what we do best: offering the best home improvement proposition to our customers.

The growing suite of AI solutions in Google Cloud adds another dimension to our ability to leverage data, which will be increasingly important to enhancing the customer experience. Vertex AI, for example, is already allowing us to optimize our product pages, so that customers can visualize products in different scenarios and make a more informed purchasing decision. Throughout the business, AI is improving and speeding up our internal processes, which ultimately allows us to better serve our customers.

It’s no secret the retail industry is evolving at pace. With Google Cloud, we now have the agility, the support, and the data platform to create more seamless customer experiences both in stores and online.

Source : Data Analytics Read More

How data can drive stronger safeguarding initiatives in the energy and utilities sector

How data can drive stronger safeguarding initiatives in the energy and utilities sector

In the UK there is a requirement for certain utility businesses to maintain a Priority Services Register (PSR) — a free support service to provide extra help to customers who need it or have particular personal circumstances which mean they may be harmed in case of an interruption to their utilities service.

We started looking into PSRs and other safeguarding initiatives. Information is often siloed in different organisations, which makes it difficult to access the data needed to provide critical support. Recent research indicates fewer than 20% of UK adults are aware of the PSR, which can mean records are incomplete, limiting both the delivery of the service and how well it is delivered.

A potential approach to serving the more vulnerable while addressing security and data privacy challenges uses a publish subscribe model (illustrated in the diagram below). People who own information (publishers) put it into the exchange for other people (subscribers) to access. In many cases organizations will be both. Organizations contribute their part of the jigsaw puzzle to help build a more comprehensive picture for all. Participants retain control of the information they publish.

Data is published on an exchange, which authorized subscribers can access, with privacy and security controls built in. Subscribers can then overlay their own data and perform analytics.

The private exchange model makes it possible to restrict access to the system, which in turn allows participants to share controlled elements of a PSR. We are doing this in trials underway with a few organizations, including the Citizens Advice Greater Manchester.

Using an exchange hub, different organizations can publish their PSR information. Data that was previously only accessible by each then becomes available to everyone on the exchange to use for the benefit of all. This can give subscribers a bigger and more comprehensive view, with controls being constantly applied to help ensure that Personal Identifiable Information is being managed appropriately.

Putting this into practice means, for example, you can display your PSR on a map. You can then gather and add additional information from electricity, gas, water utilities, emergency services, local authorities, and charities. The image below provides an example of PSR data being combined with Office for National Statistics data on average income to help better identify areas of potential fuel poverty.

It is then possible to go a step further by applying predictive analytics techniques. For example, what if inflation were to increase to 10%? The system would show you what that could look like and places where fuel poverty is likely to increase. There are many data sets available which can deliver incredible value when applied in this way. Another example might be to identify those in an area living in poorly insulated homes or on floodplains.

The technology offers a wealth of unexplored opportunities to make life better for people who need extra support.

Contact us to discover how you can embark on a journey towards a safer, and more inclusive future.

Source : Data Analytics Read More

Data-driven decisions with YugabyteDB and BigQuery

Data-driven decisions with YugabyteDB and BigQuery

YugabyteDB, a distributed SQL database, when combined with BigQuery, tackles data fragmentation, data integration, and scalability issues businesses face.

It ensures that data is well-organized and not scattered across different places, making it easier to use in BigQuery for analytics. YugabyteDB’s ability to grow with the amount of data helps companies handle more information without slowing down. It also maintains consistency in data access, crucial for reliable results in BigQuery’s complex queries. The streamlined integration enhances overall efficiency by managing and analyzing data from a single source, reducing complexity. This combination empowers organizations to make better decisions based on up-to-date and accurate information, contributing to their success in the data-driven landscape.

BigQuery is renowned for its ability to store, process, and analyze vast datasets using SQL-like queries. As a fully managed service with seamless integration into other Google Cloud Platform services, BigQuery lets users scale their data operations to petabytes. It also provides advanced analytics, real-time dashboards, and machine learning capabilities. By incorporating YugabyteDB into the mix, users can further enhance their analytical capabilities and streamline their data pipeline.

YugabyteDB is designed to deliver scale, availability and enterprise-grade RDBMS capabilities for mission-critical transactional applications. It is a powerful cloud-native database that thrives in any cloud environment. With its support for multi-cloud and hybrid cloud deployment, YugabyteDB seamlessly integrates with BigQuery using the YugabyteDB CDC Connector and Apache Kafka. This integration enables the real-time consumption of data for analysis without the need for additional processing or batch programs.

Benefits of BigQuery integration with YugabyteDB

Integrating YugabyteDB with BigQuery offers numerous benefits. Here are the top six.

Real-time data integration: YugabyteDB’s Change Data Capture (CDC) feature synchronizes data changes in real-time between YugabyteDB and BigQuery tables. This seamless flow of data enables real-time analytics, ensuring that users have access to the most up-to-date information for timely insights.Data accuracy: YugabyteDB’s CDC connector ensures accurate and up-to-date data in BigQuery. By capturing and replicating data changes in real-time, the integration guarantees that decision-makers have reliable information at their disposal, enabling confident and informed choices.Scalability: Both YugabyteDB and BigQuery are horizontally scalable solutions capable of handling the growing demands of businesses. As data volumes increase, these platforms can seamlessly accommodate bigger workloads, ensuring efficient data processing and analysis.Predictive analytics: By combining YugabyteDB’s transactional data with the analytical capabilities of BigQuery, businesses can unlock the potential of predictive analytics. Applications can forecast trends, predict future performance, and proactively address issues before they occur, gaining a competitive edge in the market.Multi-cloud and hybrid cloud deployment: YugabyteDB’s support for multi-cloud and hybrid cloud deployments adds flexibility to the data ecosystem. This allows businesses to retrieve data from various environments and combine it with BigQuery, creating a unified and comprehensive view of their data.

By harnessing the benefits of YugabyteDB and BigQuery integration, businesses can supercharge their analytical capabilities, streamline their data pipelines, and gain actionable insights from their large datasets. Whether you’re looking to make data-driven decisions, perform real-time analytics, or leverage predictive analytics, combining YugabyteDB and BigQuery is a winning combination for your data operations.

Key use cases for YugabyteDB and BigQuery integration

YugabyteDB’s Change Data Capture (CDC) with BigQuery serves multiple essential use cases. Let’s focus on two key examples.

1.Industrial IoT (IIoT): In IIoT, streaming analytics is the continuous analysis of data records as they are generated. Unlike traditional batch processing, which involves collecting and analyzing data at regular intervals, streaming analytics enables real-time analysis of data from sources like sensors/actuators, IoT devices and IoT gateways. This data can then be written into YugabyteDB with high throughput and then streamed continuously to BigQuery Tables for advanced analytics using Google Vertex AI or other AI programs.

Examples of IIoT Stream Analytics

BigQuery can process and analyze data from industrial IoT devices, enabling efficient operations and maintenance. Two real-world examples include:

Supply chain optimization: Analyze data from IoT-enabled tracking devices to monitor inventory, track shipments, and optimize logistics operations.Energy efficiency: Analyze data from IoT sensors and meters to identify energy consumption patterns to optimize usage and reduce costs.

2. Equipment predictive maintenance analytics: In various industries such as manufacturing, telecom, and instrumentation, equipment predictive maintenance analytics is a common and important use case. YugabyteDB plays a crucial role in collecting and storing equipment notifications and performance data in real time. This enables the seamless generation of operational reports for on-site operators, providing them with current work orders and notifications.

Maintenance analytics is important for determining equipment lifespan and identifying maintenance requirements. YugabyteDB CDC facilitates the integration of analytics data residing in BigQuery. By pushing the stream of notifications and work orders to BigQuery tables, historical data accumulates, enabling the development of machine learning or AI models. These models can be fed back to YugabyteDB for tracking purposes, including failure probabilities, risk ratings, equipment health scores, and forecasted inventory levels for parts. This integration not only enables advanced analytics but also helps the OLTP database (in this case, YugabyteDB) store the right data for the site engineers or maintenance personnel in the field.

So now let’s walk through how easy it is to integrate YugabyteDB with BigQuery using CDC.

Integration architecture of YugabyteDB to BigQuery

The diagram below (Figure 1) shows the end-to-end integration architecture of YugabyteDB to BigQuery.

Figure 1 – End to End Architecture

The table below shows the data flow sequences with their operations and tasks performed.

Note: Link for creating the stream ID mentioned in #1

1. Set up BigQuery Sink

Install YugabyteDB

You can deploy YugabyteDB available on Google Cloud Marketplace using this link. Alternatively, there are other options to install or deploy YugabyteDB. If you are running in Windows, you can leverage Docker on Windows with YugabyteDB.

Install and set up YugabyteDB CDC and Debezium connector

Ensure YugabyteDB CDC is configured to capture changes in the database and is running (as per the above architecture diagram) along with its dependent components. You should see a Kafka topic name and group name as per this document; it will appear in the streaming logs either through CLI or via Kafka UI (e.g. if you used KOwl).

Download and configure BigQuery Connector

Download the BigQuery Connector . Unzip and add the JAR files to YugabyteDB’s CDC (Debezium Connector) Libs folder (e.g. /kafka/libs). Then restart the Docker container

Set up BigQuery in Google Cloud

Setting up BigQuery in Google Cloud benefits both developers and DBAs by providing a scalable, serverless, and integrated analytics platform. It simplifies data analytics processes, enhances collaboration, ensures security and compliance, and supports real-time processing, ultimately contributing to more effective and efficient data-driven decision-making.

Follow these five steps to set up BigQuery in Google Cloud.

Create a Google Cloud Platform account. If you don’t already have one, create a Google Cloud Platform account by visiting the Google Cloud Console and following the prompts.Create a new project (if you don’t already have one). Once you’re logged into the Google Cloud Console, create a new project by clicking the “Select a Project” dropdown menu at the top of the page and clicking on “New Project”. Follow the prompts to set up your new project.Enable billing (if you haven’t done it already). NOTE: Before you can use BigQuery, you need to enable billing for your Google Cloud account. To do this, navigate to the “Billing” section of the Google Cloud Console and follow the prompts.Enable the BigQuery API: To use BigQuery, you need to enable the BigQuery API for your project. To do this, navigate to the “APIs & Services” section of the Google Cloud Console and click on “Enable APIs and Services”. Search for “BigQuery” and click the “Enable” button..Create a Service Account and assign BigQuery roles in IAM (Identity and Access Management): As shown in Figure 4, create a service account and assign the IAM role for Big Query. The following roles are mandatory to create a BigQuery table:BigQuery Data EditorBigQuery Data Owner

Figure 2 – Google Cloud – Service Account for BigQuery

After creating a service account, you will see the details (as shown in Figure 3). Create a private and public key for the service account and download it in your local machine. It needs to be copied to YugabyteDB’s CDC Debezium Docker container in a designated folder (e.g., “/kafka/balabigquerytesting-cc6cbd51077d.json”). This is what you refer to while deploying the connector.

Figure 3 – Google Cloud – Generate Key for the Service Account

2. Deploy the configuration for the BigQuery connector

Source Connector

Create and deploy the source connector (as shown below), change the database hostname, database master addresses, database user, password, database name, logical server name and table include list and StreamID according to your CDC configuration (refer code block highlighted in yellow).

code_block<ListValue: [StructValue([(‘code’, ‘curl -i -X POST -H “Accept:application/json” -H “Content-Type:application/json” localhost:8083/connectors/ -d ‘{rn “name”: “srcdb”,rn “config”: {rn “connector.class”: “io.debezium.connector.yugabytedb.YugabyteDBConnector”,rn “database.hostname”:”10.9.205.161″,rn “database.port”:”5433″,rn “database.master.addresses”: “10.9.205.161:7100”,rn “database.user”: “yugabyte”,rn “database.password”: “xxxx”,rn “database.dbname” : “yugabyte”,rn “database.server.name”: “dbeserver5”,rn “table.include.list”:”public.testcdc”,rn “database.streamid”:”d36ef18084ed4ad3989dfbb193dd2546″,rn “snapshot.mode”:”initial”,rn “transforms”: “unwrap”, rn “transforms.unwrap.type”: “io.debezium.connector.yugabytedb.transforms.YBExtractNewRecordState”, rn “transforms.unwrap.drop.tombstones”: “false”,rn “time.precision.mode”: “connect”,rn “key.converter”:”io.confluent.connect.avro.AvroConverter”,rn “key.converter.schema.registry.url”:”http://localhost:18081″,rn “key.converter.enhanced.avro.schema.support”:”true”,rn “value.converter”:”io.confluent.connect.avro.AvroConverter”,rn “value.converter.schema.registry.url”:”http://localhost:18081″,rn “value.converter.enhanced.avro.schema.support”:”true”rn rn }rn }”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5809686e80>)])]>

Target Connector (BigQuery Connector):

The configuration below shows a sample BigQuerySink connector. The topic name, Google project, dataset name, default dataset name — highlighted in yellow — need to be replaced according to your specific configuration.

The key file fields contain the private key location of your Google project, and it needs to be kept in the YugabyteDB’s CDC Debezium Docker connector folder. (e.g. /kafka)

code_block<ListValue: [StructValue([(‘code’, ‘curl -i -X POST -H “Accept:application/json” -H “Content-Type:application/json” localhost:8083/connectors/ -d ‘{rn “name”: “bigquerysinktests”,rn “config”: {rn “connector.class”: “com.wepay.kafka.connect.bigquery.BigQuerySinkConnector”,rn “tasks.max” : “1”,rn “topics” : “dbserver11.public.testcdc”,rn “sanitizeTopics” : “true”,rn “autoCreateTables” : “true”,rn “allowNewBigQueryFields” : “true”,rn “allowBigQueryRequiredFieldRelaxation” : “true”,rn “schemaRetriever” : “com.wepay.kafka.connect.bigquery.retrieve.IdentitySchemaRetriever”,rn “project”:”balabigquerytesting”,rn “datasets”:”.*=testcdc”,rn “defaultDataset” : “testcdc”,rn “keyfile” : “/kafka/balabigquerytesting-cc6cbd51077d.json”rn }rn}”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e58086639d0>)])]>

3. Output in BigQuery Console

After deployment, the table name (e.g. dbserver11_public_testcdc) will be created in BigQuery automatically (see below).

Figure 4 – Data stored in the BigQuery Table

Figure 5 – BigQuery – Execution Details for a sample BigQuery table is shown below.

Figure 6 – Query Execution Details

Conclusion and summary

Following the steps above is all it takes to integrate YugabyteDB’s CDC connector with BigQuery.

Combining YugabyteDB OLTP data with BigQuery data can benefit an application in a number of ways (e.g. real-time analytics, advanced analytics with machine learning, historical analysis and data warehousing and reporting). YugabyteDB holds the transactional data that is generated by an application in real-time, while BigQuery data is typically large, historical data sets that are most often used for analysis. By combining these two types of data, an application can leverage the strengths of both to provide real-time insights and support better, quicker, more accurate and informed decision-making.

Next steps:

Discover more about YugabyteDBDive into distributed SQL with Distributed SQL for DummiesReady to give it a try? Procure YugabyteDB through Google Cloud Marketplace, and build your first geo-distributed application.

Source : Data Analytics Read More

Built with BigQuery: LiveRamp’s open approach to optimizing customer experiences

Built with BigQuery: LiveRamp’s open approach to optimizing customer experiences

Editor’s note: The post is part of a series showcasing our partners, and their solutions, that are Built with BigQuery.

Data collaboration, the act of gathering and connecting data from various sources to unlock combined data insights, is the key to reaching, understanding, and expanding your audience. And enabling global businesses to accurately connect, unify, control, and activate data across different channels and devices will ultimately optimize customer experiences and drive better results.

LiveRamp is a leader in data collaboration, with clients that span every major industry. One of LiveRamp’s enterprise platforms, Safe Haven, helps enterprises do more with their data, and is especially valuable for brands constructing clean room environments for their partners and retail media networks. It enables four universal use cases to facilitate better customer experiences:

The core challenge: Creating accurate cross-channel marketing analytics

As brand marketers accelerate their move to the cloud, they struggle to execute media campaigns guided by data and insights. This is due to challenges in building effective cross-channel marketing analytics, which need to overcome the following hurdles:

Lack of a common key to accurately consolidate and connect data elements and behavioral reports from different data sources that should be tied to the same consumer identity. Such a “join key” can not be a constructed internal record ID, as it must be semantically rich enough to work for both individual and household-level data across all the brand’s own prospects and customers, and across all of the brand’s partner data (e.g., data from publishers, data providers, co-marketing partners, supply-chain providers, agency teams).Reduced data availability from rising consumer authentication requirements makes it difficult to reach a sufficient sample volume for accurately driving recommendation and personalization engines, creating lookalikes, or for creating unbiased data inputs for incorporating into machine learning training.Brand restrictions on analytic operations to guard against data leaks of sensitive consumer personally-identifiable information (PII). By reducing operational access to consumer data, brands increase their data protections but decrease their data science team’s ability to discover key insights and perform cross-partner audience measurements.Decreased partner collaboration from using weak individual record identifiers, such as hashed emails, as the basis for record matching. Hashed emails are often produced by default by many customer data platforms (CDPs), but these identifiers are insecure due to their easy-reversibility and have limited capacity to connect the same individual and household across partners.

LiveRamp solves these challenges for marketers by building a suite of dedicated data connectivity tools and services centered on Google’s BigQuery ecosystem — with the objective of creating the ultimate open data-science environment for marketing analytics on Google Cloud.

The resulting LiveRamp Safe Haven environment is deployed, configured and customized for each client’s Google Cloud instance. The solution is scalable, secure and future-proof by being deployed alongside Google’s BigQuery ecosystem. As Google Cloud technology innovation continues, users of Safe Haven are able to naturally adopt new analytic tools and libraries enabled by BigQuery.

All personally-identifiable data in the environment is automatically processed and replaced with LiveRamp’s brand-encoded pseudonymized identifiers, known as RampIDs:

These identifiers are derived from LiveRamp’s decades of dedicated work on consumer knowledge and device-identity knowledge graphs. LiveRamp’s identifiers represent secure people-based individual and household-level IDs that let data scientists connect audience records, transaction records, and media behavior records across publishers and platforms.Because these RampIDs are based on actual demographic knowledge, these identifiers can connect data sets with real person-centered accuracy and higher connectivity than solutions that rely on string matching alone.RampIDs are supported in Google Ads, Google’s Ads Data Hub, and hundreds of additional leading destinations including TV and Connected TV, Walled Gardens, ecommerce platforms, and all leading social and programmatic channels.

The Safe Haven data, because of its pseudonymization, presents a much safer profile for analysts working in the environment, with little risk to insider threats due to PII removal, a lockdown of data exports, transparent activity logging, and Google Cloud’s powerful encryption and role-based permissioning.

LiveRamp’s Safe Haven solutions on Google Cloud have been deployed by many leading brands globally, especially brands in retail, CPG, pharma, travel, and entertainment. Success for all of these brands is due in large part to the combination of the secure BigQuery environment and the ability to increase data connectivity with LiveRamp’s RampID ecosystem partners.

One powerful example in the CPG space is the success achieved by a large CPG client who needed to enrich their understanding of consumer product preferences, and piloted a focused effort to assess the impact of digital advertising on audience segments and their path to purchase at one large retailer.

Using Safe Haven running on BigQuery, they were able to develop powerful person-level insights, create new optimized audience segments based on actual in-store product affinities, and greatly increase their direct addressability to over a third of their regional purchasers. The net result was a remarkable 24.7% incremental lift over their previous campaigns running on Google and on Facebook.

Built with BigQuery: How Safe Haven empowers analysts and marketers

Whether you’re a marketer activating media audience segments, or a data scientist using analytics across the pseudonymized and connected data sets, LiveRamp Safe Haven delivers the power of BigQuery to either end of the marketing function.

Delivering BigQuery to data scientists and analysts

Creating and configuring an ideal environment for marketing analysts is a matter of selecting and integrating from the wealth of powerful Google and partner applications, and uniting them with common data pipelines, data schemas, and processing pipelines. An example configuration LiveRamp has used for retail analysts combines Jupyter, Tableau, Dataproc and BigQuery as shown below:

Data scientists and analysts need to work iteratively and interactively to analyze and model the LiveRamp-connected data. To do this, they have the option of using either the SQL interface through the standard BigQuery console, or for more complex tasks, they can write Python spark jobs inside a custom JupyterLab environment hosted on the same VM that utilizes a Dataproc cluster for scale.

They also need to be able to automate, schedule and monitor jobs to provide insights throughout the organization. This is solved by a combination of BigQuery scheduling (for SQL jobs) and Google Cloud Scheduler (for Python Spark jobs), both standard features of Google Cloud Platform.

Performing marketing analytics at scale utilizes the power of Google Cloud’s elasticity. LiveRamp Safe Haven is currently running on over 300 tenants workspaces deployed across multiple regions today. In total, these BigQuery instances contain more than 350,000 tables, and over 200,000 load jobs and 400,000 SQL jobs execute per month — all configured via job management within BigQuery.

Delivering BigQuery to marketers

SQL is a barrier for most marketers, and LiveRamp faced the challenge of unlocking the power of BigQuery for this key persona. TheAdvanced Audience Builder is one of the custom applications that LiveRamp created to address this need. It generates queries automatically and auto-executes them on a continuous schedule to help marketers examine key attributes and correlations of their most important marketing segments.

Queries are created visually off of the customers’ preferred product schema:

Location qualification, purchase criteria, time windows and many other factors can be easily selected through a series of purpose-built screens that marketers, not technical analysts, find easy to navigate and which quickly unlock the value of scalable BigQuery processing to all team members.

By involving business and marketing experts to work and contribute insights alongside the dedicated analysts, team collaboration is enhanced and project goals and handoffs are much more easily communicated across team members.

What’s next for LiveRamp and Safe Haven?

We’re excited to announce that LiveRamp was recently named Cloud Partner of the Year at Google Cloud Next 2023. This award celebrates the achievements of top partners working with Google Cloud to solve some of today’s biggest challenges.

Safe Haven is the first version of LiveRamp’s identity-based platform. Version 2 of the platform, currently in development, is designed to have even more cloud-native integrations within Google Cloud. There will be more updates on the next version soon.

For more information on LiveRamp Safe Haven, visit the website here.

The Built with BigQuery advantage for ISVs and data providers

Built with BigQuery helps companies like LiveRamp build innovative applications with Google Data and AI Cloud. Participating companies can:

Accelerate product design and architecture through access to designated experts who can provide insight into key use cases, architectural patterns, and best practices.Amplify success with joint marketing programs to drive awareness, generate demand, and increase adoption.

BigQuery gives ISVs the advantage of a powerful, highly scalable unified AI lakehouse that’s integrated with Google Cloud’s open, secure, sustainable platform. Click here to learn more about Built with BigQuery.

Source : Data Analytics Read More

Top 5 AI-Driven Furniture Engineering Design Applications

Top 5 AI-Driven Furniture Engineering Design Applications

Anybody familiar with the nature of technology recognizes the contributions of artificial intelligence. AI technology has been instrumental in transforming the healthcare and financial industries, as well as many other sectors. Fewer people talk about the role that AI has played in the creative arts professions. However, there are a number of reasons that AI […]

Source : SmartData Collective Read More