A data pipeline for MongoDB Atlas and BigQuery using Dataflow

A data pipeline for MongoDB Atlas and BigQuery using Dataflow

Data is critical for any organization to build and operationalize a comprehensive analytics strategy. For example, each transaction in the BFSI (Banking, Finance, Services, and Insurance) sector produces data. In Manufacturing, sensor data can be vast and heterogeneous. Most organizations maintain many different systems, and each organization has unique rules and processes for handling the data contained within those systems.

Google Cloud provides end-to-end data cloud solutions to store, manage, process, and activate data starting with  BigQuery. BigQuery is a fully managed data warehouse that is designed for running analytical processing (OLAP) at any scale. BigQuery has built-in features like machine learning, geospatial analysis, data sharing, log analytics, and business intelligence. MongoDB is a document-based database that handles the real-time operational application with thousands of concurrent sessions with millisecond response times. Often, curated subsets of data from MongoDB are replicated to BigQuery for aggregation and complex analytics and to further enrich the operational data and end-customer experience. As you can see, MongoDB Atlas and Google Cloud BigQuery are complementary technologies. 

Introduction to Google Cloud Dataflow

Dataflow is a truly unified stream and batch data processing system that’s serverless, fast, and cost-effective. Dataflow allows teams to focus on programming instead of managing server clusters as Dataflow’s serverless approach removes operational overhead from data engineering workloads. Dataflow is very efficient at implementing streaming transformations, which makes it a great fit for moving data from one platform to another with any changes in the data model required. As part of Data Movement with Dataflow, you can also implement additional use cases such as identifying fraudulent transactions, real-time recommendations, etc.

Announcing new Dataflow Templates for MongoDB Atlas and BigQuery

Customers have been using Dataflow widely to move and transform data from Atlas to BigQuery and vice versa. For this, they have been writing custom code using Apache Beamlibraries and deploying it on the Dataflow runtime. 

To make moving and transforming data between Atlas and BigQuery easier, the MongoDB and Google teams worked together to build templates for the same and make them available as part of the Dataflow page in the Google Cloud console. Dataflow templates allow you to package a Dataflow pipeline for deployment. Templates have several advantages over directly deploying a pipeline to Dataflow. The Dataflow templates and the Dataflow page make it easier to define the source, target, transformations, and other logic to apply to the data. You can key in all the connection parameters through the Dataflow page, and with a click, the Dataflow job is triggered to move the data. 

To start with, we have built three templates. Two of these templates are batch templates to read and write from MongoDB to BigQuery and vice versa. And the third is to read the change stream data pushed on Pub/Sub and write to BigQuery. Below are the templates for interacting with MongoDB and Google Cloud native services currently available:

1. MongoDB to BigQuery template:
The MongoDB to BigQuery template is a batch pipeline that reads documents from MongoDB and writes them to BigQuery

2. BigQuery to MongoDB template:
The BigQuery to MongoDB template can be used to read the tables from BigQuery and write to MongoDB.

3. MongoDB to BigQuery CDC template:
The MongoDB to BigQuery CDC (Change Data Capture) template is a streaming pipeline that works together with MongoDB change streams. The pipeline reads the JSON records pushed to Pub/Sub via a MongoDB change stream and writes them to BigQuery

The Dataflow page in the Google Cloud console can help accelerate job creation. This eliminates the requirement to set up a java environment and other additional dependencies. Users can instantly create a job by passing parameters including URI, database name, collection name, and BigQuery table name through the UI.

Below you can see these new MongoDB templates currently available in the Dataflow page:

Below is the parameter configuration screen for the MongoDB to BigQuery (Batch) template. The required parameters vary based on the template you select.

Getting started

Refer to the Google provided Dataflow templates documentation page for more information on these templates. If you have any questions, feel free to contact us or engage with the Google Cloud Community Forum.

Reference

Apache beam I/O connectors

Acknowledgement: We thank the many Google Cloud and MongoDB team members who contributed to this collaboration, and review, led by Paresh Saraf from MongoDB and Maruti C from Google Cloud.

Related Article

Simplify data processing and data science jobs with Serverless Spark, now available on Google Cloud

Spark on Google Cloud, Serverless and Integrated for Data Science and ETL jobs.

Read Article

Source : Data Analytics Read More

Lufthansa increases on-time flights by wind forecasting with Google Cloud ML

Lufthansa increases on-time flights by wind forecasting with Google Cloud ML

The magnitude and direction of wind significantly impacts  airport operations, and Lufthansa Group Airlines are no exception. A particularly troublesome kind is called BISE: it is a cold, dry wind that blows from the northeast to southwest in Switzerland, through the Swiss Plateau. Its effects on flight schedules can be severe, such as forcing planes to change runways, which can create a chain reaction of flight delays and possible cancellations. In Zurich Airport, in particular, BISE can potentially reduce capacity by up to 30%, leading to further flight delays and cancellations, and to millions in lost revenue for Lufthansa (as well as dissatisfaction among their passengers).

Being able to predict this kind of wind well in advance lets the Network Operations Control team schedule flight operations optimally across runways and timeslots, to minimize disruptions to the schedule. However, predicting speed and magnitude can be incredibly difficult to model and thus to predict— which is why Lufthansa reached out to Google Cloud.

Machine learning (ML) can help airports and airlines to better anticipate and manage these types of disruptive weather events. In this blog post, we’ll explore an experiment Lufthansa did together with Google Cloud and its Vertex AI Forecast service, accurately predicting BISE hours in advance, with more than 40% relative improvement in accuracy over internal heuristics, all within days instead of the months it often takes to do ML projects of this magnitude and performance.

“Being impressed with Google’s technology and prowess in the field of AI and machine learning, we were certain that my working together with their expert, to combine our technology with their domain expertise, we would achieve the best results possible,“ said Christian Most, Senior Director, Digital Operations Optimization at Lufthansa Group. 

Collecting and preparing the dataset

The goal of Lufthansa and Google Cloud’s project was to forecast the BISE wind for Zurich’s Kloten Airport using deep learning-based ML approaches, then to see if the prediction surpasses internal heuristics-driven solutions and  gauge the ease of use and practicality of the deep learning approach in production.

Since deep learning-based techniques require large datasets, the project relies on Meteoswiss simulation data, a dataset consisting of multiple meteorological sensor measurements collected from several weather stations across Switzerland over the past five years. By using this dataset, we obtained data on factors like wind direction, speed, pressure, temperature, humidity and more, at a 10 min resolution, along with some information about the location of the weather stations, such as altitude. These factors, which we hypothesized to be predictive of the BISE, ended up carrying valuable signals, as we would see later.

This collected data was next subjected to an extensive cleaning and feature engineering process using Vertex AI Workbench, in order to prepare the final dataset for training. The cleaning phase included steps to drop the features, or rows, that contained too many missing values, or failed statistical tests for entropy, etc. Since the direction of wind is a circular feature (between 0 and 360 degrees), this column/feature was replaced with two features: the corresponding sine and cosine embedding. The dataset was then flattened such that the columns contained all the relevant features and sensor measurements from all the weather stations at a particular 10-minute interval. 

Since the target variable — i.e,. BISE — was not directly available, we engineered a proxy target variable for BISE called “tailwind speed around runway,” which above a certain threshold indicates the presence of BISE along the runway.

Forecasting wind in the Cloud

Once the dataset was ready, Lufthansa and Google Cloud evaluated several options before deciding to experiment and tune Vertex AI Forecast, Google’s AutoML-powered forecasting service, in order to achieve optimum results. Vertex Forecast is capable of the required feature engineering, neural architecture search, and hyper parameter tuning, and it is managed by Google Cloud to score in the top 2.5% in the M5 Forecasting Competition on Kaggle, in a completely automated fashion. These qualities made it an excellent choice for Lufthansa, to reduce the manual overhead of creating, deploying, and maintaining top performing deep learning models.

The raw data files were loaded from cloud storage, preprocessed on Vertex AI Workbench. Then, a training pipeline was initiated on Vertex AI Pipelines, which performed the following steps in sequence:

The .csv data file was loaded from Cloud Storage into a Vertex AI managed dataset.

A Vertex AI forecasting training job was initiated with the dataset, and it was also registered as a model in the Vertex AI Model Registry.

Upon completion, the model was evaluated on the test set, and the model’s predictions and the input features and ground truth of the test set, were stored in a user-defined table in BigQuery. Several test metrics were also available on the service and model dashboards.

One of the biggest challenges was the severe imbalance in the dataset, as measurements with BISE were very far and few in between. In order to account for this, instances where BISE occurred, as well the occurrences temporally close to them, were upweighted using weights calculated with methods including Inverse of Square Root of Number of Samples (ISNS), Effective Number of Samples (ENS), and Gaussian reweighting. The formulas for the methods are given below. These weights were supplied as separate columns in the dataset, and were iteratively used thereafter by the service as the “weight” column.

ISNS

ENS

Weighted gaussian

Results and next steps

 Fig 1. Recall for 2-hour horizon
Fig 2. F1 Score for a 2-hour Horizon

In the above figures, the x-axis represents the forecast horizon and the Y-axis shows the respective metrics (Recall/F1-score). As shown after multiple experiments, we can see Vertex AI Forecast achieved higher recall and precision t (red bar), outperforming Lufthansa’s internal baseline heuristics, with the performance gap widening steadily as the forecast horizon extends further into the future. At the two-hour mark, our custom-configured Vertex AI Forecast model improved by 40% relative to the internal heuristics and 1700% compared to the random guess baseline. As we saw with other experiments, at a six-hour forecast horizon, the performance gap widens even more, with Vertex AI Forecast in the lead. Since forecasting BISE a few hours in advance is very beneficial to prevent flight delays for Lufthansa, this was a great solution for them.

“We are very excited to be able to not only do accurate long term forecasts for the BISE, but also that Vertex AI Forecasting makes training and deploying such models much easier and faster, allowing us to innovate rapidly to serve our customers and stakeholders in the best possible manner,” said Swiss Oliver Rueegg, Product Owner, Swiss International Airlines.

Lufthansa plans to explore productionizing this solution by integrating it into their Operations Decision Support Suite, which is used by the network controllers in the Operations Control Center in Kloten, as well as to work closely with Google’s specialists to integrate both Vertex AI Forecast and other of Google’s AI/ML offerings for their use cases.

Related Article

Accelerate the deployment of ML in production with Vertex AI

Google Cloud expands Vertex AI to help customers accelerate deployment of ML models into production.

Read Article

Source : Data Analytics Read More

Gyfted uses Google Cloud AI/ML tools to match tech workers with the best jobs

Gyfted uses Google Cloud AI/ML tools to match tech workers with the best jobs

It’s no secret that many organizations and job seekers find the hiring process exhausting. It can be time consuming, costly, and somewhat risky for both parties. Those are just some of the experiences we wanted to change when we started Gyfted, a pre-vetted talent marketplace for people who complete tech training or degree programs and are looking for the right career move. At the same time, we’re helping businesses save time and improve recruiting outcomes with our automated candidate screening and sourcing tools. 

Our vision is clear: To take a candidate through one structured hiring process, and then put them in front of thousands of companies. It’s similar to the common app system in higher education. Sounds simple, but it is a herculean technical and UX task. To succeed we had to combine advanced psychometric testing, machine learning, the latest in behavioral design, and develop the highest quality structured, relational dataset to represent candidate and manager profiles and preferences on our network. Fortunately, we were cofounded by world-leading experts in these areas including Dr. Michal Kosinski, one of the world’s top computational psychologists, and Adam Szefer, a gifted young technologist. We’ve been joined by a group of equally talented employees, most of whom work remotely in Poland, US, Switzerland, UK, Israel, and Ukraine.

The influence of dating platforms and matching

 When seeking inspiration, we were influenced by the success of dating platforms, especially Bumble with its focus on commitment. These platforms have done a great job using design to match people together.

We like to think we’re doing the same for recruiters and candidates in terms of not only role and culture fit matching, but also through a fundamental feature of Gyfted, which is that our job-seekers are anonymous. This helps recruiters meet one of their goals today, which is to minimize bias in the hiring process and enable objective, diversity-oriented recruiting.

Another of our unique selling points is that we conduct candidate screening that is gamified and automated, with a structured interview, where the interview remains with the candidate’s profile.   We roughly estimate that just for the 15 million open jobs on LinkedIn, if companies fill in 50% of those via external recruiting, and conduct a screening interview with 10 candidates per job filled at $50/hour paid to an employee to do the screening, that’s $3.75 billion and 75 million hours in direct costs. This is on top of applying for jobs, selecting CVs, and coordinating the process, which takes an even bigger financial and time toll on both applicants and recruiters. Instead, it would be better to take one interview for 1000 companies. The impact of what we want to achieve with our vision is enormous. 

We also offer career discovery and career search tools for job candidates. This includes free, personalized feedback for every job-seeker. Right now, we’re aiming the service at students, bootcamp graduates and juniors, helping them to land jobs in tech and the creative industry at large. Next, we’ll expand into mid and senior roles. In the long run we want to reshape how recruiting happens through a common app that saves everyone in the market significant time and resources, helping people find jobs not only faster, but jobs that truly fit them. 

Developing advanced AI applications with Google Cloud

We obtained our original funding from angel and institutional investors, and we were selected into the Fall 2021 batch of StartX, the non-profit start-up accelerator and founder community associated with Stanford University. But like most startups our budgets are tight, and we need to find ways to operate as efficiently as possible, especially when building out our technology stack and developer environment.

That’s where Google Cloud comes in. It’s a lot more affordable and flexible than competing solutions, and our developers love it. We use Google App Engine for the hosting and development of our applications giving us enormous flexibility. Vertex AI enables us to build, deploy, and scale machine learning models faster, within a unified artificial intelligence platform. On top of that we use Google Vertex AI Workbench as the development environment for the data science workflow, which allows us to have everything that we need to host and develop innovative AI-based applications.  

BigQuery, Google Cloud’s serverless data warehouse, is another stand-out solution for us. We use it to crunch big data from all our systems and the UX is very intuitive and easy to use, allowing us to use it across the business and get insights from a wide range of employees, not just technical experts.

Above all, Google Cloud helps us solve the main platform challenges facing Gyfted including scalability and identity management, so we are perfectly positioned for growth. Right now, we handle about 2 million candidate interactions, a volume we expect to grow exponentially. As that number grows, we rely on Google Cloud to help us scale securely and with reliability.

Eliminating bias from the hiring process

Our technology partners have also been integral to helping us get to an advanced stage of our beta program. MongoDB on Google Cloud takes the data burden off our teams and reduces time to value of our applications. We can stay nimble and can scale database capacity at the push of a button.

Our collaboration with the Google team has been fantastic. Our Startup Success Manager is an expert when it comes to Google Cloud solutions, and he also understands our business from his own experience as an entrepreneur and an investor. It’s great to have an internal point of contact who can help us navigate all of Google’s resources.

I’d also stress the extent to which Google Cloud values align with ours. For example, a key benefit for our customers is the ability to strip unconscious bias out of the hiring process. Google Cloud tools support this commitment to diversity, especially when we are building out our AI models.

On a team level, we also appreciate the support that Google Cloud has shown through its Google Support Fund for Start-ups in Ukraine. This has helped many Ukrainian businesses to continue to operate at a very challenging time, including startups with remote, distributed teams in Poland where most of us stem from.

If I had to sum up Google Cloud and our collaboration with Google for Startups in a phrase, I’d say that it adds enormous value to our business while removing much of the risk when scaling up a start-up. We’ve seen the addition of many new tools and features in the past two years and our Google mentors are always looking at the best way these can be integrated with Gyfted’s own roadmap. That means that we can continue to transform recruiting and hiring processes with the support of one of the world’s most advanced tech companies as a strategic growth partner.

Gyfted Team Members

If you want to learn more about how Google Cloud can help your startup, visit our pagehere to get more information about our program, andsign up for our communications to get a look at our community activities, digital events, special offers, and more.

Related Article

Divercity uses Google Cloud to build more inclusive and sustainable workforces

Learn how Divercity uses Google Cloud to help tech companies measure employee diversity, recruit underrepresented talent, and significant…

Read Article

Source : Data Analytics Read More

Built with BigQuery: BigQuery ML enables Faraday to make predictions for any US consumer brand

Built with BigQuery: BigQuery ML enables Faraday to make predictions for any US consumer brand

In 2022, digital natives and traditional enterprises find themselves with a better understanding of data warehousing, protection, and governance. But machine learning and the ethical application of artificial intelligence and machine learning (AI/ML) remain open questions, promising to drive better results if only their power can be safely harnessed. Customers on the Google Cloud Platform (GCP) have access to industry-leading technology for example, with BigQuery, with access to in a serverless, zero-ETL environment – but it’s still hard to know how to start.

While Google Cloud provides customers with a multitude of built-in options for Data Analytics and AI/ML, Google relies on technology partners to provide customized solutions to meet fit-for-purpose customer use-cases.

Faraday is a Google technology partner focused on helping brands of all sizes engage customers more effectively with the practical power of prediction. Since 2012, Faraday has been standardizing a set of patterns that any business can follow to look for signal in its consumer data – and activate on that signal with a wide variety of execution partners.

Crucial to Faraday’s success is how it uses Google BigQuery, one of the crown jewels of GCP’s data cloud. BigQuery is a serverless data warehouse that provides data-local compute for both analytic and machine learning workloads. One of BigQuery’s core abstractions is the use of SQL to declare business logic across all of its functions. This is a design choice with wide implications: if you can write SQL, then BigQuery will take care of parallelizing it across virtually unlimited resources. This presents a very clear path from a Data Engineer persona to the use of machine learning without the need for deep expertise in ML.

On BigQuery, Faraday can ingest virtually unlimited amounts of client data, protect and govern it with best-in-breed Google tools, transform it into a standard schema, calculate a wide variety of features that are relevant to consumer predictions, and run data-local machine learning modeling and prediction using BigQuery ML. BigQuery ML lets you create and execute machine learning models in BigQuery using standard SQL queries. BigQuery ML democratizes machine learning by letting SQL practitioners build models using existing SQL tools and skills. BigQuery ML increases development speed by eliminating the need to move data.

Below, Faraday gives some examples of this work and also the comparative advantage from being Built on BigQuery. 

Real examples of BigQuery ML

Use case 1: increase conversion with personalization

Clients who provide known identity to Faraday in any form – email, email hash, or physical address – can have their customers segmented into a brand-specific set of personas. Then they can personalize outreach to these personas to increase conversion. This is facilitated by Faraday’s “batteries included” database of 260 million US adults, with more than 600 features covering demographic, psychographic, property, and life event data. 

Once the client requests a “persona set” in the Faraday API or application, Faraday joins any available client “first party data” (provided by the client) with the national dataset (“third party data”) and declares the following BigQuery ML statement:

What’s unique about BigQuery ML is that Faraday is able to do all data prep in SQL, and from that point on, Google is in charge of data movement, scaling and computation. The resulting cluster model can be used to predictively segment the entire US population, so that the client can personalize outreach in any channel.

Use case 2: lead scoring

As long as the client is able to provide a form of known identity for their customers, leads, or prospects, Faraday can construct a rich training dataset from first and third party data. This dataset can be used to predict the likelihood of leads becoming customers, of customers becoming great (high-spending) customers, or of customers churning or otherwise becoming inactive.

Once the client requests an “outcome” in the Faraday API or application, Faraday again joins any first and third party data, computes relevant machine learning features including time-based differentials, and declares the following BigQuery ML statement:

There are a couple points to note. First, in this use case and the previous (personalization), Faraday applies other optimizations using BigQuery ML – but they are a simple expression of normal data science practice to enhance feature selection using different forms of regression. In all cases, the SQL is straightforward – and perhaps more accessible than data pipelines expressed in Python, Spark, Airflow or other technologies.

Second, Faraday is not asking the client to act as a data scientist. Thanks to the explainability of BigQuery ML boosted tree models, the output to the client is an extensive report on feature importances and possible biases, but the initial input from the client is to simply select a population they would like to see more of. For example, if they can define what a “high spending customer” is, they can simply ask Faraday to predict more of those.

Use case 3: spend forecasting and LTV

Say a client wants to know what particular customers or customer segments (personas) will spend with them in the next year or 36 months. By requesting a “forecast” in the Faraday API or application, Faraday will perform the aforementioned data joining and feature generation and then declare the following BigQuery ML statement:

Currently, Faraday implements spend forecasting and LTV (Loan-to-value ratio) as a regression model, but even better options may become available in the future as BigQuery is under active development. Faraday clients would see this as an improvement in the signal that Faraday provided to them.

Why GCP is the best data cloud for building predictive products

In the first half of 2022, Faraday ran more than 1 trillion predictions for US consumer businesses. This was only possible due to a number of factors that make GCP the best data cloud for building predictive products.

Factor 1: Zero ETL

Did you know that when you build a BigQuery ML model, you are actually creating a Vertex AI model? Probably not – and it doesn’t matter in most cases. Google’s industry leading data cloud architecture means that the client (and Faraday) is not responsible for data movement, RAM allocation, disk expansion, or sharding. You simply declare what you want in SQL, and Google ensures that it happens.

Factor 2: Serverless, data-local compute

“Data locality” is not just a buzzword – ever since Faraday came to BigQuery in 2018, bringing the compute to the data instead of the other way around has enabled Faraday to scale its predictive capability by two orders of magnitude compared to its previous machine learning solution. Previously, Faraday had to build highly complex data copying and retry logic; now, the retry logic has been deleted and scaling problems are solved by increasing slot reservations (or rethinking SQL).

Factor 3: Model diversity and active development

If you want to model something, there is probably an appropriate model type already available in BigQuery ML. But if there’s not, Google’s continuing investment means that data pipelines built on BigQuery will grow in value over time – without the cognitive dissonance that arises from needing to learn languages and frameworks outside of SQL just to accomplish a particular task.

Conclusion

Digital natives and traditional enterprises alike will benefit from predictions made about their customers and potential customers. Faraday can provide a ready-made solution to this problem, both to enable immediate activation and to inspire and benchmark clients on their own data science journey. Google BigQuery’s scale, convenience, and active investment make GCP the best data cloud for Faraday to build its product – and provide a compelling reason for clients to consider it for their own architecture.

The Built with BigQuery advantage for ISVs 

Through Built with BigQuery, launched in April as part of Google Data Cloud Summit, Google is helping tech companies like SoundCommerce build innovative applications on Google’s data cloud with simplified access to technology, helpful and dedicated engineering support, and joint go-to-market programs. Participating companies can: 

Get started fast with a Google-funded, pre-configured sandbox. 

Accelerate product design and architecture through access to designated experts from the ISV Center of Excellence who can provide insight into key use cases, architectural patterns, and best practices. 

Amplify success with joint marketing programs to drive awareness, generate demand, and increase adoption.

BigQuery gives ISVs the advantage of a powerful, highly scalable data warehouse that’s integrated with Google Cloud’s open, secure, sustainable platform. And with a huge partner ecosystem and support for multi-cloud, open source tools and APIs, Google provides technology companies the portability and extensibility they need to avoid data lock-in. 

Click here to learn more about Built with BigQuery.

Related Article

Securely exchange data and analytics assets at scale with Analytics Hub, now available in preview

Efficiently and securely exchange valuable data and analytics assets across organizational boundaries with Analytics Hub. Start your free…

Read Article

Source : Data Analytics Read More

Adore Me embraces the power and flexibility of Looker and Google Cloud

Adore Me embraces the power and flexibility of Looker and Google Cloud

You don’t have to work in the women’s clothing business to know that one size doesn’t fit all. Adore Me pioneered the try-at-home shopping service, helping to ensure that every woman can feel good in what she wears. I’ve been lucky enough to have played a part in our growth and success over the years. Now, data is transforming every aspect of how we work, shop, and do business, making these last few years especially exciting. But I’m often asked about how we utilize data here at Adore Me, so I thought I’d share some of the obstacles we’ve encountered, how we resolved them, and offer up a few pointers that I hope others will find helpful. 

Freeing up teams from getting to the data, to use the data more effectively

It’s no secret: The less time we spend getting to the data, the more time we have to actually use it to support our business. Getting an online shopping service off the ground brings complexity into every part of our business. We quickly discovered that providing everyone in-house with the ability to make smart, data-driven decisions resulted in fewer errors and fewer choices that slowed down the business, driving better results for the company and our customers. 

Once I got my nose out of code and started looking around for ways I could help the business make the most of its data, Looker and BigQuery quickly fell into place as the solutions we needed. In BigQuery, we found a centralized, self-managed database that reduced management overhead. And once all our data was in place, Looker had the most significant impact on our overall productivity, particularly around efficiency and reducing human hours previously spent waiting for data and sharing results between teams. With Looker, we saved time on both ends: in the gathering of data as well as in sharing the insights it revealed with those who needed them most. 

What’s remarkable about the BigQuery and Looker combination is how much we can accomplish with relatively small teams. We have our Business Intelligence team, the Data Engineering team, and the Data Science team. These are our ‘data people’, who bring in the data rather than consume it. Then we have our power users who need quick insights from that data and therefore rely on Looker to access up-to-the minute data when they need it. Empowering everyone with data consistently pays off, and it’s a much better use of our time than hammering away at SQL. 

Surfacing data insights that lead to action

Data permeates everything we do at Adore Me because we believe that a smarter business results in happier customers. Data helps us run interference, identify problems, and find a fix in real-time, whether that’s optimizing our delivery times or tracking lost packages. On the business planning side, our data reveals what our customers are looking at on our site. This gives us insight into their interests, what’s trending, and what they want to see more of, which in turn also helps to inform our marketing strategies. 

As an online shop, driving traffic to the site is critical to Adore Me’s business. With real-time data at our disposal, we’re able to determine which campaigns are the most effective and which markets are best suited for a specific message, so we can intelligently refine our campaigns during peak seasons. With the data in BigQuery and insights surfaced by Looker, we can deliver the products and services our customers want most on our site.

Enabling continuous improvement with a flexible infrastructure

Ultimately, we want to have all of our production-critical data in BigQuery and Looker, acting as an easy-to-manage single source of truth. Data lives where we can easily access it, see it, and analyze it. We can set the rules for all of our KPIs, and everyone is able to look at the same data in order to work towards achieving them together. 

What makes Google Cloud Platform so powerful is the suite of products and services that allow our teams to experiment with data in ways that are relevant to our particular business needs. For example, when working with new data sources, we need the ability to quickly visualize a .csv file, and Google Data Studio is the perfect tool for enabling that. If we find something that we want to bring into production, BigQuery makes it easy while modeling it in Looker speeds up the process. This is one way we are constantly improving and enriching our organization’s data capabilities.  

Making it easy to find the right tools for the job

Our teams have discovered that the variety of solutions offered by Google Cloud are ideal to address the evolving data challenges we face. Flexibility is critical in business today, and Google Cloud provides a major advantage to those who embrace a proof of concept mentality, which is why we take advantage of the free Google Cloud trials offered. They allow us to roll a product into a project, test drive it for a few days, and fail fast if necessary. No contracts. No hassle. Better still, the variety of products, their ease of use, and overall versatility make it a good bet that we’ll find a solution that works for us. 

Anyone with experience working with data will tell you that there’s no shortage of fly-by-night tools out there. But personal experience has shown us that, at the end of the day, success comes down to the strength of your team and choosing the right tools to get the job done. At Adore Me, we’ve built a fantastic team and, with the power of Looker and BigQuery, the sky’s the limit.

Source : Data Analytics Read More

Introducing Device Connect for Fitbit: How Google Cloud and Fitbit are working together to help people live healthier lives

Introducing Device Connect for Fitbit: How Google Cloud and Fitbit are working together to help people live healthier lives

Healthcare is at the beginning of a fundamental transformation to become more patient-centered and data-driven than ever before. We now have better access to healthcare, thanks to improved virtual care, while wearables and other tools have dramatically increased our ability to take control of our own health and wellness. 

Healthcare alone generates as much as 30% of the world’s data and much of this will come from the Internet of Medical Things (IoMT) and consumer wearable devices. Gaining insights from wearable data can be challenging, however, due to the lack of a common data standard for health devices resulting in different data types and formats. So what do we do with all this data, and how do we make it most useful?  

Today, Fitbit Health Solutions and Google Cloud are introducing Device Connect for Fitbit, which empowers healthcare and life sciences enterprises with accelerated analytics and insights to help people live healthier lives. Fitbit data from their consenting users is made available through the Fitbit Web API, providing users with control over what data they choose to share and ensuring secure data storage and protection. Unlocking actionable insights about patients can help support management of chronic conditions, help drive population health impact, and advance clinical research to help transform lives. 

With this solution, healthcare organizations will be increasingly able to gain a more holistic view of their patients outside of clinical care settings. These insights can enhance understanding of patient behaviors and trends while at home, enabling healthcare and life science organizations to better support care teams, researchers, and patients themselves. Based on a recent Harris poll, more than 9 in 10 physicians (92%) believe technology can have a positive impact on improving patient experiences, and 96% agree that easier access to critical information may help save someone’s life.

Help people live healthier lives

This new solution can support care teams and empower patients to live healthier lives in several critical ways:

Pre- and post-surgery: Supporting the patient journey before and after surgery can lead to higher patient engagement and more successful outcomes.1 However, many organizations lack a holistic view of patients. Fitbit tracks multiple behavioral metrics of interest, including activity level, sleep, weight and stress, and can provide visibility and new insights for care teams to what’s happening with patients outside of the hospital.  Chronic condition management: For people living with diabetes, maintaining their blood glucose levels within an acceptable range is a constant concern. It’s just one of countless examples, from heart diseases to high blood pressure, where care teams want to promote healthy behaviors and habits to improve outcomes. Better understanding how lifestyle factors may impact disease indicators such as blood glucose levels can enable organizations to deliver more personalized care and tools to support healthy lifestyle changes. Population health: Supporting better management of community health outcomes with a focus on preventive care can help reduce the likelihood of getting a chronic disease and improve quality of life.2 Fitbit users can choose to share their data with organizations that deliver lifestyle behavior change programs aimed at both prevention and management of chronic or acute conditions.Clinical research: Clinical trials depend on rich patient data. Collection in a physician’s office captures a snapshot of the participant’s data at one point in time and doesn’t necessarily account for daily lifestyle variables. Fitbit, used in more than 1,500 published studies–more than any other wearable device–can enrich clinical trial endpoints with new insights from longitudinal lifestyle data, which can help improve patient retention and compliance with study protocols.Health equity: Addressing healthcare disparities is a priority across the healthcare ecosystem. Analyzing a variety of datasets, such as demographic and social determinants of health (SDOH) alongside Fitbit data has the potential to provide organizations and researchers with new insights regarding disparities that may exist across populations—such as obesity disparities that exist among children in low-income families, or increased risk of complications among Black women related to pregnancy and childbirth. Learn more about Fitbit’s commitment to health equity research here.

Accelerate time to insight

Gaining a more holistic view of the patient can better support people on their health and wellness journeys, identify potential health issues earlier, and provide clinicians with actionable insights to help increase care team efficiency. Device Connect for Fitbit addresses data interoperability to “make the invisible visible” for organizations, providing users with consent management and control over their data. Leveraging world-class Google Cloud technologies, Device Connect for Fitbit offers several pre-built components that help make Fitbit data accessible, interoperable and useful—with security and privacy as foundational features.

Enrollment & consent app for web and mobile: The pre-built patient enrollment and consent app enables organizations to provide their users with the permissions, transparency, and frictionless experience they expect. For example, users have control over what data they share and how that data is used.Data connector: Device Connect for Fitbit offers an open-source data connector3, with automated data normalization and integration with Google Cloud BigQuery for advanced analytics. Our data connector can support emerging standards like Open mHealth and enables interoperability with clinical data when used with Cloud Healthcare API for cohort building and AI training pipelines.Pre-built analytics dashboard: The pre-built Looker interactive visualization dashboard can be easily customized for different clinical settings and use cases to provide faster time to insights.AI and machine learning tools: Use AutoML Tables to build advanced models directly from BigQuery or build custom models with 80% fewer lines of code required using Vertex AI—the groundbreaking ML tools that power Google, developed by Google Research. 

Google Cloud’s ecosystem of delivery partners will provide expert implementation of services for Device Connect for Fitbit to help customers deploy at scale, and includes BlueVector AI, CitiusTech, Deloitte, and Omnigen.

Potential to help predict and prevent disease

The Hague’s Haga Teaching Hospital in the Netherlands is one of the first organizations to use Device Connect for Fitbit. The solution is helping the organization support a new study on early identification and prevention of vascular disease. 

“Collaborating with Google Cloud allows us to do our research, with the help of data analytics and AI, on a much greater scale,” cardiologist Dr. Ivo van der Bilt said. “Being able to leverage the new solution makes it easier than ever to gain the insights that will make this trial a success. Health is a precious commodity. You realize that all the more if you are struck down by an illness. If you can prevent it or catch it in time so that it can be treated, you have gained a great deal.” 

Fitbit innovation continues

Since becoming part of the Google family in January 2021, Fitbit has continued to help people around the world live healthier, more active lives and to introduce innovative devices and features, including FDA clearance for the new PPG AFib algorithm for irregular heart rhythm detection, released in April of this year. Fitbit metrics including activity, sleep, breathing rate, cardio fitness score (Vo2 Max), heart rate variability, weight, nutrition, SP02 and more will be accessible through Device Connect Fitbit.  Google’s interactions with Fitbit are subject to strict legal requirements, including with respect to how Google accesses and handles relevant Fitbit health and wellness data. You can find details on these obligations here.

We look forward to empowering our customers to create more patient-centered, data-driven healthcare. Read more about Haga Teaching Hospital’s work to predict heart disease on the Google Cloud blog, and visit cloud.google.com/device-connect to learn more about Device Connect for Fitbit.

1. Harris Poll
2. CDC
3. Device Connect for Fitbit is built on the Fitbit Web API and data available from consenting users is the same as that made available for third parties through the Fitbit Web API, and enables the enterprise customer services through Google Cloud.

Related Article

Read Article

Source : Data Analytics Read More

How Google Cloud and Fitbit are building a better view of health for hospitals, with analytics and insights in the cloud

How Google Cloud and Fitbit are building a better view of health for hospitals, with analytics and insights in the cloud

Great technology gives us new ways of seeing and working with the world. The microscope enabled new scientific understanding. Trains and telegraphs, in different ways, changed the way we think about distance. Today, cloud computing is changing how we can assist in improving human health.

When you think of the healthcare system, it historically includes a visit to the doctor,  sometimes coupled with a hospital stay. These are deeply important events, where tests are done, information on the patient is gathered, and a consultation is set up. But as you think about this structure, there are also limits. Multiple visits are inconvenient and potentially distressing for patients, expensive for the healthcare system and, at best, provide a view of patient health at a specific point in time.

But what if that snapshot of health could be supplemented with a stream of patient information that the doctor could observe and use to help predict and prevent diseases? By harnessing advancements in wearables—devices that sense temperature, heart rate, and oxygen levels—combined with  the power of cloud and artificial intelligence (AI) technologies, it is possible to develop a more accurate understanding of patient health.

This broader perspective is the goal of a collaboration between cardiologists at The Hague’s Haga Teaching Hospital, Fitbit—one of the world’s leading wearables that tracks activity, sleep, stress, heart rate, and more—and Google Cloud.

Initially focusing on 100 individuals who have been identified as at-risk of developing heart disease, during a pilot study (ME-TIME), cardiologists at the hospital will give patients a Fitbit Charge 5—Fitbit’s latest activity and health tracker with ECG monitoring1—to wear at home after an initial consultation.

With user consent, the devices will send information about certain patient behavioural metrics to the hospital via the cloud, in an encrypted state. 

This data is only accessed by (Haga Teaching Hospital approved) physicians and data scientists at the hospital and is not used by Haga for any other purposes than medical research during the study.2 With user consent, the data, which includes the amount of physical activity a patient is undertaking, will be monitored by Haga ’s physicians against other clinical information already gathered about the individual by the hospital during prior consultations. 

With user consent, Haga Teaching Hospital will also compare the data against its other relevant pseudonymized experience data, so the hospital can learn more about potential patterns and abnormalities associated with certain heart conditions. This is made possible by Google Cloud’s infrastructure, which will be used to store the encrypted data at scale, while artificial intelligence (AI) and data analytics tools will power near real-time analysis. For example, predictive analytics on this data could help identify early signs of a life-threatening disease such as a heart attack or stroke, so doctors can investigate further and provide preventative treatment—even before symptoms arise. 

Haga is using Device Connect for Fitbit, a new solution from Google Cloud, as part of the trial. Now available for healthcare and life sciences enterprises, the solution empowers business leaders and clinicians with accelerated analytics and insights from consenting users’ Fitbit data, powered by Google Cloud.3

The project is in collaboration with partner Omnigen who has supported Haga with deployment, in addition to processing and analysis of data. 

Other hospitals in the Netherlands are already expressing interest in participating in similar projects. Longer term, we see applications to help with deeper understanding of overall population health for healthcare professionals, reducing unnecessary visits to the hospital – and better operation of the wider healthcare system. Preliminary results of the project may be available as early as the end of this year.

“Health is a precious commodity. You realise that all the more if you are struck down by an illness. If you can prevent it or catch it in time so that it can be treated, you have gained a great deal,” said cardiologist, Dr. Ivo van der Bilt of Haga Teaching Hospital, who has been leading on this collaboration. “Digital tools and technologies like those provided by Google Cloud and Fitbit open up a world of possibilities for healthcare and a new era of even more accessible medicine is possible.”

“This collaboration shows how Fitbit can help support innovation in population health, helping healthcare systems & care programmes create more efficient and effective care pathways that aren’t always tied to primary or secondary care settings. Plus it provides patients with tools to help them with their health and wellbeing each day, with metrics which can be overseen by clinical care teams.” said Nicola Maxwell, Head of Fitbit Health Solutions Europe, Middle East & Africa.

This collaboration is an important step towards a goal of creating a more dynamic, rich, and holistic understanding of human health for hospitals, carried out with a strong emphasis on transparency. We are proud to be part of a project that we expect can help patients and healthcare workers alike. We believe this is only the start of what’s possible in healthcare with digital tools like Fitbit and cloud computing. 

1. The Fitbit ECG app is only available in select countries. Not intended for use by people under 22 years old. See fitbit.com/ecg for additional details.
2. Haga Teaching Hospital is responsible for any consents, notices or other specific conditions as may be required to permit any accessing, storing, and other processing of this data. Google Cloud does not have control over the data used in this study, which belongs to Haga Teaching Hospital. More generally, Google’s interactions with Fitbit are subject to strict legal requirements, including with respect to how Google accesses and handles relevant Fitbit health and wellness data. Details on these obligations can be found here
3. This is the same data as that made available through the Fitbit Web API, which the Device Connect integration is built on.

Related Article

Introducing Device Connect for Fitbit: How Google Cloud and Fitbit are working together to help people live healthier lives

How Google Cloud and Fitbit are working together to help people live healthier lives with Device Connect for Fitbit.

Read Article

Source : Data Analytics Read More

Benchmarking your Dataflow jobs for performance, cost and capacity planning

Benchmarking your Dataflow jobs for performance, cost and capacity planning

Calling all Dataflow developers, operators and users…

So you developed your Dataflow job, and you’re now wondering how exactly will it perform in the wild, in particular:

How many workers does it need to handle your peak load and is there sufficient capacity (e.g. CPU quota)?

What is your pipeline’s total cost of ownership (TCO), and is there room to optimize performance/cost ratio?

Will the pipeline meet your expected service-level objectives (SLOs) e.g. daily volume, event throughput and/or end-to-end latency?

To answer all these questions, you need to performance test your pipeline with real data to measure things like throughput and expected number of workers. Only then can you optimize performance and cost. However, performance testing data pipelines is historically hard as it involves: 1) configuring non-trivial environments including sources & sinks, to 2) staging realistic datasets, to 3) setting up and running a variety of tests including batch and/or streaming, to 4) collecting relevant metrics, to 5) finally analyzing and reporting on all tests’ results.

We’re excited to share that PerfKit Benchmarker (PKB) now supports testing Dataflow jobs! As an open-source benchmarking tool used to measure and compare cloud offerings, PKB takes care of provisioning (and cleaning up) resources in the cloud, selecting and executing benchmark tests, as well as collecting and publishing results for actionable reporting. PKB is a mature toolset that has been around since 2015 with community effort from over 30 industry and academic participants such as Intel, ARM, Canonical, Cisco, Stanford, MIT and many more.

We’ll go over the testing methodology and how to use PKB to benchmark a Dataflow job. As an example, we’ll present sample test results from benchmarking one of the popular Google-provided Dataflow templates, Pub/Sub Subscription to BigQuery template, and how we identified its throughput and optimum worker size. There are no performance or cost guarantees since results presented are specific to this demo use case.

Quantifying pipeline performance

“You can’t improve what you don’t measure.” 

One common way to quantify pipeline performance is to measure its throughput per vCPU core in elements per second (EPS). This throughput value depends on your specific pipeline and your data, such as:

Pipeline’s data processing steps

Pipeline’s sources/sinks (and their configurations/limits)

Worker machine size

Data element size

It’s important to test your pipeline with your expected real-world data (type and size), and in a testbed that mirrors your actual environment including similarly configured network, sources and sinks. You can then benchmark your pipeline by varying several parameters such as worker machine size. PKB makes it easy to A/B test different machine sizes and determine which one provides the maximum throughput per vCPU.

Note: What about measuring pipeline throughput in MB/s instead of EPS?

While either of these units work, measuring throughput in EPS draws a clear line with:

underlying performance dependency (i.e. element size in your particular data), andtarget performance requirement (i.e. number of individual elements processed by your pipeline). Similar to how disk performance depends on I/O block size (KB), pipeline throughput depends on element size (KB). With pipelines processing primarily small element sizes (in the order of KBs), EPS is likely the limiting performance factor. The ultimate choice between EPS and MB/s depends on your use case and data.

Note: The approach presented here expands on this prior post from 2020, predicting dataflow cost. However, we also recommend varying worker machine sizes to identify any potential cpu/network/memory bottlenecks, and determine the optimum machine size for your specific job and input profile, rather than assuming default machine size (i.e. n1-standard-2). The same applies to any other relevant pipeline configuration option such as custom parameters.

The following are sample PKB results from benchmarking PubSub Subscription to BigQuery Dataflow template across n1-standard-{2,4,8,16} using the same input data, that is logs with element size of ~1KB. As you can see, while n1-standard-16 offers the maximum throughput at 28.9k EPS, the maximum throughput per vCPU is provided by n1-standard-4 at around 3.8k EPS/core slightly beating n1-standard-2 which is at 3.7k EPS/core, by 2.6%.

Latency & throughput results from PKB testing of Pub/Sub to BigQuery Dataflow template

What about pipeline cost? Which machine size offers the best performance/cost ratio?

Let’s look at resource utilization and total cost to quantify this. After each test run, PKB collects standard Dataflow metrics such as average CPU utilization and calculates the total cost based on reported resources used by the job. In our case, jobs running on n1-standard-4 incurred on average 5.3% more costs than jobs running on n1-standard-2. With an increased performance of only 2.6%, one might argue that from a performance/cost point of view, n1-standard-4 is less optimal than n1-standard-2. However, looking at CPU utilization, n1-standard-2 was highly utilized at > 80% on average, while n1-standard-4 utilization was at a healthy average of 68.57% offering room to respond faster to small load changes, without potentially spinning up a new instance.

Utilization and cost results from PKB testing of Pub/Sub to BigQuery Dataflow template

Choosing optimum worker size sometimes involves a tradeoff between cost, throughput and freshness of data. The choice depends on your specific workload profile and target requirements namely throughput and event latency. In our case, the extra 5.3% in cost for n1-standard-4 is worth it, given the added performance and responsiveness. Therefore, for our specific use case and input data, we chose n1-standard-4 as the pipeline unit worker size with throughput of 3.8k EPS per vCPU.

Sizing & costing pipelines

“Provision for peak, and pay only for what you need.”

Now that you measured (and hopefully optimized) your pipeline’s throughput per vCPU, you can deduce the pipeline size necessary to process your expected input workload as follows:

Since your pipeline’s input workload is likely variable, you need to calculate the average and maximum pipeline sizes. Maximum pipeline size helps with capacity planning for peak load. Average pipeline size is necessary for cost estimation: you can now plug in the average number of workers and chosen instance type in the Google Cloud Pricing Calculator to determine TCO.

Let’s go through an example. For our specific use case, let’s assume the following about our input workload profile:

Daily volume to be processed: 10 TB/day

Average element size: 1 KB

Target steady-state throughput: 125k EPS

Target peak throughput: 500k EPS (or 4x steady-state)

Peak load occurs 10% of the time

In other words, the average throughput is expected to be around 90%125k + 10% 500k =162.5k (EPS).

Let’s calculate the average pipeline size:

To determine pipeline monthly cost, we can now plug in the average number of workers (11 workers) and instance type (n1-standard-4) into the pricing calculator. Note the number of hours per month (730 on average) given this is a continuously running streaming pipeline:

How to get started

To get up and running with PKB, refer topublic PKB docs. If you prefer walkthrough tutorials, check out this beginner lab, which goes over PKB setup, PKB command-line options, and how to visualize test results in Data Studio, similar to how we did above.

The repo includes example PKB config files, including dataflow_template.yaml which you can use to re-run the sequence of tests above. You need to replace all <MY_PROJECT> and <MY_BUCKET> instances with your own GCP project and bucket. You also need to create an input Pub/Sub subscription with your own test data preprovisioned (since test results vary based on your data), and an output BigQuery tablewith correct schema to receive the test data. The PKB benchmark handles saving and restoring a snapshot of that Pub/Sub subscription for every test run iteration. You can run the entire benchmark directly from PKB root directory:

code_block[StructValue([(u’code’, u’./pkb.py –project=$PROJECT_ID \rn –benchmark_config_file=dataflow_template.yaml’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ed63969cb90>)])]

To benchmark Dataflow jobs from a jar file (instead of a staged Dataflow template), refer to wordcount_template.yaml PKB config file as an example, which you can run as follows:

code_block[StructValue([(u’code’, u’./pkb.py –project=$PROJECT_ID \rn –benchmark_config_file=wordcount_template.yaml’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ed612220510>)])]

To publish test results in BigQuery for further analysis, you need to append BigQuery-specific arguments to above commands. For example:

code_block[StructValue([(u’code’, u’./pkb.py –project=$PROJECT_ID \rn –benchmark_config_file=dataflow_template.yaml \rn –bq_project=$PROJECT_ID \rn –bigquery_table=example_dataset.dataflow_tests’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ed613a70810>)])]

What’s next?

We’ve covered how performance benchmarking can help ensure your pipeline is properly sized and configured, in order to:

meet your expected data volumes,

without hitting capacity limits, and

without breaking your cost budget

In practice, there may be many more parameters that impact your pipeline performance beyond just the machine size, so we encourage you to take advantage of PKB to benchmark different configurations of your pipeline, and help you make data-driven decisions around things like:

Planned pipeline’s features development

Default and recommended values for your pipeline parameters. See this sizing guideline for one of Google-provided Dataflow templates as an example of PKB benchmark results synthesized into deployment best practices.

You can also incorporate these performance tests in your pipeline development process to quickly identify and avoid performance regressions. You can automate such pipeline regression testing as part of your CI/CD pipeline – no pun intended.

Finally, there’s a lot of opportunity to further enhance PKB for Dataflow benchmarking, such as collecting more stats and adding more realistic benchmarks that’s in line with your pipeline’s expected input workload. While we have tested here pipeline’s unit performance (max EPS/vCPU) under peak load, you might want to test your pipeline’s auto-scaling and responsiveness (e.g. 95th percentile for event latency) under varying load which could be just as critical for your use case. You can file tickets to suggest features or submit pull requests and join the 100+ PKB developer community.

On that note, we’d like to acknowledge the following individuals who helped make PKB available to Dataflow end-users:

Diego Orellana, Software Engineer @ Google, PerfKit Benchmarker

Rodd Zurcher, Cloud Solutions Architect @ Google, App/Infra Modernization

Pablo Rodriguez Defino, PSO Cloud Consultant @ Google, Data & Analytics

Related Article

What’s New with Google’s Unified, Open and Intelligent Data Cloud

Google’s unified, open and intelligent data cloud provides insights at every level of the enterprise to empower leaders to drive results.

Read Article

Source : Data Analytics Read More

Meet Optimus, Gojek’s open-source cloud data transformation tool

Meet Optimus, Gojek’s open-source cloud data transformation tool

Editor’s note: Earlier this year, we heard from Gojek, the ​​on-demand services platform, about the open-source data ingestion tool it developed for use with data warehouses like BigQuery. Today, Gojek VP of Engineering Ravi Suhag is back to discuss the open-source data transformation tool it is building.

In a recent post, we introduced Firehose, an open source solution by Gojek for ingesting data to cloud data warehouses like Cloud Storage and BigQuery. Today, we take a look at another project within the data transformation and data processing flow.

As Indonesia’s largest hyperlocal on-demand services platform, Gojek has diverse data needs across transportation, logistics, food delivery, and payments processing. We also run hundreds of microservices across billions of application events. While Firehose solved our need for smarter data ingestion across different use cases, our data transformation tool, Optimus, ensures the data is ready to be accessed with precision wherever it is needed.

The challenges in implementing simplicity

At Gojek, we run our data warehousing across a large number of data layers within BigQuery to standardize and model data that’s on its way to being ready for use across our apps and services. 

Gojek’s data warehouse has thousands of BigQuery tables. More than 100 analytics engineers run nearly 4,000 jobs on a daily basis to transform data across these tables. These transformation jobs process more than 1 petabyte of data every day. 

Apart from the transformation of data within BigQuery tables, teams also regularly export the cleaned data to other storage locations to unlock features across various apps and services.

This process addresses a number of challenges:

Complex workflows: The large number of BigQuery tables and hundreds of analytics engineers writing transformation jobs simultaneously creates a huge dependency on very complex database availability groups (DAGs) to be scheduled and processed reliably. 

Support for different programming languages: Data transformation tools must ensure standardization of inputs and job configurations, but they must also comfortably support the needs of all data users. They cannot, for instance, limit users to only a single programming language. 

Difficult to use transformation tools: Some transformation tools are hard to use for anyone that’s not a data warehouse engineer. Having easy-to-use tools helps remove bottlenecks and ensure that every data user can produce their own analytical tables.

Integrating changes to data governance rules: Decentralizing access to transformation tools requires strict adherence to data governance rules. The transformation tool needs to ensure columns and tables have personally identifiable information (PII) and non-PII data classifications correctly inserted, across a high volume of tables. 

Time-consuming manual feature updates: New requirements for data extraction and transformation for use in new applications and storage locations are part of Gojek’s operational routine. We need to design a data transformation tool that could be updated and extended with minimal development time and disruption to existing use cases.

Enabling reliable data transformation on data warehouses like BigQuery 

With Optimus, Gojek created an easy-to-use and reliable performance workflow orchestrator for data transformation, data modeling, data pipelines, and data quality management. If you’re using BigQuery as your data warehouse, Optimus makes data transformation more accessible for your analysts and engineers. This is made possible through simple SQL queries and YAML configurations, with Optimus handling many key demands including dependency management, and scheduling data transformation jobs to run at scale.

Key features include:

Command line interface (CLI): The Optimus command line tool offers effective access to services and job specifications. Users can create, run, and replay jobs, dump a compiled specification for a scheduler, create resource specifications for data stores, add hooks to existing jobs, and more.

Optimized scheduling: Optimus offers an easy way to schedule SQL transformation through YAML based configuration. While it recommends Airflow by default, it is extensible enough to support other schedulers that can execute Docker containers.

Dependency resolution and dry runs: Optimus parses data transformation queries and builds dependency graphs automatically. Deployment queries are given a dry-run to ensure they pass basic sanity checks.

Powerful templating: Users can write complex transformation logic with compile time template options for variables, loops, IF statements, macros, and more.

Cross-tenant dependency: With more than two tenants registered, Optimus can resolve cross-tenant dependencies automatically.

Built-in hooks: If you need to sink a BigQuery table to Kafka, Optimus can make it happen thanks to hooks for post-transformation logic that extend the functionality of your transformations.

Extensibility with plugins: By focusing on the building blocks, Optimus leaves governance for how to execute a transformation to its plugin system. Each plugin features an adapter and a Docker image, and Optimus supports Python transformation for easy custom plugin development.

Key advantages of Optimus

Like Google Cloud, Gojek is all about flexibility and agility, so we love to see open source software like Optimus helping users take full advantage of multi-tenancy solutions to meet their specific needs. 

Through a variety of configuration options and a robust CLI, Optimus ensures that data transformation remains fast and focused by preparing SQL correctly. Optimus handles all scheduling, dependencies, and table creation. With the capability to build custom features quickly based on new needs through Optimus plugins, you can explore more possibilities. Errors are also minimized with a configurable alert system that flags job failures immediately. Whether to email or Slack, you can trigger alerts based on specific requirements – from point of failure to warnings based on SLA requirements.

How you can contribute

With Firehose and Optimus working in tandem with Google Cloud, Gojek is helping pave the way in building tools that enable data users and engineers to achieve fast results in complex data environments.

Optimus is developed and maintained at Github and uses Requests for Comments (RFCs) to communicate ideas for its ongoing development. The team is always keen to receive bug reports, feature requests, assistance with documentation, and general discussion as part of its Slack community.

Related Article

Introducing Firehose: An open source tool from Gojek for seamless data ingestion to BigQuery and Cloud Storage

The Firehose open source tool allows Gojek to turbocharge the rate it streams its data into BigQuery and Cloud Storage.

Read Article

Source : Data Analytics Read More

What’s New with Google’s Unified, Open and Intelligent Data Cloud

What’s New with Google’s Unified, Open and Intelligent Data Cloud

We’re fortunate to work with some of the world’s most innovative customers on a daily basis, many of whom come to Google Cloud for our well-established expertise in data analytics and AI. As we’ve worked and partnered with these data leaders, we have encountered similar priorities among many of them: to remove the barriers of data complexity, unlock new use cases, and reach more people with more impact. 

These innovators and industry disruptors power their data innovation with a data cloud that lets their people work with data of any type, any source, any size, and at any speed, without capacity limits. A data cloud that lets them easily and securely move across workloads: from SQL to Spark, from business intelligence to machine learning, with little infrastructure set up required. A data cloud that acts as the open data ecosystem foundation needed to create data products that employees, customers, and partners use to drive meaningful decisions at scale.

On October 11, we will be unveiling a series of new capabilities at Google Cloud Next ‘22 that continue to support this vision. If you haven’t registered yet for the Data Cloud track at Google Next, grab your spot today! 

But I know you data devotees probably can’t wait until then. So, we wanted to take some time before Next to share some recent innovations for data cloud that are generally available today. Consider these the data hors d’oeuvres to your October 11 data buffet.

Removing the barriers of data sharing, real-time insights, and open ecosystems

The data you need is rarely stored in one place. More often than not data is scattered across multiple sources and in various formats. While data exchanges were introduced decades ago, their results have been mixed. Traditional data exchanges often require painful data movement and can be mired with security and regulatory issues. 

This unique use case led us to design Analytics Hub, now generally available, as the data sharing platform for teams and organizations who want to curate internal and external exchanges securely and reliably. 

This innovation not only allows for the curation and sharing of a large selection of analytics-ready datasets globally, it also enables teams to tap into the unique datasets only Google provides, such as Google Search Trends or the Data Commons knowledge graph.

Analytics Hub is a first-class experience within BigQuery. This means you can try it now for free using BigQuery, without having to enter any credit card information. 

Analytics Hub is not the only way to bring data into your analytical environment rapidly. We recently launched a new way to extract, load, and transform data in real-time into BigQuery: the Pub/Sub “BigQuery subscription.” This new ELT innovation simplifies streaming ingestion workloads, is simpler to implement, and is more economical since you don’t need to spin up new compute to move data and you no longer need to pay for streaming ingestion into BigQuery. 

But what if your data is distributed across lakes, warehouses, multiple clouds, and file formats? As more users demand more use cases, the traditional approach to build data movement infrastructure can prove difficult to scale, can be costly, and introduces risk. 

That’s why we introduced BigLake, a new storage engine that extends BigQuery storage innovation to open file formats running on public cloud object stores. BigLake lets customers build secure, data lakes over open file formats.  And, because it provides consistent, fine-grained security controls for Google Cloud and open-source query engines, security only needs to be configured in one place to be enforced everywhere. 

Customers like Deutsche Bank, Synapse LLC, and Wizard have been taking advantage of BigLake in preview. Now that BigLake is generally available, I invite you to learn how it can help to build your own data ecosystem.

Unlocking the ways of working with data

When data ecosystems expand to data of all shape, size, type, and format, organizations struggle to innovate quickly because their people have to move from one interface to the next, based on their workloads. 

This problem is often encountered in the field of machine learning, where the interface for ML is often different than that of business analysis. Our experience with BigQuery ML has been quite different: customers have been able to accelerate their path to innovation drastically because machine learning capabilities are built-in as part of BigQuery (as opposed to “bolted-on” in the case of alternative solutions).

We’re now applying the same philosophy to log data by offering a Log Analytics service in Cloud Logging. This new capability, currently in preview, gives users the ability to gain deeper insights into their logging data with BigQuery. Log Analytics comes at no additional charge beyond existing Cloud Logging fees and takes advantage of soon-to-be generally available BigQuery features designed for analytics on logs: Search indexes,a JSON data type, and the Storage Write API. 

Customers that store, explore, and analyze their own machine generated data from servers, sensors, and other devices can tap into these same BigQuery features to make querying their logs a breeze. Users simply use standard BigQuery SQL to analyze operational log data alongside the rest of their business data!

And there’s still more to come. We can’t wait to engage with you on Oct 11, during Next’22, to share more of the next generation of data cloud solutions. To tune into sessions tailored to your particular interests or roles, you can find top Next sessions for Data Engineers, Data Scientists, and Data Analysts — or create and share your own.

Join us at Next’22 to hear how leaders like Boeing, Twitter, CNA Insurance, Telus, L’Oreal, and Wayfair, are transforming data-driven insights with Google’s data cloud.

Related Article

Register for Google Cloud Next

Register now for Google Cloud Next ‘22, coming live to a city near you, as well as online and on demand.

Read Article

Source : Data Analytics Read More