Built with BigQuery: How Exabeam delivers a petabyte-scale cybersecurity solution

Built with BigQuery: How Exabeam delivers a petabyte-scale cybersecurity solution

Editor’s note: The post is part of a series highlighting our awesome partners, and their solutions, that are Built with BigQuery.

Exabeam, a leader in SIEM and XDR, provides security operations teams with end-to-end Threat Detection, Investigation, and Response (TDIR) by leveraging a combination of user and entity behavioral analytics (UEBA) and security orchestration, automation, and response (SOAR) to allow organizations to quickly resolve cybersecurity threats. As the company looked to take its cybersecurity solution to the next level, Exabeam partnered with Google Cloud to unlock its ability to scale for storage, ingestion, and analysis of security data.

Harnessing the power of Google Cloud products including BigQuery, Dataflow, Looker, Spanner and Bigtable, the company is now able to ingest data from more than 500 security vendors, convert unstructured data into security events, and create a common platform to store them in a cost-effective way. The scale and power of Google Cloud enables Exabeam customers to search multi-year data and detect threats in seconds

Google Cloud provides Exabeam with three critical benefits.  

Global scale security platform. Exabeam leveraged serverless Google Cloud data products to speed up platform development. The Exabeam platform supports horizontal scale with built-in resiliency (backed by 99.99% reliability) and data backups in three other zones per region. Also, multi-tenancy with tenant data separation, data masking, and encryption in transit and at rest are backed up in the data cloud products Exabeam uses from Google Cloud.

Scale data ingestion and processing. By leveraging Google’s compute capabilities, Exabeam can differentiate itself from other security vendors that are still struggling to process large volumes of data. With Google Cloud, Exabeam can provide a path to scale data processing pipelines. This allows Exabeam to offer robust processing to model threat scenarios with data from more than 500 security and IT vendors in near-real time. 

Search and detection in seconds. Traditionally, security solutions break down data into silos to offer efficient and cost-effective search. Thanks to the speed and capacity of BigQuery, Security Operations teams can search across different tiers of data in near real time. The ability to search data more than a year old in seconds, for example, can help security teams hunt for threats simultaneously across recent and historical data. 

Exabeam joins more than 700 tech companies powering their products and businesses using data cloud products from Google, such as BigQuery, Looker, Spanner, and Vertex AI. Google Cloud announced theBuilt with BigQuery initiative at the Google Data Cloud Summit in April, which helps Independent Software Vendors like Exabeam build applications using data and machine learning products. By providing dedicated access to technology, expertise, and go-to-market programs, this initiative can help tech companies accelerate, optimize, and amplify their success. 

Google’s data cloud provides a complete platform for building data-driven applications like those from Exabeam — from simplified data ingestion, processing, and storage to powerful analytics, AI, ML, and data sharing capabilities — all integrated with the open, secure, and sustainable Google Cloud platform. With a diverse partner ecosystem and support for multi-cloud, open-source tools, and APIs, Google Cloud can help provide technology companies the portability and the extensibility they need to avoid data lock-in.   

To learn more about Exabeam on Google Cloud, visit www.exabeam.com. Click here to learn more about Google Cloud’s Built with BigQuery initiative. 

We thank the many Google Cloud team members who contributed to this ongoing security collaboration and review, including Tom Cannon and Ashish Verma in Partner Engineering.

Related Article

CISO Perspectives: June 2022

Google Cloud CISO Phil Venables shares his thoughts on the RSA Conference and the latest security updates from the Google Cybersecurity A…

Read Article

Source : Data Analytics Read More

Now in preview, BigQuery BI Engine Preferred Tables

Now in preview, BigQuery BI Engine Preferred Tables

Earlier in the quarter we had announced that BigQuery BI Engine support for all BI and custom applications was generally available. Today we are excited to announce the preview launch of Preferred Tables support in BigQuery BI Engine!  BI Engine is an in-memory analysis service that helps customers get low latency performance for their queries across all BI tools that connect to BigQuery.  With support for preferred tables,  BigQuery customers now have the ability to prioritize specific tables for acceleration, achieving predictable performance and optimized use of their BI Engine resources. 

BigQuery BI Engine is designed to help deliver freshest insights without having to sacrifice the performance of their queries by accelerating their most popular dashboards and reports.  It provides intelligent scaling and ease of configuration where customers do not have to worry about any changes to their BI tools or in the way they interact with BigQuery. They simply have to create a project level memory reservation.  BigQuery BI Engine’s smart caching algorithm ensures that the data that tends to get queried often is in memory for faster response times.  BI Engine also creates replicas of the data being queried to support concurrent access, this is based on the query patterns and does not require manual tuning from the administrator.  

However, some workloads are more latency sensitive than others.  Customers would therefore want more control of the tables to be accelerated within a project to ensure reliable performance and better utilization of their BI Engine reservations.  Before this feature,  BigQuery BI Engine customers could achieve this by using separate projects for only those tables that need acceleration. However, that requires additional configuration and not the best reason to use separate projects.

With the launch of preferred tables in BI Engine, you can now tell BI Engine which tables should be accelerated.  For example, if you have two types of tables being queried from your project.  The first being a set of pre-aggregated or dimension tables that get queried by dashboards for executive reporting and the other representing all tables used for ad hoc analysis.  You can now ensure that your reporting dashboards get predictable performance by configuring them as ‘preferred tables’ in the BigQuery project.  That way, other workloads from the same project will not consume memory required for interactive use-cases. 

Getting started

To use preferred tables, you can use cloud console, BigQuery Reservation API or a data definition language (DDL) statement in SQL.  We will show the UI experience below.  You can look at detailed documentation of the preview feature here

You can simply edit existing BI Engine configuration in the project.  You will see an optional step of specifying the preferred tables, followed by a box to specify the tables you want to set as preferred.

The next step is to confirm and submit the configuration and you will be ready to go! 

Alternatively, you can also achieve this by issuing a DDL statement in SQL editor as follows:

code_block[StructValue([(u’code’, u’ALTER BI_CAPACITY `<PROJECT_ID>.region-<REGION>.default`rnSET OPTIONS(rn size_gb = 100,rn preferred_tables = [“bienginedemo.faadata.faadata1″]);’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e6e28921450>)])]

This feature is available in all regions today and rolled out to all BigQuery customers. Please give it a spin!

Related Article

Learn how BI Engine enhances BigQuery query performance

This blog explains how BI Engine enhances BigQuery query performance, different modes in BI engine and its monitoring.

Read Article

Source : Data Analytics Read More

Twitter: gaining insights from Tweets with an API for Google Cloud

Twitter: gaining insights from Tweets with an API for Google Cloud

Editor’s note: Although Twitter has long been considered a treasure trove of data, the task of analyzing Tweets in order to understand what’s happening in the world, what people are talking about right now, and how this information can support business use cases has historically been highly technical and time-consuming. Not anymore. Twitter recently launched an API toolkit for Google Cloud which helps developers to harness insights from Tweets, at scale, within minutes. This blog is based on a conversation with the Twitter team who’ve made this possible. The authors would like to thank Prasanna Selvaraj and Nikki Golding from Twitter for contributions to this blog. 

Businesses and brands consistently monitor Twitter for a variety of reasons: from tracking the latest consumer trends and analyzing competitors, to staying ahead of breaking news and responding to customer service requests. With 229 million monetizable daily active users, it’s no wonder companies, small and large, consider Twitter a treasure trove of data with huge potential to support business intelligence. 

But language is complex, and the journey towards transforming social media conversations into insightful data involves first processing large amounts of Tweets by ways of organizing, sorting, and filtering them. Crucial to this process are Twitter APIs: a set of programmatic endpoints that allow developers to find, retrieve, and engage with real-time public conversations happening on the platform. 

In this blog, we learn from the Twitter Developer Platform Solutions Architecture team about the Twitter API toolkit for Google Cloud, a new framework for quickly ingesting, processing, and analyzing high volumes of Tweets to help developers harness the power of Twitter. 

Making it easier for developers to surface valuable insights from Tweets 

Two versions of the toolkit are currently available: The Twitter API Toolkit for Google Cloud Filtered Stream and the Twitter API Toolkit for Google Cloud Recent Search.

The Twitter API for Google Cloud for Filtered Stream supports developers with a trend detection framework that can be installed on Google Cloud in 60 minutes or less. It automates the data pipeline process to ingest Tweets into Google Cloud, and offers visualization of trends in an easy-to-use dashboard that illustrates real-time trends for configured rules as they unfold on Twitter. This tool can be used to detect macro- and micro-level trends across domains and industry verticals, and can horizontally scale and process millions of Tweets per day. 

“Detecting trends from Twitter requires listening to real-time Twitter APIs and processing Tweets on the fly,” explains Prasanna Selvaraj, Solutions Architect at Twitter and author of this toolkit. “And while trend detection can be complex work, in order to categorize trends, tweet themes and topics must also be identified. This is another complex endeavor as it involves integrating with NER (Named Entity Recognition) and/or NLP (Natural Language Processing) services. This toolkit helps solve these challenges.”

Meanwhile, the Twitter API for Google Cloud Recent Search returns Tweets from the last seven days that match a specific search query. “Anyone with 30 minutes to spare can learn the basics about this Twitter API and, as a side benefit, also learn about Google Cloud Analytics and the foundations of data science,” says Prasanna. 

The toolkits leverage Twitter’s new API v2 (Recent Search & Filtered Stream) and use BigQuery for tweet storage, Data Studio for business intelligence and visualizations, and App Engine for data pipeline on the Google Cloud Platform. 

“We needed a solution that is not only serverless but also can support multi-cardinality, because all Twitter APIs that return Tweets provide data encoded using JavaScript Object Notation (JSON). This has a complex structure, and we needed a database that can easily translate it into its own schema. BigQuery is the perfect solution for this,” says Prasanna. “Once in BigQuery, one can visualize that data in under 10 minutes with Data Studio, be it in a graphic, spreadsheet, or Tableau form. This eliminates friction in Twitter data API consumption and significantly improves the developer experience.” 

Accelerating time to value from 60 hours to 60 minutes

Historically, Twitter API developers have often grappled with processing, analyzing, and visualizing a higher volume of Tweets to derive insights from Twitter data. They’ve had to build data pipelines, select storage solutions, and choose analytics and visualization tools as the first step before they can start validating the value of Twitter data. 

“The whole process of choosing technologies and building data pipelines to look for insights that can support a business use case can take more than 60 hours of a developer’s time,” explains Prasanna. “And after investing that time in setting up the stack they still need to sort through the data to see if what they are looking for actually exists.”

Now, the toolkit enables data processing automation at the click of a button because it provisions the underlying infrastructure it needs to work, such as BigQuery as a database and the compute layer with App Engine. This enables developers to install, configure, and visualize Tweets in a business intelligence tool using Data Studio in less than 60 minutes.

“While we have partners who are very well equipped to connect, consume, store, and analyze data, we also collaborate with developers from organizations who don’t have a myriad of resources to work with. This toolkit is aimed at helping them to rapidly prototype and realize value from Tweets before making a commitment,” explains Nikki Golding, Head of Solutions Architecture at Twitter.

Continuing to build what’s next for developers

As they collaborated with Google Cloud to bring the toolkit to life, the Twitter team started to think about what public datasets exist within the Google Cloud Platform and how they can complement some of the topics that Twitter has a lot of conversations about, from crypto to weather. “We thought, what are some interesting ways developers can access and leverage what both platforms have to offer?” shares Nikki. “Twitter data on its own has high value, but there’s also data that is resident in Google Cloud Platform that can further support users of the toolkit. The combination of Google Cloud Platform infrastructure and application as a service with Twitter’s data as a service is the vision we’re marching towards.”

Next, the Twitter team aims to place these data analytics tools in the hands of any decision-maker, both in technical and non-technical teams. “To help brands visualize, slice, and dice data on their own, we’re looking at self-serve tools that are tailored to the non-technical person to democratize the value of data across organizations,” explains Nikki. “Google Cloud was the platform that allowed us to build the easiest low-code solution relative to others in the market so far, so our aim is to continue collaborating with Google Cloud to eventually launch a no-code solution that helps people to find the content and information they need without depending on developers. Watch this space!”

Related Article

Smooth sailing: The resource hierarchy for adopting Google Cloud BigQuery across Twitter

To provide one-to-one mapping from on-prem Hadoop to BigQuery, the Google Cloud and Twitter team created this resource hierarchy architec…

Read Article

Source : Data Analytics Read More

Earn Google Cloud swag when you complete the #LearnToEarn challenge

Earn Google Cloud swag when you complete the #LearnToEarn challenge

The MLOps market is expected to grow to around $700m by 20251. With the Google Cloud Professional Data Engineer certification topping the list of highest paying IT certifications in 20212, there has never been a better time to grow your data and ML skills with Google Cloud. 

Introducing the Google Cloud #LearnToEarn challenge 

Starting today, you’re invited to join the data and ML #LearnToEarn challenge– a high-intensity workout for your brain.  Get the ML, data, and AI skills you need to drive speedy transformation in your current and future roles with no-cost access to over 50 hands-on labs on Google Cloud Skills Boost. Race the clock with players around the world, collect badges, and earn special swag! 

How to complete the #LearnToEarn challenge?

The challenge will begin with a core data analyst learning track. Then each week you’ll get new tracks designed to help you explore a variety of career paths and skill sets. Keep an eye out for trivia and flash challenges too!  

As you progress through the challenge and collect badges, you’ll qualify for rewards at each step of your journey. But time and supplies are limited – so join today and complete by July 19! 

What’s involved in the challenge? 

Labs range from introductory to expert level. You’ll get hands-on experience with cutting edge tech like Vertex AI and Looker, plus data differentiators like BigQuery, Tensorflow, integrations with Workspace, and AutoML Vision. The challenge starts with the basics, then gets gradually more complex as you reach each milestone. One lab takes anywhere from ten minutes to about an hour to complete. You do not have to finish all the labs at once – but do keep an eye on start and end dates. 

Ready to take on the challenge?

Join the #LearnToEarn challengetoday!

1. IDC, Market Analysis Perspective: Worldwide AI Life-Cycle Software, September 2021
2. Skillsoft Global Knowledge, 15 top-paying IT certifications list 2021, August 2021

Source : Data Analytics Read More

Learn how BI Engine enhances BigQuery query performance

Learn how BI Engine enhances BigQuery query performance

BigQuery BI Engine is a fast, in-memory analysis service that lets users analyze data stored in BigQuery with rapid response times and with high concurrency to accelerate certain BigQuery SQL queries. BI Engine caches data instead of query results, allowing different queries over the same data to be accelerated as you look at different aspects of the data. By using BI Engine with BigQuery streaming, you can perform real-time data analysis over streaming data without sacrificing write speeds or data freshness.

​​BI Engine architecture

The BI Engine SQL interface expands BI Engine support to any business intelligence (BI) tool that works with BigQuery such as Looker, Tableau, Power BI, and custom applications to accelerate data exploration and analysis. With BI Engine, you can build rich, interactive dashboards and reports in BI tool of your choice without compromising performance, scale,security, or data freshness. To learn more about the BI Engine SQL interface, please refer here.

The following diagram shows the updated architecture for BI Engine:

Shown here is one simple example of a Looker dashboard that was created with BI Engine capacity reservation (top) versus the same dashboard without any reservation (bottom).This dashboard is created from the BigQuery public dataset `bigquery-public-data.chicago_taxi_trips.taxi_trips`  to analyze the Sum of total_trip cost and logarithmic average of total trip cost over time.

total_trip cost for past 5 years

BI Engine will cache the minimum amount of data possible to resolve a query to maximize the capacity of the reservation. Running business intelligence on big data can be tricky.

Here is a query against the same public dataset, ‘bigquery-public-data.chicago_taxi_trips.taxi_trips,’ to demonstrate BI Engine performance with/without reserved BigQuery slots.

Example Query

code_block[StructValue([(u’code’, u”SELECTrn (DATE(trip_end_timestamp , ‘America/Chicago’)) AS trip_end_timestamp_date,rn (DATE(trip_start_timestamp , ‘America/Chicago’)) AS trip_start_timestamp_date,rn COALESCE(SUM(CAST(trip_total AS FLOAT64)), 0) AS sum_trip_total,rn CONCAT (‘Hour :’,(DATETIME_DIFF(trip_end_timestamp,trip_start_timestamp,DAY) * 1440) ,’ , ‘,’Day :’,(DATETIME_DIFF(trip_end_timestamp,trip_start_timestamp,DAY)) ) AS trip_time,rn CASE WHENrn ROUND(fare + tips + tolls + extras) = trip_total THEN ‘Tallied’rn WHEN ROUND(fare + tips + tolls + extras) < trip_total THEN ‘Tallied Less’rn WHEN ROUND(fare + tips + tolls + extras) > trip_total THEN ‘Tallied More’rn WHEN (ROUND(fare + tips + tolls + extras) = 0.0 AND trip_total = 0.0) THEN ‘Tallied 0’rn ELSE ‘N/A’ END AS trip_total_tally,rn REGEXP_REPLACE(TRIM(company),’null’,’N/A’) as company,rn CASE WHENrn TRIM(payment_type) = ‘Unknown’ THEN ‘N/A’rn WHEN payment_type IS NULL THEN ‘N/A’ ELSE payment_type END AS payment_typern FROMrn `bigquery-public-data.chicago_taxi_trips.taxi_trips`rn GROUP BYrn 1,rn 2,rn 4,rn 5,rn 6,rn 7rnORDER BYrn 1 DESC,rn 2 ,rn 4 DESC,rn 5 ,rn 6 ,rn 7rnLIMIT 5000″), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ee1a4db3b10>)])]

The above query was run with the below combinations: 

Without any BigQuery slot reservation/BI Engine reservation,  the query observed 7.6X more average slots and 6.3X more job run time compared to the run with reservations (last stats in the result). 

Without BI Engine reservation but with BigQuery slot reservation, the query observed 6.9X more average slots and 5.9X more job run time compared to the run with reservations (last stats in the result). 

With BI Engine reservation and no BigQuery slot reservation, the query observed 1.5 more average slots and the job completed in sub-seconds (868 ms). 

With both BI Engine reservation and BigQuery slot reservation, only 23 average slots were used and the job completed in sub-second as shown in results.This is the most cost effective way in regards to average slots and run time compared to all other options (23.27 avg_slots , 855 ms run time).

INFORMATION_SCHEMA is a series of views that provide access to metadata about datasets, routines, tables, views, jobs, reservations, and streaming data. You can query the INFORMATION_SCHEMA.JOBS_BY_* view to retrieve real-time metadata about BigQuery jobs. This view contains currently running jobs, and the history of jobs completed in the past 180 days.

Query to determine bi_engine_statistics and number of slots. More schema information can be found here.

code_block[StructValue([(u’code’, u”SELECTrn project_id,rn job_id,rn reservation_id,rn job_type,rn TIMESTAMP_DIFF(end_time, creation_time, MILLISECOND) AS job_duration_mseconds,rn CASErn WHEN job_id = ‘bquxjob_54033cc8_18164d54ada’ THEN ‘YES_BQ_RESERV_NO_BIENGINE’rn WHEN job_id = ‘bquxjob_202f17eb_18149bb47c3’ THEN ‘NO_BQ_RESERV_NO_BIENGINE’rn WHEN job_id = ‘bquxjob_404f2321_18164e0f801’ THEN ‘YES_BQ_RESERV_YES_BIENGINE’rnWHEN job_id = ‘bquxjob_48c8910d_18164e520ac’ THEN ‘NO_BQ_RESERV_YES_BIENGINE’ ELSE ‘NA’ END as query_method,rn bi_engine_statistics,rn — Average slot utilization per job is calculated by dividingrn– total_slot_ms by the millisecond duration of the jobrn SAFE_DIVIDE(total_slot_ms,(TIMESTAMP_DIFF(end_time, start_time, MILLISECOND))) AS avg_slotsrnFROMrnregion-us.INFORMATION_SCHEMA.JOBS_BY_PROJECTrnwhere creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 80 DAY) AND CURRENT_TIMESTAMP()rnAND end_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY) AND CURRENT_TIMESTAMP()rnANd job_id in (‘bquxjob_202f17eb_18149bb47c3′,’bquxjob_54033cc8_18164d54ada’,’bquxjob_404f2321_18164e0f801′,’bquxjob_48c8910d_18164e520ac’)rnORDER BY avg_slots DESC”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ee18b949590>)])]

From the observation, the most effective way of improving performance  for BI queries is to use BI ENGINE reservation along with BigQuery slot reservation.This will increase query performance, throughput and also utilizes less number of slots. Reserving BI Engine capacity will let you save on slots in your projects.

BigQuery BI Engine optimizes the standard SQL functions and operators when connecting business intelligence (BI) tools to BigQuery. Optimized SQL functions and operators for BI Engine are found here.

Monitor BI Engine with Cloud Monitoring

BigQuery BI Engine integrates with Cloud Monitoring so you can monitor BI Engine metrics and configure alerts.

For information on using Monitoring to create charts for your BI Engine metrics, see Creating charts in the Monitoring documentation.

We ran the same query without BI engine reservation and noticed 15.47 GB were processed.

After BI Engine capacity reservation, in Monitoring under BIE Reservation Used Bytes dashboard we got a compression ratio of ~11.74x (15.47 GB / 1.317 MB). However compression is very data dependent, primarily compression depends on the data cardinality. Customers should run tests on their data to determine their compression rate.

Monitoring metrics ‘Reservation Total Bytes’ gives information about the BI engine capacity reservation whereas ‘Reservation Used Bytes’ gives information about the total used_bytes. Customers can make use of these 2 metrics to come up with the right capacity for reservation. 

When a project has BI engine capacity reserved, queries running in BigQuery will use BI engine to accelerate the compatible subquery performance.​​The degree of acceleration of the query falls into one of the below mentioned modes:

BI Engine Mode FULL – BI Engine compute was used to accelerate leaf stages of the query but the data needed may be in memory or may need to be scanned from a disk. Even when BI Engine compute is utilized, BQ slots may also be used for parts of the query. The more complex the query,the more slots are used.This mode executes all leaf stages in BI Engine (and sometimes all stages).

BI Engine Mode PARTIAL – BI Engine accelerates compatible subqueries and BigQuery processes the subqueries that are not compatible with BI Engine.This mode also provides bi-engine-reason for not using BI Engine mode fully.This mode executes some leaf stages in BI Engine and rest in BigQuery.

BI Engine Mode DISABLED – When BI Engine process subqueries that are not compatible for acceleration, all leaf stages will get processed in BigQuery. This mode also provides bi-engine-reason for not using BI Engine mode fully/partially.

Note that when you purchase a flat rate reservation, BI Engine capacity (GB) will be provided as part of the monthly flat-rate price. You can get up to 100 GB of BI Engine capacity included for free with a 2000-slot annual commitment. As BI Engine reduces the number of slots processed for BI queries, purchasing less slots by topping up little BI Engine capacity along with freely offered capacity might suffice your requirement instead of going in for more slots!

References

bi-engine-introbi-engine-reserve-capacity streaming-apibi-engine-sql-interface-overview bi-engine-pricing bi-engine-sql-interface-overview 

To learn more about how BI Engine and BigQuery can help your enterprise, try out listed Quickstarts page 

bi-engine-data-studiobi-engine-looker Bi-engine-tableau

Related Article

Introducing Firehose: An open source tool from Gojek for seamless data ingestion to BigQuery and Cloud Storage

The Firehose open source tool allows Gojek to turbocharge the rate it streams its data into BigQuery and Cloud Storage.

Read Article

Source : Data Analytics Read More

Google Cloud Data Heroes Series: Meet Francisco, the Ecuadorian American

Google Cloud Data Heroes Series: Meet Francisco, the Ecuadorian American

Google Cloud Data Heroes is a series where we share stories of the everyday heroes who use our data analytics tools to do incredible things. Like any good superhero tale, we explore our Google Cloud Data Heroes’ origin stories, how they moved from data chaos to a data-driven environment, what projects and challenges they are overcoming now, and how they give back to the community.

In this month’s edition, we’re pleased to introduce Francisco! He is based out of Austin, Texas, but you’ll often find him in Miami, Mexico City, or Bogotá, Colombia. Francisco is the founder of Direcly, a Google Marketing Platform and Google Cloud Consulting/Sales Partner with presence in the US and Latin America.

Francisco was born in Quito, Ecuador, and at age 13, came to the US to live with his father in Miami, Florida. He studied Marketing at Saint Thomas University, and his skills in math landed him a job as Teaching Assistant for Statistics & Calculus. After graduation, his professional career began at some nation’s  leading ad agencies before he eventually transitioned into the ad tech space. In 2016, he ventured into the entrepreneurial world and founded Direcly, a Google Marketing Platform, Google Cloud, and Looker Sales/Consulting partner obsessed with using innovative technological solutions to solve business challenges. Against many odds and with no external funding since its inception, Direcly became a part of a selected group of Google Cloud and Google Marketing Platform partners. Francisco’s story was even featured in a Forbes Ecuador article

Outside of the office, Francisco is an avid comic book reader/collector, a golfer, and fantasy adventure book reader. His favorite comic book is The Amazing Spider-Man #252, and his favorite book is The Hobbit. He says he isn’t the best golfer, but can ride the cart like a pro.

When were you introduced to the cloud, tech, or data field? What made you pursue this in your career? 

I began my career in marketing/advertising, and I was quickly drawn to the tech/data space, seeing the critical role it played. I’ve always been fascinated by technology and how fast it evolves. My skills in math and tech ended up being a good combination. 

I began learning some open source solutions like Hadoop, Spark, and MySQL for fun and started to apply them in roles I had throughout my career.  After my time in the ad agency world, I transitioned into the ad tech industry, where I was introduced to how cloud solutions were powering ad tech solutions like demand side, data management, and supply side platforms. 

I’m the type of person that can get easily bored doing the same thing day in and day out, so I pursued a career in data/tech because it’s always evolving. As a result, it forces you to evolve with it. I love the feeling of starting something from scratch and slowly mastering a skill.

What courses, studies, degrees, or certifications were instrumental to your progression and success in the field? In your opinion, what data skills or competencies should data practitioners be focusing on acquiring to be successful in 2022 and why? 

My foundation in math, calculus, and statistics was instrumental for me.  Learning at my own pace and getting to know the open source solutions was a plus. What I love about Google is that it provides you with an abundance of resources and information to get started, become proficient, and master skills. Coursera is a great place to get familiar with Google Cloud and prepare for certifications. Quests in Qwiklabs are probably one of my favorite ways of learning because you actually have to put in the work and experience first hand what it’s like to use Google Cloud solutions. Lastly, I would also say that just going to the Google Cloud internal documentation and spending some time reading and getting familiar with all the use cases can make a huge difference. 

For those who want to acquire the right skills I would suggest starting with the fundamentals. Before jumping into Google Cloud, make sure you have a good understanding of Python, SQL, data, and some popular open sources. From there, start mastering Google Cloud by firstly learning the fundamentals and then putting things into practice with Labs. Obtain a professional certification — it can be quite challenging but it is rewarding once you’ve earned it. If possible, add more dimension to your data expertise by studying real life applications with an industry that you are passionate about. 

I am fortunate to be a Google Cloud Certified Professional Data Engineer and hold certifications in Looker, Google Analytics, Tag Manager, Display and Video 360, Campaign Manager 360, Search Ads 360, and Google Ads. I am also currently working to obtain my Google Cloud Machine Learning Engineer Certification. Combining data applications with analytics and marketing has proven instrumental throughout my career. The ultimate skill is not knowledge or competency in a specific topic, but the ability to have a varied range of abilities and views in order to solve complicated challenges.

You’re no doubt a thought leader in the field. What drew you to Google Cloud? How have you given back to your community with your Google Cloud learnings?

Google Cloud solutions are highly distributed, allowing companies to use the same resources an organization like Google uses internally, but for their own business needs. With Google being a clear leader in the analytics/marketing space, the possibilities and applications are endless. As a Google Marketing Platform Partner and having worked with the various ad tech stacks Google has to offer, merging Google Cloud and GMP for disruptive outcomes and solutions is really exciting.  

I consider myself to be a very fortunate person, who came from a developing country, and was given amazing opportunities from both an educational and career standpoint. I have always wanted to give back in the form of teaching and creating opportunities, especially for Latinos / US Hispanics. Since 2018, I’ve partnered with Florida International University Honors College and Google to create industry relevant courses. I’ve had the privilege to co-create the curriculum and teach on quite a variety of topics. We introduced a class called Marketing for the 21st Century, which had a heavy emphasis on the Google Marketing Platform. Given its success, in 2020, we introduced Analytics for the 21st Century, where we incorporated key components of Google Cloud into the curriculum. Students were even fortunate enough to learn from Googlers like Rob Milks (Data Analytics Specialist) and Carlos Augusto (Customer Engineer).

What are 1-2 of your favorite projects you’ve done with Google Cloud’s data products? 

My favorite project to date is the work we have done with Royal Caribbean International (RCI) and Roar Media. Back in 2018, we were able to transition RCI efforts from a fragmented ad tech stack into a consolidated one within the Google Marketing Platform. Moreover, we were able to centralize attribution across all the paid marketing channels. With the vast amount of data we were capturing (17+ markets), it was only logical to leverage Google Cloud solutions in the next step of our journey. We centralized all data sources in the warehouse and deployed business intelligence across business units. 

The biggest challenge from the start was designing an architecture that would meet both business and technical requirements. We had to consider the best way to ingest data from several different sources, unify them, have the ability to transform data as needed, visualize it for decision makers, and set the foundations to apply machine learning. Having a deep expertise in marketing/analytics platforms combined with an understanding of data engineering helped me tremendously in leading the process, designing/implementing the ideal architecture, and being able to present end users with information that makes a difference in their daily jobs.      

We utilized BigQuery as a centralized data warehouse to integrate all marketing sources (paid, organic, and research) though custom built pipelines. From there we created data driven dashboards within Looker, de-centralizing data and giving end users the ability to explore and answer key questions and make real time data driven business decisions. An evolution of this initiative has been able to go beyond marketing data and apply machine learning. We have created dashboards that look into covid trends, competitive pricing, SEO optimizations, and data feeds for dynamic ads. From the ML aspect, we have created predictive models on the revenue side, mixed marketing modeling, and applied machine learning to translate English language ads to over 17 languages leveraging historical data.

What are your favoriteGoogle Cloud data productswithin the data analytics, databases, and/or AI/ML categories? What use case(s) do you most focus on in your work? What stands out aboutGoogle Cloud’s offerings?

I am a big fan of BigQuery (BQ) and Looker. Traditional data warehouses are no match for the cloud – they’re not built to accommodate the exponential growth of today’s data and the sophisticated analytics required. BQ offers a fast, highly scalable, cost-effective and fully controlled cloud data warehouse for integrated machine learning analytics and the implementation of AI. 

Looker on the other hand, is truly next generation BI. We all love Structured Query Language (SQL), but I think many of us have been in position of writing dense queries and forgetting how some aspects of the code work, experiencing the limited collaboration options, knowing that people write queries in different ways, and how difficult it can be to track changes in a query if you changed your mind on a measure. I love how Look ML solves all those challenges, and how it helps one reuse, control and separate SQL into building blocks. Not to mention, how easy it is to give end users with limited technical knowledge the ability to look at data on their terms.      

What’s next for you?

I am really excited about everything we are doing at Direcly. We have come a long way, and I’m optimistic that we can go even further. Next for me is just to keep on working with a group of incredibly bright people who are obsessed with using innovative technological solutions to solve business challenges faced by other incredibly bright people.

From this story I would like to tell those that are pursuing a dream, that are looking to provide a better life for themselves and their loved ones, to do it, take risks, never stop learning, and put in the work. Things may or may not go your way, but keep persevering — you’ll be surprised with how it becomes more about the journey than the destination. And whether things don’t go as planned, or you have a lot of success, you will remember everything you’ve been through and how far you’ve come from where you started.  

Want to join the Data Engineer Community?

Register for the Data Engineer Spotlight, where attendees have the chance to learn from four technical how-to sessions and hear from Google Cloud Experts on the latest product innovations that can help you manage your growing data.

Related Article

Google Cloud Data Heroes Series: Meet Antonio, a Data Engineer from Lima, Peru

Google Cloud continues their Data Hero series with a profile on Antonio C., a data engineer, teacher, writer, and enthusiast on GCP.

Read Article

Source : Data Analytics Read More

Four back-to-school and off-to-college consumer trends retailers should know

Four back-to-school and off-to-college consumer trends retailers should know

Is it September yet? Hardly! School is barely out for the summer. But according to Google and Quantum Metric research, the back-to-school and off-to-college shopping season – which in the U.S. is second only to the holidays in terms of purchasing volume1 – has already begun. For retailers, that means planning for this peak season has kicked off as well.

We’d like to share four key trends that emerged from Google research and Quantum Metric’s Back-to-School Retail Benchmarks study of U.S. retail data, explore the reasons behind them, and outline the key takeaways.

1. Out-of-stock and inflation concerns are changing the way consumers shop. Back-to-school shoppers are starting earlier every year, with 41% beginning even before school is out – even more so when buying for college1. Why? The behavior is driven in large part by consumers’ concerns that they won’t be able to get what they need if they wait too long. 29% of shoppers start looking a full month before they need something1.

Back-to-school purchasing volume is quite high, with the majority spending up to $500 and 21% spending more than $1,0001. In fact, looking at year-over-year data, we see that average cart values have not only doubled since November 2021, but increased since the holidays1. And keep in mind that back-to-school spending is a key indicator leading into the holiday season.

That said, as people are reacting to inflation, they are comparing prices, hunting for bargains, and generally taking more time to plan. This is borne out by the fact that 76% of online shoppers are adding items to their carts and waiting to see if they go on sale before making the purchase1. And, to help stay on budget and reduce shipping costs, 74% plan to make multiple purchases in one checkout1. That carries over to in-store shopping, when consumers are buying more in one visit to reduce trips and save on gas.  

2. The omnichannel theme continues. Consumers continue to use multiple channels in their shopping experience. As the pandemic has abated, some 82% expect that their back-to-school buying will be in-store, and 60% plan to purchase online. But in any case, 45% of consumers report that they will use both channels; more than 50% research online first before ever setting foot in a store2. Some use as many as five channels, including video and social media, and these 54% of consumers spend 1.5 times more compared to those who use only two channels4.

And mobile is a big part of the journey. Shoppers are using their phones to make purchases, especially for deadline-driven, last-minute needs, and often check prices on other retailers’ websites while shopping in-store. Anecdotally, mobile is a big part of how we ourselves shop with our children, who like to swipe on the phone through different options for colors and styles. We use our desktops when shopping on our own, especially for items that require research and represent a larger investment – and our study shows that’s quite common.

3. Consumers are making frequent use of wish lists. One trend we have observed is a higher abandonment rate, especially for apparel and general home and school supplies, compared to bigger-ticket items that require more research. But that can be attributed in part to the increasing use of wish lists. Online shoppers are picking a few things that look appealing or items on sale, saving them in wish lists, and then choosing just a few to purchase. Our research shows that 39% of consumers build one or two wish lists per month, while 28% said they build one or two each week, often using their lists to help with budgeting1.

4. Frustration rates have dropped significantly. Abandonment rates aside, shopper annoyance rates are down by 41%, year over year1. This is despite out-of-stock concerns and higher prices. But one key finding showed that both cart abandonment and “rage clicks” are more frequent on desktops, possibly because people investing time on search also have more time to complain to customer service.

And frustration does still exist. Some $300 billion is lost each year in the U.S. from bad search experiences5. Data collected internationally shows that 80% of consumers view a brand differently after experiencing search difficulties, and 97% favor websites where they can quickly find what they are looking for5.

Lessons to Learn

What are the key takeaways for retailers? In general, consider the sources of customer pain points and find ways to erase friction. Improve search and personalization. And focus on improving the customer experience and building loyalty. Specifically:

80% of shoppers want personalization6. Think about how you can drive personalized promotions or experiences that will drive higher engagement with your brand. 

46% of consumers want more time to research1. Drive toward providing more robust research and product information points, like comparison charts, images, and specific product details.

43% of consumers want a discount1, but given current economic trends, retailers may not be offering discounts. In order to appease budget-conscious shoppers, retailers can consider other retention strategies such as driving loyalty using points, rewards, or faster-shipping perks.

Be sure to keep returns as simple as possible so consumers feel confident when making a purchase, and reduce possible friction points if a consumer decides to make a return. 43% of shoppers return at least a quarter of the products they buy and do not want to pay for shipping or jump through hoops1.

How We Can Help

Google-sponsored research shows that price, deals, and promotions are important to 68% of back-to-school shoppers.7 In addition, shoppers want certainty that they will get what they want. Google Cloud can make it easier for retailers to enable customers to find the right products with discovery solutions. These solutions provide Google-quality search and recommendations on a retailer’s own digital properties, helping to increase conversions and reduce search abandonment. In addition, Quantum Metric solutions, available on the Google Cloud Marketplace, are built with BigQuery, which helps retailers consolidate and unlock the power of their raw data to identify areas of friction and deliver improved digital shopping experiences.

We invite you to watch the Total Retail webinar “4 ways retailers can get ready for back-to-school, off-to college” on demand and to view the full Back-to-School Retail Benchmarks reportfrom Quantum Metric.

Sources:
1. Back-to-School Retail Benchmarks reportfrom Quantum Metric
2. Google/Ipsos,Moments 2021, Jun 2021, Online survey, US, n=335 Back to School shoppers
3. Google/Ipsos, Moments 2021, Jun 2021, Online survey, US, n=2,006 American general population 18+
4. Google/Ipsos, Holiday Shopping Study, Oct 2021 – Jan 2022, Online survey, US, n=7,253, Americans 18+ who conducted holiday shopping activities in past two days
5. Google Cloud Blog, Nov 2021, “Research: Search abandonment has a lasting impact on brand loyalty”
6. McKinsey & Company, “Personalizing the customer experience: Driving differentiation in retail”
7. Think with Google, July 2021, “What to expect from shoppers this back-to-school season”

Related Article

Quantum Metric explores retail big data use cases on BigQuery

Explore three ways enterprises are leveraging Quantum Metric data in BigQuery to enhance the customer experience.

Read Article

Source : Data Analytics Read More

Announcing new BigQuery capabilities to help secure sensitive data

Announcing new BigQuery capabilities to help secure sensitive data

In order to better serve their customers and users, digital applications and platforms continue to store and use sensitive data such as Personally Identifiable Information (PII), genetic and biometric information, and credit card information. Many organizations that provide data for analytics use cases face evolving regulatory and privacy mandates, ongoing risks from data breaches and data leakage, and a growing need to control data access. 

Data access control and masking of sensitive information is even more complex for large enterprises that are building massive data ecosystems. Copies of datasets often are created to manage access to different groups. Sometimes, copies of data are obfuscated while other copies aren’t. This creates an inconsistent approach to protecting data, which can be expensive to manage. To fully address these concerns, sensitive data needs to be protected with the right defense mechanism at the base table itself so that data can be kept secure throughout its entire lifecycle.

Today, we’re excited to introduce two new capabilities in BigQuery that add a second layer of defense on top of access controls to help secure and manage sensitive data. 

1. General availability of BigQuery column-level encryption functions

BigQuery column-level encryption SQL functions enable you to encrypt and decrypt data at the column level in BigQuery. These functions unlock use cases where data is natively encrypted in BigQuery and must be decrypted when accessed. It also supports use cases where data is externally encrypted, stored in BigQuery, and must then be decrypted when accessed. SQL functions support industry standard encryption algorithms AES-GCM (non-deterministic) and AES-SIV (deterministic).  Functions supporting AES-SIV allow for grouping, aggregation, and joins on encrypted data. 

In addition to these SQL functions, we also integrated BigQuery with Cloud Key Management Service (Cloud KMS). This gives you additional control, and allows you to manage your encryption keys in KMS and enables on-access secure key retrieval as well as detailed logging. An additional layer of envelope encryption enables generations of wrapped key sets to decrypt data. Only users with permission to access the Cloud KMS key and the wrapped keyset can unwrap the keyset and decrypt the ciphertext. 

“Enabling dynamic field level encryption is paramount for our data fabric platform to manage highly secure, regulated assets with rigorous security policies complying with several regulations including FedRAMP, PCI, GDPR, CCPA and more. BigQuery column-level encryption capability provides us with a secure path for decrypting externally encrypted data in BigQuery unblocking analytical use cases across more than 800+ analysts,” said Kumar Menon, CTO of Equifax.

Users can also leverage available SQL functions to support both non-deterministic encryption and deterministic encryption to enable joins and grouping of encrypted data columns.

The following query sample uses non-deterministic SQL functions to decrypt ciphertext.

code_block[StructValue([(u’code’, u’SELECTrn AEAD.DECRYPT_STRING(KEYS.KEYSET_CHAIN(rn @kms_resource_name,rn @wrapped_keyset),rn ciphertext,rn additional_data)rnFROMrn ciphertext_tablernWHERErn …’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3edc9b976150>)])]

The following query sample uses deterministic SQL functions to decrypt ciphertext.

code_block[StructValue([(u’code’, u’SELECTrn DETERMINISTIC_DECRYPT_STRING(KEYS.KEYSET_CHAIN(rn @kms_resource_name,rn @wrapped_keyset),rn ciphertext,rn additional_data)rn FROMrn ciphertext_tablernWHERErn …’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3edc9b9764d0>)])]

2. Preview of dynamic data masking in BigQuery

Extending BigQuery’s column-level security, dynamic data masking allows you to obfuscate sensitive data and control user access while mitigating the risk of data leakage. This capability selectively masks column level data at query time based on the defined masking rules, user roles and privileges. Masking eliminates the need to duplicate data and allows you to define different masking rules on a single copy of data to desensitize data, simplify user access to sensitive data, and protect against compliance, privacy regulations, or confidentiality issues. 

Dynamic data masking allows for different transformations of underlying sensitive data to obfuscate data at query time. Masking rules can be defined on the policy tag in the taxonomy to grant varying levels of access based on the role and function of the user and the type of sensitive data. Masking adds to the existing access controls to allow customers a wide gamut of options around controlling access. An administrator can grant a user full access, no access or partial access with a particular masked value based on data sharing use case.

For the preview of data masking, three different masking policies are being supported: 

ALWAYS_NULL. Nullifies the content regardless of column data types.

SHA256. Applies SHA256 to STRING or BYTES data types. Note that the same restrictions apply to the SHA256 function.

Default_VALUE. Returns the default value based on the data type.

A user must first have all of the permissions necessary to run a query job against a BigQuery table to query it. In addition, for users to view the masked data of a column tagged with a policy tag they need to have a MaskedReader role.

When to use dynamic data masking vs encryption functions?

Common scenarios for using data masking or column level encryption are: 

protect against unauthorized data leakage 

access control management 

compliance against data privacy laws for PII, PHI, PCI data

create safe test datasets

Specifically, masking can be used for real-time transactions whereas encryption provides additional security for data at rest or in motion where real-time usability is not required.  

Any masking policies or encryption applied on the base tables are carried over to authorized views and materialized views, and masking or encryption is compatible with other security features such as row-level security. 

These newly added BQ security features along with automatic DLP can help to scan your data across your entire organization, give you visibility into where sensitive data is stored, and enable you to manage access and usability of data for different use cases across your user base. We’re always working to enhance BigQuery’s (and Google Cloud’s) data governance capabilities, to enable end to end management of your sensitive data. With the new releases, we are adding deeper protections for your data in BigQuery.

Related Article

Build a secure data warehouse with the new security blueprint

Introducing our new security blueprint that helps enterprises build a secure data warehouse.

Read Article

Source : Data Analytics Read More

Introducing Firehose: An open source tool from Gojek for seamless data ingestion to BigQuery and Cloud Storage

Introducing Firehose: An open source tool from Gojek for seamless data ingestion to BigQuery and Cloud Storage

Indonesia’s largest hyperlocal company, Gojek has evolved from a motorcycle ride-hailing service into an on-demand mobile platform, providing a range of services that include transportation, logistics, food delivery, and payments. A total of 2 million driver-partners collectively cover an average distance of 16.5 million kilometers each day, making Gojek Indonesia’s de-facto transportation partner.

To continue supporting this growth, Gojek runs hundreds of microservices that communicate across multiple data centers. Applications are based on an event-driven architecture and produce billions of events every day. To empower data-driven decision-making, Gojek uses these events across products and services for analytics, machine learning, and more.

Data warehouse ingestion challenges 

To make sense of large amounts of data — and to better understand customers for the purpose of app development, customer support, growth, and marketing purposes — data must first be ingested into a data warehouse. Gojek uses BigQuery as its primary data warehouse. But ingesting events at Gojek’s scale, with rapid changes, poses the following challenges:

With multiple products and microservices offered, Gojek releases new Kafka topics almost every day and they need to be ingested for analytical purposes. This can quickly result in significant operational overhead for the data engineering team that is deploying new jobs to load data into BigQuery and Cloud Storage. 

Frequent schema changes in Kafka topics require consumers of those topics to load the new schema to avoid data loss and capture more recent changes. 

Data volumes can vary and grow exponentially as people start building new products and logging new activities on top of a new topic. Each topic can also have a different load during peak business hours. Customers need to handle the rising volume of data to quickly scale per their business needs.

Firehose and Google Cloud to the rescue 

To solve these challenges, Gojek uses Firehose, a cloud-native service to deliver real-time streaming data to destinations like service endpoints, managed databases, data lakes, and data warehouses like Cloud Storage and BigQuery. Firehose is part of the Open Data Ops Foundation (ODPF), and is fully open source. Gojek is one of the major contributors to ODPF.

Here are Firehose’s key features:

Sinks – Firehose supports sinking stream data to the log console, HTTP, GRPC, PostgresDB (JDBC), InfluxDB, Elastic Search, Redis, Prometheus, MongoDB, GCS, and BigQuery.

Extensibility – Firehose allows users to add a custom sink with a clearly defined interface, or choose from existing sinks.

Scale – Firehose scales in an instant, both vertically and horizontally, for a high-performance streaming sink with zero data drops.

Runtime – Firehose can run inside containers or VMs in a fully-managed runtime environment like Kubernetes.

Metrics – Firehose always lets you know what’s going on with your deployment, with built-in monitoring of throughput, response times, errors, and more.

Key advantages

Using Firehose for ingesting data in BigQuery and Cloud Storage has multiple advantages. 

Reliability 
Firehose is battle-tested for large-scale data ingestion. At Gojek, Firehose streams 600 Kafka topics in BigQuery and 700 Kafka topics in Cloud Storage. On average, 6 billion events are ingested daily in BigQuery, resulting in more than 10 terabytes of daily data ingestion.  

Streaming ingestion
A single Kafka topic can produce up to billions of records in a day. Depending on the nature of the business, scalability and data freshness are key to ensuring the usability of that data, regardless of the load. Firehose uses BigQuery streaming ingestion to load data in near-real-time. This allows analysts to query data within five minutes of it being produced.

Schema evolution
With multiple products and microservices offered, new Kafka topics are released almost every day, and the schema of Kafka topics constantly evolves as new data is produced. A common challenge is ensuring that as these topics evolve, their schema changes are adjusted in BigQuery tables and Cloud Storage. Firehose tracks schema changes by integrating with Stencil, a cloud-native schema registry, and automatically updates the schema of BigQuery tables without human intervention. This reduces data errors and saves developers hundreds of hours. 

Elastic infrastructure
Firehose can be deployed on Kubernetes and runs as a stateless service. This allows Firehose to scale horizontally as data volumes vary.

Organizing data in cloud storage 
Firehose GCS Sink provides capabilities to store data based on specific timestamp information, allowing users to customize how their data is partitioned in Cloud Storage.

Supporting a wide range of open source software

Built for flexibility and reliability, Google Cloud products like BigQuery and Cloud Storage are made to support a multi-cloud architecture. Open source software like Firehose is just one of many examples that can help developers and engineers optimize productivity. Taken together, these tools can deliver a seamless data ingestion process, with less maintenance and better automation.

How you can contribute

Development of Firehose happens in the open on GitHub, and we are grateful to the community for contributing bug fixes and improvements. We would love to hear your feedback via GitHub discussions or Slack.

Related Article

Transform satellite imagery from Earth Engine into tabular data in BigQuery

With Geobeam on Dataflow, you can transform Geospatial data from raster format in Earth Engine to vector format in BigQuery.

Read Article

Source : Data Analytics Read More

Wayfair: Accelerating MLOps to power great experiences at scale

Wayfair: Accelerating MLOps to power great experiences at scale

Machine Learning (ML) is part of everything we do at Wayfair to support each of the 30 million active customers on our website. It enables us to make context-aware, real-time and intelligent decisions across every aspect of our business. We use ML models to forecast product demand across the globe, to ensure our customers can quickly access what they’re looking for. Natural language processing (NLP) models are used to analyze chat messages on our website so customers can be redirected to the appropriate customer support team as quickly as possible, without having to wait for a human assistant to become available.. 

ML is an integral part of our strategy for remaining competitive as a business and supports a wide range of eCommerce engineering processes at Wayfair. As an online furniture and home goods retailer, the steps we take to make the experience of our customers as smooth, convenient, and pleasant as possible determine how successful we are. This vision inspires our approach to technology and we’re proud of our heritage as a tech company, with more than 3,000 in-house engineers and data scientists working on the development and maintenance of our platform. 

We’ve been building ML models for years, as well as other homegrown tools and technologies, to help solve the challenges we’ve faced along the way. We began on-prem but decided to migrateto Google Cloud in 2019, utilizing a lift-and-shift strategy to minimize the number of changes we had to make to move multiple workloads into the cloud. Among other things, that meant deploying Apache Airflow clusters on the Google Cloud infrastructure and retrofitting our homegrown technologies to ensure compatibility. 

While some of the challenges we faced with our legacy infrastructure were resolved immediately, such as lack of scalability, others remained for our data scientists. For example, we lacked a central feature store and relied on a shared cluster with a shared environment for workflow orchestration, which caused noisy neighbor problems. 

As a Google Cloud customer, however, we can easily access new solutions as they become available. So in 2021, when Google Cloud launched Vertex AI, we didn’t hesitate to try it out as an end-to-end ML platform to support the work of our data scientists.

One AI platform with all the ML tools needed

As big fans of open source, platform-agnostic software, we were impressed by Vertex AI Pipelines and how they work on top of open-source frameworks like Kubeflow. This enables us to build software that runs on any infrastructure. We enjoyed how the tool looks, feels, and operates. Within six months, we moved from configuring our infrastructure manually to conducting a POC, to a first production release.

Next on our priority list was to use Vertex AI Feature Store to serve and use AI technologies as ML features in real-time, or in batch with a single line of code. Vertex AI Feature Store fully manages and scales its underlying infrastructure, such as storage and compute resources. That means our data scientists can now focus on feature computation logic, instead of worrying about the challenges of storing features for offline and online usage.

While our data scientists are proficient in building and training models, they are less comfortable setting up the infrastructure and bringing the models to production. So, when we embarked on an MLOps transformation, it was important for us to enable data scientists to leverage a  platform as seamlessly as possible without having to know all about its underlying infrastructure. To that end, our goal was to build an abstraction on Vertex AI. Our simple python-based library interacts with the Vertex AI Pipeline and Vertex AI Features Store. And a typical data scientist can leverage this setup without having to know how Vertex AI works in the backend. That’s the vision we’re marching towards–and we’ve already started to notice its benefits.

Reducing hyperparameter tuning from two weeks to under one hour

While we enjoy using open source tools such as Apache Airflow, the way we were using it  was creating issues for our data scientists. And we frequently ran into infrastructure challenges, carried over from our legacy technologies, such as support issues and failed jobs. So we built a CI/CD pipeline using Vertex AI Pipelines, based on Kubeflow, to remove the complexity of model maintenance.

Now everything is well arranged, documented, scalable, easy to test, and well organized in terms of best practices. This incentivizes people to adopt a new standardized way of working, which in turn brings its own benefits. One example that illustrates this is hyperparameter tuning, an essential part of controlling the behavior of a machine learning model. 

In machine learning, hyperparameter tuning or optimization is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. Every machine learning model will have a different hyperparameter, whose value is set before the learning process begins. And a good choice of hyperparameters can make an algorithm perform optimally. 

But while hyperparameter tuning is a very common process in data science, there are no standards in terms of how this should be done. Doing it in Python using a legacy infrastructure would take a data scientist on average two weeks. We have over 100 data scientists at Wayfair, so standardizing this practice and making it more efficient was a priority for us. 

With a standardized way of working on Vertex AI, all our data scientists can now leverage our code to access CI/CD, monitoring, and analytics out-of-the-box to conduct hyperparameter tuning in just one day. 

Powering great customer experiences with more ML-based functionalities

Next, we’re working on a docker container template that will enable data scientists to deploy a running ‘hello world’ Vertex AI pipeline. It can take a data science team more than two months to get a ML model fully operational on average. With Vertex AI, we expect to cut down that time to two weeks. Like most of the things we do, this will have a direct impact on our customer experience. 

It’s important to remember that some ML models are more complex than others. Those that have an output that the customer immediately sees while navigating the website, such as when an item will be delivered to their door, are more complicated. This prediction is made by ML models and automated by Vertex AI. It must be accurate, and it must appear on-screen extremely quickly while customers browse the website. That means these models have the highest requirements and are the most difficult to publish to production. 

We’re actively working on building and implementing tools to streamline and enable continuous monitoring of our data and models in production, which we want to integrate with Vertex AI. We believe in the power of AutoML to build models faster, so our goal is to evaluate all these services in GCP and then find a way to leverage them internally. 

And it’s already clear that the new ways of working enabled by Vertex AI not only make the lives of our data scientists easier, but also have a ripple effect that directly impacts the experience of millions of shoppers who visit our website daily. They’re all experiencing better technology and more functionalities, faster. 

For a more detailed dive on how our data scientists are using Vertex AI, look for part two of this blog coming soon.

Related Article

How Wayfair says yes with BigQuery—without breaking the bank

BigQuery’s performance and cost optimization have transformed Wayfair’s internal analytics to create an environment of “yes”.

Read Article

Source : Data Analytics Read More