Editor’s note: Soundtrack Your Brand is an award-winning streaming service with the world’s largest licensed music catalog built just for businesses, backed by Spotify. Today, we hear how BigQuery has been a foundational component in helping them transform big data into music.
Soundtrack Your Brand is a music company at its heart, but big data is our soul. Playing the right music at the right time has a huge influence on the emotions a brand inspires, the overall customer experience, and sales. We have a catalog of over 58 million songs and their associated metadata from our music providers and a vast amount of user data that helps us deliver personalized recommendations, curate playlists and stations, and even generate listening schedules. As an example, through our Schedules feature our customers can set up what to play during the week. Taking that one step further, we provide suggestions on what to use in different time slots and recommend entire schedules.
Using BigQuery, we built a data lake to empower our employees to access all this content and metadata in a structured way. Ensuring that our data is easily discoverable and accessible allows us to build any type of analytics or machine learning (ML) use case and run queries reliably and consistently across the complete data set. Today, our users are benefiting from this advanced analytics through the personalized recommendations we offer across our core features: Home, Search, Playlists, Stations, and Schedules.
Fine-tuning developer productivity
The biggest business value that comes from BigQuery is how much it speeds up our development capabilities and allows us to ship features faster. In the past 3 years, we have built more than 150 pipelines and more than 30 new APIs within our ML and data teams that total about 10 people. That is an impressive rate of a new pipeline every week and a new API every month. With everything in BigQuery, it’s easy to simply write SQL and have it be orchestrated within a CI/CD toolchain to automate our data processing pipelines. An in-house tool built as a github template, in many ways very similar to Dataform, helps us build very complex ETL processes in minutes, significantly reducing the time spent on data wrangling.
BigQuery acts as a cornerstone for our entire data ecosystem, a place to anchor all our data and be our single source of truth. This single source of truth has expanded the limits of what we can do with our data. Most of our pipelines start from a data lake, or end at a data lake, increasing re-usability of data and collaboration. For example, one of our interns built an entire churn prediction pipeline in a couple of days on top of existing tables that are produced daily. Nearly a year later, this pipeline is still running without failure largely due to its simplicity. The pipeline is BigQuery queries chained together into a BigQuery ML model running on a schedule withKubeflow Pipelines.
Once we made BigQuery the anchor for our data operations, we discovered we could apply it to use cases that you might not expect, such as maintaining our configurations or supporting our content management system. For instance, we created a Google Sheet where our music experts are able to correct genre classification mistakes for songs by simply adding a row to a Google Sheet. Instead of hours or days to create a bespoke tool, we were able to set everything up in a few minutes.
BigQuery’s ability to consume Excel spreadsheets allows business users who play key roles in improving our recommendations engine and curating our music, such as our content managers and DJs, to contribute to the data pipeline.
Another example is our use of BigQuery as an index for some of our large Cloud Storage buckets. By using cloud functions to subscribe to read/write events for a bucket, and writing those events to partitioned tables, our pipelines can easily and in a natural way quickly search and access files, such as downloading and processing the audio of new track releases. We also make use of Log Events when a table is added to a dataset to trigger pipelines that process data on demand, such as JSON/CSV files from some of our data providers that are newly imported into BQ. Being the place for all file integration and processing, BQ allows new data to be quickly available to our entire data ecosystem in a timely and cost effective manner while allowing for data retention, ETL, ACL and easy introspection.
BigQuery makes everything simple. We can make a quick partitioned table and run queries that use thousands of CPU hours to sift through a massive volume of data in seconds — and only pay a few dollars for the service. The result? Very quick, cost-effective ETL pipelines.
In addition, centralizing all of our data in BigQuery makes it possible to easily establish connections between pipelines providing developers with a clear understanding of what specific type of data a pipeline will produce. If a developer wants a different outcome, she can copy the github template and change some settings to create a new, independent pipeline.
Another benefit is that developers don’t have to coordinate schedules or sync with each other’s pipelines: they just need to know that a table that is updated daily exists and can be relied on as a data source for an application. Each developer can progress their work independently without worrying about interfering with other developers’ use of the platform.
Making iteration our forte
Out of the box, BigQuery met and exceeded our performance expectations, but ML performance was the area that really took us by surprise. Suddenly, we found ourselves going through millions of rows in a few seconds, where the previous method might have taken an hour. This performance boost ultimately led to us improving our artist clustering workload from more than 24 hours on a job running 100 CPU workers to 10 minutes on a BigQuery pipeline running inference queries in a loop until convergence. This more than 140x performance improvement also came at 3% of the cost.
Currently we have more than 100 Neural Network ML models being trained and run regularly in batch in BQML. This setup has become our favorite method for both fast prototyping and creating production ready models. Not only is it fast and easy to hypertune in BQML, but our benchmarks show comparable performance metrics to using our own Tensorflow code. We now use Tensorflow sparingly. Differences in input data can have an even greater impact on the experience of the end user than individual tweaks to the models.
BigQuery’s performance makes it easy to iterate with the domain experts who help shape our recommendations engine or who are concerned about churn, as we are able to show them the outcome on our recommendations from changes to input data in real time. One of our favorite things to do is to build a Data Studio report that has the ML.predict query as part of its data source query. This report shows examples of good/bad predictions in the report along with bias/variance summaries and a series of drop-downs, thresholds and toggles to control the input features and the output threshold. We give that report to our team of domain experts to help manually tune the models, putting the model tuning right in the hands of the domain experts. Having humans in the loop has become trivial for our team. In addition to fast iteration, the BigQuery ML approach is also very low maintenance. You don’t need to write a lot of Python or Scala code or maintain and update multiple frameworks—everything can be written as SQL queries run against the data store.
Helping brands to beat the band—and the competition
BigQuery has allowed us to establish a single source of truth for our company that our developers and domain experts can build on to create new and innovative applications that help our customers find the sound that fits their brand.
Instead of cobbling together data from arbitrary sources, our developers now always start with a data set from BigQuery and build forward. This guarantees the stability of our data pipeline and makes it possible to build outward into new applications with confidence. Moreover, the performance of BigQuery means domain experts can interact with the analytics and applications that developers create more easily and see the results of their recommended improvements to ML models or data inputs quickly. This rapid iteration drives better business results, keeps our developers and domain experts aligned, and ensures Soundtrack Your Brand keeps delivering sound that stands out from the crowd.
Source : Data Analytics Read More