Using Data from OKRs To Improve Business Growth

Using Data from OKRs To Improve Business Growth

OKRs are a simple yet powerful management framework for managers to align with employee goals with the organization while fostering transparency and streamlining goal attainment. Employee monitoring is very important.

Measuring results and performance is critical to business success. However, it is often challenging to ensure that your metrics are aligned with the C-suite objectives and that all your employees are engaged and connected in meeting these targets.

This is where the data from OKRs comes into play.

Why Are OKR Platforms Trending?

OKRs allow you to build a solid mission, elevate employee engagement, and set a path towards your ultimate company target.

Additionally, using the best OKR software for your business enables you to structure teams and their daily routine work around attaining common objectives. As a result, this brings various benefits to companies like improved focus, increased transparency, and better alignment. As it emphasizes measurable and time indicators, the OKR technique is based on reality.

Here are several reasons why many companies implement OKRs and how the data derived can help your business drive growth:

Alignment Of Business-Related goals

Your employees must comprehend how the work is aligned and connected to their organization’s overall vision. When you share the company goals throughout the organization, you create a transparent need so every employee feels connected and invested.

The Harvard Business Review shows that organizations with highly-aligned employees are two times as likely to be the best performers.

Regular Progress Tracking

Monitoring progress from production to result makes OKRs a great choice amongst progressive organizations. Your teams can focus on the big picture by frequent status updates and check-ins, ensuring every conversation has a clear action plan.

Increased Productivity And Focus

OKRs allow your employees to prioritize better and stay focused on critical matters. In addition, OKRs give an overview on aspects where employees need more support and extra resources, thereby improving resources management.

The data from OKR also allow you to make prompt and informed decision-making. And with routine progress tracking, you can get valuable insights into development and learning issues and make data-driven decisions accordingly.

Better Teamwork And Transparency

With OKRs, company-specific goals are out there, creating better collaboration and transparency between departments and teams.

The OKR framework allows leaders and employees to work cohesively to accomplish larger goals.

Leveraging OKRs Data To Obtain Your KPIs

After setting the organizational Objectives and Key Results, your process flows down to the departments in the company. Then, you can select your OKRs to meet (or even exceed) those expectations. Finally, it would be best if you achieved the KPIs from an addition of the Key Results gathered from the OKR data and analytics.

If you only look to the external sources in your industry to garner appropriate information, it will lead you to misguided outcomes.

Adopting their KPIs will create impractical expectations and result in burn-out and unsuccessful in your company.

When the chosen KPIs increase efficiency and volume, you can create an increase in the risk of poor decision-making to meet the specific metrics. In today’s environment, businesses are always expediting their processes.

Here are some ways data from OKRs help you drive business growth and desired outcomes:

Monitor in real-time. Tracking the real-time progress of OKR and having your communication and collaboration practices backed by insightful data help the managers have adequate knowledge of multiple aspects while allowing employees to work to achieve their individual goalsEasily accessible. The ability to obtain OKRs data quickly and efficiently empowers employees to leverage the data and information in their decision-makingCompetitive advantage. The companies that implement OKR also have a competitive advantage compared to sluggish data-frugal enterprises. And this all as capitalizing on understandings into performance and goal progress can catapult your company to new heights of productivity and improvement, helping you identify potential opportunities for your business growth

Also, when employees can get the OKRs data directly, this empowers them to minimize resources and time to reach critical insights. You can choose the best OKR software for your business according to your specific needs and requirements.

Wrapping Up

Your company will be able to set long-term goals that you can track, measure, and analyze through` the data obtained from OKRs while allotting the resources on the business-critical that will attain considerable success by following the structured system of OKR. You can learn more about these employee monitoring tools to see how this can be put into practice.

By defining clear goals, all team members can focus on the most crucial goals. Well-planned KPIs and OKRs also boost collaboration as everyone is well-informed and prioritizes their tasks accordingly.

The post Using Data from OKRs To Improve Business Growth appeared first on SmartData Collective.

Source : SmartData Collective Read More

The 5 Best Methods Utilized for Data Collection

The 5 Best Methods Utilized for Data Collection

Collecting data is a necessary step that companies must take to reach their desired standards or keep from declining in quality. Not only that, but the product or service primarily influences the public’s perception of a brand that they offer, so gathering the data that will inform them of customers’ level of satisfaction is extremely important.

But what ways should be used to do so? Here are a few methods used in data collection.

Conduct Surveys

Surveys are a tried and true way for companies to gather customer information. This method allows companies to directly ask their customers questions about whatever they wish to provide insight into products, services, prices, and more. They are also very versatile and can be conducted online and via email to reach a wide range of customers or in-person to create a deeper and more personalized understanding of the questions being asked. They can also be performed over the phone with a live correspondent or an automated system by implementing tools like Votacall. Questionnaires and social media are often used, too.

Using Registrations and Subscriptions

Many platforms today require customers to subscribe or register to use their services, which can prove to be a valuable way of collecting information about your consumer base. For example, those signing up for email lists or recurring purchases will likely be asked to provide a small amount of data, such as their date of birth, gender, name, email, and so on. When done correctly, this user data can give companies an idea of the demographics using their services and products. However, when asking too many questions or too personal of information, people reject the service.

Performing Online Tracking

More people are online today than ever before, so online tracking is inevitably used to obtain statistics and data for websites. Tracking will tell you how many people have clicked on your site, what tabs they went to, and how long they stayed there. Analyzing these statistics will help teams decide what needs to be addressed or what is working well for the site. For example, what is garnering the most traffic? Why do people back-click out of the site so often? All these questions and more are a result of online tracking.

Scraping Online Directories

There is a lot of content on the Internet. A lot of this content is publicly available. You can use a scraping tool to assimilate content that will be useful for whatever projects you are working on.

Scraping tools like Parsehub, Scrapy and Octoparse can be very useful. You will be able to use them to scrape data on the public web and then categorize it with your data archiving tools.

CRM Tools

There are a lot of tools that make it easier to gather and categorize customer data. You can try using customer relationship management tools like HubSpot CRM, Zendesk Sell and Pipedrive. These CRM tools are able to organize and aggregate customer data easily. You will be able to leverage it to its full effectiveness.

Closing Thoughts

While these data collection methods are widespread and still used to great success, several other strategies are used, such as managing transactions and analyzing market analytics. Understanding your audience is one of the most critical aspects of the process, regardless of whichever method you choose to employ. You’ll often need a combination of many to obtain a solid and accurate representation of the collected data.

You will want to use these data collection techniques as effectively as possible. You may be surprised by the value that they will provide when you are creating a data-driven business.

The post The 5 Best Methods Utilized for Data Collection appeared first on SmartData Collective.

Source : SmartData Collective Read More

7 Ways Data Analytics Is Boosting the ROI Of Digital Marketing

7 Ways Data Analytics Is Boosting the ROI Of Digital Marketing

Digital marketers work online and leverage online tools to drive sales. They make up an aspect of marketing focused on using the internet and cloud-based technology to promote brands. Marketers all share the same goal: reach the target audience and make more profit. But driving sales through the maximization of profit and minimization of cost is impossible without data analytics.

Data analytics is the process of drawing inferences from datasets to understand the information they contain. It helps marketers with insight, which is the capacity to gain a deep understanding of their target audience. Whether marketers intend to reach new customers or persuade the existing ones, here are ways analytics is boosting returns on investment (ROI):

1. Increased Customer Growth

The two primary objectives of any digital marketing in Scotland or other countries are acquiring new clients and retaining existing customers. Through data analytics, digital marketers find it so easy to acquire new customers through trends and retain the existing ones by assessing their behaviors. That way, they can widen their customer base, improve reach, and effectively engage buyers with more marketing products. 

They use the patterns they’ve learned to trigger brand loyalty for customer growth and thereby boost returns on marketing investment. 

2. Improved Marketing Campaigns 

Campaigns flag off digital marketing. Creating a strategy that promotes your products and services is the first attempt at reaching your target audience. However, most marketing campaigns are not targeted because they fail to understand customers’ behaviors by leveraging data analytics. Marketers deploying analytics make a comprehensive analysis of customer trends to achieve improved campaigns. Focused and targeted campaigns boost sales by engaging optimally with the audience. When buyers connect the right way, make purchase decisions, there are positive returns on investment.

3. Personalized Services

Personalization is among the prime drivers of digital marketing, thanks to data analytics. Gathered data enables business owners to understand the needs of buyers. This includes the understanding of the message they’d love to read, the medium they’re comfortable with, and the time they’re mostly online. 

Leveraging these metrics, digital marketers can draft personalized campaigns that meet customers’ needs and eliminate budget waste. At reduced costs and efforts, they can make more sales and boost returns on investment.

4. Reduced Risks

Aside from the assessment function of analytics, there’s also the security benefit. By optimizing your marketing campaigns, analytics helps you identify risks and quickly patch them. It reduces customer churn rate, often due to repetitive tasks, by providing meaningful and actionable insight

When your campaign is faulty, it’s usually due to a lack of data. A non-functional marketing operation, for instance, may cause potential buyers to explore other options. But if your strategy is data-driven, not only do you engage customers but also boost whatever amount you invested in running that program.

5. Innovative Products

So, when you start creating a product, aside from features, another thing that you must consider is to create something that customers want. Consumers today are not solely after the satisfaction of wants; they want products that meet their demands at the right place and time.

Therefore, any digital marketing process without considering the correct goods will miss the target. Marketers can no longer base campaigns on instinctive and intuitional content. Commodities have to be innovative either by labeling, filtering, or description. Data analytics fuses the right products with customers’ needs for maximum engagement and favorable returns.

6. Enhanced Supply Management

If production is not complete until goods get to the consumers, what about marketing? Of course, the ultimate goal of any marketing strategy is to persuade consumers to make sales and start using the products. If the good is yet to get to the consumer, marketing is incomplete. 

In digital marketing, there are many bottlenecks regarding the delivery of goods that data analytics has resolved. With greater precision and insight, digital marketers can deliver orders within a few days. Buyers can also track their orders. This process leads to improved satisfaction, increased customer growth, and positive returns on investment.

7. Effective Decision Making 

Savvy digital marketers appreciate the role of data reporting in the success of marketing strategies. More importantly, they never take lightly the need to make informed decisions from the reports to avoid analysis paralysis. Fortunately, today, there are significant data analytics tools with incredible visualization designs for the ease and convenience of decision-making. Improved decisions make the buyer’s journey more efficient, leading to more sales, profits, and mouthwatering returns.


Digital marketing is a lot of investment. It requires time, effort, and money. However, smart marketers know the solution to boosting returns is the deployment of data analytics. 

Analytics helps business owners to understand customers’ trends and behaviors better. This understanding is then used towards persuading buyers to engage with products, buy the goods, and start using them.

The post 7 Ways Data Analytics Is Boosting the ROI Of Digital Marketing appeared first on SmartData Collective.

Source : SmartData Collective Read More

The Evolving Role of Analytics in Supply Chain Security

The Evolving Role of Analytics in Supply Chain Security

Analytics technology has become fundamental to many aspects of organizational management. Some of the benefits of analytics actually have crossover with each other.

For example, more companies than ever are using analytics to bolster their security. They are also using data analytics tools to help streamline many logistical processes and make sure supply chains operate more efficiently. These companies have since realized that analytics can be invaluable to helping improve the security of supply chain systems.

The market for security analytics will be worth over $25 billion by 2026. You can learn more about the benefits by reading below.

Analytics is Making the Security of Supply Chains Far More Robust

Supply chain refers to the ecosystem of resources used in designing, manufacturing, and distributing a product. For example, hardware and software, the cloud and local storage, and delivery systems are components of the cybersecurity supply chain. Supply chain analytics also plays a huge role.

Every organization requires various third-party services and software to carry out its daily operations. Many of these third-party organizations offer invaluable analytics capabilities, which can help address countless logistical issues. This means that the organization must rely on other organizations for many things, such as third-party chat applications, to interact internally. The supply chain is referred to because many items are procured from outside sources. 

Supply chain attacks are typically concerned with cyber security, and a single attack on a single supplier might compromise a large number of organizations, for example, by spreading malware throughout the supply chain.

Analytics technology can help identify some of the security threats that businesses are encountering. A number of tools merge AI and analytics algorithms to improve their threat scoring challenges and engage in automated prevention measures as hackers try to orchestrate these attacks. This is leading to a new era of security analytics.

Supply Chain Security with Analytics Capabilities

Software supply chain security is primarily concerned with securing your process to ensure you can provide customers with what they require at the most fortuitous timing and price and with adequate protection. Any disruptions and risks to the integrity of the delivered products or services compromise the organization’s privacy and its data. So, you must implement a wide range of cyber security practices for trust to be easily established and for your organization to operate efficiently.

You will be able to address these threats more easily with the right analytics-based cybersecurity strategy in place. However, you must first educate yourself about the different cybersecurity threats and analytics tools that can help prevent them.

Organizations must adhere to the following analytics-driven tactics and procedures to reduce the risks associated with the supply chain.

Performing Vendor Review

Every organization should conduct a vendor review of the third-party service provider before implementing any software in their organization. This provides essential clarity regarding the software supply chain security of that vendor. 

The vendor review should include an evaluation of how the data is handled by the vendor and what data protection methods are in operation on the vendor’s premises. This provides a clear image of the third-party provider and their rules, allowing businesses to make more informed decisions about whether or not they should include the software in their operations.

Analytics technology can make it easier to learn more about different vendors. There are a lot of data mining tools that can analyze ratings on different vendor review sites, which can help you more quickly identify the best candidates to handle the job.

Performing Vulnerability Assessment

To be considered complete, the vulnerability assessment should be done on all devices that are part of the infrastructure. These can be performed on both new and old apps to ensure that the applications are secure and that the individuals in charge are doing an excellent job of administering them. Organizations can also enlist the assistance of a purple or red team simulation to assess the level of cyber security knowledge among their workforce.

Analytics technology has become a lot more important with vulnerability assessments. You will be able to use analytics tools to evaluate the security architecture of your cybersecurity defense system and come up with actionable strategies to address concerns.

Modernization and Digitization

There are a few things that we cannot digitize. For example, if the business is still reliant on paper, monitoring access control and security will be challenging to manage because they are generated by third-party processes. It is recommended that instead of making physical copies, you utilize the digital version to prevent the exposure of sensitive information. Digitalization also facilitates access control for both the products and the data.


To mitigate the risks resulting from supply chain attacks, it is necessary that any data that contains critical information about customers be encrypted at all times. Therefore, customer data will be protected in a malware attack. Additionally, all sensitive data that can be accessed by third-party systems should be protected using encryption or different authentication factors.

Analytics is Essential for Improving Supply Chain Security

Attacks on the supply chain are now causing worry because every organization relies on third-party vendors to maintain its day-to-day operations. The good news is that analytics technology is becoming very useful in thwarting these problems. However, threat attackers are now focusing their attention on the vendors as it allows them to infect a large number of companies. As a result, it is critical to implement controls and conduct regular reviews to ensure that the organization is not adversely affected if any vulnerability in a supply chain is exploited.

The post The Evolving Role of Analytics in Supply Chain Security appeared first on SmartData Collective.

Source : SmartData Collective Read More

Learn Beam patterns with Clickstream processing of Google Tag Manager data

Learn Beam patterns with Clickstream processing of Google Tag Manager data

Building your first pipeline can be a daunting task for a developer. When there are so many tools that can get the job done, how should you go about getting started? For power, flexibility, and ease of use, we find that the combination of Apache Beam and Dataflow offers developers a variety of ways to create data processing pipelines that meet their unique needs. And to make it easier, we’ve built a sample e-commerce analytics application to provide a reference implementation that works for developers who are starting out with Apache Beam as well as those who are proficient but may have advanced problems they need to solve. In some cases the tasks can be accomplished in many valid ways; however, we’ll focus on what we think is the easiest path. 

This blog will cover some of the patterns used in the sample application, for example the initial processing of the clickstream data from server-side Google Tag Manager. The sample application can also be used to find examples of working with JSON and Apache Beam Schemas for both de-serialization and serialization within the pipeline. 

Note: while the application is only implemented in Java, many of the patterns in this post will also be useful for other SDK languages. 

About the sample application

Our sample application implements processing of clickstream data from Google Tag Manager and shows how to dynamically respond to customer actions by analyzing and responding to events as they happen. The application also uses BigQuery as a sink which is updated in real time, and stores, analyzes, and visualizes event data for strategic decision making. Like most real world enterprise applications, it interacts with other products. You’ll find we also use PubSub as a source and Cloud Bigtable as a sink. 

Clickstream processing using Dataflow

Imagine you’re the developer for a retail company that has an e-commerce site, a mobile application, and brick-and-mortar stores. For now, let’s focus on the e-commerce and mobile applications, which send clickstream events to a web tier and then forward this onto PubSub via Server-side Tagging (GTM). Server-side Tagging allows customers to use a single Google tag on their website to collect the superset of clickstream events needed to enable all of their measurement tools, then send this data to a customer-owned server, and finally fan it out to numerous destinations, such as Google Analytics and other third-parties, or other data storage and dissemination tools such as BigQuery or PubSub. (Learn how to get started with server-side GTM here.)

The architecture diagram below illustrates our integration with server-side Tagging:

The events are published to PubSub and in JSON format, such as the following example:

Now that we know where our clickstream data is coming from and where it’s going, we’ll need to accomplish two tasks to make it useful: 

Validate and correct the data

Create session windows to ensure activities are grouped correctly

First task: Validate and correct the data 

Just because data gets parsed doesn’t mean that the data types and schema were the right match for our business domain. They could have missing properties or inconsistencies, which is why a good data pipeline should always include validation and correction. We’ve modeled these as separate transforms in our retail application, which allows for clean data to simply pass through the validation step.

There are a few common approaches to data validation when it comes to the pipeline shape. We’ll illustrate some options, along with their pros and cons, and then show how we decide to do it within our retail application.

Let’s use an object Foo which has properties a,b for example: {Foo:{a:’bar’,b:’baz’}}. The business has validation logic that needs to check the values of a and b properties for correctness.

Approach A: Single transform

In this approach, the validation and correction code is all built into a single DoFn, where items that cannot be corrected are passed to a deadletter queue. 

Pros: This is a straightforward approach that will work very well for many cases. 

Cons: This Beam transform is specific to Foo and does not lend itself well to a transform library, which could be reused in different parts of the business. For example, property b might be something that is used in many objects within the retail organization, and this approach would potentially involve duplicate efforts down the line.

Approach B: Router validation transform and branch correction

In this approach, different properties (or groups of similar properties) are validated in one transform and then corrected in other transforms.

Pros: This approach allows you to integrate the transforms into a larger transform library. 

Cons: Compared to Approach A, this requires writing more boilerplate code. Moreover, there might also  be a performance disadvantage to this approach (although Dataflow fusion will make it fairly negligible). Another disadvantage to this approach is that if the element requires both property a and b to be corrected, then you’ll need a join operation downstream to bring back together all parts of the element. A join will require a shuffle operation (which is a redistribution of data based on keys), which, since it can be a relatively slow and expensive operation, is good to avoid if possible.

Approach C: Serial validation and correction transforms

In this approach, the element passes through a series of validation steps, followed by a series of correction steps. Using a multi-output transform, any element which has no errors is passed directly to the transformations downstream. If an element does have errors, then that element is sent on to correction transformations ( unless it is deemed not fixable, in which case it is sent to the deadletter sink). Thanks to Dataflow fusion, these elements can all be efficiently processed through all of these transformations.

While there are many more variations of this process (and a combination of approaches A and C can solve the majority of complex real-life scenarios), we’ll be using Approach C for our retail application, since it offers the most code to illustrate the process. Now that we know what we want to do with our element properties, let’s walk through an example that we’ll want to check and correct if needed:

Event DateTime

Every clickstream object will have an Event DateTime property attached to it. As this date is set client-side in our application, there are several issues that could arise that the business would like to check:

Date format: the string itself should be correct. Since this property is a string in the schema, it would not have failed, even if the date format is incorrect.

Date accuracy: the date should not be in the future. Given that this date is coming from the client-side, it could be incorrectly set. How to deal with incorrect date-times is a choice to be made by the business. For our purposes we’ll correct the time to be set to when the element was published to PubSub (the processing time), as opposed to when it was first seen by the pipeline (the event time).

Note that we’re not correcting the event datetime field itself, but rather the timestamp property of the clickstream event. This lets us keep the original data intact, while still ensuring that  any resulting analytics with this  data will use the correct time values.  

Certain events, such as add_to_cart and purchase, both require the Items array to be populated with at least one item. These items normally have all the fields correctly populated, however, there have been situations where this was not done for some of the descriptive fields, such as item_name, item_brand, and item_category. If any of these are not populated, they need to be fixed. 

A serial set of transforms can help check whether or not an element needs to be corrected. To do this, we’ll need a way to tag elements with problems. Adding a new property to the ClickStreamEvent element and all other elements that require validation is one way to go, however, this changes the element itself and will show up in any output down stream (e.g., if the element is passed to BigQuery). Instead, we decided to wrap the element in another schema, one which includes an array to tag elements with problems.

In the above code snippet, the errors field lets us add a list of problems to our clickstream, which is then checked for within the correction transforms. ValidateAndCorrectCSEvt transform is then used to compose all of the validation and correction transforms. A section of this is shown below. 

Note that we make use of Row.withSchema to create a new row element. Also note that PCollectionTuple is not able to infer the row schema, therefore we use withSchema to explicitly set the row schema.

Once the validation is complete, we’re ready to put the corrections into place:

For items which are not fixable they will follow the deadletter pattern and be sent to a sink for future correction, which is often a manual process.

Another modular pattern that could have been used is to create a DoFn in a way that  allows injecting multiple ‘verifiers’/’correctors’. The verifiers/correctors could be a library (i.e. not a PTransform) that is reused across the system. Only a failed corrector would trigger the object to the deadletter queue. 

With that step, we finish validation and correction, and can move on to creating session windows.

Second task: Creating Session windows

Session windows are a special type of data window, which, unlike fixed windows (which rely on discrete time intervals), make use of the data itself to determine when a window should be closed. From the Apache Beam documentation:

“A session window function defines windows that contain elements that are within a certain gap duration of another element.” 

To understand what this means, let’s imagine a customer logs into your company’s mobile shopping application in the morning before work and orders items to be delivered which will be used for the evening meal. Later on in that day, the customer logs in and browses some sports equipment to use over the weekend. Other windowing options (for example, a fixed window), might miss part of a session, or would require you to manually sort through the data to find the time gap and define these sessions. But with a session window, we can treat these two activities as two distinct sessions. As there is a long delay between the end of the morning activities and the start of the evening activities (the ‘gap’ duration), the session window will output the activities from the morning as distinct sessions and start a new session when the user begins the evening activities.

Note that we specify the value of the gap duration when we apply the window transform to our pipeline. While there’s no perfect value for this duration, in practice, something in the range of low single-digit hours works reasonably well.

We can use Apache Beam to create such session windows, and the code is fairly simple as you’ll see below (ClickStreamSessions). We’ll apply a session window to our validated and cleansed data, then do a operation to locate all the different session events together. The grouping key for this will be the client_id which is provided by Tag manager.

The ITERABLE[ROW[ClickstreamEvent]] will hold all the events for that session and can be used to carry out further analysis work. For example, you might want to study what sequence of browsing events ultimately led to the purchase. This can then be found in the transform ClickStreamSessions within the retail application.


We covered a few core concepts and patterns around data validation and correction that are necessary for a healthy streaming pipeline by making use of clickstream data from Google Tag Manager. We also touched on how session windows can help separate the different journeys for a user and segment your data in meaningful ways. 

In our next post, we’ll explore data ingestion flows and how to use Apache Beam schemas to both represent the elements within a pipeline. We’ll also look at how to use schemas to write succinct, manageable pipelines. In the meantime, let us know how these approaches are helping you with your day-to-day workflows, and what else you’d like to learn about Dataflow pipelines and applications.

Source : Data Analytics Read More

Easier administration and management of BigQuery with Resource Charts and Slot Estimator

Easier administration and management of BigQuery with Resource Charts and Slot Estimator

As customers grow their analytical workloads and footprint on BigQuery, their monitoring and management requirements evolve – they want to be able to manage their environments at scale, take action in context. They also desire capacity management capabilities to optimize their BigQuery environments. With our BigQuery Administrator Hub capabilities, customers can now better manage BigQuery at scale. Two key features of BigQuery Administrator Hub are Resource Charts and Slot Estimator which help administrators understand their BQ environments like never before.

Resource Charts empowers Administrators with a native out-of-the-box experience to monitor their slot usage, manage capacity based on historical consumption, troubleshoot job performance, and self diagnose queries, and take corrective action as needed. They provide visibility into key metrics such as slot consumption, job performance, concurrency, bytes processed and failed jobs.  Resource Charts are built and rendered usingINFORMATION_SCHEMA tables, enabling customers to understand data through these purpose-built dashboards or query the data directly to build their own dashboards and monitoring processes.

BigQuery customer Snap is an early adopter of Resource Charts; “Admin Resource Charts is a great tool that helps us understand how we’re consuming our slots and which workloads/queries are driving usage. It has provided us better visibility into our BigQuery environment”—Muthu Hariharasubramanian, Engineering Manager, BigData Infrastructure, Snap, Inc. 

Slot Estimator is an interactive capacity management tool that helps administrators estimate and optimize their BigQuery capacity based on performance. This tool helps customers make informed decisions on capacity planning based on historical usage. It also helps customers to estimate and optimize their capacity based on their workloads and performance. 

Paypal is a preview customer of Slot Estimator; “Slots Estimator is just amazing and a differentiator for BigQuery. Our trials with this feature have shown very good results in predicting slot requirements for critical analytical workloads.”—Bala Natarajan, Sr. Director of Data Infrastructure and Cloud Engineering, PayPal 

Let’s look at a day in the life of an BigQuery Administrator and see how these various features can help you: When you come in the morning and log into the BigQuery UI, you will see Administrator Hub as the central home to understand, manage, and monitor your queries, capacity, and BQ environments.

As you are monitoring your environment real time in Resource Charts, you see a decline in slot usage a couple of hours later and decide to investigate further.

You can look at the new Errors chart and see a sharp increase in permission denied and invalid errors. You can investigate these errors further using filters such as projects, reservations, users and job priorities to understand what changed since morning and fix them so you utilize the slots efficiently

Later in the day, a Data Analyst comes to you when they notice that their jobs have been running slower gradually over the week. Using Resource Charts, you see that slot utilization has hit the max capacity. On further drill down, you learn that ramping up a new workflow has caused a steady increase in slots and all the slots are fully utilized constantly.

You can now switch over to the Slot Estimator tab and see a similar full slot utilization view at 100% and how the slot usage has ramped up over the week. You can look at the reservation data and analyze how much you can improve performance by adding different amounts of slots. Once you decide to add slots, you can directly buy slots for a specific reservation in-context.

Resource Charts is generally available and Slot Estimator is available in preview for customers using Reservations. We hope these Administration capabilities help you better monitor and manage your BigQuery workloads at scale! 

Related Article

Using BigQuery Administrator for real-time monitoring

Resource Charts for BigQuery Administrator makes it easy to understand historical patterns across slot consumption, job concurrency, & jo…

Read Article

Source : Data Analytics Read More

Unlock the power of change data capture and replication with new, serverless Datastream, now GA

Unlock the power of change data capture and replication with new, serverless Datastream, now GA

We’re excited to announce that Datastream, Google Cloud’s serverless change data capture (CDC) and replication service, is now generally available. Datastream allows you to synchronize data across disparate databases, storage systems, and applications reliably and with minimal latency to support real-time analytics, database replication, and event-driven architectures. You can easily and seamlessly deliver change streams from Oracle and MySQL databases into Google Cloud services such as BigQuery, Cloud SQL, Google Cloud Storage and Cloud Spanner, saving time and resources and ensuring your data is accurate and up to date. Get started with Datastream today.

Datastream provides an integrated solution for CDC replication use cases with custom sources and destinations

*Check the documentation page for all supported sources and destinations.

Since our public preview launch earlier this year, we’ve seen Datastream used across a variety of industries, by customers such as, Cogeco, Schnuck Markets, and MuchBetter. This early adoption strengthens the message we’ve been hearing from customers about the demand for change data capture to provide replication and streaming capabilities for real-time analytics and business operations. 

MuchBetter is a multi-award-winning e-wallet app, providing a truly secure and enjoyable banking alternative for customers all over the world. Working with Google Cloud Premier Partner Datatonic, they’re leveraging Datastream to replicate real-time data from MySQL OLTP databases into a BigQuery data warehouse to power their analytics needs. According to Andrew McBrearty, Head of Technology at MuchBetter, “from MuchBetter’s point of view, leveraging Dataflow, BigQuery and Looker has unlocked additional insights from our ever-increasing operational data. Using Datastream in our solution ensured continued real-time capability – we now have trend analysis in place, improved efficiency across the business, and the ability to use our data to derive actionable insights and to make data-driven decisions. This means we can continue to grow and adapt at a pace our customers have come to expect from MuchBetter. And for the first time, the world of ML and AI is open to us.”

Getting to know Datastream

Google Cloud customers are choosing Datastream for real-time change data capture because of its differentiated approach:

Simple experience
Real-time replication of change data shouldn’t be complicated: database preparation documentation, secure connectivity setup, and stream validation should be built right into the flow. Datastream delivers on this experience, as MuchBetter discovered during their evaluation of the product. “Datastream’s ease-of-use and immediate availability (serverless) meant we could start our evaluation and immediately see results”, says Mark Venables, Principal Data Engineer at MuchBetter. “For us, this meant getting rid of the considerable pre-work needed to align proof of concept tests with third-party CDC suppliers.”

Datastream guides you to success by providing detailed pre-requisites and step-by-step configuration guidelines to prepare your source database for CDC ingestion.

End-to-end solution
Building pipelines to replicate changes from your source database shouldn’t take up all of your team’s time. Use pre-built Dataflow templates to easily replicate data into BigQuery, Cloud Spanner or Cloud SQL. Out of the box, these Dataflow templates will automatically create the tables and update the data at the destination, taking care of any out-of-order or duplicate events, and providing error resolution capabilities. Leverage the templates’ flexibility to fine-tune Dataflow to fit your specific needs. “Google-managed Dataflow templates meant getting our pipelines up and running with minimal effort and fuss – this allowed more time to be spent on more complex pipeline development whilst tactically delivering solutions to our users,” says Venables.

Datastream keeps your migrated data secure, supporting private connectivity between source and destination databases. “Establishing connectivity is often viewed as hard. Datastream surprised us with its ease of use & setup, even in more secure modes,” says Grzegorz Dlugolecki, Principal Cloud Architect at, a leading online chess community and mobile application, hosting more than ten million chess games every day. “Datastream’s private connectivity configuration allowed us to easily create a private connection between our source and the destination, and ensure our data is safe and secure.”

Datastream provides a simple wizard to automatically set up private, secure connectivity to your source database

High throughput, low latency
With Datastream’s serverless architecture, you don’t need to worry about provisioning, managing machines, or scaling up resources to meet fluctuations in data throughput. Datastream guarantees high performance – a single stream can process 10’s of MBs per second, while ensuring minimal latency. “We evaluated several market-leading ETL solutions”, says Dlugolecki,  “Datastream was the only tool able to successfully sync our complex, single-table datasets, doing this in weeks instead of years estimated by the other vendors.”

Getting started with Datastream

You can start streaming real-time changes from your Oracle and MySQL databases today using Datastream:

Navigate to the Datastream area of your Google Cloud console, under Big Data, and click Create Stream.

Choose the source database type, and see what actions you need to take to set up your source.

Create your source connection profile, which can later be used for additional streams.

Define how you want to connect your source.

Create and configure your destination connection profile.

Validate your stream and make sure the test was successful. Start the stream when you’re ready.

Once the stream is started, Datastream will backfill historical data and will continuously replicate new changes as they happen. 

Learn more and start using Datastream today

Datastream is now generally available for Oracle and MySQL sources. Datastream supports sources both on-premises and in the cloud, and captures historical data and changes into Cloud Storage. Integrations with Cloud Data Fusion and Cloud Dataflow (our data integration and stream processing products, respectively) replicate changes to other Google Cloud destinations, including: BigQuery, Cloud Spanner, and Cloud SQL.

For more information, head on over to the Datastream documentation, see our step-by-step Datastream + Dataflow to BigQuery tutorial, or start training with this Datastream Qwiklab.

Related Article

Using Datastream to unify data for machine learning and analytics

While machine learning model architectures are becoming more sophisticated and effective, the availability of high-quality, fresh data fo…

Read Article

Source : Data Analytics Read More

Unlocking opportunities with data transformation

Unlocking opportunities with data transformation

One of the biggest challenges data executives have today is turning the immense amount of information that their organization, customers and partners — or rather their whole ecosystem — are creating into a competitive advantage. 

In my role here at Google Cloud, I specialize in everything data — from analytics, to business intelligence, data science and AI. 

My team’s role is split into 3 main activities:

Engagement with customers and partner community. About 70% of my time is spent with customers. And it’s where I’ve gathered all these insights that I’m going to share with you today.  

Product strategy and execution. This time is for strategizing and planning around all our new Cloud launches and products.

Go-To-Market globally. This is where we ask all the tough questions: How do we make it easier for our customers to onboard? And get the most out of our services? To transform and innovate? And then we solve for them.

It’s safe to say data-driven transformation is my bread and butter. And I want it to be yours too. My aim is to help people think about data in a new way — not something to be afraid of, but something to leverage and grow with. There are still lots of problems to be solved in our industry. But data is helping us unlock a world of opportunities. I talked with Stephanie Wong, Google Cloud’s Developer Advocate, and we’d like to share some key learnings that can help you plan for change and transform your business with your data.

What modern data architectures look like today

There’s a treasure trove of new technologies that are transforming the way companies do business at incredible speeds. I think of companies like Paypal, which migrated over 20 petabytes of data to serve its 3,000+ users, and Verizon Media, which ingested 200 terabytes of data daily and stored 100 petabytes in BigQuery. Even traditional retailers like Crate & Barrel are making strides in the cloud, doubling their return-on-ad-spend (ROAS) while only increasing investment by 20%.

But what do these companies all have in common? A modern approach to their data practices and platforms. And there are three attributes that I think all organizations should take into account: 

1. Embrace the old with the new.  Every single one of the most important brands on earth has legacy systems. They’ve developed leadership over decades and these systems (before the cloud came along) got them there.

2. Don’t discard what’s going to get you there (i.e multi-cloud). All modern architectures today are multi-cloud by default. According to Flexera, over 80% of businesses reported using a multi-cloud strategy this year and over 90%  have a hybrid  strategy in place.

3. Data is no longer a stagnant asset. Organizations that win with data think about it as part of an ‘ecosystem’ of opportunity, where insights arise from emerging data — whether it be from interconnected data networks or the data from their partners. And this is a trend organizations should keep their eye on. A study from Gartner predicts that by 2023, organizations that promote data sharing will outperform their peers on most business value metrics.

How to make the best hires for your data team

Leaders often say that their competitive advantage comes from their people, not just services or products. While most companies are now recognizing the importance of data and analytics, many still struggle to get the right people in place. 

The best way to look at how many data people to hire is to ask yourself, what percentage of my total employee base should they make up? I agree with Kirk Borne, Chief Scientist Officer at DataPrime Solutions, who says that your entire organization should be ‘data literate’. And when we say literate, we mean recognize, understand and talk data. 

One third of your company should be ‘data fluent’ — meaning able to analyze and present informed results with data. And finally, 10% of employees should be ‘data professionals’ that are paid to create value from data. That’s where all your chief scientists, data analysts, engineers and Business Intelligence specialists come into play.

The ideal data team structure of course depends on the type and size of the company. Furniture and home e-commerce company Wayfair for instance has approximately 3,000 engineers and data scientists — close to 18% of its total workforce. 

Who should own the data? 

There are a lot of questions around who data leaders should work for and who should own that data. It’s  tough to answer because there are so many choices. Should it be the CTO? Or the CFO, whose initiatives are around cost reduction? Or the CPO, who may focus on product analytics only? 

When asking customers at scale, it’s typically under the CFO or CTO. And while that makes sense, I think there’s something else we should be asking: How should data be approached so that companies are enabled to innovate with it?

A trend we’re hearing a lot more about is data mesh. This data ownership approach basically centralizes data and decentralizes analytics through ‘data neighborhoods.’ This allows business users and data scientists to access, analyze, and augment insights, but in a way that’s connected to the centralized strategy and abides by corporate rules and policies.

Data neighborhoods

Data: 2022 and beyond

Data analytics, data integration and data processing can be very complex, especially as we begin to modernize. So I’d like to leave you with a ‘gotcha’ moment — and that’s data sharing. 

You can’t expect to reap the benefits of data instantly. First you have to work with it, clean it up and analyze it. The real innovators are those looking at the wider picture — considering analytics solutions and sharing and combining datasets.

My advice for people who want to get started? Forget the notion of new and existing use cases and focus on business value from day one. How are you going to measure that? And how are you sharing that with leaders that are supporting your initiative? 

Data is constantly growing and trends are always shifting. So we need to stay on our toes. Data-driven transformation gives businesses real-time insights and prepares you for the unpredictable. So looking forward to 2022, I’d say use data to plan for change and plan for the unexpected.

A data cloud offers a comprehensive and proven approach to cloud — allowing you to increase agility, innovate faster, get value from your data and support business transformation. Google Cloud is uniquely positioned to help businesses get there. Learn how

Source : Data Analytics Read More

Tokopedia’s journey to creating a Customer Data Platform (CDP) on Google Cloud Platform

Tokopedia’s journey to creating a Customer Data Platform (CDP) on Google Cloud Platform

Founded in 2009, Tokopedia is an ecommerce platform that enables millions of Indonesian to transact online. As the company grows, there is an urgent need to better understand customer’s behavior in order to improve the customer’s experience across the platform. Now, Tokopedia has more than 100 million Monthly Active Users and the demography and preferences of all these users are different. A way to meet their needs is through personalization. 

Normally, a user needs to browse through thousands of products in order to find the item they are looking for. By creating product recommendations that are relevant to each user, we shorten their search journey and hopefully increase conversion early on in the journey. In order to build personalization, the Data Engineering Team’s Customer Data Platform (CDP) helped to gain access to user’s attributes. These attributes developed by the Data Engineering team come in handy for different use cases across functions and teams.

Previously, two main challenges were observed:

The need for speed and answers caused an increase in data silos. As the needs for personalization increased across the company, different teams have been building their own personalization features. However, the limited time and the need to simplify communication across teams have resulted in the decision for each team to create their own data pipeline. This caused a few redundancies due to the development of similar data across different teams and these redundancies caused slower development time for new personalized feature, even though some of the attributes have been previously build in a different module.

Inconsistent data definitions. As each team created their own data pipeline, there are many cases where each team had a different definition of a user’s attributes. On several occasions, this caused misunderstandings during meetings and unsynchronized user journeys due to different teams applying different attribute values to the same user. For example, team A evaluated user_id 001 as a woman in their 20s. Meanwhile, team B, having a different set of attributes and definitions evaluated user_id 001 as a woman in their 30s. These differences in definition and attributes can lead to different conclusions and results, consequently giving different personalizations. As a result, customers might be facing inconsistent experience during their journey in Tokopedia and have a bad experience during their activity. Imagine that you’re being displayed by one set type of content that is related with college necessities and then in a different module you’re being given a a content that is related to mom and baby.

Previous State of Data Distribution

Currently, with CDP, different teams do not have to constantly rebuild the infrastructure. The same attributes will only need to be processed once, and can be used by different teams across the company. This optimizes the development time, cost, and effort. Another advantage of having CDP is the single definition of attributes across services and teams. Since different teams will be looking at the same attributes inside the CDP, this will reduce the chances of misunderstanding and strengthen synchronization between teams. This will give customers consistent experience across the Tokopedia platform and enable them to display relevant contents.

CDP High level Concept

Moreover, there are several key factors required in building the CDP platform in Tokopedia. The journey is as follows:

1. Define and Make a List of Attributes
During this phase, we work with the Product and Analyst teams to define all of the user’s attributes required to build the CDP. Our product team interviewed several stakeholders to understand different perspectives regarding user attributes. As a result, an initial attributes list was made to include gender, age group, location, etc. This process is done repetitively in order to have the best understanding of the user’s attributes.

2. Platform Design
After doing comprehensive reviews, we decided to build our CDP platform using several GCP tech stacks.

CDP Architecture

Bigquery was chosen as the analytics backend of our CDP self-service. Meanwhile, Google Cloud BigTable was selected as the backend, where our services will interact to enable the personalization. In developing the storage for Big Table, the design of the scheme is very important. The frequency and categorization will affect how we design the column qualifier while the CDP attribute will affect how we design the row key.

We also opted to create a caching mechanism to reduce the load to big tables for similar read activity. We build the cache system using redis with certain Time to Live (TTL) to ensure an optimized performance. In addition, we also applied a Role Based Access Control (RBAC) mechanism on the CDP API to ensure access control of different services towards attributes in the CDP.

3. Monitoring and alerting
Another important point in building a CDP is developing the correct monitoring and alerting system to maintain stability on our platform. A soft and hard threshold on each metric is established and monitored. Once this threshold is reached, some alerts will be sent through the communication channel. Based on the current architecture, there are several parts in which we need to enable monitoring and alerting. 

Data Pipeline
One of the things that we will need to monitor is resource consumption during computation and data pipeline from data sources to the CDP storages, as we operate using Bigquery and Dataflow for Data Computation and Data Pipeline. In Bigquery, we need to monitor the slot utilization that is used to compute some data aggregation or manipulation to produce the attribute. 

Data Quality
When building the CDP, high quality data was important in order for it to be a trusted platform. Several metrics that are important in terms of data quality are Data Completeness, Data Validity, Data Anomaly and Data Consistency. Therefore, several monitoring needs to be enabled to ensure these metrics.

Storage and API Performance
Since CDP’s backend and API directly interact with several front facing features, we have to ensure the availability of the CDP service. Since we’re using Big Table as the backend, the monitoring of CPU, Latency and RPS is required. This metric, by default, is provided in the Bigtable monitoring.

4. Discoverability across company
Many users have been inquiring on how they can browse attributes that our CDP offers. Initially, we started out by documenting our attributes and sharing it to our stakeholders. However, as the number of the attributes increased, it became increasingly harder for people to go through our documentation. This pushed us to start integrating the CDP terminology into our Data Catalog. In this case, our Data Catalog plays an important role in enabling users to browse attributes in CDP, including the definition of each attribute and how they can retrieve the data.

5. Implementation and adoption of the platform 
Another key point for a successful CDP implementation is collaboration across teams on the front end services. There are several types of CDP implementation in Tokopedia: Personalization, Marketing Analytics, and Self Service Analytics.

The most common usage of CDP would be in personalizing a user’s journey. One example of personalization is the search feature. The product team personalizes the user’s search result based on the user’s address, so that the user will be able to find products that are in proximity to their location. After discussing the definition of user address, we created a CDP API contract with the Search team, so the development can run in parallel. As a result, today our users are able to have a better user experience based on their location.

Marketing Analytics
When we started building the CDP platform, we discussed with the Marketing team on their existing use cases. One of their goals was to personalize and optimize marketing efforts, such as sending out notifications to the right user based on the user’s attributes to reduce unnecessary notification costs to unrelated users, and to enhance the overall user experience by avoiding spam notifications. Once we understood their needs, we looked at the ways in which CDP could cater to those needs. We discussed with the relevant team on how to integrate the segmentation engine and communication channel towards the CDP platform, the type of user attributes to use when sending marketing push/notifications, and how to integrate it with the segmentation engine and communication channel of the CDP platform.

Self-Service Analytics
CDP also often uses self-service analytics to enable quick insights on user demographics and behavior in certain segments. To build this self-serve analytics tool, our team consulted with the Product and Analyst teams to define the user demographics’ attributes that business/product users often select for insights. After understanding the attributes required, we discussed with the Business Intelligence team to enable the visualization for the end user. This allowed different teams to understand our users better and gain insights on how we can improve our platform.

CDP implementation has created a significant impact on different use cases and helped Tokopedia to be a more data-driven company. Through CDP, we are also able to strengthen one of our core DNA, which is Focus on Consumer. By sharing the CDP framework, we hope to bring value and help others to more easily create a thriving CDP platform.

Source : Data Analytics Read More

Store more and worry less with 31 day retention in Pub/Sub

Store more and worry less with 31 day retention in Pub/Sub

Pub/Sub stores messages reliably at any scale. Publish a log entry or an event and you don’t have to worry about when it is processed. If your subscribers, event handlers, or consumers are not keeping up, we’ll hold the messages till they are ready. Bugs made it past your integration test? Not to worry: just seek back and replay

But, you had to worry a little: until today, you had up to a week to fix the code and process everything. And if you wanted to use historical data to compute some aggregate state, such as a search index, you had to use a separate storage system. In fact, we noticed that many of our users stored raw copies of all messages in GCS or BigQuery, just in case. This is reliable and inexpensive, but requires a separate reprocessing setup in case you actually need to look at older data. 

Starting today, Pub/Sub can store your messages in a topic for up to 31 days. This gives you more time to debug subscribers. This also gives you a longer time horizon of events to backtest streaming applications or initialize state in applications that compute state from an event log. 

Using the feature is simple. The interfaces and pricing are unchanged. You can just set a larger value for a topic’s message retention duration. For example, you can configure extended retention of an existing topic using the gCloud CLI

gcloud pubsub topics update myTopic –message-retention-duration 31d

Or use the settings in the Topic Details page in Cloud Console:

One limitation of this feature is that you cannot extend storage retention for an individual subscription beyond 7 days. This limits the control individual subscription owners have over storage. The limit comes with benefits: controlling storage costs is simpler and so is limiting access to older data across multiple applications.

We’d love to hear how you’ve used this feature or how it came short of your needs. Let us know by posting a message to the pubsub-discuss mailing list or creating a bug.

Source : Data Analytics Read More