Meet Optimus, Gojek’s open-source cloud data transformation tool

Meet Optimus, Gojek’s open-source cloud data transformation tool

Editor’s note: Earlier this year, we heard from Gojek, the ​​on-demand services platform, about the open-source data ingestion tool it developed for use with data warehouses like BigQuery. Today, Gojek VP of Engineering Ravi Suhag is back to discuss the open-source data transformation tool it is building.

In a recent post, we introduced Firehose, an open source solution by Gojek for ingesting data to cloud data warehouses like Cloud Storage and BigQuery. Today, we take a look at another project within the data transformation and data processing flow.

As Indonesia’s largest hyperlocal on-demand services platform, Gojek has diverse data needs across transportation, logistics, food delivery, and payments processing. We also run hundreds of microservices across billions of application events. While Firehose solved our need for smarter data ingestion across different use cases, our data transformation tool, Optimus, ensures the data is ready to be accessed with precision wherever it is needed.

The challenges in implementing simplicity

At Gojek, we run our data warehousing across a large number of data layers within BigQuery to standardize and model data that’s on its way to being ready for use across our apps and services. 

Gojek’s data warehouse has thousands of BigQuery tables. More than 100 analytics engineers run nearly 4,000 jobs on a daily basis to transform data across these tables. These transformation jobs process more than 1 petabyte of data every day. 

Apart from the transformation of data within BigQuery tables, teams also regularly export the cleaned data to other storage locations to unlock features across various apps and services.

This process addresses a number of challenges:

Complex workflows: The large number of BigQuery tables and hundreds of analytics engineers writing transformation jobs simultaneously creates a huge dependency on very complex database availability groups (DAGs) to be scheduled and processed reliably. 

Support for different programming languages: Data transformation tools must ensure standardization of inputs and job configurations, but they must also comfortably support the needs of all data users. They cannot, for instance, limit users to only a single programming language. 

Difficult to use transformation tools: Some transformation tools are hard to use for anyone that’s not a data warehouse engineer. Having easy-to-use tools helps remove bottlenecks and ensure that every data user can produce their own analytical tables.

Integrating changes to data governance rules: Decentralizing access to transformation tools requires strict adherence to data governance rules. The transformation tool needs to ensure columns and tables have personally identifiable information (PII) and non-PII data classifications correctly inserted, across a high volume of tables. 

Time-consuming manual feature updates: New requirements for data extraction and transformation for use in new applications and storage locations are part of Gojek’s operational routine. We need to design a data transformation tool that could be updated and extended with minimal development time and disruption to existing use cases.

Enabling reliable data transformation on data warehouses like BigQuery 

With Optimus, Gojek created an easy-to-use and reliable performance workflow orchestrator for data transformation, data modeling, data pipelines, and data quality management. If you’re using BigQuery as your data warehouse, Optimus makes data transformation more accessible for your analysts and engineers. This is made possible through simple SQL queries and YAML configurations, with Optimus handling many key demands including dependency management, and scheduling data transformation jobs to run at scale.

Key features include:

Command line interface (CLI): The Optimus command line tool offers effective access to services and job specifications. Users can create, run, and replay jobs, dump a compiled specification for a scheduler, create resource specifications for data stores, add hooks to existing jobs, and more.

Optimized scheduling: Optimus offers an easy way to schedule SQL transformation through YAML based configuration. While it recommends Airflow by default, it is extensible enough to support other schedulers that can execute Docker containers.

Dependency resolution and dry runs: Optimus parses data transformation queries and builds dependency graphs automatically. Deployment queries are given a dry-run to ensure they pass basic sanity checks.

Powerful templating: Users can write complex transformation logic with compile time template options for variables, loops, IF statements, macros, and more.

Cross-tenant dependency: With more than two tenants registered, Optimus can resolve cross-tenant dependencies automatically.

Built-in hooks: If you need to sink a BigQuery table to Kafka, Optimus can make it happen thanks to hooks for post-transformation logic that extend the functionality of your transformations.

Extensibility with plugins: By focusing on the building blocks, Optimus leaves governance for how to execute a transformation to its plugin system. Each plugin features an adapter and a Docker image, and Optimus supports Python transformation for easy custom plugin development.

Key advantages of Optimus

Like Google Cloud, Gojek is all about flexibility and agility, so we love to see open source software like Optimus helping users take full advantage of multi-tenancy solutions to meet their specific needs. 

Through a variety of configuration options and a robust CLI, Optimus ensures that data transformation remains fast and focused by preparing SQL correctly. Optimus handles all scheduling, dependencies, and table creation. With the capability to build custom features quickly based on new needs through Optimus plugins, you can explore more possibilities. Errors are also minimized with a configurable alert system that flags job failures immediately. Whether to email or Slack, you can trigger alerts based on specific requirements – from point of failure to warnings based on SLA requirements.

How you can contribute

With Firehose and Optimus working in tandem with Google Cloud, Gojek is helping pave the way in building tools that enable data users and engineers to achieve fast results in complex data environments.

Optimus is developed and maintained at Github and uses Requests for Comments (RFCs) to communicate ideas for its ongoing development. The team is always keen to receive bug reports, feature requests, assistance with documentation, and general discussion as part of its Slack community.

Related Article

Introducing Firehose: An open source tool from Gojek for seamless data ingestion to BigQuery and Cloud Storage

The Firehose open source tool allows Gojek to turbocharge the rate it streams its data into BigQuery and Cloud Storage.

Read Article

Source : Data Analytics Read More

What’s New with Google’s Unified, Open and Intelligent Data Cloud

What’s New with Google’s Unified, Open and Intelligent Data Cloud

We’re fortunate to work with some of the world’s most innovative customers on a daily basis, many of whom come to Google Cloud for our well-established expertise in data analytics and AI. As we’ve worked and partnered with these data leaders, we have encountered similar priorities among many of them: to remove the barriers of data complexity, unlock new use cases, and reach more people with more impact. 

These innovators and industry disruptors power their data innovation with a data cloud that lets their people work with data of any type, any source, any size, and at any speed, without capacity limits. A data cloud that lets them easily and securely move across workloads: from SQL to Spark, from business intelligence to machine learning, with little infrastructure set up required. A data cloud that acts as the open data ecosystem foundation needed to create data products that employees, customers, and partners use to drive meaningful decisions at scale.

On October 11, we will be unveiling a series of new capabilities at Google Cloud Next ‘22 that continue to support this vision. If you haven’t registered yet for the Data Cloud track at Google Next, grab your spot today! 

But I know you data devotees probably can’t wait until then. So, we wanted to take some time before Next to share some recent innovations for data cloud that are generally available today. Consider these the data hors d’oeuvres to your October 11 data buffet.

Removing the barriers of data sharing, real-time insights, and open ecosystems

The data you need is rarely stored in one place. More often than not data is scattered across multiple sources and in various formats. While data exchanges were introduced decades ago, their results have been mixed. Traditional data exchanges often require painful data movement and can be mired with security and regulatory issues. 

This unique use case led us to design Analytics Hub, now generally available, as the data sharing platform for teams and organizations who want to curate internal and external exchanges securely and reliably. 

This innovation not only allows for the curation and sharing of a large selection of analytics-ready datasets globally, it also enables teams to tap into the unique datasets only Google provides, such as Google Search Trends or the Data Commons knowledge graph.

Analytics Hub is a first-class experience within BigQuery. This means you can try it now for free using BigQuery, without having to enter any credit card information. 

Analytics Hub is not the only way to bring data into your analytical environment rapidly. We recently launched a new way to extract, load, and transform data in real-time into BigQuery: the Pub/Sub “BigQuery subscription.” This new ELT innovation simplifies streaming ingestion workloads, is simpler to implement, and is more economical since you don’t need to spin up new compute to move data and you no longer need to pay for streaming ingestion into BigQuery. 

But what if your data is distributed across lakes, warehouses, multiple clouds, and file formats? As more users demand more use cases, the traditional approach to build data movement infrastructure can prove difficult to scale, can be costly, and introduces risk. 

That’s why we introduced BigLake, a new storage engine that extends BigQuery storage innovation to open file formats running on public cloud object stores. BigLake lets customers build secure, data lakes over open file formats.  And, because it provides consistent, fine-grained security controls for Google Cloud and open-source query engines, security only needs to be configured in one place to be enforced everywhere. 

Customers like Deutsche Bank, Synapse LLC, and Wizard have been taking advantage of BigLake in preview. Now that BigLake is generally available, I invite you to learn how it can help to build your own data ecosystem.

Unlocking the ways of working with data

When data ecosystems expand to data of all shape, size, type, and format, organizations struggle to innovate quickly because their people have to move from one interface to the next, based on their workloads. 

This problem is often encountered in the field of machine learning, where the interface for ML is often different than that of business analysis. Our experience with BigQuery ML has been quite different: customers have been able to accelerate their path to innovation drastically because machine learning capabilities are built-in as part of BigQuery (as opposed to “bolted-on” in the case of alternative solutions).

We’re now applying the same philosophy to log data by offering a Log Analytics service in Cloud Logging. This new capability, currently in preview, gives users the ability to gain deeper insights into their logging data with BigQuery. Log Analytics comes at no additional charge beyond existing Cloud Logging fees and takes advantage of soon-to-be generally available BigQuery features designed for analytics on logs: Search indexes,a JSON data type, and the Storage Write API. 

Customers that store, explore, and analyze their own machine generated data from servers, sensors, and other devices can tap into these same BigQuery features to make querying their logs a breeze. Users simply use standard BigQuery SQL to analyze operational log data alongside the rest of their business data!

And there’s still more to come. We can’t wait to engage with you on Oct 11, during Next’22, to share more of the next generation of data cloud solutions. To tune into sessions tailored to your particular interests or roles, you can find top Next sessions for Data Engineers, Data Scientists, and Data Analysts — or create and share your own.

Join us at Next’22 to hear how leaders like Boeing, Twitter, CNA Insurance, Telus, L’Oreal, and Wayfair, are transforming data-driven insights with Google’s data cloud.

Related Article

Register for Google Cloud Next

Register now for Google Cloud Next ‘22, coming live to a city near you, as well as online and on demand.

Read Article

Source : Data Analytics Read More

Building trust in the data with Dataplex

Building trust in the data with Dataplex

Analytics data is growing exponentially and so is the dependence on the data in making critical business and product decisions.  In fact, the best decisions are said to be the ones which are backed by data. In data, we trust!  

But do we trust the data ? 

As the data volumes have grown – one of the key challenges organizations are facing is how to maintain the data quality in a scalable and consistent way across the organization.  While data quality is not a newly found need,  the needs used to be  contained when the data footprint was small and data consumers were few. In such a world, data consumers knew who the producers were and producers knew what the consumers needed. But today, data ownership is getting distributed and data consumption is finding new users and use cases.  So the existing data quality approaches find themselves limited and are isolated to certain pockets of the organization.  This often exposes data consumers to inconsistent and inaccurate data which ultimately impacts the decisions made from that data.  As a result, organizations today are losing 10s of millions of dollars due to the low quality of data.

These organizations are looking for solutions that empower their data producers to consistently create high quality data cloud scale. 

Building Trust with Dataplex data quality 

Earlier this year,  at Google Cloud,  we launched Dataplex, an intelligent data fabric that enables governance and data management across distributed data at scale.  One of the key things Dataplex enables out-of-box is for data producers to build trust in the data with a built-in data quality. 

Dataplex data quality task delivers a declarative, data-ops centric experience for validating data across BigQuery and Google Cloud Storage.  Producers can now easily build and publish quality reports or can easily include data validations as part of their data production pipeline.  Reports can be aggregated across various data quality dimensions and the execution is entirely serverless.

Dataplex data quality task provides  – 

A declarative approach for defining “what good looks like” that can be managed as part of a CI/CD workflow.  

A serverless and managed execution with no infrastructure to provision. 

Ability to validate across data quality dimensions like  freshness, completeness, accuracy and validity.

Flexibility in execution – either by using Dataplex serverless scheduler (at no extra cost) or executing the data validations as part of a pipeline (e.g. Apache Airflow).

Incremental execution – so you save time and money by validating new data only. 

Secure and performant execution with zero data-copy from BigQuery environments and projects. 

Programmatic consumption of quality metrics for Dataops workflows. 

Users can also execute these checks on data that is stored in BigQuery and Google Cloud Storage but is not yet organized with Dataplex.  For Google Cloud Storage data that is managed by Dataplex, Dataplex auto-detects and auto-creates tables for structured and semi-structured data. These tables can be referenced with the Dataplex data quality task as well. 

Behind the scenes – Dataplex makes use of an open source data quality engine – Cloud Data Quality Engine – to run these checks. Providing an open platform is one of our key goals and we have made contributions to this engine to integrate seamlessly with Dataplex’s  metadata and serverless environment.

You can learn more about this in our product documentation

Building enterprise trust at American Eagle Outfitters 

One of our enterprise customers  – American Eagle Outfitters (AEO) – is continuing to build trust in their critical data using Dataplex Data Quality Task.  Kanhu Badtia, lead data engineer from AEO, shares their rationale and experience with Dataplex data quality task:  

“AEO is a leading global specialty retailer offering high-quality & on-trend clothing under its American Eagle® and Aerie® brands. Our company operates stores in the United States, Canada, Mexico, and Hong Kong, and ships to 81 countries worldwide through its websites. 

We are a data-driven organization that utilizes data from physical and digital store fronts, from social media channels, from logistics/delivery partners and many other sources through established compliant processes. We have a team of data scientists and analysts who create models, reports and dashboards that inform responsible business decision-making on such matters as inventory, promotions, new product launches and other internal business reviews. As the data engineering team at AEO, our goal is to provide highly trusted data for our internal data consumers. 

Before Dataplex – AEO had methods for maintaining data quality that were effective for their purpose. However, those methods were not scalable with the continual expansion of data volume and demand for quality results from our data consumers.  Internal data consumers identified and reported quality issues where ‘bad data’ was impacting business critical dashboards/reports . As a result, our teams were often in “fire-fighting” mode – finding & fixing bad data. We were looking for a solution that would standardize and scale data quality across the production data pipelines. 

The majority of AEO’s business data is in Google’s BigQuery or in Google Cloud Storage (GCS). When Dataplex launched the data quality capabilities, we immediately started a proof-of-concept. After a careful evaluation, we decided to use it as the central data quality framework for production pipelines. We liked that – 

It provides an easy declarative (YAML) & flexible way of defining data quality. We were able to parameterize it to use across multiple tables. It allows validating data in any BigQuery table with a completely serverless and native execution using existing slot reservations. It allows executing these checks as part of the ETL pipelines using DataPlex Airflow Operators. This is a huge win as pipelines can now pause further processing if critical rules do not pass. Data quality checks are executed in parallel which gives us the required execution efficiency in pipelines. Data quality results are stored centrally in BigQuery & can be queried to identify which rules failed/succeeded and how many rows failed. This enables defining custom thresholds for success. Organizing data in Dataplex Lakes is optional when using Dataplex data quality. 

Our team truly believes that data quality is an integral part of any data-driven organization and Dataplex DQ capabilities align perfectly with that fundamental principle. 

For example, here is a sample Google Cloud Composer / Airflow DAG  that loads & validates the “item_master” table and stops downstream processing if the validation fails.

It includes simple rules for uniqueness, completeness and more complex rules for referential integrity or business rules such as checking daily price variance. We publish all data quality results centrally to a BigQuery table, such as this:

Sample data quality output table

We query this output table for data quality issues & fail the pipeline in case of critical rule failure. This stops low quality data from flowing downstream. 

We now have a repeatable process for data validation that can be used across the key data production pipelines. It standardizes the data production process and effectively ensures that bad data doesn’t break downstream reports and analytics.”

Learn more

Here at Google – we are excited to enable our customer’s journey to high quality, trusted data.  To learn more about our current data quality capabilities please refer to – 

Dataplex Data Quality Overview

Sample Airflow DAG with Dataplex Data Quality task

Related Article

Streamline data management and governance with the unification of Data Catalog and Dataplex

Data Catalog will be unified with Dataplex, providing an enterprise-ready data fabric that enables data management and governance at scale.

Read Article

Source : Data Analytics Read More

Key Criteria When Hiring AI Software Development Agency

Key Criteria When Hiring AI Software Development Agency

There are a number of amazing AI companies. You may want to create an equally successful one. However, you won’t be able to do this without the best experts at your side.

There are a lot of factors that you have to take into consideration when hiring a company to help you develop AI applications. There is a huge lack of qualified AI developers, as a recent report from ZDNet recently showed.

Many companies will outsource their AI development projects, since they have trouble finding qualified experts on their own. Fortunately, this process can be a lot easier when you work with the right agency.

However, large companies and enterprises must be especially careful when outsourcing software development services to handle their AI projects. The right IT services & staff augmentation can make all the difference in the product you produce, which affects your bottom line.

With this in mind, let’s take a look at some key criteria when hiring a custom software development agency to help you make your AI startup succeed.

How Do You Find the Right Software Development Agency for Your AI Startup?

AI startups need the most qualified experts to help them create new applications. Unfortunately, choosing the best AI software development agency is easier said than done. The good news is that the process will be a lot easier if you follow these guidelines.

Choose The Right Service Type

There is a huge difference between ‘outstaffing’ and ‘outsourcing’ that many hiring agents do not fully understand until it is too late. Outstaffing means hiring employees housed under another company. Outsourcing is the more commonly used technique, which involves hiring independent contractors. Both are viable methods of IT staffing augmentation; deciding which to use is usually a function of the following metrics:

Decide On The Budget

Your budget will be one of the deciding factors in whether you outstaff or outsource. Outstaffing, because you are hiring employees, is much more expensive than outsourcing. Depending on the budget, outstaffing may be outside of your wheelhouse, and the software development staff augmentation decision could be make for you!

If you have a choice of whether to outstaff or outsource, then you may want to pay more attention to the other metrics below. However, most companies that hire out for IT staff augmentation services do prefer to outstaff for long-term needs. Although employees cost more, the benefits far outweigh the costs of having to re-teach company culture and projects to new people over and over again.

Rates & Business Goals

Developing AI software applications is not cheap. You want to make sure that you aren’t skimping on rates. However, you also want to make sure that you aren’t overpaying for a service either.

What are the reasons that your business is in need of staff augmentation? Does the work require highly skilled or low skilled labor? Does the scope of work involve deep dives into sensitive material that your company needs to keep private? If you are looking for highly skilled labor or the work involves company secrets, then you may want to consider outstaffing over outsourcing.

All told, what is the productivity of work that you are getting for the money you are outputting? When you consider rates, you may want to consider not only the wages you are paying, but the potential cost of insurance and education. Education is a cost that many companies do not consider. The learning curve that every new hire must take on, where (s)he is learning and low on productivity, is a real cost of business. There is an opportunity cost here of dealing with long term employees that also must be deeply considered.


You want to make sure that the software development team that you hire understands the best practices. This includes ensuring that they know the relevance of agile development when developing AI applications.

You have to make sure that they actually have a good reputation for doing a good job in these regards. You should really only deal with companies that have been marked as a top rated software development agency. Software is an intense industry that obfuscates a great deal of itself from the outside world. If you are not an insider, you can get taken for a lot of money and time before you find the right fit, because everyone knows the lingo that makes the agency sound good. If you aren’t a tech person by trade, you can easily be fooled into thinking an agency can get you the right people when they are really not of the specialty you need.

The reputation of a software agency is essential to consider. This is the way that you, as a tech outsider, can know how effective an agency is in creating true synergy between the employees who will be helping you and your core staff. This synergy is essential in getting work done in a productive way.


You may not think that communication is a big deal when you are dealing with employees or contractors that you do not expect to make a part of your core staff. Nothing could be further from the truth. In order to generate successful projects in a productive way, you must have good communication between your new hires and your core staff.

If your core staff does not have a tech background, it is essential that your software agency bring you people who know how to explain what they are doing in a non-technical fashion. After all, the products that you come out with as a group most likely need to be usable to a non-technical audience. Why not start with you?

Even if your software agency is creating software for internal use, the communication aspect of the process is still a huge part of success. Your employees need to know what the new software can do in layman’s terms so that they can make best use of it when the software developers are not on hand. This is especially true if you are making remote hires — these people will not be in the office, and may not even be in the same time zone to ask questions of.

Troubleshooting capabilities

There is no software project that goes perfectly from the very beginning. A few hiccups here and there should not upset you or make you believe that you have hired the wrong team. You should look at how the team responds to the inevitable bugs in the system, however.

Where to Find the Best Software Development Agency for Your AI Startup?

When you are running an AI startup, word of mouth is the best way to find a software development agency that you can trust. You may also take a look at successful products that come from your competitors. If you can find out who built their software, then you may have a great candidate for your own.

Make sure that you take the appropriate time to go through the entire hiring process for a software agency. You may take a bit more time up front, but this due diligence will definitely pay off in the end. It is much better to spend a bit of money and time looking for that perfect fit than it is to hire quickly and face communication issues or a team that cannot handle troubleshooting.

Even worse is if you outsource when you should have outstaffed and vice versa — make sure that you understand the goals of your company inside and out before making this crucial decision. Keep this in mind, and you can help your company immensely with the right staff augmentation!

The post Key Criteria When Hiring AI Software Development Agency appeared first on SmartData Collective.

Source : SmartData Collective Read More

Traits AI Startups Seek When Hiring New Employees

Traits AI Startups Seek When Hiring New Employees

Are you launching a new AI startup? You will discover that there are a number of opportunities and challenges of creating a company that develops new AI algorithms to solve problems.

The demand for AI technology has surged in recent years. One analysis indicates that 90% of companies have made investments in AI and 37% actively deploy it. AI startups have a burgeoning market that they can serve.

Unfortunately, they also have challenges, such as choosing the right business model for their AI startup. One of the biggest issues that AI entrepreneurs must deal with is finding the right employees. Unfortunately, they find this is easier said than done.

In May, ZDNet author Owen Hughes talked about the shortage of skilled AI developers. Hughes points out that this has made it more difficult for tech companies to innovate effectively.

Of course, the skill shortage doesn’t just apply to AI developers. There is also a shortage of skilled marketers, financial professionals and other experts that AI startups depend on.

Despite the skill shortage, it is essential for AI companies to find the right employees to operate effectively. They will need to know what skills to look for when building a pool of talented employees to serve their business.

Seek Applicants with the Right Skills for Your AI Startup

As an entrepreneur striving to find employees for your AI startup, you will have to read many applications and find the most suitable candidates for certain positions. You can’t waste time overthinking and overanalyzing CVs and cover letters.

Here are some useful suggestions to understand what to focus on when scanning an applicant’s resume and message. Read more about what personal and technical skills you should be looking for in a candidate in the following guide that New Millennia helped create.

Personal Skills

Many AI entrepreneurs and founders of other tech companies make the mistake of putting all of their emphasis on technical skills. This is a huge mistake, because even technology companies need employees with soft skills as well. In fact, one expert points out that 85% of the success in the technology sector can be attributed to soft skills like good communication. AI companies are no exception.

Personal skills are soft skills that people either possess naturally or develop and improve over time. They refer to personal qualities that are transferable to any type of role. This skillset helps an individual to perform better at work. Read the following examples of such personal skills you should search for in an applicant to understand why they’re necessary. 


The AI sector has become very competitive. Entrepreneurs in this sphere need employees that are self-motivated and ambitious to help them succeed.

A motivated employee has an internal drive to perform well at work. These employees help save time and money, as they require less guidance. They have a greater level of independence regarding their duties and projects, which is highly productive and effective for the company. A self motivated employee will inspire their colleagues to do their best when it comes to their roles, as well. 


Communication is what helps to convey factual and complex information in a clear and concise manner. People that showcase a great level of skill regarding communicating with others can use many channels, like phone, email, and face-to-face talks, to send their message across. A good communicator is someone who understands how to listen actively. 


Flexibility is another key skill that means being able to adapt easily. Be it new duties, positions, or obstacles, a good employee should know how to adapt to various situations quickly. As a recruiter, you should always look for applicants that can respond to complex scenarios in an efficient manner. What’s more, a positive attitude helps showcase the candidate’s ability to adapt, so keep this aspect in mind when you’re interviewing somebody. 

Problem Solving

Problem solving refers to the ability to find solutions to any issues in quite a timely manner. Organizations need employees with problem solving skills to evaluate situations and come up with strategies in order to fix the issue. It’s important to find candidates that can see any situation in a complex way, from more than just one perspective, as well as think of the best tactics to handle the problem. 

Technical Skills Such as AI Programming

Developing AI technology requires decent programming skills. You obviously need to hire developers that understand the programming languages that help create AI applications. Python is one of the best languages for data science and AI, so it is a good idea to find Python programmers for your AI startup.

Technical skills are hard skills that people can only learn from experience. People can develop technical skills through courses, various forms of education, and actual work expertise. This skillset is typically particular to certain jobs and fields. Read examples of some key technical skills candidates should showcase in order for you to consider them for the role and get a better understanding of their importance. 

Industry Specific Skills

Below are a few examples of job specific skills you should look for when you read applicants’ CVs and cover letters, depending on the role you’re hiring for:

Data Analysis

As far as Data Analysis is concerned, potential employees should have an extensive knowledge of quantitative research, quantitative reporting, compiling statistics, statistical analysis, data mining, and big data.


The old adage that you can build a better mousetrap and the world will beat a path to your door doesn’t hold up. Even when you are developing stellar AI applications that blow your competitors out of the water, you need to have a sound marketing team by your side.

In the marketing industry, some key technical skills you should look for are expert knowledge of different social media platforms, a good understanding of Search Engine Optimization (SEO), copywriting, and the ability to operate Content Management Systems (CMS). 

Graphic Design

Regarding Graphic Design, it’s essential for applicants to be familiar with branding, Adobe Suite software, user modelling, responsive design, print design, and typography. Graphic design skills are essential for many Internet-based AI startups, since they need to make sure that their interface is easy to navigate and aesthetically appealing.

Software Development

In the Software Development field, it’s important for candidates to know coding, algorithms, applications, design, security, testing, debugging, modelling, languages, and documentation. This is essential for AI startups.

Technical Support Skills 

People in charge of technical support are vital in every company, as they ensure the safety of computer systems, by being able to troubleshoot any issues that may arise. These skills are also essential when configuring new hardware, performing regular updates, and assisting other employees in creating accounts, resetting passwords, and dealing with any difficulty regarding the online system. Technical support teams help restock equipment and maintain records of software licenses, as well. This is why anyone with this type of skill is a great asset to any organization. 

Project Management Skills

Being able to manage people, budgets, and resources in an efficient manner is one of the most important technical skills a candidate can have. Applicants with project management skills are always in high demand in different fields, such as construction and digital marketing, for example. A few key project management technical skills you should look for are project planning, task management, budget planning, risk management, and knowledge of project management software.

When working in-house, recruiters often have several other duties, aside from reviewing CVs, which requires a great deal of attention in itself. Such responsibilities cover various aspects, including the finances of the recruitment agency. For smaller recruitment agencies, there are outsourcing options available to save time and resources, so that recruiters can focus on reading applications properly.

Find the Right Skills for Employees When Growing Your AI Startup

AI companies need the right employees to thrive. You will have an easier time growing your startup if you follow these guidelines to hire the best talent.

The post Traits AI Startups Seek When Hiring New Employees appeared first on SmartData Collective.

Source : SmartData Collective Read More

Built with BigQuery: Retailers drive profitable growth with SoundCommerce

Built with BigQuery: Retailers drive profitable growth with SoundCommerce

As economic conditions change, retail brands’ reliance on ever-growing customer demand puts these companies at financial and even existential risk. Top-line revenue and active customer growth do not equal profitable growth.

Despite multi-billion dollar valuations for some brands, especially those operating the direct-to-consumer model, the rising costs of meeting shoppers’ high expectations (i.e. free shipping, free returns) along with the escalating cost of goods, fulfillment operations, and delivery costs, create pressure for brands to turn a profit. The ONLY way brands drive profitable growth is by managing variable costs in real-time. This in turn mandates adopting modern data cloud infrastructure including Google Cloud services such as BigQuery.

To unlock the value of data, Google Cloud has partnered with SoundCommerce, a retail data and analytics platform that offers a unique way of connecting marketing, merchandising, and operations data and modeling it within a retail context – all so brands can optimize profitability across the business.

Profitability can be measured per order through short-term metrics like contribution profit or long-term metrics like Customer Lifetime Value (CLV).  Often, retailers calculate CLV as a measure of revenue with no consideration for the variable costs of serving that customer, for instance: the costs of marketing, discounting, delivering orders to the doorstep, or post-conversion operational exceptions (e.g. cancellations, returns).

What may first appear to be a high lifetime value customer through revenue-based CLV models, may not be profitable at all. By connecting marketing, merchandising, and operations data together, brands can understand their most profitable programs, channels, and products through the lens of actual customer value – and optimize accordingly. 

The journey for brands starts with the awareness and data enablement of a more complex data set containing all variable revenue, cost, and margin inputs. What does a retailer need to do to achieve this?

All data together in one place

Matched and deduplicated disparate data

Data is organized into entities and concepts that business decision makers can understand

A common definition of key business metrics (this is especially important yet challenging for retailers because systems are siloed by the department and common KPIs like contribution profit per order may be defined differently across a company)

Branched outputs for actionability: BI dashboards vs. data activation to improve marginal profit.

Once brands understand these requirements, up next is execution. This responsibility may fall within a ‘build’ strategy on the shoulders of technical IT/data leadership and their team(s) within a brand. This offers maximum control but at maximum cost and time-to-value. Retail data sources are complicated and change often. Technical teams within brands can spend too much time building and maintaining the tactical data ingestion process, which means they are spending less time deriving business value from the data. 

But it doesn’t have to be this hard. There are other options in the market that brands can consider, such as a tool like SoundCommerce which provides a library of data connectors, pre-built and customizable date mappings, and outbound data orchestrations all tailor-made and ready-to-go for retail brands.

SoundCommerce empowers retail technology leaders to:

Maintain data ownership to allow users to send modeled data to external data warehouses or layer-on business analytics tools for greater flexibility

Provide universal access to data across the organization so every employee can have access to and participate in a low-code or no-code experience

Expand and democratize data exploration and activation among both technical and non-technical users

For retail business and marketing decision-makers, SoundCommerce makes it easy to:

Calculate individual customer lifetime value through the lens of profitability – not just revenue.

Evaluate and predict lifetime shopper performance – identify which programs drive the highest CLV impact

Set CAC and retention cost thresholds – determine optimal Customer Acquisition Costs (CAC) and retention costs that ensure marketing efforts are profitable through the lens of total lifetime transactions

Below is a sample data flow that illustrates how SoundCommerce connects to all the tools a Retailer is using and ingests the first-party data, agnostic of the platform. SoundCommerce then models and transforms the data within the retail context for brands to take immediate action on the insights they gain from the modeled data.

SoundCommerce built on Google Cloud Platform Services

SoundCommerce selected the Google Cloud Platform and its services to achieve what they set out to do – drive profitability for retailers and brands. SoundCommerce perfected this very need of retailers to centralize and harmonize data, map to business users’ needs to infer key metrics and insights by visualizing the data, or reuse the produced datasets to build upon other use cases specific to retailers. SoundCommerce built a cloud-native solution on Google Cloud leveraging the data cloud platform. Data from various sources are ingested in raw format, parsed, and processed as messages in Cloud Storage buckets using Google Kubernetes Engine (GKE) and stored as individual events in Cloud BigTable. A mapping engine maps the data to proprietary data models stored in BigQuery to store the produced data as datasets. Customers use visualization dashboards in Looker to access the data exposed as materialized views from within BigQuery. In many cases, these views are directly accessible by the customer per their use case.

Power Retail Profitable Growth with Analytics Hub 

SoundCommerce adopted BigQuery and the recent release of Analytics Hub – a data exchange that enables BigQuery users to efficiently and securely share data. This feature ensures a more scalable direct access experience for SoundCommerce’s current and future customers. It meets brands where they are in their data maturity by giving them the keys to own their data and control their analytics-driven business outcomes. With this feature, retailers can customize their analysis with additional data they own and manage. 

“Retail Brands need flexible data models to make key business decisions in real-time to optimize contribution profit and shopper lifetime value,” said SoundCommerce CEO Eric Best. “GCP and BigQuery with Analytics Hub make it easy for SoundCommerce to land and maintain complex data sets, so brands and retailers can drive profitable shopper experiences with every decision.”

SoundCommerce uses Analytics Hub to increase the pace of innovation by sharing datasets with its customers in real time by using the streaming functionality of BigQuery. Customers subscribe to specific datasets through a data exchange as data is generated from external data sources and published into BigQuery. This leads to a natural flow of data that scales easily to hundreds of exchanges and thousands of listings. From the Customer’s viewpoint, Analytics Hub enables them to search listings and coalesce data from other software vendors to produce richer insights. All of the benefits are an add-on to the BigQuery features such as separation of compute and storage, petabyte-scale serverless data warehouse, and tighter integration with several Google Cloud products. 

The below diagram shows a view of SoundCommerce sharing datasets with one of its customers:

A SoundCommerce GCP project that hosts the BigQuery instance contains one or more Source Datasets that are composed into a Shared Dataset for a specific customer. The dataset is wrapped around Materialized views but can include other BigQuery objects such as Tables, Views, Authorized views and datasets, BigQuery ML models, external Tables, etc.

The same SoundCommerce GCP project contains the data exchange that acts as a container regarding the shared datasets. The exchange is made private to securely share the curated dataset relevant to the customer. The shared dataset is published into the data exchange as a private listing. The listing inherits the security permissions that are configured on the exchange. 

SoundCommerce shares a direct link to the Exchange to the Customer, which they can add as a Linked dataset into their project in their Google Cloud Organization. The shareable link can be pointed to a private listing. From here on, the dataset is visible in the Customer project like any other dataset and immediately available to accept queries and return results. Alternatively, the customer can also view the listing in their own project under Analytics Hub and subscribe to it by adding it as a linked dataset.

SoundCommerce is incrementally onboarding customers to use Analytics Hub for all data sharing use cases. This enables brands to get business insights faster and gain an understanding of their profitable growth quickly. Plus, it gives them the ownership to own their own data and manage it how they see fit for their business. From a technical standpoint, the adoption of Analytics Hub has led to leveraging an inherent capability in BigQuery for data sharing, faster scaling, and reducing operational overhead to onboard customers.

The Built with BigQuery advantage for ISVs 

Through Built with BigQuery launched in April as part of the Google Data Cloud Summit, Google is helping tech companies like SoundCommerce build innovative applications on Google’s data cloud with simplified access to technology, helpful and dedicated engineering support, and joint go-to-market programs. Participating companies can: 

Get started fast with a Google-funded, pre-configured sandbox. 

Accelerate product design and architecture through access to designated experts from the ISV Center of Excellence who can provide insight into key use cases, architectural patterns, and best practices. 

Amplify success with joint marketing programs to drive awareness, generate demand, and increase adoption.

BigQuery gives ISVs the advantage of a powerful, highly scalable data warehouse that’s integrated with Google Cloud’s open, secure, sustainable platform. And with a huge partner ecosystem and support for multi-cloud, open source tools, and APIs, Google provides technology companies the portability and extensibility they need to avoid data lock-in. 

Click here to learn more about Built with BigQuery.

We thank the many Google Cloud team members including Yemi Falokun in Partner Engineering who contributed to close collaboration with the SoundCommerce team.

Related Article

Securely exchange data and analytics assets at scale with Analytics Hub, now available in preview

Efficiently and securely exchange valuable data and analytics assets across organizational boundaries with Analytics Hub. Start your free…

Read Article

Source : Data Analytics Read More

How Atlas AI and Google Cloud are promoting a more inclusive economy

How Atlas AI and Google Cloud are promoting a more inclusive economy

What if governments could quantify where improved access to safe drinking water would have the greatest benefit on communities?  Or if telecom and energy companies could identify ways to  expand infrastructure access and improve regional wealth disparities?  

Satellite imagery and data from other “earth observation” sensors,  combined with the analytical power of artificial intelligence (AI) can unlock a detailed and dynamic understanding of development conditions in communities. This helps connect global investment and policy support to the places that need it most. Atlas AI, a public benefit startup, has developed an AI-enabled predictive analytics solution powered by Google Cloud that translates satellite imagery and other context-specific datasets into the basis for more inclusive investment in traditionally underserved communities around the world.  

The company’s proprietary Atlas of Human Settlements, an interlinking collection of spatial datasets makes Atlas AI’s platform possible.  These datasets measure the changing footprint of communities around the world, from a village in rural Africa to a large European city,  alongside a wide range of socioeconomic characteristics. These include the average spending capacity of local households, the extent of electricity access, and the productivity of key crops supplying local diets.  Atlas AI, a Google Cloud Ready — Sustainability Partner, provides access to software and streamed analytics that enable business, government and social sector customers to address urgent real-world operating, investment and policy challenges.  

Identifying unmet and emerging demand for consumer-facing businesses

For businesses serving end consumers such as in the telecom, financial services and consumer goods sectors, Atlas AI’s Aperture™ platform can help to identify revenue expansion opportunities with existing customers, forecast future infrastructure demand and promote more inclusive market expansion strategies.  The company’s technology analyzes recent market trends based on socioeconomic forecasts from the Atlas of Human Settlements such as demographic, asset wealth, and consumption growth, alongside an organization’s internal operations and customer data to identify strategies for business expansion, customer growth and sustainable investment.

Cassava Technologies, Africa’s first integrated tech company of continental scale, with a vision of a digitally connected continent that leaves no African behind, became an early adopter of the Aperture platform. Through this initiative, Cassava, aided by the hyperlocal resolution of Atlas AI’s market data and predictive analytics, has identified pockets of high internet usage in the Democratic Republic of Congo. The success of this initiative in just one country reiterates the real-world use of AI and how businesses on the continent can make informed decisions based on available intelligence.

Commenting on this achievement, Hardy Pemhiwa, President and CEO of Cassava Technologies, said, “Cassava’s mission is to use technology as an enabler of social mobility and economic prosperity, transforming the lives of individuals and businesses across the African continent. Deploying billions of dollars into such a rapidly evolving market requires a level of predictive market and business intelligence that has historically been unavailable before Atlas AI.  We see great potential for this technology in promoting expanded global investment into Africa’s development.”

Mapping vulnerable communities to inform climate action

Another promising application of Atlas AI’s technology involves the assessment of how climate change is affecting vulnerable populations.

For example Aquaya Institute, a nonprofit focused on universal access to safe water, utilized Atlas AI’s ApertureTM platform, and in particular datasets on human and economic activity in Ghana, to identify the best areas to expand water piping based on low existing coverage of piped water networks and the ability of customers in those regions to pay for those services.

Aquaya concluded that the methodology used in its project can be used for a wide range of water and sanitation projects.  Importantly, the authors note that “similar approaches could be extended to other development topics, including health care, climate change, conservation or emergency response.”

To study these impacts, researchers first need data on the sociodemographic characteristics of people living in affected areas- often areas lacking in recent granular statistical data.  Stanford professors and Atlas AI Co-Founders David Lobell, Marshall Burke and Stefano Ermon originally established via their academic research the efficacy of using satellite imagery and deep learning AI methods to identify impoverished zones in Africa.  The team was able to correlate daytime imagery and brightly lit images of the Earth at night with areas of high economic activity.  The researchers then used the visual representation to predict village- level development indicators such as household wealth.

Atlas AI arose from that body of research at Stanford thanks to partnership with The Rockefeller Foundation, with an aim of building the Atlas of Human Settlements to global scale and making that data informative and useful for real-world decision making such as guiding sustainable infrastructure development.  One constant from the Stanford research to Atlas AI’s current product development has been Google Earth Engine, which continues to power the company’s ingest and processing of petabytes of satellite imagery.  

Atlas AI and Google Cloud: Promoting sustainable investment globally 

New planetary scale data sets such as satellite imagery and deep learning AI technologies offer unprecedented capacity to generate economic estimates, guide investment and improve operating efficiency of organizations in countries where data is often scarce or out of date. These technologies can be used to improve targeting of social programs, to route infrastructure to disconnected communities and to gain a better understanding of how our activities are affecting the planet.  

Google Cloud has been working with partners such as Atlas AI to help our mutual customers around the world meet their environmental social and governance (ESG) related commitments. The Google Cloud Ready – Sustainability program is a recently announced validation program for partners with the goal of accelerating the development of solutions that can drive ESG transformations. 

Atlas AI is among the first Google Cloud partners to receive the Google Cloud Ready – Sustainability designation. Collaborative efforts among solutions providers, such as Atlas AI and other partners, will continue to create  innovative, intelligent technologies that can improve the outlook for millions of people around the world. 

Learn more about Google Cloud Ready – Sustainability.

Related Article

Google Cloud launches new sustainability offerings to help public sector agencies improve climate resilience

Leveraging Google Earth Engine, BigQuery, AI and ML, Google Cloud’s Climate Insights helps scientists standardize, aggregate, analyze and…

Read Article

Source : Data Analytics Read More

Ways Big Data Creates a Better Customer Experience In Fintech

Ways Big Data Creates a Better Customer Experience In Fintech

Big data has led to many important breakthroughs in the Fintech sector. The industry is growing at a remarkable rate due to this new technology.

Positive customer experience sits atop the most valuable things critical to the longevity of any business. It helps build brand reputation, enhances a company’s visibility, and encourages customer loyalty, which translates to increased revenues.

Statistics show that 93% of customers will offer repeat business when they encounter a positive customer experience. For these reasons, fintech companies actively seek opportunities to nurture better customer experiences.

Global companies are projected to spend $19.8 billion on financial analytics by 2030. The fintech sector will be among the biggest proponents.

And Big Data is one such excellent opportunity!

Big Data is the collection and processing of huge volumes of different data types, which financial institutions use to gain insights into their business processes and make key company decisions.

This article focuses on big data in financial industry, its role, and how it helps fintech companies protect their customers and improve the customer experience.

The Role Of Big Data In Fintech

We have witnessed huge advancements in the financial industry’s service provision, thanks to big data.

Big data in fintech plays a vital role, providing crucial content that impacts service delivery. Through big data insights, financial institutions can offer personalized services as well as predict consumer behavior. They can also anticipate industry trends, assess risks, and make strategic steps to elevate the customer experience.  

How Big Data Helps Fintech Companies And Startups To Better Serve And Protect Their Customers

Fintech analytics helps businesses in the financial and banking industry offer satisfactory services by:

Enhancing View Of Customer Profiling

Big Data provides data that fintech companies can leverage to build customer profiles. Through segmentation, these institutions can easily understand customer wants, needs, and expectations. They can also use this information to analyze consumer behavior and create tailored services.     

Improving Risk Assessment

Data analytics fintech provides crucial information financial institutions need to build a robust risk assessment strategy. This allows businesses to identify potential risks fast and avoid them or immediately find the appropriate mitigation strategies.  

Improving Security

Fraud is a cause for concern in the banking industry, especially now that mobile banking takes a center stage. However, fintech businesses can use big data and machine learning to build fraud detection systems that uncover anomalies in real time. They will detect illicit activities such as suspicious transactions, logins, and bot activity. 

Forecasting Future Market Trends

Start-ups and established fintech companies can use big data to understand the changing financial industry. With access to previous data, these companies can monitor purchasing behavior and predict future trends. As a result, they can make crucial decisions that elevate customer experience, based on these facts. 

Personalizing Assistance With Chatbots

Businesses in the Fintech industry can harness the power of big data to personalize chatbot customer service. AI-powered chatbots will access raw data, allowing them to answer customer questions accurately and straight to the point.  

Ensuring Friction-less Multi-channel Experience

Changing consumer preferences and the need to capture market share drove financial institutions to embrace multi-channel service delivery. To ensure their customers have a satisfactory experience, financial businesses will use big data analytics to tweak their services across various platforms to suit a customer’s needs. They will also use historical and real-time data to identify possible customer challenges.    

How Can Big Data In Fintech Influence The Customer Experience?

Data science in fintech has influenced customer experience in more ways than one. Thanks to it, the financial industry can now:

Analyze customer behavior to propose new products

Customer likes and dislikes shift depending on need. Historical financial big data helps businesses scrutinize evolving customer behaviors, allowing them to come up with invaluable products and services that streamline banking processes.   

An excellent example is how the Oversea-Chinese Banking Corporation (OCBC) designed a successful event-based marketing strategy based on the high amounts of historical customer data they collected.

Better UI/UX based on A/B testing

Thanks to big data, Fintech businesses can access real-time data that shows how users interact with their products, the average time spent on the portal/system/app, and the most-used features.

With such information, these businesses can assess two product versions to see which offers a superior UI/UX design. Additionally, they understand in-depth the differences between the products and how they affect the customer experience.

Analyze customer satisfaction survey results.

Big data evaluates customer satisfaction rates from survey results. For instance, it helps financial institutions identify the rate of and reasons for customer churn, helping them devise newer ways to keep their audience interested in their services. Also, it has been used in the management of product and feature requests, as well as in analyzing customer support ticket trends.


Financial companies can provide accurate credit ratings based on the number of missed or delayed payments, how much money a customer owes, and how promptly they make payments.

Fraud detection

Big data for financial services in conjunction with digital technologies such as machine learning has proved fruitful in the detection of suspicious activities. They prevent various types of sophisticated fraud and elaborate hacking attempts.

Deutsche Bank is one such financial institution that is taking advantage of big data analytics to identify techniques used in money laundering, secure the know-your-customer processes, and prevent credit card theft.

Measure the ROI from delivering a great customer experience

With insights from big data, fintech companies can measure the success of their efforts geared toward providing a positive customer experience. By measuring ROI, they can identify where to improve and what to focus on.   

The Fintech Sector is Exploding Due to Big Data

Big data is, without a doubt, a tech advancement revolutionizing the Fintech industry. It allows access to large data volumes that can be used to improve a customer’s user experience in retail banking, online trading, and other financial processes. However, to take full advantage of big data’s powerful capabilities, choosing BI and ETL solutions cannot be over-emphasized.

ETL and Business Intelligence solutions make dealing with large volumes of data easy. They support system integrations, helping create reliable data pipelines that deliver actionable insights. Additionally, they help fintech companies predict market trends, driving profitability.

The post Ways Big Data Creates a Better Customer Experience In Fintech appeared first on SmartData Collective.

Source : SmartData Collective Read More

ML is a Vital Defense Against Thwart Digital Attack Surfaces

ML is a Vital Defense Against Thwart Digital Attack Surfaces

Machine learning technology has become invaluable in many facets of the IT sector. A study by Markets and Markets shows that the market for machine learning technology is growing over 44% a year.

One of the biggest factors driving the demand for machine learning technology is a growing need for cybersecurity solutions. Cyberattacks are becoming more common each year. Fortunately, machine learning advances have made it easier to stop them in their tracks.

One of the biggest applications of machine learning in cybersecurity is with stopping digital attack surfaces. In order to appreciate the benefits of machine learning in this application, it is important to understand the nature of these cyberattacks and the best ways to prevent them.

How Can Machine Learning Technology Stop a Digital Attack Surface?

With organizations expanding their digital footprint to reach more customers on more devices across more countries, their exposure (attack surface) to both internal and external threat actors increases. To make matters worse, a number of cybercriminals are using AI technology to conduct more devastating cyberattacks than ever before.

The good news is that cybersecurity professionals an use machine learning as well. There are a growing number of ways that they are able to fortify their defenses with machine learning. This includes using machine learning to stop digital attack surfaces.

But what are digital attack surfaces and what can machine learning really do to stop them?

Overview of Digital Attack Surfaces

 It might seem like an increasing attack surface is simply a recipe for disaster where security breaches are inevitable. Luckily this is not the case. Many organizations join hands with attack surface mapping and monitoring specialists to quantify their risk and introduce remedial steps to protect against breaches.

The term digital attack surface refers to the sum of all the possible attack vectors your organization has exposed to threat actors, that could be utilized to launch a malicious attack against your organization. Simply put, what technologies can threat actors utilize to gain access to your organization?

At first glance, it might seem to be an easy assertion to simply list all networked nodes. As soon as a closer inspection is done though you will soon find many possible vectors that you did not previously consider as vulnerabilities.

The most common kind of attack surface vector is those nodes that we know of. This would include all the organization’s managed technologies. From the workstations and servers to the outward-facing websites and web services hosting public APIs.

The second kind of attack surface vector is all the managed technologies that have fallen outside of the organization’s direct reach of influence. Whether risks have been introduced without the knowledge of the IT team, like shadow IT, for example, or whether there are online resources that have been forgotten about.

And thirdly, if the areas mentioned above are not enough, organizations still need to deal with threat actors who can create resources of their own. From malware and social engineering to resources specifically created to masquerade as your organization to harvest credentials and other sensitive information.

How Can Machine Learning Stop Attack Vectors?

There are a lot of benefits of using machine learning technology to stop cyberattacks. Some of them are listed below:

Machine learning helps cybersecurity professionals automate certain tasks that would otherwise be very repetitive. This frees their time to focus on more essential threat analysis tasks.Machine learning technology can be trained to recognize threats that would otherwise be difficult to detect. For example, it can perform risk scoring analyses on emails that might be used for phishing. Machine learning helps identify weak points in the cybersecurity infrastructure, such as outdated firewalls. It can ping the cybersecurity team to make appropriate modifications.

As a result, machine learning is invaluable in stopping attack vectors of all types.

Five common attack vectors that machine learning must be taught to fight

There are a number of different attack vectors that cybercriminals use. Machine learning technology must be trained to address them. The biggest are listed below.

User and cloud credentials

Account restrictions and password policies are among the most neglected security mechanisms and pose a great risk to organizations, globally. Users get into the habit of reusing their organizational credentials on their social media profiles, and unintentionally supplying their credentials during a data leak. The other dimension is where administrators do not apply the principle of least privilege. The combination of these vectors can result in devastating data breaches.

Third-party APIs and web applications

APIs are an attractive target for hackers because they allow attackers to get access to otherwise secure systems and exploit weaknesses. APIs are frequently vulnerable to similar vulnerabilities as web applications, such as failed access controls, injections, and security misconfigurations because of the automated nature of their users. Newer machine learning driven cybersecurity tools are trained to recognize these threats.

Email Security

Email security is too often overlooked. You might be more appreciative of the need to train your machine learning tools to stop phishing attacks if you realize that one out of every 99 emails is a phishing attempt.

Security policy frameworks and similar email authentication measures need to be in place to protect against email spoofing from threat actors. The second major risk introduced by email is malware. Servers that are not configured to scan eliminate high-risk attachments open the door for external threat actors to gain access through social engineering and malicious attachments.

Shadow IT

The use of computer systems, hardware, applications, and resources without express IT department authority is known as shadow IT. With the popularity of cloud-based apps and services in recent years, it has risen at an exponential rate. While shadow IT can potentially boost employee productivity and promote innovation, it can also pose major security concerns to your organization by leaking data and potentially violating regulatory compliance standards. You need to make sure that machine learning tools are trained to recognize the weak points in your shadow IT system.

Unmanaged tech assets

As cloud technologies advance, organizations may still have connections to legacy systems and vice versa. These could have also been approved connections from enterprise applications to decommissioned third-party suppliers. They could also be internal linkages to firm IP addresses or expired storage domains. These unmanaged assets are almost always running outdated software with known vulnerabilities that have never been fixed, making it easy for skilled threat actors to exploit.

Machine Learning is Crucial for Stopping Digital Surface Attacks

To take back control of your digital attack surface, holistic attack surface visibility must be acquired. Machine learning technology makes this task much easier. This will allow you to efficiently identify and manage the risks they pose. Cyber security visibility can be rapidly attained by partnering with an industry security specialist who can provide real-time monitoring tools to remediate risks before breaches occur.

The post ML is a Vital Defense Against Thwart Digital Attack Surfaces appeared first on SmartData Collective.

Source : SmartData Collective Read More

Announcing Pub/Sub metrics dashboards for improved observability

Announcing Pub/Sub metrics dashboards for improved observability

Pub/Sub offers a rich set of metrics for resource and usage monitoring. Previously, these metrics were like buried treasure: they were useful to understand Pub/Sub usage, but you had to dig around to find them. Today, we are announcing out-of-the-box Pub/Sub metrics dashboards that are accessible with one click from the Topics and Subscriptions pages in the Google Cloud Console. These dashboards provide more observability in context and help you build better solutions with Pub/Sub.

Check out our new one-click monitoring dashboards

The Overview section of the monitoring dashboard for all the topics in your project.

We added metrics dashboards to monitor the health of all your topics and subscriptions in one place, including dashboards for individual topics and subscriptions. Follow these steps to access the new monitoring dashboards:

To view the monitoring dashboard for all the topics in your project, open the Pub/Sub Topics page and click the Metrics tab. This dashboard has two sections: Overview and Quota.

To view the monitoring dashboard for a single topic, in the Pub/Sub Topics page, click any topic to display the topic detail page, and then click the Metrics tab. This dashboard has up to three sections: Overview, Subscriptions, and Retention (if topic retention is enabled).

To view the monitoring dashboard for all the subscriptions in your project, open the Pub/Sub Subscriptions page and click the Metrics tab. This dashboard has two sections: Overview and Quotas.

To view the monitoring dashboard for a single subscription, in the Pub/Sub Subscriptions page, click any subscription to display the subscription detail page, and then click the Metrics tab. This dashboard has up to four sections: Overview, Health, Retention (if acknowledged message retention is enabled), and either Pull or Push depending on the delivery type of your subscription.

A few highlights

When exploring the new Pub/Sub metrics dashboard, here are a few examples of things you can do. Please note that these dashboards are a work in progress, and we hope to update them based on your feedback. To learn about recent changes, please refer to the Pub/Sub monitoring documentation.

The Overview section of the monitoring dashboard for a single topic.

As you can see, the metrics available in the monitoring dashboard for a single topic are closely related to one another. Roughly speaking, you can obtain Publish throughput in bytes by multiplying Published message count and Average message size. Because a publish request is made up of a batch of messages, dividing Published messages by Publish request count gets you Average number of messages per batch. Expect a higher number of published messages than publish requests if your publisher client has batching enabled. 

Some interesting questions you can answer by looking at the monitoring dashboard for a single topic are: 

Did my message sizes change over time?

Is there a spike in publish requests?

Is my publish throughput in line with my expectations?

Is my batch size appropriate for the latency I want to achieve, given that larger batch sizes increase publish latency?

The Overview section of the monitoring dashboard for a single subscription.

You can find a few powerful composite metrics in the monitoring dashboard for a single subscription. These metrics are Delivery metrics, Publish to ack delta, and Pull to ack delta. All three aim to give you a sense of how well your subscribers are keeping up with incoming messages. Delivery metrics display your publish, pull, and acknowledge (ack) rate next to each other. Pull to ack delta and Publish to ack delta offer you the opportunity to drill down to any specific bottlenecks. For example, if your subscribers are pulling messages a lot faster than they are acknowledging them, expect the values reported in Pull to ack delta to be mostly above zero. In this scenario, also expect both your Unacked messages by region and your Backlog bytes to grow. To remedy this situation, you can increase your message processing power or setting up subscriber flow control.

The Health section of the monitoring dashboard for a single subscription.

Another powerful composite metric available in the monitoring dashboard for a single subscription is the Delivery latency health score in the Health section. You may treat this metric as a one-stop shop to examine the health of your subscription. This metric tracks a total of five properties; each can take a value of zero or one. If your subscribers are not keeping up, zero scores for “ack_latency” and/or “expired_ack_deadlines” effectively tell you that those properties are the reason why. We prescribe how to fix these failing scores in our documentation. If your subscription is run by a managed service like Dataflow, do not be alarmed by a “utilization” score of zero. With Dataflow, the number of streams open to receive messages is optimized, so the recommendation to open more streams does not apply. 

Some questions you can answer by looking at your monitoring dashboard for a single subscription are: 

What is the 99th percentile of my ack latencies? Is the majority of my messages getting acknowledged in under a second, allowing my application to run in near real-time? 

How well are my subscribers keeping up with my publishers? 

Which region has a growing backlog? 

How frequently are my subscribers allowing a message’s ack deadline to expire?

Customize your monitoring experience

Hopefully the existing dashboards are enough to diagnose a problem. But maybe you need something slightly different. If that’s the case, from the dropdown menu, click Save as Custom Dashboard to save an entire dashboard in your list of monitoring dashboards, or click Add to Custom Dashboard in a specific chart to save the chart to a custom dashboard. Then, in the custom dashboard, you can edit any chart configuration or MQL query. 

For example, by default, the Top 5 subscriptions by ack message count chart in the Subscriptions section of the monitoring dashboard for a single topic shows the top five attached subscriptions with the highest rate of acked messages. You can modify the dashboard to show the top ten subscriptions. To make the change, export the chart, click on the chart, and edit the line of MQL “| top 5, .max()” to “| top 10, .max()”. To know more about editing in MQL, see Using the Query Editor | Cloud Monitoring.

For a slightly more complex example, you can build a chart that compares current data to past data. For example, consider the Byte Cost chart in the Overview section of the monitoring dashboard for all topics. You can view the chart in Metrics Explorer. In the MQL tab, add the following lines at the end of the provided code snippet:

code_block[StructValue([(u’code’, u’| {rn add [when: “now”]rn ;rn add [when: “then”] | time_shift 1drn }rn| union’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eec0d491110>)])]

The preceding lines turn the original chart to a comparison chart that compares data at the same time on the previous day. For example, if your Pub/Sub topic consists of application events like requests for cab rides, data from the previous day can be a nice baseline for current data and can help you set the right expectations for your business or application for the current day. If you’d prefer, update the chart type to a Line chart for easier comparison. 

Set alerts

Quota limits can creep up on you when you least expect it. To prevent this, you can set up alerts that will notify you once you hit certain thresholds. The Pub/Sub dashboards have a built-in function to help you set up these alerts. First, access the Quota section in one of the monitoring dashboards for topics or subscriptions. Then, click Create Alert inside the Set Quota Alert card at the top of the dashboard. This will take you to the alert creation form with an MQL query that triggers for any quota metric exceeding 80% capacity (the threshold can be modified).

The Quota section of the monitoring dashboard for all the topics in your project.

In fact, all the provided charts support setting alerting policies. You can set up your alerting policies by first exporting a chart to a custom dashboard and then selecting Convert a provided chart to an alert chart. using the dropdown menu.

Convert a provided chart to an alert chart.

For example, you might want to trigger an alert if the pull to ack delta is positive more than 90% of the time during a 12-hour period. This would indicate that your subscription is frequently pulling messages faster than it is acknowledging them. First, export the Pull to Ack Delta chart to a custom dashboard, convert it to an alert chart, and add the following line of code at the end of the provided MQL query:

| condition gt(val(), 0)

Then, click Configure trigger. Set the Alert trigger to Percent of time series violates, the Minimum percent of time series in violation to 90%, and Trigger when condition is met for this amount of time to 12 hours. If the alert is created successfully, you should see a new chart with a red horizontal line representing the threshold with a text bubble that tells you if there have been any open incidents violating the condition. 

You can also add an alert for the Oldest unacked message metric. Pub/Sub lets you set a message retention period on your subscriptions. Aim to keep your oldest unacked messages within the configured subscription retention period, and fire an alert when messages are taking longer than expected to be processed. 

Making metrics dashboards that are easy to use and serve your needs is important for us. We welcome your feedback and suggestions for any of the provided dashboards and charts. You can reach us by clicking on the question icon on the top right corner in Cloud Console and choosing Send feedback. If you really like a chart, please let us know too! We will be delighted to hear from you.

Related Article

How Pub/Sub eliminates boring meetings and makes your systems scale

What is Cloud Pub/Sub? A messaging service for application and data integration!

Read Article

Source : Data Analytics Read More