Big Data Skills Must Be Utilized in a Cybersecurity Role
As far as computer and information technology occupations go, security awareness training is a key starting point for anyone interested in the bright future that this sector offers. The need for cybersecurity personnel, technicians, officers, developers, and trainers have never been greater. As the need for these professions grows, it also becomes more important for them to have a background in big data and other forms of technology.
There has never been a more relevant time to change careers and aim towards cybersecurity, especially as the sector has risen to new heights in the past few years. This of course is the fact that the cybercrime environment has become more intense and heightened the need for a background in data analytics and AI skills. That is to say, malicious actors have become more sophisticated and have spread their tentacles everywhere. Secondly, the more tech we make as a society, the more a data-driven, hands-on approach will ensure a good level of safety.
Brilliant Growth and Wages
The projections for the growth of the cybersecurity sector are very progressive, the number is around a 15% growth factor between now and 2030. In North America alone, the cybersecurity market size is projected to grow from $150 billion in 2020 to over $350 billion by 2028. The emphasis for the sector is on information and data security, big data storage and collection as well as cloud computing and cloud computing security. Not only is the sector booking thousands of applicants every day, but wages are on a steep rise as well. Cybersecurity is currently the highest-paid sector in IT. As far as median wages are concerned, the highest-paid personnel are data scientists and computer researchers (security managers) making well over $100,000 even up to $220,000, followed by network architects, programmers, and analysts. As far as the CAGR or Compound Annual Growth Rate is concerned, the largest growth is taking place forecasted vertically most notably for the cybersecurity service sector (management, consulting, and maintenance) especially relating to SMBs (Small-to-Medium Businesses.)
The Reason For So Much Demand
According to IBM, a singular data breach can cost organizations over $3 million, which is a significant increase compared to the mid-2010s. Furthermore, the fact that cyberattacks have been getting more sophisticated and much nastier over the years, with heavy breaches affecting big corporations like Facebook, T-Mobile, Equifax as well as several crypto exchanges (and many more), the need for cybersecurity personnel is higher and more in-demand than ever. Cybersecurity is a form of security, so paying top dollar for peace of mind when it comes to sensitive and valuable business and personal assets makes a lot of sense. This especially rings true in a time where nation-state APT (Advanced Persistent Threat) groups, as well as cyber-physical attacks, are becoming a threat to humanity.
Cybersecurity Market Drivers
Let’s look at what is driving the growth and changes in the cybersecurity industry. The amount of e-commerce platforms, especially as the trend is vertical, is one of the drivers. Furthermore, cloud storage, blockchain, artificial intelligence, and IoT are big drivers as well. Network security itself is advancing and growing and is again one of the drivers. Nations like Brazil, Israel, Germany, India, the United Kingdom, and of course North America are investing billions into cybersecurity.
As far as market share goes, the global cybersecurity market share pie can be divided into multiple categories such as; retail, healthcare, government, manufacturing, transportation, financial services, and more. Financial services by far are going to take most of the cybersecurity market share pie, followed closely by IT and healthcare. Some of the market leaders in cybersecurity are; Cisco Systems, Inc., FireEye, IBM, Palo Alto Networks, Inc., Zscaler, Inc., and Fortinet, Inc. (to name a few.)
What Skills Are Required for a Career in Cybersecurity?
The cybersecurity industry needs personnel, as there is currently a shortage of specialists in this sector. Now that we’ve looked at how bright the future is for cybersecurity, it is time to understand what kind of skills are going to be required for a career in this sector. Here are some of those in-demand skills at the moment;
These are the most in-demand skill sets that are required for a highly paid cybersecurity career. On top of this, it is also good to have a developed set of soft skills such as;
Communication and empathyTeam collaborationGood problem-solving skills
Where Do You Start?
So, you have decided to pursue a career in cybersecurity. Now, there are a few things you need to take into consideration;
College degreeCybersecurity boot campSpecialized certifications
Primarily, it is important to research what is required for a specific job position. Each organization will have slightly different requirements for applicants, so it is important to research these criteria well. Of course, a passion for cybersecurity and IT helps a lot with these jobs. The good news is that apart from industry-leading wages, statistically speaking most people that work in cybersecurity are satisfied with their job positions (based on a survey conducted in the United States.) Cybersecurity is only going to grow and expand more each year, so the sooner individuals interested in this line of work enter the workforce, the better it is for both sides.
As Gen Z is now beginning to make radical financial decisions for themselves, we’ve seen a rise in the number of platforms and applications that are now automating the process of insurance policies. With such platforms, powered by AI and data analysis techniques, insurance companies are slowly changing the way they function, bidding farewell to the pre-set traditional insurance schemes for people to choose from.
Most companies talk about the benefits of AI in marketing and management, but it can be essential in other aspects of the insurance industry as well. These new-age AI-backed insurance plans are making the consumers’ lives simpler and better, which has resulted in a stronger competitive advantage for insurance companies using it. Here’s how.
AI Provides Better Accessibility for Insurance Customers
What’s better than having all your essential finance-related documents and details in your hand at all times? These online platforms make it incredibly easy for you to get the required insurance quickly. It eliminates the tedious step of processing paperwork by simply automating getting customer details without you having to fill out numerous forms.
Since you get your insurance entirely online, you don’t even need to carry around physical proof for the same. The platform gives you e-proofs for everything related to your insurance policies.
Artificial Intelligence Means it Takes Less Paperwork to Get Insured
We’ve all been there – tired from standing in long queues, figuring out complicated terms related to insurances, filling out forms while making sure not to make any mistakes; it’s draining and tiresome. But platforms such as Salty can help you.
These AI-backed insurance platforms allow you to move the process online, giving it access to your information which helps them suggest a personalized plan for you. All the necessary details are also stored online on cloud with maximum customer privacy and security to ensure your details are in the right hands. Once you’re done with the formalities, you’re insured. It’s just that simple.
With no in-hand documents or submission of the same to the company, you can enjoy the security of your insurance without the hassle of paperwork. This is especially beneficial when insuring utilities such as a house or vehicle.
Personalized Insurance Schemes
Gone are the days of traditional insurance schemes when people were bound to sacrifice some of their requirements to attain insurance through these rigid plans. Nowadays, insurance companies have integrated technologically advanced techniques into their architecture. This allows them to understand their customers through efficient data analysis truly. When you give them your basic details such as name, phone number, email address, etc., they read through your smart devices, transaction history, bank history, SMS, etc., to determine what kind of insurance plan would be ideal for your needs.
With personalization comes room for customization of these schemes! Insurance plans on such platforms allow customers to alter their plans based on recent changes easily.
For instance, when a customer gets healthcare insurance, it covers their needs and is customized for their requirements. However, a new addition to the family or the demise of an older member calls for a change in the plan. They can alter their policy to cover their family members without a lot of hassle or paperwork by simply completing an online procedure.
This is also incredibly useful when buying a costly home appliance or automobile as these utilities require insurance.
Flexibility When Choosing Plans
Designed to give the customer complete freedom, these AI-backed policies genuinely deliver what they promise. These plans aren’t solely for healthcare or life insurance; they can also be small-term, event-based, or utility-based.
For instance, if you don’t get embedded insurance for an appliance as a native feature, you can buy utility-based insurance on a low premium; you can also get short-term insurance for a small business—the possibilities are endless.
Getting insurance can sound daunting. However, the new-age AI-driven platforms which leverage data analysis to provide you with the best customer experience make the process incredibly easy to grasp, fit for anybody seeking insurance.
Email marketing ranks among the best ways to stay in touch with an audience and potentially to build one too. However, like so many digital marketing tasks, it’s something that undergoes constant evolution and development. Even with the initial tasks out of the way, such as deciding on a tone and template and testing your email servers, it requires regular work to keep people engaged.
It’s also a discipline that involves massive amounts of data. Anyone with even a passing interest in using email as a marketing tool should have a good idea of typical open rates and bounce rates, so they have something to compare campaigns to. It’s then far easier to understand where a business stands and can guide what comes next.
Utilizing this data to drive decisions and others elements of the overall marketing plan is what inevitably takes up the most time – even more so than crafting the emails themselves. However, great campaigns don’t necessarily rely on marketers who understand their data’s comprehensive ins and outs. So instead, we’re going to focus on four critical aspects of that data that can yield the best results.
Starting off with the most basic data, it’s vital to understand how your current efforts perform if you’re to stand a chance of improving them. Most email marketing tools provide this data as standard, spanning core metrics such as click and open rates, together with the number of people that unsubscribe with each email.
Naturally, knowing this information is merely part of the job. The actual skill stems from applying context then making a decision. For example, when it comes to people unsubscribing from your list, the most common reason is almost always too many emails. It’s not difficult to apply context to that one, but someone needs to make a decision and decide whether the current schedule brings in as much value as it could.
With open rates, keep an eye on how different styles and tones in the subject line have an impact. With click rates, monitor whether in-content links perform better than images and buttons or vice versa.
While basic, this insight should be a priority, and marketers should monitor performance with each email they send to build up an overall picture of subscriber behaviors. From there, they can deploy their knowledge from the moment the following email goes out.
Most of your email subscribers will already have something in common – likely at least a passing interest in your brand, products, or services. However, unless you do just one thing, there will be differences.
While an extensive list is always beneficial, most email marketers accept that not every email needs to go to every subscriber. There’ll always be generalized updates, such as the latest news from your brand. However, with a robust handle on how your audience differs, you can start to divide up marketing campaigns based on their interests.
Big data is at the core of audience targeting across all channels, and few are as specific and accessible as information related to your mailing list.
Emails form a vital part of the overall sales funnel, and the data behind them can reveal where list prospects reside on the buying journey. Social data is valuable in the funneling process, and it works exceptionally well with email data as someone on your list can already be considered at least a partially warm lead.
Your email strategy will be driven by your information on how people interact with what you send them. For example, some users will make a purchase every time you email them special offers and discounts. Others will open every email that provides a story or industry commentary but demonstrates less interest in your sales messages. This ties back into audience segmentation but differs in that it provides valuable insight into the types of customers that are just browsing and those that are ready to buy.
Wrapping up with an oft-overlooked influence of data on email marketing, we consider how likely someone is to pass along something you’ve sent them. There was previously a trend where marketers would directly encourage readers to forward emails to their friends. Although a few holdouts remain, it’s mainly fallen out of fashion, primarily due to brands preferring to nudge people towards social media.
Whether you ask, imply or let nature take its course, it’s always a great idea to keep an eye on which content resonates so much with your readers that they cannot help but tell someone else about it. This might involve a behind-the-scenes look at the business, a special offer that matches what someone they know wants to buy, or anything else. Again, the key is knowing what and who.
This form of word-of-mouth marketing is exceptionally valuable in that it can not only boost sales but also grow your list organically. Both undoubtedly reside near the top of your KPI list, so a renewed focus on passing the message on, with data to make it happen, can make all the difference.
Business Owners Lean on Big Data to Deal with Cybercrime Threats
It’s no secret that the COVID pandemic caused a lot of industries to get flipped on their head or at least make some major organizational changes in order to stay afloat during the peak of shelter-in-place orders and other legalities aimed to stop the spread of the virus. The great thing about living in a world governed by advances in big data technology is that it was possible to offer these services remotely.
For many businesses, there was a move to the remote workplace for those employees who could do most of their work on a computer. In a classic “one thing leading to another” scenario, there was a boom in ecommerce and web-based sharing of data for businesses, which also led to a spike in cybersecurity breaches. For business owners looking to keep all or some of their business in the remote office (it’s a big money saver), it’s important to understand the seriousness of cyber threats and the possibility of a data breach, especially in a post-pandemic world with a heavier reliance on web-based interactions.
Types of Attacks
There is a near-endless list of different types of hackers, but they don’t all take aim at small businesses. Some try to infiltrate home networks to steal data, and others even do what is called “hacktivism,” in a “steal from the rich and give to the poor” type of scenario that includes things like publishing information on corrupt politicians.
Here is a list of some of the types of hackers use to orchestrate data breaches that you need to be the most wary of at your small business.
Phishing – These attacks are most often conducted via email, and focus on pulling on the heart strings of the receiver in order to get them to share information that can be used to infiltrate a network. When the pandemic first hit, phishing attacks increased by a whopping 600%, most via unsecure home networks being relied upon for work activities that would have otherwise been conducted at home. Phishing emails will often disguise themselves as an organization looking for financial help, and last year, a Texas school lost more than $2 million after a hacker forged an email from the World Health Organization to steal data from the public.Software Vendors – Part of the move to remote work was a heavier reliance on software to help with project management and communication, and these vendors became targets of many cyber attacks focused on stealing sensitive data. A silver lining to the pandemic was a heavy increase in capabilities offered by software, but no matter what it offers your company, be sure to look up how well they secure their information and if they have been victim of any attacks. Cloud Storage – Companies needed to rely on cloud storage during the pandemic, as well, and this led to a spike in cloud-based hacking. Technology allows hackers to scan cloud servers to find openings that don’t have passwords, or have very simple-to-break ones. Ultimately, any server is vulnerable, so ensuring you protect your information within the server is important and should be something you train your team on.
Take Adequate Steps to Prevent Data Breaches as a Small Business Owner
Speaking of training, it’s your responsibility to protect your client and customer’s data. Unfortunately, a data breach of any sort can be catastrophic to your bank account and your image. More than half of employees working remotely during the pandemic said they probably did some things that made company information more vulnerable than it would have been if they were in an office, so training, training, and more training should be the first three things on your cybersecurity list. Investments into security software are also generally sound.
The world we live in keeps facing unprecedented and rapid phase changes when it comes to business verticals and innovations. In such an era, data provides a competitive edge for businesses to stay at the forefront in their respective fields. Satisfying end-customer needs within the given time limits has also become a main priority. According to Forrester’s reports, the rate of insight-driven businesses is growing at an average of 30% per year.
Recognizing the potential of data, organizations are trying to extract values from their data in various ways to create new revenue streams and reduce the cost and resources required for operations. With the increased adoption of cloud and emerging technologies like the Internet of Things, data is no longer confined to the boundaries of organizations. The increased amounts and types of data, stored in various locations eventually made the management of data more challenging.
Challenges in maintaining data
As organizations keep using several applications, the data collected becomes unmanageable and inaccessible in the long run. The legacy systems and infrastructures can no longer be capable of handling such massive amounts of data. Shifting the data to the cloud from the existing legacy systems had its own challenges. Additionally, data sharing between different public cloud platforms or on-premise platforms can be difficult.
Companies these days have multiple on-premise as well as cloud platforms to store their data. The data contained can be both structured and unstructured and available in a variety of formats such as files, database applications, SaaS applications, etc. Processing such kinds of data require advanced technologies from ELT processing to real-time streaming. The daunting amounts of data make it very difficult for companies to quickly ingest, integrate, analyze, and share new data resources.
With the amount of increase in data, the complexity of managing data only keeps increasing. It has been found that data professionals end up spending 75% of their time on tasks other than data analysis. The ability of the organizations to manually extract the most out of their data results in being highly time and resource-consuming.
Advantages of data fabrication for data management
Data fabric is an architecture and set of data services that provide capabilities to seamlessly integrate and access data from multiple data sources like on-premise and cloud-native platforms. The data can also be processed, managed and stored within the data fabric. Using data fabric also provides advanced analytics for market forecasting, product development, sale and marketing. Moreover, it is important to note that data fabric is not a one-time solution to fix data integration and management issues. It is rather a permanent and flexible solution to manage data under a single environment. Other important advantages of data fabric are as follows
Data fabric applications provide a unified environment that caters to all the needs of the organization to transform raw data into valuable and healthy data. It also eliminates the need for the integration of multiple applications and tools for the product, contract and support mechanisms. Data fabric helps from discovery to integration of data that are gathered from various sources. Data Fabric also helps with cleansing the data, analyzing the integrity and enables sharing the trusted data with all the stakeholders.
Native code generation
A data fabric solution must be capable of optimizing code natively using preferred programming languages in the data pipeline to be easily integrated into cloud platforms such as Amazon Web Services, Azure, Google Cloud, etc. Also, the solution must have multiple built-in connectors and components that can function as intended for many environments and applications. This will enable the users to seamlessly work with code while developing data pipelines.
On-premise and cloud-native environment
Since a wide range of organizations stores data on both on-premise and cloud environments, a data fabric solution must be developed in such a way that it is natively capable of working in both environments. These solutions must also be able to ingest and integrate data from both on-premise and cloud environments such as Oracle, SAP and AWS, Google, Snowflake, etc. The data fabric solution must also embrace and adapt itself to new emerging technologies such as docker, Kubernetesinserverless computing, etc.
Data quality and governance
Data fabric solutions must integrate data quality into each step of the data management process right from the initial stages. Separate roles have to be set out for cleansing data and trace the source of data to maintain data integrity and compliance.
Best Data Fabric Tools for Enterprises – Tried and Tested
Atlan’s data fabric solution focuses primarily on 4 major areas such as data cataloging & data discovery, data quality & profiling, data lineage & governance and data exploration & integration. This product offers a search feature that is as sophisticated as Google and automatic data profiling. Altan’s data fabric solution lets the user manage data usage across the ecosystem using governance and access controls.
K2View’s data fabric solution organizes isolated data sets from various data sources according to the digital entity. Each business entity has its own hyper-performance micro-database. The digital entity unifies all the known data related to the business entity. This data fabric solution ingests, transforms, orchestrates, secures all the data in the micro DB. This solution can also be integrated with the source system and can be scaled up to support millions of micro databases at the same time. This high-performance architecture can also be integrated into on-premise and cloud-native environments.
Cinchy offers a data collaboration platform that can handle enterprise applications and data integration. The product was originally developed as a secure tool to solve data access challenges and provide real-time governance and effective data delivery. Cinchy’s solution can seamlessly integrate fragmented data sets into its network architecture. The ‘autonomous data’ feature enables the platform to self-describing, self-protecting, self-connecting and self-managing.
Data fabric ultimately enables organizations to extract the most out of the collected data and meet business demands while maintaining a competitive advantage among companies in similar fields. Data fabric also helps in data maintenance and modernize data storage methodologies. Additionally, companies can also leverage the advantages of hybrid cloud environments with the right data fabric tools.
As one of Canada’s largest telecommunications companies, TELUS has over 9.5 million subscribers, and those subscribers generate a lot of data. It’s up to Joe Bettridge, Consulting Lead of TELUS Insights, to use that data to drive innovation. Google Cloud’s suite of services, including BigQuery and Data Studio, has proven instrumental to the company’s overall data strategy.
“Our division was founded in 2015 and really focused on trying to commercialize data in a privacy-preserving manner,” Bettridge says, “That means taking data we have as a company and taking it to market in a way that the privacy of the individual or the devices the data comes from isn’t being impacted negatively.”
TELUS manages geolocated data generated by network locations, talk and text signals, and data migrations between towers, representing over 275,000 points on a map. Last year the company analyzed over 1.2 petabytes of data, a number that is very much expected to grow as more subscribers upload more photos and texts, not to mention voice calls.
In March 2020, when the pandemic changed everything, TELUS found itself in a position to leverage its data for the good of all Canadians. With respect for privacy of paramount concern, TELUS used Google Cloud to launch a new platform in three weeks that empowered the Canadian government to make better strategic decisions in confronting Covid-19 within its borders.
Government and health care industry leaders needed to better understand how far people were traveling, how different communities interacted with one another, and how efficiently messaging about the pandemic was reaching citizens.
TELUS devised a solution that drew insights from large, aggregate data samples and employed a de-identification process, which ensured that individual subscribers’ data remained private. The platform was a success and earned the company a HPE-IAPP Privacy Innovation Award.
The award “really shows that our privacy differentiator is core to our product, and it’s really helped us gain traction with these government agencies,” says Bettridge.
Before migrating to Google Cloud, TELUS managed an on-prem data environment that was proving too cumbersome to manage the volume of data they were ingesting by the day.
BigQuery marked the beginning of a different approach to data: Processes that used to take a month are now taking only a day. Ultimately, this allowed Bettridge to accomplish much more with a much smaller team.
“We engaged Google and they were very good at providing specialists who would help us with architecture, help us give our teams training, get us up to speed and comfortable. A lot of what we built in the past could be ripped out of the system it was in and placed up in the cloud quite quickly and quite successfully,” says Bettridge.
For more on how Joe Bettridge is leveraging Google Cloud to protect privacy and manage geolocated data, watch the latest episode of Google Cloud’s Data Journeys below.
Data Journey: Episode 1: Telus
Welcome to Data Journeys, the series where we invite customers to share their company’s data journey—along with key lessons learned along the way. This week, Bruno is joined by Joe Bettridge of TELUS, a leading telecommunications company in Canada whose wireless division serves 9.5 million subscribers. Joe discusses TELUS recent innovative work: providing de-identified data to researchers and government agencies to provide insight on public health while protecting each individual’s privacy. Listen to the story of how TELUS stood up their new platform in just three weeks with the help of Google Cloud, and be sure to check back in for a new data journey next week.
Want to be a guest? Let us know here!Curious for more stories just like this one? Subscribe to the playlist on YouTube here and join the Data Analytics Google Cloud Community here.
If the COVID-19 pandemic has taught us anything, it is that speed and intelligence are of the essence when it comes to making business decisions. Organizations must find ways of keeping ahead of competitors and disruptions by continually leveraging data to make smart decisions. The problem? Data may be everywhere, but it’s not always available in a form that businesses can use to generate analytics in real time. As a result, enterprise users are divided when it comes to data volume, with half telling IDC there’s too much data and nearly as many saying there isn’t enough. The first step toward extracting the full value of your company’s data, then, is to synthesize and process it at scale.
In a study recently published by research firm IDC, organizations running SAP reported using BigQuery to optimize data from their SAP applications including ERP, supply chain, CRM, and others as well as other internal and external data sources. IDC interviewed seven customers for this study with an average annual revenue of $1.8B, 36,000 employees and 1.5PB of data.
BigQuery is a fully managed serverless cloud data warehouse within Google’s data cloud, helps companies integrate, aggregate, and analyze petabytes of data. SAP customers use BigQuery to generate more impactful analytics and improve business results while lowering platform and staffing costs — and have reaped average annual benefits of $6.4 million per organization, according to IDC’s findings.
BigQuery provides major productivity and infrastructure benefits
IDC found that the benefits that BigQuery offered organizations fell into three broad categories: increased business productivity, IT staff productivity, and reduced infrastructure costs. These three factors combined to give enterprises speed, efficiency, and power that led to improved business results.
Business productivity benefits: Companies interviewed by IDC reported that BigQuery enhanced their ability to generate SAP-related business insights — including self-service options — by reducing query cycle times and delivering reports to line of business executives, managers, and end users more quickly. After adopting BigQuery, these organizations found that queries accelerated by 63% and business reports or dashboards were generated 77% faster. BigQuery provided data scientists, business intelligence teams, data/analytics engineers, and business analysts with easier access to more robust and timely data and insights derived from their SAP environments. This helped them work more effectively and deliver richer and more timely data-driven insights to line of business end users. “BigQuery for SAP has had a significant impact on our data analytics activities,” reports one respondent. “The speed of running queries is amazing. . . Our analytics teams have access to information and reports that they did not have before.”
IT staff productivity benefits: With BigQuery, organizations reported freeing up staff time to not only handle expanding data environments, but also support other data- and business-related initiatives. Since BigQuery is a fully managed solution, internal IT staff no longer had to conduct routine tasks such as data backups, allowing them to take a more proactive approach in leveraging their SAP data environments to support business operations. “Before BigQuery for SAP, we used to say that our team worked for analytics,” notes a survey respondent. “Our DevOps team was spending a lot of time doing carrot feeding, firefighting, and a lot of reactive work like patching. Now, we have been able to focus the majority of our time on proactive roadmap development and future functionality.”
IT infrastructure cost reductions: The IDC study showed that BigQuery helped respondents reduce costs by lowering their platform expenses and making management and administration of their data warehousing environments more efficient. These cost savings came thanks to BigQuery’s ability to consolidate data on a single cloud-based platform, limiting overlap, redundancies, and overprovisioning. Organizations also reduced the costs associated with maintaining and running on-premises data warehousing environments, including retiring less cost-effective legacy solutions. “The biggest area of value of BigQuery for SAP for us is that we no longer need to deal with having to maintain our own data,” explained a survey respondent. “We don’t have to worry about updates or administration, and most maintenance is now outsourced to BigQuery, including backups and business continuity.”
Speed, efficiency and power for improved results
Overall, respondents said that the enhanced ability to get value from their SAP data using BigQuery led directly to improved business results. More efficient data usage meant better strategy execution. Faster and more precise analytics gave more employees the information and tools they need to create more value for their organizations. BigQuery’s scalability and speed improved business operations and shortened time to insight. Customers gained a clearer real-time view of business operations, which helped them win new business, differentiate themselves against competitors, and reduce customer churn.
“The main benefit of BigQuery for SAP is real-time intelligence that helps us improve pricing, traffic management, routing configuration, and rating — the core of our business,” says a respondent. “This gives us the edge over competitors — when we have better intel and faster analytics, it allows us to improve revenue margins and business outcomes.” In fact, organizations increased revenue by an average $22.8 million. What could your company do with those kinds of results?
So far in this series, we’ve been focused on generic concepts and console-based workflows. However, when you’re working with huge amounts of data or surfacing information to lots of different stakeholders, leveraging BigQuery programmatically becomes essential. In today’s post, we’re going to take a tour of BigQuery’s API landscape – so you can better understand what each API does and what types of workflows you can automate with it.
The BigQuery v2 API is where you can interact with the “core” of BigQuery. This API gives you the ability to manage data warehousing resources like datasets, tables (including both external tables and views), and routines (functions and procedures). You can also leverage BigQuery’s machine learningcapabilities, and create or poll jobs for querying, loading, copying or extracting data.
Programmatically getting query results
One common way to leverage this API is to programmatically get answers to business questions by running BigQuery queries and then doing something with the results. One example that quickly came to mind was automatically filling in a Google Slide template. This can be especially useful if you’re preparing slides for something like a quarterly business review – where each team may need a slide that shows their sales performance for the last quarter. Many times an analyst is forced to manually run queries and copy-paste the results into the slide deck. However, with the BigQuery API, Google Slides APIand a Google Apps Script we can automate this entire process!
If you’ve never used Google Apps scripts before, you can use them to quickly build serverless functions that run inside of Google Drive. Google Apps Scripts already have the Google Workspace and Cloud Libraries available, so you simply need to add the Slides and BigQuery service into your script.
In your script you can do something like loop through each team’s name and use it as a parameter to run a parameterized query. Finally, you can use that to replace a template in that team’s slide within the deck. Check out someexample code here, and look out for a future post on more details on the entire process!
Loading in new data
Aside from querying existing data available in BigQuery, you can also use the API to create and run a load job to add new data into BigQuery tables. This is a common scenario when building batch loading pipelines. One example might be if you’re transforming and bringing data into BigQuery from a transactional database each night. If you remember from our post on tables in BigQuery, you can actuallyrun an external query against a Cloud SQL database. This means that we can simply send a query job, through BigQuery’s API, to grab new data from the Cloud SQL table. Below, we’re using the magics command from the google.cloud.bigquery Python library to save the results into a pandas dataframe.
Next, we may need to transform the results. For example, we can use the Google Maps GeoCoding API to get the latitude and longitude coordinates for each customer in our data.
Finally, we can create a load job to add the data, along with the coordinates, into our existing native BigQuery table.
While the “core” of BigQuery is handled through the BigQuery v2 API, there are other APIs to manage tangential aspects of BigQuery. The Reservations API, for example, allows you to programmatically leverage workload management resources like capacity commitments, reservations and assignments as we discussed in a previous post.
Let’s imagine that we have an important dashboard loading at 8am on the first Monday of each month. You’ve decided that you want to leverageflex slotsto ensure that there are enough workers to make the dashboard load super fast for your CEO. So, you decide to write a program that purchases a flex slot commitment, creates a new reservation for loading the dashboard and then assigns the project where the BI tool will run the dashboard to the new reservation. Check out the full sample code here!
Another relevant API for working with BigQuery is theStorage API. The Storage API allows you to use BigQuery like a Data Warehouse and a Data Lake. It’s real-time so that you don’t have to wait for your data, it’s fast so that you don’t need to reduce or sample your data, and it’s efficient so that you should only read the data you want. It’s broken down into two components.
The Read Client exposes a data-stream suitable for reading large volumes of data. It also provides features for parallelizing reads, performing partial projections, filtering data, and offering precise control over snapshot time.The Write Client (preview) is the successor to the streaming mechanism found in the BigQuery v2 API. It supports more advanced write patterns such as exactly one semantics. More on this soon!
The Storage API was used to build a series of Hadoop connectors so that you can run your Spark workloads directly on your data in BigQuery. You can also build your own connectors using the Storage API!
The BigQuery Connections API is used to create a connection to external storage systems, like Cloud SQL. This enables BigQuery users to issue live, federated, queries against other systems. It also supports BigQuery Omni to define multi-cloud data sources and structures.
Programmatically Managing Federation Connections
Let’s imagine that you are embedding analytics for your customers. Your web application is structured such that each customer has a single-tenant Cloud SQL instance that houses their data. To perform analytics on top of this information, you may want to create connections to each Cloud SQL database. Instead of manually setting up each connection, one option could be using the Connections API to programmatically create new connections during the customer onboarding process.
The last API I’ll mention is one of my favorites—Data QnA which is currently in preview. Are business users at your organization always pinging you to query data on their behalf? Well, with the QnA API you can convert natural language text inquiries into SQL – meaning you can build a super powerful chatbot that fulfills those query requests, or even give your business users access to connected sheets so they can ask analytics questions directly in a spreadsheet. Check out this post to learn more about how customers are using this API today!
Processing streaming data to extract insights and powering real time applications is becoming more and more critical. Google Cloud Dataflow and Pub/Sub provides a highly scalable, reliable and mature streaming analytics platform to run mission critical pipelines. One very common challenge that developers often face when designing such pipelines is how to handle duplicate data.
In this blog, I want to give an overview of common places where duplicate data may originate in your streaming pipelines and discuss various options that are available to you to handle them. You can also check out this tech talk on the same topic.
Origin of duplicates in streaming data pipelines
This section gives an overview of the places where duplicate data may originate in your streaming pipelines. Numbers in red boxes in the following diagram indicate where this may happen.
Some duplicates are automatically handled by Dataflow while for others developers may need to use some techniques to handle them. This is summarized in the following table.
1. Source generated duplicate Your data source system may itself produce duplicate data. There could be several reasons like network failure, system errors etc that can produce duplicate data. Such duplicates are referred to as ‘source generated duplicates’.
2. Publisher generated duplicates Your publisher when publishing messages to Pub/Sub can generate duplicates due to at-least-once publishing guarantees. Such duplicates are referred to as ‘publisher generated duplicates’.
Pub/Sub automatically assigns a unique message_id to each message successfully published to a topic. Each message is considered successfully published by the publisher when Pub/Sub returns an acknowledgement to the publisher. Within a topic all messages have a unique message_id and no two messages have the same message_id. If success of the publish is not observed for some reason (network delays, interruptions etc) the same message payload may be retried by the publisher. If retries happen, we may end up with duplicate messages with different message_id in Pub/Sub. For Pub/Sub these are unique messages as they have different message_id.
3. Reading from Pub/Sub Pub/Sub guarantees at least once delivery for every subscription. This means that a message may be delivered more than once by the same subscription if Pub/Sub doesn’t receive acknowledgement within the acknowledgement deadline. The subscriber may acknowledge after the acknowledgement deadline or the acknowledgement may be lost due to transient network issues. In such scenarios the same message would be redelivered and subscribers may see duplicate data. It is the responsibility of the subscribing system (for example Dataflow) to detect such duplicates and handle accordingly.
When Dataflow receives messages from Pub/Sub subscription, messages are acknowledged after they are successfully processed by the first fused stage. Dataflow does optimization called fusion where multiple stages can be combined into a single fused stage. A break in fusion happens when there is a shuffle which happens if you have transforms like GROUP BY, COMBINE or I/O transforms like BigQueryIO. If a message has not been acknowledged within its acknowledgement deadline, Dataflow attempts to maintain the lease on the message by repeatedly extending the acknowledgement deadline to prevent redelivery from Pub/Sub. However this is best effort and there is a possibility that messages may be redelivered. This can be monitored using metrics listed here.
However, because Pub/Sub provides each message with a unique message_id, Dataflow uses it to deduplicate messages by default if you use the built-in Apache Beam PubSubIO. Thus Dataflow filters out such duplicates originating from redelivery of the same message by Pub/Sub. You can read more about this topic on one of our earlier blog under the section “Example source: Cloud Pub/Sub”
4. Processing data in Dataflow Due to the distributed nature of processing in Dataflow each message may be retried multiple times on different Dataflow workers. However Dataflow ensures that only one of those tries wins and the processing from the other tries does not affect downstream fused stages. Dataflow does guarantee exactly once processing by leveraging checkpointing at each stage to ensure such duplicates are not reprocessed affecting state or output. You can read more about how this is achieved in this blog.
5. Writing to a sink Each element can be retried multiple times by Dataflow workers and may produce duplicate writes. It is the responsibility of the sink to detect these duplicates and handle them accordingly. Depending on the sink, duplicates may be filtered out, over-written or appear as duplicates.
File systems as sink If you are writing files, exactly once is guaranteed as any retries by Dataflow workers in event of failure will overwrite the file. Beam provides several I/O connectors to write files, all of which guarantees exactly once processing.
BigQuery as sink
If you use the built-in Apache Beam BigQueryIO to write messages to BigQuery using streaming inserts, Dataflow provides a consistent insert_id (different from Pub/Sub message_id) for retries and this is used by BigQuery for deduplication. However, this deduplication is best effort and duplicate writes may appear. BigQuery provides other insert methods as well with different deduplication guarantees as listed below.
You can read more about BigQuery insert methods at the BigQueryIO Javadoc. Additionally for more information on BigQuery as a sink check out the section “Example sink: Google BigQuery” in one of our earlier blog.
For duplicates originating from places discussed in points 3), 4) and 5) there are built-in mechanisms in place to remove such duplicates as discussed above, assuming BigQuery is a sink. In the following section we will discuss deduplication options for ‘source generated duplicates’ and ‘publisher generated duplicates’. In both cases, we have duplicate messages with different message_id, which for Pub/Sub and downstream systems like Dataflow are two unique messages.
Deduplication options for source generated duplicates and publisher generated duplicates
1. Use Pub/Sub message attributes
Each message published to a Pub/Sub topic can have some string key value pairs attached as metadata under the “attributes” field of PubsubMessage. These attributes are set when publishing to Pub/Sub. For example, if you are using the Python Pub/Sub Client Library, you can set the “attrs” parameter of the publish method when publishing messages. You can set the unique fields (e.g: event_id) from your message as attribute value and field name as attribute key.
Dataflow can be configured to use these fields to deduplicate messages instead of the default deduplication using Pub/Sub message_id. You can do this by specifying the attribute key when reading from Pub/Sub using the built-in PubSubIO.
For Java SDK, you can specify this attribute key in the withIdAttribute method of PubsubIO.Read() as shown below.
In the Python SDK, you can specify this in the id_label parameter of the ReadFromPubSub PTransform as shown below.
This deduplication using a Pub/Sub message attribute is only guaranteed to work for duplicate messages that are published to Pub/Sub within 10 minutes of each other.
2. Use Apache Beam Deduplicate PTransform Apache Beam provides deduplicate PTransforms which can deduplicate incoming messages over a time duration. Deduplication can be based on the message or a key of a key value pair, where the key could be derived from the message fields. The deduplication window can be configured using the withDuration method, which can be based on processing time or event time (specified using the withTimeDomain method). This has a default value of 10 mins.
This PTransform uses the Stateful API under the hood and maintains a state for each key observed. Any duplicate message with the same key that appears within the deduplication window is discarded by this PTransform.
3. Do post-processing in sink Deduplication can also be done in the sink. This could be done by running a scheduled job that periodically deduplicates rows using a unique identifier.
BigQuery as a sink If BigQuery is the sink in your pipeline, scheduled query can be executed periodically that writes the deduplicated data to another table or updates the existing table. Depending on the complexity of the scheduling you may need orchestration tools like Cloud Composer or Dataform to schedule queries.
Deduplication can be done using a DISTINCT statement or DML like MERGE. You can find sample queries about these methods on these blogs (blog 1, blog 2).
Often in streaming pipelines you may need deduplicated data available in real time in BigQuery. You can achieve this by creating materialized views on top of underlying tables using a DISTINCT statement.
Any new updates to the underlying tables will be updated in real time to the materialized view with zero maintenance or orchestration.
Technical trade-offs of different deduplication options
In the past decade, the Healthcare and Life Sciences industry has enjoyed a boon in technological and scientific advancement. New insights and possibilities are revealed almost daily. At Google Cloud, driving innovation in cloud computing is in our DNA. Our team is dedicated to sharing ways Google Cloud can be used to accelerate scientific discovery. For example, the recent announcement of AlphaFold2 showcases a scientific breakthrough, powered by Google Cloud, that will promote a quantum leap in the field of proteomics. In this blog, we’ll review another omics use case, single-cell analysis, and how Google Cloud’s Dataproc and NVIDIA GPUs can help accelerate that analysis.
The Need for Performance in Scientific Analysis
The ability to understand the causal relationship between genotypes and phenotypes is one of the long-standing challenges in biology and medicine. Understanding and drawing insights from the complexity of biological systems abounds from the actual code of life (DNA) through to expression of genes (RNA) to translation of gene transcripts into proteins that function in different pathways, cells, and tissues within an organism. Even the smallest of changes in our DNA can have large impacts on protein expression, structure, and function, which ultimately drives development and response – at both cellular and organism levels. And, as the omics space becomes increasingly data- and compute-intensive, research requires an adequate informatics infrastructure. An infrastructure that scales with growing data demands, enables a diverse range of resource-intensive computational activities, and is affordable and efficient – reducing data bottlenecks and enabling researchers to maximize insight.
But where do all these data and compute challenges come from and what makes scientific study so arduous? The layers of biological complexity begin to be made apparent immediately when looking at not just the genes themselves, but their expression. Although all the cells in our body share nearly identical genotypes, our many diverse cell types (e.g. hepatocytes versus melanocytes) express a unique subset of genes necessary for specific functions, making transcriptomics a more powerful method of analysis by allowing researchers to map gene expression to observable traits. Studies have shown that gene expression is heterogeneous, even in similar cell types. Yet, conventional sequencing methods require DNA or RNA extracted from a cell population. The development of single-cell sequencing was pivotal to the omics field. Single-cell RNA sequencing has been critical in allowing scientists to study transcriptomes across large numbers of individual cells.
Despite its potential, and the increasing availability of single-cell sequencing technology, there are several obstacles: an ever increasing volume of high-dimensionality data, the need to integrate data across different types of measurements (e.g. genetic variants, transcript and protein expression, epigenetics) and across samples or conditions, as well as varying levels of resolution and the granularity needed to map specific cell types or states. These challenges present themselves in a number of ways including background noise, signal dropouts requiring imputation, and limited bioinformatics pipelines that lack statistical flexibility. These and other challenges result in analysis workflows that are very slow, prohibiting the iterative, visual, and interactive analysis required to detect differential gene activity.
Cloud computing can help not only with data challenges, but with some of the biggest obstacles: scalability, performance, and automation of analysis. To address several of the data and infrastructure challenges facing single-cell analysis, NVIDIA developed end-to-end accelerated single-cell RNA sequencing workflows that can be paired with Google Cloud Dataproc, a fully-managed service for running open source frameworks like Spark, Hadoop, and RAPIDS. The Jupyter notebooks that power these workflows include examples using samples like human lung cells and mouse brains cells and demonstrate acceleration between CPU-based processing compared to GPU-based workflows.
Google Cloud Dataproc powers the NVIDIA GPU-based approach and demonstrates data processing capabilities and acceleration, which in turn have the potential of delivering considerable performance gains. When paired with RAPIDS, practitioners can accelerate data science pipelines on NVIDIA GPUs, reducing operations like data loading, processing, and training from hours to seconds. RAPIDS abstracts the complexities of accelerated data science by building upon popular Python and Java libraries effortlessly. When applying RAPIDS and NVIDIA accelerated compute to single-cell genomics use cases, practitioners can churn through analysis of a million cells in only a few minutes.
Give it a Try
The journey to realizing the full potential of omics is long; but through collaboration with industry experts, customers, and partners like NVIDIA, Google Cloud is here to help shine a light on the road ahead. To learn more about the notebook provided for single-cell genomic analysis, please take a look at NVIDIA’s walkthrough. To give this pattern a try on Dataproc, please visit our technical reference guide.