Archives January 2023

Cloud-Centric Companies Discover Benefits & Pitfalls of Network Relocation

Cloud-Centric Companies Discover Benefits & Pitfalls of Network Relocation

Keeping your IT infrastructure in-tact while moving or relocating to a new location is a top priority at most organizations, especially cloud-centric companies. Given that IT increasingly affects every touchpoint of your business, any downtime can derail operations across the board, or worse result in theft or loss of data with far reaching consequences.

Network relocation to the cloud comes with its own set of challenges, benefits, and pitfalls that all business owners and IT professionals should be aware of. Such a movement of networks and IT infrastructure requires the services of professional move management services, who are well versed with the necessary checks and best practices that come with the process of network relocation.

In this article, we underscore a few benefits and possible pitfalls that businesses need to consider before planning a network relocation to the cloud.

Benefits of Network Relocation

On the face of it, the benefits of network relocation to the cloud seems fairly obvious and is increasingly becoming the hallmark of any cloud-centric organization.

Easy Deployment & Adoption – Relocating your network to the cloud brings a host of benefits pertaining to the deployment and adoption of new technologies going forward.

Unlike a traditional IT network or physical presence, the technical requirements are fairly thin, allowing businesses to curtail the substantial overheads that come with the same.

The initial deployment is just as easy and straightforward, as it mostly involves training and onboarding team members for a new tool, instead of individually installing it in each system, or upgrading hardware to match the system requirements.

Remote Work & Location Independence – Office relocations, work-from-home, or issues such as fires and water damages are hardly an issue for cloud-centric organizations. With data, systems, and processes hosted entirely on the cloud, disaster preparedness and the concept of going concern remains top notch.

The value of cloud deployments was further evidenced over the course of the pandemic, when organizations across the world were forced to shut physical office spaces overnight. Having a cloud infrastructure in place helped companies maintain their operations without any major changes.

Scalability – SAAS solutions deployed in the cloud are infinitely scalable with limitations arising from hardware specs no longer an issue.

If an organization is continuing to scale, it need not invest and expand its local infrastructure, and can just buy additional seats with the cloud-based service, substantially lowering overheads. This is often cited as one of the key benefits of cloud computing for businesses of all sizes.

Scaling is also a lot easier and straightforward, requiring limited technical know-how, or time requirements, as against legacy systems.

Pitfalls of Network Relocation

Despite the substantial advantages, relocating your network to the cloud comes with its share of disadvantages and pitfalls that you have to consider.

Downtime – While cloud networks are far more reliable than a local one, working with third-party providers does involve certain downtimes, especially if there are issues with your local internet connectivity.

Such uncertainties have to be accounted for, and result in cloud networks losing their sheen to a certain extent.

Security Issues – Cloud-centric organizations are faced with substantial security threats such as phishing attacks, ransomware, and denial of service. All of which can turn into an existential threat for a business without the necessary protections and precautions.

Reliance On Third-Party Vendors – A cloud-centric approach requires a third-party service provider to take care of the hosting, maintenance, and customizations required.

Even a mere network relocation from your existing setup to the cloud requires trained professionals to expertly move your IT infrastructure. This type of reliance isn’t liked by a number of small and medium sized business owners.

Conclusion

While it is increasingly evident that cloud based systems are the future, businesses should make it a point to have a good understanding of what this entails, along with the benefits, opportunities, and limitations that come with the same, before pressing ahead.

Source : SmartData Collective Read More

Better together: Looker connector for Looker Studio now generally available

Better together: Looker connector for Looker Studio now generally available

Today’s leading organizations want to ensure their business users get fast access to data with real-time governed metrics, so they can make better business decisions. Last April, we announced our unified BI experience, bringing together both self-serve and governed BI. Now, we are making our Looker connector to Looker Studio generally available, enabling you to access your Looker modeled data in your preferred environment.

Connecting people to answers quickly and accurately to empower informed decisions is a primary goal for any successful business, and more than ten million users turn to Looker each month to easily explore and visualize their data from hundreds of different data sources. Now you can join the many Google Cloud customers who have benefited from early access to this connector by connecting your data in a few steps.*

How do I turn on the integration between Looker and Looker Studio?

You can connect to any Google Cloud-hosted Looker instance immediately after your Looker admin turns on its Looker Studio integration.

Once the integration is turned on, you create a new Data Source, select the Looker connector, choose an Explore in your connected Looker instance, and start analyzing your modeled data.

You can explore your company’s modeled data in the Looker Studio report editor and share results with other users in your organization.

When can I access the Looker connector?

The Looker connector is now available for Looker Studio, and Looker Studio Pro, which includes expanded enterprise support and compliance.

Learn more about the connector here.

* A Google Cloud-hosted Looker instance with Looker 23.0 or higher is required to use the Looker connector. A Looker admin must enable the Looker Studio BI connector before users can access modeled data in Looker Studio.

Source : Data Analytics Read More

How innovative startups are growing their businesses on Google’s open data cloud

How innovative startups are growing their businesses on Google’s open data cloud

Data is one of the single most valuable assets for organizations today. It can empower businesses to do incredible things like create better views of health for hospitals, enable people to share timely insights with their colleagues, and — increasingly — be a foundational building block for startups who build their products and businesses in a data cloud

Last year, we shared that more than 800 software companies are building their products and businesses with Google’s data cloud. Many of these are fast-growing startups. These companies are creating entirely new products with technologies like AI, ML and data analytics that help their customers turn data into real-time value. In turn, Google’s data cloud and services like Google BigQuery, Cloud Storage, and Vertex AI are helping startups build their own thriving businesses. 

We’re committed to supporting these innovative, fast-growing startups and helping them grow within our open data cloud ecosystem. That’s why today, I’m excited to share how three innovative data companies – Ocient, SingleStore, and Glean – are now building on Google’s data cloud as they grow in the market and deliver scalable data solutions to more customers around the world.

Founded in 2016, Ocient is a hyperscale data warehousing and analytics startup that is helping enterprises analyze and gain real-time value from trillions of data records by enabling massively parallelized processing in a matter of seconds. By designing its data warehouse architecture with compute adjacent to storage on NVMe solid state drives, continuous ingest on high-volume data sets, and intra-database ELT and machine learning, Ocient’s technology enables users to transform, load, and analyze otherwise infeasible data queries at 10x-100x the price performance of other cloud data warehouse providers. To help more enterprises scale their data intelligence to drive business growth, Ocient chose to bring its platform to Google Cloud’s flexible and scalable infrastructure earlier this year via Google Cloud Marketplace. In addition to bringing its solution to Google Cloud Marketplace, Ocient is using Google Cloud technologies including Google Cloud Storage for file loading, Google Compute Engine (GCE) for running its managed hyperscale data analytics solutions, and Google Cloud networking tools for scalability, increased security, and for analyzing hyperscale sets data with greater speed. In just three months, Ocient more than doubled its Google Cloud usage in order to support the transformation workloads of enterprises on Google Cloud.

Another fast-growing company that recently brought its solution to Google Cloud Marketplace to reach more customers on Google Cloud’s scalable, secure, and global infrastructure is SingleStore. Built with developers and database architects in mind, SingleStore helps companies provide low-latency access to large datasets and simplify the development of enterprise applications by bringing transactions and analytics in a single, unified data engine (SingleStoreDB). Singlestore integrates with Google Cloud services to enable a scalable and highly available implementation. In addition to growing its business by reaching more customers on Google Cloud Marketplace, SingleStore is today announcing the establishment of its go-to-market strategy with Google Cloud, which will further enable them to deliver their  database solution to customers around the world.

I’m also excited to share how Glean is leveraging our solutions to scale its business and support more customers. Founded in 2019, Glean is a powerful, unified search tool built to search across all deployed applications at an organization. Glean’s platform understands context, language, behavior, and relationships, which in turn enables users to find personalized answers to questions, instantly. To achieve this, the Glean team built its enterprise search and knowledge discovery product with Google managed services, including Cloud SQL and Kubernetes, along with solutions from Google Cloud like our Vertex AI, Dataflow, Google BigQuery. By creating its product with technologies from Google Cloud, Glean has the capabilities needed to be agile and iterate quickly. This also gives Glean’s developer team more time to focus on developing the core application aspects of its product, like relevance, performance, ease of use, and delivering a magical search experience to users. To support the growing needs of enterprises and bring its product to more customers at scale, Glean is today announcing its formal partnership with Google Cloud and the availability of its product on Google Cloud Marketplace. 

We’re proud to support innovative startups with the data cloud capabilities they need to help their customers thrive and to build and grow their own businesses, and we’re committed to providing them with an open and extensible data ecosystem so they can continue helping their customers realize the full value of their data.

Source : Data Analytics Read More

Scaling machine learning inference with NVIDIA TensorRT and Google Dataflow

Scaling machine learning inference with NVIDIA TensorRT and Google Dataflow

A collaboration between Google Cloud and NVIDIA has enabled Apache Beam users to maximize the performance of ML models within their data processing pipelines, using NVIDIA TensorRTand NVIDIA GPUs alongside the new Apache Beam TensorRTEngineHandler

The NVIDIA TensorRT SDK provides high-performance, neural network inference that lets developers optimize and deploy trained ML models on NVIDIA GPUs with the highest throughput and lowest latency, while preserving model prediction accuracy. TensorRT was specifically designed to support multiple classes of deep learning models, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and Transformer-based models. 

Deploying and managing end-to-end ML inference pipelines while maximizing infrastructure utilization and minimizing total costs is a hard problem. Integrating ML models in a production data processing pipeline to extract insights requires addressing challenges associated with the three main workflow segments: 

Preprocess large volumes of raw data from multiple data sources to use as inputs to train ML models to “infer / predict” results, and then leverage the ML model outputs downstream for incorporation into business processes. 

Call ML models within data processing pipelines while supporting different inference use-cases: batch, streaming, ensemble models, remote inference, or local inference. Pipelines are not limited to a single model and often require an ensemble of models to produce the desired business outcomes.

Optimize the performance of the ML models to deliver results within the application’s accuracy, throughput, and latency constraints. For pipelines that use complex, computate-intensive models for use-cases like NLP or that require multiple ML models together, the response time of these models often becomes a performance bottleneck. This can cause poor hardware utilization and requires more compute resources to deploy your pipelines in production, leading to potentially higher costs of operations.

Google Cloud Dataflow is a fully managed runner for stream or batch processing pipelines written with Apache Beam. To enable developers to easily incorporate ML models in data processing pipelines, Dataflow recently announced support for Apache Beam’s generic machine learning prediction and inference transform, RunInference. The RunInference transform simplifies the ML pipeline creation process by allowing developers to use models in production pipelines without needing lots of boilerplate code. 

You can see an example of its usage with Apache Beam in the following code sample. Note that the engine_handler is passed as a configuration to the RunInference transform, which abstracts the user from the implementation details of running the model.

code_block[StructValue([(u’code’, u”engine_handler = TensorRTEngineHandlerNumPy(rn min_batch_size=4,rn max_batch_size=4,rn engine_path=rn ‘gs://gcp_bucket/single_tensor_features_engine.trt’)rnrnpcoll = pipeline | beam.Create(SINGLE_FEATURE_EXAMPLES)rnpredictions = pcoll | RunInference(engine_handler)”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e7d641b10d0>)])]

Along with the Dataflow runner and the TensorRT engine, Apache Beam enables users to address the three main challenges. The Dataflow runner takes care of pre-processing data at scale, preparing the data for use as model input. Apache Beam’s single API for batch and streaming pipelines means that RunInference is automatically available for both use cases. Apache Beam’s ability to define complex multi-path pipelines also makes it easier to create pipelines that have multiple models. With TensorRT support, Dataflow now also has the ability to optimize the inference performance of models on NVIDIA GPUs. 

For more details and samples to start using this feature today please have a look at the NVIDIA Technical Blog, “Simplifying and Accelerating Machine Learning Predictions in Apache Beam with NVIDIA TensorRT.” Documentation for RunInference can be found at the Apache Beam document site and for Dataflow docs.

Source : Data Analytics Read More

Transforming customer experiences with modern cloud database capabilities

Transforming customer experiences with modern cloud database capabilities

Editor’s note: Six customers, across a range of industries, share their success stories with Google Cloud databases.

From professional sports leagues to kidney care and digital commerce, Google Cloud databases enable organizations to develop radically transformative experiences for their users. The stories of how Google Cloud Databases have helped Box, Credit Karma, Davita, Forbes, MLB, and PLAID build data-driven applications is truly remarkable – from unifying data lifecycles for intelligent applications, to reducing, and even eliminating operational burden. Here are some of the key stories that customers shared at Google Cloud Next.

Box modernizes its NoSQL databases with zero downtime with Bigtable   

A content cloud, Box enables users to securely create, share, co-edit, and retain their content online. While moving its core infrastructure from on-premises data centers to the cloud, Box chose to migrate its NoSQL infrastructure to Cloud Bigtable. To fulfill the company’s user request needs, the NoSQL infrastructure has latency requirements measured in tens of milliseconds. “File metadata like location, size, and more, are stored in a NoSQL table and accessed at every download. This table is about 150 terabytes in size and spans over 600 billion rows. Hosting this on Bigtable removes the operational burden of infrastructure management. Using Bigtable, Box gains automatic replication with eventual consistency, an HBase-compliant library, and managed backup and restore features to support critical data.” Axatha Jayadev Jalimarada, Staff Software Engineer at Box, was enthusiastic about these Bigtable benefits, “We no longer need manual interventions by SREs to scale our clusters, and that’s been a huge operational relief. We see around 80 millisecond latencies to Bigtable from our on-prem services. We see sub-20 millisecond latencies from our Google Cloud resident services, especially when the Bigtable cluster is in the same region. Finally, most of our big NoSQL use cases have been migrated to Bigtable and I’m happy to report that some have been successfully running for over a year now.”

Axatha Jayadev Jalimarada walks through “how Box modernized their NoSQL databases with minimal effort and downtime” with Jordan Hambleton, Bigtable Solutions Architect at Google Cloud.

Credit Karma deploys models faster with Cloud Bigtable and BigQuery

Credit Karma, a consumer technology platform helping consumers in the US, UK and Canada make financial progress, is reliant on its data models and systems to deliver a personalized experience for its nearly 130 million members. Given its scale, Credit Karma recognized the need to cater to the growing volume, complexity, and speed of data, and began moving its technology stack to Google Cloud in 2016. 

UsingCloud Bigtable andBigQuery, Credit Karma registered a 7x increase in the number of pre-migration experiments, and began deploying 700 models/week compared to 10 per quarter. Additionally, Credit Karma was able to push recommendations through its modeling scoring service built on a reverse extract, transform, load, (ETL) process on BigQuery, Cloud Bigtable andGoogle Kubernetes Engine. Powering Credit karma’s recommendations are machine learning models at scale — the team runs about 58 billion model predictions each day.

Looking to learn “what’s next for engineers”? Check outthe conversation between Scott Wong, and Andi Gutmans, General Manager and Vice President of Engineering for Databases at Google.

DaVita leverages Spanner and BigQuery to centralize health data and analytics for clinician enablement

As a leading global kidney care company, DaVita spans the gamut of kidney care from chronic kidney disease to transplants. As part of its digital transformation strategy, DaVita was looking to centralize all electronic health records (EHRs) and related care activities into a single system that would not only embed work flows, but also save clinicians time and enable them to focus on their core competencies. Jay Richardson, VP, Application Development at DaVita, spoke to the magnitude of the task, “Creating a seamless, real-time data flow across 600,000 treatments on 200,000 patients and 45,000 clinicians was a tall engineering order.”  The architecture was set up in Cloud Spanner housing all the EHRs and related-care activities, and BigQuery handling the analytics. Spanner change streams replicated data changes to BigQuery with a 75 percent reduction in time for replication–from 60 to 15 seconds-enabling both, simplification of the integration process, as well as, a highly scalable solution. DaVita also gained deep, relevant, insights–about 200,000 a day–and full aggregation for key patient meds and labs data. This helps equip physicians with additional tools to care for their patients, without inundating them with numbers.

Jerene Yang, Senior Software Engineering Manager at Google Cloud, helps to “see the whole picture by unifying operational data with analytics” with Jay Richardson.

Forbes fires up digital transformation with Firestore

A leading media and information company, Forbes is plugged into an ecosystem of about 140 million—employees, contributors, and readers—across the globe. It recently underwent a successful digital transformation effort to support its rapidly scaling business. This included a swift, six-month migration to Google Cloud, and integrating with the full suite of Google Cloud products from BigQuery to Firestore—a NoSQL document database. Speaking of Firestore, Vadim Supitskiy, Chief Digital & Information Officer at Forbes, explained, “We love that it’s a managed service, we do not want to be in the business of managing databases. It has a flexible document model, which makes it very easy for developers to use and it integrates really, really, well with the products that GCP has to offer.” Firestore powers the Forbes insights and analytics platform to give its journalists and contributors comprehensive, real-time suggestions that help content creators author relevant content, and analytics to assess the performance of published articles. At the backend, Firestore seamlessly integrates with Firebase Auth, Google Kubernetes Engine, Cloud Functions, BigQuery, and Google Analytics, while reducing maintenance overheads. As a cloud-native database that requires no configuration or management, it’s cheap to store data in, and executes low-latency queries

Minh Nguyen, Senior Product Manager at Google cloud, discusses “serverless application development with a document database” with Vadim Supitskiyhere.

MLB hits a home run by moving to Cloud SQL

When you think ofMajor League Baseball (MLB), you think of star players and home runs. But as Joseph Zirilli, senior software engineer at MLB explained, behind-the-scenes technology is critical to the game, whether it is the TV streaming service, or on-field technology to capture statistics data. And that’s a heavy lift, especially when MLB was running its player scouting and management system for player transactions on a legacy, on-premises database. This, in combination with the limitations of conventional licensing, was adversely impacting the business. The lack of in-house expertise in the legacy database, coupled with its small team size, made routine tasks challenging. 

Having initiated the move to Google Cloud a few years ago, MLB was already using Cloud SQL for some of its newer products. It was also looking to standardize its relational database management system around PostgreSQL so it could build in-house expertise around a single database. They selected Cloud SQL which supported their needs, and also offered high availability and automation.

Today, with drastically improved database performance and automatic rightsizing of database instances, MLB is looking forward to keeping its operational costs low and hitting it out of the park for fan experience.

Sujatha Mandava, Director, Product Management, SQL Databases at Google Cloud, and Joseph Zirilli discuss “why now is the time to migrate your apps to managed databases”.

Major League Baseball trademarks and copyrights are used with permission of Major League Baseball. Visit MLB.com.

PLAID allies with AlloyDB to enhance the KARTE website and native app experience for customer engagement

PLAID, a Tokyo-based startup hosts KARTE, an engagement platform focused on customer experience that tracks the customer in real time, supports flexible interactions, and provides wide analytics functionality. To support hybrid transactional and analytical processing (HTAP) at scale, KARTE was using a combination of BigQuery, Bigtable, and Spanner in the backend. This enabled KARTE to process over 100,000 transactions per second, and store over 10 petabytes of data. Adding AlloyDB for PostgreSQL to the mix has provided KARTE with the ability to answer flexible analytical queries. In addition to the range of queries that KARTE can now handle, AlloyDB has brought in expanded capacity with low-latency analysis in a simplified system. As Yuki Makino, CTO at PLAID pointed out, “With the current (columnar) engine and AlloyDB performance is about 100 times faster than earlier.”

Yuki Makino, in conversation with Sandy Ghai, Product Manager at Google Cloud, says “goodbye, expensive legacy database, hello next-gen PostgreSQL database” here.

Implement a modern database strategy

Transformation hinges on new cloud database capabilities. Whether you want to increase your agility and pace of innovation, better manage your costs, or entirely shut down data centers, we can help you accelerate your move to cloud. From integration into a connected environment, to disruption-free migration, and automation to free up developers for creative work, Google Cloud databases offer unified, open, and intelligent building blocks to enable a modern database strategy.

Download the complimentary 2022 Gartner Magic Quadrant for Cloud Database Management Systems report. 

Learn more about Google Cloud databases.

Learn why customers choose Google Cloud databases in this e-book.

Source : Data Analytics Read More

Optimize Cloud Composer via Better Airflow DAGs

Optimize Cloud Composer via Better Airflow DAGs

Hosting, orchestrating, and managing data pipelines is a complex process for any business.  Google Cloud offers Cloud Composer – a fully managed workflow orchestration service – enabling businesses to create, schedule, monitor, and manage workflows that span across clouds and on-premises data centers. Cloud Composer is built on the popular Apache Airflow open source project and operates using the Python programming language.  Apache Airflow allows users to create directed acyclic graphs (DAGs) of tasks, which can be scheduled to run at specific intervals or triggered by external events.

This guide contains a generalized checklist of activities when authoring Apache Airflow DAGs.  These items follow best practices determined by Google Cloud and the open source community.  A collection of performant DAGs will enable Cloud Composer to work optimally and standardized authoring will help developers manage hundreds or even thousands of DAGs.  Each item will benefit your Cloud Composer environment and your development process.

Get Started

1. Standardize file names. Help other developers browse your collection of DAG files.
a. ex) team_project_workflow_version.py

2. DAGs should be deterministic.
a. A given input will always produce the same output.

3. DAGs should be idempotent. 
a. Triggering the DAG multiple times has the same effect/outcome.

4. Tasks should be atomic and idempotent. 
a. Each task should be responsible for one operation that can be re-run independently of the others. In an atomized task, a success in part of the task means a success of the entire task.

5. Simplify DAGs as much as possible.
a. Simpler DAGs with fewer dependencies between tasks tend to have better scheduling performance because they have less overhead. A linear structure (e.g. A -> B -> C) is generally more efficient than a deeply nested tree structure with many dependencies. 

Standardize DAG Creation

6. Add an owner to your default_args.
a. Determine whether you’d prefer the email address / id of a developer, or a distribution list / team name.

7. Use with DAG() as dag: instead of dag = DAG()
a. Prevent the need to pass the dag object to every operator or task group.

8. Set a version in the DAG ID. 
a. Update the version after any code change in the DAG.
b. This prevents deleted Task logs from vanishing from the UI, no-status tasks generated for old dag runs, and general confusion of when DAGs have changed.
c. Airflow open-source has plans to implement versioning in the future. 

9. Add tags to your DAGs.
a. Help developers navigate the Airflow UI via tag filtering.
b. Group DAGs by organization, team, project, application, etc. 

10. Add a DAG description. 
a. Help other developers understand your DAG.

11. Pause your DAGs on creation. 
a. This will help avoid accidental DAG runs that add load to the Cloud Composer environment.

12. Set catchup=False to avoid automatic catch ups overloading your Cloud Composer Environment.

13. Set a dagrun_timeout to avoid dags not finishing, and holding Cloud Composer Environment resources or introducing collisions on retries.

14. Set SLAs at the DAG level to receive alerts for long-running DAGs.
a. Airflow SLAs are always defined relative to the start time of the DAG, not to individual tasks.
b. Ensure that sla_miss_timeout is less than the dagrun_timeout.
c. Example: If your DAG usually takes 5 minutes to successfully finish, set the sla_miss_timeout to 7 minutes and the dagrun_timeout to 10 minutes.  Determine these thresholds based on the priority of your DAGs.

15. Ensure all tasks have the same start_date by default by passing arg to DAG during instantiation

16. Use a static start_date with your DAGs. 
a. A dynamic start_date is misleading, and can cause failures when clearing out failed task instances and missing DAG runs.

17. Set retries as a default_arg applied at the DAG level and get more granular for specific tasks only where necessary. 
a. A good range is 1–4 retries. Too many retries will add unnecessary load to the Cloud Composer environment.

Example putting all the above together:

code_block[StructValue([(u’code’, u’import airflowrnfrom airflow import DAGrnfrom airflow.operators.bash_operator import BashOperatorrnrn# Define default_args dictionary to specify default parameters of the DAG, such as the start date, frequency, and other settingsrndefault_args = {rn ‘owner’: ‘me’,rn ‘retries’: 2, # 2-4 retries maxrn ‘retry_delay’: timedelta(minutes=5),rn ‘is_paused_upon_creation’: True,rn ‘catchup’: False,rn}rnrn# Use the `with` statement to define the DAG object and specify the unique DAG ID and default_args dictionaryrnwith DAG(rn ‘dag_id_v1_0_0′, #versioned IDrn default_args=default_args,rn description=’This is a detailed description of the DAG’, #detailed descriptionrn start_date=datetime(2022, 1, 1), # Static start datern dagrun_timeout=timedelta(minutes=10), #timeout specific to this dagrn sla_miss_timeout=timedelta(minutes=7), # sla miss less than timeoutrn tags=[‘example’, ‘versioned_dag_id’], # tags specific to this dagrn schedule_interval=None,rn) as dag:rn # Define a task using the BashOperatorrn task = BashOperator(rn task_id=’bash_task’,rn bash_command=’echo “Hello World”‘rn )’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e853b802f10>)])]

18. Define what should occur for each callback function. (send an email, log a context, message slack channel, etc.).  Depending on the DAG you may be comfortable doing nothing. 
a. success
b. failure
c. sla_miss
d. retry

Example:

code_block[StructValue([(u’code’, u’from airflow import DAGrnfrom airflow.operators.python_operator import PythonOperatorrnrndefault_args = {rn ‘owner’: ‘me’,rn ‘retries’: 2, # 2-4 retries maxrn ‘retry_delay’: timedelta(minutes=5),rn ‘is_paused_upon_creation’: True,rn ‘catchup’: False,rn}rnrndef on_success_callback(context):rn # when a task in the DAG succeedsrn print(f”Task {context[‘task_instance_key_str’]} succeeded!”)rnrndef on_sla_miss_callback(context):rn # when a task in the DAG misses its SLArn print(f”Task {context[‘task_instance_key_str’]} missed its SLA!”)rnrndef on_retry_callback(context):rn # when a task in the DAG retriesrn print(f”Task {context[‘task_instance_key_str’]} retrying…”)rnrndef on_failure_callback(context):rn # when a task in the DAG failsrn print(f”Task {context[‘task_instance_key_str’]} failed!”)rnrn# Create a DAG and set the callbacksrnwith DAG(rn ‘dag_id_v1_0_0′,rn default_args=default_args,rn description=’This is a detailed description of the DAG’,rn start_date=datetime(2022, 1, 1), rn dagrun_timeout=timedelta(minutes=10),rn sla_miss_timeout=timedelta(minutes=7),rn tags=[‘example’, ‘versioned_dag_id’],rn schedule_interval=None,rn on_success_callback=on_success_callback, # what to do on successrn on_sla_miss_callback=on_sla_miss_callback, # what to do on sla missrn on_retry_callback=on_retry_callback, # what to do on retryrn on_failure_callback=on_failure_callback # what to do on failurern) as dag:rnrn def example_task(**kwargs):rn # This is an example task that will be part of the DAGrn print(f”Running example task with context: {kwargs}”)rnrn # Create a task and add it to the DAGrn task = PythonOperator(rn task_id=”example_task”,rn python_callable=example_task,rn provide_context=True,rn )’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e853b355650>)])]

19. Use Task Groups to organize Tasks.

Example:

code_block[StructValue([(u’code’, u’# Use the `with` statement to define the DAG object and specify the unique DAG ID and default_args dictionaryrnwith DAG(rn ‘example_dag’,rn default_args=default_args,rn schedule_interval=timedelta(hours=1),rn) as dag:rn # Define the first task grouprn with TaskGroup(name=’task_group_1′) as tg1:rn # Define the first task in the first task grouprn task_1_1 = BashOperator(rn task_id=’task_1_1′,rn bash_command=’echo “Task 1.1″‘,rn dag=dag,rn )’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e853b3553d0>)])]

Reduce the Load on Your Composer Environment

20. Use Jinja Templating / Macros instead of python functions.
a. Airflow’s template fields allow you to incorporate values from environment variables and jinja templates into your DAGs. This helps make your DAGs idempotent (meaning multiple invocations do not change the result) and prevents unnecessary function execution during Scheduler heartbeats.
b. The Airflow engine passes a few variables by default that are accessible in all templates.

Contrary to best practices, the following example defines variables based on datetime Python functions:

code_block[StructValue([(u’code’, u”# Variables used by tasksrn# Bad example – Define today’s and yesterday’s date using datetime modulerntoday = datetime.today()rnyesterday = datetime.today() – timedelta(1)”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e853b355090>)])]

If this code is in a DAG file, these functions execute on every Scheduler heartbeat, which may not be performant. Even more importantly, this doesn’t produce an idempotent DAG. You can’t rerun a previously failed DAG run for a past date because datetime.today() is relative to the current date, not the DAG execution date.

A better way of implementing this is by using an Airflow Variable as such:

code_block[StructValue([(u’code’, u”# Variables used by tasksrn# Good example – Define yesterday’s date with an Airflow variablernyesterday = {{ yesterday_ds_nodash }}”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e8554a515d0>)])]

21. Avoid creating your own additional Airflow Variables. 
a. The metadata database stores these variables and requires database connections to retrieve them. This can affect the performance of the Cloud Composer Environment. Use Environment Variables or Google Cloud Secrets instead.

22. Avoid running all DAGs on the exact same schedules (disperse workload as much as possible). 
a. Prefer to use cron expressions for schedule intervals compared to airflow macros or time_deltas. This allows a more rigid schedule and it’s easier to spread out workloads throughout the day, making it easier on your Cloud Composer environment.
b. Crontab.guru can help with generating specific cron expression schedules.  Check out the examples here.

Examples:

code_block[StructValue([(u’code’, u’schedule_interval=”*/5 * * * *”, # every 5 minutes.rnrn schedule_interval=”0 */6 * * *”, # at minute 0 of every 6th hour.’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e8554a516d0>)])]

23. Avoid XComs except for small amounts of data. 
a. These add storage and introduce more connections to the database. 
b. Use JSON dicts as values if absolutely necessary. (one connection for many values inside dict)

24. Avoid adding unnecessary objects in the dags/ Google Cloud Storage path. 
a. If you must, add an .airflowignore file to GCS paths that the Airflow Scheduler does not need to parse. (sql, plug-ins, etc.)

25. Set execution timeouts for tasks.

Example:

code_block[StructValue([(u’code’, u”# Use the `PythonOperator` to define the taskrntask = PythonOperator(rn task_id=’my_task’,rn python_callable=my_task_function,rn execution_timeout=timedelta(minutes=30), # Set the execution timeout to 30 minutesrn dag=dag,rn)”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e8554a517d0>)])]

26. Use Deferrable Operators over Sensors when possible. 
a. A deferrable operator can suspend itself and free up the worker when it knows it has to wait, and hand off the job of resuming it to a Trigger. As a result, while it suspends (defers), it is not taking up a worker slot and your cluster will have fewer/lesser resources wasted on idle Operators or Sensors.

Example:

code_block[StructValue([(u’code’, u’PYSPARK_JOB = {rn “reference”: { “project_id”: “PROJECT_ID” },rn “placement”: { “cluster_name”: “PYSPARK_CLUSTER_NAME” },rn “pyspark_job”: {rn “main_python_file_uri”: “gs://dataproc-examples/pyspark/hello-world/hello-world.py”rn },rn}rnrnDataprocSubmitJobOperator(rn task_id=”dataproc-deferrable-example”,rn job=PYSPARK_JOB,rn deferrable=True,rn )’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e8554a518d0>)])]

27. When using Sensors, always define mode, poke_interval, and timeout. 
a. Sensors require Airflow workers to run.
b. Sensor checking every n seconds (i.e. poke_interval < 60)? Use mode=poke. A sensor in mode=poke will continuously poll every n seconds and hold Airflow worker resources. 
c. Sensor checking every n minutes (i.e. poke_interval >= 60)? Use mode=reschedule. A sensor in mode=reschedule will free up Airflow worker resources between poke intervals.

Example:

code_block[StructValue([(u’code’, u’table_partition_sensor = BigQueryTablePartitionExistenceSensor(rn project_id=”{{ project_id }}”,rn task_id=”bq_check_table_partition”,rn dataset_id=”{{ dataset }}”,rn table_id=”comments_partitioned”,rn partition_id=”{{ ds_nodash }}”,rn mode=”reschedule”rn poke_interval=60,rn timeout=60 * 5rn )’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e8554a519d0>)])]

28. Offload processing to external services (BigQuery, Dataproc, Cloud Functions, etc.) to minimize load on the Cloud Composer environment.
a. These services usually have their own Airflow Operators for you to utilize.

29. Do not use sub-DAGs.
a. Sub-DAGs were a feature in older versions of Airflow that allowed users to create reusable groups of tasks within DAGs. However, Airflow 2.0 deprecated sub-DAGs because they caused performance and functional issues.

30. UsePub/Subfor DAG-to-DAG dependencies.
a. Here is an example for multi-cluster / dag-to-dag dependencies. 

31. Make DAGs load faster.
a. Avoid unnecessary “Top-level” Python code. DAGs with many imports, variables, functions outside of the DAG will introduce greater parse times for the Airflow Scheduler and in turn reduce the performance and scalability of Cloud Composer / Airflow.
b. Moving imports and functions within the DAG can reduce parse time (in the order of seconds).
c. Ensure that developed DAGs do not increase DAG parse times too much.

Example:

code_block[StructValue([(u’code’, u”import airflowrnfrom airflow import DAGrnfrom airflow.operators.python_operator import PythonOperatorrnrn# Define default_args dictionaryrndefault_args = {rn ‘owner’: ‘me’,rn ‘start_date’: datetime(2022, 11, 17),rn}rnrn# Use with statement and DAG context manager to instantiate the DAGrnwith DAG(rn ‘my_dag_id’,rn default_args=default_args,rn schedule_interval=timedelta(days=1),rn) as dag:rn # Import module within DAG blockrn import my_module # DO THISrnrn # Define function within DAG blockrn def greet(): # DO THISrn greeting = my_module.generate_greeting()rn print(greeting)rnrn # Use the PythonOperator to execute the functionrn greet_task = PythonOperator(rn task_id=’greet_task’,rn python_callable=greetrn )”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e8554a51ad0>)])]

Improve Development and Testing

32. Implement “self-checks” (via Sensors or Deferrable Operators).
a. To ensure that tasks are functioning as expected, you can add checks to your DAG. For example, if a task pushes data to a BigQuery partition, you can add a check in the next task to verify that the partition generates and that the data is correct.

Example:

code_block[StructValue([(u’code’, u’# ————————————————————rn # Transform source data and transfer to partitioned tablern # ————————————————————rnrn create_or_replace_partitioned_table_job = BigQueryInsertJobOperator(rn task_id=”create_or_replace_comments_partitioned_query_job”,rn configuration={rn “query”: {rn “query”: ‘sql/create_or_replace_comments_partitioned.sql’,rn “useLegacySql”: False,rn }rn },rn location=”US”,rn )rnrn create_or_replace_partitioned_table_job_error = dummy_operator.DummyOperator(rn task_id=”create_or_replace_partitioned_table_job_error”,rn trigger_rule=”one_failed”,rn )rnrn create_or_replace_partitioned_table_job_ok = dummy_operator.DummyOperator(rn task_id=”create_or_replace_partitioned_table_job_ok”, trigger_rule=”one_success”rn )rnrn # ————————————————————rn # Determine if today’s partition exists in comments_partitionedrn # ————————————————————rnrn table_partition_sensor = BigQueryTablePartitionExistenceSensor(rn project_id=”{{ project_id }}”,rn task_id=”bq_check_table_partition”,rn dataset_id=”{{ dataset }}”,rn table_id=”comments_partitioned”,rn partition_id=”{{ ds_nodash }}”,rn mode=”reschedule”rn poke_interval=60,rn timeout=60 * 5rn )rnrn create_or_replace_partitioned_table_job >> [rn create_or_replace_partitioned_table_job_error,rn create_or_replace_partitioned_table_job_ok,rn ]rn create_or_replace_partitioned_table_job_ok >> table_partition_sensor’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e8554a51bd0>)])]

33. Look for opportunities to dynamically generate similar tasks/task groups/DAGs via Python code.
a. This can simplify and standardize the development process for DAGs. 

Example:

code_block[StructValue([(u’code’, u’import airflowrnfrom airflow import DAGrnfrom airflow.operators.python_operator import PythonOperatorrnrndef create_dag(dag_id, default_args, task_1_func, task_2_func):rn with DAG(dag_id, default_args=default_args) as dag:rn task_1 = PythonOperator(rn task_id=’task_1′,rn python_callable=task_1_func,rn dag=dagrn )rn task_2 = PythonOperator(rn task_id=’task_2′,rn python_callable=task_2_func,rn dag=dagrn )rn task_1 >> task_2rn return dagrnrndef task_1_func():rn print(“Executing task 1”)rnrndef task_2_func():rn print(“Executing task 2″)rnrndefault_args = {rn ‘owner’: ‘me’,rn ‘start_date’: airflow.utils.dates.days_ago(2),rn}rnrnmy_dag_id = create_dag(rn dag_id=’my_dag_id’,rn default_args=default_args,rn task_1_func=task_1_func,rn task_2_func=task_2_funcrn)’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e8554a51cd0>)])]

34. Implement unit-testing for your DAGs

Example:

code_block[StructValue([(u’code’, u’from airflow import modelsrnfrom airflow.utils.dag_cycle_tester import test_cyclernrnrndef assert_has_valid_dag(module):rn “””Assert that a module contains a valid DAG.”””rnrn no_dag_found = Truernrn for dag in vars(module).values():rn if isinstance(dag, models.DAG):rn no_dag_found = Falsern test_cycle(dag) # Throws if a task cycle is found.rnrn if no_dag_found:rn raise AssertionError(‘module does not contain a valid DAG’)’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e8554a51dd0>)])]

35. Perform local development via the Composer Local Development CLI Tool.
a. Composer Local Development CLI tool streamlines Apache Airflow DAG development for Cloud Composer 2 by running an Airflow environment locally. This local Airflow environment uses an image of a specific Cloud Composer version.

36. If possible, keep a staging Cloud Composer Environment to fully test the complete DAG run before deploying in the production.
a. Parameterize your DAG to change the variables, e.g., the output path of Google Cloud Storage operation or the database used to read the configuration. Do not hard code values inside the DAG and then change them manually according to the environment.

37. Use a Python linting tool such as Pylint or Flake8 for standardized code.

38. Use a Python formatting tool such as Black or YAPF for standardized code.

Next Steps

In summary, this blog provides a comprehensive checklist of best practices for developing Airflow DAGs for use in Google Cloud Composer. By following these best practices, developers can help ensure that Cloud Composer is working optimally and that their DAGs are well-organized and easy to manage.

For more information about Cloud Composer, check out the following related blog posts and documentation pages:

What is Cloud Composer? 

Deutsche Bank uses Cloud Composer workload automation

Using Cloud Build to keep Airflow Operators up-to-date in your Composer environment

Writing DAGs (workflows) | Cloud Composer

Source : Data Analytics Read More

Built with BigQuery: How Tamr delivers Master Data Management at scale and what this means for a data product strategy

Built with BigQuery: How Tamr delivers Master Data Management at scale and what this means for a data product strategy

Master data is a holistic view of your key business entities, providing a consistent set of identifiers and attributes that give context to the business data that matters most to your organization. It’s about ensuring that clean, accurate, curated data – the best available – is accessible throughout the company to manage operations and make critical business decisions. Having well-defined master data is essential to running your business operations. 

Master data undergoes a far more enriched and refined  process than other types of data captured across the organization. For instance, it’s not the same as the transactional data generated by applications. Instead, master data gives context to the transaction itself by providing the fundamental business objects – like the Customer, Product, Patient, or Supplier – on which the transactions are performed. 

Without master data, enterprise applications  are left with potentially inconsistent data living in disparate systems; with an unclear picture of whether multiple records are related. And without it, gaining essential business insight may be difficult, if not impossible, to  attain: for example, “which customers generate the most revenue?” or “which suppliers do we do the most business with?” 

Master data is a critical element of treating data as an enterprise asset and as a product.. A data product strategy requires that the data remain clean, integrated, and freshly updated with appropriate frequency. Without this additional preparation and enrichment, data becomes stale and incomplete, leading to inability to provide the necessary insights for timely business decisions. Data preparation, consolidation and  enrichment should be a  part of a data product strategy, since consolidating a complete set of external data sources will provide more complete and accurate insights for business decisions.  This data preparation, consolidation and enrichment  requires the right infrastructure, tools, and processes, otherwise it will be an additional burden on already thinly stretched data management teams. . 

This is why it is necessary to adopt and implement a next-generation master data management platform that enables a data product strategy to be operationalized. This in turn enables the acquisition of trusted records to drive business outcomes. 

The Challenge: A Single Source of Truth – The Unified “Golden” Record 

Many companies have built or are working on rolling out data lakes, lake houses, data marts, or data warehouses to address data integration challenges. However, when multiple data sets from disparate sources are combined, there is a high likelihood of introducing problems , which Tamr and Google Cloud are partnering to address and alleviate:

Data duplication: same semantic/physical entity like customer with different keys

Inconsistency: same entity having partial and/or mismatching properties (like different phone numbers or addresses for the same customers)

Reduced insight accuracy: duplicates skew the analytic key figures (like total distinct customers are higher with duplicates than without them)

 Timeliness impact: manual efforts to reach a consistent and rationalized  core set of data entities used  for application input and analytics  cause significant delays in processing and ultimately, decision making

Solution

Tamr is the leader in data mastering and next-generation master data management, delivering data products that provide clean, consolidated, and curated data to help businesses stay ahead in a rapidly changing world. Organizations benefit from Tamr’s integrated, turn-key solution that combines machine learning with humans-in-the-loop, a low-code/no-code environment, and integrated data enrichment to streamline operations. The outcome is higher quality data; faster and with less manual work

Tamr takes multiple source records, identifies duplicates, enriches data, assigns a unique ID, and provides a unified, mastered “golden record” while maintaining all source information for analysis and review. Once cleansed, data can be utilized in the downstream analytics and applications, enabling more informed decisions.

A successful data product strategy requires consistently cleaning and integrating data, a task that’s ideal for data mastering. machine-learning based capabilities in a data mastering platform can handle increases in data volume and variety, as well as data enrichment to ensure that the data stays fresh and accurate so it can be trusted by the business consumers. 

With accurate key entity data, companies can unlock the bigger picture of data insights. The term “key” signifies entities that are most important to an organization. For example, for healthcare organizations, this could mean patients and providers; for manufacturers, it could mean suppliers;  for financial services firms, it could mean customers. 

Below are examples of key business entities after they’ve been cleaned, enriched, and curated with Tamr:

Better Together: How Tamr leverages Google Cloud to differentiate their next-gen MDM

Tamr Mastering, a template-based SaaS MDM solution, is built on Google Cloud Platform technologies such as Cloud Dataproc, Cloud Bigtable and BigQuery, allowing customers to scale modern data pipelines with excellent performance while controlling costs.

The control plane (application layer) is built on Google Compute Engine (GCE) to leverage its scalability. The data plane utilizes a full suite of interconnected Google Cloud Platform services such as Google Dataproc for distributed processing, allowing for a flexible and sustainable way to bridge the gap between the analytics powers of distributed TensorFlow and the scaling capabilities of Hadoop in a managed offering. Google Cloud Storage is used for data movement/staging. 

Google Cloud Run, which enables Tamr to deploy containers directly on top of Google’s scalable infrastructure, is used in the data enrichment process. This approach allows serverless deployments without the need to create a stateful cluster or manage infrastructure to be productive with container deployments. Google Bigtable is utilized for data-scale storage, allowing for high throughput and scalability for key/value data. Data that doesn’t fall into the key/value lookup schema is retrieved in batches or used for analytical purposes. Google BigQuery is the ideal storage for this type of data and storage of the golden copy of the data discussed earlier in this blog post. Additionally, Tamr chose BigQuery as their central data storage solution due to the ability of BigQuery to promote schema denormalization with the native support of nested and repeated fields to denormalize data storage and increase query performance. 

On top of that, Tamr Mastering utilizes Cloud IAM for access control, authn/authz, configuration and observability. Deploying across the Google framework provides key advantages such as better performance due to higher bandwidth, lower management overhead, and autoscaling and resource adjustment, among other value drivers, all resulting in lower TCO.

The architecture above illustrates the different layers of functionality. Starting from the top down with the front-end deployment to the core layers at the borrow of the diagram. To scale the overall MDM architecture depicted in the above diagram, efficiently, Tamr has partnered with Google Cloud to focus on three core capabilities: 

Capability One: Machine learning optimized for scale and accuracy

Traditionally, organizing and mastering data in most organizations’ legacy infrastructure has been done using a rules-based approach (if <condition> then <action>). Conventional rules-based systems can be effective on a small scale, relying on human-built logic implemented in the rules to generate master records. However, such rules fail to scale when tasked with connecting and reconciling large amounts of highly variable data. 

Machine learning, on the other hand, becomes more efficient at matching records across datasets as more data is added. In fact, huge amounts of data (more than 1 million records across dozens of systems) provide more signal, so  the machine learning models are able to identify patterns, matches, and relationships, accelerating years of human effort down to days. Google’s high performance per core on Compute Engine, high network throughput and lower provisioning times across both storage and compute are all differentiating factors in Tamr’s optimized machine learning architecture on Google Cloud.

Capability Two: Ensure there is sufficient human input

While machine learning is critical, so is keeping humans in the loop and letting them provide feedback. Engaging business users and subject matter experts is key to building trust in the data. A middle ground where machines take the lead and humans provide guidance and feedback to make the machine – and the results – better is the data mastering approach that delivers the best outcomes. Not only will human input improve machine learning models, but it will also foster tighter alignment between the data and business outcomes that require curated data. 

Capability Three: Enrichment built in the workflow

As a final step in the process, data enrichment integrates internal data assets with external data to increase the value of these assets. It adds additional relevant or missing information so that the data is more complete – and thus more usable. Enriching data improves its quality, making it a more valuable asset to an organization. Combining data enrichment with data mastering means that not only are data sources automatically cleaned, they are also enhanced with valuable commercial information while avoiding the incredibly time-consuming and manual work that goes into consolidating or stitching internal data with  external data.

Below is an example of how these three core-capabilities are incorporated into the Tamr MDM architecture:

Building the data foundation for connected customer experiences at P360

When a major pharmaceutical company approached P360 for help with a digital transformation project aimed at better reaching the medical providers they count as customers, P360 realized that building a solid data foundation with a modern master data management (MDM) solution was the first step. 

“One of the customer’s challenges was master data management, which was the core component of rebuilding their data infrastructure. Everything revolves around data so not having a solid data infrastructure is a non-starter. Without it, you can’t compete, you can’t understand your customers and how they use your products,” said Anupam Nandwana, CEO of P360, a technology solutions provider for the pharmaceutical industry.

To develop that foundation of trusted data, P360 turned to Tamr Mastering. By using Tamr Mastering, the pharmaceutical company is quickly unifying internal and external data on millions of health care providers to create golden records that power downstream applications, including a new CRM system. Like other business-to-business companies, P360’s customer has diverse and expansive data from a variety of sources. From internal data like physician names and addresses to external data like prescription histories and claims information, this top pharmaceutical company has 150 data sources to master in order to get complete views of their customers. This includes records on 1 million healthcare providers (as well as 2 million provider addresses) and records on the more than 100,000 healthcare organizations.

“For the modern data platform, cloud is the only answer. To provide the scale, flexibility and speed that’s needed, it’s just not pragmatic to leverage other infrastructure. The cloud gives us the opportunity to do things faster. Completing this project in a short amount of time was a key criteria for success and that would have only been possible with the cloud. Using it was an easy decision,” Nandwana said. 

With Tamr Mastering, P360 helped their customer master millions of provider records in weeks and create golden records containing unique customer IDs as a consistent identifier and single source of truth. 

Conclusion

Google’s data cloud provides a complete platform for building data-driven applications like Tamr’s MDM solution on Google Cloud. Simplified data ingestion, processing, and storage to powerful analytics, AI, ML, and data sharing capabilities are integrated with the open, secure, and sustainable Google Cloud platform. With a diverse partner ecosystem, open-source tools, and APIs, Google Cloud can provide technology companies with a platform that provides the portability and differentiators they need to build their products and serve the next generation of customers. 

Learn more about Tamr on Google Cloud. 

Learn more about Google Cloud’s Built with BigQuery initiative

We thank the Google Cloud team member who co-authored the blog: Christian Williams, Principal Architect, Cloud Partner Engineering

Source : Data Analytics Read More

7 Mistakes to Avoid When Using Machine Learning for SEO

7 Mistakes to Avoid When Using Machine Learning for SEO

Companies around the world are projected to spend over $300 billion on machine learning technology by 2030. There are a growing number of reasons that companies are investing in machine learning, but digital marketing is at the top of the list. SEO, in particular, relies more heavily on machine learning these days.

More Companies Are Discovering the Benefits of Using Machine Learning for SEO

A few years ago, we wrote an article pointing out that machine learning is rewriting the rules of SEO. The shift towards using machine learning for SEO and other marketing tasks has accelerated recently.

Almost every business uses digital marketing exclusively to promote its products. Digital marketing is the best way to sell a product in 2023. Traditional marketing will not work anymore, and rightfully businesses have moved to digital marketing. Businesses are partnering with digital marketing agencies that offer SEO consulting services since it is the only way for businesses to succeed in the modern world.

Digital marketing is a wide area that has too many monotonous tasks. For example, email marketing is a section of digital marketing that requires the marketer to do the same tasks repeatedly.

This is where machine learning and AI come into play in the digital marketing field. Machine learning is a section of artificial intelligence that excels at performing monotonous tasks.

The popularity of machine learning is growing rapidly in the digital marketing field. You might have heard about the new ChatGPT tool. ChatGPT is an AI writer that is helping marketers to a great extent.

Machine learning is helping marketers, but marketers can be benefitted a lot more from machine learning. Marketers commonly make many mistakes that stop them from reaching their true potential. Here are some common mistakes to avoid when using machine learning to boost SEO.

Mistakes to Avoid When Using Machine Learning

Machine learning and AI can make SEO simpler and faster. However, you have to avoid the following mistakes.

Not Understanding the Data

Data is an important factor in both digital marketing and machine learning. When machine learning is combined with digital marketing, you need to provide proper data for it to work properly. In addition, collecting data on the AI’s work on a certain task. By collecting this data, marketers can optimize accordingly and improve performance. Not understanding the data can hurt the AI’s working capacity a lot.

Not Staying Updated on the Technology

Like digital marketing, machine learning is a technology that evolves almost daily. It constantly changes, and you need to stay on top of it. Failing to know about the technology will cost you dearly. More than hurting the marketers, using the wrong technology will affect their clients.

Relying too Much on Machine Learning

Yes, machine learning and AI are great, but they are not free from errors. Take ChatGPT, for example. It is a great tool and provides great content. Sometimes, it provides wrong data and plagiarized content. Using the content from ChatGPT without reviewing it will hurt the SEO of a website.

Not Having the Right Resources

Machine learning tools are great only when you know how to use them. Having great knowledge of a specific tool is important to use it effectively. Before incorporating machine learning in the company, it is important to give proper training to the employees so that they can use it effectively.

Ethical Issues

Some use AI tools for unethical reasons. This could hurt the websites ranking heavily in the long run. Using AI tools like Quillbot to spin the content from some other website and posting it on your website, claiming it is unique content, will not help you in any way.

Not Having a Clear Goal

Before using any tool, you need to set a clear goal. Having a clear goal is important for using AI tools effectively. When you have a goal, you can measure the performance of the tool and optimize it accordingly to reach the goal. If you don’t set a goal, even the best tools will not be able to help you.

Not Monitoring the Results

When you are using machine learning for a task, you need to pay close attention to the results. In the initial stage of using machine learning, you will encounter many errors. You need to fix it. Not paying attention will hurt your chances of boosting your SEO.

Final Thoughts

Machine learning in digital marketing can be revolutionary, but only when you really know how to take advantage of it. It is important to study and research machine learning before using it in your business. With adequate knowledge, you can easily avoid these common mistakes and use them effectively.

The post 7 Mistakes to Avoid When Using Machine Learning for SEO appeared first on SmartData Collective.

Source : SmartData Collective Read More

Optimizing Cost with DevOps on the Cloud

Optimizing Cost with DevOps on the Cloud

Are you looking for a way to reduce the cost of your development efforts? DevOps on the cloud could be the answer! DevOps is an increasingly popular approach that combines software development and operations, allowing developers and IT professionals to work together more efficiently. With DevOps on the cloud, businesses can take advantage of faster, more flexible computing environments without having to invest in expensive hardware and software.

The cost savings associated with DevOps on the cloud are significant. By leveraging existing cloud-based services, businesses can save time and money by avoiding costly IT overhead costs like maintenance, licensing fees, and server infrastructure setup. Plus, they’ll benefit from increased agility due to shorter launch cycles and faster cycle times. Additionally, because servers can be scaled up or down as needed in the cloud environment, businesses will only pay for what they need when they need it.

There are many DevOps Cloud providers available in the IT industry. Some of them are Amazon AWS, Microsoft Azure, Google Cloud, etc.

Reasons for Cost Optimization

Cost optimization is an important part of any organization’s DevOps strategy. By optimizing costs, organizations can maximize their profits and keep up with the ever-changing business landscape. But what are some of the reasons why DevOps teams should consider cost optimization?

One key reason for cost optimization in a DevOps environment is to reduce operational expenses. Costly manual processes can be automated, allowing organizations to increase efficiency while also reducing their overall spending. Additionally, teams can look at consolidating services and resources, as well as other operational activities that may be eating away at their budget. Automation tools and cloud technology platforms such as Amazon Web Services (AWS) provide organizations with the ability to optimize their costs without sacrificing quality or performance.

The DevOps model of software development has revolutionized the way we create, distribute, and deploy software. It’s a model that merges development and operations teams and allows for continuous delivery of complex applications. But cost optimization is key in this digital age and that’s why more companies are turning to outsourcing developers for their projects.

Outsourcing developers can be more cost effective than managing an in-house team due to lower overhead costs such as equipment, licenses, insurance and office space. Plus, you have access to a global pool of talent which means you can find the best person for the job regardless of location. Outsourcing also shortens project timelines since it eliminates tedious tasks like onboarding new employees or waiting for existing employees to learn new technologies.

Benefits of DevOps on the Cloud

The possibilities of DevOps on the cloud are endless. This transformative technology can help businesses become more agile, efficient, and cost-effective. With DevOps on the cloud, organizations can benefit from faster software development cycles and gain real-time insights into customer behavior and market trends.

DevOps eliminates manual processes associated with infrastructure and application deployments. This allows teams to focus their efforts on creating innovative solutions while streamlining IT operations across multiple clouds or hybrid environments. It also creates a platform for collaboration between developers, operations staff, and other stakeholders as they work together to create applications that are secure, reliable, scalable, and cost-effective. With DevOps on the cloud, organizations have full control over their development lifecycle from end-to-end – no matter how complex or distributed it may be!

Implementing Automation for Efficiency

Automation can dramatically improve the efficiency of many tasks and processes, from providing software updates to running tests and deploying code. It can even help teams better manage complex projects, making it a great solution for businesses looking to increase their efficiency.

Automation helps reduce the amount of manual labor required to complete a task or process, as well as eliminating human error from the equation. This allows teams to work faster and smarter while using fewer resources overall. Additionally, automated processes are typically more reliable since they don’t have any potential for miscommunication or misinterpretation that might occur when humans are involved in the process. Automating repetitive tasks can also free up valuable time for team members, allowing them to focus on more complex tasks that require higher-level thinking and creativity.

Security Measures in DevOps on the Cloud

The integration of DevOps and the cloud is changing the way companies operate their software development processes. With so much power, comes great responsibility and security measures need to be taken in order to ensure that any and all data is safe.

Companies can take advantage of the benefits of DevOps while still maintaining high levels of security with several strategies. Cloud infrastructure should be built with strong authentication protocols; such as two-factor authentication, encryption for data both in transit and at rest, rigorous access control methods for verifying users, logging system activity for auditing purposes, active monitoring by security staff from the organization or from a third party company providing DevOps services on the cloud. By following this set of guidelines companies can make sure that their data remains secure in an ever-changing technological environment!

Conclusion

Cloud computing has revolutionized the way businesses operate by providing access to cost-effective solutions. DevOps, an approach to software development that brings together operations and development teams, is one of the most efficient solutions for businesses that are looking for cost savings.

The combination of DevOps on the cloud can help companies save costs in several ways. First, it reduces overhead infrastructure and hardware expenses because companies no longer need to invest in their own server farms or related hardware. Second, it reduces IT labor costs since fewer personnel are needed to manage cloud operations and applications. Third, it minimizes software licensing fees since users only pay for what they use. Finally, DevOps on the cloud ensures fast scalability which means businesses can easily add more resources as needed without having to invest in new hardware or infrastructure right away.

The post Optimizing Cost with DevOps on the Cloud appeared first on SmartData Collective.

Source : SmartData Collective Read More

How to do multivariate time series forecasting in BigQuery ML

How to do multivariate time series forecasting in BigQuery ML

Companies across industries rely heavily on time series forecasting to project product demand, forecast sales, project online subscription/cancellation, and for many other use cases. This makes time series forecasting one of the most popular models in BigQuery ML. 

What is multivariate time series forecasting? For example, if you want to forecast ice cream sales, it is helpful to forecast using the external covariant “weather” along with the target metric “past sales.” Multivariate time series forecasting in BigQuery lets you create more accurate forecasting models without having to move data out of BigQuery. 

When it comes to time series forecasting, covariates or features besides the target time series are often used to provide better forecasting. Up until now, BigQuery ML has only supported univariate time series modeling using the ARIMA_PLUS model (documentation). It is one of the most popular BigQuery ML models.

While ARIMA_PLUS is widely used, forecasting using only the target variable is sometimes not sufficient. Some patterns inside the time series strongly depend on other features. We see strong customer demand for multivariate time series forecasting support that allows you to forecast using covariate and features.  

We recently announced the public preview of multivariate time series forecasting with external regressors. We are introducing a new model type ARIMA_PLUS_XREG, where the XREG refers to external regressors or side features. You can use the SELECT statement to choose side features with the target time series. This new model leverages the BigQuery ML linear regression model to include the side features and the BigQuery ML ARIMA_PLUS model to model the linear regression residuals.

The ARIMA_PLUS_XREG model supports the following capabilities: 

Automatic feature engineering for numerical, categorical, and array features.

All the model capabilities of the ARIMA_PLUS model, such as detecting seasonal trends, holidays, etc.

Headlight, an AI-powered ad agency, is using a multivariate forecasting model to determine conversion volumes for down-funnel metrics like subscriptions, cancellations, etc. based on cohort age. You can check out the customer video and demo here.

The following sections show some examples of the new ARIMA_PLUS_XREG model in BigQuery ML. In this example, we explore the bigquery-public-data.epa_historical_air_quality dataset, which has daily air quality and weather information. We use the model to forecast the PM2.51 , based on its historical data and some covariates, such as temperature and wind speed.

An example: forecast Seattle’s air quality with weather information

Step 1. Create the dataset

The PM2.5, temperature, and wind speed data are in separate tables. To simplify the queries, create a new table by joining those tables into a new table “bqml_test.seattle_air_quality_daily,” with the following columns:

date: the date of the observation

PM2.5: the average PM2.5 value for each day

wind_speed: the average wind speed for each day

temperature: the highest temperature for each day

The new table has daily data from 2009-08-11 to 2022-01-31.

code_block[StructValue([(u’code’, u”CREATE TABLE `bqml_test.seattle_air_quality_daily`rnASrnWITHrn pm25_daily AS (rn SELECTrn avg(arithmetic_mean) AS pm25, date_local AS datern FROMrn `bigquery-public-data.epa_historical_air_quality.pm25_nonfrm_daily_summary`rn WHERErn city_name = ‘Seattle’rn AND parameter_name = ‘Acceptable PM2.5 AQI & Speciation Mass’rn GROUP BY date_localrn ),rn wind_speed_daily AS (rn SELECTrn avg(arithmetic_mean) AS wind_speed, date_local AS datern FROMrn `bigquery-public-data.epa_historical_air_quality.wind_daily_summary`rn WHERErn city_name = ‘Seattle’ AND parameter_name = ‘Wind Speed – Resultant’rn GROUP BY date_localrn ),rn temperature_daily AS (rn SELECTrn avg(first_max_value) AS temperature, date_local AS datern FROMrn `bigquery-public-data.epa_historical_air_quality.temperature_daily_summary`rn WHERErn city_name = ‘Seattle’ AND parameter_name = ‘Outdoor Temperature’rn GROUP BY date_localrn )rnSELECTrn pm25_daily.date AS date, pm25, wind_speed, temperaturernFROM pm25_dailyrnJOIN wind_speed_daily USING (date)rnJOIN temperature_daily USING (date)”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eebb79cce90>)])]

Here is a preview of the data:

Step 2. Create Model

The “CREATE MODEL” query of the new multivariate model, ARIMA_PLUS_XREG, is very similar to the current ARIMA_PLUS model. The major differences are the MODEL_TYPE and inclusion of feature columns in the SELECT statement.

code_block[StructValue([(u’code’, u”CREATE OR REPLACErn MODELrn `bqml_test.seattle_pm25_xreg_model`rn OPTIONS (rn MODEL_TYPE = ‘ARIMA_PLUS_XREG’,rn time_series_timestamp_col = ‘date’,rn time_series_data_col = ‘pm25’)rnASrnSELECTrn date,rn pm25,rn temperature,rn wind_speedrnFROMrn `bqml_test.seattle_air_quality_daily`rnWHERErn datern BETWEEN DATE(‘2012-01-01’)rn AND DATE(‘2020-12-31′)”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eebb48f11d0>)])]

Step 3. Forecast the future data

With the created model, you can use the ML.FORECAST function to forecast the future data. Compared to the ARIMA_PLUS model, you have to specify the future covariates as an input.

code_block[StructValue([(u’code’, u”SELECTrn *rnFROMrn ML.FORECAST(rn MODELrn `bqml_test.seattle_pm25_xreg_model`,rn STRUCT(30 AS horizon),rn (rn SELECTrn date,rn temperature,rn wind_speedrn FROMrn `bqml_test.seattle_air_quality_daily`rn WHERErn date > DATE(‘2020-12-31′)rn ))”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eebb48f1d50>)])]

After running the above query, you can see the forecasting results:

Step 4. Evaluate the model

You can use the ML.EVALUATE function to evaluate the forecasting errors. You can set perform_aggregation to “TRUE” to get the aggregated error metric or “FALSE” to see the per timestamp errors.

code_block[StructValue([(u’code’, u”SELECTrn *rnFROMrn ML.EVALUATE(rn MODEL `bqml_test.seattle_pm25_xreg_model`,rn (rn SELECTrn date,rn pm25,rn temperature,rn wind_speedrn FROMrn `bqml_test.seattle_air_quality_daily`rn WHERErn date > DATE(‘2020-12-31′)rn ),rn STRUCT(rn TRUE AS perform_aggregation,rn 30 AS horizon))”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eebb48eb9d0>)])]

The evaluation result of ARIMA_PLUS_XREG is as follows:

As a comparison, we also show the univariate forecasting ARIMA_PLUS result in the following table:

Compared to ARIMA_PLUS, ARIMA_PLUS_XREG performs better on all measured metrics on this specific dataset and date range.

Conclusion

In the previous example, we demonstrated how to create a multivariate time series forecasting model, forecast future values using the model, and evaluate the forecasted results. The ML.ARIMA_EVALUATE and ML.ARIMA_COEFFICIENTS table value functions are also helpful for investigating your model. Based on the feedback from users, the model does the following to improve user productivity.  

It shortens the time spent preprocessing data and lets users keep their data in BigQuery when doing machine learning. 

It reduces overhead for the users who know SQL to do machine learning work in BigQuery.   

For more information about the ARIMA_PLUS_XREG model, please see thedocumentation here.

What’s Next?

In this blogpost, we described the BigQuery ML Multivariate Time Series Forecast model, which is now available for public preview. We also showed a code demo for a data scientist, data engineer, or data analyst to enable the multivariate time series forecast model. 

The following features are coming soon:

Large-scale multivariate time series, i.e., training millions of models for millions of multivariate time series in a single CREATE MODEL statement

Multivariate time series anomaly detection

Thanks to Xi Cheng, Honglin Zheng, Jiashang Liu, Amir Hormati, Mingge Deng and Abhinav Khushraj from the BigQuery ML team. Also thanks to Weijie Shen from the Google Resource Efficiency Data Science team.

1. A measure of air pollution from fine particulate matter

Source : Data Analytics Read More