Carbon Health transforms operating outcomes with Connected Sheets for Looker

Carbon Health transforms operating outcomes with Connected Sheets for Looker

Everyone wants affordable, quality healthcare but not everyone has it. A 2021 report by the Commonwealth Fund ranked the U.S. in last place among 11 high-income countries in healthcare access.1 Carbon Health is working to change that. We are doing so by combining the best of virtual care, in-person visits, and technology to support patients with their everyday physical and mental health needs.

Rethinking how data and analytics are accessed at Carbon Health 

Delivering premium healthcare for the masses that’s accessible and affordable is an ambitious undertaking. It requires a commitment to operating the business in an efficient and disciplined way. To meet our goals, our teams across the company require detailed, daily insights into operating results.

In the last year, we realized our existing BI platform was inaccessible to most of our employees outside of R&D. Creating the analytics, dashboards, and reports needed by our clinic leaders and executives required direct help from our data scientists. 

However, this has all changed since deploying Looker as our new BI platform. We initially used Looker to build tables, charts, and graphs that improved how people could access and analyze data about our operating efficiency. As we continued to evaluate how our data and analytics should be experienced by our in-clinic staff, we learned about Connected Sheets for Looker, which has unlocked an entirely new way of sharing insights across the company.

A new way to deliver performance reporting and drive results

Connected Sheets for Looker gives Carbon Health employees who work in Google Sheets—practically everyone—a familiar tool for working with Looker data. For instance, one of our first outputs using the Connected Sheets integration has been a daily and weekly performance push-report for the clinic’s operating leaders, including providers. 

Essentially a scorecard, the report tracks the most important KPIs for measuring clinics’ successes, including appointment volume, patient satisfaction such as net promoter score (NPS), reviews, phone call answer rates, and even metrics about billing and collections. To provide easy access, we built a workflow through Google App Script that takes our daily performance report and automatically emails a PDF to key clinic leaders each morning. 

Within the first 30 days of the report’s creation, clinic leaders were able to drive noticeable improvements in operating results. For instance, actively tracking clinic volume has enabled us to manage our schedules more effectively, which in turn drives more visits and enables us to better communicate expectations with our patients. Other clinics have dramatically improved their call answer rates by tracking inbound call volume, which has also led to better patient satisfaction. 

Greater accountability, greater collaboration

As you can imagine, a report that holds people accountable for outcomes in such a visible way can create some anxiety. We’ve eased those concerns by using the information constructively, with the goal to use reporting as a positive feedback mechanism to bolster open collaboration and identify operational processes that need improvement. For example, data about our call answer rates initiated an investigation that led to an operational redesign of how phones are deployed and managed at more than 120 clinics across the U.S.

Looker as a scalable solution with endless applications

We’re now rolling out Connected Sheets for Looker to deliver performance push-reporting across all teams at Carbon Health. Additionally, we continue to find new ways to leverage Connected Sheets for Looker to meet other needs of the business. 

For instance, we’ve recently been able to better understand our software costs by analyzing vendor spend from our accounting systems directly in Google Sheets. Going forward, this will allow us to build a basic workflow to monitor subscription spend and employee application usage, which will lead to us saving money on unnecessary licenses and underutilized software. 

We’ve come a long way in the last year. Between Looker and its integration with Google Sheets, we can meet the data needs of all our stakeholders at Carbon Health. Connected Sheets for Looker has been an impactful solution that’s going to help us drive measurable results in how we deliver premium healthcare to the masses.

1. Mirror, Mirror 2021: Reflecting Poorly
2.  HEALTHCARE EDITORS’ PICK Meet The Immigrant Entrepreneurs Who Raised $350 Million To Rethink U.S. Primary Care

Related Article

Analyze Looker-modeled data through Google Sheets

Connected Sheets for Looker brings modeled, trusted data into Google Sheets, enabling users to work in a way that is comfortable and conv…

Read Article

Source : Data Analytics Read More

How StreamNative facilitates integrated use of Apache Pulsar through Google Cloud

How StreamNative facilitates integrated use of Apache Pulsar through Google Cloud

StreamNative, a company founded by the original developers of Apache Pulsar and Apache BookKeeper, is partnering Google Cloud to build a streaming platform on open source technologies. We are dedicated to helping businesses generate maximum value from their enterprise data by offering effortless ways to realize real-time data streaming. Following the release of StreamNative Cloud in August 2020, which provides scalable and reliable Pulsar-Cluster-as-a-Service, we introduced StreamNative Cloud for Kafka. This is to enable a seamless switch between Kafka API and Pulsar. We then launched StreamNative Platform to support global event streaming data platforms in multi-cloud and hybrid-cloud environments.

By leveraging our fully-managed Pulsar infrastructure services, our enterprise customers can easily build their event-driven applications with Apache Pulsar and get real-time value from their data. There are solid reasons why Apache Pulsar has become one of the most popular messaging platforms in modern cloud environments, and we have strong beliefs in its capabilities of simplifying building complex event-driven applications. The most prominent benefits of using Apache Pulsar to manage real-time events include:

Single API: When building a complex event-driven application, it traditionally requires linking multiple systems to support queuing, streaming and table semantics. Apache Pulsar frees developers from the headache of managing multiple APIs by offering one single API that supports all messaging-related workloads.

Multi-tenancy: With the built-in multi-tenancy feature, Apache Pulsar enables secure data sharing across different departments with one global cluster. This architecture not only helps reduce infrastructure costs, but also avoids data silos.

Simplified application architecture: Pulsar clusters can scale to millions of topics while delivering consistent performance, which means that developers don’t have to restructure their applications when the number of topic-partitions surpasses hundreds. The application architecture can therefore be simplified.

Geo-replication: Apache Pulsar supports both synchronous and asynchronous geo-replication out-of-the-box, which makes building event-driven applications in multi-cloud and hybrid-cloud environments very easy.

Facilitating integration between Apache Pulsar and Google Cloud

To allow our customers to fully enjoy the benefits of Apache Pulsar, we’ve been working on expanding the Apache Pulsar ecosystem by improving the integration between Apache Pulsar and powerful cloud platforms like Google Cloud. In mid-2022, we added Google Cloud Pub/Sub Connector for Apache Pulsar, which enables seamless data replication between Pub/Sub and Apache Pulsar, and Google Cloud BigQuery Sink Connector for Apache Pulsar, which synchronizes Pulsar data to BigQuery in real time, to the Apache Pulsar ecosystem.

Google Cloud Pub/Sub Connector for Apache Pulsar uses Pulsar IO components to realize fully-featured messaging and streaming between Pub/Sub and Apache Pulsar, which has its own distinctive features. Using Pub/Sub and Apache Pulsar at the same time enables developers to realize comprehensive data streaming features on their applications. However, it requires significant development effort to establish seamless integration between the two tools, because data synchronization between different messaging systems depends on the functioning of applications. When applications stop working, the message data cannot be passed on to the other system.

Our connector solves this problem by fully integrating with Pulsar’s system. There are two ways to import and export data between Pub/Sub and Pulsar. The first, is the Google Cloud Pub/Sub source that feeds data from Pub/Sub topics and writes data to Pulsar topics. Alternatively, the Google Cloud Pub/Sub sink can pull data from Pulsar topics and persist data to Pub/Sub topics. Using Google Cloud Pub/Sub Connector for Apache Pulsar brings three key advantages:

Code-free integration: No code-writing is needed to move data between Apache Pulsar and Pub/Sub.

High scalability: The connector can be run on both standalone and distributed nodes, which allows developers to build reactive data pipelines in real time to meet operational needs.

Less DevOps resources required: The DevOps workloads of setting up data synchronization are greatly reduced, which translates into more resources to be invested in unleashing the value of data.

By using the BigQuery Sink Connector for Apache Pulsar, organizations can write data from Pulsar directly to BigQuery. This is unlike before, where developers could only use Cloud Storage Sink Connector for Pulsar to move data to Cloud Storage, and then query the imported data with external tables in BigQuery which had many limitations,  including low query performance and no support for clustered tables.

Pulling data from Pulsar topics and persisting data to BigQuery tables, our BigQuery sink connector supports real-time data synchronization between Apache Pulsar and BigQuery. Just like our Pub/Sub connector, Google Cloud BigQuery Sink Connector for Apache Pulsar is a low-code solution that supports high scalability and greatly reduces DevOps workloads. Furthermore, our BigQuery connector possesses the Auto Schema feature, which automatically creates and updates BigQuery table structures based on the Pulsar topic schemas to ensure smooth and continuous data synchronization.

Simplifying Pulsar resource management on Kubernetes

All the products of StreamNative are built on Kubernetes, and we’ve been developing tools that can simplify resource management on Kubernetes platforms like Google Cloud Kubernetes (GKE). In August 2022, we introduced Pulsar Resources Operator for Kubernetes, which is an independent controller that provides automatic full lifecycle management for Pulsar resources on Kubernetes.

Pulsar Resources Operator uses manifest files to manage Pulsar resources, which allows developers to get and edit resource policies through the Topic Custom Resources that render the full field information of Pulsar policies. It enables easier Pulsar resource management compared with using command line interface (CLI) tools, because developers no longer need to remember numerous commands and flags to retrieve policy information. Key advantages of using Pulsar Resources Operator for Kubernetes include:

Easy creation of Pulsar resources: By applying manifest files, developers can swiftly initialize basic Pulsar resources in their continuous integration (CI) workflows when creating a new Pulsar cluster.

Full integration with Helm: Helm is widely used as a package management tool in cloud-native environments. Pulsar Resource Operator can seamlessly integrate with Helm, which allows developers to manage their Pulsar resources through Helm templates.

How you can contribute

With the release of Google Cloud Pub/Sub Connector for Apache Pulsar, Google Cloud BigQuery Sink Connector for Apache Pulsar, and Pulsar Resources Operator for Kubernetes, we have unlocked the application potential of open tools like Apache Pulsar by making them simpler to build, easier to manage, and extended their capabilities. Now, developers can build and run Pulsar clusters more efficiently and maximize the value of their enterprise data. 

These three tools are community-driven services and have their source codes hosted in the StreamNative GitHub repository. Our team welcomes all types of contributions for the evolution of our tools. We’re always keen to receive feature requests, bug reports and documentation inquiry through GitHub, emails or Twitter.

Source : Data Analytics Read More

How to build comprehensive customer financial profiles with Elastic Cloud and Google Cloud

How to build comprehensive customer financial profiles with Elastic Cloud and Google Cloud

Financial institutions have vast amounts of data about their customers. However, many of them struggle to leverage data to their advantage. Data may be sitting in silos or trapped on costly mainframes. Customers may only have access to a limited quantity of data, or service providers may need to search through multiple systems of record to handle a simple customer inquiry. This creates a hazard for providers and a headache for customers. 

Elastic and Google Cloud enable institutions to manage this information. Powerful search tools allow data to be surfaced faster than ever – Whether it’s card payments, ACH (Automated Clearing House), wires, bank transfers, real-time payments, or another payment method. This information can be correlated to customer profiles, cash balances, merchant info, purchase history, and  other relevant information to enable the customer or business objective. 

This reference architecture enables these use cases:

1. Offering a great customer experience: Customers expect immediate access to their entire payment history, with the ability to recognize anomalies. Not just through digital channels, but through omnichannel experiences (e.g. customer service interactions).

2. Customer 360: Real-time dashboards which correlates transaction information across multiple variables, offering the business a better view into their customer base, and driving efforts for sales, marketing, and product innovation.

Customer 360: The dashboard above looks at 1.2 billion bank transactions and gives a breakdown of what they are, who executes them, where they go, when and more. At a glance we can see who our wealthiest customers are, which merchants our customers send the most money to, how many unusual transactions there are – based on transaction frequency and transaction amount, when folks spend money and what kind spending and income they have.

3. Partnership management: Merchant acceptance is key for payment providers. Having better access to present and historical merchant transactions can enhance relationships or provide leverage in negotiations. With that, banks can create and monetize new services.

4. Cost optimization: Mainframes are not designed for internet-scale access. Along-side with technological limitation, the cost becomes a prohibitive factor. While Mainframes will not be replaced any time sooner, this architecture will help to avoid costly access to data to serve new applications.

5. Risk reduction: By standardizing on the Elastic Stack, banks are  longer limited in the number of data sources they can ingest. With this, banks can better respond to call center delays and potential customer-facing impacts like natural disasters. By deploying machine learning and alerting features, banks can detect and stamp out financial fraud before it impacts member accounts.

Fraud detection: The Graph feature of Elastic helped a financial services company to identify additional cards that were linked via phone numbers and amalgamations of the original billing address on file with those two cards. The team realized that several credit unions, not just the original one where the alert originated from, were being scammed by the same fraud ring.

Architecture

The following diagram shows the steps to move data from Mainframe to Google Cloud, process and enrich the data in BigQuery, then provide comprehensive search capabilities through Elastic Cloud.

This architecture includes the following components:

Move Data from Mainframe to Google Cloud

Moving data from IBM z/OS to Google Cloud is straightforward with the Mainframe Connector, by following simple steps and defining configurations. The connector runs in z/OS batch job steps and includes a shell interpreter and JVM-based implementations of gsutil, bq and gcloud command-line utilities. This makes it possible to create and run a complete ELT pipeline from JCL, both for the initial batch data migration and ongoing delta updates.

A typical flow of the connector includes:

Reading the mainframe dataset

Transcoding the dataset to ORC

Uploading ORC file to Cloud Storage

Register ORC file as an external table or load as a native table

Submit a Query job containing a MERGE DML statement to upsert incremental data into a target table or a SELECT statement to append to or replace an existing table

Here are the steps to install the BQ MainFrame Connector:

copy mainframe connector jar to unix filesystem on z/OS

copy BQSH JCL procedure to a PDS on z/OS

edit BQSH JCL to set site specific environment variables

Please refer to the BQ Mainframe connector blog for example configuration and commands.

Process and Enrich Data in BigQuery

BigQuery is a completely serverless and cost-effective enterprise data warehouse. Its serverless architecture lets you use SQL language to query and enrich Enterprise scale data. And its scalable, distributed analysis engine lets you query terabytes in seconds and petabytes in minutes. An integrated BQML and BI Engine enables you to analyze the data and gain business insights. 

Ingest Data from BQ to Elastic Cloud

Dataflow is used here to ingest data from BQ to Elastic Cloud. It’s a serverless, fast, and cost-effective stream and batch data processing service. Dataflow provides an Elasticsearch Flex Template which can be easily configured to create the streaming pipeline. This blog from Elastic shows an example on how to configure the template.

Cloud Orchestration from Mainframe

It’s possible to load both BigQuery and Elastic Cloud entirely from a mainframe job, with no need for an external job scheduler.

To launch the Dataflow flex template directly, you can invoke the gcloud dataflow flex-template run command in a z/OS batch job step.

If you require additional actions beyond simply launching the template, you can instead invoke the gcloud pubsub topics publish command in a batch job step after your BigQuery ELT steps are completed, using the –attribute option to include your BigQuery table name and any other template parameters. The pubsub message can be used to trigger any additional actions within your cloud environment.

To take action in response to the pubsub message sent from your mainframe job, create a Cloud Build Pipeline with a pubsub trigger and include a Cloud Build Pipeline step that uses the gcloud builder to invoke gcloud dataflow flex-template run and launch the template using the parameters copied from the pubsub message. If you need to use a custom dataflow template rather than the public template, you can use the git builder to checkout your code followed by the maven builder to compile and launch a custom dataflow pipeline. Additional pipeline steps can be added for any other actions you require.

The pubsub messages sent from your batch job can also be used to trigger a Cloud Run service or a GKE service via Eventarc and may also be consumed directly by a Dataflow pipeline or any other application.

Mainframe Capacity Planning

CPU consumption is a major factor in mainframe workload cost. In the basic architecture design above, the Mainframe Connector runs on the JVM and runs on zIIP processor. Relative to simply uploading data to cloud storage, ORC encoding consumes much more CPU time. When processing large amounts of data it’s possible to exhaust zIIP capacity and spill workloads onto GP processors. You may apply the following advanced architecture to reduce CPU consumption and avoid increased z/OS processing costs.

Remote Dataset Transcoding on Compute Engine VM

To reduce mainframe CPU consumption, ORC file transcoding can be delegated to a GCE instance. A gRPC service is included with the mainframe connector specifically for this purpose. Instructions for setup can be found in the mainframe connector documentation. Using remote ORC transcoding will significantly reduce CPU usage of the Mainframe Connector batch jobs and is recommended for all production level BigQuery workloads. Multiple instances of the gRPC service can be deployed behind a load balancer and shared by all Mainframe Connector batch jobs.

Transfer Data via FICON and Interconnect

Google Cloud technology partners offer products to enable transfer of mainframe datasets via FICON and 10G ethernet to Cloud Storage. Obtaining a hardware FICON appliance and Interconnect is a practical requirement for workloads that transfer in excess of 500GB daily. This architecture is ideal for integration of z/OS and Google Cloud because it largely eliminates data transfer related CPU utilization concerns.

We really appreciate Jason Mar from Google Cloud who provided rich context and technical guidance regarding the Mainframe Connector, and Eric Lowry from Elastic for his suggestions and recommendations, and the Google Cloud and Elastic team members who contributed to this collaboration.

Source : Data Analytics Read More

How NTUC created a centralized Data Portal for frictionless access to data across business lines

How NTUC created a centralized Data Portal for frictionless access to data across business lines

As a network of social businesses, NTUC Enterprise is on a mission to harness the capabilities of its multiple units to meet pressing social needs in areas like healthcare, childcare, daily essentials, cooked food, and financial services. Serving over two million customers annually, we seek to enable and empower everyone in Singapore to live better and more meaningful lives.

With so many lines of business, each running on different computing architectures, we found ourselves struggling to integrate data across our enterprise ecosystem and enable internal stakeholders to access the data. We deemed this essential to our mission of empowering our staff to collaborate on digital enterprise transformation in ways that enable tailor-made solutions for customers. 

The central issue was that our five main business lines, including retail, health, food, supply chain, and finance, were operating on different combinations of Google Cloud, on-premises, and Amazon Web Services (AWS) infrastructure. The complex setup drove us to create a unified data portal that would integrate data from across our ecosystem, so business units could create inter-platform data solutions and analytics, and democratize data access for more than 1,400 NTUC data citizens. In essence, we sought to create a one-stop platform where internal stakeholders can easily access any assets they require from over 25,000 BigQuery tables and more than 10,000 Looker Studio dashboards.

Here is a step-by-step summary of how we deployed DataHub, an open-source metadata platform alongside Google Cloud solutions to establish a unified Data Portal that allows seamless access for NTUC employees across business lines, while enabling secure data ingestion and robust data quality.

DataHub’s built-in data discovery function provides basic functionality to locate specific data assets from BigQuery tables and Looker Studio dashboards for storage on DataHub. However, we needed a more seamless way to ingest the metadata of all data assets automatically and systematically.

We therefore carried out customizations and enhancements on Cloud Composer, a fully managed workflow orchestration service built on Apache Airflow, and Google Kubernetes Engine (GKE) Autopilot, which helps us scale out easily and efficiently based on our dynamic needs.

Next, we built data lineage, which enables the end-to-end flow of data across our tech stack, drawing data from Cloud SQL into Cloud Storage, then channeling the data back through BigQuery into Looker Studio dashboards for easy visibility. This was instrumental in enabling users across NTUC’s business lines to access data securely and intuitively on Looker Studio. 

Having set up the basic platform architecture, our next task was to enable secure data ingestion. Sensitive data needed to be encrypted and stored in Cloud Storage before populating BigQuery tables. The system needed to be flexible enough to securely ingest data in a multi-cloud environment, including Google Cloud, AWS, and our on-premises infrastructure.

Our solution was to build an in-house framework to fit requirements of Python and YML, as well as GKE and Cloud Composer. We created the equivalent of a Collibra data management platform to suit NTUC’s data flow (from Cloud Storage to BigQuery). The system also needed to conform to NTUC data principles, which are as follows: 

All data in our Cloud Storage data lake must be stored in a compressed form like Avro, a data security service

Sensitive columns must be hashed using Secure Hash Algorithm 256-bit (SHA-256)

The solution must be flexible for customization depending on needs

Connection must be made by username and password

Connection must be made with certificates (public key and private key), including override functions in code

Connections require one logical table from hundreds of physical tables (MSSQL sharding tables)

Our next task for the Data Portal was creating an automated Data Quality Control service to enable us to check data in real-time whenever a BigQuery table is updated or modified. This liberates our data engineers, who were previously building BigQuery tables by manually monitoring hundreds of table columns for changes or anomalies. This was a task that used to take an entire day, but is now reduced to just five minutes. We enable seamless data quality in the following way: 

Activity in BigQuery tables is automatically written into Cloud Logging, a fully managed, real-time log management service with storage, search, analysis, and alerts 

The logging service can then filter out events from BigQuery into Pub/Sub for datastreams that are then channeled into Looker Studio, where users can easily access the specific data they need

In addition, the Data Quality Control service sends notifications to users whenever someone updates BigQuery tables incorrectly or against set rules, whether that is deleting, changing or adding data to columns. This enables automated data discovery, without engineers needing to go intoBigQuery to look up tables

These steps enable NTUC to create a flexible, dynamic, and user-friendly Data Portal that democratizes data access across business lines for more than 1,400 platform users, opening up vast potential for creative collaboration and digital solution development. In the future, we plan to look at how we can integrate even more data services into the Data Portal, and leverage Google Cloud to help develop more in-house solutions.

Related Article

How NTUC FairPrice delivers a seamless shopping and payment experience through Google Cloud

NTUC FairPrice launched a new app payment solution, allowing customers with a seamless shopping and payment experience using Google Cloud…

Read Article

Source : Data Analytics Read More

Best practices of Dataproc Persistent History Server

Best practices of Dataproc Persistent History Server

When running Apache Hadoop and Spark, it is important to tune the configs, perform cluster planning, and right-size compute. Thorough benchmarking is required to make sure the utilization and performance are optimized. In Dataproc, you can run Spark jobs in a semi-long-running cluster or ephemeral Cloud Dataproc on Google Compute Engine (DPGCE) cluster or via Dataproc Serverless Spark. Dataproc Serverless for Spark runs a workload on an ephemeral cluster. An ephemeral cluster means the cluster’s lifecycle is tied to the job. A cluster is started, used to run the job, and then destroyed after completion. Ephemeral clusters are easier to configure, since they run a single workload or a few workloads in a sequence. You can leverage Dataproc Workflow Templates to orchestrate this. Ephemeral clusters can be sized to match the job’s requirements. This job-scoped cluster model is effective for batch processing. You can create an ephemeral cluster and configure it to run specific Hive workloads, Apache Pig scripts, Presto queries, etc., and then delete the cluster when the job is completed.  

Ephemeral clusters have some compelling advantages:

They reduce unnecessary storage and service costs from idle clusters and worker machines.

Every set of jobs runs on a job-scoped ephemeral cluster with job-scoped cluster specifications, image version, and operating system. 

Since each job gets a dedicated cluster, the performance of one job does not impact other jobs.

Persistent History Server (PHS)

The challenge with ephemeral clusters and Dataproc Serverless for Spark is that you will lose the application logs when the cluster machines are deleted after the job. Persistent History Server (PHS) enables access to the completed Hadoop and Spark application details for the jobs executed on different ephemeral clusters or serverless Spark. It can list running and completed applications. PHS keeps the history (event logs) of all completed applications and its runtime information in the GCS bucket, and it allows you to review metrics and monitor the application at a later time. PHS is nothing but a standalone cluster. It reads the Spark events from GCS, then parses and presents application details, scheduler stages, and task level details, as well as environment and executor information, in the Spark UI. These metrics are helpful for improving the performance of the application. Both the application event logs and the YARN container logs of the ephemeral clusters are collected in the GCS bucket. These log files are important for engaging Google Cloud Technical Support to troubleshoot and explore. If PHS is not set up, you have to re-run the workload, which adds to support cycle time. If you have set up PHS, you can provide the logs directly to Technical Support.

The following diagram depicts the flow of events logged from ephemeral clusters to the PHS server:

In this blog, we will focus on Dataproc PHS best practices. To set up PHS to access web interfaces of MapReduce and Spark job history files, please refer to Dataproc documentation.

PHS Best Practices

Cluster Planning and Maintenance

It’s common to have a single PHS for a given GCP project. If needed, you can create two or more PHSs pointing to different GCS buckets in a project. This allows you to isolate and monitor specific business applications that run multiple ephemeral jobs and require a dedicated PHS. 

For disaster recovery, you can quickly spin up a new PHS in another region in the event of a zonal or regional failure. 

If you require High Availability (HA), you can spin up two or more PHS instances across zones or regions. All instances can be backed by the dual-regional or multi-regional GCS bucket.

You can run PHS on a single-node Dataproc cluster, as it is not running large-scale parallel processing jobs. For the PHS machine type:

N2 are the most cost-effective and performant machines for Dataproc. We also recommend 500-1000GB pd-standard disks.

For <1000 apps and if there are apps with 50K-100K tasks, we suggest n2-highmem-8.

For <1000 apps and there are apps with 50K-100K tasks, we suggest n2-highmem-8.

For >10000, we suggest n2-highmem16.

We recommend you benchmark with your Spark applications in the testing environment before configuring PHS in production. Once in production, we recommend monitoring your GCE backed PHS instance for memory and CPU utilization and tweaking machine shape as required.

In the event of significant performance degradation within the Spark UI due to a large amount of applications or large jobs generating large event logs, you can recreate the PHS with increased machine size with higher memory. 

As Dataproc releases new sub-minor versions on a bi-weekly cadence or greater, we recommend recreating your PHS instance so it has access to the latest Dataproc binaries and OS security patches. 

As PHS services (e.g. Spark UI, MapReduce History Server) are backwards compatible, it’s suggested to create a Dataproc 2.0+ based PHS cluster for all instances.

Logs Storage 

Configure spark:spark.history.fs.logDirectory to specify where to store event log history written by ephemeral clusters or serverless Spark. You need to create the GCS bucket in advance. 

Event logs are critical for PHS servers. As the event logs are stored in a GCS bucket, it is recommended to use a multi-Region GCS bucket for high availability. Objects inside the multi-region bucket are stored redundantly in at least two separate geographic places separated by at least 100 miles.

Configuration

PHS is stateless and it constructs the Spark UI of the applications by reading the application’s event logs from the GCS bucket. SPARK_DAEMON_MEMORY is the memory to allocate to the history server and has a default of 3840m. If too many users are trying to access the Spark UI and access job application details, or if there are long-running Spark jobs (iterated through several stages with 50K or 100K tasks), the heap size is probably too small. Since there is no way to limit the number of tasks stored on the heap, try increasing the heap size to 8g or 16g until you find a number that works for your scenario.

If you’ve increased the heap size and still see performance issues, you can configure spark.history.retainedApplications and lower the number of retained applications in the PHS.   

Configure mapred:mapreduce.jobhistory.read-only.dir-pattern to access MapReduce job history logs written by ephemeral clusters. 

By default, spark:spark.history.fs.gs.outputstream.type is set to BASIC. The job cluster will send data to GCS after job completion. Set this to FLUSHABLE_COMPOSITE to copy data to GCS at regular intervals while the job is running. 

Configure spark:spark.history.fs.gs.outputstream.sync.min.interval.ms to control the frequency at which the job cluster transfers data to GCS. 

To enable the executor logs in PHS, specify the custom Spark executor log URL for supporting external log service. Configure the following properties:

code_block[StructValue([(u’code’, u’spark:spark.history.custom.executor.log.url={{YARN_LOG_SERVER_URL}}/{{NM_HOST}}:{{NM_PORT}}/{{CONTAINER_ID}}/{{CONTAINER_ID}}/{{USER}}/{{FILE_NAME}} and spark.history.custom.executor.log.url.applyIncompleteApplication=False’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e9920ac7ed0>)])]

Lifecycle Management of Logs

Using GCS Object Lifecycle Management, configure a 30d lifecycle policy to periodically clean up the MapReduce job history logs and Spark event logs from the GCS bucket. This will improve the performance of the PHS UI considerably.

Note: Before doing the cleanup, you can  back up the logs to a separate GCS bucket for long-term storage. 

PHS Setup Sample Codes

The following code block creates a Persistent History server with the best practices suggested above.

code_block[StructValue([(u’code’, u’export PHS_CLUSTER_NAME=<CLUSTER NAME>rnexport REGION=<REGION>rnexport ZONE=<ZONE>rnexport GCS_BUCKET=<GCS BUCKET>rnexport PROJECT_NAME=<PROJECT NAME>rnrngcloud dataproc clusters create $PHS_CLUSTER_NAME \rn–enable-component-gateway \rn–region ${REGION} –zone $ZONE \rn–single-node \rn–master-machine-type n2-highmem-4 \rn–master-boot-disk-size 500 \rn–image-version 2.0-debian10 \rn–properties \rnyarn:yarn.nodemanager.remote-app-log-dir=gs://$GCS_BUCKET/yarn-logs,\rnmapred:mapreduce.jobhistory.done-dir=gs://$GCS_BUCKET/events/mapreduce-job-history/done,\rnmapred:mapreduce.jobhistory.intermediate-done-dir=gs://$GCS_BUCKET/events/mapreduce-job-history/intermediate-done,\rnspark:spark.eventLog.dir=gs://$GCS_BUCKET/events/spark-job-history,\rnspark:spark.history.fs.logDirectory=gs://$GCS_BUCKET/events/spark-job-history,\rnspark:SPARK_DAEMON_MEMORY=16000m,\rnspark:spark.history.custom.executor.log.url.applyIncompleteApplication=false,\rnspark:spark.history.custom.executor.log.url={{YARN_LOG_SERVER_URL}}/{{NM_HOST}}:{{NM_PORT}}/{{CONTAINER_ID}}/{{CONTAINER_ID}}/{{USER}}/{{FILE_NAME}} \rn–project $PROJECT_NAME’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e9920ac72d0>)])]

The following code block creates an ephemeral cluster that logs events to the GCS bucket.

code_block[StructValue([(u’code’, u’export JOB_CLUSTER_NAME=<CLUSTER NAME>rnrngcloud dataproc clusters create $JOB_CLUSTER_NAME \rn–enable-component-gateway \rn–region $REGION –zone $ZONE \rn–master-machine-type n1-standard-4 \rn–master-boot-disk-size 500 –num-workers 2 \rn–worker-machine-type n1-standard-4 \rn–worker-boot-disk-size 500 \rn–image-version 2.0-debian10 \rn–properties yarn:yarn.nodemanager.remote-app-log-dir=gs://$GCS_BUCKET/yarn-logs,\rnmapred:mapreduce.jobhistory.done-dir=gs://$GCS_BUCKET/events/mapreduce-job-history/done,\rnmapred:mapreduce.jobhistory.intermediate-done-dir=gs://$GCS_BUCKET/events/mapreduce-job-history/intermediate-done,\rnspark:spark.eventLog.dir=gs://$GCS_BUCKET/events/spark-job-history,\rnspark:spark.history.fs.logDirectory=gs://$GCS_BUCKET/events/spark-job-history,\rnspark:spark.history.fs.gs.outputstream.type=FLUSHABLE_COMPOSITE,\rnspark:spark.history.fs.gs.outputstream.sync.min.interval.ms=1000ms \rn–project $PROJECT_NAME’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e9920eaa450>)])]

With the Persistent History server, you can monitor and analyze all completed applications. You will also be able to use the logs and metrics to optimize performance and to troubleshoot issues related to strangled tasks, scheduler delays, and out of memory errors. 

Additional Resources

Dataproc PHS Documentation

PHS with Terraform

Spark Monitoring and Instrumentation

Run a Spark Workload

Spark History Web Interface

Monitoring Compute Engine VMs

Source : Data Analytics Read More

The business value of Cloud SQL: how companies speed up deployments, lower costs and boost agility

The business value of Cloud SQL: how companies speed up deployments, lower costs and boost agility

If you’re self-managing relational databases such as MySQL, PostgreSQL or SQL Server, you may be thinking about the pros and cons of cloud-based database services. Regardless of whether you’re running your databases on premises or in the cloud, self-managed databases can be inefficient and expensive, requiring significant effort around patching, hardware maintenance, backups, and tuning. Are managed database services a better option?

To answer this question, Google Cloud sponsored a business value white paper by IDC, based on the real-life experiences of eight Cloud SQL customers. Cloud SQL is an easy-to-use, fully-managed database service for running MySQL, PostgreSQL and SQL Server workloads. More than 90% of the top 100 Google Cloud customers use Cloud SQL.

The study found that migration to Cloud SQL unlocked significant efficiencies and cost reductions for these customers. Let’s take a look at the key benefits in this infographic.

Infographic: IDC business value study highlights the business benefits of migrating to Cloud SQL

A deeper dive into Cloud SQL benefits

To read the full IDC white paper, you can download it here: The Business Value of Cloud SQL: Google Cloud’s Relational Database Service for MySQL, PostgreSQL, and SQL Server, by Carl W. Olofson, Research Vice President, Data Management Software, IDC and Matthew Marden, Research Vice President, Business Value Strategy Practice, IDC.

Looking for commentary from IDC? Listen to the on-demand webinar, How Enterprises Have Achieved Greater Efficiency and Improved Business Performance using Google Cloud SQL, where Carl Olofson discusses the downsides of self-managed databases and the benefits of managed services like Cloud SQL, including the cost savings and improved business performance realized by the customers interviewed in the study.

Getting started

You can use our Database Migration Service for an easy, secure migration to Cloud SQL. Since Cloud SQL supports the same database versions, extensions and configuration flags as your existing MySQL, PostgreSQL and SQL Server instances, a simple lift-and-shift migration is usually all you need. So let Google Cloud take routine database administration tasks off your hands, and enjoy the scalability, reliability and openness that the cloud has to offer.

Start your journey with a Cloud SQL free trial.

Related Article

What’s new in Google Cloud databases: More unified. More open. More intelligent.

Google Cloud databases deliver an integrated experience, support legacy migrations, leverage AI and ML and provide developers world class…

Read Article

Source : Data Analytics Read More

Performance considerations for loading data into BigQuery

Performance considerations for loading data into BigQuery

Customers have been using BigQuery for their data warehousing needs since it was introduced. Many of these  customers routinely load very large data sets into their Enterprise Data Warehouse. Whether one is doing an initial data ingestion with hundreds of TB of data or incrementally loading from systems of record, performance of bulk inserts is key to quicker insights from the data. The most common architecture for batch data loads uses Google Cloud Storage(Object storage) as the staging area for all bulk loads. All the different file formats are converted into an optimized Columnar format called ‘Capacitor’ inside BigQuery.

This blog will focus on various file types for best performance. Data files that are uploaded to BigQuery, typically come in Comma Separated Values(CSV), AVRO, PARQUET, JSON, ORC formats. We are going to use two large datasets to compare and contrast each of these file formats. We will explore loading efficiencies of compressed vs. uncompressed data for each of these file formats. Data can be loaded into BigQuery using multiple tools in the GCP ecosystem. You can use the Google Cloud console, bq load command, using the BigQuery API or using the client libraries. This blog attempts to elucidate the various options for bulk data loading into BigQuery and also provides data on the performance for each file-type and loading mechanism.

Introduction

There are various factors you need to consider when loading data into BigQuery. 

Data file format

Data compression

Level of parallelization of data load

Schema autodetect ‘ON’ or ‘OFF’

Wide tables vs narrow(fewer columns) tables.

Data file format

Bulk insert into BigQuery is the fastest way to insert data for speed and cost efficiency. Streaming inserts are however more efficient when you need to report on the data immediately. Today data files come in many different file types including Comma Separated(CSV), JSON, PARQUET, AVRO  to name a few. We are often asked how the file format matters and whether there are any advantages in choosing one file format over the other. 

CSV files (comma-separated values) contain tabular data with a header row naming the columns. When loading data one can parse the header for column names. When loading from csv files one can use the header row for schema autodetect to pick up the columns. With schema autodetect set to off, one can skip the header row and create a schema manually, using the column names in the header. CSV files can use other field separators(like ; or |) too as a separator, since many data outputs already have a comma in the data. You cannot store nested or repeated data in CSV file format.

JSON (JavaScript object notation) data is stored as a key-value pair in a semi structured format. JSON is preferred as a file type because it can store data in a hierarchical format. The schemaless nature of JSON data rows gives the flexibility to evolve the schema and thus change the payload. JSON formats are user-readable. REST-based web services use json over other file types.

PARQUET is a column-oriented data file format designed for efficient storage and retrieval of data.  PARQUET compression and encoding is very efficient and provides improved performance to handle complex data in bulk.

AVRO: The data is stored in a binary format and the schema is stored in JSON format. This helps in minimizing the file size and maximizes efficiency. 

From a data loading perspective we did various tests with millions to hundreds of billions of rows with narrow to wide column data .We have done this test with a public dataset named `bigquery-public-data.samples.github_timeline` and `bigquery-public-data.wikipedia.pageviews_2022`. We used 1000 flex slots for the test and the number of loading(called PIPELINE slots) slots is limited to the number of slots you have allocated for your environment. Schema Autodetection was set to ‘NO’. For the parallelization of the data files, each file should typically be less than 256MB uncompressed for faster throughput and here is a summary of our findings:

Do I compress the data? 

Sometimes batch files are compressed for faster network transfers to the cloud. Especially for large data files that are being transferred, it is faster to compress the data before sending over the cloud Interconnect or VPN connection. In such cases is it better to uncompress the data before loading into BigQuery? Here are the tests we did for various file types with different file sizes with different compression algorithms. Shown results are the average of five runs:

How do I load the data?

There are various ways to load the data into BigQuery. You can use the Google Cloud Console, command line, Client Library or use the REST API. As all these load types invoke the same API under the hood so there is no advantage of picking one way over the other. We used 1000 PIPELINE slots reservations, for doing the data loads shown above. For workloads that require predictable load times, it is imperative that one uses PIPELINE slot reservations, so that load jobs are not dependent on the vagaries of available slots in the default pool. In the real world many of our customers have multiple load jobs happening concurrently. In those cases, assigning PIPELINE slots to individual jobs has to be done carefully keeping a balance between load times and slot efficiency.

Conclusion: There is no distinct advantage in loading time when the source file is in compressed format for the tests that we did. In fact for the most part uncompressed data loads in the same or faster time than compressed data. For all file types including AVRO, PARQUET and JSON it takes longer to load the data when the file is compressed. Decompression is a CPU bound activity and your mileage varies based on the amount of PIPELINE slots assigned to your load job. Data loading slots(PIPELINE slots) are different from the data querying slots. For compressed files, you should parallelize the load operation, so as to make sure that data loads are efficient. Split the data files to 256MB or less to speed up the parallelization of the data load.

From a performance perspective AVRO and PARQUET files have similar load times. Fixing your schema does load the data faster than schema autodetect set to ‘ON’. Regarding ETL jobs, it is faster and simpler to do your transformation inside BigQuery using SQL, but if you have complex transformation needs that cannot be done with SQL, use Dataflow for unified batch and streaming, Dataproc for streaming based pipelines, or Cloud Data Fusion for no-code / low-code transformation needs. Wherever possible, avoid implicit/explicit data types conversions for faster load times. Please also refer to Bigquery documentation for details on data loading to BigQuery.

To learn more about how Google BigQuery can help your enterprise, try out Quickstarts page here

Disclaimer: These tests were done with limited resources for BigQuery in a test environment during different times of the day with noisy neighbors, so the actual timings and the number of rows might not be reflective of your test results. The numbers provided here are for comparison sake only, so that you can choose the right file types, compression for your workload.  This testing was done with two tables, one with 199 columns (wide table) and another with 4 columns (narrow table). Your results will vary based on the datatypes, number of columns, amount of data, assignment of PIPELINE slots and various file types. We recommend that you test with your own data before coming to any conclusion.

Related Article

Learn how BI Engine enhances BigQuery query performance

This blog explains how BI Engine enhances BigQuery query performance, different modes in BI engine and its monitoring.

Read Article

Source : Data Analytics Read More

Getting started with Looker Studio Pro

Getting started with Looker Studio Pro

Today, millions of users turn to Looker Studio for self-serve business intelligence (BI) to explore data, answer business questions, build visualizations and share insights in beautiful dashboards with others. Recently, we introduced enterprise capabilities, including the options to manage team content and gain access to enterprise support, with Looker Studio Pro.

Looker Studio Pro is designed to support medium and large scale enterprise environments, delivering team-level collaboration and sharing, while honoring corporate security and document management policies – all at a low price per user.

Looker Studio Pro empowers teams to work together on content, without worrying about visibility and manageability issues that can arise with self-serve BI. So, the next time you want to onboard a new hire and give them access to fifty BI reports at once, or want to know the downstream impact of deleting a BigQuery table on your reports, or get admin permissions to all reports in your organization, Looker Studio Pro enables you to do so in a matter of minutes. You also no longer need to go through ownership transfers every time an employee leaves the organization, since all assets are owned by the organization. Plus, Google Cloud’s support and service level agreements ensure you have the help you need available when you need it. 

Let’s take a deeper dive into the benefits of Looker Studio Pro:

Bulk manage access to all content in Looker Studio

Team workspacesenable team collaboration in a shared space. Team workspaces provide new roles for granular permissions to manage content and manage members inside a team workspace.

A team workspace can be shared with individuals or Google Groups by adding members and assigning roles to the members of a team workspace. When you create new content in a team workspace or move existing content into a team workspace, all the members of that team workspace get access to that content.

The specific permissions provided depend on which role a user is granted, including Manager, Content Manager or Contributor. For a full list of detailed permissions aligned with each role, see our help center.

Additionally, with Looker Studio Pro, if employees leave the company, you no longer need to transfer ownership for assets they create. The organization owns content created in Looker Studio Pro, managed through the Google Cloud Platform project by default.

Get visibility across the organization 

You can now link a Google Cloud project to Looker Studio, so everything that people in your organization create is in one place that you control. 

Grant administrators permission to view or modify all assets in your organization using Identity and Access Management (IAM) — meaning no more orphaned reports or access headaches.

We are also working to bring Looker Studio content into Dataplex, so you can understand your company’s full data landscape in one place – including seeing how Looker Studio reports are linked to BigQuery tables. With Dataplex, you can better understand how Looker Studio and BigQuery connect, run impact analysis and see how data is transformed before it is used in reports. Integration with Looker Studio and Dataplex is currently in private preview, so reach out to your account team to learn more.

With these new integrations, we are making all the tools you need work better together, so you can realize the full value of your data cloud investments. 

Support

Looker Studio Pro customers receive support through existing Google Cloud Customer Care channels so you can rely on Looker Studio Pro for business-critical reporting. 

Getting started and next steps

If you are already a Google Cloud Platform customer, speak to your account team to sign up and get access today. Otherwise, complete this form to be notified when you can sign up for Looker Studio Pro.

Related Article

Introducing the next evolution of Looker, your unified business intelligence platform

Presenting the future of business intelligence: Looker, which now has deep integration with Data Studio and Google’s top products in AI/M…

Read Article

Source : Data Analytics Read More

Pub/Sub Group Kafka Connector now GA: a drop-in solution for data movement

Pub/Sub Group Kafka Connector now GA: a drop-in solution for data movement

We’re excited to announce that the Pub/Sub Group Kafka Connector is now Generally Available with active support from the Google Cloud Pub/Sub team. The Connector (packaged in a single jar file) is fully open source under an Apache 2.0 license and hosted on our GitHub repository. The packaged binaries are available on GitHub and Maven Central

The source and sink connectors packaged in the Connector jar allow you to connect your existing Apache Kafka deployment to Pub/Sub or Pub/Sub Lite in just a few steps

Simplifying data movement

As you migrate to the cloud, it can be challenging to keep systems deployed on Google Cloud in sync with those running on-premises. Using the sink connector, you can easily relay data from an on-prem Kafka cluster to Pub/Sub or Pub/Sub Lite, allowing different Google Cloud services as well as your own applications hosted on Google Cloud to consume data at scale. For instance, you can stream Pub/Sub data straight to BigQuery, enabling analytics teams to perform their workloads on BigQuery tables.

If you have existing analytics tied to your on-prem Kafka cluster, you can easily bring any data you need from microservices deployed on Google Cloud or your favorite Google Cloud services using the source connector. This way you can have a unified view across your on-prem and Google Cloud data sources.

The Pub/Sub Group Kafka Connector is implemented using Kafka Connect, a framework for developing and deploying solutions that reliably stream data between Kafka and other systems. Using Kafka Connect opens up the rich ecosystem of connectors for use with Pub/Sub or Pub/Sub Lite. Search your favorite source or destination system on Confluent Hub.

Flexibility and scale

You can configure exactly how you want messages from Kafka to be converted to Pub/Sub messages and vice versa with the available configuration options. You can also choose your desired Kafka serialization format by specifying which key/value converters to use. For use cases where message order is important, the sink connectors transmit the Kafka record key as the Pub/Sub message `ordering_key`, allowing you to use Pub/Sub ordered delivery and ensuring compatibility with Pub/Sub Lite order guarantees. To keep the message order when sending data to Kafka using the source connector, you can set the Kafka record key as a desired field.

The Connector can also take advantage of Pub/Sub’s and Pub/Sub Lite’s high-throughput messaging capabilities and scale up or down dynamically as stream throughput requires. This is achieved by running the Kafka Connect cluster in distributed mode. In distributed mode, Kafka Connect runs multiple worker processes on separate servers, each of which can host source or sink connector tasks. Configuring the `tasks.max` setting to greater than 1 allows Kafka Connect to enable parallelism and shard relay work for a given Kafka topic across multiple tasks. As message throughput increases, Kafka Connect spawns more tasks, increasing concurrency and thereby increasing total throughput.

A better approach

Compared to existing ways of transmitting data between Kafka and Google Cloud, the connectors are a step-change.

To connect Kafka to Pub/Sub or Pub/Sub Lite, one option is to write a custom relay application to read data from the source and write to the destination system. For developers with Kafka experience who want to connect to Pub/Sub Lite, we provide a Kafka Shim Client that can make the task of consuming from and producing to a Pub/Sub Lite topic easier using the familiar Kafka API. This approach has a couple of downsides. It can take significant effort to develop and can be challenging for high-throughput use-cases since there is no out-of-the-box horizontal scaling. You’ll also need to learn to operate this custom solution from scratch and add any monitoring to ensure data is relayed smoothly. Instead there are easier options to build or deploy using existing frameworks. 

Pub/Sub, Pub/Sub Lite, and Kafka all have respective I/O connectors with Apache Beam. You can write a Beam pipeline using KafkaIO to move data between a cluster Pub/Sub or Pub/Sub Lite and then run it on an execution engine like Dataflow. This requires some familiarity with the Beam programming model, writing code to create the pipeline and possibly expanding your architecture to a supported runner like Dataflow. Using the Beam programming model with Dataflow gives you the flexibility to perform transformations on streams connecting your Kafka cluster to Pub/Sub or to create complex topologies like fan-out to multiple topics. For simple data movement especially when using an existing Connect cluster, however, the connectors offer a simpler experience requiring no development and low-operational overhead. 

No code is required to set up a data integration pipeline in Cloud Data Fusion between Kafka and Pub/Sub, thanks to plugins that support all three products. Like a Beam pipeline that must execute somewhere, a Data Fusion pipeline needs to execute on a Cloud Dataproc cluster. It is a valid option most suitable for Cloud-native data practitioners who prefer drag-and-drop option in a GUI and who do not manage Kafka clusters directly. If you do manage Kafka clusters already, you may prefer a native solution, i.e., deploying the connector directly into a Kafka Connect cluster between your sources/sinks and your Kafka cluster, for more direct control. 

To give the Pub/Sub connector a try, head over to the how-to guide.

1. Dataflow compute cost. 2. autoscaling. 3. Cloud Data Fusion cost

Source : Data Analytics Read More

Built with BigQuery: How Datalaksa provides a unified marketing and customer data warehouse for brands in South East Asia

Built with BigQuery: How Datalaksa provides a unified marketing and customer data warehouse for brands in South East Asia

Editor’s note: The post is part of a series highlighting our partners, and their solutions, that are Built with BigQuery.

Datalaksa is a unified marketing and customer data warehouse created by Persuasion Technologies, a Big Data Analytics & Digital Marketing consultancy serving clients throughout South East Asia. It enables marketing teams to optimize campaigns by combining data from across their marketing channels and enabling insight driven actions across marketing automation and delivery systems.

In this post, we explore how they have leveraged Google BigQuery and Google’s other data cloud products to build a solution that is rapid to set-up, highly flexible and able to scale with the needs of their customers. 

Through close collaboration with their customers, Persuasion Technologies gained first hand experience of the challenges they face trying to optimize campaigns across multiple channels.  “Marketing and CRM teams find it difficult to gain the insights that drive decisions across their marketing channels.” said Tzu Ming Chu, Director, Persuasion Technologies. “An ever-increasing variety of valuable data resides in siloed systems, while the teams that can integrate and analyze that data have never been more in demand. All too frequently this means that campaign planning is incomplete or too slow and campaign execution is less effective, ultimately resulting in lower sales and missed opportunities.”

Marketing teams of all sizes face similar challenges:

Access to technical skills and resources. Integrating data from the various sources requires skilled, and scarce, technical resources to scope out requirements, design solutions, build the pipelines that connect data sources, develop data models and ensure data quality. Machine learning (ML) requires data scientists to develop models to generate advanced insights, and ML Ops engineers to make sure those models are always updated and can be used for scoring at the needed scale.

Access to technology. While smaller companies may not have a data warehouse at all, even in large companies that do, gaining access to it and having resources allocated can be a long and difficult process, often with a lack of flexibility to accommodate local needs and with limitations to what can be provided. 

Ease of use. Even a well architected data warehouse may see little usage if data or marketing teams can’t figure out how to deep dive into the data. Without an intuitive data model, an easy to use interface that enables business users to query, transform and visualize data and beverage AI models that automate insights and predict outcomes, the full benefits will not be realized. 

Flexibility. Each marketing team is different – they each have their own set of requirements, data sources and use cases, and they continue to evolve and scale over time. Many of-the-shelf solutions lack the flexibility to accommodate the unique needs of each business.

In these challenges, the Persuasion Technologies team saw an opportunity — an opportunity to help their customers in a repeatable way, ensuring they all had easy access to rich data warehouse capabilities, and to enable them to create a new product-centric business and revenue stream. 

Datalaksa, a unified marketing and customer data warehouse

Datalaksa is a solution that enables marketing teams to easily, securely and scalably bring together marketing and customer data from multiple channels into a cloud data warehouse and enables them with advanced capabilities to derive actionable insights and take actions that increase campaign efficiency and effectiveness. 

Out of the box, Datalaksa includes data connectors that enable data to be imported from a wide range of platforms such as Google Marketing Platform, Facebook Ads and eCommerce systems, which means that marketing teams can unify data from across channels quickly and easily without reliance on scarce and costly technical resources to build and maintain integrations.

To accelerate time-to-insight, Datalaksa provides pre-built data models, machine learning models and analytical templates for key marketing use cases such as cohort analyses, customer clustering, campaign recommendation and lifetime value models, all wrapped within an simple and intuitive user interface that enables marketing teams to easily query, transform, enrich and analyze their data – decreasing the time from data to value. 

It’s often said that “insight without action is worthless” — to ensure this is not the case for Datalaksa users, the solution prompts action through notifications and enables audience segmentation tools and integrations back to marketing automation systems such as Salesforce Marketing Cloud, Google Ads and eCommerce systems. 

For example, teams can set thresholds and conditions using SQL queries to send notification emails for ‘out of stock’ or `low stock’ to relevant teams and automatically update product recommendation algorithms to offer in-stock items. Through built-in connectors, customer audience segments can be activated by automatically updating ad buying audiences in platforms including Tik Tok, Google Ads, Linkedin and Facebook or Instagram. These can be scheduled and updated regularly. 

All of this is built using Google’s BigQuery and data cloud suite of products.

Why Datalaksa chose Google Cloud and BigQuery

The decision to use Google Cloud and BigQuery for Datalaksa was an easy one according to Tzu, “Not only did it accelerate our ability to provide our customers with industry leading data warehousing and analytical capabilities, it’s incredibly easy to integrate with many key marketing systems, including those from Google. This equates directly to saved time and cost, not just during the initial design and build, but in the ongoing support and maintenance.”

Persuasion Technologies story is one of deep expertise, customer empathy and innovative thinking, but BigQuery and Google Cloud’s end to end platform for building data driven applications is also key part of their success:

World class analytics. By leveraging BigQuery as the core of Datalaksa, they were immediately able to provide their customers with a fully-managed, petabyte-scale, world class analytics solution with a 99.99% SLA. Additionally, integrated, fully managed services like Cloud Data Loss Prevention help their users discover, classify, and protect their most sensitive data. This is a huge advantage for a startup, and enables them to focus their time on creating value for their customers by building their expertise into their product.

Built-in industry leading ML/AI. To deliver advanced machine learning capabilities to its customers, Datalaksa uses BigQuery ML. As the name suggests, BigQuery ML is built right into BigQuery, so not only does it enable them to easily leverage a wide range of advanced ML models, it further decreases development time and cost by eliminating the need to move data between the data warehouse and separate ML system, while enabling people no coding skills to gain extra insights by developing machine learning models using SQL constructs.

Serverless scalability and efficiency. As all of the services that Datalaksa uses are serverless or fully managed services, they offer high levels of resiliency and effortlessly scale up and down with their customers’ needs while keeping the total cost of ownership low by minimizing the operational overheads.    

Simplified data integration. Datalaksa is rapidly adding connections to Google data sources such as Google Ads and YouTube, and hundreds of other SaaS services, through BigQuery Data Transfer Service (DTS), and through access to a wide range of 3rd party connectors in the Google Cloud Marketplace including Facebook Ads and eCommerce cart connectors.

The Built with BigQuery advantage for ISVs

Through Built with BigQuery, Google is helping tech companies like Persuasion Technologies build innovative applications on Google’s data cloud with simplified access to technology, helpful and dedicated engineering support, and joint go-to-market programs. Participating companies can: 

Get started fast with a Google-funded, pre-configured sandbox. 

Accelerate product design and architecture through access to designated experts from the ISV Center of Excellence who can provide insight into key use cases, architectural patterns, and best practices. 

Amplify success with joint marketing programs to drive awareness, generate demand, and increase adoption.

BigQuery gives ISVs the advantage of a powerful, highly scalable data warehouse that’s integrated with Google Cloud’s open, secure, sustainable platform. And with a huge partner ecosystem and support for multi cloud, open source tools and APIs, Google provides technology companies the portability and extensibility they need to avoid data lock-in. 

Click these links to learn more about Datalaksa and Built with BigQuery.

Related Article

Built with BigQuery: Retailers drive profitable growth with SoundCommerce

SoundCommerce uses Analytics Hub to increase the pace of innovation by sharing datasets with its customers in real-time by using the stre…

Read Article

Source : Data Analytics Read More