Archives January 2024

Automated fraud detection with Fivetran and BigQuery

Automated fraud detection with Fivetran and BigQuery

In today’s dynamic landscape, businesses need faster data analysis and predictive insights to identify and address fraudulent transactions. Typically, tackling fraud through the lens of data engineering and machine learning boils down to these key steps:

Data acquisition and ingestion: Establishing pipelines across various disparate sources (file systems, databases, third-party APIs) to ingest and store the training data. This data is rich with meaningful information, fueling the development of fraud-prediction machine learning algorithms.Data storage and analysis: Utilizing a scalable, reliable and high-performance enterprise cloud data platform to store and analyze the ingested data.Machine-learning model development: Building training sets out of and running machine learning models on the stored data to build predictive models capable of differentiating fraudulent transactions from legitimate ones.

Common challenges in building data engineering pipelines for fraud detection include:

Scale and complexity: Data ingestion can be a complex endeavor, especially when organizations utilize data from diverse sources. Developing in-house ingestion pipelines can consume substantial data engineering resources (weeks or months), diverting valuable time from core data analysis activities.Administrative effort and maintenance: Manual data storage and administration, including backup and disaster recovery, data governance and cluster sizing, can significantly impede business agility and delay the generation of valuable data insights.Steep learning curve/skill requirements: Building a data science team to both create data pipelines and machine learning models can significantly extend the time required to implement and leverage fraud detection solutions.

Addressing these challenges requires a strategic approach focusing on three central themes: time to value, simplicity of design and the ability to scale. These can be addressed by leveraging Fivetran for data acquisition, ingestion and movement, and BigQuery for advanced data analytics and machine learning capabilities.

Streamlining data integration with Fivetran

It’s easy to underestimate the challenge of reliably persisting incremental source system changes to a cloud data platform unless you happen to be living it and dealing with it on a daily basis. In my previous role, I worked with an enterprise financial services firm that was stuck on legacy technology described as “slow and kludgy” by the lead architect. The addition of a new column to their DB2 source triggered a cumbersome process, and it took six months for the change to be reflected in their analytics platform.

This delay significantly hampered the firm’s ability to provide downstream data products with the freshest and most accurate data. Consequently, every alteration in the source’s data structure resulted in time-consuming and disruptive downtime for the analytics process. The data scientists at the firm were stuck wrangling incomplete and outdated information.

In order to build effective fraud detection models, they needed all of their data to be:

Curated, contextual: The data should be personalized and specific to their use case, while being high quality, believable, transparent, and trustworthy.Accessible and timely: Data needs to always be available, high performance, and offering frictionless access with familiar downstream data consumption tools.

The firm chose Fivetran notably for its automatic and reliable handling of schema evolution and schema drift from multiple sources to their new cloud data platform. With over 450 source connectors, Fivetran allows the creation of datasets from various sources, including databases, applications, files and events.

The choice was game-changing. With Fivetran ensuring a constant flow of high-quality data, the firm’s data scientists could devote their time to rapidly testing and refining their models, closing the gap between insights and action and moving them closer to prevention.

Most importantly for this business, Fivetran automatically and reliably normalized the data and managed changes that were required from any of their on-premises or cloud-based sources as they moved to the new cloud destination. These included:

Schema changes (including schema additions)Table changes within a schema (table adds, table deletes, etc.)Column changes within a table (column adds, column deletes, soft deletes, etc.)Data type transformation and mapping (here’s an example for SQL Server as a source)

The firm’s selection of a dataset for a new connector was a straightforward process of informing Fivetran how they wanted source system changes to be handled — without requiring any coding, configuration, or customization. Fivetran set up and automated this process, enabling the client to determine the frequency of changes moving to their cloud data platform based on specific use case requirements.

Fivetran demonstrated its ability to handle a wide variety of data sources beyond DB2, including other databases and a range of SaaS applications. For large data sources, especially relational databases, Fivetran accommodated significant incremental change volumes. The automation provided by Fivetran allowed the existing data engineering team to scale without the need for additional headcount. The simplicity and ease of use of Fivetran allowed business lines to initiate connector setup with proper governance and security measures in place.

In the context of financial services firms, governance and complete data provenance are critical. The recently released Fivetran Platform Connector addresses these concerns, providing simple, easy and near-instant access to rich metadata associated with each Fivetran connector, destination or even the entire account. The Platform Connector, which incurs zero Fivetran consumption costs, offers end-to-end visibility into metadata (26 tables are automatically created in your cloud data platform – see the ERD here) for the data pipelines, including:

Lineage for both source and destination: schema, table, columnUsage and volumesConnector typesLogsAccounts, teams, roles

This enhanced visibility allows financial service firms to better understand their data, fostering trust in their data programs. It serves as a valuable tool for providing governance and data provenance — crucial elements in the context of financial services and their data applications.

BigQuery’s scalable and efficient data warehouse for fraud detection

BigQuery is a serverless and cost-effective data warehouse designed for scalability and efficiency, making it good fit for enterprise fraud detection. Its serverless architecture minimizes the need for infrastructure setup and ongoing maintenance, allowing data teams to focus on data analysis and fraud mitigation strategies.

Key benefits of BigQuery include:

Faster insights generation: BigQuery’s ability to run ad-hoc queries and experiments without capacity constraints allows for rapid data exploration and quicker identification of fraudulent patterns.Scalability on demand: BigQuery’s serverless architecture automatically scales up or down based on demand, ensuring that resources are available when needed and avoiding over-provisioning. This removes the need for data teams to manually scale their infrastructure, which can be time-consuming and error-prone. A key part here to understand is that BigQuery can scale while the queries are running/in-flight — a clear differentiator with other modern cloud data warehouses.Data analysis: BigQuery datasets can scale to petabytes, helping to store and analyze financial transactions data at near-limitless scale. This empowers you to uncover hidden patterns and trends within your data, for effective fraud detection.Machine learning: BigQuery ML offers a range of off-the-shelf fraud detection models, from anomaly detection to classification, all implemented through simple SQL queries. This democratizes machine learning and enables rapid model development for your specific needs. Different types of models that BigQuery ML supports are listed here.Model deployment for inference at scale: While BigQuery supports batch inference, Google Cloud’s Vertex AI can be leveraged for real-time predictions on streaming financial data. Deploy your BigQuery ML models on Vertex AI to gain immediate insights and actionable alerts, safeguarding your business in real-time.

The combination of Fivetran and BigQuery provides a simple design to a complex problem — an effective fraud detection solution capable of real-time, actionable alerts. In the next series of this blog, we’ll focus on the hands-on implementation of the Fivetran-BigQuery integration using an actual dataset and create ML models in BigQuery that can accurately predict fraudulent transactions.

Fivetran is available on Google Cloud Marketplace.

Source : Data Analytics Read More

Vector Embeddings: The Upcoming Building Blocks for Generative AI

Vector Embeddings: The Upcoming Building Blocks for Generative AI

The AI domain is undergoing a remarkable upswing in both expansion and inventiveness. This surge is driven by advancements across various subfields and increasing adoption in diverse sectors. Global AI market projections anticipate a substantial CAGR of 37.3% within the 2023-2030 timeframe. This translates to a projected market size of approximately $1.81 trillion by the […]

Source : SmartData Collective Read More

Game on: Aiven for Apache Kafka and BigQuery – your ultimate gaming cheat code

Game on: Aiven for Apache Kafka and BigQuery – your ultimate gaming cheat code

The games industry is one of the most data-driven industries in the world. Games generate massive amounts of data every second, from player behavior and in-game transactions to social media engagement and customer support tickets. This data can be used to improve games, make better business decisions, and create new and innovative experiences for players.

However, the games industry also faces a unique challenge: how to analyze this data in real-time or near-real-time. This is because games are now constantly changing and evolving, and players expect a personalized and highly responsive experience. Imagine, you and your friends are all set to join your favorite game’s new version launch! You take the day off, get your snacks ready and POOF! Server unresponsive… Reload… Still nothing.

Massively Multiplayer Online Role-Playing Games (MMORPGs), for instance, need to be able to handle a large number of concurrent players while simulating a virtual world in real time. This can put a strain on game server infrastructure, and it can be difficult to scale the infrastructure to meet the needs of a growing player base. Here is where real-time analytics plays a role in auto-scaling infrastructure in response to these events.

Automating infrastructure scaling in real-time

Aiven for Apache Kafka provides time-value to player-volume-based data, allowing automation of infrastructure scaling, based on traffic patterns and load. In addition, with Aiven for InfluxDB and Aiven for Grafana, data infrastructure teams have insights into the health of gaming services — as the gameplay is happening. Once certain thresholds are detected, automation scripts employing the Kubernetes Operator or Terraform Provider can spin up new game services to answer demand.

As one of the most demanding industries for computing power, online or mobile games need to be able to handle a large number of concurrent players. This can put a strain on game server infrastructure, and it can be difficult to scale the infrastructure to meet the needs of a growing player base.

The Aiven Platform offers multiple features, across all managed services, that make it well-suited for automated game service scaling, including:

Scalability: Aiven services can be scaled to handle any amount of data, making it ideal for high player volume gaming scenarios, such as a highly anticipated version launch.High reliability and availability: Via the management plane — or Aiven Console — highly reliable services are designed to be always available. (Typically, even smaller plans provided by Aiven for Apache Kafka include three High-Available Nodes, out of the box.)Security: The Aiven Console offers a number of security features and compliance by default to protect your data, including encryption and authentication.

Some of the benefits gained when using the Aiven Stack, shown in our reference architecture, to scale game servers are:

Improved performance: By automatically scaling the number of game servers up or down as needed, you can ensure that your game servers are always operating at optimal capacity.Cost optimization: You can save money on your cloud computing costs by only running the number of game servers that you need.Improved scalability: Aiven for Apache Kafka, coupled with Aiven observability services (Aiven for InfluxDB and Aiven for Grafana), can be used to scale your game server infrastructure to meet the needs of your growing player base.

Future-proofing automated scaling

Google Pub/Sub capabilities within the BigQuery suite can be used together to perform longer-term analytics for the games industry. Pub/Sub is a near-real-time/longer-term messaging service that can be used to collect data from game servers, in our case, messages from Aiven for Apache Kafka.

There are a variety of use cases beyond the scope of auto-scaling infrastructure that can be leveraged when using Pub/Sub and BigQuery for longer-term analytics of player telemetry data:

Player behavior analysis: By tracking player behavior over time, game companies can identify trends and patterns in how players are playing their games. This information can be used to improve the player experience, develop new content, and balance in-game economies.Game performance analysis: By tracking game performance over time, game companies can identify areas where technical performance is struggling. This information can be used to fix bugs, optimize performance, and improve the overall quality of the game experience.Business intelligence: By analyzing data on player engagement, revenue, and other metrics, companies can make better business decisions. For example, a gaming company could use this data to identify their most popular and profitable titles.

Several benefits games industry customers will see by using Pub/Sub and BigQuery for longer-term analytics in the games industry are:

Scalability and reliability: Pub/Sub and BigQuery are both highly scalable and highly available services that can handle any amount of data.Security: Pub/Sub and BigQuery offer a number of security features to help protect your data, including encryption and authentication.Cost-optimization: By analyzing longer term data points, Pub/Sub and BigQuery can help forecast future player workloads and enable adjustments to auto-scaling behavior.

Aiven + Google Cloud = better together

By partnering with Google Cloud and Aiven on Google Cloud, the games industry can prepare for a worry-free launch while having the data to understand players and keep them coming back for more. Service reliability is key — a hassle-free experience dictates the success of the game! — but cost should always be considered. By lowering the total cost of operations, and right-sizing in real-time, you can achieve greater game-play usability while minimizing unnecessary overscaling.

Predictive analytics in BigQuery allows you to tweak the scaling parameters based on past data, enabling greater control of future volumes that would otherwise be lost. The combination of managed services from Aiven and Google Cloud add time value to data — and increased revenue from a successful launch. Game on!

Conclusion and next steps

As the games industry continues to evolve and embrace data-driven decision-making, the combination of Aiven for Kafka and Pub/Sub capabilities within the BigQuery suite will become increasingly essential for success. By harnessing the power of real-time data, games companies can unlock new opportunities, enhance player experiences, and drive sustainable growth. If you’re ready to learn more, check out the following links below:

For further reading of connecting Aiven with Google Native services: Shorten the path to insights with Aiven for Apache Kafka and BigQuery.How Google Cloud empowers the games industry to achieve success: Game on and on and on: Google Cloud’s strategy for live service gamesReady to give it a try? Click here to check out Aiven’s listing on Google Cloud Marketplace, and let us know what you think.

Source : Data Analytics Read More

Real-time data processing for machine learning with Striim and BigQuery

Real-time data processing for machine learning with Striim and BigQuery

In today’s data-driven world, the ability to leverage real-time data for machine learning applications is a game-changer. Two key players in this field, Striim and Google Cloud with BigQuery, offer a powerful combination to make this possible. Striim serves as a real-time data integration platform that seamlessly and continuously moves data from diverse sources to destinations such as cloud databases, messaging systems, and data warehouses, making it a vital component in modern data architectures. BigQuery is an enterprise data platform with best-in-class capabilities to unify all data and workloads in multi-format, multi-storage and multi-engine. BigQuery ML is built into the BigQuery environment, allowing you to create and deploy machine learning models using SQL-like syntax in a single, unified experience.

Real-time data processing in the world of machine learning (ML) allows data scientists and engineers to focus on model development and monitoring, instead of relying on traditional methods where data scientists and ML engineers used to manually execute workflows and code to gather, clean, and label their raw data through batch processing, which often involved delays and less responsiveness. Striim’s strength lies in its capacity to connect to over 150 data sources, enabling real-time data acquisition from virtually any location and simplifying data transformations. This empowers businesses to expedite the creation of machine learning models and make data-driven decisions and predictions swiftly, ultimately enhancing customer experiences and optimizing operations. By incorporating the most current data, organizations can further boost the accuracy of their decision-making processes, ensuring that insights are derived from the latest information available, leading to more informed and strategic business outcomes.


Before we embark on the journey of integrating Striim with BigQuery ML for real-time data processing in machine learning, there are a few prerequisites that you should ensure are in place.

Striim instance: To get started, you need to have a Striim instance created and have access to it. Striim is the backbone of this integration, and having a working Striim instance is essential for setting up the data pipelines and connecting to your source databases. For a free trial, please sign up for a Striim Cloud on Google Cloud trial at understanding of Striim: Familiarity with the basic concepts of Striim and the ability to create data pipelines is crucial. You should understand how to navigate the Striim environment, configure data sources, and set up data flows. If you’re new to Striim or need a refresher on its core functionalities, you can review the documentation and resources available at

In the forthcoming sections of this blog post, we will guide you through the seamless integration of Striim with BigQuery ML, showcasing a step-by-step process from connecting to a Postgres database to deploying machine learning models. The integration of Striim’s real-time data integration capabilities with BigQuery ML’s powerful machine learning services empowers users to not only move data seamlessly but also harness the latest data for building and deploying machine learning models. Our demonstration will highlight how these tools facilitate real-time data acquisition, transformation, and model deployment, ultimately enabling organizations to make quick, data-driven decisions and predictions while optimizing their operational efficiency.

Section 1: Connecting to the source database

The first step in this integration journey is connecting Striim to a database that contains raw machine learning data. In this blog, we will focus on a PostgreSQL database. Inside this database, we have an iris_dataset table with the following column structure.

code_block<ListValue: [StructValue([(‘code’, “Table: dms_sample.iris_datasetrnrn| Column | Type | Collation | Nullable | Default |rn|—————|——————|———–|———-|————————————-|rn| id | integer | | not null | nextval(‘iris_dataset_id_seq’::regclass) |rn| sepal_length | double precision | | | |rn| sepal_width | double precision | | | |rn| petal_length | double precision | | | |rn| petal_width | double precision | | | |rn| species | text | | | |”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0e55fe7b80>)])]>

This table contains raw data related to the characteristics of different species of iris flowers. It’s worth noting that this data has been gathered from a public source, and as a result, there are NULL values in some fields, and the labels for the species are represented numerically. Specifically, in this dataset, 1 represents “setosa,” 2 represents “versicolor,” and 3 represents “virginica.”

To read raw data from our PostgreSQL database, we will use Striim’s PostgreSQL Reader adapter, which captures all operations and changes from the PostgreSQL log files.

To get the PostgreSQL Reader adapter created, we will drag and drop it from the Component section, provide the Connection URL, username, and password, and specify the iris_dataset table in the Tables property. The PostgreSQL Reader adapter utilizes the wal2json plugin of the PostgreSQL database to read the log files and capture the changes. Therefore, as a part of the setup, we need to create a replication slot in the source database and then provide its name in the replication slot property.

Section 2: Creating Striim Continuous Query (CQ) Adapters

In the context of Striim, CQ refers to continuously running queries that transform data in-flight by using Striim queries, which are similar to SQL queries. These adapters can be used to filter, aggregate, join, enrich, and transform events.

This adapter plays a crucial role in this integration, as it helps transform and prepare the data for machine learning in BigQuery ML. In order for us to create and attach a CQ adapter under the previous adapter, we have to click on the ‘Wave’ icon and ‘+’ sign, then select ‘Connect next CQ component’:

We will now walk you through the steps of writing SQL-like queries in the CQ adapters and how Striim transforms the data in-flight once we read it from the Postgres database.

1. Handling NULL Values:

We build a CQ adapter that transforms NULL values into a float 0.0, ensuring the consistency and integrity of your data. Here’s the SQL query for this transformation:

code_block<ListValue: [StructValue([(‘code’, ‘SELECT * FROM pg_output_ml rnMODIFY(rn data[1] = CASE WHEN data[1] IS NULL THEN TO_FLOAT(0.0) ELSE data[1] END,rn data[2] = CASE WHEN data[2] IS NULL THEN TO_FLOAT(0.0) ELSE data[2] END,rn data[3] = CASE WHEN data[3] IS NULL THEN TO_FLOAT(0.0) ELSE data[3] END,rn data[4] = CASE WHEN data[4] IS NULL THEN TO_FLOAT(0.0) ELSE data[4] ENDrn);’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0e51caedf0>)])]>

We will attach the PostgreSQL Reader adapter to this CQ for seamless data processing:

2. Converting numeric species classes to text:

We build another CQ adapter to convert numeric species classes to text classes, making the data more human-readable and interpretable for the ML model.

code_block<ListValue: [StructValue([(‘code’, “SELECT replaceString(rn replaceString(rn replaceString(t, ‘1’, ‘setosa’),rn ‘2’, ‘virginica’rn ),rn ‘3’, ‘versicolor’rn )rnFROM pg_ml_data_output t;”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0e51caecd0>)])]>

We attach the Data_ML_Transform CQ adapter to this CQ for label processing:

3. Data transformation:

Finally, we create the last CQ adapter to extract the final data and assign it to variable/column names, making it ready for integration with BigQuery ML.

code_block<ListValue: [StructValue([(‘code’, ‘SELECT rn data[0] as id, rn data[1] as sepal_length, rn data[2] as sepal_width, rn data[3] as petal_length, rn data[4] as petal_width, rn data[5] as speciesrnFROM transformed_data t;’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0e53139670>)])]>

We attach the Label_ML_Transform CQ adapter to this CQ to assign data field to variables:

Section 3: Attaching CQ to BigQuery Writer adapter

Now that we’ve prepared the data using CQ adapters, we need to connect them to the BigQuery Writer adapter, the gateway for streaming data into BigQuery. By clicking on the ‘Wave’ icon, and attaching the BigQuery adapter, you establish a connection between the previous adapters and BigQuery.

In the Tables property, we use the ColumnMap to connect the transformed data with the appropriate BigQuery columns:

code_block<ListValue: [StructValue([(‘code’, ‘DMS_SAMPLE.iris_dataset ColumnMap(rn id = id, rn sepal_length = sepal_length, rn petal_length = petal_length, rn petal_width = petal_width, rn species = speciesrn)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0e58bc3eb0>)])]>

To complete the BigQuery Writer adapter setup, you need to create a service account in your Google Cloud account. This service account requires specific roles within BigQuery (see BigQuery > Documentation > Guides > Introduction to IAM > BigQuery predefined Cloud IAM roles):

bigquery.dataEditor for the target project or dataset
bigquery.jobUser for the target project

For more information, please visit this link.

After we create the service account key, we specify the Project ID and supply the Service Account Key JSON file to give Striim permission to connect to BigQuery:

Section 4: Execute the CDC data pipeline to replicate the data to BigQuery

To execute the CDC data pipeline, simply click on the top dropdown labeled as ‘Created,’ select ‘Deploy App’:

and then choose ‘Start App’ to initiate the data pipeline:

After successfully executing the CDC data pipeline, the Application Progress page indicates that we’ve read 30 ongoing changes from our source database and written these 30 records and changes to my BigQuery database. At the bottom of the Application Progress page, you can also preview the data flowing from the source to the target by clicking on the ‘Wave’ icon and then the ‘Eye’ icon located between the source and target components. This is one sample of the raw data:

code_block<ListValue: [StructValue([(‘code’, ‘Id | sepal_length | sepal_width | petal_length | petal_width | speciesrn1 5.1 3.5 1.4 NULL “1”‘), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0e58bc35e0>)])]>

This is the processed data after undergoing the CQ transformations. Please observe how we transformed the NULL value in the petal_width to 0.0 and changed the numeric class ‘1’ to ‘setosa’ for the species.

code_block<ListValue: [StructValue([(‘code’, ‘Id | sepal_length | sepal_width | petal_length | petal_width | speciesrn1 5.1 3.5 1.4 0.0 “setosa”‘), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0e5086ff70>)])]>

Section 5: Building a BigQuery ML model

With your data flowing seamlessly into BigQuery, it’s time to harness the power of Google Cloud’s machine learning service. BigQuery ML provides a user-friendly environment for creating machine learning models, without the need for extensive coding or external tools. We provide you with step-by-step instructions on building a logistic machine learning model within BigQuery. This includes examples of model creation, training, and making predictions, giving you a comprehensive overview of the process.

Verify that the data has been populated correctly in the BigQuery iris_dataset table. Note that ‘ancient-yeti-175123’ represents the name of our project, and ‘DMS_SAMPLE’ is the designated dataset. It is important to acknowledge that individual project and dataset names may vary.

2. Create a logistic regression model from the iris_dataset table by executing this query:

Logistic regression is a statistical method used for classification tasks, making it suitable for predicting outcomes with possible values. In the context of our iris dataset, logistic regression can be used to predict the probability of a given iris flower belonging to a particular species based on its features. This model is particularly useful when dealing with problems where the dependent variable is categorical, providing valuable insights into classification scenarios.

Here’s a breakdown of what this query is doing:

CREATE MODEL IF NOT EXISTS: This part of the query creates a machine learning model if it doesn’t already exist with the specified name, which is `striim_bq_model` in this case.

OPTIONS: This section defines various options and hyperparameters for the model. Here’s what each of these options means:

model_type=’logistic_reg’:Specifies that you are creating a logistic regression model.ls_init_learn_rate=.15:Sets the initial learning rate for the model to 0.15.l1_reg=1:Applies L1 regularization with a regularization strength of 1.max_iterations=20:Limits the number of training iterations to 20.input_label_cols=[‘species’]:Specifies the target variable for the logistic regression, which is ‘species’ in this case.data_split_method=’seq’:Uses a sequential data split method for model training and evaluation.data_split_eval_fraction=0.3:Allocates 30% of the data for model evaluation.data_split_col=’id’:Uses the ‘id’ column to split the data into training and evaluation sets.

AS:This keyword indicates the start of the SELECT statement, where you define the data source for your model.

SELECT:This part of the query selects the features and target variable from theiris_datasettable, which is the data used for training and evaluating the logistic regression model.

id, sepal_length, sepal_width, petal_length, petal_widthare the feature columns used for model training.speciesis the target variable or label column that the model will predict.

In summary, this query creates a logistic regression model namedstriim_bq_modelusing theiris_datasetdata in BigQuery ML. It specifies various model settings and hyperparameters to train and evaluate the model. The model’s goal is to predict the ‘species’ based on the other specified columns as features.

3. Evaluate the model by executing this query:

code_block<ListValue: [StructValue([(‘code’, ‘SELECT * FROM ML.EVALUATE(MODEL `ancient-yeti-175123.DMS_SAMPLE.striim_bq_model`, (SELECT * FROM `ancient-yeti-175123.DMS_SAMPLE.iris_dataset`))’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0e5086f6d0>)])]>

Evaluating the performance of an ML model is a critical step in gauging its effectiveness in generalizing to new, unseen data. This process includes quantifying the model’s predictive accuracy and gaining insights into its strengths and weaknesses.This query performs an evaluation of the machine learning model calledstriim_bq_modelthat we previously created. Here’s a breakdown of what this query does:

SELECT * FROM ML.EVALUATE:This part of the query is using theML.EVALUATEfunction, which is a BigQuery ML function used to assess the performance of a machine learning model. It evaluates the model’s predictions against the actual values in the test dataset.

(MODEL ancient-yeti-175123.DMS_SAMPLE.striim_bq_model, … ):Here, you specify the model to be evaluated. The model being evaluated is namedstriim_bq_model, and it resides in the datasetancient-yeti-175123.DMS_SAMPLE.

(SELECT * FROM `ancient-yeti-175123.DMS_SAMPLE.iris_dataset):This part of the query selects the data from theiris_dataset, which is used as the test dataset. The model’s predictions will be compared to the actual values in this dataset to assess its performance.

In summary, this query evaluates thestriim_bq_modelusing the data from theiris_datasetto assess how well the model makes predictions. The results of this evaluation will provide insights into the model’s accuracy and performance.

4. Now, we will predict the type of Iris based on the features of sepal_length, petal_length, sepal_width, and petal_width using the model we trained in the previous step:

code_block<ListValue: [StructValue([(‘code’, ‘SELECT * FROM ML.PREDICT(MODEL `ancient-yeti-175123.DMS_SAMPLE.striim_bq_model`,(SELECT 5.1 as sepal_length, 2.5 as petal_length, 3.0 as petal_width, 1.1 as sepal_width))’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0e5086f9a0>)])]>

In the screenshot above, you can see that the `striim_bq_model` provided us with information such as the predicted species, probabilities for the predicted species, and the feature column values used in our ML.PREDICT function.


Integrating Striim with BigQuery ML enhances the capabilities of data scientists and ML engineers by eliminating the need to repeatedly gather data from the source and execute the same data cleaning processes. Instead, they can focus solely on building and monitoring machine learning models. This powerful combination accelerates decision-making, enhances customer experiences, and streamlines operations. We invite you to explore this integration for your real-time machine learning projects, as it has the potential to revolutionize how you leverage data for business insights and predictions. Embrace the future of real-time data processing and machine learning with Striim and BigQuery ML!

Refer to this link to learn more about what you can do with Striim and Google Cloud.

We thank the many Google Cloud and Striim team members who contributed to this collaboration, especially Bruce Sandell and Purav Shah for their guidance during the process.

Source : Data Analytics Read More

Pokerstars Gambling enterprise

Pokerstars Gambling enterprise

Content Racy Las vegas Casino Our very own Better Gambling enterprises Which have one hundred 100 percent free Spins No deposit Incentive Codes No-deposit Added bonus 100 percent free Revolves Inside Michigan Totally free Ports Having Extra Advantages So you can Winnings Real cash encourages you to get already been with a big acceptance […]

Source : SmartData Collective Read More

Fortunate Larrys pokies online Lobstermania 3 Slot

Fortunate Larrys pokies online Lobstermania 3 Slot

Content Far more Gambling enterprise Ports Courses Practical Gamble Articles Is intended To own Persons 18 Decades Or Old Lobstermania Slot machine Lobstermania Spread Pay Probabilities And you can Productivity It is because of your impossibility to improve the amount, the money try gone to live in an individual’s membership once the formation of profitable […]

Source : SmartData Collective Read More