Blog

Introducing LLM fine-tuning and evaluation in BigQuery

Introducing LLM fine-tuning and evaluation in BigQuery

BigQuery allows you to analyze your data using a range of large language models (LLMs) hosted in Vertex AI including Gemini 1.0 Pro, Gemini 1.0 Pro Vision and text-bison. These models work well for several tasks such as text summarization, sentiment analysis, etc. using only prompt engineering. However, in some scenarios, additional customization via model fine-tuning is needed, such as when the expected behavior of the model is hard to concisely define in a prompt, or when prompts do not produce expected results consistently enough. Fine-tuning also helps the model learn specific response styles (e.g., terse or verbose), new behaviors (e.g., answering as a specific persona), or to update itself with new information.

Today, we are announcing support for customizing LLMs in BigQuery with supervised fine-tuning. Supervised fine-tuning via BigQuery uses a dataset which has examples of input text (the prompt) and the expected ideal output text (the label), and fine-tunes the model to mimic the behavior or task implied from these examples.Let’s see how this works.

Feature walkthrough

To illustrate model fine-tuning, let’s  look at a classification problem using text data. We’ll use a medical transcription dataset and ask our model to classify a given transcript into one of 17 categories, e.g. ‘Allergy/Immunology’, ‘Dentistry’, ‘Cardiovascular/ Pulmonary’, etc.

Dataset

Our dataset is from mtsamples.com as provided on Kaggle. To fine-tune and evaluate our model, we first create an evaluation table and a training table in BigQuery using a subset of this data available in Cloud Storage as follows:

code_block
<ListValue: [StructValue([(‘code’, “– Create a eval tablernrnLOAD DATA INTOrn bqml_tutorial.medical_transcript_evalrnFROM FILES( format=’NEWLINE_DELIMITED_JSON’,rn uris = [‘gs://cloud-samples-data/vertex-ai/model-evaluation/peft_eval_sample.jsonl’] )rnrn– Create a train tablernrnLOAD DATA INTOrn bqml_tutorial.medical_transcript_trainrnFROM FILES( format=’NEWLINE_DELIMITED_JSON’,rn uris = [‘gs://cloud-samples-data/vertex-ai/model-evaluation/peft_train_sample.jsonl’] )”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e7e321dcc70>)])]>

The training and evaluation dataset has an ‘input_text’ column that contains the transcript, and a ‘output_text’ column that contains the label, or ground truth.

Baseline performance of text-bison model

First, let’s establish a performance baseline for the text-bison model. You can create a remote text-bison model in BigQuery using a SQL statement like the one below. For more details on creating a connection and remote models refer to the documentation (1,2).

code_block
<ListValue: [StructValue([(‘code’, “CREATE OR REPLACE MODELrn `bqml_tutorial.text_bison_001` REMOTErnWITH CONNECTION `LOCATION. ConnectionID`rnOPTIONS (ENDPOINT =’text-bison@001′)”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e7e2d667190>)])]>

For inference on the model, we first construct a prompt by concatenating the task description for our model and the transcript from the tables we created. We then use the ML.GENERATE_TEXT function to get the output. While the model gets many classifications correct out of the box, it classifies some transcripts erroneously. Here’s a sample response where it classifies incorrectly.

code_block
<ListValue: [StructValue([(‘code’, ‘PromptrnrnPlease assign a label for the given medical transcript from among these labels [Allergy / Immunology, Autopsy, Bariatrics, Cardiovascular / Pulmonary, Chiropractic, Consult – History and Phy., Cosmetic / Plastic Surgery, Dentistry, Dermatology, Diets and Nutritions, Discharge Summary, ENT – Otolaryngology, Emergency Room Reports, Endocrinology, Gastroenterology, General Medicine, Hematology – Oncology, Hospice – Palliative Care, IME-QME-Work Comp etc., Lab Medicine – Pathology, Letters, Nephrology, Neurology, Neurosurgery, Obstetrics / Gynecology, Office Notes, Ophthalmology, Orthopedic, Pain Management, Pediatrics – Neonatal, Physical Medicine – Rehab, Podiatry, Psychiatry / Psychology, Radiology, Rheumatology, SOAP / Chart / Progress Notes, Sleep Medicine, Speech – Language, Surgery, Urology]. TRANSCRIPT: rnINDICATIONS FOR PROCEDURE:, The patient has presented with atypical type right arm discomfort and neck discomfort. She had noninvasive vascular imaging demonstrating suspected right subclavian stenosis. Of note, there was bidirectional flow in the right vertebral artery, as well as 250 cm per second velocities in the right subclavian. Duplex ultrasound showed at least a 50% stenosis.,APPROACH:, Right common femoral artery.,ANESTHESIA:, IV sedation with cardiac catheterization protocol. Local infiltration with 1% Xylocaine.,COMPLICATIONS:, None.,ESTIMATED BLOOD LOSS:, Less than 10 ml.,ESTIMATED CONTRAST:, Less than 250 ml.,PROCEDURE PERFORMED:, Right brachiocephalic angiography, right subclavian angiography, selective catheterization of the right subclavian, selective aortic arch angiogram, right iliofemoral angiogram, 6 French Angio-Seal placement.,DESCRIPTION OF PROCEDURE:, The patient was brought to the cardiac catheterization lab in the usual fasting state. She was laid supine on the cardiac catheterization table, and the right groin was prepped and draped in the usual sterile fashion. 1% Xylocaine was infiltrated into the right femoral vessels. Next, a #6 French sheath was introduced into the right femoral artery via the modified Seldinger technique.,AORTIC ARCH ANGIOGRAM:, Next, a pigtail catheter was advanced to the aortic arch. Aortic arch angiogram was then performed with injection of 45 ml of contrast, rate of 20 ml per second, maximum pressure 750 PSI in the 4 degree LAO view.,SELECTIVE SUBCLAVIAN ANGIOGRAPHY:, Next, the right subclavian was selectively cannulated. It was injected in the standard AP, as well as the RAO view. Next pull back pressures were measured across the right subclavian stenosis. No significant gradient was measured.,ANGIOGRAPHIC DETAILS:, The right brachiocephalic artery was patent. The proximal portion of the right carotid was patent. The proximal portion of the right subclavian prior to the origin of the vertebral and the internal mammary showed 50% stenosis.,IMPRESSION:,1. Moderate grade stenosis in the right subclavian artery.,2. Patent proximal edge of the right carotid.rnrnResponsernRadiology’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e7e2d667070>)])]>

In the above case the correct classification should have been ‘Cardiovascular/ Pulmonary’.

Metrics-based evaluation for base modelTo perform a more robust evaluation of the model’s performance, you can use BigQuery’s ML.EVALUATE function to compute metrics on how the model responses compare against the ideal responses from a test/eval dataset. You can do so as follows:

code_block
<ListValue: [StructValue([(‘code’, ‘– Evaluate base modelrnrnSELECTrn *rnFROMrn ml.evaluate(MODEL bqml_tutorial.text_bison_001,rn (rn SELECTrn CONCAT(“Please assign a label for the given medical transcript from among these labels [Allergy / Immunology, Autopsy, Bariatrics, Cardiovascular / Pulmonary, Chiropractic, Consult – History and Phy., Cosmetic / Plastic Surgery, Dentistry, Dermatology, Diets and Nutritions, Discharge Summary, ENT – Otolaryngology, Emergency Room Reports, Endocrinology, Gastroenterology, General Medicine, Hematology – Oncology, Hospice – Palliative Care, IME-QME-Work Comp etc., Lab Medicine – Pathology, Letters, Nephrology, Neurology, Neurosurgery, Obstetrics / Gynecology, Office Notes, Ophthalmology, Orthopedic, Pain Management, Pediatrics – Neonatal, Physical Medicine – Rehab, Podiatry, Psychiatry / Psychology, Radiology, Rheumatology, SOAP / Chart / Progress Notes, Sleep Medicine, Speech – Language, Surgery, Urology]. “, input_text) AS input_text,rn output_textrn FROMrn `bqml_tutorial.medical_transcript_eval` ),rn STRUCT(“classification” AS task_type))’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e7e2d667490>)])]>

In the above code we provided an evaluation table as input and chose ‘classification‘ as the task type on which we evaluate the model. We left other inference parameters at their defaults but they can be modified for the evaluation.

The evaluation metrics that are returned are computed for each class (label). The results look like following:

Focusing on the F1 score (harmonic mean of precision and recall), you can see that the model performance varies between classes. For example, the baseline model performs well for ‘Autopsy’, ‘Diets and Nutritions’, and ‘Dentistry’, but performs poorly for ‘Consult – History and Phy.’, ‘Chiropractic’, and ‘Cardiovascular / Pulmonary’ classes.

Now let’s fine-tune our model and see if we can improve on this baseline performance.

Creating a fine-tuned model

Creating a fine-tuned model in BigQuery is simple. You can perform fine-tuning by specifying the training data with ‘prompt’ and ‘label’ columns in it in the Create Model statement. We use the same prompt for fine-tuning that we used in the evaluation earlier. Create a fine-tuned model as follows:

code_block
<ListValue: [StructValue([(‘code’, ‘– Fine tune a textbison modelrnrnCREATE OR REPLACE MODELrn `bqml_tutorial.text_bison_001_medical_transcript_finetuned` REMOTErnWITH CONNECTION `LOCATION. ConnectionID`rnOPTIONS (endpoint=”text-bison@001″,rn max_iterations=300,rn data_split_method=”no_split”) ASrnSELECTrn CONCAT(“Please assign a label for the given medical transcript from among these labels [Allergy / Immunology, Autopsy, Bariatrics, Cardiovascular / Pulmonary, Chiropractic, Consult – History and Phy., Cosmetic / Plastic Surgery, Dentistry, Dermatology, Diets and Nutritions, Discharge Summary, ENT – Otolaryngology, Emergency Room Reports, Endocrinology, Gastroenterology, General Medicine, Hematology – Oncology, Hospice – Palliative Care, IME-QME-Work Comp etc., Lab Medicine – Pathology, Letters, Nephrology, Neurology, Neurosurgery, Obstetrics / Gynecology, Office Notes, Ophthalmology, Orthopedic, Pain Management, Pediatrics – Neonatal, Physical Medicine – Rehab, Podiatry, Psychiatry / Psychology, Radiology, Rheumatology, SOAP / Chart / Progress Notes, Sleep Medicine, Speech – Language, Surgery, Urology]. “, input_text) AS prompt,rn output_text AS labelrnFROMrn `bqml_tutorial.medical_transcript_train`’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e7e2d667520>)])]>

The CONNECTION you use to create the fine-tuned model should have (a) Storage Object User  and (b) Vertex AI Service Agent roles attached. In addition, your Compute Engine (GCE) default service account should have an editor access to the project. Refer to the documentation for guidance on working with BigQuery connections.

BigQuery performs model fine-tuning using a technique known as Low-Rank Adaptation (LoRA. LoRA tuning is a parameter efficient tuning (PET) method that freezes the pretrained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture to reduce the number of trainable parameters. The model fine-tuning itself happens on a Vertex AI compute and you have the option to choose GPUs or TPUs as accelerators. You are billed by BigQuery for the data scanned or slots used, as well as by Vertex AI for the Vertex AI resources consumed. The fine-tuning job creates a new model endpoint that represents the learned weights. The Vertex AI inference charges you incur when querying the fine-tuned model are the same as for the baseline model.

This fine-tuning job may take a couple of hours to complete, varying based on training options such as ‘max_iterations’. Once completed, you can find the details of your fine-tuned model in the BigQuery UI, where you will see a different remote endpoint for the fine-tuned model.

Endpoint for the baseline model vs a fine tuned model.

Currently, BigQuery supports fine-tuning of text-bison-001 and text-bison-002 models.

Evaluating performance of fine-tuned model

You can now generate predictions from the fine-tuned model using code such as following:

code_block
<ListValue: [StructValue([(‘code’, ‘SELECTrn ml_generate_text_llm_result,rn label,rn promptrnFROMrn ml.generate_text(MODEL bqml_tutorial.text_bison_001_medical_transcript_finetuned,rn (rn SELECTrn CONCAT(“Please assign a label for the given medical transcript from among these labels [Allergy / Immunology, Autopsy, Bariatrics, Cardiovascular / Pulmonary, Chiropractic, Consult – History and Phy., Cosmetic / Plastic Surgery, Dentistry, Dermatology, Diets and Nutritions, Discharge Summary, ENT – Otolaryngology, Emergency Room Reports, Endocrinology, Gastroenterology, General Medicine, Hematology – Oncology, Hospice – Palliative Care, IME-QME-Work Comp etc., Lab Medicine – Pathology, Letters, Nephrology, Neurology, Neurosurgery, Obstetrics / Gynecology, Office Notes, Ophthalmology, Orthopedic, Pain Management, Pediatrics – Neonatal, Physical Medicine – Rehab, Podiatry, Psychiatry / Psychology, Radiology, Rheumatology, SOAP / Chart / Progress Notes, Sleep Medicine, Speech – Language, Surgery, Urology]. “, input_text) AS prompt,rn output_text as labelrn FROMrn `bqml_tutorial.medical_transcript_eval`rn ),rn STRUCT(TRUE AS flatten_json_output))’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e7e2d667400>)])]>

Let us look at the response to the sample prompt we evaluated earlier. Using the same prompt, the model now classifies the transcript as ‘Cardiovascular / Pulmonary’ — the correct response.

Metrics based evaluation for fine tuned model

Now, we will compute metrics on the fine-tuned model using the same evaluation data and the same prompt we previously used for evaluating the base model.

code_block
<ListValue: [StructValue([(‘code’, ‘– Evaluate fine tuned modelrnrnrnSELECTrn *rnFROMrn ml.evaluate(MODEL bqml_tutorial.text_bison_001_medical_transcript_finetuned,rn (rn SELECTrn CONCAT(“Please assign a label for the given medical transcript from among these labels [Allergy / Immunology, Autopsy, Bariatrics, Cardiovascular / Pulmonary, Chiropractic, Consult – History and Phy., Cosmetic / Plastic Surgery, Dentistry, Dermatology, Diets and Nutritions, Discharge Summary, ENT – Otolaryngology, Emergency Room Reports, Endocrinology, Gastroenterology, General Medicine, Hematology – Oncology, Hospice – Palliative Care, IME-QME-Work Comp etc., Lab Medicine – Pathology, Letters, Nephrology, Neurology, Neurosurgery, Obstetrics / Gynecology, Office Notes, Ophthalmology, Orthopedic, Pain Management, Pediatrics – Neonatal, Physical Medicine – Rehab, Podiatry, Psychiatry / Psychology, Radiology, Rheumatology, SOAP / Chart / Progress Notes, Sleep Medicine, Speech – Language, Surgery, Urology]. “, input_text) AS prompt,rn output_text as labelrn FROMrn `bqml_tutorial.medical_transcript_eval`), STRUCT(“classification” AS task_type))’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e7e2d667c70>)])]>

The metrics from the fine-tuned model are below. Even though the fine-tuning (training) dataset we used for this blog contained only 519 examples, we already see a marked improvement in performance. F1 scores on the labels, where the model had performed poorly earlier, have improved, with the “macro” F1 score (a simple average of F1 score across all labels) jumping from 0.54 to 0.66.

Ready for inference

The fine-tuned model can now be used for inference using the ML.GENERATE_TEXT function, which we used in the previous steps to get the sample responses. You don’t need to manage any additional infrastructure for your fine-tuned model and you are charged the same inference price as you would have incurred for the base model.

To try fine-tuning for text-bison models in BigQuery, check out the documentation. Have feedback or need fine-tuning support for additional models? Let us know at bqml-feedback@google.com>.

Special thanks to Tianxiang Gao for his contributions to this blog.

Source : Data Analytics Read More

Google Cloud offers new AI, cybersecurity, and data analytics training to unlock job opportunities

Google Cloud offers new AI, cybersecurity, and data analytics training to unlock job opportunities

Google Cloud is on a mission to help everyone build the skills they need for in-demand cloud jobs. Today, we’re excited to announce  new learning opportunities  that will help you gain these in-demand skills through new courses and certificates in AI, data analytics, and cybersecurity. Even better, we’re hearing from Google Cloud customers that they are eager to consider certificate completers for roles they’re actively hiring for, so don’t delay and start your learning today. 

Google Cloud offers new generative AI courses

Introduction to Generative AI

Demand for AI skills is exploding in the market. There has been a staggering 21x increase in job postings that include AI technologies in 2023.1 To help prepare you for these roles, we’re announcing new generative AI courses on YouTube and Google Cloud Skills Boost, from introductory level to advanced. Once you complete the hands-on training, you can show off your new skill badges to employers.

Introductory (no cost!): This training will get you started with the basics of generative AI and responsible AI.  

Intermediate: For Application Developers, and you will learn how to use Gemini for Google Cloud to work faster across networking, security, and infrastructure.

Advanced: For AI/ ML Engineers, and you will learn how to integrate multimodal prompts in Gemini into your workflow. 

New AI-powered, employer-recognized Google Cloud Certificates

Gen AI has triggered massive demand for skilling, especially in the areas of cybersecurity and data analytics,2 where there are significant employment opportunities. In the U.S. alone:

There are over 505,000 open entry-level roles3 related to a Cloud Cybersecurity Analyst, with a median annual salary of $135,000.4

There are more than 725,000 open entry-level roles5 related to a Cloud Data Analyst, with a median annual salary of $85,700.6

Building on the success of the Grow with Google Career Certificates, our new Google Cloud Certificates in Data Analytics and Cybersecurity can help prepare you for these high-growth, entry-level cloud jobs. 

A gen AI-powered learning journey 

What better way to understand just how much AI can do for you than integrating it into your learning journey? You’ll get no-cost access to generative AI tools throughout your learning experience. For example, you can put your skills to use and rock your interviews with Interview Warmup, Google’s gen AI-powered interview prep tool.

Talent acquisition, reimagined 

And while we’re at it, we’ll help connect you to jobs. Our new Google Cloud Affiliate Employer program unlocks access for certificate completers to apply for jobs with some top cloud employers, like the U.S. Department of the Treasury, Rackspace, and Jack Henry.

We’re also taking it one step further. Together, with the employers in the affiliate program, we’re helping  reimagine talent acquisition through a new skills-based hiring effort. This new initiative uses Google Cloud technology to help move certificate completers through the hiring process. Here’s how it works: Certificate completers in select hiring locations will have the chance to take custom labs that represent on-the-job scenarios, specific to each employer partner. These labs will be considered the first stage in their hiring process. By matching candidates with the right skills to the right jobs, this initiative marks a major step forward in creating more access to job opportunities for cloud employers.

The U.S. Department of the Treasury will start using these new Google Cloud Certificates and labs for cyber and data analytics talent identification across the federal agency, per President Biden’s Executive Order on AI.

“In an age of rapid innovation and adoption of new technology offering the promise of improved productivity, it is imperative that we equip every worker with accessible training and development opportunities to understand and apply this new technology. We are partnering with Google to provide the new Cloud Certificates training for our current and future employees to accelerate their careers in cybersecurity and data analytics.” – Todd Conklin, Chief AI Officer and Deputy Assistant Secretary of Cybersecurity and Critical Infrastructure Protection, U.S. Department of the Treasury 

No-cost access for higher education institutions worldwide

To expand access to these programs, educational institutions, as well as government and nonprofit workforce development programs across the globe, can offer these new certificates and gen AI courses at no cost. Learn more and apply today

And in the U.S., learners who successfully complete a Google Cloud Certificate can apply for college credit,7 to have a faster and more affordable pathway to a degree.

“Purdue Global’s students have benefited greatly from the strong working relationship between Purdue Global and Google. Together, they were the pioneers in stacking Grow with Google certificates into four types of degree-earning credit certificates over the past two years. We believe these new Google Cloud Cybersecurity and Data Analytics Certificates will equip our working adult learners with the essential skills to move forward and succeed in today’s cloud-driven market.”Frank Dooley, Chancellor of Purdue Global 

Take the next steps to upskill and identify cloud-skilled talent  

We’re helping to prepare new-to-industry talent for the most in-demand cloud jobs, expanding access to these opportunities globally, and pioneering a skills-based hiring effort with employers eager to hire them. Here’s how you can get started:

Learners: Preview the courses and certificates on Google Cloud YouTube and earn the full credential on Google Cloud Skills Boost to give yourself a headstart in the race to hire AI talent.  

Higher education institutions and government / nonprofit workforce programs: Apply today to skill up your workforce at no cost. 

Employers: Express interest to become a Google Cloud Affiliate Employer and be considered for our skills-based hiring pilot to connect with cloud-skilled talent.

1. LinkedIn, Future of Work Report (2023)
2. CompTIA Survey (Feb 2024)
3. U.S. Bureau of Labor Statistics (2024)
4. (ISC)2 Cybersecurity Workforce Study (2022)

5.  U.S. Bureau of Labor Statistics (2024)
6. U.S. Bureau of Labor Statistics (2024)
7. The Google Cloud Certificates offer a recommendation from the American Council on Education® of up to 10 college credits.

Source : Data Analytics Read More

Introducing multimodal and structured data embedding support in BigQuery

Introducing multimodal and structured data embedding support in BigQuery

Embeddings represent real-world objects, like entities, text, images, or videos as an array of numbers (a.k.a vectors) that machine learning models can easily process. Embeddings are the building blocks of many ML applications such as semantic search, recommendations, clustering, outlier detection, named entity extraction, and more. Last year, we introduced support for text embeddings in BigQuery, allowing machine learning models to understand real-world data domains more effectively and earlier this year we introduced vector search, which lets you index and work with billions of embeddings and build generative AI applications on BigQuery.

At Next ’24, we announced further enhancement of embedding generation capabilities in BigQuery with support for:

Multimodal embeddings generation in BigQuery via Vertex AI’s multimodalembedding model, which lets you embed text and image data in the same semantic space

Embedding generation for structured data using PCA, Autoencoder or Matrix Factorization models that you train on your data in BigQuery

Multimodal embeddings

Multimodal embedding generates embedding vectors for text and image data in the same semantic space (vectors of items similar in meaning are closer together) and the generated embeddings have the same dimensionality (text and image embeddings are the same size). This enables a rich array of use cases such as embedding and indexing your images and then searching for them via text. 

You can start using multimodal embedding in BigQuery using the following simple flow. If you like, you can take a look at our overview video which walks through a similar example.

Step 0: Create an object table which points to your unstructured data
You can work with unstructured data in BigQuery via object tables. For example, if you have your images stored in a Google Cloud Storage bucket on which you want to generate embeddings, you can create a BigQuery object table that points to this data without needing to move it. 

To follow along the steps in this blog you will need to reuse an existing BigQuery CONNECTION or create a new one following instruction here. Ensure that the principal of the connection used has the ‘Vertex AI User’ role and that the Vertex AI API is enabled for your project. Once the connection is created you can create an object table as follows:

code_block
<ListValue: [StructValue([(‘code’, “CREATE OR REPLACE EXTERNAL TABLErn `bqml_tutorial.met_images`rnWITH CONNECTION `Location.ConnectionID`rnOPTIONSrn( object_metadata = ‘SIMPLE’,rn uris = [‘gs://gcs-public-data–met/*’]rn);”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e9807b70a00>)])]>

In this example, we are creating an object table that contains public domain art images from The Metropolitan Museum of Art (a.k.a. “The Met”) using a public Cloud Storage bucket that contains this data. The resulting object table has the following schema:

Let’s look at a sample of these images. You can do this using a BigQuery Studio Colab notebook by following instructions in this tutorial. As you can see, the images represent a wide range of objects and art pieces.

Image source: The Metropolitan Museum of Art

Now that we have the object table with images, let’s create embeddings for them.

Step 1: Create model
To generate embeddings, first create a BigQuery model that uses the Vertex AI hosted ‘multimodalembedding@001’ endpoint.

code_block
<ListValue: [StructValue([(‘code’, “CREATE OR REPLACE MODELrn bqml_tutorial.multimodal_embedding_model REMOTErnWITH CONNECTION `LOCATION.CONNNECTION_ID`rnOPTIONS (endpoint = ‘multimodalembedding@001’)”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e9804081220>)])]>

Note that while the multimodalembedding model supports embedding generation for text, it is specifically designed for cross-modal semantic search scenarios, for example, searching images given text. For text-only use cases, we recommend using the textembedding-gecko@ model instead.

Step 2: Generate embeddings 
You can generate multimodal embeddings in BigQuery via the ML.GENERATE_EMBEDDING function. This function also works for generating text embeddings (via textembedding-gecko model) and structured data embeddings (via PCA, AutoEncoder and Matrix Factorization models). To generate embeddings, simply pass in the embedding model and the object table you created in previous steps to the ML.GENERATE_EMBEDDING function.

code_block
<ListValue: [StructValue([(‘code’, “CREATE OR REPLACE TABLE `bqml_tutorial.met_image_embeddings`rnASrnSELECT * FROM ML.GENERATE_EMBEDDING(rn MODEL `bqml_tutorial.multimodal_embedding_model`,rn TABLE `bqml_tutorial.met_images`)rnWHERE content_type = ‘image/jpeg’rnLimit 10000″), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e9804081430>)])]>

To reduce the tutorial’s runtime, we limit embedding generation to 10,000 images. This query will take 30 minutes to 2 hours to run. Once this step is completed you can see a preview of the output in BigQuery Studio. The generated embeddings have a dimension of 1408.

Step 3 (optional): Create a vector index on generated embeddings
While the embeddings generated in the previous step can be persisted and used directly in downstream models and applications, we recommend creating a vector index for improving embedding search performance and enabling the nearest-neighbor query pattern. You can learn more about vector search in BigQuery here.

code_block
<ListValue: [StructValue([(‘code’, “– Create a vector index on the embeddingsrnrnCREATE OR REPLACE VECTOR INDEX `met_images_index`rnON bqml_tutorial.met_image_embeddings(ml_generate_embedding_result)rnOPTIONS(index_type = ‘IVF’,rn distance_type = ‘COSINE’)”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e9803cc6fa0>)])]>

Step 4: Use embeddings for text-to-image (cross-modality) search
You can now use these embeddings in your applications. For example, to search for “pictures of white or cream colored dress from victorian era” you first embed the search string like so:

code_block
<ListValue: [StructValue([(‘code’, ‘– embed search stringrnrnCREATE OR REPLACE TABLE `bqml_tutorial.search_embedding`rnASrnSELECT * FROM ML.GENERATE_EMBEDDING(rn MODEL `bqml_tutorial.multimodal_embedding_model`,rn (rn SELECT “pictures of white or cream colored dress from victorian era” AS contentrn )rn)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e9803cc6bb0>)])]>

You can now use the embedded search string to find similar (nearest) image embeddings as follows:

code_block
<ListValue: [StructValue([(‘code’, ‘– use the embedded search string to search for imagesrnrnCREATE OR REPLACE TABLErn `bqml_tutorial.vector_search_results` ASrnSELECTrn base.uri AS gcs_uri,rn distancernFROMrn VECTOR_SEARCH( TABLE `bqml_tutorial.met_image_embeddings`,rn “ml_generate_embedding_result”,rn TABLE `bqml_tutorial.search_embedding`,rn “ml_generate_embedding_result”,rn top_k => 5)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e9807772760>)])]>

Step 5: Visualize results
Now let’s visualize the results along with the computed distance and see how we performed on the search query “pictures of white or cream colored dress from victorian era”. Refer the accompanying tutorial on how to render this output using a BigQuery notebook.

Image source: The Metropolitan Museum of Art

The results look quite good!

Wrapping up

In this blog, we demonstrated a common vector search usage pattern but there are many other use cases for embeddings. For example, with multimodal embeddings you can perform zero-shot classification of images by converting a table of images and a separate table containing sentence-like labels to embeddings. You can then classify images by computing distance between images and each descriptive label’s embedding. You can also use these embeddings as input for training other ML models, such as clustering models in BigQuery to help you discover hidden groupings in your data. Embeddings are also useful wherever you have free text input as a feature, for example, embeddings of user reviews or call transcripts can be used in a churn prediction model, embeddings of images of a house can be used as input features in a price prediction model etc. You can even use embeddings instead of categorical text data when such categories have semantic meaning, for example, product categories in a deep-learning recommendation model.

In addition to multimodal and text embeddings, BigQuery also supports generating embeddings on structured data using PCA, AUTOENCODER and Matrix Factorization models that have been trained on your data in BigQuery. These embeddings have a wide range of use cases. For example, embeddings from PCA and AUTOENCODER models can be used for anomaly detection (embeddings further away from other embeddings are deemed anomalies) and as input features to other models, for example, a sentiment classification model trained on embeddings from an autoencoder. Matrix Factorization models are classically used for recommendation problems, and you can use them to generate user and item embeddings. Then, given a user embedding you can find the nearest item embeddings and recommend these items, or cluster users so that they can be targeted with specific promotions.

To generate such embeddings, first use the CREATE MODEL function to create a PCA, AutoEncoder or Matrix Factorization model and pass in your data as input, and then use ML.GENERATE_EMBEDDING function providing the model, and a table input to generate embeddings on this data.

Getting started

Support for multimodal embeddings and support for embeddings on structured data in BigQuery is now available in preview. Get started by following our documentation and tutorials. Have feedback? Let us know what you think at bqml-feedback@google.com.

Source : Data Analytics Read More

Announcing Delta Lake support for BigQuery

Announcing Delta Lake support for BigQuery

Delta Lake is an open-source optimized storage layer that provides a foundation for tables in lake houses and brings reliability and performance improvements to existing data lakes. It sits on top of your data lake storage (like cloud object stores) and provides a performant and scalable metadata layer on top of data stored in the Parquet format. 

Organizations use BigQuery to manage and analyze all data types, structured and unstructured, with fine-grained access controls. In the past year, customer use of BigQuery to process multiformat, multicloud, and multimodal data using BigLake has grown over 60x. Support for open table formats gives you the flexibility to use existing open source and legacy tools while getting the benefits of an integrated data platform. This is enabled via BigLake — a storage engine that allows you to store data in open file formats on cloud object stores such as Google Cloud Storage, and run Google-Cloud-native and open-source query engines on it in a secure, governed, and performant manner. BigLake unifies data warehouses and lakes by providing an advanced, uniform data governance model. 

This week at Google Cloud Next ’24, we announced that this support now extends to the Delta Lake format, enabling you to query Delta Lake tables stored in Cloud Storage or Amazon Web Services S3 directly from BigQuery, without having to export, copy, nor use manifest files to query the data. 

Why is this important? 

If you have existing dependencies on Delta Lake and prefer to continue utilizing Delta Lake, you can now leverage BigQuery native support. Google Cloud provides an integrated and price-performant experience for Delta Lake workloads, encompassing unified data management, centralized security, and robust governance. Many customers already harness the capabilities of Dataproc or Serverless Spark to manage Delta Lake tables on Cloud Storage. Now, BigQuery’s native Delta Lake support enables seamless delivery of data for downstream applications such as business intelligence, reporting, as well as integration with Vertex AI. This lets you do a number of things, including: 

Build a secure and governed lakehouse with BigLake’s fine-grained security model

Securely exchange Delta Lake data using Analytics Hub 

Run data science workloads on Delta Lake using BigQuery ML and Vertex AI 

How to use Delta Lake with BigQuery

Delta Lake tables follow the same table creation process as BigLake tables. 

Required roles

To create a BigLake table, you need the following BigQuery identity and access management (IAM) permissions: 

bigquery.tables.create 

bigquery.connections.delegate

Prerequisites

Before you create a BigLake table, you need to have a dataset and a Cloud resource connection that can access Cloud Storage.

Table creation using DDL

Here is the DDL statement to create a Delta lake Table

code_block
<ListValue: [StructValue([(‘code’, ‘CREATE EXTERNAL TABLE `PROJECT_ID.DATASET.DELTALAKE_TABLE_NAME`rnWITH CONNECTION `PROJECT_ID.REGION.CONNECTION_ID`rnOPTIONS (rn format =”DELTA_LAKE”,rn uris=[‘DELTA_TABLE_GCS_BASE_PATH’]);’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e476f76ed00>)])]>

Querying Delta Lake tables

After creating a Delta Lake BigLake table, you can query it using GoogleSQL syntax, the same as you would a standard BigQuery table. For example:

code_block
<ListValue: [StructValue([(‘code’, ‘SELECT FIELD1, FIELD2 FROM `PROJECT_ID.DATASET.DELTALAKE_TABLE_NAME`’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e476f76e610>)])]>

You can also enforce fine-grained security at the table level, including row-level and column-level security. For Delta Lake tables based on Cloud Storage, you can also use dynamic data masking.

Conclusion

We believe that BigQuery’s support for Delta Lake is a major step forward for customers building lakehouses using Delta Lake. This integration will make it easier for you to get insights from your data and make data-driven decisions. We are excited to see how you use Delta Lake and BigQuery together to solve their business challenges. For more information on how to use Delta Lake with BigQuery, please refer to the documentation.

Acknowledgments: Mahesh Bogadi, Garrett Casto, Yuri Volobuev, Justin Levandoski, Gaurav Saxena, Manoj Gunti, Sami Akbay, Nic Smith and the rest of the BigQuery Engineering team.

Source : Data Analytics Read More

BigQuery is now your single, unified AI-ready data platform

BigQuery is now your single, unified AI-ready data platform

Eighty percent of data leaders believe that the lines between data and AI are blurring. Using large language models (LLMs) with your business data can give you a competitive advantage, but to realize this advantage, how you structure, prepare, govern, model, and scale your data matters. 

Tens of thousands of organizations already choose BigQuery and its integrated AI capabilities to power their data clouds. But in a data-driven AI era, organizations need a simple way to manage all of their data workloads. Today, we’re going a step further and unifying key data Google Cloud analytics capabilities under BigQuery, which is now the single, AI-ready data analytics platform. BigQuery incorporates key capabilities from multiple Google Cloud analytics services into a single product experience that offers the simplicity and scale you need to manage structured data in BigQuery tables, unstructured data like images, audience and documents, and streaming workloads, all with the best price-performance. 

BigQuery helps you:

Scale your data and AI foundation with support for all data types and open formats

Eliminate the need for upfront sizing and just simply bring your data, at any scale, with a fully managed, serverless workload management model and universal metastore

Increase flexibility and agility for data teams to collaborate by bringing multiple languages and engines (SQL, Spark, Python) to a single copy of data 

Support the end-to-end data to AI lifecycle with built-in high availability, data governance, and enterprise security features

Simplify analytics with a unified product experience designed for all data users and AI-powered assistive and collaboration features

With your data in BigQuery, you can quickly and efficiently bring gen AI to your data and take advantage of LLMs. BigQuery simplifies multimodal generative AI for the enterprise by making Gemini models available through BigQuery ML and BigQuery DataFrames. It helps you unlock value from your unstructured data, with its expanded integration with Vertex AI’s document processing and speech-to-text APIs, and its vector capabilities to enable AI-powered search for your business data. The insights from combining your structured and unstructured data can be used to further fine-tune your LLMs.

Support for all data types and open formats

Customers use BigQuery to manage all data types, structured and unstructured, with fine-grained access controls and integrated governance. BigLake, BigQuery’s unified storage engine, supports open table formats which let you use existing open-source and legacy tools to access structured and unstructured data while benefiting from an integrated data platform. BigLake supports all major open table formats, including Apache Iceberg, Apache Hudi and now Delta Lake natively integrated with BigQuery. It provides a fully managed experience for Iceberg, including DDL, DML and streaming support. 

Your data teams need access to a universal definition of data, whether in structured, unstructured or open formats. To support this, we are launching BigQuery metastore, a managed, scalable runtime metadata service that provides universal table definitions and enforces fine-grained access control policies for analytics and AI runtimes. Supported runtimes include Google Cloud, open source engines (through connectors), and 3rd party partner engines.

Use multiple languages and serverless engines on a single copy of data

Customers increasingly want to run multiple languages and engines on a single copy of their data, but the fragmented nature of today’s analytics and AI systems makes this challenging. You can now bring the programmatic power of Python and PySpark right to your data without having to leave BigQuery. 

BigQuery DataFrames brings the power of Python together with the scale and ease of BigQuery with a minimum learning curve. It implements over 400 common APIs from pandas and scikit-learn by transparently and optimally converting methods to BigQuery SQL and BigQuery ML SQL. This breaks the barriers of client side capabilities, allowing data scientists to explore, transform and train on terabytes of data and processing horsepower of BigQuery.

Apache Spark has become a popular data processing runtime, especially for data engineering tasks. In fact, customers’ use of serverless Apache Spark in Google Cloud increased by over 500% in the past year.1 BigQuery’s newly integrated Spark engine lets you process data using PySpark as you do with SQL. Like the rest of BigQuery, the Spark engine is completely serverless — no need to manage compute infrastructure. You can even create stored procedures using PySpark and call them from your SQL-based pipelines. 

Make decisions and feed ML models in near real-time

Data teams are also increasingly being asked to deliver real-time analytics and AI solutions, reducing the time between signal, insight, and action. BigQuery now helps make real-time streaming data processing easy with new support for continuous SQL queries, an unbounded SQL query that processes data the moment it arrives via SQL statement. BigQuery continuous queries amplifies downstream SaaS applications, like Salesforce, with the real-time enterprise knowledge of your data and AI platform. In addition, to support open source streaming workloads, we are announcing a preview of Apache Kafka for BigQuery. Customers can use Apache Kafka to manage streaming data workloads and feed ML models without the need to worry about version upgrades, rebalancing, monitoring and other operational headaches. 

Scale analytics and AI with governance and enterprise features 

To make it easier for you to manage, discover, and govern data, last year we brought data governance capabilities like data quality, lineage and profiling from Dataplex directly into BigQuery. We will be expanding BigQuery to include Dataplex’s enhanced search capabilities, powered by a unified metadata catalog, to help data users discover data and AI assets, including models and datasets from Vertex AI. Column-level lineage tracking in BigQuery is now available in preview, which will be followed by a preview for lineage for Vertex AI pipelines. Governance rules for fine-grained access control are also in preview, allowing businesses to define governance policies based on metadata.

For customers looking for enhanced redundancy across geographic regions, we are introducing managed disaster recovery for BigQuery. This feature, now in preview, offers automated failover of compute and storage and will offer a new cross-regional service level agreement (SLA) tailored for business-critical workloads. The managed disaster recovery feature provides standby compute capacity in the secondary region included in the price of BigQuery’s Enterprise Plus edition.

A unified experience for all data users

As Google Cloud’s single integrated platform for data analytics, BigQuery unifies how data teams work together with BigQuery Studio. Now generally available, BigQuery Studio gives data teams a collaborative data workspace that all data practitioners can use to accelerate their data-to-AI workflows. BigQuery Studio lets you use SQL, Python, PySpark, and natural language in a single unified analytics workspace, regardless of the data’s scale, format or location. B All development assets in BigQuery Studio are enabled with full lifecycle capabilities, including team collaboration and version control. Since BigQuery Studio’s launch at Next ‘23, hundreds of thousands of users are actively using the new interface.2

Gemini in BigQuery for AI assistive and collaborative experiences

We announced several new innovations for Gemini in BigQuery that help data teams with AI-powered experiences for data preparation, analysis and engineering as well as intelligent recommendations to enhance user productivity and optimize costs. BigQuery data canvas, an AI-centric experience with natural language input, makes data discovery, exploration, and analysis faster and more intuitive. AI augmented data preparation in BigQuery helps users to cleanse and wrangle their data and build low-code visual data pipelines, or rebuild legacy pipelines. Gemini in BigQuery also helps you write and edit SQL or Python code using simple natural language prompts, referencing relevant schemas and metadata.

How Deutsche Telekom is innovating with the BigQuery platform

“Deutsche Telekom built a horizontally scalable data platform in an innovative way that was designed to meet our current and future business needs. With BigQuery at the center of our enterprise’s One Data Ecosystem, we created a unified approach to maintain a single source of truth while fostering de-centralized usage of data across all of our data teams. With BigQuery and Vertex AI, we built a governed and scalable space for data scientists to experiment and productionize AI models while maintaining data sovereignty and federated access controls. This has allowed us to quickly deploy practical usage of LLMs to turbocharge our data engineering life cycle and unleash new business opportunities.” – Ashutosh Mishra, VP of Data Architecture, Deutsche Telekom

Start building your AI-ready data platform

To learn more and start building your AI-ready data platform, start exploring the next generation of BigQuery today. Read more about the latest innovations for Gemini in BigQuery and an overview of what’s next for data analytics at Google Cloud.

1. Google internal data – YoY growth of data processed using Apache Spark on Google Cloud compared with Feb ‘23. 
2. Since the August 2023 announcement of BigQuery Studio, monthly active users have continued to grow.

Source : Data Analytics Read More