Machine Learning using SQL commands on GCP

3 min readDec 14, 2020

Many of us might have came across Machine Learning using some scripting languages. But what if I tell you that you can build your machine learning models and predict your labels using SQL commands as well. Feeling excited, let’s get started.

So before going ahead, let us look at different cloud products and services that are offered by Google.

These products includes Cloud Storage, Pub/Sub, BigQuery, Cloud Functions, Datastore, Scheduler and many more. All these products and services are integrated under a platform known as Google Cloud Platform (GCP).

If we see BigQuery in a nutshell, it is an analytical data warehouse. Internally, it is a combination of Fast SQL Query Engine and Data warehouse. Both are integrated as one, using petabyte speed Google networking service.

If we see the BigQuery UI, it will look something like this :

In this article, we will be discussing about BigQuery ML and see some of the SQL commands that will meet our requirements.

Model Creation

CREATE MODEL statement will create a ML model as per specification given. Let us take an example here:

CREATE MODEL `<dataset_name>.<model_name>`
OPTIONS
(
model_type=’<type_of_model>’,
labels = [‘<label_name>’]
)
AS
#standardSQL to fetch the data for training the model

Change ‘dataset_name’, ‘model_name’, ‘type_of_model’, ‘label_name’ as per your model specification.

Here, standard SQL will be any ‘Select’ command that would meet your data requirement from the dataset.

Now, let’s see the better way of creating model into your dataset :

CREATE OR REPLACE MODEL `<dataset_name>.<model_name>`
OPTIONS
(
model_type=’<type_of_model>’,
labels = [‘<label_name>’]
)
AS
#standardSQL to fetch the data for training the model

Here, it will first check the presence of model in our dataset. If already present, then replace it, else create a new model.

Model Evaluation

In model evaluation, we basically try to minimize the False Positives and maximize the True Positives. Here, we check the ROC (Receiver Operating Characteristic) area under curve.

For this, we can apply condition on ‘roc_auc’ in BigQuery ML while evaluating the model. (‘roc_auc’ is actually a queryable field in BQ.)

So for evaluation, we have ML.EVALUATE() which will take the argument as <model_name> and <data>. Here, <data> = fetched data using standard SQL query from dataset.

SELECT
 roc_auc,
 CASE
 WHEN roc_auc > .9 THEN ‘good’
 WHEN roc_auc > .8 THEN ‘fair’
 WHEN roc_auc > .7 THEN ‘not great’
 ELSE ‘poor’ END AS model_quality
FROM
 ML.EVALUATE(
    MODEL <dataset_name>.<model_name>, (
    #standardSQL query to fetch the data for evaluation
   )
);

Here, you can apply your own condition for ‘roc_auc’.

This will return your ‘roc_auc’ value and you will be able to know how much your model has fitted into your data points.

If this model don’t work for you, try doing some feature engineering and fitting some good model or stack of models for better estimation.

Prediction

Once we are done with model evaluation, then comes the prediction phase.

For predicting label values in test dataset, we have ML.PREDICT() which takes two arguments <model_name> and <test_data> and will return the predicted value of the label in the test dataset. Here, <test_data> = fetched data using standard SQL statements from your dataset.

SELECT
*
FROM
ml.PREDICT(
  MODEL `<dataset_name>.<model_name>`,(
  #standardSQL query to fetch the test data
  )
);

With this, you are done with the prediction.

I hope you have got an idea of creating a Machine Learning model using standard SQL queries on BigQuery.

See you all in the next article. Till then, happy learning !!

Machine Learning using SQL commands on GCP

Written by Niket Kumar