This is the second part of the ML series. For your convenience you can find other parts in the table of contents in Part 1 – Linear regression in MXNet

Image that you have only a data warehouse with SQL capabilities to train and evaluate your models. Last time we ran Python code to calculate linear regression for Iris dataset, today we are going to do exactly the same but in SQL.

The code provided below is for MS SQL 2017.

Let’s start with dataset and schema:

Nothing fancy, just a table with Iris data. Next, the training:

Whoa, looks terrible. Let’s go step by step.

First, we get transformed table with samples in randomized order and two new features. The same as in Python code.

Next, we gat training and test tables representing datasets for training and evaluation respectively.

Nest, learning table. We want to represent the formula Aw + b - y where A is a matrix of samples, w and b are vectors of parameters we calculate with linear regression (representing the line), y is a vector of target variables. gw# and bw# are variables representing gradient, mse is a mean square error. dummy is just a variable we need to use in windowing functions since we cannot use grouping.

Next, we go with recursive CTE part. Let’s start from the most nested part.

Our initial learning values represent some coefficients with gradients calculated in last iteration. We could start with random values as well, here we start with constants. In the innermost view we do the actual training: for every feature we subtract gradient multiplied by learning rate (0.01 here) and this is how we calculate new coefficients. Because of performance issues we also calculate highest iteration available so far.

Next, We join training samples with coefficients and calculate the actual l^2 metric. We multiply coefficients by value and finally subtract target variable. Just before that we filter only the last iteration (with WHERE L.iteration = max_iteration) to decrease the dataset size. We also limit the number of iterations.

Now, we have distance calculated. We calculate squared distance and components for gradient. Since we need to find the derivatives on our own (and we know the result, don’t we?), we multiply distance by features for partial derivatives for w and get just a distance for partial derivatives for b.

Next, we do a lot of ugly casting to match the CTE requirements of uniform data types. We also calculate averages of gradients for every feature. We divide the dataset for given partitions, actually there is just one iteration, but we need to have some partition for syntax purposes. We could use Z.dummy as well.

Ultimately, we just get the values for the first row, as all the rows have the same values. We could ignore this filtering but the our dataset would be very big and training would take much longer.

And here are the results of the fiddle

You can evaluate the dataset now.

This query works but has a lot of drawback. In next parts we will try to fix some of them.