Machine Learning Part 3 — Linear regression in SQL revisited

This is the third part of the ML series. For your convenience you can find other parts in the table of contents in Part 1 – Linear regression in MXNet

Last time we saw how to calculate linear regression for the Iris dataset. However, we had to hardcode all the featuers. Today we are going to make our query much more flexible. Let’s begin.

We start with the same schema as before. Now, the query, this time for PostgreSQL 10:

Uuu, terrible. Let’s go step by step.

First, constants is just a table with some numbers. We have 5 features so we have 5 rows there. This could be done much easier with recursive CTE or any other dynamic solution.

Next, extended: we just add two more features and randomize the rows.

training, test, and numbered are just tables for bookkeeping the samples.

pivoted_training: here comes some magic. We don’t want to have rows with all the features inside, we want to have one row for each sample’s feature. So we do the translation and emit rows with sample number, feature id, feature value, and target variable.

Next comes our recursive CTE for training. We start with some rows representing coefficients for each feature. We initialize w and b, as well as iteration and distances. We have dummy column again.

Next, we do the calculation. Let’s go from the middle.

We do the training in similar way as before. Assuming we have 100 samples, each has 5 features, the inner join target has 5\cdot i rows where i stands for iterations. We join this with samples (100 \cdot 5 rows) based on the feature id and maximum iteration. So finally, we should have 100 \cdot 5 rows per each iteration.

Next, we calculate the distance for each sample (this is the PARTITION BY K.sample part).

Next, we square the distance and calculate gradient coefficients for each sample.

Finally, we cast variables and calculate the final gradients by taking averages over features. We also calculate mse and we are almost done.

The only tricky part is how to get exactly 5 rows representing new coefficients. This is done by WHERE R.sample = 1 as for each sample we have exactly the same results so we can just take any of them.

Finally, we get our training results with SELECT * FROM learning. You can see it here:

Now you can evaluate the model.