vanheusden.com

Calculating the Pearson correlation coefficient in SQL

I learned that you should always (well, mostly) try to let the database do most if not all of the work if possible for efficiency reasons.
After reading chapter 2 of the book Programming Collective Intelligence (by Toby Segaran) I picked up SQL for dummies and started typing.


Input data

For this query I created a relative simple table with only 3 columns: one for the user, one for the movie he/she saw and one with the rating as a floating point value:
+------+-----------------------+--------+
| user | movie                 | rating |
+------+-----------------------+--------+
| John | Donny Darko           |      3 | 
| John | Existenze             |      5 | 
| John | Sex and the city      |      4 | 
| John | The Hours             |    2.5 | 
| John | The Matrix            |      0 | 
| Tim  | Blade                 |      4 | 
| Tim  | Donny Darko           |      2 | 
| Tim  | Starwars 4            |      5 | 
| Tim  | The Matrix            |      5 | 
| Toby | Alien versus Predator |      3 | 
| Toby | Donny Darko           |      5 | 
| Toby | The Hours             |    4.5 | 
| Toby | The Matrix            |    2.5 | 
+------+-----------------------+--------+

The query explained

The query itself is relatively simple: one query goes through all data and calculates the sum/sum square/psum for each user/user combination:
+-------+-------+------+------+--------+--------+-------+---+
| user1 | user2 | sum1 | sum2 | sum1sq | sum2sq | psum  | n |
+-------+-------+------+------+--------+--------+-------+---+
| Tim   | John  |    7 |    3 |     29 |      9 |     6 | 2 | 
| Toby  | John  |   12 |  5.5 |   51.5 |  15.25 | 26.25 | 3 | 
| Toby  | Tim   |  7.5 |    7 |  31.25 |     29 |  22.5 | 2 | 
+-------+-------+------+------+--------+--------+-------+---+
The query around that then picks the results and calculates the Pearson coefficient from that:
+-------+-------+------------------+---+
| user1 | user2 | r                | n |
+-------+-------+------------------+---+
| Toby  | John  | 0.99942379712877 | 3 | 
| Tim   | John  |               -1 | 2 | 
| Toby  | Tim   |               -1 | 2 | 
+-------+-------+------------------+---+

The query

SELECT  
        user1, user2,
        ((psum - (sum1 * sum2 / n)) / sqrt((sum1sq - pow(sum1, 2.0) / n) * (sum2sq - pow(sum2, 2.0) / n))) AS r,
        n
FROM
        (SELECT 
                n1.user AS user1,
                n2.user AS user2,
                SUM(n1.rating) AS sum1,
                SUM(n2.rating) AS sum2,
                SUM(n1.rating * n1.rating) AS sum1sq,
                SUM(n2.rating * n2.rating) AS sum2sq,
                SUM(n1.rating * n2.rating) AS psum,
                COUNT(*) AS n
        FROM
                testdata AS n1
	LEFT JOIN
		testdata AS n2
	ON
		n1.movie = n2.movie
        WHERE   
                n1.user > n2.user
	GROUP BY
		n1.user, n2.user) AS step1
ORDER BY
        r DESC,
        n DESC



For contact info, see this page.