Introduction
In this article, we explore some simple statistical functions implementated in SQL, using the sqldf
package and mtcars
dataset.
Histogram 0-100, 100-200, 200-300, 300+
sqldf("SELECT COUNT(CASE WHEN hp >= 0 and hp < 100 THEN 1 END) AS _0_100,
COUNT(CASE WHEN hp >= 100 and hp < 200 THEN 1 END) AS _100_200,
COUNT(CASE WHEN hp >= 200 and hp < 300 THEN 1 END) AS _200_300,
COUNT(CASE WHEN hp >= 300 and hp < 999 THEN 1 END) AS _300_999
FROM mtcars
") %>% print(row.names=F)
## _0_100 _100_200 _200_300 _300_999
## 9 16 6 1
Simple Linear Regression (wt ~ hp)
slope = sqldf("SELECT
(SELECT SUM((hp-(SELECT AVG(hp) FROM mtcars))*(wt-(SELECT AVG(wt) FROM mtcars)))/(COUNT(hp)-1)
AS covariance FROM mtcars)/
(SELECT SUM((hp-(SELECT AVG(hp) FROM mtcars))*(hp-(SELECT AVG(hp) FROM mtcars)))/(COUNT(hp)-1)
AS variance FROM mtcars)
AS slope")
#intercept
intercept = sqldf("SELECT AVG(wt) as intercept FROM mtcars") - slope*sqldf("SELECT AVG(hp) FROM mtcars")
lm.fit <- lm(wt ~ hp, mtcars)
lm.intercept <- lm.fit$coefficients[[1]]
lm.slope <- lm.fit$coefficients[[2]]
## LM Slope = 0.00940096 , LM Intercept = 1.838247
## SQL Slope = 0.00940096 , SQL Intercept = 1.838247