Regression using the Housing data

Lampros Mouselimis

2019-04-14

The following examples illustrate the functionality of the KernelKnn package for regression tasks. I’ll make use of the Housing data set,

data(Boston, package = 'KernelKnn')

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...


When using an algorithm where the ouput depends on distance calculation (as is the case in k-nearest-neighbors) it is recommended to first scale the data,

X = scale(Boston[, -ncol(Boston)])
y = Boston[, ncol(Boston)]

# random split of data in train and test

spl_train = sample(1:length(y), round(length(y) * 0.75))
spl_test = setdiff(1:length(y), spl_train)
str(spl_train)
##  int [1:380] 9 57 407 476 41 305 281 430 177 380 ...
str(spl_test)
##  int [1:126] 2 3 12 13 14 23 27 28 36 38 ...
# evaluation metric

mse = function (y_true, y_pred) {
  
  out = mean((y_true - y_pred)^2)
  
  out
}

The KernelKnn function

The KernelKnn function takes a number of arguments. To read details for each one of the arguments type ?KernelKnn::KernelKnn in the console.

A simple k-nearest-neighbors can be run with weights_function = NULL (the parameter ‘regression’ should be set to TRUE for regression),

library(KernelKnn)

preds_TEST = KernelKnn(X[spl_train, ], TEST_data = X[spl_test, ], y[spl_train], k = 5 , 
                       
                       method = 'euclidean', weights_function = NULL, regression = T)
str(preds_TEST)
##  num [1:126] 23.3 31.3 20.8 21.5 19.6 ...


Using transf_categ_cols = TRUE, categorical features can be either encoded to dummy or to numeric features depending on the number of the unique values (here I convert the ‘chas’ and ‘rad’ features to factor to apply the transf_categ_cols parameter)

apply(Boston, 2, function(x) length(unique(x)))
##    crim      zn   indus    chas     nox      rm     age     dis     rad 
##     504      26      76       2      81     446     356     412       9 
##     tax ptratio   black   lstat    medv 
##      66      46     357     455     229
tmp_bst = Boston
tmp_bst$chas = as.factor(tmp_bst$chas)
tmp_bst$rad = as.factor(tmp_bst$rad)

preds_TEST = KernelKnn(tmp_bst[spl_train, -ncol(tmp_bst)], 
                       
                       TEST_data = tmp_bst[spl_test, -ncol(tmp_bst)], 
                       
                       y[spl_train], k = 5 , method = 'euclidean', 
                       
                       regression = T, transf_categ_cols = T)
str(preds_TEST)
##  num [1:126] 20.7 22.9 20.2 24 21 ...


There are two ways to use a kernel in the KernelKnn function. The first option is to choose one of the existing kernels (uniform, triangular, epanechnikov, biweight, triweight, tricube, gaussian, cosine, logistic, silverman, inverse, gaussianSimple, exponential). Here, I use the mahalanobis metric (which takes advantage of the covariance matrix of the data, but it somewhat slows down training in comparison to the other distance metrics) and the biweight kernel, because they give optimal results (according to my RandomSearchR package),

preds_TEST_biw = KernelKnn(X[spl_train, ], TEST_data = X[spl_test, ], y[spl_train], k = 5, 
                           
                           method = 'mahalanobis', weights_function = 'biweight', 
                           
                           regression = T, transf_categ_cols = F)
str(preds_TEST_biw)
##  num [1:126] 22.7 35.1 22.1 20.8 19.3 ...


The second option is to give a self defined kernel function. Here, I’ll pick the density function of the normal distribution with mean = 0.0 and standard deviation = 1.0 (the data are scaled to have mean zero and unit variance),

norm_kernel = function(W) {
  
  W = dnorm(W, mean = 0, sd = 1.0)
  
  W = W / rowSums(W)
  
  return(W)
}


preds_TEST_norm = KernelKnn(X[spl_train, ], TEST_data = X[spl_test, ], y[spl_train], k = 5,
                            
                            method = 'mahalanobis', weights_function = norm_kernel, 
                            
                            regression = T, transf_categ_cols = F)
str(preds_TEST_norm)
##  num [1:126] 23.6 31.1 22.3 22.1 20.3 ...


The computations can be speed up by using the parameter threads (multiple cores can be run in parallel). There is also the option to exclude extrema (minimum and maximum distances) during the calculation of the k-nearest-neighbor distances using extrema = TRUE. The bandwidth of the existing kernels can be tuned using the h parameter.

K-nearest-neigbor calculations in the KernelKnn function can be accomplished using the following distance metrics : euclidean, manhattan, chebyshev, canberra, braycurtis, minkowski (by default the order ‘p’ of the minkowski parameter equals k), hamming, mahalanobis, pearson_correlation, simple_matching_coefficient, jaccard_coefficient and Rao_coefficient. The last four are similarity measures and are appropriate for binary data [0,1].

I employed my RandomSearchR package to find the optimal parameters for the KernelKnn function and the following two pairs of parameters give an optimal mean-squared-error,

k method kernel
9 mahalanobis triweight
3 canberra cosine

The KernelKnnCV function

I’ll use the KernelKnnCV function to calculate the mean-squared-error using 3-fold cross-validation for the previous mentioned parameter pairs,

fit_cv_pair1 = KernelKnnCV(X, y, k = 9, folds = 3, method = 'mahalanobis', 
                           
                           weights_function = 'triweight', regression = T, 
                           
                           threads = 5, seed_num = 3)
str(fit_cv_pair1)
## List of 2
##  $ preds:List of 3
##   ..$ : num [1:168] 24.1 32.4 17.6 16.6 19.4 ...
##   ..$ : num [1:169] 30 21.8 21.5 20.5 22.6 ...
##   ..$ : num [1:169] 23.3 29.7 30.8 20.4 22.6 ...
##  $ folds:List of 3
##   ..$ fold_1: int [1:168] 2 5 8 9 14 23 24 29 34 35 ...
##   ..$ fold_2: int [1:169] 6 12 13 16 17 19 20 22 30 36 ...
##   ..$ fold_3: int [1:169] 1 3 4 7 10 11 15 18 21 25 ...
fit_cv_pair2 = KernelKnnCV(X, y, k = 3, folds = 3, method = 'canberra',
                           
                           weights_function = 'cosine', regression = T, 
                           
                           threads = 5, seed_num = 3)
str(fit_cv_pair2)


Each cross-validated object returns a list of length 2 ( the first sublist includes the predictions for each fold whereas the second gives the indices of the folds)

mse_pair1 = unlist(lapply(1:length(fit_cv_pair1$preds), 
                          
                          function(x) mse(y[fit_cv_pair1$folds[[x]]], 
                                          
                                          fit_cv_pair1$preds[[x]])))
mse_pair1
## [1] 18.17392 18.29160 11.01078
cat('mse for params_pair1 is :', mean(mse_pair1), '\n')
## mse for params_pair1 is : 15.82543
mse_pair2 = unlist(lapply(1:length(fit_cv_pair2$preds), 
                          
                          function(x) mse(y[fit_cv_pair2$folds[[x]]], 
                                          
                                          fit_cv_pair2$preds[[x]])))
mse_pair2
## [1] 26.84027 22.75759 18.53496
cat('mse for params_pair2 is :', mean(mse_pair2), '\n')
## mse for params_pair2 is : 22.71094


Adding or multiplying kernels

In the KernelKnn package there is also the option to combine kernels (adding or multiplying) from the existing ones. For instance, if I want to multiply the tricube with the gaussian kernel, then I’ll give the following character string to the weights_function, “tricube_gaussian_MULT”. On the other hand, If I want to add the same kernels then the weights_function will be “tricube_gaussian_ADD”. I experimented with my RandomSearchR package combining the different kernels and the following two parameter settings gave optimal results,


k method kernel
19 mahalanobis triangular_triweight_MULT
18 mahalanobis biweight_triweight_gaussian_MULT


fit_cv_pair1 = KernelKnnCV(X, y, k = 19, folds = 3, method = 'mahalanobis', 
                           
                           weights_function = 'triangular_triweight_MULT', 
                           
                           regression = T, threads = 5, seed_num = 3)
str(fit_cv_pair1)
## List of 2
##  $ preds:List of 3
##   ..$ : num [1:168] 24.3 32.4 17.8 17.1 19.5 ...
##   ..$ : num [1:169] 28.6 21.5 21.9 20.4 22.2 ...
##   ..$ : num [1:169] 23.2 27.6 31.1 20.7 22.4 ...
##  $ folds:List of 3
##   ..$ fold_1: int [1:168] 2 5 8 9 14 23 24 29 34 35 ...
##   ..$ fold_2: int [1:169] 6 12 13 16 17 19 20 22 30 36 ...
##   ..$ fold_3: int [1:169] 1 3 4 7 10 11 15 18 21 25 ...
fit_cv_pair2 = KernelKnnCV(X, y, k = 18, folds = 3, method = 'mahalanobis', 
                           
                           weights_function = 'biweight_triweight_gaussian_MULT', 
                           
                           regression = T, threads = 5, seed_num = 3)
str(fit_cv_pair2)
## List of 2
##  $ preds:List of 3
##   ..$ : num [1:168] 24.2 32.5 17.8 17.1 19.4 ...
##   ..$ : num [1:169] 28.9 21.7 21.8 20.4 22.1 ...
##   ..$ : num [1:169] 23.3 27.7 31.2 20.6 22.4 ...
##  $ folds:List of 3
##   ..$ fold_1: int [1:168] 2 5 8 9 14 23 24 29 34 35 ...
##   ..$ fold_2: int [1:169] 6 12 13 16 17 19 20 22 30 36 ...
##   ..$ fold_3: int [1:169] 1 3 4 7 10 11 15 18 21 25 ...


mse_pair1 = unlist(lapply(1:length(fit_cv_pair1$preds), 
                          
                          function(x) mse(y[fit_cv_pair1$folds[[x]]], 
                                          
                                          fit_cv_pair1$preds[[x]])))
mse_pair1
## [1] 18.73885 18.64990 11.21641
cat('mse for params_pair1 is :', mean(mse_pair1), '\n')
## mse for params_pair1 is : 16.20172
mse_pair2 = unlist(lapply(1:length(fit_cv_pair2$preds), 
                          
                          function(x) mse(y[fit_cv_pair2$folds[[x]]], 
                                          
                                          fit_cv_pair2$preds[[x]])))
mse_pair2
## [1] 19.12602 18.63044 11.21680
cat('mse for params_pair2 is :', mean(mse_pair2), '\n')
## mse for params_pair2 is : 16.32442