Experimenting with a Custom Distance Based on Haversine with K Nearest Neighbors

Published:

In this blog post, I will discuss: (1) the Haversine distance, a distance metric designed for measuring distances between places on earth, (2) a customized distance metric I implemented, “HaversineEuclidean”, which I felt would be more appropriate in an analysis of the California Housing data, and (3) how to implement this custom metric in a K Nearest Neighbors Regression (KNN) with scikit-learn. The implementation is also relevant to other functionalities in sklearn.neighbors.

I became curious about using a potentially more appropriate metric in KNN, and this blog post logs an exploration, analysis, and findings.

An overview:

  1. Background and motivation
  2. The Haversine distance and a new custom “HaversineEuclidean” distance
  3. Implementation of this custom metric and an analysis with sk-learn
  4. Analysis of CA Housing data with only latitude and longitude features
  5. Analysis of CA Housing data with “all features”
  6. Conclusions and lessons
  7. Related Work
  8. An appendix with mathematical details and other info

The raw markdown file, where it may be easier to copy and paste some of the python code, is available here.

Background and Motivation

I am reading through the excellent book Hands-on Machine Learning with Scikit-Learn, Keras & Tensorflow (2nd Edition). In Chapter 2, we use the California Housing dataset to predict the median housing price from attributes like longitude, latitude, and income. This chapter used models like Random Forest but not another popular simple approach, K-Nearest Neighbors (KNN). KNN models are simple and still used in practice. They have fallen slightly out of vogue because the computation does not scale well in problems with large numbers of features or many observations. However, in this problem, both the number of features and training data size are not too large. Moreover, there is a natural sense of “distance” between districts, which made me curious about how KNN would perform; houses in wealthy neighborhoods will be more similar in price than between wealthy neighborhoods and those of less fortunate.

When contemplating this, I realized that standard metrics like the Euclidean Distance –often used as a default in KNN – are not ideal for the latitude/longitude features. Rather, the Haversine Distance computes the distance between two locations specified by latitude/longitude and accounts for the spherical nature of the planet. I thought it would be interesting to use a custom metric with KNN that computes the Haversine Distance on latitude/longitude but Euclidean (or another standard metric) on the remaining features. More broadly, an exploration of different metrics is a worthwhile area of research; academic papers (e.g., 1, 2) have demonstrated the importance of the chosen distance metric in KNN regression.

I did not find any prior work that combined the Haversine and Euclidean distances, so I decided to do a little project for fun.

One important detail about the Haversine formula we will be using is that it expects latitude/longitude to be measured in radians. For more information on the Haversine Distance, refer to the Wikipedia article or Scikit-Learn’s metrics docs.

Custom Distance when Feature Vector Includes Latitude/Longitude

Suppose our training data contains feature vectors $x$ of the form $x = (\text{latitude}, \text{longitude}, x_{3},\ldots, x_{p})$. How do we compute the distance $d(x,y)$ between such vectors for a KNN model? The first two attributes should be handled by the Haversine metric but for the remaining, we could use a standard approach like Euclidean distance. (The Euclidean distance is not designed to take into account the curvature of the earth, etc.)

I am not aware of a “right way” to approach this, but my first instinct is to:

  1. Compute the Haversine distance on the first two coordinates, e.g., $(\text{latitude}, \text{longitude})$
  2. Compute the Euclidean distance on the remaining, e.g., $(x_{3},\ldots, x_{p})$
  3. Take a weighted average (i.e., convex combination) of the two

In other words, I am experimenting with the following distance:

\[D(x,y) = w Haversine(x,y) + (1-w) Euclidean(x,y)\]

where $Haversine$ and $Euclidean$ represent the respective distance metrics, $w \in [0,1]$, and $x,y$ are feature vectors “mixed” with latitude/longitude and other attributes. It can be shown that $D$ is a valid distance metric; see the appendix of this blog for details.

Implementing HaversineEuclidean Distance

Let’s implement this in code! To work with scikit-learn modules, our distance function should take in two one-dimensional numpy arrays and return a scalar. I provide a class with the Euclidean distance, Haversine distance, and a weighted average of the two.

I also include an empirical verification that $D$, the combination metric above, satisfies the requirements of a metric. It’s always a good idea to include such sanity checks.

# All imports for entire analysis

import os
import tarfile
import urllib
import joblib

import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit, GridSearchCV, cross_val_score
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.neighbors import KNeighborsRegressor, DistanceMetric
from sklearn.linear_model import LinearRegression
from functools import partial
from sklearn.metrics import mean_squared_error

# i/o constants
# Dataset, online
DOWNLOAD_ROOT = 'https://raw.githubusercontent.com/ageron/handson-ml2/master/'
HOUSING_URL = DOWNLOAD_ROOT + 'datasets/housing/housing.tgz'

# Local (Relative paths)
HOUSING_PATH = 'dataset'  # Directory where data is saved
RESULTS = 'results'  

# tuning constants
CV = 5
NJOBS = 2  # Change if your system does not support parallelism
VERBOSITY = 9
SCORING = 'neg_mean_squared_error'

The following class implements different distance metrics that share methods like input checking.

class Distances:
    """ User distance metrics to use with scikit-learn.  Includes our HaversineEuclidean Distance  """
    
    def __init__(self, metric):
        """
        Initialize distances class by verifying the metric string
        Parameters:
            metric, str.
        """
        # Valid metrics.  We are shortening Euclidean to l2.  The l2 norm induces the Euclidean metric.
        self.metrics = {
            'l2': self._l2,  
            'haversine': self._haversine,
            'haversine_l2': self._haversine_l2
        }
        
        if metric not in self.metrics.keys():
            raise ValueError(f"The metric {metric} is not recognized.")
        else:
            self.metric = self.metrics[metric]
            
        if 'haversine' in metric:
            # Load sklearn's haversine metric in advance
            self.skhaversine = DistanceMetric.get_metric('haversine')
    
    def distance(self, x, y, **kargs):
        """ 
        * Compute distance between vectors x and y after validation checks, using self.metric
        * We use one main function to avoid redundant things like input validation checks.        
        
        Parameters:
            x, array 
            y, array
            
        Returns:
            Scalar value representing distance between x and y
        """
        self._check_dist_inputs(x, y)
        return self.metric(x, y, **kargs)
    
    @staticmethod
    def _check_dist_inputs(x, y):
        """ Verify that the two inputs are valid. """
        for vec in [x, y]:
            if not isinstance(vec, np.ndarray):
                raise ValueError("Inputs expected to be np arrays")
            if vec.ndim != 1:
                raise ValueError("Inputs expected to have one dimension")

        if len(x) != len(y):
            raise ValueError("x and y must have same number of values")

    
    def _l2(self, x, y):
        """ Return Euclidean distance, or distance induced by l2 norm """
        return np.sqrt(np.sum((x-y)**2))
    
    def _haversine(self, x, y):
        """ 
        Compute Haversine distance between x and y
        
        Parameters:
            x = (latitude1, longitude1) in radians
            y = (latitude2, longitude2) in radians
            
        Note:
            Just uses functionality from sklearn.neighbors, to keep things conceptually contained
        """
        # Note, we already know the inputs are 1 dimensional arrays of same length
        if len(x) != 2:
            raise ValueError("Input vectors to Haversine distance must have length 2: (latitude, longitude)")

        # Get a 1 by 1 distance matrix (the Haversine distance between x and y)
        one_by_one_dist_matrix = self.skhaversine.pairwise(x[np.newaxis, :], y[np.newaxis, :])
        return one_by_one_dist_matrix.item()
        
    def _haversine_l2(self, x, y, w_h=0.5):
        """ 
        Combines the Haversine distance of latitude and longitude features with Euclidean of remaining features.

        Note:  This assumes vectors are of the form (latitude, longitude, x3, ..., xp).
        """
        # Note, we already know the inputs are 1 dimensional arrays of same length
        if len(x) < 3:
            raise ValueError(f"input to haversine_l2 has length {len(x)}.  Should be at least 3 to use this method")
        
        if not isinstance(w_h, float):
            raise TypeError('w_h should be a float')
        if w_h < 0. or w_h > 1: 
                raise ValueError("w_h should be in [0,1]")
        
        w_l2 = 1.0 - w_h
        return w_h*self._haversine(x[0:2], y[0:2]) + w_l2*self._l2(x[2:], y[2:])
        
    @staticmethod
    def check_is_metric(func, niter=1000, dd=10):
        """ 
        Empirically check whether func is a valid distance metric 
        It is always good to include a quick empirical check. I've included the math in the appendix.
        """
        print(f"Checking whether {func} is a distance")
        
        for test_iter in range(niter):
            x = np.random.randn(dd)
            y = np.random.randn(dd)
            z = np.random.randn(dd)
            
            # d(x,x) = 0  
            assert func(x, x) == 0., "d(x,x) != 0"
            
            # Symmetry
            assert func(x, y) == func(y, x), "d(x,y) != d(y,x)"
            
            # Non-negative
            assert func(x,y) >= 0, "d(x,y) is negative"
            
            # Triangle inequality
            assert func(x, y) + func(y, z) >= func(x, z), "Failed triangle ineq"
            
            # The other check, that d(x,y)=0 implies x=y is hard empirically.
            # This will suffice as a first check.  The math confirms this property.
        
        print(f"Passed {niter} empirical checks\n")

The following is our empirical check that HaversineEuclidean (or _haversine_l2()) satisfies the metric requirements (except that $D(x,y) = 0 \implies x = y$ which is hard to check empirically).

for _ in range(3):  # You can run more, I'm only using three to avoid too much output.
    dist = Distances('haversine_l2')
    haversine_l2_func = partial(dist.distance,
                                w_h=np.random.rand())
    Distances.check_is_metric(haversine_l2_func)
Checking whether functools.partial(<bound method Distances.distance of <__main__.Distances object at 0x7f2d0c179710>>, w_h=0.4040077860780521) is a distance
Passed 1000 empirical checks

Checking whether functools.partial(<bound method Distances.distance of <__main__.Distances object at 0x7f2d0c179690>>, w_h=0.21491843314170378) is a distance
Passed 1000 empirical checks

Checking whether functools.partial(<bound method Distances.distance of <__main__.Distances object at 0x7f2d0c1798d0>>, w_h=0.10112062962565771) is a distance
Passed 1000 empirical checks

Ready to move on!

Now we have a new metric. Let’s apply it to some data.

Getting Started With the Data Analysis

To follow along, you can install the necessary packages from the book’s GitHub page, https://GitHub.com/ageron/handson-ml2.

We will use the dataset and pre-processing steps available in Chapter two of the aforementioned textbook. It’s available as a notebook https://GitHub.com/ageron/handson-ml2/blob/master/02_end_to_end_machine_learning_project.ipynb . Please refer to that notebook for further explanations of the preprocessing steps. Fortunately, it contains code to directly download the data from GitHub, so you don’t have to do that manually.

Note that the California Housing dataset available here is slightly different from what you would obtain by running

sklearn.datasets import fetch_california_housing

We will not be visualizing or exploring the data too much here because that has already been done in the notebook above.

Download and Check

Let’s download the data.

# Directly taken from notebook mentioned before
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    """ From Hands-on Machine Learning """
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, 'housing.tgz')
    urllib.request.urlretrieve(housing_url, tgz_path)

    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
    print(f"Dataset downloaded, extracted, and saved in {tgz_path}")


def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, 'housing.csv')
    return pd.read_csv(csv_path)
        
    
fetch_housing_data()
housing = load_housing_data()
Dataset downloaded, extracted, and saved in dataset/housing.tgz

The following brief inspection indicates that we have downloaded the data correctly.

housing.head()
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
0-122.2337.8841.0880.0129.0322.0126.08.3252452600.0NEAR BAY
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0NEAR BAY
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0NEAR BAY
3-122.2537.8552.01274.0235.0558.0219.05.6431341300.0NEAR BAY
4-122.2537.8552.01627.0280.0565.0259.03.8462342200.0NEAR BAY
housing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
housing.describe()
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_value
count20640.00000020640.00000020640.00000020640.00000020433.00000020640.00000020640.00000020640.00000020640.000000
mean-119.56970435.63186128.6394862635.763081537.8705531425.476744499.5396803.870671206855.816909
std2.0035322.13595212.5855582181.615252421.3850701132.462122382.3297531.899822115395.615874
min-124.35000032.5400001.0000002.0000001.0000003.0000001.0000000.49990014999.000000
25%-121.80000033.93000018.0000001447.750000296.000000787.000000280.0000002.563400119600.000000
50%-118.49000034.26000029.0000002127.000000435.0000001166.000000409.0000003.534800179700.000000
75%-118.01000037.71000037.0000003148.000000647.0000001725.000000605.0000004.743250264725.000000
max-114.31000041.95000052.00000039320.0000006445.00000035682.0000006082.00000015.000100500001.000000
housing['ocean_proximity'].value_counts()
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

Set Aside Test Data

Next we must set aside a hold-out test set. This is done with a stratified split so that different income categories are represented similarly between training and test.

housing['income_cat'] = pd.cut(housing['median_income'],
                              bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                              labels=[1,2,3,4,5])

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
splits_gen = split.split(housing, housing['income_cat'])
train_index, test_index = next(splits_gen)

housing_train = housing.loc[train_index].copy()
housing_test = housing.loc[test_index].copy()

# Pull out the targets
train_labels = housing_train['median_house_value'].copy()
test_labels = housing_test['median_house_value'].copy()

for set_ in (housing_train, housing_test):
    set_.drop('income_cat', axis=1, inplace=True)
    set_.drop('median_house_value', axis=1, inplace=True)
print(f"X train: {housing_train.shape}")
print(f"X test: {housing_test.shape}")
print(f"y train: {train_labels.shape}")
print(f"y test: {test_labels.shape}")
X train: (16512, 9)
X test: (4128, 9)
y train: (16512,)
y test: (4128,)

Analysis and Evaluation of Different Distances in KNN

Now we can evaluate different modeling approaches!

First, we should compare using the Euclidean and Haversine metrics in KNN on only latitude and longitude information. Does the choice of metric make any difference?

Second, I will use all features, and compare the performance of KNN when using the Euclidean and the new “HaversineEuclidean” distances.

We will only run a grid search but not use other techniques like learning curves to keep things short and focused. This should not have an undue impact on the comparisons of interest.

The following helper functions are used in both analyses.

# Helper functions for evaluation

def get_rmse(score):
    """ 
    Given a negative Mean Squared Error, return the (positive) Root Mean Squared Error
    (sklearn prefers to return -MSE)
    """
    return np.sqrt(-score)


def print_results(searcher):
    """
    Print key results of a grid search, sorted by performance
    Parameters:
        searcher: A fitted GridSearchCV object
    """
    res_zip = zip(searcher.cv_results_['mean_test_score'], searcher.cv_results_['params'])
    for result, param in sorted(res_zip):
        print(get_rmse(result), param)


def styled_df(df):
    """ Display a data frame with commas separating thousands places """
    display(df.style.format("{:,.1f}"))

Comparison on Latitude and Longitude Only

Let’s subset the data.

lat_long_train = housing_train[['latitude', 'longitude']]  # We'll need the data in this order: (latitude, longitude)
lat_long_test = housing_test[['latitude', 'longitude']]

print(lat_long_train.shape, lat_long_test.shape)
lat_long_train.head()
(16512, 2) (4128, 2)
latitudelongitude
1760637.29-121.89
1863237.05-121.93
1465032.77-117.20
323036.31-119.61
355534.23-118.59

The Haversine formula assumes the inputs are in the order (latitude, longitude) and in radians. We need to convert degrees (the current units) to radians. Even though the formula is simple, we will use a class (specifically, an sklearn Transformer); classes are easier to extend and to help with input validation checks.

class DegreesToRadians(BaseEstimator, TransformerMixin):
    """ Transformer to convert degrees to radians, with special checks for valid latitude and longitude """
    def fit(self, X, y=None):
        return self  # Nothing else to do
    
    def transform(self, X):
        """ 
        Do conversion Radians = Degrees * pi / 180 
        Parameters:
            X, numpy array or pandas df containing latitude/longitude values in degrees
        Returns:
            numpy array with radian values
        """
        if isinstance(X, pd.core.frame.DataFrame):
            X = X.values  # Get a numpy array
        elif not isinstance(X, np.ndarray):
            raise ValueError(f"Unrecognized type for degrees->radians conversion: {type(X)}")

        # Verify that the values are between -180 and 180 (valid lat/longitude in degrees)
        if (np.min(X) < -180.) or (np.max(X) > 180.):
            raise ValueError("Latitude and longitude values are expected to be in the range [-180, 180]")
        else:
            return X*np.pi/180.

Now, we set up two analyses:

  1. Run KNN using Euclidean distance. In this case, the inputs should be scaled in the usual way: zero mean, unit variance.
  2. Run KNN using Haversine. In this case, we need to convert to radians.

In both cases, we sweep across the same hyperparameters. Since the relevance of house prices diminishes as we go farther and farther away from the district of interest, we should not need to consider very large values of n_neighbors. Note, we also try tuning the weights parameter of KNeighborsRegressor. This affects a weighted mean of the targets of the nearest data points when weights='distance'.

# Data pre-processing pipelines
# (ll = latitude/longitude)
ll_euclidean_pipeline = Pipeline([('scale', StandardScaler())])
ll_haversine_pipeline = Pipeline([('degToRad', DegreesToRadians())])

# Grids
ll_base_grid = {'n_neighbors': np.arange(2, 15), 'weights': ['uniform', 'distance']}

ll_euclid_grid = ll_base_grid.copy()
ll_euclid_grid['metric'] = ['euclidean']
ll_haversine_grid = ll_base_grid.copy()
ll_haversine_grid['metric'] = ['haversine']
print(ll_euclid_grid)
print(ll_haversine_grid)
{'n_neighbors': array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]), 'weights': ['uniform', 'distance'], 'metric': ['euclidean']}
{'n_neighbors': array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]), 'weights': ['uniform', 'distance'], 'metric': ['haversine']}
def ll_grid_search(pipe, param_grid, ntrain=None):
    """ 
    Train a model on latitude and longitude (ll) only.
    The pre-processing steps will differ, depending on whether we convert to radians for Haversine or something else.  
    Then, compute test-set performance
    
    Parameters:
        pipe: A Pipeline object to preprocess the inputs
        param_grid: A dictionary containing hyperparameters to tune
        ntrain: The number of observations to train on.  Setting to None uses the entire training set.
    """
    # Pre-process data
    train_prepared = pipe.fit_transform(lat_long_train)
    print("\nGrid Searching")
    print(f"\tInput data:\n\t Shape: {train_prepared.shape}\n\t Feature means: {np.mean(train_prepared, axis=0)}")
    print(f"\tTargets:\n\t Shape: {train_labels.shape}")
    
    # Run grid-search
    # (If you get an error, try changing NJOBS=1)
    grid_searcher = GridSearchCV(KNeighborsRegressor(),
                                 param_grid,
                                 cv=CV,
                                 n_jobs=NJOBS,
                                 verbose=VERBOSITY,
                                 scoring=SCORING)
    
    grid_searcher.fit(train_prepared[0:ntrain], train_labels[0:ntrain])
    
    # Save key results
    best_model = grid_searcher.best_estimator_
    best_cv_rmse = get_rmse(grid_searcher.best_score_)
    
    # Prepare test set
    test_prepared = pipe.transform(lat_long_test)
    print(f"\tTest data:\n\t X shape: {test_prepared.shape}\n\t Feature means: {np.mean(test_prepared, axis=0)}")
    print(f"\tTest targets:\n\t  y shape: {test_labels.shape}")

    # Test performance
    yhat = best_model.predict(test_prepared)
    test_score = np.sqrt(mean_squared_error(test_labels, yhat)) 
    
    return grid_searcher, best_model, best_cv_rmse, test_score


ll_euclid_results = ll_grid_search(pipe = ll_euclidean_pipeline, param_grid = ll_euclid_grid)
ll_haversine_results = ll_grid_search(pipe = ll_haversine_pipeline, param_grid = ll_haversine_grid)

ll_euclid_searcher, ll_euclid_best_model, ll_euclid_cv_rmse, ll_euclid_test_rmse = ll_euclid_results
ll_haversine_searcher, ll_haversine_best_model, ll_haversine_cv_rmse, ll_haversine_test_rmse = ll_haversine_results
Grid Searching
	Input data:
	 Shape: (16512, 2)
	 Feature means: [ 2.28456358e-15 -4.35310702e-15]
	Targets:
	 Shape: (16512,)
Fitting 5 folds for each of 26 candidates, totalling 130 fits
	Test data:
	 X shape: (4128, 2)
	 Feature means: [-0.0180446   0.01530993]
	Test targets:
	  y shape: (4128,)

Grid Searching
	Input data:
	 Shape: (16512, 2)
	 Feature means: [ 0.62202797 -2.08699201]
	Targets:
	 Shape: (16512,)
Fitting 5 folds for each of 26 candidates, totalling 130 fits
	Test data:
	 X shape: (4128, 2)
	 Feature means: [ 0.62135463 -2.08645711]
	Test targets:
	  y shape: (4128,)

First, let’s have a look at the optimal hyperparameters.

ll_euclid_best_model.get_params()
{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'euclidean',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 6,
 'p': 2,
 'weights': 'uniform'}
ll_haversine_best_model.get_params()
{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'haversine',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

Interestingly, the weights are ‘uniform’ and the n_neighbors values differ by 1.

Next, let’s look at the mean RMSE across cross validation folds. Recall that “less is better”.

styled_df(
    pd.DataFrame([[ll_euclid_cv_rmse, ll_haversine_cv_rmse]],
                columns = ['Euclid CV RMSE', 'Haversine CV RMSE'])
)
Euclid CV RMSEHaversine CV RMSE
053,865.653,106.6

Important note: We cannot conclude from this analysis that using the Haversine distance is “better” for predictive performance. For that, we’d need a more careful statistical test. However, if we think of these results as coming from one big grid that combines both metrics, the results would suggest that we choose the model with the Haversine metric before proceeding.

That said, what if we used the optimal hyperparameters for the Euclidean-KNN with the Haversine-KNN? That is, if we use the optimal n_neighbors for ll_euclid_best_model in ll_haversine_best_model?

#
# Extract performance of Haversine-KNN at the optimal hyperparameters of Euclid-KNN
#
optimal_euclid_n_neighbors = ll_euclid_best_model.get_params()['n_neighbors']
ii = 0
for kk in ll_haversine_searcher.cv_results_['params']:
    if kk == {'metric': 'haversine', 'n_neighbors': optimal_euclid_n_neighbors, 'weights': 'uniform'}:
        print(ii)
        break
    ii += 1

get_rmse(ll_haversine_searcher.cv_results_['mean_test_score'][ii])
8

53114.30402038107

That is, when we use the same hyperparameters, the Haversine-KNN gets a smaller average RMSE (across folds) than the Euclid-KNN.

Let’s look at the performance on the test set:

styled_df(
    pd.DataFrame([[ll_euclid_test_rmse, ll_haversine_test_rmse]],
                columns = ['Euclid Test RMSE', 'Haversine Test RMSE'])
)
Euclid Test RMSEHaversine Test RMSE
051,677.951,005.3

The Haversine-KNN attained better performance on the test data.

In fact, the performance is quite close to that of the Random Forest test-set results – using all the features – presented in the textbook (RMSE of 49,682).

Analysis with all features

Now we will use all the features used in the analysis from Hands on Machine Learning. One difference is that we are treating latitude and longitude separately. These will be converted to radians rather than scaled to have zero mean and unit variance.

Our preprocessing follows these steps:

  • Convert latitude and longitude to radians and make sure they are in the first two columns
  • Convert the categorical variable to one-hot
  • Apply the same preprocessing to the remaining numerical variables as in the aforementioned Jupyter notebook.

Speaking of the categorical variables, we could go down an entirely different rabbit hole regarding the appropriate distance metric for those. However, to keep this exploration contained, I will use the Euclidean distance on those as well.

# A preprocessing Transformer from the textbook

# Hard-code the column indices, as in the textbook
# The indices differ by 2 because we're treating latitude and longitude differently: they get "taken out" of this calculation
rooms_ix, bedrooms_ix, population_ix, households_ix = 1, 2, 3, 4
# rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    """ Creates new features.  From page 68 of textbook and in the aforementioned Jupyter notebook """
    def __init__(self, add_bedrooms_per_room=True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        rooms_per_household = X[:, rooms_ix]/X[:, households_ix]
        population_per_household = X[:, population_ix]/X[:, households_ix]
        
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix]/X[:, rooms_ix]
            return np.c_[X, 
                         rooms_per_household,
                         population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, 
                         rooms_per_household,
                         population_per_household] 
        
# Get a list of numerical attributes other than latitude and longitude
remaining_num_attribs = list(housing_train.columns)
for _del in ('ocean_proximity', 'latitude', 'longitude'):
    remaining_num_attribs.remove(_del)

num_pipeline = Pipeline(  # Pipeline used in the aforementioned Jupyter notebook
    [
        ('imputer', SimpleImputer(strategy='median')),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler())
    ],
    verbose=True
)
    
full_pipeline = ColumnTransformer(
    [
        ('latitudeLongitudeToRads', DegreesToRadians(), ['latitude', 'longitude']), 
        ('num', num_pipeline, remaining_num_attribs),
        ('cat', OneHotEncoder(), ['ocean_proximity'])
    ],
    verbose=True
)
housing_prepared = full_pipeline.fit_transform(housing_train)
housing_test_prepared = full_pipeline.transform(housing_test)
[ColumnTransformer]  (1 of 3) Processing latitudeLongitudeToRads, total=   0.0s
[Pipeline] ........... (step 1 of 3) Processing imputer, total=   0.0s
[Pipeline] ..... (step 2 of 3) Processing attribs_adder, total=   0.0s
[Pipeline] ........ (step 3 of 3) Processing std_scaler, total=   0.0s
[ColumnTransformer] ........... (2 of 3) Processing num, total=   0.0s
[ColumnTransformer] ........... (3 of 3) Processing cat, total=   0.0s

Check the data: Before running models, let’s make sure the pipeline’s outputs are reasonable. We note that the first two columns (latitude and longitude in radians) do not have mean 0 and variance 1. Similarly for the last columns, the one-hot encodings.

for name, prepared_dataset in \
    {'housing_prepared': housing_prepared, 'housing_test_prepared': housing_test_prepared}.items():
    
    print('-'*50)
    print(f'dataset info for {name}')
    
    print(type(prepared_dataset))
    print(f"Number of missing values: {np.isnan(prepared_dataset).sum()}")
    print(f"Variable means: {np.mean(prepared_dataset, axis=0)}")  
    print(f"Variable stds: {np.std(prepared_dataset, axis=0)}")
    print(f"Shape: {prepared_dataset.shape}")
    print("cols 0:5\n", prepared_dataset[0:3,0:5],
      "\ncols 5:10\n", prepared_dataset[0:3,5:10],
     "\ncols 10:\n", prepared_dataset[0:3,10:])

--------------------------------------------------
dataset info for housing_prepared
<class 'numpy.ndarray'>
Number of missing values: 0
Variable means: [ 6.22027968e-01 -2.08699201e+00 -2.70052069e-16  8.60688432e-17
  1.18156185e-16 -4.34773476e-17  9.61494020e-18 -2.24102068e-16
  7.25894366e-17 -1.59251649e-17 -5.10020468e-17  4.40649225e-01
  3.18737888e-01  1.21124031e-04  1.11858043e-01  1.28633721e-01]
Variable stds: [0.03731501 0.03493799 1.         1.         1.         1.
 1.         1.         1.         1.         1.         0.49646499
 0.46598717 0.01100497 0.31519172 0.3347941 ]
Shape: (16512, 16)
cols 0:5
 [[ 0.65083328 -2.12738183  0.74333089 -0.49323393 -0.44543821]
 [ 0.64664449 -2.12807996 -1.1653172  -0.90896655 -1.0369278 ]
 [ 0.5719444  -2.04552588  0.18664186 -0.31365989 -0.15334458]] 
cols 5:10
 [[-0.63621141 -0.42069842 -0.61493744 -0.31205452 -0.08649871]
 [-0.99833135 -1.02222705  1.33645936  0.21768338 -0.03353391]
 [-0.43363936 -0.0933178  -0.5320456  -0.46531516 -0.09240499]] 
cols 10:
 [[ 0.15531753  1.          0.          0.          0.          0.        ]
 [-0.83628902  1.          0.          0.          0.          0.        ]
 [ 0.4222004   0.          0.          0.          0.          1.        ]]
--------------------------------------------------
dataset info for housing_test_prepared
<class 'numpy.ndarray'>
Number of missing values: 0
Variable means: [ 6.21354633e-01 -2.08645711e+00 -5.41353879e-03  3.04779303e-02
  3.43292266e-02  2.54825119e-02  3.29949085e-02 -1.29098202e-02
 -2.17120579e-02 -1.11279202e-02  5.94440623e-03  4.50581395e-01
  3.12015504e-01  7.26744186e-04  1.07315891e-01  1.29360465e-01]
Variable stds: [0.0371272  0.0350814  1.00431805 1.09689067 1.09980658 1.072864
 1.08472956 0.98643014 0.69779731 0.13677257 0.98668233 0.49755181
 0.46331612 0.0269484  0.30951444 0.33559847]
Shape: (4128, 16)
cols 0:5
 [[ 0.59550634 -2.0662953   0.02758786  1.78838525  1.16351084]
 [ 0.58939769 -2.05704506  0.8228579   0.71842323  0.29453231]
 [ 0.59707714 -2.07781447 -0.13146615  0.8110161   0.95417708]] 
cols 5:10
 [[ 0.68498857  1.23217448  2.31299771  0.48830927 -0.07090847]
 [ 0.22337528  0.40973048  0.38611673  0.36310326 -0.04598303]
 [ 0.61865967  1.00859747 -0.45340597 -0.17866074 -0.05936925]] 
cols 10:
 [[-0.86820063  1.          0.          0.          0.          0.        ]
 [-0.86028018  1.          0.          0.          0.          0.        ]
 [-0.01792937  1.          0.          0.          0.          0.        ]]

Hyperparameter Tuning on All Features

When tuning hyperparameters on all the features, computational efficiency concerns become more important.

Speed issue

When using a custom metric, we will face a slowdown. From the docs on custom metrics:

Because of the Python object overhead involved in calling the python function, this will be fairly slow, but it will have the same scaling as other distances.

Furthermore, the nearest-neighbors search drops into Brute Force Search rather than the faster algorithms like K-D tree or BallTree when using custom metrics. See the KNeighbors user guide for more details.

I conducted experiments indicating that KNN with HaversineEuclidean is unacceptably slower than just Euclidean when it comes to sweeping across a grid of hyperparameters. Thus, I will use a hack to get a sense of a good range of hyperparameters: tune on Euclidean first. We saw above that the two metrics did not differ too much in terms of the optimal hyperparameters. I admit that this is not the gold-standard approach we would take with more computational resources. But, I’m doing this on my personal machine and for fun :).

Model Tuning

We will again vary n_neighbors and weights. Ultimately, we will also tune the value $w_h$, the weight in the weighted average between the Haversine and Euclidean metrics. (See class Distances above).

def all_features_search(out_file_name, grid_dict, num_train=None):
    """
    
    Given a pre-processed dataset and a hyperparameter grid, run a grid over KNN Regression models.
    This is written for the case where the input contains all features, not just latitude and longitude.
    
    Since this can be computationally intensive, we save the results to disk.  
    If the file at out_file_name exists, we load it rather than train.  Otherwise, we create it after training.
    
    We can set num_train to be a subset to speed things up for development or exploratory purposes.
    In that case, it's added to the file path.
    
    For simplicity, I don't use grid information in the file path.  It's not needed for the notebook.
    
    Parameters:
        out_file_name: str, path to saved model file, without suffix.  
        grid_dict: dictionary, hyperparameter grid
        num_train: int or None.  If None, the entire train set is used.
    """
    if num_train is not None:
        assert isinstance(num_train, int)
        out_file_name += f"_n_train_{num_train}"
    out_file_name += ".pkl"
    
    if os.path.exists(out_file_name):
        print(f"Reading {out_file_name} from disk")
        grid_searcher = joblib.load(out_file_name)
    else:
        print("Training grid from scratch")
        # If you get an error, try changing NJOBS to 1.
        grid_searcher = GridSearchCV(KNeighborsRegressor(),
                                     grid_dict,
                                     cv=CV,
                                     n_jobs=NJOBS,
                                     verbose=VERBOSITY,
                                     scoring=SCORING)

        grid_searcher.fit(housing_prepared[0:num_train], train_labels[0:num_train])
        
        print(f"Done.\nWriting to disk at {out_file_name}")
        joblib.dump(grid_searcher, out_file_name)

        
    return grid_searcher

First we sweep over a coarse grid of n_neighbors, including both uniform and distance weighting to narrow-in on better hyperparameters.

coarse_euclidean_grid = {'n_neighbors': np.arange(5, 30, 5),
                         'weights': ['uniform', 'distance'],
                         'metric': ['euclidean']}

coarse_euclidean_outfile = os.path.join(RESULTS, 'out_coarse_euclidean')
coarse_euclidean_searcher = all_features_search(out_file_name=coarse_euclidean_outfile,
                                                grid_dict=coarse_euclidean_grid)
Reading results/out_coarse_euclidean.pkl from disk

The output indicates that I already completed the grid search on my machine at the time this notebook was built. Let’s look at the results.

print_results(coarse_euclidean_searcher)
66731.06667079251 {'metric': 'euclidean', 'n_neighbors': 5, 'weights': 'uniform'}
66567.8944134118 {'metric': 'euclidean', 'n_neighbors': 5, 'weights': 'distance'}
65026.17831401409 {'metric': 'euclidean', 'n_neighbors': 10, 'weights': 'uniform'}
65009.489294396895 {'metric': 'euclidean', 'n_neighbors': 25, 'weights': 'uniform'}
64882.87027501116 {'metric': 'euclidean', 'n_neighbors': 20, 'weights': 'uniform'}
64728.92659504293 {'metric': 'euclidean', 'n_neighbors': 15, 'weights': 'uniform'}
64718.262472793926 {'metric': 'euclidean', 'n_neighbors': 10, 'weights': 'distance'}
64516.65431267559 {'metric': 'euclidean', 'n_neighbors': 25, 'weights': 'distance'}
64428.1763520563 {'metric': 'euclidean', 'n_neighbors': 20, 'weights': 'distance'}
64351.96881036284 {'metric': 'euclidean', 'n_neighbors': 15, 'weights': 'distance'}

It seems clear that we should use weights=distance. Also, we may not need values of n_neighbors less than 10. Let’s run a finer search over n_neighbors now.

# There are some redundant values, but it does not cause a big problem in terms of speed here (Euclidean is much faster than HaversineEuclidean)
fine_euclidean_grid = {'n_neighbors': np.arange(10, 21),
                         'weights': ['distance'],
                         'metric': ['euclidean']}

fine_euclidean_outfile = os.path.join(RESULTS, 'out_fine_euclidean')
fine_euclidean_searcher = all_features_search(out_file_name=fine_euclidean_outfile,
                                           grid_dict=fine_euclidean_grid)
Reading results/out_fine_euclidean.pkl from disk
print_results(fine_euclidean_searcher)
64718.262472793926 {'metric': 'euclidean', 'n_neighbors': 10, 'weights': 'distance'}
64593.690188574445 {'metric': 'euclidean', 'n_neighbors': 11, 'weights': 'distance'}
64475.579108402184 {'metric': 'euclidean', 'n_neighbors': 12, 'weights': 'distance'}
64450.109896533315 {'metric': 'euclidean', 'n_neighbors': 13, 'weights': 'distance'}
64428.1763520563 {'metric': 'euclidean', 'n_neighbors': 20, 'weights': 'distance'}
64399.26763071997 {'metric': 'euclidean', 'n_neighbors': 19, 'weights': 'distance'}
64387.78656241735 {'metric': 'euclidean', 'n_neighbors': 14, 'weights': 'distance'}
64358.68569576212 {'metric': 'euclidean', 'n_neighbors': 18, 'weights': 'distance'}
64351.96881036284 {'metric': 'euclidean', 'n_neighbors': 15, 'weights': 'distance'}
64321.36152841593 {'metric': 'euclidean', 'n_neighbors': 17, 'weights': 'distance'}
64297.415818766334 {'metric': 'euclidean', 'n_neighbors': 16, 'weights': 'distance'}

Interestingly, these results are worse than what we got from only using latitude and longitude!

We can use these optimal values to train KNNs with the HaversineEuclidean distance. Let’s run a coarse grid over the $w_h$ hyperparameter.

hyper_ws = fine_euclidean_searcher.best_params_  # Like "warm starting" in a way. 
coarse_haversine_grid = {'n_neighbors': [hyper_ws['n_neighbors']],
                         'weights': [hyper_ws['weights']],
                         'metric': [Distances('haversine_l2').distance],
                         'metric_params': [{'w_h': 0.25}, {'w_h': 0.5}, {'w_h': 0.75}]}



coarse_haversine_outfile = os.path.join(RESULTS, 'out_coarse_haversine')
print(coarse_haversine_grid)
print(coarse_haversine_outfile)

{'n_neighbors': [16], 'weights': ['distance'], 'metric': [<bound method Distances.distance of <__main__.Distances object at 0x7f2ccbaf0750>>], 'metric_params': [{'w_h': 0.25}, {'w_h': 0.5}, {'w_h': 0.75}]}
results/out_coarse_haversine
coarse_haversine_searcher = all_features_search(out_file_name=coarse_haversine_outfile,
                                                grid_dict=coarse_haversine_grid)
Reading results/out_coarse_haversine.pkl from disk
print_results(coarse_haversine_searcher)
64145.307672109586 {'metric': <bound method Distances.distance of <__main__.Distances object at 0x7f2ccbaf0b90>>, 'metric_params': {'w_h': 0.25}, 'n_neighbors': 16, 'weights': 'distance'}
63699.17123660246 {'metric': <bound method Distances.distance of <__main__.Distances object at 0x7f2ccbaf0b90>>, 'metric_params': {'w_h': 0.5}, 'n_neighbors': 16, 'weights': 'distance'}
62685.911669884495 {'metric': <bound method Distances.distance of <__main__.Distances object at 0x7f2ccbaf0b90>>, 'metric_params': {'w_h': 0.75}, 'n_neighbors': 16, 'weights': 'distance'}

I expected the larger values of $w_h$ to perform better, based on the notion of “equal weight” in the metric that I’ll now explain. The Euclidean distance between two points generally increases when we add new dimensions (as long as the values at “old” coordinates are the same), and there are far more dimensions on which we’re using the Euclidean. The Haversine metric does not have enough influence. To “give it a voice” in the distance, it stands to reason that its weight should be larger.

fine_haversine_grid = coarse_haversine_grid.copy()
fine_haversine_grid['metric_params'] = [{'w_h': 0.85}, {'w_h': 0.90}, {'w_h': 0.95}, {'w_h': 0.99}]

fine_haversine_outfile = os.path.join(RESULTS, 'out_fine_haversine')
print(fine_haversine_grid)
print(fine_haversine_outfile)

{'n_neighbors': [16], 'weights': ['distance'], 'metric': [<bound method Distances.distance of <__main__.Distances object at 0x7f2ccbaf0750>>], 'metric_params': [{'w_h': 0.85}, {'w_h': 0.9}, {'w_h': 0.95}, {'w_h': 0.99}]}
results/out_fine_haversine
fine_haversine_searcher = all_features_search(out_file_name=fine_haversine_outfile,
                                              grid_dict=fine_haversine_grid)
Reading results/out_fine_haversine.pkl from disk
print_results(fine_haversine_searcher)
61762.46933054633 {'metric': <bound method Distances.distance of <__main__.Distances object at 0x7f2ccbaf0710>>, 'metric_params': {'w_h': 0.85}, 'n_neighbors': 16, 'weights': 'distance'}
60950.55819622307 {'metric': <bound method Distances.distance of <__main__.Distances object at 0x7f2ccbaf0710>>, 'metric_params': {'w_h': 0.9}, 'n_neighbors': 16, 'weights': 'distance'}
59349.7906665763 {'metric': <bound method Distances.distance of <__main__.Distances object at 0x7f2ccbaf0710>>, 'metric_params': {'w_h': 0.95}, 'n_neighbors': 16, 'weights': 'distance'}
53663.336178831676 {'metric': <bound method Distances.distance of <__main__.Distances object at 0x7f2ccbaf0710>>, 'metric_params': {'w_h': 0.99}, 'n_neighbors': 16, 'weights': 'distance'}

This suggests that performance increases as $w_h$, the weight given to the Haversine metric, increases. This could be explained by two factors:

  1. The information contained in latitude/longitude is more helpful, so ignoring the other features leads to better performance.
  2. The Haversine distance reflects only two coordinates and has a smaller range to begin with, so it needs a larger weight.

These errors from HaversineEuclidean are greater than, but approaching, our errors from using Latitude and Longitude alone. This suggests to me that the addition of features measured by the Euclidean metric is not helping much. Thus, I think situation (1) is more likely.

At the same time, we warm-started this grid from the results of the Euclidean grid. Perhaps a more exhaustive search using the Haversine metric would obtain better results. Maybe that’s an experiment for the future :D.

# Save the cv score for later.
ff_haversine_cv_rmse = get_rmse(fine_haversine_searcher.best_score_)
ff_haversine_cv_rmse
53663.336178831676

A note about timing

HaversineEuclidean-KNN takes about 25 minutes per fold on my machine.

Test-Set Performance: All Features Analysis

Let’s get the test-set performance of the HaversineEuclidean-KNN. We will also compare the results to a basic model, the linear model, for comparison.

def all_features_test_set(estimator, out_file_name):
    """ 
    Predict on the test set, using all the features, and report the RMSE score 
    Parameters:
        estimator: A scikit-learn estimator
        out_file_name: A string where predictions from a fitted model are saved
    """
    print(f"Estimator: {estimator}")
    
    # Fitting is really cheap for KNeighborsRegressor.  We can do it even if 
    # we've done it before and saved to disk.
    # The expensive operation is prediction: computing distances to all observations in the training set
    print("\nFitting model on prepared train set")
    print(type(housing_prepared))
    print(f"Input shape: {housing_prepared.shape}")
    print(f"Targets shape: {train_labels.shape}")
    estimator.fit(housing_prepared, train_labels)
    
    # Obtain predictions
    if os.path.exists(out_file_name):
        print(f"\nReading predictions from disk at {out_file_name}")
        yhat = joblib.load(out_file_name)
    else:
        print("\nObtaining predictions on prepared test set")
        print(type(housing_test_prepared))
        print(f"Input shape: {housing_test_prepared.shape}")
        print(f"Targets shape: {test_labels.shape}")
        
        yhat = estimator.predict(housing_test_prepared)
        
        print(f"Saving predictions at {out_file_name}")
        joblib.dump(yhat, out_file_name)
        
    mse = mean_squared_error(test_labels, yhat)
    
    return np.sqrt(mse)
    
haversine_outfile = os.path.join(RESULTS, 'final_haversine_preds.pkl')
ff_haversine_model_test_rmse = all_features_test_set(
    fine_haversine_searcher.best_estimator_,
    haversine_outfile
)
ff_haversine_model_test_rmse  # Full features = ff
Estimator: KNeighborsRegressor(metric=<bound method Distances.distance of <__main__.Distances object at 0x7f2ccb8a8f90>>,
                    metric_params={'w_h': 0.99}, n_neighbors=16,
                    weights='distance')

Fitting model on prepared train set
<class 'numpy.ndarray'>
Input shape: (16512, 16)
Targets shape: (16512,)

Reading predictions from disk at results/final_haversine_preds.pkl


51596.72885134255
# For comparison, a linear regression model
linear_model = LinearRegression()
linear_outfile = os.path.join(RESULTS, 'final_linear_preds.pkl')
linear_model_test_rmse = all_features_test_set(linear_model, linear_outfile)
print(linear_model_test_rmse)
Estimator: LinearRegression()

Fitting model on prepared train set
<class 'numpy.ndarray'>
Input shape: (16512, 16)
Targets shape: (16512,)

Reading predictions from disk at results/final_linear_preds.pkl
66911.98070857555

Let’s synthesize our findings.

Overall Results and Interpretation

Let’s tabulate the results, using Pandas for fun.

table_columns = ['Dataset',   'Distance',  'CV Mean RMSE', 'Test Score']
# Latitude/Longitude = ll
ll_e = pd.Series(['Lat/Long', 'Euclidean', ll_euclid_cv_rmse,        ll_euclid_test_rmse],
                 index=table_columns)
ll_h = pd.Series(['Lat/Long', 'Haversine', ll_haversine_cv_rmse,     ll_haversine_test_rmse],
                 index=table_columns)

# Full features = ff
ff_e = pd.Series(['All Features', 'HaversineEuclidean', ff_haversine_cv_rmse, ff_haversine_model_test_rmse],
                 index=table_columns)


pd.DataFrame([ll_e, ll_h, ff_e])

DatasetDistanceCV Mean RMSETest Score
0Lat/LongEuclidean53865.60121851677.949371
1Lat/LongHaversine53106.61908351005.349319
2All FeaturesHaversineEuclidean53663.33617951596.728851

(Recall that a smaller value of RMSE is “better”)

Conclusions/Lessons

The best results came from using only latitude and longitude with the Haversine metric. Increasing the importance of the Haversine metric in HaversineEuclidean also improved performance when using all features. These test-set estimates of generalization error are point estimates only, and we’ve only tried one dataset, but I feel that the overall results are interesting and show some promise. That is, further exploration of “mixed metrics” like this one may be useful. However, it would need to be made computationally efficient before having an impact.

Interesting room for future work:

  • Making all this computationally efficient
    • Perhaps calling C to do the hard work is the only way.
  • Implementing a less brittle way to use different metrics for different features. Here, we needed to know the location of latitude and longitude in the data frame.
  • Exploring custom metrics that handle one-hot features differently (we just used Euclidean), and mixing them into the distance metric.
  • Devising a theoretically sound way of combining Euclidean and Haversine. Or, proving that weighted averages are a “right way”. I just tried it as it was the first thing to come to my mind.

A few other blogs have studied the CA housing dataset and KNN

Alternative techniques for leveraging spatial information

Our idea was to use the Haversine distance between latitude/longitude as a more appropriate metric to use with KNN. Other approaches exist with the entire purpose of leveraging spatial information such as from spatial statistics (see this textbook for example).

Appendix: HaversineEuclidean is a valid metric

We will prove the following more general result. If $d_1$ and $d_2$ are distance metrics, then a distance defined as a convex combination of them,

\[\begin{equation*} D(x,y) = w d_{1}(x,y) + (1-w)d_{2}(x,y), \end{equation*}\]

for fixed $w \in [0,1]$, is also a valid metric.

Since the Haversine and Euclidean are valid distance metrics, it will follow that our haversine_l2 is a valid metric.

We proceed by verifying the properties of a metric.

If $x=y$ then $D(x, y) = 0$

Since $d_{1}$ and $d_{2}$ are valid metrics, $d_{1}(x,x) = 0 = d_{2}(x,x)$. Therefore,

\[\begin{equation*} D(x,x) = w d_{1}(x,x) + (1-w) d_{2}(x,x) = 0. \end{equation*}\]

If $D(x, y) = 0$ then $x = y$

$D(x, y) = 0$ implies $w d_{1}(x,y) + (1-w) d_{2}(x,y) = 0$. But this implies that

\[\begin{equation*} w d_{1}(x,y) = -(1-w) d_{2}(x,y). \end{equation*}\]

First, observe that we cannot have $d_{2}(x,y) < 0$ because $d_2$ is a valid metric.

Second, we show that it is not possible for $d_{2}(x,y) > 0$. Notice that the right hand side cannot be positive when $d_{2}(x,y) > 0$ since $(1-w) \ge 0$. But if the right hand side can take negative values (which happens when $w < 1$), then so can the left hand side. Since $w \ge 0$, a negative left hand side would imply $d_{1}(x,y)$ can take negative values in general, but this contradicts its definition as a metric. Hence it cannot be true that $d_{2}(x,y) > 0$.

The only remaining conclusion is that $d_{2}(x,y) = 0$, from which it follows that $x=y$ since $d_{2}$ satisfies the properties of a metric.

Symmetry: $D(x,y) = D(y,x)$

Since $d_{i}$ are valid metrics, $d_{i}(x,y) = d_{i}(y, x)$. So, we can write

\[\begin{aligned} D(x,y) &= w d_{1}(x,y) + (1-w) d_{2}(x,y) \\ &= w d_{1}(y,x) + (1-w) d_{2}(y,x) \\ &= D(y,x) \end{aligned}\]

Triangle inequality: $ D(x, z) + D(z,y) \ge D(x,y) $

\[\begin{aligned} D(x, z) + D(z,y) &= \big( w d_{1}(x,z) + (1-w) d_{2}(x,z) \big) + \big( w d_{1}(z,y) + (1-w) d_{2}(z,y) \big) \\ &= w \big( d_{1}(x,z) + d_{1}(z,y) \big) + (1-w) \big( d_{2}(x,z) + d_{2}(z,y) \big) \\ &\ge w d_{1}(x, y) + (1-w) d_{2}(x,y) \\ &= D(x,y). \end{aligned}\]