Content
This article introduces the k-Nearest Neighbors (k-NN) regressor algorithm as the first topic in a Machine Learning “Advent Calendar” series focused on implementing machine learning and deep learning techniques within Excel. The k-NN regressor is presented as one of the simplest predictive models, making it an ideal starting point for learners. The article encourages readers, even those familiar with k-NN, to consider important nuances such as the significance of scaling continuous features, handling categorical variables, and choosing appropriate distance measures based on the application context, such as housing price prediction where geographical location plays a vital role.
The fundamental concept of k-NN is illustrated through a practical example: estimating apartment prices based on nearby similar properties. This method is intuitive, modeling the prediction for a new case by averaging the values of its closest neighbors within the dataset. To demonstrate this, the California Housing Dataset is used, which contains census data from California block groups along with median house values and various features like median income, house age, average rooms, population, and geographical coordinates.
Initially, the focus is on using a single continuous feature to predict house prices. The article demonstrates the end-to-end k-NN process, starting from exploring the dataset, setting the hyperparameter k (number of neighbors), and making predictions. Unlike many machine learning models, k-NN requires no explicit training as the entire dataset acts as the model. Distances from the new observation to all training points are calculated using the absolute difference in the feature value. By setting k=3, the closest three neighbors are identified using Excel functions such as RANK and SUMPRODUCT. These neighbors’ target values are averaged to derive the prediction, which can be visualized with color-coded plots representing training data, the new point, and its neighbors.
The article then expands the prediction task to multiple new observations by manually iterating the distance calculations and prediction processes within Excel. It also explores the effect of the hyperparameter k, showing how small values of k can lead to overfitting—where the model captures noise rather than the pattern—while large k values risk underfitting, where the model oversimplifies the data by averaging too many points. The extreme case occurs when k equals the entire training dataset, yielding a global average, which does not adapt to individual feature differences.
Looking ahead, the tutorial transitions to a more complex scenario involving two continuous features, discussing how the approach differs from the single-feature case. The article also highlights the opportunity for readers to deepen their Excel skills by leveraging formulas such as IF, RANK, and SUMPRODUCT, which facilitate the implementation of k-NN without programming languages. Overall, the tutorial combines conceptual understanding, practical Excel application, and critical considerations such as feature scaling and distance metrics, providing a comprehensive foundation for beginners exploring machine learning in familiar spreadsheet environments.