The Machine Learning “Advent Calendar” Day 1: k-NN Regressor in Excel | Towards Data Science

Published: December 1, 2025 at 07:12 PM

News Article

boules

competition-discipline

sport

artificial-intelligence

information-technology-and-computer-science

The Machine Learning “Advent Calendar” Day 1: k-NN Regressor in Excel | Towards Data Science

Content

This article introduces the k-Nearest Neighbors (k-NN) regressor algorithm as the first topic in a Machine Learning “Advent Calendar” series focused on implementing machine learning and deep learning techniques within Excel. The k-NN regressor is presented as one of the simplest predictive models, making it an ideal starting point for learners. The article encourages readers, even those familiar with k-NN, to consider important nuances such as the significance of scaling continuous features, handling categorical variables, and choosing appropriate distance measures based on the application context, such as housing price prediction where geographical location plays a vital role. The fundamental concept of k-NN is illustrated through a practical example: estimating apartment prices based on nearby similar properties. This method is intuitive, modeling the prediction for a new case by averaging the values of its closest neighbors within the dataset. To demonstrate this, the California Housing Dataset is used, which contains census data from California block groups along with median house values and various features like median income, house age, average rooms, population, and geographical coordinates. Initially, the focus is on using a single continuous feature to predict house prices. The article demonstrates the end-to-end k-NN process, starting from exploring the dataset, setting the hyperparameter k (number of neighbors), and making predictions. Unlike many machine learning models, k-NN requires no explicit training as the entire dataset acts as the model. Distances from the new observation to all training points are calculated using the absolute difference in the feature value. By setting k=3, the closest three neighbors are identified using Excel functions such as RANK and SUMPRODUCT. These neighbors’ target values are averaged to derive the prediction, which can be visualized with color-coded plots representing training data, the new point, and its neighbors. The article then expands the prediction task to multiple new observations by manually iterating the distance calculations and prediction processes within Excel. It also explores the effect of the hyperparameter k, showing how small values of k can lead to overfitting—where the model captures noise rather than the pattern—while large k values risk underfitting, where the model oversimplifies the data by averaging too many points. The extreme case occurs when k equals the entire training dataset, yielding a global average, which does not adapt to individual feature differences. Looking ahead, the tutorial transitions to a more complex scenario involving two continuous features, discussing how the approach differs from the single-feature case. The article also highlights the opportunity for readers to deepen their Excel skills by leveraging formulas such as IF, RANK, and SUMPRODUCT, which facilitate the implementation of k-NN without programming languages. Overall, the tutorial combines conceptual understanding, practical Excel application, and critical considerations such as feature scaling and distance metrics, providing a comprehensive foundation for beginners exploring machine learning in familiar spreadsheet environments.

Key Insights

The article centers around implementing the k-Nearest Neighbors (k-NN) regressor within Excel, using the California Housing Dataset as a geographical and temporal anchor, with a focus on predicting median Californian house values.

Primary stakeholders include data practitioners and Excel users engaging in machine learning, while secondary groups such as real estate analysts and educators may also benefit.

Immediate impacts involve enhancing accessibility to machine learning through familiar tools, promoting hands-on learning, and demonstrating the importance of hyperparameter tuning and feature scaling.

Historically, this approach aligns with trends of democratizing AI, similar to the rise of no-code platforms in the late 2010s.

Future prospects highlight innovation in integrating ML with ubiquitous software, facilitating broader adoption, yet risks remain around misunderstandings of algorithmic limitations and misapplication without robust validation.

From a regulatory perspective, recommendations include developing standardized educational resources tailored for non-experts, encouraging transparent documentation of model assumptions and hyperparameters, and fostering cross-disciplinary collaborations to ensure responsible use of ML models embedded in everyday tools.

Prioritization favors educational resource development for immediate impact, transparency measures for moderate complexity, and collaborations as a longer-term objective.

Loading...

The Machine Learning “Advent Calendar” Day 1: k-NN Regressor in Excel | Towards Data Science

Content

Key Insights

Editors' Choice