This article is by Ben Moss and originally appeared on the Alteryx Data Science Blog here: https://community.alteryx.com/t5/Data-Science-Blog/Alteryx-Your-Discover-Weekly-A-Practical-Application-of-K/ba-p/563359
What Is K-Nearest Neighbours (KNN)?
The Wikipedia definition is:
In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space.
That’s quite a statistical explanation, right? To most people, those words sound complicated… but conceptually, KNN is relatively straightforward.
As we’re all about conveying information visually at The Information Lab, let’s share some of Gwilym’s finest drawings to help illustrate exactly what KNN is and how it works (at a high level).
I like animals, so let’s explore KNN by looking at the differences between cats and dogs. They’re both four-legged mammals with tails, which some people keep as pets, but there are a few ways in which they differ. Most dogs are bigger than most cats, and most dogs are friendlier than most cats. That’s shown on this graph here - the red dots represent dogs and the blue dots represent cats:
I’ve just received a phone call from somebody who doesn’t really know much about animals. They’ve just found some animals in their house, and they don’t know if they’re cats or dogs. I can’t see what they’re looking at, but if they tell me how big they are and how friendly they are, I can use KNN to compare their mystery animal to some real ones.
The first animal is pretty small, and not very friendly:
When we plot this on the graph, its closest neighbours are three cats and one dog. Based on the identity of its nearest neighbours, I reckon this is probably a cat, or maybe a small dog.
The second animal is really big and really friendly:
This animal’s closest neighbours are all dogs, so it’s almost definitely a big friendly dog.
Finally, this animal is quite friendly, and quite small:
Its closest neighbours are a mix of cats and dogs, so it’s hard to tell - there are three cats that it’s close to, so it could well be a big friendly cat, but it’s also pretty close to two dogs, so it could also be a dog.
It does come with limitations, though - if we’re only comparing these animals to cats and dogs, then we’ll only get predictions based around cats and dogs. This final animal is really big and really unfriendly:
KNN based on cats and dogs will give it an approximately even chance of it being a massive unfriendly cat or a massive unfriendly dog. It’s actually a bear, but the KNN algorithm doesn’t know that because we haven’t included bears in the input. In other words, KNN can only find neighbours based on things it already knows - if we haven’t told the KNN tool anything about bears, then it won’t recognise a bear.
In short, KNN allows you to find the distance between a thing and some other things based on a shared set of properties. The closest other things are the nearest neighbours, and you can use the identities of the nearest neighbours to make predictions about what the initial thing is. KNN can work on as few as two variables, and it’s easiest to illustrate what KNN is using two variables because it’s easy to draw on a graph. In most cases, we use more than two variables. There isn’t technically an upper limit, but the more variables there are, the longer the process will take (which you don’t want), and the more likely there will be to have several highly correlated variables (which you also don’t want).
Business Use Cases
Now, before we get into the fun on how to apply KNN within Alteryx, we thought we would also share a couple of real business examples from work we have done with some of our clients.
The first use case is Field Identification, by which we mean What is this data supposed to be? This is useful for a business process where different people submit a similar Excel file for processing each month, and it uses KNN to automatically label data without the need for your users to adhere to a specific template (this may help with the adoption of your data transformation process).
Let’s look at the problem in more detail. In the first input I have some historical transactional sales data in a nice clean format. In contrast, our 2nd input dataset, which we want to union against our existing dataset, has no headers and the fields seem to be presented in a different order.
To match these two sources together, we can use the actual data that exists in each field to create a profile of the kind of data we would expect to be present, then we can do the same to our unknown datasource, and use KNN to find the best match. Some examples of metrics we could use might be:
- How many values in the field are unique?
- What’s the average length of the values in each field?
- What’s the min length of the values in each field?
- What’s the max length of the values in each field?
- What’s the standard deviation of the length in each field?
- How much of the text is uppercase?
- Does the string contain punctuation?
Taken together, these values can help narrow down what sort of data is in a field. For example, if the field is a country name, the average string length will be maybe 6-10 letters, and the max length probably won’t be more than about 15. There will probably be relatively few unique values, as there are only about 200 countries in the world to choose from. In comparison, if the field is a customer name, the average, minimum, and maximum string lengths will probably be longer, and the number of unique values will probably be much higher.
These are of course questions that would allow us to profile string variables, but you could also come up with rules to apply against fields of numeric type, such as Does the field contain both positive and negative values? (which might indicate profit and rule out sales, which can’t be negative), or What’s the min and max numeric values? (which might indicate a percentage if it’s between 0 and 100).
Once we have profiled our data from each input, it’s then a case of applying our KNN model to find the best match for each unknown field from our historical transactional data.
Even in this simple workflow, with a limited profile, we successfully match all unknown string fields against their equivalent from our historical data. You can find a copy of this example in your folder Example 1 - Field Identification.yxmd.
Another example of KNN analysis we’ve done at clients is in manufacturing.
Let’s say we’ve got a factory that makes big, complicated machines. We measure everything we can about these machines, as everything has to be within specifications. For example, if we’re building, say, a car, the axle should be a fixed diameter, so we’ll measure the axle diameter at three different points to make sure it’s within spec. That’s three data points already. We also want to make sure the axle is consistently circular and not elliptical or deformed, so we’ll also measure the eccentricity of the axle at those same points. That’s another three data points. You can see how this can build up to the point where there are thousands of dimensions which are recorded for a machine going through assembly. Once a machine has been assembled, it goes to test to see if it performs well enough to sell to a customer.
Here’s an example of how that data might look - there’s a field for Machine ID, the test result, and a column for all the dimensions (although we would recommend storing this data in a long, transposed format if there are hundreds or thousands of dimensions):
Machines 010X and 011X are currently being built and haven’t yet gone to test. Building and testing a machine is a long and expensive process, so what we want to do is get an idea of what kind of machines 010X and 011X are before we test them - there’s no point wasting time and money in testing a machine that has a critical flaw that we haven’t detected. Ideally, we want to iron out manufacturing issues before the machine goes to test.
So, we can use a KNN analysis to assess which previous machines 010X and 011X look like. If the nearest neighbours are machines which all passed test, that’s a good sign. If the nearest neighbours include machines which failed test, then we might want to take a closer look at the manufacturing data, find the causes of previous failures, and see if those patterns exist in our machines which are currently in assembly.
Here is a simple workflow which does just that:
The results are informative. The nearest neighbours to 011X are all machines which passed test, so we can be more confident that 011X is a good machine. But 010X is a concern - two of its three nearest neighbours failed test, so 010X may have some issues which we haven’t detected yet. A quick look at the distance is also useful - 011X is a Euclidean distance of 3.63 away from its nearest neighbour, whereas 010X is a Euclidean distance of 5.27 away from its nearest neighbour. We might run this KNN analysis again to look at the average distance of all machines from their nearest neighbours - if 5.27 is still the biggest first neighbour distance, then this KNN analysis is also telling us that not only is 010X closer to some failed machines, it’s also generally further away from the rest of the group, meaning that something about its manufacturing characteristics is a bit different in general, which gives us some more food for thought.
You can find a copy of this example in your folder Example 2 - Manufacturing Characteristics.yxmd’
Bye For Now!
That’s the end of this first blog, but the real fun starts in the next one, where we look to use Alteryx to build a nearest neighbours model which allows us to label songs by different artists, a variation on the example of cats and dogs that Gwilym introduced earlier!