A Practical Application of K-Nearest Neighbours Analysis II

May 20th, 2020

This article is by Ben Moss and originally appeared on the Alteryx Data Science Blog here: https://community.alteryx.com/t5/Data-Science-Blog/Alteryx-Your-Discover-Weekly-A-Practical-Application-of-K/ba-p/564059


Carly Rae Jepsen or Napalm Death

So, let’s review how we can apply KNN in Alteryx in order to allow us to determine whether a track is by either pop artist Carly Rae Jepsen or grindcore band Napalm Death. It’s a slightly different topic to the dogs and cats one outlined above, but we are trying to solve the same kind of question - we are trying to classify whether an item is one of two things.

When we listen to tracks by each band, we can generally tell that these two songs are from completely different artists and genres:

When listening to a track, we assess a number of different attributes that help us to classify these songs into different genres and artists - things like how abrasive the vocals are, how fast the beat is, what the textures of the sounds are like. In order for a computer to make the same judgements, we need to have these sound attributes codified as particular data points.

Spotify have created a series of metrics called Audio Features which we can use to train Alteryx to tell the difference between Carly Rae Jepsen and Napalm Death. Audio features cover a range of quantitative metrics, and taken altogether, they create a profile of a track. Some audio features are one-dimensional, like tempo, which is a simple measure of how fast the beat is. Other features are a composite score of various dimensions into a relatable concept, such as the metric danceability, which is a composite score of aspects like tempo, rhythm, and beat strength to indicate how suitable a song is for dancing.


Spotify already uses these audio features, along with additional metrics about us as individuals to help generate the content that is recommended to us via curated playlists. But handily, they also make the audio features accessible to the rest of us through their API, so we can use them for our own KNN analysis.

Like with the cats and dogs example earlier, to build a KNN analysis to tell the difference between Carly Rae Jepsen and Napalm Death, we need a data set of known songs where we have the artist and the audio features of each track. We also need a data set of mystery songs without the artist information, but still with the same audio features. We’ve already used the Spotify API to create the data needed for this exercise.

These inputs already exist on our canvas in preparation for this exercise (workflow: Exercise 1 - Starter - Carly Rae Jepsen or Napalm Death).


The next stage of our analysis is to build our KNN model. In the image above I have already brought the Find Nearest Neighbours tool onto our canvas. I have connected our Known tracks, which already has the artist information attached to them, and plugged this into the D or Data stream, while I have put our unlabelled track data into the Q or Query anchor.

The way the tool outputs data is that it will return the N number of nearest neighbours for each record in the Q anchor, so placing the data streams into the correct input anchors is important! What we want is the nearest known songs to each unknown song so that we can use the nearest known songs to categorise the unknown song. If we put them the other way around, we’ll get the nearest unknown songs to each known song, which isn’t much use to us.


In the configuration window there are a few aspects to set up. Firstly, we need to choose the fields to do our KNN analysis on. The drop-downs are populated by the data in our D stream, but all these fields must also be present in our Q stream or the tool will throw an error message.

The fields to select are the ones we want to compare between the two sources. It is only possible to select numeric type variables (if you need to use boolean variables, this can be done by converting the values to integers first). Also, you can’t have null values, otherwise the tool will error. You will need to either impute or filter out any null values you might have.

You must then determine whether you need to standardise your data (you almost definitely do need to do this, unless you’ve deliberately weighted your data in some way or other). This is an important step when building machine learning algorithms with variables that have different scales, as in this case. This process is designed to prevent the model being dominated by feature variables that exist in larger scales. In this case, we will standardise our data using z-scores.

The nearest number of neighbours you wish to find is really dependent on the use-case in hand. The more neighbours you want to find, the more potential insights you’ll get, but the more processing time it’ll take. And after a certain point, looking for a lot of neighbours will actually make things less clear. For example, with the dogs and cats example, if we look for the nearest twenty neighbours, we’re going to return the entire data set of ten dogs and ten cats, which will make it harder to answer our question.

The last thing we need to choose is our search algorithm. We’re using KD-Tree, which is a good trade-off between speed and accuracy. A linear search will find you the truest neighbours, but it may take a lot longer (and I’ve found that in the client projects I’ve worked on, the overall difference in results when using KD-Tree and linear search is minimal, but that could just be the projects I’ve worked on). My favourite explanation of how KD-Tree search algorithms work is in this blog by Vladimir Agafonkin from Mapbox. There are a handful of other algorithms available if you fancy playing around with those to see how your results differ.

Once you have run your workflow, your results will be passed into the N or Neighbours anchor.


You’ll notice the output structure isn’t perfect. We almost always use the same transformation process in order to put the data into a cleaner long (rather than wide) structure, before then bringing back fields beyond just the track identifier.


The results in this case are quite promising. The image below shows that for our first two unknown tracks, all five nearest neighbour tracks are from the same artist. If all five nearest neighbours are Napalm Death tracks, then the mystery track is probably a Napalm Death track:


There could be cases where the neighbour tracks are split across multiple artists. In this case, there are a couple of methods you could apply in order to determine which artist’s track it is.

The first method is to take the most common artist (or modal artist) of the neighbour tracks. This is one reason we chose an odd number for our KNN - if we’ve only got two artists in to choose from, it’s a best-of-five scenario and there is always going to be a single artist that has matched more than the other. This can be done with a single summarise tool.


However, there might be more complicated situations. What if the closest two neighbours are Napalm Death, and the next three are Carly Rae Jepsen? The mode will return Carly Rae Jepsen, as she makes up three of the five nearest neighbours, but that doesn’t account for the fact that the closest two are Napalm Death.

This is illustrated below - the nearest two neighbours to the white dot are red, so it’s probably a red dot, but if we’re only taking the mode neighbour colour, our analysis will tell us that it’s probably a blue dot instead:


Alternatively, we could make use of the distance field that is part of the N output from our Find Nearest Neighbours tool.

This value represents the Euclidean distance between the track and the nearest neighbour track in question. The shorter the distance between the two tracks, the closer the two tracks are over multiple dimensions, so we could potentially use the lowest average distance to identify the artists attributed to each of the neighbour songs. In this case, the average distance for the red dots is lower than the average distance for the blue dots, so we can use that to decide that the white dot is probably red.


Although this isn’t always going to be perfect either - if, for example, the KNN results return one red dot and four blue dots like this, the red dot will still have the lowest average distance and we’ll conclude that the white dot is probably red, even though 80% of its five nearest neighbours are blue:


Ultimately, the KNN tool will only give you the nearest neighbours and the distances - it’s then up to you to choose how you use that information.

Finally, let’s validate our results. As mentioned at the very start of this exercise; Carly Rae Jepsen and Napalm Death are extremely different artists, so while it’s positive that our model has been able to successfully differentiate all of the tracks thrown at it, it’s certainly not definitive.

As a result, we decided to build the exact same workflow, this time focused on two artists which have some level of cross-over in terms of the music they produce, Metallica and Led Zeppelin. You’ll be able to hear from these two songs that the two styles of music are much more similar:

...which means that the data in the audio features will be more similar, too.

The solution can be accessed by looking at the Exercise 2 - Solution - Metallica vs Led Zeppelin Alteryx workflow file.

The results are still positive. The same model with the same settings successfully classified 13 of 15 songs. However, different tweaks to the model (such as including or not including specific variables) and different tweaks to the post-model processing (such as using average distance instead of modal artist) can vary this hugely, so it’s important to check that your KNN process is robust enough to work on different bits of data without being too specifically set up (or overfitted) for one particular set of data.

In this second use case we are going to explore how we can apply KNN in Alteryx in order to create our own Discover Weekly playlist.

Again, we need some data to work with, one dataset which highlights Gwilym’s existing taste in music (for this dataset we took the 100 songs Gwilym listened to most often in 2019 from Spotify’s auto-generated Your Top Songs 2019 playlist). Secondly, we need a dataset of other tracks from Spotify’s library, which we took from various genre playlists and friends’ Your Top Songs 2019 playlists. From this second data source we can use KNN to predict which 30 of these tracks are most likely to match to Gwilym’s music taste.

These inputs already exist on our canvas in preparation for this exercise (workflow: Exercise 3 - Starter - Build Gwilym’s Discover Weekly).

In this case, there are a number of different solutions to the problem, and each of these affects the way we would look to use the Find Nearest Neighbours tool.

The first approach is to find the N closest tracks to each of the tracks that exist within Gwilym’s library. From those matches, we could then restrict the number of matches to 30 by filtering each to the closest neighbours only, and then finding the 30 lowest tracks by Euclidean distance. This is a simple way of only finding the closest neighbours, but it’s also potentially restricting us - one track may have several neighbours which are closer than another track’s nearest neighbour, so we still want to consider those.

Another way is to look at which tracks have the shortest Euclidean distance to any one track in Gwilym’s library, or even the average if a song has matched against multiple tracks.


We could also try to make a broader profile of what Gwilym likes to listen to. Rather than finding the nearest neighbours to all 100 tracks he likes, we could find “Gwilym’s Average Track” by averaging together the audio features. Then we can use the KNN tool to find the nearest 30 tracks to this average track. For large datasets, this will be a lot quicker than repeating a KNN search across hundreds or thousands of tracks.


However, the downside is that averaging across all tracks will lose some potentially important information. For example, what if Gwilym had a bipolar music taste, and only listened to death metal and classical music? If we average together these very different styles of music, we’ll artificially create an “average track” that’s about halfway between the two styles and is nothing like what Gwilym actually listens to.

So, as a way of taking this further, we could use clustering to separate out the tracks that Gwilym likes into different genres, find the average track profile within a genre, and then use a KNN analysis to find the nearest tracks to the average track of each genre.


And that’s a few different ways to how we can create Gwilym’s discover weekly in Alteryx. We thought it would be well worth you having a go at this yourself with your own data!