Ok, so full disclosure: at Angoss we have an idea wall for new product features, and I’m the one who wrote “Filtering Join” on a pink sticky note and placed it on the wall. So maybe I was destined to love the Filtering Join node even if it wasn’t great. But I do think it’s great. It’s an incredibly valuable node, and I use it all the time, and today I want to tell you what I love about it.
Before I get started, I think it’s worth mentioning just what the filtering join node does.
The purpose of the filtering join node is to reduce the number of records in a dataset by comparing the records in that dataset with those a second dataset. The resulting subset consists of all of the rows from the first dataset that either match or do not match the second dataset (in a particular column). In this sense, it is not a traditional join, it is more of a filter, but the filtering is done on the basis of matching rows from a separate dataset (the kind of thing you do in a join).
There are two kinds of filtering-join actions: semi-join and anti-join. Semi-join is where you keep all the matching records from the second input dataset, and anti-join is where you keep all the records that do not match from the second input dataset. Here’s a simple example of a semi-join, joining on the field “Country”:
And here’s a simple example of an anti-join (also joining on the field “Country”):
So, now that you know what you can do with the Filtering Join node, let me tell you why I love it. When I started to write this blog entry, I thought it would be easy to explain what’s so great about the Filtering Join node, and that I would have tons of easily accessible examples to choose from, but it turned out to be a little trickier than I thought it would be to keep my thoughts on this issue coherent.
The problem was, in any example I have where I felt like a filtering join saved me a ton of steps, if I thought really hard about an alternative, there was usually some other slick option where if I was clever, I could do things almost as efficiently. So, it might seem like the filtering join node really only saves 1 or 2 steps, and in that case, what’s so special about it? Is saving a single step or two every so often really worth the trouble to learn some new functionality? In general, this may be a tricky and highly personal question, but there are two things that make my answer a definitive YES in the case of Filtering Join.
- I find that the filtering join saves me a step or two very often, and in a few different settings (examples coming later). Saving a step or two infrequently may not be worth the time it takes to save them, but saving a couple steps a couple times on nearly every project I work on is definitely worth it in the long run.
- I find the filtering join the most natural way to do almost all the things I use it for. That means I save a lot of time by using filtering join and not having to spend the time to come up with clever work-arounds. This one pays off extra, because doing things the most natural way means it’s easier for me to remember what I did when looking back over my work, it makes my documentation easier, and it makes sharing my work with others easier as well.
So, with that high-level explanation out of the way, you’re probably thinking “just give me some examples already, will ya?”. I sure will! Here’s two examples ripped from a real project I was recently working on. I also created a little challenge problem / tutorial based on this project for those who want to give it a try. That’s at the end.
I was recently working on a project where we built a model to estimate the effect of trade spend (advertising) on end sales (in-store sales), across a number of different types of retailers in many different cities all under the purview of a single sales organization.
If that language is unfamiliar to you, don’t worry, it’s not relevant for this example.
In this example, we had multiple different datasets, and each contained information for a large number of cities, some of which were in scope for the project, but some of which were out of scope. During the course of our analysis, we needed to filter the data to include only the in-scope cities, and we needed to do this many times.
Our solution: create a tiny dataset with just one field (city name), and just the set of cities in scope for the project. Every time we needed to filter a new dataset, we just used a filtering join with that same tiny dataset and moved on.
Why do I find this such a great (and natural) solution? Let’s consider some of the other options.
- We could use a partition node to just select the cities we wanted to keep directly (in the (stratified) partition wizard). This would prevent the need for an additional dataset, but, every time we needed to apply the filter we would need to re-select the in-scope cities from a list. This would take far more time in the long run, and be more error prone.
- Instead of needing to select the cities from a list in the partition wizard, we could use binning within the variable transformations node to create a flag for all the relevant cities in each dataset, and then partition using that flag field. This is far less error prone than selecting from a list in the wizard every time, and with the ability to import/export expressions, probably faster as well. This is likely the solution we would have used before the Filtering Join node existed.
But this is 2 steps for every dataset we need to filter instead of 1. This also leaves lots of additional flag variables floating around that need documentation. This solution would also be a little harder to follow on the workflow, as a variable transformation node is more general and could be used for any number of things, or multiple things at a time (a great feature of the variable transformation node in many situations, just not this one). To see this for yourself, consider the two possible workflows below:
Filtering two datasets to only include cities in scope, with Filtering Join
Filtering two datasets to only include cities in scope, without Filtering Join
Not only is the first workflow simpler, but I also think that it’s easier to figure out what is going on just from looking at the workflow in the first case. This type of simplicity is a great (and often under-appreciated) property to strive for in your workflows.
Also of note: on the second workflow above, the final two datasets are carrying around an extra flag column. This can be removed in a later step (or by creating an extra step specifically for this purpose), but that also makes it a little harder to follow what’s going on.
- As one of my colleagues, Alex Gobolos, pointed out, there is another good option: using the (regular) join node to perform an inner join, and only keeping columns from the first dataset. This is also a great solution, and is almost as quick. The wizard is (slightly) more cumbersome to navigate in this case because it’s more general, and it’s a little less clear from the workflow alone what you’re doing (i.e. the Join node doesn’t make it clear that you’re performing a filter as opposed to joining new information from a separate dataset until you open the node up), but it’s still a very reasonable option in this example. On the other hand, a full inner join can’t accomplish the functionality of the anti-join part of the filtering join that will be discussed in the second example.
One final point I want to address: in this example (and in many cases in general), we will join these two datasets eventually, and so if we wait until after the join to do the filtering, we can do all the filtering in a single step (which seems simpler). This argument seems compelling at first, but in practice, this isn’t really how interactive data science is done. There may be a point at the end of a project where I can build a “final workflow”, and on that workflow doing all the joins before any of the filters may be appropriate. During my daily process of working, on the other hand, I do a lot of exploring and investigation, and I don’t always know exactly what all the data prep will be before I do it. In other words, during interactive data science, I will almost certainly be wanting to filter to the appropriate number of cities multiple times, for which the filtering join node is ideal. (And of course, the filtering join is still a very appropriate method for one single join on a “final workflow”).
For the second example, let me start off at a slightly higher level. One part of my job often involves hunting down data quality issues, and figuring out how to deal with them. This often means doing some exploratory data analysis to find problematic groups of records. This exploratory analysis frequently leads me to a situation where I have a single dataset with a bunch of “bad records” (at precisely the point where I finally figure out what makes that group of records “bad”). Once all that’s happened, the most common solution is just to exclude that small set of bad records from the data and continue. Even in cases where I ultimately want to do something other than just excluding the records, it’s still common in the early phases to exclude them to keep things moving quickly, and circle back for a better solution later on. (In simpler terms, this is a great feature for prototyping.)
Well, if I have just created a dataset full of records that I want to exclude from my original dataset, then the exact functionality I want is that of the filtering join node (specifically the anti-join feature). It’s just a single step to go from the exploration I was just doing to pushing aside the problematic records and getting back to the primary work I want to focus on.
For a more specific example, let me return to the same project I was discussing above where the goal was to find the relationship between trade spend and end sales. Even though we were working at a city level, many of the datasets we had were at a lower level (the level of individual stores), and we had to roll-up that data back to the city level. In a few cases, we noticed that there were problems with the number of records after the roll up, and went to do some investigation. The problem turned out to be a handful of problematic stores which were accidentally considered to be part of two different cities at the same time.
Of course, this problem led to some discussion with the data-owner to figure out why such a thing might happen, and to see if we could figure out which city each store actually belonged to. However, it turned out that there were relatively few of these stores, and the solution we chose was just to exclude all of these bad stores from the dataset and proceed with the roll up and model building. Since we had just done the investigation, we already had in our project a dataset with only the “problematic” stores. So, we threw on a filtering join, and got back to business.
Exploratory Data Science: What stores are causing problems?
Back to Business: Wrap up the exploration and move forward
Even more importantly, if we ever need to do a similar task again within the same project, we don’t need to re-do the part of the workflow where we search for bad stores, we can just import that “BadStores” dataset to another workflow, and use a filtering join in one step to exclude the problematic records. Bonus!
So, there are some examples of where I use filtering join to great effect in my daily life. Hopefully you found that useful, and it inspired you to use the filtering join in a new way, or maybe even to learn about it and start using it in the first place. In either case, I hope you find it as invaluable as I do.
Additionally, I’ve created a small data prep exercise (which I use filtering join to help me complete, of course). You can read about that exercise, get the necessary datasets to try it yourself, see my explanation of how I solved the exercise, and even a sample project file to play with below.
Do you love filtering join too? Do you have examples where it saves you a lot of time that you want to share? I’d love to hear about them! Let me know at firstname.lastname@example.org
Thanks for reading!