A Join Gone Wrong
Have you found yourself working with ‘bad’ data, only to learn it is so because of an erroneous data join? Worse, have you shared this data, or resulting analysis, with others, before realizing your mistake and suffered some form of regret and loss of credibility? This mistake is common, which is why we’re writing this article which we hope will prevent similar future mistakes.
In my colleague, Sharon Resheff, and I's previous two posts, The Join Pitfall to Look Out For, and Data Grain: Why Does It Matter, we discussed best practices and pitfalls to avoid when joining two tables as part of your data analysis and visualization work.
In this third installment, we will help you identify when a join has gone wrong so that even if you joined two tables quickly or received data from someone else, you can get ahead of faulty analysis.
Why this is important
As we discussed in The Join Pitfall to Look Out For, if you are joining tables of different grains, you will inadvertently cause aggregations of measures to be incorrect. This in turn can lead to a ‘garbage in garbage out’ situation, meaning that your analysis will be based on erroneous data, leading to inaccurate conclusions and lost credibility.
Getting ahead of / identify an erroneous join
1. Check the number of records before and after the join
If creating an inner join: you should expect to have at most the maximum number of records in any of the tables you joined.
Let’s assume you’re creating an inner-join* of table A, with 1,500 records, and table B, with 350 records. If the resulting joined data has more than 1,500 records, this is a strong indication your join does not have the result you want.
If there was no duplication of records, you should expect to have at most the maximum number of records in any of the tables you joined.
2. Check for duplicate records
A duplicate record is a record that appears more than once in the new table that is created post-join. Example: table A consists of date, category, and sales while table B consists of date, category, manufacturer, and profits. Records in table A may duplicate after an inner join of table A and B when using date and category as joining fields.
3. Examining Duplicate Records
The aggregate values are off compared to what you would expect, likely inflated, due to the record duplication. If you did not proactively check the number of records and/or uniqueness of records post-join, you can still prevent the join mishap by looking at the resulting values of measures like sales or profit from the above example, and comparing that to what you might expect them to be.
Curious? Confused? Reach out and let us know what confounds you in creating joins. Our next blog post will demonstrate how to identify erroneous joins in Alteryx, SQL, and Tableau.