From the course: Learning Data Analytics Part 2: Extending and Applying Core Knowledge

Building queries with Joins

From the course: Learning Data Analytics Part 2: Extending and Applying Core Knowledge

Building queries with Joins

- [Instructor] The way we store data from a design perspective and the way we structure it with queries is different. For example, we normalize data for storage and optimization when we design databases. For reporting, we typically de-normalize it, meaning we bring that data back together into a single dataset. You've already discovered how important joins are, and having some understanding about how databases are designed will help you be a better data analyst. The library has a lot of options on database design and relational concepts. We need to determine what orders have been invoiced, and because of the way the data is stored, we can't just achieve this answer with a single table. We need both of the orders and the invoices combined. This company has also recently gone into a new system, and there's a little bit of concern that some of the historical data hasn't been carried over into the new system. We need all of the data to achieve the results that we're trying to (indistinct) customers, so we might have to do a little bit of investigating on our own using joins. It's important to know the work flow. At this company, the customer will order it and then the company will invoice it. And because we understand that that is the work flow, we can easily look at the records to tell what's ordered and what's invoiced by joining the two tables. Let's take a look at orders. So I'll double-click orders, and I note that I can see all of the orders. There are 7500 of them, but this doesn't tell me if it's been invoiced or not. I'll go ahead and open up invoices. Now, invoices will tell me if it's been invoiced and I can see the associated order ID. That means if I join these two tables together, and I have an invoice with an order ID, then I've completed the process. Okay, let's go ahead and choose create, and choose query design. I'll go ahead and bring in my orders and my invoices. I'll join orders to invoices based on the order ID. And we know by default that the join type is an inner join. I'll go ahead and double-click that join line. Now what I want to see is, are there orders that haven't been invoiced yet? So looking at my three join options, I'll probably choose number two first. Include all records from orders and only those records from invoices where the join fields are equal. I'll go ahead and click OK. Okay, I'll double-click my order ID and I'll double-click my invoice ID. These two fields together tell me if it's been ordered and if it's been invoiced. Okay, so I'll go ahead and run my query. And immediately, I see a null value. So I have 7500 records, and of the 7500, not all of them have an invoice ID, and that's okay, because maybe they were ordered and haven't been invoiced yet. Let's go ahead and save this as OrdersNotInvoiced. Okay, now it's important that we do see an invoice ID. So I want to go ahead and flip over and use the null value to my benefit. So I'll go to the design view and I'll choose if the invoice ID is null in the criteria, then I run it again. Now I see orders that don't have invoices associated to them. I'll go ahead and save my query there. Now we need to explore this data set like an investigator. Remember the concern is that not all of the historical data transferred over, and based on the rules, if we have an order with an invoice, that means every invoice record should have an order associated to it. So it'll be interesting to see if we have invoices with no associated orders. This will tell us if we're missing data. Okay, so I'm going to go ahead and create a new query. I'll bring in the same tables. So we should have an invoice with an order ID, period. We should have no invoices without associated orders. So I'll go ahead and do order ID to order ID. I'll go ahead and adjust my join types. I want to see all records from invoices and only if those records from orders where the join fields are equal. Remember, I should have no invoices without an order record. I'll go ahead and click OK. I'll go ahead and double-click invoice ID and order ID. Now this means that, if I run this query, and I scroll down, I'm just spot-checking at this point. I want to see, are there any invoices that do not have an order ID? And I could go sort or I could try to do a number filter. I'm just going to keep scrolling. Again, I never not look. Oh. Start pulling down and I notice, I have invoices without an order ID. Now I know this is a problem because, again, it's a break in the process, right? You should have an order. It's the order that produces an invoice, not the other way around. Let me go to the design view. And I'm going to do is null in the order ID and I'll run it again. And I notice that I have 6422 invoices without an associated order ID. This is definitely a problem. I'll go ahead and save my query. We'll call this InvoicesWithoutOrders. Go ahead and click OK. Because we know we have an issue, we really can't move forward until it's corrected. Remember, I need all of this information to achieve the original ask, which is trying to get to a list of top customers. We need to return to the source data to determine what was missed. The process of returning to the source data could be vastly different across organizations. It might be as simple as all the data we needed was not provided, and it's a quick, easy fix. Or worst case, the data is lost and not retrievable, but at least we know about it. This is what makes for a great data analyst. You're analyzing data at every turn, not just at the end.

Contents