From the course: Hands-On PostgreSQL Project: Spatial Data Science

Data exploration with SQL

- [Instructor] Geospatial data science is exciting and generates powerful results, but the insights we can get from non-spatial SQL queries are just as powerful, and often the first step to answering key questions for stakeholders. Let's start by answering the following questions. A stakeholder might want to know the scale of BigApple BikeShare company's operation, or in other words, how many stations does BigApple BikeShare company have? How many trips were taken on September 17? What percentage of bike trips used an e-bike? Which bike station had the most starting trips on that day? What's the average length of a bike trip? These are all great questions we can answer quickly with basic SQL queries. If you're feeling a little rusty with SQL, this is a great opportunity to brush up on your skills. Pause the video now and take a few minutes to create your own SQL solutions to these questions. Then I'll walk you through how I approached and solved each one. You can use the file 03_01_workbook.sql under the SQL Script folder in your Codespaces to get started. I'm using DBeaver here to interact with my database, but you can use any IDE that supports Postgres databases. All right, let's start by answering the first question. How many stations does BigApple BikeShare company have? For this, we want to count all the stations in the stations table. Since we know each row in this data is a unique station, we can just count all by using the star notation, select count (*). Let's name this column station_count. And we're pulling this data from the public schema and stations table. Listing the schema here is optional, but it's generally a good practice and especially helpful when you have identical table names in different schemas within the same database. Let's run this query, and we can see that there are 520 unique bike stations operated by BigApple BikeShare company. The second question was, how many trips were taken on September 17? The solution to this question is very similar to the solution for the previous question, except now we are using the trip_data table. We know each row contains a unique trip. Again, we can use count all. Name this column trip_count. From public schema and trip_data table. If we had more than one day of data to work with, we would have to filter the date by using a where clause on the start_time column. And since we inputted this column as a character varying type, we need to cast this column to date format so that it will match our input value of 2024-09-17. End this query with a semicolon. Now we can see that we have just over 12,000 trips that occurred in this dataset. Moving forward in the project, we do not have to use the where clause to filter out the date since the data provided in this course include only bike trips from September 17. Next, let's find out what percentage of the total bike trips used an e-bike. Let's set up the basic structure of this query. Select something from public schema, trip_data. The column we want to interact with here is called bike_type, which has two unique values, e-bike and mechanical. In this query, I'm using a case when statement to count only the rows where the bike type is e-bike. Then I divide this count by the total number of trips and multiplying it by 100 to give us the percentage. Notice that I'm using 100.0 here just to make sure that we have a float type in the return number. Now let's run the query. We see that 64% of all bike trips were taken with an e-bike. To solve question number four, which bike station had the most starting trips on September 17? In the trip_data table, we need to group the data by the start_station_id and count how many trips started at each station. select start_station_id, count(ride_id), and save this column as trip_count, from public schema trip_data, and group by start_station_id. Then we'll sort the results in descending order based on the trip_count. And let's limit to one result to find the station with the highest count. We should see that in the result console, Station 7Y6 had the most amount of starting trip on that day, totalling 131 rides. If we want to see trip counts for all station, we can just eliminate the phrase limit 1. Run this query, and we should see the trip count for all of station IDs in descending order of trip_count column. Finally, let's calculate the average length of a bike trip. To do this, we will subtract the start_time from the end_time for each trip to get the duration from the trip_data table. select end_time - start_time. Let's name this avg_trip_duration, from public schema and trip_data table. Next, we need to calculate the average of those durations, so we use the function avg. Because we imported start_time and end_time as character varying type, we can double-check this by checking the Properties tab of the data that start_time and end_time are indeed character varying type. Because of this, we need to cast both start_time and end_time columns on the fly to timestamp type in order to make the subtraction. And this is how we find out that the average bike trip on that day was 12 and half minutes.

Contents