JOIN
A SELECT statement with the JOIN clause tells the database server to stitch together two tables that share a column.
Let's walk through an example to see why we would want to join tables. Consider the following table called stats:
| player_id | team | year | home_runs |
|---|---|---|---|
| 1 | Angels | 2015 | 40 |
| 1 | Angels | 2016 | 32 |
| 1 | Angels | 2017 | 51 |
| 2 | Astros | 2015 | 3 |
| 2 | Rangers | 2016 | 5 |
Each row in the stats table contains statistics of a baseball player in some year. For example, the player with ID 1 hit 40 home runs for the Angels in year 2015.
If you want to know the name of the player with ID 1, then you can look in the players table:
| id | given_name | family_name |
|---|---|---|
| 1 | Mike | Trout |
| 2 | Max | Scherzer |
Each row in the players table contains the ID and the name of a baseball player. This table shows that the name of the player with ID 1 is Mike Trout.
Next, take a look at the following row:
[ 2, Rangers, 2016, 5 ]
This is the last row in the stats table. This row says that a player with ID 2 hit 5 home runs for the Rangers in 2016. And which player has an ID of 2? To find out, we again look in the players table. Here is the second row:
[2, "Max", "Scherzer" ]
This row tells us that the player with ID 2 is Max Scherzer.
As we can see, in order to retrieve the name of a player with a particular set of statistics, we need to look in both the players and the stats tables. But if we add the players' names to the stats table, then we can see everything in one table.
To add names to the stats table, we retrieve the player ID from each row of the stats table and look for the same ID in the players table. We start with the first row of the stats table, which is:
[1, 'Angels', 2015, 40]
The player ID is 1, so we look in the players table for the row with ID 1. Here is the first row:
[1, "Mike", "Trout" ]
We now add the given_name and family_name to the first row of the stats table as follows:
[1, 'Angels', 2015, 40, "Mike", "Trout"]
We are now finished with the first row. Next, we repeat the process with the second row of the stats table, which is:
[1, 'Angels', 2016, 32]
The ID is 1 again, so we add Mike Trout as follows:
[1, 'Angels', 2016, 32, "Mike", "Trout"]
Then we repeat the process for the 3rd row, which is:
[1, 'Angels', 2017, 51]
Once again the ID is 1, so we add Mike Trout as follows:
[1, 'Angels', 2017, 51, "Mike", "Trout"]
The fourth row is:
[2, 'Astros', 2015, 3]
The player ID is 2, so let's look in the players table for the player with ID 2. Here is the row:
[2, "Max", "Scherzer" ]
As before, we add the given_name and the family_name to the stats table. Here is the new row:
[2, 'Astros', 2015, 3, "Max", "Scherzer"]
Finally, the fifth row is:
[2, 'Rangers', 2016, 5 ]
The ID is 2, so we add Max Scherzer as follows:
[2, 'Rangers', 2016, 5, "Max", "Scherzer"],
We have joined the stats table and the players table by matching the player_id column of the stats table and id column of the players table. Here is the joined table:
[
[1, 'Angels', 2015, 40, "Mike", "Trout"],
[1, 'Angels', 2016, 32, "Mike", "Trout"],
[1, 'Angels', 2017, "Mike", "Trout"],
[2, 'Astros', 2015, 3, "Max", "Scherzer"],
[2, 'Rangers', 2016, 5, "Max", "Scherzer"],
]
The JOIN Query
To join two tables, serve_db needs to know the table names and the columns to join on. The format of the database query with the JOIN clause is:
SELECT < columns > FROM < table1 > JOIN < table2 > ON < join_condition >
- table1 and table2 are the names of the two tables to join.
- join_condition is a Boolean expression that contains the names of the columns to join on.
Here is an example query:
SELECT players.id,players.family_name,stats.team,stats.home_runs FROM players JOIN stats ON players.id=stats.player_id
The two tables may contain the same column names, so we prepend the table name to the column name. Here is the name format:
< table >.< column >
For example, players.id is id column of the players table, and stats.team is the team column of the stats table.
The example query joins the players table and the stats table. The rows of the two tables will be attached when the id column of the row from the players table is equal to the player_id column of the row from the stats table.
Next, let's see how serve_db parses the query. As before, serve_db first parses the selected columns. serve_db converts players.id to the following map:
- type: attribute
- id: players
- attr: id
After parsing all the columns, serve_db parses the JOIN clause. The JOIN clause is:
stats ON players.id=stats.player_id
serve_db converts the JOIN clause to the following map:
- join:
- table: stats
- left:
- type: attribute
- id: players
- attr: id
- right:
- type: attribute
- id: stats
- attr: player_id
The name of the second table is stored in Query Map -> join -> table. The JOIN condition is a binary operation. serve_db stores the left operand in Query Map -> join -> left and the right operand in Query map -> join -> right.
Here is the full Query Map for our example query:
- select:
- Index: 0
- type: 'attribute'
- id: players
- attr: id
- Index: 1
- type: 'attribute'
- id: players
- attr: family_name
- Index: 2
- type: 'attribute'
- id: stats
- attr: team
- Index: 3
- type: 'attribute'
- id: stats
- attr: home_runs
- Index: 0
- where: null
- table: players
- group by: null
- join:
- table: stats
- left:
- type: attribute
- id: players
- attr: id
- right:
- type: attribute
- id: stats
- attr: player_id
serve_db is now ready to execute the query.
Query Execution
If Query Map -> join is not null, then the database server performs the JOIN operation first. serve_db first retrieves the two tables players and stats from the database. Then it joins the column names of the two tables.
The colum names of the players table are:
['id', 'given_name', 'family_name']
serve_db first prepends 'players' to the columns as follows:
['players.id', 'players.given_name', 'players.family_name']
The column names of the stats table are:
['player_id', 'team', 'year', 'home_runs']
serve_db prepends 'stats' to the columns as follows:
['stats.player_id', 'stats.team', 'stats.year', 'stats.home_runs'],
Then serve_db joins the two sets of columns as follows:
['players.id', 'players.given_name', 'players.family_name', 'stats.player_id', 'stats.team', 'stats.year', 'stats.home_runs']
Next, serve_db joins the two tables.
For each row in the first table, serve_db cycle through all the rows in the second table. For each row combination, serve_db checks the JOIN condition (players.id=stats.player_id). If the condition is true, then serve_db appends the row combination to the output.
players.id is the first column in the players table, and stats.player_id is the first column in the stats table. Thus, serve_db compares the first column of the row from the players table against the first column of the row from the stats table.
The first row of the players table is:
[1, "Mike", "Trout"]
players.id is 1.
The first row of the stats table is:
[1, 'Angels', 2015, 40]
stats.player_id is also 1. Thus, the IDs are equal and the JOIN condition is true. Thus, serve_db joins the two rows as follows and appends it to the output:
[1, "Mike", "Trout", 1, 'Angels', 2015, 40]
The second row of the stats table is:
[1, 'Angels', 2016, 32]
The player_id column is again 1. Thus, serve_db joins the first row of the players table with this row as follows:
[1, "Mike", "Trout", 1, 'Angels', 2016, 32]
The player_id column of the third row of the stats table is also 1. Thus, serve_db appends the following row to the output:
[1, "Mike", "Trout", 1, 'Angels', 2017, 51]
The fourth row of the stats table is:
[2, 'Astros', 2015, 3]
The player_id in this row is 2, so the JOIN condition is false. Thus, serve_db skips this row.
The player_id in the last row is also 2, so serve_db skips the last row.
So far, the output of the JOIN operation is:
[
[1, "Mike", "Trout", 1, 'Angels', 2015, 40],
[1, "Mike", "Trout", 1, 'Angels', 2016, 32],
[1, "Mike", "Trout", 1, 'Angels', 2017, 51],
]
serve_db now moves on to the second row of the players table:
[2, "Max", "Scherzer"]
The id is now 2. serve_db once again cycles through the rows of the stats table.
The first 3 rows are:
[1, 'Angels', 2015, 40]
[1, 'Angels', 2016, 32]
[1, 'Angels', 2017, 51]
The player_id in these rows are 1, which does not match the value of players.id (2). Thus, serve_db skips these rows.
The fourth row is:
[2, 'Astros', 2015, 3]
player_id is 2, so the second row of the players table is joined with this row as follows:
[2, 'Max', 'Scherzer', 2, 'Astros', 2015, 3]
The last row of the stats table is:
[2, 'Rangers', 2016, 5]
The player_id is again 2, so the second row of the players table is joined with this row as follows:
[2, 'Max', 'Scherzer', 2, 'Rangers', 2016, 5]
serve_db has now attempted to join every row in the players table with every row in the stats table. The final output is as follows:
[
[1, "Mike", "Trout", 1, 'Angels', 2015, 40],
[1, "Mike", "Trout", 1, 'Angels', 2016, 32],
[1, "Mike", "Trout", 1, 'Angels', 2017, 51],
[2, "Max", "Scherzer",2, 'Rangers', 2016, 5],
[2, "Max", "Scherzer", 2, 'Astros', 2015, 3]
]
As a reminder, the column names are:
['players.id', 'players.given_name', 'players.family_name', 'stats.player_id', 'stats.team', 'stats.year', 'stats.home_runs']
The JOIN operation is now complete. Next, serve_db iterates over the output and selects the columns that we want, which are:
players.id,players.family_name,stats.team,stats.home_runs
serve_db starts with the first row, which is:
[1, "Mike", "Trout", 1, 'Angels', 2015, 40],
players.id is the first column, players.family_name is the 3rd column, stats.team is the 5th column, and stats.home_runs is the 6th column. After serve_db filters out the other columns, the result is:
[1, "Trout", 'Angels', 40]
serve_db repeats this process for every row in the output. Here is the final result:
[
[1, "Trout", 'Angels', 40],
[1, "Trout", 'Angels', 32],
[1, "Trout", 'Angels', 51],
[2, Scherzer",'Rangers', 5],
[2, Scherzer", 'Astros', 3]
]
serve_db has completed execution of the JOIN query.
Try joining tables in TERMINAL_A (start serve_db in TERMINAL_B first):
Suppose that the profiles table has a column called address. What is the new column name after we prepend the table name to the column name?
Consider the following table called students:
| id | given_name | family_name |
|---|---|---|
| 451834 | Mary | Smith |
| 134922 | Emma | Lin |
| 652331 | Lucas | Smalls |
| 820980 | Billy | Homes |
And another table called enrollments:
| course_id | student_id | year | grade |
|---|---|---|---|
| CS101 | 451834 | 2015 | A |
| CS101 | 134922 | 2016 | B |
| MATH152 | 652331 | 2017 | A |
| MATH152 | 134922 | 2017 | C |
| MATH152 | 820980 | 2016 | A |
Suppose that we join the two tables by the student ID. How many rows will the joined table have?
Please write the query that joins the students table with the enrollments table by the student ID and produces the following output:
| given_name | family_name | course_id | grade |
|---|---|---|---|
| Mary | Smith | CS101 | A |
| Emma | Lin | CS101 | B |
| Emma | Lin | MATH152 | C |
| Lucas | Smalls | MATH152 | A |
| Billy | Homes | MATH152 | A |
Consider the following table called orders:
| order_id | customer_id | item |
|---|---|---|
| 14083 | 2902 | soap |
| 38515 | 6351 | shampoo |
| 17124 | 2902 | towel |
And another table called customers:
| id | given_name | family_name |
|---|---|---|
| 2902 | Mary | Smith |
| 6351 | Emma | Lin |
What is the output of the following query:
SELECT orders.item,customers.given_name,customers.family_name FROM orders JOIN customers on orders.customer_id=customers.id
Hint: the answer is a list of lists. Also, please surround text columns with quotation marks.
Comments
Please log in to add comments