This article talks about the very basics of writing SQL queries, but we assume that you have already read the formal description of the structure of SQL queries somewhere. It is assumed that you know what the query “SELECT * FROM Students” returns, but that you do not feel very confident when you need to write a SQL query that receives data from multiple tables using multiple INNER JOINs or LEFT JOINs. Let's take a look at how SQL works in practice.
When looking at the text of SQL queries, you may find yourself wondering, “Why was SQL invented at all? What problem does it solve? Maybe this would have been easier without SQL?” Of course, SQL is designed to save program data to and read from a file. To save data to a file, the program must be able to insert a new chunk of data at the end of the file, or to update existing data in the middle of the file. Moreover, it is important that the program doing this can be multithreaded. How, then, can we write data to a file so that when the program closes unexpectedly, the half-saved record is marked as invalid?
When reading data from a file, the program usually wants to read only part of the data. To do this, it must fetch data from different places in the file, and then use filtering algorithms to return only the necessary entries, and it must do all of this as fast as possible. All this is achieved by rather complex algorithms, but the SQL language allows us to declaratively describe read operations, eliminating painstaking work and implementing fast filtering algorithms. Without SQL, programs would be much more complex.
SQL itself consists of four statements: SELECT, DELETE, UPDATE, and INSERT.
INSERT is easy enough to understand and I'll touch on it at the end of the article. The SELECT, DELETE, and UPDATE statements share a common idea and a very similar structure. The SELECT statement has richer capabilities and we can say that if you understand how SELECT works, you will understand how DELETE, UPDATE, andINSERT work. This is an interesting point: in order to understand how SQL queries work, you need to understand only the use of the SELECT statement. Let's take a closer look at it in practice.
Suppose we have 2 tables with numbers.
Table A with column X
Table B with column Y
What will the SQL query
SELECT * FROM A, B return to us? Any query with a SELECT statement will return data in only one table. The answer is hidden under the “View Answer” button. Try to write the answer on a piece of paper, then check the result and look at the explanation below.
The result of the query will be this table:
Why is the result exactly like this? Since we requested data from 2 tables, SQL must somehow combine them. It could print the rows of Table A first and then the rows of Table B, for example:
But let's remember that, usually, different tables store different types of records. For example, let’s say that table A stores information about students, and table B stores the results of their exams. It will be of little use if the result of the query is one table, in which first there are rows with students and then there are rows with the results of their exams. There are other cases demonstrating that the simple union of two tables does not give much benefit
(Besides, the table of students and the table of exams will probably have different sets of columns and we will get a mess of data. Another argument demonstrating that there is no need for a sequential concatenation of tables: what if table A stores current students and table B stores graduated students? It would be nice to get a list in which there will first be students and then graduates. In this case, this is a design error. It would be more correct to use only one table and to include a boolean field “IsStudentStudying.” If you store similar data in different tables, you will need to move records between them, while the IDs of students will change. Each object/data type of the program must be stored in its own separate and unique table)
However, it would be more useful if you could get a table containing information concatenated from two tables, such that each row of the new table contains both the student’s name and the result of his exam.. The columns of this new table would be the student's first name, the student's surname, the name of the subject, the score obtained, and the date of the exam.
Linking rows from different tables and filtering them like this is the central idea of SQL. To get this result, we need to indicate that the student table is the main one, and then somehow tell SQL how exactly the rows of the student table and the rows of the exam results table are related. SELECT queries are a means to get a new table, the rows of which are composed of related rows from different tables. In our simple query, we did not indicate which table is the main one and did not indicate how the data in the different tables is related. Therefore, SQL assumes that every record in table A is associated with every record in table B, and simply multiplies sets of records. The resulting table is called the Cartesian product. This result is the basis for linking data from different tables. It is worth remembering this result, and always thinking of it when you compose a SQL query.
For example, if we want to to combine our tables A and B to get this result:
Then, having imagined in our head the Cartesian product of our original tables, we can write
SELECT * FROM A, B WHERE B.y - A.x = 3
When executing this query, SQL will first build the Cartesian product, then perform the WHERE operation for each row, leaving only the lines of interest to us where B.y - A.x = 3
The Cartesian product is always built. Let's consider 2 examples:
When executing the query
SELECT A.x FROM A, B WHERE B.y - A.x = 3 SQL will build the same Cartesian product in memory, but will return only 1 requested column.
It also becomes clear that
SELECT A.x FROM A, B WHERE A.x = 2
Our Cartesian product works with JOIN too. For example, the query
SELECT * FROM Table1 T1 INNER JOIN Table2 T2 ON T1.ID = T2.ID
Is absolutely identical to the query
SELECT * FROM Table1 T1, Table2 T2 WHERE T1.ID = T2.ID
By the way, the word INNER in “INNER JOIN” can be omitted, and SQL will by default understand that this is an INNER JOIN.
But why do we need an INNER JOIN if the same thing can be written using WHERE? In the INNER JOIN, the binding condition T1.ID = T2.ID stands alone. This lets us easily see what the binding condition is.. When we look at the query that uses WHERE, we will need to understand what the expression T1.ID = T2.ID is: a binding condition or an expression for filtering data. Also, when multiple tables are involved in a join, the JOIN syntax will be much clearer.
If we redo the query with INNER JOIN for our tables A and B:
SELECT * FROM A INNER JOIN B ON A.x = B.y
Then the result will be empty - there will be no rows in the Cartesian product with the same values of columns x and y.
You cannot specify a main table with INNER JOIN, but LEFT OUTER JOIN allows this to be done. The word OUTER can be omitted, and SQL will interpret this as a LEFT OUTER JOIN by default. The left table will be the main one. This means that when you link rows from the left (main) table, the records will always end up in the resulting table.
Let's see this in practice. Let's create 2 tables.
SELECT Students.Name, ExamResults.Subject, ExamResults.Score FROM Students LEFT JOIN ExamResults ON Students.Id = ExamResults.StudentId
SQL constructs exactly the same Cartesian product here as for the query that used INNER JOIN. Further, all records for which Students.Id is not equal to ExamResults.StudentId are eliminated. However, since the Students table is the main one (as we are using LEFT JOIN), SQL will add a row for the student with Id 2 (Maxim) when linking. For this student, all columns from the ExamResults table will have NULL values.
LEFT JOIN differs from INNER JOIN in that all rows from the left (main) table are present in the result.
If we remove the word LEFT in the previous query, then we get an INNER JOIN:
SELECT Students.Name, ExamResults.Subject, ExamResults.Score FROM Students JOIN ExamResults ON Students.Id = ExamResults.StudentId
The result will be the same, except that there will be no line with Maxim from the main table.
If in the binding condition for LEFT JOIN we put a condition that will not be fulfilled for any row:
SELECT * FROM Students LEFT JOIN ExamResults ON 1 = 2
Then we get the following result:
Of course, here SQL built a Cartesian product, filtered out all rows, and then since none of the conditions were met, it then added the missing rows (all rows) from the main table to the result.
If we want to get the table:
Then we need INNER JOIN, for example:
SELECT Students.Name, ExamResults.Subject, ExamResults.Score FROM Students JOIN ExamResults ON Students.Id = ExamResults.StudentId WHERE Students.Id = 1
In SQL, there is also a RIGHT OUTER JOIN. This is the same as a LEFT JOIN, except the main table is on the right. RIGHT JOIN is rarely encountered in practice; this is probably due to the fact that programmers who write from left to right choose LEFT JOIN.
There is also an even more rarely used FULL JOIN in SQL. Both tables in FULL JOIN are the main ones. For tables A and B from our first example:
Table A with column X
Table B with column Y
SELECT * FROM A FULL JOIN B ON A.x = B.y
Constructs the Cartesian product:
FULL JOIN can be implemented independently using RIGHT JOIN and LEFT JOIN.
Let's solve a problem for which SQL is not designed:
You have two db tables A and B. Each table has a single column C with type integer. Write a SQL query that selects the difference between both datasets and creates the difference set with a single column.
For example: A.C = [1, 2, 3, 4]; B.C = [1, 4, 5]; Result D.C = [2, 3, 5]
In this problem, we are asked to combine data from different tables and delete data that is found in both tables.
There are many ways to accomplish this. I'll give you a few.
Method 1. A FULL JOIN is very suitable for our task, probably because this task is not typical for SQL.
The Cartesian product for the query:
SELECT * FROM A FULL JOIN B ON A.C = B.C
on our data would look like this:
From the result, we now need to cut off rows 1 and 4 and combine non-empty values into 1 column A.C and B.C.
This can be done with a query like:
SELECT COALESCE (A.C, B.C) FROM A FULL JOIN B ON A.c = B.c WHERE A.c IS NULL OR B.c IS NULL
MS SQL Server has the COALESCE function, which returns the first non-nullable value. For example, COALESCE (null, null, 45, null, null 36, null) will return 45. There is a similar function in many other databases.
A variation of this method: instead of using FULL JOIN, you can use a combination of RIGHT JOIN and LEFT JOIN.
Method 2. With LEFT JOIN, we can find numbers present only in table A, for example:
SELECT * FROM A LEFT JOIN B ON A.c = B.c WHERE B.c IS NULL
This gets the Cartesian product, and then filters it to give you the result. Next, let’s write a mirror query to find rows from table B, and combine the results together:
SELECT * FROM A LEFT JOIN B ON A.c = B.c WHERE B.c IS NULL UNION ALL SELECT * FROM B LEFT JOIN A ON A.c = B.c WHERE A.C IS NULL
Method 3 and Method 4. We can use UNION as in the previous approach and use the predicates EXISTS or NOT IN.
If you ever forget how to do SQL joins, just do an image search for “SQL JOIN”, and you should be able to easily find diagrams like this one:
Let’s consider how to join multiple tables. Let’s say we have these tables:
Let’s say we want to get a table containing the scores of all students who sat the math exam, and also containing the names of all students who didn’t sit the math exam at all.
To do this, you need to write a query that joins 3 tables. First, write a query that will combine 2 tables and return the result:
SELECT st.Name, er.Score FROM Students st LEFT JOIN (SELECT * FROM ExamResults WHERE SubjectId=1) er ON st.Id = er.StudentId
Now we need to display the name of the item, which is stored in the third table. Mentally, you can imagine that:
Students st LEFT JOIN (SELECT * FROM ExamResults WHERE SubjectId=1) er ON st.Id = er.StudentId
is one table which we will call “A”, and we need to join it with the Subjects table. If we use “A” JOIN Subjects sb ON er.SubjectId = sb.Id, then in the returned result the lines where er.SubjectId are null will have disappeared. Therefore, we use LEFT JOIN:
SELECT st.Name, er.Score FROM Students st LEFT JOIN (SELECT * FROM ExamResults WHERE SubjectId=1) er ON st.Id = er.StudentId LEFT JOIN Subjects sb ON er.SubjectId = sb.Id
If we need to join the fourth table, then:
Students st LEFT JOIN (SELECT * FROM ExamResults WHERE SubjectId=1) er ON st.Id = er.StudentId LEFT JOIN Subjects sb ON er.SubjectId = sb.Id
We will mentally imagine this as one table. We are now done with the SELECT overview.
UPDATE and DELETE statements are very similar to SELECT statements - so much so, that before executing a DELETE or UPDATE statement, you can easily replace its header with a SELECT, run it, and see which rows are displayed. This way you can check what will be removed or updated.
The INSERT operator inserts the data only in one table and may take several forms.
To insert one line of data:
INSERT INTO table_name ([column_name1], [column_name2], [column_name3]... ) VALUES (value_or_expression1, value_or_expression2, value_or_expression3 ...)
You can also omit the column names:
INSERT INTO table_name VALUES (value_or_expression1, value_or_expression2, value_or_expression3 ...)
There are special functions to find out the ID of the last inserted record in different SQL databases. In MSSQL, this is SCOPE_IDENTITY (). You can run INSERT and then launch SELECT SCOPE_IDENTITY ();
You can insert multiple lines at once. For example, to insert 3 rows in the table, write:
INSERT INTO table_name VALUES (value_or_expression1, value_or_expression2, value_or_expression3 ...), (value_or_expression1, value_or_expression2, value_or_expression3 ...), (value_or_expression1, value_or_expression2, value_or_expression3 ...)
You can even insert one or several records obtained by the SELECT statement:
INSERT INTO table_name SELECT column_name, ... FROM table_name
If you need to insert data into several tables, you can use the construction:
BEGIN TRANSACTION; INSERT INTO table1 VALUES ('1', '2', '3'); INSERT INTO table2 VALUES ('bob', 'smith'); COMMIT TRANSACTION;
If the insertion adds rows into the main and child tables at the same time, then you need to use the functions that return the ID of the inserted record. Returning to our Students example from earlier, using MSSQL, we could add a student together with their exam scores:
DECLARE @StudentId int; BEGIN TRANSACTION; INSERT INTO Students ([Name], [Birthday]) VALUES ('Lisa', '2002-10-29'); SET @StudentId = SCOPE_IDENTITY (); -- Student Id Lisa INSERT INTO ExamResults ([StudentId], [Subject], [Date], [Score]) VALUES (@StudentId, 'Geometry', '2021-02-20', 5); COMMIT TRANSACTION;
To consolidate your understanding of SQL, I recommend that you look into several educational problems on writing SQL queries yourself. These can be from any resources (for example, the free site http://sql-ex.ru), but any will do. If you have the ability to practice writing SQL queries at work, then that is even better.
After you learn how to write SQL queries with ease, you can move on to learning how to optimize them. To do this, you can watch videos on youtube about how your database works with the query execution plan, how you can view the plan, and how to understand it and find the bottleneck(s). In the future, you may encounter deadlocks in the database and you will need to figure out how to work with them. Good luck with learning SQL, and practice!