SQL
For Data Science
Matt Deitke
2019
ID Name
1 ‘a’
2 ‘b’
3 ‘c’
4 ‘d’
5 ‘e’
ID Name
4 ‘v’
5 ‘w’
6 ‘x’
7 ‘y’
8 ‘z’
Last updated: August 6, 2019
Acknowledgements
The following is a collection of my lecture notes on the topic of SQL. These notes primarily
follow the Coursera course “SQL for Data Science” from the University of California, Davis
taught by Sadie St. Lawrence,
1
with the exception of my choice of using Python as a running
example for demonstration.
With Google Colab, I have uploaded the code for each section to a Jupyter Notebook,
2
which
will allow anybody to follow along and run the code locally. The perminant URL to these
notes, and others I have written about, can be found on my website mattdeitke.github.io.
If I have made an error assigning credit to somebody’s work, a technical error, or a grammat-
ical error, please email me mdeitke23@gmail.com or Tweet me @MattDeitke. More notes
and information about the author are located on his website at mattdeitke.github.io.
1
http://sadiestlawrence.com/
2
https://colab.research.google.com/drive/1CPQmPIx -r9SnCrf08IiC7PvMp3Ra9W6
i
Contents
1 Introduction 2
1.1 ER Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Connecting a database to Python . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Creating tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Retrieving data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Temporary tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Adding comments to SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Pandas dataframe from SQLite database . . . . . . . . . . . . . . . . . . . . . 8
2 Filtering, sorting, and calculations 10
2.1 WHERE operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 IN, OR, AND, and NOT operators . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Wildcards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 ORDER BY operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Math operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Aggregate functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 GROUP BY and HAVING command . . . . . . . . . . . . . . . . . . . . . . . 15
2.8 Order of clauses in a SELECT statement . . . . . . . . . . . . . . . . . . . . 15
3 Subqueries and joins 17
3.1 Using subqueries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Joining tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Cartesian (cross) joins . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.2 Inner joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.3 Aliases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.4 Self join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.5 Left join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.6 Right join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.7 Full outer join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.8 Union . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
ii
CONTENTS iii
4 Modifying and analyzing data 26
4.1 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.1 Concatenations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.2 Trimming strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.3 Substring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.4 Uppercase and lowercase . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.5 Date and time strings . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Case statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Notation
We will be using the monospace font to denote methods, function calls, and variable names.
Code blocks will have a background. Each Python code block will be in gray with an
occassional output block in white.
Python code block
Output
Each SQL code block will have a yellow background with the output in a white background.
SQL code block
Output
An output block will always follow an SQL or Python code block, so there is no need
to change the style for an output block in each language. We will also be running SQL
commands inside of a Python environment, in which case we will use a Python code block.
1
Chapter 1
Introduction
SQL
1
(Structured Query Language) is a programming language that will allow us to ma-
nipulate large amounts of data. In particular, we can query, insert, update, and modify
data in a database. Unlike most languages, SQL is not used to write applications; instead,
it is used to communicate with databases. The primary goal we have when using SQL is to
read or retrieve data, write dat a, and update data. To carry out these tasks, we use
SQL with a relational database management system, which include some popular ones like
Microsoft SQL Server, MySQL, IBM DB2 Oracle, Sybase ASE, PostgreSQL, SQLite, and
Apache Open Office Base. When using different database management systems, the specific
syntax for a SQL task may vary slightly, but will generally follow the same structure.
Each database can be thought of as a collection of tables. In each table we will have rows
and columns. The columns will store the different variables and the rows will record distinct
observations. As we have gotten more data in the recent years, other types of relational
databases have come up, including NoSQL (Not Only SQL), which gives users a way to
store and retrieve unstructured data without tabular relations.
The main two types of database models include relational models and transactional models.
With relational models, data is neatly stored and organized into a table, allowing the user to
easily look-up and access a particular piece of information. In contrast, transactional models
are much less organized and often need to be translated into a relational model in order to
easily access the information. For example, a doctor scribbling down everything that the
patient says is unstructured and would be added to a transactional model. If somebody
went in and organized the doctors noted into distinct categories, such as symptoms, blood
pressure, temperature, then that data would be stored like a relational model.
1.1 ER Diagrams
Relational models specify three things: entities, attributes, and relationships. An entity
is used to differentiate between observations, which could be a name or an ID. The at-
tributes are the values of the variables for each entity. The relationships of an entity and
attributes can be described using graph-like models. We call the graph-like models Entity-
Relationship (ER) Diagrams. If each entry maps to many other entries, we say the
relationship is one-to-many. Similarily, there exist relationships that are many-to-many and
one-to-one. An example of a one-to-many relationship would be tracking a single customers
entire invoice history.
We use ER diagrams to show how different tables are linked together. ER diagram notations
may vary. The primary notations are the Chen Notation, Crow’s Foot Notation, and UML
Class Diagram Notation, which are shown in Figure 1.1.2.
1
Pronounced “see-quel” or “S-Q-L”
2
1.2. CONNECTING A DATABASE TO PYTHON 3
Figure 1.1.1: ER diagram
Figure 1.1.2: Different types of ER diagram notations
1.2 Connecting a database to Python
In Python, we will be using the relational database management system SQLite with the
module sqlite3. To use the module, we simply can import it.
import sqlite3
We will be using a local database as an example throughout this text, but one could find
many ways to connect to a database on the cloud in practice. To connect to a database,
we will need to specify the path using the connect(path) method. For now, let us call our
database data.db. If there is no file in the path specified, the file will automatically be
created.
conn = sqlite3.connect(‘data.db’)
For developmental purposes, we could also define a database inside of Python memory by
specifying the path to be :memory:. Now, instead of creating a new database for develop-
mental purposes, Python will create and destroy a new database after each session.
conn = sqlite3.connect(‘:memory:’)
4 CHAPTER 1. INTRODUCTION
It is typical to use the variable name conn when connecting to a database. Once we have
connected to the database, we need to create a cursor with the cursor() method, which
will enable us to run SQL commands.
c = conn.cursor()
It is also typical to use c as the variable name when creating a cursor, but it does not
matter as long as the variable name is consistent throughout the script. Now, we have
setup everything necessary in order to run an SQL command, which can be done using the
execute() method from the cursor object.
c.execute(<INSERT SQL COMMAND>)
We will come back to specific commands we can run in our database soon, but for now, we
can end our session by closing the connection to the database with the close() command.
conn.close()
1.3 Creating tables
To start exploring SQL commands, we will begin with an empty database and create a table
named Shoes using the SQL command CREATE TABLE.
CREATE TABLE Shoes (
ID char(12) PRIMARY KEY,
Brand char(10) NOT NULL,
Type char(250) NULL,
Color char(250) NULL,
Price decimal(8,2) NULL,
Desc Varchar(750) NULL
);
Each parameter that we specify inside of the CREATE TABLE method creates a new column.
We use PRIMARY KEY to characterize the entity. We also define the variable types that the
values in the column will take on along with if null values are accepted. The variable type
char must have a fixed character length, while the length of Vachar can differ. We would
use null values if we did not have all the information for a particular observation. Primary
keys can never have a null value.
To add an observation into our table we will use the INSERT INTO command.
INSERT INTO Shoes VALUES (
‘1234567890’,
‘Nike’,
‘Air Jordan’,
‘Red’,
‘450.00’,
NULL
);
Notice that when using the INSERT INTO command, each value must correspond to the
corect column. It may be hard to remember or know the order of the columns, so instead,
we can be more explicit when adding values, by first specifying the column order.
1.4. RETRIEVING DATA 5
INSERT INTO Shoes (
ID,
Brand,
Type,
Color,
Price,
Desc
) VALUES (
‘1234567890’,
‘Nike’,
‘Air Jordan’,
‘Red’,
‘450.00’,
NULL
);
This also gives us the flexibity to insert less values than there are columns.
In order to run SQL commands from a Python interface, we must use the execute(SQL
command) command. We will assume a database has been connected with the variable conn
and a cursor, c, was created from that connection.
c.execute(‘‘‘
CREATE TABLE Shoes (
ID char(12) PRIMARY KEY,
Brand char(10) NOT NULL,
Type char(250) NULL,
Color char(250) NULL,
Price decimal(8,2) NULL,
Desc Varchar(750) NULL
);
’’’)
Note that the triple quote block allows us to write strings in a multiline environment.
However, even after running this code our database has not been modified. In order to
permanently change the database from python we must call commit() from the variable
connected to the database. We can think of a database commit similar to a commit when
using Git.
conn.commit()
Regardless of the SQL command, we will use the same execute(SQL command) method in
order to stage changes to our database. Therefore, if we do not explicitly write the Python
code block after an SQL command is newly introduced, then it is implied that in order
to use it in Python, one must call c.execute(SQL command). We do this because SQL is
frequently used outside of Python and this will allow us to focus our attention more on SQL
at times.
1.4 Retrieving data
To get data from a table, we must specify at least two things: what information we would
like and the name of the table. Using the Shoes table we have just created, we will populate
it with a few more observations. For each of these observations, we will only specify the ID
and Brand, where the Brand name must be unique.
6 CHAPTER 1. INTRODUCTION
INSERT INTO Shoes (ID, Brand) VALUES (‘2345678901’, ‘LeBron’);
INSERT INTO Shoes (ID, Brand) VALUES (‘3456789012’, ‘Adidas’);
INSERT INTO Shoes (ID, Brand) VALUES (‘4567890123’, ‘Yeezy’);
INSERT INTO Shoes (ID, Brand) VALUES (‘5678901234’, ‘Puma’);
INSERT INTO Shoes (ID, Brand) VALUES (‘6789012345’, ‘New Balance’);
Now we will be able to get more information when retrieving data from our table. To retrieve
a variable, such as ID, from the Shoes table, we use the SELECT and FROM commands inside
of SQL.
SELECT ID FROM Shoes;
In Python, after running this command inside of the c.execute() method nothing will be
returned nor printed. In order to access the variables we are selecting we must fetch them.
For now, we will use fetchall, which will return all of the selected variables.
c.execute("SELECT ID FROM Shoes;")
c.fetchall()
[(‘1234567890’,),
(‘2345678901’,),
(‘3456789012’,),
(‘4567890123’,),
(‘5678901234’,),
(‘6789012345’,)]
We now have a Python list of tuples corresponding to each observation. In each observation
tuple, we see the values for each ID variable we have set. SQL stores this information as a
tuple in order to more easily let us retrieve multiple variables from a table, which can be
done by specifying all the variables after the SELECT command, separated with a comma.
c.execute("SELECT ID, Brand FROM Shoes;")
c.fetchall()
[(‘1234567890’, ‘Nike’),
(‘2345678901’, ‘LeBron’),
(‘3456789012’, ‘Adidas’),
(‘4567890123’, ‘Yeezy’),
(‘5678901234’, ‘Puma’),
(‘6789012345’, ‘New Balance’)]
If we want to retrieve every single value from the table, we can specify the asterisk (*)
character after the SELECT command.
c.execute("SELECT * FROM Shoes;")
c.fetchall()
1.5. TEMPORARY TABLES 7
[(‘1234567890’, ‘Nike’, ‘Air Jordan’, ‘Red’, 450, None),
(‘2345678901’, ‘LeBron’, None, None, None, None),
(‘3456789012’, ‘Adidas’, None, None, None, None),
(‘4567890123’, ‘Yeezy’, None, None, None, None),
(‘5678901234’, ‘Puma’, None, None, None, None),
(‘6789012345’, ‘New Balance’, None, None, None, None)]
When working with larger databases, it is often not a good choice to view the entire dataset,
because it stores so many observations. Instead, we can specify how many observations we
want to see using the LIMIT operator.
c.execute("SELECT * FROM Shoes LIMIT 3;")
c.fetchall()
[(‘1234567890’, ‘Nike’, ‘Air Jordan’, ‘Red’, 450, None),
(‘2345678901’, ‘LeBron’, None, None, None, None),
(‘3456789012’, ‘Adidas’, None, None, None, None)]
1.5 Temporary tables
We often do not need to want to create an entirely new table for many tasks. For example, if
we are creating a new table to store intermediate calculations before inserting that resultant
value into a table, then we will not need the table once the session is over. So, instead of
creating a table, we will use temporary tables, which are faster to create and removed from
memory after a session is closed. To create a temporary table, we use the SQL command
CREATE TEMPORARY TABLE.
CREATE TEMPORARY TABLE Nikes AS
SELECT *
FROM shoes
WHERE Brand = ‘Nike’;
Here, we have created a temporary table called Nikes that stores only the observations from
the table that have the value of the Brand variable as Nike.
1.6 Adding comments to SQL
Inside of an SQL environment, there are two says to add comments: single-line comments
and multiline comments. To add a single like comment to SQL, we add two minus signs
(--) to the beginning of the part we want commented out. For example, the below example
will comment out the selection of Brand.
select ID,
--Brand,
Color
FROM Shoes
);
To comment out multiple lines, we use /* < COMMENTED OUT AREA > */.
8 CHAPTER 1. INTRODUCTION
select ID,
/*Brand,
Color*/
FROM Shoes
);
1.7 Pandas dataframe from SQLite database
Pandas is a module in Python that is commonly used to represent and analyze dataframes.
Instead of using SQL from Python to analyze data, it is often faster and easier to ana-
lyze data natively on Pandas. We only need the sqlite3 and pandas modules for this
demonstration, which can be imported once they are installed locally.
import sqlite3
import pandas as pd
In the remainder of this section, we will be looking at how to load a larger SQLite database
with multiple tables. We will use the Kaggle database titled “The History of Baseball” from
Sean Lahman, which can be accessed here.
2
The database contains 26 unique tables inside
of the database.sqlite file. The filetype .sqlite specifies that a database must be of
type SQLite, while a filetype of .db allows for many other types of databases, in addition
to SQLite.
With multiple tables, it is often a good idea to store each of the tables inside of a dictionary.
As we have previously done, we can set a connection and cursor to the new databack with
the connect() and cursor() commands.
conn = sqlite3.connect(‘the-history-of-baseball/database.sqlite’)
c = conn.cursor()
We can get the names of all the tables with a SELECT command and fetchall() method,
where we are choosing from sqlite master.
c.execute("SELECT name FROM sqlite_master WHERE type=’table’;")
c.fetchall()
[(‘all_star’,),
(‘appearances’,),
(‘manager_award’,),
(‘player_award’,),
(‘manager_award_vote’,),
(‘player_award_vote’,),
(‘batting’,),
(‘batting_postseason’,),
(‘player_college’,),
(‘fielding’,),
(‘fielding_outfield’,),
(‘fielding_postseason’,),
(‘hall_of_fame’,),
(‘home_game’,),
(‘manager’,),
(‘manager_half’,),
2
https://www.kaggle.com/seanlahman/the-history-of-baseball
1.7. PANDAS DATAFRAME FROM SQLITE DATABASE 9
(‘player’,),
(‘park’,),
(‘pitching’,),
(‘pitching_postseason’,),
(‘salary’,),
(‘college’,),
(‘postseason’,),
(‘team’,),
(‘team_franchise’,),
(‘team_half’,)]
We can create a dataframe from each table using the Pandas method read sql query().
We will save all of the tables into a dictionary with the name tables.
tables = {}
c.execute("SELECT name FROM sqlite_master WHERE type=‘table’;")
for table in c.fetchall():
tables[table[0]] = pd.read_sql_query(‘SELECT * FROM + table[0], conn)
Now we have stored all the tables as values in the dictionary, which can be acceessed like
any other Python dictionary.
Chapter 2
Filtering, sorting, and
calculations
2.1 WHERE operator
Filtering through a database gives us a way to look at only specific observations and variables
for a given table. By lowering the number of observations and variables we look at, our
analysis will run faster. It is also best to first filter a database from SQL, before creating
a dataframe from Python if there is data that needs to be removed, since not all the data
would need to originally be copied to Python.
The WHERE operator in SQL gives us a way to filter from a SELECT command. With the
WHERE operator we must specify the column name, operator, and value we want to compare.
SELECT column_name1, column_name2
FROM table_name,
WHERE column_name1 <operator> value;
Operator Description
= Equal
<> Not equal
> Greater than
< Less than
>= Greater than or equal
<= Less than or equal
BETWEEN Inclusive range
IS NULL Test for null values
NOT NULL Tests for non-null values
Table 2.1: Operator types in SQLite
The primary operators in SQLite are shown in Table 2.1. Using the Shoes table we pre-
viously worked with, we can access only the shoes from the “Nike” Brand with Brand =
‘Nike’.
10
2.2. IN, OR, AND, AND NOT OPERATORS 11
SELECT *
FROM Shoes
WHERE Brand = ‘Nike’;
If we are using numerical variables, we could filter for values in a range. For example, if we
want the observations where the price is greater than 45, we would write statement Price
> 45.
SELECT *
FROM Shoes
WHERE Price > 45;
To set the upper and lower bounds of a value, we use the BETWEEN operator, with the syntax
BETWEEN lower bound AND upper bound.
SELECT *
FROM Shoes
WHERE Price BETWEEN 300 AND 500;
When using the IS NULL or NOT NULL operators, we do not provide a value to test against,
since we are only checking if the value is null.
SELECT *
FROM Shoes
WHERE Price IS NULL;
2.2 IN, OR, AND, and NOT operators
The IN operator in SQL works similar to finding if α S, where S is a mathematical set.
To define the set, we must specify all the possible values that we would like. For example,
in the following code block, we check in each observation if the Price is 99, 199, or 299.
SELECT *
FROM Shoes
WHERE Price IN (99, 199, 299);
The OR operator works similar to the IN operator, in that it checks if any of the conditions
specified hold true. The order that the OR operator checks is from left-to-right, and if a
condition is true, then we do not have to test through all the other conditional statements,
since one true statement in an OR operator results in the entire operation being true.
SELECT *
FROM Shoes
WHERE Price = 99 OR Price = 199;
We typically prefer using the IN operator, since it allows us to more easily specify many
conditions and runs faster. With the IN operator, the order in which we add the items does
not matter.
The AND operator will only result in true if both of the conditions are true. We can string
together multiple AND and OR operations into a single WHERE statement. And has a higher
precedence than the OR operator.
12 CHAPTER 2. FILTERING, SORTING, AND CALCULATIONS
SELECT *
FROM Shoes
WHERE (Price = 99 OR Price = 199) AND Brand = ‘Nike’;
The NOT operator typically is used to negate a conditional statement. We can use either the
NOT operator or <> operator to achieve the same results of not filtering for a specific value.
The performance of each operator is the same, so there is really no benefit when using one
over the other.
SELECT *
FROM Shoes
WHERE NOT ID = ‘0123456789’ AND NOT Price = 99;
The NOT operator has higher precedence than the AND operator, which also has higher
precedence than the OR operator.
2.3 Wildcards
Wildcards give us a way to filter for a particular value, if we do not know all the information
about that value. There are two wildcard characters that do the same thing, the underscore
( ) and percent (%) characters. If, for example, we have multiple Brand values that have the
string “Jordan” inside of it, such as “Air Jordan” or “Jordan Max,” then we could search
if the string “Jordan” exists in the Brand value with the following code.
SELECT *
FROM Shoes
WHERE Brand LIKE ‘%Jordan%’;
Instead of using the % character, we can achieve the same results using the character.
SELECT *
FROM Shoes
WHERE Brand LIKE ‘_Jordan_’;
The wildcard characters are specifiying that any value can come before or after we see the
string “Jordan,” which means we are just looking for if the word “Jordan” exists in the
value. If, instead, we want to check and see if the word starts with “Jordan,” we could run
the following code.
SELECT *
FROM Shoes
WHERE Brand LIKE ‘Jordan_’;
Similarily, to check if “Jordan,” is at the end of the value, we could run the following.
SELECT *
FROM Shoes
WHERE Brand LIKE ‘_Jordan’;
While wildcard characters work great, they take longer to run than the comparison operators
we have previously introduced. Also, wildcard characters can only be used with varaibles
that hold string values.
2.4. ORDER BY OPERATOR 13
2.4 ORDER BY operator
We can sort the data we retrieve from the SELECT statement using the ORDER BY operator.
The two ways we can sort data are in ascending and descending order, with the ASC and
DESC operators, respectively. The ORDER BY operator always comes at the end of the SELECT
statement. By default, if we do not specify which order to sort, the sorting will occur in
ascending order, as shown.
SELECT *
FROM Shoes
ORDER BY Brand;
To specify ascending or descending order, that command must follow the column name.
SELECT *
FROM Shoes
ORDER BY Brand DESC;
We can chain together multiple ORDER BY statements. Chaining the statements together will
first order by the first variable specified, then order the ties based on the second varaible
specified, and continue this pattern for all the order specified. For example, the code block
below orders first by the Brand name, and then if two Brand names are equal, the tiebreaker
will be the ID.
SELECT *
FROM Shoes
ORDER BY Brand, ID ASC;
2.5 Math operators
Inside of a SELECT statement, we can add additional variables, which can be formed using
arithmetic operations. For example, if we had the variables Price and Units from the
Shoes table, then we could track the Total Cost as follows.
SELECT Price,
Units,
Units * Price AS Total_Cost
FROM Shoes;
By specifying Total Cost after the AS operator, we are storing that as the varible name, and
the calculation is done for each row. We can use all the basic arithmetic operations, which
include parentheses (()), multiplication (*), division (/), addition (+), and subtraction (-).
2.6 Aggregate functions
Aggregate functions in SQL allow us a quick way to summarize our data. The primary
aggregate functions are shown in Table 2.2. Aggregate functions ignore null values.
Notice that for the COUNT() method, we are able to use two different types of parameters:
the column name and all the columns. The specific syntax when using aggregate functions
is shown below.
SELECT COUNT(ID) AS id_count
FROM Shoes;
14 CHAPTER 2. FILTERING, SORTING, AND CALCULATIONS
Function Description
AVG(X) Average value in a column
COUNT(X) Number of values in a column
COUNT(*) Number of rows in the table
MIN(X) Minimum value in a column
MAX(X) Maximum value in a column
SUM(X) Integer valued sum over each value in a column
TOTAL(X) Floating point valued sum over each value in a column
Table 2.2: Primary aggregate functions. We are using X to denote the name of a column.
We can run multiple aggregate functions at once in Python. We will be using the batting
table from “The History of Baseball” dataset we download earlier. In this dataset, they
store empty strings instead of null values, so running the code NULLIF(var, ’’) will set
empty string values to null in the var column. We can run aggregate functions with the
following code block.
c.execute("""
SELECT
AVG(NULLIF(h, ’’)),
MAX(NULLIF(h, ’’)),
MIN(NULLIF(h, ’’)),
COUNT(*),
COUNT(NULLIF(h, ’’)),
SUM(NULLIF(h, ’’)),
TOTAL(NULLIF(h, ’’))
FROM batting;""")
c.fetchall()
[(39.261647068608795, 262, 0, 101332, 96183, 3776303, 3776303.0)]
Notice that we do not need to specify a variable name for each select statement. However,
if we would like to access that data at a later time, then we must assign it to a variable. We
can also use previously introduced filtering operations to analyze more specific data. For
example, we can filter for the year 2000 as follows.
c.execute("""
SELECT
AVG(NULLIF(h, ’’)),
MAX(NULLIF(h, ’’)),
MIN(NULLIF(h, ’’)),
COUNT(*),
COUNT(NULLIF(h, ’’)),
SUM(NULLIF(h, ’’)),
TOTAL(NULLIF(h, ’’))
FROM batting
WHERE year = 2000;""")
c.fetchall()
[(32.69219653179191, 240, 0, 1384, 1384, 45246, 45246.0)]
2.7. GROUP BY AND HAVING COMMAND 15
If we want the total number of distinct values, such as the total number of unique Brand
names in our Shoes table, we can use the DISTINCT operator. The DISTINCT operator will
count each category only once and can be used with the syntax that follows.
SELECT COUNT(DISTINCT Brand)
FROM Shoes;
2.7 GROUP BY and HAVING command
If we want to create groups to organize and filter different row categories, then we can use
the GROUP BY and HAVING commands. When using the GROUP BY command, we must specify
how we want to group our items. If, for example, we wanted to group our Shoes table by
brand, we could write GROUP BY Brand.
SELECT *
FROM Shoes
GROUP BY Brand;
The HAVING clause follows the grouping and allows us to filter based on the groups we have
assigned. If we want to remove all of the rows that have a unique brand name, then we
could write HAVING count(Brand) > 1.
SELECT *
FROM Shoes
GROUP BY Brand
HAVING count(Brand) > 1;
We can also change the columns we select to display information about the groups. If we
wanted to know the number of items in the group, maximum value, and minimum value,
we count write the following.
SELECT COUNT(Brand), MAX(Brand), MIN(Brand)
FROM Shoes
GROUP BY Brand
HAVING count(Brand) > 1;
2.8 Order of clauses in a SELECT statement
When using a SELECT statement in SQL, each clause must follow a specific order. The
particular order and descriptions of each clause is shown in Table 2.3.
16 CHAPTER 2. FILTERING, SORTING, AND CALCULATIONS
Order Clause Description Required
1 SELECT Columns to be returned 3
2 FROM Table to retrieve data from Only when using tables
3 WHERE Row-level filtering 7
4 GROUP BY Create groups 7
5 HAVING Group-level filtering 7
6 ORDER BY Sorting specification 7
7 LIMIT Caps the maximum number of observations 7
Table 2.3: Clause ordering with a SELECT statement
Chapter 3
Subqueries and joins
3.1 Using subqueries
As we have seen with “The History of Baseball” database, a database may contain multiple
tables. Subqueries allow us a way to combine data from different tables to analyze it all
in one table. A subquery works by embedding one query inside of another query. To set
up why subqueries could be useful, let us consider finding the number of all-stars who were
born in January from our history of baseball database. We also know that the birth month
is a variable in the player table, and we can find out if somebody was an all star if their
player id is in the all star table. Without subqueries, we would first need to filter for
birth month = 1.
SELECT player_id
FROM player
WHERE birth_month = 1;
[(‘abercda01’,),
(‘abernbi01’,),
(‘abreujo02’,),
(‘actama99’,),
.
.
.
(‘youngan01’,),
(‘yellela01’,),
(‘ydeem01’,),
(‘wynnea01’,)]
The result turns out to be around 1612 players, which we now need to copy into the WHERE
clause when filtering through all stars.
SELECT *
FROM all_star
WHERE player_id IN (‘abercda01’, ‘abernbi01’, ‘abreujo02’, ‘actama99’,
..., ‘youngan01’, ‘yellela01’, ‘ydeem01’, ‘wynnea01’);
After copying and formatting all 1612 player IDs into the WHERE clause, we will get the
combination of all-star players who are born in January. This process required plenty of
manual assistance, which does is not ideal.
17
18 CHAPTER 3. SUBQUERIES AND JOINS
The subquery approach to finding the all-star players who are born in January is as follows.
SELECT player_id
FROM player
WHERE birth_month = 1 AND player_id IN (
SELECT player_id
FROM all_star);
Now, not only have we gotten the same result with less code and manual work, but our
approach will work as the information is updated in the table. There is not a limit to the
number of subqueries that we can use, however, as the nesting of SELECT statements grows,
the performance slows down. With subqueries, we are only able to select one column at
a time. As we go deeper into nested queries, it may be difficult to understand what is
happening if the indentation is poor. The website PoorSQL can be useful to reorganize
SQL commands, in order to improve readibility.
3.2 Joining tables
Joining tables gives us a way to combine tables based on the primary key attribute. It is
important to note that databases do not always come with joined tables, primarily because
it is more efficient to store and index into tables separately. In addition, multiple tables
inside of a database allows for greater scalability. Joining tables do not permanent records
of information. Instead, they are only used to view data once a query is executed.
For our discussion on joining tables, we will be using two tables, T1 and T2. Each table
will store an ID (integer) and Name (text) attribute. The specific observations we have are
shown in Figure 3.2.1.
ID Name
1 ‘a’
2 ‘b’
3 ‘c’
4 ‘d’
5 ‘e’
(a) Contents of T1
ID Name
4 ‘v’
5 ‘w’
6 ‘x’
7 ‘y’
8 ‘z’
(b) Contents of T2
Figure 3.2.1: Contents of our example tables.
In the tables we have created, the ID attribute is a primary key that must be unique for
each observation in each table. We can use a venn diagram as shown in Figure 3.2.2 to show
the overlapping and uniqueness of the keys in both tables.
3.2.1 Cartesian (cross) joins
One of the easiest ways to join two databases is to combine each item in one database with
each item in the other database. This is known as cartesian or cross join and is shown in
Figure 3.2.3. We can create that specific cross join between two tables, T1 and T2 as follows.
SELECT *
FROM T1 CROSS JOIN T2;
With one table having m rows and the other table having n rows, the cartesian join will
result in mn rows. This can cause the number of rows to be large, even with relatively small
databases, so this method for joining databases is seldom used in practice.
3.2. JOINING TABLES 19
Table 1
1, 2, 3 4, 5
6, 7, 8
Table 2
Figure 3.2.2: Venn diagram showing the relationship between the IDs in T1 and T2.
ID Name
1 ‘a’
2 ‘b’
3 ‘c’
4 ‘d’
5 ‘e’
ID Name
4 ‘v’
5 ‘w’
6 ‘x’
7 ‘y’
8 ‘z’
T1.id T1.Name T2.id T2.Name
1 ‘a’ 4 ‘v’
1 ‘a’ 5 ‘w’
1 ‘a’ 6 ‘x’
1 ‘a’ 7 ‘y’
1 ‘a’ 8 ‘z’
2 ‘b’ 4 ‘v’
2 ‘b’ 5 ‘w’
2 ‘b’ 6 ‘x’
2 ‘b’ 7 ‘y’
2 ‘b’ 8 ‘z’
3 ‘c’ 4 ‘v’
3 ‘c’ 5 ‘w’
3 ‘c’ 6 ‘x’
3 ‘c’ 7 ‘y’
3 ‘c’ 8 ‘z’
4 ‘d’ 4 ‘v’
4 ‘d’ 5 ‘w’
4 ‘d’ 6 ‘x’
4 ‘d’ 7 ‘y’
4 ‘d’ 8 ‘z’
5 ‘e’ 4 ‘v’
5 ‘e’ 5 ‘w’
5 ‘e’ 6 ‘x’
5 ‘e’ 7 ‘y’
5 ‘e’ 8 ‘z’
=
Figure 3.2.3: Resultant table after a cartesian (cross) join between T1 and T2.
20 CHAPTER 3. SUBQUERIES AND JOINS
3.2.2 Inner joins
If we have keys for each observation in two tables, T1 and T2, the inner join will keep only
the observations that have keys in both T1 and T2. If we look at the table keys as sets, this
means we are finding T1.id T2.id. The SQL syntax for inner joins is as follows.
SELECT *
FROM T1 INNER JOIN T2
ON T1.id = T2.id;
[(4, ‘d’, 4, ‘v’), (5, ‘e’, 5, ‘w’)]
We could also combine more than two tables using multiple inner join statements in the
SELECT clause as shown.
SELECT *
FROM T1
INNER JOIN T2 ON T1.keyid = T2.keyid
INNER JOIN T3 ON T1.keyid = T3.keyid;
Table 1
1, 2, 3 4, 5
6, 7, 8
Table 2
Figure 3.2.4: The selection from inner join takes the intersection of keys from two tables.
3.2.3 Aliases
Consider, for example, if we had longer table names, such as table name1, table name2,
and table name3 and performed an inner join on all three of the tables as shown.
SELECT table_name1.attr, table_name2.attr, table_name3.attr
FROM table_name1
INNER JOIN table_name2 ON table_name1.keyid = table_name2.keyid
INNER JOIN table_name3 ON table_name1.keyid = table_name3.keyid;
Using an alias, we can temporarily set the references to the variable names to be shorter,
making it easier to write compact code. For example, we can set the alias for table name1,
table name2, and table name3 to be T1, T2, and T3, respectively, with the code that follows.
3.2. JOINING TABLES 21
SELECT T1.attr, T2.attr, T3.attr
FROM table_name1 T1
INNER JOIN table_name2 T2 ON T1.keyid = T2.keyid
INNER JOIN table_name3 T3 ON T1.keyid = T3.keyid;
3.2.4 Self join
In this section, we will use the table that results from the cross join between T1 and T2 as
shown in Figure 3.2.3. We will label the cross joined table as CJ, the ID from T1 as ID1 and
the ID from T2 as ID2. A self join will join an attribute with itself, generally with a specific
filtering in mind. For example, if we have a table named people that stores what city a
given person is in, along with their name, we can SELECT people from the same city using a
self join technique that requires creating a copy of the table. If we want to use self join on
our table, CJ, where we are looking to combine all the attributes that have ID1 = ID2 and
ID1 <> ID2, then we can use the self join that follows.
SELECT *
FROM CJ cj1
INNER JOIN CJ cj2
on (cj1.id1 = cj2.id1) and (cj1.id2 <> cj2.id2)
[(1, 4, ‘a’, ‘v’, 1, 5, ‘a’, ‘w’),
(1, 4, ‘a’, ‘v’, 1, 6, ‘a’, ‘x’),
(1, 4, ‘a’, ‘v’, 1, 7, ‘a’, ‘y’),
(1, 4, ‘a’, ‘v’, 1, 8, ‘a’, ‘z’),
(1, 5, ‘a’, ‘w’, 1, 4, ‘a’, ‘v’),
(1, 5, ‘a’, ‘w’, 1, 6, ‘a’, ‘x’),
(1, 5, ‘a’, ‘w’, 1, 7, ‘a’, ‘y’),
(1, 5, ‘a’, ‘w’, 1, 8, ‘a’, ‘z’),
(1, 6, ‘a’, ‘x’, 1, 4, ‘a’, ‘v’),
(1, 6, ‘a’, ‘x’, 1, 5, ‘a’, ‘w’),
(1, 6, ‘a’, ‘x’, 1, 7, ‘a’, ‘y’),
(1, 6, ‘a’, ‘x’, 1, 8, ‘a’, ‘z’),
(1, 7, ‘a’, ‘y’, 1, 4, ‘a’, ‘v’),
(1, 7, ‘a’, ‘y’, 1, 5, ‘a’, ‘w’),
(1, 7, ‘a’, ‘y’, 1, 6, ‘a’, ‘x’),
(1, 7, ‘a’, ‘y’, 1, 8, ‘a’, ‘z’),
(1, 8, ‘a’, ‘z’, 1, 4, ‘a’, ‘v’),
(1, 8, ‘a’, ‘z’, 1, 5, ‘a’, ‘w’),
(1, 8, ‘a’, ‘z’, 1, 6, ‘a’, ‘x’),
(1, 8, ‘a’, ‘z’, 1, 7, ‘a’, ‘y’),
...
(5, 4, ‘e’, ‘v’, 5, 5, ‘e’, ‘w’),
(5, 4, ‘e’, ‘v’, 5, 6, ‘e’, ‘x’),
(5, 4, ‘e’, ‘v’, 5, 7, ‘e’, ‘y’),
(5, 4, ‘e’, ‘v’, 5, 8, ‘e’, ‘z’),
(5, 5, ‘e’, ‘w’, 5, 4, ‘e’, ‘v’),
(5, 5, ‘e’, ‘w’, 5, 6, ‘e’, ‘x’),
(5, 5, ‘e’, ‘w’, 5, 7, ‘e’, ‘y’),
(5, 5, ‘e’, ‘w’, 5, 8, ‘e’, ‘z’),
(5, 6, ‘e’, ‘x’, 5, 4, ‘e’, ‘v’),
(5, 6, ‘e’, ‘x’, 5, 5, ‘e’, ‘w’),
(5, 6, ‘e’, ‘x’, 5, 7, ‘e’, ‘y’),
(5, 6, ‘e’, ‘x’, 5, 8, ‘e’, ‘z’),
(5, 7, ‘e’, ‘y’, 5, 4, ‘e’, ‘v’),
22 CHAPTER 3. SUBQUERIES AND JOINS
(5, 7, ‘e’, ‘y’, 5, 5, ‘e’, ‘w’),
(5, 7, ‘e’, ‘y’, 5, 6, ‘e’, ‘x’),
(5, 7, ‘e’, ‘y’, 5, 8, ‘e’, ‘z’),
(5, 8, ‘e’, ‘z’, 5, 4, ‘e’, ‘v’),
(5, 8, ‘e’, ‘z’, 5, 5, ‘e’, ‘w’),
(5, 8, ‘e’, ‘z’, 5, 6, ‘e’, ‘x’),
(5, 8, ‘e’, ‘z’, 5, 7, ‘e’, ‘y’)]
Here, we have shortened the output to put it on only a single page. Notice that no pair of
IDs from cj1 is equal to the pair of IDs from cj2.
3.2.5 Left join
The left join keeps all the elements from the left table, which maintains the same number
of rows in the left table before and after the observation. However, the left join will also
add missing information from the right table to the observations in the left table, if the
observation keys intersect.
3.2. JOINING TABLES 23
Table 1
1, 2, 3 4, 5
6, 7, 8
Table 2
Figure 3.2.5: Left join keeps all the elements from the left table and adds additional information
to each observation from the right table, if available.
Using our example data with tables T1 and T2, we can perform a left join as follows.
SELECT *
FROM T1 LEFT JOIN T2
ON T1.id = T2.id;
[(1, ‘a’, None, None),
(2, ‘b’, None, None),
(3, ‘c’, None, None),
(4, ‘d’, 4, ‘v’),
(5, ‘e’, 5, ‘w’)]
Notice that all the left observations remain, and if the id attribute does not overlap between
the two tables, then a null value (None) is filled in. A left join could be useful, for example,
if we had tables of both customers and orders and would like to keep all the customers, as
well as match the orders with the customers who have made orders.
3.2.6 Right join
Table 1
1, 2, 3 4, 5
6, 7, 8
Table 2
Figure 3.2.6: A right join works similar to a left join, just the roles of left and right are switched.
A right join works just like a left join, in that all the observations from the right table remain,
with the addition of variables to the column space. However, right join is not in SQLite,
which is what we have been using to program our database in Python. The functionality is
24 CHAPTER 3. SUBQUERIES AND JOINS
purposfully not added because any left join can be made into a right join by switching the
ordering of the tables.
3.2.7 Full outer join
The outer join of two tables gives both the left join and right join between the IDs of two
tables. For overlapping IDs, the observations will be combined, and unique IDs will remain
unique IDs. However, like the right join, the full outer join is not available as a command
when using SQLite.
Table 1
1, 2, 3 4, 5
6, 7, 8
Table 2
Figure 3.2.7: The outer join takes the left join and right join between the keys of two tables.
3.2.8 Union
The union operator does not work in the traditional probabilitic sense that one may think.
From a probabilistic perspective, the full outer join appears to be most similar to the
union operator as shown in Figure 3.2.7. What the union operator does in SQL is stack
the information from one table onto the other as shown in Figure 3.2.8. Thus, the union
operator requires the number of columns to be the same, along with the type of variables
that can be in each column.
City
New York
Seattle
Atlanta
City
Palo Alto
Berkeley
Austin
UNION
City
New York
Seattle
Atlanta
Palo Alto
Berkeley
Austin
Figure 3.2.8: Union between two tables.
To find the union between the two tables, T1 and T2, we can use the UNION command. By
default, SQL will sort rows based on the ordering of the key.
3.2. JOINING TABLES 25
SELECT * FROM T1
UNION
SELECT * FROM T2;
[(1, ‘a’),
(2, ‘b’),
(3, ‘c’),
(4, ‘d’),
(4, ‘v’),
(5, ‘e’),
(5, ‘w’),
(6, ‘x’),
(7, ‘y’),
(8, ‘z’)]
Chapter 4
Modifying and analyzing data
4.1 Strings
Just like most programming languages, SQL comes with a variety of ways to modify strings
of text. A few of the common functions that work on strings include concatenate, substring,
trim, upper, and lower. For our discussion on strings, we will be using the Names table in
Figure 4.1.1
First Last
John’ ‘Doe
‘Bill’ ‘Gates’
‘LeBron’ ‘James ’
Figure 4.1.1: Names table that we will be using for our discussion on strings.
4.1.1 Concatenations
A concatenation between two strings takes the contents from the first string and adds to it
the contents of the second string. For example, if the first string was “this and and the
second string was “that”, then the concatenation between the first and second string would
be “this and that”. The concatenate operator in SQLite is ||. If we wanted to combine the
First and Last names in our Names table, we could call the following.
SELECT First || || Last
FROM Names;
[(‘ John Doe ’,), (‘Bill Gates’,), (‘LeBron James ’,)]
Notice that we are also adding in a space between the first and last names. If we did not
want to add a space, we could call First || Last.
SELECT First || Last
FROM Names;
[(‘ JohnDoe ’,), (‘BillGates’,), (‘LeBronJames ’,)]
26
4.1. STRINGS 27
4.1.2 Trimming strings
Notice that in our Names table, there appears to be leading and trailing whitespace. We
can remove all the whitespace on the lefthand and righthand side of the string using the
TRIM() method.
SELECT TRIM(First) || || TRIM(Last)
FROM Names;
[(‘John Doe’,), (‘Bill Gates’,), (‘LeBron James’,)]
We could also remove only the whitespace on the lefthand side of the strings using the
LTRIM() method.
SELECT LTRIM(First) || || LTRIM(Last)
FROM Names;
[(‘John Doe ’,), (‘Bill Gates’,), (‘LeBron James ’,)]
Similarily, we could also remove the whitespace on the righthand side of the string using
the RTRIM() method.
SELECT RTRIM(First) || || RTRIM(Last)
FROM Names;
[(‘ John Doe’,), (‘Bill Gates’,), (‘LeBron James’,)]
4.1.3 Substring
Taking a substring of a particular string will give you back part of the original string. In
SQL, We use the SUBSTR(string name, starting position, number of characters)
method. SQL starts strings at index 1, as opposed to many programming languages that
start a string with index 0. Assuming our strings are trimmed, if we wanted to get the first
letter of the first name and concatenate it with the first three letters of the last name, we
could call the following.
SELECT SUBSTR(First, 1, 1) || SUBSTR(Last, 1, 3)
FROM Names;
[(‘JDoe’,), (‘BGat’,), (‘LJam’,)]
If we specify a string length that is too long, for example if we try to get the first five
characters from the name “John”, then the string will still work, but it will only return
characters until it gets to the end of the string.
SELECT SUBSTR(First, 1, 5)
FROM Names;
28 CHAPTER 4. MODIFYING AND ANALYZING DATA
[(‘John’,), (‘Bill’,), (‘LeBro’,)]
4.1.4 Uppercase and lowercase
To convert a string to uppercase or lowercase, we will using the UPPER() and LOWER()
commands, respectively. We can call UPPER() on our first names as follows.
SELECT UPPER(First)
FROM Names;
[(‘JOHN’,), (‘BILL’,), (‘LEBRON’,)]
Similarily, we can also call LOWER() on a string to convert each character to be lowercase.
SELECT LOWER(First)
FROM Names;
[(‘john’,), (‘bill’,), (‘lebron’,)]
4.1.5 Date and time strings
Working with dates and times in SQL may be tricky because they are not stored the same
in every DBMS. Table 4.1 shows the 5 functions associated with dates and times in SQLite.
Function Name
DATE(timestring, modifiers...)
TIME(timestring, modifiers...)
DATETIME(timestring, modifiers...)
JULIANDAY(timestring, modifiers...)
STRFTIME(format, timestring, modifiers...)
Table 4.1: SQLite date and time functions
Table 4.2 shows the formats a string must be in, in order for SQLite to pick it up as a
timestring. For example, we can represent the date December 11, 1980 at 5:30 PM as
1980-12-11 17:30, where we have converted the time to a 24 hour scale. We can also refer
to the current time as ‘now’.
We can also add modifiers to our timestrings that modify the time we have specified. In
particular, we can add and subtract time, move the time to the beginning of the day/mon-
th/year, and change which day we want to use. Table 4.3 shows the strings needed for
each modifier. For example, if we want to refer to the Monday of the current week, where
weekday N starts at Sunday = 0 and goes to Saturday = 6, we can call the following:
SELECT date(‘now’, ‘weekday 1’);
4.2. CASE STATEMENTS 29
YYYY-MM-DD
YYYY-MM-DD HH:MM
YYYY-MM-DD HH:MM:SS.SSS
YYYY-MM-DD‘T’HH:MM
YYYY-MM-DD‘T’HH:MM:SS
YYYY-MM-DD‘T’HH:MM:SS.SSS
HH:MM
HH:MM:SS
HH:MM:SS.SSS
Table 4.2: Different types of timestrings (note that the ‘T’ character in the middle timestrings is
the literal T character)
±N N N days start of day weekday N
±N N N hours start of month
±N N N minutes start of year
±N N N.N N N N seconds
±N N N months
±N N N years
Table 4.3: Different types of modifiers
If we want to add 2 hours, 20 minutes, and 15 seconds to the start of the current month,
then we can call:
SELECT date(‘now’, ‘start of month’, ‘+2 hours’, ‘+20 minutes’,
‘+15 seconds’);
4.2 Case statements
Case statements allow us to use conditional statements in the form of if/then/else inside of
SQL. We can use CASE statements inside of SELECT, INSERT, UPDATE, and DELETE statements.
The syntax to write a case statement follows.
CASE
WHEN C1 THEN E1
WHEN C2 THEN E2
...
ELSE
<else result>
END
We can also use case statements to act like switch functions in other programming languages,
as we show in the following example.
30 CHAPTER 4. MODIFYING AND ANALYZING DATA
CASE <input expression>
WHEN <when expression> THEN E1
...
ELSE
<else result>
END
For example, suppose we want to separate our data that has a city category into ‘San
Francisco’ and ‘Other’ categories. Here, we can set our new category name to be SF and
use the select statement that follows.
SELECT
CASE city
WHEN ‘San Francisco’ THEN ‘San Francisco’
ELSE ‘Other’
END SF
FROM <table>;
4.3 Views
Views give us a way to save our searches in SQL in a variable, which other users can easily
reference. For example, suppose we have a list of NBA players and we have a strict criteria to
determine if they have had a successful career based on if they meet the following conditions
NBA years 10
All NBA appearances 5
All Star appearances 3
NBA Championships 1
. (4.3.1)
Here, we could create a VIEW using the statement that follows.
CREATE VIEW successful AS
SELECT *
FROM <players table>
WHERE
NBA_years >= 10
All_NBA_appearances >= 5
All_Star_appearances >= 3
NBA_Championships >= 1;
After creating that view, we can reference it in the future just like we reference tables.
SELECT * FROM successful;
We can also find all of the created views from a single table using the dot VIEWS command.
SELECT * FROM <table name>.VIEWS;