Up and Running with SQL

Databases

SQL

Data engineering

Published

2 December 2023

Introduction to databases

We do realise that most of the softwares around are designed to solve a problem!. To better understand what databases are ,we need to first understand the problems that necessitated the advent of databases. The answer to this is quite obvious ! you have some data or large amounts of information that you need to store, this data could be about:

Customers
Products
Employees etc

But why worry? , we could store the data in text files or even spreadsheets or if they are documents we can just organise them in folders…. is it! . Just having data is not a good enough reason to have a database , having data is not the problem but the problem is what comes next. Below are some of the problems that a database can solve

so what`s a database?

What is a database?

A structured storage space where the data is kept in many tables and organized so that the necessary information can be easily fetched, manipulated, and summarized.

tables and fields

A table is an organized set of related data stored in a tabular form, i.e., in rows and columns. A field is another term for a column of a table and record is the other name for a row

Relational Data

Relational data are related by unique identifiers

To better understand this, let’s consider a scenario were Bongani Has a business. say I have a number of different restaurants. In one table I might have information about these restaurants including, where they are located and what type of food they serve. I may then have a second table where information about health and safety inspections is stored. Each inspection is a different row and the date of the inspection, the inspector, and the safety rating are stored in this table. Finally, I might have a third table. This third table contains information pulled from an API, regarding the number of stars given to each restaurant, as rated by people online. Each table contains different bits of information; however, there is a common column id in each of the tables. This allows the information to be linked between the tables. The restaurant with the id “JJ29JJ” in the restaurant table would refer to the same restaurant with the id “JJ29JJ” in the health inspections table, and so on. The values in this id column are known as unique identifiers because they uniquely identify each restaurant. No two restaurants will have the same id, and the same restaurant will always have the same id, no matter what table you’re looking at. The fact that these tables have unique identifiers connecting each table to all the other tables makes this example what we call relational data.

Unique identifiers help link entries across tables

Why relational data?

Storing data in this way has a number of advantages; however, the three most important are:

Efficient Data Storage
Avoids Ambiguity
Privacy

DBMS

DDMS stands for Database Management System, a software package used to perform various operations on the data stored in a database, such as accessing, updating, wrangling, inserting, and removing data. There are various types of DBMS, such as relational, hierarchical, network, graph, or object-oriented. These types are based on the way the data is organized, structured, and stored in the system.

RDBMS

RDBMS stands for Relational Database Management System. It’s the most common type of DBMS used for working with data stored in multiple tables related to each other by means of shared keys. The SQL programming language is particularly designed to interact with RDBMS. Some examples of RDBMS are MySQL, PostgreSQL, Oracle, MariaDB, etc.

SQL constraints

constraints	Description
DEFAULT	provides a default value for a column
UNIQUE	allows only unique values
NOT NULL	allows only non-null values.
PRIMARY KEY	allows only unique and strictly non-null values (NOT NULL and UNIQUE).
FOREIGN KEY	provides shared keys between two and more tables

unique key

A column (or multiple columns) of a table to which the UNIQUE constraint was imposed to ensure unique values in that column, including a possible NULL value (the only one).

foreign key

A column (or multiple columns) of a table to which the FOREIGN KEY constraint was imposed to link this column to the primary key in another table (or several tables). The purpose of foreign keys is to keep connected various tables of a database.

SQL queries

A query is a piece of code written in SQL to access the data from a database or to modify the data. Correspondingly, there are two types of SQL queries: select and action queries. The first ones are used to retrieve the necessary data (this also includes limiting, grouping, ordering the data, extracting the data from multiple tables, etc.), while the second ones are used to create, add, delete, update, rename the data, etc.

SQL comments

A human-readable clarification on what a particular piece of code does. SQL code comments can be single-line (preceded by a double dash –) or span over multiple lines (as follows: /comment_text/). When the SQL engine runs, it ignores code comments. The purpose of adding SQL code comments is to make the code more comprehensive for those people who will read it in the future.

types of SQL commands (or SQL subsets)

Command	Explanation
Data Definition Language (DDL)	to define and modify the structure of a database.
Data Manipulation Language (DML)	to access, manipulate, and modify data in a database.
Data Control Language (DCL)	to control user access to the data in the database and give or, revoke privileges to a specific user or a group of users.
Transaction Control Language (TCL)	to control transactions in a database.
Data Query Language (DQL)	to perform queries on the data in a database to retrieve the necessary information from it.

creating a table

any computer system that deals with storing data needs to have the four fundamental functions that have the ability to:

C-REATE
R-EAD
U-UPDATE
D-ELETE

commonly pronounced as CRUD

we use the CREATE TABLE statement.

CREATE TABLE table_name (col_1 datatype,
                           col_2 datatype,
                           col_3 datatype);

for instance if run the following query !

CREATE TABLE my_table (age INTERGER,
                       sex TEXT);

we add the values to the table using INSERT INTO .... VALUES (....)

INSERT INTO my_table VALUES (34,'male');
INSERT INTO my_table VALUES (47,'female');
INSERT INTO my_table VALUES (32,'male');

then call the database using SELECT

SELECT * FROM my_table;

#> # A tibble: 3 × 2
#>     age sex   
#>   <dbl> <chr> 
#> 1    34 male  
#> 2    47 female
#> 3    32 male

Querying a database

we will use the Employee attrition database
A query is a request for data from a database table (or combination of tables).
Querying is an essential skill for a data scientist, since the data you need for your analyses will often live in databases.

SELECT statement

select all with *

In SQL, you can select data from a table using a SELECT statement. For example, the following query selects the name column from the people table: SELECT name FROM people;
you may want to select all columns from a table use * rather than typing all the columns

SELECT * FROM dat_new

#>    EmployeeNumber Attrition Age    BusinessTravel DailyRate
#> 1               1       Yes  41     Travel_Rarely      1102
#> 2               2        No  49 Travel_Frequently       279
#> 3               4       Yes  37     Travel_Rarely      1373
#> 4               5        No  33 Travel_Frequently      1392
#> 5               7        No  27     Travel_Rarely       591
#> 6               8        No  32 Travel_Frequently      1005
#> 7              10        No  59     Travel_Rarely      1324
#> 8              11        No  30     Travel_Rarely      1358
#> 9              12        No  38 Travel_Frequently       216
#> 10             13        No  36     Travel_Rarely      1299
#>                Department DistanceFromHome Education EducationField
#> 1                   Sales                1         2  Life Sciences
#> 2  Research & Development                8         1  Life Sciences
#> 3  Research & Development                2         2          Other
#> 4  Research & Development                3         4  Life Sciences
#> 5  Research & Development                2         1        Medical
#> 6  Research & Development                2         2  Life Sciences
#> 7  Research & Development                3         3        Medical
#> 8  Research & Development               24         1  Life Sciences
#> 9  Research & Development               23         3  Life Sciences
#> 10 Research & Development               27         3        Medical
#>    EmployeeCount EnvironmentSatisfaction Gender HourlyRate JobInvolvement
#> 1              1                       2 Female         94              3
#> 2              1                       3   Male         61              2
#> 3              1                       4   Male         92              2
#> 4              1                       4 Female         56              3
#> 5              1                       1   Male         40              3
#> 6              1                       4   Male         79              3
#> 7              1                       3 Female         81              4
#> 8              1                       4   Male         67              3
#> 9              1                       4   Male         44              2
#> 10             1                       3   Male         94              3
#>    JobLevel                   JobRole JobSatisfaction MaritalStatus
#> 1         2           Sales Executive               4        Single
#> 2         2        Research Scientist               2       Married
#> 3         1     Laboratory Technician               3        Single
#> 4         1        Research Scientist               3       Married
#> 5         1     Laboratory Technician               2       Married
#> 6         1     Laboratory Technician               4        Single
#> 7         1     Laboratory Technician               1       Married
#> 8         1     Laboratory Technician               3      Divorced
#> 9         3    Manufacturing Director               3        Single
#> 10        2 Healthcare Representative               3       Married
#>    MonthlyIncome MonthlyRate NumCompaniesWorked Over18 OverTime
#> 1           5993       19479                  8      Y      Yes
#> 2           5130       24907                  1      Y       No
#> 3           2090        2396                  6      Y      Yes
#> 4           2909       23159                  1      Y      Yes
#> 5           3468       16632                  9      Y       No
#> 6           3068       11864                  0      Y       No
#> 7           2670        9964                  4      Y      Yes
#> 8           2693       13335                  1      Y       No
#> 9           9526        8787                  0      Y       No
#> 10          5237       16577                  6      Y       No
#>    PercentSalaryHike PerformanceRating RelationshipSatisfaction StandardHours
#> 1                 11                 3                        1            80
#> 2                 23                 4                        4            80
#> 3                 15                 3                        2            80
#> 4                 11                 3                        3            80
#> 5                 12                 3                        4            80
#> 6                 13                 3                        3            80
#> 7                 20                 4                        1            80
#> 8                 22                 4                        2            80
#> 9                 21                 4                        2            80
#> 10                13                 3                        2            80
#>    StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
#> 1                 0                 8                     0               1
#> 2                 1                10                     3               3
#> 3                 0                 7                     3               3
#> 4                 0                 8                     3               3
#> 5                 1                 6                     3               3
#> 6                 0                 8                     2               2
#> 7                 3                12                     3               2
#> 8                 1                 1                     2               3
#> 9                 0                10                     2               3
#> 10                2                17                     3               2
#>    YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
#> 1               6                  4                       0
#> 2              10                  7                       1
#> 3               0                  0                       0
#> 4               8                  7                       3
#> 5               2                  2                       2
#> 6               7                  7                       3
#> 7               1                  0                       0
#> 8               1                  0                       0
#> 9               9                  7                       1
#> 10              7                  7                       7
#>    YearsWithCurrManager
#> 1                     5
#> 2                     7
#> 3                     0
#> 4                     0
#> 5                     2
#> 6                     6
#> 7                     0
#> 8                     0
#> 9                     8
#> 10                    7

selecting a subset

you can achieve this by defining the columns to be selected e.g

SELECT col_1,col_2,col_3 
FROM my_table;

using the database we get

SELECT Attrition,Department,Education 
FROM dat_new;

#>    Attrition             Department Education
#> 1        Yes                  Sales         2
#> 2         No Research & Development         1
#> 3        Yes Research & Development         2
#> 4         No Research & Development         4
#> 5         No Research & Development         1
#> 6         No Research & Development         2
#> 7         No Research & Development         3
#> 8         No Research & Development         1
#> 9         No Research & Development         3
#> 10        No Research & Development         3

select distinct items in SQL

Often your results will include many duplicate values. If you want to select all the unique values from a column, you can use the DISTINCT keyword. for instance if you have the following database table:

#> # A tibble: 5 × 3
#>   Name      age sex   
#>   <chr>   <dbl> <chr> 
#> 1 Bongani    34 male  
#> 2 Eliza      47 female
#> 3 Bongani    34 male  
#> 4 Peter      27 male  
#> 5 Fadzie     19 female

we can see that Bongani has been recorded twice in the dataset

SELECT DISTINCT name,age,sex 
FROM database_table

#>      Name age    sex
#> 1 Bongani  34   male
#> 2   Eliza  47 female
#> 3   Peter  27   male
#> 4  Fadzie  19 female

filtering using WHERE in SQL

WHERE is a filtering clause
In SQL, the WHERE keyword allows you to filter based on both text and numeric values in a table. There are a few different comparison operators you can use:
= equal
<> not equal
< less than
> greater than
<= less than or equal to
>= greater than or equal to
You can build up your WHERE queries by combining multiple conditions with the AND keyword.

SELECT DISTINCT name,age,sex
FROM df
WHERE sex='male';

#>      Name age  sex
#> 1 Bongani  34 male
#> 2   Peter  27 male

COUNT and AS alias in SQL

The COUNT statement lets you count then returning the number of rows in one or more columns.
we use AS to rename default name
COUNT(*) tells you how many rows are in a table

SELECT COUNT(*) AS pple_count
FROM database_table;

#>   pple_count
#> 1          5

the above query calculates the number of rows in the data and finds 5 rows

we can use COUNT(DISTINCT column) to get the number of unique rows

SELECT COUNT(DISTINCT name) AS unique_count
FROM database_table;

#>   unique_count
#> 1            4

the query above will calculate rows with distinct names and finds them to be 4

We can use COUNT with WHERE

We can do this to get the number of terms after filtering
how many people are male?

SELECT COUNT(*) AS num_male
FROM database_table
WHERE sex='male';

#>   num_male
#> 1        3

Manipulating and Aggregating

Create a database for the exercises

con<-dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con,data1)
copy_to(con,data2)

filtering using `WHERE` in SQL

the following points were outlined previously

WHERE is a filtering clause
In SQL, the WHERE keyword allows you to filter based on both text and numeric values in a table. There are a few different comparison operators you can use:
= equal
<> not equal
< less than
> greater than
<= less than or equal to
>= greater than or equal to
You can build up your WHERE queries by combining multiple conditions with the AND keyword.

SELECT * 
FROM data2 
WHERE category='Potato'

1 records
recipe	calories	sugar	protein	category
624	236.62	0.81	9.07	Potato

filtering using `IN` and `WHERE` in SQL

SELECT * 
FROM data2 
WHERE category IN ('Chicken','Chicken Breast')

5 records
recipe	calories	sugar	protein	category
609	NA	NA	NA	Chicken Breast
671	481.88	1.34	51.90	Chicken
683	147.24	0.94	54.00	Chicken
419	1830.28	1.83	44.74	Chicken Breast
117	NA	NA	NA	Chicken Breast

`COUNT` and `AS` alias in SQL

The COUNT statement lets you count then returning the number of rows in one or more columns.
we use AS to rename default name
COUNT(*) tells you how many rows are in a table

SELECT COUNT(*) AS n 
FROM data2

1 records
n
25

count filters in SQL

SELECT COUNT(category) AS n_potato
FROM data2 
WHERE category IN ('Chicken Breast','Chicken')

1 records
n_potato
5

Updating a database using `dbExecute`

categories before

#>        category n percent
#>       Beverages 1    0.04
#>       Breakfast 4    0.16
#>         Chicken 2    0.08
#>  Chicken Breast 3    0.12
#>         Dessert 3    0.12
#>            Meat 4    0.16
#>   One Dish Meal 3    0.12
#>            Pork 1    0.04
#>          Potato 1    0.04
#>       Vegetable 3    0.12

Updating a database

we need to change a category called Breakfast to Breakfast meal
we shall use REPLACE

UPDATE data2 
SET category= REPLACE(category,'Breakfast','Breakfast meal')
WHERE category='Breakfast'

Querying to check results

SELECT category, COUNT(category) as n_per_category
FROM data2
GROUP BY category

Displaying records 1 - 10
category	n_per_category
Beverages	1
Breakfast meal	4
Chicken	2
Chicken Breast	3
Dessert	3
Meat	4
One Dish Meal	3
Pork	1
Potato	1
Vegetable	3

we see that Breakfast has updated to Breakfast meal

Missing data in SQL

in SQL, NULL represents a missing or unknown value. You can check for NULL values using the expression IS NULL
Use IS NULL AND IS NOT NULL

filtering missing data

SELECT * 
FROM data2 
WHERE Protein IS NULL

3 records
recipe	calories	sugar	protein	category
912	NA	NA	NA	Dessert
609	NA	NA	NA	Chicken Breast
117	NA	NA	NA	Chicken Breast

filtering nonmissing data

SELECT * 
FROM data2 
WHERE Protein IS NOT NULL

Displaying records 1 - 10
recipe	calories	sugar	protein	category
624	236.62	0.81	9.07	Potato
102	198.98	0.39	39.12	Vegetable
427	187.41	86.97	4.49	Dessert
324	49.50	4.69	53.33	Breakfast meal
380	75.89	8.18	17.64	Meat
671	481.88	1.34	51.90	Chicken
683	147.24	0.94	54.00	Chicken
208	20.98	9.80	0.82	Beverages
633	105.95	0.07	7.55	Vegetable
859	388.44	4.62	13.93	Pork

How many categories do we have

select `distinct` items in SQL

Often your results will include many duplicate values. If you want to select all the unique values from a column, you can use the DISTINCT keyword.

SELECT COUNT(DISTINCT category) AS unique_categories 
FROM data2

1 records
unique_categories
10

Advanced filters

filtering in SQL

SELECT * 
FROM data2 WHERE sugar > 1 AND sugar < 5 
AND category='Breakfast meal'

3 records
recipe	calories	sugar	protein	category
324	49.5	4.69	53.33	Breakfast meal
598	212.4	2.25	23.78	Breakfast meal
526	93.5	2.57	23.68	Breakfast meal

also achieved by using BETWEEN and AND

SELECT * 
FROM data2 
WHERE sugar BETWEEN 1 AND 5 
AND category='Breakfast meal'

3 records
recipe	calories	sugar	protein	category
324	49.5	4.69	53.33	Breakfast meal
598	212.4	2.25	23.78	Breakfast meal
526	93.5	2.57	23.68	Breakfast meal

filtering text data in SQL

use LIKE
the LIKE operator can be used in a WHERE clause to search for a pattern in a column.
To accomplish this, you use something called a wildcard as a placeholder for some other values. There are two wildcards you can use with LIKE:

The % wildcard will match zero, one, or many characters in text

The _ wildcard will match a single character

SELECT * 
FROM data2 
WHERE category LIKE 'Chicken%'

5 records
recipe	calories	sugar	protein	category
609	NA	NA	NA	Chicken Breast
671	481.88	1.34	51.90	Chicken
683	147.24	0.94	54.00	Chicken
419	1830.28	1.83	44.74	Chicken Breast
117	NA	NA	NA	Chicken Breast

Aggregate functions in SQL

SELECT AVG(sugar) AS avg_sugar,
       MAX(sugar) AS max_sugar,
       MIN(sugar) AS min_sugar,
       VARIANCE(sugar) AS variance,
       STDEV(sugar) AS stnd_deviation
FROM data2;

1 records
avg_sugar	max_sugar	min_sugar	variance	stnd_deviation
11.24864	86.97	0.07	426.5449	20.65296

Grouping and aggregating in SQL

SELECT category,AVG(sugar) AS avg_sugar,
                       MAX(sugar) AS max_sugar,
                       MIN(sugar) AS min_sugar
FROM data2
GROUP BY category;

Displaying records 1 - 10
category	avg_sugar	max_sugar	min_sugar
Beverages	9.8000000	9.80	9.80
Breakfast meal	5.1600000	11.13	2.25
Chicken	1.1400000	1.34	0.94
Chicken Breast	1.8300000	1.83	1.83
Dessert	69.5700000	86.97	52.17
Meat	13.5525000	30.06	5.21
One Dish Meal	3.9066667	6.09	1.46
Pork	4.6200000	4.62	4.62
Potato	0.8100000	0.81	0.81
Vegetable	0.8066667	1.96	0.07

we can order the data by a certain column using ORDER BY clause

SELECT category,AVG(sugar) AS avg_sugar,
                       MAX(sugar) AS max_sugar,
                       MIN(sugar) AS min_sugar
FROM data2
GROUP BY category
ORDER BY avg_sugar;

Displaying records 1 - 10
category	avg_sugar	max_sugar	min_sugar
Vegetable	0.8066667	1.96	0.07
Potato	0.8100000	0.81	0.81
Chicken	1.1400000	1.34	0.94
Chicken Breast	1.8300000	1.83	1.83
One Dish Meal	3.9066667	6.09	1.46
Pork	4.6200000	4.62	4.62
Breakfast meal	5.1600000	11.13	2.25
Beverages	9.8000000	9.80	9.80
Meat	13.5525000	30.06	5.21
Dessert	69.5700000	86.97	52.17

the above output shows results arranged in ascending order of avg_sugar , if we needed to begin with the largest descending we could useORDER BY...DESC

SELECT category,AVG(sugar) AS avg_sugar,
                       MAX(sugar) AS max_sugar,
                       MIN(sugar) AS min_sugar
FROM data2
GROUP BY category
ORDER BY avg_sugar DESC;

Displaying records 1 - 10
category	avg_sugar	max_sugar	min_sugar
Dessert	69.5700000	86.97	52.17
Meat	13.5525000	30.06	5.21
Beverages	9.8000000	9.80	9.80
Breakfast meal	5.1600000	11.13	2.25
Pork	4.6200000	4.62	4.62
One Dish Meal	3.9066667	6.09	1.46
Chicken Breast	1.8300000	1.83	1.83
Chicken	1.1400000	1.34	0.94
Potato	0.8100000	0.81	0.81
Vegetable	0.8066667	1.96	0.07

Using `UPPER()`

Sometimes you may want to change a column to all upper cases

SELECT recipe,category , UPPER(category) AS upper_cat
FROM data2;

Displaying records 1 - 10
recipe	category	upper_cat
624	Potato	POTATO
102	Vegetable	VEGETABLE
427	Dessert	DESSERT
324	Breakfast meal	BREAKFAST MEAL
912	Dessert	DESSERT
380	Meat	MEAT
609	Chicken Breast	CHICKEN BREAST
671	Chicken	CHICKEN
683	Chicken	CHICKEN
208	Beverages	BEVERAGES

Basic arithmetic

In addition to using aggregate functions, you can perform basic arithmetic with symbols like +, -, *, and /.
lets create weird variables here

SELECT category,(sugar-protein) AS diff_sugar_protein 
FROM data2;

Displaying records 1 - 10
category	diff_sugar_protein
Potato	-8.26
Vegetable	-38.73
Dessert	82.48
Breakfast meal	-48.64
Dessert	NA
Meat	-9.46
Chicken Breast	NA
Chicken	-50.56
Chicken	-53.06
Beverages	8.98

`CASE` statements

this contains a WHEN,THEN and ELSE statement ,finished with END

Take the stair `CASE`

using CASE WHEN to categorise or group data as follows
SELECT….
CASE WHEN condition THEN category
WHEN condition THEN category
…………………………………..
ELSE some_category
END AS variable name

SELECT recipe , sugar,
   CASE WHEN sugar BETWEEN 0 AND 50 THEN '0-50'
        WHEN sugar BETWEEN 51 AND 100 THEN '51-100'
        WHEN sugar BETWEEN 101 AND 200 THEN '101-200'
        WHEN sugar IS NULL THEN 'missing'
        ELSE '>200'
        END AS sugar_category
FROM data2;

Displaying records 1 - 10
recipe	sugar	sugar_category
624	0.81	0-50
102	0.39	0-50
427	86.97	51-100
324	4.69	0-50
912	NA	missing
380	8.18	0-50
609	NA	missing
671	1.34	0-50
683	0.94	0-50
208	9.80	0-50

we can letter perform aggregations on the new column for instance get the count per each category

SELECT COUNT(*) AS count_per_category,
   CASE WHEN sugar BETWEEN 0 AND 50 THEN '0-50'
        WHEN sugar BETWEEN 51 AND 100 THEN '51-100'
        WHEN sugar BETWEEN 101 AND 200 THEN '101-200'
        WHEN sugar IS NULL THEN 'missing'
        ELSE '>200'
        END AS sugar_category
FROM data2
GROUP BY sugar_category;

3 records
count_per_category	sugar_category
20	0-50
2	51-100
3	missing

SQL subqueries

A subquery is a query nested in another query and is useful for intermediate transformations

SELECT column
FROM (SELECT column 
      FROM table) AS subquery;

subqueries can be found anywhere within SELECT,FROM,WHERE,GROUP BY

use cases

comparing groups to summarised values
Reshaping Data
Combining data that cannot be joined

for instance

use data1 and data2 databases for this task
the following is just an example and calories in data2 and data1 are the same so the result is the same

SELECT recipe,calories
FROM data2 
WHERE calories > 
      (SELECT AVG(calories) FROM data1);

9 records
recipe	calories
671	481.88
859	388.44
171	431.28
282	409.99
91	388.37
868	307.36
885	504.09
419	1830.28
639	321.95

Summarize group statistics using subqueries

Sometimes you want to understand how a value varies across groups. For example, how does the maximum value per group vary across groups?

To find out, first summarize by group, and then compute summary statistics of the group results. One way to do this is to compute group values in a subquery, and then summarize the results of the subquery.

what is the standard deviation across meal cateries in the maximum amount of sugar? What about the mean, min, and max of the maximums as well?

Start by writing a subquery to compute the max() per categoy; alias the subquery result as sugar_max. Then compute the standard deviation of sugar_max with stddev(). Compute the min(), max(), and avg() of sugar_max too.

SELECT AVG(sugar_max) AS avg_sugar,
       MAX(sugar_max) AS max_sugar,
       MIN(sugar_max) AS min_sugar,
       VARIANCE(sugar_max) AS variance,
       STDEV(sugar_max) AS stnd_deviation
FROM (SELECT MAX(sugar) AS sugar_max
      FROM data2
         -- Compute max by...
      GROUP BY category) AS max_results

1 records
avg_sugar	max_sugar	min_sugar	variance	stnd_deviation
15.461	86.97	0.81	707.2168	26.59355

SQL JOINS

setting up the data

to aid with illustrations , I have created some fake datasets to assist in explaining and these are join_df1.csv and join_df2.csv datasets
read in CSV files and store them as databases in R

# read in example csv files (join_df1 and join_df2)
join_df1 <- read.csv("join_df1.csv") |> 
  dplyr::rename(A=bongi)
join_df2 <- read.csv("join_df2.csv")|> 
  dplyr::rename(A=bongi)

join<-dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(join,join_df1)
copy_to(join,join_df2)

join_df1

SELECT * FROM join_df1

8 records
A	B	C
red	2	3
orange	4	6
yellow	8	9
green	0	0
indigo	3	3
blue	1	1
purple	5	5
white	8	2

this data has 8 rows and 3 columns

join_df2

SELECT * FROM join_df2

7 records
A	D
red	3
orange	5
yellow	7
green	1
indigo	3
blue	6
pink	9

this has 7 rows and 2 columns

note that we will use the column A as our unique identifier

Querying A join

generally the following step show how a join is done in SQL

SELECT t1.comon_column1,t1.common_column2,…
FROM table1 AS t1
<type> JOIN table2 AS t2
ON t1.common_unique_column = t2.common_unique_column;

Inner Join

When talking about inner joins, we are only going to keep an observation if it is found in all of the tables we’re combining. Here, we’re combining the tables based on the ArtistId column. In our dummy example, there are only two artists that are found in both tables. These are highlighted in green and will be the rows used to join the two tables. Then, once the inner join happens, only these artists’ data will be included after the inner join.

when doing an inner join, data from any observation found in all the tables being joined are included in the output. Here, ArtistIds “1” and “2” are in both the artists and albums tables. Thus, those will be the only ArtistIds in the output from the inner join.

And, since it’s a mutating join, our new table will have information from both tables! We now have ArtistId, Name, AlbumId, and Title in a single table! We’ve joined the two tables, based on the column ArtistId!

See this in `ACTION`

SELECT * 
FROM join_df1
INNER JOIN join_df2
USING(A)

6 records
A	B	C	D
red	2	3	3
orange	4	6	5
yellow	8	9	7
green	0	0	1
indigo	3	3	3
blue	1	1	6

we note from the above that only the colours that are common to both tables are returned. however instead of using USING we can use ON table1.columnname=table2.columnname where columnname is the column that tables to be matched by and in our example this is column A . when using this method it is important to give each table an ALIAS which is a shorthand for our tables

SELECT * 
FROM join_df1 AS j1
INNER JOIN join_df2 AS j2
ON j1.A=j2.A;

6 records
A	B	C	A	D
red	2	3	red	3
orange	4	6	orange	5
yellow	8	9	yellow	7
green	0	0	green	1
indigo	3	3	indigo	3
blue	1	1	blue	6

looking at the table above ,we note that column A has been repeated twice , thus we need to only SELECT the column A From only one of our tables hence we specify the repeated columns in the SELECT statement using the table.columnname method . So in the example above SELECT j1.A,B,C,D specifies that we want we want column A to come from table j1 (j1.A)

SELECT j1.A,B,C,D 
FROM join_df1 AS j1
INNER JOIN join_df2 AS j2
ON j1.A=j2.A;

6 records
A	B	C	D
red	2	3	3
orange	4	6	5
yellow	8	9	7
green	0	0	1
indigo	3	3	3
blue	1	1	6

great :)! , generally we can apply the same scenario to all kinds of joins

left join in SQL

For a left join, all rows in the first table specified will be included in the output. Any row in the second table that is not in the first table will not be included.

In our toy example this means that ArtistIDs 1, 2, and 3 will be included in the output; however, ArtistID 4 will not.

Thus, our output will again include all the columns from both tables combined into a single table; however, for ArtistId 3, there will be NAs for AlbumId and Title. NAs will be filled in for any observations in the first table specified that are missing in the second table.

Let us see it in ACTION

SELECT j1.A,B,C,D 
FROM join_df1 AS j1
LEFT JOIN join_df2 AS j2
ON j1.A=j2.A;

8 records
A	B	C	D
red	2	3	3
orange	4	6	5
yellow	8	9	7
green	0	0	1
indigo	3	3	3
blue	1	1	6
purple	5	5	NA
white	8	2	NA

great , we notice the NA values found in the last column

RIGHT JOIN IN SQL

Right Join is similar to what we just discussed; however, in the output from a right join, all rows in the final table specified are included in the output. NAs will be included for any observations found in the last specified table but not in the other tables.

In our toy example, that means, information about ArtistIDs 1, 2, and 4 will be included.

Again, in our toy example, we see that right join combines the information across tables; however, in this case, ArtistId 4 is included, but Name is an NA, as this information was not in the artists table for this artist.

Let us see a RIGHT JOIN in action

SELECT j1.A,B,C,D 
FROM join_df1 AS j1
RIGHT JOIN join_df2 AS j2
ON j1.A=j2.A;

7 records
A	B	C	D
red	2	3	3
orange	4	6	5
yellow	8	9	7
green	0	0	1
indigo	3	3	3
blue	1	1	6
NA	NA	NA	9

great , notice now where the NA is displayed

full join in SQL

Finally, a full join will take every observation from every table and include it in the output.

Thus, in our toy example, this join produces five rows, including all the observations from either table. NAs are filled in when data are missing for an observation.

let us see a FULL JOIN in ACTION

SELECT j1.A,B,C,D 
FROM join_df1 AS j1
FULL JOIN join_df2 AS j2
ON j1.A=j2.A;

9 records
A	B	C	D
red	2	3	3
orange	4	6	5
yellow	8	9	7
green	0	0	1
indigo	3	3	3
blue	1	1	6
purple	5	5	NA
white	8	2	NA
NA	NA	NA	9

Window functions

A window function performs an aggregate-like operation on a set of query rows.

However, whereas an aggregate operation groups query rows into a single result row, a window function produces a result for each query row:

Anatomy of a window function

FUNCTION_NAME() OVER()
ORDER BY
PARTITION BY
ROWS/RANGE PRECEDING/FOLLOWING/UNBOUNDED

Creating a fake dataset in R

library(tidyverse)

sales<-tribble(~ year , ~country , ~product    , ~profit ,~Own ,~Time,
                 2000 , "Finland", "Computer"  ,   1500  ,"Y"  ,"D",
                 2000 , "Finland" ,"Phone"      ,   100   ,"Y"  ,"D",
                 2001 , "Finland" ,"Phone"      ,     10  ,"N"  ,"D",
                 2000 , "India"   ,"Calculator" ,     75  ,"Y"  ,"E",
                 2000 , "India"   ,"Calculator" ,     75  ,"N"  ,"E",
                 2000 , "India"   ,"Computer"   ,   1200  ,"N"  ,"E",
                 2000 , "USA"     , "Calculator",     75  ,"Y"  ,"E",
                 2000 , "USA"     , "Computer"  ,   1500  ,"N"  ,"E",
                 2001 , "USA"     ,"Calculator" ,     50  ,"Y"  ,"E",
                 2001 , "USA"     ,"Computer"   ,   1500  ,"Y"  ,"E",
                 2001 , "USA"     ,"Computer"   ,   1200  ,"Y"  ,"D",
                 2001 , "USA"     , "TV"        ,    150  ,"Y"  ,"D",
                 2001 , "USA"     , "TV"        ,    100  ,"Y"  ,"D")
sales
#> # A tibble: 13 × 6
#>     year country product    profit Own   Time 
#>    <dbl> <chr>   <chr>       <dbl> <chr> <chr>
#>  1  2000 Finland Computer     1500 Y     D    
#>  2  2000 Finland Phone         100 Y     D    
#>  3  2001 Finland Phone          10 N     D    
#>  4  2000 India   Calculator     75 Y     E    
#>  5  2000 India   Calculator     75 N     E    
#>  6  2000 India   Computer     1200 N     E    
#>  7  2000 USA     Calculator     75 Y     E    
#>  8  2000 USA     Computer     1500 N     E    
#>  9  2001 USA     Calculator     50 Y     E    
#> 10  2001 USA     Computer     1500 Y     E    
#> 11  2001 USA     Computer     1200 Y     D    
#> 12  2001 USA     TV            150 Y     D    
#> 13  2001 USA     TV            100 Y     D

Load in the necessary packages

library(odbc)
library(DBI)
library(RSQLite)

Set up a database using the datafile

wind<-dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(wind,sales)

Lets get started

The row for which function evaluation occurs is called the current row.
The query rows related to the current row over which function evaluation occurs comprise the window for the current row.

`ROW_NUMBER() OVER()`

use it if maybe you have duplicates or intend to create rowids for your data

SELECT ROW_NUMBER() OVER () AS row_id , year, country, product, profit
FROM sales;

Displaying records 1 - 10
row_id	year	country	product	profit
1	2000	Finland	Computer	1500
2	2000	Finland	Phone	100
3	2001	Finland	Phone	10
4	2000	India	Calculator	75
5	2000	India	Calculator	75
6	2000	India	Computer	1200
7	2000	USA	Calculator	75
8	2000	USA	Computer	1500
9	2001	USA	Calculator	50
10	2001	USA	Computer	1500

PARTITION BY

This splits the table into partitions based on a column’s unique values and results aren’t rolled into one column

therefore ROW_NUMBER() With PARTITION BY produces the row number of each row within its partition. In this case, rows are numbered per country. By default, partition rows are unordered and row numbering is nondeterministic.

SELECT ROW_NUMBER() OVER(PARTITION BY country) AS row_num1,
       year, country, product, profit
FROM sales;

Displaying records 1 - 10
row_num1	year	country	product	profit
1	2000	Finland	Computer	1500
2	2000	Finland	Phone	100
3	2001	Finland	Phone	10
1	2000	India	Calculator	75
2	2000	India	Calculator	75
3	2000	India	Computer	1200
1	2000	USA	Calculator	75
2	2000	USA	Computer	1500
3	2001	USA	Calculator	50
4	2001	USA	Computer	1500

its also possible to use ROW_NUMBER() With ORDER BY to determine ranks

SELECT ROW_NUMBER() OVER(ORDER BY profit DESC) AS row_num1,
       year, country, product, profit
FROM sales;

Displaying records 1 - 10
row_num1	year	country	product	profit
1	2000	Finland	Computer	1500
2	2000	USA	Computer	1500
3	2001	USA	Computer	1500
4	2000	India	Computer	1200
5	2001	USA	Computer	1200
6	2001	USA	TV	150
7	2000	Finland	Phone	100
8	2001	USA	TV	100
9	2000	India	Calculator	75
10	2000	India	Calculator	75

in the above example , we see that we have ranked our data according to profits implying that each rowid also signifies the rank of profit

RANK()

ROW_NUMBER assigns different numbers to profits with the same amount, so it’s not a useful ranking function; if profits are the same , they should have the same rank.

in such a case we use RANK()

SELECT RANK() OVER(ORDER BY profit DESC) AS RANK, 
     profit ,year, country, product
FROM sales;

Displaying records 1 - 10
RANK	profit	year	country	product
1	1500	2000	Finland	Computer
1	1500	2000	USA	Computer
1	1500	2001	USA	Computer
4	1200	2000	India	Computer
4	1200	2001	USA	Computer
6	150	2001	USA	TV
7	100	2000	Finland	Phone
7	100	2001	USA	TV
9	75	2000	India	Calculator
9	75	2000	India	Calculator

`DENSE_RANK()`

we can use DENSE_RANK() if we need ranks to be ordered or require a further partition

SELECT DENSE_RANK() OVER(PARTITION BY country ORDER BY profit DESC) AS RANK, 
     profit ,year, country, product
FROM sales;

Displaying records 1 - 10
RANK	profit	year	country	product
1	1500	2000	Finland	Computer
2	100	2000	Finland	Phone
3	10	2001	Finland	Phone
1	1200	2000	India	Computer
2	75	2000	India	Calculator
2	75	2000	India	Calculator
1	1500	2000	USA	Computer
1	1500	2001	USA	Computer
2	1200	2001	USA	Computer
3	150	2001	USA	TV

now we have ranked our profits by country starting with the highest profit in each country

Aggregations

window functions are similar to GROUP BY aggregate function but all rows stay in the output

using the sales information table, these two queries perform aggregate operations that produce a single global sum for all rows taken as a group, and sums grouped per country:

SELECT SUM(profit) AS total_profit
FROM sales;

1 records
total_profit
7535

lets us see a GROUP BY in action

SELECT country, SUM(profit) AS country_profit
FROM sales
GROUP BY country;

3 records
country	country_profit
Finland	1610
India	1350
USA	4575

On the other hand , window operations do not collapse groups of query rows to a single output row. Instead, they produce a result for each row. Like the preceding queries, the following query uses SUM(), but this time as a window function:

SELECT
year, country, product, profit,
         SUM(profit) OVER() AS total_profit,
         SUM(profit) OVER(PARTITION BY country) AS country_profit
FROM sales
ORDER BY country, year, product, profit;

Displaying records 1 - 10
year	country	product	profit	total_profit	country_profit
2000	Finland	Computer	1500	7535	1610
2000	Finland	Phone	100	7535	1610
2001	Finland	Phone	10	7535	1610
2000	India	Calculator	75	7535	1350
2000	India	Calculator	75	7535	1350
2000	India	Computer	1200	7535	1350
2000	USA	Calculator	75	7535	4575
2000	USA	Computer	1500	7535	4575
2001	USA	Calculator	50	7535	4575
2001	USA	Computer	1200	7535	4575

we have included OVER clause that specifies how to partition query rows into groups for processing by the window function:

The first OVER clause is empty, which treats the entire set of query rows as a single partition. The window function thus produces a global sum, but does so for each row.
The second OVER clause partitions rows by country, producing a sum per partition (per country). The function produces this sum for each partition row.

Window functions are permitted only in the select list and ORDER BY clause. Query result rows are determined from the FROM clause, after WHERE, GROUP BY, and HAVING processing, and windowing execution occurs before ORDER BY, LIMIT, and SELECT DISTINCT.

WITH CLAUSE

This is used to create a temporary relation such that the output of this temporary relation is available and is used by the query that is associated with the WITH clause. for instance the following table can be created as a temporary table

FRAMES

Frames allow you to “peek” forwards or backward without first using the relative fetching functions, LAG and LEAD, to fetch previous rows’ values into the current row.

ROWS BETWEEN

ROWS BETWEEN [START] AND [FINISH]
n PRECEDING : n rows before current row
CURRENT ROW : the current row
n FOLLOWING : n rows after the current row

the following query calculates the number of sales per year and per country(i.e grouped by year and country)

SELECT
    year, country, COUNT(*) AS num_sales
  FROM sales
  GROUP BY year, country;

5 records
year	country	num_sales
2000	Finland	2
2000	India	3
2000	USA	2
2001	Finland	1
2001	USA	5

like soo…

`WITH` and a `FRAME`

We create a temporary table called country_sales

WITH Country_sales AS (
  SELECT
    year, country, COUNT(*) AS num_sales
  FROM sales
  GROUP BY year, country)

SELECT
  year, country, num_sales,
  -- Calculate each country's 3s-ales moving total
  SUM(num_sales) OVER
    (PARTITION BY country
     ORDER BY year ASC
     ROWS BETWEEN
     2 PRECEDING AND CURRENT ROW) AS sales_MA
FROM Country_sales
ORDER BY country ASC, year ASC;

5 records
year	country	num_sales	sales_MA
2000	Finland	2	2
2001	Finland	1	3
2000	India	3	3
2000	USA	2	2
2001	USA	5	7