R’s Essential Data Wrangling Verbs

Bongani Ncube
Statistician/Data Scientist
May 19, 2024

Housekeeping

Intro 👋
Workshop materials ⬇️

Today’s plan

What is R? How can it ease the burden of repeated reporting?
Basic functions for manipulating data

What is R?

1 2 3 4 5 6

What is R?

1 2 3 4 5 6

R is an open-source (free!) scripting language for working with data

The benefits of R

1 2 3 4 5 6

The magic of R is that it’s reproducible (by someone else or by yourself in six months)

Keeps data separate from code (data preparation steps)

Getting R

1 2 3 4 5 6

You need the R language

And also the software

Navigating RStudio

1 2 3 4 5 6

project files are here

imported data shows up here

code can go here

Navigating RStudio

1 2 3 4 5 6

project files are here

imported data shows up here

code can also
go here

Using R

1 2 3 4 5 6

You use R via packages

…which contain functions

…which are just verbs

Today’s data

1 2 3 4 5 6

faculty

year	id	rank	dept1	dept2
2021-22	1005	Lecturer	Chemistry
2021-22	1022	Professor	Physics	Engineering
2021-22	1059	Professor	Physics
2021-22	1079	Lecturer	Music
2021-22	1086	Assistant Professor	Music
2021-22	1095	Adjunct Instructor	Sociology

Today’s data

1 2 3 4 5 6

courses

semester	course_id	faculty_id	dept	enrollment	level
20212202	10605	1772	Physics	7	UG
20212202	10605	1772	Physics	32	GR
20212202	11426	1820	Political Science	8	UG
20212202	12048	1914	English	24	UG
20212202	13269	1095	Sociology	48	UG
20212202	13517	1086	Music	17	UG

Basic data manipulation

1 2 3 4 5 6

Useful operators

1 2 3 4 5 6

<-

“save as”

opt + -

%>%

“and then”

Cmd + shift + m

Common functions

1 2 3 4 5 6

filter keeps or discards rows (aka observations)

select keeps or discards columns (aka variables)

arrange sorts data set by certain variable(s)

count tallies data set by certain variable(s)

mutate creates new variables

group_by/summarize aggregates data (pivot tables!)

str_* functions work easily with text

Syntax of a function

1 2 3 4 5 6

function(data, argument(s))

is the same as

data %>%

function(argument(s))

Filter

1 2 3 4 5 6

filter keeps or discards rows (aka observations)

the == operator tests for equality

faculty %>% 
  filter(dept1 == "Sociology")

year	id	rank	dept1	dept2
2021-22	1095	Adjunct Instructor	Sociology
2021-22	1118	Assistant Professor	Sociology
2021-22	1161	Assistant Professor	Sociology
2021-22	1191	Professor	Sociology
2021-22	1216	Associate Professor	Sociology	American Studies
2021-22	1273	Assistant Professor	Sociology

Filter

1 2 3 4 5 6

the | operator signifies “or”

faculty %>% 
  filter(dept1 == "Sociology" | 
           dept1 == "Physics")

year	id	rank	dept1	dept2
2021-22	1022	Professor	Physics	Engineering
2021-22	1059	Professor	Physics
2021-22	1095	Adjunct Instructor	Sociology
2021-22	1118	Assistant Professor	Sociology
2021-22	1161	Assistant Professor	Sociology
2021-22	1191	Professor	Sociology

Filter

1 2 3 4 5 6

the %in% operator allows for multiple options in a list

faculty %>% 
  filter(dept1 %in% c("Sociology",
                      "Physics",
                      "Music"))

year	id	rank	dept1	dept2
2021-22	1022	Professor	Physics	Engineering
2021-22	1059	Professor	Physics
2021-22	1079	Lecturer	Music
2021-22	1086	Assistant Professor	Music
2021-22	1095	Adjunct Instructor	Sociology
2021-22	1118	Assistant Professor	Sociology

Filter

1 2 3 4 5 6

the & operator combines conditions

faculty %>% 
  filter(dept1 %in% c("Sociology",
                      "Physics",
                      "Music") &
         rank == "Professor")

year	id	rank	dept1	dept2
2021-22	1022	Professor	Physics	Engineering
2021-22	1059	Professor	Physics
2021-22	1191	Professor	Sociology
2021-22	1201	Professor	Physics
2021-22	1209	Professor	Music
2021-22	1421	Professor	Physics	Engineering

Select

1 2 3 4 5 6

select keeps or discards columns (aka variables)

faculty %>% 
  select(id, dept1, rank)

id	dept1	rank
1005	Chemistry	Lecturer
1022	Physics	Professor
1059	Physics	Professor
1079	Music	Lecturer
1086	Music	Assistant Professor
1095	Sociology	Adjunct Instructor

Select

1 2 3 4 5 6

can drop columns with -column

faculty %>% 
  select(-dept2)

year	id	rank	dept1
2021-22	1005	Lecturer	Chemistry
2021-22	1022	Professor	Physics
2021-22	1059	Professor	Physics
2021-22	1079	Lecturer	Music
2021-22	1086	Assistant Professor	Music
2021-22	1095	Adjunct Instructor	Sociology

Select

1 2 3 4 5 6

the pipe %>% chains multiple functions together

faculty %>% 
  select(id, dept1, rank) %>% 
  filter(rank == "Professor")

id	dept1	rank
1022	Physics	Professor
1059	Physics	Professor
1191	Sociology	Professor
1201	Physics	Professor
1209	Music	Professor
1407	English	Professor

Arrange

1 2 3 4 5 6

arrange sorts data set by certain variable(s)

use desc() to get descending order

courses %>% 
  arrange(desc(enrollment))

semester	course_id	faculty_id	dept	enrollment	level
20212201	10511	1005	Chemistry	50	UG
20212201	15934	1421	Physics	50	UG
20192002	13850	1105	Chemistry	50	UG
20181901	17773	1942	Music	50	UG
20212202	13269	1095	Sociology	48	UG
20202101	16202	1816	Political Science	48	UG

Arrange

1 2 3 4 5 6

can sort by multiple variables

courses %>% 
  arrange(dept, desc(enrollment))

semester	course_id	faculty_id	dept	enrollment	level
20212201	10511	1005	Chemistry	50	UG
20192002	13850	1105	Chemistry	50	UG
20202102	13850	1258	Chemistry	39	UG
20202102	16606	1393	Chemistry	38	UG
20202101	16540	1784	Chemistry	38	UG
20181901	10511	1829	Chemistry	36	UG

Count

1 2 3 4 5 6

count tallies data set by certain variable(s) (very useful for familiarizing yourself with data)

courses %>% 
  count(dept)

dept	n
Chemistry	16
English	18
Music	17
Physics	19
Political Science	17
Sociology	17

Count

1 2 3 4 5 6

can use sort = TRUE to order results

courses %>% 
  count(dept, level, sort = TRUE)

dept	level	n
Chemistry	UG	16
English	UG	16
Music	UG	16
Physics	UG	16
Political Science	UG	16
Sociology	UG	16
Physics	GR	3
English	GR	2
Music	GR	1
Political Science	GR	1
Sociology	GR	1

Mutate

1 2 3 4 5 6

mutate creates new variables (with a single =)

faculty %>% 
  mutate(new = "hello!")

year	id	rank	dept1	dept2	new
2021-22	1005	Lecturer	Chemistry		hello!
2021-22	1022	Professor	Physics	Engineering	hello!
2021-22	1059	Professor	Physics		hello!
2021-22	1079	Lecturer	Music		hello!
2021-22	1086	Assistant Professor	Music		hello!
2021-22	1095	Adjunct Instructor	Sociology		hello!

Mutate

1 2 3 4 5 6

much more useful with a conditional such as ifelse(), which has three arguments:

condition, value if true, value if false

faculty %>% 
  mutate(prof = ifelse(rank == "Professor",
                       1, 0)) %>% 
  select(rank, prof)

rank	prof
Lecturer	0
Professor	1
Professor	1
Lecturer	0
Assistant Professor	0
Adjunct Instructor	0

Mutate

1 2 3 4 5 6

the ! operator means not

is.na() identifies null values

faculty %>% 
  mutate(joint = ifelse(!is.na(dept2),
                        "joint", NA)) %>% 
  select(dept1, dept2, joint)

dept1	dept2	joint
Chemistry
Physics	Engineering	joint
Physics
Music
Music
Sociology

Mutate

1 2 3 4 5 6

with multiple conditions, case_when() is much easier!

faculty %>% 
  mutate(division = case_when(dept1 %in% c("Sociology","Political Science") ~
                                "Social Sciences",
                              dept1 %in% c("Music","English") ~
                                "Humanities",
                              dept1 %in% c("Chemistry","Physics") ~
                                "Sciences")) %>% 
  select(dept1, division)

dept1	division
Chemistry	Sciences
Physics	Sciences
Physics	Sciences
Music	Humanities
Music	Humanities
Sociology	Social Sciences

Group by / summarize

1 2 3 4 5 6

group_by/summarize aggregates data (pivot tables!)

group_by() identifies the grouping variable(s) and summarize() specifies the aggregation

courses %>% 
  group_by(dept, semester) %>% 
  summarize(enr = sum(enrollment))

dept	semester	enr
Chemistry	20181901	59
Chemistry	20181902	44
Chemistry	20192001	47
Chemistry	20192002	68
Chemistry	20202101	69
Chemistry	20202102	77

Group by / summarize

1 2 3 4 5 6

useful arguments within summarize:

mean, median, sd, min, max, n

courses %>% 
  group_by(dept, semester) %>% 
  summarize(enr = sum(enrollment),
            count = n_distinct(course_id))

dept	semester	enr	courses
Chemistry	20181901	59	2
Chemistry	20181902	44	2
Chemistry	20192001	47	2
Chemistry	20192002	68	2
Chemistry	20202101	69	2
Chemistry	20202102	77	2