Common functions
filter
keeps or discards rows (aka observations)
select
keeps or discards columns (aka variables)
arrange
sorts data set by certain variable(s)
count
tallies data set by certain variable(s)
mutate
creates new variables
group_by
/summarize
aggregates data (pivot tables!)
str_*
functions work easily with text
Syntax of a function
function(data, argument(s))
is the same as
data %>%
function(argument(s))
Filter
filter
keeps or discards rows (aka observations)
the ==
operator tests for equality
faculty %>%
filter(dept1 == "Sociology")
2021-22 |
1095 |
Adjunct Instructor |
Sociology |
|
2021-22 |
1118 |
Assistant Professor |
Sociology |
|
2021-22 |
1161 |
Assistant Professor |
Sociology |
|
2021-22 |
1191 |
Professor |
Sociology |
|
2021-22 |
1216 |
Associate Professor |
Sociology |
American Studies |
2021-22 |
1273 |
Assistant Professor |
Sociology |
|
Filter
the |
operator signifies “or”
faculty %>%
filter(dept1 == "Sociology" |
dept1 == "Physics")
2021-22 |
1022 |
Professor |
Physics |
Engineering |
2021-22 |
1059 |
Professor |
Physics |
|
2021-22 |
1095 |
Adjunct Instructor |
Sociology |
|
2021-22 |
1118 |
Assistant Professor |
Sociology |
|
2021-22 |
1161 |
Assistant Professor |
Sociology |
|
2021-22 |
1191 |
Professor |
Sociology |
|
Filter
the %in%
operator allows for multiple options in a list
faculty %>%
filter(dept1 %in% c("Sociology",
"Physics",
"Music"))
2021-22 |
1022 |
Professor |
Physics |
Engineering |
2021-22 |
1059 |
Professor |
Physics |
|
2021-22 |
1079 |
Lecturer |
Music |
|
2021-22 |
1086 |
Assistant Professor |
Music |
|
2021-22 |
1095 |
Adjunct Instructor |
Sociology |
|
2021-22 |
1118 |
Assistant Professor |
Sociology |
|
Filter
the &
operator combines conditions
faculty %>%
filter(dept1 %in% c("Sociology",
"Physics",
"Music") &
rank == "Professor")
2021-22 |
1022 |
Professor |
Physics |
Engineering |
2021-22 |
1059 |
Professor |
Physics |
|
2021-22 |
1191 |
Professor |
Sociology |
|
2021-22 |
1201 |
Professor |
Physics |
|
2021-22 |
1209 |
Professor |
Music |
|
2021-22 |
1421 |
Professor |
Physics |
Engineering |
Select
select
keeps or discards columns (aka variables)
faculty %>%
select(id, dept1, rank)
1005 |
Chemistry |
Lecturer |
1022 |
Physics |
Professor |
1059 |
Physics |
Professor |
1079 |
Music |
Lecturer |
1086 |
Music |
Assistant Professor |
1095 |
Sociology |
Adjunct Instructor |
Select
can drop columns with -column
faculty %>%
select(-dept2)
2021-22 |
1005 |
Lecturer |
Chemistry |
2021-22 |
1022 |
Professor |
Physics |
2021-22 |
1059 |
Professor |
Physics |
2021-22 |
1079 |
Lecturer |
Music |
2021-22 |
1086 |
Assistant Professor |
Music |
2021-22 |
1095 |
Adjunct Instructor |
Sociology |
Select
the pipe %>%
chains multiple functions together
faculty %>%
select(id, dept1, rank) %>%
filter(rank == "Professor")
1022 |
Physics |
Professor |
1059 |
Physics |
Professor |
1191 |
Sociology |
Professor |
1201 |
Physics |
Professor |
1209 |
Music |
Professor |
1407 |
English |
Professor |
Arrange
arrange
sorts data set by certain variable(s)
use desc()
to get descending order
courses %>%
arrange(desc(enrollment))
20212201 |
10511 |
1005 |
Chemistry |
50 |
UG |
20212201 |
15934 |
1421 |
Physics |
50 |
UG |
20192002 |
13850 |
1105 |
Chemistry |
50 |
UG |
20181901 |
17773 |
1942 |
Music |
50 |
UG |
20212202 |
13269 |
1095 |
Sociology |
48 |
UG |
20202101 |
16202 |
1816 |
Political Science |
48 |
UG |
Arrange
can sort by multiple variables
courses %>%
arrange(dept, desc(enrollment))
20212201 |
10511 |
1005 |
Chemistry |
50 |
UG |
20192002 |
13850 |
1105 |
Chemistry |
50 |
UG |
20202102 |
13850 |
1258 |
Chemistry |
39 |
UG |
20202102 |
16606 |
1393 |
Chemistry |
38 |
UG |
20202101 |
16540 |
1784 |
Chemistry |
38 |
UG |
20181901 |
10511 |
1829 |
Chemistry |
36 |
UG |
Count
count
tallies data set by certain variable(s) (very useful for familiarizing yourself with data)
Chemistry |
16 |
English |
18 |
Music |
17 |
Physics |
19 |
Political Science |
17 |
Sociology |
17 |
Count
can use sort = TRUE
to order results
courses %>%
count(dept, level, sort = TRUE)
Chemistry |
UG |
16 |
English |
UG |
16 |
Music |
UG |
16 |
Physics |
UG |
16 |
Political Science |
UG |
16 |
Sociology |
UG |
16 |
Physics |
GR |
3 |
English |
GR |
2 |
Music |
GR |
1 |
Political Science |
GR |
1 |
Sociology |
GR |
1 |
Mutate
mutate
creates new variables (with a single =
)
faculty %>%
mutate(new = "hello!")
2021-22 |
1005 |
Lecturer |
Chemistry |
|
hello! |
2021-22 |
1022 |
Professor |
Physics |
Engineering |
hello! |
2021-22 |
1059 |
Professor |
Physics |
|
hello! |
2021-22 |
1079 |
Lecturer |
Music |
|
hello! |
2021-22 |
1086 |
Assistant Professor |
Music |
|
hello! |
2021-22 |
1095 |
Adjunct Instructor |
Sociology |
|
hello! |
Mutate
much more useful with a conditional such as ifelse()
, which has three arguments:
condition, value if true, value if false
faculty %>%
mutate(prof = ifelse(rank == "Professor",
1, 0)) %>%
select(rank, prof)
Lecturer |
0 |
Professor |
1 |
Professor |
1 |
Lecturer |
0 |
Assistant Professor |
0 |
Adjunct Instructor |
0 |
Mutate
the !
operator means not
is.na()
identifies null values
faculty %>%
mutate(joint = ifelse(!is.na(dept2),
"joint", NA)) %>%
select(dept1, dept2, joint)
Chemistry |
|
|
Physics |
Engineering |
joint |
Physics |
|
|
Music |
|
|
Music |
|
|
Sociology |
|
|
Mutate
with multiple conditions, case_when()
is much easier!
faculty %>%
mutate(division = case_when(dept1 %in% c("Sociology","Political Science") ~
"Social Sciences",
dept1 %in% c("Music","English") ~
"Humanities",
dept1 %in% c("Chemistry","Physics") ~
"Sciences")) %>%
select(dept1, division)
Chemistry |
Sciences |
Physics |
Sciences |
Physics |
Sciences |
Music |
Humanities |
Music |
Humanities |
Sociology |
Social Sciences |
Group by / summarize
group_by
/summarize
aggregates data (pivot tables!)
group_by()
identifies the grouping variable(s) and summarize()
specifies the aggregation
courses %>%
group_by(dept, semester) %>%
summarize(enr = sum(enrollment))
Chemistry |
20181901 |
59 |
Chemistry |
20181902 |
44 |
Chemistry |
20192001 |
47 |
Chemistry |
20192002 |
68 |
Chemistry |
20202101 |
69 |
Chemistry |
20202102 |
77 |
Group by / summarize
useful arguments within summarize
:
mean
, median
, sd
, min
, max
, n
courses %>%
group_by(dept, semester) %>%
summarize(enr = sum(enrollment),
count = n_distinct(course_id))
Chemistry |
20181901 |
59 |
2 |
Chemistry |
20181902 |
44 |
2 |
Chemistry |
20192001 |
47 |
2 |
Chemistry |
20192002 |
68 |
2 |
Chemistry |
20202101 |
69 |
2 |
Chemistry |
20202102 |
77 |
2 |