| year | id | rank | dept1 | dept2 |
|---|---|---|---|---|
| 2021-22 | 1005 | Lecturer | Chemistry | |
| 2021-22 | 1022 | Professor | Physics | Engineering |
| 2021-22 | 1059 | Professor | Physics | |
| 2021-22 | 1079 | Lecturer | Music | |
| 2021-22 | 1086 | Assistant Professor | Music | |
| 2021-22 | 1095 | Adjunct Instructor | Sociology |
1 2 3 4 5 6
1 2 3 4 5 6
R is an open-source (free!) scripting language for working with data
1 2 3 4 5 6
The magic of R is that it’s reproducible (by someone else or by yourself in six months)
Keeps data separate from code (data preparation steps)
1 2 3 4 5 6
You need the R language
And also the software
1 2 3 4 5 6

project files are here
imported data shows up here
code can go here
1 2 3 4 5 6

project files are here
imported data shows up here
code can also
go here
1 2 3 4 5 6
You use R via packages
…which contain functions
…which are just verbs

1 2 3 4 5 6
faculty
| year | id | rank | dept1 | dept2 |
|---|---|---|---|---|
| 2021-22 | 1005 | Lecturer | Chemistry | |
| 2021-22 | 1022 | Professor | Physics | Engineering |
| 2021-22 | 1059 | Professor | Physics | |
| 2021-22 | 1079 | Lecturer | Music | |
| 2021-22 | 1086 | Assistant Professor | Music | |
| 2021-22 | 1095 | Adjunct Instructor | Sociology |
1 2 3 4 5 6
courses
| semester | course_id | faculty_id | dept | enrollment | level |
|---|---|---|---|---|---|
| 20212202 | 10605 | 1772 | Physics | 7 | UG |
| 20212202 | 10605 | 1772 | Physics | 32 | GR |
| 20212202 | 11426 | 1820 | Political Science | 8 | UG |
| 20212202 | 12048 | 1914 | English | 24 | UG |
| 20212202 | 13269 | 1095 | Sociology | 48 | UG |
| 20212202 | 13517 | 1086 | Music | 17 | UG |
1 2 3 4 5 6
1 2 3 4 5 6
<-
“save as”
opt + -
%>%
“and then”
Cmd + shift + m
1 2 3 4 5 6
filter keeps or discards rows (aka observations)
select keeps or discards columns (aka variables)
arrange sorts data set by certain variable(s)
count tallies data set by certain variable(s)
mutate creates new variables
group_by/summarize aggregates data (pivot tables!)
str_* functions work easily with text
1 2 3 4 5 6
function(data, argument(s))
is the same as
data %>%
function(argument(s))
1 2 3 4 5 6
filter keeps or discards rows (aka observations)
the == operator tests for equality
| year | id | rank | dept1 | dept2 |
|---|---|---|---|---|
| 2021-22 | 1095 | Adjunct Instructor | Sociology | |
| 2021-22 | 1118 | Assistant Professor | Sociology | |
| 2021-22 | 1161 | Assistant Professor | Sociology | |
| 2021-22 | 1191 | Professor | Sociology | |
| 2021-22 | 1216 | Associate Professor | Sociology | American Studies |
| 2021-22 | 1273 | Assistant Professor | Sociology |
1 2 3 4 5 6
the | operator signifies “or”
| year | id | rank | dept1 | dept2 |
|---|---|---|---|---|
| 2021-22 | 1022 | Professor | Physics | Engineering |
| 2021-22 | 1059 | Professor | Physics | |
| 2021-22 | 1095 | Adjunct Instructor | Sociology | |
| 2021-22 | 1118 | Assistant Professor | Sociology | |
| 2021-22 | 1161 | Assistant Professor | Sociology | |
| 2021-22 | 1191 | Professor | Sociology |
1 2 3 4 5 6
the %in% operator allows for multiple options in a list
| year | id | rank | dept1 | dept2 |
|---|---|---|---|---|
| 2021-22 | 1022 | Professor | Physics | Engineering |
| 2021-22 | 1059 | Professor | Physics | |
| 2021-22 | 1079 | Lecturer | Music | |
| 2021-22 | 1086 | Assistant Professor | Music | |
| 2021-22 | 1095 | Adjunct Instructor | Sociology | |
| 2021-22 | 1118 | Assistant Professor | Sociology |
1 2 3 4 5 6
the & operator combines conditions
| year | id | rank | dept1 | dept2 |
|---|---|---|---|---|
| 2021-22 | 1022 | Professor | Physics | Engineering |
| 2021-22 | 1059 | Professor | Physics | |
| 2021-22 | 1191 | Professor | Sociology | |
| 2021-22 | 1201 | Professor | Physics | |
| 2021-22 | 1209 | Professor | Music | |
| 2021-22 | 1421 | Professor | Physics | Engineering |
1 2 3 4 5 6
select keeps or discards columns (aka variables)
1 2 3 4 5 6
can drop columns with -column
1 2 3 4 5 6
the pipe %>% chains multiple functions together
1 2 3 4 5 6
arrange sorts data set by certain variable(s)
use desc() to get descending order
| semester | course_id | faculty_id | dept | enrollment | level |
|---|---|---|---|---|---|
| 20212201 | 10511 | 1005 | Chemistry | 50 | UG |
| 20212201 | 15934 | 1421 | Physics | 50 | UG |
| 20192002 | 13850 | 1105 | Chemistry | 50 | UG |
| 20181901 | 17773 | 1942 | Music | 50 | UG |
| 20212202 | 13269 | 1095 | Sociology | 48 | UG |
| 20202101 | 16202 | 1816 | Political Science | 48 | UG |
1 2 3 4 5 6
can sort by multiple variables
| semester | course_id | faculty_id | dept | enrollment | level |
|---|---|---|---|---|---|
| 20212201 | 10511 | 1005 | Chemistry | 50 | UG |
| 20192002 | 13850 | 1105 | Chemistry | 50 | UG |
| 20202102 | 13850 | 1258 | Chemistry | 39 | UG |
| 20202102 | 16606 | 1393 | Chemistry | 38 | UG |
| 20202101 | 16540 | 1784 | Chemistry | 38 | UG |
| 20181901 | 10511 | 1829 | Chemistry | 36 | UG |
1 2 3 4 5 6
count tallies data set by certain variable(s) (very useful for familiarizing yourself with data)
1 2 3 4 5 6
can use sort = TRUE to order results
1 2 3 4 5 6
mutate creates new variables (with a single =)
| year | id | rank | dept1 | dept2 | new |
|---|---|---|---|---|---|
| 2021-22 | 1005 | Lecturer | Chemistry | hello! | |
| 2021-22 | 1022 | Professor | Physics | Engineering | hello! |
| 2021-22 | 1059 | Professor | Physics | hello! | |
| 2021-22 | 1079 | Lecturer | Music | hello! | |
| 2021-22 | 1086 | Assistant Professor | Music | hello! | |
| 2021-22 | 1095 | Adjunct Instructor | Sociology | hello! |
1 2 3 4 5 6
much more useful with a conditional such as ifelse(), which has three arguments:
condition, value if true, value if false
1 2 3 4 5 6
the ! operator means not
is.na() identifies null values
1 2 3 4 5 6
with multiple conditions, case_when() is much easier!
| dept1 | division |
|---|---|
| Chemistry | Sciences |
| Physics | Sciences |
| Physics | Sciences |
| Music | Humanities |
| Music | Humanities |
| Sociology | Social Sciences |
1 2 3 4 5 6
group_by/summarize aggregates data (pivot tables!)
group_by() identifies the grouping variable(s) and summarize() specifies the aggregation
1 2 3 4 5 6
useful arguments within summarize:
mean, median, sd, min, max, n
| dept | semester | enr | courses |
|---|---|---|---|
| Chemistry | 20181901 | 59 | 2 |
| Chemistry | 20181902 | 44 | 2 |
| Chemistry | 20192001 | 47 | 2 |
| Chemistry | 20192002 | 68 | 2 |
| Chemistry | 20202101 | 69 | 2 |
| Chemistry | 20202102 | 77 | 2 |