R’s Essential Data Wrangling Verbs

Bongani Ncube
Statistician/Data Scientist
May 19, 2024

Housekeeping


  • Intro 👋
  • Workshop materials ⬇️

Today’s plan


  1. What is R? How can it ease the burden of repeated reporting?
  2. Basic functions for manipulating data

What is R?

1 2 3 4 5 6

What is R?

1 2 3 4 5 6





R is an open-source (free!) scripting language for working with data

The benefits of R

1 2 3 4 5 6



The magic of R is that it’s reproducible (by someone else or by yourself in six months)


Keeps data separate from code (data preparation steps)

Getting R

1 2 3 4 5 6




You need the R language

And also the software

Using R

1 2 3 4 5 6



You use R via packages


…which contain functions


…which are just verbs

Today’s data

1 2 3 4 5 6

faculty


year id rank dept1 dept2
2021-22 1005 Lecturer Chemistry
2021-22 1022 Professor Physics Engineering
2021-22 1059 Professor Physics
2021-22 1079 Lecturer Music
2021-22 1086 Assistant Professor Music
2021-22 1095 Adjunct Instructor Sociology

Today’s data

1 2 3 4 5 6

courses


semester course_id faculty_id dept enrollment level
20212202 10605 1772 Physics 7 UG
20212202 10605 1772 Physics 32 GR
20212202 11426 1820 Political Science 8 UG
20212202 12048 1914 English 24 UG
20212202 13269 1095 Sociology 48 UG
20212202 13517 1086 Music 17 UG

Basic data manipulation

1 2 3 4 5 6

Useful operators

1 2 3 4 5 6



<-

“save as”

opt + -

%>%

“and then”

Cmd + shift + m

Common functions

1 2 3 4 5 6


filter keeps or discards rows (aka observations)

select keeps or discards columns (aka variables)

arrange sorts data set by certain variable(s)

count tallies data set by certain variable(s)

mutate creates new variables

group_by/summarize aggregates data (pivot tables!)

str_* functions work easily with text

Syntax of a function

1 2 3 4 5 6



function(data, argument(s))


is the same as


data %>%

    function(argument(s))

Filter

1 2 3 4 5 6


filter keeps or discards rows (aka observations)

the == operator tests for equality


faculty %>% 
  filter(dept1 == "Sociology")
year id rank dept1 dept2
2021-22 1095 Adjunct Instructor Sociology
2021-22 1118 Assistant Professor Sociology
2021-22 1161 Assistant Professor Sociology
2021-22 1191 Professor Sociology
2021-22 1216 Associate Professor Sociology American Studies
2021-22 1273 Assistant Professor Sociology

Filter

1 2 3 4 5 6


the | operator signifies “or”


faculty %>% 
  filter(dept1 == "Sociology" | 
           dept1 == "Physics")
year id rank dept1 dept2
2021-22 1022 Professor Physics Engineering
2021-22 1059 Professor Physics
2021-22 1095 Adjunct Instructor Sociology
2021-22 1118 Assistant Professor Sociology
2021-22 1161 Assistant Professor Sociology
2021-22 1191 Professor Sociology

Filter

1 2 3 4 5 6


the %in% operator allows for multiple options in a list


faculty %>% 
  filter(dept1 %in% c("Sociology",
                      "Physics",
                      "Music"))
year id rank dept1 dept2
2021-22 1022 Professor Physics Engineering
2021-22 1059 Professor Physics
2021-22 1079 Lecturer Music
2021-22 1086 Assistant Professor Music
2021-22 1095 Adjunct Instructor Sociology
2021-22 1118 Assistant Professor Sociology

Filter

1 2 3 4 5 6


the & operator combines conditions


faculty %>% 
  filter(dept1 %in% c("Sociology",
                      "Physics",
                      "Music") &
         rank == "Professor")
year id rank dept1 dept2
2021-22 1022 Professor Physics Engineering
2021-22 1059 Professor Physics
2021-22 1191 Professor Sociology
2021-22 1201 Professor Physics
2021-22 1209 Professor Music
2021-22 1421 Professor Physics Engineering

Select

1 2 3 4 5 6


select keeps or discards columns (aka variables)


faculty %>% 
  select(id, dept1, rank)
id dept1 rank
1005 Chemistry Lecturer
1022 Physics Professor
1059 Physics Professor
1079 Music Lecturer
1086 Music Assistant Professor
1095 Sociology Adjunct Instructor

Select

1 2 3 4 5 6


can drop columns with -column


faculty %>% 
  select(-dept2)
year id rank dept1
2021-22 1005 Lecturer Chemistry
2021-22 1022 Professor Physics
2021-22 1059 Professor Physics
2021-22 1079 Lecturer Music
2021-22 1086 Assistant Professor Music
2021-22 1095 Adjunct Instructor Sociology

Select

1 2 3 4 5 6


the pipe %>% chains multiple functions together


faculty %>% 
  select(id, dept1, rank) %>% 
  filter(rank == "Professor")
id dept1 rank
1022 Physics Professor
1059 Physics Professor
1191 Sociology Professor
1201 Physics Professor
1209 Music Professor
1407 English Professor

Arrange

1 2 3 4 5 6


arrange sorts data set by certain variable(s)

use desc() to get descending order


courses %>% 
  arrange(desc(enrollment))
semester course_id faculty_id dept enrollment level
20212201 10511 1005 Chemistry 50 UG
20212201 15934 1421 Physics 50 UG
20192002 13850 1105 Chemistry 50 UG
20181901 17773 1942 Music 50 UG
20212202 13269 1095 Sociology 48 UG
20202101 16202 1816 Political Science 48 UG

Arrange

1 2 3 4 5 6


can sort by multiple variables


courses %>% 
  arrange(dept, desc(enrollment))
semester course_id faculty_id dept enrollment level
20212201 10511 1005 Chemistry 50 UG
20192002 13850 1105 Chemistry 50 UG
20202102 13850 1258 Chemistry 39 UG
20202102 16606 1393 Chemistry 38 UG
20202101 16540 1784 Chemistry 38 UG
20181901 10511 1829 Chemistry 36 UG

Count

1 2 3 4 5 6


count tallies data set by certain variable(s) (very useful for familiarizing yourself with data)


courses %>% 
  count(dept)
dept n
Chemistry 16
English 18
Music 17
Physics 19
Political Science 17
Sociology 17

Count

1 2 3 4 5 6


can use sort = TRUE to order results


courses %>% 
  count(dept, level, sort = TRUE)
dept level n
Chemistry UG 16
English UG 16
Music UG 16
Physics UG 16
Political Science UG 16
Sociology UG 16
Physics GR 3
English GR 2
Music GR 1
Political Science GR 1
Sociology GR 1

Mutate

1 2 3 4 5 6


mutate creates new variables (with a single =)


faculty %>% 
  mutate(new = "hello!")
year id rank dept1 dept2 new
2021-22 1005 Lecturer Chemistry hello!
2021-22 1022 Professor Physics Engineering hello!
2021-22 1059 Professor Physics hello!
2021-22 1079 Lecturer Music hello!
2021-22 1086 Assistant Professor Music hello!
2021-22 1095 Adjunct Instructor Sociology hello!

Mutate

1 2 3 4 5 6


much more useful with a conditional such as ifelse(), which has three arguments:

condition, value if true, value if false


faculty %>% 
  mutate(prof = ifelse(rank == "Professor",
                       1, 0)) %>% 
  select(rank, prof)
rank prof
Lecturer 0
Professor 1
Professor 1
Lecturer 0
Assistant Professor 0
Adjunct Instructor 0

Mutate

1 2 3 4 5 6


the ! operator means not

is.na() identifies null values


faculty %>% 
  mutate(joint = ifelse(!is.na(dept2),
                        "joint", NA)) %>% 
  select(dept1, dept2, joint)
dept1 dept2 joint
Chemistry
Physics Engineering joint
Physics
Music
Music
Sociology

Mutate

1 2 3 4 5 6


with multiple conditions, case_when() is much easier!

faculty %>% 
  mutate(division = case_when(dept1 %in% c("Sociology","Political Science") ~
                                "Social Sciences",
                              dept1 %in% c("Music","English") ~
                                "Humanities",
                              dept1 %in% c("Chemistry","Physics") ~
                                "Sciences")) %>% 
  select(dept1, division)
dept1 division
Chemistry Sciences
Physics Sciences
Physics Sciences
Music Humanities
Music Humanities
Sociology Social Sciences

Group by / summarize

1 2 3 4 5 6


group_by/summarize aggregates data (pivot tables!)

group_by() identifies the grouping variable(s) and summarize() specifies the aggregation


courses %>% 
  group_by(dept, semester) %>% 
  summarize(enr = sum(enrollment))
dept semester enr
Chemistry 20181901 59
Chemistry 20181902 44
Chemistry 20192001 47
Chemistry 20192002 68
Chemistry 20202101 69
Chemistry 20202102 77

Group by / summarize

1 2 3 4 5 6


useful arguments within summarize:

mean, median, sd, min, max, n


courses %>% 
  group_by(dept, semester) %>% 
  summarize(enr = sum(enrollment),
            count = n_distinct(course_id))
dept semester enr courses
Chemistry 20181901 59 2
Chemistry 20181902 44 2
Chemistry 20192001 47 2
Chemistry 20192002 68 2
Chemistry 20202101 69 2
Chemistry 20202102 77 2