A tutorial on writing a function to group your data and generate basic descriptives.
For the tutorial, we will use the COVID Community Measures dataset from Kaggle. We hope that you will be able to generalize the function we covered to be applicable to your own dataset. The current dataset contains Community Mobility Reports describing community activities in a variety of settings from Google and the incidence of COVID-19 diagnosis and fatalities from John Hopkins CSSE.
Community activity is defined as the change in activity from the baseline days. A baseline day is the median value from the 5‑week period between January 3rd to February 6th, 2020.
To create a function that can read in our dataframe and output descriptives, we will need the following packages. If you have not already isntalled them, please use install.packages(" ") prior to calling library().
#At later time, identify what aspects of each library are used for the function
library(tidyverse)
library(janitor)
library(lubridate)
library(purrr)
Once the data is loaded, we can look at the dataframe by calling reactable() on our dataframe. Notice that you can sort each column by clicking on the column header.
We can also look at each of the variable names by calling the R base function names().
names(d)
[1] "iso" "country" "date"
[4] "grocery_pharmacy" "parks" "residential"
[7] "retail_recreation" "transit_stations" "workplaces"
[10] "total_cases" "fatalities"
& we can look at all the different countries:
#using tidyverse
d %>%
count(country,
sort = TRUE)
# A tibble: 19 x 2
country n
<chr> <int>
1 Argentina 43
2 Australia 43
3 Brazil 43
4 Canada 43
5 France 43
6 Germany 43
7 India 43
8 Indonesia 43
9 Italy 43
10 Japan 43
11 Mexico 43
12 Saudi Arabia 43
13 South Africa 43
14 South Korea 43
15 Spain 43
16 Sweden 43
17 Turkey 43
18 United Kingdom 43
19 US 43
For this dataset, it would be helpful to create a ‘week’ variable that extracts each day of the week, so we can group our descriptives by week which would allow us to examine for example the average change in activities across weeks for different settings. Here, we’ve also created a ‘day_of_week’ variable to double-check our work.
If you are working with date-time variables in your data, the lubridate package is quite powerful at manipulating date-time data.
#cut.Date, labels = TRUE will print actual day of the week, e.g., Sunday (1)
#use lubridate wday and base R cut.Date
#?cut.Date
#?wday
d <- d %>%
mutate(week = cut.Date(date, "week", start.on.monday = FALSE, labels = FALSE),
day_of_week = wday(date, label = TRUE))
#taking a look at our new variables `week` and `day_of_week`
d %>% select(country, date, week, day_of_week)
# A tibble: 817 x 4
country date week day_of_week
<chr> <date> <int> <ord>
1 Argentina 2020-02-23 1 Sun
2 Argentina 2020-02-24 1 Mon
3 Argentina 2020-02-25 1 Tue
4 Argentina 2020-02-26 1 Wed
5 Argentina 2020-02-27 1 Thu
6 Argentina 2020-02-28 1 Fri
7 Argentina 2020-02-29 1 Sat
8 Argentina 2020-03-01 2 Sun
9 Argentina 2020-03-02 2 Mon
10 Argentina 2020-03-03 2 Tue
# ... with 807 more rows
Now that we’ve created our week variable, we can use tidyr::nest to group our dataset by week, by country, or both. This will allow us to loop our function through each grouping variable and provide unique descriptives for each.
We will nest our data by the variables country and week
#week 7 only has one day -- filter out here if it becomes a problem
d2 <- d %>%
group_by(country, week) %>%
nest()
print(d2)
# A tibble: 133 x 3
# Groups: country, week [133]
country week data
<chr> <int> <list>
1 Argentina 1 <tibble [7 x 11]>
2 Argentina 2 <tibble [7 x 11]>
3 Argentina 3 <tibble [7 x 11]>
4 Argentina 4 <tibble [7 x 11]>
5 Argentina 5 <tibble [7 x 11]>
6 Argentina 6 <tibble [7 x 11]>
7 Argentina 7 <tibble [1 x 11]>
8 Australia 1 <tibble [7 x 11]>
9 Australia 2 <tibble [7 x 11]>
10 Australia 3 <tibble [7 x 11]>
# ... with 123 more rows
After nesting, you see that our new dataframe now contains a list-column named data. We can call the Argentina Week 1 data tibble by specifying d2$data[[1]] to look at it in more details.
d2$data[[1]]
# A tibble: 7 x 11
iso date grocery_pharmacy parks residential retail_recreati~
<chr> <date> <dbl> <dbl> <dbl> <dbl>
1 AR 2020-02-23 8.18 17.6 0.437 13.6
2 AR 2020-02-24 -15.9 25.6 6.32 -9.97
3 AR 2020-02-25 -17.1 1.32 6.90 -19.7
4 AR 2020-02-26 2.30 5.40 -2.27 0.845
5 AR 2020-02-27 -0.404 -5.27 -0.493 -0.584
6 AR 2020-02-28 2.08 -7.86 -0.577 2.88
7 AR 2020-02-29 4.68 -7.58 1.41 5.26
# ... with 5 more variables: transit_stations <dbl>,
# workplaces <dbl>, total_cases <dbl>, fatalities <dbl>,
# day_of_week <ord>
You can see that inside the 7 x 11 tibble for Argentina week 1, the community activities, diagnosis, and fatalities data from that week (week of February 23 to February 29, 2020) are listed.
We can also perform other functions to check our nested dataframe:
#checking length of variables
map_dbl(d2$data, nrow)
[1] 7 7 7 7 7 7 1 7 7 7 7 7 7 1 7 7 7 7 7 7 1 7 7 7 7 7 7 1 7 7 7 7
[33] 7 7 1 7 7 7 7 7 7 1 7 7 7 7 7 7 1 7 7 7 7 7 7 1 7 7 7 7 7 7 1 7
[65] 7 7 7 7 7 1 7 7 7 7 7 7 1 7 7 7 7 7 7 1 7 7 7 7 7 7 1 7 7 7 7 7
[97] 7 1 7 7 7 7 7 7 1 7 7 7 7 7 7 1 7 7 7 7 7 7 1 7 7 7 7 7 7 1 7 7
[129] 7 7 7 7 1
map_dbl(d2$data, ncol)
[1] 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
[22] 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
[43] 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
[64] 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
[85] 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
[106] 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
[127] 11 11 11 11 11 11 11
#if you want to find the mean for just one specific column, by each country and week.
#Here we look at the mean for residential for each country across all 7 weeks.
avg <- map_dbl(d2$data, ~mean(.x$residential))
avg
[1] 1.67414286 -0.63714286 0.59157143 15.39671429 29.97485714
[6] 28.50571429 25.87300000 -1.32342857 -0.81500000 0.44171429
[11] 4.16142857 12.63700000 17.63757143 14.31900000 4.18628571
[16] -0.01671429 -0.76614286 7.83928571 20.75071429 18.92214286
[21] 14.97400000 0.64442857 0.41400000 1.76100000 13.42957143
[26] 20.44085714 21.86371429 14.08800000 2.00785714 0.97028571
[31] 2.10400000 10.38700000 15.75300000 15.62085714 7.61100000
[36] -0.80771429 -0.45228571 3.36014286 25.70114286 28.44842857
[41] 30.36214286 22.64600000 2.45942857 1.76314286 2.05214286
[46] 23.13114286 30.17714286 29.84257143 17.48900000 0.56071429
[51] 0.14157143 1.28285714 8.34442857 21.51128571 24.88914286
[56] 14.84200000 0.88385714 0.44857143 0.83642857 7.05357143
[61] 14.78771429 15.42585714 14.35400000 0.10814286 0.35414286
[66] 2.92942857 4.64042857 26.67171429 28.29157143 21.09400000
[71] 4.16500000 5.45200000 18.36900000 28.09785714 31.50657143
[76] 30.93385714 24.03100000 3.39742857 4.30500000 4.78357143
[81] 5.13257143 4.07328571 6.58700000 7.80600000 10.03400000
[86] 9.88071429 8.51985714 7.77971429 7.38671429 6.53257143
[91] 5.72100000 -1.15700000 -1.24385714 -0.10271429 6.35128571
[96] 13.76957143 16.55714286 12.77800000 1.53228571 1.35285714
[101] 4.75042857 12.01185714 21.12371429 22.34842857 23.31700000
[106] 0.56885714 0.85028571 2.46128571 8.77828571 9.42542857
[111] 10.07814286 5.71000000 -0.72157143 -1.80671429 -1.57442857
[116] 9.76814286 18.24400000 21.63442857 18.88000000 -0.73600000
[121] -0.88142857 1.00300000 10.96157143 17.33842857 18.50142857
[126] 12.55000000 -0.52328571 -0.42128571 -0.93985714 6.29614286
[131] 15.84242857 33.08314286 22.35100000
#if we want to create a country name and week index
index <- d2 %>% select(country, week)
index
# A tibble: 133 x 2
# Groups: country, week [133]
country week
<chr> <int>
1 Argentina 1
2 Argentina 2
3 Argentina 3
4 Argentina 4
5 Argentina 5
6 Argentina 6
7 Argentina 7
8 Australia 1
9 Australia 2
10 Australia 3
# ... with 123 more rows
#we can create a tibble of our index with our residential average data
avg_tibble <- tibble(index, `residential mean` = avg)
avg_tibble
# A tibble: 133 x 2
index$country $week `residential mean`
<chr> <int> <dbl>
1 Argentina 1 1.67
2 Argentina 2 -0.637
3 Argentina 3 0.592
4 Argentina 4 15.4
5 Argentina 5 30.0
6 Argentina 6 28.5
7 Argentina 7 25.9
8 Australia 1 -1.32
9 Australia 2 -0.815
10 Australia 3 0.442
# ... with 123 more rows
In order to figure out how to create a function that can be used for any variable in a dataset, we first need to do it for one.
mean is a base R function.d2 is our nested data set.data is a dataframe that contains all our other variables, for each country and week. You’ll notice there’s a different dataset for Argentina week1, Argentina week2, and so on.[[1]] says: give me the contents within data$residential specifies which variable within data we are interested in.
mean(d2$data[[1]]$residential)
[1] 1.674143
Next, we can attempt to loop through each cell in the ‘data’ column (that is, grouped by country and week) to find the residential means for each.
residential_mean <- d2 %>%
mutate(mean = map_dbl(data, ~mean(.x$residential)))
print(residential_mean)
# A tibble: 133 x 4
# Groups: country, week [133]
country week data mean
<chr> <int> <list> <dbl>
1 Argentina 1 <tibble [7 x 11]> 1.67
2 Argentina 2 <tibble [7 x 11]> -0.637
3 Argentina 3 <tibble [7 x 11]> 0.592
4 Argentina 4 <tibble [7 x 11]> 15.4
5 Argentina 5 <tibble [7 x 11]> 30.0
6 Argentina 6 <tibble [7 x 11]> 28.5
7 Argentina 7 <tibble [1 x 11]> 25.9
8 Australia 1 <tibble [7 x 11]> -1.32
9 Australia 2 <tibble [7 x 11]> -0.815
10 Australia 3 <tibble [7 x 11]> 0.442
# ... with 123 more rows
Finally, we can generalize this further to loop through each cell in the data column, to get the mean for each variable in the data frame (not just residential).
d2.
tmp <- d2[[3]][[1]]
tmp
# A tibble: 7 x 11
iso date grocery_pharmacy parks residential retail_recreati~
<chr> <date> <dbl> <dbl> <dbl> <dbl>
1 AR 2020-02-23 8.18 17.6 0.437 13.6
2 AR 2020-02-24 -15.9 25.6 6.32 -9.97
3 AR 2020-02-25 -17.1 1.32 6.90 -19.7
4 AR 2020-02-26 2.30 5.40 -2.27 0.845
5 AR 2020-02-27 -0.404 -5.27 -0.493 -0.584
6 AR 2020-02-28 2.08 -7.86 -0.577 2.88
7 AR 2020-02-29 4.68 -7.58 1.41 5.26
# ... with 5 more variables: transit_stations <dbl>,
# workplaces <dbl>, total_cases <dbl>, fatalities <dbl>,
# day_of_week <ord>
map_lgl.
#map_lgl will first return a TRUE/FALSE of whether a column is numeric
map_lgl(tmp, is.numeric)
iso date grocery_pharmacy
FALSE FALSE TRUE
parks residential retail_recreation
TRUE TRUE TRUE
transit_stations workplaces total_cases
TRUE TRUE TRUE
fatalities day_of_week
TRUE FALSE
#Can use map_lgl to subset our tmp data for only the numeric columns
tmp2 <- tmp[map_lgl(tmp, is.numeric)]
tmp2
# A tibble: 7 x 8
grocery_pharmacy parks residential retail_recreati~ transit_stations
<dbl> <dbl> <dbl> <dbl> <dbl>
1 8.18 17.6 0.437 13.6 5.33
2 -15.9 25.6 6.32 -9.97 -26.8
3 -17.1 1.32 6.90 -19.7 -28.5
4 2.30 5.40 -2.27 0.845 9.46
5 -0.404 -5.27 -0.493 -0.584 7.98
6 2.08 -7.86 -0.577 2.88 7.97
7 4.68 -7.58 1.41 5.26 5.60
# ... with 3 more variables: workplaces <dbl>, total_cases <dbl>,
# fatalities <dbl>
map_dbl or map_df.
#map_dbl - returns a vector
map_dbl(tmp2, mean, na.rm = TRUE)
grocery_pharmacy parks residential
-2.309286 4.169143 1.674143
retail_recreation transit_stations workplaces
-1.091857 -2.703429 -9.460286
total_cases fatalities
0.000000 0.000000
#map_df - returns a data frame
map_df(tmp2, mean, na.rm = TRUE)
# A tibble: 1 x 8
grocery_pharmacy parks residential retail_recreati~ transit_stations
<dbl> <dbl> <dbl> <dbl> <dbl>
1 -2.31 4.17 1.67 -1.09 -2.70
# ... with 3 more variables: workplaces <dbl>, total_cases <dbl>,
# fatalities <dbl>
Let’s start by creating a list of all the operations that we will later supply to our function to run with our dataset. We will call the list funs_to_apply. Here, we are specifying that we want the total n for each column, the mean, standard deviation sd, min, and max.
Remember, if you want to create a function that you can share with members of your team, you should not assign this list to an object. Instead you can input the list directly into the function, as you’ll see at the very end of this post.
listof_fun <- list(
n = function(x) length(x),
mean = function(x) mean(x, na.rm = TRUE),
sd = function(x) sd(x, na.rm = TRUE),
min = function(x) min(x, na.rm = TRUE),
max = function(x) max(x, na.rm = TRUE)
)
tmp because that was the specific dataset we were interested in making numeric. Now we can generalize to any dataset by replacing tmp with df:numeric_df <- df[map_lgl(df, is.numeric)]
Here, we are using df; once we put this mini-function into our larger function, df will refer to all the datasets within our nested dataframe (d2). (This will make more sense later)
Next, we’ve broadened the function from step 3 so that it can take any dataframe and function as an input, and loop that function throughout the dataset.
#estimating just one function
est_one_fun <- function(df, f) {
map_dbl(df, ~f(.x))
}
descriptives. Our arguments are:df = placeholder for a dataframe.
funs_to_apply = a list of functions. We can call this anything. Here we are saying that the list of functions we want is what we created earlier, listof_fun.
descriptives <- function(df, funs_to_apply = listof_fun) {
numeric_df function we created in step 5.
descriptives <- function(df, funs_to_apply = listof_fun) {
numeric_df <- df[map_lgl(df, is.numeric)]
map_df AND our est_one_fun function within our descriptives function. Let’s focus just on this for now. We are saying: loop through each of our operations in funs_to_apply, so that we can apply the est_one_fun function to use each of those operations on our numeric_df.
#descriptives <- function(df, funs_to_apply = listof_fun) {
# numeric_df <- df[map_lgl(df, is.numeric)]
map_df(funs_to_apply, ~est_one_fun(numeric_df, .x))
dplyr code to make our output look nicer. We are using mutate to create a vector with each of the column names. Then we are using select to change the order of the output so that our names column (place) is first.
descriptives <- function(df, funs_to_apply = listof_fun) {
numeric_df <- df[map_lgl(df, is.numeric)]
map_df(funs_to_apply, ~est_one_fun(numeric_df, .x)) %>%
mutate(place = names(numeric_df)) %>%
select(place, everything())
}
descriptives(tmp)
# A tibble: 8 x 6
place n mean sd min max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 grocery_pharmacy 7 -2.31 10.1 -17.1 8.18
2 parks 7 4.17 13.0 -7.86 25.6
3 residential 7 1.67 3.55 -2.27 6.90
4 retail_recreation 7 -1.09 10.8 -19.7 13.6
5 transit_stations 7 -2.70 17.1 -28.5 9.46
6 workplaces 7 -9.46 29.6 -53.7 12.9
7 total_cases 7 0 0 0 0
8 fatalities 7 0 0 0 0
descriptives function to each element of our list. Let’s pipe everything to d2 and create a new column called Output that contains the output from our function.
Descriptive_output <- d2 %>%
mutate(Output = map(data, descriptives)) %>%
select(-data) %>%
unnest(Output)
Note, we are using data within map because everything is piped to d2. Next, we deselect data and unnest so that all we see is our output, country, and week.
listof_fun), you can input the list directly into the function like below:
descriptives2 <- function(df, funs_to_apply = list(
n = function(x) length(x),
mean = function(x) mean(x, na.rm = TRUE),
sd = function(x) sd(x, na.rm = TRUE),
min = function(x) min(x, na.rm = TRUE),
max = function(x) max(x, na.rm = TRUE)
)) {
numeric_df <- df[map_lgl(df, is.numeric)]
map_df(funs_to_apply, ~est_one_fun(numeric_df, .x)) %>%
mutate(place = names(numeric_df)) %>%
select(place, everything())
}
Let’s test it out!
Descriptive_output1 <- d2 %>%
mutate(Output = map(data, descriptives2)) %>%
select(-data) %>%
unnest(Output)
Descriptive_output1
# A tibble: 1,064 x 8
# Groups: country, week [133]
country week place n mean sd min max
<chr> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Argentina 1 grocery_pharmacy 7 -2.31 10.1 -17.1 8.18
2 Argentina 1 parks 7 4.17 13.0 -7.86 25.6
3 Argentina 1 residential 7 1.67 3.55 -2.27 6.90
4 Argentina 1 retail_recreation 7 -1.09 10.8 -19.7 13.6
5 Argentina 1 transit_stations 7 -2.70 17.1 -28.5 9.46
6 Argentina 1 workplaces 7 -9.46 29.6 -53.7 12.9
7 Argentina 1 total_cases 7 0 0 0 0
8 Argentina 1 fatalities 7 0 0 0 0
9 Argentina 2 grocery_pharmacy 7 4.84 1.93 2.87 7.63
10 Argentina 2 parks 7 -7.27 3.31 -10.5 -0.579
# ... with 1,054 more rows
We did it again! Woohoo!
#a. We specify that if the type of function(s) supplied by the user is not a list, that we should make it a list, even if this is a list of one.
#b. Next we specify that if the functions within the list have no names that we should create names, and order them according to the input.
descriptives3 <- function(df, funs_to_apply = list(
n = function(x) length(x),
mean = function(x) mean(x, na.rm = TRUE),
sd = function(x) sd(x, na.rm = TRUE),
min = function(x) min(x, na.rm = TRUE),
max = function(x) max(x, na.rm = TRUE)
)) {
if(!typeof(funs_to_apply) == "list") { #a
funs_to_apply <- list(funs_to_apply)
}
if(is.null(names(funs_to_apply))) { #b
names(funs_to_apply) <- paste0("Descriptive", seq_along(funs_to_apply))
}
numeric_df <- df[map_lgl(df, is.numeric)]
map_df(funs_to_apply, ~est_one_fun(numeric_df, .x)) %>%
mutate(place = names(numeric_df)) %>%
select(place, everything())
}
Again, let’s test it out.
# with one function
descriptives3(tmp2, mean)
# A tibble: 8 x 2
place Descriptive1
<chr> <dbl>
1 grocery_pharmacy -2.31
2 parks 4.17
3 residential 1.67
4 retail_recreation -1.09
5 transit_stations -2.70
6 workplaces -9.46
7 total_cases 0
8 fatalities 0
#with two functions
descriptives3(tmp2, list(mean, sd, min, max))
# A tibble: 8 x 5
place Descriptive1 Descriptive2 Descriptive3 Descriptive4
<chr> <dbl> <dbl> <dbl> <dbl>
1 grocery_pharmacy -2.31 10.1 -17.1 8.18
2 parks 4.17 13.0 -7.86 25.6
3 residential 1.67 3.55 -2.27 6.90
4 retail_recreati~ -1.09 10.8 -19.7 13.6
5 transit_stations -2.70 17.1 -28.5 9.46
6 workplaces -9.46 29.6 -53.7 12.9
7 total_cases 0 0 0 0
8 fatalities 0 0 0 0
To learn more about how to plot this data with iterations, check out Joanna’s blog here: Post #3 by Joanna.