Post 2 - How to Subset Data and Output Basic Descriptives

A tutorial on writing a function to group your data and generate basic descriptives.

Stephanie Gluck & Akhila Nekkanti https://s-gluck.github.io/funprog_blog
06-08-2020

Table of Contents


Introduction

For the tutorial, we will use the COVID Community Measures dataset from Kaggle. We hope that you will be able to generalize the function we covered to be applicable to your own dataset. The current dataset contains Community Mobility Reports describing community activities in a variety of settings from Google and the incidence of COVID-19 diagnosis and fatalities from John Hopkins CSSE.

Community activity is defined as the change in activity from the baseline days. A baseline day is the median value from the 5‑week period between January 3rd to February 6th, 2020.

Loading the Data

Please refer to our blogpost on how to automate your data loading. You can check out the post here

To create a function that can read in our dataframe and output descriptives, we will need the following packages. If you have not already isntalled them, please use install.packages(" ") prior to calling library().


#At later time, identify what aspects of each library are used for the function
library(tidyverse) 
library(janitor)
library(lubridate)
library(purrr)

Once the data is loaded, we can look at the dataframe by calling reactable() on our dataframe. Notice that you can sort each column by clicking on the column header.

We can also look at each of the variable names by calling the R base function names().


names(d)

 [1] "iso"               "country"           "date"             
 [4] "grocery_pharmacy"  "parks"             "residential"      
 [7] "retail_recreation" "transit_stations"  "workplaces"       
[10] "total_cases"       "fatalities"       

& we can look at all the different countries:


#using tidyverse

d %>% 
  count(country, 
        sort = TRUE)

# A tibble: 19 x 2
   country            n
   <chr>          <int>
 1 Argentina         43
 2 Australia         43
 3 Brazil            43
 4 Canada            43
 5 France            43
 6 Germany           43
 7 India             43
 8 Indonesia         43
 9 Italy             43
10 Japan             43
11 Mexico            43
12 Saudi Arabia      43
13 South Africa      43
14 South Korea       43
15 Spain             43
16 Sweden            43
17 Turkey            43
18 United Kingdom    43
19 US                43

Function Tutorial

Subset Data

For this dataset, it would be helpful to create a ‘week’ variable that extracts each day of the week, so we can group our descriptives by week which would allow us to examine for example the average change in activities across weeks for different settings. Here, we’ve also created a ‘day_of_week’ variable to double-check our work.

If you are working with date-time variables in your data, the lubridate package is quite powerful at manipulating date-time data.


#cut.Date, labels = TRUE will print actual day of the week, e.g., Sunday (1)

#use lubridate wday and base R cut.Date

#?cut.Date
#?wday

d <- d %>% 
  mutate(week = cut.Date(date, "week", start.on.monday = FALSE, labels = FALSE), 
         day_of_week = wday(date, label = TRUE))

#taking a look at our new variables `week` and `day_of_week`
d %>% select(country, date, week, day_of_week)

# A tibble: 817 x 4
   country   date        week day_of_week
   <chr>     <date>     <int> <ord>      
 1 Argentina 2020-02-23     1 Sun        
 2 Argentina 2020-02-24     1 Mon        
 3 Argentina 2020-02-25     1 Tue        
 4 Argentina 2020-02-26     1 Wed        
 5 Argentina 2020-02-27     1 Thu        
 6 Argentina 2020-02-28     1 Fri        
 7 Argentina 2020-02-29     1 Sat        
 8 Argentina 2020-03-01     2 Sun        
 9 Argentina 2020-03-02     2 Mon        
10 Argentina 2020-03-03     2 Tue        
# ... with 807 more rows

Now that we’ve created our week variable, we can use tidyr::nest to group our dataset by week, by country, or both. This will allow us to loop our function through each grouping variable and provide unique descriptives for each.

We will nest our data by the variables country and week


#week 7 only has one day  -- filter out here if it becomes a problem

d2 <- d %>%
  group_by(country, week) %>% 
  nest()

print(d2)

# A tibble: 133 x 3
# Groups:   country, week [133]
   country    week data             
   <chr>     <int> <list>           
 1 Argentina     1 <tibble [7 x 11]>
 2 Argentina     2 <tibble [7 x 11]>
 3 Argentina     3 <tibble [7 x 11]>
 4 Argentina     4 <tibble [7 x 11]>
 5 Argentina     5 <tibble [7 x 11]>
 6 Argentina     6 <tibble [7 x 11]>
 7 Argentina     7 <tibble [1 x 11]>
 8 Australia     1 <tibble [7 x 11]>
 9 Australia     2 <tibble [7 x 11]>
10 Australia     3 <tibble [7 x 11]>
# ... with 123 more rows

After nesting, you see that our new dataframe now contains a list-column named data. We can call the Argentina Week 1 data tibble by specifying d2$data[[1]] to look at it in more details.


d2$data[[1]]

# A tibble: 7 x 11
  iso   date       grocery_pharmacy parks residential retail_recreati~
  <chr> <date>                <dbl> <dbl>       <dbl>            <dbl>
1 AR    2020-02-23            8.18  17.6        0.437           13.6  
2 AR    2020-02-24          -15.9   25.6        6.32            -9.97 
3 AR    2020-02-25          -17.1    1.32       6.90           -19.7  
4 AR    2020-02-26            2.30   5.40      -2.27             0.845
5 AR    2020-02-27           -0.404 -5.27      -0.493           -0.584
6 AR    2020-02-28            2.08  -7.86      -0.577            2.88 
7 AR    2020-02-29            4.68  -7.58       1.41             5.26 
# ... with 5 more variables: transit_stations <dbl>,
#   workplaces <dbl>, total_cases <dbl>, fatalities <dbl>,
#   day_of_week <ord>

You can see that inside the 7 x 11 tibble for Argentina week 1, the community activities, diagnosis, and fatalities data from that week (week of February 23 to February 29, 2020) are listed.

We can also perform other functions to check our nested dataframe:


#checking length of variables
map_dbl(d2$data, nrow)

  [1] 7 7 7 7 7 7 1 7 7 7 7 7 7 1 7 7 7 7 7 7 1 7 7 7 7 7 7 1 7 7 7 7
 [33] 7 7 1 7 7 7 7 7 7 1 7 7 7 7 7 7 1 7 7 7 7 7 7 1 7 7 7 7 7 7 1 7
 [65] 7 7 7 7 7 1 7 7 7 7 7 7 1 7 7 7 7 7 7 1 7 7 7 7 7 7 1 7 7 7 7 7
 [97] 7 1 7 7 7 7 7 7 1 7 7 7 7 7 7 1 7 7 7 7 7 7 1 7 7 7 7 7 7 1 7 7
[129] 7 7 7 7 1

map_dbl(d2$data, ncol)

  [1] 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
 [22] 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
 [43] 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
 [64] 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
 [85] 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
[106] 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
[127] 11 11 11 11 11 11 11

#if you want to find the mean for just one specific column, by each country and week. 
#Here we look at the mean for residential for each country across all 7 weeks. 
avg <- map_dbl(d2$data, ~mean(.x$residential))
avg

  [1]  1.67414286 -0.63714286  0.59157143 15.39671429 29.97485714
  [6] 28.50571429 25.87300000 -1.32342857 -0.81500000  0.44171429
 [11]  4.16142857 12.63700000 17.63757143 14.31900000  4.18628571
 [16] -0.01671429 -0.76614286  7.83928571 20.75071429 18.92214286
 [21] 14.97400000  0.64442857  0.41400000  1.76100000 13.42957143
 [26] 20.44085714 21.86371429 14.08800000  2.00785714  0.97028571
 [31]  2.10400000 10.38700000 15.75300000 15.62085714  7.61100000
 [36] -0.80771429 -0.45228571  3.36014286 25.70114286 28.44842857
 [41] 30.36214286 22.64600000  2.45942857  1.76314286  2.05214286
 [46] 23.13114286 30.17714286 29.84257143 17.48900000  0.56071429
 [51]  0.14157143  1.28285714  8.34442857 21.51128571 24.88914286
 [56] 14.84200000  0.88385714  0.44857143  0.83642857  7.05357143
 [61] 14.78771429 15.42585714 14.35400000  0.10814286  0.35414286
 [66]  2.92942857  4.64042857 26.67171429 28.29157143 21.09400000
 [71]  4.16500000  5.45200000 18.36900000 28.09785714 31.50657143
 [76] 30.93385714 24.03100000  3.39742857  4.30500000  4.78357143
 [81]  5.13257143  4.07328571  6.58700000  7.80600000 10.03400000
 [86]  9.88071429  8.51985714  7.77971429  7.38671429  6.53257143
 [91]  5.72100000 -1.15700000 -1.24385714 -0.10271429  6.35128571
 [96] 13.76957143 16.55714286 12.77800000  1.53228571  1.35285714
[101]  4.75042857 12.01185714 21.12371429 22.34842857 23.31700000
[106]  0.56885714  0.85028571  2.46128571  8.77828571  9.42542857
[111] 10.07814286  5.71000000 -0.72157143 -1.80671429 -1.57442857
[116]  9.76814286 18.24400000 21.63442857 18.88000000 -0.73600000
[121] -0.88142857  1.00300000 10.96157143 17.33842857 18.50142857
[126] 12.55000000 -0.52328571 -0.42128571 -0.93985714  6.29614286
[131] 15.84242857 33.08314286 22.35100000

#if we want to create a country name and week index
index <- d2 %>% select(country, week)
index

# A tibble: 133 x 2
# Groups:   country, week [133]
   country    week
   <chr>     <int>
 1 Argentina     1
 2 Argentina     2
 3 Argentina     3
 4 Argentina     4
 5 Argentina     5
 6 Argentina     6
 7 Argentina     7
 8 Australia     1
 9 Australia     2
10 Australia     3
# ... with 123 more rows

#we can create a tibble of our index with our residential average data
avg_tibble <- tibble(index, `residential mean` = avg)
avg_tibble

# A tibble: 133 x 2
   index$country $week `residential mean`
   <chr>         <int>              <dbl>
 1 Argentina         1              1.67 
 2 Argentina         2             -0.637
 3 Argentina         3              0.592
 4 Argentina         4             15.4  
 5 Argentina         5             30.0  
 6 Argentina         6             28.5  
 7 Argentina         7             25.9  
 8 Australia         1             -1.32 
 9 Australia         2             -0.815
10 Australia         3              0.442
# ... with 123 more rows

Descriptives with purrr:map and Custom Function

In order to figure out how to create a function that can be used for any variable in a dataset, we first need to do it for one.


mean(d2$data[[1]]$residential)

[1] 1.674143

Next, we can attempt to loop through each cell in the ‘data’ column (that is, grouped by country and week) to find the residential means for each.


residential_mean <- d2 %>% 
  mutate(mean = map_dbl(data, ~mean(.x$residential)))

print(residential_mean)

# A tibble: 133 x 4
# Groups:   country, week [133]
   country    week data                mean
   <chr>     <int> <list>             <dbl>
 1 Argentina     1 <tibble [7 x 11]>  1.67 
 2 Argentina     2 <tibble [7 x 11]> -0.637
 3 Argentina     3 <tibble [7 x 11]>  0.592
 4 Argentina     4 <tibble [7 x 11]> 15.4  
 5 Argentina     5 <tibble [7 x 11]> 30.0  
 6 Argentina     6 <tibble [7 x 11]> 28.5  
 7 Argentina     7 <tibble [1 x 11]> 25.9  
 8 Australia     1 <tibble [7 x 11]> -1.32 
 9 Australia     2 <tibble [7 x 11]> -0.815
10 Australia     3 <tibble [7 x 11]>  0.442
# ... with 123 more rows

Finally, we can generalize this further to loop through each cell in the data column, to get the mean for each variable in the data frame (not just residential).

  1. First, as with any function, we will attempt to create it for just one data frame. Let’s call the contents of the first dataset (Argentina Week1) in our nested list-column data frame d2.

tmp <- d2[[3]][[1]]
tmp

# A tibble: 7 x 11
  iso   date       grocery_pharmacy parks residential retail_recreati~
  <chr> <date>                <dbl> <dbl>       <dbl>            <dbl>
1 AR    2020-02-23            8.18  17.6        0.437           13.6  
2 AR    2020-02-24          -15.9   25.6        6.32            -9.97 
3 AR    2020-02-25          -17.1    1.32       6.90           -19.7  
4 AR    2020-02-26            2.30   5.40      -2.27             0.845
5 AR    2020-02-27           -0.404 -5.27      -0.493           -0.584
6 AR    2020-02-28            2.08  -7.86      -0.577            2.88 
7 AR    2020-02-29            4.68  -7.58       1.41             5.26 
# ... with 5 more variables: transit_stations <dbl>,
#   workplaces <dbl>, total_cases <dbl>, fatalities <dbl>,
#   day_of_week <ord>
  1. Now you’ll notice that some of the columns are not numeric. So we will create a second temporary dataset that calls only the numeric columns by using map_lgl.

#map_lgl will first return a TRUE/FALSE of whether a column is numeric
map_lgl(tmp, is.numeric)

              iso              date  grocery_pharmacy 
            FALSE             FALSE              TRUE 
            parks       residential retail_recreation 
             TRUE              TRUE              TRUE 
 transit_stations        workplaces       total_cases 
             TRUE              TRUE              TRUE 
       fatalities       day_of_week 
             TRUE             FALSE 

#Can use map_lgl to subset our tmp data for only the numeric columns
tmp2 <- tmp[map_lgl(tmp, is.numeric)]
tmp2

# A tibble: 7 x 8
  grocery_pharmacy parks residential retail_recreati~ transit_stations
             <dbl> <dbl>       <dbl>            <dbl>            <dbl>
1            8.18  17.6        0.437           13.6               5.33
2          -15.9   25.6        6.32            -9.97            -26.8 
3          -17.1    1.32       6.90           -19.7             -28.5 
4            2.30   5.40      -2.27             0.845             9.46
5           -0.404 -5.27      -0.493           -0.584             7.98
6            2.08  -7.86      -0.577            2.88              7.97
7            4.68  -7.58       1.41             5.26              5.60
# ... with 3 more variables: workplaces <dbl>, total_cases <dbl>,
#   fatalities <dbl>
  1. Okay, so now that we have a dataframe with only numeric columns. Let’s write a function that gives us just the mean for each column, using map_dbl or map_df.

#map_dbl - returns a vector 
map_dbl(tmp2, mean, na.rm = TRUE)

 grocery_pharmacy             parks       residential 
        -2.309286          4.169143          1.674143 
retail_recreation  transit_stations        workplaces 
        -1.091857         -2.703429         -9.460286 
      total_cases        fatalities 
         0.000000          0.000000 

#map_df - returns a data frame
map_df(tmp2, mean, na.rm = TRUE)

# A tibble: 1 x 8
  grocery_pharmacy parks residential retail_recreati~ transit_stations
             <dbl> <dbl>       <dbl>            <dbl>            <dbl>
1            -2.31  4.17        1.67            -1.09            -2.70
# ... with 3 more variables: workplaces <dbl>, total_cases <dbl>,
#   fatalities <dbl>
  1. Now we’re ready to create a function that does all of the above, for each column!

Let’s start by creating a list of all the operations that we will later supply to our function to run with our dataset. We will call the list funs_to_apply. Here, we are specifying that we want the total n for each column, the mean, standard deviation sd, min, and max.

Remember, if you want to create a function that you can share with members of your team, you should not assign this list to an object. Instead you can input the list directly into the function, as you’ll see at the very end of this post.


listof_fun <- list(
  n = function(x) length(x),
  mean = function(x) mean(x, na.rm = TRUE),
  sd = function(x) sd(x, na.rm = TRUE),
  min = function(x) min(x, na.rm = TRUE),
  max = function(x) max(x, na.rm = TRUE)
  )
  1. Let’s take the functions we wrote in steps 2 and 3. We can change the object names to make them more descriptive. Above we used tmp because that was the specific dataset we were interested in making numeric. Now we can generalize to any dataset by replacing tmp with df:

numeric_df <- df[map_lgl(df, is.numeric)]

Here, we are using df; once we put this mini-function into our larger function, df will refer to all the datasets within our nested dataframe (d2). (This will make more sense later)

Next, we’ve broadened the function from step 3 so that it can take any dataframe and function as an input, and loop that function throughout the dataset.


#estimating just one function
est_one_fun <- function(df, f) {
 map_dbl(df, ~f(.x))
}
  1. Almost there! Now let’s create our larger function. We’ll assign it to the name descriptives. Our arguments are:

descriptives <- function(df, funs_to_apply = listof_fun) {
  1. Great, now let’s add in the numeric_df function we created in step 5.

descriptives <- function(df, funs_to_apply = listof_fun) {
  
  numeric_df <- df[map_lgl(df, is.numeric)] 
  1. Now comes the tricky part. We will use map_df AND our est_one_fun function within our descriptives function. Let’s focus just on this for now. We are saying: loop through each of our operations in funs_to_apply, so that we can apply the est_one_fun function to use each of those operations on our numeric_df.

#descriptives <- function(df, funs_to_apply = listof_fun) {
  
#  numeric_df <- df[map_lgl(df, is.numeric)] 

  map_df(funs_to_apply, ~est_one_fun(numeric_df, .x)) 
  1. Phew! Now let’s use a bit of dplyr code to make our output look nicer. We are using mutate to create a vector with each of the column names. Then we are using select to change the order of the output so that our names column (place) is first.

descriptives <- function(df, funs_to_apply = listof_fun) {
  
  numeric_df <- df[map_lgl(df, is.numeric)] 

  map_df(funs_to_apply, ~est_one_fun(numeric_df, .x)) %>% 
    mutate(place = names(numeric_df)) %>% 
    select(place, everything())
}
  1. Before we celebrate…let’s test it! First, on a single dataframe. We can use the one we created in step 1.

descriptives(tmp)

# A tibble: 8 x 6
  place                 n  mean    sd    min   max
  <chr>             <dbl> <dbl> <dbl>  <dbl> <dbl>
1 grocery_pharmacy      7 -2.31 10.1  -17.1   8.18
2 parks                 7  4.17 13.0   -7.86 25.6 
3 residential           7  1.67  3.55  -2.27  6.90
4 retail_recreation     7 -1.09 10.8  -19.7  13.6 
5 transit_stations      7 -2.70 17.1  -28.5   9.46
6 workplaces            7 -9.46 29.6  -53.7  12.9 
7 total_cases           7  0     0      0     0   
8 fatalities            7  0     0      0     0   
  1. It works!! But can we make it work with our nested dataframe? We’ll have to use map again, so that we can loop our descriptives function to each element of our list. Let’s pipe everything to d2 and create a new column called Output that contains the output from our function.

Descriptive_output <- d2 %>%
  mutate(Output = map(data, descriptives)) %>% 
  select(-data)               %>% 
  unnest(Output)

Note, we are using data within map because everything is piped to d2. Next, we deselect data and unnest so that all we see is our output, country, and week.

WOOHOO!! WE DID IT.

Now let’s take a look at some extra perks.

  1. If you want to create a function that you can share with your team without requiring them to create extra objects (e.g., listof_fun), you can input the list directly into the function like below:

descriptives2 <- function(df, funs_to_apply = list(
                                               n = function(x) length(x),
                                               mean = function(x) mean(x, na.rm = TRUE),
                                               sd = function(x) sd(x, na.rm = TRUE),
                                               min = function(x) min(x, na.rm = TRUE),
                                               max = function(x) max(x, na.rm = TRUE)
  )) {
  
  numeric_df <- df[map_lgl(df, is.numeric)] 

  map_df(funs_to_apply, ~est_one_fun(numeric_df, .x)) %>% 
    mutate(place = names(numeric_df)) %>% 
    select(place, everything())
}

Let’s test it out!


Descriptive_output1 <- d2 %>%
  mutate(Output = map(data, descriptives2)) %>% 
  select(-data) %>% 
  unnest(Output)

Descriptive_output1

# A tibble: 1,064 x 8
# Groups:   country, week [133]
   country    week place                 n  mean    sd    min    max
   <chr>     <int> <chr>             <dbl> <dbl> <dbl>  <dbl>  <dbl>
 1 Argentina     1 grocery_pharmacy      7 -2.31 10.1  -17.1   8.18 
 2 Argentina     1 parks                 7  4.17 13.0   -7.86 25.6  
 3 Argentina     1 residential           7  1.67  3.55  -2.27  6.90 
 4 Argentina     1 retail_recreation     7 -1.09 10.8  -19.7  13.6  
 5 Argentina     1 transit_stations      7 -2.70 17.1  -28.5   9.46 
 6 Argentina     1 workplaces            7 -9.46 29.6  -53.7  12.9  
 7 Argentina     1 total_cases           7  0     0      0     0    
 8 Argentina     1 fatalities            7  0     0      0     0    
 9 Argentina     2 grocery_pharmacy      7  4.84  1.93   2.87  7.63 
10 Argentina     2 parks                 7 -7.27  3.31 -10.5  -0.579
# ... with 1,054 more rows

We did it again! Woohoo!

  1. Now let’s modify the function so that the person using it can specify exactly what operations they want in their output.

descriptives3 <- function(df, funs_to_apply = list(
                                               n = function(x) length(x),
                                               mean = function(x) mean(x, na.rm = TRUE),
                                               sd = function(x) sd(x, na.rm = TRUE),
                                               min = function(x) min(x, na.rm = TRUE),
                                               max = function(x) max(x, na.rm = TRUE)
  )) {
  
  if(!typeof(funs_to_apply) == "list") { #a
    funs_to_apply <- list(funs_to_apply) 
  }
  if(is.null(names(funs_to_apply))) { #b
    names(funs_to_apply) <- paste0("Descriptive", seq_along(funs_to_apply))
  }
  
  numeric_df <- df[map_lgl(df, is.numeric)] 

  map_df(funs_to_apply, ~est_one_fun(numeric_df, .x)) %>% 
    mutate(place = names(numeric_df)) %>% 
    select(place, everything())
}

Again, let’s test it out.


# with one function
descriptives3(tmp2, mean)

# A tibble: 8 x 2
  place             Descriptive1
  <chr>                    <dbl>
1 grocery_pharmacy         -2.31
2 parks                     4.17
3 residential               1.67
4 retail_recreation        -1.09
5 transit_stations         -2.70
6 workplaces               -9.46
7 total_cases               0   
8 fatalities                0   

#with two functions
descriptives3(tmp2, list(mean, sd, min, max))

# A tibble: 8 x 5
  place            Descriptive1 Descriptive2 Descriptive3 Descriptive4
  <chr>                   <dbl>        <dbl>        <dbl>        <dbl>
1 grocery_pharmacy        -2.31        10.1        -17.1          8.18
2 parks                    4.17        13.0         -7.86        25.6 
3 residential              1.67         3.55        -2.27         6.90
4 retail_recreati~        -1.09        10.8        -19.7         13.6 
5 transit_stations        -2.70        17.1        -28.5          9.46
6 workplaces              -9.46        29.6        -53.7         12.9 
7 total_cases              0            0            0            0   
8 fatalities               0            0            0            0   

CONGRATULATIONS! WE WROTE SOME REALLY COMPLICATED FUNCTIONS. You’re a rockstar, seriously.

To learn more about how to plot this data with iterations, check out Joanna’s blog here: Post #3 by Joanna.