Notes from the Messyverse: How to tidy nested lists in R
Hi! I'm Ryan Moore, NBA fan & PhD candidate in Eric Wommack's viral ecology lab @ UD. Follow me on Twitter!
You’re a fairly recent convert to the Tidyverse, and you’re still using an unholy amalgamation of Tidy verbs and base R throughout your code. What’s more, you’ve got reams of legacy code that’s not going to magically tidy itself up anytime soon. Don’t feel bad about you’re heathen love of the apply family (after all, even Hadley says that map
is just a fancier version of lapply
). Rather, embrace those old-school list of lists! All your nice, modern tibbles are just a few short verbs away.
So you’re walking down the hall to the office breakroom when you overhear a conversation some of your colleagues are having. You’re coming closer and things must be starting to get a little intense in there. One of them starts to shout. “By Jove, it’s lists all the way down!” By this time you’re starting to hurry…are they really talking about…? You fly around the corner and see a grad student with Professor Gamgee and his flowing grey beard huddled in front of an old CRT workstation, with what looks like–gasp–an Emacs buffer filled top to bottom with–no, wait is that a JSON file?–the biggest, most nested, list-of-lists you’ve ever seen!
“Oh good, you made it,” one of them says.
“Who, me?”
“Yes, you’re the one who makes all those pretty graphs right? With something called the ggplot?”
You cough, “Maybe a little.”
“Well take a seat then! Take a seat!”
The CRT flickers as you take the offered chair. Gamgee gives it a quick thwack. It clears up, and you give it a look:
“What’s with the all-caps variable names?”
“Huh? Oh that, well, you see, I used to love Common Lisp, and, well, the reader was always case converting, so….”
“Wow, Common Lisp, eh? I guess, that explains the Emacs….”
“Alright kid, so we’ve got the watermelon data in…you know the experiment, right? No? Well, we were testing out some new fertilizer on our melons. As you can see, we did a couple of different experiments. Each experiment has two groups. The Control
group used the standard husbandry procedures, and the Treatment
group got our new fertilization strategy. Got it? Alright then, we’ll leave the plotting to you. See you in an hour or so.”
Stomach growling, you shoot a sideways glance at the fridge. Soups and sandwiches will just have to wait.
First things first, you get rid of all those upcased variable names. A few quick Emacs macros and you’re in business:
Being a lover of all things Tidy, you think, “Wow! lists are okay, but I really prefer tibbles….it’s soo easy to plot them with ggplot!” You remember someone mentioning a function to convert untidy things to tibbles. What was that again? Oh yeah, as_tibble
. Well, why not give it a try? You decide to take it step-by-step so you pull out the first experiment and work with that to start.
Hmm, that’s not quite what you want–it looks like there is a column for each list. You seem to remember a function for applying other functions to elements of vectors. Ah yes, it’s called map
from the purrr package. But wait, you think, I have lists not vectors, and the title of the help page is clear:
Apply a function to each element of a vector
It turns out that lists are just vectors. (Try running list() %>% is.vector
.) Oh and look at that, there are special versions of map called map_dfr
and map_dfc
which return data frames by row-binding and column-binding respectively.
In general, map
returns a vector the same length as its first argument (e.g., map
returns a list
, map_int
returns an integer vector, and map_dfr
returns a data frame by row binding.) Since you want each of the nested lists to be a row in your data frame, you’ll need map_dfr
. But what kind of function to you need to apply to each of the lists? It turns out that you need a function that returns its argument as is, somthing like function(l) l
. You give that a try.
Hey that worked! But writing that whole function(l) l
seems kind of unnecessary, so you wonder if there is a better way. The help page for map
mentions that you can write anonymous functions with formulas (e.g., ~ .x + 2
, would be converted to function(x) x + 2
). Given your love of syntax sugar, you give it a shot.
Now, you could write ~ .
since it is just a single argument function, but Hadley says you should avoid this. The ~ .x
thing looks kind of cool, but it is definitely a little obscure for someone who isn’t too familiar with the Tidyverse.
Just then, you remember that base R has a function called identity
, which returns its argument as is. Identity functions are much beloved by functional programmers and mathematicians alike, and Tidyverse feels pretty functional. Not to mention that it’s clearer to be explicit about what you’re doing rather than to use sweet syntax sugar. So you adjust your code once more.
That’s not too shabby. But wait…isn’t it kind of overkill to use map_dfr
if you’re just passing in the identity funciton anyway? Back on the help page you see this:
map_dfr()
andmap_dfc()
return data frames created by row-binding and column-binding respectively.
So if you’re only passing in the identity function to map_dfr
, then really you’re just exploiting map_dfr
for its ability to make data frames (or tibbles!) through row-binding. In that case, couldn’t you just use bind_rows
directly?
Yes! Now this is programming!
The bind_rows
trick works with your data because each sublist has names.
If you remove the names and try it again, you’ll see that bind_rows
doesn’t work anymore.
Luckily, your collegues are very well-organized and each of the lists you were given has names, so you stick with the bind_rows
method.
Now, you’re thinking that it would be a good idea to include the sample name and the experiment name in the tibble. That will make it easier to color and group elements of the figure. For that, you use mutate.
While that’s a good solution for a single expeiment, you remember that you’ve got a whole list of experiments. Remember that map
will return something that is the same size as your input, which will be the experiments
list.
Now, that list contains three experiments, each of which is just like your experiment_a
test case. What you want to do is map that little pipeline that you used to convert experiment_a
into a tibble onto each of the lists in experiments
. Sound good? But first, you decide to encapsulate the process into a named function so it’s easier to work with.
Now you’re ready to map this function onto each of the elements of the experiments
list.
Whoops, that’s not quite right. Back to the map
help page. Okay, so it looks like you can pass additional arguments to the mapped function. Something like this:
It worked at least, but you want the actual experiment names in there. So you try something else:
Nope, that doesn’t work either. Hold on…what you need is a function sort of like map
except that it can map over multiple inputs instead of just one. You pop over to your web browser and search for map over multiple arguments tidyverse
.
The purrr package has a function for this: map2
. It works more or less the same way as the regular map
variants, except that you can iterate over multiple arguments simultaneously. This way, you can pass in the experiment names as an additional argument before the function argument. (Arguments that come before the function argument .f
will all be vectorized, but function arguments coming after the .f
are supplied to each call directly.) Something like this:
That works, but wouldn’t it be nice if there was a map
variant that could apply a function to each element in the experiments
along with the name of the experiment automatically? You flip over to Google once more and are rewarded with the imap
function. The first line of the help function looks promising:
Apply a function to each element of a vector, and its index
As before, you want to iterate over a list and not a vector, but that’s okay, since you know that deep down, lists are just special vectors. In fact, the help page assures you that all will be well:
imap_xxx(x, …), an indexed map, is short hand for map2(x, names(x), …) if x has names
That’s perfect for your use case. In fact, that’s exactly what you were doing, so you just replace the map2_dfr
with imap_dfr
and drop the names(experiments)
bit like so:
Alright, now you’re cooking with Crisco! Finally you’ve got a nice tibble ready to pipe into ggplot.
Phew! That wasn’t so bad, was it? Right on cue, Professor Gamgee turns the corner into the breakroom and sits down at your table. “How’s the data looking, kid?”
“Tidy professor. It’s looking Tidy.”
If you enjoyed this post, consider sharing it on Twitter and subscribing to the RSS feed! If you have questions or comments, you can find me on Twitter or send me an email directly.
← Go back