Share on FacebookTweet about this on TwitterPin on PinterestShare on Google+

dplyr (0.4.3 as of September, 2015) and tidyr (0.4.1 as of February, 2015) are two R packages and all that you require for data munching in R. To make the most of the capabilities of these packages, they are often used together by data scientists.

( dplyr is the next version of the famous plyr package in R )

This is a short overview of the basic capabilities provided by these packages.

The pipe operator %>% which was first introduced in the package magrittr is now supported in both of these packages and is very addictive. We will discuss the usefulness of pipe operator ( %>% ) first.

The R pipe: %>%

Using the pipe frequently in your code can improve the neatness of your code hugely. It helps avoid unnecessary use of temporary variables in your code.

The basic meaning of %>% is :

R pipe code Equivalent code
x %>% f f(x)
x %>% f(y) f(x,y)
x %>% f(.,y) f(x,y)
x %>% f(y,.) f(y,x)
x %>% f(y, z = .) f(y,z=x)
x %>% f %>% g g(f(x))
z <- x %>% f z <- f(x)

Note that the variable x on the left side of %>% is applied as the first argument in the function on the right side. This default behaviour can be changed using . which is called a placeholder.

However, one important thing to remember is, when the . appears in nested expressions, the first-argument-rule is still applied. But this behaviour can be suppressed using the curly braces{ }. That is,

R pipe code Equivalent code
data %>% f(x = ncol(.)) f(data, x = ncol(data))
data %>% { f(x = ncol(.)) } f(x = ncol(data))

That is all what is necessary to know about the pipe.

dplyr

This package provides intuitive “verbs” useful for working with data frames in R (example: for subsetting, summarizing , rearranging data frames).

The basic verbs/functions of dplyr package are:

dplyr verb Meaning
select() keep only the columns you mention
rename() rename the columns you mention
mutate() add new columns and keep existing ones
filter() returns subset of rows with matching conditions
arrange() re-order the rows

We look at a few examples to understand how these verbs and pipes can be used together. We will be using sensor.csv data located in the repository here. So download this file into any folder first.

This data contains columns named: timestamp, id, tempature, humidity and precipitation for different sensors which are represented by the column named id.

We load the data first.

Now we show the use of each of dplyr verbs one by one.

In one line using the pipe, the above code would be:

Pretty clean !

tidyr

This package provides verbs to clean/tidy the data.

The main verbs here are:

tidyr verb Meaning
gather() makes wide data longer
spread() makes long data wider

There are two additional verbs which are sometimes useful:

tidyr verb Meaning
unite() paste together multiple columns into one
separate() separate one column into several

Here are some examples on the same sensor.csv dataset:

Combining above code into single line:

Lastly, Here is a cheatsheet for quick reference to use these packages in tandem.

Leave a Reply

メールアドレスが公開されることはありません。 * が付いている欄は必須項目です

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">