Dance with R: Unleash productivity with these tips and tricks!

Anjan Dhungana
7 min readFeb 12, 2024

--

Photo by Annie Spratt on Unsplash

R is one of the most widely used programming languages in the scientific community and has a large community of users worldwide. Learning R is pretty easy, and if you have heard that Python is one of the easiest programming languages to learn, I would say R isn’t that different.

Whether you are a beginner trying to learn R or a long-time user, little tips and tricks are always nice for increasing productivity. Here, I have summarized a few tips and tricks that might be useful when you are working with data.

We will use the built-in dataset ‘mtcars’ for most of the following examples. The dataset contains information about 32 different car models from the 1970s and is very useful for practicing data analysis and visualization.

You won’t need to do anything to load the dataset as it’s already built-in. Just type mtcars and run the code.

The head() function

If you have a large dataset and want to familiarize yourself with its structure, use the head() function with your dataset name as an argument. It will return the first six rows of the dataset along with column names.

head(mtcars)

Making a new dataset from the existing one

You might only need some things to be included in your data analysis, so you can select only the ones you want. Let’s say you want mpg, hp, and wt in your new dataset.

cars_subset <- mtcars[, c("mpg", "hp", "wt"]

If you type cars_subset and run the code, you will see something like this

If you are wondering about the comma (,) before c(), the comma separates row and column indexes. Add a numeric value (1 or 2) before the comma in the above code, and see what happens!

Changing column names

Sometimes, column names can be confusing, and you must be clear on what a column name means to avoid making mistakes. Let’s say you wanted to change the name of the column wt to weight in the cars_subset

colnames(cars_subset)[colnames(cars_subset) == "wt"] <- "weight"

alternatively, you can do

colnames(cars_subset)[3] <- "weight"

Here, [3] corresponds to the column with the index of 3 or wt column in our cars_subset dataset. Try running colnames(cars_subset) and see what happens!

Something about indexing

In many programming languages like Python, C, and JavaScript, indexing starts at 0, but in R, it starts at 1.

But if it starts at 1, why is the fourth column (look at the cars_subset dataset), i.e., weight referred by the index 3? This is because the first column contains row names and is not counted as a regular column in R.

Using pipe operator

A pipe operator (denoted by %>%) is exactly what it says; it allows us to create a pipeline of operations through which the data flows. When working on a set of operations on data, we tend to do it one after another, with multiple in-between variables to store the values.

#Create a new database with cars having mpg >= 25
filtered_data <- filter(cars_subset, mpg>=25)

#Select only mpg and hp columns for a new database
selected_data <- select(filtered_data, mpg, hp)

#Sort the above-filtered data according to mpg (ascending order is the default)
sorted_data <- arrange(selected_data, mpg)

The above code would give the same result as the following code:

library(dplyr)

final_result <- cars_subset %>%
filter(mpg >= 25) %>%
select(mpg, hp) %>%
arrange(mpg)

Here, the filter() function returns cars with mpg ≥ 25. The select() line below takes the output of the filter() line and selects only the mpg and hp columns. The last line arranges the data according to mpg in ascending order.

In our first code, the pipe operator reduces the need for multiple variables, like filtered_data and selected_data, which is unnecessary and will probably be useless to us.

Furthermore, the readability of the code increases significantly. Notice how the second code is more straightforward to comprehend than the first one. Basically, a pipe operator chains together a set of commands without using multiple variables in the middle.

Shortcut key for pipe operator: Use Ctrl + Shift + M to get the pipe operator instead of typing >%>

Vectorization

R has one of the most useful features in programming, called vectorization. Vectors in R are the data structures having elements of the same type. You can create a vector as:

fruits <- c("Apple", "Banana", "Mango")
x <- c(1,2,3,4,5)

How is it useful? If you know about functions like loops, vectorization makes it easier to apply functions to individual elements without hassle. Most of the functions in R are vectorized, meaning, the function will work on all elements of a vector without the need for loops. For example, you can do this:

x <- c(1,2,3,4,5)
x * 5

Now, if you look at the output of the code, you will see that each number is multiplied by 5, and you did that in a single line of code without the need for loops. This feature simplifies things and makes the code concise and easy to write. This is a classic example of a scalar operation on a vector.

Another example is vector operations.

x <- c(1,2,3,4,5)
y <- c(3,4,5,6,7)
x+y

Try the code above and see what you find.

You can also perform other operations as well

#dot product
sum(x*y)
#cross product
crossprod(x,y)
#vector comparison
x == y

For vectors containing texts, you can do something like this

states1 <- c("Tennessee", "Kentucky", "Georgia")
states2 <- c("Florida", "Alabama", "Georgia", "Kentucky")

intersect(set1, set2)
union(set1, set2)

This is useful in instances like measuring the Jaccard similarity index between two sets

jaccard_similarity <- length(intersect(states1,states2))/length(union(states1,states2))

Some use cases of the Jaccard index are movie recommendation algorithms, e-commerce recommendation systems, genomic sequence comparison in bioinformatics, plagiarism detection, etc.

Wide and long data formats

Understanding data formats in R is a key to data manipulation. In a wide format, each row represents an observation, and each column represents a variable. Run the code

wide_data <- data.frame(
Participant_ID = c(1, 2, 3, 4),
Age = c(25, 30, 28, 35),
Weight = c(70, 65, 80, 75),
Height = c(175, 168, 180, 172),
Group = c("Control", "Treatment", "Control", "Treatment")
)

You’ll see the following output:

Participant_ID Age Weight Height   Group
1 1 25 70 175 Control
2 2 30 65 168 Treatment
3 3 28 80 180 Control
4 4 35 75 172 Treatment

In the above output, you can see that each row, 1 through 4, represents age, height, weight, and group for each participant. The columns represent the variables Participant_ID, Age,…, Group. This is a wide data format.

Although this format is easier to comprehend and is useful for preliminary data exploration, most of the R functions expect long data format as an input as it’s easier to manipulate and visualize.

Converting from wide to long data format inside R is useful and is more efficient than changing it manually in a spreadsheet. Use the reshape2 library to melt the wide data format to the long format (like melting a piece of a big plastic sheet to make it longer).

You might need to run install.packages(“reshape2”) to install the library.

library(reshape2)
long_data = melt(wide_data, id.vars = c("Participant_ID", "Group"))long_data

Output:

Participant_ID     Group variable value
1 1 Control Age 25
2 2 Treatment Age 30
3 3 Control Age 28
4 4 Treatment Age 35
5 1 Control Weight 70
6 2 Treatment Weight 65
7 3 Control Weight 80
8 4 Treatment Weight 75
9 1 Control Height 175
10 2 Treatment Height 168
11 3 Control Height 180
12 4 Treatment Height 172

Here, the wide_data is converted to a long format and has been changed on the basis of Participant_ID and Group.

To convert long data format to wide format, use the dcast() function of reshape2 library:

wide_data_back <- dcast(long_data, Participant_ID + Group ~ variable, value.var = "value")
print(wide_data_back)

Here, the ‘variable’ is used to specify the columns in the wide data format. ‘value.var’ specifies the required values that fill up the wide data format. The values in our ‘value’ column in long data format will be distributed to the variables extracted by dcast() function.

This article about reshape2 library might be helpful if you want to learn more about conversion between wide and long data formats.

Some useful shortcuts

Shortcuts are a great way to maximize productivity while writing code. Here are some shortcuts to use in R.

Ctrl/Command + 1: Source/Script Editor
Ctrl/Command + 2: Console
Ctrl/Command + 3: Help
Ctrl/Command + 4: History
Ctrl/Command + 5: Files
Ctrl/Command + 6: Plots
Ctrl/Command + 7: Packages
Ctrl/Command + 8: Environment
Ctrl/Command + 9: Viewer

Alt/Option + - : Assignment operator (<-)
Ctrl + L: Clear the console
Ctrl/Command + Shift + C: Comment/Uncomment Lines or Selection
Ctrl/Command + Enter: Run the current line of code / Selection

Hope this helps. Don’t forget to like and follow if you want more content like this. Happy dancing :)

--

--

Anjan Dhungana
Anjan Dhungana

Written by Anjan Dhungana

Graduate @KYSU conducting research on ruminants. I talk about everything food and tech.

Responses (1)