Data Science for Business Analytics > Assignments > HW0: R Basics

HW0: R Basics

R Foundations

In this tutorial, you’ll get an overview of the basic programming concepts in R and main data types. It’s just enough to get you up and running essential R code. However, for true “beginners”, we highly recommend going through Advanced R - Chapter ‘Foundations’ from which the content of this assignment is (mostly) inspired by.

R, RStudio, Installation

R is a programming language for statistical analysis
RStudio is the integrated development environment (IDE) for R in which we write and execute R code, plot things and write reports.
Installation guidelines and details (Mac, Windows, Linux)

To follow the tutorial, you can start R Studio and execute statements from the code chunks in the R Console.

Libraries

R uses different libraries or packages to load specific functions (read excel files, talk to Twitter, generate plots).

# To install package from the console, note the quotation marks!
# install.packages("name_of_package")

# load package in environment
library(mgcv)

Ex. 0

As a starting point you can install a package that we’ll extensively use throughout the semester:

tidyverse

Assignment

In R, we assign values (numbers, characters, data frames) to objects (vectors, matrices, variables). To do so, we use the <- operator:

# name_of_object <- value
an_object <- 2
another_object <- "some string"
# inspect object's value
an_object
#> [1] 2
print(another_object)
#> [1] "some string"

Data Structures

R’s base data structures can be organised by their dimensionality (1d, 2d, or nd) and whether they’re homogeneous (all contents must be of the same type) or heterogeneous (the contents can be of different types).

	Homogeneous	Heterogeneous
1d	Atomic vector	List
2d	Matrix	Data frame
nd	Array

Vectors

The basic data structure in R is the vector. Vectors can be of two kinds: atomic vectors and lists. They have three common properties:

Type, typeof(), what it is.
Length, length(), how many elements it contains.
Attributes, attributes(), additional arbitrary metadata.

However, atomic vectors and lists differ in the types of their elements: all elements of an atomic vector must be the same type, whereas the elements of a list can have different types.

There are four common types of atomic vectors:

logical
integer
double (often called numeric)
character

Atomic vectors are usually created with c(), short for combine:

dbl_var <- c(1, 2.5, 4.5)
# with the L suffix, you get an integer rather than a double
int_var <- c(1L, 6L, 10L)
# use TRUE and FALSE (or T and F) to create logical vectors
log_var <- c(TRUE, FALSE, T, F)
chr_var <- c("these are", "some strings")

int_var <- c(1L, 6L, 10L)
typeof(int_var)
#> [1] "integer"

is.integer(int_var)
#> [1] TRUE

Lists

List objects can hold elements of any type, including lists. You construct lists by using list() instead of c():

x <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))
str(x)
#> List of 4
#>  $ : int [1:3] 1 2 3
#>  $ : chr "a"
#>  $ : logi [1:3] TRUE FALSE TRUE
#>  $ : num [1:2] 2.3 5.9

Attributes

All objects can have arbitrary additional attributes, used to store metadata about the object. Attributes can be thought of as a named list (with unique names). They can be accessed individually with attr() or all at once (as a list) with attributes().

y <- 1:10
attr(y, "my_attribute") <- "This is a vector"

# inspect the attribute of y
attr(y, "my_attribute")
#> [1] "This is a vector"

Matrices and arrays

Adding a dim attribute to an atomic vector allows it to behave like a multi-dimensional array. A special case of the array is the matrix, which has two dimensions. Matrices and arrays are created with matrix() and array(), or by using the assignment form of dim():

# two scalar arguments to specify rows and columns
a <- matrix(1:6, ncol = 3, nrow = 2)
# one vector argument to describe all dimensions
b <- array(1:12, c(2, 3, 2))

# you can also modify an object in place by setting dim()
c <- 1:6
dim(c) <- c(3, 2)
c
#>      [,1] [,2]
#> [1,]    1    4
#> [2,]    2    5
#> [3,]    3    6

length() and names() have high-dimensional generalisations:

length() generalises to nrow() and ncol() for matrices, and dim() for arrays.
names() generalises to rownames() and colnames() for matrices, and dimnames(), a list of character vectors, for arrays.

c() generalises to cbind() and rbind() for matrices, and to abind() (provided by the abind package) for arrays. You can transpose a matrix with t(); the generalised equivalent for arrays is aperm().

You can test if an object is a matrix or array using is.matrix() and is.array(), or by looking at the length of the dim(). as.matrix() and as.array() make it easy to turn an existing vector into a matrix or array.

Data frames

A data frame is the most common way of storing data in R, and if used systematically makes data analysis easier. Under the hood, a data frame is a list of equal-length vectors. This makes it a 2-dimensional structure, so it shares properties of both the matrix and the list. This means that a data frame has names(), colnames(), and rownames(), although names() and colnames() are the same thing. The length() of a data frame is the length of the underlying list and so is the same as ncol(); nrow() gives the number of rows.

df <- data.frame(x = 1:3, y = c("a", "b", "c"))
str(df)
#> 'data.frame':    3 obs. of  2 variables:
#>  $ x: int  1 2 3
#>  $ y: chr  "a" "b" "c"

You can combine data frames using cbind() and rbind():

cbind(df, data.frame(z = 3:1))
#>   x y z
#> 1 1 a 3
#> 2 2 b 2
#> 3 3 c 1

Subsetting vectors

Let’s explore the different types of subsetting with a simple vector, x.

x <- c(2, 4, 3, 5)

Positive integers return elements at the specified positions:

x[c(3, 1)]
#> [1] 3 2

Duplicated indices yield duplicated values:

x[c(1, 1)]
#> [1] 2 2

Real numbers are silently truncated to integers:

x[c(2, 9)]
#> [1]  4 NA

Negative integers omit elements at the specified positions:

x[-c(3, 1)]
#> [1] 4 5

You can’t mix positive and negative integers in a single subset: x[c(-1, 2)] is not allowed.

Logical vectors select elements where the corresponding logical value is TRUE. This is probably the most useful type of subsetting because you write the expression that creates the logical vector:

x[c(TRUE, TRUE, FALSE, FALSE)]
#> [1] 2 4

x[x > 3]
#> [1] 4 5

A missing value in the index always yields a missing value in the output:

x[c(TRUE, TRUE, NA, FALSE)]
#> [1]  2  4 NA

Nothing returns the original vector. This is not useful for vectors but is very useful for matrices, data frames, and arrays. It can also be useful in conjunction with assignment.

x[]
#> [1] 2 4 3 5

Zero returns a zero-length vector. This is not something you usually do on purpose, but it can be helpful for generating test data.

x[0]
#> numeric(0)

If the vector is named, you can also use character vectors to return elements with matching names:

(y <- setNames(x, letters[1:4]))
#> a b c d 
#> 2 4 3 5

y[c("d", "c", "a")]
#> d c a 
#> 5 3 2

Like integer indices, you can repeat indices:

y[c("a", "a", "a")]
#> a a a 
#> 2 2 2

When subsetting with [ names are always matched exactly

z <- c(abc = 1, def = 2)
z[c("a", "d")]
#> <NA> <NA> 
#>   NA   NA

Subsetting lists, matricies and data frames

Subsetting a list works in the same way as subsetting an atomic vector. Using [ will always return a list; [[ and $, as described below, let you pull out the components of the list.

You can subset higher-dimensional structures in three ways:

With multiple vectors.
With a single vector.
With a matrix.

a <- matrix(1:9, nrow = 3)
colnames(a) <- c("A", "B", "C")
# multiple vectors
a[1:2, ]
#>      A B C
#> [1,] 1 4 7
#> [2,] 2 5 8

df <- data.frame(x = 1:3, y = 3:1, z = letters[1:3])
# selecting by value of certain vector
df[df$x == 2, ]
#>   x y z
#> 2 2 2 b

Importing data in R

The following checklist makes it easier to import data correctly into R:

The first row is maybe reserved for the header, while the first column is used to identify the sampling unit;
Avoid names, values or fields with blank spaces, otherwise each word will be interpreted as a separate variable, resulting in errors that are related to the number of elements per line in your data set;
Short names are preferred over longer names;
Try to avoid using names that contain symbols such as ?, $,%, ^, &, *, (, ),-,#, ?,,,<,>, /, |, \, [ ,] ,{, and };
Make sure that any missing values in your data set are indicated with NA.

library(readr)

# import data from .txt file
df <- read_table(
  "https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.txt",
  col_names = FALSE)
df
#> # A tibble: 5 x 3
#>      X1    X2 X3   
#>   <dbl> <dbl> <chr>
#> 1     1     6 a    
#> 2     2     7 b    
#> 3     3     8 c    
#> 4     4     9 d    
#> 5     5    10 e

# import data from .csv file
df <- read.table(
  "https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.csv",
  header = TRUE,
  sep = ",")

Functions

Standard format for defining a function in R:

my_function_name <- function(arg1 = "default", arg2 = "default") {
  
  # 'cat' is used for concatenating strings
  merged_string <- cat(arg1, arg2) 
  
  # if not specified, last evaluated object is returned
  return(merged_string) 
  }
  
  # call a function elsewhere from code
  arg1 <- "Hello"
  arg2 <- "World!"
  a_greeting <- my_function_name(arg1, arg2)
#> Hello World!
  print(a_greeting)
#> NULL

CRAN - the curated repository of R packages provides millions of functions that you could use to tackle data. You simply need to install a package, and then call the function from your R code function_name(somearguments). For example, the package stats helps you in fitting linear models through the function lm():

library(stats)

x <- rnorm(500)
y <- x*4 + rnorm(500)

lm.fit <- lm(y~x, data = data.frame(x, y))

print(lm.fit)
#> 
#> Call:
#> lm(formula = y ~ x, data = data.frame(x, y))
#> 
#> Coefficients:
#> (Intercept)            x  
#>    0.002114     3.941450

How many functions have been used in the example? What does rnorm mean? You can get informed about any R function by using its documentation ?function_name or ?packageName::function_name.

Cheat sheets

Base R Cheat sheet

Data Types Cheat sheet

Try it yourself!

Exercise 1

Try to figure out the answers without executing the code. Check your answers in R Studio.

Given the vector: x <- c("ww", "ee", "ff", "uu", "kk"), what will be the output for x[c(2,3)] ?
Let a <- c(2, 4, 6, 8) and b <- c(TRUE, FALSE, TRUE, FALSE), what will be the output for the R expression max(a[b])?
Is it possible to apply the function my_function_name using x and a as arguments?

Exercise 2

Consider a vector x such that: x <- c(1, 3, 4, 7, 11, 18, 29) Write an R statement that will return a list X2 with components of value: x*2, x/2, sqrt(x) and names ‘x*2’, ‘x/2’, ‘sqrt(x)’.

Exercise 3

Read the file Table0.txt into an object DS.

What is the data type for the object DS?
Change the names of the columns to Name, Age, Height, Weight and Sex.
Change the row names so that they are the same as Name, and remove the variable Name.
Get the number of rows and columns of the data.

Exercise 4

Convert DS from the previous exercise to a data frame DF.
Add an additional column “zeros” in DF with all elements 0.
Remove the Weight column from DF.