R Programming: Part 1 - Nuts and Bolts

Part 1 - Nuts and Bolts

  • Part 1 - Nuts and Bolts
    • Getting Started
      • R Console Input
      • Working Directory and Files
      • R Objects and Attributes
      • Sequence of Numbers
    • Basic Data Types in R
      • Vectors Lists
      • Missing Values
      • Subsetting Vectors and Lists
      • Matrices Data Frames
      • Factors
    • Reading Data to R
      • Tabular Data Textual Data Formats
      • Connections Interfaces to the Outside World

Getting Started

R Console Input

Input an expression and R will print the result immediately.
When assignment operator “<-“, “->” is used, R will store the result and not print it unless you type the variable name or call print() function.
Comment sign: “#”

> x <- 5
> x    ## Or print(x)
[1] 5

Working Directory and Files

A bit similar to terminal command line tools.
Basic commands:

  • getwd(): to get the current working directory
  • ls(): current objects in local workspace
  • list.files(): to list all the files in the current working directory
  • args(function_name): to see what parameters a function take
  • setwd(“dir”): to set working directory to a specified directory

More functions about directory and files:

  • dir.create(“dir_name”, recursive = FALSE)
  • file.create(“file_name”)
  • file.exists(“f”) file.info(“f”) file.rename(“f1”, “f2”) file.copy(“f1”, “f2”)
  • file.path(“f1”, “f2”, “f3”): relative path: f1/f2/f3
  • unlink(“f”, recursive = FALSE): to delete directory and files

Tab completion works in R as well.

R Objects and Attributes

Atomic classes of objects in R:

  • Characters
  • Numeric (double precision real numbers)
  • Integer
  • Complex
  • Logical

Numbers in R are generally treated as numeric numbers. Specify the “L” suffix if you explicitly want an integer, e.g. 1L. Inf and NaN are also defined in R.
R objects can have attributes: names, dimnames, dimensions, class, length and others. They can be accessed by attributes() function.

Sequence of Numbers

Colon operator “:” is the most common one used to create a sequence.

> 1:10
 [1]  1  2  3  4  5  6  7  8  9 10
> pi:10
 [1] 3.141593 4.141593 5.141593 6.141593 7.141593 8.141593 9.141593
> 15:1
 [1] 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1

the bracket [] above indicates that x is a vector (which contains elements of the same class), and the element follows it, which is 1, is the first element of the vector. If it is printed in two lines as below, you’ll see

 [1] 15 14 13 12 11 10  9  8
 [9]  7  6  5  4  3  2  1

Seq() function does similar work. Advantages are seq() can control increment and length, e.g.

> seq(1, 5, by = 0.5)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
> seq(1, 10, length = 7)
[1]  1.0  2.5  4.0  5.5  7.0  8.5 10.0

Rep() (replicate) is another function to create a sequence.

> rep(0, times = 10)
[1] 0 0 0 0 0 0 0 0 0 0
> rep(c(0, 1, 2), times = 5)
[1] 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2
> rep(c(0, 1, 2), each = 5)
[1] 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2


Basic Data Types in R

Vectors & Lists

Vector is the most common object in R. And it can only contain objects of the same class. List is similar to vector but can contain objects of different classes.
The c() function (combine / concatenate) can be used to create vectors.

> x <- c(0.5, 0.6)
> x <- c(“a”, “b”, “c”)
> x <- c(1+0i, 3+4i)

The vector() function works as well.

> x <- vector(“numeric”, length = 10)

then the vector x will be initialized with default value.

Vectors can be used in arithmetic expression. Common arithmetic operators include “+”, “-“, “*”, “/”, “^” (power), sqrt(), abs(), etc. e.g.

> z <- c(1, 2, 3)
> z + 100
[1] 101 102 103
> sqrt(z - 1)
[1] 0.000000 1.000000 1.414214

Other operations for vectors include max, min, range (return c(min, max)), length, sum, prod, mean (return average), var (return variance), sort, etc.

When two vectors of the same length are involved in arithmetic expression, R will perform the operations element by element (vectorized operations).
If they are of different lengths, R will cycle in the shorter vector (Note that a single number can be viewed as a vector of length 1). And R will give a warning if the short length does not divide the long length. e.g.

> x <- c(1, 2, 3, 4, 5, 6)
> y <- c(1, 10, 1, 10, 1, 10)
> x + y
[1]  2 12  4 14  6 16

> y <- c(1, 10, 100)
> x + y
[1]   2  12 103   5  15 106

> y <- c(1, 10, 100, 1000)
> x + y
[1]    2   12  103 1004    6   16
Warning message:
In x + y : 长的对象长度不是短的对象长度的整倍数

Logical vectors:

> x <- c("a", "b", "c", "c", "d", "a")
> u <- x > "a"
> u
[1] FALSE  TRUE  TRUE  TRUE  TRUE FALSE

Logical operators: >, <, ==, >=, <=, !=, &, |, !, xor()
And we have && and || which only evaluates the first element of each operand.

Character vectors can be combined using both c() and paste() functions.

> my_char <- c("My", "name", "is")
> paste(my_char, collapse = " ")
[1] "My name is"
> c(my_char, "Niwatori")
[1] "My"       "name"     "is"       "Niwatori"
> paste("Hello", "world!", sep = " ")
[1] "Hello world!"
> paste(1:3, c("X", "Y", "Z"), sep = "")
[1] "1X" "2Y" "3Z"

When you try to mix objects of different classes in a vector, implicit coercion will happen to turn objects into the same class. (Coercion principle?)

> c(1.7, “a”)
[1] “1.7” “a”
> c(TRUE, 2)
[1] 1 2

Explicit coercion can happen by using as.* function.

> x <- 0:4
> as.numeric(x)
[1] 0 1 2 3 4
> as.character(x)
[1] “0” “1” “2” “3” “4

Lists are similar to vectors except that lists can contain objects of different classes, and every object in the list occupies a single vector.

> list(1, “a”, TRUE, 1+4i)
[[1]]
[1] 1

[[2]]
[1] "a"

[[3]]
[1] TRUE

[[4]]
[1] 1+4i

Missing Values

Missing values are denoted by NA (Not Available) or NaN (Not a Number) for undefined mathematical operations.
NaN will occur if you try to compute 0 / 0 or Inf – Inf, where Inf stands for infinity.
The function is.na() is used to test objects if they are NA, and is.nan() is used to test for NaN. A NaN value is also NA but not vice versa.

> x <- c(1, 2, NA, NaN, 3)
> is.na(x)
[1] FALSE FALSE  TRUE  TRUE FALSE
> is.nan(x)
[1] FALSE FALSE FALSE  TRUE FALSE

Note the command “x == NA” does NOT perform identically as “is.na(x)”. For “x == NA”, each element in x is compared with NA, yielding an incomplete expression which returns NA as an indefinite value, i.e.

> x == NA
[1] NA NA NA NA NA

To remove missing values, logical vectors with is.na() and complete.cases() functions are often used.

> x <- c(1, 2, NA, 4, NA, 6)
> x[!is.na(x)]
[1] 1 2 4 6
> y <- c("a", NA, "c", "d", NA, "f")
> good <- complete.cases(x, y)
> x[good]
[1] 1 4 6
> y[good]
[1] "a" "d" "f"
> myd       ## A data frame
  Names First Second Third
1 Alice     1      2     3
2   Bob     2      3     4
3 Carol    NA      4     5
4  Dave     4     NA     6

> good <- complete.cases(myd)
> myd[good, ]
  Names First Second Third
1 Alice     1      2     3
2   Bob     2      3     4

Subsetting Vectors and Lists

For subsetting vectors, single square bracket operator [] is most commonly used.

> x <- c(1, 2, 3, 4, 5, 5, 5, 5, 5, NA, NA, NA, 6, 7, 8, 9)
> x[2]                ## Positive integer index
[1] 2
> x[1:5]
[1] 1 2 3 4 5
> x[c(3, 5, 7, 9, 11)]
[1] 3 5 5 5 NA

> y <- x[!is.na(x)]   ## Logical index
> y
[1] 1 2 3 4 5 5 5 5 5 6 7 8 9
> y[y > 5]
[1] 6 7 8 9
> x[!is.na(x) & x > 5]
[1] 6 7 8 9

Which() function will produce the indices of the elements which make the expression true.

You’ll get nothing useful if you ask for numbers whose indices are 0 or greater than the bound of the vector. Be cautious! But negative indices do make sense.

> x <- 1:10
> x[c(-2, -7)]                  ## Negative integer index
[1]  1  3  4  5  6  8  9  10    ## All numbers except x[2] & x[7]
> x[-c(2, 7)]                   ## Putting the negative sign in front also works
[1]  1  3  4  5  6  8  9  10

Modifying subsets:

> x <- c(-2:5, rep(NA, 4))
> x
 [1] -2 -1  0  1  2  3  4  5 NA NA NA NA
> x[is.na(x)] <- -1
> x
 [1] -2 -1  0  1  2  3  4  5 -1 -1 -1 -1
> x[x < 0] <- -x[x < 0]         ## Same as x <- abs(x)
> x
 [1] 2 1 0 1 2 3 4 5 1 1 1 1

R objects can have names for writing readable code.
Names of vectors can be accessed and set with names() function.

> x <- c(foo = 1, bar = 2, norf = 3)
> x
 foo  bar norf 
   1    2    3 
> names(x)
[1] "foo"  "bar"  "norf"

or can be implemented as

> x <- c(1, 2, 3)
> names(x) <- c("foo", "bar", "norf")
> x
 foo  bar norf 
   1    2    3 

Now we can subset the vector through names.

> x["bar"]
bar 
  2 
> x[c("foo", "bar")]
foo bar 
  1   2 

Other operators used for extracting subsets of R objects:

  • []: returns an object of the same class, can extract multiple elements
  • [[]]: extracts a single element of a list or a data frame, returns an object with a type not necessarily the same as the original
  • $: extract a single element of a list or a data frame by name

Examples of [[]] and $ operators for subsetting lists:

> x <- list(1:4, 0.6)
> x
[[1]]
[1] 1 2 3 4

[[2]]
[1] 0.6

> x[1]       ## Returns a list containing a numeric vector
[[1]]
[1] 1 2 3 4

> x[[1]]     ## Returns simply a numeric vector
[1] 1 2 3 4

> names(x) <- c("foo", "bar")
> x
$foo
[1] 1 2 3 4

$bar
[1] 0.6

> x$foo      ## x$foo == x[["foo"]] == x[[1]]
[1] 1 2 3 4
> x[1:2]     ## Returns a list
$foo
[1] 1 2 3 4

$bar
[1] 0.6

Differences between [[]] and $ operators when subsetting by names:

  • From the commands x$foo and x[[“foo”]], we know that the [[]] operator can be used with computed indices while the $ operator can only be used with literal names.
  • The $ operator can be used in partial matching while the [[]] operator cannot unless you set exact = FALSE.

The [[]] operator can take a integer sequence to extract a single element from nested lists, equivalent to using bracket operators multiple times.

> x <- list(a = list(2, 3, 4), b = c(5, 6))
> x[[1]][1]
[[1]]
[1] 2
> x[[1]][[1]]
[1] 2
> x[[c(1, 1)]]
[1] 2
> x$a[[1]]
[1] 2

Matrices & Data Frames

Matrices are vectors with a dimension attribute, which is an integer vector of length 2 (nrow, ncol). So the first way to create a matrix from a vector is to add dimension attribute. Note matrices are constructed column-wise.

> m <- 1:10
> dim(m) <- c(2, 5)
> m
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8   10

Matrices can also be created using matrix() function.

> m <- matrix(1:6, nrow = 2, ncol = 3)
> m
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Matrices can be created by column-binding or row-binding with cbind() or rbind() function.

> x <- 1:3
> y <- 10:12
> cbind(x, y)
     x  y
[1,] 1 10
[2,] 2 11
[3,] 3 12
> rbind(x, y)
  [,1] [,2] [,3]
x    1    2    3
y   10   11   12

Matrices can be subsetted with x[i, j] type indices, where i and j can be missing.

When a single element of a matrix is extracted, it is returned as a vector of length 1 rather than a 1 x 1 matrix. This behavior can be turned off by setting drop = FALSE. Similar for extracting a single row or a single column.

> x <- matrix(1:6, 2, 3)
> x
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
> x[1, 2]
[1] 3
> x[1, ]
[1] 1 3 5
> x[1, , drop = FALSE]
     [,1] [,2] [,3]
[1,]    1    3    5

Vectorized operations work for matrices as well. Note x*y yields a matrix with entries of x multiplied by entries of y respectively, while x %*% y is the real matrix multiplication.

> x <- matrix(1:4, 2, 2)
> y <- x
> x * y
     [,1] [,2]
[1,]    1    9
[2,]    4   16
> x %*% y
     [,1] [,2]
[1,]    7   15
[2,]   10   22

Names of matrices can be set with dimnames() attribute, which must be a list containing names of rows and columns.

> x <- matrix(1:6, nrow = 2, ncol = 3)
> x
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
> dimnames(x) <- list(c("r1", "r2"), c("c1", "c2", "c3"))
> x
   c1 c2 c3
r1  1  3  5
r2  2  4  6

Similar to matrices, data frames are used to store tabular data as well, but data frames can contain objects of different classes while matrices cannot.
Data frames have attributes called rownames() and colnames(), which will be 1, 2, 3, etc. by default.

> my_matrix <- matrix(1:20, nrow = 4, ncol = 5)
> patients <- c("Bill", "Gina", "Kelly", "Sean")
> cbind(patients, my_matrix)
  ## Wrong! Implicit coercion from numeric to character
[1,] "Bill"   "1" "5" "9"  "13" "17"
[2,] "Gina"   "2" "6" "10" "14" "18"
[3,] "Kelly"  "3" "7" "11" "15" "19"
[4,] "Sean"   "4" "8" "12" "16" "20"

> my_data <- data.frame(patients, my_matrix)
> my_data
  patients X1 X2 X3 X4 X5
1     Bill  1  5  9 13 17
2     Gina  2  6 10 14 18
3    Kelly  3  7 11 15 19
4     Sean  4  8 12 16 20

> colnames(my_data) <- c("patient", "age", "weight", "bp", "rating", "test")
> my_data
  patient age weight bp rating test
1    Bill   1      5  9     13   17
2    Gina   2      6 10     14   18
3   Kelly   3      7 11     15   19
4    Sean   4      8 12     16   20

Factors

Factors are used to represent categorical data like a label with a levels attribute.

> x <- factor(c("y", "y", "n", "y", "n"))
> x
[1] y y n y n
Levels: n y

> table(x)      ## Show how many objects of each level
x
n y
2 3

> unclass(x)    ## Strip the classes out of objects
[1] 2 2 1 2 1
attr(,"levels")
[1] "n" "y"

The order of the levels can be set using levels arguments to factor() or modifying levels() attribute. This can be important because the first level sometimes is set as the baseline level, e.g.

> x <- factor(c("y", "y", "n", "y", "n"), levels = c("y", "n"))
> x
[1] y y n y n
Levels: y n


Reading Data to R

Tabular Data & Textual Data Formats

The most commonly used function to read tabular data is read.table() and read.csv(). The two functions are almost identical except that the separator for the former is the space while for the latter is the comma.

Read.table() function takes quite a few parameters, many of which have default values. But specifying these options instead of using default can make it run faster.

> data <- read.table("foo.txt")

Dump() and dput() function can result in textual format which preserves the metadata though sacrificing some readability and memory. Textual format frees other users from specifying the data all over again, and it makes data potentially recoverable in case of corruption.

> y <- data.frame(a = 1, b = "a")
> dput(y)
structure(list(a = 1, b = structure(1L, .Label = "a", class = "factor")), .Names = c("a", 
"b"), row.names = c(NA, -1L), class = "data.frame")
> dput(y, file = "test.R")
> newy <- dget("test.R")
> newy
  a b
1 1 a

Dput() and dget() is used to write and read data in textual format. Dump() and source() have similar functions, but the difference is that they are used for multiple objects.

> x <- "foo"
> y <- data.frame(a = 1, b = "a")
> dump(c("x", "y"), file = "test.R")
> rm(x, y)           ## Remove variables x and y
> source("test.R")
> x
[1] "foo"
> y
  a b
1 1 a

Connections: Interfaces to the Outside World

Data are read in through connection interfaces.

  • file: opens a connection to a file
  • url: opens a connection to a webpage
  • gzfile: opens a connection to a file compressed with gzip
    etc.

File() function takes a few parameters, among which description, the name of the file, and the open options, are most commonly used. For open options, there are “r”, “w”, “a”, “rb”, “wb”, “ab” for reading, writing and appending only or in binary mode.
Here are two examples:

> con <- file(“foo.txt”, “r”)
> data <- read.csv(con)
> close(con)

is the same as

> data <- read.csv(“foo.txt”)

Reading webpages:

> con <- url(“http://www.baidu.com/”, “r”)
> x <- readlines(con)
> head(x)
[1] " " ...
[2] "