Creating Data frame in R – R tutorial

223

Data in R is held as a wide variety of objects such as vectors, matrices, arrays, data frames, and lists. Let’s see How to create or assign a data frame in R.

What is Data frame?

A data frame is more general than a matrix. Here different columns can contain different modes of data (numeric, character, etc.). It’s similar to the datasets you typically see and use in statistical packages like IBM SPSS, SAS, and Stata. Data frames are the most common data structure that we deal in R. A data frame is created with the function data.frame(). The general format is

mydata <- data.frame(col1, col2, col3,…)

where col1, col2, col3, … are column vectors of any type (such as character, numeric, or logical). Names for each column can be provided with the names function. The following example will help you understand better.

Example

Let us create students database

> rollno<- c(1201,1202,1203,1204,1205)
> age <- c(19,20,22,19,20)
> results <- c("Pass","Pass","Fail","Pass","Fail")
> gender <- c("Male","Female","Male","Male","Female")
> students <- data.frame(rollno,age,gender,results)
> students
  rollno age gender results
1   1201  19   Male    Pass
2   1202  20 Female    Pass
3   1203  22   Male    Fail
4   1204  19   Male    Pass
5   1205  20 Female    Fail

Points to remember, each column must have only one data type, but you can put columns of different data type together to form the data frame. Because data frames are close to what analysts typically think of as datasets, we’ll use the terms columns and variables interchangeably when discussing data frames. There are several ways to identify the elements of a data frame.

You can use the subscript notation or you can specify column names. Using the students data frame in the above example, the following illustration will help in understanding.

Three ways to Subset a data frame in R  Examples

In this example lets subset, Rollno and Results column from above data frame.

> students[c(1,4)]
  rollno results
1   1201    Pass
2   1202    Pass
3   1203    Fail
4   1204    Pass
5   1205    Fail
> students[c("rollno","results")]
  rollno results
1   1201    Pass
2   1202    Pass
3   1203    Fail
4   1204    Pass
5   1205    Fail
> students$age
[1] 19 20 22 19 20

The $ notation in the third example is used to indicate a particular variable from a given data frame. For example, if you want to cross-tabulate students gender  by results, you could use the following code

> table(students$gender,students$results)
        
         Fail Pass
  Female    1    1
  Male      1    2

You may get tired typing students$ at the beginning of every variable name. So, R has shortcuts for that too. You can use either the attach() and detach() or with() functions to simplify your code.