*Kickstarting*
R

Massaging data in
R**
**## Specifying subsets of data

This in known as extraction in
**R**. Most users will use the extraction
operator "[]" to select values from a data object. The flexibility of the
extraction operator does cause a bit of confusion in many new users.
To extract a single value from a matrix or array of **m** dimensions,
append the brackets to the name of the object with **m** integers separated
by commas, e.g.:
`> xmat[2,3]`

Easy. What about a data frame? As the columns of a data frame can contain
different modes of data, they may be specified differently:

> xdf$age[3]
> xdf[[2]][3]
> xdf[,2][3]
> xdf[2,3]

In the first line, the name of the column was used, along with the
index of the desired value, in the second line, the index of the
column was used in double brackets, then the index of the value in single
brackets. Then the second column was specified by omitting the first index
in single brackets, again with a second index in single brackets. Finally, the
same system shown for matrices or arrays will work. Note that only the first
example will extract only the value. The others will return a factor object with
a single value if that column was a factor object.

To extract more than one value, use a vector rather than a single integer.

> xdf$age[3:6]
> xdf$age[c(0,0,1,1,1,1)]

The vector can be explicit indices, as shown in the first line, or a vector of
logicals that will return the elements corresponding to non-zero values.

## Changing the value of a variable

The convention of aligning variables as columns and subject as rows is pretty
well established and will be observed here unless specified otherwise. Say that
you have one or more variables that you would like to normalize (i.e. transform
so that each variable has a mean of 0 (zero) and a variance of 1 (one).
`> mydata$myvar<-(mydata$myvar-mean(mydata$myvar))/sd(myvar)`

The original `myvar`

has been replaced by the normalized values.
Numeric transformations like this are relatively simple, as are generating
categories from continuous measurements:

`> mydata$tertiaryed<-ifelse(mydata$yearseduc > 12,"Y","N")`

or recategorizations of factors:

`> mydata$tertiaryed<-ifelse(mydata$education == "UNI" || mydata$education == "COL","Y","N")`

Notice how this time, the new values were stored in a new variable rather than
overwriting the previous values. You can either append the new variable to the
original data frame, as in the example, or just make it a separate variable.
Obviously, if you want to save your data in a compact form for further analysis,
appending makes it easier to manage.

## Dealing with NAs (missing values)

The NA (datum Not Available) is **R**'s way of
dealing with missing data. NAs can give you trouble unless you explicitly tell
functions to ignore them, or pass the data through `na.omit()`

(drop all NAs in the data) or `na.exclude()`

. In some cases you may
wish to give the NAs a specific value. For example, you may know that only
non-smokers did not complete a "How many cigarettes?" item, and want to replace
the NAs that were generated with zeros.
`> mydata$ncigs[is.na(mydata$ncigs)]<-0`

Notice here that an equality test is not appropriate for NAs, because they don't
equal anything. The `is.na()`

function returns a vector of indices
that correspond to the elements in `mydata$ncigs`

that are NAs. Those
elements are then replaced with zeros. You can also replace NAs with potentially
more informative values by using a data imputation function.

For more information, see __An Introduction to R__:
Simple manipulations; numbers and vectors

Back to Table of Contents