R Tutorial 1

Sep 9, 2018

Three intro points about R. 1) R is case sensitive. TyPiNG LIKE ThIs =/= typing like this. My general recommendation is just to pick a style - do you like capitalizing all your variable names? That’s fine! Do you want everything lowercase? Cool, you do you. There’s no right or wrong choice; don’t expend any energy trying to figure out what “everyone else does” on this front. 2) R is an object-oriented language. This will mean more in the future, but in short, it means that EVERYTHING is customizable; it’s one of the things that makes R wildly popular. It also means that there are almost always about 3000 different ways to accomplish the same thing, depending on how you’re customizing/editing your R objects. If you can accomplish something I’m doing with less code or simply differently, that’s fine! There is no one “right”" answer when it comes to R (though there are definitely still many, many wrong answers…). 3) R is open source (and as previously mentioned, hugely popular). This means that it’s growing every day - people can make their own packages and functions and share them online, and then people like us can then use them in our R sessions. One consequence of this is that when you don’t know how to do something in R, if you can articulate sensible google search terms, you can very easily find the answer to your question 9/10 times.

Reading data in to R often ends up being one of the barriers people have to using R. Here are a few examples of how to read in data (hopefully) without issue!

For SPSS files

My personal preference for reading in spss datasets is using the ‘foreign’ package, but it’s not the only one. You also could use the ‘haven’ package if you prefer. There’s no right answer! Just pick one. Remember, if you don’t yet have a package installed, you have to use the ‘install.packages()’ command (make sure to put whatever the package name is in quotation marks). Then, you have to actually load the package by using the ‘library()’ command. You have to use the ‘library()’ command to load packages because otherwise, R will have to load absolutely everything it has associated with it everytime you open R, and it’ll take as long as SPSS does. SPSS takes so long to open because it’s effectively using its internal ‘library()’ command, but on every. single. thing. it can can do.

I’ve commented out the install.packages commands because I already have them installed on my computer, but if you do not, remove the hashtag to run the code. In subsequent parts of this code, I will only use the “library()” command; if you get the error that there is no package named XYZ, it means you need to first use install.packages() before doing library().

`install.packages("foreign")`

`install.packages("haven")`

`library(foreign)`

`library(haven)`

Most times you run a library() command (or a lot of other commands) in R, you’ll see red words shows up. This doesn’t necessarily mean something went wrong! It oftentimes is just alerting your attention to something, like what version of R that package is optimized for. It’s up to you whether you think whatever it says is actually a problem for you or not. >90% of the time, it’s no problem at all for you; R just wants you to stay informed.

I prefer to use the entirety of the file path of my spss dataset to read it in. You don’t HAVE to do it this way; I’ll show some other options below. The “read.spss” function comes from the foreign package; the “read_sav” command comes from the haven package. Notice the direction of the slashes - R only deals in front-slashes. However, you can use double back-slashes and it will then work. The reason for this is back-slash is actually part of a line space/break character so R thinks you only halfway gave it a command to break up a line when it sees a single back-slash.

`dataset <- read.spss("C:/Users/keana/downloads/dataset.sav")`

`dataset2 <- read_sav("C:\\Users\\keana\\downloads\\dataset.sav")`

If you just run this code, R will read your dataset in as a “tibble”. Some people prefer tibbles, but I personally find them a little confusing, so I like to add a command to get it to resemble what you might be used to seeing in SPSS. The ‘to.data.frame’ command being set to true will give you this.

`dataset <- read.spss("C:/Users/keana/downloads/dataset.sav", to.data.frame = TRUE)`

Sometimes, you may also run into an issue with variables that are text strings. For example, your gender variable may contain the text strings ‘male’ and ‘female’ instead of being coded ‘0’ and ‘1’. In particular, this happens when you have value labels in SPSS. This sometimes will cause R to read in your gender variable as what’s called a “factor”. Factors can’t be involved in certain analyses and functions in R later, so just something to know about. To get around this, you can use another command called ‘stringsAsFactors’ and set it to false.

`dataset <- read.spss("C:/Users/keana/downloads/dataset.sav", to.data.frame = TRUE, stringsAsFactors = FALSE)`

You can also read in other dataset types, such as excel files. Reading in excel data works almost exactly the same way, just a new function to call. However, this one doesn’t let you use the to.data.frame command, so we’ll use a work-around.

For Excel Files

`library(readxl)`

`dataset <- data.frame(read_xlsx("C:/Users/keana/downloads/dataset.xlsx"))`

What about .csv, .txt, or .dat files? Well, it depends on what type of separator you have in your data. The “sep =” command within the ‘read.csv()’ function lets you tell R what kind of separator it is. If it’s a .csv file, it’s (typically) a “comma separated” file, so you’d put a comma in there. If it’s a space, you just leave the space within the quotes of the “sep =” command (as shown below). Then, you indicate with the ‘header’ command about whether the first row of your data contains variable names.

For general tab-delimited files

`dataset_fromdat <- read.csv("C:/Users/keana/downloads/dataset.dat",
                    sep = "", header = F)`

If all else fails, you can also do “file.choose()” and a pop-up window will give you the file path for your data. Honestly though, I’ve had way more issues using the pop-up windows/drop-down menus than I have just with using code to load in data.

As I alluded to earlier, you don’t actually have to put in the whole file path to your data to load in your data. As long as your data is in the same folder as your saved R script, then it will still be able to be loaded just with e.g. “read.spss(“dataset.sav”)”. However, I generally advise against this for two reasons. 1) If you open R from anywhere other than that folder, you cannot load the data that way. So if you click the R studio shortcut, R booted from your documents folder (or wherever you installed R), and the working directory is not in the same place your R script is, even if that R script is still open when you open back up R! This is incredibly confusing. You can also always check what your current working directory is using “getwd()”. You can also change your working directory using “setwd()”. However, the second reason is much more important. 2) Reproducibility. One of the unrivaled benefits of R is reproducibility. After your article has been under review for like 6 months, and you can’t remember which variable was named what in SPSS that had these outliers removed for that reason, and the reviewer wants you to do one tiny thing different…that sucks. You don’t have that in R. I have one published paper whose analyses I did (almost) entirely in R, and I can still open my script back up and reproduce everything for that manuscript down to the figure. And, because I included the entire file path for where I got my dataset(s), I can re-locate which exact files I used, rather than having to remember whether I had re-saved a SPSS file since then and if I used dataset_6_19.sav or dataset_6_18.sav.