Once your data is loaded in, you’ll want to explore it and make sure it all came in correctly. I usually go through a set of about 3 functions to help me do this. The str() function will show you a list of your variables and just the first few observations in each variable (don’t do this one if you have like >100 variables in your dataset, it will be annoying). The describe() function gives you basic descriptive statistics. The table() function gives you a frequency table a la SPSS. These functions are SUPER useful, so try to commit them to memory!
To indicate you want to know about a specific variable within a dataset, you want to denote this by using the “$” after your dataset name, as shown below. This tells R that you want to select an object within the data frame you’ve indicated.
library(psych) library(tidyverse) str(dataset) table(dataset$GENDER) describe(dataset$AGE)
What about some data wrangling/data management?
Okay, let’s say you at some point decided that you’re ready to start shifting /a little bit/ of your data management from SPSS to R to better document how you’re computing variables, etc.
#creating a log transformed variable - don't forget to make sure all your observations are positive before log transforming them! dataset$SUD_comp_log <- log(dataset$SUD_comp + 1) #square root transformed dataset$SUD_comp_sqrt <- sqrt(dataset$SUD_comp) #creating a difference score, e.g., the RewP dataset$RewP_Cz <- dataset$Cz_Gain_w2 - dataset$Cz_Loss_w2
Keeping only certain variables from a larger dataset can be done the tidyverse way (using the select() function), or the base R way. This also shows a great example of one challenge in R sometimes. It matters what order you load packages. Originally, I loaded the tidyverse package, and dplyr is one of the dependencies. Dplyr has a function called “select” that we can use to take only certain variables from a dataset. BUT, the package MASS also has a function called “select” that does something totally different. When I was first making this document, I had it arranged differently and when I got to this section, I had loaded MASS after loading tidyverse, so if you just say “select”, it will try to use the MASS package version, and it won’t work. So, to get around this, I can directly name the package I want the function to come from (dplyr) followed by two colons, and then it will use the correct “select” function. Also, if you know which packages your functions come from, you could hypothetically skip the “library()” calls and just name packages for everything outside of base R functions. I don’t recommend this in general though; who can keep track of all these functions??
dataset_new <- dplyr::select(dataset, SUBNAME, GENDER, AGE, RewP_Cz) dataset_new2 <- dataset[,c("SUBNAME","SUD_comp","dis1_item")]
Standardizing (Z-scoring) variables and centering variables
dataset$ZRewP_Cz <- scale(dataset$RewP_Cz) dataset$RewP_Cz_cent <- dataset$RewP_Cz - mean(dataset$RewP_Cz)
Changing variable names
colnames(dataset_new2) <- c("Subname","SUDs","ESI-disinhibition") View(dataset_new2)
Merging files together
colnames(dataset_new2) <- c("SUBNAME","SUDs","ESI-disinhibition") dataset_merged <- merge(dataset_new, dataset_new2, id = "SUBNAME") colnames(dataset_new2) <- c("Subname","SUDs","ESI-disinhibition") #if your ID variable has the same contents but simply has a different variable name, then you can use 'by.x' and 'by.y' instead of 'id' dataset_merged2 <- merge(dataset_new, dataset_new2, by.x = "SUBNAME", by.y = "Subname")
What about taking your dataset from wide to long and vice versa?
This is SUCH an annoying task to do in SPSS. Luckily, R makes your life much easier with this!
library(reshape2) iCAD <- read.spss("C:/Users/keana/OneDrive - Florida State University/Data/iCAD data/round 2/iCAD_r2.sav", stringsAsFactors = FALSE, to.data.frame = TRUE) iCAD <- dplyr::select(iCAD, c("subject","cAUDIT_tot","dAUDIT_tot", "eAUDIT_tot","fAUDIT_tot","gAUDIT_tot", "hAUDIT_tot","cddq_sum","dDDQ_sum","eDDQ_sum", "fDDQ_sum","gDDQ_sum","hDDQ_sum")) dataset_long <- reshape(iCAD, direction='long', varying=c("cAUDIT_tot","dAUDIT_tot", "eAUDIT_tot","fAUDIT_tot","gAUDIT_tot", "hAUDIT_tot","cddq_sum","dDDQ_sum","eDDQ_sum", "fDDQ_sum","gDDQ_sum","hDDQ_sum"), timevar='Time', #naming NEW variable that will represent time times=c(1:6), v.names=c("AUDIT","DDQ"), idvar='subject') #just your actual id variable
You might decide that you only want to analyze data within one gender. But you don’t want to delete the other cases forever! One advantage of R is that once you’ve loaded in your data, it doesn’t do anything to the actual data like in SPSS. Here, it only does the thing to the data frame that exists in the R environment. If you reload your data, those cases will still be there! And if you want to drop them again, you’ll have to re-run this command.
So we can just drop the men according to whether they have a ‘GENDER’ value of 2. Like with any coding language, you use a single ‘=’ for assignment, and a ‘==’ to test whether something is true or not. So this function is quite literally saying, if it’s true that this case’s value for gender is equal to 2, then keep all their variables (denoted by nothing coming after the comma, it just means take everything). For every case that this is true for, we’ll then have all their variables, which results in a “subsetted” dataset of only women. The way I’ve just shown you to do this is actually using base R, but this can also be accomplished using a tidyverse command called ‘subset()'; you can look into that for doing this too!
dataset_ladiesnight <- dataset[dataset$GENDER == 2,]