This is aimed at individuals who already understand background on latent variable modeling, but just want to know how to do it in R. I’ll try to have a short explanation of each model for folks who are less familiar with latent variables, but I’d direct you to other resources for a comprehensive understanding of latent variable modeling.
So you know, it’s funny, when I first learned how to do latent variable modeling in R, I told everyone about how they should totally jump ship from Mplus to R, and now after a few years, I’ve come back around to prefer doing my latent variable modeling in Mplus actually. But, there’s a great R package called ‘MplusAutomation’ that allows you to use Mplus through R (mind blown, I know!). I’ll make a separate guide for using that package.
Latent Variable Modeling
Latent variables refer to an entity that is not directly observable, but is posited to cause variation across multiple measures that is shared and reflected in those items’ covariance with one another.
Almost all latent variable models that can be implemented in Mplus can be implemented in R in the ‘lavaan’ package. Also, as you’ve picked up by now, R is incredibly popular, so there are SO many examples of how to run latent variable models in lavaan online. The creator of the lavaan package has extensive tutorials here also: http://lavaan.ugent.be/tutorial/index.html.
A simple CFA
This is an example of what a 4-indicator CFA looks like: https://images.slideplayer.com/25/7929454/slides/slide_21.jpg.
You have to estimate 8 paths in this model - 4 loadings and 4 disturbances (alternatively, you could estimate 1 variance of the latent variable, 3 loadings, and 4 disturbances). Loadings are interpreted essentially the same way as in an exploratory factor analysis; practically speaking, they’re denoting the degree to which the latent factor is defined by that specific indicator (or more accurately, the degree to which that indicator is caused by the latent variable). Disturbances are also known as residuals in certain cases; these refer to the variance unaccounted for by the latent variable in that indicator, or, in SEM models, the amount of variance in the latent variable DV (referred to as an endogenous variable) that is unaccounted for by the predictor variables (referred to as exogenous variable).
To do a latent variable modeling in lavaan, you want to specify your syntax very similarly to Mplus. You have 3 parts to a latent model - 1) your syntax, 2) calling your syntax and your dataset using the ‘cfa()’, ‘sem()’, ‘growth()’, or ‘lavaan()’ command, depending on the model being run, and 3) a results summary call. As with anything in R, you could theoretically nest all of these three parts within each other, but your code would look super ugly and be confusing to try and follow, so it’s my recommendation to do each part separately.
To specify the syntax, you just create a text object, starting and ending with an apostrophe and saved into an object. I always keep my apostrophes on different lines than the actual syntax for some reason; I think it helps to keep things more tidy. The latent variable you’re creating you name on the left side of the “=~“, which signifies that what’s to the left of that will be the latent variable and what’s on the right of it are the indicators that will define that latent variable to the left.
To run the model, you create another object where the results will be saved into (you’ll want to do this as a habit instead of just doing a summary of the results every time; there are reasons to have an object saved and be able to manipulate it, etc.). You first say what your syntax object is called, then what dataset you’re using, then how you want to handle missing data (typical to use full information maximum likelihood “FIML”, but could also use random effects maximum likelihood “REML”), then what estimator you want to use (typically just maximum likelihood, but also could be maximum likelihood with robust standard errors “MLR”).
To view the results of your model, you use the “summary()” command, where you say what the object your latent model is saved into is, then what type of results you want (I like to have standardized included), and you can additionally request modification indices, etc.
library(lavaan) library(tidyverse) library(foreign) dataset <- read.spss("dataset.sav", to.data.frame = T) cfa_syntax <- 'DIS =~ plnful + prbimp + irresp + impurg' fit_cfa <- cfa(cfa_syntax, dataset, missing = "FIML", estimator = "ML") summary(fit_cfa, standardized = T, fit = T)
Multiple Latent variable model
This is an example of what modeling multiple latent variables looks like: https://i0.wp.com/www.theanalysisfactor.com/wp-content/uploads/2016/03/sem-cfa-e1458568506324.png.
To add another latent variable, simply define it in the model in the same way as the first one. Lavaan will automatically estimate the covariance between all latent variables in the model unless you tell it not to by setting the covariance to 0. Keep in mind that this will have an effect on the loadings in your model; the way that latent variable modeling works is basically sophisticated guessing/plug and chug (this is ultimately what maximum likelihood estimation boils down to). It’s estimating all paths in your model simultaneously. Thus, when you add a second variable in your model, it will “bias” (not in a technical sense, but in a colloquial/practical one) your loadings for each of those factors to account for the specific slice of variance in each of their individual indicators that will cause the two latent variables to be more highly correlated with one another. Check out what happens to your loadings on your factor when you add a second related factor compared to when you model the first factor alone.
cfa_syntax <- 'DIS =~ plnful + prbimp + irresp + impurg SUD =~ alcprb + mrjprb + drgprb' fit_cfa <- cfa(cfa_syntax, dataset, missing = "FIML", estimator = "ML") summary(fit_cfa, standardized = T, fit = T)
Another important thing that’s not necessarily latent variable modeling but is a component of structural equation modeling is path analysis. It looks like this: https://stats.idre.ucla.edu/wp-content/uploads/2016/02/tikz5.png.
It simply estimates the “structural paths” between variables. In a regression, all variables are entered at the same time. Path analysis allows you to specify “at what point” you might put in a variable and then it gives you the regression path estimates.
path_syntax <- ' alcprb ~ dis1_item + mrjprb mrjprb ~~ dis1_item ' fit_path <- cfa(path_syntax, dataset, missing = "FIML", estimator = "ML") summary(fit_path, standardized = T) summary(lm(alcprb ~ mrjprb + dis1_item, data = dataset))
Structural Equation Model
Structural equation modeling is the combination of a CFA and a path analysis. This is an example of what a SEM looks like: https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSy3lD5hoV4rMhctSQI0coPBmBbnYte0Z-MFcPxXNuybGqrYX8q.
This is the same as a CFA, but it also involves exogenous predictors. This will now involve ‘regression’ paths, denoted by ‘~‘. You can do correlations instead of regression paths by using ‘~~’ (and keep in mind lavaan automatically calculates covariances among exogenous variables unless you tell it not to by saying “auto.var = F”).
sem_syntax <- ' DIS =~ plnful + prbimp + irresp + impurg SUD =~ alcprb + mrjprb + drgprb SUD ~ DIS + mpq_NEt ' fit_sem <- sem(sem_syntax, dataset, missing = "FIML", estimator = "ML") summary(fit_sem, standardized = T, fit = T)
Up to this point, we’ve used FIML to handle missingness on endogenous variables. However, we also sometimes want to use FIML to deal with missingness on exogenous variables. We can do this by specifying “fixed.x = FALSE” in the model command.
sem_syntax <- ' DIS =~ plnful + prbimp + irresp + impurg SUD =~ alcprb + mrjprb + drgprb SUD ~ DIS + mpq_NEt ' fit_sem <- sem(sem_syntax, dataset, missing = "FIML", estimator = "ML", fixed.x = F) summary(fit_sem, standardized = T, fit = T)
Higher order model
This is an example of what a higher order model looks like: https://ars.els-cdn.com/content/image/1-s2.0-S1048984310000287-gr1.jpg.
Notice that I’m defining latent variables with “=~” and then also defining another latent variable with “=~” and the variables to the right of this latent variable includes the latent variables I just defined in the first step.
sem_syntax <- ' DIS =~ plnful + prbimp + irresp + impurg SUD =~ alcprb + mrjprb + drgprb CALL =~ empath + honest + phyagg EXT =~ DIS + SUD + CALL ' fit_sem <- sem(sem_syntax, dataset, missing = "FIML", estimator = "ML", fixed.x = F) summary(fit_sem, standardized = T)
This is an example of what a bifactor model looks like: https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRXLSuDGpGqugTJxXv000AFUK8bJgIxWpEArkW6EYVGX5-azXRh.
To do a bifactor model, I create a latent variable that is defined by all the manifest indicators in the model, then additional latent variables that are defined by a subset of those indicators. Then, I say that there is 0 association between the higher-order latent variable of all indicators and the lower-order latent variable of a subset of indicators. Then, it is typical to also set the covariance between all lower-order factors to 0 as well, though I personally am not sure if I think this should always be the case.
One cautionary note - bifactor models have a very high “fit propensity.” Fit propensity refers to the artifical inflation of how well a model will fit data even if the data are not generated by that process. What that means practically is that just because a bifactor model fits the data better than a correlated factors, correlated residuals, or higher-order model does not necessarily mean the bifactor model is necessarily the “true” model/data generating mechanism. That doesn’t mean you shouldn’t run or select a bifactor model, it just means that you should be aware of it and really have very solid hypotheses about what a lower-order factor might represent because you can’t rely on fit indices to do that work for you like you can with other model comparisons.
BF_syntax <- ' EXT =~ plnful + prbimp + irresp + impurg + alcprb + mrjprb + drgprb + honest + phyagg SUD =~ alcprb + mrjprb + drgprb CALL =~ honest + phyagg EXT ~~ 0*SUD EXT ~~ 0*CALL SUD ~~ 0*CALL ' fit_BF <- sem(BF_syntax, dataset, missing = "FIML", estimator = "ML", fixed.x = F) summary(fit_BF, standardized = T, fit = T)
This is an example of what a correlated residuals model looks like: http://davidakenny.net/img/cfa.gif.
CR_syntax <- ' DIS =~ plnful + prbimp + irresp + impurg SUD =~ alcprb + mrjprb + drgprb ASPD =~ honest + phyagg plnful ~~ alcprb honest ~~ irresp ' fit_CR <- sem(CR_syntax, dataset, missing = "FIML", estimator = "ML", fixed.x = F) summary(fit_CR, standardized = T)
There are also many path modeling applications where you want to examine an indirect effect path. This is done by computing the paths you’re interested in by using the ‘:=’ notation to define a parameter.
mediation <-' DIS =~ plnful + prbimp + irresp + impurg SUD =~ alcprb + mrjprb + drgprb SUD ~ c1*mpq_NEt DIS ~ a1*mpq_NEt SUD ~ b1*DIS dir := c1 indir := a1*b1 total := a1*b1 + c1 ' fit_mediation <- sem(mediation, dataset, meanstructure=TRUE, se="bootstrap", bootstrap=1000) parameterEstimates(fit_mediation, boot.ci.type = "perc")
That’s it for this guide! Next, we’ll talk about how to run all of these models using MplusAutomation, and more advanced SEMs.