Specify Core Inputs • devMSMs

This vignette guides users through the specification of the core inputs required by the deMSMs functions (and helper functions). Users should first view the Terminology and Data Requirements vignettes. This vignette helps users specify core inputs (e.g., home directory, exposure, time-varying confounders, time invariant confounders, outcome, reference, comparison, high/low exemplar cutoff for continuous exposure, and balance threshold) that are used by multiple functions in the devMSMs package (see Workflows vignettes). Their use and values should be kept consistent throughout the package. Additionally, all functions have two optional inputs allowing the user to override defaults to suppress all messaging to the console (verbose = FALSE) and elect not to save out locally any intermediate and final output (save.out = FALSE).

Following completion of this vignette, the user should complete a Workflow vignette to implement devMSMs with their longitudinal data. Many of these core inputs will be specified at the beginning of the workflow in the MSM object.

The code contained in this vignette is also available, integrated code from the other vignettes, in the ExampleWorkflow.rmd file.

P3.1 Recommended: Home Directory

Users are required to specify home directory, or in quotations, a path to a designated folder for all output from the package (without the final forward slash), if they plan to save out intermediary and final outputs in the package (default) by setting save.out = TRUE for any of the functions. All sub directories will be created within this home directory by the devMSMs functions automatically when save.out = TRUE.

home_dir <- NA
# home_dir <- '/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/testing/isa'

P3.2 Recommended: Time Point Delimiter

Users are required to use a consistent delimiter for parsing the names of their time-varying variables (i.e., exposure, time-varying confounders, outcome) into the name of the construct and the time point at which it was measured. The default is a period but the user can specify any delimiter. devMSMs will assume the integer after the final instance of the delimiter wihin the variable name corresponds to the time point at which it was measured. The same delimiter shoudl be used for all time-varying variables.

Users can specify a different delimiter as a character string in regular expression (regex) form.

Below, we use the default period as a time delimiter.

sep <- "\\."

P3.3 Required: Exposure Variable

The exposure refers to the developmental construct that is hypothesized to cause the outcome. Users are required to specify an exposure variable for use in all functions in devMSMs. The exposure can be either continuously distributed or binary, which the package automatically detects. The user must specify exposure, or a list of character strings corresponding to the exposure variable in wide format, as it appears in your dataset. The time at which the exposure was measured should appear as a suffix with a delimiter of the user’s choosing (that should be the same for time-varying confounders).

The user has two options when specifying the exposure variablee, and should select the option that best serves their theory regarding developmental timing and the practical constraints of their data and the modeling process.

First, they may specify exposure at all time points at which the exposure was measured in their data. This means that balancing formulas will be created (Steps 1a, 2a, 3b in the Workflows vignettes) and IPTW weights will be created (Steps 2b, 3c in the Workflows vignettes) and assessed (Steps 2c, 3a, 4 in the Workflows vignettes) for all time points. In this case, if no epochs are specified, all time points will be included as exposure main effects in the final substantive model and history comparison (Steps 5 & 6 in the Workflows vignettes).

Second, they may specify a subset of theoretically important time points at which the exposure was measured in their data. This means that balancing formulas will be created and IPTW weights will be created and assessed for only those time points. Further, if no epochs are specified, only those subsetted time points will be included as exposure main effects in the final substantive models. Importantly, exposure variables at time point not included in exposure should be included as time-varying confounders, to be used for balancing purposes only.

The specification of exposure epochs should be kept consistent throughout the use of the devMSMs package. If the user intends to specify exposure epochs, the user should include all time points encompassed in the epochs in. If the user does not intend to specify exposure epochs, the exposure will constitute the exposure main effects in the final outcome model and form the basis of the histories used for history comparison. In this case, if the user specifies more than 4 exposure time points, they will be required to conduct only a subset of history comparisons (Step 6 in the Workflows vignettes), given that the base code (see the hypotheses() function from the marginaleffects package) cannot accommodate all pairwise history comparisons for more than 5 time points.

For this example, we elected to create epochs of infancy (6 and 15 months), toddlerhood (24 and 35 months), and early childhood (58 months) so we include all exposure time points.

exposure <- c("ESETA1.6", "ESETA1.15", "ESETA1.24", "ESETA1.35", "ESETA1.58")

P3.4 Required for Continuous Exposure: High and Low Exemplar Cutoff Values

If your exposure variable is continuously distributed (e.g., economic strain), there is not inherent demarcation of what is considered high versus low levels of exposure. Given that the final step of MSM process involves examining how different exposure histories –that vary in dose and timing – over development result in different outcome levels, we need to specify what is considered high versus low levels of exposure. Critically, these ‘high’ and ‘low’ values are merely examples to be plugged into the fitted regression equation to obtain a model-implied predicted value of the outcome for that particular exposure. For continuous exposures, this is required for the recommended preliminary step from Workflows vignette and the compareHistories() devMSMs function (see Workflows vignettes).

We note that specifying exemplar values is not unique to MSMs–we do this every time we probe the simple slopes of a statistical interaction at ‘high’ and ‘low’ values of a continuous moderator. It is often even implicit in the creation of binary exposure variables (e.g., how many doses of an intervention or drug trial constitute an ‘effective dose?’). Importantly, decision-making around exemplar values can be principled and transparent, drawing on theory, expertise, and data-derived criteria. Sometimes a continuous exposure has pre-established exemplar values, such as diagnostic thresholds for a clinical measure. Other times the researcher is tasked with identifying “reasonable” exemplar values. Ideally, this should be informed purely by theory; however, the user must also choose values that yield reasonable representation of the exposure histories in their data. For example, if one selected the 90th and 10th percentiles as ‘high’ and ‘low’ exposure, respectively, and there was some rank-order stability over time, then the representation of participants experiencing different exposure histories (e.g., ‘high’ at one time point yet ‘low’ at the others) within the data will be quite sparse. Thus, when these exposure histories are used to estimate the causal effects, the estimates will be imprecise (i.e., large standard errors) and risk extrapolation. We recommend first drawing upon theory and expertise, and then visually inspecting (see Step P4 of the Workflow for Continuous Exposures vignette) and adjusting to ensure adequate representation of participants within each history in your data, given the exemplar values. We also strongly recommend sensitivity analyses to probe the robustness of findings using both slightly more and less stringent exemplar values.

We specify to hi_lo_cut, a list of two quantile values (0-1): the first represents the value above which is considered high levels exposure and the second represents the followed below which is considered low levels of exposure. These values may have to be revised following inspection of the sample distribution across the resulting exposure histories in the recommended preliminary steps of the Workflows vignettes. These final exemplar values will be used in creating and modeling the effects of exposure histories in Steps 5 and 6 of the Workflows vignettes.

Below, we specify the 60th and 30th percentiles to demarcate high and low levels of economic strain exposure, respectively.

hi_lo_cut <- c(0.6, 0.3)

P3.5 Optional: Exposure Epochs

When modeling the effect of exposure on outcome in the MSM process, the user has the option to specify epochs of exposure, grouping the time points at which the exposure into more meaningful categories. This may be useful in cases when the user wishes to establish balance using all time points at which exposure was measured, but has substantive hypotheses at coarser timescales (e.g., infancy). If no epochs are specified, the time points at which exposure was measured will be used in the creation of exposure histories in the final step of the process. Each specified epoch must have a corresponding value (but the values can differ in number of entries as shown below).

To specify epochs, users utilize the optional epoch argument by providing, as a character string, a list of user-created epoch labels, one for each exposure time point in the order in which they are listed in exposure. If the user specifies exposure epochs, exposure main effects are created for each epoch, and exposure levels are averaged for any epochs that consist of two or more time point values.

The exposure epochs will be arguments in the fitModel() and compareHistories() devMSMs functions (see Workflows vignettes) and their specification should be kept consistent throughout use of the package and its vignettes. They will constitute the main effects variables when modeling the relation between exposure and outcome (Workflows vignettes Step 5) and form the basis for estimating and comparing exposure histories (Workflows vignettes Step 6). If no epochs are specified, the exposure time points from the exposure field will be used at the aforementioned steps.

Below, we specify ‘infancy’ (“ESETA1.6”, “ESETA1.15), ‘toddlerhood’ (”ESETA1.24”, “ESETA1.35”), and ‘childhood’ (“ESETA1.58”).

epoch <- c("Infancy", "Infancy", "Toddlerhood", "Toddlerhood", "Childhood")

P3.6 Recommended: Hypotheses-Relevant Exposure Histories

We strongly recommend users be selective about which histories, or developmental sequences of high and low exposure (at exposure time points or epochs), are vital for testing their hypotheses. We recommend that the user estimates and compares only a subset of all possible exposure histories using the reference and comparison fields (rather than comparing all possible exposure histories). The user can specify a custom subset of exposure histories using the reference and comparison fields as optional inputs to the compareHistories() devMSMs function (see Workflows vignettes). To conduct these customized comparisons, users must provide at least one unique valid history (e.g., “l-l-l”) as a reference by providing a character string (or a list of strings) of lowercase l’s and h’s (each separated by -), each corresponding to each exposure epoch (or exposure time point), that signify the sequence of exposure levels (“low” or “high”, respectively).

If you supply a reference history, in comparisons provide at least one unique and valid history for comparison by, in quotations, providing a string (or list of strings) of l’s and h’s (each separated by “-”), with each corresponding to each exposure epoch, that signify the sequence of exposure levels (“low” or “high”, respectively) that constitutes the comparison exposure history/histories to be compared to the reference in Step 6 of the Workflows vignettes. If you supply one or more comparisons, at least one reference must be specified. Each reference exposure history will be compared to each comparison history and all comparisons will be supplied for multiple comparison correction. If no reference or comparison is specified, all histories will be compared to each other in Step 6 of the Workflows vignettes. However, if there are more than 4 exposure main effects (either as epochs or exposure time points), the user is required to select a subset of history comparisons (Step 6 of the Workflows vignettes), given that the base code (see the hypotheses() function from the marginaleffects package) cannot accommodate all pairwise history comparisons for more than 5 time points).

These final reference and comparison values established at this step should be used for estimating and comparing exposure histories in Step 6 of the Workflows vignettes.

Below, we specify low economic strain at all epochs (“l-l-l”) as our reference event in comparison to high levels at all epochs (“h-h-h”) as well as all histories that contain 1 dose of exposure to high economic strain at different epochs.

reference <- c("l-l-l")

comparison <- c("h-h-h", 
                "l-l-h", 
                "h-l-l", 
                "l-h-l")

P3.7 Required: Outcome Variable

The outcome refers to the developmental construct that is hypothesized to be caused by the exposure. Users are also required to specify an outcome variable at a designated final time point, as required input to the fitModel() and compareHistories() functions in the devMSMs package. This final time point should be equal to (or, ideally greater than) the final exposure time point. Note that instances of the outcome variable measured at any prior time points should be included as time-varying confounders for balancing purposes. Users must specify an outcome, as a character string with a suffix of time point at which it was collected, corresponding to the variable name in the wide data. Any outcome variables measured at prior time points should be included as time-varying confounders.

For this example, we specify behavior problems measured at 58 months as the outcome.

outcome <- "StrDif_Tot.58"

P3.8 Recommended: Confounders

Specifying confounders is critical to the successful implmentation of MSMs. Confounders refer to variables that could cause either or both exposure and outcome, and thus could cause a spurious relation between exposure and outcome. MSMs assume that all possible confounders are measured.

P3.8a Recommended: Time Invariant Confounders

Time invariant confounders are those that are considered to be stable across the developmental period you are studying. Specifying time invariant confounders in ti_conf is recommended in the use of this package as input for the createFormulas() function in devMSMS. Time invariant confounders could include core demographic or birth characteristics (e.g., sex, racial group membership, birth complications) that might cause either the exposure or outcome, either directly or as a proxy, as suggested by theory and/or evidenced by strong associations in the existing literature.

If the user wishes to specify any interactions, they need to manually create them in the data before listing them here. Here, the user can also include any interaction terms between time invariant variables (e.g., “variable:variable”) for inclusion in the balancing formula. Keep in mind that any interactions that include factor variables will be decomposed into interactions at each factor level.

For ti_conf, as a list of character strings, provide the names of all confounders in your dataset that are time invariant. These should not have a time suffix. Below, we specify 18 time invariant confounders that were measured at the baseline study visit, prior to the first exposure time point.

ti_conf <- c( "state", "BioDadInHH2", "PmAge2", "PmBlac2", "TcBlac2", "PmMrSt2", "PmEd2", "KFASTScr",
  "RMomAgeU", "RHealth", "HomeOwnd", "SWghtLB", "SurpPreg", "SmokTotl", "DrnkFreq",
  "peri_health", "caregiv_health", "gov_assist")

P3.8b Recommended: Time-varying Confounders

Time-varying confounders are those that are considered to vary over development, regardless of how many times they are measured in your dataset. Specifying time-varying confounders is recommended for the use of this package as input for the createFormulas() devMSMs function (see Workflows vignettes). In tv_conf, as a list of character strings, provide the names of all variables in wide format (e.g., “variable.t”) in your dataset that are time-varying, including any exposure and outcome variables not specified in their respective fields (see above). They should have a time suffix following a delimiter.

Note that time-varying confounders should also include any confounders measured repeatedly at all time points (e.g., InRatioCor) or any collected only at one or several specific time points, and missing at other time points, but are not time invariant. Additionally, note that the user does not specify exposure time points to tv_conf, as lagged values of exposure will be automatically included as time-varying confounders in the creation of formulas.

tv_conf <- c("SAAmylase.6", "SAAmylase.15", "SAAmylase.24",
  "MDI.6", "MDI.15",
  "RHasSO.6", "RHasSO.15", "RHasSO.24", "RHasSO.35", 
  "WndNbrhood.6", "WndNbrhood.24", "WndNbrhood.35",
  "IBRAttn.6", "IBRAttn.15", "IBRAttn.24",
  "B18Raw.6", "B18Raw.15", "B18Raw.24", 
  "HOMEETA1.6", "HOMEETA1.15", "HOMEETA1.24", "HOMEETA1.35", 
  "InRatioCor.6", "InRatioCor.15", "InRatioCor.24", "InRatioCor.35",
  "CORTB.6", "CORTB.15", "CORTB.24",
  "EARS_TJo.24", "EARS_TJo.35",
  "LESMnPos.24", "LESMnPos.35",
  "LESMnNeg.24", "LESMnNeg.35",
  "StrDif_Tot.35", 
  "fscore.35")

P3.8c Optional: Concurrent Confounders

When creating balancing formula at each exposure time point, the default of devMSMs is to only include time-varying confounders measured before the exposure time point. This is because it is often difficult to differentiate confounders measured contemporaneously from mediators, and balancing on a mediator is ill-advised. However, sometimes a user may have good evidence to know that a given time-varying confounder is not a mediator of exposure even when measured contemporaneously. If this is the case, they may specify to override the package default an retain a given variable/set of time-varying confoudners in the formulas of the exposure measured contemporaneously.

For concur_conf: as a list of character strings, provide the names of any time-varying confounders (in wide format) that you wish to be included contemporaneously in the balancing formulas (overriding the default which is to only include lagged confounders).

concur_conf <- "B18Raw.6"

The user can now proceed to the Data Requirements or Workflows vignettes.

References

Arel-Bundock, Vincent. 2024. marginaleffects: Predictions, Comparisons, Slopes, Marginal Means, and Hypothesis Tests. https://CRAN.R-project.org/package=marginaleffects.