- Longitudinal data analysis for rare variants detection with penalized quadratic inference function
- Longitudinal and clustered data analysis books
- Navigation menu
- Longitudinal Data Analysis for Social Science Researchers
- UCL facilities
Stat Med ; — Google Scholar. Practical longitudinal data analysis. London: Chapman and Hall, Analysis of longitudinal data. Morrisson DF. Multivariate statistical methods. New York: McGraw-Hill, Analysis of repeated measures. Longitudinal data analysis for discrete and continuous outcomes. Biometrics ; — Liang K, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika ; — Random effects models for longitudinal data.
Goldstein H. Multilevel statistical models, 2nd edn. London: Edward Arnold, Mixed effect nonlinear regression for unbalanced repeated measures. Biometrics ; 1— Generalized estimating equations for correlated binary data: Using the odds ratio as a measure of association. Kemper HCG ed.
Albert PS. This chapter will shift from continuous to binary outcomes. Binary outcomes are ones in which the outcome are in two categories. Special considerations for this outcome are needed to appropriately model the data and receive valid statistical results.
Longitudinal data analysis for rare variants detection with penalized quadratic inference function
Pricing See our plans. Plans For Business For Students. Create Free Account. Sign in. If you type We will search for Community Projects Podcasts. Start Course For Free. For example, the cohort allelic sum test CAST 31 collapses multiple rare variants into one binary variable, which indicates whether there are any rare variants. The combined multivariate and collapsing method 33 first collapses the variants into several subgroups based on some predefined criteria e.
The WSS method weights all variants differently when determining the genetic load of an individual. So, by weighting the signals from each variant, the WSS accentuates variants that are rare in an individual 9. The variable threshold method 34 selects the optimal rare frequency threshold on a grid of points, and estimates the p-value by a permutation procedure.
Longitudinal and clustered data analysis books
All these burden tests assume all the variants share the same effect direction and magnitude after incorporating weights. Thus, any violation of this assumption can result in a loss of power 8 , 10 , To overcome the limitations of the burden tests, the data-adaptive sum test aSum was proposed Specifically, the aSum method first estimates the direction of effect for each variant using a marginal regression model, then it changes the coding of variants accordingly, and finally uses the same method as the burden test to test for any association.
However, aSum is computationally intensive because it obtains the p-value via permutations. Moreover, the estimation of regression coefficients for single rare variants is typically difficult and unstable 8.
Variance-component based methods e. It has been demonstrated that burden tests were more powerful than SKAT when most of the rare variants were causal and had the same directions, whereas SKAT outperformed burden tests when the effects of rare variants were heterogeneous These hybrid methods were more robust across a range of scenarios, but were less powerful than either one of these tests on their self-underlying assumptions 8 , Other dimension-reduction techniques are available for rare variants analysis, such as FPCA 11 and the adaptive ridge regression method Luo et al.
However, the performance of the dimension reduction techniques and variance components-based tests is not clearly known. Borrowing the idea of the WSS, we proposed to adopt the collapsing idea to collapse both rare and common variants over a gene or region into a single genetic score for further estimation and gene selection. Without loss of generality, here we focused on a gene to describe the method.
Suppose J is the total number of variants in a gene. Then, each individual is scored by a single weighted average of the number of alleles in a given gene as:. This weighting function assumes that rare variants have larger effect sizes than common variants 9. A weighted gene score can be obtained for each gene.
Longitudinal Data Analysis for Social Science Researchers
The gene-based scores are then fitted into the pQIF model to select the genes associated with a longitudinal disease trait. After collapsing multiple common and rare variants in each gene with the weighted sum, the longitudinal model can be defined as:. This mean model is then fitted with the pQIF method for further estimation and gene selection. In a real longitudinal study, unbalanced data, which are featured as cluster sizes that vary for different individuals, are commonplace because of missing data.
In such cases, a transformation matrix H i can be applied for each subject to fit the pQIF model For the i th subject with missing measurements, H i is generated by deleting the columns that correspond to the missing measurements. Consider a study with a total of three time points. We performed extensive simulations to examine the performance of the pQIF for longitudinal sequencing association studies. We examined the pQIF under different sample sizes and different variable dimensions.
The performance of the pQIF under mis-specified correlation structures was also evaluated, based on three different working correlations independent, AR 1 , and exchangeable.
The simulation was based on the GAW18 real sequencing data. The GAW18 dataset was based on a longitudinal study design consisting of whole-genome sequencing of individuals in the San Antonio Family studies with pedigrees. Among the individuals, are unrelated and had both real phenotype data and imputed sequence data. The sequencing data for GAW18 were provided only for markers on odd-numbered autosomes. Two phenotype datasets were provided: one was the real phenotype data including systolic blood pressure and diastolic blood pressure along with other covariates such as current use of antihypertensive medications, sex, age, and smoking status up to four time points; the other was the simulated longitudinal phenotype data that were based on the real genotype data.
Here we focused on the unrelated individuals in both the simulation and real data analyses.
Because the sample size was not large enough to demonstrate the performance of the pQIF, we bootstrapped additional samples assuming that the individuals represented the population. For each bootstrapped sample, we fixed the original sequencing data, but simulated new binary longitudinal responses Y it based on the following model:. We also simulated noisy gene variants with no genetic effect. Each noisy gene consists of 10 SNP variants with the proportion of rare and common variants set as Both the rare and common variants were collapsed over genes as a weighted score using the WSS method.
Ages were taking from the original dataset, and missing age values at exams were filled in by adding or subtracting 3. Tobacco smoking was generated as follows: This follows the same quitting rate as in the real dataset. All the variables were standardized to have mean zero and standard deviation one before further analysis. The R package mvtBinaryEP was used to generate the longitudinal binary responses. Under each scenario, simulation runs were conducted. True positive TP and false positive FP rates were calculated to evaluate the model selection performance.
Data were simulated assuming an AR 1 correlation structure and were subsequently analyzed by applying an AR 1 correlation structure assuming the true correlation structure was known. For the increased sample sizes, the TP selection rate also increased.
Thus, in real applications, when the gene dimension is large, the pQIF may not be useful because of computational limitations, especially when the sample size is small. Performance of the pQIF for different sample sizes and different dimensions. The true and working correlation structures were set as AR 1. The title of each subfigure e.