Introduction

This vignette explores two toy examples (binary data and count data) to illustrate how the functions provided in GINAX perform the GINA-X procedure. Data has been simulated under a generalized linear mixed model from 9,000 SNPs of 328 A. Thaliana ecotypes. The GINAX package includes as R objects the simulated data; 9,000 SNPs, the simulated phenotypes (both binary and Poisson), and the kinship matrix used to simulate the data. Further, the Github repo that contains the GINAX package also contains the data for the A. Thaliana case study.

The GINAX package implements the novel Genome-wide Iterative fiNe-mApping method for non-Gaussian data (GINA-X) proposed by Xu et al. (2025). As shown in that paper, traditional fine-mapping methods often fail to identify causal variants with smaller effect sizes, and do not properly correct for multiple comparisons. In contrast, GINA-X efficiently extracts information from GWAS data in a way that reduces FDR and increases recall.

Functions

The function implemented in GINAX is described below:

GINAX Performs GINA-X, using generalized linear mixed models for a given numeric phenotype vector, either binary or Poisson distributed Y, a SNP matrix encoded numerically SNPs, fixed covariates Fixed, and random effects and their projection matrices (covariance and Z respectively). The GINAX function returns the indices of the SNP matrix that were identified in the best model found by the GINA-X procedure.

Model/Model Assumptions

The model used in the GINAX package is

\[\begin{equation*} \textbf{y} \sim F(\cdot|\theta) // g(\theta) = X \boldsymbol{\beta} + X_f \boldsymbol{\beta}_f + Z_1 \boldsymbol{\alpha}_1 + \ldots + Z_l \boldsymbol{\alpha}_l \end{equation*}\]

where

\(g()\) is the link function.
\(\theta\) is the parameter, related to the mean of the distribution.
\(\textbf{y}\) is the vector of phenotype responses. Either binary or Poisson.
\(X\) is the matrix of SNPs (single nucleotide polymorphisms).
\(\boldsymbol{\beta}\) is the vector of regression coefficients that contains the effects of the SNPs.
\(X_f\) is a matrix of fixed covariates.
\(\boldsymbol{\beta}_f\) is the vector of regression coefficients that contains the effects of the fixed covariates.
\(Z_i\) is an incidence matrix relating the random effects \(\boldsymbol{\alpha}_i\) to the phenotype response.
\(\boldsymbol{\alpha}_i\) is a vector of random effects with covariance matrix \(\Sigma_i\). Common covariance structures include the identity matrix and kinship matrix.

Currently, GINAX can analyze binary responses (family = "bernoulli") and Poisson responses (family = "poisson").

Examples

Simulated Data

The GINAX function requires a vector of observed phenotypes (either binary or assumed Poisson distributed), a matrix of SNPs, and the specification of the random effects. First, the vector of observed phenotypes must be a numeric vector or a numeric \(n \times 1\) matrix. In the GINAX package, there are two simulated phenotype vectors. The first simulated phenotype vector comes from a Poisson generalized linear mixed model with both a kinship random effect and an overdispersion random effect. The data is assumed to have 15 replicates for each A. Thaliana ecotype. The first five elements of the Poisson simulated vector of phenotypes are

data("Y_poisson")
Y_poisson[1:5]
#> [1]  2387   179   299   139 59805

The second simulated phenotype vector comes from a binary generalized linear mixed model with only a kinship random effect. The first five elements of the binary simulated vector of phenotypes are

data("Y_binary")
Y_binary[1:5]
#> [1] 0 0 1 1 0

Second, the SNP matrix has to contain numeric values where each column corresponds to a SNP of interest and the \(i\)th row corresponds to the \(i\)th observation. In this example, the SNPs are a subset of the A. Thaliana TAIR9 genotype dataset and all SNPs have minor allele frequency greater than 0.01. Each simulated phenotype vector is simulated using this SNP matrix. Here are the first five rows and five columns of the SNP matrix:

data("SNPs")
SNPs[1:5,1:5]
#>      SNP2555 SNP2556 SNP2557 SNP2558 SNP2559
#> [1,]       1       1       1       0       0
#> [2,]       0       1       1       1       1
#> [3,]       0       0       1       1       1
#> [4,]       1       1       0       0       1
#> [5,]       1       1       1       1       1

Third, the kinship matrix is an \(n \times n\) positive semi-definite matrix containing only numeric values. The \(i\)th row or \(i\)th column quantifies how observation \(i\) is related to other observations. Since both simulated phenotype vectors are simulated from the same SNP matrix, they have the same kinship structure. The first five rows and five columns of the kinship matrix are

data("kinship")
kinship[1:5,1:5]
#>              V1         V2         V3         V4         V5
#> [1,] 0.78515873 0.15800700 0.04264546 0.02057071 0.05643574
#> [2,] 0.15800700 0.78146476 0.05135891 0.01476357 0.05482448
#> [3,] 0.04264546 0.05135891 0.80199976 0.10558970 0.04888596
#> [4,] 0.02057071 0.01476357 0.10558970 0.80030413 0.02935703
#> [5,] 0.05643574 0.05482448 0.04888596 0.02935703 0.78401489

The function GINAX implements the GINA-X method for generalized linear mixed models with either Poisson or Bernoulli distributed responses. This function takes as inputs the observed phenotypes, the SNPs coded numerically, the distributional family of the phenotype, a matrix of fixed covariates, the covariance matrices of the random effects, the design matrices of the random effects, and an offset. Further, the other inputs of GINAX are the FDR nominal level, the maximum number of iterations of the genetic algorithm in the model selection step, and the number of consecutive iterations of the genetic algorithm with the same best model for convergence.

GINAX Poisson Example

Here we illustrate the use of GINAX with a nominal FDR of 0.05 with Poisson count data. First we specify the covariance matrices for the random effects. The first random effect is assumed to be \(\boldsymbol{\alpha}_1 \sim N(0,\kappa_1 K)\), where \(K\) is the realized relationship matrix or kinship matrix. The second random effect is assumed to be \(\boldsymbol{\alpha}_1 \sim N(0,\kappa_2 I)\), where the covariance matrix is an identity matrix times a scalar. This second random effect is to account for overdispersion in the Poisson model. The Covariance argument takes a list of random effect covariance matrices. For this example, the list of covariance matrices is set as:

n <- length(Y_poisson)
covariance <- list()
covariance[[1]] <- kinship
covariance[[2]] <- diag(1, nrow = n, ncol = n)

The design matrices \(Z_i\) do not need to be specified in Z as the observations have no other structure such as a grouping structure. Z is set to be NULL implying that \(Z_i = I_{n \times n}\). Further, because the number of ecotype replications is 15, in this example we set the offset to log(15). The call to the GINAX function is

# This example is computationally intensive and is shown but not evaluated in this vignette.
# You can run it manually in your R session.
output_poisson <- GINAX(Y=Y_poisson, Covariance=covariance, SNPs=SNPs, family="poisson", Z=NULL, offset=log(15),FDR_Nominal = 0.05, maxiterations = 1000, runs_til_stop = 200)
output_poisson

The data was generated with causal SNPs at positions 450, 1350, 2250, 3150, 4050, 4950, 5850, 6750, 7650, and 8550. GINAX outputs the column indices of the SNPs matrix that are in best model or column indices of SNPs perfectly correlated to SNPs in the best model.

GINAX Example

Here we illustrate the use of GINAX with a nominal FDR of 0.05 with binary data. First we specify the covariance matrices for the random effects. The only random effect is assumed to be \(\boldsymbol{\alpha} \sim N(0,\kappa_1 K)\), where \(K\) is the realized relationship matrix or kinship matrix. For this example, the list of covariance matrices is set as:

covariance <- list()
covariance[[1]] <- kinship

In this example, the design matrices \(Z_i\) do not need to be specified in Z as the observations have no other structure such as a grouping structure. Z is set to be NULL implying that \(Z_i = I_{n \times n}\). With binary data, setting the number of replicates provides no computation gain and is not required.

# This example is computationally intensive and is shown but not evaluated in this vignette.
# You can run it manually in your R session.
output_binary <- GINAX(Y=Y_binary, Covariance=covariance, SNPs = SNPs, family = "bernoulli", Z=NULL, offset=NULL, FDR_Nominal = 0.05, maxiterations = 2000, runs_til_stop = 400)
output_binary

Similarly to the Poisson example, the data was generated with causal SNPs at positions 450, 1350, 2250, 3150, 4050, 4950, 5850, 6750, 7650,and 8550. GINAX identifies 1 false SNPs and 4 true causal SNPs.

GINAX package

Shuangshuang Xu, Jacob Williams, Allison N. Tegge, and Marco A. R. Ferreira

2025-10-08