Input File Format

Input Genotype

The pipeline requires genotype information to be given in the form of uncompressed VCF or compressed VCF.gz files.

If you are working with imputed genotyping data, specify the optional parameter --r2thres to filter out poorly imputed variants.

Input Covariates

Covariates for each subject to be passed into the model can be provided via a tab-delimited file (*.tsv).

For both cross-sectional and longitudinal analysis, the pipeline expects covariates to be defined in the following format:

Note: the Plink-style columns #FID and PHENO must be present but can be populated with 0.

#FID  IID SEX PHENO study_arm apoe4 levodopa_usage age_at_baseline
0 sid-1 1 0 control 0 0 35
0 sid-2 1 0 control 0 0 40
0 sid-3 0 0 control 1 0 32
.
.
.
0 sid-98  1 0 PD  1 0 55
0 sid-99  0 0 PD  0 1 66
0 sid-100 1 0 PD  0 0 58

Input Phenotype / Outcomes

Phenotype and measured outcomes can be passed into the pipeline via a tab-delimited file (*.tsv)

For cross-sectional analysis, the pipeline expects a minimum of 2 columns in the following format

IID y
sid-1 1
sid-2 0
sid-3 1
.
.
.
sid-98 0
sid-99 0
sid-100 1

For longitudinal analyses, the input phenotype file must contain a column specifying the days passed since the start of the study, as shown here under study_days:

IID y	study_days
sid-414	95.2206895626295	0.0
sid-414	102.30085429524436	182.625
sid-414	114.79879923749795	365.25
.
.
.
sid-204	81.12926520637295	730.5
sid-204	104.79350859619713	1095.75
sid-204	91.52620016527872	1461.0

For survival analyses, apart from the time-to-event column (study_days), a column must be present specifying whether the event outcome was reached (0/1), as shown here under surv_y.

IID	y	study_days	surv_y	tstart	tend
sid-1	115.97392441028578	730.5	1	0	730.5
sid-10	86.8996551417189	182.625	1	0	182.625
sid-100	150.4507827126939	2556.75	1	0	2556.75
.
.
.
sid-103	66.2393727266641	365.25	1	0	365.25
sid-104	137.98518827246627	730.5	1	0	730.5
sid-105	102.64954194365411	1826.25	1	0	1826.25