Input File Format

Input Genotype

The pipeline requires genotype information to be given in the form of uncompressed VCF or compressed VCF.gz files.

If you are working with imputed genotyping data, specify the optional parameter --r2thres to filter out poorly imputed variants.

Input Covariates

Covariates for each subject to be passed into the model can be provided via a tab-delimited file (*.tsv).

For both cross-sectional and longitudinal analysis, the pipeline expects covariates to be defined in the following format:

Note: the Plink-style columns #FID and PHENO must be present but can be populated with 0.

#FID  IID SEX PHENO study_arm apoe4 levodopa_usage age_at_baseline
0 sid-1 1 0 control 0 0 35
0 sid-2 1 0 control 0 0 40
0 sid-3 0 0 control 1 0 32
.
.
.
0 sid-98  1 0 PD  1 0 55
0 sid-99  0 0 PD  0 1 66
0 sid-100 1 0 PD  0 0 58

Input Phenotype / Outcomes

Phenotype and measured outcomes can be passed into the pipeline via a tab-delimited file (*.tsv)

For cross-sectional analysis, the pipeline expects a minimum of 2 columns in the following format

IID y
sid-1 1
sid-2 0
sid-3 1
.
.
.
sid-98 0
sid-99 0
sid-100 1

For longitudinal analyses, the input phenotype file must contain a column specifying the days passed since the start of the study, as shown here under study_days:

IID y	study_days
sid-414	95.2206895626295	0.0
sid-414	102.30085429524436	182.625
sid-414	114.79879923749795	365.25
.
.
.
sid-204	81.12926520637295	730.5
sid-204	104.79350859619713	1095.75
sid-204	91.52620016527872	1461.0

For survival analyses, apart from the time-to-event column (study_days), a column must be present specifying whether the event outcome was reached (0/1), as shown here under surv_y.

IID	y	study_days	surv_y	tstart	tend
sid-1	115.97392441028578	730.5	1	0	730.5
sid-10	86.8996551417189	182.625	1	0	182.625
sid-100	150.4507827126939	2556.75	1	0	2556.75
.
.
.
sid-103	66.2393727266641	365.25	1	0	365.25
sid-104	137.98518827246627	730.5	1	0	730.5
sid-105	102.64954194365411	1826.25	1	0	1826.25