IGVF Additional Guidance on Sharing Software, Models, and Intermediate Analyses

Last updated 25_03_28

Software and Models

Developers of significant new IGVF-related software or models will make their programs, including source code and parameters, publicly available using a permissive open-source license. Examples include data processing pipelines, implementations of statistical, visualization, and modeling tools developed to process or analyze IGVF data, and predictive models.

What to release: IGVF requires the release of analysis pipelines used for major IGVF products such as the IGVF Catalog, software, computational tools, models, and pipelines used for major analyses in planned IGVF publications, and other software likely to be useful to multiple groups either within IGVF or in the broader community.

When to release: The decision of when software, pipelines, and models should be released should balance the benefit to the community against the labor involved in release and maintenance. Major software tools should be released as soon as they are sufficiently stable, and no later than the time they are first used in preprints and/or included in manuscript submissions.

How to release: Software, pipelines, and models should be released by version-controlled public repositories (e.g., Github, kipoi). The entries in these repositories should be linked to via the IGVF data portal. Software should be well-documented and there should be a contact person for questions. Software development should continue through version-controlled deposition, and the software version used to generate each dataset should be documented. The precise version used in any publication should also be archived at a permanent repository with a DOI such as Zenodo at the time of manuscript acceptance and IGVF portal links should be updated. For additional guidance on scientific data management and stewardship of research software, see FAIR Principles for Research Software (FAIR4RS Principles) and the NIH Best Practices for Sharing Research Software.

Accompanying publications: In addition to the release of well-documented code, we strongly encourage developers to publish citable descriptions of their software. We recommend that authors describe their software in methodological papers so that they can receive credit for their work. These can be published in conventional journals, and/or disseminated pre-publication through preprint servers (e.g. bioRxiv).

Good Software Design Practices: We recommend (1) using strict versioning and version control; (2) providing easy installation and compilation steps (e.g. via Docker/Singularity containers, Conda, Bioconductor packages etc.); (3) including simple test input datasets with matched output, and small unit tests for software with multiple components; (4) including realistic “production” tests, possibly matching publication figures; (5) specifying any unusual hardware requirements (e.g. CPU, RAM, disk). Software packages should be downloadable and installable with all dependencies specified.

Dissemination of More Complex Pipelines

For most complex analyses, multiple software components are routinely combined to generate intermediate datasets. For reproducibility of these results, analysts should document all software components used, and the specific software versions utilized. We encourage (1) documenting these components; (2) providing scripts that reproduce key figures in IGVF publications; (3) establishing reusable, publicly accessible analysis, containerized pipelines (e.g. Galaxy, virtual machines, Docker/Singularity, DNAnexus sessions); and (4) linking these through the DACC website.

Dissemination of Intermediate Data Analysis Results

All analysis results and data analysis products generated by the IGVF consortium that will be of broad use to the community must be registered at the IGVF data portal under unique accession numbers as soon as they are stable, and no later than the time of preprint or manuscript submission.

What to share with the DACC and release. Examples of types of analysis results to be released include: 1. Calls and/or predictions of elements and/or their activity 2. Calls and/or predictions of variant effects. 3. Other analyses that are significant elements of planned manuscripts. Analyses of IGVF generated data must use uniformly processed data unless a justification is provided and approved by the DACC in advance.

When possible, analysis results should be released using standard IGVF file formats (to be agreed on by the consortium). Analyses comprising major IGVF deliverables (e.g. the Catalog) may be subject to pre-release vetting by consortium members.

Analyses should be released in an unrestricted manner via the IGVF data portal when they are free of personally identifiable information (to current standards). When IRB rules apply, analysis products should be released via controlled access (such as dbGaP) using the appropriate sharing mechanism.

Analyses should be accompanied by written documentation, ideally in the form of a publication or a preprint (e.g. bioRxiv), with clearly specified contact author. The release should specify and provide links to: (1) the datasets used, (2) the software used and the specific version, and (3) the specific pipelines used and their specific versions.