Test-data Specifications
Edit

Specifications for writing nf-core test dataset files

Test-data Specifications

The key words “MUST”, “MUST NOT”, “SHOULD”, etc. are to be interpreted as described in RFC 2119.

The new test data file within a branch (modules or pipelines) SHOULD NOT replicate existing test-data unless absolutely necessary

If you need to make a new file that can be generated from an upstream file
For example, if you need a particular bioinformatic index file for a tool, index an existing FASTA file on the test-datasets branch

Info

CI tests for nf-core modules, subworkflows, or pipeline are not required to produce meaningful output.

The main goal for nf-core CI tests are to ensure a given tool ‘happily’ executes without errors.

It is OK for a test to produce nonsense output, or find ‘nothing’, as long as the tool does not crash or produce an error.

You SHOULD therefore reuse existing test data as far as possible to reduce the size of our test dataset repository.

You SHOULD only upload new test data if there is absolutely no other option within the existing test-data archive.

Test data SHOULD be as small as possible

Test data MUST be publicly available and have licenses to allow public reuse

Test data files SHOULD be described on the given branch’s README file, describing source, how generated, licenses etc.

In order to keep the size of the test data repository as minimal as possible, pre-existing files from nf-core/test-datasets MUST be reused if at all possible.
If the appropriate test data doesn’t exist in the modules branch of nf-core/test-datasets please contact us on the nf-core Slack #modules channel (you can join with this invite) to discuss possible options.
It may not be possible to add test data for some modules e.g. if the input data is too large or requires a local database. In these scenarios, it is recommended to use the Nextflow stub feature to test the module. Please refer to the gtdbtk/classify module and its corresponding test script to understand how to use this feature for your module development.

Files SHOULD be generally organised based on existing structure, typically (for bioinformatics pipelines) by discipline, organism, platform or format

Downstream or related test-data files SHOULD be named based on the upstream file name

For example, if you used genome.fasta as the upstream file, your output file should be called genome.<new_extension>.

Test data files MUST have an entry in the nf-core/test-datasets repo README