Basic usage

hashFrag supports 3 primary use cases.

If you already have train/test data splits

Existing train-test data splits can be handled by providing two separate FASTA files to hashFrag as input. This limits the homology search process to inter-data split comparisons. Specifically, a BLAST database is constructed over the train sequences and the test sequences are queried against this database to identify pairs with high local alignment scores.

Filter existing splits: remove test sequences exhibiting homology to the training set
Stratify the test split based on homology ot the training set.

If you want to create new splits

When a single FASTA file is provided as input, hashFrag will characterize homology for all pairwise comparisons. This involves constructing a BLAST database over all sequences in the population and then subsequently querying each sequence to the databae.

Create orthogonal, homology-aware train-test splits

Example dataset

Example usage of hashFrag is provided below on an example dataset composed of 10,000 sequences (each 200 base pairs in length).

Note that the filter_existing_splits and create_orthogonal_splits pipelines require specification of a pairwise alignment score threshold to define homology between sequences.

If users do not have a predefined threshold for homology, we recommend computing pairwise alignment scores between a set of random (e.g., dinucleotide shuffled) genomic sequences, and then defining a threshold above the distribution of values observed.

Defining homology

Defining homology in terms of alignment scores requires the specification of scoring parameters (e.g., penalty, gapopen, gapextend, and reward values).

To remain consistent with BLAST scoring parameter specification, hashFrag expects a negative value for penalty but a positive value for gapopen and gapextend arguments (gap penalties will be subtracted from the alignment score during calculation). A positive value is expected for reward.

Changing these parameters can drastically impact the identification process of homology. Please see permissible scoring parameter combinations for the BLASTn algorithm here (Table D1).

hashFrag syntax

By default, hashFrag generates reverse complement sequences when creating the BLAST database. This ensures that query sequences are assessed for homology for both sequence orientations. If the input FASTA files already contain both forward and reverse sequence orientations, include the --skip-revcomp argument to skip this step.

Currently, hashFrag expects reverse complementary sequences to be denoted with a _Reversed suffix in the sequence header.

If both sequence orentations already exist in the input FASTA file(s), make sure one of the orientations is denoted with the _Reversed suffix. For example, if a sequence has a header such as seq_A, ensure that its reverse complement has the header seq_A_Reversed.

Note that a warning will be generated if there already exists sequences with the _Reversed suffix and the --skip-revcomp argument is NOT specified.

hashFrag commands

Filter existing splits

Filter sequences in the test split exhibiting homology with any sequences in the train split

hashFrag filter_existing_splits \
--train-fasta-path example_train_split.fa \
--test-fasta-path example_test_split.fa \
-t 60 \
--skip-revcomp \
-o filter_existing_splits.work

Expand to view the full table of arguments

Argument	Description	Expected input
`--train-fasta-path`	Input file containing train split sequences.	FASTA file path (unzipped)
`--test-fasta-path`	Input file containing test split sequences.	FASTA file path (unzipped)
`-w`, `--word-size`	Length of exact match to intialize alignment score calculation (`blastn_module`).	integer (Default: 11)
`-g`, `--gapopen`	Penalty for opening a gap in the alignment (`blastn_module`)	positive integer (Default: 2)
`-x`, `--gapextend`	Penalty for extending an existing gap in the alignment (`blastn_module`).	positive integer (Default: 1)
`-p`, `--penalty`	Nucleotide mismatch penalty (`blastn_module`).	negative integer (Default: -1)
`-r`, `--reward`	Nucleotide match reward (`blastn_module`).	positive integer (Default: 1)
`-m`, `--max-target-seqs`	Maximum number of target sequences that can be returned per query sequence (`blastn_module`).	positive integer (Default: 500)
`--exec-makeblastdb-only`	Only run the makeblastdb command (`blastn_module`).	Boolean (Default: False, set to True when specified)
`--skip-revcomp`	Do not generate reverse complement of sequences comprising the BLAST database (`blastn_module`).	Boolean (Default: False, generated if not skipped)
`--xdrop-ungap`	X-drop threshold (heuristic value in bits) for ungapped alignment extension (`blastn_module`).	real number (Default: 20)
`--xdrop-gap`	X-drop threshold (heuristic value in bits) for gapped alignment extension (`blastn_module`).	real number (Default: 30)
`--xdrop-gap-final`	X-drop threshold (heuristic value in bits) for final alignment extension (`blastn_module`)	real number (Default: 100)
`-e`, `--e-value`	Likelihood threshold required to report a sequence as a match (`blastn_module`).	real number (Default: 10.0)
`-d`, `--dust`	Filter for low-complexity (i.e., repetitive) regions (`blastn_module`).	Permissible values: {‘yes’, ‘no’} (Default: ‘no’)
`--blastdb-label`	Label for the BLAST database (`blastn_module`).	string (Default: None)
`-T`, `--threads`	Number of threads to use for `blastn_module` execution.	positive integer (Default: 1)
`-t`, `--threshold`	Alignment score threshold to define a pair of sequences as similar, or homologous (`filter_candidates_module`).	all real numbers (Required)
`--force`	Force overwrite existing `blastn_module` output files.	Boolean (Default: False, set to True when specified)
`-o`, `--output-dir`	Directory to write intermediate results.	string (Default: ‘.’)

Stratify test split

Stratify the test split sequences into an arbitrary number of levels based on their maximum alignment scores to the train split sequences.

hashFrag stratify_test_split \
--train-fasta-path example_train_split.fa \
--test-fasta-path example_test_split.fa \
--skip-revcomp \
-o stratify_test_split.work

Note that the sizes of each stratified level will not necessarily be balanced.
This can be useful to better understand a model’s behaviour over test splits at varying levels of orthogonality to the sequences the model was trained on.

Expand to view the full table of arguments

Argument	Description	Expected input
`--train-fasta-path`	Input file containing train split sequences.	FASTA file path (unzipped)
`--test-fasta-path`	Input file containing test split sequences.	FASTA file path (unzipped)
`-w`, `--word-size`	Length of exact match to intialize alignment score calculation (`blastn_module`).	integer (Default: 11)
`-g`, `--gapopen`	Penalty for opening a gap in the alignment (`blastn_module`)	positive integer (Default: 2)
`-x`, `--gapextend`	Penalty for extending an existing gap in the alignment (`blastn_module`).	positive integer (Default: 1)
`-p`, `--penalty`	Nucleotide mismatch penalty (`blastn_module`).	negative integer (Default: -1)
`-r`, `--reward`	Nucleotide match reward (`blastn_module`).	positive integer (Default: 1)
`-m`, `--max-target-seqs`	Maximum number of target sequences that can be returned per query sequence (`blastn_module`).	positive integer (Default: 500)
`--exec-makeblastdb-only`	Only run the makeblastdb command (`blastn_module`).	Boolean (Default: False, set to True when specified)
`--skip-revcomp`	Do not generate reverse complement of sequences comprising the BLAST database (`blastn_module`).	Boolean (Default: False, generated if not skipped)
`--xdrop-ungap`	X-drop threshold (heuristic value in bits) for ungapped alignment extension (`blastn_module`).	real number (Default: 20)
`--xdrop-gap`	X-drop threshold (heuristic value in bits) for gapped alignment extension (`blastn_module`).	real number (Default: 30)
`--xdrop-gap-final`	X-drop threshold (heuristic value in bits) for final alignment extension (`blastn_module`)	real number (Default: 100)
`-e`, `--e-value`	Likelihood threshold required to report a sequence as a match (`blastn_module`).	real number (Default: 10.0)
`-d`, `--dust`	Filter for low-complexity (i.e., repetitive) regions (`blastn_module`).	Permissible values: {‘yes’, ‘no’} (Default: ‘no’)
`--blastdb-label`	Label for the BLAST database (`blastn_module`).	string (Default: None)
`-T`, `--threads`	Number of threads to use for `blastn_module` execution.	positive integer (Default: 1)
`-s`, `--step`	Step size for how large each alignment score range is (`stratify_test_split_module`).	positive integer (Default: 10)
`--force`	Force overwrite existing `blastn_module` output files.	Boolean (Default: False)
`-o`, `--output-dir`	Directory to write the created train-test splits.	string (Default: ‘.’)

Creating orthogonal splits

Create homology-aware (i.g., orthogonal) train-test data splits.

hashFrag create_orthogonal_splits \
-f example_full_dataset.fa \
-t 60 \
--skip-revcomp \
-o create_orthogonal_splits.work

The creation of orthogonal train-test splits involves determining disjoint sets of sequences with respect to homology. This is accomplished using the union-find data structure (also referred to as disjoint-set or merge-find set). From the homology cluster information, splits with no leakage can be created proportionally.

Expand to view the full table of arguments

Argument	Description	Expected input
`-f`, `--fasta-path`	Input file containing all sequences in the dataset.	FASTA file path (unzipped)
`-w`, `--word-size`	Length of exact match to intialize alignment score calculation (`blastn_module`).	integer (Default: 11)
`-g`, `--gapopen`	Penalty for opening a gap in the alignment (`blastn_module`)	positive integer (Default: 2)
`-x`, `--gapextend`	Penalty for extending an existing gap in the alignment (`blastn_module`).	positive integer (Default: 1)
`-p`, `--penalty`	Nucleotide mismatch penalty (`blastn_module`).	negative integer (Default: -1)
`-r`, `--reward`	Nucleotide match reward (`blastn_module`).	positive integer (Default: 1)
`-m`, `--max-target-seqs`	Maximum number of target sequences that can be returned per query sequence (`blastn_module`).	positive integer (Default: 500)
`--exec-makeblastdb-only`	Only run the makeblastdb command (`blastn_module`).	Boolean (Default: False, set to True when specified)
`--skip-revcomp`	Do not generate reverse complement of sequences comprising the BLAST database (`blastn_module`).	Boolean (Default: False, generated if not skipped)
`--xdrop-ungap`	X-drop threshold (heuristic value in bits) for ungapped alignment extension (`blastn_module`).	real number (Default: 20)
`--xdrop-gap`	X-drop threshold (heuristic value in bits) for gapped alignment extension (`blastn_module`).	real number (Default: 30)
`--xdrop-gap-final`	X-drop threshold (heuristic value in bits) for final alignment extension (`blastn_module`)	real number (Default: 100)
`-e`, `--e-value`	Likelihood threshold required to report a sequence as a match (`blastn_module`).	real number (Default: 10.0)
`-d`, `--dust`	Filter for low-complexity (i.e., repetitive) regions (`blastn_module`).	Permissible values: {‘yes’, ‘no’} (Default: ‘no’)
`--blastdb-label`	Label for the BLAST database (`blastn_module`).	string (Default: None)
`-T`, `--threads`	Number of threads to use for `blastn_module` execution.	positive integer (Default: 1)
`-t`, `--threshold`	Alignment score threshold to define a pair of sequences as similar, or homologous (`filter_candidates_module`).	all real numbers (Required)
`--p-train`	Proportion of sequences for the newly-created train data split (`create_orthogonal_splits_module`).	float (Default: 0.8)
`--p-test`	Proportion of sequences for the newly-created test data split (`create_orthogonal_splits_module`).	float (Default: 0.2)
`-n`, `--n-splits`	Number of train-test split replicates to create (`create_orthogonal_splits_module`).	positive integer (Default: 1)
`-s`, `--seed`	Random seed for creation of homology-aware train-test splits (`create_orthogonal_splits_module`).	positive integer (Default: 21)
`--force`	Force overwrite existing `blastn_module` output files.	Boolean (Default: False)
`-o`, `--output-dir`	Directory to write the created train-test splits.	string (Default: ‘.’)