Pipeline to produce dataset of pairs low and high temperature proteins

Getting started

Environment

Create and activate the environment specified in environment.yml

conda env create --file environment.yml conda activate learn2thermDB pip install .

Ensure that the following are set in order access full functionality

ENV_EMAIL - The email will be used to access NCBI FTP in s0.0 and Bacdive in s0.1
NCBI_API_KEY - API key from NCBI. Needed to get 16s sequences in s0.0. See here
LOGLEVEL (optional) - Specified logging level to run the package. eg 'INFO' or 'DEBUG'
FATCAT_EXEC - Necessary for struvtura alignment steps of validation, s2.11 and s2.12

Config

DVC

Configure DVC as desired. This is not required, however here are a few recommendations:

Add a remote data repository
Use hardlinking instead of copying to the DVC cache to increase speed. See here

Parameters

params.yaml contains all of the tunable parameters used to run the pipeline. Modifying these parameters will product a different result than in the presented paper. Note that DVC tracks input and output states, thus by changing a parameter, DVC whill know which stages need to be run. The parameters are discussed in the context of their pipeline stages in ./docs/compspec/pipeline_components.md. The params file itself has comments indicating what the parameter does.

Dask

Dask configuration is required for s1.3. This script stands up a Dask cluster of workers to conduct protein alignment. Each worker runs alignment between a one taxa pair at a time. In order to conduct this step, .config/dask/jobqueue.yaml must be updated. The pipeline was initially ran using slurm, and the config file is provided, however names and accounts will necessarily need to be changed for your cluster. If using a distributed scheduler other than SLURM, it must be supported by dask and the appropriate configurations made. See here.

If the pipeline is erroring out at this stage, it is likely an issue with the cluster configuration. Common issues experienced are workers not being able to find executables. Ensure that the workers have appropriate environment setups. If using SLURM, the existing config file is a working example: note the sourceing of bashrc, environment activation, and environment variable exports in the job preludes/directives.

Execution

Data Version Control (DVC) is used to track data, parameters, metrics, and execution pipelines.

To use a DVC remote, see the the documentation.

DVC tracked data, metrics, and models are found in ./data while scripts and parameters can be found in ./pipeline. To execute pipeline steps, run dvc repro <stage-name> where stages are listed below, and details on stagess can be found in ./docs/compspec/pipeline_components.md:

get_raw_data_taxa
get_raw_data_proteins
get_proteome_mdata
parse_proteins
label_taxa
get_16s_blast_scores
label_all_pairs
protein_alignment
make_database
get_hait_pairs
compare_to_Tm
run_hait_alignment
compare_hait_alignment
get_HMM_profiles
hmmer_hait
run_hmmer
parse_hmmer_result
compare_hait_hmmer
sample_data_for_structure
structure_hait
structure_l2t

Dependancies between stages is shown below:

flowchart TD node1["../../analysis_pipeline/dvc.yaml:chosen_protein_search_space"] node2["../../analysis_pipeline/dvc.yaml:filtered_protein_search_space"] node3["../../analysis_pipeline/dvc.yaml:full_protein_search_space"] node4["../../dvc.yaml:compare_hait_alignment"] node5["../../dvc.yaml:compare_hait_hmmer"] node6["../../dvc.yaml:compare_to_Tm"] node7["../../dvc.yaml:get_16s_blast_scores"] node8["../../dvc.yaml:get_HMM_profiles"] node9["../../dvc.yaml:get_hait_pairs"] node10["../../dvc.yaml:get_proteome_mdata"] node11["../../dvc.yaml:get_raw_data_proteins"] node12["../../dvc.yaml:get_raw_data_taxa"] node13["../../dvc.yaml:hmmer_hait"] node14["../../dvc.yaml:label_all_pairs"] node15["../../dvc.yaml:label_taxa"] node16["../../dvc.yaml:make_database"] node17["../../dvc.yaml:parse_hmmer_result"] node18["../../dvc.yaml:parse_proteins"] node19["../../dvc.yaml:protein_alignment"] node20["../../dvc.yaml:run_hait_alignment"] node21["../../dvc.yaml:run_hmmer"] node22["../../dvc.yaml:sample_data_for_structure"] node23["../../dvc.yaml:structure_hait"] node24["../../dvc.yaml:structure_l2t"] node7-->node2 node7-->node14 node7-->node16 node7-->node19 node8-->node13 node8-->node21 node9-->node13 node9-->node20 node9-->node23 node10-->node18 node11-->node18 node12-->node3 node12-->node7 node12-->node15 node12-->node16 node12-->node18 node14-->node2 node14-->node16 node14-->node19 node15-->node1 node15-->node7 node15-->node16 node16-->node4 node16-->node5 node16-->node6 node16-->node21 node16-->node22 node17-->node5 node18-->node1 node18-->node2 node18-->node3 node18-->node7 node18-->node16 node18-->node19 node19-->node16 node19-->node17 node20-->node4 node21-->node17 node22-->node24

Note that script execution is expected to occur with the top level as the current working directory, and paths are specified with respect to the repo top level.

The entire pipeline can be run in a single command using dvc exp run. Note however that many steps are resource intensive. At least 8 cores and 60GB of RAM is recommended. For s1.3, Dask must be configured. This stage requires a distributed computing cluster.

Python package

Installable, importable code used in the pipeline is found in learn2therm and should be installed given the above steps in the Environment section.

Name		Name	Last commit message	Last commit date
Latest commit History 432 Commits
.config/dask		.config/dask
.dvc		.dvc
analysis_pipeline		analysis_pipeline
data		data
docs/compspec		docs/compspec
external		external
learn2therm		learn2therm
logs		logs
notebooks		notebooks
pipeline		pipeline
plotting		plotting
testing		testing
tmp		tmp
.dvcignore		.dvcignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
environment.yml		environment.yml
params.yaml		params.yaml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pipeline to produce dataset of pairs low and high temperature proteins

Getting started

Environment

Config

DVC

Parameters

Dask

Execution

Python package

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Pipeline to produce dataset of pairs low and high temperature proteins

Getting started

Environment

Config

DVC

Parameters

Dask

Execution

Python package

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages