Skip to content

BeckResearchLab/learn2thermDB

Repository files navigation

Pipeline to produce dataset of pairs low and high temperature proteins

Getting started

Environment

Create and activate the environment specified in environment.yml

conda env create --file environment.yml conda activate learn2thermDB pip install . 

Ensure that the following are set in order access full functionality

  • ENV_EMAIL - The email will be used to access NCBI FTP in s0.0 and Bacdive in s0.1
  • NCBI_API_KEY - API key from NCBI. Needed to get 16s sequences in s0.0. See here
  • LOGLEVEL (optional) - Specified logging level to run the package. eg 'INFO' or 'DEBUG'
  • FATCAT_EXEC - Necessary for struvtura alignment steps of validation, s2.11 and s2.12

Config

DVC

Configure DVC as desired. This is not required, however here are a few recommendations:

  • Add a remote data repository
  • Use hardlinking instead of copying to the DVC cache to increase speed. See here

Parameters

params.yaml contains all of the tunable parameters used to run the pipeline. Modifying these parameters will product a different result than in the presented paper. Note that DVC tracks input and output states, thus by changing a parameter, DVC whill know which stages need to be run. The parameters are discussed in the context of their pipeline stages in ./docs/compspec/pipeline_components.md. The params file itself has comments indicating what the parameter does.

Dask

Dask configuration is required for s1.3. This script stands up a Dask cluster of workers to conduct protein alignment. Each worker runs alignment between a one taxa pair at a time. In order to conduct this step, .config/dask/jobqueue.yaml must be updated. The pipeline was initially ran using slurm, and the config file is provided, however names and accounts will necessarily need to be changed for your cluster. If using a distributed scheduler other than SLURM, it must be supported by dask and the appropriate configurations made. See here.

If the pipeline is erroring out at this stage, it is likely an issue with the cluster configuration. Common issues experienced are workers not being able to find executables. Ensure that the workers have appropriate environment setups. If using SLURM, the existing config file is a working example: note the sourceing of bashrc, environment activation, and environment variable exports in the job preludes/directives.

Execution

Data Version Control (DVC) is used to track data, parameters, metrics, and execution pipelines.

To use a DVC remote, see the the documentation.

DVC tracked data, metrics, and models are found in ./data while scripts and parameters can be found in ./pipeline. To execute pipeline steps, run dvc repro <stage-name> where stages are listed below, and details on stagess can be found in ./docs/compspec/pipeline_components.md:

  • get_raw_data_taxa
  • get_raw_data_proteins
  • get_proteome_mdata
  • parse_proteins
  • label_taxa
  • get_16s_blast_scores
  • label_all_pairs
  • protein_alignment
  • make_database
  • get_hait_pairs
  • compare_to_Tm
  • run_hait_alignment
  • compare_hait_alignment
  • get_HMM_profiles
  • hmmer_hait
  • run_hmmer
  • parse_hmmer_result
  • compare_hait_hmmer
  • sample_data_for_structure
  • structure_hait
  • structure_l2t

Dependancies between stages is shown below:

flowchart TD node1["../../analysis_pipeline/dvc.yaml:chosen_protein_search_space"] node2["../../analysis_pipeline/dvc.yaml:filtered_protein_search_space"] node3["../../analysis_pipeline/dvc.yaml:full_protein_search_space"] node4["../../dvc.yaml:compare_hait_alignment"] node5["../../dvc.yaml:compare_hait_hmmer"] node6["../../dvc.yaml:compare_to_Tm"] node7["../../dvc.yaml:get_16s_blast_scores"] node8["../../dvc.yaml:get_HMM_profiles"] node9["../../dvc.yaml:get_hait_pairs"] node10["../../dvc.yaml:get_proteome_mdata"] node11["../../dvc.yaml:get_raw_data_proteins"] node12["../../dvc.yaml:get_raw_data_taxa"] node13["../../dvc.yaml:hmmer_hait"] node14["../../dvc.yaml:label_all_pairs"] node15["../../dvc.yaml:label_taxa"] node16["../../dvc.yaml:make_database"] node17["../../dvc.yaml:parse_hmmer_result"] node18["../../dvc.yaml:parse_proteins"] node19["../../dvc.yaml:protein_alignment"] node20["../../dvc.yaml:run_hait_alignment"] node21["../../dvc.yaml:run_hmmer"] node22["../../dvc.yaml:sample_data_for_structure"] node23["../../dvc.yaml:structure_hait"] node24["../../dvc.yaml:structure_l2t"] node7-->node2 node7-->node14 node7-->node16 node7-->node19 node8-->node13 node8-->node21 node9-->node13 node9-->node20 node9-->node23 node10-->node18 node11-->node18 node12-->node3 node12-->node7 node12-->node15 node12-->node16 node12-->node18 node14-->node2 node14-->node16 node14-->node19 node15-->node1 node15-->node7 node15-->node16 node16-->node4 node16-->node5 node16-->node6 node16-->node21 node16-->node22 node17-->node5 node18-->node1 node18-->node2 node18-->node3 node18-->node7 node18-->node16 node18-->node19 node19-->node16 node19-->node17 node20-->node4 node21-->node17 node22-->node24 
Loading

Note that script execution is expected to occur with the top level as the current working directory, and paths are specified with respect to the repo top level.

The entire pipeline can be run in a single command using dvc exp run. Note however that many steps are resource intensive. At least 8 cores and 60GB of RAM is recommended. For s1.3, Dask must be configured. This stage requires a distributed computing cluster.

Python package

Installable, importable code used in the pipeline is found in learn2therm and should be installed given the above steps in the Environment section.

About

Prediction of high temperature stable protein structures approached as a natural language problem

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages