12

I'm trying to write a quick parser for .pdb files (they show protein structure). An example of a protein I am looking at is KRAS (common in cancer) and is here: http://www.rcsb.org/pdb/files/3GFT.pdb

If you scroll down far enough you will get to a line that looks like this: ATOM 1 N MET A 1 63.645 97.355 31.526 1.00 33.80 N

The first element "atom" means this relates to an actual atom in the protein. The 1 relates to a general count, N relates to the type of atom, "MET" is the name of the residue, "A" relates to the the type of chain, 1 (the second "1") is the atom count and then the next 3 numbers are the x-y-z positions in space.

What I need output is something like this (where the "1" below corresponds to the atom count, not the general count): MET A 1 63.645 97.355 31.526

To make matters more complicated, sometimes the atom count (the second "1" in this case) is negative. In those cases I want to skip that line an continue until I hit a positive entry as those elements relate to the biochemistry needed to find the positions and not the actual protein. To make matters even worse, sometimes you get a line as such:

ATOM 139 CA AILE A 21 63.260 111.496 12.203 0.50 12.87 C
ATOM 140 CA BILE A 21 63.275 111.495 12.201 0.50 12.17 C

While they both refer to residue 21, the biochem isn't precise enough to get an exact position, so they give two options. Ideally, I would specify "1", "2" or whatever, but if I just take the first option that would be ok. Finally, for the type of atom ("N") in my original example, I only want to get those lines with a "CA".

I'm a newbie to python, and my training is in biostats so I was wondering what's the best way to do this? Do I parse this line by line with a for loop? Is there a way to do it faster in python? How do I handle the double entries for some atoms?

I realize it's a bit to ask, but some guidance would be a ton of help! I've programmed all the statistics bits in using R, but now I just need to get my files in the right format!

Thanks!

1

4 Answers 4

18

I am mildly surprised that no one mentioned the Bio.PDB package from BioPython. Writing a PDB parser on your own is a rather serious case of unnecessarily reinventing, I mean reimplementing the wheel.

BioPython is a useful collection of packages for working with other kinds of biological data (e.g. protein or nucleic acid sequences) as well.

Update

I added a few functions I use for downloading and reading PDB files by way of example:

import os import sys import urllib.request import Bio import Bio.PDB import Bio.SeqRecord def download_read_pdb(pdbcode, datadir, keepfile=True): """ Downloads a PDB file from the Internet and saves it in a data directory. Then it reads and returns the structure inside. :param pdbcode: The standard PDB ID e.g. '3ICB' :param datadir: The directory where the downloaded file will be saved :param keepfile: if False, then the downloaded file will be deleted (default: keep the downloaded file) :return: a Bio.PDB Structure object or None if something went wrong """ pdbfilenm = download_pdb(pdbcode, datadir) if pdbfilenm is None: return None struct = read_pdb(pdbcode, pdbfilenm) if not keepfile: os.remove(pdbfilenm) return struct def download_pdb(pdbcode, datadir, downloadurl="http://files.rcsb.org/download/"): """ Downloads a PDB file from the Internet and saves it in a data directory. :param pdbcode: The standard PDB ID e.g. '3ICB' or '3icb' :param datadir: The directory where the downloaded file will be saved :param downloadurl: The base PDB download URL, cf. `https://www.rcsb.org/pages/download/http#structures` for details Note that the unencrypted HTTP protocol is used by default to avoid spurious OpenSSL errors... :return: the full path to the downloaded PDB file or None if something went wrong """ pdbfn = pdbcode + ".pdb" url = downloadurl + pdbfn outfnm = os.path.join(datadir, pdbfn) try: urllib.request.urlretrieve(url, outfnm) return outfnm except Exception as err: # all sorts of things could have gone wrong... print(str(err), file=sys.stderr) return None def read_pdb(pdbcode, pdbfilenm): """ Read a PDB structure from a file. :param pdbcode: A PDB ID string :param pdbfilenm: The PDB file :return: a Bio.PDB.Structure object or None if something went wrong """ try: pdbparser = Bio.PDB.PDBParser(QUIET=True) # suppress PDBConstructionWarning struct = pdbparser.get_structure(pdbcode, pdbfilenm) return struct except Exception as err: print(str(err), file=sys.stderr) return None def extract_seqrecords(pdbcode, struct): """ Extracts the sequence records from a Bio.PDB structure. :param pdbcode: the PDB ID of the structure, needed to add a sequence ID to the result :param struct: a Bio.PDB.Structure object :return: a list of Bio.SeqRecord objects """ ppb = Bio.PDB.PPBuilder() seqrecords = [] for i, chain in enumerate(struct.get_chains()): # extract and store sequences as list of SeqRecord objects pps = ppb.build_peptides(chain) # polypeptides seq = pps[0].get_sequence() # just take the first, hope there's no chain break seqid = pdbcode + chain.id seqrec = Bio.SeqRecord.SeqRecord(seq, id=seqid, description="Sequence #{}, {}".format(i+1, seqid)) seqrecords.append(seqrec) return seqrecords def get_calphas(struct): """ Extracts the C-alpha atoms from a PDB structure. :param struct: A Bio.PDB.Structure object. :return: A list of Bio.PDB.Atom objects representing the C-alpha atoms in `struct`. """ calphas = [ atom for atom in struct.get_atoms() if atom.get_fullname() == " CA " ] return calphas 

I am aware that this was an old question but maybe someone still finds this pointer helpful.

Sign up to request clarification or add additional context in comments.

4 Comments

Is Bio.PDB crossplatform? Or it's yet another wrapper for DIA SDK?
It's written in Python AFAIK, so it's exactly as cross-platform as Python itself.
I would really appreciate a 5-10 liner of 'how to load using BioPDB', it's documentation is deeply nested and expects one to guess where to look for an example
@Alleo I added some code examples to my answer as requested.
8

It looks like a library to parse PDB files already exists. Check out:

http://prody.csb.pitt.edu/

Briefly looking through the tutorials, it seems like this is what you're talking about no?

http://www.csb.pitt.edu/prody/tutorial.html#atomic-data

3 Comments

It seems like with windows 7 x64, the registry is a bit different so the installer is complaining. Thanks for pointing me to this, I will take a look a bit later.
This is outdated as resource is removed.
This answer should be the preferred one as the ProDy parser is 2x faster than BioPython (even though they are both in python). Here is the new link: prody.csb.pitt.edu
7

For large pdb files with a lot of atoms, there is no blank left between fields, so you cannot use the split command. Instead, you can use the Protein Data Bank format definition to parse the pdb file:

with open('min.pdb') as pdbfile: for line in pdbfile: if line[:4] == 'ATOM' or line[:6] == "HETATM": print line # Split the line splitted_line = [line[:6], line[6:11], line[12:16], line[17:20], line[21], line[22:26], line[30:38], line[38:46], line[46:54]] print splitted_line # To format again the pdb file with the fields extracted print "%-6s%5s %4s %3s %s%4s %8s%8s%8s\n"%tuple(splitted_line) 

Below is a sample output:

# Original line: HETATM10000 O HOH B3257 2.509 40.006 -4.097 1.00 0.00 O # Fields extracted: ['HETATM', '10000', ' O ', 'HOH', 'B', '3257', ' 2.509', ' 40.006', ' -4.097'] # Reformatted line: HETATM10000 O HOH B3257 2.509 40.006 -4.097 

Comments

6

That's a long description. I am not sure I got all that correctly :-) If the fields (for the lines start with ATOM) are fixed, you could use a split and some comaprisons. I've used a hash to see if the entry is already seen to eliminate the duplicate as you wanted. Hope this would give you a start,

visited = {} for line in open('3GFT.pdb'): list = line.split() id = list[0] if id == 'ATOM': type = list[2] if type == 'CA': residue = list[3] type_of_chain = list[4] atom_count = int(list[5]) position = list[6:8] if atom_count >= 0: if type_of_chain not in visited: visited[type_of_chain] = 1 print residue,type_of_chain,atom_count,' '.join(position) 

Will output,

MET A 1 62.935 97.579 GLY B 0 39.524 105.916 GLY C 0 67.295 110.376 MET D 1 59.311 124.106 GLY E 0 44.038 96.819 GLY F 0 44.187 123.590 

3 Comments

Hi, Thanks for this! It wasn't quite what I needed but I was able to to piece together the correct result! Thank you!
Slight corrections: 1) list[6:8] should be list[6:9] 2) the final if block should be: if(atom_count >= 1): if atom_count not in visited and type_of_chain==chain_required:
Just a heads up that split() should not be used on PDB files since the data is based on the column position. ATOM 5762 CA LEU B3854 19.760... and ATOM 5762 CA LEU B 3854 19.760... would cause errors.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.