I'm trying to write a quick parser for .pdb files (they show protein structure). An example of a protein I am looking at is KRAS (common in cancer) and is here: http://www.rcsb.org/pdb/files/3GFT.pdb
If you scroll down far enough you will get to a line that looks like this: ATOM 1 N MET A 1 63.645 97.355 31.526 1.00 33.80 N
The first element "atom" means this relates to an actual atom in the protein. The 1 relates to a general count, N relates to the type of atom, "MET" is the name of the residue, "A" relates to the the type of chain, 1 (the second "1") is the atom count and then the next 3 numbers are the x-y-z positions in space.
What I need output is something like this (where the "1" below corresponds to the atom count, not the general count): MET A 1 63.645 97.355 31.526
To make matters more complicated, sometimes the atom count (the second "1" in this case) is negative. In those cases I want to skip that line an continue until I hit a positive entry as those elements relate to the biochemistry needed to find the positions and not the actual protein. To make matters even worse, sometimes you get a line as such:
ATOM 139 CA AILE A 21 63.260 111.496 12.203 0.50 12.87 C
ATOM 140 CA BILE A 21 63.275 111.495 12.201 0.50 12.17 C
While they both refer to residue 21, the biochem isn't precise enough to get an exact position, so they give two options. Ideally, I would specify "1", "2" or whatever, but if I just take the first option that would be ok. Finally, for the type of atom ("N") in my original example, I only want to get those lines with a "CA".
I'm a newbie to python, and my training is in biostats so I was wondering what's the best way to do this? Do I parse this line by line with a for loop? Is there a way to do it faster in python? How do I handle the double entries for some atoms?
I realize it's a bit to ask, but some guidance would be a ton of help! I've programmed all the statistics bits in using R, but now I just need to get my files in the right format!
Thanks!