0

I have a list of genes, their coordinates, and their expression (right now just looking at the top 500 most highly expressed genes) and 12 files corresponding to DNA reads. I have a python script that searches for reads overlapping with each gene's coordinates and storing the values in a dictionary. I then use this dictionary to create a Pandas dataframe and save this as a csv. (I will be using these to create a scatterplot.)

The RNA file looks like this (the headers are gene name, chromosome, start, stop, gene coverage/enrichment):

MSTRG.38 NC_008066.1 9204 9987 48395.347656 MSTRG.36 NC_008066.1 7582 8265 47979.933594 MSTRG.33 NC_008066.1 5899 7437 43807.781250 MSTRG.49 NC_008066.1 14732 15872 26669.763672 MSTRG.38 NC_008066.1 8363 9203 19514.273438 MSTRG.34 NC_008066.1 7439 7510 16855.662109 

And the DNA file looks like this (the headers are chromosome, start, stop, gene name, coverage, strand):

JQ673480.1 697 778 SRX6359746.5505370/2 8 + JQ673480.1 744 824 SRX6359746.5505370/1 8 - JQ673480.1 1712 1791 SRX6359746.2565519/2 27 + JQ673480.1 3445 3525 SRX6359746.7028440/2 23 - JQ673480.1 4815 4873 SRX6359746.6742605/2 37 + JQ673480.1 5055 5092 SRX6359746.5420114/2 40 - JQ673480.1 5108 5187 SRX6359746.2349349/2 24 - JQ673480.1 7139 7219 SRX6359746.3831446/2 22 + 

The RNA file has >9,000 lines, and the DNA files have > 12,000,000 lines.

I originally had a for-loop that would generate a dictionary containing all values for all 12 files in one go, but it runs extremely slowly. Since I have access to a computing system with multiple cores, I've decided to run a script that only calculates coverage one DNA file at a time, like so:

#import modules import csv import pandas as pd import matplotlib.pyplot as plt #set sample name sample='CON-2' #set fraction number f=6 #dictionary to store values d={} #load file name into variable fileRNA="top500_R8_7-{}-RNA.gtf".format(sample) print(fileRNA) #read tsv file tsvRNA = open(fileRNA) readRNA = csv.reader(tsvRNA, delimiter="\t") expGenes=[] #convert tsv file into Python list for row in readRNA: gene=row[0],row[1],row[2],row[3],row[4] expGenes.append(row) #print(expGenes) #establish file name for DNA reads fileDNA="D2_7-{}-{}.bed".format(sample,f) print(fileDNA) tsvDNA = open(fileDNA) readDNA = csv.reader(tsvDNA, delimiter="\t") #put file into Python list MCNgenes=[] for row in readDNA: read=row[0],row[1],row[2] MCNgenes.append(read) #find read counts for r in expGenes: #include FPKM in the dictionary d[r[0]]=[r[4]] regionCount=0 #set start and stop points based on transcript file chr=r[1] start=int(r[2]) stop=int(r[3]) #print("start:",start,"stop:",stop) for row in MCNgenes: if start < int(row[1]) < stop: regionCount+=1 d[r[0]].append(regionCount) n+=1 df=pd.DataFrame.from_dict(d) #convert to heatmap df.to_csv("7-CON-2-6_forHeatmap.csv") 

This script also runs quite slowly, however. Are there any changes I can make to get it run more efficiently?

4
  • 1
    Have you considered using Dask? Commented Dec 9, 2020 at 23:58
  • 2
    For performance issues, it's usually better to run a profiler than to guess where the bottleneck is. Commented Dec 10, 2020 at 0:00
  • Which columns are you using as 'coordinates'? Are they always called 'start' and 'stop'? Commented Dec 10, 2020 at 0:06
  • Can you please post an example of your desired output? In essence, part of 7-CON-2-6_forHeatmap.csv. Commented Dec 10, 2020 at 0:22

2 Answers 2

2

If I understood correctly and you are trying to match between coordinates of genes in different files I believe the best option would be to use something like KDTree partitioning algorithm.

You can use KDtree to partition your DNA and RNA data. I'm assumming you're using 'start' and 'stop' as 'coordinates':

import pandas as pd import numpy as np from sklearn.neighbors import KDTree dna = pd.DataFrame() # this is your dataframe with DNA data rna = pd.DataFrame() # Same for RNA # Let's assume you are using 'start' and 'stop' columns as coordinates dna_coord = dna.loc[:, ['start', 'stop']] rna_coord = rna.loc[:, ['start', 'stop']] dna_kd = KDTree(dna_coord) rna_kd = KDTree(rna_coord) # Now you can go through your data and match with DNA: my_data = pd.DataFrame() for start, stop in zip(my_data.start, my_data.stop): coord = np.array(start, stop) dist, idx = dna_kd.query(coord, k=1) # Assuming you need an exact match if np.islose(dist, 0): # Now that you have the index of the matchin row in DNA data # you can extract information using the index and do whatever # you want with it dna_gene_data = dna.loc[idx, :] 

You can adjust your search parameters to get the desired results, but this will be much faster than searching every time.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you, I will definitely give that a try!
1

Generally, Python is extremely extremely easy to work with at the cost of it being inefficient! Scientific libraries (such as pandas and numpy) help here by only paying the Python overhead a minimum limited number of times to map the work into a convenient space, then doing the "heavy lifting" in a more efficient language (which may be quite painful/inconvenient to work with).

General advice

  • try to get data into a dataframe whenever possible and keep it there (do not convert data into some intermediate Python object like a list or dict)
  • try to use methods of the dataframe or parts of it to do work (such as .apply() and .map()-like methods)
  • whenever you must iterate in native Python, iterate on the shorter side of a dataframe (ie. there may be only 10 columns, but 10,000 rows ; go over the columns)

More on this topic here:

How to iterate over rows in a DataFrame in Pandas?
Answer: DON'T*!


Once you have a program, you can benchmark it by collecting runtime information. There are many libraries for this, but there is also a builtin one called cProfile which may work for you.

docs: https://docs.python.org/3/library/profile.html

python3 -m cProfile -o profile.out myscript.py 

2 Comments

why iterate over columns with for and not .apply()?
There is no strict rule, but it can be easier to reason about and not too work-expensive (a list of python objects and references for each column vs one for each row). I've updated my example to be clearer about the case being when you have some reason to work in native Python there!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.