I have a list of genes, their coordinates, and their expression (right now just looking at the top 500 most highly expressed genes) and 12 files corresponding to DNA reads. I have a python script that searches for reads overlapping with each gene's coordinates and storing the values in a dictionary. I then use this dictionary to create a Pandas dataframe and save this as a csv. (I will be using these to create a scatterplot.)
The RNA file looks like this (the headers are gene name, chromosome, start, stop, gene coverage/enrichment):
MSTRG.38 NC_008066.1 9204 9987 48395.347656 MSTRG.36 NC_008066.1 7582 8265 47979.933594 MSTRG.33 NC_008066.1 5899 7437 43807.781250 MSTRG.49 NC_008066.1 14732 15872 26669.763672 MSTRG.38 NC_008066.1 8363 9203 19514.273438 MSTRG.34 NC_008066.1 7439 7510 16855.662109 And the DNA file looks like this (the headers are chromosome, start, stop, gene name, coverage, strand):
JQ673480.1 697 778 SRX6359746.5505370/2 8 + JQ673480.1 744 824 SRX6359746.5505370/1 8 - JQ673480.1 1712 1791 SRX6359746.2565519/2 27 + JQ673480.1 3445 3525 SRX6359746.7028440/2 23 - JQ673480.1 4815 4873 SRX6359746.6742605/2 37 + JQ673480.1 5055 5092 SRX6359746.5420114/2 40 - JQ673480.1 5108 5187 SRX6359746.2349349/2 24 - JQ673480.1 7139 7219 SRX6359746.3831446/2 22 + The RNA file has >9,000 lines, and the DNA files have > 12,000,000 lines.
I originally had a for-loop that would generate a dictionary containing all values for all 12 files in one go, but it runs extremely slowly. Since I have access to a computing system with multiple cores, I've decided to run a script that only calculates coverage one DNA file at a time, like so:
#import modules import csv import pandas as pd import matplotlib.pyplot as plt #set sample name sample='CON-2' #set fraction number f=6 #dictionary to store values d={} #load file name into variable fileRNA="top500_R8_7-{}-RNA.gtf".format(sample) print(fileRNA) #read tsv file tsvRNA = open(fileRNA) readRNA = csv.reader(tsvRNA, delimiter="\t") expGenes=[] #convert tsv file into Python list for row in readRNA: gene=row[0],row[1],row[2],row[3],row[4] expGenes.append(row) #print(expGenes) #establish file name for DNA reads fileDNA="D2_7-{}-{}.bed".format(sample,f) print(fileDNA) tsvDNA = open(fileDNA) readDNA = csv.reader(tsvDNA, delimiter="\t") #put file into Python list MCNgenes=[] for row in readDNA: read=row[0],row[1],row[2] MCNgenes.append(read) #find read counts for r in expGenes: #include FPKM in the dictionary d[r[0]]=[r[4]] regionCount=0 #set start and stop points based on transcript file chr=r[1] start=int(r[2]) stop=int(r[3]) #print("start:",start,"stop:",stop) for row in MCNgenes: if start < int(row[1]) < stop: regionCount+=1 d[r[0]].append(regionCount) n+=1 df=pd.DataFrame.from_dict(d) #convert to heatmap df.to_csv("7-CON-2-6_forHeatmap.csv") This script also runs quite slowly, however. Are there any changes I can make to get it run more efficiently?
7-CON-2-6_forHeatmap.csv.