I have a dataset is in this format in .csv
id,interaction_flag,x_coordinate,y_coordinate,z_coordinate,hydrophobicity_kd,hydrophobicity_ww,hydrophobicity_hh,surface_tension,charge_cooh,charge_nh3,charge_r,alpha_helix,beta_strand,turn,van_der_walls,mol_wt,solublity 229810,1,-33.8675148907451,-110.273691995647,100.021824089754,0.129381338742408,0.129381338742408,0.129381338742408,57.9996957403639,2.20539553752535,9.55985801217038,4.47146044624688,1.08064908722114,1.20135902636915,0.611653144016251,145.232251521298,107.951643002026,21.5344036511141 229811,1,-26.9070290467923,-117.172163712053,106.980243932766,0.922048681541592,0.922048681541592,0.922048681541592,58.5383367139972,2.03983772819472,9.23210953346856,1.58401622717997,0.84178498985806,1.0387626774848,0.921703853955354,124.73630831643,84.1570182555755,10.7648600405665 I am trying to get Receiver Operating Characteristics (ROC) from this data using this link : http://scikit-learn.org/0.11/auto_examples/plot_roc.html
My target is interaction_flag column and test is all columns after interaction_flag. But, my program continue running in never ending state.
When I run the test example given in that link, it runs within a moment.
Can anyone let me know what wrong I am doing? or do I need to so something else to load my data like iris?
my code :
import numpy as np import pylab as pl from sklearn import svm, datasets from sklearn.utils import shuffle from sklearn.metrics import roc_curve, auc training = 'dataset/training_5000_col.csv' test = 'dataset/test_5000_col.csv' random_state = np.random.RandomState(0) # Import some data to play with #iris = datasets.load_iris() #X = iris.data #y = iris.target X = [] y = [] for line in open(training): z = line.rstrip().split(',') y.append(int(z[2])) tmp = [] for a in range(5, 15): tmp.append(float(z[a])) X.append(tmp) X_train = np.array(X) y_train = np.array(y) X1 = [] y1 = [] for line in open(test): z = line.rstrip().split(',') y1.append(int(z[2])) tmp = [] for a in range(5, 15): tmp.append(float(z[a])) X1.append(tmp) X_test = np.array(X1) y_test = np.array(y1) # Run classifier classifier = svm.SVC(kernel='linear', probability=True) probas_ = classifier.fit(X_train, y_train).predict_proba(X_test) # Compute ROC curve and area the curve fpr, tpr, thresholds = roc_curve(y_test, probas_[:, 1]) print "y_test : ", y_test print "fpr : ", fpr print "tpr : ", tpr roc_auc = auc(fpr, tpr) print "Area under the ROC curve : %f" % roc_auc # Plot ROC curve pl.clf() pl.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc) pl.plot([0, 1], [0, 1], 'k--') pl.xlim([0.0, 1.0]) pl.ylim([0.0, 1.0]) pl.xlabel('False Positive Rate') pl.ylabel('True Positive Rate') pl.title('Receiver operating characteristic example') pl.legend(loc="lower right") pl.show() my .csv file is at : http://pastebin.com/iet5xQW2 how I will plot roc with this .csv
shape,dtypeand the first 5 lines of theX_trainandy_trainarrays.