working with dataset in sklearn

Question

I have a dataset is in this format in .csv

id,interaction_flag,x_coordinate,y_coordinate,z_coordinate,hydrophobicity_kd,hydrophobicity_ww,hydrophobicity_hh,surface_tension,charge_cooh,charge_nh3,charge_r,alpha_helix,beta_strand,turn,van_der_walls,mol_wt,solublity 229810,1,-33.8675148907451,-110.273691995647,100.021824089754,0.129381338742408,0.129381338742408,0.129381338742408,57.9996957403639,2.20539553752535,9.55985801217038,4.47146044624688,1.08064908722114,1.20135902636915,0.611653144016251,145.232251521298,107.951643002026,21.5344036511141 229811,1,-26.9070290467923,-117.172163712053,106.980243932766,0.922048681541592,0.922048681541592,0.922048681541592,58.5383367139972,2.03983772819472,9.23210953346856,1.58401622717997,0.84178498985806,1.0387626774848,0.921703853955354,124.73630831643,84.1570182555755,10.7648600405665

I am trying to get Receiver Operating Characteristics (ROC) from this data using this link : http://scikit-learn.org/0.11/auto_examples/plot_roc.html

My target is interaction_flag column and test is all columns after interaction_flag. But, my program continue running in never ending state.

When I run the test example given in that link, it runs within a moment.

Can anyone let me know what wrong I am doing? or do I need to so something else to load my data like iris?

my code :

import numpy as np import pylab as pl from sklearn import svm, datasets from sklearn.utils import shuffle from sklearn.metrics import roc_curve, auc training = 'dataset/training_5000_col.csv' test = 'dataset/test_5000_col.csv' random_state = np.random.RandomState(0) # Import some data to play with #iris = datasets.load_iris() #X = iris.data #y = iris.target X = [] y = [] for line in open(training): z = line.rstrip().split(',') y.append(int(z[2])) tmp = [] for a in range(5, 15): tmp.append(float(z[a])) X.append(tmp) X_train = np.array(X) y_train = np.array(y) X1 = [] y1 = [] for line in open(test): z = line.rstrip().split(',') y1.append(int(z[2])) tmp = [] for a in range(5, 15): tmp.append(float(z[a])) X1.append(tmp) X_test = np.array(X1) y_test = np.array(y1) # Run classifier classifier = svm.SVC(kernel='linear', probability=True) probas_ = classifier.fit(X_train, y_train).predict_proba(X_test) # Compute ROC curve and area the curve fpr, tpr, thresholds = roc_curve(y_test, probas_[:, 1]) print "y_test : ", y_test print "fpr : ", fpr print "tpr : ", tpr roc_auc = auc(fpr, tpr) print "Area under the ROC curve : %f" % roc_auc # Plot ROC curve pl.clf() pl.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc) pl.plot([0, 1], [0, 1], 'k--') pl.xlim([0.0, 1.0]) pl.ylim([0.0, 1.0]) pl.xlabel('False Positive Rate') pl.ylabel('True Positive Rate') pl.title('Receiver operating characteristic example') pl.legend(loc="lower right") pl.show()

my .csv file is at : http://pastebin.com/iet5xQW2 how I will plot roc with this .csv

in my program I have first retrieved the interaction_flag in one list and other test data in other list and then passed to the fit function. — veena
– veena, Commented Feb 5, 2014 at 7:36
First comment: the link you provide is referencing the documentation of a very old version of scikit-learn. Replace the 0.11 by stable in the URL to get the up to date documentation. Then please edit your question to print the shape, dtype and the first 5 lines of the X_train and y_train arrays. — ogrisel
– ogrisel, Commented Feb 5, 2014 at 7:55
please guide me step by step. I am stucked there from long and I dont find any way with .csv I have to get roc — veena
– veena, Commented Feb 5, 2014 at 8:06

Abhishek Thakur · Accepted Answer · 2014-02-05 08:38:29Z

You need to have two different labels in order to plot the ROC curve. The following example works for me if I add some 0 labels in your data. I have used pandas to read the data, rest is same as sklearn example.

Also, you need to split the dataset into training and test set to plot the ROC curve on the test set.

import pandas as pd import numpy as np from scipy import interp import pylab as pl from sklearn import svm from sklearn.metrics import roc_curve, auc from sklearn.cross_validation import StratifiedKFold def data(filename): X = pd.read_table(filename, sep=',', warn_bad_lines=True, error_bad_lines=True, low_memory = False) X = np.asarray(X) data = X[:,2:] labels = X[:,1] print np.unique(labels) return data, labels filename = '../data/sodata.csv' X, y = data(filename) ############################################################################### # Classification and ROC analysis # Run classifier with cross-validation and plot ROC curves cv = StratifiedKFold(y, n_folds=6) classifier = svm.SVC(kernel='linear', probability=True, random_state=0) mean_tpr = 0.0 mean_fpr = np.linspace(0, 1, 100) all_tpr = [] for i, (train, test) in enumerate(cv): probas_ = classifier.fit(X[train], y[train]).predict_proba(X[test]) # Compute ROC curve and area the curve fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1]) mean_tpr += interp(mean_fpr, fpr, tpr) mean_tpr[0] = 0.0 roc_auc = auc(fpr, tpr) pl.plot(fpr, tpr, lw=1, label='ROC fold %d (area = %0.2f)' % (i, roc_auc)) pl.plot([0, 1], [0, 1], '--', color=(0.6, 0.6, 0.6), label='Luck') mean_tpr /= len(cv) mean_tpr[-1] = 1.0 mean_auc = auc(mean_fpr, mean_tpr) pl.plot(mean_fpr, mean_tpr, 'k--', label='Mean ROC (area = %0.2f)' % mean_auc, lw=2) pl.xlim([-0.05, 1.05]) pl.ylim([-0.05, 1.05]) pl.xlabel('False Positive Rate') pl.ylabel('True Positive Rate') pl.title('Receiver operating characteristic example') pl.legend(loc="lower right") pl.show()

thanks I will check. how much time it takes for probas? since I am runnig it frmo 3-4 hours and it is still running?
it take me a few seconds on the dataset you have provided. you must be doing something wrong.
since its running whole day. can you tell me what you do with dataset I have provided in pastebin link in detail ?
I got it. can you tell me same about scikit-learn.org/stable/auto_examples/…

Collectives™ on Stack Overflow

working with dataset in sklearn

1 Answer 1

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Related