PyOD is a Python-based toolkit to identify anomalies in data with unsupervised and supervised approach. The toolkits consist of two major functionalities:
- Individual Algorithms
- Local Outlier Factor (wrapped on sklearn implementation)
- Isolation Forest (wrapped on sklearn implementation)
- One-Class support vector machines (wrapped on sklearn implementation)
- KNN Outlier Detection (implemented)
- Average KNN Outlier Detection (implemented)
- Median KNN Outlier Detection (implemented)
- Global-Local Outlier Score From Hierarchies (implemented)
- More to add
- Ensemble Framework
- Feature bagging
- More to add
Before using the toolkit, please be advised the purpose of the tool is only for quick exploration. Using it as the final result should be understood with cautions. Fine-tunning may be needed to generate meaningful solution. I would recommend to use this as the first-step data exploration tool, and build your model/reuse the this model to get more accurate results.
"example.py" is an example to demo the basic API of PyOD. It first generate some sample data to run. normal data is generated by a 2-d gaussian distribution, and outliers are generated by a 2-d uniform distribution.
# percentage of outliers contamination = 0.1 n_train = 1000 n_test = 500 # generate sample data X_train, y_train, c_train, X_test, y_test, c_test = generate_data(n=n_train, contamination=contamination, n_test=n_test)Then it initializes the classifier, fit the model, and make the prediction.
# train a HBOS detector clf = Hbos(contamination=0.1) clf.fit(X_train) # get the outlier score of the training data y_train_pred = clf.y_pred y_train_score = clf.decision_scores # make the prediction on the test data y_test_pred = clf.predict(X_test) y_test_score = clf.decision_function(X_test)The evaluation of the data is generated by:
print('Precision@n on train data is', get_precn(y_train, y_train_score)) print('ROC on train data is', roc_auc_score(y_train, y_train_score)) print('Precision@n on test data is', get_precn(y_test, y_test_score)) print('ROC on test data is', roc_auc_score(y_test, y_test_score))Here is a sample output:
Precision@n on train data is 0.78 ROC on train data is 0.9360 Precision@n on test data is 0.8780 ROC on test data is 0.9872
