Skip to content

Lidaguo/Pyod

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python Outlier Detection (PyOD)

Note: the project is still under development as of Feb 7th 2018.


Quick Introduction

PyOD is a Python-based toolkit to identify anomalies in data with unsupervised and supervised approach. The toolkits consist of two major functionalities:

  • Individual Algorithms
    1. Local Outlier Factor (wrapped on sklearn implementation)
    2. Isolation Forest (wrapped on sklearn implementation)
    3. One-Class support vector machines (wrapped on sklearn implementation)
    4. KNN Outlier Detection (implemented)
    5. Average KNN Outlier Detection (implemented)
    6. Median KNN Outlier Detection (implemented)
    7. Global-Local Outlier Score From Hierarchies (implemented)
    8. More to add
  • Ensemble Framework
    1. Feature bagging
    2. More to add

Before using the toolkit, please be advised the purpose of the tool is only for quick exploration. Using it as the final result should be understood with cautions. Fine-tunning may be needed to generate meaningful solution. I would recommend to use this as the first-step data exploration tool, and build your model/reuse the this model to get more accurate results.


Quick Start

"example.py" is an example to demo the basic API of PyOD. It first generate some sample data to run. normal data is generated by a 2-d gaussian distribution, and outliers are generated by a 2-d uniform distribution.

# percentage of outliers contamination = 0.1 n_train = 1000 n_test = 500 # generate sample data X_train, y_train, c_train, X_test, y_test, c_test = generate_data(n=n_train, contamination=contamination, n_test=n_test)

Then it initializes the classifier, fit the model, and make the prediction.

# train a HBOS detector clf = Hbos(contamination=0.1) clf.fit(X_train) # get the outlier score of the training data y_train_pred = clf.y_pred y_train_score = clf.decision_scores # make the prediction on the test data y_test_pred = clf.predict(X_test) y_test_score = clf.decision_function(X_test)

The evaluation of the data is generated by:

print('Precision@n on train data is', get_precn(y_train, y_train_score)) print('ROC on train data is', roc_auc_score(y_train, y_train_score)) print('Precision@n on test data is', get_precn(y_test, y_test_score)) print('ROC on test data is', roc_auc_score(y_test, y_test_score))

Here is a sample output:

Precision@n on train data is 0.78 ROC on train data is 0.9360 Precision@n on test data is 0.8780 ROC on test data is 0.9872

To check the result of the classification visually: sample figure

About

A Python Toolkit for Outlier Detection, aka Anomaly Detection

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%