Skip to content

jdvala/lazytext

LazyText

lazy

lazytext Documentation Code Coverage Downloads

LazyText is inspired b the idea of lazypredict, a library which helps build a lot of basic mpdels without much code. LazyText is for text what lazypredict is for numeric data.

  • Free Software: MIT licence

Installation

To install LazyText

pip install lazytext

Usage

To use lazytext import in your project as

from lazytext.supervised import LazyTextPredict

Text Classification

Text classification on BBC News article classification.

import pandas as pd from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from lazytext.supervised import LazyTextPredict import re import nltk # Load the dataset df = pd.read_csv("tests/assets/bbc-text.csv") df.dropna(inplace=True) # Download models required for text cleaning nltk.download('punkt') nltk.download('stopwords') nltk.download('wordnet') nltk.download('omw-1.4') # split the data into train set and test set df_train, df_test = train_test_split(df, test_size=0.3, random_state=13) # Tokenize the words df_train['clean_text'] = df_train['text'].apply(nltk.word_tokenize) df_test['clean_text'] = df_test['text'].apply(nltk.word_tokenize) # Remove stop words stop_words=set(nltk.corpus.stopwords.words("english")) df_train['text_clean'] = df_train['clean_text'].apply(lambda x: [item for item in x if item not in stop_words]) df_test['text_clean'] = df_test['clean_text'].apply(lambda x: [item for item in x if item not in stop_words]) # Remove numbers, punctuation and special characters (only keep words) regex = '[a-z]+' df_train['text_clean'] = df_train['text_clean'].apply(lambda x: [item for item in x if re.match(regex, item)]) df_test['text_clean'] = df_test['text_clean'].apply(lambda x: [item for item in x if re.match(regex, item)]) # Lemmatization lem = nltk.stem.wordnet.WordNetLemmatizer() df_train['text_clean'] = df_train['text_clean'].apply(lambda x: [lem.lemmatize(item, pos='v') for item in x]) df_test['text_clean'] = df_test['text_clean'].apply(lambda x: [lem.lemmatize(item, pos='v') for item in x]) # Join the words again to form sentences df_train["clean_text"] = df_train.text_clean.apply(lambda x: " ".join(x)) df_test["clean_text"] = df_test.text_clean.apply(lambda x: " ".join(x)) # Tfidf vectorization vectorizer = TfidfVectorizer() x_train = vectorizer.fit_transform(df_train.clean_text) x_test = vectorizer.transform(df_test.clean_text) y_train = df_train.category.tolist() y_test = df_test.category.tolist() lazy_text = LazyTextPredict( classification_type="multiclass", ) models = lazy_text.fit(x_train, x_test, y_train, y_test) Label Analysis ┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓ ┃ ClassesWeights ┃ ┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩ │ business0.8725490196078431 │ │ sport1.1528497409326426 │ │ politics1.0671462829736211 │ │ entertainment0.8708414872798435 │ │ tech1.1097256857855362 │ └───────────────┴────────────────────┘ Result Analysis ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓ ┃ ModelAccuracyBalanced AccuracyF1 ScoreCustom Metric ScoreTime Taken ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩ │ AdaBoostClassifier0.72604790419161680.7177371721327690.7248335989941609NA1.4244091510772705 │ │ BaggingClassifier0.88173652694610780.87966339623636770.8814695332332374NA2.422576904296875 │ │ BernoulliNB0.95359281437125750.95059291934257330.9533647387436917NA0.015914201736450195 │ │ CalibratedClassifierCV0.97604790419161680.97600182203408470.9755904096436046NA0.36926722526550293 │ │ ComplementNB0.97604790419161680.97523291925465830.9754237510855159NA0.009947061538696289 │ │ DecisionTreeClassifier0.85329341317365270.84739566711942780.8496464898940103NA0.34440088272094727 │ │ DummyClassifier0.21556886227544910.20.07093596059113301NA0.005555868148803711 │ │ ExtraTreeClassifier0.72754491017964070.72535184599086580.7255575847020816NA0.018934965133666992 │ │ ExtraTreesClassifier0.96556886227544910.96353632859033020.9649837485086689NA1.2101161479949951 │ │ GradientBoostingClassifier0.95508982035928150.95263338871965290.9539060578037555NA30.256237030029297 │ │ KNeighborsClassifier0.9386227544910180.93700536939598140.9367294513157219NA0.12071108818054199 │ │ LinearSVC0.97455089820359290.9742626915993020.9740343976103922NA0.11713886260986328 │ │ LogisticRegression0.9685628742514970.96689958592132510.9678778814908909NA0.8916082382202148 │ │ LogisticRegressionCV0.97155688622754490.97088967572628610.971147482393915NA37.82431483268738 │ │ MLPClassifier0.97604790419161680.97533816425120780.9752912960666735NA30.700589656829834 │ │ MultinomialNB0.97005988023952090.96787957211870260.9689200656860745NA0.01410818099975586 │ │ NearestCentroid0.95209580838323350.94990451354547180.9515097876015481NA0.018617868423461914 │ │ NuSVC0.96706586826347310.96561594202898550.9669719954040374NA6.941549062728882 │ │ PassiveAggressiveClassifier0.97754491017964070.97723888207549250.9770812340935414NA0.05249309539794922 │ │ Perceptron0.97754491017964070.97692546583850940.9768161404324825NA0.030637741088867188 │ │ RandomForestClassifier0.96257485029940120.96051355426320810.9624462948504477NA0.9921820163726807 │ │ RidgeClassifier0.97754491017964070.97692546583850930.9769176825464448NA0.09582686424255371 │ │ SGDClassifier0.97005988023952090.96950078683739730.969787370271274NA0.04686570167541504 │ │ SVC0.97155688622754490.97037784679089020.9713021262026043NA6.64256477355957 │ └─────────────────────────────┴────────────────────┴────────────────────┴─────────────────────┴─────────────────────┴──────────────────────┘

Result of each estimator is stored in models which is a list and each trained estimator is also returned which can be used further for analysis.

confusion matrix and classification reports are also part of the models if they are needed.

print(models[0]) { 'name': 'AdaBoostClassifier', 'accuracy': 0.7260479041916168, 'balanced_accuracy': 0.717737172132769, 'f1_score': 0.7248335989941609, 'custom_metric_score': 'NA', 'time': 1.829047679901123, 'model': AdaBoostClassifier(), 'confusion_matrix': array([ [ 89, 5, 12, 35, 3], [ 8, 58, 5, 44, 0], [ 5, 2, 108, 10, 1], [ 5, 7, 5, 138, 2], [ 25, 5, 1, 3, 92]]), 'classification_report': """ precision recall f1-score support 0 0.67 0.62 0.64 144 1 0.75 0.50 0.60 115 2 0.82 0.86 0.84 126 3 0.60 0.88 0.71 157 4 0.94 0.73 0.82 126 accuracy 0.73 668 macro avg 0.76 0.72 0.72 668 weighted avg 0.75 0.73 0.72 668'} 

Custom metrics

LazyText also support custom metric for evaluation, this metric can be set up like following

from lazytext.supervised import LazyTextPredict # Custom metric def my_custom_metric(y_true, y_pred): ...do your stuff return score lazy_text = LazyTextPredict(custom_metric=my_custom_metric) lazy_text.fit(X_train, X_test, y_train, y_test)

If the signature of the custom metric function does not match with what is given above, then even though the custom metric is provided, it will be ignored.

Custom model parameters

LazyText also support providing parameters to the esitmators. For this just provide a dictornary of the parameters as shown below and those following arguments will be applied to the desired estimator.

In the following example I want to apply/change the default parameters of SVC classifier.

LazyText will fit all the models but only change the default parameters for SVC in the following case.

from lazytext.supervisd custom_parameters = [ { "name": "SVC", "parameters": { "C": 0.5, "kernel": 'poly', "degree": 5 } } ] l = LazyTextPredict( classification_type="multiclass", custom_parameters=custom_parameters ) l.fit(x_train, x_test, y_train, y_test)

About

LazyText is inspired by the idea of lazypredict, a library which helps build lot of basic models without much code. LazyText is for text what lazypredict is for numeric data.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors