- Code has been run on Google Colab which provides free GPU memory
-
Natural Language Processing(自然语言处理)
-
- IMDB(English Data)
-
-
SNLI(English Data)
-
微众银行智能客服(Chinese Data)
-
-
Spoken Language Understanding(对话理解)
- ATIS(English Data)
-
-
青云语料(Chinese Data)
-
Python Inference(基于 Python 的推理)
-
Java Inference(基于 Java 的推理)
-
-
-
Multi-turn Dialogue Rewriting(多轮对话改写)
-
微信 AI 研发数据(Chinese Data)
-
Python Inference(基于 Python 的推理)
-
Java Inference(基于 Java 的推理)
-
-
-
- Facebook AI Research Data(English Data)
-
- bAbI(Engish Data)
-
-
Word Extraction
-
Word Segmentation
-
-
-
Knowledge Graph(知识图谱)
-
- Movielens 1M(English Data)
└── finch/tensorflow2/text_classification/imdb │ ├── data │ └── glove.840B.300d.txt # pretrained embedding, download and put here │ └── make_data.ipynb # step 1. make data and vocab: train.txt, test.txt, word.txt │ └── train.txt # incomplete sample, format <label, text> separated by \t │ └── test.txt # incomplete sample, format <label, text> separated by \t │ └── train_bt_part1.txt # (back-translated) incomplete sample, format <label, text> separated by \t │ ├── vocab │ └── word.txt # incomplete sample, list of words in vocabulary │ └── main └── attention_linear.ipynb # step 2: train and evaluate model └── attention_conv.ipynb # step 2: train and evaluate model └── fasttext_unigram.ipynb # step 2: train and evaluate model └── fasttext_bigram.ipynb # step 2: train and evaluate model └── sliced_rnn.ipynb # step 2: train and evaluate model └── sliced_rnn_bt.ipynb # step 2: train and evaluate model -
Task: IMDB(English Data)
Training Data: 25000, Testing Data: 25000, Labels: 2-
Model: TF-IDF + Logistic Regression
-
Model: FastText
-
Model: Feedforward Attention
-
Model: Sliced RNN
-
TensorFlow 2
-
<Notebook> Sliced LSTM + Back-Translation -> 91.7 % Testing Accuracy
-
<Notebook> Sliced LSTM + Back-Translation + Char Embedding -> 92.3 % Testing Accuracy
-
<Notebook> Sliced LSTM + Back-Translation + Char Embedding + Label Smoothing
-> 92.5 % Testing Accuracy
-
<Notebook> Sliced LSTM + Back-Translation + Char Embedding + Label Smoothing + Cyclical LR
-> 92.6 % Testing Accuracy
This result (without transfer learning) is higher than CoVe (with transfer learning)
-
└── finch/tensorflow2/text_matching/snli │ ├── data │ └── glove.840B.300d.txt # pretrained embedding, download and put here │ └── download_data.ipynb # step 1. run this to download snli dataset │ └── make_data.ipynb # step 2. run this to generate train.txt, test.txt, word.txt │ └── train.txt # incomplete sample, format <label, text1, text2> separated by \t │ └── test.txt # incomplete sample, format <label, text1, text2> separated by \t │ ├── vocab │ └── word.txt # incomplete sample, list of words in vocabulary │ └── main └── dam.ipynb # step 3. train and evaluate model └── esim.ipynb # step 3. train and evaluate model └── ...... -
Task: SNLI(English Data)
Training Data: 550152, Testing Data: 10000, Labels: 3-
TensorFlow 2
-
Model: DAM
-
<Notebook> DAM -> 85.3% Testing Accuracy
The accuracy of this implementation is higher than UCL MR Group's implementation (84.6%)
-
-
Model: Match Pyramid
-
<Notebook> Pyramid -> 87.1% Testing Accuracy
The accuracy of this model is 0.3% below ESIM, however the speed is 1x faster than ESIM
-
-
Model: ESIM
-
<Notebook> ESIM -> 87.4% Testing Accuracy
The accuracy of this implementation is comparable to UCL MR Group's implementation (87.2%)
-
-
Model: RE2
-
└── finch/tensorflow2/text_matching/chinese │ ├── data │ └── make_data.ipynb # step 1. run this to generate char.txt and char.npy │ └── train.csv # incomplete sample, format <text1, text2, label> separated by comma │ └── test.csv # incomplete sample, format <text1, text2, label> separated by comma │ ├── vocab │ └── cc.zh.300.vec # pretrained embedding, download and put here │ └── char.txt # incomplete sample, list of chinese characters │ └── char.npy # saved pretrained embedding matrix for this task │ └── main └── pyramid.ipynb # step 2. train and evaluate model └── esim.ipynb # step 2. train and evaluate model └── ...... -
Task: 微众银行智能客服(Chinese Data)
Training Data: 100000, Testing Data: 10000, Labels: 2-
Model
-
TensorFlow 2
These results are higher than the results here and the result here
-
TensorFlow 1 + bert4keras
-
<Notebook> BERT -> 85.0% Testing Accuracy
Weights downloaded from here
-
-
-
Data: 2373 Lines of Book Titles(English Data)
-
Model: TF-IDF + LDA
-
PySpark
-
Sklearn + pyLDAvis
-
-
└── finch/tensorflow2/spoken_language_understanding/atis │ ├── data │ └── glove.840B.300d.txt # pretrained embedding, download and put here │ └── make_data.ipynb # step 1. run this to generate vocab: word.txt, intent.txt, slot.txt │ └── atis.train.w-intent.iob # incomplete sample, format <text, slot, intent> │ └── atis.test.w-intent.iob # incomplete sample, format <text, slot, intent> │ ├── vocab │ └── word.txt # list of words in vocabulary │ └── intent.txt # list of intents in vocabulary │ └── slot.txt # list of slots in vocabulary │ └── main └── bigru.ipynb # step 2. train and evaluate model └── bigru_self_attn.ipynb # step 2. train and evaluate model └── transformer.ipynb # step 2. train and evaluate model └── transformer_elu.ipynb # step 2. train and evaluate model -
Task: ATIS(English Data)
Training Data: 4978, Testing Data: 893-
Model: Bi-directional RNN
-
TensorFlow 2
-
97.4% Intent Acc, 95.4% Slot Micro-F1 on Testing Data
-
<Notebook> Bi-GRU + Self-Attention
97.6% Intent Acc, 95.7% Slot Micro-F1 on Testing Data
-
-
-
Model: ELMO Embedding
-
TensorFlow 1
-
97.5% Intent Acc, 96.1% Slot Micro-F1 on Testing Data
-
-
└── finch/tensorflow1/free_chat/chinese_qingyun │ ├── data │ └── raw_data.csv # raw data downloaded from external │ └── make_data.ipynb # step 1. run this to generate vocab {char.txt} and data {train.txt & test.txt} │ └── train.txt # processed text file generated by {make_data.ipynb} │ ├── vocab │ └── char.txt # list of chars in vocabulary for chinese │ └── cc.zh.300.vec # fastText pretrained embedding downloaded from external │ └── char.npy # chinese characters and their embedding values (300 dim) │ └── main └── lstm_seq2seq_train.ipynb # step 2. train and evaluate model └── lstm_seq2seq_export.ipynb # step 3. export model └── lstm_seq2seq_infer.ipynb # step 4. model inference └── transformer_train.ipynb # step 2. train and evaluate model └── transformer_export.ipynb # step 3. export model └── transformer_infer.ipynb # step 4. model inference -
Task: 青云语料(Chinese Data)
Training Data: 107687, Testing Data: 3350-
Data
-
Model: RNN Seq2Seq + Attention
-
TensorFlow 1
-
LSTM + Attention + Beam Search -> 3.540 Testing Perplexity
-
-
Model Inference
-
-
Model: Transformer
-
TensorFlow 1 + texar
-
Transformer (6 Layers, 8 Heads) -> 3.540 Testing Perplexity
-
-
Model Inference
-
-
└── FreeChatInference │ ├── data │ └── transformer_export/ │ └── char.txt │ └── libtensorflow-1.14.0.jar │ └── tensorflow_jni.dll │ └── src └── ModelInference.java └── finch/tensorflow2/semantic_parsing/tree_slu │ ├── data │ └── glove.840B.300d.txt # pretrained embedding, download and put here │ └── make_data.ipynb # step 1. run this to generate vocab: word.txt, intent.txt, slot.txt │ └── train.tsv # incomplete sample, format <text, tokenized_text, tree> │ └── test.tsv # incomplete sample, format <text, tokenized_text, tree> │ ├── vocab │ └── source.txt # list of words in vocabulary for source (of seq2seq) │ └── target.txt # list of words in vocabulary for target (of seq2seq) │ └── main └── lstm_seq2seq_tf_addons.ipynb # step 2. train and evaluate model └── ...... -
Task: Semantic Parsing for Task Oriented Dialog(English Data)
Training Data: 31279, Testing Data: 9042-
Model: RNN Seq2Seq + Attention
-
TensorFlow 2
-
<Notebook> LSTM + Attention + Beam Search ->
72.4% Exact Match Accuracy on Testing Data
-
<Notebook> LSTM + Attention + Beam Search + Cyclical LR + Label Smoothing ->
74.1% Exact Match Accuracy on Testing Data
-
-
└── finch/tensorflow2/knowledge_graph_completion/wn18 │ ├── data │ └── download_data.ipynb # step 1. run this to download wn18 dataset │ └── make_data.ipynb # step 2. run this to generate vocabulary: entity.txt, relation.txt │ └── wn18 # wn18 folder (will be auto created by download_data.ipynb) │ └── train.txt # incomplete sample, format <entity1, relation, entity2> separated by \t │ └── valid.txt # incomplete sample, format <entity1, relation, entity2> separated by \t │ └── test.txt # incomplete sample, format <entity1, relation, entity2> separated by \t │ ├── vocab │ └── entity.txt # incomplete sample, list of entities in vocabulary │ └── relation.txt # incomplete sample, list of relations in vocabulary │ └── main └── distmult_1-N.ipynb # step 3. train and evaluate model -
Task: WN18
Training Data: 141442, Testing Data: 5000-
We use 1-N Fast Evaluation to largely accelerate evaluation process
MRR: Mean Reciprocal Rank
-
Model: DistMult
-
TensorFlow 2
-
-
Model: TuckER
-
TensorFlow 2
-
-
Model: ComplEx
-
TensorFlow 2
-
-
Data Scraping
-
SPARQL
-
Neo4j + Cypher
└── finch/tensorflow1/question_answering/babi │ ├── data │ └── make_data.ipynb # step 1. run this to generate vocabulary: word.txt │ └── qa5_three-arg-relations_train.txt # one complete example of babi dataset │ └── qa5_three-arg-relations_test.txt # one complete example of babi dataset │ ├── vocab │ └── word.txt # complete list of words in vocabulary │ └── main └── dmn_train.ipynb └── dmn_serve.ipynb └── attn_gru_cell.py -
Task: bAbI(English Data)
-
Word Extraction
-
Chinese
-
-
Word Segmentation
-
Chinese
-
Custom TensorFlow Op added by applenob
-
-
└── finch/tensorflow1/recommender/movielens │ ├── data │ └── make_data.ipynb # run this to generate vocabulary │ ├── vocab │ └── user_job.txt │ └── user_id.txt │ └── user_gender.txt │ └── user_age.txt │ └── movie_types.txt │ └── movie_title.txt │ └── movie_id.txt │ └── main └── dnn_softmax.ipynb └── ...... -
Task: Movielens 1M(English Data)
Training Data: 900228, Testing Data: 99981, Users: 6000, Movies: 4000, Rating: 1-5-
Model: Fusion
-
TensorFlow 1
MAE: Mean Absolute Error
-
└── finch/tensorflow1/multi_turn_rewrite/chinese/ │ ├── data │ └── make_data.ipynb # run this to generate vocab, split train & test data, make pretrained embedding │ └── corpus.txt # original data downloaded from external │ └── train_pos.txt # processed positive training data after {make_data.ipynb} │ └── train_neg.txt # processed negative training data after {make_data.ipynb} │ └── test_pos.txt # processed positive testing data after {make_data.ipynb} │ └── test_neg.txt # processed negative testing data after {make_data.ipynb} │ ├── vocab │ └── cc.zh.300.vec # fastText pretrained embedding downloaded from external │ └── char.npy # chinese characters and their embedding values (300 dim) │ └── char.txt # list of chinese characters used in this project │ └── main └── baseline_lstm_train.ipynb └── baseline_lstm_export.ipynb └── baseline_lstm_predict.ipynb -
Task: Multi-turn Dialogue Rewriting(Chinese Data)
Training Data (Positive): 18986, Testing Data (Positive): 1008 Training Data = 2 * 18986 because of 1:1 Negative Sampling-
Model: RNN Seq2Seq + Attention + Dynamic Memory
-
TensorFlow 1
-
<Notebook> LSTM Seq2Seq + Attention + Memory + Beam Search
-> BLEU-1: 94.6, BLEU-2: 89.1, BELU-4: 78.5, EM: 56.2%
This result (without BERT) is comparable to the result here with BERT
-
-
└── MultiDialogInference │ ├── data │ └── baseline_lstm_export/ │ └── char.txt │ └── libtensorflow-1.14.0.jar │ └── tensorflow_jni.dll │ └── src └── ModelInference.java -
Rule-based System(基于规则的系统)






