Skip to content

yueyu1030/STEAM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

STEAM

This is the code repository for our KDD'20 paper STEAM: Self-Supervised Taxonomy Expansion with Mini-Paths.

Requirements

  • Python >= 3.6
  • PyTorch >= 1.2
  • tqdm
  • Scipy
  • Numpy
  • transformers

Usage

Use run.sh from model/ to run the code. Some Key parameters:

{ "epochs": 20, // number of training epochs "lr": 1e-3, // number of learning rate "cudaid": 1, // id of gpu "dropout": 0.4, // dropout rate "hidden": 200, // number of hidden layers "weight_decay": 5e-4, // L2 Regularization "fp": "../data_environment_eurovoc_en_0.2", // file path "path_len": 3, // length of mini-path "lambda1": 0.1, // weight of loss 1 (regularization for base classifiers) "lambda2": 0.1, // weight of loss 2 (regularization for consistency) "taxi_feature": 1 // whether to load lexico-syntactic embeddings "load_gcn": 1 // whether to load gnn-propogated term embeddings }

Folder Structure

├── model/ - models, losses, and metrics │ ├── model_fuse.py // main modules of STEAM │ ├── layers_path.py // neural layers of STEAM │ ├── run_fuse.sh // script to run the code │ ├── utils_path.py // utility functions: loading train data, test data and sample mini-paths │ └── test_fuse.py // script for testing the model ├── data_science_wordnet_en_0.2/ - folder for science wordnet │ ├── score_gnn.txt - scores for PGAT propogated embeddings │ ├── LD.txt, gene_diff.txt, nfd_norm.txt, LCS.txt, Contains.txt, Suffix.txt, Ends.txt - value matrix of term pairs with 7 lexico-syntactic patterns │ ├── paths.json - dependency path information for all possible paths │ ├── paths_index.json - the index information for all dependency paths │ ├── taxo_path.json - all the paths from the training set of the seed taxonomy │ ├── taxo_node_info.json - all the term information in the seed taxonomy ├── data_environment_eurovoc_en_0.2/ - folder for environment wordnet │ └── structure similar to above one └── log_results/ - store results 

Processing Text Data on Your Own

The way to obtain your own corpus is described as follows

  • For GNN-propagated embeddings:
    • Use model/bert_emb_extractor.py to obtain the BERT Embeddings of terms.
    • Please follow the link of the paper TaxoExpan to generate the GNN-propagated embeddings for terms.
  • For text corpus / contextual features:
  • For Lexico-Syntactic Features:
    • Use model/gen_lexico_features.py to generate linguistic patterns based on surface name of terms.
    • For term frequency patterns from TAXI, please refer to the instructions here.

TODOs

  • Support more tensorboard functions
  • Using fixed random seed

Acknowledgements

If you find this paper useful for your research, please cite the following paper in your publication:

@inproceedings{yu2020steam, title={STEAM: Self-Supervised Taxonomy Expansion with Mini-Paths}, author={Yu, Yue and Li, Yinghao and Shen, Jiaming and Feng, Hao and Sun, Jimeng and Zhang, Chao}, booktitle={Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining}, publisher = {ACM}, year={2020} } 

About

[KDD 2020] This is the code repository for our KDD'20 paper STEAM: Self-Supervised Taxonomy Expansion with Mini-Paths.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors