STEAM

This is the code repository for our KDD'20 paper STEAM: Self-Supervised Taxonomy Expansion with Mini-Paths.

Requirements

Python >= 3.6
PyTorch >= 1.2
tqdm
Scipy
Numpy
transformers

Usage

Use run.sh from model/ to run the code. Some Key parameters:

{ "epochs": 20, // number of training epochs "lr": 1e-3, // number of learning rate "cudaid": 1, // id of gpu "dropout": 0.4, // dropout rate "hidden": 200, // number of hidden layers "weight_decay": 5e-4, // L2 Regularization "fp": "../data_environment_eurovoc_en_0.2", // file path "path_len": 3, // length of mini-path "lambda1": 0.1, // weight of loss 1 (regularization for base classifiers) "lambda2": 0.1, // weight of loss 2 (regularization for consistency) "taxi_feature": 1 // whether to load lexico-syntactic embeddings "load_gcn": 1 // whether to load gnn-propogated term embeddings }

Folder Structure

├── model/ - models, losses, and metrics │ ├── model_fuse.py // main modules of STEAM │ ├── layers_path.py // neural layers of STEAM │ ├── run_fuse.sh // script to run the code │ ├── utils_path.py // utility functions: loading train data, test data and sample mini-paths │ └── test_fuse.py // script for testing the model ├── data_science_wordnet_en_0.2/ - folder for science wordnet │ ├── score_gnn.txt - scores for PGAT propogated embeddings │ ├── LD.txt, gene_diff.txt, nfd_norm.txt, LCS.txt, Contains.txt, Suffix.txt, Ends.txt - value matrix of term pairs with 7 lexico-syntactic patterns │ ├── paths.json - dependency path information for all possible paths │ ├── paths_index.json - the index information for all dependency paths │ ├── taxo_path.json - all the paths from the training set of the seed taxonomy │ ├── taxo_node_info.json - all the term information in the seed taxonomy ├── data_environment_eurovoc_en_0.2/ - folder for environment wordnet │ └── structure similar to above one └── log_results/ - store results

Processing Text Data on Your Own

The way to obtain your own corpus is described as follows

For GNN-propagated embeddings:
- Use model/bert_emb_extractor.py to obtain the BERT Embeddings of terms.
- Please follow the link of the paper TaxoExpan to generate the GNN-propagated embeddings for terms.
For text corpus / contextual features:
- To build everything from scratch, first download corpora such as Wikipedia, UMBC, and 1 Billion Word Language Model Benchmark.
- To preprocess the corpus, generate a vocabulary file and use the scripts based on LexNET. Please refer to the instructions here. It may take several hours to finish this process.
For Lexico-Syntactic Features:
- Use model/gen_lexico_features.py to generate linguistic patterns based on surface name of terms.
- For term frequency patterns from TAXI, please refer to the instructions here.

TODOs

Support more tensorboard functions
Using fixed random seed

Acknowledgements

If you find this paper useful for your research, please cite the following paper in your publication:

@inproceedings{yu2020steam, title={STEAM: Self-Supervised Taxonomy Expansion with Mini-Paths}, author={Yu, Yue and Li, Yinghao and Shen, Jiaming and Feng, Hao and Sun, Jimeng and Zhang, Chao}, booktitle={Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining}, publisher = {ACM}, year={2020} }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STEAM

Requirements

Usage

Folder Structure

Processing Text Data on Your Own

TODOs

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data_environment_eurovoc_en_0.2		data_environment_eurovoc_en_0.2
data_science_wordnet_en_0.2		data_science_wordnet_en_0.2
log_result		log_result
model		model
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

STEAM

Requirements

Usage

Folder Structure

Processing Text Data on Your Own

TODOs

Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages