Name	Name	Last commit message	Last commit date
Latest commit History 273 Commits
.semaphore	.semaphore
build_tools/travis	build_tools/travis
doc	doc
podium	podium
test	test
.flake8	.flake8
.gitignore	.gitignore
.travis.yml	.travis.yml
CONTRIBUTING.md	CONTRIBUTING.md
LICENSE	LICENSE
README.md	README.md
logging.ini	logging.ini
requirements.txt	requirements.txt
requirements_ner.txt	requirements_ner.txt
requirements_yake.txt	requirements_yake.txt
setup.cfg	setup.cfg
setup.py	setup.py

TakeLab Podium

Home of the TakeLab Podium project. Podium is a framework agnostic Python natural language processing library which standardizes data loading and preprocessing as well as model training and selection, among others. Our goal is to accelerate users' development of NLP models whichever aspect of the library they decide to use.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

For building this project system needs to have installed the following:

git
python3.6 and higher
pip

We also recommend usage of a virtual environment:

Installing from source

To install podium, in your terminal

Clone the repository: git clone git@github.com:mttk/podium.git && cd podium
Install requirements: pip install -r requirements.txt
Install podium: python setup.py install

Installing package from pip/wheel

Coming soon!

Usage examples

For detailed usage examples see podium/examples

Loading datasets

Use some of our pre-defined datasets:

>>> from podium.datasets import SST >>> sst_train, sst_test, sst_dev = SST.get_dataset_splits() >>> print(sst_train) SST[Size: 6920, Fields: ['text', 'label']] >>> print(sst_train[222]) # A short example Example[label: ('positive', None); text: (None, ['A', 'slick', ',', 'engrossing', 'melodrama', '.'])]

Load your own dataset from a standardized format (csv, tsv or jsonl):

>>> from podium.datasets import TabularDataset >>> from podium.storage import Vocab, Field, LabelField >>> fields = {'premise': Field('premise', vocab=Vocab()), 'hypothesis':Field('hypothesis', vocab=Vocab()), 'label': LabelField('label')} >>> dataset = TabularDataset('my_dataset.csv', format='csv', fields=fields) >>> print(dataset) TabularDataset[Size: 1, Fields: ['premise', 'hypothesis', 'label']]

Or define your own Dataset subclass (tutorial coming soon)

Define your preprocessing

We wrap dataset pre-processing in customizable Field classes. Each Field has an optional Vocab instance which automatically handles token-to-index conversion.

>>> from podium.storage import Vocab, Field, LabelField >>> vocab = Vocab(max_size=5000, min_freq=2) >>> text = Field(name='text', vocab=vocab) >>> label = LabelField(name='label') >>> fields = {'text': text, 'label':label} >>> sst_train, sst_test, sst_dev = SST.get_dataset_splits(fields=fields) >>> print(vocab) Vocab[finalized: True, size: 5000]

Each Field allows the user full flexibility modify the data in multiple stages:

Prior to tokenization (by using pre-tokenization hooks)
During tokenization (by using your own tokenizer)
Post tokenization (by using post-tokenization hooks)

You can also completely disregard our preprocessing and define your own:

Set your custom_numericalize

You could decide to lowercase all the characters and filter out all non-alphanumeric tokens:

>>> def lowercase(raw): >>> return raw.lower() >>> def filter_alnum(raw, tokenized): >>> filtered_tokens = [token for token in tokenized if any([char.isalnum() for char in token])] >>> return raw, filtered_tokens >>> text.add_pretokenize_hook(lowercase) >>> text.add_posttokenize_hook(filter_alnum) >>> # ... >>> print(sst_train[222]) Example[label: ('positive', None); text: (None, ['a', 'slick', 'engrossing', 'melodrama'])]

Pre-tokenization hooks do not see the tokenized data and are applied (and modify) only raw data. Post-tokenization hooks have access to tokenized data, and can be applied to either raw or tokenized data.

Use preprocessing from other libraries

A common use-case is to incorporate existing components of pretrained language models, such as BERT. This is extremely simple to incorporate as part of our Fields. This snippet requires installation of the transformers (pip install transformers) library.

>>> from transformers import BertTokenizer >>> # Load the tokenizer and fetch pad index >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') >>> pad_index = tokenizer.convert_tokens_to_ids(tokenizer.pad_token) >>> # Define a BERT subword Field >>> bert_field = Field("subword", vocab=None, padding_token=pad_index, tokenizer=tokenizer.tokenize, custom_numericalize=tokenizer.convert_tokens_to_ids) >>> # ... >>> print(sst_train[222]) Example[label: ('positive', None); subword: (None, ['a', 'slick', ',', 'eng', '##ross', '##ing', 'mel', '##od', '##rama', '.'])]

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

Code style standards

In this repository we use numpydoc as a standard for documentation and Flake8 for code sytle. Code style references are Flake8 and PEP8.

Commands to check flake8 compliance for written code and tests.

flake8 podium flake8 test

Building and running unit tests

You will work in a virtual environment and keep a list of required dependencies in a requirements.txt file. The master branch of the project must be buildable with passing tests all the time. Code coverage should be kept as high as possible (preferably >95%).

Commands to setup virtual environment and run tests.

virtualenv -p python3.6 env source env/bin/activate python setup.py install py.test --cov-report=term-missing --cov=podium

If you intend to develop part of podium you should use following command to install podium.

python setup.py develop

In other cases it should be enough to run python setup.py for podium to be added to python environment.

The project is packaged according to official Python packaging guidelines.

We recommend use of pytest and pytest-mock library for testing when developing new parts of the library.

Adding new dependencies

Adding a new library to a project should be done via pip install <new_framework>. Don't forget to add it to requirements.txt

The best thing to do is to manually add dependencies to the requirements.txt file instead of using pip freeze > requirements.txt. See here why.

Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

Authors

Podium is currently maintained by Ivan Smoković, Silvije Skudar, Filip Boltužić and Martin Tutek. A non-exhaustive but growing list of collaborators needs to mention: Domagoj Pluščec, Marin Kačan, Dunja Vesinger, Mate Mijolović.
Project made as part of TakeLab at Faculty of Electrical Engineering and Computing, University of Zagreb

See also the list of contributors who participated in this project.

License

This project is licensed under the BSD 3-Clause - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TakeLab Podium

Getting Started

Prerequisites

Installing from source

Installing package from pip/wheel

Usage examples

Loading datasets

Define your preprocessing

Use preprocessing from other libraries

Contributing

Code style standards

Building and running unit tests

Adding new dependencies

Versioning

Authors

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TakeLab Podium

Getting Started

Prerequisites

Installing from source

Installing package from pip/wheel

Usage examples

Loading datasets

Define your preprocessing

Use preprocessing from other libraries

Contributing

Code style standards

Building and running unit tests

Adding new dependencies

Versioning

Authors

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages