Trajectory is a software platform for automatically extracting topics from university course descriptions. It includes modules for data ingestion, learning, and visualization.
The basic requirements are Java JDK 7 or higher, Python 3.0 or higher, virtualenv, and maven. Support for the database layer requires system copies of MySQL, PostGres, SQLite, or similar software. Support for the visualization layer requires a proxy web server (eg. Apache, Nginx).
Note that this project contains an unholy combination of Bash scripts, Python tools, and Java code. Proceed with setup carefully.
Begin by exporting the $TRJ_HOME path variable.
$ git clone http://github.com/jrouly/trajectory $ cd trajectory $ export TRJ_HOME=$(pwd) Install Python dependencies by calling the bin/util/pysetup script. Java code will be compiled on demand.
To specify or change the database URI and scheme, modify the config.py file. Specifically, look for DATABASE_URI. It defaults to a SQLite file named data.db.
server { listen 80; location ^~ /static/ { root /TRJ_HOME/src/main/resources/web/static; } location / { proxy_pass http://localhost; proxy_redirect off; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Host $server_name; } } $ bin/scrape download [-h] {targets} $ bin/scrape export [-h] [--data-directory <directory>] [--departments <departments>] [--cs] This exports data in a format that can be read in by the Learn module. The data directory will default to data/. You can selectively filter subjects exported using the --departments flag.
$ bin/learn -in <path> -out <path> [-iterations <n>] [-debug] [-threads <n>] [-topics <n>] [-words <n>] [-alpha <alpha>] [-beta <beta>] The -in parameter must be an export location from the Scrape module. Results will be stored within a timestamped subdirectory of the -out directory. All other parameters are optional.
$ bin/scrape import-results [-h] --topic-file <file> --course-file <file> [--alpha <alpha>] [--beta <beta>] [--iterations <iterations>] Read the results of the Learn module (inferred topics) back into the database and pair with existing course data. Multiple imports will simply add ResultSets to the existing database.
$ bin/web Activate the visualization server. See gunicorn.py for configuration settings. Notice that the PID and log files are stored in the TRJ_HOME.
- Create common engine frameworks for catalog installs to be more DRY.
- Refactor configuration objects as a module.