justext-java is a library for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. This implementation is the Java port of https://github.com/miso-belica/jusText.
See what is kept and what is discarded from a typical web page. Read a description of the jusText algorithm.
TBD
% mvn clean install import nl.wizenoze.justext.JusText import nl.wizenoze.justext.paragraph.Paragraph import nl.wizenoze.justext.util.StopWordsUtil JusText jusText = new JusText() String rawHtml = "http://www.devx.com/wireless/remote-work-and-the-social-forces-and-technologies-that-enable-it.html".toURL().getText() Set<String> stopWords = StopWordsUtil.getStopWords("en") List<Paragraph> paragraphs = jusText.extract(rawHtml, stopWords) paragraphs.each { println(it.getText()) } Please refer to CONTRIBUTING
justext-java is available under the GNU Lesser General Public License v3.0.
Copyright (c) 2016-present WizeNoze B.V. All rights reserved.