Java Utility that converts warc (Web Archive) files into XML files

Question

I have written a program to perform a task, but I'm guessing it is not optimized. I want to know if there are any ways to improve the effiency and performance of this program.

This program reads a set of .gz files from a directory, parses each file, inserts the filtered content into another .xml file in the results directory.

For example, contents of 1.gz are as follows:

URL:http://www.samplePage1.com HTTP/1.1 200 OK Content-Type: application/vnd.ms-excel Content-Length: 46592 Last-Modified: Mon, 08 Mar 2010 18:48:10 GMT <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta name="title" content="Internet Infrastructure Vendors (Vendor SIG)" /> <meta name="description" content="Sample page1" /> <title>My title1</title> </head> <body class="home"> <p> body content of this sample page 1 </p> </body> </html>

This would be read, parsed and inserted as an XML file (there would be an xml created for each .gz file in the input folder) as follows:

<docHead> <doc> <field name="url">http://www.samplePage1.com</field> <field name="meta">Sample page1</field> <field name="title">My title1</field> <field name="body">body content of this sample page 1 </field> <field name="lastmodified">Mon, 08 Mar 2010 18:48:10 GMT</field> </doc> <doc> ...another doc </doc> ... ... </docHead>

The Java code is as follows:

import java.io.BufferedWriter; import java.io.DataInputStream; import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.OutputStreamWriter; import java.nio.file.DirectoryStream; import java.nio.file.FileSystems; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; import java.util.regex.Matcher; import java.util.regex.Pattern; import java.util.zip.GZIPInputStream; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class WarcToXML { static Pattern lstModPattern = Pattern.compile("last-modified:.*?\r?\n", Pattern.CASE_INSENSITIVE); public static void main(String[] args) throws IOException { String in_directory=args[0]; String result_dir=args[1]; String resFileName=null; //Path to create newly craeted xml files Path outPath = Paths.get(result_dir); //Create new directory if it does not exist if (!Files.exists(outPath)) { try { Files.createDirectory(outPath); } catch (IOException e) { System.err.println(e); } } int fileCount=1; Path dir = FileSystems.getDefault().getPath(in_directory); DirectoryStream<Path> stream=null; try { stream = Files.newDirectoryStream( dir ); for (Path path : stream) { if((path.getFileName().toString()).endsWith(".gz")) resFileName=result_dir+"\\"+fileCount+".xml"; try { parseFile(path.toFile(), resFileName); } catch (Exception e) { e.printStackTrace(); } } stream.close(); } catch (Exception e1) { e.printStackTrace(); } } public static void parseFile(File inputFile, String resFileName) throws IOException { // open the gzip input stream GZIPInputStream gzStream=new GZIPInputStream(new FileInputStream(inputFile)); DataInputStream inStream=new DataInputStream(gzStream); int i=0; String pageContent; String thisTargetURI=null; BufferedWriter writer=null; try{ writer = new BufferedWriter(new OutputStreamWriter( new FileOutputStream(resFileName), "utf-8")); writer.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>"); writer.newLine(); writer.write("<docHead>"); writer.newLine(); writer.close(); // PageRepository iterates through the inStream and returns each WARC Record as a String while ((pageContent=PageRepository.readPage(inStream))!=null) { int startOfHtmlContent=0; if(pageContent.toLowerCase().indexOf("<!doctype html")!=-1) startOfHtmlContent=pageContent.toLowerCase().indexOf("<!doctype html"); else startOfHtmlContent=pageContent.toLowerCase().indexOf("<html"); pageContent=pageContent.substring(startOfHtmlContent, pageContent.length()-1); //Start-Get value of last-modified header int endOfHeader=startOfHtmlContent; String headersBlock=pageContent.substring(0, endOfHeader); String lastModified=null; Pattern pattern = Pattern.compile("last-modified:.*?\r?\n", Pattern.CASE_INSENSITIVE); Matcher matcher = pattern.matcher(headersBlock); if (matcher.find()) { lastModified=(matcher.group(0).substring(14)).trim(); } //end-get last-modified header String pageTitle=null; String h1=null; Element firstH1=null; //Parsing the html content using Jsoup Document doc=Jsoup.parse(pageContent); /**Extracting document title, if no title is present, select the text inside 1st h1 or h2 tag as the title. * If that too is not found, take the url as title * */ if(doc.title()!=null && !doc.title().isEmpty()){ pageTitle=doc.title(); } else{ if(doc.select("h1").first()!=null) firstH1= doc.select("h1").first(); else if(doc.select("h2").first()!=null) firstH1= doc.select("h2").first(); if(firstH1!=null) h1=firstH1.text(); else h1=thisTargetURI; pageTitle=h1; } /** End of extracting Title */ //getting meta data String metaInfo=""; Elements metalinks = doc.select("meta"); for (Element ele : metalinks) { if(ele.attr("name").equalsIgnoreCase("keywords") || ele.attr("name").equalsIgnoreCase("description")) metaInfo=metaInfo+" "+ele.attr("content"); } writeToXml(thisTargetURI, metaInfo, pageTitle, doc.text(), lastModified, resFileName); } writer = new BufferedWriter(new OutputStreamWriter( new FileOutputStream(resFileName,true), "utf-8")); writer.write("</docHead>"); writer.close(); }catch(Exception e){ e.printStackTrace(); } finally { ... } } public static void writeToXml(String url, String metaKeywords, String title, String content, String lastModified, String resFileName){ BufferedWriter writer = null; try { url=url.replace("<", "&lt;"); url=url.replace(">", "&gt;"); url=url.replace("'", "&apos;"); if(metaKeywords!=null){ metaKeywords=metaKeywords.replace("<", "&lt;"); metaKeywords=metaKeywords.replace(">", "&gt;"); metaKeywords=metaKeywords.replace("'", "&apos;"); } if(title!=null){ title=title.replace("<", "&lt;"); title=title.replace(">", "&gt;"); title=title.replace("'", "&apos;"); } if(content!=null){ content=content.replace("<", "&lt;"); content=content.replace(">", "&gt;"); content=content.replace("'", "&apos;"); } writer = new BufferedWriter(new OutputStreamWriter( new FileOutputStream(resFileName,true), "utf-8")); writer.write("<doc>"); writer.newLine(); writer.write("<field name=\"url\">"+url+"</field>"); writer.newLine(); writer.write("<field name=\"meta\">"+metaKeywords+"</field>"); writer.newLine(); writer.write("<field name=\"title\">"+title+"</field>"); writer.newLine(); writer.write("<field name=\"body\">"+content+"</field>"); writer.newLine(); writer.write("<field name=\"lastmodified\">"+lastModified+"</field>"); writer.newLine(); writer.write("</doc>"); writer.newLine(); writer.close(); } catch (Exception ex) { e.printStackTrace(); } finally { ... } }

Is there a better way to do this task? I'm guessing threads would be helpful to read and process multiple files at once, but not really sure how to use them.

Please include your imports as well. It appears that this code requires jsoup. But what library is PageRepository from? — 200_success
– 200_success, Commented Nov 27, 2015 at 6:20
Nothing much is present in finally block, just close() for the writers and streams. — CoralReef
– CoralReef, Commented Nov 27, 2015 at 6:30
@200_success I assume PageRepository is coming from this same project, but is not included in here. What it does is perfectly clear, I think it's good enough. — Vogel612
– Vogel612, Commented Nov 27, 2015 at 7:34

Emz · Accepted Answer · 2015-11-27 06:44:54Z

Final Modifier

static Pattern lstModPattern = Pattern.compile("last-modified:.*?\r?\n", Pattern.CASE_INSENSITIVE);

Is never changed, so it could be declared as final, it both tells the developer it will never chnage as well the application that it can never be changed.

Inconsistent Format

if((path.getFileName().toString()).endsWith(".gz")) resFileName=result_dir+"\\"+fileCount+".xml"; try {

This piece of code for example is formatted with extra lines, unlike some other parts of the code.

if(firstH1!=null) h1=firstH1.text(); else h1=thisTargetURI;

Here there are no brackets. I strongly recommend you to have brackets, if you are to ladder add more depending on the state of firstH1.

try-catch

You are ignoring all exceptions, just printing the stacktrace, this is generally something you want to avoid. It is not always possible to recover from exceptions, however now attempts are made, nor is the output telling us directly what is wrong.

public static void main(String[] args) throws IOException {

Is generally frowned upon. In this case you are catching some throwing some, it might make debugging a pain.

Catch and try to recover, at least continue if possible, if not generate a true and clear error as what went wrong and terminate the application.

Repetetive code

writer.write("<doc>"); writer.newLine(); writer.write("<field name=\"url\">"+url+"</field>"); writer.newLine(); writer.write("<field name=\"meta\">"+metaKeywords+"</field>"); writer.newLine(); writer.write("<field name=\"title\">"+title+"</field>"); writer.newLine(); writer.write("<field name=\"body\">"+content+"</field>"); writer.newLine(); writer.write("<field name=\"lastmodified\">"+lastModified+"</field>"); writer.newLine(); writer.write("</doc>"); writer.newLine();

Atleast break the <field name=\" + head + \"> + data + "</field>" into a helper method, even consider using a HashMap<K,V> to iterate over it.

title=title.replace("<", "&lt;"); title=title.replace(">", "&gt;"); title=title.replace("'", "&apos;");

Same with these, they are used in numerous places as well for different variables. Consider using a helper here as well.

These are some quick pointers. Once they are updated feel free to ask a new question. Once there I can provide more direct feedback overall to the program.

Vogel612 · Accepted Answer · 2015-11-27 07:33:22Z

Instead of the manual building of the XML fileas you do it now, I'd strongly suggest using a "proper" XML Serializer, e.g. Xerces

In general the whole code written is rather less object oriented than usual for java. It might have been interesting to write this in a more "script"-like language (e.g. Python) that's not rooted so deeply in OOP.

But when you got OOP, I suggest you use it.

Currently (as mentioned) the code is very procedural and not Object Oriented. We can change that by introducing a class holding the information we want about our documents and knowing how to properly serialize (and deserialize) that information:

public class DocumentInformation { private URL url; // maybe String is more appropriate? private String title; private String metaKeywords; private String body; private String lastModified; // omitting constructor and getters / setters public void serializeToXML(XMLSerializer serializer) { // Serialization specific code :) } }

What you should strive for is creating a proper abstraction over the Parsing and Serialization process so that you can grasp the concept of what happens by just looking at the main-method.

Consider a main-method like (disclaimer: this is example code):

public static void main(String[] args) { // skipping the nitty-gritty argument parsing try (DirectoryStream<Path> input = Files.newDirectoryStream(dir)) { Path outputFile = Paths.get(result_dir, fileCount + ".xml" XMLSerialize serializer = new XMLSerializer( Files.newOutputStream(outputFile, StandardOpenOptions.CREATE), OutputFormat.defaults); serializer.startDocument("docHead"); input.map(Parser::parse) .forEach(document -> document.serialize(serializer)); } catch (IOException ex) { // ... better error handling :D } }

This achieves multiple goals of OOP at once. First we abstract the parsing and serializing into specialized classes. These classes can theoretically change (if they need to) without having to adjust the rest of the application.

Secondly we're now separating the responsibility of input and output. This is considerably simpler than your current parseFile which does multiple things at once, namely:

reading a file
parsing it into intermediary results
perform normalization on those results
serializing the results to XML

The overall gist of what happens is clearer and easier to grasp.

Lastly we're now using the newer (and cleaner) nio API more to simplify processing and error handling.

Stack Exchange Network

Java Utility that converts warc (Web Archive) files into XML files

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Java Utility that converts warc (Web Archive) files into XML files

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions