I have written a program to perform a task, but I'm guessing it is not optimized. I want to know if there are any ways to improve the effiency and performance of this program.
This program reads a set of .gz files from a directory, parses each file, inserts the filtered content into another .xml file in the results directory.
For example, contents of 1.gz are as follows:
URL:http://www.samplePage1.com HTTP/1.1 200 OK Content-Type: application/vnd.ms-excel Content-Length: 46592 Last-Modified: Mon, 08 Mar 2010 18:48:10 GMT <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta name="title" content="Internet Infrastructure Vendors (Vendor SIG)" /> <meta name="description" content="Sample page1" /> <title>My title1</title> </head> <body class="home"> <p> body content of this sample page 1 </p> </body> </html> This would be read, parsed and inserted as an XML file (there would be an xml created for each .gz file in the input folder) as follows:
<docHead> <doc> <field name="url">http://www.samplePage1.com</field> <field name="meta">Sample page1</field> <field name="title">My title1</field> <field name="body">body content of this sample page 1 </field> <field name="lastmodified">Mon, 08 Mar 2010 18:48:10 GMT</field> </doc> <doc> ...another doc </doc> ... ... </docHead> The Java code is as follows:
import java.io.BufferedWriter; import java.io.DataInputStream; import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.OutputStreamWriter; import java.nio.file.DirectoryStream; import java.nio.file.FileSystems; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; import java.util.regex.Matcher; import java.util.regex.Pattern; import java.util.zip.GZIPInputStream; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class WarcToXML { static Pattern lstModPattern = Pattern.compile("last-modified:.*?\r?\n", Pattern.CASE_INSENSITIVE); public static void main(String[] args) throws IOException { String in_directory=args[0]; String result_dir=args[1]; String resFileName=null; //Path to create newly craeted xml files Path outPath = Paths.get(result_dir); //Create new directory if it does not exist if (!Files.exists(outPath)) { try { Files.createDirectory(outPath); } catch (IOException e) { System.err.println(e); } } int fileCount=1; Path dir = FileSystems.getDefault().getPath(in_directory); DirectoryStream<Path> stream=null; try { stream = Files.newDirectoryStream( dir ); for (Path path : stream) { if((path.getFileName().toString()).endsWith(".gz")) resFileName=result_dir+"\\"+fileCount+".xml"; try { parseFile(path.toFile(), resFileName); } catch (Exception e) { e.printStackTrace(); } } stream.close(); } catch (Exception e1) { e.printStackTrace(); } } public static void parseFile(File inputFile, String resFileName) throws IOException { // open the gzip input stream GZIPInputStream gzStream=new GZIPInputStream(new FileInputStream(inputFile)); DataInputStream inStream=new DataInputStream(gzStream); int i=0; String pageContent; String thisTargetURI=null; BufferedWriter writer=null; try{ writer = new BufferedWriter(new OutputStreamWriter( new FileOutputStream(resFileName), "utf-8")); writer.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>"); writer.newLine(); writer.write("<docHead>"); writer.newLine(); writer.close(); // PageRepository iterates through the inStream and returns each WARC Record as a String while ((pageContent=PageRepository.readPage(inStream))!=null) { int startOfHtmlContent=0; if(pageContent.toLowerCase().indexOf("<!doctype html")!=-1) startOfHtmlContent=pageContent.toLowerCase().indexOf("<!doctype html"); else startOfHtmlContent=pageContent.toLowerCase().indexOf("<html"); pageContent=pageContent.substring(startOfHtmlContent, pageContent.length()-1); //Start-Get value of last-modified header int endOfHeader=startOfHtmlContent; String headersBlock=pageContent.substring(0, endOfHeader); String lastModified=null; Pattern pattern = Pattern.compile("last-modified:.*?\r?\n", Pattern.CASE_INSENSITIVE); Matcher matcher = pattern.matcher(headersBlock); if (matcher.find()) { lastModified=(matcher.group(0).substring(14)).trim(); } //end-get last-modified header String pageTitle=null; String h1=null; Element firstH1=null; //Parsing the html content using Jsoup Document doc=Jsoup.parse(pageContent); /**Extracting document title, if no title is present, select the text inside 1st h1 or h2 tag as the title. * If that too is not found, take the url as title * */ if(doc.title()!=null && !doc.title().isEmpty()){ pageTitle=doc.title(); } else{ if(doc.select("h1").first()!=null) firstH1= doc.select("h1").first(); else if(doc.select("h2").first()!=null) firstH1= doc.select("h2").first(); if(firstH1!=null) h1=firstH1.text(); else h1=thisTargetURI; pageTitle=h1; } /** End of extracting Title */ //getting meta data String metaInfo=""; Elements metalinks = doc.select("meta"); for (Element ele : metalinks) { if(ele.attr("name").equalsIgnoreCase("keywords") || ele.attr("name").equalsIgnoreCase("description")) metaInfo=metaInfo+" "+ele.attr("content"); } writeToXml(thisTargetURI, metaInfo, pageTitle, doc.text(), lastModified, resFileName); } writer = new BufferedWriter(new OutputStreamWriter( new FileOutputStream(resFileName,true), "utf-8")); writer.write("</docHead>"); writer.close(); }catch(Exception e){ e.printStackTrace(); } finally { ... } } public static void writeToXml(String url, String metaKeywords, String title, String content, String lastModified, String resFileName){ BufferedWriter writer = null; try { url=url.replace("<", "<"); url=url.replace(">", ">"); url=url.replace("'", "'"); if(metaKeywords!=null){ metaKeywords=metaKeywords.replace("<", "<"); metaKeywords=metaKeywords.replace(">", ">"); metaKeywords=metaKeywords.replace("'", "'"); } if(title!=null){ title=title.replace("<", "<"); title=title.replace(">", ">"); title=title.replace("'", "'"); } if(content!=null){ content=content.replace("<", "<"); content=content.replace(">", ">"); content=content.replace("'", "'"); } writer = new BufferedWriter(new OutputStreamWriter( new FileOutputStream(resFileName,true), "utf-8")); writer.write("<doc>"); writer.newLine(); writer.write("<field name=\"url\">"+url+"</field>"); writer.newLine(); writer.write("<field name=\"meta\">"+metaKeywords+"</field>"); writer.newLine(); writer.write("<field name=\"title\">"+title+"</field>"); writer.newLine(); writer.write("<field name=\"body\">"+content+"</field>"); writer.newLine(); writer.write("<field name=\"lastmodified\">"+lastModified+"</field>"); writer.newLine(); writer.write("</doc>"); writer.newLine(); writer.close(); } catch (Exception ex) { e.printStackTrace(); } finally { ... } } Is there a better way to do this task? I'm guessing threads would be helpful to read and process multiple files at once, but not really sure how to use them.
catchandfinallyblocks? \$\endgroup\$imports as well. It appears that this code requires jsoup. But what library isPageRepositoryfrom? \$\endgroup\$PageRepositoryis coming from this same project, but is not included in here. What it does is perfectly clear, I think it's good enough. \$\endgroup\$thisTargetURIis never set to anything other thannull! \$\endgroup\$