Content!

Dayon - Contentstrategie en technologie

Content Management

Wie beheerst zijn content nu écht? De sleutel ligt bij een content management oplossing die past bij de eigenlijke vraag van uw organisatie: gebruiken wij content om te leren, om te informeren of om te publiceren?

Enterprise Search

Snel zoeken is het probleem niet. Wel de juiste informatie vinden. In de juiste context, op het juiste moment. Dayon combineert zijn kennis van content, van metadata en van search technology tot het enige juiste antwoord.

Content! Wij weten er alles van. Genoeg gelezen? Sluit dit venster.
 

test

(X)HTML to PDF with Java

Door Tedla Brandsema

Introduction

There are many ways to accomplish PDF creation with Java. By far the easiest way in my opinion is by using flying saucer.

Flying saucer supports the rendering of XML and XHTML with CSS 2.1 compiant stylesheets. Through the use of iText, flyingsaucer can render the XML/XHTML CSS combination into PDF.

There are many tutorials and articles on the subject, but they seem to be outdated or tackle the niche complexities. The goal here is to have a hands on tutorial in which we create a PDF with the help of the February 2011 release of flying saucer - R8 - in a very straightforward way.

Prerequisits

The installation process of both the JDK and the IDE are not covered here because they would reach beyond the scope of this tutorial.

Creating our project

Create a new java project called 'xhtml2pdf' (very original) from within your IDE of choice. We are going to keep it as basic as we can so no dynamic web projects, just a plain old Java application. Your project structure should look something like this:
The look will differ depending on the IDE you are using, but you should create a source folder structure like above i.e. src/main/java and src/main/resources.

We are going to create a file in the resources folder called source.html. And it should look like this:

<!DOCTYPE html>

<html>
<head>
    <title>Let's create a PDF!</title>
  
    <style type="text/css">
        body{
            font-family: Helvetica;
            width: 700px;
            margin: 0 auto;
            color: #DCDDDB;
            background-color: #40404A;
        }

        section {
            border-radius: 2px 2px 2px 2px;
            box-shadow: -16px 0 20px -20px #FFF, 16px 0 20px -20px #FFF;
        }

        h1, footer{
            color: #FFF;
        }

        article {
            padding: 10px;
        }

        #dayon_link{
            display: inline-block;
            background: no-repeat center #FFF url("http://www.dayon.nl/sites/all/themes/dayon/images/logo.png");
            height: 95px;
            width: 100%;
        }
    </style>

</head>
<body>
<section>

    <a id="dayon_link" href="http://www.dayon.nl"></a>
    <article>
        <footer>
            <time pubdate datetime="2011-11-15T10:21">15 november 2011 10:21</time>
            Written by Cicero
        </footer>
        <h1>
            Lorem ipsum
        </h1>
        <img src="http://sanatyasami.com/wp-content/uploads/2010/12/Lorem-ipsum-dolor-sit-...">
        <p>
            Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum vitae nisi ipsum, id interdum eros.
            Suspendisse libero sapien, dictum nec luctus ac, adipiscing in massa. Etiam odio sem,
            porttitor in dignissim id, convallis ut augue. Vestibulum ante ipsum primis in faucibus orci luctus
            et ultrices posuere cubilia Curae; Aliquam et neque lacus. Ut quam ligula, eleifend eget cursus vitae,
            adipiscing non risus. In suscipit fermentum quam, vel auctor dolor congue vitae. Nullam aliquet nulla
            ut sem elementum lobortis at eget odio. Pellentesque euismod, erat imperdiet sollicitudin imperdiet,
            nulla ante cursus odio, vel lobortis massa velit at nisl. Aliquam et arcu sed massa aliquam placerat
            ac vel velit. Mauris fringilla leo in augue dignissim quis placerat nisl ultricies.
            Proin ut odio ac lorem faucibus sollicitudin. Integer vitae est nulla. Phasellus lacinia posuere convallis.
        </p>

        <p>
            Phasellus posuere sodales lorem a euismod. Duis et nibh nulla. Nullam at justo augue, quis molestie arcu.
            Aenean eu lacus dolor, vitae ultricies lorem. Fusce et mauris non magna sodales congue et id nibh.
            Nulla consectetur, sem id vulputate mattis, leo quam posuere leo, id pretium purus mi sed leo.
            Donec nibh metus, tincidunt sit amet semper vitae, consequat non sapien. Sed in eros sed magna
            faucibus posuere. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque urna ante,
            aliquam eget varius ac, congue ac elit. Maecenas vitae tempus est. Donec auctor volutpat nunc, non
            dictum sem ultrices viverra. Mauris ipsum orci, tempor ac venenatis sit amet, adipiscing et nisi.
            Nunc interdum leo ac magna sollicitudin ut fermentum nisi viverra. Aenean congue enim ut neque
            aliquet in venenatis diam eleifend.
        </p>

    </article>
</section>
</body>
</html>


If you look closely you should notice that the HTML is not well formed. Things like <link href="style.css" rel="stylesheet" type="text/css">, <br> and pubdate do not adhere to the rules of being well formed in XML terms. These minor defects in the HTML are within the margins of tolerance of any modern day browser and do not affect the way it is rendered by the browser. An XML parser, however, will not be so tolerant.

This is where HTMLCleaner comes in. It is a library capable of cleaning up not so well formed input into well formed input. Let's take it for a spin.

First of all add the HTMLCleaner jar to your classpath. (If you do not know how to do this, do a quick google search on 'adding external libraries  <inser your IDE of choice here>'. If you have created a Maven project chances are you already know how to add dependencies to your project.)  

In your src/main/java sourcefolder create a file called PdfRenderer.java. This is where our main method lives and where we are be doing our coding.

PdfRenderer.java should look like this whith the HTMLCleaner code added:

import org.htmlcleaner.CleanerProperties;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.PrettyXmlSerializer;
import org.htmlcleaner.TagNode;

import java.io.*;

public class PdfRenderer {
public static void main(String[] args) throws IOException {
  
    // Clean up the HTML to be well formed
    HtmlCleaner cleaner = new HtmlCleaner();
    CleanerProperties props = cleaner.getProperties();
    TagNode node = cleaner.clean(PdfRenderer.class.getResourceAsStream("source.html"));
    new PrettyXmlSerializer(props).writeToStream(node, System.out);

}

}

First off we create a cleaner followed by default properties. Then we get our source.html as inputstream and pass it to the cleaner to do the obvious and clean. The cleaner returns the root node. For the purposes of this tutorial we will initially output it to System.out. That way we can verify that the cleaner indeed made the HTML well formed and thus produced true XHTML.


Final step

Now that we have checked that the HTMLCleaner library does it's job, we can safely create a PDF with flying saucer and iText. Let's take a look at the final code.

import com.lowagie.text.DocumentException;
import org.htmlcleaner.CleanerProperties;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.PrettyXmlSerializer;
import org.htmlcleaner.TagNode;
import org.xhtmlrenderer.pdf.ITextRenderer;

import org.xml.sax.SAXException;

import javax.xml.parsers.ParserConfigurationException;
import java.io.ByteArrayOutputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;


public class PdfRenderer {
public static void main(String[] args) throws IOException, DocumentException, ParserConfigurationException, SAXException {

    // Create a buffer to hold the cleaned up HTML
    ByteArrayOutputStream out = new ByteArrayOutputStream();

    // Clean up the HTML to be well formed
    HtmlCleaner cleaner = new HtmlCleaner();
    CleanerProperties props = cleaner.getProperties();
    TagNode node = cleaner.clean(PdfRenderer.class.getResourceAsStream("source.html"));
    // Instead of writing to System.out we now write to the ByteArray buffer
    new PrettyXmlSerializer(props).writeToStream(node, out);

    // Create the PDF
    ITextRenderer renderer = new ITextRenderer();
    renderer.setDocumentFromString(new String(out.toByteArray()));
    renderer.layout();
    OutputStream outputStream = new FileOutputStream("HTMLasPDF.pdf");
    renderer.createPDF(outputStream);

    // Finishing up
    renderer.finishPDF();
    out.flush();
    out.close();
}
}

Instead of writing to System.out, we now write to a ByteArrayOutputStream. We create an instance of a PDF renderer object. Since our cleaned up HTML is written to the ByteArrayOutputStream, we need a way to feed it to the renderer. We could have created a w3c Document and feed it to the renderer's setDocument() method. But that would have produced a few more lines of code and since we already know the produced result is XHTML - because of the XmlSerializer of HTMLCleaner - there is no added value to be had by having our output parsed when we create a Document object.

After that we just simply tell the renderer to write the PDF to the specified OutputStream. Then all what is left to do is to tie up some lose ends. Run the code again and you should find a PDF by the name of HTMLasPDF.pdf in the root of your project.

Collega's

  • Mirek Dronkers