Web Development (Martin Trummer)

In this post, I'll try to explain some of the problems that I've encountered, when implementing WYSIWYG functionality.

Adding WYSIWYG funcitonality to my GWT/GWT application is very easy and it'll take no more than 1 or 2 days
this is the biggest problme I had: my naive time estimation :)

My implementation requirements

implement a WYSIWYG editor into an existing GWT/GXT application: no plug-ins (applets, flash, ...) shall be required
users may create richt-text, which will be stored in the database and later displayed to other users, thus
security concerns must be considered
performance: reading the Richt-Text should be as cheap as possible (writing is not critical)
e-mail: the rich-text may also be sent via e-mail in plain-text format

Which WYSIWYG editor to use
There are numerous editors out there and they all have different pros and cons. Since no plug-in techniques are allowed, we only have to considere java-script based solutions.
I am not going to compare/list all of them: just google it, you'll find plenty of information.
In a first attempt, I decided to use TinyMCE.
It turned out, that integrating TinyMCE into a GXT application is not trivial, but can be done:
see this GXT Forum Post: "Integration TinyMCE in GXT" for details
Anyway, I was not happy with TinyMCE in my GXT application, so I was eagerly waiting for the new native GXT Html Editor that comes with GXT 2.x.
Pros of the native editor:

it integrates well with the existing application
is very easy to use
it has the same look & feel as the other GXT elements

Security
When you store user-generated HTML content in the database and then display this text to other users, you must take care of cross-site scripting (XSS).
As all other security-measures, this must be done on the server-side and to also respect my performance requirement, I want to check it once when the text is stored in the database.
Making sure, that the user-text is XSS-safe turns out to be very tricky and difficult to implement. Fortunetly the OWASP's AntiSamy project comes to the rescue.
It makes it really easy to configure a policy that is tailored to your very specific needs (or just go and take one of the predefined policies).

Security-Implementation
When you have implemented AntiSamy to clean your users input, you'll need a convenient and reliable way to enforce that this cleanup always happens when the relevant field its stored to the database.
If you are using Hibernate you should take a close look at Hibernate's user-types.
Then it could be as easy as adding a single annotation to your entity-field:

@Column(name = "description", length = 4000)
@org.hibernate.annotations.Type(type=com.yourdomain.YourUserType")
private java.lang.String description;

The class YourUserType would implement Hibernates UserType interface and in the nullSafeSet() method you'd call AntiSamy to do it's magic.

Styles and Formatting
Once, this is implemented, you'll do your first test.
GXT styles
In the editor, write one word, select it, and click the B-button on the editors tool-bar to make the text bold:
This works as expected and the text is also displayed in bold in the editor.
Then you save the text and display it in a GXT Html Component and may notice that the text is NOT displayed in bold.
In my case, I was testing with Inernet Explorer 8 and found out that:

the editor is not using a tag, as expected, but (which is fine, but just not what I've expected) - anyway, this should still display the text in bold, but it doesn't, because
GXT resets a lot of the common HTML styles in their gxt-all.css.
in this case the following definition is relevant:
...,strong,th,var{font-style:normal;font-weight:normal;}...
see: GXT Forum discussion: Richt text usage

This will also be the case for other tags: e.g. ol, ul, etc.
Workaround (for GXT < 2.1)

set the style for the element:
Html html = cp.addText("");
html.setStylePrimaryName("my-html");
define the style on your .css file:
div.my-html ul {
list-style-type: disc;
}

Note: in GXT >= 2.1 the HTML element should have a distinct style that you can use:
see Forum Discussion default style for Html component
(then you don't need to set the HTML components style name on your own - you just have to adopt the style name in your own .css file to whatever the gxt-default style name of HTML elements will be)

Inconsitent HTML on different browsers
Next problem you may notice is, that the Richt Text component will produce different HTML on different browsers (see: GXT Fourm discussion: Rich Text: inconsitent output on different browsers).
When you, for example write a text in the GXT HTML editor, select it and make it bold and underlined, then the HTML text that you will get depends on the browser - some examples:

Chrome 4.0:
bold&underlined
Opera 10.00:
bold&underlined
Safari 4.0.3:
Bbold&underlined
IE6:
bold&underlined
Iron 2.0:
bold&underlined
FF 3.5.2:
bold&underlined

I also created a short presentation with a more detailed example, that you can find here:
Presentation: WYSIWYG text handling

Well that one hits you like a hammer and makes it very difficult to apply a consistent look across browsers.
What we want, is of course one single way to handle e.g. bold text.
I chose to always use a b tag (I think, that strong would be a better fit, but b is just shorter)

Unify HTML code
As you can see from the examples above, a myriad of different ways to express boldness exist (and I'm not really sure, that I've covered all of them): thus I need a very sophisticated and easy to adopt way to do this unification.
I use XSLT for this task (see Forum Discussion in Java Technology & XML - which xml api to use).
This tricky conversation requires XSLT 2.0 which leads to another problem.

Using XSLT 2.0 (in Java 1.6)
Java 1.6 is shipped with JAXP 1.4 which per default uses the Xalan XSLT processor that can only do XSLT 1.0 transformations (AFAIK). So I need another processor.
I chose Saxon's Home Edition which is open source and free.

Using Saxon (but not as default)
First thing I did, is download the Saxon-HE jar and add it to my projects class-path.
And: surprise - only by adding it to my class-path some other parts of the build-process and application stopped working or did wired things.
e.g. the xtext XML-beautifier of Open Architecture Ware now produced different XML files where some of the tags had empty xmlns attributes.
A look at the documentation of TransformerFactory.newInstance() soon explained why this is the case:
the saxon-jar includes a META-INF/services/javax.xml.transform.TransformerFactory file, which will be checked by the TransformerFactory.newInstance() method and thus will override the previous platform default.
So I need to restore the platform default: This can "easily" be done by setting the system property "javax.xml.transform.TransformerFactory".
Default TransformerFactory for Eclipse
for all run/debug-configurations inside eclipse, you can add this property to the VM arguments
-Djavax.xml.transform.TransformerFactory=com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl
Default TransformerFactory for Maven
For the Maven build it is not that easy to set a default system property.
I could not find a way to set system-properties in a maven-pom or maven-profile. Seems the only way, you can do this, is to pass the system property via -D to the maven command when you run it. This is of course not an option in this case

it would be very annoying to always type this property
I would often forget it
you could set it in MAVEN_OPTS environment variable, but you also had to do this on all machines (developers, cim-build, integration tests, ..)
so I had to implement my own maven plugin that can set environment variables
(if anyone knows an alternative to this, please post it in the comments)

Ok - so now finally I have the saxon-jar in the build path and everything uses the platform default as before.
So when I now need to do an XLST 2.0 transformation, I explicitly call
javax.xml.transform.TransformerFactory.newInstance()
to make sure that I get a Saxon TransformerFactory (instead of the system default TransformerFactory).

prepare user input for XSLT
the XSLT transformation needs a valid XML file for input, which is also a problem in our case.
The user input that we get out of GXTs HTML editor is HTML (and thus not valid XML).
Fortunately you can tell AntiSamy to produce well-formed XHTML:

<directive name="useXHTML" value="true"/>

Fine, now AntiSamy will convert the following HTML input

valid html is not valid xml

will to well-formed XHTML:

valid html is not valid xml

Next, a valid XML file needs an XML Declaration, so we prepend it to the user-input:

<?xml version="1.0" encoding="utf-16"?>

AFAIK, all java-strings are utf-16 encoded, so this is used as the encoding of the XML-document.

One more thing to keep in mind, is that all tags in an xml file must be included in ONE root tag.
So we simply surround all the user input with any arbitrary tag: e..g

<body>USER_INPUT</body>

Note: You can tell the XSLT-stylesheet to remove this tag completely during the transformation.

There's one more problem, that we might run into: the user input may include character entity references that are defined in HTML, but not in XML: e.g.  
If the XML parser finds such entities in the source (and has no definition), it will stop with an error.
A quick workaround is to replace all & before the transformation with & and redo this replacement after the transformation (the clean solution which is way more work will be presented in the e-mail transformation section below).

converting HTML to plain text
this is also quite tricky and you have to take care not to change the semantics of the text.
e.g. when the HTML text containes striked-out text, it could be a disaster, if you simply stripedp the <strike> tag.
What I do, is to only allow a minimum of tags, that I absolutely need and which I can easily support: e.g. bold, italic can easily be omitted in the output without a (major) change of the semantics.
I also want to support lists (both ordered and unordered).

For the actual conversion of the HTML text (which has already gone through the validation and conversion described in the sections above) to plain text (that can be used in e-mails), I use XSLT again: which makes sense, (since I have already set up all the infrastructure and aquired some basic knowledge.
This transformation is somewhat easier, because I can already rely on the preceding cleanup: so I only need to handle the b-tag and not strong, font-weight: bold; etc.
On the other hand, now I really need to handle all those character entity references that the HTML text may still contain.

First step is to construct a valid XHTML file for the XML parser.
So, we need an XML-Declaration (which is the same as before).

Next we need to handle the character entity references.
Thus we need to provide the correct Document Type Declaration.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

This will point to the xhtml document type definition (DTD) of the w3c: if you take a look into this DTD, you'll see that some character entity definitions are imported at the top of the file: these .ent files contain the info the XML parser needs to replace the character entities with the corresponding character.

Then I wrap the input text into html-head-body tags and we have a valid xhtml file, that can be parsed and transformed.

So now it should work - but of course: it doesn't.

Use an EntityResolver
The problem is, that the the XML Parser will try to read the document type definition file, but (at least in my machine) it can't because it has no access to the internet: so I need to make the parser read those files locally instead of trying to reach into the internet.
Note: even if the parser could fetch the file from the w3c-URL you should use local files instead, because local access is faster and will make your application independent of the w3c site.

The solution for this is to use an EntityResolver.
First, I had to fetch the relevant resource files and copy them into my project. These are the relevant URLs:

To use this EntityResolver, you must set it on the DocumentBuilder instance by calling: DocumentBuilder. setEntityResolver().

Length-Check Problem
Another problem to consider when using HTML Text, is that you cannot easily check the length of the userinput at client side.
e.g. for a plain text field, it's easy to just count the number of characters and validate against the size of your database field.

But in the case of HTML text, this is problematic, because the number of characters that the user can see does not match the number of characters that are required to store the text:
e.g. when the user writes the text 123 and applies bold formatting , the required number of characters to store the text may be

10: 123
or maybe
20: 123

And another fact to consider is that the backend-validation and transformation steps may alter the number of characters again!

Web Development (Martin Trummer)

Tuesday, December 15, 2009

Indirekt Funktion von Excel nach Calc konvertieren

Monday, November 2, 2009

WYSIWYG gotchas

Followers

Blog Archive

About Me