You want to allow untrusted users to supply HTML for output on your website (e.g. as comment submission). You need to clean this HTML to avoid cross-site scripting (XSS) attacks.
Use the jsoup HTML Cleaner
with a configuration specified by a Whitelist
.
String unsafe = "<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>"; String safe = Jsoup.clean(unsafe, Whitelist.basic()); // now: <p><a href="http://example.com/" rel="nofollow">Link</a></p>
A cross-site scripting attack against your site can really ruin your day, not to mention your users'. Many sites avoid XSS attacks by not allowing HTML in user submitted content: they enforce plain text only, or use an alternative markup syntax like wiki-text or Markdown. These are seldom optimal solutions for the user, as they lower expressiveness, and force the user to learn a new syntax.
A better solution may be to use a rich text WYSIWYG editor (like CKEditor or TinyMCE). These output HTML, and allow the user to work visually. However, their validation is done on the client side: you need to apply a server-side validation to clean up the input and ensure the HTML is safe to place on your site. Otherwise, an attacker can avoid the client-side Javascript validation and inject unsafe HMTL directly into your site
The jsoup whitelist sanitizer works by parsing the input HTML (in a safe, sand-boxed environment), and then iterating through the parse tree and only allowing known-safe tags and attributes (and values) through into the cleaned output.
It does not use regular expressions, which are inappropriate for this task.
jsoup provides a range of Whitelist
configurations to suit most requirements; they can be modified if necessary, but take care.
The cleaner is useful not only for avoiding XSS, but also in limiting the range of elements the user can provide: you may be OK with textual a
, strong
elements, but not structural div
ortable
elements.
See the XSS cheat sheet and filter evasion guide, as an example of how regular-expression filters don't work, and why a safe whitelist parser-based sanitizer is the correct approach.
See the Cleaner
reference if you want to get a Document
instead of a String return
See the Whitelist
reference for the different canned options, and to create a custom whitelist
The nofollow link attribute