对于xml的转义最方便,最简单的方法就是直接使用apache的commons.lang jar包中的StringEscapeUtils的escapeXml方法。但该方法在commons lang 2.x和commons lang 3.x的处理方式不太一样。
在commons lang 2.x中StringEscapeUtils的escapeXml方法除了会对xml中的“,&,<,>和‘等字符进行转义外,还会对unicode编码大于0x7F的字符进行转义。
在StringEscapeUtils中创建了xml Entities对象。在该对象中添加了了BASIC_ARRAY和APOS_ARRAY中定义的字符,如果碰到这些字符就会进行转义。
BASIC_ARRAY中定义了
private static final String[][] BASIC_ARRAY = {{"quot", "34"}, // " - double-quote {"amp", "38"}, // & - ampersand {"lt", "60"}, // < - less-than {"gt", "62"}, // > - greater-than };APOS_ARRAY中定义了
private static final String[][] APOS_ARRAY = {{"apos", "39"}, // XML apostrophe };因此会对这些定义的字符进行转义。escapeXml方法调用Entities.XML.escape的方法进行转义的具体操作
public void escape(Writer writer, String str) throws IOException { int len = str.length(); for (int i = 0; i < len; i++) { char c = str.charAt(i); String entityName = this.entityName(c); if (entityName == null) { if (c > 0x7F) { writer.write("&#"); writer.write(Integer.toString(c, 10)); writer.write(';'); } else { writer.write(c); } } else { writer.write('&'); writer.write(entityName); writer.write(';'); } } }
如果不想使用中文字符被转义,要么自己可以参考上面的代码,自己改写,去掉对大于0x7F的字符的转义,要么可以使用commons lang3中的escapeXml相关方法。commons lang3中对方法使用策略模式进行了重新设计。相关的方法有escapeXml、escapeXml10和escapeXml11。
其中escapeXml方法已经被废弃。该方法只转义xml中的“,&,<,>和‘5个字符进行转义。将new LookupTranslator(EntityArrays.BASIC_ESCAPE())和new LookupTranslator(EntityArrays.APOS_ESCAPE())两个Tranlator注册到ESCAPE_XML上
escapeXml10方法除了对上述5个字符进行转义外,还会将一些控制字符,例如\b、\t、\n、\r等等替换成空字符串。因为XML1.0是纯文本格式,不能表示控制字符。另外对于不成对的代理码点也不能表示,因此会去除掉。因此注册到escapeXml10的Translator除了new LookupTranslator(EntityArrays.BASIC_ESCAPE())和new LookupTranslator(EntityArrays.APOS_ESCAPE())外,还有
new LookupTranslator(
new String[][] {
{ "\u0000", "" }, { "\u0001", "" }, { "\u0002", "" }, { "\u0003", "" }, { "\u0004", "" }, { "\u0005", "" }, { "\u0006", "" }, { "\u0007", "" }, { "\u0008", "" },
{ "\u000b", "" }, { "\u000c", "" }, { "\u000e", "" }, { "\u000f", "" }, { "\u0010", "" }, { "\u0011", "" }, { "\u0012", "" }, { "\u0013", "" }, { "\u0014", "" },
{ "\u0015", "" }, { "\u0016", "" }, { "\u0017", "" }, { "\u0018", "" }, { "\u0019", "" }, { "\u001a", "" }, { "\u001b", "" }, { "\u001c", "" }, { "\u001d", "" },
{ "\u001e", "" }, { "\u001f", "" }, { "\ufffe", "" }, { "\uffff", "" }
}),
和
new UnicodeUnpairedSurrogateRemover()。
一个是用来处理控制字符,一个是用来处理未成对的代理码点,移除掉码值在[#xD8000,#xDFFF]之间的码值字符。也就是escapeXml10会移除不在下面码值范围内的所有码值:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]。
另外escapeXml10还注册了NumericEntityEscaper.between(0x7f, 0x84)和NumericEntityEscaper.between(0x86, 0x9f)两个Translator,将[#x7F-#x84] | [#x86-#x9F]}两个范围内的字符进行转义。
对于escapeXml11,由于XML 1.1可以表示一定的控制字符,所以对于控制字符的Translator和escapeXml10不太相同。
new LookupTranslator(
new String[][] {
{ "\u0000", "" },
{ "\u000b", "" },
{ "\u000c", "" },
{ "\ufffe", "" },
{ "\uffff", "" }
})
escapeXml11将会移除不在下面码值范围内的所有码值:
[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
escapeXml11还注册了
NumericEntityEscaper.between(0x1, 0x8),
NumericEntityEscaper.between(0xe, 0x1f),
NumericEntityEscaper.between(0x7f, 0x84),
NumericEntityEscaper.between(0x86, 0x9f),
四个Translator,这样将会对在#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]范围内的码值进行转义。
所使用的主要函数就是这三个。下面说一下其大概的一个工作原理。
对于这三个函数都分别使用了不同的Translator。不过都是AggregateTranslator类的对象。从这个类的名字也可以看出这是个集成Translator,作用就是调用其中注册的一组Translator。所有的Translator都继承自CharSequenceTranslator抽象类,在转义方法中都是直接调用了CharSequenceTranslator的
/** * Helper for non-Writer usage. * @param input CharSequence to be translated * @return String output of translation */ public final String translate(final CharSequence input) { if (input == null) { return null; } try { final StringWriter writer = new StringWriter(input.length() * 2); translate(input, writer); return writer.toString(); } catch (final IOException ioe) { // this should never ever happen while writing to a StringWriter throw new RuntimeException(ioe); } }方法,这个方法又调用了
/** * Translate an input onto a Writer. This is intentionally final as its algorithm is * tightly coupled with the abstract method of this class. * * @param input CharSequence that is being translated * @param out Writer to translate the text to * @throws IOException if and only if the Writer produces an IOException */ public final void translate(final CharSequence input, final Writer out) throws IOException { if (out == null) { throw new IllegalArgumentException("The Writer must not be null"); } if (input == null) { return; } int pos = 0; final int len = input.length(); while (pos < len) { //从pos位置开始,对该位置开始的字符进行遍历转义,并返回转义的代码点的个数。注意是代码点,而不是char的个数或者代码单元的个数, //这个函数在CharSequenceTranslator是个虚函数,需要各继承类实现。并约定每个继承类需要处理码值代理对 //关于码值代理对的概念,可以参考我的另一篇博文“java char String中涉及到的length字符长度概念的研究” final int consumed = translate(input, pos, out); if (consumed == 0) { //说明调用的traslator没有需要处理的转移字符 // inlined implementation of Character.toChars(Character.codePointAt(input, pos)) // avoids allocating temp char arrays and duplicate checks char c1 = input.charAt(pos); out.write(c1); pos++; //如果当前位置是个代理对码值,那么就需要把该辅助字符的第一和第二部分同时处理输出 if (Character.isHighSurrogate(c1) && pos < len) { char c2 = input.charAt(pos); if (Character.isLowSurrogate(c2)) { out.write(c2); pos++; } } continue; } // contract with translators is that they have to understand codepoints // and they just took care of a surrogate pair //consumed应该表示的是代码点的数量,因此需要获取当前位置的代码点的代码单元的个数,然后将pos指向需要处理的下一个代码点 for (int pt = 0; pt < consumed; pt++) { pos += Character.charCount(Character.codePointAt(input, pos)); } } }
/** * Translate a set of codepoints, represented by an int index into a CharSequence, * into another set of codepoints. The number of codepoints consumed must be returned, * and the only IOExceptions thrown must be from interacting with the Writer so that * the top level API may reliably ignore StringWriter IOExceptions. * * @param input CharSequence that is being translated * @param index int representing the current point of translation * @param out Writer to translate the text to * @return int count of codepoints consumed * @throws IOException if and only if the Writer produces an IOException */ public abstract int translate(CharSequence input, int index, Writer out) throws IOException;这是个虚函数,继承该类都需要实现。在AggregateTranslator的translate方法中就能直接调用集成在这里面的其它对象的translate方法。
AggregateTranslator的translate方法如下:
/** * The first translator to consume codepoints from the input is the 'winner'. * Execution stops with the number of consumed codepoints being returned. * {@inheritDoc} */ @Override public int translate(final CharSequence input, final int index, final Writer out) throws IOException { for (final CharSequenceTranslator translator : translators) { final int consumed = translator.translate(input, index, out); if(consumed != 0) { return consumed; } } return 0; }
@Override public int translate(final CharSequence input, final int index, final Writer out) throws IOException { //从 input的index位置进行比较,只要找到一个就返回 // check if translation exists for the input at position index if (prefixSet.contains(input.charAt(index))) { int max = longest; if (index + longest > input.length()) { max = input.length() - index; } //先从最长的字符串进行匹配 // implement greedy algorithm by trying maximum match first for (int i = max; i >= shortest; i--) { final CharSequence subSeq = input.subSequence(index, index + i); final String result = lookupMap.get(subSeq.toString()); if (result != null) { out.write(result); return i; } } } return 0; }
具体实现就是这样子的。但是我认为此函数有问题。因为它返回的是char的length而不是代码点的长度。如果lookupTable中的key是含有辅助字符的,在CharSequenceTranslator的tanslate函数处理地方:
// contract with translators is that they have to understand codepoints // and they just took care of a surrogate pair for (int pt = 0; pt < consumed; pt++) { pos += Character.charCount(Character.codePointAt(input, pos)); }