对字符串进行验证之前先进行规范化

对字符串进行验证之前先进行规范化

应用系统中经常对字符串会进行各种规则的验证,不过由于字符串信息在java6中是基于unicode的4.0版本的,而java7则是unicode的6.0.0版本。

unicode的规范化格式有几种,每种的处理方式有些不一样。

	NFC
	Unicode 规范化格式 C。如果未指定 normalization-type,那么会执行 Unicode 规范化。
	NFD
	Unicode 规范化格式 D。
	NFKC
	Unicode 规范化格式 KC。
	NFKD
	Unicode 规范化格式 KD。
	
如果我们对输入字符串先进行验证,再规范化,Normalizer.normalize将unicode的文本转成等价的规范化格式内容,下面这个用Pattern.compile("[<>]")验证不通过,

	// String s may be user controllable
	// \uFE64 is normalized to < and \uFE65 is normalized to > using NFKC
	String s = "\uFE64" + "script" + "\uFE65";
	// Validate
	Pattern pattern = Pattern.compile("[<>]"); // Check for angle brackets
	Matcher matcher = pattern.matcher(s);
	if (matcher.find()) {  
	  // Found black listed tag
	  throw new IllegalStateException();
	} else {
	  // . . .
	}
	// Normalize
	s = Normalizer.normalize(s, Form.NFKC);
	
如果对输入字符串先进行规范化在进行验证,使用Pattern.compile("[<>]")验证就能正确判断出来,抛出IllegalStateException异常,正确过滤有问题的输入文本,

	String s = "\uFE64" + "script" + "\uFE65";
	// Normalize
	s = Normalizer.normalize(s, Form.NFKC);
	// Validate
	Pattern pattern = Pattern.compile("[<>]");
	Matcher matcher = pattern.matcher(s);
	if (matcher.find()) {
	  // Found black listed tag
	  throw new IllegalStateException();
	} else {
	  // . . .
	}
	
java中的Normalizer类

	public final class Normalizer {

	   private Normalizer() {};

		/**
		 * This enum provides constants of the four Unicode normalization forms
		 * that are described in
		 * <a href="http://www.unicode.org/unicode/reports/tr15/tr15-23.html">
		 * Unicode Standard Annex #15 &mdash; Unicode Normalization Forms</a>
		 * and two methods to access them.
		 *
		 * @since 1.6
		 */
		public static enum Form {

			/**
			 * Canonical decomposition.
			 */
			NFD,

			/**
			 * Canonical decomposition, followed by canonical composition.
			 */
			NFC,

			/**
			 * Compatibility decomposition.
			 */
			NFKD,

			/**
			 * Compatibility decomposition, followed by canonical composition.
			 */
			NFKC
		}

		/**
		 * Normalize a sequence of char values.
		 * The sequence will be normalized according to the specified normalization
		 * from.
		 * @param src        The sequence of char values to normalize.
		 * @param form       The normalization form; one of
		 *                   {@link java.text.Normalizer.Form#NFC},
		 *                   {@link java.text.Normalizer.Form#NFD},
		 *                   {@link java.text.Normalizer.Form#NFKC},
		 *                   {@link java.text.Normalizer.Form#NFKD}
		 * @return The normalized String
		 * @throws NullPointerException If <code>src</code> or <code>form</code>
		 * is null.
		 */
		public static String normalize(CharSequence src, Form form) {
			return NormalizerBase.normalize(src.toString(), form);
		}

		/**
		 * Determines if the given sequence of char values is normalized.
		 * @param src        The sequence of char values to be checked.
		 * @param form       The normalization form; one of
		 *                   {@link java.text.Normalizer.Form#NFC},
		 *                   {@link java.text.Normalizer.Form#NFD},
		 *                   {@link java.text.Normalizer.Form#NFKC},
		 *                   {@link java.text.Normalizer.Form#NFKD}
		 * @return true if the sequence of char values is normalized;
		 * false otherwise.
		 * @throws NullPointerException If <code>src</code> or <code>form</code>
		 * is null.
		 */
		public static boolean isNormalized(CharSequence src, Form form) {
			return NormalizerBase.isNormalized(src.toString(), form);
		}
	}

你可能感兴趣的:(字符串)