转自:http://www.cnblogs.com/itech/archive/2011/03/27/1996883.html
一 python2.6中的字符串
1) 字符串的种类和关系 (在2.x中,默认的string为str)
2) python的全局函数中basestring,str和unicode的描述如下
basestring()
This abstract type is the superclass for str and unicode. It cannot be called or instantiated, but it can be used to test whether an object is an instance of str or unicode. isinstance(obj, basestring) is equivalent to isinstance(obj, (str, unicode)).
str([object])
Return a string containing a nicely printable representation of an object. For strings, this returns the string itself. The difference with repr(object) is that str(object) does not always attempt to return a string that is acceptable to eval(); its goal is to return a printable string. If no argument is given, returns the empty string, ''.
unicode([object[, encoding[, errors]]])
Return the Unicode string version of object using one of the following modes:
If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised. Error handling is done according to errors; this specifies the treatment of characters which are invalid in the input encoding. If errors is 'strict' (the default), a ValueError is raised on errors, while a value of 'ignore' causes errors to be silently ignored, and a value of 'replace' causes the official Unicode replacement character, U+FFFD, to be used to replace input characters which cannot be decoded. See also the codecs module.
If no optional parameters are given, unicode() will mimic the behaviour of str() except that it returns Unicode strings instead of 8-bit strings. More precisely, if object is a Unicode string or subclass it will return that Unicode string without any additional decoding applied.
For objects which provide a __unicode__() method, it will call this method without arguments to create a Unicode string. For all other objects, the 8-bit string version or representation is requested and then converted to a Unicode string using the codec for the default encoding in 'strict' mode.
二 print
2.6中print函数的帮助:(print()函数基本等价于print ‘’ 语句)
print([object, ...][, sep=' '][, end='n'][, file=sys.stdout])
Print object(s) to the stream file, separated by sep and followed by end. sep, end and file, if present, must be given as keyword arguments.
All non-keyword arguments are converted to strings like str() does and written to the stream, separated by sep and followed by end. Both sep and end must be strings; they can also be None, which means to use the default values. If no object is given, print() will just write end.
The file argument must be an object with a write(string) method; if it is not present or None, sys.stdout will be used.
Note
This function is not normally available as a builtin since the name print is recognized as the print statement. To disable the statement and use the print() function, use this future statement at the top of your module:
from __future__ import print_function
print 函数支持str和unicode。 python的print会对输出的文本做自动的编码转换,而文件对象的write方法就不会做,例如如下代码中包含中英文,但是能够正确的输出:
三 codecs
函数 decode( char_set )可以实现 其它编码到 Unicode 的转换,函数 encode( char_set )实现 Unicode 到其它编码方式的转换。
codecs模块为我们解决的字符编码的处理提供了lookup方法,它接受一个字符编码名称的参数,并返回指定字符编码对应的 encoder、decoder、StreamReader和StreamWriter的函数对象和类对象的引用。 为了简化对lookup方法的调用, codecs还提供了getencoder(encoding)、getdecoder(encoding)、getreader(encoding)和 getwriter(encoding)方法;进一步,简化对特定字符编码的StreamReader、StreamWriter和 StreamReaderWriter的访问,codecs更直接地提供了open方法,通过encoding参数传递字符编码名称,即可获得对 encoder和decoder的双向服务。
四 实例
代码:
五 总结
1)如果python文件中包含中文的字符串,必须在python文件的最开始包含# -*- coding: utf-8 -*-, 表示让python以utf8格式来解析此文件;
2)使用isinstance(obj, basestring) is equivalent to isinstance(obj, (str, unicode))来判断是否为字符串;
3)us = u'中国' 有错误,必须使用us2 = unicode('中国','gbk')来将中文解码为正确的unicode字符串;
4)str和unicode字符串不能连接和比较;
5)print函数能够支持str和unicode,且能够正确的解码和输出字符串;
6)可以使用unicode.encode或str.decode来实现unicode和str的相互转化,还可以使用codecs的encode和decode来实现转化。
7)貌似必须在中文系统或者系统安装中文的语言包后gbk解码才能正常工作。
这点是一定要做到的。
按引号前先按一下u最初做起来确实很不习惯而且经常会忘记再跑回去补,但如果这么做可以减少90%的编码问题。如果编码困扰不严重,可以不参考此条。
如果编码困扰不严重,可以不参考此条。
这里说的MBCS不是指GBK什么的都不能用,而是不要使用Python里名为'MBCS'的编码,除非程序完全不移植。
完!
感谢,Thanks!
作者:iTech
出处:http://itech.cnblogs.com/
本文版权归作者iTech所有,转载请包含作者签名和出处,不得用于商业用途,非则追究法律责任!