吓人的鸟

python学习笔记

官网http://www.python.org/

官网library http://docs.python.org/library/

PyPI https://pypi.python.org/pypi

中文手册，适合快速入门 http://download.csdn.net/detail/xiarendeniao/4236870

python cook book中文版 http://download.csdn.net/detail/XIARENDENIAO/3231793

1.数值尤其是实数很方便、字符串操作很炫、列表

   a = complex(1,0.4)
   a.real
   a.imag

Unicode()

字符串前加上r/R表示常规字符串，加上u/U表示unicode字符串

列表的append()方法在列表末尾加一个新元素

2.流程控制

while：
if:
	if xxx:
		...
	elif yyy:
		...
	elif xxx:
		...
	else:
		...
for
range()
break  continue  循环中的else
pass

3.函数
   1)def funA(para)   没有return语句时函数返回None，参数传递进去的是引用
   2)默认参数，默认参数是列表、字典、类实例时要小心
   3)不定参数，def funB(king, *arguments, **keywords) 不带关键字的参数值存在元组arguments中，关键字跟参数值存在字典keywords中。其实是元组封装和序列拆封的一个结合。
   4) def funC(para1, para2, para3) 下面的调用把列表元素分散成函数参数funcC(*list)

5)匿名函数 lambda arg1,arg2...:

特点：创建一个函数对象，但是没有赋值给标识符（不同于def）;lambda是表达式，不是语句；“：”后面只能是一个表达式

   6)if ‘ok’ in (‘y’, ‘ye’, ‘yes’): xxxxx 关键字in的用法
   7)f = bambda x: x*2 等效于 def f(x): return x*2

4.数据结构
   1)[] help(list) append(x) extend(L) insert(i,x) remove(x) pop([i]) index(x) count(x) sort() reverse()
   2)List的函数化编程 filter() map() reduce()
   3)列表推导式 aimTags = [aimTag for aimTag in aimTags if aimTag not in filterAimTags]
   4)del删除列表切片或者整个变量
   5)() help(tuple) 元组tuple，其中元素和字符串一样不能改变。元组、字符串、列表都是序列。 Python 要求单元素元组中必须使用逗号，以此消除与圆括号表达式之间的歧义。这是新手常犯的错误
   6){} help(dict) 字典 keys() has_key() 可用以键值对元组为元素的列表直接构造字典
   7)循环字典：for k, v in xxx.iteritems():… for item in xxx.items():... 序列：for i, v in enumerate([‘tic’, ‘tac’, ‘toe’]):… 同时循环多个序列：for q, a in zip(questions, answers):…
   8)in   not in   is   is not   a    9)相同类型的序列对象之间可以用< > ==进行比较

10)判断变量类型的两种方法：isinstance（var,int） type(var).__name__=="int"

多种类型判断，isinstance(s,(str,unicode))当s是常规字符串或者unicode字符串都会返回True

11）在循环中删除list元素时尤其要注意出问题，for i in listA:... listA.remove(i)是会有问题的，删除一个元素之后后面的元素就前移了；for i in len(listA):...del listA[i]也会有问题，删除元素后长度变化，循环会越界

filter(lambda x:x !=4,listA)这种方式比较优雅

listA = [ i for i in listA if i !=4] 也不错，或者直接创建一个新的列表算球

效率：
1)"if k in my_dict" 优于 "if my_dict.has_key(k)"

2)"for k in my_dict" 优于 "for k in my_dict.keys()",也优于"for k in [....]"

12）set是dict的一种实现 https://docs.python.org/2/library/stdtypes.html#set-types-set-frozenset

>>> s1 = set([1,2,3,4,5]) 
>>> s2 = set([3,4,5,6,7,8]) 
>>> s1|s2
set([1, 2, 3, 4, 5, 6, 7, 8])
>>> s1-s2
set([1, 2])
>>> s2-s1
set([8, 6, 7])

5.模块
   1)模块名由全局变量__name__得到，文件fibo.py可以作为fibo模块被import fibo导入到其他文件或者解释器中，fibo.py中函数明明必须以fib开头
   2)import变体： from fibo import fib, fib2 然后不用前缀直接使用函数
   3)sys.path   sys.ps1   sys.ps2
   4)内置函数 dir() 用于按模块名搜索模块定义，它返回一个字符串类型的存储列表，列出了所有类型的名称：变量，模块，函数，等等
      help()也有类似的作用
   5)包 import packet1.packet2.module       from packet1.packet2 import module       from packet1.packet2.module import functionA
   6)import 语句按如下条件进行转换：执行 from package import * 时，如果包中的 __init__.py 代码定义了一个名为 __all__ 的列表，就会按照列表中给出的模块名进行导入
   7)sys.path打印出当前搜索python库的路径，可以在程序中用sys.path.append("/xxx/xxx/xxx")来添加新的搜索路径
   8)安装python模块时可以用easy_install，卸载easy_install -m pkg_name
   9)用__doc__可以得到某模块、函数、对象的说明，用__name__可以得到名字（典型用法：if __name__=='__main__'： ...）

6.IO

1)str() unicode() repr() repr() print rjust() ljust() center() zfill() xxx%v xxx%(v1,v2) 打印复杂对象时可用pprint模块（调试时很有用）

对于自定义的类型，要支持pprint需要提供__repr__方法。对于pprint的结果不想直接给标准输出(pprint.pprint(var))可以用pprint.pformat(var).

   2)f = open(“fileName”, “w”) w r a r+ Win和Macintosh平台还有一个模式”b”
      f.read(size)
      f.readline()
      f.write(string)
      f.writelines(list)
      f.tell()
      f.seek(offset, from_what) from_what:0开头 1当前 2末尾 offset:byte数http://www.linuxidc.com/Linux/2007-12/9644p3.htm
      f.close()

linecache模块可以方便的获取文件某行数据，在http-server端使用时要注意，尤其是操作大文件很危险，并发情况下很容易就让机器内存耗尽、系统直接挂掉（本人血的教训）

文件操作时shutil比较好用

os.walk()遍历目录下所有文件

   3)pickle模块(不是只能写入文件中)
   封装（pickling）类似于php的序列化：pickle.dump(objectX, fileHandle)
   拆封（unpickling）类似于php反序列化：objectX = pickle.load(fileHandle)

   msgpack(easy_install msgpack-python)比pickle和cpickle都好用一些,速度较快
   msgpack.dump(my_var, file('test_file_name','w'))
   msgpack.load(file('test_file_name','r'))

4)raw_input()接受用户输入

7.class
1)以两个下划线下头、以不超过一个下划线结尾成员变量和成员函数都是私有的，父类的私有成员在子类中不可访问

2)调用父类的方法：1>ParentClass.FuncName(self,args) 2>super(ChildName,self).FuncName(args) 第二种方法的使用必须保证类是从object继承下来的，否则super会报错

3)静态方法定义，在方法名前一行写上@staticmethod。可以通过类名直接调用。

#!/bin/python
#encoding=utf8
class A(object):
        def __init__(self, a, b):
                self.a = a
                self.b = b
        def show(self):
                print "A::show() a=%s b=%s" % (self.a,self.b)

class B(A):
        def __init__(self, a, b, c):
                #A.__init__(self,a,b)
                super(B,self).__init__(a,b) #super这种用法要求父类必须是从object继承的
                self.c = c

if __name__ == "__main__":
        b = B(1,2,3) 
        print b.a,b.b,b.c
        b.show()

#输出
xudongsong@sysdev:~$ python class_test.py 
1 2 3
A::show() a=1 b=2

8.编码
   常见的编码转换分为以下几种情况：

        unicode->其它编码
        例如：a为unicode编码要转为gb2312。a.encode('gb2312')

        其它编码->unicode
        例如：a为gb2312编码，要转为unicode。 unicode(a, 'gb2312')或a.decode('gb2312')

        编码1 -> 编码2
        可以先转为unicode再转为编码2

        如gb2312转big5
        unicode(a, 'gb2312').encode('big5')

        判断字符串的编码
        isinstance(s, str) 用来判断是否为一般字符串
        isinstance(s, unicode) 用来判断是否为unicode

如果一个字符串已经是unicode了，再执行unicode转换有时会出错(并不都出错)

>>> str2 = u"sfdasfafasf"
>>> type(str2)

>>> isinstance(str2,str)
False
>>> isinstance(str2,unicode)
True
>>> type(str2)

>>> str3 = "safafasdf"
>>> type(str3)        

>>> isinstance(str3,unicode)
False
>>> isinstance(str3,str)    
True
>>> str4 = r'asdfafadf'
>>> isinstance(str4,str)
True
>>> isinstance(str4,unicode)
False
>>> type(str4)

可以写一个通用的转成unicode函数：
        def u(s, encoding):
            if isinstance(s, unicode):
                return s
            else:
                return unicode(s, encoding)

9.线程
   1)要让子线程跟着父线程一起退出，可以对子线程调用setDaemon()
   2)对子线程调用join()方法可以让父线程等到子线程退出之后再退出

3)ctrl+c只能被父线程捕获到（子线程不能调用信号捕获函数signal.signal(signal,function)），对子线程调用join()会导致父线程捕获不到ctrl+c，需要子线程退出后才能捕获到

附：成应元老师关于python信号的邮件
参考 http://stackoverflow.com/questions/631441/interruptible-thread-join-in-python
From http://docs.python.org/library/signal.html#module-signal:
Some care must be taken if both signals and threads are used in the same program. The fundamental thing to remember in using signals and threads simultaneously is: always perform signal() operations in the main thread of execution. Any thread can perform an alarm(), getsignal(), pause(), setitimer() or getitimer(); only the main thread can set a new signal handler, and the main thread will be the only one to receive signals (this is enforced by the Python signal module, even if the underlying thread implementation supports sending signals to individual threads). This means that signals can’t be used as a means of inter-thread communication. Use locks instead.
总是在主线程调用signal设置信号处理器，主线程将是唯一处理信号的线程。因此不要把线程间通信寄托在信号上，而应该用锁。
The second, from http://docs.python.org/library/thread.html#module-thread:
Threads interact strangely with interrupts: the KeyboardInterrupt exception will be received by an arbitrary thread. (When the signal module is available, interrupts always go to the main thread.)
当导入signal模块时， KeyboardInterrupt异常总是由主线程收到，否则KeyboardInterrupt异常会被任意一个线程接到。
直接按Ctrl+C会导致Python接收到SIGINT信号，转成KeyboardInterrupt异常在某个线程抛出，如果还有线程没有被 setDaemon，则这些线程照运行不误。如果用kill送出非SIGINT信号，且该信号没设置处理函数，则整个进程挂掉，不管有多少个线程还没完成。

下面是signal的一个使用范例：

>>> import signal
>>> def f():
...     signal.signal(signal.SIGINT, sighandler)
...     signal.signal(signal.SIGTERM, sighandler)
...     while True:
...             time.sleep(1)
... 
>>> def sighandler(signum,frame):
...     print signum,frame
... 
>>> f()
^C2 
^C2 
^C2 
^C2

signal的设置和清除：

import signal, time

term = False

def sighandler(signum, frame):
        print "terminate signal received..."
        global term
        term = True

def set_signal():
        signal.signal(signal.SIGTERM, sighandler)
        signal.signal(signal.SIGINT, sighandler)

def clear_signal():
        signal.signal(signal.SIGTERM, 0)
        signal.signal(signal.SIGINT, 0)


set_signal()
while not term:
        print "hello"
        time.sleep(1)

print "jumped out of while loop"

clear_signal()
term = False
for i in range(5):
        if term:
                break
        else:
                print "hello, again"
                time.sleep(1)

[dongsong@bogon python_study]$ python signal_test.py 
hello
hello
hello
^Cterminate signal received...
jumped out of while loop
hello, again
hello, again
^C
[dongsong@bogon python_study]$

多进程程序使用信号时，要想让父进程捕获信号并对子进程做一些操作，应该在子进程启动完成以后再注册信号处理函数，否则子进程继承父进程的地址空间，也会有该信号处理函数，程序会混乱不堪

from multiprocessing import Process, Pipe
import logging, time, signal

g_logLevel = logging.DEBUG
g_logFormat = "%(asctime)s %(levelname)s [%(filename)s:%(lineno)d]%(message)s"

def f(conn):
    conn.send([42, None, 'hello'])
    #conn.close()
    logging.basicConfig(level=g_logLevel,format=g_logFormat,stream=None)
    logging.debug("hello,world")

def f2():
    while True:
        print "hello,world"
        time.sleep(1)

termFlag = False
def sighandler(signum, frame):
    print "terminate signal received..."
    global termFlag
    termFlag = True

if __name__ == '__main__':
#    parent_conn, child_conn = Pipe()
#    p = Process(target=f, args=(child_conn,))
#    p.start()
#    print parent_conn.recv()   # prints "[42, None, 'hello']"
#    print parent_conn.recv()
#    p.join()

    p = Process(target=f2)
    p.start()
    signal.signal(signal.SIGTERM, sighandler)
    signal.signal(signal.SIGINT, sighandler)

    while not termFlag:
        time.sleep(0.5)
    print "jump out of the main loop"
    p.terminate()
    p.join()

10.Python 的内建函数locals() 。它返回的字典对所有局部变量的名称与值进行映射

11.扩展位置参数

def func(*args): ...

在参数名之前使用一个星号，就是让函数接受任意多的位置参数。

python把参数收集到一个元组中，作为变量args。显式声明的参数之外如果没有位置参数，这个参数就作为一个空元组。

关联item 3.4

12.扩展关键字参数（扩展键参数）

def accept(**kwargs): ...

python在参数名之前使用2个星号来支持任意多的关键字参数。

注意：kwargs是一个正常的python字典类型，包含参数名和值。如果没有更多的关键字参数，kwargs就是一个空字典。

位置参数和关键字参数参考这篇文章：http://blog.csdn.net/qinyilang/article/details/5484415

>>> def func(arg1, arg2 = "hello", *arg3, **arg4):
...     print arg1
...     print arg2
...     print arg3
...     print arg4
... 

>>> func("xds","t1",t2="t2",t3="t3")
xds
t1
()
{'t2': 't2', 't3': 't3'}

13.装饰器在函数前加上@another_method，用于对已有函数做包装、前提检查=工作，这篇文章写得很透彻 http://daqinbuyi.iteye.com/blog/1161274

14.异常处理的语法

import sys

try:
    f = open('myfile.txt')
    s = f.readline()
    i = int(s.strip())
except IOError, (errno, strerror):
    print "I/O error(%s): %s" % (errno, strerror)
except ValueError:
    print "Could not convert data to an integer."
except:
    print "Unexpected error:", sys.exc_info()[0]
    raise

>>> try:
...    raise Exception('spam', 'eggs')
... except Exception, inst:
...    print "error %s" % str(e)
...    print type(inst)     # the exception instance
...    print inst.args      # arguments stored in .args
...    print inst           # __str__ allows args to printed directly
...    x, y = inst          # __getitem__ allows args to be unpacked directly
...    print 'x =', x
...    print 'y =', y
...

('spam', 'eggs')
('spam', 'eggs')
x = spam
y = eggs

15.命令行参数的处理，用python的optparse库处理，具体用法见这篇文章 http://blog.chinaunix.net/space.php?uid=16981447&do=blog&id=2840082

from optparse import OptionParser
[...]
def main():
    usage = "usage: %prog [options] arg"
    parser = OptionParser(usage)
    parser.add_option("-f", "--file", dest="filename",
                      help="read data from FILENAME")
    parser.add_option("-v", "--verbose",
                      action="store_true", dest="verbose")
    parser.add_option("-q", "--quiet",
                      action="store_false", dest="verbose")
    [...]
    (options, args) = parser.parse_args()
    if len(args) != 1:
        parser.error("incorrect number of arguments")
    if options.verbose:
        print "reading %s..." % options.filename
    [...]

if __name__ == "__main__":
    main()

通俗的讲，make_option()和add_option()用于创建对python脚本的某个命令项的解析方式，用parse_args()解析后单个参数存入args元组，键值对参数存入options；dest指定键值对的key,不写则用命令的长名称作为key；help用于对脚本调用--help/-h时候解释对应命令；action描述参数解析方式，默认store表示命令出现则用dest+后跟的value存入options,store_true表示命令出现则以dest+True存入options,store_false表示命令出现则以dest+False存入options

16.最近用了BeautifulSoup v4，出现如下错误（之前用的是低版本的BeautifulSoup,没遇到这个错误）

HTMLParser.HTMLParseError: malformed start tag

解决办法：用easy_install html5lib，安装html5lib，替代HTMLParser

参考：http://topic.csdn.net/u/20090531/09/956454dd-ba13-4fa3-af3c-6bf7af5726dc.html

beautifulsoup官网：http://www.crummy.com/software/BeautifulSoup/

beautifulsoup的手册：http://www.crummy.com/software/BeautifulSoup/bs4/doc/

中文手册（用于快速入门）：http://www.leeon.me/upload/other/beautifulsoup-documentation-zh.html

下面是一个beautifulsoup的一些用法

[dongsong@localhost boosenspider]$ vpython
Python 2.6.6 (r266:84292, Dec  7 2011, 20:48:22) 
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>> 
>>> from bs4 import BeautifulSoup as soup
>>> s = soup('打卡')
>>> s
æ‰“å�¡
>>> type(s)

>>> 
>>> 
>>> t = s.body.contents[0]
>>> t
æ‰“å�¡
>>> import re
>>> t.findAll(name='a',attrs={'class':re.compile(r"btn_dks")})
[]
>>> t.findAll(name='a',attrs={'class':re.compile(r"btn_dk")}) 
[æ‰“å�¡]
>>> t.findAll(name='a',attrs={'class':re.compile(r"btn_dk"),'href':None})
[]
>>> t.findAll(name='a',attrs={'class':re.compile(r"btn_dk"),'href':re.compile('')})
[æ‰“å�¡]
>>> t.contents[0]
æ‰“å�¡
>>> t.contents[0].string = "hello"
>>> t
hello
>>> t.contents[0].text
u'hello'
>>> t.contents[0].string
u'hello'
>>> t.findAll(name='a',attrs={'class':re.compile(r"btn_dk"),'text':re.compile('')})
[]
>>> t.findAll(name='a',attrs={'class':re.compile(r"btn_dk"),'text':re.compile('h')})
[]
>>> t.findAll(name='a',attrs={'class':re.compile(r"btn_dk"),'text':re.compile('^h')})
[]
>>> t.findAll(name='a',attrs={'class':re.compile(r"btn_dk")})                        
[hello]
>>> t.findAll(name='a',attrs={'class':re.compile(r"btn_dk")},text=re.compile(r'')) 
[hello]
>>> t.findAll(name='a',attrs={'class':re.compile(r"btn_dk")},text=re.compile(r'a'))   
[]
>>> t.findAll(name='a',attrs={'class':re.compile(r"btn_dk")},text=re.compile(r'^hell')) 
[hello]
>>> t.findAll(name='a',attrs={'class':re.compile(r"btn_dk")},text=re.compile(r'^hello$'))
[hello]
>>> 
>>> t.findAll(name='a',attrs={},text=re.compile(r'^hello$'))                             
[hello]
>>> 
>>> t
hello
>>> t1 = soup('hello').body.contents[0]
>>> 
>>> t1
hello
>>> t == t1
True
>>> re.search(r'(^hello)|(^bbb)','hello')
<_sre.SRE_Match object at 0x25ef718>
>>> re.search(r'(^hello)|(^bbb)','hellosdfsd')
<_sre.SRE_Match object at 0x25ef7a0>
>>> re.search(r'(^hello)|(^bbb)','bbbsdfsdf') 
<_sre.SRE_Match object at 0x25ef718>
>>> t2 = t1.contents[0]
>>> t2
hello
>>> t2.findAll(name='a')
[] 
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>> 
>>> from bs4 import BeautifulSoup as soup
>>> s = soup('天涯婚礼堂') 
>>> s.findAll(name='a',attrs={'href':None})
[]
>>> s.findAll(name='a',attrs={'href':True})
[?¤???ˉ????¤????]
>>> import re
>>> s.findAll(name='a',attrs={'href':re.compile(r'')})
[?¤???ˉ????¤????]
>>> s1 =s
>>> s1
?¤???ˉ????¤????
>>> id(s1)
140598579280080
>>> id(s)
140598579280080
>>> s1.body.contents[0].contents[0]['href']=None
>>> s1
?¤???ˉ????¤????
>>> s
?¤???ˉ????¤????
>>> id(s)
140598579280080
>>> id(s1)
140598579280080
>>> s.findAll(name='a',attrs={'href':re.compile(r'')})
[]
>>> s.findAll(name='a',attrs={'href':True})           
[]
>>> s.findAll(name='a',attrs={'href':None})
[?¤???ˉ????¤????]
>>> s.findAll(name='a')                    
[?¤???ˉ????¤????]
#text是一个用于搜索NavigableString对象的参数。它的值可以是字符串，一个正则表达式，一个list或dictionary，True或None，一个以NavigableString为参数的可调用对象
#None,False,''表示不做要求；re.compile(''),True表示必须有NavigableString存在 （跟attrs不同，attrs字典中指定为False的属性表示不能存在）
 #注意findAll函数text参数的使用，如下：
>>> rts = s2.findAll(name=u'ul',attrs={u'id': u'contentbar', u'st_type': 'nav'}, text=re.compile(r''))
>>> len(rts)
0
>>> rts = s2.findAll(name=u'ul',attrs={u'id': u'contentbar', u'st_type': 'nav'}, text='')
>>> len(rts)
1
>>> rts = s2.findAll(name=u'ul',attrs={u'id': u'contentbar', u'st_type': 'nav'}, text=True)
>>> len(rts)
0
>>> rts = s2.findAll(name=u'ul',attrs={u'id': u'contentbar', u'st_type': 'nav'}, text=False)
>>> len(rts)
1
>>> rts = s2.findAll(name=u'ul',attrs={u'id': u'contentbar', u'st_type': 'nav'}, text=None) 
>>> len(rts)
1 
#关于string属性的用法，以及其在什么类型元素上出现的问题
>>> from bs4 import BeautifulSoup as soup
>>> soup1 = soup('hello,aaaa').body.contents[0]
>>> soup1
hello,aaaa
>>> soup1.string
>>> soup1.name
u'b'
>>> soup1.text
u'hello,aaaa'
>>> type(soup1)

>>> soup1.contents[0]
u'hello,'
>>> type(soup1.contents[0])

>>> soup1.contents[0].string
u'hello,'
>>> soup2 = soup('hello').body.contents[0]
>>> type(soup2)

>>> soup2.string
u'hello'
#limit的用法，为零表示不限制
>>> soup2.findAll(name='a',text=False,limit=0)
[, åŒ†åŒ†é‚£å¹´]
>>> soup2.findAll(name='a',text=False,limit=1)
[]

BeautifulSoup的性能一般，但是对于不合法的hetml标签有很强的修复和容错能力，对于编码问题，能确定来源页面编码的情况下可以通过BeautifulSoup的构造函数（参数from_encoding）指定（如我解析天涯的页面时就指定了from_encoding='gbk'），不确定来源的话可以依赖bs的自动编码检测和转换(可能会有乱码，毕竟机器没人这么聪明)。

BeautifulSoup返回的对象、以及其各节点内的数据都是其转换后的unicode编码。

---------->

今天遇到一个小问题

有一段html源码在bs3.2.1下构建bs对象失败，抛出UnicodeEncodeError，不论把源码用unicode还是utf-8或者lantin1传入都报错，而且bs3.2.1构造函数居然没有from_encoding的参数可用

尼玛，在bs4下就畅行无阻，不论用unicode编码传入还是utf-8编码传入，都不用指定from_encoding（编码为utf-8、不指定from_encoding时出现乱码，但是也没有报错呀，谁有bs3那么脆弱啊！）

总结一个道理，代码在某个版本库下面测试稳定了以后用的时候安装相应版本的库就ok了，为嘛要委曲求全的做兼容，如果低版本的库有bug我也兼容吗？兼？贱！

<--------------------2012-06-08 18:20

bs4构建对象：

[dongsong@bogon boosenspider]$ cat bs_constrator.py                                        
#encoding=utf-8

from bs4 import BeautifulSoup as soup
from bs4 import Tag

if __name__ == '__main__':
        sou = soup('')

        tag1 = Tag(sou, name='div')
        tag1['id'] = 'gentie1'
        tag1.string = 'hello,tag1'
        sou.div.insert(0,tag1)

        tag2 = Tag(sou, name='div')
        tag2['id'] = 'gentie2'
        tag2.string = 'hello,tag2'
        sou.div.insert(1,tag2)

        print sou

[dongsong@bogon boosenspider]$ vpython bs_constrator.py
hello,tag1
hello,tag2

cgi可以对html字符串转义(escape);HTMLParser可以取消html的转义(unescape)

>>> t = Tag(name='t')                            
>>> t.string=""                          
>>> t

>>> str(t)
""
>>> t.string
u""
>>> HTMLParser.HTMLParser().unescape(str(t))
u""
>>> s1
u""
>>> 
>>> s2 = cgi.escape(s1)
>>> s2
u"<t><img src='www.baidu.com'/></t>"
>>> HTMLParser.HTMLParser().unescape(s2)
u""

17.加密md5模块或者hashlib模块

>>> md5.md5("asdfadf").hexdigest()
'aee0014b14124efe03c361e1eed93589'
>>> import hashlib
>>> hashlib.md5("asdfadf").hexdigest()
'aee0014b14124efe03c361e1eed93589'

18.urllib2.urlopen(url)不设置超时的话可能会一直等待远端服务器的反馈，导致卡死

urlFile = urllib2.urlopen(url, timeout=g_url_timeout)
urlData = urlFile.read()

19.正则匹配 re模块

用三个单引号括起来的字符串可以跨行，得到的实际字符串里面有\n，这个得注意

用单引号或者双引号加上\也可以实现字符串换行，得到的实际字符串没有\和\n，但是在做正则匹配时写正则串不要用这种方式写，会匹配不上的

>>> ss = '''
... hell0,a
... shhh
... liumingdong
... xudongsong
... hello
... '''
>>> ss
'\nhell0,a\nshhh\nliumingdong\nxudongsong\nhello\n'
SyntaxError: EOL while scanning string literal
>>> sss = 'aaaa\
... bbbb\
... cccccc'
>>> sss
'aaaabbbbcccccc'
>>> s3 = r'(^hello)|\
... (abc$)'
>>> 
>>> re.search(s3,'hello,world')
<_sre.SRE_Match object at 0x7f95233047a0>
#第一行的正则串匹配成功
>>> re.search(s3,'aaa,hello,worldabc')
#第二行的匹配失败
>>> s4 = r'(^hello)|(abc$)'
#s4没有用单引号加\做跨行，则两个正则串都匹配上了
>>> re.search(s4,"hello,world")
<_sre.SRE_Match object at 0x182e690>
>>> re.search(s4,"aaa,hello,worldabc")
<_sre.SRE_Match object at 0x7f95233047a0>
>>> 
#注意如何取匹配到的子串（把要抽取的子串对应的正则用圆括号括起来，group从1开始就是圆括号对应的子串）
>>> re.search(r'^(\d+)abc(\d+)$','232abc1').group(0,1,2)
('232abc1', '232', '1')

#下面是一个re和lambda混合使用的一个例子

#encoding=utf-8

import re

f = lambda arg: re.search(u'^(\d+)\w+',arg).group(1)
print f(u'1111条评论')
try:
        f(u'aaaa')
except AttributeError,e:
        print str(e)
:!python re_lambda.py
111
'NoneType' object has no attribute 'group'

re.findall（）很好用的哦

>>> re.findall(r'\\@[A-Za-z0-9]+', s)
['\\@userA', '\\@userB']
>>> s
'hello,world,\\@userA\\@userB'
>>> re.findall(r'\\@([A-Za-z0-9]+)', s)
['userA', 'userB']

20.写了个爬虫，之前在做一些url的连接时总是自己来根据各种情况来处理，比如./xxx #xxxx /xxx神马的都要考虑，太烦了，后来发现有现成的东西可以用

>>>from urlparse import urljoin
>>>import urllib 
>>>url = urljoin(r"http://book.douban.com/tag/?view=type",u"./网络小说")
>>> url
u'http://book.douban.com/tag/\u7f51\u7edc\u5c0f\u8bf4'
>>> conn2 = urllib.urlopen(url)               
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib64/python2.6/urllib.py", line 86, in urlopen
    return opener.open(url)
  File "/usr/lib64/python2.6/urllib.py", line 179, in open
    fullurl = unwrap(toBytes(fullurl))
  File "/usr/lib64/python2.6/urllib.py", line 1041, in toBytes
    " contains non-ASCII characters")
UnicodeError: URL u'http://book.douban.com/tag/\u7f51\u7edc\u5c0f\u8bf4' contains non-ASCII characters
>>> conn2 = urllib.urlopen(url.encode('utf-8'))

21.urllib2做http请求时如何添加header，如何获取cookie的值

>>> request = urllib2.Request("http://img1.gtimg.com/finance/pics/hv1/46/178/1031/67086211.jpg",headers={'If-Modified-Since':'Wed, 02 May 2012 18:32:20 GMT'})
#等同于request.add_header('If-Modified-Since','Wed, 02 May 2012 18:32:20 GMT')
>>> urllib2.urlopen(request)
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib64/python2.6/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib64/python2.6/urllib2.py", line 397, in open
    response = meth(req, response)
  File "/usr/lib64/python2.6/urllib2.py", line 510, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib64/python2.6/urllib2.py", line 435, in error
    return self._call_chain(*args)
  File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.6/urllib2.py", line 518, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 304: Not Modified
>>> urllib.urlencode({"aaa":"bbb"})
'aaa=bbb'
>>> urllib.urlencode([("aaa","bbb")])
'aaa=bbb'
#urlencode的使用，在提交post表单时需要把参数k-v用urlencode处理后放入头部
#urllib2.urlopen(url,data=urllib.urlencode(...))

今天(13.7.4)遇到一个问题是登录某个站点时需要把第一次访问服务器植入的csrftoken作为post数据一起返给服务器，所以就研究了写怎么获取cooke的值，具体代码不便透漏，把栈溢出上的一个例子摆出来(主要看获取cookie数据的那几行代码)

http://stackoverflow.com/questions/10247054/http-post-and-get-with-cookies-for-authentication-in-python

[dongsong@localhost python_study]$ cat cookie.py 
from urllib2 import Request, build_opener, HTTPCookieProcessor, HTTPHandler
import httplib, urllib, cookielib, Cookie, os

conn = httplib.HTTPConnection('webapp.pucrs.br')

#COOKIE FINDER
cj = cookielib.CookieJar()
opener = build_opener(HTTPCookieProcessor(cj),HTTPHandler())
req = Request('http://webapp.pucrs.br/consulta/principal.jsp')
f = opener.open(req)
html = f.read()
import pdb
pdb.set_trace()
for cookie in cj:
    c = cookie
#FIM COOKIE FINDER

params = urllib.urlencode ({'pr1':111049631, 'pr2':'sssssss'})
headers = {"Content-type":"text/html",
           "Set-Cookie" : "JSESSIONID=70E78D6970373C07A81302C7CF800349"}
            # I couldn't set the value automaticaly here, the cookie object can't be converted to string, so I change this value on every session to the new cookie's value. Any solutions?

conn.request ("POST", "/consulta/servlet/consulta.aluno.ValidaAluno",params, headers) # Validation page
resp = conn.getresponse()

temp = conn.request("GET","/consulta/servlet/consulta.aluno.Publicacoes") # desired content page
resp = conn.getresponse()

print resp.read()

22.如何修改logging的日志输出文件，尤其在使用multiprocessing模块做多进程编程时这个问题变得更急迫，因为子进程会继承父进程的日志输出文件和格式....

def change_log_file(fileName):
	h = logging.FileHandler(fileName)
	h.setLevel(g_logLevel)
	h.setFormatter(logging.Formatter(g_logFormat))
	
	logger = logging.getLogger()
	#print logger.handlers
	for handler in logger.handlers:
		handler.close()
	while len(logger.handlers) > 0:
		logger.removeHandler(logger.handlers[0])
		
	logger.addHandler(h)

logging设置logger、handler、formatter可以参见django的配置文件，下面是个人写的一个小例子

[dongsong@localhost python_study]$ cat logging_test.py 
#encoding=utf-8
import logging, sys

if __name__ == '__main__':
        logger = logging.getLogger('test')
        logger.setLevel(logging.DEBUG)
        print 'log handlers: %s' % str(logger.manager.loggerDict)
        logger.error('here')
        logger.warning('here')
        logger.info('here')
        logger.debug('here')

        #handler = logging.FileHandler('test.log')
        handler = logging.StreamHandler(sys.stdout)
        handler.setLevel(logging.DEBUG)
        formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
        handler.setFormatter(formatter)
        logger.addHandler(handler)
        #logging.getLogger('test').addHandler(logging.NullHandler()) # python 2.7+
        logger.error('here')
        logger.warning('here')
        logger.info('here')
        logger.debug('here')
[dongsong@localhost python_study]$ vpython logging_test.py 
log handlers: {'test': }
No handlers could be found for logger "test"
2012-12-26 11:30:48,725 - test - ERROR - here
2012-12-26 11:30:48,725 - test - WARNING - here
2012-12-26 11:30:48,725 - test - INFO - here
2012-12-26 11:30:48,725 - test - DEBUG - here

23.multiprocessing模块使用demo

import multiprocessing
from multiprocessing import Process
import time

def func():
        for i in range(3):
                print "hello"
                time.sleep(1)

proc = Process(target = func)
proc.start()

while True:
        childList = multiprocessing.active_children()
        print childList
        if len(childList) == 0:
                break
        time.sleep(1)

[dongsong@bogon python_study]$ python multiprocessing_children.py 
[]
hello
[]
hello
[]
hello
[]
[]
[dongsong@bogon python_study]$ fg

multiprocessing的Pool模块（进程池）是很好用的，今天差点多此一举的自己写了一个（当然，自己写也是比较easy的，只是必然没官方的考虑周到）

[dongsong@bogon python_study]$ vpython
Python 2.6.6 (r266:84292, Jun 18 2012, 14:18:47) 
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from multiprocessing import Pool
>>> import time
>>> poolObj = Pool(processes = 10)
>>> procObj = poolObj.apply_async(time.sleep, (20,))
>>> procObj.get(timeout = 1)
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib64/python2.6/multiprocessing/pool.py", line 418, in get
    raise TimeoutError
multiprocessing.TimeoutError
>>> print procObj.get(timeout = 21)
None
>>> poolObj.__dict__['_pool']
[, , , , , , , , , ]
>>> poolObj.close()
>>> poolObj.join()

24.关于bs的编码和str()函数编码的问题在下面的demo里面可见一斑(跟str()类似的内建函数是unicode())

#encoding=utf-8
from bs4 import BeautifulSoup as soup

tag = soup((u"白痴代码"),from_encoding='unicode').body.contents[0]
newStr = str(tag) #tag内部的__str__()返回utf-8编码的字符串（tag不实现__str__()的话就会按照本文第38条表现了）
print type(newStr),isinstance(newStr,unicode),newStr
try:
        print u"[unicode]hello," + newStr #自动把newStr按照unicode解释，报错
except Exception,e:
        print str(e)
print "[utf-8]hello," + newStr
print u"[unicode]hello," + newStr.decode('utf-8')

[dongsong@bogon python_study]$ vpython tag_str_test.py 
 False 白痴代码
'ascii' codec can't decode byte 0xe7 in position 3: ordinal not in range(128)
[utf-8]hello,白痴代码
[unicode]hello,白痴代码

25.关于MySQLdb使用的一些问题 http://mysql-python.sourceforge.net/
1> 这里是鸟人11年在某个项目中封装的数据库操作接口database.py，具体的数据库操作可以继承该类并实现跟业务相关的接口
2>cursor.execute(), cursor.fetchall()查出来的是unicode编码，即使指定connect的charset为utf8

3>查询语句需要注意的问题见下述测试代码；推荐的cursor.execute()用法是cursor.execute(sql, args)，因为底层会自动做字符串逃逸

If you're not familiar with the Python DB-API, notethat the SQL statement incursor.execute() uses placeholders,"%s",rather than adding parameters directly within the SQL. If you use thistechnique, the underlying database library will automatically add quotes andescaping to your parameter(s) as necessary. (Also note that Django expects the"%s" placeholder,not the "?" placeholder, which is used by the SQLitePython bindings. This is for the sake of consistency and sanity.)

4>规范的做法需要conn.cursor().execute()后conn.commit()，否则在某些不支持自动提交的数据库版本上会有问题

5>对于插入操作成功后新增记录对应的自增主键可以用MySQLdb.connections.Connection.insert_id()来获取（MySQLdb.connections.Connection就是MySQLdb.connect()返回的mysql连接）（2014.5.29）

#encoding=utf-8
import MySQLdb

conn = MySQLdb.connect(host = "127.0.0.1", port = 3306, user = "xds", passwd = "xds", db = "xds_db", charset = 'utf8')
cursor = conn.cursor()
print cursor

siteName = u"百度贴吧"
bbsNames = [u"明星", u"影视"]


siteName = siteName.encode('utf-8')
for index in range(len(bbsNames)):
        bbsNames[index] = bbsNames[index].encode('utf-8')

#正确的用法
#args = tuple([siteName] + bbsNames)
#sql = "select bbs from t_site_bbs where site = %s and bbs in (%s,%s)"
#rts = cursor.execute(sql,args)
#print rts

#正确的用法
args = tuple([siteName] + bbsNames)
sql = "select bbs from t_site_bbs where site = '%s' and bbs in ('%s','%s')" % args
print sql
rts = cursor.execute(sql)
print rts

#错误的用法,报错
#args = tuple([siteName] + bbsNames)
#sql = "select bbs from t_site_bbs where site = %s and bbs in (%s,%s)" % args
#rts = cursor.execute(sql)
print rts

#错误的用法,不报错，但是查不到数据(bbsName的成员是数字串或者英文字符串时正确)
#sql = "select bbs from t_site_bbs where site = '%s' and bbs in %s" % (siteName, str(tuple(bbsNames)))
#print sql
#rts = cursor.execute(sql)
#print rts


rts = cursor.fetchall()
for rt in rts:
        print rt[0]

对于有自增列的数据表，insert之后可以通过cursor.lastrowid获取刚插入的记录的自增id，update不行

参考：http://stackoverflow.com/questions/706755/how-do-you-safely-and-efficiently-get-the-row-id-after-an-insert-with-mysql-usin

26.关于时间

[dongsong@bogon boosencms]$ vpython
Python 2.6.6 (r266:84292, Dec  7 2011, 20:48:22) 
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import time
>>> time.gmtime()
time.struct_time(tm_year=2012, tm_mon=5, tm_mday=18, tm_hour=4, tm_min=14, tm_sec=55, tm_wday=4, tm_yday=139, tm_isdst=0)
>>> time.localtime()
time.struct_time(tm_year=2012, tm_mon=5, tm_mday=18, tm_hour=12, tm_min=15, tm_sec=2, tm_wday=4, tm_yday=139, tm_isdst=0)
>>> time.time()
1337314595.7790151
>>> time.timezone
-28800
>>> time.gmtime(time.time())
time.struct_time(tm_year=2012, tm_mon=5, tm_mday=18, tm_hour=4, tm_min=19, tm_sec=45, tm_wday=4, tm_yday=139, tm_isdst=0)
>>> time.localtime(time.time())
time.struct_time(tm_year=2012, tm_mon=5, tm_mday=18, tm_hour=12, tm_min=19, tm_sec=54, tm_wday=4, tm_yday=139, tm_isdst=0)
>>> time.strftime("%a, %d %b %Y %H:%M:%S +0800", time.localtime(time.time()))
'Fri, 18 May 2012 12:21:20 +0800'
>>> time.strftime("%a, %d %b %Y %H:%M:%S +0000", time.gmtime(time.time()))   
'Fri, 18 May 2012 04:21:36 +0000'
#%Z这玩意到底怎么用的，下面也没搞明白
>>> time.strftime("%a, %d %b %Y %H:%M:%S %Z", time.gmtime(time.time()))
'Fri, 18 May 2012 04:23:09 CST'
>>> time.strftime("%a, %d %b %Y %H:%M:%S %Z", time.localtime(time.time()))
'Fri, 18 May 2012 12:23:31 CST'
>>> timeStr = time.strftime("%a, %d %b %Y %H:%M:%S +0000", time.gmtime(time.time()))         
>>> timeStr
'Fri, 18 May 2012 04:24:29 +0000'
>>> t = time.strptime(timeStr, "%a, %d %b %Y %H:%M:%S %Z")              
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib64/python2.6/_strptime.py", line 454, in _strptime_time
    return _strptime(data_string, format)[0]
  File "/usr/lib64/python2.6/_strptime.py", line 325, in _strptime
    (data_string, format))
ValueError: time data 'Fri, 18 May 2012 04:24:29 +0000' does not match format '%a, %d %b %Y %H:%M:%S %Z'
>>> t = time.strptime(timeStr, "%a, %d %b %Y %H:%M:%S +0000")
>>> t
time.struct_time(tm_year=2012, tm_mon=5, tm_mday=18, tm_hour=4, tm_min=24, tm_sec=29, tm_wday=4, tm_yday=139, tm_isdst=-1)
#下面是datetime的用法
>>> import datetime
>>> datetime.datetime.today()
datetime.datetime(2012, 5, 18, 12, 28, 25, 892141)
>>> datetime.datetime(2012,12,12,23,54)
datetime.datetime(2012, 12, 12, 23, 54)
>>> datetime.datetime(2012,12,12,23,54,32)
datetime.datetime(2012, 12, 12, 23, 54, 32)
>>> datetime.datetime.fromtimestamp(time.time())
datetime.datetime(2012, 5, 18, 12, 29, 15, 130257)
>>> datetime.datetime.utcfromtimestamp(time.time())
datetime.datetime(2012, 5, 18, 4, 29, 34, 897017)
>>> datetime.datetime.now()
datetime.datetime(2012, 5, 18, 12, 29, 52, 558249)
>>> datetime.datetime.utcnow()
datetime.datetime(2012, 5, 18, 4, 30, 6, 164009)
>>> datetime.datetime.fromtimestamp(time.time()).strftime("%a, %d %b %Y %H:%M:%S")                                                    
'Fri, 18 May 2012 17:05:30'
>>> datetime.datetime.today().strftime("%a, %d %b %Y %H:%M:%S")                          
'Fri, 18 May 2012 17:05:44'
>>> datetime.datetime.strptime('Fri, 18 May 2012 04:24:29', "%a, %d %b %Y %H:%M:%S")    
datetime.datetime(2012, 5, 18, 4, 24, 29)

>>> datetime.datetime.fromtimestamp(time.time()).strftime('%X')  
'17:07:14'
>>> datetime.datetime.fromtimestamp(time.time()).strftime('%x')  
'02/28/15'
>>> datetime.datetime.fromtimestamp(time.time()).strftime('%c')  
'Sat Feb 28 17:07:24 2015'

%a 英文星期简写
%A 英文星期的完全
%b 英文月份的简写
%B 英文月份的完全
%c 显示本地日期时间
%d 日期，取1-31
%H 小时， 0-23
%I 小时， 0-12
%m 月， 01 -12

%M 分钟，0-59

%S 秒，0-61（官网这样写的）

%j 年中当天的天数
%w 显示今天是星期几
%W 第几周
%x 当天日期
%X 本地的当天时间
%y 年份 00-99间
%Y 年份的完整拼写

27.关于整数转字符串的陷阱

有些整数是int，有些是long,对于long调用str()处理后返回的字符串是数字+L，该long数字在list等容器中时，对容器调用str()处理时也有这个问题，用者需谨慎啊！
至于一个整数什么时候是int，什么时候是long鸟人正在研究...（当然，指定int或者long就肯定是int或者long了）

28.join()的用法（列表中的元素必须是字符串）

>>> l = ['a','b','c','d']
>>> '&'.join(l)
'a&b&c&d'

29.python的pdb调试

http://www.ibm.com/developerworks/cn/linux/l-cn-pythondebugger/

跟gdb很类似：

b line_number 加断点，还可以指定文件和函数加断点

b 180, childWeiboRt.retweetedId == 3508203280986906 条件断点

b 显示所有断点

cl breakpoint_number 清除某个断点

cl 清除所有断点

c 继续

n 下一步

s 跟进函数内部

bt 调用栈

whatis obj 查看某变量类型（跟python的内置函数type()等效）

up 移到调用栈的上一层（frame）,可以看该调用点的代码和变量（当然，程序实际进行到哪里了是不可改变的）

down 移到调用栈的下一层（frame）,可以看该调用点的代码和变量（当然，程序实际进行到哪里了是不可改变的）

...

调试过程中要查看某实例（instanceObj）的属性值可用下述语句：

for it in [(attr,getattr(instanceObj,attr)) for attr in dir(instanceObj)]: print it[0],'-->',it[1]

30.在函数内部获取函数名

>>> import sys
>>> def f2():
...     print sys._getframe().f_code.co_name
... 
>>> f2()
f2

31.url中的空格等特殊字符的处理

url出现了有+，空格，/，?，%，#，&，=等特殊符号的时候，可能在服务器端无法获得正确的参数值，如何是好？
解决办法
将这些字符转化成服务器可以识别的字符，对应关系如下：
URL字符转义
用其它字符替代吧，或用全角的。
+    URL中+号表示空格                                %2B
空格 URL中的空格可以用+号或者编码           %20
/  分隔目录和子目录                                    %2F
?   分隔实际的URL和参数                             %3F
%   指定特殊字符                                         %25
#   表示书签                                                 %23
&    URL中指定的参数间的分隔符                  %26
=    URL中指定参数的值                               %3D

>>> import urllib
>>> import urlparse
>>> urlparse.urljoin('http://s.weibo.com/weibo/',urllib.quote('python c++')) 
'http://s.weibo.com/weibo/python%20c%2B%2B'

当url与特殊字符碰撞、然后参数又用于有特殊字符的搜索引擎（lucene等）....

需要把url转义再转义，否则特殊字符安全通过http协议后就裸体进入搜索引擎了，查到的将不是你要的东东...

参考：http://stackoverflow.com/questions/688766/getting-401-on-twitter-oauth-post-requests

通过观察url可以发现http://s.weibo.com浏览器脚本也是做了这种处理的

[dongsong@bogon python_study]$ cat url.py 
#encoding=utf-8

import urllib, urlparse

if __name__ == '__main__':
        baseUrl = 'http://s.weibo.com/weibo/'
        url = urlparse.urljoin(baseUrl, urllib.quote(urllib.quote('python c++')))
        print url
        conn = urllib.urlopen(url)
        data = conn.read()
        f = file('/tmp/d.html', 'w')
        f.write(data)
        f.close()

[dongsong@bogon python_study]$ vpython url.py 
http://s.weibo.com/weibo/python%2520c%252B%252B

32.json模块编码问题

json.dumps()默认行为：

把数据结构中所有字符串转换成unicode编码，然后对unicode串做编码转义(\u56fd变成\\u56fd)再整个导出utf-8编码(由参数encoding的默认值utf-8控制，没必要动它)的json串

如原数据结构中的元素编码不一致不影响dumps函数的行为，因为导出json串之前会把所有元素串转换成unicode串

参数ensure_ascii默认是True，如设置为False会改变dumps的行为：

原数据结构中的字符串编码为unicode则导出的json串是unicode串，且内部unicode串不做转义(\u56fd还是\u56fd)；

原数据结构中的字符串编码为utf-8则导出的json串是utf-8串，且内部utf-8串不做转义(\xe5\x9b\xbd还是\xe5\x9b\xbd)；

如原数据结构中的元素编码不一致则dumps函数会出现错误

通过这种方式拿到的json串是可以做编码转换的，默认行为得到的json串不行(因为原数据结构的字符串元素被转义了，对json串整个做编码转换无法触动原数据结构的字符串元素)

warning--->2012-07-11 10:00:

今天遇到一个问题，用这种方式转一个带繁体字的字典，转换成功，只是把json串入库时报错

_mysql_exceptions.Warning: Incorrect string value: '\xF0\x9F\x91\x91\xE7\xAC...' for column 'detail' at row 1

而用第一种方式存库就没有问题，初步认定是json.dumps(ensure_ascii = False)对繁体字的处理有编码问题

对于一些编码比较杂乱的数据，可能json.loads()会抛UnicodeDecodeError异常（比如我今天（2013.3.19）遇到的qq开放平台API返回的utf8编码json串在反解时总遇到这个问题），可如下解决：

myString = jsonStr.decode('utf-8', 'ignore') #转成unicode,并忽略错误

jsonObj = json.loads(myString)

可能会丢数据，但总比什么也不干要强。

#encoding=utf-8

import json
from pprint import pprint

def show_rt(rt):
        pprint(rt)
        print rt
        print "type(rt) is %s" % type(rt)

if __name__ == '__main__':
        unDic = {
                        u'中国':u'北京',
                        u'日本':u'东京',
                        u'法国':u'巴黎'
                }
        utf8Dic = {
                        r'中国':r'北京',
                        r'日本':r'东京',
                        r'法国':r'巴黎'
                }

        pprint(unDic)
        pprint(utf8Dic)

        print "\nunicode instance dumps to string:"
        rt = json.dumps(unDic)
        show_rt(rt)
        print "utf-8 instance dumps to string:"
        rt = json.dumps(utf8Dic)
        show_rt(rt)

        #encoding is the character encoding for str instances, default is UTF-8
        #If ensure_ascii is False, then the return value will be a unicode instance, default is True
        print "\nunicode instance dumps(ensure_ascii=False) to string:"
        rt = json.dumps(unDic,ensure_ascii=False)
        show_rt(rt)
        print "utf-8 instance dumps(ensure_ascii=False) to string:"
        rt = json.dumps(utf8Dic,ensure_ascii=False)
        show_rt(rt)

        print "\n-----------------数据结构混杂编码-----------------"
        unDic[u'日本'] = r'东京'
        utf8Dic[r'日本'] = u'东京'
        pprint(unDic)
        pprint(utf8Dic)

        print "\nunicode instance dumps to string:"
        try:
                rt = json.dumps(unDic)
        except Exception,e:
                print "%s:%s" % (type(e),str(e))
        else:
                show_rt(rt)
        print "utf-8 instance dumps to string:"
        try:
                rt = json.dumps(utf8Dic)
        except Exception,e:
                print "%s:%s" % (type(e),str(e))
        else:
                show_rt(rt)

        print "\nunicode instance dumps(ensure_ascii=False) to string:"
        try:
                rt = json.dumps(unDic, ensure_ascii=False)
        except Exception,e:
                print "%s:%s" % (type(e),str(e))
        else:
                show_rt(rt)
        print "utf-8 instance dumps to string:"
        try:
                rt = json.dumps(utf8Dic, ensure_ascii=False)
        except Exception,e:
                print "%s:%s" % (type(e),str(e))
        else:
                show_rt(rt)

[dongsong@bogon python_study]$ vpython json_test.py 
{u'\u4e2d\u56fd': u'\u5317\u4eac',
 u'\u65e5\u672c': u'\u4e1c\u4eac',
 u'\u6cd5\u56fd': u'\u5df4\u9ece'}
{'\xe4\xb8\xad\xe5\x9b\xbd': '\xe5\x8c\x97\xe4\xba\xac',
 '\xe6\x97\xa5\xe6\x9c\xac': '\xe4\xb8\x9c\xe4\xba\xac',
 '\xe6\xb3\x95\xe5\x9b\xbd': '\xe5\xb7\xb4\xe9\xbb\x8e'}

unicode instance dumps to string:
'{"\\u4e2d\\u56fd": "\\u5317\\u4eac", "\\u65e5\\u672c": "\\u4e1c\\u4eac", "\\u6cd5\\u56fd": "\\u5df4\\u9ece"}'
{"\u4e2d\u56fd": "\u5317\u4eac", "\u65e5\u672c": "\u4e1c\u4eac", "\u6cd5\u56fd": "\u5df4\u9ece"}
type(rt) is 
utf-8 instance dumps to string:
'{"\\u4e2d\\u56fd": "\\u5317\\u4eac", "\\u6cd5\\u56fd": "\\u5df4\\u9ece", "\\u65e5\\u672c": "\\u4e1c\\u4eac"}'
{"\u4e2d\u56fd": "\u5317\u4eac", "\u6cd5\u56fd": "\u5df4\u9ece", "\u65e5\u672c": "\u4e1c\u4eac"}
type(rt) is 

unicode instance dumps(ensure_ascii=False) to string:
u'{"\u4e2d\u56fd": "\u5317\u4eac", "\u65e5\u672c": "\u4e1c\u4eac", "\u6cd5\u56fd": "\u5df4\u9ece"}'
{"中国": "北京", "日本": "东京", "法国": "巴黎"}
type(rt) is 
utf-8 instance dumps(ensure_ascii=False) to string:
'{"\xe4\xb8\xad\xe5\x9b\xbd": "\xe5\x8c\x97\xe4\xba\xac", "\xe6\xb3\x95\xe5\x9b\xbd": "\xe5\xb7\xb4\xe9\xbb\x8e", "\xe6\x97\xa5\xe6\x9c\xac": "\xe4\xb8\x9c\xe4\xba\xac"}'
{"中国": "北京", "法国": "巴黎", "日本": "东京"}
type(rt) is 

-----------------数据结构混杂编码-----------------
{u'\u4e2d\u56fd': u'\u5317\u4eac',
 u'\u65e5\u672c': '\xe4\xb8\x9c\xe4\xba\xac',
 u'\u6cd5\u56fd': u'\u5df4\u9ece'}
{'\xe4\xb8\xad\xe5\x9b\xbd': '\xe5\x8c\x97\xe4\xba\xac',
 '\xe6\x97\xa5\xe6\x9c\xac': u'\u4e1c\u4eac',
 '\xe6\xb3\x95\xe5\x9b\xbd': '\xe5\xb7\xb4\xe9\xbb\x8e'}

unicode instance dumps to string:
'{"\\u4e2d\\u56fd": "\\u5317\\u4eac", "\\u65e5\\u672c": "\\u4e1c\\u4eac", "\\u6cd5\\u56fd": "\\u5df4\\u9ece"}'
{"\u4e2d\u56fd": "\u5317\u4eac", "\u65e5\u672c": "\u4e1c\u4eac", "\u6cd5\u56fd": "\u5df4\u9ece"}
type(rt) is 
utf-8 instance dumps to string:
'{"\\u4e2d\\u56fd": "\\u5317\\u4eac", "\\u6cd5\\u56fd": "\\u5df4\\u9ece", "\\u65e5\\u672c": "\\u4e1c\\u4eac"}'
{"\u4e2d\u56fd": "\u5317\u4eac", "\u6cd5\u56fd": "\u5df4\u9ece", "\u65e5\u672c": "\u4e1c\u4eac"}
type(rt) is 

unicode instance dumps(ensure_ascii=False) to string:
:'ascii' codec can't decode byte 0xe4 in position 1: ordinal not in range(128)
utf-8 instance dumps to string:
:'ascii' codec can't decode byte 0xe4 in position 1: ordinal not in range(128)

33.json序列化字典会把数字key变成字符串

>>> import json
>>> d = {1:[1,2,3,4],0:()}
>>> d
{0: (), 1: [1, 2, 3, 4]}
>>> s = json.dumps(d)
>>> s
'{"0": [], "1": [1, 2, 3, 4]}'
>>> json.loads(s)
{u'1': [1, 2, 3, 4], u'0': []}

官网说明：

Keys in key/value pairs of JSON are always of the type str. Whena dictionary is converted into JSON, all the keys of the dictionary arecoerced to strings. As a result of this, if a dictionary is converedinto JSON and then back into a dictionary, the dictionary may not equalthe original one. That is, loads(dumps(x)) != x if x has non-stringkeys.

34.交互模式下_表示上次最后一次运算的结果

35.多进程模块的比较

os.popen()和popen2.*都不是官方倡导的用法，subprocess才是

os.popen()启动子进程时命令后面如果不加地址符就会把父进程阻塞住；该命令使用非常方便，但是它仅仅返回一个跟子进程通信的pipe（默认的mode是读，读的是子进程的stdout和stderr）而已，没办法直接杀掉子进程或者获取子进程的信息（可以从pipe写信息通知子进程让子进程自行终止，但是这个很扯淡，你懂的）；对pipe的fd调用close()可以得到子进程的退出码（我没用过，^_^）；在前几个项目里面我频繁使用该命令，因为当时的环境对进程的控制比较粗线条

popen2.*这个模块还没用过，不过顾名思义popen2.popen2()就是启动子进程时返回stdin和stdout，popen2.popen3()就是启动子进程时返回stdout,stdin,stderr....跟os.popen好像也没多大改进

multiprocessing是仿多线程threading接口的多进程模块，需要注意文件描述符、数据库连接共享的问题；这个和其他执行命令行命令启动子进程的多进程模块是不一样滴

subprocess注意僵尸进程的产生，系统一般会为已退出的子进程保留一个进程退出码等信息的结构、供父进程使用，当父进程wait()子进程时系统知道父进程已不需要该结构则会释放，如果父进程不wait而直接退出那么该子进程（已退出，等待wait）就会变成僵尸，占用系统进程号

subprocess的用法:

>>> obj2 = subprocess.Popen('python /home/dongsong/python_study/child2.py', shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
>>> dir(obj2)
['__class__', '__del__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_check_timeout', '_child_created', '_close_fds', '_communicate', '_communicate_with_poll', '_communicate_with_select', '_communication_started', '_execute_child', '_get_handles', '_handle_exitstatus', '_input', '_internal_poll', '_remaining_time', '_set_cloexec_flag', '_translate_newlines', 'communicate', 'kill', 'pid', 'poll', 'returncode', 'send_signal', 'stderr', 'stdin', 'stdout', 'terminate', 'universal_newlines', 'wait']
>>> dir(obj2.stdout)
['__class__', '__delattr__', '__doc__', '__enter__', '__exit__', '__format__', '__getattribute__', '__hash__', '__init__', '__iter__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'close', 'closed', 'encoding', 'errors', 'fileno', 'flush', 'isatty', 'mode', 'name', 'newlines', 'next', 'read', 'readinto', 'readline', 'readlines', 'seek', 'softspace', 'tell', 'truncate', 'write', 'writelines', 'xreadlines']
>>> obj2.stdout.read()
'[]\naaaaa\naaaaa\naaaaa\naaaaa\naaaaa\naaaaa\naaaaa\naaaaa\naaaaa\naaaaa\n'
>>> obj2.stdout.read()
''
>>> obj2.communicate()[0]
''
>>> obj2.communicate()[1]
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib64/python2.6/subprocess.py", line 729, in communicate
    stdout, stderr = self._communicate(input, endtime)
  File "/usr/lib64/python2.6/subprocess.py", line 1310, in _communicate
    stdout, stderr = self._communicate_with_poll(input, endtime)
  File "/usr/lib64/python2.6/subprocess.py", line 1364, in _communicate_with_poll
    register_and_append(self.stdout, select_POLLIN_POLLPRI)
  File "/usr/lib64/python2.6/subprocess.py", line 1343, in register_and_append
    poller.register(file_obj.fileno(), eventmask)
ValueError: I/O operation on closed file
>>> obj2.stderr.read()   
Traceback (most recent call last):
  File "", line 1, in 
ValueError: I/O operation on closed file
>>> args = shlex.split('python /home/dongsong/python_study/child2.py')
>>> obj = subprocess.Popen(args)

36.设置文件对象非阻塞读取

flags = fcntl.fcntl(procObj.stdout.fileno(), fcntl.F_GETFL)
fcntl.fcntl(procObj.stdout.fileno(), fcntl.F_SETFL, flags|os.O_NONBLOCK)

37.如何创建deamon进程（可避免僵尸进程）

原理在僵尸的百科里有提到：fork两次，父进程fork一个子进程，然后继续工作，子进程fork一个孙进程后退出，那么孙进程被init接管，孙进程结束后，init会回收。不过子进程的回收还要自己做。

可以参考这人的实现，这个只能用于纯粹的学习，没什么实际意义http://blog.csdn.net/snleo/article/details/4410305

38.默认编码和内建函数str()的问题

str(xx)把xx转换成系统默认编码（sys.getdefaultencoding()）的适合打印的字符串，一般默认是ascii,那么xx如果是unicode汉字就会报错；默认编码改成utf-8当然就不会报错了

建议不要修改系统默认编码，会影响一些库的使用；一定要改可用这些方法。其中sys.setdefaultencoding()方法不是任何场景都有效（Thesetdefaultencoding is used in python-installed-dir/site-packages/pyanaconda/sitecustomize.py）

[dongsong@bogon python_study]$ vpython
Python 2.6.6 (r266:84292, Dec  7 2011, 20:48:22) 
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> s = u'中国'
>>> str(s)
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
>>> s.encode('utf-8')
'\xe4\xb8\xad\xe5\x9b\xbd'
>>> sys.setdefaultencoding('utf-8')
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: 'module' object has no attribute 'setdefaultencoding'
>>> d = {u'中国':u'北京'}
>>> d
{u'\u4e2d\u56fd': u'\u5317\u4eac'}
>>> str(d)
"{u'\\u4e2d\\u56fd': u'\\u5317\\u4eac'}"
#修改默认编码
[dongsong@bogon python_study]$ cat ~/venv/lib/python2.6/site-packages/sitecustomize.py
import sys
sys.setdefaultencoding('utf-8')
[dongsong@bogon python_study]$ vpython -c 'import sys; print sys.getdefaultencoding();'
utf-8
[dongsong@bogon python_study]$ vpython
Python 2.6.6 (r266:84292, Dec  7 2011, 20:48:22) 
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = u'中国' 
>>> str(s)
'\xe4\xb8\xad\xe5\x9b\xbd'
>>> import sys
>>> print sys.getdefaultencoding()
utf-8
>>> d = {u'中国':u'北京'}
>>> d
{u'\u4e2d\u56fd': u'\u5317\u4eac'}
>>> str(d)
"{u'\\u4e2d\\u56fd': u'\\u5317\\u4eac'}"

可以用python -S 跳过site.py（site.py这个东东可以看看python源码里面的内容），然后sys模块就直接支持setdefaultencoding()方法了。

39.trackback

...
except Exception,e:
                if not isinstance(e, APIError):
                    traceback.print_exc(file=sys.stderr)

或者

import sys
    tp,val,td = sys.exc_info()

sys.exc_info()的返回值是一个tuple, (type, value/message, traceback)
这里的type ---- 异常的类型
value/message ---- 异常的信息或者参数

traceback ---- 包含调用栈信息的对象。

可用traceback模块处理traceback对象，traceback.print_tb()打印traceback对象，traceback.format_tb()返回traceback对象的可打印串

参考：http://hi.baidu.com/whaway/item/8136af0b404dd1813c42e207

40.用python做GUI开发的一些选择 GUI Programming in Python( http://wiki.python.org/moin/GuiProgramming)

cocos2d ：Cocos2D家族的前世今生

cocos2d官网

cocos2d-x

pygame：pygame维基

pygame官网

tkinter：tkinter教程

tkinter官网

wxpython:wxpython官网

图像处理和图表见另一篇文章http://blog.csdn.net/xiarendeniao/article/details/7991305

41.类的静态方法和类方法（用内建函数staticmethod()和classmethod()修饰的类的成员方法）

在python中，静态方法和类方法都是可以通过类对象和类对象实例访问。但是区别是：

1>@classmethod修饰的类的方法是类方法，第一个参数cls是接收类变量。有子类继承时，调用该类方法时，传入的类变量cls是子类，而非父类。不同于C++中类的静态方法。调用方法：ClassA.func() or ClassA().func()（后者调用时函数忽略类的实例）classmethod() is useful for creating alternateclass constructors.

>>> class A:
...     @classmethod
...     def func(cls):
...             import pdb
...             pdb.set_trace()
...             pass
... 
>>> A.func()
> (6)func()
(Pdb) cls

(Pdb) type(cls)

(Pdb) 
>>> type(A())

2>@staticmethod修饰的类的方法是静态方法，静态方法不接收隐式的第一个参数。基本上跟一个全局函数相同，跟C++中类的静态方法很类似。调用方法：ClassA.func() or ClassA().func() （后者调用时函数忽略类的实例）

3>没有上述修饰的类的方法是普通方法（实例方法），第一个参数是self，接收类的实例。调用方法：ClassA().func()

42.字典合并

>>> d1
{1: 6, 11: 12, 12: 13, 13: 14}
>>> d2
{1: 2, 2: 3, 3: 4}
>>> dict(d2, **d1)
{1: 6, 2: 3, 3: 4, 11: 12, 12: 13, 13: 14}
>>> dict(d1,**d2) 
{1: 2, 2: 3, 3: 4, 11: 12, 12: 13, 13: 14}
>>> d = dict(d1)
>>> d
{1: 6, 11: 12, 12: 13, 13: 14}
>>> d2
{1: 2, 2: 3, 3: 4}
>>> d.update(d2)
>>> d
{1: 2, 2: 3, 3: 4, 11: 12, 12: 13, 13: 14}
>>> d = dict(d2)
>>> d
{1: 2, 2: 3, 3: 4}
>>> d1
{1: 6, 11: 12, 12: 13, 13: 14}
>>> d.update(d1)
>>> d
{1: 6, 2: 3, 3: 4, 11: 12, 12: 13, 13: 14}

43.网络超时处理

1>>urllib2.urlopen(url,timeout=xx)

2>>socket.setdefaulttimeout(xx) #(全局socket超时设置)

3>>定时器

from urllib2 import urlopen
from threading import Timer
url = "http://www.python.org"
def handler(fh):
        fh.close()
fh = urlopen(url)
t = Timer(20.0, handler,[fh])
t.start()
data = fh.read()
t.cancel()

44.excel处理

以前一直用的csv模块，读写csv格式文件，然后用excel软件打开另存为xls文件

今天（2012.10.30）发现这个库更直接，更强大http://www.python-excel.org/

鸟人用的版本：（xlwt-0.7.4 xlrd-0.8.0 xlutils-1.5.2）

设置行的高度可以用sheetObj.row(index).set_style(easyxf('font:height 720;')) 设置列的宽度可以用sheetObj.col(index).width = 1000 其他那些方法差不多都有bug 设置不上http://reliablybroken.com/b/2011/10/widths-heights-with-xlwt-python/

#encoding=utf-8
from xlwt import Workbook, easyxf

book = Workbook(encoding='utf-8')
sheet1 = book.add_sheet('Sheet 1')
sheet1.col_width(20000)
book.add_sheet('Sheet 2')
sheet1.write(0,0,'起点')
sheet1.write(0,1,'B1')
row1 = sheet1.row(1)
row1.write(0,'Ai2')
row1.write(1,'B2')
sheet1.col(0).width = 10000
sheet1.col(1).width = 20000
#sheet1.default_col_width = 20000 #bug invalid
#sheet1.col_width(30000) #bug invalid
#sheet1.default_row_height = 5000 #bug invalid
#sheet1.row(0).height = 5000 #bug invalid
sheet1.row(0).set_style(easyxf('font:height 400;'))
style = easyxf('pattern: pattern solid, fore_colour red;'
                'align: vertical center, horizontal center;'
                'font: bold true;')
sheet1.write_merge(2,5,2,5,'Merged',style)
sheet2 = book.get_sheet(1)
sheet2.row(0).write(0,'Sheet 2 A1')
sheet2.row(0).write(1,'Sheet 2 B1')
sheet2.flush_row_data()
sheet2.write(1,0,'Sheet 2 A3')
sheet2.col(0).width = 5000
sheet2.col(0).hidden = True
book.save('simple.xls')

用这个库的时候很头疼的一点是不知道设置的宽度/高度/颜色在视觉上到底是什么样子，鸟人写了个脚本把所有支持的颜色和常用的宽高打印出来已备选，具体参见http://blog.csdn.net/xiarendeniao/article/details/8276957

45.在本机有多个ip地址的情况下，urllib2发起http请求时如何指定使用哪个IP地址？两种方式，方便且稍带取巧性质的是篡改socket模块的socket方法（下面的代码是这种），另一种是：A better way is to extendconnect() method in subclass ofHTTPConnection and redefinehttp_open() method in subclass ofHTTPHandler

def bind_alt_socket(alt_ip):
	true_socket = socket.socket
	def bound_socket(*a, **k):
		sock = true_socket(*a, **k)
		sock.bind((alt_ip, 0))
		return sock
	socket.socket = bound_socket

参考： http://www.rossbates.com/2009/10/urllib2-with-multiple-network-interfaces/

http://stackoverflow.com/questions/1150332/source-interface-with-python-and-urllib2

46.PyQt4的安装：

1.sip安装
wget http://sourceforge.net/projects/pyqt/files/sip/sip-4.14.1/sip-4.14.1.tar.gz
vpython configure.py
make
sudo make install

2.sudo yum install qt qt-devel -y
  sudo yum install qtwebkit qtwebkit-devel -y //没有这一个操作的话，下面configure操作就会不生成QtWebKit的Makefile

3.pyqt安装
wget http://sourceforge.net/projects/pyqt/files/PyQt4/PyQt-4.9.5/PyQt-x11-gpl-4.9.5.tar.gz
vpython configure.py -q/usr/bin/qmake-qt4 -g
make 
make install

dir(PyQt4)看不到的模块不表示不存在啊亲！so动态库可以用from PyQt4 import QtGui或者import PyQt4.QtGui来引入的啊亲！尼玛，我一直以为安装失败了，各种尝试各种找原因啊，崩溃中...

47.一个python解释器要使用另一个python解释器的环境（安装的模块）

参考：http://mydjangoblog.com/2009/03/30/django-mod_python-and-virtualenv/https://pypi.python.org/pypi/virtualenv

下述示例是在默认python环境中使用virtualenv python中安装的callme模块：

[dongsong@localhost ~]$ python
Python 2.6.6 (r266:84292, Jun 18 2012, 14:18:47) 
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import callme
Traceback (most recent call last):
  File "", line 1, in 
ImportError: No module named callme
>>> activate_this = '/home/dongsong/venv/bin/activate_this.py'            
>>> execfile(activate_this, dict(__file__=activate_this))
>>> import callme
>>>

至于如何使得mod_python使用virtualenv python环境，可参考前述连接：

#myvirtualdjango.py

activate_this = '/home/django/progopedia.ru/ve/bin/activate_this.py'
execfile(activate_this, dict(__file__=activate_this))

from django.core.handlers.modpython import handler


    ServerName progopedia.ru
    ServerAdmin [email protected]

    
        SetHandler python-program
        PythonPath "['/home/django/progopedia.ru/ve/bin', '/home/django/progopedia.ru/src/progopedia_ru_project/'] + sys.path"
        PythonHandler myvirtualdjango
        SetEnv DJANGO_SETTINGS_MODULE settings
        SetEnv PYTHON_EGG_CACHE /var/tmp/egg
        PythonInterpreter polyprog_ru

48.格式化输出

%r是一个万能的格式付，它会将后面给的参数原样打印出来，带有类型信息

print 会自动在行末加上回车,如果不需回车，只需在print语句的结尾添加一个逗号”,“，就可以改变它的行为

更多精彩用法请见http://www.pythonclub.org/python-basic/print

%r是用对象的repr形式，%s是用str形式

49.finally 很容易搞错哦！

[dongsong@localhost python_study]$ cat finally_test.py 
#encoding=utf-8

def func():
        a = 1
        try:
                return a
        except Exception,e:
                print '%r' % e
        else:
                print 'no exception'
        finally:
                print 'finally'
                a += 1

a = func()
print 'func returned %s' % a
[dongsong@localhost python_study]$ vpython finally_test.py 
finally
func returned 1

50.stackless

官网：http://www.stackless.com/

中文资料（有例子哦~）：http://gashero.yeax.com/?p=30

1>当调用 stackless.schedule() 的时候，当前活动微进程将暂停执行，并将自身重新插入到调度器队列的末尾，好让下一个微进程被执行。
一旦在它前面的所有其他微进程都运行过了，它将从上次停止的地方继续开始运行。这个过程会持续，直到所有的活动微进程都完成了运行过程。这就是使用stackless达到合作式多任务的方式。
2>接收的微进程调用 channel.receive() 的时候，便阻塞住，这意味着该微进程暂停执行，直到有信息从这个通道送过来。除了往这个通道发送信息以外，没有其他任何方式可以让这个微进程恢复运行。
若有其他微进程向这个通道发送了信息，则不管当前的调度到了哪里，这个接收的微进程都立即恢复执行；而发送信息的微进程则被转移到调度列表的末尾，就像调用了 stackless.schedule() 一样。
同样注意，发送信息的时候，若当时没有微进程正在这个通道上接收，也会使当前微进程阻塞。
发送信息的微进程，只有在成功地将数据发送到了另一个微进程之后，才会重新被插入到调度器中。
3>清除堆栈溢出的问题：是否还记得，先前我提到过，那个代码的递归版本，有经验的程序员会一眼看出毛病。但老实说，这里面并没有什么“计算机科学”方面的原因在阻碍它的正常工作，有些让人坚信的东西，其实只是个与实现细节有关的小问题——只因为大多数传统编程语言都使用堆栈。某种意义上说，有经验的程序员都是被洗了脑，从而相信这是个可以接受的问题。而stackless，则真正察觉了这个问题，并除掉了它。
4>微线程--轻量级线程：与当今的操作系统中内建的、和标准Python代码中所支持的普通线程相比，“微线程”要更为轻量级，正如其名称所暗示。它比传统线程占用更少的内存，并且微线程之间的切换，要比传统线程之间的切换更加节省资源。
5>计时：现在，我们对若干次实验运行过程进行计时。Python标准库中有一个 timeit.py 程序，可以用作此目的。
6>我们将channel的preference 设置为1，这使得调用send之后任务不被阻塞而继续运行，以便在之后输出正确的仓库信息。
7>In stackless, the balance of a channel is how many tasklets are waiting to send or receive on it.正数表示有send的个数；负数表示receive的个数；0表示没有等待。

总结：stackless python还是受限于GIL，多核用不上，只是比python的传统thread有些改进而已（http://stackoverflow.com/questions/377254/stackless-python-and-multicores）。所以multiprocessing构建多进程、进程内部用stackless构建微线程是不错的搭配。EVE服务器端使用stackless做的（貌似是C++/stackless python），好想看看他们的代码啊，哈哈哈。

stackless python安装：参考http://opensource.hyves.org/concurrence/install.html#installing-stackless

sudo yum install readline-devel -y
./configure --prefix=/opt/stackless --with-readline --with-zlib=/usr/include
make
make install

51.动态加载模块

内建函数__import__()

[dongsong@localhost python_study]$ touch mds/__init__.py
[dongsong@localhost python_study]$ vpython
Python 2.6.6 (r266:84292, Jun 18 2012, 14:18:47) 
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> m = __import__('mds.m1', globals(), locals(), fromlist=[], level = 0)
>>> m

第一次在自己的代码中实用这个函数（2014.6.25），发现需要注意的问题挺多的，要仔细阅读官方说明

class RobotMeta(type):
    def __new__(cls, name, bases, attrs):
        newbases = list(bases)
        import testcase
        import pkgutil
        for importer, modname, ispkg in pkgutil.iter_modules(testcase.__path__):
            if ispkg: continue
            mod = __import__('testcase.'+modname, globals(), locals(), fromlist=(modname,), level=1)
            if hasattr(mod, 'Robot'):
                newbases.append(mod.Robot)
        return super(RobotMeta, cls).__new__(cls, name, tuple(newbases), attrs)

importlib库， importlib.import_module()

[dongsong@localhost python_study]$ touch mds/__init__.py
[dongsong@localhost python_study]$ vpython
Python 2.6.6 (r266:84292, Jun 18 2012, 14:18:47) 
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import importlib
>>> m = importlib.import_module('mds.m1')
>>> m

>>>

52.对于user-defined class，如何使其支持pickle和cPickle？（下面是对项目中一个继承自dict的json串反解对象所做的修改，参考http://stackoverflow.com/questions/5247250/why-does-pickle-getstate-accept-as-a-return-value-the-very-instance-it-requi）

def __getstate__(self): 
        return dict(self)
    
def __setstate__(self, state):
        return self.update(state)

53.判断字符串的组成

s.isalnum()  所有字符都是数字或者字母
s.isalpha()  所有字符都是字母
s.isdigit()  所有字符都是数字
s.islower()  所有字符都是小写
s.isupper()  所有字符都是大写
s.istitle()  所有单词都是首字母大写，像标题
s.isspace()  所有字符都是空白字符、\t、\n、\r

54.python networking framework, 这种python并发问题三言两语难尽其意，故另起炉灶见http://blog.csdn.net/xiarendeniao/article/details/9143059

Twisted是比较常见和广泛使用的(module index)

concurrence 跟stackless有一腿（stackless和libevent的结合体），所以对我比较有吸引力

cogen 跟上面的那个相似，移植性更好一些

gevent greenlet和libevent的结合体（greenlet是stackless的副产品、只是比stackless更原始一些、更容易满足coder对协程的控制欲），这样看跟concurrence原理差不多哦

得出上述总结的原材料：http://stackoverflow.com/questions/1824418/a-clean-lightweight-alternative-to-pythons-twisted

55.python环境变量（environment variables）

import os
if not os.environ.has_key('DJANGO_SETTINGS_MODULE'):
    os.environ['DJANGO_SETTINGS_MODULE'] = 'boosencms.settings'
else:
    print 'DJANGO_SETTINGS_MODULE: %s' % os.environ['DJANGO_SETTINGS_MODULE']

56.yield，用于生成generator的语法，generator是一个可迭代一次的对象，用generator做迭代（遍历）相对于list、tuple等结构的优势是没必要所有数据都在内存中，详解见官网文档和栈溢出讨论帖

[dongsong@localhost python-study]$ !cat
cat yield.py 
def echo(value=None):
    print "Execution starts when 'next()' is called for the first time."
    try:
        while True:
            try:
                value = (yield value)
            except Exception, e:
                print "catched an exception", e
                value = e
            else:
                print "yield received ", value
    finally:
        print "Don't forget to clean up when 'close()' is called."

generator = echo(1)
print generator.next()
print generator.next()
print generator.send(2)
generator.throw(TypeError, "spam")
generator.close()
[dongsong@localhost python-study]$ 
[dongsong@localhost python-study]$ 
[dongsong@localhost python-study]$ !python
python yield.py 
Execution starts when 'next()' is called for the first time.
1
yield received  None
None
yield received  2
2
catched an exception spam
Don't forget to clean up when 'close()' is called.

57.元类metaclass详解见文章 http://blog.csdn.net/xiarendeniao/article/details/9232021

58.单件模式的实现，栈溢出上这个帖子介绍了四种方式，我比较中意第三种http://stackoverflow.com/questions/6760685/creating-a-singleton-in-python

[dongsong@localhost python_study]$ cat singleton3.py 
#encoding=utf-8

class Singleton(type):
        _instances = {}
        def __call__(cls, *args, **kwargs):
                if cls not in cls._instances:
                        cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs)
                return cls._instances[cls]

class MyClass(object):
        __metaclass__ = Singleton

singletonObj = Singleton('Test',(),{})
myClassObj1 = MyClass()
myClassObj2 = MyClass()
print singletonObj, singletonObj.__class__
print id(myClassObj1),myClassObj1,myClassObj1.__class__
print id(myClassObj2),myClassObj2,myClassObj2.__class__
[dongsong@localhost python_study]$ vpython singleton3.py 
 
139799414931408 <__main__.MyClass object at 0x7f2596777fd0> 
139799414931408 <__main__.MyClass object at 0x7f2596777fd0>

59.python magic methods ，有些长，单开一篇文章 http://blog.csdn.net/xiarendeniao/article/details/9270407

60.struct 二进制官方文档 http://docs.python.org/3/library/struct.html

Character	Byte order	Size	Alignment
`@`	native	native	native
`=`	native	standard	none
`<`	little-endian	standard	none
`>`	big-endian	standard	none
`!`	network (= big-endian)	standard	none

Format	C Type	Python type	Standard size	Notes
`x`	pad byte	no value
`c`	`char`	bytes of length 1	1
`b`	`signed char`	integer	1	(1),(3)
`B`	`unsigned char`	integer	1	(3)
`?`	`_Bool`	bool	1	(1)
`h`	`short`	integer	2	(3)
`H`	`unsigned short`	integer	2	(3)
`i`	`int`	integer	4	(3)
`I`	`unsigned int`	integer	4	(3)
`l`	`long`	integer	4	(3)
`L`	`unsigned long`	integer	4	(3)
`q`	`long long`	integer	8	(2), (3)
`Q`	`unsigned longlong`	integer	8	(2), (3)
`n`	`ssize_t`	integer		(4)
`N`	`size_t`	integer		(4)
`f`	`float`	float	4	(5)
`d`	`double`	float	8	(5)
`s`	`char[]`	bytes
`p`	`char[]`	bytes
`P`	`void *`	integer		(6)

>>> import struct
>>> struct.pack('HH',1,2)
'\x01\x00\x02\x00'
>>> struct.pack('>> struct.pack('>HH',1,2)
'\x00\x01\x00\x02'
>>> s= struct.pack('HH',1,2)
>>> s
'\x01\x00\x02\x00'
>>> len(s)
4
>>> struct.unpack('HH',s)  
(1, 2)
>>> struct.unpack_from('H', s, 2) 
(2,)
>>> struct.unpack('H',s[0:2])
(1,)

61.闭包

[dongsong@localhost python_study]$ cat enclosing_1.py 
#encoding=utf8
a = 1
b = 2

def f(v = 0):
        a = 2
        c = list()
        def g():
                print 'a = %s' % a
                print 'b = %s' % b
                print 'c = %r' % c

        if v == 0:
                a += 1
        else:
                a += v
        c.append(111)
        return g

g = f() #函数返回g函数对象赋值给g; 函数对象g跟a(3)、c([111])绑定构成闭包
f(10)() #内嵌对象跟a(12)、c([111])绑定构成闭包；输出: a=12, b=2, c=[111]
f()     #没有任何输出，内嵌函数跟a/c绑定后的结果没有使用
g()     #输出: a = 3, b = 2, c = [111]
b = 3
g() #输出: a = 3, b = 3, c = [111] (b是全局变量)

print a #输出全局变量: a = 1
[dongsong@localhost python_study]$ vpython enclosing_1.py 
a = 12
b = 2
c = [111]
a = 3
b = 2
c = [111]
a = 3
b = 3
c = [111]
1

62.如何阻止pyc跟py文件同居？看栈溢出的讨论帖http://stackoverflow.com/questions/3522079/changing-the-directory-where-pyc-files-are-created

python3.2之后可以在代码目录加一个__pycache__目录，pyc文件会分居到这个目录下（应该是这个意思，python3我没用过）

python2的话可以在启动解释器的时候加上-B参数阻止pyc字节码文件写盘，不过这样势必会导致import变慢（重新编译）

63.微博数据(账号描述)入库报警告且数据被截断：

[dongsong@localhost tfengyun_py]$ vpython new_user.py debug 1852589841
/data/weibofengyun/workspace-php/tfengyun_py/utils.py:26: Warning: Incorrect string value: '\xF0\x9F\x92\x91\xE4\xBD...' for column 'description' at row 1
  try: affectCount = self.cursor.execute(sql)

最终解决办法（直接从Python群里copy来的）：

吓人的鸟(362278013) 11:27:58 
对于昨天那个数据入库Mysql报Warning的问题大概整明白了，现分享如下，非常感谢@墨迹 !!

http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html 
mysql5.5.3之前不支持utf8mb4,上周五那个入库警告是因为有部分unicode字符(ios设备的emoji表情)编码成utf-8以后占四字节（正常一般不超过三字节）：
>>> u'\u8bb0'.encode('utf-8')
'\xe8\xae\xb0'
>>> u'\U0001f497'.encode('utf-8')
'\xf0\x9f\x92\x97'
对于不想升级mysql版本来解决问题的情况，可以把这种字符过滤掉，栈溢出上有相关讨论
http://stackoverflow.com/questions/10798605/warning-raised-by-inserting-4-byte-unicode-to-mysql

那么对于同一个Mysql数据库和一样的数据，为什么PHP程序可以正常入库(不报错不报警告、数据不被截断)呢？
原来是因为它内部自动的把utf8的四字节编码部分过滤掉了，入库以后在mysql命令行下查询会发现那些emoji表情符不见了，用PHP程序从数据库把数据查出来验证也确实如此

PS: 知之为知之,不知为不知,是知也.  来提问的都是因为比较着急了，希望各位同仁少些说教，多些实际有效建议。

64.（2014.4.25）Python跟C/C++的混合使用（Python使用C/C++扩展，C/C++嵌套Python），最基本的用法当然是参照官网来做了，我有两个对官网相关文档的翻译，巨麻烦！引用什么的规则太多了，这种低级接口不适宜在项目中直接使用。

项目中首选Boost.Python(http://www.boost.org/doc/libs/1_55_0/libs/python/doc/)，用过C++的应该对Boost不陌生，我对Boost的理解是仅次于C++标准库的标准库(09年老成在昆仑写的聊天服用的就是boost.asio)。其中提供了对Python语言的支持。金山的C++/Python游戏服务器就是用的这个库实现C++跟Python之间交互。

其次，听一个同学讲他们项目(貌似非游戏项目)中有用到Pyrex（http://www.cosc.canterbury.ac.nz/greg.ewing/python/Pyrex/version/Doc/About.html）,这是一种类似于C和Python语法混写的新语言，没深入了解过，暂且搁下，我还是对Boost.Python比较感兴趣。

Cython(http://cython.org/) 基于Pyrex，被设计用来编写python的c扩展

说到这里不得不提一下pypy(http://pypy.org/)了（虽然pypy不是用来跟c/c++交互的），pypy是python实现的python解释器，jit（Just-in-time compilation，动态编译）使其运行速度比cpython（官方解释器，一般人用的解释器）要快，支持stackless、提供微线程协作，感觉前景一片光明啊！有消息说pypy会丢弃GIL以提升多线程程序的性能，不过我看官方文档好像没这么说（http://pypy.org/tmdonate2.html#what-is-the-global-interpreter-lock）。

65.exec直接就可以执行代码片段

eval执行的是单条表达式

compile可以把代码片段或者代码文件编译成codeobject，exec和eval都可以执行codeobject

https://docs.python.org/2/library/functions.html#compile

[dongsong@localhost python-study]$ python
Python 2.6.6 (r266:84292, Jan 22 2014, 09:42:36) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = file("code.py").read()
>>> print s
def func():
        print "i am in function func()"
        return 1,2,3

>>> codeObj = compile(s,"","exec")  
>>> dir()
['__builtins__', '__doc__', '__name__', '__package__', 'codeObj', 's']
>>> codeObj
 at 0x7f761cd74738, file "", line 1>
>>> eval(codeObj)
>>> dir()
['__builtins__', '__doc__', '__name__', '__package__', 'codeObj', 'func', 's']
>>> func()
i am in function func()
(1, 2, 3)

[dongsong@localhost python-study]$ python
Python 2.6.6 (r266:84292, Jan 22 2014, 09:42:36) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = file("code.py").read()
>>> exec(s)
>>> dir()
['__builtins__', '__doc__', '__name__', '__package__', 'func', 's']
>>> func()
i am in function func()
(1, 2, 3)

 
  66.随机字符串 http://stackoverflow.com/questions/2257441/random-string-generation-with-upper-case-letters-and-digits-in-python 
   
  >>> import string
>>> import random
>>> def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
...    return ''.join(random.choice(chars) for _ in range(size))
...
>>> id_generator()
'G5G74W'
>>> id_generator(3, "6793YUIO")
'Y3U'
>>> string.ascii_uppercase
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> string.digits
'0123456789'
>>> string.ascii_uppercase + string.digits
'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
>>> string.lowercase
'abcdefghijklmnopqrstuvwxyz' 
  67.内建函数hasattr不能查找对象的私有属性（2014.6.18） 
   
  [dongsong@localhost python-study]$ cat hasattr.py
#encoding=utf-8

class A(object):
    def __init__(self):
        self.__a = 100
        self.a = 200
    def test(self):
        if hasattr(self,'__a'): print 'found self.__a:',self.__a
        else: print 'not found self.__a'
        if hasattr(self,'a'): print 'found self.a:', self.a
        else: print 'not found self.a:', self.a

if __name__ == '__main__':
    t = A()
    t.test()
[dongsong@localhost python-study]$ 
[dongsong@localhost python-study]$ python hasattr.py 
not found self.__a
found self.a: 200 
  68.Python循环import : Circular (or cyclic) imports
 
  http://stackoverflow.com/questions/744373/circular-or-cyclic-imports-in-python 
  说白了，a import b, b import a, 那么在a的主代码块(也就是“import a”时会被执行的代码)中使用module b里面的符号(b.xx、from b import xx)会出错。 
  另，python a.py，那么a.py初次会当做__main__ module，“import a”会重新把a执行一遍（这个在源码剖析里面有提到，也就是使用if __name__ == '__main__'判断的原因） 
   
  [root@test-22 xds]# cat maintest.py
import maintest
print 'main test in ..'
if __name__ == '__main__':
    print 'aaaa'
print 'main test out..'
[root@test-22 xds]# 
[root@test-22 xds]# python maintest.py
main test in ..
main test out..
main test in ..
aaaa
main test out..

python学习笔记

你可能感兴趣的:(代码库/程序片段,业务-网络爬虫,编程语言-python)