Python3+itchat爬虫实战

本文主要记录如何用Python调用itchat来爬取好友信息，并且制作好友性别柱状图和好友个性签名词云。涉及如下模块：

itchat ：一个开源的微信个人号接口，可以实现信息收发、获取好友列表等功能。

jieba ：python中文分词组件，制作词云的时候会用到

matpolotlib ：python的一个用来画图的库

wordcloud ：用来制作词云

怎么下载？

怎么安装？？

详细介绍？？？

在上面的粗体字模块名上点击一下就知道了~~~

OK ! 正式开始

代码环境：Python3+win10

第一步：python登陆微信，并获取所有好友的信息

def my_friends():
     #二维码登陆
    itchat.auto_login()
    #获取好友信息
    friends = itchat.get_friends(update=True)
    return friends

运行这个函数时电脑屏幕会出现一个二维码，手机微信扫描后即可完成登陆。同时终端会输出如下信息：

    Getting uuid of QR code.
    Downloading QR code.
    Please scan the QR code to log in.
    Please press confirm on your phone.
    Loading the contact, this may take a little while.
    Login successfully as 某某某

itchat的get_friends方法会获取到所有好友信息。需要说明的是此处return的friends是列表类型，列表中的元素是字典类型，且列表中第0个元素是自己，这个后续数据处理的时候会遇到。至此，第一步已完成。

第二步：提取数据

在第一步中微信好友的数据已全部放入friends这个列表中，接下来遍历列表并从中取出我们需要内容即可。

1.好友性别统计

def my_friends_sex(friends):
   
    #创建一个字典用于存放好友性别信息
    friends_sex = dict()
    #定义好友性别信息字典的key，分别为男性，女性，其他
    male    =  "男性"
    female  =  "女性"
    other   =  "其他"

    #遍历列表中每一个好友的信息，     
    for i in friends[1:]:
        sex = i["Sex"]
        if sex == 1:
            #字典操作，找到key并为其的值加1
            friends_sex[male] = friends_sex.get(male,0) + 1
        elif sex == 2:
            friends_sex[female] = friends_sex.get(female,0) + 1
        elif sex == 0 :
            friends_sex[other] = friends_sex.get(other,0) + 1
    #打印好友性别信息的字典
    #print (friends_sex)
    #好友总数，从第二个开始是因为第一个好友是自己
    totle = len(friends[1:])
    
    proportion = [float(friends_sex[male])/totle*100,float(friends_sex[female])/totle*100,float(friends_sex[other])/totle*100]
    print (
       "男性好友：%.2f%% " % (proportion[0])     +'\n' +
       "女性好友：%.2f%% " % (proportion[1])   +'\n' +
       "其他：%.2f%% "  % (proportion[2])
       )
    return friends_sex

额~注释写的够详细吧，主要是怕自己过两天就忘了。。。

在遍历friends列表的时候本函数提取其元素的key为Sex，这是因为，因为Sex对应的是性别啊！另外还有几个其他常用的key：

       'NickName'      好友昵称
       'RemarkName'   备注
       'Signature'         签名
       'Province':          省
       'City':                   市
       'SEX'                    性别，1男 2女 0其他

return的friends_sex是一个字典，有三个key，分别是male,female,other。由于我们的目的是画好友性别的统计图，所以需要得到每个性别的人数。

2.获取好友个性签名

def my_friends_style(friends):
    #创建列表用于存放个性签名
    style = []
    for i in range(len(friends)):
        #每一个好友的信息存放在列表中的字典里，此处获取到
        i = friends[i]
        #得到每个字典的个性签名的key，即Signature
        #strip去除字符串首位的空格，replace去掉英文
        Signature = i['Signature'].strip().replace('span','').replace('class','').replace('emoji','')
        #通过正则表达式将签名中的特殊符号去掉，re.sub则相当于字符串操作中的replace
        rep = re.compile('1f\d+\w*|[<>/=]')
        Signature=rep.sub('',Signature)
        #放入列表
        style.append(Signature)
    #join() 方法用于将序列中的元素以指定的字符连接生成一个新的字符串。
    #此处将所有签名去除特殊符号和英文之后，拼接在一起
    text = ''.join(style)
    #将输出保存到文件，并用结巴来分词
    with io.open('F:\python_实战\itchat\微信好友个性签名词云\\text.txt','a',encoding = 'utf-8') as f:
        wordlist = jieba.cut(text,cut_all=False)
        word_space_split = ' '.join(wordlist)
        f.write(word_space_split)

个性签名的数据处理相比性别统计要复杂一丢丢，由于大家的个性签名都比较个性，大多包含一些表情或者特殊符号，所有提取到Signature后需要用strip方法去除字符串首位的空格，再用正则表达式去除特殊符号，最后用结巴分词后，将数据放入一个文件中，后续制作词云时使用。

结巴分词的cut_all=False表示精确模式，如果你设置为True，词云会很。。。

第三步：画图

1.好友性别柱状图

def drow_sex(friends_sex):
    #获取饼状图的标签和大小
    labels = []
    sizes = []
    for key in friends_sex:
        labels.append(key)
        sizes.append(friends_sex[key])
    #每块图的颜色，数量不足时会循环使用
    colors = ['red', 'yellow', 'blue']
    #每一块离中心的距离
    explode = (0.1,0,0)
    #autopct='%1.2f%%'百分数保留两位小数点；shadow=True,加阴影使图像更立体
    #startangle起始角度，默认为0°，一般设置为90比较好看
    plt.pie(sizes,explode=explode,labels=labels,colors=colors,autopct='%1.2f%%',shadow=True,startangle=90)
    #设置图像的xy轴一致
    plt.axis('equal')
    #显示颜色和标签对应关系
    plt.legend()
    #添加title，中文有乱码是个坑，不过我找到填平的办法了
    plt.suptitle("微信好友性别统计图")
    #保存到本地，因为show之后会创建空白图层，所以必须在show之前保存
    plt.savefig('F:\python_实战\itchat\好友性别饼状图.png')
    plt.show()

全是 matplotlib的用法，没啥好说的

如果有title中文乱码的问题，在程序开始前
from pylab import *
mpl.rcParams['font.sans-serif'] = ['SimHei']

2.好友个性签名词云

def wordart():
    back_color = imread('F:\python_实战\itchat\微信好友个性签名词云\\猫咪.png')
    wc = WordCloud(background_color='white',    #背景色
                   max_words=1000,
                   mask=back_color,     #以该参数值绘制词云
                   max_font_size=100,
                   
                   font_path="C:/Windows/Fonts//STFANGSO.ttf", #设置字体类型，主要为了解决中文乱码问题
                   random_state=42, #为每一词返回一个PIL颜色
            )
    
    #打开词源文件
    text = open("F:\python_实战\itchat\微信好友个性签名词云\\text.txt",encoding='utf-8').read()
    #
    wc.generate(text)
    #基于彩色图像生成相应颜色
    image_colosr = ImageColorGenerator(back_color)
    #显示图片
    plt.imshow(wc)
    #关闭坐标轴
    plt.axis("off")
    #保存图片
    wc.to_file("F:\python_实战\itchat\微信好友个性签名词云\\词云.png")

完工~~~

python基础知识补充：

1.字典操作

举例
    b={'A':1,'B':2,'C':3,'D':4}
    b['A']
    Out[28]: 1
    b['D']
    Out[29]: 4

2.字典get方法

get()方法语法：
dict.get(key, default=None)
参数
key -- 字典中要查找的键。
default -- 如果指定键的值不存在时，返回该默认值值。
举例
dict = {'Name': 'Zara', 'Age': 27}
print "Value : %s" % dict.get('Age')
print "Value : %s" % dict.get('Sex', "Never")
输出：
Value : 27
Value : Never

3.列表内容直接写入文件

with open('F:\python_实战\itchat\\friends.txt','a+') as f:
for i in range(len(friends)):
f.write(str(friends[i]))

4.strip()方法

用于移除字符串首位的特点字符，默认为去除空格
   a = "assdgheas"
    a.strip('as')
   print(a)
   输出：ssdghe