中央情报局关键词提取——Unicode码

Dataset

本文的任务是学习计算机在内存中如何存储一个值。本文的数据集sentences_cia.csv是中央情报局备忘录的一个摘录,描述了酷刑和其他秘密活动的细节。数据格式如下:
year,statement,,,
1997,”The FBI information included that al-Mairi’s brother “”traveled to Afghanistan in 1997-1998 to train in Bin - Ladencamps.”“”,,,

  • 整个csv文件就是一个长的字符串,我们之前讨论了字符串的应用,但一直不知道它们是如何被存储在一台电脑中。文件是存储在硬盘上的,硬盘通常是磁存储,将数据存储到磁条中。磁条只能存储两个值:up和down也就是高电压低电压,代表了计算机的0,1值。因此在存储我们的字符串前,我们需要将其转换为二进制,然后才能存储到磁条中。

Intro To Binary

  • 下面这段代码,首先b是一个二进制数据的字符串形式,然后将其转换为整数型二进制数再打印,同理打印了一个100的二进制数。
# Let's say a is a binary number.  In python, we have to store binary numbers as strings
# Trying to say b = 10 directly will assume base 10, so strings are needed
b = "10"

# We can convert b to a binary number from a string using the int function -- the optional second argument base is set to 2 (binary is base two)
print(int(b, 2))
'''
2
'''
base_10_100 = int("100", 2)
'''
base_10_100 :4
'''
  • 此处打印的结果人就是显示的十进制

Binary Addition

  • 二进制相加和十进制类似
b = "1"

# We'll add binary values using a binary_add function that was made just for this exercise
# It's not extremely important to know how it works right this second
def binary_add(a, b):
    return bin(int(a, 2) + int(b, 2))[2:]

c = binary_add(b, "1")

# We now see that c equals "10", which is exactly what happens in base 10 when we reach the highest possible digit.
print(c)
''' 10 '''
  • bin这个函数导致打印出来的形式是二进制,且由于前面会添加0b,因此打印的时候从第3个位置开始打印。

Converting Binary Values

  • 现在知道了利用int()函数可以将数据转换为其它进制。
def binary_add(a, b):
    return bin(int(a, 2) + int(b, 2))[2:]

# Start both at 0
a = 0
b = "0"

# Loop 10 times
for i in range(0, 10):
    # Add 1 to each
    a += 1
    b = binary_add(b, "1")

    # Check if they are equal
    print(int(b, 2) == a)
''' True True True True True True True True True True '''
  • 上面这个例子表达了,无论是什么进制的数据,相加的量相同,他们不是一个进制但是任然数值是相等的。

Characters To Binary

字符串首先拆分成单个字符,然后存储为整型数据然后转换为二进制存储起来。

# We can use the ord() function to get the integer associated with an ascii character.
ord('a')

# Then we use the bin() function to convert to binary
# The bin function adds "0b" to the start of strings to indicate that they contain binary values bin(ord('a')) print(bin(ord('a'))) ''' 0b1100001 ''' # ÿ is the "last" ascii character -- it has the highest integer value of any ascii character # This is because 255 is the highest value that can be represented with 8 binary digits ord('ÿ') # As you can see, we get 8 1's, which shows that this is the highest possible 8 digit value bin(ord('ÿ')) print(bin(ord('ÿ'))) ''' 0b11111111 ''' # Why is this? It's because a single binary digit is called a bit, and computers store values in sequences of bytes, which are 8 bits together. # You might be more familiar with kilobytes or megabytes -- a kilobyte is 1000 bytes, and a megabyte is 1000 kilobytes. # There are 256 different ascii symbols, because the largest amount of storage any single ascii character can take up is one byte. binary_w = bin(ord("w")) ''' str (<class 'str'>) '0b1110111' '''
  • 上面这段代码显示:字符类型数据总共是8位,字符数据需要转换为对应的ascii码,然后再转化为二进制数据存储起来,8位的二进制最多能存储255个字符。通过ord()函数可以获取一个字符的ascii码,而通过bin函数将数据转化为二进制,并且在数据的前面添加了0b表明这是个二进制字符串。

Intro To Unicode

  • ASCII码

我们知道,在计算机内部,所有的信息最终都表示为一个二进制的字符串。每一个二进制位(bit)有0和1两种状态,因此八个二进制位就可以组合出256种状态,这被称为一个字节(byte)。也就是说,一个字节一共可以用来表示256种不同的状态,每一个状态对应一个符号,就是256个符号,从0000000到11111111。 上个世纪60年代,美国制定了一套字符编码,对英语字符与二进制位之间的关系,做了统一规定。这被称为ASCII码,一直沿用至今。 ASCII码一共规定了128个字符的编码,比如空格”SPACE”是32(二进制00100000),大写的字母A是65(二进制01000001)。这128个符号(包括32个不能打印出来的控制符号),只占用了一个字节的后面7位,最前面的1位统一规定为0。

  • Unicode

正如上一节所说,世界上存在着多种编码方式,同一个二进制数字可以被解释成不同的符号。因此,要想打开一个文本文件,就必须知道它的编码方式,否则用错误的编码方式解读,就会出现乱码。为什么电子邮件常常出现乱码?就是因为发信人和收信人使用的编码方式不一样。 可以想象,如果有一种编码,将世界上所有的符号都纳入其中。每一个符号都给予一个独一无二的编码,那么乱码问题就会消失。这就是Unicode,就像它的名字都表示的,这是一种所有符号的编码。 Unicode当然是一个很大的集合,现在的规模可以容纳100多万个符号。每个符号的编码都不一样,比如,U+0639表示阿拉伯字母Ain,U+0041表示英语的大写字母A,U+4E25表示汉字”严”。具体的符号对应表,可以查询unicode.org,或者专门的汉字对应表。

  • Unicode的问题

需要注意的是,Unicode只是一个符号集,它只规定了符号的二进制代码,却没有规定这个二进制代码应该如何存储。这里就有两个严重的问题,第一个问题是,如何才能区别Unicode和ASCII?计算机怎么知道三个字节表示一个符号,而不是分别表示三个符号呢?第二个问题是,我们已经知道,英文字母只用一个字节表示就够了,如果Unicode统一规定,每个符号用三个或四个字节表示,那么每个英文字母前都必然有二到三个字节是0,这对于存储来说是极大的浪费,文本文件的大小会因此大出二三倍,这是无法接受的。

  • UTF-8

UTF-8是Unicode的实现方式之一。UTF-8最大的一个特点,就是它是一种变长的编码方式。它可以使用1~4个字节表示一个符号,根据不同的符号而变化字节长度。UTF-8的编码规则很简单,只有二条:

  • 对于单字节的符号,字节的第一位设为0,后面7位为这个符号的unicode码。因此对于英语字母,UTF-8编码和ASCII码是相同的。
  • 对于n字节的符号(n>1),第一个字节的前n位都设为1,第n+1位设为0,后面字节的前两位一律设为10。剩下的没有提及的二进制位,全部为这个符号的unicode码。

中央情报局关键词提取——Unicode码_第1张图片
跟据上表,解读UTF-8编码非常简单。如果一个字节的第一位是0,则这个字节单独就是一个字符;如果第一位是1,则连续有多少个1,就表示当前字符占用多少个字节。
已知”严”的unicode是4E25(100111000100101),根据上表,可以发现4E25处在第三行的范围内(0000 0800-0000 FFFF),因此”严”的UTF-8编码需要三个字节,即格式是”1110xxxx 10xxxxxx 10xxxxxx”。然后,从”严”的最后一个二进制位开始,依次从后向前填入格式中的x,多出的位补0。这样就得到了,”严”的UTF-8编码是”11100100 10111000 10100101”,转换成十六进制就是E4B8A5。可以看到”严”的Unicode码是4E25,UTF-8编码是E4B8A5,两者是不一样的。

# We can initialize unicode code points (the value for this code point is \u27F6, but you see it as a character because it is being automatically converted)
code_point = "→"

# This particular code point maps to a right arrow character
print(code_point)

# We can get the base 10 integer value of the code point with the ord function
print(ord(code_point))

# As you can see, this takes up a lot more than 1 byte
print(bin(ord(code_point)))
''' → 10230 0b10011111110110 '''

Strings With Unicode

由于ascii 是Unicode的子集,因此在python3中,默认所有的字符串都是用Unicode,并且用utf-8编码。所以我们可以直接使用Unicode的码点和字符。

s1 = "café"
# The \u prefix means "the next 4 digits are a unicode code point"
# It doesn't change the value at all (the last character in the string below is \u00e9)
s2 = "café"

# These strings are the same, because code points are equal to their corresponding unicode character.
# \u00e9 and é are equivalent.
print(s1 == s2)
'''
True
'''

The Bytes Type

  • encode(“utf-8”)可以对字符串进行编码将其转换为bytes型数据。
# We can make a string with some unicode values
superman = "Clark Kent□"
# This tells python to encode the string superman into unicode using the utf-8 encoding
# We end up with a sequence of bytes instead of a string
superman_bytes = "Clark Kent␦".encode("utf-8") print(superman_bytes)
'''
b'Clark Kent\xe2\x90\xa6'
'''

batman = "Bruce Wayne□"
batman_bytes = batman.encode("utf-8") print(batman_bytes)
'''
bytes (<class 'bytes'>)
b'Bruce Wayne\xe2\x90\xa6'
'''

Hexadecimal Intro

\u是unicode码的前缀,说明这代表一个unicode码。\x是十六进制的前缀,代表后面两个数字是16进制的。两个十六进制数等于8个二进制数。

# F is the highest single digit in hexadecimal (base 16)
# Its value is 15 in base 10
print(int("F", 16))

# A in base 16 has the value 10 in base 10
print(int("A", 16))

# Just like the earlier binary_add function, this adds two hex numbers
def hexadecimal_add(a, b):
    return hex(int(a, 16) + int(b, 16))[2:]

# When we add 1 to 9 in hexadecimal, it becomes "a"
value = "9"
value = hexadecimal_add(value, "1")
print(value)
hex_ea = hexadecimal_add("2", "ea")
''' hex_ea :str (<class 'str'>) 'ec' '''
hex_ef = hexadecimal_add("e", "f")
''' hex_ef :str (<class 'str'>) '1d' '''
''' 15 10 a '''

Hex To Binary

# One byte (8 bits) in hexadecimal (the value of the byte below is \xe2)
hex_byte = "â"

# Print the base 10 integer value for the hex byte
print(ord(hex_byte))

# This gives the exact same value -- remember than \x is just a prefix, and doesn't affect the value
print(int("e2", 16))

# Convert the base 10 integer to binary
print(bin(ord("â")))
binary_aa = bin(ord("ª"))
''' str (<class 'str'>) '0b10101010' '''
binary_ab = bin(ord("\xab"))
''' str (<class 'str'>) '0b10101011' '''
''' 226 226 0b11100010 '''

Bytes And Strings

  • bytes和strings对象不可以混在一起,用encode(“utf-8”)会将一个strings对象转换为一个bytes对象,然后不能在其中插入strings对象,但是可以插入形如b”“这样的bytes对象
hulk_bytes = "Bruce Banner␦".encode("utf-8")
print(type(hulk_bytes))
# We can't mix strings and bytes
# For instance, if we try to replace the unicode □ character as a string, it won't work, because that value has been encoded to bytes
try:
    hulk_bytes.replace("Banner", "")
except Exception:
    print("TypeError with replacement")

# We can create objects of the bytes datatype by putting a b in front of the quotation marks in a string
hulk_bytes = b"Bruce Banner"
# Now, instead of mixing strings and bytes, we can use the replace method with bytes objects instead
hulk_bytes.replace(b"Banner", b"")
thor_bytes = b"Thor"
'''
<class 'bytes'>
TypeError with replacement
'''

Decode Bytes To Strings

  • decode(“utf-8”)可以将bytes对象解码为strings对象。
# Make a bytes object with aquaman's secret identity
aquaman_bytes = b"Who knows?" # Now, we can use the decode method, along with the encoding (utf-8) to turn it into a string.
aquaman = aquaman_bytes.decode("utf-8") # We can print the value and type out to verify that it is a string.
print(aquaman)
print(type(aquaman))
'''
Who knows?
<class 'str'>
'''

Read In File Data

  • 到目前为止,对unicode有一定的了解了,现在开始处理数据了。
  • sentences_cia.csv”文件的第一行是列标签:[‘year’, ‘statement’, ”, ”, ”]
  • 第二行:[‘1997’, ‘The FBI information included that al-Mairi\’s brother “traveled to Afghanistan in 1997-1998 to train in Bin - Ladencamps.”’, ”, ”, ”]
# We can read our data in using csvreader
import csv
# When we open a file, we can specify the encoding that it's in. In this case, utf-8
f = open("sentences_cia.csv", 'r', encoding="utf-8")
csvreader = csv.reader(f)
sentences_cia = list(csvreader)

# The data is two columns
# First column is year, second is a sentence from a CIA report written that year
# Print the first column of the second row
print(sentences_cia[1][0])

# Print the second column of the second row
print(sentences_cia[1][1])
'''
1997
The FBI information included that al-Mairi's brother "traveled to Afghanistan in 1997-1998 to train in Bin - Ladencamps."
'''

Convert To A Dataframe

  • 将sentences_cia 转换为DataFrame对象,并且将legislators也同样处理。
import csv
# Let's read in the legislators data from a few missions ago
f = open("legislators.csv", 'r', encoding="utf-8")
csvreader = csv.reader(f)
legislators = list(csvreader)

# Now, we can import pandas and use the DataFrame class to convert the list of lists to a dataframe
import pandas as pd

legislators_df = pd.DataFrame(legislators)

# As you can see, the first row is the headers, which we don't want (it's not actually data, it's just headers)
print(legislators_df.iloc[0,:])

# In order to remove the headers, we'll subset the df and pass them in separately
# This code removes the headers from legislators, and instead passes them into the columns argument
# The columns argument specifies column names
legislators_df = pd.DataFrame(legislators[1:], columns=legislators[0])
# We now have the right data in the first row, and the proper headers
print(legislators_df.iloc[0,:])

# The sentences_cia data from last screen is available.
sentences_cia_df = pd.DataFrame(sentences_cia[1:], columns=sentences_cia[0])
'''
0     last_name
1    first_name
2      birthday
3        gender
4          type
5         state
6         party
Name: 0, dtype: object
last_name                 Bassett
first_name                Richard
birthday               1745-04-02
gender                          M
type                          sen
state                          DE
party         Anti-Administration
Name: 0, dtype: object
'''

Clean Up Sentences

  • “statement”列是一个陈述句,必须对其进行处理然后才能进行分析。首先需要将strings中无关的符号剔除,我们只关心单词,数字和空格。所以我们先利用ord()查阅以上每个字符的整型码。good_characters列出了那些有用的码,保留这些字符,然后通过空格将其连接起来。
def clean_statement(row):
    # The integer codes for all the characters we want to keep
    good_characters = [48, 49, 50, 51, 52, 53, 54, 55, 56, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 32]
    statement = row["statement"]
    clean_statement_list = [s for s in statement if ord(s) in good_characters]
    # Join the list together, separated by "" (no space), which creates a string again
    return "".join(clean_statement_list)

sentences_cia["cleaned_statement"] = sentences_cia.apply(clean_statement, axis=1)

Tokenize Statements

  • 剔除掉无关单词后,需要将拆分单词,然后计算每个单词的频率,所以首先将所有的statement通过join函数连接在一起,在通过空格将其拆分为一个字符数组:
# We can use the .join() method on strings to join lists together.
# The string we use the method on will be used as the separator -- the character(s) between each string when they are joined.
combined_statements = " ".join(sentences_cia["cleaned_statement"])
statement_tokens = combined_statements.split(" ")
''' list (<class 'list'>) ['The', 'FBI', 'information', 'included', 'that', 'alMairis', 'brother', 'traveled', ... '''

Filter The Tokens

  • 现在得到的是一个词袋形式,需要找到里面有用的单词的个数,但是在英语中最常见的单词是一些连接词,比如that or and等等,这些单词被称为停顿词,因此需要将这些单词过滤掉。但是为了简单起见,我们这里只是简单的将单词长度小于5的过滤掉:
# statement_tokens has been loaded in.
filtered_tokens = [s for s in statement_tokens if len(s) > 4]

Count The Tokens

  • 现在引入一个新的包collections,这个包里面有一个Counter函数,返回一个字典格式,键值为每个元素的频数。
from collections import Counter

# filtered_tokens has been loaded in filtered_token_counts = Counter(filtered_tokens)
'''
Counter({'interrogation': 391, 'REDACTED': 375, 'information': 375, 'Zubaydah': 328, 'Committee': 327, ...
'''
  • 然后计算其中最常见的3个单词:
common_tokens = filtered_token_counts.most_common(3)
'''
[('interrogation', 391), ('REDACTED', 375), ('information', 375)]
'''

Finding The Most Common Tokens By Year

# sentences_cia has been loaded in. # It already has the cleaned_statement column.
from collections import Counter
def find_most_common_by_year(year, sentences_cia): data = sentences_cia[sentences_cia["year"] == year] combined_statement = " ".join(data["cleaned_statement"]) statement_split = combined_statement.split(" ") counter = Counter([s for s in statement_split if len(s) > 4])
 return counter.most_common(2)

common_2000 = find_most_common_by_year("2000", sentences_cia)
'''
[('terrorist', 9), ('Ahmad', 9)]
'''
common_2002 = find_most_common_by_year("2002", sentences_cia)
'''
[('interrogation', 275), ('Zubaydah', 252)]
'''
common_2013 = find_most_common_by_year("2013", sentences_cia)
'''
[('Response', 196), ('states', 111)]
'''

你可能感兴趣的:(中央情报局关键词提取——Unicode码)