本文的任务是学习计算机在内存中如何存储一个值。本文的数据集sentences_cia.csv是中央情报局备忘录的一个摘录,描述了酷刑和其他秘密活动的细节。数据格式如下:
year,statement,,,
1997,”The FBI information included that al-Mairi’s brother “”traveled to Afghanistan in 1997-1998 to train in Bin - Ladencamps.”“”,,,
# Let's say a is a binary number. In python, we have to store binary numbers as strings
# Trying to say b = 10 directly will assume base 10, so strings are needed
b = "10"
# We can convert b to a binary number from a string using the int function -- the optional second argument base is set to 2 (binary is base two)
print(int(b, 2))
'''
2
'''
base_10_100 = int("100", 2)
'''
base_10_100 :4
'''
b = "1"
# We'll add binary values using a binary_add function that was made just for this exercise
# It's not extremely important to know how it works right this second
def binary_add(a, b):
return bin(int(a, 2) + int(b, 2))[2:]
c = binary_add(b, "1")
# We now see that c equals "10", which is exactly what happens in base 10 when we reach the highest possible digit.
print(c)
''' 10 '''
def binary_add(a, b):
return bin(int(a, 2) + int(b, 2))[2:]
# Start both at 0
a = 0
b = "0"
# Loop 10 times
for i in range(0, 10):
# Add 1 to each
a += 1
b = binary_add(b, "1")
# Check if they are equal
print(int(b, 2) == a)
''' True True True True True True True True True True '''
字符串首先拆分成单个字符,然后存储为整型数据然后转换为二进制存储起来。
# We can use the ord() function to get the integer associated with an ascii character.
ord('a')
# Then we use the bin() function to convert to binary
# The bin function adds "0b" to the start of strings to indicate that they contain binary values bin(ord('a')) print(bin(ord('a'))) ''' 0b1100001 ''' # ÿ is the "last" ascii character -- it has the highest integer value of any ascii character # This is because 255 is the highest value that can be represented with 8 binary digits ord('ÿ') # As you can see, we get 8 1's, which shows that this is the highest possible 8 digit value bin(ord('ÿ')) print(bin(ord('ÿ'))) ''' 0b11111111 ''' # Why is this? It's because a single binary digit is called a bit, and computers store values in sequences of bytes, which are 8 bits together. # You might be more familiar with kilobytes or megabytes -- a kilobyte is 1000 bytes, and a megabyte is 1000 kilobytes. # There are 256 different ascii symbols, because the largest amount of storage any single ascii character can take up is one byte. binary_w = bin(ord("w")) ''' str (<class 'str'>) '0b1110111' '''
我们知道,在计算机内部,所有的信息最终都表示为一个二进制的字符串。每一个二进制位(bit)有0和1两种状态,因此八个二进制位就可以组合出256种状态,这被称为一个字节(byte)。也就是说,一个字节一共可以用来表示256种不同的状态,每一个状态对应一个符号,就是256个符号,从0000000到11111111。 上个世纪60年代,美国制定了一套字符编码,对英语字符与二进制位之间的关系,做了统一规定。这被称为ASCII码,一直沿用至今。 ASCII码一共规定了128个字符的编码,比如空格”SPACE”是32(二进制00100000),大写的字母A是65(二进制01000001)。这128个符号(包括32个不能打印出来的控制符号),只占用了一个字节的后面7位,最前面的1位统一规定为0。
正如上一节所说,世界上存在着多种编码方式,同一个二进制数字可以被解释成不同的符号。因此,要想打开一个文本文件,就必须知道它的编码方式,否则用错误的编码方式解读,就会出现乱码。为什么电子邮件常常出现乱码?就是因为发信人和收信人使用的编码方式不一样。 可以想象,如果有一种编码,将世界上所有的符号都纳入其中。每一个符号都给予一个独一无二的编码,那么乱码问题就会消失。这就是Unicode,就像它的名字都表示的,这是一种所有符号的编码。 Unicode当然是一个很大的集合,现在的规模可以容纳100多万个符号。每个符号的编码都不一样,比如,U+0639表示阿拉伯字母Ain,U+0041表示英语的大写字母A,U+4E25表示汉字”严”。具体的符号对应表,可以查询unicode.org,或者专门的汉字对应表。
需要注意的是,Unicode只是一个符号集,它只规定了符号的二进制代码,却没有规定这个二进制代码应该如何存储。这里就有两个严重的问题,第一个问题是,如何才能区别Unicode和ASCII?计算机怎么知道三个字节表示一个符号,而不是分别表示三个符号呢?第二个问题是,我们已经知道,英文字母只用一个字节表示就够了,如果Unicode统一规定,每个符号用三个或四个字节表示,那么每个英文字母前都必然有二到三个字节是0,这对于存储来说是极大的浪费,文本文件的大小会因此大出二三倍,这是无法接受的。
UTF-8是Unicode的实现方式之一。UTF-8最大的一个特点,就是它是一种变长的编码方式。它可以使用1~4个字节表示一个符号,根据不同的符号而变化字节长度。UTF-8的编码规则很简单,只有二条:
- 对于单字节的符号,字节的第一位设为0,后面7位为这个符号的unicode码。因此对于英语字母,UTF-8编码和ASCII码是相同的。
- 对于n字节的符号(n>1),第一个字节的前n位都设为1,第n+1位设为0,后面字节的前两位一律设为10。剩下的没有提及的二进制位,全部为这个符号的unicode码。
跟据上表,解读UTF-8编码非常简单。如果一个字节的第一位是0,则这个字节单独就是一个字符;如果第一位是1,则连续有多少个1,就表示当前字符占用多少个字节。
已知”严”的unicode是4E25(100111000100101),根据上表,可以发现4E25处在第三行的范围内(0000 0800-0000 FFFF),因此”严”的UTF-8编码需要三个字节,即格式是”1110xxxx 10xxxxxx 10xxxxxx”。然后,从”严”的最后一个二进制位开始,依次从后向前填入格式中的x,多出的位补0。这样就得到了,”严”的UTF-8编码是”11100100 10111000 10100101”,转换成十六进制就是E4B8A5。可以看到”严”的Unicode码是4E25,UTF-8编码是E4B8A5,两者是不一样的。
# We can initialize unicode code points (the value for this code point is \u27F6, but you see it as a character because it is being automatically converted)
code_point = "→"
# This particular code point maps to a right arrow character
print(code_point)
# We can get the base 10 integer value of the code point with the ord function
print(ord(code_point))
# As you can see, this takes up a lot more than 1 byte
print(bin(ord(code_point)))
''' → 10230 0b10011111110110 '''
由于ascii 是Unicode的子集,因此在python3中,默认所有的字符串都是用Unicode,并且用utf-8编码。所以我们可以直接使用Unicode的码点和字符。
s1 = "café"
# The \u prefix means "the next 4 digits are a unicode code point"
# It doesn't change the value at all (the last character in the string below is \u00e9)
s2 = "café"
# These strings are the same, because code points are equal to their corresponding unicode character.
# \u00e9 and é are equivalent.
print(s1 == s2)
'''
True
'''
# We can make a string with some unicode values
superman = "Clark Kent□"
# This tells python to encode the string superman into unicode using the utf-8 encoding
# We end up with a sequence of bytes instead of a string
superman_bytes = "Clark Kent␦".encode("utf-8") print(superman_bytes)
'''
b'Clark Kent\xe2\x90\xa6'
'''
batman = "Bruce Wayne□"
batman_bytes = batman.encode("utf-8") print(batman_bytes)
'''
bytes (<class 'bytes'>)
b'Bruce Wayne\xe2\x90\xa6'
'''
\u是unicode码的前缀,说明这代表一个unicode码。\x是十六进制的前缀,代表后面两个数字是16进制的。两个十六进制数等于8个二进制数。
# F is the highest single digit in hexadecimal (base 16)
# Its value is 15 in base 10
print(int("F", 16))
# A in base 16 has the value 10 in base 10
print(int("A", 16))
# Just like the earlier binary_add function, this adds two hex numbers
def hexadecimal_add(a, b):
return hex(int(a, 16) + int(b, 16))[2:]
# When we add 1 to 9 in hexadecimal, it becomes "a"
value = "9"
value = hexadecimal_add(value, "1")
print(value)
hex_ea = hexadecimal_add("2", "ea")
''' hex_ea :str (<class 'str'>) 'ec' '''
hex_ef = hexadecimal_add("e", "f")
''' hex_ef :str (<class 'str'>) '1d' '''
''' 15 10 a '''
# One byte (8 bits) in hexadecimal (the value of the byte below is \xe2)
hex_byte = "â"
# Print the base 10 integer value for the hex byte
print(ord(hex_byte))
# This gives the exact same value -- remember than \x is just a prefix, and doesn't affect the value
print(int("e2", 16))
# Convert the base 10 integer to binary
print(bin(ord("â")))
binary_aa = bin(ord("ª"))
''' str (<class 'str'>) '0b10101010' '''
binary_ab = bin(ord("\xab"))
''' str (<class 'str'>) '0b10101011' '''
''' 226 226 0b11100010 '''
hulk_bytes = "Bruce Banner␦".encode("utf-8")
print(type(hulk_bytes))
# We can't mix strings and bytes
# For instance, if we try to replace the unicode □ character as a string, it won't work, because that value has been encoded to bytes
try:
hulk_bytes.replace("Banner", "")
except Exception:
print("TypeError with replacement")
# We can create objects of the bytes datatype by putting a b in front of the quotation marks in a string
hulk_bytes = b"Bruce Banner"
# Now, instead of mixing strings and bytes, we can use the replace method with bytes objects instead
hulk_bytes.replace(b"Banner", b"")
thor_bytes = b"Thor"
'''
<class 'bytes'>
TypeError with replacement
'''
# Make a bytes object with aquaman's secret identity
aquaman_bytes = b"Who knows?" # Now, we can use the decode method, along with the encoding (utf-8) to turn it into a string.
aquaman = aquaman_bytes.decode("utf-8") # We can print the value and type out to verify that it is a string.
print(aquaman)
print(type(aquaman))
'''
Who knows?
<class 'str'>
'''
# We can read our data in using csvreader
import csv
# When we open a file, we can specify the encoding that it's in. In this case, utf-8
f = open("sentences_cia.csv", 'r', encoding="utf-8")
csvreader = csv.reader(f)
sentences_cia = list(csvreader)
# The data is two columns
# First column is year, second is a sentence from a CIA report written that year
# Print the first column of the second row
print(sentences_cia[1][0])
# Print the second column of the second row
print(sentences_cia[1][1])
'''
1997
The FBI information included that al-Mairi's brother "traveled to Afghanistan in 1997-1998 to train in Bin - Ladencamps."
'''
import csv
# Let's read in the legislators data from a few missions ago
f = open("legislators.csv", 'r', encoding="utf-8")
csvreader = csv.reader(f)
legislators = list(csvreader)
# Now, we can import pandas and use the DataFrame class to convert the list of lists to a dataframe
import pandas as pd
legislators_df = pd.DataFrame(legislators)
# As you can see, the first row is the headers, which we don't want (it's not actually data, it's just headers)
print(legislators_df.iloc[0,:])
# In order to remove the headers, we'll subset the df and pass them in separately
# This code removes the headers from legislators, and instead passes them into the columns argument
# The columns argument specifies column names
legislators_df = pd.DataFrame(legislators[1:], columns=legislators[0])
# We now have the right data in the first row, and the proper headers
print(legislators_df.iloc[0,:])
# The sentences_cia data from last screen is available.
sentences_cia_df = pd.DataFrame(sentences_cia[1:], columns=sentences_cia[0])
'''
0 last_name
1 first_name
2 birthday
3 gender
4 type
5 state
6 party
Name: 0, dtype: object
last_name Bassett
first_name Richard
birthday 1745-04-02
gender M
type sen
state DE
party Anti-Administration
Name: 0, dtype: object
'''
def clean_statement(row):
# The integer codes for all the characters we want to keep
good_characters = [48, 49, 50, 51, 52, 53, 54, 55, 56, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 32]
statement = row["statement"]
clean_statement_list = [s for s in statement if ord(s) in good_characters]
# Join the list together, separated by "" (no space), which creates a string again
return "".join(clean_statement_list)
sentences_cia["cleaned_statement"] = sentences_cia.apply(clean_statement, axis=1)
# We can use the .join() method on strings to join lists together.
# The string we use the method on will be used as the separator -- the character(s) between each string when they are joined.
combined_statements = " ".join(sentences_cia["cleaned_statement"])
statement_tokens = combined_statements.split(" ")
''' list (<class 'list'>) ['The', 'FBI', 'information', 'included', 'that', 'alMairis', 'brother', 'traveled', ... '''
# statement_tokens has been loaded in.
filtered_tokens = [s for s in statement_tokens if len(s) > 4]
from collections import Counter
# filtered_tokens has been loaded in filtered_token_counts = Counter(filtered_tokens)
'''
Counter({'interrogation': 391, 'REDACTED': 375, 'information': 375, 'Zubaydah': 328, 'Committee': 327, ...
'''
common_tokens = filtered_token_counts.most_common(3)
'''
[('interrogation', 391), ('REDACTED', 375), ('information', 375)]
'''
# sentences_cia has been loaded in. # It already has the cleaned_statement column.
from collections import Counter
def find_most_common_by_year(year, sentences_cia): data = sentences_cia[sentences_cia["year"] == year] combined_statement = " ".join(data["cleaned_statement"]) statement_split = combined_statement.split(" ") counter = Counter([s for s in statement_split if len(s) > 4])
return counter.most_common(2)
common_2000 = find_most_common_by_year("2000", sentences_cia)
'''
[('terrorist', 9), ('Ahmad', 9)]
'''
common_2002 = find_most_common_by_year("2002", sentences_cia)
'''
[('interrogation', 275), ('Zubaydah', 252)]
'''
common_2013 = find_most_common_by_year("2013", sentences_cia)
'''
[('Response', 196), ('states', 111)]
'''