Python自然语言处理学习笔记(4):1.2 进一步学习Python:将文本视作单词列表

新手上路,翻译不恰之处,恳请指出,不胜感谢

 

Updated log

1st:2011/8/6  

2nd:新图标更换,原图标实在不喜欢那

clip_image012
~相信有不少童鞋会喜欢~

 

1.2 A Closer Look at Python: Texts as Lists of Word  进一步学习Python:将文本视作单词列表 

You’ve seen some important elements of the Python programming language. Let’s take a few moments to review them systematically.

Lists  列表

What is a text? At one level, it is a sequence of symbols on a page such as this one. At another level, it is a sequence of chapters, made up of a sequence of sections, where each section is a sequence of paragraphs, and so on. However, for our purposes, we will think of a text as nothing more than a sequence of words and punctuation. Here’s how we represent text in Python, in this case the opening sentence of Moby Dick:

 

>>> sent1 = [ ' Call ' , ' me ' , ' Ishmael ' , ' . ' ]
>>>

After the prompt we've given a name we made up, sent1, followed by the equals sign, and then some quoted words, separated with commas, and surrounded with brackets. This bracketed material is known as a list in Python: it is how we store a text. We can inspect it by typing the name . We can ask for its length . We can even apply our own lexical_diversity() function to it .

 

>>>  sent1[ 1
[ ' Call ' ' me ' ' Ishmael ' ' . '
>>>  len(sent1) [ 2
4  
>>>  lexical_diversity(sent1) [ 3
1.0  
>>>

 

Some more lists have been defined for you, one for the opening sentence of each of our texts, sent2 … sent9. We inspect two of them here; you can see the rest for yourself using the Python interpreter (if you get an error saying that sent2 is not defined, you need to first type from nltk.book import *).

 

>>> sent2
['The', 'family', 'of', 'Dashwood', 'had', 'long',
'been', 'settled', 'in', 'Sussex', '.']
>>> sent3
['In', 'the', 'beginning', 'God', 'created', 'the',
'heaven', 'and', 'the', 'earth', '.']

Your Turn: Make up a few sentences of your own, by typing a name, equals sign, and a list of words, like this: ex1 = ['Monty', 'Python', 'and', 'the', 'Holy', 'Grail']. Repeat some of the other Python operations we saw earlier in Section 1.1, e.g., sorted(ex1), len(set(ex1)), ex1.count('the').

  

A pleasant surprise is that we can use Python’s addition operator on lists(对列表使用+操作). Adding two lists ①creates a new list with everything from the first list, followed by everything from the second list:

 

>>> [ ' Monty ' , ' Python ' ] + [ ' and ' , ' the ' , ' Holy ' , ' Grail ' ] ①
['Monty', 'Python', 'and', 'the', 'Holy', 'Grail']

This special use of the addition operation is called concatenation(连结); it combines the lists together into a single list. We can concatenate sentences to build up a text(我们可以把句子连结起来组成一个文本).

We don’t have to literally(逐字的) type the lists either; we can use short names that refer to predefined lists.

>>> sent4 + sent1
['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and', 'of', 'the','House', 'of', 'Representatives', ':', 'Call', 'me', 'Ishmael', '.']
>>>

What if we want to add a single item to a list(加一个元素)? This is known as appending. When we append() to a list, the list itself is updated as a result of the operation. 

 

>>> sent1.append( " Some " )
>>> sent1
['Call', 'me', 'Ishmael', '.', 'Some']
>>>

Indexing Lists  列表索引

 

As we have seen, a text in Python is a list of words, represented using a combination of brackets and quotes. Just as with an ordinary page of text, we can count up the total number of words in text1 with len(text1), and count the occurrences in a text of a particular word—say, heaven—using text1.count('heaven').

 

With some patience, we can pick out(挑选出) the 1st, 173rd, or even 14,278th word in a printed text. Analogously(类似的), we can identify the elements of a Python list by their order of occurrence in the list. The number that represents this position is the item’s index. We instruct Python to show us the item that occurs at an index such as 173 in a text by writing the name of the text followed by the index inside square brackets:

 

>>> text4[ 173 ]
' awaken '
>>>

 

We can do the converse; given a word, find the index of when it first occurs:

 

>>> text4.index( ' awaken ' )
173
>>>

Indexes are a common way to access the words of a text, or, more generally, the elements of any list. Python permits us to access sublists as well, extracting manageable pieces of language from large texts, a technique known as slicing(切片)

 

>>> text5[ 16715 : 16735 ]
[ ' U86 ' , ' thats ' , ' why ' , ' something ' , ' like ' , ' gamefly ' , ' is ' , ' so ' , ' good ' ,
'because', 'you', 'can', 'actually', 'play', 'a', 'full', 'game', 'without',
'buying', 'it']
>>> text6[1600:1625]
['We', "'", 're', 'an', 'anarcho', '-', 'syndicalist', 'commune', '.', 'We',
'take', 'it', 'in', 'turns', 'to', 'act', 'as', 'a', 'sort', 'of', 'executive',
'officer', 'for', 'the', 'week']
>>>

 

Indexes have some subtleties(微妙), and we’ll explore these with the help of an artificial sentence:

 

>>> sent = [ ' word1 ' , ' word2 ' , ' word3 ' , ' word4 ' , ' word5 ' ,
... 'word6', 'word7', 'word8', 'word9', 'word10']
>>> sent[0]
'word1'
>>> sent[9]
'word10'
>>>

Notice that our indexes start from zero(注意我们的索引是从0开始的): sent element zero, written sent[0], is the first word, 'word1', whereas sent element 9 is 'word10'. The reason is simple: the moment Python accesses the content of a list from the computer’s memory, it is already at the first element; we have to tell it how many elements forward to go. Thus, zero steps forward leaves it at the first element.

 

This practice of counting from zero is initially confusing, but typical of modern programming languages. You’ll quickly get the hang of it if you’ve mastered the system of counting centuries where 19XY is a year in the 20th century, or if you live in a country where the floors of a building are numbered from 1, and so walking up n-1 flights of stairs takes you to level n.

 

Now, if we accidentally use an index that is too large, we get an error:

 

>>> sent[ 10 ]
Traceback (most recent call last):
File "", line 1, in ?
IndexError: list index out of range
>>>

This time it is not a syntax error, because the program fragment is syntactically correct. Instead, it is a runtime error(运行时错误), and it produces a Traceback message that shows the context of the error, followed by the name of the error, IndexError, and a brief explanation.

Let' s take a closer look at slicing, using our artificial sentence again. Here we verify that the slice 5:8 includes sent elements at indexes 5, 6, and 7:

 

>>> sent[ 5 : 8 ]
[ ' word6 ' , ' word7 ' , ' word8 ' ]
>>> sent[5]
'word6'
>>> sent[6]
'word7'
>>> sent[7]
'word8'
>>>

By convention, m:n means elements m…n-1. As the next example shows, we can omit the first number if the slice begins at the start of the list①, and we can omit the second number if the slice goes to the end②:

 

>>> sent[: 3 ] ①
['word1', 'word2', 'word3']
>>> text2[141525:]②
['among', 'the', 'merits', 'and', 'the', 'happiness', 'of', 'Elinor', 'and', 'Marianne',',', 'let', 'it', 'not', 'be', 'ranked', 'as', 'the', 'least', 'considerable', ',','that', 'though', 'sisters', ',', 'and', 'living', 'almost', 'within', 'sight', 'of','each', 'other', ',', 'they', 'could', 'live', 'without', 'disagreement', 'between','themselves', ',', 'or', 'producing', 'coolness', 'between', 'their', 'husbands', '.','THE', 'END']
>>>

We can modify an element of a list by assigning to one of its index values. In the next example, we put sent[0] on the left of the equals sign①. We can also replace an entire slice with new material②. A consequence of this last change is that the list only has four elements, and accessing a later value generates an error③.

 

>>> sent[ 0 ] = ' First '
>>> sent[ 9 ] = ' Last '
>>> len(sent)
10
>>> sent[1:9] = ['Second', 'Third'] ②
>>> sent
['First', 'Second', 'Third', 'Last']
>>> sent[9] ③
Traceback (most recent call last):
File "", line 1, in ?
IndexError: list index out of range

Your Turn: Take a few minutes to define a sentence of your own and modify individual words and groups of words (slices) using the same methods used earlier. Check your understanding by trying the exercises on lists at the end of this chapter.

关于列表的操作可以去查看:如何向科学家一样思考的chaper 9 List 或者查阅Python自带的document

 

Variables 变量

 

From the start of Section 1.1, you have had access to texts called text1, text2, and so on. It saved a lot of typing to be able to refer to a 250,000-word book with a short name like this! In general, we can make up names for anything we care to calculate. We did this ourselves in the previous sections, e.g., defining a variable sent1, as follows:

 

>>> sent1 = ['Call', 'me', 'Ishmael', '.']

Such lines have the form: variable = expression. Python will evaluate the expression, and save its result to the variable. This process is called assignment(赋值). It does not generate any output; you have to type the variable on a line of its own to inspect its contents. The equals sign is slightly misleading(误解的), since information is moving from the right side to the left. It might help to think of it as a left-arrow. The name of the variable can be anything you like, e.g., my_sent, sentence, xyzzy. It must start with a letter, and can include numbers and underscores. Here are some examples of variables and assignments:

 

>>> my_sent = [ ' Bravely ' , ' bold ' , ' Sir ' , ' Robin ' , ' , ' , ' rode ' ,
... 'forth', 'from', 'Camelot', '.']
>>> noun_phrase = my_sent[1:4]
>>> noun_phrase
['bold', 'Sir', 'Robin']
>>> wOrDs = sorted(noun_phrase)
>>> wOrDs
['Robin', 'Sir', 'bold']

 

Remember that capitalized words appear before lowercase words in sorted lists.(因为按ASCII码排的,大写字母的数值比小写的小)

Notice in the previous example that we split the definition of my_sent over two lines. Python expressions can be split across multiple lines, so long as this happens within any kind of brackets. Python uses the ... prompt to indicate that more input is expected. It doesn’t matter how much indentation is used in these continuation lines, but some indentation usually makes them easier to read.

 

It is good to choose meaningful variable names to remind you—and to help anyone else who reads your Python code—what your code is meant to do(很重要的一点). Python does not try to make sense of the names; it blindly follows your instructions, and does not object if you do something confusing, such as one = 'two' or two = 3. The only restriction is that a variable name cannot be any of Python’s reserved words(保留字), such as def, if, not, and import. If you use a reserved word, Python will produce a syntax error:

 

>>> not = ' Camelot '
File "", line 1
not = 'Camelot'
^
SyntaxError: invalid syntax
>>>

We will often use variables to hold intermediate steps of a computation, especially when this makes the code easier to follow. Thus len(set(text1)) could also be written:

 

>>> vocab = set (text1)
>>> vocab_size = len(vocab)
>>> vocab_size
19317
>>>

clip_image012Caution!

Take care with your choice of names (or identifiers) for Python variables. First, you should start the name with a letter, optionally followed by digits (0 to 9) or letters. Thus, abc23 is fine, but 23abc will cause a syntax error. Names are case-sensitive, which means that myVar and myvar are distinct variables. Variable names cannot contain whitespace, but you can separate words using an underscore, e.g., my_var. Be careful not to insert a hyphen instead of an underscore: my-var is wrong, since Python interprets the - as a minus sign.

 

Strings 字符串

 

Some of the methods we used to access the elements of a list also work with individual words, or strings. For example, we can assign a string to a variable①, index a string ②, and slice a string③.

>>> name = ' Monty '
>>> name[0] ②
'M'
>>> name[:4] ③
'Mont'
>>>

 

We can also perform multiplication and addition with strings:

>>> name * 2
'MontyMonty'
>>> name + '!'
'Monty!'
>>>

 

We can join the words of a list to make a single string, or split a string into a list, as follows:

>>> ' ' .join([ ' Monty ' , ' Python ' ])
'Monty Python'
>>> 'Monty Python'.split()
['Monty', 'Python']
>>>

We will come back to the topic of strings in Chapter 3. For the time being, we have two important building blocks—lists and strings—and are ready to get back to some language analysis.

你可能感兴趣的:(python,人工智能,runtime)