2012-4-6 读前言(preface)部分
update time:2012-4-6
Thisis a book about Natural Language Processing. By “natural language” we mean a languagethat is used for everyday communication by humans; languages such as English,Hindi,or Portuguese. In contrast to artificial languages such as programminglanguages and mathematical notations, natural languages have evolved as theypass from generation to generation, and are hard to pin down with explicitrules. We will take Natural Language Processing—or NLP for short—in a widesense to cover any kind of computer manipulation of natural language. At oneextreme, it could be as simple as counting word frequencies to comparedifferent writing styles. At the other extreme,NLP involves “understanding”complete human utterances[表达;说话;说话方式], at least to the extent of being able to giveuseful responses to them.
Technologiesbased on NLP are becoming increasingly widespread. For example,phonesand handheld computers support predictive text and handwritingrecognition;
websearch engines give access to information locked up in unstructured text;machine translation allows us to retrieve texts written in Chinese and readthem in Spanish. By providing more natural human-machine interfaces, and moresophisticated access to stored information, language processing has come toplay a central role in the multilingual [使用多种语言的]information society.
Thisbook provides a highly accessible introduction to the field of NLP. It can beused for individual study or as the textbook for a course on natural language processingor
computationallinguistics, or as a supplement to courses in artificial intelligence, text mining,or corpus linguistics[语料库语言学]. The book is intensely practical, containing hundreds of fully workedexamples and graded exercises. The book is based on the Python programminglanguage together with an open source library called the Natural LanguageToolkit (NLTK). NLTK includes extensive software, data, and documentation, all freely downloadablefrom http://www.nltk.org/.Distributions are provided for Windows, Macintosh,and Unix platforms. We strongly encourage you to download Python and NLTK, andtry out the examples and exercises along the way.
Audience
NLPis important for scientific, economic, social, and cultural reasons. NLPis experiencing rapid growth as its theories and methods are deployed ina variety of new language technologies. For this reason it is important for awide range of people to have a working knowledge of NLP. Within industry, thisincludes people in human-computer interaction, business information analysis,and web software development. Within academia, it includes people inareas from humanities computing and corpus linguistics through to computerscience and artificial intelligence. (To many people in academia,NLP is knownby the name of “Computational Linguistics.”)
Thisbook is intended for a diverse range of people who want to learn how to write programsthat analyze written language, regardless of previous programming
experience:
Newto programming?
Theearly chapters of the book are suitable for readers with no prior knowledge of programming,so long as you aren’t afraid to tackle newconcepts and develop new
computingskills. The book is full of examples that you can copy and try for yourself,togetherwith hundreds of graded exercises. If you need a more generalintroduction
toPython, see the list of Python resources at http://docs.python.org/.
Newto Python?
Experiencedprogrammers can quickly learn enough Python using this book to get immersedin natural language processing. All relevant Python features are carefully
explainedand exemplified, and you will quickly come to appreciate Python’s suitability for this application area. The language index will help you locate relevant discussionsin the book.
Alreadydreaming in Python?
Skimthe Python examples and dig into the interesting language analysis material thatstarts in Chapter 1. You’ll soon be applying yourskills to this fascinating
domain.
Emphasis
Thisbook is a practical introduction to NLP. You will learn by example, write real programs,and grasp the value of being able to test an idea through implementation. If
youhaven’t learned already, this book will teach you programming.Unlike other programmingbooks, we provide extensive illustrations and exercises from NLP. The
approachwe have taken is also principled, in that we cover the theoretical underpinningsand don’t shy away from careful linguistic andcomputational analysis. We have tried to be pragmatic in striking abalance between theory and application, identifying the connections and the tensions.Finally, we recognize that you won’t get through this unlessit is also pleasurable, so we have tried to include many applications andexamples that are interesting and entertaining, and sometimes whimsical.
Notethat this book is not a reference work. Its coverage of Python and NLP isselective,and presented in a tutorial style. For reference material, pleaseconsult the substantial quantity of searchable resources available at http://python.org/and http://www.nltk.org/.This book is not an advanced computer science text.The content ranges from introductory to intermediate, and is directed atreaders who want to learn how to analyze text using Python and the NaturalLanguage Toolkit. To learn about advanced algorithms implemented in NLTK, youcan examine the Python code linked from http://www.nltk.org/, and consult theother materials cited in this book. implemented in NLTK, you can examine thePython code linked from http://
www.nltk.org/,and consult the other materials cited in this book.
What you will learn?
Bydigging into the material presented here, you will learn:
• How simple programs canhelp you manipulate and analyze language data, and
howto write these programs
• How key concepts from NLPand linguistics are used to describe and analyze
language
• How data structures andalgorithms are used in NLP
• How language data isstored in standard formats, and how data can be used to
evaluatethe performance of NLP techniques
Dependingon your background, and your motivation for being interested in NLP, you
willgain different kinds of skills and knowledge from this book, as set out in TableP-1.
Goals |
Background in arts and humanities |
Background in science and engineering |
Language analysis |
Manipulating large corpora, exploring linguistic models, and testing empirical claims. |
Using techniques in data modeling, data mining, and knowledge discovery to analyze natural language. |
Language technology |
Building robust systems to perform linguistic tasks with technological applications. |
Using linguistic algorithms and data structures in robust language processing software. |
Table P-1. Skills and knowledge to be gainedfrom reading this book, depending on readers’ goals and Background
Oragnization
Theearly chapters are organized in order of conceptual difficulty, starting with apractical introduction to language processing that shows how to exploreinteresting bodies of text using tiny Python programs (Chapters 1–3). This is followed by a chapter on structured programming (Chapter 4)that consolidates [巩固]the programming topics scattered across the preceding chapters.After this, the pace picks up, and we move on to a series of chapters coveringfundamental topics in language processing: tagging, classification, andinformation extraction (Chapters 5–7). The next three chapterslook at ways to parse a sentence, recognize its syntactic structure, andconstruct representations of meaning (Chapters 8–10). The final chapter isdevoted to linguistic data and how it can be managed effectively (Chapter 11).The book concludes with an Afterword[编后记], briefly discussing the past and future of the field.
Withineach chapter, we switch between different styles of presentation. In one style,naturallanguage is the driver. We analyze language, explore linguistic concepts, and
useprogramming examples to support the discussion. We often employ Pythonconstructs that have not been introduced systematically, so you can see theirpurpose before delving [钻研;探究]into the details of how and why they work. This is just like learning idiomatic[惯用的;符合语言习惯的]expressions in a foreignlanguage: you’re able to buy a nice pastry [油酥点心;面粉糕饼]without first having learned the intricacies [纷繁难懂之处;错综复杂的事物]of question formation. Inthe other style of presentation, the programming language will be the driver.We’ll analyze programs, explore algorithms, and the linguistic examples willplay a supporting role.
Each chapter ends with a series of graded exercises, which are useful for consolidatingthe material. The exercises are graded according to the following scheme:
○ is for easy exercises thatinvolve minor modifications to supplied code samples or other simple activities;
◑ is for intermediateexercises that explore an aspect of the material in more depth,requiring careful analysis and design;
● is for difficult, open-endedtasks that willchallenge your understanding of the material and force you to thinkindependently(readers new to programming should skip these).
Eachchapter has a further reading section and an online “extras” section at http://www.nltk.org/, with pointers tomore advanced materials and online resources. Online versions of all the codeexamples are also available there.
Why Python?
Pythonis a simple yet powerful programming language with excellent functionality for processinglinguistic data. Python can be downloaded for free from http://www.python.org/.Installers are available for all platforms. Here is a five-line Python programthat processes file.txt and prints all the words ending in ing:
>>>for line in open("file.txt"):
...for word in line.split():
...if word.endswith('ing'):
...print word
Thisprogram illustrates some of the main features of Python. First, whitespaceis used to nest lines of code; thus the line starting with if falls inside the scope ofthe previous linestarting with for; this ensures that the ing test is performed for each word.Second, Python is object-oriented; each variable is an entity that has certaindefined attributes and methods. For example, the value of the variable line is more than a sequence of characters. It is a string object that has a “method” (or operation) called split() that wecan use to break a line into its words. To apply a method to an object, we write the objectname, followed by a period, followed by the method name, i.e., line.split().
Third,methods have arguments expressed inside parentheses. For instance, inthe example, word.ends with('ing') hadthe argument 'ing' to indicate that we wanted words ending with ing and notsomething else. Finally—and most importantly—Python is highly readable, so much so that it is fairly easy to guess whatthis program does even if you have never written a program before.
We chose Python because it has a shallow learning curve, its syntaxand semantics are
transparent, and it has good string-handlingfunctionality. As an interpreted language, Python facilitates interactiveexploration. As an object-oriented language, Python permits data and methods tobe encapsulated and re-used easily. As a dynamic language, Pythonpermits attributes to be added to objects on the fly, and permits variables tobe typed dynamically, facilitating rapid development. Python comes with anextensive standard library, including components for graphical programming,numerical processing, and web connectivity.
Python is heavily used in industry, scientific research, and education around theworld.Python is often praised for the way it facilitates productivity, quality,and maintainability of software. A collection of Python success stories isposted at http://www.python.org/about/success/.
NLTKdefines an infrastructure that can be used to build NLP programs in Python. It providesbasic classes for representing data relevant to natural language processing;
standardinterfaces for performing tasks such as part-of-speech tagging,syntactic parsing, and text classification; and standard implementations foreach task that can be combined to solve complex problems.
NLTKcomes with extensive documentation. In addition to this book, the website at http://www.nltk.org/provides API documentation that covers every module, class, and function in thetoolkit, specifying parameters and giving examples of usage. The website alsoprovides many HOWTOs with extensive examples and test cases, intended for users,developers, and instructors.
Software Requirements
Toget the most out of this book, you should install several free softwarepackages.
Currentdownload pointers and instructions are available at http://www.nltk.org/.
Python
Thematerial presented in this book assumes that you are using Python version 2.4 or2.5. We are committed to porting NLTK to Python 3.0 once the libraries that NLTKdepends on have been ported.
NLTK
Thecode examples in this book use NLTK version 2.0. Subsequent releases of NLTKwill be backward-compatible[向后兼容].
NLTK-Data
Thiscontains the linguistic corpora that are analyzed and processed in thebook.
NumPy(recommended)
Thisis a scientific computing library with support for multidimensional arrays and linearalgebra, required for certain probability, tagging, clustering,and classification
tasks.
Matplotlib(recommended)
Thisis a 2D plotting library for data visualization, and is used in some of the book’s code samples that produce line graphs and bar charts.
NetworkX(optional)
Thisis a library for storing and manipulating network structures consisting of nodes and edges. For visualizing semantic networks, also install the Graphviz
library.
Prover9(optional)
Thisis an automated theorem prover for first-order and equational logic,used to support inference in language processing.
Natural Language Toolkit(NLTK)
NLTKwas originally created in 2001 as part of a computational linguistics course in the Department of Computer and Information Science at the University of Pennsylvania. Since then it has been developed and expanded with the help ofdozens of contributors.It has now been adopted in courses in dozens ofuniversities, and serves as the basis of many research projects. Table P-2 liststhe most important NLTK modules.
NLTK was designed with four primary goals in mind:
Simplicity(简易性)
Toprovide an intuitive framework along with substantial building blocks, giving
usersa practical knowledge of NLP without getting bogged down in the tedious
house-keeping usually associated with processing annotatedlanguage data
Consistency(连续性)
Toprovide a uniform framework with consistent interfaces and data structures,
andeasily guessable method names
Extensibility(扩展性)
Toprovide a structure into which new software modules can be easily accommodated,
includingalternative implementations and competing approaches to the same task
Modularity(模块性)
Toprovide components that can be used independently without needing to understand
therest of the toolkit
Contrastingwith these goals are three non-requirements—potentially useful qualities that we have deliberately avoided. First, while the toolkit provides a wide range of functions,it is not encyclopedic[百科全书的]; it is a toolkit, not a system, and it willcontinue to evolve with the field of NLP. Second, while the toolkit isefficient enough to support meaningful tasks, it is not highly optimized forruntime performance; such optimizations often involve more complex algorithms,or implementations in lower-level programming languages such as C or C++. This would make the software less readable and more difficult to install. Third, wehave tried to avoid clever programming tricks, since we believe that clearimplementations are preferable to ingenious yet indecipherable ones.
For Instructors
NaturalLanguage Processing is often taught within the confines of asingle-semester courseat the advanced undergraduate level or postgraduate level. Many instructors
havefound that it is difficult to cover both the theoretical and practical sides of the subjectin such a short span of time. Some courses focus on theory to the exclusion of
practicalexercises, and deprive students of the challenge and excitement of writing programsto automatically process language. Other courses are simply designed to
teachprogramming for linguists, and do not manage to cover any significant NLPcontent.
NLTKwas originally developed to address this problem, making it feasible to covera substantial amount of theory and practice within a single-semester course,even if students have no prior programming experience.
Asignificant fraction of any NLP syllabus deals with algorithms and datastructures.
On their own these can be rather dry, but NLTK brings them to life with the help of
interactivegraphical user interfaces that make it possible to view algorithms step-by-step.
MostNLTK components include a demonstration that performs an interesting
taskwithout requiring any special input from the user. An effective way to deliverthe
materialsis through interactive presentation of the examples in this book, entering
themin a Python session, observing what they do, and modifying them to explore someempirical or theoretical issue.
Thisbook contains hundreds of exercises that can be used as the basis for student
assignments.The simplest exercises involve modifying a supplied program fragment in aspecified way in order to answer a concrete question. At the other end of the spectrum,NLTK provides a flexible framework for graduate-level research projects, withstandard implementations of all the basic data structures and algorithms,interfaces to dozens of widely used datasets (corpora), and a flexible andextensible architecture. Additional support for teaching using NLTK isavailable on the NLTK website.
Webelieve this book is unique in providing a comprehensive framework for students
tolearn about NLP in the context of learning to program. What sets thesematerials
apartis the tight coupling of the chapters and exercises with NLTK, givingstudents—
eventhose with no prior programming experience—a practical introduction toNLP.
Aftercompleting these materials, students will be ready to attempt one of the more
advancedtextbooks, such as Speech and Language Processing, by Jurafsky and Martin(PrenticeHall, 2008).
Thisbook presents programming concepts in an unusual order, beginning with a nontrivialdata type—lists of strings—then introducing non-trivialcontrol structures such as comprehensions and conditionals. These idiomspermit us to do useful language processing from the start. Once this motivationis in place, we return to a systematic presentation of fundamentalconcepts such as strings, loops, files, and so forth. In this way, we cover thesame ground as more conventional approaches, without expecting readers to beinterested in the programming language for its own sake.
Twopossible course plans are illustrated in Table P-3. The first one presumesan arts/
humanitiesaudience, whereas the second one presumes a science/engineering audience. Othercourse plans could cover the first five chapters, then devote the remaining timeto a single area, such as text classification (Chapters 6 and 7), syntax(Chapters 8 and 9), semantics (Chapter 10), or linguistic data management (Chapter11).
Conventions Used in ThisBook
Thefollowing typographical conventions are used in this book:
Bold
Indicatesnew terms.
Italic
Usedwithin paragraphs to refer to linguistic examples, the names of texts, and
URLs;also used for filenames and file extensions.
Constantwidth[等宽字体]
Usedfor program listings, as well as within paragraphs to refer to program elements
suchas variable or function names, statements, and keywords; also used for program
names.
Constantwidth italic
Showstext that should be replaced with user-supplied values or by values determined
bycontext; also used for metavariables within program code examples.
Using Code Examples
Thisbook is here to help you get your job done. In general, you may use the code in
thisbook in your programs and documentation. You do not need to contact us for
permissionunless you’re reproducing a significant portion of the code. For example,
writinga program that uses several chunks of code from this book does not require
permission.Selling or distributing a CD-ROM of examples from O’Reilly books does
requirepermission. Answering a question by citing this book and quoting example
codedoes not require permission. Incorporating a significant amount of example code
fromthis book into your product’s documentation doesrequire permission.
Weappreciate, but do not require, attribution. An attribution usually includesthe title,
author,publisher, and ISBN. For example: “Natural Language Processingwith Python,by Steven Bird, Ewan Klein, and Edward Loper. Copyright 2009 StevenBird,
EwanKlein, and Edward Loper, 978-0-596-51649-9.”
Ifyou feel your use of code examples falls outside fair use or the permissiongiven above, feel free to contact us at [email protected].
Acknowledgments
Theauthors are indebted to the following people for feedback on earlierdrafts of this
book:Doug Arnold, Michaela Atterer, Greg Aumann, Kenneth Beesley, Steven Bethard, OndrejBojar, Chris Cieri, Robin Cooper, Grev Corbett, James Curran, Dan Garrette, JeanMark Gawron, Doug Hellmann, Nitin Indurkhya, Mark Liberman, Peter Ljunglöf, StefanMüller, Robin Munn, Joel Nothman, AdamPrzepiorkowski, Brandon Rhodes, Stuart Robinson, Jussi Salmela, KyleSchlansker, Rob Speer, and Richard Sproat. We are thankful to many students andcolleagues for their comments on the class materials that evolved into these chapters,including participants at NLP and linguistics summer schools in Brazil, India,and the USA. This book would not exist without the members of the nltk-dev developercommunity, named on the NLTK website, who have given so freely of their timeand expertise in building and extending NLTK.We are grateful to the U.S.National Science Foundation, the Linguistic Data Consortium, an EdwardClarence Dyason Fellowship, and the Universities of Pennsylvania, Edinburgh,and Melbourne for supporting our work on this book.We thank Julie Steele, AbbyFox, Loranah Dimant, and the rest of the O’Reilly team,for organizingcomprehensive reviews of our drafts from people across the NLP and Pythoncommunities, for cheerfully customizing O’Reilly’s production tools to accommodate our needs, and for meticulous copyeditingwork. Finally, we owe a huge debt of gratitude to our partners, Kay, Mimo, andJee, for their
love,patience, and support over the many years that we worked on this book. We hope thatour children—Andrew, Alison, Kirsten, Leonie, and Maaike—catch our enthusiasm for language and computation from these pages.
Royalties[版税,稿酬]
Royaltiesfrom the sale of this book are being used to support the development of the
NaturalLanguage Toolkit.