作者:Yoav Goldberg
出版社:Morgan & Claypool
发行时间:April 17th 2017
来源:下载的 PDF 版本
Goodreads:4.85 (13 Ratings)
豆瓣:9.4(15人评价)
摘录:
Natural language processing (NLP) is a collective term referring to automatic computational processing of human languages. This includes both algorithms that take human-produced text as input, and algorithms that produce natural looking text as outputs. The need for such algorithms is ever increasing: human produce ever increasing amounts of text each year, and expect computer interfaces to communicate with them in their own language.
Natural language processing is also very challenging, as human language is inherently ambiguous, ever changing, and not well defined. Natural language is symbolic in nature, and the first attempts at processing language were symbolic: based on logic, rules, and ontologies. However, natural language is also highly ambiguous and highly variable, calling for a more statistical algorithmic approach. Indeed, the current day dominant approaches to language processing are all based on statistical machine learning. For over a decade, core NLP techniques were dominated by linear modeling approaches to supervised learning, centered around algorithms such as Perceptrons, linear Support Vector Machines, and Logistic Regression, trained over very high dimensional yet very sparse feature vectors.
Around 2014, the field has started to see some success in switching from such linear models over sparse inputs to nonlinear neural network models over dense inputs. Some of the neural network techniques are simple generalizations of the linear models and can be used as almost drop-in replacements for the linear classifiers. Others are more advanced, require a change of mindset, and provide new modeling opportunities. In particular, a family of approaches based on recurrent neural networks (RNNs) alleviates the reliance on the Markov Assumption that was prevalent in sequence models, allowing to condition on arbitrarily long sequences and produce effective feature extractors. These advances led to breakthroughs in language modeling, automatic machine translation, and various other applications.
While powerful, the neural network methods exhibit a rather strong barrier of entry, for various reasons. In this book, I attempt to provide NLP practitioners as well as newcomers with the basic background, jargon, tools, and methodologies that will allow them to understand the principles behind neural network models for language, and apply them in their own work. I also hope to provide machine learning and neural network practitioners with the background, jargon, tools, and mindset that will allow them to effectively work with language data.
Finally, I hope this book can also serve a relatively gentle (if somewhat incomplete) introduction to both NLP and machine learning for people who are newcomers to both fields.
Natural language processing (NLP) is the field of designing methods and algorithms that take as input or produce as output unstructured, natural language data. Human language is highly ambiguous (consider the sentence I ate pizza with friends, and compare it to I ate pizza with olives), and also highly variable (the core message ofI ate pizza with friends can also be expressed as friends and I shared some pizza). It is also ever changing and evolving. People are great at producing language and understanding language, and are capable of expressing, perceiving, and interpreting very elaborate and nuanced meanings. At the same time, while we humans are great users of language, we are also very poor at formally understanding and describing the rules that govern language.
Understanding and producing language using computers is thus highly challenging. Indeed, the best known set of methods for dealing with language data are using supervised machine learning algorithms, that attempt to infer usage patterns and regularities from a set of pre-annotated input and output pairs. Consider, for example, the task of classifying a document into one of four categories: Sports, Politics, Gossip, and Economy. Obviously, the words in the documents provide very strong hints, but which words provide what hints? Writing up rules for this task is rather challenging. However, readers can easily categorize a document into its topic, and then, based on a few hundred human-categorized examples in each category, let a supervised machine learning algorithm come up with the patterns of word usage that help categorize the documents. Machine learning methods excel at problem domains where a good set of rules is very hard to define but annotating the expected output for a given input is relatively simple.
Besides the challenges of dealing with ambiguous and variable inputs in a system with ill-defined and unspecified set of rules, natural language exhibits an additional set of properties that make it even more challenging for computational approaches, including machine learning: it is discrete, compositional, and sparse.
Language is symbolic and discrete. The basic elements of written language are characters. Characters form words that in turn denote objects, concepts, events, actions, and ideas. Both characters and words are discrete symbols: words such as “hamburger” or “pizza” each evoke in us a certain mental representations, but they are also distinct symbols, whose meaning is external to them and left to be interpreted in our heads. There is no inherent relation between “hamburger” and “pizza” that can be inferred from the symbols themselves, or from the individual letters they are made of. Compare that to concepts such as color, prevalent in machine vision, or acoustic signals: these concepts are continuous, allowing, for example, to move from a colorful image to a gray-scale one using a simple mathematical operation, or to compare two different colors based on inherent properties such as hue and intensity. This cannot be easily done with words—there is no simple operation that will allow us to move from the word “red” to the word “pink” without using a large lookup table or a dictionary.
Language is also compositional: letters form words, and words form phrases and sentences. The meaning of a phrase can be larger than the meaning of the individual words that comprise it, and follows a set of intricate rules. In order to interpret a text, we thus need to work beyond the level of letters and words, and look at long sequences of words such as sentences, or even complete documents.
The combination of the above properties leads to data sparseness. The way in which words (discrete symbols) can be combined to form meanings is practically infinite. The number of possible valid sentences is tremendous: we could never hope to enumerate all of them. Open a random book, and the vast majority of sentences within it you have not seen or heard before. Moreover, it is likely that many sequences of four-words that appear in the book are also novel to you. If you were to look at a newspaper from just 10 years ago, or imagine one 10 years in the future, many of the words, in particular names of persons, brands, and corporations, but also slang words and technical terms, will be novel as well. There is no clear way of generalizing from one sentence to another, or defining the similarity between sentences, that does not depend on their meaning— which is unobserved to us. This is very challenging when we come to learn from examples: even with a huge example set we are very likely to observe events that never occurred in the example set, and that are very different than all the examples that did occur in it.
Deep learning is a branch of machine learning. It is a re-branded name for neural networks—a family of learning techniques that was historically inspired by the way computation works in the brain, and which can be characterized as learning of parameterized differentiable mathematical functions. The name deep-learning stems from the fact that many layers of these differentiable function are often chained together.
While all of machine learning can be characterized as learning to make predictions based on past observations, deep learning approaches work by learning to not only predict but also to correctly represent the data, such that it is suitable for prediction. Given a large set of desired input-output mapping, deep learning approaches work by feeding the data into a network that produces successive transformations of the input data until a final transformation predicts the output. The transformations produced by the network are learned from the given input-output mappings, such that each transformation makes it easier to relate the data to the desired label.
While the human designer is in charge of designing the network architecture and training regime, providing the network with a proper set of input-output examples, and encoding the input data in a suitable way, a lot of the heavy-lifting of learning the correct representation is performed automatically by the network, supported by the network’s architecture.
There are two major kinds of neural network architectures, that can be combined in various ways: feed-forward networks and recurrent/recursive networks.
Feed-forward networks, in particular multi-layer perceptrons (MLPs), allow to work with fixed sized inputs, or with variable length inputs in which we can disregard the order of the elements. When feeding the network with a set of input components, it learns to combine them in a meaningful way. MLPs can be used whenever a linear model was previously used. The nonlinearity of the network, as well as the ability to easily integrate pre-trained word embeddings, often lead to superior classification accuracy.
Convolutional feed-forward networks are specialized architectures that excel at extracting local patterns in the data: they are fed arbitrarily sized inputs, and are capable of extracting meaningful local patterns that are sensitive to word order, regardless of where they appear in the input. These work very well for identifying indicative phrases or idioms of up to a fixed length in long sentences or documents.
Recurrent neural networks (RNNs) are specialized models for sequential data. These are network components that take as input a sequence of items, and produce a fixed size vector that summarizes that sequence. As “summarizing a sequence” means different things for different tasks (i.e., the information needed to answer a question about the sentiment of a sentence is different from the information needed to answer a question about its grammaticality), recurrent networks are rarely used as standalone component, and their power is in being trainable components that can be fed into other network components, and trained to work in tandem with them. For example, the output of a recurrent network can be fed into a feed-forward network that will try to predict some value. The recurrent network is used as an input-transformer that is trained to produce informative representations for the feed-forward network that will operate on top of it. Recurrent networks are very impressive models for sequences, and are arguably the most exciting offer of neural networks for language processing. They allow abandoning the markov assumption that was prevalent in NLP for decades, and designing models that can condition on entire sentences, while taking word order into account when it is needed, and not suffering much from statistical estimation problems stemming from data sparsity. This capability leads to impressive gains in language-modeling, the task of predicting the probability of the next word in a sequence (or, equivalently, the probability of a sequence), which is a cornerstone of many NLP applications. Recursive networks extend recurrent networks from sequences to trees.
The essence of supervised machine learning is the creation of mechanisms that can look at examples and produce generalizations. More concretely, rather than designing an algorithm to perform a task (“distinguish spam from non-spam email”), we design an algorithm whose input is a set of labeled examples (“This pile of emails are spam. This other pile of emails are not spam.”), and its output is a function (or a program) that receives an instance (an email) and produces the desired label (spam or not-spam). It is expected that the resulting function will produce correct label predictions also for instances it has not seen during training.
It is clear that no straight line can separate the two classes.
What is a word? We are using the term word rather loosely. The question “what is a word?” is a matter of debate among linguists, and the answer is not always clear.
One definition (which is the one being loosely followed in this book) is that words are sequences of letters that are separated by whitespace. This definition is very simplistic. First, punctuation in English is not separated by whitespace, so according to our definition dog, dog?, dog. and dog) are all different words. Our corrected definition is then words separated by whitespace or punctuation. A process called tokenization is in charge of splitting text into tokens (what we call here words) based on whitespace and punctuation. In English, the job of the tokenizer is quite simple, although it does need to consider cases such as abbreviations (I.B.M) and titles (Mr.) that needn’t be split. In other languages, things can become much tricker: in Hebrew and Arabic some words attach to the next one without whitespace, and in Chinese there are no whitespaces at all. These are just a few examples.
When working in English or a similar language (as this book assumes), tokenizing on whitespace and punctuation (while handling a few corner cases) can provide a good approximation of words. However, our definition of word is still quite technical: it is derived from the way things are written. Another common (and better) definition take a word to be “the smallest unit of meaning.” By following this definition, we see that our whitespace-based definition is problematic. After splitting by whitespace and punctuation, we still remain with sequences such as don’t, that are actually two words, do not, that got merged into one symbol. It is common for English tokenizers to handle these cases as well. The symbols cat and Cat have the same meaning, but are they the same word? More interestingly, take something like New York, is it two words, or one? What about ice cream? Is it the same as ice-cream or icecream? And what about idioms such as kick the bucket?
Lemmas and Stems We often look at the lemma (the dictionary entry) of the word, mapping forms such as booking, booked, books to their common lemma book. is mapping is usually performed using lemma lexicons or morphological analyzers, that are available for many languages. e lemma of a word can be ambiguous, and lemmatizing is more accurate when the word is given in context. Lemmatization is a linguistically defined process, and may not work well for forms that are not in the lemmatization lexicon, or for mis-spelling. A coarser process than lemmatization, that can work on any sequence of letters, is called stemming. A stemmer maps sequences of words to shorter sequences, based on some language-specific heuristics, such that different inflections will map to the same sequence. Note that the result of stemming need not be a valid word: picture and pictures and pictured will all be stemmed to pictur. Various stemmers exist, with different levels of aggressiveness.
In particular, we will discuss 1D convolutional-and-pooling architectures (CNNs), and recurrent neural networks (RNNs). CNNs are neural architectures that are specialized at identifying informative ngrams and gappy-ngrams in a sequence of text, regardless of their position, but while taking local ordering patterns into account. RNNs are neural architectures that are designed to capture subtle patterns and regularities in sequences, and that allow modeling non-markovian dependencies looking at “infinite windows” around a focus word, while zooming-in on informative sequential patterns in that window.
CNNs and RNNs as Lego Bricks When learning about the CNN and RNN architectures, it is useful to think about them as “Lego Bricks,” that can be mixed and matched to create a desired structure and to achieve a desired behavior.
This Lego-bricks-like mixing-and-matching is facilitated by the computation-graph mechanism and gradient-based optimization. It allows treating network architectures such as MLPs, CNNs and RNNs as components, or blocks, that can be mixed and matched to create larger and larger structures—one just needs to make sure that that input and output dimensions of the different components match—and the computation graph and gradient-based training will take care of the rest.
This allows us to create large and elaborate network structures, with multiple layers of MLPs, CNNs and RNNs feeding into each other, and training the entire network in an endtoend fashion. Several examples are explored in later chapters, but many others are possible, and different tasks may benefit from different architectures. When learning about a new architecture, don’t think “which existing component does it replace?” or “how do I use it to solve a task?” but rather “how can I integrate it into my arsenal of building blocks, and combine it with the other components in order to achieve a desired result?”