临风而眠

《Getting Started with NLP》chap11：Named-entity recognition

最近需要做一些NER相关的任务，来学习一下这本书的第十一章

文章目录

《Getting Started with NLP》chap11：Named-entity recognition
11.1 Named entity recognition: Definitions and challenges
- 11.1.1 Named entity types
- - Exercise 11.1
- 11.1.2 Challenges in named entity recognition
- - 引入思考：Exercise 11.2
  - Let’s look into these challenges together
  - - full entity span identification
    - ambiguity
11.2 Named-entity recognition as a sequence labeling task
- 11.2.1 The basics: BIO scheme
- - 引例
  - further extensions：IO scheme 、 BIOES scheme
  - Exercise 11.3
  - Exercise 11.4
- 11.2.2 What does it mean for a task to be sequential?
- 11.2.3 Sequential solution for NER
11.3 Practical applications of NER
- 11.3.1 Data loading and exploration
- - extract the data from the input file using pandas
  - extract the information on the news sources only
  - extract the content of articles from a specific source
- 11.3.2 Named entity types exploration with spaCy
- - Code to populate a dictionary with NEs extracted from news articles
  - - codespace炸了
    - colab也不行(呃，其实是因为一开始尝试的时候csv还没上传完整)
    - 换成kaggle跑通了呜呜
  - Code to print out the named entities dictionary
  - Code to aggregate the counts on all named entity types
- 11.3.3 Information extraction revisited
- - Code to extract the indexes of the words covered by the NE
  - Code to extract information about the main participants of the action
  - Code to extract information on the specific entity
- 11.3.4 Named entities visualization 很炫酷
- - Code to visualize named entities of various types in their contexts of use
  - - 拿个句子先试试
    - 试试中文的！
    - 处理这个数据集
  - Code to visualize named entities of a specific type only
Summary
参考资料

This chapter covers
- Introducing named-entity recognition (NER)
- Overviewing sequence labeling approaches in NLP
- Integrating NER into downstream tasks
  
  什么是downstream task：
  Downstream tasks refer to tasks that depend on the output of a previous task. For example, if task A is to extract data from a website, and task B is to analyze the data that was extracted, then task B is a downstream task of task A. Downstream tasks are often used in natural language processing (NLP) to refer to tasks that use the output of a natural language model as input. For example, once a natural language model has generated a set of language inputs, downstream tasks might include tasks like text classification, question answering, or machine translation.
- Introducing further data preprocessing tools and techniques
In this chapter, you will be working with the task of named-entity recognition (NER), concerned with detection and type identification of named entities (NEs). Named entities are real-world objects (people, locations, organizations) that can be referred to with a proper name. The most widely used entity types include person, location (abbreviated as LOC), organization (abbreviated as ORG), and geopolitical entity (abbreviated as GPE).

In practice, the set of name extended with further expressions such as dates, time references, numerical expressions (e.g., referring to money and currency(货币) indicators), and so on. Moreover, the types listed so far are quite general, but NER can also be adapted to other domains and application areas. For example, in biomedical（生物医学） applications, “entities” can denote different types of proteins and genes, in the financial domain they can cover specific types of products, and so on.
Named entities play an important role in natural language understanding (you have already seen examples from question answering and information extraction) and can be combined with the tasks that you addressed earlier in this book. Such tasks, which rely on the output of NLP tools (e.g., NER models) are technically called downstream tasks, since they aim to solve a problem different from, say, NER itself, but at the same time they benefit from knowing about named entities in text. For instance, identifying entities related to people, locations, organizations, and products in reviews can help better understand users’ or customers’ sentiments toward particular aspects of the product or service.
- examples: the use of NER for two downstream tasks.
  - In the context of question answering, NER helps to identify the chunks of text that can answer a specific type of a question.
    - For example, named entities denoting locations (LOC), or geopolitical entities (GPE) are appropriate as answers for a Where?question.
  - In the context of information extraction, NER can help identify useful characteristics of a product that may be informative on their own or as features in a sentiment analysis or another related task.
  - Another example of a downstream task in which NER plays a central role is stock market movement prediction. It is widely known that certain types of events influence the trends in stock price movements (for more examples and justification, see Ding et al. [2014], Using Structured Events to Predict Stock Price Movement: An Empirical Investigation, which you can access at https://aclanthology.org/D14-1148.pdf). For instance, the news about Steve Jobs’s death negatively impacted Apple’s stock price immediately after the event, while the news about Microsoft buying a new mobile phone business positively impacted its stock price. Suppose your goal is to build an application that can extract relevant facts from the news (e.g., “Apple’s CEO died”; “Microsoft buys mobile phone business”; “Oracle sues Google”) and then use these facts to predict stock prices for these companies. Figure 11.3 visualizes this idea

11.1 Named entity recognition: Definitions and challenges

11.1.1 Named entity types

We start by defining the major named entity types and their usefulness for downstream tasks. Figure 11.4 shows entities of five different types (GPE for geopolitical entity, ORG for organization, CARDINAL for cardinal numbers, DATE, and PERSON) highlighted in a typical sentence that you could see on the news

cardinal (also cardinal number)

a number that represents amount, such as 1, 2, 3, rather than order, such as 1st, 2nd, 3rd

基数

In natural language processing, “cardinal” is a type of entity that refers to a numerical value. In named entity recognition (NER), cardinal entities are numbers that represent a specific quantity, such as “five” or “twenty-three.” Cardinal entities are often distinguished from other types of numerical entities, such as ordinal entities (which indicate a position in a sequence, such as “first” or “third”) and percentage entities (which represent a ratio or proportion, such as “50%”). NER systems may use machine learning algorithms to identify and classify cardinal entities in text data.

The notation(符号) used in this sentence is standard for the task of named entity recognition: some labels like DATE and PERSON are self-explanatory(无需解释的); others are abbreviations(缩写)or short forms of the full labels (e.g., ORG for organization). The set of labels comes from the widely adopted annotation(注释) scheme(方案)in OntoNotes (see full documentation at http://mng.bz/Qv5R). What is important from a practitioner’s（从业者） point of view is that this is the scheme that is used in NLP tools, including spaCy.
Table 11.1 lists all named entity types typically used in practice and identified in text by spaCy’s NER component and provides a description and some illustrative examples for each of them.
A couple of observations are due at this point. （需要注意的是）
- First, note that named entities of any type can consist of a single word (e.g., “two” or “tomorrow”) and longer expressions (e.g., “MacBook Air” or “about 200 miles”).
- Second, the same word or expression may represent an entity of a different type, depending on the context. For example, “Amazon” may refer to a river (LOC) or to a company (ORG).

Exercise 11.1

The NE labeling presented in table 11.1 is used in spaCy. Familiarize yourself with the types and annotation by running spaCy’s NER on a selected set of examples. You can use the sentences from table 11.1 or experiment with your own set of sentences. Do you disagree with any of the results?

code

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("I bought two books on Amazon") 
for ent in doc.ents:
    print(ent.text, ent.label_)

输出结果为two CARDINAL Amazon ORG

前置除了需要install spacy ，还需要python -m spacy download en_core_web_sm

NOTE Check out the different language models available for use with spaCy: https://spacy.io/models/en.

Small model (en_core_web_sm) is suitable for most purposes and is more efficient to upload and use.

However, larger models like en_core_web_md (medium) and en_core_web_lg (large) are more powerful, and some NLP tasks will require the use of such larger models. The models should be installed prior to running the code examples with spaCy. You can also install the models from within the Jupyter Notebook using the command !python -m spacy download en_core_web_md.

spaCy的官方文档感觉很不错

11.1.2 Challenges in named entity recognition

NER is a task concerned with identification of a word or phrase that constitutes an entity and with detection of the type of the identified entity. As examples from the previous section suggest, each of these steps has its own challenges.

引入思考：Exercise 11.2

What challenges can you identify in NER, based on the examples from table 11.1?（前面那张表格）

Let’s look into these challenges together

full entity span identification

The first task that an NER algorithm solves is full entity span identification（span，跨度）. As you can see in the examples from figure 11.2 and table 11.1, some entities consist of a single word, while others may span whole expressions, and it is not always trivial（容易解决的） to identify where an expression starts and where it finishes. For instance, does the full entity consist of Amazon or Amazon River? It would seem reasonable to select the longest sequence of words that are likely to constitute a single named entity. However, compare the following two sentences:
- Check out our [Amazon River]LOC maps selection.
- On [Amazon]ORG [River maps]PRODUCT from ABC Publishers are sold for $5(If this sentence baffles(迷惑) you, try adding a comma（逗号）, as in “On Amazon, River maps from ABC Publishers are sold for $5.”)
The first sentence contains a named entity of the type location (Amazon River). Even though the second sentence contains the same sequence of words, each of these two words actually belongs to a different named entity—Amazon is an organization, while River is part of a product name River maps.

ambiguity

The following examples illustrate one of the core reasons why natural language processing is challenging—ambiguity. You have seen some examples of sentences with ambiguous analysis before (e.g., when we discussed parsing and part-of-speech tagging in chapter 4).
For NER, ambiguity poses a number of challenges: one is related to span identification, as just demonstrated. Another one is related to the fact that the same words and phrases may or may not be named entities.
For some examples, consider the following pairs, where the first sentence in each pair contains a word used as a common, general noun, and the second sentence contains the same word being used as (part of) a named entity:
- “An apple a day keeps a doctor away” versus “Apple announces a new iPad Pro.”
- “Turkey is the main dish served at Thanksgiving” versus “Turkey is a country with amazing landscapes.”
- “The tiger is the largest living cat species” versus “Tiger Woods is an American professional golfer.”
Can you spot any characteristics distinguishing the two types of word usage (as a common noun versus as a named entity) that may help the algorithm distinguish between the two? Think about this question, and we will discuss the answer to it in the next section. （后面回头来看这个问题）
Ambiguity in NER poses a challenge, and not only when the algorithm needs to define whether a word or a phrase is a named entity or not. Even if a word or a phrase is identified to be a named entity, the same entity may belong to different NE types.
- For example, Amazon may refer to a location or a company, April may be a name of a person or a month, JFK may refer to a person or a facility, and so on.

So，an algorithm has to identify the span of a potential named entity and make sure the identified expression or word is indeed a named entity, since the same phrase or word may or may not be an NE, depending on the context of use. But even when it is established that an expression or a word is a named entity, this named entity may still belong to different types. How does the algorithm deal with these various levels of complexity?

First of all, a typical NER algorithm combines the span identification and the named entity type identification steps into a single, joint task.
Second, it approaches this task as a sequence labeling problem: specifically, it goes through the running text word by word and tries to decide whether a word is part of a specific type of a named entity. Figure 11.5 provides the mental model for this process.

这里的mental model我感觉翻译为概念模型比较好？（或者思维导图也可以算是mental model）
A mental model is a simplified representation of a complex system or concept that we use to understand and reason about that system or concept. Mental models help us process and make sense of new information by allowing us to relate it to concepts and information that we already know. They also help us anticipate（预料） how a system or concept is likely to behave in the future based on our understanding of it.↳

Mental models are often used in problem-solving and decision-making, as they allow us to evaluate different options and predict the likely outcomes of different courses of action. They are also an important component of learning and knowledge acquisition, as they help us organize and integrate new information into our existing understanding of the world.

In fact, many tasks in NLP are framed as sequence labeling tasks, since language has a clear sequential nature. We have not looked into sequential tasks and sequence labeling in this book before, so let’s discuss this topic now.

11.2 Named-entity recognition as a sequence labeling task

Not surprisingly, named-entity recognition is addressed using machine-learning algorithms, which are capable of learning useful characteristics of the context. NER is typically addressed with supervised machine-learning algorithms, which means that such algorithms are trained on annotated data. To that end, let’s start with the questions of how the data should be labeled for sequential tasks, such as NER, in a way that the algorithm can benefit from the most.

11.2.1 The basics: BIO scheme

引例

看这个例子

“Inc.” is an abbreviation for “Incorporated.” It is a legal designation that is used by businesses to indicate that they are a corporation.

We said before that the way the NER algorithm identifies named entities and their types is by considering every word in sequence and deciding whether this word belongs to a named entity of a particular type.

For instance, in Apple Inc., the word Apple is at the beginning of a named entity of type ORG and Inc. is at its end. Explicitly annotating the beginning and the end of a named-entity expression and training the algorithm on such annotation helps it capture the information that if a word Apple is classified as beginning a named entity ORG, it is very likely that it will be followed by a word that finishes this named-entity expression.

The labeling scheme that is widely used for NER and similar sequence labeling tasks is called BIO scheme, since it comprises three types of tags: beginning, inside, and outside. We said that the goal of an NER algorithm is to jointly assign to every word its position in a named entity and its type, so in fact this scheme is expanded to accommodate for the type tags too.

For instance, there are tags B-PER and I-PER for the words beginning and inside of a named entity of the PER type; similarly, there are B-ORG, I-ORG, B-LOC, I-LOC tags, and so on.

O-tag is reserved for all words that are outside of any named entity, and for that reason, it does not have a type extension. Figure 11.6 shows the application of this scheme to the short example “tech giant Apple Inc.”

In total, there are 2n+1 tags for n named entity types plus a single O-tag: for the 18 NE types from the OntoNotes presented in table 11.1, this amounts to 37 tags in total.

further extensions：IO scheme 、 BIOES scheme

BIO scheme has two further extensions that you might encounter in practice: a less fine-grained IO scheme, which distinguishes between the inside and outside tags only, and a more fine-grained BIOES scheme, which also adds an end-of-entity tag for each type and a single-word entity for each type that consists of a single word. Table 11.2 illustrates the application of these annotation schemes to the beginning of our example.

Exercise 11.3

Table 11.2 doesn’t contain the annotation for the rest of the sentence. Provide the annotation for “the company’s CEO Tim Cook said” using the IO, BIO, and BIOES schemes.

The IO, BIO, and BIOES schemes are all annotation schemes that are used to label the words in a text as part of natural language processing tasks, such as named entity recognition (NER). These schemes are used to create training data for machine learning models that are used to identify named entities in text.

The IO (Inside, Outside) scheme is a simple labeling scheme that consists of only two tags: I (Inside) and O (Outside). It is used to indicate whether a word is part of a named entity or not.

The BIO (Beginning, Inside, Outside) scheme is similar to the IO scheme, but it includes an additional tag for the beginning of a named entity. This scheme is often used to identify the boundaries of named entities in text.

The BIOES (Beginning, Inside, Outside, End, Single) scheme is an extension of the BIO scheme that includes additional tags for the end of a named entity and for named entities that consist of a single word. This scheme is used to more accurately identify the boundaries and types of named entities in text.

Exercise 11.4

The complexity of a supervised machine-learning task depends on the number of classes to distinguish between. The BIO scheme consists of 37 tags for 18 entity types. How many tags are there in the IO and BIOES schemes?
- The IO scheme has one I-tag for each entity type plus a single O-tag for words outside any entity type. This results in n+1, or 19 tags for 18 entity types.
- The BIOES scheme has 4 tags for each entity type (B, I, E, S) plus one O-tag for words outside any entity type. This results in 4n+1, or 73 tags.
- The more detailed schemes provide for finer（更精细的） granularity（粒度） but also come at an expense of having more classes for the algorithm to distinguish between.
- While the BIO scheme allows the algorithm to train on 37 classes, the BIOES scheme has almost twice as many classes, which means the algorithm has to deal with higher complexity and may make more mistakes.

11.2.2 What does it mean for a task to be sequential?

Many real-word tasks show sequential nature.
作者举了俩例子
水的温度变化
- As an illustrative example, let’s consider how the temperature of water changes with respect to various possible actions applied to it. Suppose water can stay in one of three states—cold, warm, or hot, as figure 11.7 (left) illustrates. You can apply different actions to it. For example, heat it up or let it cool down. Let’s call a change from one state to another a state transition. Suppose you start heating cold water up and measure water temperature at regular intervals, say, every minute. Most likely you would observe the following sequence of states: cold → . . . → cold → warm → . . . → warm → hot. In other words, to get to the “hot” state, you would first stay in the “cold” state for some time; then you would need to transition through the “warm” state, and finally you would reach the “hot” state. At the same time, it is physically impossible to transition from the “cold” to the “hot” state immediately, bypassing the “warm” state. The reverse is true as well: if you let water cool down, the most likely sequence will be hot → . . . → hot → warm → . . . → warm → cold, but not hot → cold
  
  In fact, these types of observations can be formalized and expressed as probabilities. For example, to estimate how probable it is to transition from the “cold” state to the “warm” state, you use your timed measurements and calculate the proportion of times that the temperature transitioned cold → warm among all the observations made for the “cold” state:
  
  Such probabilities estimated from the data and observations simply reflect how often certain events occur compared to other events and all possible outcomes. Figure 11.7(left) shows the probabilities on the directed edges. The edges between hot → cold and cold → hot are marked with 0.0, reflecting that it is impossible for the temperature to change between “hot” and “cold” directly bypassing（绕过） the “warm” state. At the same time, you can see that the edges from the state back to itself are assigned with quite high probabilities: P(hot → hot) = 0.8 means that 80% of the time if water temperature is hot at this particular point in time it will still be hot at the next time step (e.g., in a minute). Similarly, 60% of the time water will be warm at the next time step if it is currently warm, and in 80% of the cases water will still be cold in a minute from now if it is currently cold.
  
  Also note that this scheme describes the set of possibilities fully: suppose water is currently hot. What temperature will it be in a minute? Follow the arrows in figure 11.7 (left) and you will see that with a probability of 0.8 (or in 80% of the cases), it will still be hot and with a probability of 0.2 (i.e., in the other 20%), it will be warm.
  
  What if it is currently warm? Then, with a probability of 0.6, it will still be warm in a minute, but there is a 20% chance that it will change to hot and a 20% chance that it will change to cold.
语句
- Where do language tasks fit into this? As a matter of fact, language is a highly structured, sequential system. For instance, you can say “Albert Einstein was born in Ulm” or “In Ulm, Albert Einstein was born,” but “Was Ulm Einstein born Albert in” is definitely weird（怪异的） if not nonsensical（荒谬的，无稽之谈的） and can be understood only because we know what each word means and, thus, can still try to make sense of such word salad（词语混杂）. At the same time, if you shuffle the words in other expressions like “Ann gave Bob a book,” you might end up not understanding what exactly is being said. In “A Bob book Ann gave,” who did what to whom? This shows that language has a specific structure to it and if this structure is violated, it is hard to make sense of the result. Figure 11.7 (right) shows a transition system for language, which follows a very similar strategy to the water temperature example from figure 11.7 (left).
  
  It shows that if you see a word “a,” the next word may be “book” (“a book”) with a probability of 0.14, “new” (“a new house”) with a 15% chance, or some other word. If you see a word “new,” with a probability of 0.05, it may be followed by another “new” (“a new, new house”), with an 8% chance it may be followed by “a” (“no matter how new a car is, . . .”), in 17% of the cases it will be followed by “book” (“a new book”), and so on. Finally, if the word that you currently see is “book,” it will be followed by “a” (“book a flight”) 13% of the time, by “new” (“book new flights”) 10% of the time, or by some other word (note that in the language example, not all possible transitions are visualized in figure 11.7). Such predictions on the likely sequences of words are behind many NLP applications. For instance, word prediction is used in predictive keyboards, query completion, and so on. Note that in the examples presented in figure 11.7, the sequential models take into account a single previous state to predict the current state.
  
  Technically, such models are called first-order Markov models or Markov chains .
  
  马尔科夫模型或马尔科夫链
  
  In a Markov chain, the probabilistic transitions between states are described by a transition matrix, which specifies the probability of transitioning from one state to another. The state of the system at any given time is called a Markov state, and the set of all possible states is called the state space. The behavior of the system over time is described by a sequence of random variables, called a Markov process.
  
  It is also possible to take into account longer history of events.
  
  For example, second-order Markov models look into two previous states to predict the current state and so on.
  
  NLP models that do not observe word order and shuffle words freely (as in “A Bob book Ann gave”) are called bag-of-words models. The analogy(类比) is that when you put words in a “bag,” their relative order is lost, and they get mixed among themselves like individual items in a bag. A number of NLP tasks use bag-of-words models. The tasks that you worked on before made little if any use of the sequential nature of language. Sometimes the presence of individual words is informative enough for the algorithm to identify a class (e.g., lottery strongly suggests spam, amazing is a strong signal of a positive sentiment, and rugby has a strong association with the sports topic). Yet, as we have noted earlier in this chapter, for NER it might not be enough to just observe a word (is “Apple” a fruit or a company?) or even a combination of words (as in “Amazon River Maps”). More information needs to be extracted from the context and the way the previous words are labeled with NER tags. In the next section, you will look closely into how NER uses sequential information and how sequential information is encoded as features for the algorithm to make its decisions.

11.2.3 Sequential solution for NER

Just like water temperature cannot change from “cold” immediately to “hot” or vice versa without going through the state of being “warm,” and just like there are certain sequential rules to how words are put together in a sentence (with “a new book” being much more likely in English than “a book new”), there are certain sequential rules to be observed in NER.

For instance, if a certain word is labeled as beginning a particular type of an entity (e.g., B-GPE for “New” in “New York”), it cannot be directly followed by an NE tag denoting inside of an entity of another type (e.g., I-EVENT cannot be assigned to “York” in “New York” when “New” is already labeled as B-GPE, as I-GPE is the correct tag).

In contrast, I-EVENT is applicable to “Year” in “New Year” after “New” being tagged as B-EVENT. To make such decisions, an NER algorithm takes into account the context, the labels assigned to the previous words, and the current word and its properties.

Let’s consider two examples with somewhat similar contexts:

Your goal in the NER task is to assign the most likely sequence of tags to each sentence. Ideally, you would like to end up with the following labeling for the sentences: O – O – B-EVENT – I-EVENT for “They celebrated New Year” and O – O – O – B-GPE – I-GPE for “They live in New York.”

Figure 11.8 visualizes such “ideal” labeling for “They celebrated New Year” (using an abbreviation EVT for EVENT for space reasons

As figure 11.8 shows, it is possible to start a sentence with a word labeled as a beginning of some named entity, such as B-EVENT or B-EVT (as in “Christmas B-EVT is celebrated on December 25”).

However, it is not possible to start a sentence with I-EVT (the tag for inside the EVENT entity), which is why it is grayed out in figure 11.8 and there is no arrow connecting the beginning of the sentence (the START state) to I-EVT. Since the second word, “celebrated,” is a verb, it is unlikely that it belongs to any named entity type; therefore, the most likely tag for it is O.

“New” can be at the beginning of event (B-EVT as in “New Year”) or another entity type (e.g., B-GPE as in “New York”), or it can be a word used outside any entity (O).

Finally, the only two possible transitions after tag B-EVT are O (if an event is named with a single word, like “Christmas”) or I-EVT. All possible transitions are marked with arrows in figure 11.8; all impossible states are grayed out with the impossible transitions dropped (i.e., no connecting arrows); and the states and transitions highlighted in bold are the preferred ones.

As you can see, there are multiple sources of information that are taken into account here: word position in the sentence matters (tags of the types O and BENTITY—outside an entity and beginning an entity, respectively—can apply to the first word in a sentence, but I-ENTITY cannot); word characteristics matter (a verb like “celebrate” is unlikely to be part of any entity); the previous word and tag matter (if the previous tag is B-EVENT, the current tag is either I-EVENT or O); the word shape matters (capital N in “New” makes it a better candidate for being part of an entity, while the most likely tag for “new” is O); and so on.

This is, essentially（本质上）, how the algorithm tries to assign the correct tag to each word in the sequence.

For instance, suppose you have assigned tags O – O – B-EVENT to the sequence “They celebrated New” and your current goal is to assign an NE tag to the word “Year”. The algorithm may consider a whole set of characteristic rules—let’s call them features by analogy（类比） with the features used by supervised machine-learning algorithms in other tasks. The features in NER can use any information related to the current NE tag and previous NE tags, current word and the preceding context, and the position of the word in the sentence.

Let’s define some feature templates for the features helping the algorithm predict that $\text{word}_4$ in “They celebrated New Year” (i.e., $\text{word}_4$ =“Year”) should be assigned with the tag I-EVENT after the previous word “New” is assigned with B-EVENT. It is common to use the notation $y_i$ for the current tag, $y_{i-1}$ for the previous one, $X$ for the input, and $i$ for the position, so let’s use this notation in the feature templates

A gazetteer(地名录) (e.g., www.geonames.org) is a list of place names with millions of entries for locations, including detailed geographical and political
information. It is a very useful resource for identification of LOC, GPE, and
some other types of named entities.

Word shape is determined as follows: capital letters are replaced with X, lowercase letters are replaced with x, numbers are replaced with d, and punctuation marks are preserved; for example, “U.S.A.” can be represented as “X.X.X.” and “11–12p.m.” as “d–dx.x.” This helps capture useful generalizable information.

Feature indexes used in this list are made up, and as you can see, the list of features grows quickly with the examples from the data. When applied to our example, the features will yield the following values:

It should be noted that no single feature is capable of correctly identifying an NE tag in all cases; moreover, some features may be more informative than others. What the algorithm does in practice is it weighs the contribution from each feature according to its informativeness and then it combines the values from all features, ranging from feature $k = 1$ to feature $k = K$ (where $k$ is just an index), by summing the individual contributions as follows

The appropriate weights in this equation are learned from labeled data as is normally done for supervised machine-learning algorithms. As was pointed out earlier, the ultimate goal of the algorithm is to assign the correct tags to all words in the sequence, so the expression is actually applied to each word in sequence, from $i = 1$ (i.e., the first word in the sentence) to $i = n$ (the last word); that is

Specifically, this means that the algorithm is not only concerned with the correct assignment of the tag I-EVENT to “Year” in “They celebrated New Year”, but also with the correct assignment of the whole sequence of tags O – O – B-EVENT – I-EVENT to “They celebrated New Year”.

However, originally, the algorithm knows nothing about the correct tag for “They” and the correct tag for “celebrated” following “They”, and so on. Since originally the algorithm doesn’t know about the correct tags for the previous words, it actually considers all possible tags for the first word, then all possible tags for the second word, and so on. In other words, for the first word, it considers whether “They” can be tagged as B-EVENT, I-EVENT, B-GPE, I-GPE, . . . , O, as figure 11.8 demonstrated earlier; then for each tag applied to “They”, the algorithm moves on and considers whether “celebrated” can be tagged as B-EVENT, I-EVENT, B-GPE, I-GPE, . . . , O; and so on.

In the end, the result you are interested in is the sequence of all NE tags for all words that is most probable; that is

The formula in Equation 11.3 is exactly the same as the one in Equation 11.2, with just one modification: argmax means that you are looking for the sequence that results in the highest probability estimated by the rest of the formula; $Y$ stands for the whole sequence of tags for all words in the input sentence; and the fancy font(花体字) $Y$ denotes the full set of possible combinations of tags.

Recall the three BIO-style schemes introduced earlier in this chapter: the most coarse-grained(粗粒度) IO scheme has 19 tags, which means that the total number of possible tag combinations for the sentence “They celebrated New Year”, consisting of 4 words, is $19^4 =130,321$ ; the middle-range BIO scheme contains 37 distinct tags and results in $37^4 =1,874,161$ possible combinations; and finally, the most fine-grained BIOES scheme results in $73^4 =28,398,241$ possible tag combinations for a sentence consisting of 4 words.

Note that a sentence consisting of 4 words is a relatively short sentence, yet the brute-force（蛮力法） algorithm (i.e., the one that simply iterates through each possible combination at each step) rapidly becomes highly inefficient. After all, some tag combinations (like O → I-EVENT) are impossible, so there is no point in wasting effort on even considering them. In practice, instead of a brute-force algorithm, more efficient algorithms based on dynamic programming（动态规划） are used (the algorithm that is widely used for language-related sequence labeling tasks is the Viterbi algorithm;

Instead of exhaustively（彻底地） considering all possible combinations, at each step a dynamic programming algorithm calculates the probability of all possible solutions given only the best, most optimal （最优的）solution for the previous step. The algorithm then calculates the best move at the current point and stores it as the current best solution. When it moves to the next step, it again considers only this best solution rather than all possible solutions, thus considerably（大幅度地） reducing the number of overall possibilities to only the most promising ones. Figure 11.9 demonstrates the intuition behind dynamic estimation of the best NE tag that should be selected for “Year” given that the optimal solution O – O – B-EVENT is found for “They celebrated New”.

This, in a nutshell, is how a sequence labeling algorithm solves the task of tag assignment. As was highlighted before, NER is not the only task that demonstrates sequential effects, and a number of other tasks in NLP are solved this way.

The approach to sequence labeling outlined in this section is used by machine-learning algorithms, most notably, conditional random fields， although you don’t need to implement your own NER to be able to benefit from the results of this step in the NLP pipeline. For instance, spaCy has an NER implementation that you are going to rely on to solve the task set out in the scenario for this chapter. The next section delves（探索；深入寻找，搜寻） into implementation （实现）details.

Conditional random fields (CRFs) are a type of probabilistic graphical model used for modeling and predicting structured data, such as sequences or sets of interconnected items. Like other graphical models, CRFs use a graph structure to represent the relationships between different variables and their dependencies. However, unlike many other graphical models, CRFs are specifically designed to handle structured data and make use of the relationships between variables to improve prediction accuracy.

条件随机场 (CRF) 是一种概率图模型，用于建模和预测结构化数据，例如序列或相互关联的项目集。与其他图形模型一样，CRF 使用图形结构来表示不同变量之间的关系及其依赖关系。然而，与许多其他图形模型不同，CRF 专门设计用于处理结构化数据并利用变量之间的关系来提高预测准确性。

CRFs are often used in natural language processing tasks, such as part-of-speech tagging and named entity recognition, where the input data consists of a sequence of words or other tokens and the output is a sequence of tags or labels. CRFs can also be applied to other types of structured data, such as biological sequences, and have been used in a variety of other applications, including image segmentation and handwritten character recognition.

CRFs are related to other probabilistic models, such as hidden Markov models and Markov random fields, and can be trained using a variety of algorithms, including gradient descent and the Expectation-Maximization (EM) algorithm.

11.3 Practical applications of NER

作者还是带我们回到了开头说的那个股票的例子

Let’s remind ourselves of the scenario for this chapter. It is widely known that certain events influence the trends of stock price movements. Specifically, you can extract relevant facts from the news and then use these facts to predict company stock prices.

Suppose you have access to a large collection of news; now your task is to extract the relevant events and facts that can be linked to the stock market in the downstream (stock market price prediction) application. How will you do that? This means that you have access to a collection of news texts, and among other preprocessing steps, you apply NER. Then you can focus only on the texts and sentences that are relevant for your task.

For instance, if you are interested in the recent events, in which a particular company (e.g., “Apple”) participated, you can easily identify such texts, sentences, and contexts. Figure 11.10 shows a flow diagram for this process.

11.3.1 Data loading and exploration

数据集是用的Kaggle的这个数据集：All the news | Kaggle
- The dataset consists of 143,000 articles scraped from 15 news websites, including the New York Times, CNN, Business Insider, Washington Post, and so on.
- The dataset is quite big and is split into three comma-separated values (CSV) files. In the examples in this chapter, you are going to be working with the file called articles1.csv, but you are free to use other files in your own experiments.
  
  csv
  
  Comma-separated values (CSV) is a simple file format used to store tabular（表格式的） data, such as a spreadsheet or database. A CSV file stores data in plain text, with each line representing a row of the table and each field (column) within that row separated by a comma.
  
  Many datasets available via Kaggle and similar platforms are stored in the .csv format. This basically means that the data is stored as a big spreadsheet（电子表格） file, where information is split between different rows and columns. For instance, in articles1.csv,
  each row represents a single news article, described with a set of columns containing
  information on its title, author, the source website, the date of publication, its full content, and so on. The separator（分隔符） used to define the boundary between the information
  belonging to different data fields in .csv files is a comma. It’s time now to familiarize
  yourselves with pandas, a useful data-preprocessing toolkit that helps you work with
  files in such formats as .csv and easily extract information from them.

extract the data from the input file using pandas

import pandas as pd

path = "all-the-news/"
df = pd.read_csv(path + "articles1.csv")

读取得到的df是DataFrame, a labeled data structure with columns of potentially different types

df.shape

The function df.shape prints out the dimensionality of the data structure

(50000,10)

df.head()

extract the information on the news sources only

sources = df["publication"].unique()
print(sources)

['New York Times' 'Breitbart' 'CNN' 'Business Insider' 'Atlantic']

extract the content of articles from a specific source

Since the DataFrame contains as many as 50,000 articles, for the sake of this applica-
tion, let’s focus on some articles only. We will extract the text (content) of the first
1,000 articles published in the New York Times.

condition = df["publication"].isin(["New York Times"])
content_df = df.loc[condition, :]["content"][:1000]
content_df.shape
content_df.head()

In this code, df.loc[condition, :] is used to select rows from the DataFrame df based on the condition provided. The : after the comma indicates that all columns should be selected for the rows that match the condition.

type(content_df):pandas.core.series.Series

(1000,)
0    WASHINGTON  —   Congressional Republicans have...
1    After the bullet shells get counted, the blood...
2    When Walt Disney’s “Bambi” opened in 1942, cri...
3    Death may be the great equalizer, but it isn’t...
4    SEOUL, South Korea  —   North Korea’s leader, ...
Name: content, dtype: object

看看里面的中间变量是什么

11.3.2 Named entity types exploration with spaCy

Now that the data is loaded, let’s explore what entity types it contains.

Let’s start by iterating through the news articles, collecting all named entities identified in texts and storing the number of occurrences in a Python dictionary.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-VNnsdGDe-1673432362017)(null)]

Code to populate a dictionary with NEs extracted from news articles

import spacy
nlp = spacy.load("en_core_web_md")

def collect_entites(data_frame):
    named_entities = {}
    processed_docs = []

    for item in data_frame:
        doc = nlp(item)
        processed_docs.append(doc)

        for ent in doc.ents:
            # For each entity, extract the text with ent.text (e.g., Apple) and store it as entity_text
            entity_text = ent.text # e.g., Apple
            # Identify the type of the entity with ent.label_ (e.g.,ORG) and store it as entity_type.
            entity_type = str(ent.label_) # e.g., ORG
            current_ents = {} # e.g., [Apple: 1, Facebook: 2, ...]
            # For each entity type,extract the list of currently stored entities with their counts
            if entity_type in named_entities.keys():
                current_ents = named_entities.get(entity_type)
            # Update the counts in current_ents, incrementing the count for the entity stored as entity_text
            current_ents[entity_text] = current_ents.get(entity_text, 0) + 1                
            named_entities[entity_type] = current_ents
    return named_entities, processed_docs

named_entities, processed_docs = collect_entites(content_df)

这段代码做了啥

This code uses the spaCy library to extract named entities from a data frame of text content. The collect_entites function takes a data frame as an input and returns two outputs: a dictionary of named entities, and a list of processed documents.

First, it initializes two empty variables, named_entities and processed_docs. Then, it iterates through each item in the input data frame and applies spaCy’s natural language processing (NLP) model to each item, creating a spaCy “doc” object for each one. These doc objects are then appended to the processed_docs list.

Next, for each named entity in the doc object, the code extracts the entity’s text and label, and uses them to update the named_entities dictionary. The dictionary is structured with entity labels as keys (e.g., “ORG” for organizations) and values that are another dictionary containing the entities as keys and their counts as values (e.g., {“Apple”: 2, “Facebook”: 3} for the “ORG” label). If a new entity label is encountered, it is added as a new key in the named_entities dictionary with an empty dictionary as its value.

Finally, the function returns both the named_entities dictionary and processed_docs list. It is important to note that this code will extract the entities and count them, but it doesn’t handle lower case named entities or handle common english words such as “The” and “And” which might be extracted as entities in some scenarios.

codespace炸了

这一步查了好多解决方案比如安装ipywidgets这些…都解决不了

我担心是因为md模型大了些，codespace吃不消之类的，换成sm还是不行

codespace export到repo里面也fail了…

colab也不行(呃，其实是因为一开始尝试的时候csv还没上传完整)

换colab试试吧…

我先是把codespace fork到我的的一个repo里面，然后git到我的colab里面

啊这…转战kaggle试试？？

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-txsygll0-1673432360874)(null)]

换成kaggle跑通了呜呜

kaggle读文件的时候记得前面加上../input

对了，kaggle和colab都可以link to github

Code to print out the named entities dictionary

def print_out(named_entities):
    for key in named_entities.keys():
        print(key)
        entities = named_entities.get(key)
        sorted_keys = sorted(entities, key=entities.get, reverse=True)
        for item in sorted_keys[:10]:
            if (entities.get(item) > 1):
                print("   " + item + ": " + str(entities.get(item)))

print_out(named_entities)

This code defines a function called print_out, which takes a named_entities dictionary as an input and prints out the top 10 entities for each entity label, sorted by count in descending order.

First, the function uses a for loop to iterate through the keys of the named_entities dictionary. For each key, it gets the corresponding dictionary of entities and counts (e.g., {“Apple”: 2, “Facebook”: 3} for the “ORG” label) and assigns it to the variable entities.

Next, it uses the sorted function with the key parameter set to entities.get to sort the entities by count in descending order. The reverse=True argument is used to sort in descending order.

Then, it iterates through the top 10 entities in the sorted list and checks if their count is greater than 1, and if so, it prints out the entity and its count, indented by 2 spaces for readability. This will help you to identify the most common entities in the text.

结果如下

ORG
   Senate: 373
   Congress: 352
   the White House: 258
   White House: 249
   The New York Times: 221
   House: 203
   Times: 192
   WASHINGTON: 125
   Volkswagen: 118
   the European Union: 111
NORP
   American: 984
   Republicans: 525
   Republican: 473
   Democrats: 400
   Russian: 337
   Chinese: 288
   Americans: 270
   British: 185
   Muslim: 170
   Democrat: 165
MONEY
   1: 66
   2: 23
   millions of dollars: 20
   10: 19
   3: 19
   billions of dollars: 18
   100: 17
   5: 17
   4: 15
   $1 billion: 14
CARDINAL
   one: 1006
   two: 908
   three: 353
   One: 333
   seven: 170
   four: 168
   1: 141
   •: 137
   five: 132
   thousands: 122
DATE
   Friday: 436
   Wednesday: 374
   Tuesday: 338
   Monday: 295
   Sunday: 279
   Thursday: 279
   last year: 254
   Saturday: 243
   years: 205
   2014: 205
PERSON
   Trump: 3401
   Obama: 372
   Donald J. Trump: 199
   Clinton: 185
   Spicer: 136
   Sessions: 123
   Gorsuch: 118
   Hillary Clinton: 114
   Kushner: 103
   Putin: 99
GPE
   the United States: 1173
   Russia: 526
   China: 515
   Washington: 499
   New York: 364
   America: 358
   Iran: 294
   Mexico: 264
   Britain: 236
   California: 206
LAW
   the Affordable Care Act: 114
   Constitution: 93
   Roe v. Wade: 13
   the Clean Air Act: 12
   the Geneva Conventions: 9
   the First Amendment: 6
   The Affordable Care Act: 5
   the Affordable Care Act’s: 4
   the First Amendment’s: 3
   the Johnson Amendment: 3
LOC
   Europe: 150
   Africa: 54
   Asia: 52
   Silicon Valley: 47
   Earth: 40
   the Middle East: 39
   North: 37
   South: 30
   Pacific: 25
   West: 23
ORDINAL
   first: 1222
   second: 224
   third: 89
   First: 38
   fourth: 38
   fifth: 21
   45th: 12
   Second: 10
   sixth: 10
   40th: 7
TIME
   morning: 150
   night: 133
   evening: 75
   hours: 73
   afternoon: 64
   that night: 23
   the morning: 22
   last night: 21
   tonight: 20
   overnight: 17
FAC
   Broadway: 76
   the White House: 48
   Trump Tower: 45
   Capitol: 31
   Vatican: 31
   Fifth Avenue: 13
   the C. I. A.: 13
   Kennedy Airport: 13
   the Oval Office: 11
   Times Square: 11
PRODUCT
   Twitter: 76
   Obamas: 14
   Cowboys: 13
   Facebook: 12
   Adumim: 11
   Manchester: 8
   Boko Haram: 6
   Cellectis: 6
   Air Force One: 5
   Grammy: 5
PERCENT
   5 percent: 23
   20 percent: 21
   3 percent: 15
   4 percent: 15
   6 percent: 13
   2 percent: 13
   1 percent: 13
   40 percent: 12
   100 percent: 10
   9 percent: 10
EVENT
   World War II: 49
   the Super Bowl: 36
   Super Bowl: 28
   New Year’s Eve: 18
   Inauguration Day: 17
   Olympic: 14
   the Vietnam War: 12
   Wimbledon: 12
   New Year’s Day: 11
   the Cold War: 11
WORK_OF_ART
   Saturday Night Live: 25
   Moonlight: 20
   La La Land: 19
   Bible: 17
   The Daily: 16
   Your Morning Briefing: 15
   Hidden Figures: 15
   The Mary Tyler Moore Show: 14
   La La Land”: 14
   Meet the Press: 11
QUANTITY
   about 30 miles: 3
   40 miles: 3
   hundreds of miles: 3
   less than a mile: 3
   150 miles: 3
   about 600 miles: 3
   more than 100 pounds: 2
   20 pounds: 2
   hundreds of feet: 2
   Three miles: 2
LANGUAGE
   English: 46
   Arabic: 9
   Spanish: 6
   Hebrew: 5
   Mandarin: 3
   Filipino: 3
   French: 2
   Excedrin: 2

Perhaps not surprisingly, the most frequent geopolitical entity mentioned in the NewYork Times articles is the United States. It is mentioned 1,148 times in total; in fact, thiscan be combined with the counts for other expressions used for the same entity (e.g.America and the like). It is followed by Russia (526 times) and China (515).As you can see, there is a lot of information contained in this dictionary.

Another way in which you can explore the statistics on various NE types is to aggregate the counts on the types and print out the number of unique entries (e.g., Apple
and Facebook would be counted as two separate named entities under the ORG type),
as well as the total number of occurrences of each type (e.g., 175 counts for Facebook
and 63 for Apple would result in the total number of 238 occurrences of the type
ORG). Listing 11.7 shows how to do that. It suggests that you print out information on
the entity type (e.g., ORG), the number of unique entries belonging to a particular type
(e.g., Apple and Facebook would contribute as two different entries for ORG), and the
total number of occurrences of the entities of that particular type. To do that, you
extract and aggregate the statistics for each NE type, and in the end, you print out the
results in a tabulated format, with each row storing the statistics on a separate NE type.

Code to aggregate the counts on all named entity types

rows = []
rows.append(["Type:", "Entries:", "Total:"])
for ent_type in named_entities.keys():
    # Extract and aggregate the statistics for each NE type.
    rows.append([ent_type, str(len(named_entities.get(ent_type))), 
                 str(sum(named_entities.get(ent_type).values()))])

columns = zip(*rows)
column_widths = [max(len(item) for item in col) for col in columns]
for row in rows:
    print(''.join(' {:{width}} '.format(row[i], width=column_widths[i]) 
                  for i in range(0, len(row))))

This code defines a list called rows and appends a header row to it. Then, it uses a for loop to iterate through the keys of the named_entities dictionary (the entity types), for each type it calculates the number of unique entities and the total count of that type and append those values to rows list as a new row.

Then, it uses the zip function to transpose the rows so that the columns of data can be processed. Next, it uses a list comprehension to determine the maximum length of each column of data and saves it to the variable column_widths. Finally, it uses a nested for loop to iterate through the rows and columns, and print out the data, formatted to the widths specified in column_widths.

This will create a cleanly formatted table with 3 columns: “Type”, “Entries”, and “Total” that shows the number of unique entities and the total count for each entity label. Additionally, this way, you can easily spot which label type have more unique entities or the total count of entities for each label.

输出结果

As this table shows, the most frequently used named entities in the news articles are entities of the following types: PERSON, GPE, ORG, and DATE. This is, perhaps, not very surprising. After all, most often news reports on the events that are related to people (PERSON), companies (ORG), countries (GPE), and usually news articles include references to specific dates. At the same time, the least frequently used entities are the ones of the type LANGUAGE: there are only 12 unique languages mentioned in this news articles dataset, and in total they are mentioned 85 times. Among the most frequently mentioned are English (48 times), Arabic (8), and Spanish (7). You may also note that the ORDINAL type has only 68 unique entries: it is, naturally, a very compact list of items including entries like first, second, third, and so on

11.3.3 Information extraction revisited

Consider the scenario again: your task is to build an information extraction application focused on companies and the news that reports on these companies. The dataset at hand, according to table 11.4, contains information on as many as 4,892 companies. Of course, not all of them will be of interest to you, so it would make sense to select a few and extract information on them.

Chapter 4 looked into the information extraction task, which was concerned with the extraction of relevant facts (e.g., actions in which certain personalities of interest are involved). Let’s revisit this task here, making the necessary modifications. Specifically, the following ones:

Let’s extract actions together with their participants but focus on participants of a particular type, such as companies (ORG) or a specific company (Apple). For that, you will work with a subset of sentences that contain the entity of interest.
Let’s extract the contexts in which an entity of interest (e.g., Apple) is one of the main participants (e.g., “Apple sued Qualcomm” or “Russia required Apple to . . .”). For that, you will use the linguistic information from spaCy’s NLP pipeline, focusing on the cases where the entity is the subject (the main participant of the main action as Apple is in “Apple sued Qualcomm”) or the object (the second participant of the main action as Apple is in “Russia required Apple to . . .”). This information can be extracted from the spaCy’s parser output using nsubj and dobj relations, respectively
Oftentimes the second participant of the action is linked to the main verb via a preposition. For instance, compare “Russia required Apple” to “The New York Times wrote about Apple”. In the first case, Apple is the direct object of the main verb required, and in the second case, it is an indirect object of the main verb wrote. Let’s make sure that both cases are covered by our information extraction algorithm.
Finally, as observed in the earlier examples, named entities may consist of a single word (Apple) or of several words (Apple Inc.). To that end, let’s make sure the code applies to both cases.

Code to extract the indexes of the words covered by the NE

Each word in a sentence has a unique index that is linked to its position in the sentence
The goal is to extract not only the word marked as nsubj or dobj but also the whole named entity that plays that role
The best way to do this is to match the named entities to their roles in the sentence via the indexes assigned to the named entities in the sentence
An example is given of a named entity “The New York Times” in the sentence "The New York Times wrote about Apple"The goal is to identify whether “The New York Times” is a participant of the main action (wrote) in the sentenceThe indexes of the words covered by “The New York Times” in the sentence are [0,1,2,3].Check if a word with any of these indexes plays a role of the subject or an object in the sentence.The word that is the subject in the sentence has the index of 3.Therefore, the whole named entity “The New York Times” can be returned as the subject of the main action in the sentence.

entity = "The New York Times"
sentences = ["The New York Times wrote about Apple"]

def extract_span(sent, entity):
    indexes = []
    for ent in sent.ents:
        if ent.text==entity:
            for i in range(int(ent.start), int(ent.end)):
                indexes.append(i)
    return indexes

    
def extract_information(sent, entity, indexes):
    # Initialize the list of actions and an action with two participants.
    actions = []
    action = ""
    participant1 = ""
    participant2 = ""
        
    for token in sent:
        if token.pos_=="VERB" and token.dep_=="ROOT":  
            # Initialize the indexes for the subject and the object related to the main verb.
            subj_ind = -1
            obj_ind = -1
            # Store the main verb itself in the action variable.
            action = token.text
            children = [child for child in token.children]   
            for child1 in children:
                # Find the subject via the nsubj relation and store it as participant1 and its index as subj_ind.

                if child1.dep_=="nsubj":
                    participant1 = child1.text
                    subj_ind = int(child1.i)
                if child1.dep_=="prep":
                    participant2 = ""
                    child1_children = [child for child in child1.children]
                    for child2 in child1_children:
                        if child2.pos_ == "NOUN" or child2.pos_ == "PROPN":
                            participant2 = child2.text
                            obj_ind = int(child2.i)
                    if not participant2=="":
                        if subj_ind in indexes:
                            actions.append(entity + " " + action + " " + child1.text + " " + participant2)
                        elif obj_ind in indexes:
                            actions.append(participant1 + " " + action + " " + child1.text + " " + entity)
                # If both participants of the action are identified,add the action with two participants to the list of actions.
                if child1.dep_=="dobj" and (child1.pos_ == "NOUN"
                                            or child1.pos_ == "PROPN"):
                    participant2 = child1.text
                    obj_ind = int(child1.i)
                    if subj_ind in indexes:
                        actions.append(entity + " " + action + " " + participant2)
                    elif obj_ind in indexes:
                        # If there is no preposition attached to the verb, find a direct object of the main verb via the dobj relation.
                        actions.append(participant1 + " " + action + " " + entity)
    # If the final list of actions is not empty, print out the sentence and all actions together with the participants.
    if not len(actions)==0:
        print (f"\nSentence = {sent}")
        for item in actions:
            print(item)


for sent in sentences:
    doc = nlp(sent)
    indexes = extract_span(doc, entity)
    print(indexes)
    extract_information(doc, entity, indexes)

Now let’s apply this code to your texts extracted from the news articles. Note, however, that the code in listing 11.9 applies to the sentence level, since it relies on the information extracted from the parser (which applies to each sentence rather than the whole text). In addition, if you are only interested in a particular entity, it doesn’t make sense to waste the algorithm’s efforts on the texts and sentences that don’t mention this entity at all. To this end, let’s first extract all sentences that mention the entity in question from processed_docs and then apply the extract_information method to extract all tuples (participant1 + action + participant2) from the sentences, where either participant1 or participant2 is the entity you are interested in.

Code to extract information about the main participants of the action

def entity_detector(processed_docs, entity, ent_type):
    output_sentences = []
    for doc in processed_docs:
        for sent in doc.sents:
            if entity in [ent.text for ent in sent.ents if ent.label_==ent_type]:
                output_sentences.append(sent)
    return output_sentences   

entity = "Apple"

ent_sentences = entity_detector(processed_docs, entity, "ORG")
print(len(ent_sentences))
for sent in ent_sentences:
    indexes = extract_span(sent, entity)
    extract_information(sent, entity, indexes)

Code to extract information on the specific entity

def entity_detector(processed_docs, entity, ent_type):
    output_sentences = []
    for doc in processed_docs:
        for sent in doc.sents:
            if entity in [ent.text for ent in sent.ents if ent.label_==ent_type]:
                output_sentences.append(sent)
    return output_sentences   

entity = "Apple"

ent_sentences = entity_detector(processed_docs, entity, "ORG")
print(len(ent_sentences))
for sent in ent_sentences:
    indexes = extract_span(sent, entity)
    extract_information(sent, entity, indexes)

This code uses “Apple” as the entity of interest and specifically looks for sentences, in which the company (ORG) Apple is mentioned. As the printout message shows, there are 59 such sentences. Not all sentences among these 59 sentences mention Apple as a subject or object of the main action, but the last line of code returns a number of such sentences with the tuples summarizing the main content:

47
Sentence = Apple, complying with what it said was a request from Chinese authorities, removed news apps created by The New York Times from its app store in China late last month.
Apple removed apps

Sentence = Apple removed both the   and   apps from the app store in China on Dec. 23.
Apple removed apps
Apple removed from store
Apple removed on Dec.

Sentence = Apple has previously removed other, less prominent media apps from its China store.
Apple removed apps
Apple removed from store

Sentence = On Friday, Apple, its longtime partner, sued Qualcomm over what it said was $1 billion in withheld rebates.
Apple sued Qualcomm

看到这里，才明白前面那个extract_information到底是要干啥

11.3.4 Named entities visualization 很炫酷

One of the most useful ways to explore named entities contained in text and to extract relevant information is to visualize the results of NER.

接下来使用spaCy的可视化工具displaCy

“entity of interest"在书里面大概是这个意思？ refers to a specific named entity that the user wants to extract information about from a text.

Code to visualize named entities of various types in their contexts of use

拿个句子先试试

from spacy import displacy

text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, \
        few people outside of the company took him seriously."

doc = nlp(text)
displacy.render(doc, style="ent")

试试中文的！

需要!python -m spacy download zh_core_web_sm (sm、md啥的自己选)，我先用小的来玩玩

from spacy import displacy

text2 = "2010年4月，小米成立于中华人民共和国北京市。[5]，并于2011年8月发布小米手机进军手机市场[6]。据全球市场调研机构Canalys的统计，在2021年第二季度，小米智能手机市场占有率位居全球第二，占比17%[7]。小米还是继苹果、三星、华为之后第四家拥有手机芯片自研能力的手机厂商[8]。 小米旗下拥有多个子品牌，面向不同产品品类、地区市场及消费人群。通过与其生态链企业的研发与合作，其旗下产品涵盖了智能手机、小米手环、小米电视、小米空气净化器等多种智能化的消费电子产品[9]。小米拥有其直接控股或间接控制的生态链企业多达近400家，产业覆盖智能硬件、生活消费用品、\
教育、游戏、社交网络、文化娱乐、医疗健康、汽车交通、金融等多个领域[10]。"
nlp2 = spacy.load("zh_core_web_sm")
doc = nlp2(text2)
displacy.render(doc, style="ent")

处理这个数据集

def visualize(processed_docs, entity, ent_type):
    for doc in processed_docs:
        for sent in doc.sents:
            if entity in [ent.text for ent in sent.ents if ent.label_==ent_type]:
                displacy.render(sent, style="ent")

visualize(processed_docs, "Apple", "ORG")

This code displays all sentences, in which the company (ORG) Apple is mentioned. Other entities are highlighted with distinct colors.

Code to visualize named entities of a specific type only

Finally, you might be interested specifically in the contexts in which the company Apple is mentioned alongside other companies. Let’s filter out all other information and highlight only named entities of the same type as the entity in question (i.e., all ORG NEs in this case).

def count_ents(sent, ent_type):
    return len([ent.text for ent in sent.ents if ent.label_==ent_type])   

def entity_detector_custom(processed_docs, entity, ent_type):
    output_sentences = []
    for doc in processed_docs:
        for sent in doc.sents:
            if entity in [ent.text for ent in sent.ents if ent.label_==ent_type and 
                          count_ents(sent, ent_type)>1]:
                output_sentences.append(sent)
    return output_sentences

output_sentences = entity_detector_custom(processed_docs, "Apple", "ORG")
print(len(output_sentences))

With an updated entity_detector_custom function, you extract only the sentences that mention the input entity of a specified type as well as at least one other entity of the same type. You can print out the number of sentences identified this way as a sanity check. Then you define the visualize_type function that applies visualization to the entities of a predefined type only. spaCy allows you to customize the colors for visualization and to apply gradient (you can choose other colors from https://htmlcolorcodes.com/color-chart/), and using this customized color scheme, you can finally visualize the results.

def visualize_type(sents, entity, ent_type):
    colors = {"ORG": "linear-gradient(90deg, #64B5F6, #E0F7FA)"}
    options = {"ents": ["ORG"], "colors": colors}
    for sent in sents:
        displacy.render(sent, style="ent", options=options)
                
visualize_type(output_sentences, "Apple", "ORG")

Summary

这章学会了什么？

可以从新闻文章中提取相关事件和感兴趣的参与者（例如特定公司）所采取action的summarization。这些事件可以在下游任务中进一步使用。例如，如果还收集了有关股票价格转移的数据，则可以将从新闻提取的事件链接到此类事件后立即发生的股票价格的变化，这将帮助预测股票价格可能如何变化，未来的类似事件。

Named-entity recognition (NER) is one of the core NLP tasks; however, when the main goal of the application you are developing is not concerned with improving the core NLP task itself but rather relies on the output from the core NLP technology, this is called a downstream task. The tasks that benefit from NER include information extraction, question answering, and the like.
Named entities are real-world objects (people, locations, organizations, etc.) that can be referred to with a proper name, and named-entity recognition is concerned with identification of the full span of such entities (as entities may consist of a single word like Apple or of multiple words like Albert Einstein) and the type of the expression.
The four most widely used types include person, location, organization, and geopolitical entity, although other types like time references and monetary units are also typically added. Moreover, it is also possible to train a customized NER algorithm for a specific domain. For instance, in biomedical texts, gene and protein names represent named entities.
NER is a challenging task. The major challenges are concerned with the identification of the full span of the expression (e.g., Amazon versus Amazon River) and the type (e.g., Washington may be an entity of up to 4 different types depending on the context). The span and type identification are the tasks that in NER are typically solved jointly.
The set of named entities often used in practice is derived from the OntoNotes, and it contains 18 distinct NE types. The annotation scheme used to label NEs in data is called a BIO scheme (with a more coarse-grained variant being the IO, and the more fine-grained one being the BIOES scheme).
This scheme explicitly annotates each word as beginning an NE, being inside of an NE, or being outside of an NE.
The NER task is typically framed as a sequence labeling task, and it is commonly addressed using a feature-based approach. NER is not the only NLP task that is solved using sequence labeling, since language shows strong sequential effects. Part-of-speech tagging overviewed in chapter 4 is another example of a sequential task.
You can apply spaCy in practice to extract named entities of interest and facts related to these entities from a collection of news articles.
A very popular format in data science is CSV, which uses a comma as a delimiter. An easy-to-use open source data analysis and manipulation tool for Python practitioners that helps you work with such files is pandas.
Finally, you can explore the results of NER visually, using the displaCy tool and color-coding entities of different types with its help.

感觉本书还有一大亮点就是推荐的材料,很多地方都附上了论文链接

参考资料

看到一个spacy官方的很好的教程:https://course.spacy.io/zh/
这个也不错：
- Introduction to Cultural Analytics & Python
displaCy:https://demos.explosion.ai/displacy-ent

你可能感兴趣的:(NLP,自然语言处理,人工智能)

【优秀文章】7月优秀文章推荐
优秀文章智能自主运动体与人工智能技术——环境感知、SLAM定位、路径规划、运动控制、多智能体协同作者：fpga和matlabC++之红黑树认识与实现作者：zzh_zao【手把手带你刷好题】–C语言基础编程题(十)作者：草莓熊Lotso飞算JavaAI：从“码农”到“代码指挥官”的终极进化论作者：可涵不会debug前端网页开发学习（HTML+CSS+JS）有这一篇就够！作者：一颗小谷粒
【心灵鸡汤】深度学习技能形成树：从零基础到AI专家的成长路径全解析智算菩萨人工智能深度学习
引言：技能树的生长哲学在这个人工智能浪潮汹涌的时代，深度学习犹如一棵参天大树，其根系深深扎入数学与计算科学的沃土，主干挺拔地承载着机器学习的核心理念，而枝叶则繁茂地延伸至计算机视觉、自然语言处理、强化学习等各个应用领域。对于初入此领域的新手而言，理解这棵技能树的生长规律，掌握其形成过程中的关键节点和发展阶段，将直接决定其在人工智能道路上能够走多远、攀多高。技能树的概念源于游戏设计，但在学习深度学习
自然语言处理-基于预训练模型的方法-笔记
自然语言处理-基于预训练模型的方法-笔记【下载地址】自然语言处理-基于预训练模型的方法-笔记《自然语言处理-基于预训练模型的方法》由哈尔滨工业大学出版，深入探讨了NLP领域的前沿技术与预训练模型的应用。本书系统介绍了预训练模型的基本概念、发展历程及常见模型的原理，并通过丰富的实践案例与代码实现，帮助读者掌握这些技术在自然语言处理任务中的实际应用。无论是初学者、研发人员，还是希望提升NLP能力的研究
MongoDB + Voyage AI 详解：重塑数据库与AI的协同范式 csdn_tom_168 NoSQL 数据库 mongodb 人工智能 AI
MongoDB+VoyageAI详解：重塑数据库与AI的协同范式2025年2月，MongoDB官方宣布收购VoyageAI，这一举措标志着数据库与人工智能技术的深度融合迈入新阶段。通过整合VoyageAI的先进AI检索与嵌入模型能力，MongoDB旨在重新定义AI时代的数据库架构，为企业构建智能应用提供端到端的数据基础设施。一、收购背景与技术战略1.行业趋势驱动AI数据挑战：随着生成式AI与大语言
HarmonyOS5.0仓颉引擎与盘古大模型：个性化作业批改系统架构设计与实现 H老师带你学鸿蒙系统架构 HarmonyOS5.0 鸿蒙华为仓颉教育
人工智能与边缘计算的融合正在重塑教育评价体系。本文将展示如何基于HarmonyOS5.0仓颉并发引擎和盘古大模型，构建新一代智能作业批改系统。系统架构全景graphTDA[学生端设备]-->|提交作业|B[仓颉边缘处理]B-->C[盘古大模型分析]C-->D[个性化反馈生成]D-->E[学生终端]D-->F[教师仪表盘]subgraphHarmonyOS分布式系统B-->|设备协同|G[教室平板集
DeepSeek在智能教育评估中的应用：试题检索 AIGC应用创新大全 AI大模型与大数据技术 AI人工智能与大数据应用开发 MCP&Agent 云算力网络 easyui 前端 javascript ai
DeepSeek在智能教育评估中的应用：试题检索关键词：DeepSeek、智能教育、试题检索、自然语言处理、知识图谱、个性化学习、评估系统摘要：本文探讨了DeepSeek大模型在智能教育评估系统中的试题检索应用。我们将深入分析如何利用先进的自然语言处理技术和知识图谱构建高效的试题检索系统，实现个性化学习路径推荐和精准评估。文章将从核心概念、技术原理到实际应用场景，全面解析这一创新教育技术解决方案。
阿里云瑶池数据库 Data Agent for Meta 正式发布，让 AI 更懂你的业务！数据库观点资讯人工智能
背景随着生成式人工智能（GenerativeAI）从概念验证迈向规模化商业落地，AIAgent已成为企业核心业务流程的重要组成部分。然而，当模型调用日益便捷时，核心痛点已不再是模型本身，而是集中在一个关键要素上：数据。AIAgent的落地瓶颈已从技术能力转向高质量、高相关性、安全合规的数据供给。企业面临的核心挑战在于：数据孤岛导致知识库分散，通用大模型难以理解专业业务传统数据管理依赖人工开发维护，
【PaddleOCR】OCR文本检测与文本识别数据集整理，持续更新......
博主简介：曾任某智慧城市类企业算法总监，目前在美国市场的物流公司从事高级算法工程师一职，深耕人工智能领域，精通python数据挖掘、可视化、机器学习等，发表过AI相关的专利并多次在AI类比赛中获奖。CSDN人工智能领域的优质创作者，提供AI相关的技术咨询、项目开发和个性化解决方案等服务，如有需要请站内私信或者联系任意文章底部的的VX名片（ID：xf982831907）博主粉丝群介绍：①群内初中生、
多模态大模型的技术应用与未来展望：重构AI交互范式的新引擎 zhaoyi_he 重构人工智能
一、引言：为什么多模态是AI发展的下一场革命？过去十年，深度学习推动了计算机视觉和自然语言处理的飞跃，但两者的发展路径长期割裂。随着生成式AI和大模型时代的到来，**多模态大模型（MultimodalFoundationModels）**以统一的建模方式处理图像、文本、音频、视频等多源数据，重塑了“感知-认知-决策”链条，为AGI迈出关键一步。OpenAI的GPT-4o、Google的Gemini
使用 C++ 实现 MFCC 特征提取与说话人识别系统 whoarethenext c++开发语言 mfcc 语音识别
使用C++实现MFCC特征提取与说话人识别系统在音频处理和人工智能领域，C++凭借其卓越的性能和对硬件的底层控制能力，在实时音频分析、嵌入式设备和高性能计算场景中占据着不可或缺的地位。本文将引导你了解如何使用C++库计算核心的音频特征——梅尔频率倒谱系数(MFCCs)，并进一步利用这些特征构建一个说话人识别（声纹识别）系统。Part1:在C/C++中计算MFCCs直接从零开始实现MFCC的所有计算
ImportError: /nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4 爱编程的喵喵 Python基础课程 python ImportError torch nvJitLink 解决方案
大家好，我是爱编程的喵喵。双985硕士毕业，现担任全栈工程师一职，热衷于将数据思维应用到工作与生活中。从事机器学习以及相关的前后端开发工作。曾在阿里云、科大讯飞、CCF等比赛获得多次Top名次。现为CSDN博客专家、人工智能领域优质创作者。喜欢通过博客创作的方式对所学的知识进行总结与归纳，不仅形成深入且独到的理解，而且能够帮助新手快速入门。本文主要介绍了ImportError:/home/
网络安全相关专业总结（非常详细）零基础入门到精通，收藏这一篇就够了网络安全工程师教学兼职副业黑客技术网络安全 web安全安全人工智能网络运维
一、网络工程专业专业内涵网络工程是指按计划进行的以工程化的思想、方式、方法，设计、研发和解决网络系统问题的工程，一般指计算机网络系统的开发与构建。该专业培养具备计算机科学与技术学科理论基础，掌握网络技术领域专业知识和基本技能，在计算机、网络及人工智能领域的工程实践和应用方面受到良好训练，具有深厚通信背景、可持续发展、能力较强的高水平工程技术人才。学生可在计算机软硬件系统、互联网、移动互联网及新一代
大语言模型应用指南：ReAct 框架 AI大模型应用实战 java python javascript kotlin golang 架构人工智能
大语言模型应用指南：ReAct框架关键词：大语言模型,ReAct框架,自然语言处理(NLP),模型融合,多模态学习,深度学习,深度学习框架1.背景介绍1.1问题由来近年来，深度学习技术在自然语言处理(NLP)领域取得了显著进展。尤其是大语言模型(LargeLanguageModels,LLMs)，如BERT、GPT系列等，通过在大规模无标签数据上进行预训练，获得了强大的语言理解和生成能力。然而，预
大语言模型原理基础与前沿基于语言反馈进行微调 AI天才研究院计算 AI大模型企业级应用开发实战 AI人工智能与大数据计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
大语言模型原理基础与前沿基于语言反馈进行微调作者：禅与计算机程序设计艺术/ZenandtheArtofComputerProgramming1.背景介绍1.1问题的由来随着深度学习技术的飞速发展，自然语言处理（NLP）领域取得了显著的进展。大语言模型（LargeLanguageModels，LLMs）如GPT-3、BERT等在各项NLP任务上取得了令人瞩目的成绩。然而，如何进一步提高大语言模型的理
《北京市加快推动“人工智能+医药健康“创新发展行动计划（2025-2027年）》深度解读
引言随着新一轮科技革命和产业变革的深入推进，人工智能技术与医药健康的深度融合已成为全球科技创新的重要方向。北京市于2025年7月正式发布《北京市加快推动"人工智能+医药健康"创新发展行动计划（2025-2027年）》，旨在充分发挥北京在人工智能技术策源、头部医疗资源汇聚、健康数据高度富集等方面的突出优势，构建形成"人工智能+医药健康"创新和应用并举的产业生态体系，打造具有国际影响力的创新策源地、应
「源力觉醒创作者计划」_文心大模型开源：开启 AI 新时代的大门小黄编程快乐屋人工智能
在人工智能的浩瀚星空中，大模型技术宛如一颗璀璨的巨星，照亮了无数行业前行的道路。自诞生以来，大模型凭借其强大的语言理解与生成能力，引发了全球范围内的技术变革与创新浪潮。百度宣布于6月30日开源文心大模型4.5系列，这一消息如同一颗重磅炸弹，在AI领域掀起了惊涛骇浪，其影响之深远，意义之重大，足以改写行业的发展轨迹。百度这次放大招，直接把文心大模型4.5开源了，这操作就像往国内AI圈子里空投了一个超
四种微调技术详解：SFT 监督微调、LoRA 微调、P-tuning v2、Freeze 监督微调方法
当谈到人工智能大语言模型的微调技术时，我们进入了一个令人兴奋的领域。这些大型预训练模型，如GPT-3、BERT和T5，拥有卓越的自然语言处理能力，但要使它们在特定任务上表现出色，就需要进行微调，以使其适应特定的数据和任务需求。在这篇文章中，我们将深入探讨四种不同的人工智能大语言模型微调技术：SFT监督微调、LoRA微调方法、P-tuningv2微调方法和Freeze监督微调方法。第一部分：SFT监
从新闻到知识图谱：用大模型和知识工程“八步成诗”打造科技并购大脑许泽宇的技术分享知识图谱科技人工智能
一句话摘要：本文带你用现代NLP和知识图谱技术，把科技公司并购新闻变成结构化的知识大脑，过程全景揭秘，理论与实战齐飞，代码只用伪代码，干货与段子齐发，助你成为AI知识工程老司机！前言：为什么要把新闻变成知识图谱？想象一下，你是个投资分析师，老板让你一周内梳理全球科技并购大事件，找出谁在买谁、花了多少钱、背后有哪些大佬、涉及哪些新技术……你会怎么做？A.手动Ctrl+F，Excel狂敲，熬夜爆肝？B
Longformer: The Long-Document Transformer（2020-4-10）不负韶华ღ 深度学习（NLP）transformer 深度学习人工智能
模型介绍目前基于Transformer的预训练模型在各项NLP任务纷纷取得更好的效果，这些成功的部分原因在于Self-Attention机制，它运行模型能够快速便捷地从整个文本序列中捕获重要信息。然而传统的Self-Attention机制的时空复杂度与文本的序列长度呈平方的关系，这在很大程度上限制了模型的输入不能太长，因此需要将过长的文档进行截断传入模型进行处理，例如BERT中能够接受的最大序列长
搜索架构中的NLP技术：提升搜索准确性的关键搜索引擎技术架构自然语言处理人工智能 ai
搜索架构中的NLP技术：提升搜索准确性的关键关键词：搜索架构、NLP技术、查询理解、语义搜索、相关性排序、意图识别、BERT模型摘要：本文将深入探讨现代搜索架构中NLP技术的核心应用，从查询理解到结果排序的全流程，揭示NLP如何提升搜索准确性。我们将通过生动的比喻解释复杂概念，分析关键技术原理，并提供实际代码示例，帮助读者全面理解搜索系统背后的NLP魔法。背景介绍目的和范围本文旨在解析NLP技术在
2023年搜索领域的技术认证与职业发展指南搜索引擎技术搜索引擎 ai
2023年搜索领域的技术认证与职业发展指南关键词搜索领域、技术认证、职业发展、搜索引擎技术、人工智能搜索摘要本指南旨在为搜索领域的从业者和有志于进入该领域的人士提供全面的技术认证与职业发展参考。首先介绍搜索领域的概念基础，包括其历史发展和关键问题。接着阐述相关理论框架，分析不同认证背后的原理。架构设计部分展示搜索系统的组成与交互。实现机制探讨算法复杂度和代码优化。实际应用部分给出实施和部署策略。高
探索AI人工智能医疗NLP实体识别系统的架构设计 AI学长带你学AI 人工智能自然语言处理 easyui ai
探索AI人工智能医疗NLP实体识别系统的架构设计关键词：人工智能、医疗NLP、实体识别、系统架构、深度学习、自然语言处理、医疗信息化摘要：本文将深入探讨医疗领域NLP实体识别系统的架构设计。我们将从基础概念出发，逐步解析医疗文本处理的特殊性，详细介绍实体识别技术的核心原理，并通过实际案例展示如何构建一个高效可靠的医疗实体识别系统。文章还将探讨当前技术面临的挑战和未来发展方向，为医疗AI领域的从业者
AI智能体原理及实践：从概念到落地的全链路解析 you的日常人工智能大语言模型人工智能机器学习深度学习神经网络自然语言处理
AI智能体正从实验室走向现实世界，成为连接人类与数字世界的桥梁。它代表了人工智能技术从"知"到"行"的质变，是能自主感知环境、制定决策、执行任务并持续学习的软件系统。在2025年，AI智能体已渗透到智能家居、企业服务、医疗健康、教育和内容创作等领域，展现出强大的生产力与创造力。然而，其发展也伴随着技术挑战、伦理困境和安全风险，需要从架构设计到落地应用的全链条思考与平衡。一、AI智能体的核心定义与技
人工智能动画展示人类的特征 AGI大模型与大数据研究院 AI大模型应用开发实战 java python javascript kotlin golang 架构人工智能
人工智能，动画，人类特征，情感识别，行为模拟，机器学习，深度学习，自然语言处理1.背景介绍人工智能（AI）技术近年来发展迅速，已渗透到生活的方方面面。从智能语音助手到自动驾驶汽车，AI正在改变着我们的世界。然而，尽管AI技术取得了令人瞩目的成就，但它仍然难以完全模拟人类的复杂行为和特征。人类的特征是多方面的，包括情感、认知、社交和创造力等。这些特征是人类区别于其他生物的重要标志，也是人类社会文明发
RNN案例人名分类器（完整步骤） AI扶我青云志 rnn 人工智能深度学习 nlp lstm gru
今天给大家分享一个NLP（自然语言处理）中的一个小案例，本案例讲解了RNN、LSTM、GRU模型是如何使用并进行预测的，一、案例架构人名分类器的实现可分为以下五个步骤:第一步:导入必备的工具包第二步:对data文件中的数据进行处理，满足训练要求第三步:构建RNN模型(包括传统RNN,LSTM以及GRU)第四步:构建训练函数并进行训练五步第:构建评估函数并进行预测二、实现步骤1.导包#导入torch
Spring AI 第二讲之 Chat Model API 第八节ZhiPu AI Chat 疼死老夫了人工智能
SpringAI支持知普人工智能的各种人工智能语言模型。您可以与知普人工智能语言模型互动，并基于知普人工智能模型创建多语言对话助手。先决条件您需要与ZhiPuAI创建一个API，以访问ZhiPuAI语言模型。在ZhiPuAI注册页面创建账户，并在APIKeys页面生成令牌。SpringAI项目定义了一个名为spring.ai.zhipuai.api-key的配置属性，你应将其设置为从APIKeys
Chat Model API 虾条_花吹雪 Spring AI java
聊天模型API为开发人员提供了将人工智能聊天完成功能集成到应用程序中的能力。它利用预训练的语言模型，如GPT（生成预训练转换器），以自然语言对用户输入生成类似人类的响应。API通常通过向人工智能模型发送提示或部分对话来工作，然后人工智能模型根据其训练数据和对自然语言模式的理解生成对话的完成或继续。然后将完成的响应返回给应用程序，应用程序可以将其呈现给用户或用于进一步处理。Spring人工智能聊天模
【论文笔记】RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation AustinCyy 论文笔记论文阅读
论文信息论文标题：RAGLAB:AModularandResearch-OrientedUnifiedFrameworkforRetrieval-AugmentedGeneration-EMNLP24论文作者：XuanwangZhang-NanjingUniversity论文链接：https://arxiv.org/abs/2408.11381代码链接：https://github.com/fat
巅峰对决，超三十万奖金等你挑战！第十届信也科技杯全球AI算法大赛火热开赛！中杯可乐多加冰前沿资讯分享科技人工智能算法计算机视觉机器学习深度学习
信也科技今年跟IJCAI和CIKM这两大全球顶级AI会议合作，这场比赛被全球人工智能顶会CIKM收录为官方赛事单元，获奖选手有机会全球人工智能顶会创造更大的影响力。一、赛事概况随着深度伪造技术的高度发展，人工智能产业走深向实，生成合成技术开始呈现工具化和普及化趋势。在生成合成内容质量显著提升的当下，基于换脸攻击的身份冒用和欺诈事件在全球范围内激增，严重威胁个人隐私和公共数据安全。第十届信也科技杯全
OPENAI中Assistants API的实现原理及示例代码python实现 dzend aigc python ai
OPENAI中AssistantsAPI的实现原理及示例代码前言OPENAI是一家人工智能公司，致力于研究和开发人工智能技术。其中，AssistantsAPI是OPENAI推出的一项人工智能服务，可以帮助开发者快速构建智能助手。本文将介绍AssistantsAPI的实现原理，并提供使用Python实现的示例代码。AssistantsAPI实现原理AssistantsAPI的实现原理主要包括以下几个
java责任链模式 3213213333332132 java 责任链模式村民告县长
责任链模式，通常就是一个请求从最低级开始往上层层的请求，当在某一层满足条件时，请求将被处理，当请求到最高层仍未满足时，则请求不会被处理。就是一个请求在这个链条的责任范围内，会被相应的处理，如果超出链条的责任范围外，请求不会被相应的处理。下面代码模拟这样的效果：创建一个政府抽象类,方便所有的具体政府部门继承它。 package 责任链模式; /** *
linux、mysql、nginx、tomcat 性能参数优化 ronin47
一、linux 系统内核参数 /etc/sysctl.conf文件常用参数 net.core.netdev_max_backlog = 32768 #允许送到队列的数据包的最大数目 net.core.rmem_max = 8388608 #SOCKET读缓存区大小 net.core.wmem_max = 8388608 #SOCKET写缓存区大
php命令行界面 dcj3sjt126com PHP cli
常用选项 php -v php -i PHP安装的有关信息 php -h 访问帮助文件 php -m 列出编译到当前PHP安装的所有模块执行一段代码 php -r 'echo "hello, world!";' php -r 'echo "Hello, World!\n";' php -r '$ts = filemtime("
Filter&Session 171815164 session
Filter HttpServletRequest requ = (HttpServletRequest) req; HttpSession session = requ.getSession(); if (session.getAttribute("admin") == null) { PrintWriter out = res.ge
连接池与Spring,Hibernate结合 g21121 Hibernate
前几篇关于Java连接池的介绍都是基于Java应用的，而我们常用的场景是与Spring和ORM框架结合，下面就利用实例学习一下这方面的配置。 1.下载相关内容： &nb
[简单]mybatis判断数字类型 53873039oycg mybatis
昨天同事反馈mybatis保存不了int类型的属性,一直报错，错误信息如下: Caused by: java.lang.NumberFormatException: For input string: "null" at sun.mis
项目启动时或者启动后ava.lang.OutOfMemoryError: PermGen space 程序员是怎么炼成的 eclipse jvm tomcat catalina.sh eclipse.ini
在启动比较大的项目时，因为存在大量的jsp页面，所以在编译的时候会生成很多的.class文件，.class文件是都会被加载到jvm的方法区中，如果要加载的class文件很多，就会出现方法区溢出异常 java.lang.OutOfMemoryError: PermGen space. 解决办法是点击eclipse里的tomcat，在
我的crm小结 aijuans crm
各种原因吧，crm今天才完了。主要是接触了几个新技术： Struts2、poi、ibatis这几个都是以前的项目中用过的。 Jsf、tapestry是这次新接触的，都是界面层的框架，用起来也不难。思路和struts不太一样，传说比较简单方便。不过个人感觉还是struts用着顺手啊，当然springmvc也很顺手，不知道是因为习惯还是什么。jsf和tapestry应用的时候需要知道他们的标签、主
spring里配置使用hibernate的二级缓存几步 antonyup_2006 java spring Hibernate xml cache
．在spring的配置文件中 applicationContent.xml，hibernate部分加入 xml 代码 <prop key="hibernate.cache.provider_class">org.hibernate.cache.EhCacheProvider</prop> <prop key="hi
JAVA基础面试题百合不是茶抽象实现接口 String类接口继承抽象类继承实体类自定义异常
/* * 栈（stack）：主要保存基本类型（或者叫内置类型）（char、byte、short、 *int、long、 float、double、boolean）和对象的引用，数据可以共享，速度仅次于 * 寄存器（register），快于堆。堆（heap）：用于存储对象。 */ &
让sqlmap文件 "继承" 起来 bijian1013 java ibatis sqlmap
多个项目中使用ibatis , 和数据库表对应的 sqlmap文件（增删改查等基本语句)，dao, pojo 都是由工具自动生成的, 现在将这些自动生成的文件放在一个单独的工程中，其它项目工程中通过jar包来引用，并通过"继承"为基础的sqlmap文件，dao,pojo 添加新的方法来满足项
精通Oracle10编程SQL(13)开发触发器 bijian1013 oracle 数据库 plsql
/* *开发触发器 */ --得到日期是周几 select to_char(sysdate+4,'DY','nls_date_language=AMERICAN') from dual; select to_char(sysdate,'DY','nls_date_language=AMERICAN') from dual; --建立BEFORE语句触发器 CREATE O
【EhCache三】EhCache查询 bit1129 ehcache
本文介绍EhCache查询缓存中数据，EhCache提供了类似Hibernate的查询API，可以按照给定的条件进行查询。要对EhCache进行查询，需要在ehcache.xml中设定要查询的属性数据准备 @Before public void setUp() { //加载EhCache配置文件 Inpu
CXF框架入门实例白糖_ spring Web 框架 webservice servlet
CXF是apache旗下的开源框架，由Celtix + XFire这两门经典的框架合成，是一套非常流行的web service框架。它提供了JAX-WS的全面支持，并且可以根据实际项目的需要，采用代码优先（Code First）或者 WSDL 优先（WSDL First）来轻松地实现 Web Services 的发布和使用，同时它能与spring进行完美结合。在apache cxf官网提供
angular.equals boyitech AngularJS AngularJS API AnguarJS 中文API angular.equals
angular.equals 描述: 比较两个值或者两个对象是不是相等。还支持值的类型，正则表达式和数组的比较。两个值或对象被认为是相等的前提条件是以下的情况至少能满足一项：两个值或者对象能通过=== （恒等）的比较两个值或者对象是同样类型，并且他们的属性都能通过angular
java-腾讯暑期实习生-输入一个数组A[1,2,...n]，求输入B，使得数组B中的第i个数字B[i]=A[0]*A[1]*...*A[i-1]*A[i+1] bylijinnan java
这道题的具体思路请参看何海涛的微博：http://weibo.com/zhedahht import java.math.BigInteger; import java.util.Arrays; public class CreateBFromATencent { /** * 题目：输入一个数组A[1,2,...n]，求输入B，使得数组B中的第i个数字B[i]=A
FastDFS 的安装和配置修订版 Chen.H linux fastDFS 分布式文件系统
FastDFS Home:http://code.google.com/p/fastdfs/ 1. 安装 http://code.google.com/p/fastdfs/wiki/Setup http://hi.baidu.com/leolance/blog/item/3c273327978ae55f93580703.html 安装libevent (对libevent的版本要求为1.4.
[强人工智能]拓扑扫描与自适应构造器 comsci 人工智能
当我们面对一个有限拓扑网络的时候,在对已知的拓扑结构进行分析之后,发现在连通点之后,还存在若干个子网络,且这些网络的结构是未知的,数据库中并未存在这些网络的拓扑结构数据....这个时候,我们该怎么办呢? 那么,现在我们必须设计新的模块和代码包来处理上面的问题
oracle merge into的用法 daizj oracle sql merget into
Oracle中merge into的使用 http://blog.csdn.net/yuzhic/article/details/1896878 http://blog.csdn.net/macle2010/article/details/5980965 该命令使用一条语句从一个或者多个数据源中完成对表的更新和插入数据. ORACLE 9i 中，使用此命令必须同时指定UPDATE 和INSE
不适合使用Hadoop的场景 datamachine hadoop
转自：http://dev.yesky.com/296/35381296.shtml。　　Hadoop通常被认定是能够帮助你解决所有问题的唯一方案。当人们提到“大数据”或是“数据分析”等相关问题的时候，会听到脱口而出的回答：Hadoop! 实际上Hadoop被设计和建造出来，是用来解决一系列特定问题的。对某些问题来说，Hadoop至多算是一个不好的选择，对另一些问题来说，选择Ha
YII findAll的用法 dcj3sjt126com yii
看文档比较糊涂，其实挺简单的： $predictions=Prediction::model()->findAll("uid=:uid",array(":uid"=>10)); 第一个参数是选择条件：”uid=10″。其中:uid是一个占位符，在后面的array(“:uid”=>10)对齐进行了赋值；更完善的查询需要
vim 常用 NERDTree 快捷键 dcj3sjt126com vim
下面给大家整理了一些vim NERDTree的常用快捷键了，这里几乎包括了所有的快捷键了，希望文章对各位会带来帮助。切换工作台和目录 ctrl + w + h 光标 focus 左侧树形目录ctrl + w + l 光标 focus 右侧文件显示窗口ctrl + w + w 光标自动在左右侧窗口切换ctrl + w + r 移动当前窗口的布局位置 o 在已有窗口中打开文件、目录或书签，并跳
Java把目录下的文件打印出来蕃薯耀列出目录下的文件文件夹下面的文件目录下的文件
Java把目录下的文件打印出来 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 蕃薯耀 2015年7月11日 11:02:
linux远程桌面----VNCServer与rdesktop hanqunfeng Desktop
windows远程桌面到linux，需要在linux上安装vncserver，并开启vnc服务，同时需要在windows下使用vnc-viewer访问Linux。vncserver同时支持linux远程桌面到linux。 linux远程桌面到windows，需要在linux上安装rdesktop，同时开启windows的远程桌面访问。下面分别介绍，以windo
guava中的join和split功能 jackyrong java
guava库中，包含了很好的join和split的功能，例子如下： 1）将LIST转换为使用字符串连接的字符串 List<String> names = Lists.newArrayList("John", "Jane", "Adam", "Tom");
Web开发技术十年发展历程 lampcy android Web 浏览器 html5
回顾web开发技术这十年发展历程： Ajax 03年的时候我上六年级，那时候网吧刚在小县城的角落萌生。传奇，大话西游第一代网游一时风靡。我抱着试一试的心态给了网吧老板两块钱想申请个号玩玩，然后接下来的一个小时我一直在，注，册，账，号。彼时网吧用的512k的带宽，注册的时候，填了一堆信息，提交，页面跳转，嘣，”您填写的信息有误，请重填”。然后跳转回注册页面，以此循环。我现在时常想，如果当时a
架构师之mima-----------------mina的非NIO控制IOBuffer(说得比较好) nannan408 buffer
1.前言。如题。 2.代码。 IoService IoService是一个接口，有两种实现：IoAcceptor和IoConnector；其中IoAcceptor是针对Server端的实现，IoConnector是针对Client端的实现；IoService的职责包括： 1、监听器管理 2、IoHandler 3、IoSession
ORA-00054:resource busy and acquire with NOWAIT specified Everyday都不同 oracle session Lock
[Oracle] 今天对一个数据量很大的表进行操作时，出现如题所示的异常。此时表明数据库的事务处于“忙”的状态，而且被lock了，所以必须先关闭占用的session。 step1，查看被lock的session： select t2.username, t2.sid, t2.serial#, t2.logon_time from v$locked_obj
javascript学习笔记 tntxia JavaScript
javascript里面有6种基本类型的值:number、string、boolean、object、function和undefined。number：就是数字值，包括整数、小数、NaN、正负无穷。string:字符串类型、单双引号引起来的内容。boolean:true、false object:表示所有的javascript对象，不用多说function:我们熟悉的方法，也就是
Java enum的用法详解 xieke90 enum 枚举
Java中枚举实现的分析：示例： public static enum SEVERITY{ INFO,WARN,ERROR } enum很像特殊的class，实际上enum声明定义的类型就是一个类。而这些类都是类库中Enum类的子类 (java.l