chencjiajy

低成本指令数据集构建：《Self-Instruct: Aligning Language Model with Self Generated Instructions》阅读笔记

最近有点好奇指令数据集是如何构建的，就读了一下SELF-INSTRUCT的论文

简介

摘要翻译：大型“指令微调”语言模型（即经过微调以响应指令）已表现出对于新任务的zero-shot泛化的非凡能力。然而，它们严重依赖于人工编写的指令数据，而这些数据通常在数量、多样性和创造力方面受到限制，因此阻碍了微调模型的通用性。我们引入了 SELF-INSTRUCT，这是一个用自己生成的数据自举来提高预训练语言模型的指令跟踪能力的框架。我们的pipeline从语言模型生成指令、输入和输出样本，在过滤掉无效或相似的样本后，使用它们来微调原始模型。将我们的方法应用于普通 GPT3后，在SUPER-NATURALINSTRUCTIONS数据集上比原始模型绝对提高了 33%。与使用私人用户数据和人工标注进行训练的 $InstructGPT_{001}$ 的性能相当。为了进一步评估，我们为新任务创建了一组专家编写的指令，并通过人工评估表明，使用 SELF-INSTRUCT 微调后的 GPT3 的性能大幅优于使用现有公共指令数据集，与 $InstructGPT_{001}$ 相比仅 5% 的绝对差距。 SELF-INSTRUCT 提供了一种几乎无需标注的方法，用于将预训练语言模型与指令对齐，我们同时发布了大型合成数据集，以促进未来指令微调的研究。

近期NLP文献见证了构建遵循自然语言指令的模型的活跃，这包含着两个关键组件：大型预训练语言模型和人工编写指令数据集（如PROMPTSOURCE和SUPERNATURALINSTRUCTIONS）。但是收集这些人工编写的指令数据有几个缺点：

成本高昂
缺乏多样性

这篇文章提出了一种半自动的方法来改进人工生成指令数据集，文章贡献有：

提出了一种框架：SELF-INSTRUCT，该框架可以仅使用最少的人工标注，生成大量的用于指令调优的数据；
通过多个实验验证了文章方法的有效性；
文章发布了52K的使用SELF-INSTRUCT指令数据集，以及一份人工手动编写的新任务数据集用来构建和评估指令调优模型。

方法

SELF-INSTRUCT的流程框架图如上图，主要分为四个步骤。

在解释每个步骤的详情之前，先看一下如何指令数据的定义。

The instruction data we want to generate contains a set of instructions ${I_t\}$ , each of which defines a task t in natural language. Task t has $n_t \ge 1$ input-output instances $\{(X_{t,i},Y_{t,i})\}^{n_t}_{i=1}$ . A model is expected to produce the output, given the task instruction and the corresponding input: $M(I_t,X_{t,i}) = Y_{t,i}, for \ i \in \{1, \cdots, n_t\}$

一条指令数据集由instruction、input、output三个部分组成。需要注意的是insturction和input之间没有严格的区分，比如下面两个例子表达的意思是一样的，只是在形式上不一样。

# 例子1
instruction: "write an essay about school safety"
input:""
output:"...."

# 例子2
instruction: "write an essay about the following topic"
input:"school safety"
output:"...."

步骤一指令生成

作者构建了175个种子任务（每个包含1个指令和一个实例），用这175个任务来初始化任务池，每次从任务池中采样8个任务作为大语言模型的输入数据样例，这8个任务中，有6个是从175个人工手写任务中抽取的，2个是在前面的步骤中由模型生成的任务中抽取的。（一开始还没有模型生成的数据时，8个任务全部都是人工手写的）

请求的prompt模板如下图（论文中Table 5）

注：在斯坦福羊驼中使用模型是 text-davinci-003 ，prompt模板如下（这里的序号有点问题，在issue中有说明）：

You are asked to come up with a set of 30 diverse task instructions. These task instructions will be given to a GPT model and we will evaluate the GPT model for completing the instructions.

Here are the requirements:
1. Try not to repeat the verb for each instruction to maximize diversity.
2. The language used for the instruction also should be diverse. For example, you should combine questions with imperative instrucitons.
3. The type of instructions should be diverse. The list should include diverse types of tasks like open-ended generation, classification, editing, etc.
4. A GPT language model should be able to complete the instruction. For example, do not ask the assistant to create any visual or audio output. For another example, do not ask the assistant to wake you up at 5pm or set a reminder because it cannot perform any action.
5. The instructions should be in English.
6. The instructions should be 1 to 2 sentences long. Either an imperative sentence or a question is permitted.
7. You should generate an appropriate input to the instruction. The input field should contain a specific example provided for the instruction. It should involve realistic data and should not contain simple placeholders. The input should provide substantial content to make the instruction challenging but should ideally not exceed 100 words.
8. Not all instructions require input. For example, when a instruction asks about some general information, "what is the highest peak in the world", it is not necssary to provide a specific context. In this case, we simply put "" in the input field.
9. The output should be an appropriate response to the instruction and the input. Make sure the output is less than 100 words.
10. Make sure the output is gramatically correct with punctuation if needed.
List of 30 tasks:

步骤二分类任务识别

判断生成的指令是否是一个分类任务，将分类任务定义为output的标签是一个有限的小的标签集合的任务。

是否是分类任务也是由模型来判断的，使用few-shot方式来实现，用种子任务集合中的12个分类任务指令和19个非分类任务指令作为few-shot的例子。具体模板如下（对应论文中的Table 6）：

Can the following task be regarded as a classification task with finite output labels?

Task: Given my personality and the job, tell me if I would be suitable.
Is it classification? Yes

Task: Give me an example of a time when you had to use your sense of humor.
Is it classification? No

Task: Replace the placeholders in the given text with appropriate named entities.
Is it classification? No

Task: Fact checking - tell me if the statement is true, false, or unknown, based on your knowledge and common sense.
Is it classification? Yes

Task: Return the SSN number for the person.
Is it classification? No

Task: Detect if the Reddit thread contains hate speech.
Is it classification? Yes

Task: Analyze the sentences below to identify biases.
Is it classification? No

Task: Select the longest sentence in terms of the number of words in the paragraph, output the sentence index.
Is it classification? Yes

Task: Find out the toxic word or phrase in the sentence.
Is it classification? No

Task: Rank these countries by their population.
Is it classification? No

Task: You are provided with a news article, and you need to identify all the categories that this article belongs to. Possible categories include: Music, Sports, Politics, Tech, Finance, Basketball, Soccer, Tennis, Entertainment, Digital Game, World News. Output its categories one by one, seperated by comma.
Is it classification? Yes

Task: Given the name of an exercise, explain how to do it.
Is it classification? No

Task: Select the oldest person from the list.
Is it classification? Yes

Task: Find the four smallest perfect numbers.
Is it classification? No

Task: Does the information in the document supports the claim? You can answer "Support" or "Unsupport".
Is it classification? Yes

Task: Create a detailed budget for the given hypothetical trip.
Is it classification? No

Task: Given a sentence, detect if there is any potential stereotype in it. If so, you should explain the stereotype. Else, output no.
Is it classification? No

Task: Explain the following idiom to me, and try to give me some examples.
Is it classification? No

Task: Is there anything I can eat for a breakfast that doesn't include eggs, yet includes protein, and has roughly 700-1000 calories?
Is it classification? No

Task: Answer the following multiple choice question. Select A, B, C, or D for the final answer.
Is it classification? Yes

Task: Decide whether the syllogism is logically sound.
Is it classification? Yes

Task: How can individuals and organizations reduce unconscious bias?
Is it classification? No

Task: What are some things you can do to de-stress?
Is it classification? No

Task: Find out the largest one from a set of numbers. Output the number directly.
Is it classification? Yes

Task: Replace the  token in the text with proper words that are consistent with the context. You can use multiple words for each  token.
Is it classification? No

Task: Write a cover letter based on the given facts.
Is it classification? No

Task: Identify the pos tag of the word in the given sentence.
Is it classification? Yes

Task: Write a program to compute the sum of integers from k to n.
Is it classification? No

Task: In this task, you need to compare the meaning of the two sentences and tell if they are the same. Output yes or no.
Is it classification? Yes

Task: To make the pairs have the same analogy, write the fourth word.
Is it classification? No

Task: Given a set of numbers, find all possible subsets that sum to a given number.
Is it classification? No

Task: {instruction for the target task}

步骤三实例生成

给定指令及其任务类别，为每个指令生成实例。根据指令是否是分类任务，采取不同的prompt模板。

对于非分类任务，采用Input-first Approach，也就是先让LLM对于指令先生成一个输入(input)，再生成对应的输出(output)（如果任务不需要input，就直接生成结果）。对应的prompt模板如下（论文中的table 7)：

Come up with examples for the following tasks. Try to generate multiple examples when possible. If the task doesn't require additional input, you can generate the output directly.

Task: Which exercises are best for reducing belly fat at home?
Output:
- Lying Leg Raises
- Leg In And Out
- Plank
- Side Plank
- Sit-ups

Task: Extract all the country names in the paragraph, list them separated by commas.
Example 1
Paragraph: Dr. No is the sixth novel by the English author Ian Fleming to feature his British Secret Service agent James Bond. Written at Fleming's Goldeneye estate in Jamaica, it was first published in the United Kingdom by Jonathan Cape in 1958. In the novel Bond looks into the disappearance in Jamaica of two fellow MI6 operatives who had been investigating Doctor No. Bond travels to No's Caribbean island and meets Honeychile Rider, who is there to collect shells. They are captured and taken to a luxurious facility carved into a mountain. The character of Doctor No, the son of a German missionary and a Chinese woman, was influenced by Sax Rohmer's Fu Manchu stories. Dr. No was the first of Fleming's novels to face widespread negative reviews in Britain, but it was received more favourably in the United States.
Output: English, British, Jamaica, the United Kingdom, German, Chinese, Britain, the United States.

Task: Converting 85 F to Celsius.
Output: 85°F = 29.44°C

Task: Sort the given list ascendingly. 
Example 1
List: [10, 92, 2, 5, -4, 92, 5, 101]
Output: [-4, 2, 5, 5, 10, 92, 92, 101]
Example 2
Input 2 - List: [9.99, 10, -5, -1000, 5e6, 999]
Output: [-1000, -5, 9.99, 10, 999, 5e6]

Task: Suggest a better and more professional rephrasing of the following sentence.
Example 1
Sentence: This house is surprisingly not constructed very well, and you probably need more money to fix it after you buy it. If you ask me, I would suggest you to consider other candidates.
Output: This house does not seem to be constructed well, so you may need to spend more money to fix it after you purchase it. I would suggest that you look at other properties.
Example 2
Sentence: Just so you know, we did an experiment last week and found really surprising results - language model can improve itself!
Output: Our experiments last week demonstrated surprising results, proving that the language model can improve itself.

Task: Read the following paragraph and answer a math question about the paragraph. You need to write out the calculation for getting the final answer.
Example 1
Paragraph: Gun violence in the United States results in tens of thousands of deaths and injuries annually, and was the leading cause of death for children 19 and younger in 2020. In 2018, the most recent year for which data are available as of 2021, the Centers for Disease Control and Prevention's (CDC) National Center for Health Statistics reports 38,390 deaths by firearm, of which 24,432 were by suicide. The rate of firearm deaths per 100,000 people rose from 10.3 per 100,000 in 1999 to 12 per 100,000 in 2017, with 109 people dying per day or about 14,542 homicides in total, being 11.9 per 100,000 in 2018. In 2010, there were 19,392 firearm-related suicides, and 11,078 firearm-related homicides in the U.S. In 2010, 358 murders were reported involving a rifle while 6,009 were reported involving a handgun; another 1,939 were reported with an unspecified type of firearm. In 2011, a total of 478,400 fatal and nonfatal violent crimes were committed with a firearm.
Question: How many more firearm-related deaths were there in 2018 compared to 2010?
Output:
38390 - (19392 + 11078) = 38390 - 30470 = 7920. 
So, in 2018, there were 7920 more deaths by firearm than in 2010.

Task: Write Python code to solve this leetcode problem.
Example 1
Problem: You are given two non-empty linked lists representing two non-negative integers. The digits are stored in reverse order, and each of their nodes contains a single digit. Add the two numbers and return the sum as a linked list. You may assume the two numbers do not contain any leading zero, except the number 0 itself.
Output:
class Solution(object):
    def addTwoNumbers(self, l1, l2):
        carry = 0
        root = n = ListNode(0)
        while l1 or l2 or carry:
            v1 = v2 = 0
            if l1:
                v1 = l1.val
                l1 = l1.next
            if l2:
                v2 = l2.val
                l2 = l2.next
            carry, val = divmod(v1+v2+carry, 10)
            n.next = ListNode(val)
            n = n.next
        return root.next

Task: Solve the equation and find the value of X. Show your steps.
Example 1
Equation: 10X + 5 = 10
Output: 10X = 5,  X = 0.5
Example 2
Equation: X + Y + 120 = 100
Output: X + Y = -20, X = -20 - Y

Task: Write a program to compute the sum of integers from k to n.
Output:
def sum(k, n):
    sum = 0
    for i in range(k, n+1):
        sum += i
    return sum

Task: Select the oldest person from the given list.
Example 1
List: George Washington, Confucius, Michael Jordan, Michelangelo
Output: Confucious
Example 2
List: Alan Turing, Geoffrey Hinton, Yann LeCun, Yoshua Bengio
Output: Alan Turing

Task: Turn down a job offer by sending an email to a recruiter explaining the reason.
Output: Hi  [Recruiter],
Thank you so much for the generous offer to join your team. As we discussed, I’ve admired the company for a number of years, and am a proud endorser of its products. However, after further consideration of where I currently am in my career, I’ve decided to accept an offer at another company.
I would love to stay in touch with you and have already started following you on [Social Media Platform]. Again, thank you so much for your time and consideration.
Thanks again,
[Your Name]

Task: {Instruction for the target task}

对于分类任务，采用Output-first Approach，先生成一个可能的类别标签，再在类别标签的条件下生成input，prompt模板如下（Table 8）：

Given the classification task definition and the class labels, generate an input that corresponds to each of the class labels. If the task doesn't require input, just generate possible class labels.

Task: Classify the sentiment of the sentence into positive, negative, or mixed.
Class label: mixed
Sentence: I enjoy the flavor of the restaurant but their service is too slow.
Class label: Positive
Sentence: I had a great day today. The weather was beautiful and I spent time with friends and family.
Class label: Negative
Sentence: I was really disappointed by the latest superhero movie. I would not recommend it to anyone.

Task: Given a dialogue, classify whether the user is satisfied with the service. You should respond with "Satisfied" or "Unsatisfied".
Class label: Satisfied
Dialogue:
- Agent: Thank you for your feedback. We will work to improve our service in the future.
- Customer: I am happy with the service you provided. Thank you for your help.
Class label: Unsatisfied
Dialogue:
- Agent: I am sorry we will cancel that order for you, and you will get a refund within 7 business days.
- Customer: oh that takes too long. I want you to take quicker action on this.

Task: Given some political opinions, classify whether the person belongs to Democrats or Republicans.
Class label: Democrats
Opinion: I believe that everyone should have access to quality healthcare regardless of their income level.
Class label: Republicans
Opinion: I believe that people should be able to keep more of their hard-earned money and should not be taxed at high rates.

Task: Tell me if the following email is a promotion email or not.
Class label: Promotion
Email: Check out our amazing new sale! We've got discounts on all of your favorite products.
Class label: Not Promotion
Email: We hope you are doing well. Let us know if you need any help.

Task: Detect if the Reddit thread contains hate speech.
Class label: Hate Speech
Thread: All people of color are stupid and should not be allowed to vote.
Class label: Not Hate Speech
Thread: The best way to cook a steak on the grill.

Task:  Does the information in the document supports the claim? You can answer "Support" or "Unsupport".
Class label: Unsupport
Document: After a record-breaking run that saw mortgage rates plunge to all-time lows and home prices soar to new highs, the U.S. housing market finally is slowing. While demand and price gains are cooling, any correction is likely to be a modest one, housing economists and analysts say. No one expects price drops on the scale of the declines experienced during the Great Recession.
Claim: The US housing market is going to crash soon.
Class label: Support
Document: The U.S. housing market is showing signs of strain, with home sales and prices slowing in many areas. Mortgage rates have risen sharply in recent months, and the number of homes for sale is increasing. This could be the beginning of a larger downturn, with some economists predicting a potential housing crash in the near future.
Claim: The US housing market is going to crash soon.

Task: Answer the following multiple-choice question. Select A, B, C, or D for the final answer.
Class label: C
Question: What is the capital of Germany?
A. London
B. Paris
C. Berlin
D. Rome
Class label: D
Question: What is the largest planet in our solar system?
A) Earth
B) Saturn
C) Mars
D) Jupiter
Class label: A
Question: What is the process by which plants make their own food through photosynthesis?
A) Respiration
B) Fermentation
C) Digestion
D) Metabolism
Class label: B
Question: Who wrote the novel "The Great Gatsby"?
A) Ernest Hemingway
B) F. Scott Fitzgerald
C) J.D. Salinger
D) Mark Twain

Task: You need to read a code and detect if there is a syntax error or not. Output true if there is an error, output false if there is not.
Class label: true
Code:
def quick_sort(arr):
    if len(arr) < 2
        return arr
Class label: False
Code:
def calculate_average(numbers):
    total = 0
    for number in numbers:
        total += number
    return total / len(numbers)

Task: You are provided with a news article, and you need to identify all the categories that this article belongs to. Possible categories include Sports and Politics. Output its categories one by one, separated by a comma.
Class label: Sports
Article: The Golden State Warriors have won the NBA championship for the second year in a row.
Class label: Politics
Article: The United States has withdrawn from the Paris Climate Agreement.
Class label: Politics, Sports
Article: The government has proposed cutting funding for youth sports programs.

Task: Given a credit card statement, the cardholder's spending habits, and the account balance, classify whether the cardholder is at risk of defaulting on their payments or not.
Class label: At risk
Credit card statement: Purchases at high-end clothing stores and luxury hotels.
Cardholder's spending habits: Frequent purchases at luxury brands and high-end establishments.
Account balance: Over the credit limit and multiple missed payments.
Class label: Not at risk
Credit card statement: Purchases at grocery stores and gas stations.
Cardholder's spending habits: Regular purchases for necessary expenses and occasional dining out.
Account balance: Slightly below the credit limit and no missed payments.

Task: Given a social media post, the hashtags used, and a topic. classify whether the post is relevant to the topic or not.
Class label: Relevant
Post: I can't believe the government is still not taking action on climate change. It's time for us to take matters into our own hands.
Hashtags: #climatechange #actnow
Topic: Climate change
Class label: Not relevant 
Post: I just bought the new iPhone and it is amazing!
Hashtags: #apple #technology
Topic: Travel

Task: The answer will be 'yes' if the provided sentence contains an explicit mention that answers the given question. Otherwise, answer 'no'. 
Class label: Yes
Sentence: Jack played basketball for an hour after school.
Question: How long did Jack play basketball?
Class label: No
Sentence: The leaders of the Department of Homeland Security now appear before 88 committees and subcommittees of Congress.
Question: How often are they required to appear?

Task: Tell me what's the second largest city by population in Canada.
Class label: Montreal

Task: Classifying different types of mathematical equations, such as linear, and quadratic equations, based on the coefficients and terms in the equation.
Class label: Linear equation
Equation: y = 2x + 5
Class label: Quadratic equation
Equation: y = x^2 - 4x + 3

Task: Tell me the first number of the given list.
Class label: 1
List: 1, 2, 3
Class label: 2
List: 2, 9, 10

Task: Which of the following is not an input type? (a) number (b) date (c) phone number (d) email address (e) all of these are valid inputs.
Class label: (e)

Task: {Instruction for the target task}

步骤四过滤和后处理

为了保证指令的多样性，一个指令只有与任务池中已有的指令的ROUGE-L小于0.7才会被考虑加入到任务池，同时过滤掉包含一些特殊词（如image/picture/graph等）的生成指令，因为这些指令LM还不会处理。（注：虽然过滤是在论文中的第四个步骤，但是代码实现时，对于指令的过滤是在第一步生成指令的时候就过滤了不满足要求的指令，当任务池中的指令达到生成数量比如52k后就停止第一步的指令生成）

## 开源代码 https://github.com/yizhongw/self-instruct/blob/main/self_instruct/bootstrap_instructions.py#L41C1-L71C1
def post_process_gpt3_response(response):
    if response is None or response["choices"][0]["finish_reason"] == "length":
        return []
    raw_instructions = re.split(r"\n\d+\s?\. ", response["choices"][0]["text"])
    instructions = []
    for inst in raw_instructions:
        inst = re.sub(r"\s+", " ", inst).strip()
        inst = inst.strip().capitalize()
        if inst == "":
            continue
        # filter out too short or too long instructions
        if len(inst.split()) <= 3 or len(inst.split()) > 150:
            continue
        # filter based on keywords that are not suitable for language models.
        if any(find_word_in_string(word, inst) for word in ["image", "images", "graph", "graphs", "picture", "pictures", "file", "files", "map", "maps", "draw", "plot", "go to"]):
            continue
        # We found that the model tends to add "write a program" to some existing instructions, which lead to a lot of such instructions.
        # And it's a bit comfusing whether the model need to write a program or directly output the result. 
        # Here we filter them out.
        # Note this is not a comprehensive filtering for all programming instructions.
        if inst.startswith("Write a program"):
            continue
        # filter those starting with punctuation
        if inst[0] in string.punctuation:
            continue
        # filter those starting with non-english character
        if not inst[0].isascii():
            continue
        instructions.append(inst)
    return instructions

对于生成的实例，过滤掉重复的、过滤掉与相同input不同output的实例。

并用启发式方法识别和过滤掉一些实例，比如是否太长、是否太短、输出是输入的重复等。

## 论文开源代码 https://github.com/yizhongw/self-instruct/blob/main/self_instruct/prepare_for_finetuning.py#L110C1-L192C1
def filter_duplicate_instances(instances):
    # if the instances have same non-empty input, but different output, we will not use such instances
    same_input_diff_output = False
    for i in range(1, len(instances)):
        for j in range(0, i):
            if instances[i][1] == "":
                continue
            if instances[i][1] == instances[j][1] and instances[i][2] != instances[j][2]:
                same_input_diff_output = True
                break
    if same_input_diff_output:
        return []

    # remove duplicate instances
    instances = list(set(instances))
    return instances

def filter_invalid_instances(instances):
    filtered_instances = []
    for instance in instances:
        # if input and output are the same, we will not use such instances
        if instance[1] == instance[2]:
            continue
        # if output is empty, we will not use such instances
        if instance[2] == "":
            continue
        # if input or output ends with a colon, these are usually imcomplete generation. We will not use such instances
        if instance[1].strip().endswith(":") or instance[2].strip().endswith(":"):
            continue
        filtered_instances.append(instance)
    return filtered_instances

def parse_instances_for_generation_task(raw_text, instruction, response_metadata):
    instances = []
    raw_text = raw_text.strip()
    if re.findall("Example\s?\d*\.?", raw_text):
        instance_texts = re.split(r"Example\s?\d*\.?", raw_text)
        instance_texts = [it.strip() for it in instance_texts if it.strip() != ""]
        for instance_text in instance_texts:
            inst_input, inst_output = parse_input_output(instance_text)
            instances.append((instruction.strip(), inst_input.strip(), inst_output.strip()))
    elif re.findall(r"Output\s*\d*\s*:", raw_text):
        # we assume only one input/output pair in this case
        inst_input, inst_output = parse_input_output(raw_text)
        instances.append((instruction.strip(), inst_input.strip(), inst_output.strip()))
    else:
        return []
    # if the generation stops because of length, we remove the last instance
    if response_metadata["response"]["choices"][0]["finish_reason"] == "length":
        instances = instances[:-1]
    
    instances = filter_invalid_instances(instances)
    instances = filter_duplicate_instances(instances)
    return instances

def parse_instances_for_classification_task(raw_text, instruction, response_metadata):
    instances = []
    if not "Class label:" in raw_text:
        return []
    instance_texts = raw_text.split("Class label:")[1:]
    for instance_text in instance_texts:
        instance_text = instance_text.strip()
        fields = instance_text.split("\n", 1)
        if len(fields) == 2:
            # the first field split by \n is the class label
            class_label = fields[0].strip()
            # the rest is the input
            input_text = fields[1].strip()
        elif len(fields) == 1:
            # the first field split by \n is the input
            class_label = fields[0].strip()
            input_text = ""
        else:
            raise ValueError("Invalid instance text: {}".format(instance_text))
        instances.append((instruction.strip(), input_text.strip(), class_label.strip()))

    # if the generation stops because of length, we remove the last instance
    if response_metadata["response"]["choices"][0]["finish_reason"] == "length":
        instances = instances[:-1]
    instances = filter_invalid_instances(instances)
    instances = filter_duplicate_instances(instances)
    return instances

GPT3+SELF-INSTRUCT生成数据

将GPT3（OpenAI 的"davinci"引擎）作为提出的self-instruct方法的试验大语言模型，在论文试验进行的2022年12月，GPT3的api收费标准为$0.02/1000 token，生成整个数据集花费约600美元。调用API的超参如下图（论文的Table 4）

生成数据的统计数据如下图（论文Table 1）

指令数据多样性：

通过对生成的指令进行句法分析，绘制最常见的20个动词和它们的4个最常见直接名词如下图（论文Figure 3)，这些数据占整个数据集的14%。

比较生成的数据与种子指令数据的相似性，对于每一个生成的指令，计算与175个种子指令的最大ROUGE-L，绘制结果如下图所示（论文Figure 4）。
指令长度、input长度、output长度的分布如下图（论文Figure 5）

数据质量：

为了验证生成数据的质量，随机从生成的样本中抽取200个指令，对于每个指令随机选择一个实例。然后让一个专家标注员（论文作者）来标注实例是否正确。其结果如下图(论文Table 2)

试验部分

论文作者做了几个试验来验证生成的指令数据对于提升模型的性能是有帮助的。因为使用GPT3生成了指令数据集，训练模型还是GPT3，使用OpenAI的finetune API对模型进行微调。（试验部分目前不太关注，就不详细记录了）

参考资料

开源代码：https://github.com/yizhongw/self-instruct
论文：Wang, Yizhong, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, NoahA. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. “Self-Instruct: Aligning Language Model with Self Generated Instructions,” December.
https://github.com/tatsu-lab/stanford_alpaca

你可能感兴趣的:(深度学习,语言模型,人工智能,论文阅读)

PyTorch & TensorFlow速成复习：从基础语法到模型部署实战（附FPGA移植衔接）阿牛的药铺算法移植部署 pytorch tensorflow fpga开发
PyTorch&TensorFlow速成复习：从基础语法到模型部署实战（附FPGA移植衔接）引言：为什么算法移植工程师必须掌握框架基础？针对光学类产品算法FPGA移植岗位需求（如可见光/红外图像处理），深度学习框架是算法落地的"桥梁"——既要用PyTorch/TensorFlow验证算法可行性，又要将训练好的模型（如CNN、目标检测）转换为FPGA可部署的格式（ONNX、TFLite）。本文采用"
算法学习笔记：17.蒙特卡洛算法 ——从原理到实战，涵盖 LeetCode 与考研 408 例题
在计算机科学和数学领域，蒙特卡洛算法（MonteCarloAlgorithm）以其独特的随机抽样思想，成为解决复杂问题的有力工具。从圆周率的计算到金融风险评估，从物理模拟到人工智能，蒙特卡洛算法都发挥着不可替代的作用。本文将深入剖析蒙特卡洛算法的思想、解题思路，结合实际应用场景与Java代码实现，并融入考研408的相关考点，穿插图片辅助理解，帮助你全面掌握这一重要算法。蒙特卡洛算法的基本概念蒙特卡
AI音乐模拟器：AIGC时代的智能音乐创作革命 lauo 人工智能 AIGC 开源前端机器人
AI音乐模拟器：AIGC时代的智能音乐创作革命引言：AIGC浪潮下的音乐创作新范式在数字化转型的浪潮中，人工智能生成内容（AIGC）正在重塑各个创意领域。音乐产业作为创意经济的重要组成部分，正经历着前所未有的变革。据最新市场研究数据显示，全球AI音乐市场规模预计将从2023年的5.8亿美元增长到2030年的26.8亿美元，年复合增长率高达24.3%。这一快速增长的市场背后，是AI音乐技术正在打破传
LLM中最后一个词语的表征（隐藏状态）通常会融合前面所有词语的信息吗？ ZhangJiQun&MXP 教学 2024大模型以及算力 2021 AI python 机器学习算法深度学习人工智能
LLM中最后一个词语的表征（隐藏状态）通常会融合前面所有词语的信息吗？在大语言模型（LLM）中，最后一个词语的表征（隐藏状态）通常会融合前面所有词语的信息，这是由LLM的核心架构（以Transformer为基础）决定的，具体可以从以下角度理解：1.核心机制：自注意力（Self-Attention）的作用现代LLM（如GPT系列、Qwen等）均基于Transformer架构，其核心是自注意力机制。在
深度学习模型表征提取全解析 ZhangJiQun&MXP 教学 2024大模型以及算力 2021 AI python 深度学习人工智能 python embedding 语言模型
模型内部进行表征提取的方法在自然语言处理（NLP）中，“表征（Representation）”指将文本（词、短语、句子、文档等）转化为计算机可理解的数值形式（如向量、矩阵），核心目标是捕捉语言的语义、语法、上下文依赖等信息。自然语言表征技术可按“静态/动态”“有无上下文”“是否融入知识”等维度划分一、传统静态表征（无上下文，词级为主）这类方法为每个词分配固定向量，不考虑其在具体语境中的含义（无法解
视频分析：让AI看懂动态画面随机森林404 计算机视觉音视频人工智能 microsoft
引言：动态视觉理解的革命在数字信息爆炸的时代，视频已成为最主要的媒介形式。据统计，每分钟有超过500小时的视频内容被上传到YouTube平台，而全球互联网流量的82%来自视频数据传输。面对如此海量的视频内容，传统的人工处理方式已无法满足需求，这正是人工智能视频分析技术大显身手的舞台。视频分析技术赋予机器"看懂"动态画面的能力，使其能够自动理解、解释甚至预测视频中的内容，这一突破正在彻底改变我们与视
【Qualcomm】高通SNPE框架简介、下载与使用 Jackilina_Stone 人工智能 Qualcomm SNPE
目录一高通SNPE框架1SNPE简介2QNN与SNPE3Capabilities4工作流程二SNPE的安装与使用1下载2Setup3SNPE的使用概述一高通SNPE框架1SNPE简介SNPE（SnapdragonNeuralProcessingEngine），是高通公司推出的面向移动端和物联网设备的深度学习推理框架。SNPE提供了一套完整的深度学习推理框架，能够支持多种深度学习模型，包括Pytor
深度学习篇---昇腾NPU&CANN 工具包 Atticus-Orion 上位机知识篇图像处理篇深度学习篇深度学习人工智能 NPU 昇腾 CANN
介绍昇腾NPU是华为推出的神经网络处理器，具有强大的AI计算能力，而CANN工具包则是面向AI场景的异构计算架构，用于发挥昇腾NPU的性能优势。以下是详细介绍：昇腾NPU架构设计：采用达芬奇架构，是一个片上系统，主要由特制的计算单元、大容量的存储单元和相应的控制单元组成。集成了多个CPU核心，包括控制CPU和AICPU，前者用于控制处理器整体运行，后者承担非矩阵类复杂计算。此外，还拥有AICore
深度学习图像分类数据集—桃子识别分类 AI街潜水的八角深度学习图像数据集深度学习分类人工智能
该数据集为图像分类数据集，适用于ResNet、VGG等卷积神经网络，SENet、CBAM等注意力机制相关算法，VisionTransformer等Transformer相关算法。数据集信息介绍：桃子识别分类：['B1','M2','R0','S3']训练数据集总共有6637张图片，每个文件夹单独放一种数据各子文件夹图片统计:·B1:1601张图片·M2:1800张图片·R0:1601张图片·S3:
法律科技领域人工智能代理构建的十个经验教训，一位人工智能工程师通过构建、部署和维护智能代理的经验教训来优化法律工作流程的历程。知识大胖 NVIDIA GPU和大语言模型开发教程人工智能 ai
目录介绍什么是代理人？为什么它对法律如此重要？法律技术中代理用例示例-合同审查代理-法律研究代理在LegalTech中使用代理的十个教训-教训1：即使代理很酷，它们也不能解决所有问题-教训2：选择最适合您用例的框架-教训3：能够快速迭代不同的模型-教训4：从简单开始，必要时扩展-教训5：使用跟踪解决方案；您将需要它-教训6：确保跟踪成本，代理循环可能很昂贵-教训7：将控制权交给最终用户（人在环路中
AI MCP教程之什么是 MCP？利用本地 LLM 、MCP、DeepSeek 集成构建您自己的 AI 驱动工具知识大胖 NVIDIA GPU和大语言模型开发教程人工智能 mcp deepseek
介绍利用模型上下文协议(MCP)的工具吸引了我们的注意力—将AI变成触手可及的生产力引擎。它们巧妙、高效，让人难以抗拒。但如果您可以将这样的功能添加到自己的工具中，会怎么样呢？在本指南中，我将引导您构建一个具有本地运行的大型语言模型(LLM)和MCP集成的AI工具-让您以类似的方式自动执行利用MCP的工具您喜欢的任务。推荐文章《AnythingLLM教程系列之12AnythingLLM上的Olla
24GB GPU 中的 DeepSeek R1：Unsloth AI 针对 671B 参数模型进行动态量化知识大胖 NVIDIA GPU和大语言模型开发教程人工智能 deepseek ollama
简介最初的DeepSeekR1是一个拥有6710亿个参数的语言模型，UnslothAI团队对其进行了动态量化，将模型大小减少了80%（从720GB减少到131GB），同时保持了强大的性能。当添加模型卸载功能时，该模型可以在24GBVRAM下以低令牌/秒的推理速度运行。推荐文章《本地构建AI智能分析助手之01快速安装，使用PandasAI和Ollama进行数据分析，用自然语言向你公司的数据提问为决策
Llama-Omni会说话的人工智能“语音到语音LLM” 利用低延迟、高质量语音转语音 AI 彻底改变对话方式（教程含源码）知识大胖 NVIDIA GPU和大语言模型开发教程 llama 人工智能 nvidia llm
介绍“单靠技术是不够的——技术与文科、人文学科的结合，才能产生让我们心花怒放的成果。”——史蒂夫·乔布斯近年来，人机交互领域发生了重大变化，尤其是随着ChatGPT、GPT-4等大型语言模型(LLM)的出现。虽然这些模型主要基于文本，但人们对语音交互的兴趣日益浓厚，以使人机对话更加无缝和自然。然而，实现语音交互而不受语音转文本处理中常见的延迟和错误的影响仍然是一个挑战。关键字：Llama-Omni
什么是热力学计算？它如何帮助人工智能发展？知识大胖 NVIDIA GPU和大语言模型开发教程人工智能量子计算
现代计算的基础是晶体管，这是一种微型电子开关，可以用它构建逻辑门，从而创建CPU或GPU等复杂的数字电路。随着技术的进步，晶体管变得越来越小。根据摩尔定律，集成电路中晶体管的数量大约每两年增加一倍。这种指数级增长使得计算技术呈指数级发展。然而，晶体管尺寸的缩小是有限度的。我们很快就会达到晶体管无法工作的阈值。此外，人工智能的进步使得对计算能力的需求比以往任何时候都更加迫切。根本问题是自然是随机的（
上海交大：工具增强推理agent
标题：SciMaster:TowardsGeneral-PurposeScientificAIAgentsPartI.X-MasterasFoundation-CanWeLeadonHumanity’sLastExam?来源：arXiv,2507.05241摘要人工智能代理的快速发展激发了利用它们加速科学发现的长期雄心。实现这一目标需要深入了解人类知识的前沿。因此，人类的最后一次考试（HLE）为评
微算法科技的前沿探索：量子机器学习算法在视觉任务中的革新应用 MicroTech2025 量子计算算法
在信息技术飞速发展的今天，计算机视觉作为人工智能领域的重要分支，正逐步渗透到我们生活的方方面面。从自动驾驶到人脸识别，从医疗影像分析到安防监控，计算机视觉技术展现了巨大的应用潜力。然而，随着视觉任务复杂度的不断提升，传统机器学习算法在处理大规模、高维度数据时遇到了计算瓶颈。在此背景下，量子计算作为一种颠覆性的计算模式，以其独特的并行处理能力和指数级增长的计算空间，为解决这一难题提供了新的思路。微算
中国银联豪掷1亿采购海光C86架构服务器信创新态势海光芯片 C86 国产芯片海光信息
近日，中国银联国产服务器采购大单正式敲定，基于海光C86架构的服务器产品中标，项目金额超过1亿元。接下来，C86服务器将用于支撑中国银联的虚拟化、大数据、人工智能、研发测试等技术场景，进一步提升其业务处理能力、用户服务效率和信息安全水平。作为我国重要的银行卡组织和金融基础设施，中国银联在全球183个国家和地区设有银联受理网络，境内外成员机构超过2600家，是世界三大银行卡品牌之一。此次中国银联发力
【AI大模型】LLM模型架构深度解析：BERT vs. GPT vs. T5 我爱一条柴ya 学习AI记录 ai 人工智能 AI编程 python
引言Transformer架构的诞生（Vaswanietal.,2017）彻底改变了自然语言处理（NLP）。在其基础上，BERT、GPT和T5分别代表了三种不同的模型范式，主导了预训练语言模型的演进。理解它们的差异是LLM开发和学习的基石。一、核心架构对比特性BERT(BidirectionalEncoder)GPT(GenerativePre-trainedTransformer)T5(Text
[论文阅读]Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smal 0x211 论文阅读语言模型人工智能自然语言处理
中文译名：逐步蒸馏！以较少的训练数据和较小的模型规模超越较大的语言模型发布链接：http://arxiv.org/abs/2305.02301AcceptedtoFindingsofACL2023阅读原因：近期任务需要用到蒸馏操作，了解相关知识核心思想：改变视角。原来的视角：把LLMs视为噪声标签的来源。现在的视角：把LLMs视为能够推理的代理。方法好在哪？需要的数据量少，得到的结果好。文章的方法
AI人工智能浪潮中文心一言的独特优势
AI人工智能浪潮中文心一言的独特优势：为什么它是中国市场的“AI主力军”？关键词：文心一言,AI大模型,中文处理,多模态融合,产业落地,安全可控,百度ERNIE摘要：在全球AI大模型浪潮中，百度文心一言（ERNIEBot）凭借“懂中文、会多模态、能落地、守规矩”的四大核心优势，成为中国市场最具竞争力的AI产品之一。本文将用“超级大脑”的比喻，从中文理解、多模态能力、产业生态融合、安全可控性四个维度
正义的算法迷宫—人工智能重构司法体系的技术悖论与文明试炼
一、法庭的数字化迁徙当美国威斯康星州法院采纳COMPAS算法评估被告再犯风险，当中国"智慧法院"系统年处理1.2亿件案件，司法体系正经历从石柱法典到代码裁判的范式革命。这场转型的核心驱动力是司法效率与公正的永恒张力：美国重罪案件平均审理周期达18个月，中国基层法官年人均结案357件（是德国同行的6倍），而算法能在0.3秒内完成百万份文书比对。人工智能渗透司法引发三重裂变：证据分析从经验推断转向数据
【实战AI】macbook M1 本地ollama运行deepseek 东方鲤鱼 chat AI macos ai llama AIGC chatgpt
由于deepseek官网或者Aapi调用会有网络延迟或不响应的情况，故在本地搭建部署；前提条件1.由于需要拉取开源镜像，受网络限制，部分资源在前提中会下载的更快！请自行；2.设备macbookM132G下载ollamaOllama是一款跨平台推理框架客户端（MacOS、Windows、Linux），专为无缝部署大型语言模型（LLM）（如Llama2、Mistral、Llava等）而设计。通过一键式
NumPy-@运算符详解 GG不是gg numpy numpy
NumPy-@运算符详解一、@运算符的起源与设计目标1.从数学到代码：符号的统一2.设计目标二、@运算符的核心语法与运算规则1.基础用法：二维矩阵乘法2.一维向量的矩阵语义3.高维数组：批次矩阵运算4.广播机制：灵活的形状匹配三、@运算符与其他乘法方式的核心区别1.对比`np.dot()`2.对比元素级乘法`*`3.对比`np.matrix`的`*`运算符四、典型应用场景：从基础到高阶1.深度学习
【python实战】不玩微博，一封邮件就能知道实时热榜，天秀吃瓜一条coding 从实战学python 人工智能 python linux 爬虫
❤️欢迎订阅《从实战学python》专栏，用python实现办公自动化、数据可视化、人工智能等各个方向的实战案例，有趣又有用！❤️更多精品专栏简介点这里有的人金玉其表败絮其中，有的人却若彩虹般绚烂，怦然心动前言哈喽，大家好，我是一条。在生活中我是一个不太喜欢逛娱乐平台的人，抖音、快手、微博我手机里都没装，甚至微信朋友圈都不看，但是自从开始写博客，有些热度不得不蹭。所以就有了这样一个需求，能不能让微
NLP_知识图谱_大模型——个人学习记录 macken9999 自然语言处理知识图谱大模型自然语言处理知识图谱学习
1.自然语言处理、知识图谱、对话系统三大技术研究与应用https://github.com/lihanghang/NLP-Knowledge-Graph深度学习-自然语言处理(NLP)-知识图谱：知识图谱构建流程【本体构建、知识抽取（实体抽取、关系抽取、属性抽取）、知识表示、知识融合、知识存储】-元気森林-博客园https://www.cnblogs.com/-402/p/16529422.htm
解决 Python 包安装失败问题：以 accelerate 为例
在使用Python开发项目时，我们经常会遇到依赖包安装失败的问题。今天，我们就以accelerate包为例，详细探讨一下可能的原因以及解决方法。通过这篇文章，你将了解到Python包安装失败的常见原因、如何切换镜像源、如何手动安装包，以及一些实用的注意事项。一、问题背景在开发一个深度学习项目时，我需要安装accelerate包来优化模型的训练过程。然而，当我运行以下命令时：bash复制pipins
LLaMA-Omni 深度解析：打开通往无缝人机语音交互的大门 kakaZhui 前沿多模态大模型：论文与实战 llama 交互 LLM TTS 语音识别语音合成人工智能
一、引言：语音交互大模型今天我们来看语音交互大模型LLaMA-Omni，它由中国科学院计算技术研究所的研究者们推出，是一个基于强大的Llama-3.1-8B-Instruct构建的语音语言模型。LLaMA-Omni不仅实现了低至226ms的惊人交互延迟，还能同时生成高质量的文本与语音回复，真正意义上让大语言模型（LLM）具备了“听说”的能力。这篇博客将带你由浅入深，全方位地探索LLaMA-Omni
MCP协议：AI时代的“万能插座”如何重构IT生态与未来
MCP协议：AI时代的“万能插座”如何重构IT生态与未来在人工智能技术爆炸式发展的浪潮中，一个名为ModelContextProtocol（MCP）的技术协议正以惊人的速度重塑IT行业的底层逻辑。2024年11月由Anthropic首次发布，MCP在短短半年内获得OpenAI、谷歌、亚马逊、阿里、腾讯等全球科技巨头的支持，被业内誉为AI时代的HTTP协议或USB-C接口，正在成为连接大模型与现实世
从RNN循环神经网络到Transformer注意力机制：解析神经网络架构的华丽蜕变熊猫钓鱼>_> 神经网络 rnn transformer
1.引言在自然语言处理和序列建模领域，神经网络架构经历了显著的演变。从早期的循环神经网络（RNN）到现代的Transformer架构，这一演变代表了深度学习方法在处理序列数据方面的重大进步。本文将深入比较这两种架构，分析它们的工作原理、优缺点，并通过实验结果展示它们在实际应用中的性能差异。2.循环神经网络（RNN）2.1基本原理循环神经网络是专门为处理序列数据而设计的神经网络架构。RNN的核心思想
《算法备案全攻略：规范与流程引领数字时代新秩序》算法及大模型备案顾问刘老师算法备案深度学习 AIGC 语言模型算法人工智能
一、算法备案：开启合规新征程（一）备案规定的起源与发展2022年国家互联网信息办公室、工业和信息化部、公安部、国家市场监督管理总局联合发布《互联网信息服务算法推荐管理规定》，自2022年3月1日起施行。此后，相关规定不断完善和演进。如国家网信办于2022年8月、10月及2023年1月先后三次公布了《境内互联网信息服务算法备案清单》。同时，2022年发布的最高人民法院《关于规范和加强人工智能司法应用
项目中枚举与注解的结合使用飞翔的马甲 java enum annotation
前言：版本兼容，一直是迭代开发头疼的事，最近新版本加上了支持新题型，如果新创建一份问卷包含了新题型，那旧版本客户端就不支持，如果新创建的问卷不包含新题型，那么新旧客户端都支持。这里面我们通过给问卷类型枚举增加自定义注解的方式完成。顺便巩固下枚举与注解。一、枚举 1.在创建枚举类的时候，该类已继承java.lang.Enum类，所以自定义枚举类无法继承别的类，但可以实现接口。
【Scala十七】Scala核心十一：下划线_的用法 bit1129 scala
下划线_在Scala中广泛应用，_的基本含义是作为占位符使用。_在使用时是出问题非常多的地方，本文将不断完善_的使用场景以及所表达的含义 1. 在高阶函数中使用 scala> val list = List(-3,8,7,9) list: List[Int] = List(-3, 8, 7, 9) scala> list.filter(_ > 7) r
web缓存基础：术语、http报头和缓存策略 dalan_123 Web
对于很多人来说，去访问某一个站点，若是该站点能够提供智能化的内容缓存来提高用户体验，那么最终该站点的访问者将络绎不绝。缓存或者对之前的请求临时存储，是http协议实现中最核心的内容分发策略之一。分发路径中的组件均可以缓存内容来加速后续的请求，这是受控于对该内容所声明的缓存策略。接下来将讨web内容缓存策略的基本概念，具体包括如如何选择缓存策略以保证互联网范围内的缓存能够正确处理的您的内容，并谈论下
crontab 问题周凡杨 linux crontab unix
一： 0481-079 Reached a symbol that is not expected. 背景： */5 * * * * /usr/IBMIHS/rsync.sh
让tomcat支持2级域名共享session g21121 session
tomcat默认情况下是不支持2级域名共享session的，所有有些情况下登陆后从主域名跳转到子域名会发生链接session不相同的情况，但是只需修改几处配置就可以了。打开tomcat下conf下context.xml文件找到Context标签,修改为如下内容如果你的域名是www.test.com <Context sessionCookiePath="/path&q
web报表工具FineReport常用函数的用法总结（数学和三角函数）老A不折腾 Web finereport 总结
ABS ABS(number):返回指定数字的绝对值。绝对值是指没有正负符号的数值。 Number:需要求出绝对值的任意实数。示例: ABS(-1.5)等于1.5。 ABS(0)等于0。 ABS(2.5)等于2.5。 ACOS ACOS(number):返回指定数值的反余弦值。反余弦值为一个角度，返回角度以弧度形式表示。 Number:需要返回角
linux 启动java进程 sh文件墙头上一根草 linux shell jar
#!/bin/bash #初始化服务器的进程PId变量 user_pid=0; robot_pid=0; loadlort_pid=0; gateway_pid=0; ######### #检查相关服务器是否启动成功 #说明： #使用JDK自带的JPS命令及grep命令组合，准确查找pid #jps 加 l 参数，表示显示java的完整包路径 #使用awk，分割出pid
我的spring学习笔记5-如何使用ApplicationContext替换BeanFactory aijuans Spring 3 系列
如何使用ApplicationContext替换BeanFactory？ package onlyfun.caterpillar.device; import org.springframework.beans.factory.BeanFactory; import org.springframework.beans.factory.xml.XmlBeanFactory; import
Linux 内存使用方法详细解析 annan211 linux 内存 Linux内存解析
来源 http://blog.jobbole.com/45748/ 我是一名程序员，那么我在这里以一个程序员的角度来讲解Linux内存的使用。一提到内存管理，我们头脑中闪出的两个概念，就是虚拟内存，与物理内存。这两个概念主要来自于linux内核的支持。 Linux在内存管理上份为两级，一级是线性区，类似于00c73000-00c88000，对应于虚拟内存，它实际上不占用
数据库的单表查询常用命令及使用方法(-) 百合不是茶 oracle 函数单表查询
创建数据库; --建表 create table bloguser(username varchar2(20),userage number(10),usersex char(2)); 创建bloguser表,里面有三个字段 &nbs
多线程基础知识 bijian1013 java 多线程 thread java多线程
一．进程和线程进程就是一个在内存中独立运行的程序，有自己的地址空间。如正在运行的写字板程序就是一个进程。 “多任务”：指操作系统能同时运行多个进程（程序）。如WINDOWS系统可以同时运行写字板程序、画图程序、WORD、Eclipse等。线程：是进程内部单一的一个顺序控制流。线程和进程 a. 每个进程都有独立的
fastjson简单使用实例 bijian1013 fastjson
一.简介阿里巴巴fastjson是一个Java语言编写的高性能功能完善的JSON库。它采用一种“假定有序快速匹配”的算法，把JSON Parse的性能提升到极致，是目前Java语言中最快的JSON库；包括“序列化”和“反序列化”两部分，它具备如下特征：
【RPC框架Burlap】Spring集成Burlap bit1129 spring
Burlap和Hessian同属于codehaus的RPC调用框架，但是Burlap已经几年不更新，所以Spring在4.0里已经将Burlap的支持置为Deprecated,所以在选择RPC框架时，不应该考虑Burlap了。这篇文章还是记录下Burlap的用法吧，主要是复制粘贴了Hessian与Spring集成一文，【RPC框架Hessian四】Hessian与Spring集成
【Mahout一】基于Mahout 命令参数含义 bit1129 Mahout
1. mahout seqdirectory $ mahout seqdirectory --input (-i) input Path to job input directory(原始文本文件). --output (-o) output The directory pathna
linux使用flock文件锁解决脚本重复执行问题 ronin47 linux lock　重复执行
linux的crontab命令，可以定时执行操作，最小周期是每分钟执行一次。关于crontab实现每秒执行可参考我之前的文章《linux crontab 实现每秒执行》现在有个问题，如果设定了任务每分钟执行一次，但有可能一分钟内任务并没有执行完成，这时系统会再执行任务。导致两个相同的任务在执行。例如： <? // test .php
java-74-数组中有一个数字出现的次数超过了数组长度的一半，找出这个数字 bylijinnan java
public class OcuppyMoreThanHalf { /** * Q74 数组中有一个数字出现的次数超过了数组长度的一半，找出这个数字 * two solutions: * 1.O(n) * see <beauty of coding>--每次删除两个不同的数字，不改变数组的特性 * 2.O(nlogn) * 排序。中间
linux 系统相关命令 candiio linux
系统参数 cat /proc/cpuinfo cpu相关参数 cat /proc/meminfo 内存相关参数 cat /proc/loadavg 负载情况性能参数 1）top M：按内存使用排序 P：按CPU占用排序 1：显示各CPU的使用情况 k：kill进程 o：更多排序规则回车：刷新数据 2）ulimit ulimit -a：显示本用户的系统限制参
[经营与资产]保持独立性和稳定性对于软件开发的重要意义 comsci 软件开发
一个软件的架构从诞生到成熟，中间要经过很多次的修正和改造如果在这个过程中，外界的其它行业的资本不断的介入这种软件架构的升级过程中那么软件开发者原有的设计思想和开发路线
在CentOS5.5上编译OpenJDK6 Cwind linux OpenJDK
几番周折终于在自己的CentOS5.5上编译成功了OpenJDK6，将编译过程和遇到的问题作一简要记录，备查。 0. OpenJDK介绍 OpenJDK是Sun（现Oracle）公司发布的基于GPL许可的Java平台的实现。其优点： 1、它的核心代码与同时期Sun（-> Oracle）的产品版基本上是一样的，血统纯正，不用担心性能问题，也基本上没什么兼容性问题；（代码上最主要的差异是
java乱码问题 dashuaifu java乱码问题 js中文乱码
swfupload上传文件参数值为中文传递到后台接收中文乱码在js中用setPostParams（{"tag" : encodeURI( document.getElementByIdx_x("filetag").value，"utf-8")}）; 然后在servlet中String t
cygwin很多命令显示command not found的解决办法 dcj3sjt126com cygwin
cygwin很多命令显示command not found的解决办法修改cygwin.BAT文件如下 @echo off D: set CYGWIN=tty notitle glob set PATH=%PATH%;d:\cygwin\bin;d:\cygwin\sbin;d:\cygwin\usr\bin;d:\cygwin\usr\sbin;d:\cygwin\us
[介绍]从 Yii 1.1 升级 dcj3sjt126com PHP yii2
2.0 版框架是完全重写的，在 1.1 和 2.0 两个版本之间存在相当多差异。因此从 1.1 版升级并不像小版本间的跨越那么简单，通过本指南你将会了解两个版本间主要的不同之处。如果你之前没有用过 Yii 1.1，可以跳过本章，直接从"入门篇"开始读起。请注意，Yii 2.0 引入了很多本章并没有涉及到的新功能。强烈建议你通读整部权威指南来了解所有新特性。这样有可能会发
Linux SSH免登录配置总结 eksliang ssh-keygen Linux SSH免登录认证 Linux SSH互信
转载请出自出处：http://eksliang.iteye.com/blog/2187265 一、原理我们使用ssh-keygen在ServerA上生成私钥跟公钥，将生成的公钥拷贝到远程机器ServerB上后,就可以使用ssh命令无需密码登录到另外一台机器ServerB上。生成公钥与私钥有两种加密方式，第一种是
手势滑动销毁Activity gundumw100 android
老是效仿ios，做android的真悲催！有需求：需要手势滑动销毁一个Activity 怎么办尼？自己写？不用~，网上先问一下百度。结果： http://blog.csdn.net/xiaanming/article/details/20934541 首先将你需要的Activity继承SwipeBackActivity，它会在你的布局根目录新增一层SwipeBackLay
JavaScript变换表格边框颜色 ini JavaScript html Web html5 css
效果查看：http://hovertree.com/texiao/js/2.htm代码如下，保存到HTML文件也可以查看效果： <html> <head> <meta charset="utf-8"> <title>表格边框变换颜色代码-何问起</title> </head> <body&
Kafka Rest : Confluent kane_xie kafka REST confluent
最近拿到一个kafka rest的需求，但kafka暂时还没有提供rest api（应该是有在开发中，毕竟rest这么火），上网搜了一下，找到一个Confluent Platform，本文简单介绍一下安装。这里插一句，给大家推荐一个九尾搜索，原名叫谷粉SOSO，不想fanqiang谷歌的可以用这个。以前在外企用谷歌用习惯了，出来之后用度娘搜技术问题，那匹配度简直感人。环境声明：Ubu
Calender不是单例 men4661273 单例 Calender
在我们使用Calender的时候，使用过Calendar.getInstance()来获取一个日期类的对象，这种方式跟单例的获取方式一样，那么它到底是不是单例呢，如果是单例的话，一个对象修改内容之后，另外一个线程中的数据不久乱套了吗？从试验以及源码中可以得出，Calendar不是单例。测试： Calendar c1 =
线程内存和主内存之间联系 qifeifei java thread
1， java多线程共享主内存中变量的时候，一共会经过几个阶段， lock:将主内存中的变量锁定，为一个线程所独占。 unclock:将lock加的锁定解除，此时其它的线程可以有机会访问此变量。 read:将主内存中的变量值读到工作内存当中。 load:将read读取的值保存到工作内存中的变量副本中。
schedule和scheduleAtFixedRate tangqi609567707 java timer schedule
原文地址：http://blog.csdn.net/weidan1121/article/details/527307 import java.util.Timer;import java.util.TimerTask;import java.util.Date; /** * @author vincent */public class TimerTest {
erlang 部署 wudixiaotie erlang
1.如果在启动节点的时候报这个错： {"init terminating in do_boot",{'cannot load',elf_format,get_files}} 则需要在reltool.config中加入 {app, hipe, [{incl_cond, exclude}]}, 2.当generate时，遇到： ERROR

低成本指令数据集构建：《Self-Instruct: Aligning Language Model with Self Generated Instructions》阅读笔记

简介

方法

步骤一 指令生成

步骤二 分类任务识别

步骤三 实例生成

步骤四 过滤和后处理