失踪人口回归!
因为期末 + 春节的原因,断更了好久。2月份去了网易的人工智能实验室实习,做的自然语言处理实习生。因为之后有个项目是和文本生成相关,老板让我去参加个学术会议比赛熟悉下。然后就选了SDP 2021@NAACL Longsumm Task,我和另一个实习生一起奋战了一个月最后成功代表网易人工智能实验室拿到排行榜第一名的成绩。我们的SBAS模型比第二名各项指标都高出了0.02。
{
"id": "79792577",
"blog_id": "4d803bc021f579d4aa3b24cec5b994",
"summary": [
"Task of translating natural language queries into regular expressions ...",
"Proposes a methodology for collecting a large corpus of regular expressions ...",
"Reports performance gain of 19.6% over state-of-the-art models.",
"Architecture LSTM based sequence to sequence neural network (with attention) Six layers ...",
"Attention over encoder layer.",
"...."
],
"author_id": "shugan",
"pdf_url": "http://arxiv.org/pdf/1608.03000v1",
"author_full_name": "Shagun Sodhani",
"source_website": "https://github.com/shagunsodhani/papers-I-read"
}
Title: A Binarized Neural Network Joint Model for Machine Translation
Url: https://doi.org/10.18653/v1/d15-1250
Extractive summaries:
0 23 Supertagging in lexicalized grammar parsing is known as almost parsing (Bangalore and Joshi, 1999), in that each supertag is syntactically informative and most ambiguities are resolved once a correct supertag is assigned to every word.
16 22 Our model does not resort to the recursive networks while modeling tree structures via dependencies.
23 19 CCG has a nice property that since every category is highly informative about attachment decisions, assigning it to every word (supertagging) resolves most of its syntactic structure.
24 14 Lewis and Steedman (2014) utilize this characteristics of the grammar.
27 16 Their model looks for the most probable y given a sentence x of length N from the set Y (x) of possible CCG trees under the model of Eq.
28 38 Since this score is factored into each supertag, they call the model a supertag-factored model.
30 58 ci,j is the sequence of categories on such Viterbi parse, and thus b is called the Viterbi inside score, while a is the approximation (upper bound) of the Viterbi outside score.
31 19 A* parsing is a kind of CKY chart parsing augmented with an agenda, a priority queue that keeps the edges to be explored.
48 37 As we will see this keeps our joint model still locally factored and A* search tractable.
50 11 We define a CCG tree y for a sentence x = ⟨xi, .
… …
这篇博客重点阐述模型迭代过程以及最终解决方案,对于数据的转换、处理,模型尝试与对比实验等方法可以在之后的论文中获取。
Fine-tune BERT for Extractive Summarization
PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization
Big Bird: Transformers for Longer Sequences
其中SBAS为这次我们设计的模型结构,其余为部分其他参赛者的结果
"1005": "to make the architecture task adaptive , the paper proposes the concept of met-adapt controller modules these modules are added to the model and are meta-trained to predict the optimal network connections for a given novel task a few-shot classification (fsc) is a popular method for approaching fsc . in meta-learning , the inputs to both train and test phases are not images but instead a set of few-shot tasks ti each k-shot / n-way task containing a small amount of k (usually) of labeled support images and some amount of unlabeled query images for each of the n categories of the task the goal of meta-learning is to find a base model that is easily adapted to the specific task at hand so that it will generalize well to tasks built from novel unseen categories and fulfill the goal of fsc it seems that larger architectures increase fsc performance up to a certain size where performance seems to saturate or even degrade this happens since bigger backbones carry higher risk of over-fitting it seems the overall performance of the fsc techniques cannot continue to grow by simply expanding the backbone size in light of the above set to explore methods for architecture search their meta-adaptation and optimization for fsc a few-shot learning problem is one where you train a model on a small number of examples , and you try to classify the new examples according to their proximity to the ones that have already been trained on the previous task . sub-models metadapt controllers predict the change in connectivity that is needed in the learned graph as a function of the current task replacing simple sequence of convolutional layers with the suggested dag , with its many layers and parameters in conventional gradient descent training will result in a larger over-fitting . this is even worse for fsl where it is harder to achieve generalization due to scarcity of the data and the domain differences between the training and test sets the weights are optimized using sgd optimizer with learning rate 0 . tldr; the authors propose met modifiers , a few-shot learning approach that enables meta-learned network architecture that is adaptive to novel few-shot tasks . the goal of meta-learning is to find a base model that is easily adapted to the specific task at hand, so that it will generalize well to tasks built from novel unseen categories and fulfill the goal of fsc (see section for further review). one of such major factors is the cnn backbone architecture at the basis of all the modern fsc methods. to summarize, our contributions in this work are as follows: (1) show that darts-like bi-level iterative optimization of layer weights and network connections performs well for few-shot classification without suffering from overfitting due to over-parameterization; (2) show that adding small neural networks, metadapt controllers, that adapt the connections in the main network according to the given task further (and significantly) improves performance; (3) using the proposed method, obtain improvements over fsc state-of-the-art on two popular fsc benchmarks: miniimagenet and fc100 these approaches include: (i) semi-supervised approaches using additional unlabeled data 9,14; (ii) fine tuning from pre-trained models 31,62,63; (iii) applying domain transfer by borrowing examples from relevant categories or using semantic vocabularies 3,15; (iv) rendering synthetic examples 42,10,56; (v) augmenting the training examples using geometric and photometric transformations or learning adaptive augmentation strategies 21; (vi) example synthesis using generative adversarial networks (gans) 69,25,20,48,45,35,11,23,2. in 22,54 additional examples are synthesized via extracting, encoding, and transferring to the novel category instances, of the intra-class relations between pairs of instances of reference categories. the list of search space operations used in our experiments is provided in table this list includes the zero-operation and identity-operation that can fully or partially (depending on the corresponding (i,j) o ) cut the connection or make it a residual one (skip-connection). darts, at search time the training is done on the full model at each iteration where each edge is a weighted-sum of its operations according to i,j contrarily, in snas i,j are treated as probabilities of a multinomial distribution and at each iteration a single operation is sampled accordingly. (9) here i,j are after softmax normalization and summed to at test time, rather than the one-hot approximation, use the operation with the top probability zi,jk 1, if k argmax(i,j) 0, otherwise (10) using this method get better results for fc100 1-shot and comparable results for 5-shot, compared to vanilla metaoptnet. the proposed approach effectively applies tools from the neural architecture search (nas) literature, extended with the concept of metadapt controllers’, in order to learn adaptive architectures. these tools help mitigate over-fitting to the extremely small data of the few-shot tasks and domain shift between the training set and the test set. demonstrate that the proposed approach successfully improves state-of-the-art results on two popular few-shot benchmarks, miniimagenet and fc100, and carefully ablate the different optimization steps and design choices of the proposed approach. some interesting future work directions include extending the proposed approach to progressively searching the full network architecture (instead of just the last block), applying the approach to other few-shot tasks such as detection and segmentation, and researching into different variants of task-adaptivity including global connections modifiers and inter block adaptive wiring.",
"1002": "in this paper , propose a novel method that automatically generates summaries for scientific papers by utilizing videos of talks at scientific conferences . hypothesize that such talks constitute a coherent and concise description of the papers content and can form the basis for good summaries collected papers and their corresponding videos and created a dataset of paper summaries a model trained on this dataset achieves similar performance as models trained on a dataset of summaries created manually in addition validated the quality of our summaries by human experts the rate of publications of scientific papers is increasing and it is almost impossible for researchers to keep up with relevant research . the paper proposes talksumm (acronym for talk-based summarization) , a method to automatically generate extractive content-based summaries for scientific papers based on video talks the approach utilizes the transcripts of video conference talks and treat them as spoken summaries of pa-s then for summaries using unsupervised alignment rithms map the transcripts to the corresponding text and create extractive summaries alignment between text and videos was studied by bojanowski et al . downloaded the 4www cleo org igem org/videos/videos extracted the speech data then via a publicly available asr service extracted transcripts of the speech and based on the video metadata (e g title) retrieved the corresponding paper (in pdf format) used science parse7 to extract the text of the paper and applied a simple processing in order to filter out some noise (e starting with the word copyright) at the end of this process the text of each paper is associated with the corresponding transcript of the corresponding talk during the talk , the speaker generates words for describing ver-vite sentences from the paper one word at each time step . thus at each time step the speaker has a single sentence from the paper in mind and produces a word that constitutes a part of its ver-vite description then at the next time-step the speaker either stays with the same sentence or moves on to describing another sentence and so on given the transcript aim to retrieve those source sentences and use them as the summary the number of words uttered to describe each sentence can serve as importance score in dicating the amount of time the speaker spent describing the sentence this enables to control the summary length by considering the only the most important sentences up to some threshold use an hmm to model the assumed stay-tive process each hidden state of the hmm corresponds to a single sentence each hidden state of the hmm is conditioned over the sentences appearing in the start of the paper and the average value of the hmm over all papers is 0 5 where s is the average value of the sentences appearing in the start of the paper the model is evaluated on clsm , scisumm talksum and evalnet . automatic summarization is studied exhaustively for the news domain (cheng and lapata, 2016; see et al., 2017), while summarization of scientific papers is less studied, mainly due to the lack of large-scale training data. in such talks, the presenter (usually a co-author) must describe their paper coherently and concisely (since there is a time limit), providing a good basis for generating summaries. table gives an example of an alignment between 1vimeo.com/aclweb icml.cc/conferences/2017/videos a paper and its talk transcript (see table in the appendix for a complete example). summaries generated with our approach can then be used to train more complex and data- demanding summarization models. our main contributions are as follows: (1) propose a new approach to automatically generate summaries for scientific papers based on video talks; (2) create a new dataset, that contains summaries for papers from several computer science conferences, that can be used as training data; (3) show both automatic and human evaluations for our approach. the transcript itself cannot serve as a good summary for the corresponding paper, as it constitutes only one modality of the talk (which also consists of slides, for example), and hence cannot stand by itself and form a coherent written text. thus, to create an extractive paper summary based on the transcript, model the alignment between spoken words and sentences in the paper, assuming the following generative process: during the talk, the speaker generates words for describing verbally sentences from the paper, one word at each time step. training data using the hmm importance scores, create four training sets, two with fixed-length summaries (150 and words), and two with fixed ratio between summary and paper lengths (0.3 and 0.4). automatic evaluation table summarizes the results: both gcn cited text spans and talksumm-only models, are not able to obtain better performance than abstract8 however, for the hybrid approach, where the abstract is augmented with sentences from the summaries emitted by the models, our talksumm-hybrid outperforms both gcn hybrid and abstract. randomly selected presenters from our corpus and asked them to perform two tasks, given the generated summary of their paper: (1) for each sentence in the summary, asked them to indicate whether they considered it when preparing the talk (yes/no question); (2) asked them to globally evaluate the quality of the summary (1-5 scale, ranging from very bad to excellent, means good).",
Yang Liu. 2019. Fine-tune bert for extractive summarization. arXiv preprint arXiv:1903.10318.
Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, pages 11328–11339. PMLR.
Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. 2020. Big bird: Transformers for longer sequences. arXiv preprint arXiv:2007.14062.
苏剑林. (Jan. 01, 2021). 《SPACES:“抽取-生成”式长文本摘要(法研杯总结) 》[Blog post]. Retrieved from https://kexue.fm/archives/8046