这个公开的资源被很多和自然语言处理NLP相关的开源代码和论文提到,
所以仔细阅读了readme,并记录相关要点
所有文件以" +++$+++ "分隔符
- movie_titles_metadata.txt
- 包含每部电影标题信息
- fields:
- movieID,
- movie title,
- movie year,
- IMDB rating,
- no. IMDB votes,
- genres in the format ['genre1','genre2',?'genreN']
- movie_characters_metadata.txt
- 包含每部电影角色信息
- fields:
- characterID
- character name
- movieID
- movie title
- gender ("?" for unlabeled cases)
- position in credits ("?" for unlabeled cases)
关键是下面两个文件,一个包含了所有文本,一个包含了文本之间的关系
- movie_lines.txt
- 包含每个表达(utterance)的实际文本
- fields:
- lineID
- characterID (who uttered this phrase)
- movieID
- character name
- text of the utterance
前面5个样本:
L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!
L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!
L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.
L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?
L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.
- movie_conversations.txt
- 对话的结构-
- fields
- characterID of the first character involved in the conversation 对话中的第一个角色的ID
- characterID of the second character involved in the conversation 对话中的第二个角色的ID
- movieID of the movie in which the conversation occurred 对话所属电影的ID
- list of the utterances that make the conversation, in chronological
order: ['lineID1','lineID2',?'lineIDN']
has to be matched with movie_lines.txt to reconstruct the actual content
对话中以时间顺序的各个表达的列表,
order: ['lineID1','lineID2',?'lineIDN']必须和movie_lines.txt匹配以便于重构实际内容
前面5个样本:
u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']
u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L198', 'L199']
u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L200', 'L201', 'L202', 'L203']
u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L204', 'L205', 'L206']
u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L207', 'L208']
- raw_script_urls.txt
-原始来源的url( the urls from which the raw sources were retrieved)
来源:
http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html