一般用于命名实体识别 模型测试
Context:
Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.
Tip: Use Pandas Dataframe to load dataset if using Python for convenience.
Content:
This is the extract from GMB corpus which is tagged, annotated and built specifically to train the classifier to predict named entities such as name, location, etc.
Number of tagged entities:
‘O’: 1146068’, geo-nam’: 58388, ‘org-nam’: 48034, ‘per-nam’: 23790, ‘gpe-nam’: 20680, ‘tim-dat’: 12786, ‘tim-dow’: 11404, ‘per-tit’: 9800, ‘per-fam’: 8152, ‘tim-yoc’: 5290, ‘tim-moy’: 4262, ‘per-giv’: 2413, ‘tim-clo’: 891, ‘art-nam’: 866, ‘eve-nam’: 602, ‘nat-nam’: 300, ‘tim-nam’: 146, ‘eve-ord’: 107, ‘per-ini’: 60, ‘org-leg’: 60, ‘per-ord’: 38, ‘tim-dom’: 10, ‘per-mid’: 1, ‘art-add’: 1
Essential info about entities:
geo = Geographical Entity
org = Organization
per = Person
gpe = Geopolitical Entity
tim = Time indicator
art = Artifact
eve = Event
nat = Natural Phenomenon
Total Words Count = 1354149
Target Data Column: “tag”
Inspiration: This dataset is getting more interested because of more features added to the recent version of this dataset. Also, it helps to create a broad view of Feature Engineering with respect to this dataset.
Why this dataset is helpful or playful?
It might not sound so interested for earlier versions, but when you are able to pick intent and custom named entities from your own sentence with more features then, it is getting interested and helps you solve real business problems(like picking entities from Electronic Medical Records, etc)
Please, feel free to ask questions, do variations and let’s play together!
Data Explorer
164.26 MB
calendar_view_week
ner.csv
calendar_view_week
ner_dataset.csv
下载地址:
https://download.csdn.net/download/immenselee/13773860