Explore Data

Data Asset Exchange - Groningen Meaning Bank (Modified)

Sample Records

Term POS Tags Entity Tags
VOA NNP B-ORG
's POS O
Mil NNP B-PER
Arcega NNP I-PER
reports VBZ O
. . O
Mr. NNP B-PER
Blair NNP B-PER
left VBD O
for IN O
Turkey NNP B-GEO
Friday NNP B-TIM
from IN I-TIM
Brussels NNP I-TIM
, , O
where WRB O
he PRP O
was VBD O
attending VBG O
a DT O
European NNP B-ORG
union NNP I-ORG
summit NN O
. . O
Indian JJ B-GPE
Foreign NNP O
secretary NNP O
shyam NNP B-PER
Saran NNP I-PER
and CC O
U.S. NNP B-ORG
Undersecretary NNP I-ORG
of IN I-ORG
State NNP I-ORG
Nicholas NNP I-ORG
Burns NNP I-ORG
are VBP O
are VBP O
leading VBG O
the DT O
negotiations NNS O
. . O

Metadata

Format IOB format
License CDLA-Sharing
Domain Natural Language Processing
Number of Records 1,314,115 terms
Size 10 MB
Origin University of Groningen
Dataset Version Update Version 2 - May 14, 2020

Version 1 - December 19, 2019
Dataset Coverage The dataset contains only documents authored by Voice of America VOA, together with documents from the MASC dataset and the CIA World Factbook.
Business Use Case

    Linguistics

  • Can be used to generate new text features and train a model to perform named entity recognition.

Dataset Feature Definition

Feature Description
Term Word
POS Tags

Indicate part-of-speech tagging of each term. For example:

  • CC: Coordinating conjunction

  • CD: Cardinal number

  • DT: Determiner

  • EX: Existential there

  • FW: Foreign word

This is just a show case of a few POS tags, For more details, please refer to this alphabetical list from UPenn.

Entity Tags

The entity tags cover 8 types of named entities: persons, locations, organizations, geo-political entities, artifacts, events, natural objects, time, as well as a tag for 'no entity'.

The entity types furthermore may be tagged with either a 'B-' tag or 'I-' tag.

A 'B-' tag indicates the first term of a new entity (or only term of a single-term entity), while subsequent terms in an entity will have an 'I-' tag. For example, 'New York' would be tagged as ['B-GEO', 'I-GEO'] while 'London' would be tagged as 'B-GEO'.

The annotation scheme for named entities in the GMB distinguishes the following eight classes:

  • Person (PER) - Person entities are limited to individuals that are human or have human characteristics, such as divine entities.

  • Location (GEO) - Location entities are limited to geographical entities such as geographical areas and landmasses, bodies of water, and geological formations.

  • Organization (ORG) - Organization entities are limited to corporations, agencies, and other groups of people defined by an established organizational structure.

  • Geo-political Entity (GPE) - GPE entities are geographical regions defined by political and/or social groups. A GPE entity subsumes and does not distinguish between a city, a nation, its region, its government, or its people (LOC`ORG).

  • Artifact (ART) - Artifacts are limited to manmade objects, structures and abstract entities, including buildings, facilities, art and scientific theories.

  • Event (EVE) - Events are incidents and occasions that occur during a particular time.

  • Natural Object (NAT) - Natural objects are entities that occur naturally and are not manmade, such as diseases, biological entities and other living things.

  • Time (TIM) - Time entities are limited to references to certain temporal entities that have a name, such as the days of the week and months of a year. For all other temporal expressions the tagging layer timex is used (see below).

  • Other (O) - Other entities include all other words which do not fall in any of the categories above.