Developed at the University of Lisbon by NLX/FCUL and CLUL
concordancer | intro | what's in? | how to? | release | versão portuguesa
CINTIL is a corpus of Portuguese with 1 Million annotated tokens, each one of which verified by human expert annotators. The annotation comprises information on part-of-speech, open classes lemma and inflection, multi-word expressions pertaining to the class of adverbs and to the closed POS classes, and multi-word proper names (for named entity recognition).
Over one third of the corpus is composed of transcribed spoken materials, with about half of that being the transcription of informal conversations.
The remaining corpus is composed of written materials. The majority (58.73%) of this written corpus includes articles from newspapers and magazines, such as the Jornal Público, Diário de Notícias, Revista Visão, etc. The rest of the written corpus is mostly composed of fiction.
A more detailed breakdown of the composition of CINTIL is presented in the following table:
Written 689,124 tokens |
News | 33.96% | 404,690 |
Fiction | 16.80% | 200,194 | |
Other | 7.07% | 84,240 | |
Spoken 502,622 tokens |
Formal/Natural | 8.18% | 97,499 |
Formal/Media | 7.45% | 88,727 | |
Formal/Phone | 4.05% | 48,284 | |
Informal/Private | 18.26% | 217,604 | |
Informal/Public | 4.05% | 48,221 | |
Informal/Phone | 0.19% | 2,287 | |
Total | 1,191,746 |
They include the following individual tools covering analysis and generation procedures:
Tag | Category | Examples |
---|---|---|
ADJ | Adjectives | bom, brilhante, eficaz, … |
ADV | Adverbs | hoje, já, sim, felizmente, … |
CARD | Cardinals | zero, dez, cem, mil, … |
CJ | Conjunctions | e, ou, tal como, … |
CL | Clitics | o, lhe, se, … |
CN | Common Nouns | computador, cidade, ideia, … |
DA | Definite Articles | o, os, … |
DEM | Demonstratives | este, esses, aquele, … |
DFR | Denominators of Fractions | meio, terço, décimo, %, … |
DGTR | Roman Numerals | VI, LX, MMIII, MCMXCIX, … |
DGT | Arabic Numerals | 0, 1, 42, 12345, 67890, … |
DM | Discourse Marker | olá, … |
EADR | Electronic Addresses | http://www.di.fc.ul.pt, … |
EOE | End of Enumeration | etc |
EXC | Exclamation | ah, ei, … |
GER | Gerunds | sendo, afirmando, vivendo, … |
GERAUX | Gerund "ter"/"haver" in compound tenses | tendo, havendo |
IA | Indefinite Articles | uns, umas, … |
IND | Indefinites | tudo, alguém, ninguém, … |
INF | Infinitive | ser, afirmar, viver, … |
INFAUX | Infinitive "ter"/"haver" in compound tenses | ter, haver, … |
INT | Interrogatives | quem, como, quando, … |
ITJ | Interjection | bolas, caramba, … |
LTR | Letters | a, b, c, … |
MGT | Magnitude Classes | unidade, dezena, dúzia, resma, … |
MTH | Months | Janeiro, Dezembro, … |
NP | Noun Phrases | idem, … |
ORD | Ordinals | primeiro, centésimo, penúltimo, … |
PADR | Part of Address | Rua, av., rot., … |
PNM | Part of Name | Lisboa, António, João, … |
PNT | Punctuation Marks | ., ?, (, … |
POSS | Possessives | meu, teu, seu, … |
PPA | Past Participles not in compound tenses | sido, afirmados, vivida, … |
PP | Prepositional Phrases | algures, … |
PPT | Past Participle in compound tenses | sido, afirmado, vivido, … |
PREP | Prepositions | de, para, em redor de, … |
PRS | Personals | eu, tu, ele, … |
QNT | Quantifiers | todos, muitos, nenhum, … |
REL | Relatives | que, cujo, tal que, … |
STT | Social Titles | Presidente, drª., prof., … |
SYB | Symbols | @, #, &, … |
TERMN | Optional Terminations | (s), (as), … |
UM | "um" or "uma" | um, uma |
UNIT | Abbreviated Measurement Unit | kg., km., … |
VAUX | Finite "ter" or "haver" in compound tenses | temos, haveriam, … |
V | Verbs (other than PPA, PPT, INF or GER) | falou, falaria, … |
WD | Week Days | segunda, terça-feira, sábado, … |
Tags for multi-word expressions | ||
LADV1…LADVn | Multi-Word Adverbs | de facto, em suma, um pouco, … |
LCJ1…LCJn | Multi-Word Conjunctions | assim como, já que, … |
LDEM1…LDEMn | Multi-Word Demonstratives | o mesmo, … |
LDFR1…LDFRn | Multi-Word Denominators of Fractions | por cento |
LDM1…LDMn | Multi-Word Discourse Markers | pois não, até logo, … |
LITJ1…LITJn | Multi-Word Interjections | meu Deus |
LPRS1…LPRSn | Multi-Word Personals | a gente, si mesmo, V. Exa., … |
LPREP1…LPREPn | Multi-Word Prepositions | através de, a partir de, … |
LQD1…LQDn | Multi-Word Quantifiers | uns quantos, … |
LREL1…LRELn | Multi-Word Relatives | tal como, … |
Tags specific to the spoken corpus | ||
EMP | Emphasis | |
EL | Extra-linguistic | |
PL | Para-linguistic | |
FRG | Fragment |
Tag | Description |
---|---|
Tags for nominal categories | |
m | Masculine |
f | Feminine |
g | Underspecified gender |
s | Singular |
p | Plural |
n | Underspecified number |
dim | Diminutive |
sup | Superlative |
comp | Comparative |
Tags for verbs | |
1 | First Person |
2 | Second Person |
3 | Third Person |
pi | Presente do Indicativo |
ppi | Pretérito Perfeito do Indicativo |
ii | Pretérito Imperfeito do Indicativo |
mpi | Pretérito Mais que Perfeito do Indicativo |
fi | Futuro do Indicativo |
c | Condicional |
pc | Presente do Conjuntivo |
ic | Pretérito Imperfeito do Conjuntivo |
fc | Futuro do Conjuntivo |
imp | Imperativo |
Tags for infinitive verbs | |
ifl | Inflected |
nifl | Not Inflected |
Position | description | Semantic type | description | example | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
B- | beginning |
|
|
|
||||||||||||||||
I- | inside | |||||||||||||||||||
O | outside |