Developed at the
University of Lisbon
by
NLX/FCUL
and
CLUL
concordancer
|
intro
|
what's in?
|
how to?
|
release
|
versão portuguesa
How to?
Table of contents
Cheat sheet / Quick reference
Query syntax cheat sheet
Basic query |
a word matches itself |
|
Query modifiers |
/i | case-insensitive match |
/x | sub-sequence matching |
|
Character expressions |
. | any single character |
[ ] | character from a set |
[^ ] | character from negated set |
|
|
Repetition operators |
? | optional |
* | zero or more times |
+ | one or more times |
{n} | exactly n times |
{n,} | n or more times |
{,n} | up to n times |
{m,n} | from m to n times |
|
|
Combining expressions |
e1e2 | e1 followed by e2 |
| | alternation |
( ) | grouping |
|
Search by annotation |
[keyword=expression] |
[keyword!=expression] |
[key1=exp1 & key2=exp2] |
[key1=exp1 | key2=exp2] |
|
|
Regular expressions must be enclosed in quotes. |
Contraction are reverted and encoded as two tokens, where the first is concatenated with an underscore. |
|
Quick reference
Field |
Keyword |
Values |
Orthographic form |
orth |
any |
Part-of-speech tag |
pos |
full table |
Inflection feature |
gender |
f, m, g |
number |
s, p, n |
degree |
dim, sup, comp |
person |
1, 2, 3 |
time |
full table |
inflection |
ifl, nifl
|
Lemma (base form) |
base |
any |
Named-entity |
iob |
full table |
Metadata |
source |
writtennews
writtenfiction
writtenother
spoken |
Query outcome
The CINTIL Online Concordancer permits to retrieve passages with
occurrences of a given target expression in the CINTIL corpus.
The target expression is entered in the query text box. The retrieved
passages are displayed below that box.
When the "Show tags" box is checked, the concordancer displays
also the linguistic annotation.
For each token, this annotation is displayed between square brackets, with a
colon separating each field. For instance, the annotation for the
common noun gatas will be displayed as follows:
occurrence with annotation | → |
gatas |
[ | gato | : | cn | : | f | : | p | : | O | ] |
keywords | → |
orth |
| base | | pos | | gender | | number | | IOB | |
Note that this annotation is displayed in a slightly
different format than the one used in the corpus
release. For a description of the latter, check
here.
For practical reasons each passage returned with the occurrence contains at most 10 tokens.
Also for practical reasons, not all passages with occurrences of
the target expression in the CINTIL corpus are returned. Also,
the order in which the passages are displayed does not
correspond to a possible consecutive order of their occurrence
in the corpus. Note however that the outcome of the CINTIL
online concordancer can be used as a reference in research
given that identical queries return identical outcome.
In those usage cases where it is imperative to have access
to every occurrence,
the interested user can acquire a copy
of the corpus and run a concordancer of his choice over
that local copy.
User interface
The online concordancer user interface is quite self-explanatory.
The operation of "Sort" buttons provides the following functionality:
Upon pressing these buttons, the results are alphabetically sorted according
to the context.
The right-hand side button sorts the passages using their right side context.
The left-hand side button sorts the passages according to the context
on their left side, from right to left.
The following example illustrates the use of these two buttons
over the outcome of the same search for carro (with 2 words of left
context and 1 word of right context):
no sort |
...guiar um | carro | novo... |
...ir de | carro | para... |
...levar o | carro | até... |
|
right sort |
...levar o | carro | até... |
...guiar um | carro | novo... |
...ir de | carro | para... |
|
left sort |
...ir de | carro | para... |
...levar o | carro | até... |
...guiar um | carro | novo... |
|
Searching orthographic forms
- Case-sensitiveness
- Search is case-sensitive. For a case-insensitive search, append /i to the orthographic form:
- by entering gato, occurrences of gato are obtained
- gato/i gets occurrences of gato, Gato, GATO, etc.
- Sub-sequence matching
- The query expression match whole tokens. For instance gato
will not match parts of words, and will not return regato or
obrigatoriamente.
To allow sub-sequence matching, append /x to the orthographic form (which can be combined with the
/i mentioned previously).
For instance:
- gato will only match gato
- gato/x will match any word containing the string
gato, such as obrigatoriamente
- gato/xi is as above, but case-insensitive
- Contractions
- Note that in the CINTIL corpus the contractions (e.g. daquela, aos, nas)
are reverted and encoded with two tokens, where the first is concatenated with an underscore symbol
(e.g. de_ aquela, a_ os, em_ as)
Searching for patterns
It is possible to search with general pattern (aka regular expressions). A query can
thus include regular expressions, provided it is enclosed in quotes. The usual notational conventions are followed:
- Alternation
- Alternatives are introduced by the | (vertical bar) character:
- "gato|peixe" matches all occurrences of gato and all occurrences of peixe
- Character sets
- A set of characters within square brackets match occurrences of
any of those characters:
- "gat[ao]" match occurrences of gata and
gato
- "[pg]at[ao]" will match occurrences of gata,
gato, pata and pato
A set can be negated by placing a ^ (caret) symbol immediately after the opening bracket.
- "[^abcd][efg]" matches tokens with two characters, the
first one not being a, b, c or d and the
second one being e, f or g
- Period
- The "." (period) match any single character (letter,
digit or symbol):
- "gat.s" will match
gatas, gatbs, gatcs, gat1s, etc.
- Optionality
- The "?" (question mark) permits that the
character/expression preceding it is optionally matched:
- "gatos?" matches gato and gatos.
- Iteration
- There are three forms of expressing iteration. The * (star)
operator permits that the character/expression preceding it is matched zero or more times:
- "gat.*" matches any word starting with gat,
including gat itself
- ".*gato.*" matches any word containing the string
gato (this is equivalent to gato/x)
The + (plus) operator is similar, but enforces that there
is at least one occurrence of the character/expression preceding it:
- "gat.+" matches any word starting with gat, but not
gat since + enforces at least one occurrence
Finally, {l,u} permits that the number of
iterations is bounded by a lower (l) and an upper
(u) value. Note that either bound may be omitted. In such
cases, {l,} means "at least l times", {,u}
means "at most u times" and {n} means "exactly
n times":
- "gat.{2,4}" matches words that start with gat and
that have 2 to 4 additional characters
- "[^aer]{5,}" matches words without a, e or
r that have 5 or more characters.
- Grouping
- Parentheses are used to group expressions. The operators
described above can then be applied to the whole expression in
parentheses as if it was a single character:
- "gat(inh)?o" matches gato and gatinho
(i.e. the sequence inh that follows t is optional)
- "ga(to)*" matches ga, gato, gatoto,
gatototo, etc. (i.e. to may occur zero or more times)
Note that any of these expressions may also be modified by the
/i and /x described previously.
For instance:
- "ga.*"/i matches words starting with ga,
Ga, gA or GA
- "(ra){2}"/x matches words that contain two consecutive occurrences of
ra (e.g. rara, mostraram, etc.)
Searching through linguistic information
Each token is associated to linguistic information, encoded
by means of annotation tags. Each tag is composed of a field
and its value in square brackets ([field=value]). For example,
[gender=m], [time=pi], etc.
Each field is instantiated by a keyword.
The values can be matched with any of the methods described above:
- [field=pattern] is the format for such queries.
Field-pattern pairs can be combined by using logical operators:
& (ampersand) for conjunction and | (vertical bar)
for disjunction:
- [field=pattern & field=pattern]
- [field=pattern | field=pattern]
In addition, the negation symbol ! (exclamation) permits
to match tokens whose field values do not conform to a given pattern:
- [!field=pattern] is one format for such negation
- [field!=pattern] is equivalent to the
previous query.
Orthographic form (again)
The orthographic form itself can be matched via the keyword orth:
- [orth=gato] matches tokens with the orthographic form
gato. This returns the same result as simply searching
for gato. Using this alternative but equivalent way
is useful when combining orth with other fields (to be discussed below)
- [orth="gat.*" & orth!=gato] matches tokens that begin
with gat but that are not gato
Part-of-speech
Selecting occurrences with a given part-of-speech (POS) category is done by resorting to keyword pos:
- [pos=cn] matches tokens with the POS tag cn (common noun)
- [pos=cn & orth="ga.*"] matches tokens that are common nouns and begin with ga
- [pos="d.*"] matches tokens with any POS tag whose name begins with d
- [pos!=pnt] matches tokens that are not punctuation (the pnt tag)
Here is the list of POS tags.
Nominal inflection
The keywords gender and number have, respectively, the values f
(feminine) or m (masculine), and the values s (singular) or p (plural).
They permit to match occurrences with selected inflection features:
- [gender=f] matches all tokens with feminine inflection
- [number=s & orth=".*s"] matches all tokens with singular
inflection that end in s
- [gender!=m] matches tokens that do not have masculine
inflection. Note that this also matches those tokens to which gender
inflection is not even applicable, such as prepositions, punctuation,
symbols, etc.
Some tokens may bear degree features, accessed through the degree keyword:
- [degree=dim] matches all tokens with diminutive degree
Verbal inflection
In order to match tokens according to their verbal inflection features,
one can resort to person, time and number keywords:
- [person="1"] matches tokens inflected for first person
- [time="ppi"] matches tokens inflected for the Pretérito Perfeito Indicativo
- [person="3" & number="s" & time="fc"] matches all forms expressing the third person singular of Futuro Conjuntivo
- [person!="1"] matches tokens that do not have 1st-person
inflection. Note that this also matches those tokens to which person
inflection is not even applicable, such as prepositions, punctuation,
symbols, etc.
Here is the list of verbal inflection tags.
Infinitives can occur inflected or not inflected. This information is matched through the inflection keyword.
Lemma
In order to match tokens by their lemma, the base keyword can be used:
- [base=rato] matches words with rato as their base form
(lemma), such as rato, ratos or ratinho, etc.
- [pos=cn & base=".*s"] finds common nouns whose lemma ends
in s
- [orth=foi & pos=v & base!=ir] matches occurrences of the verb form
foi that do not belong to verb ir
Named-entity
To match tokens according to their being part of an expression naming an entity, the iob keyword is used:
- [iob=B-LOC] matches tokens that are the beginning (B-) of an expression naming an entity whose semantic type is "location" (LOC).
- [iob=I-PER] matches tokens that are inside (I-) an expression naming an entity of type "person" (PER).
Here is the list of named-entity tags.
Metadata
It is possible to use metadata to restrict the match to a given
type of text through the use of the meta command:
- gato meta source=writtennews matches gato only in
the news portion (writtennews) of the corpus
- gato meta source="written.*" matches gato only in
the written portion of the corpus (includes writtennews, writtenficiton and writtenother)
For a list of metadata fields and values, see here.
Advanced queries
Through the combination of the different search options described above,
it is possible to construct advanced queries and uncover relevant
linguistic information:
- situação[pos=adj] returns the occurrences of the word
situação followed by an adjective
- [pos=da][pos=cn] return the occurrences of a definite article (the
da tag) followed by a common noun
- [pos=da][pos=adj]?[pos=cn] is similar to the previous
query, but allows a single, optional adjective (indicated by the adj tag)
between the definite article and the common noun
- [pos="cn|adj"]{3,} returns sequences with at least 3
consecutive adjectives and common nouns (in any relative order)
- [pos=da][pos!=cn]{2,3}[pos=adj] returns sequences of a definite
article followed by 2 or 3 tokens that are not common nouns and that are followed
by an adjective
- ... etc.
Aligning matches
It is possible to split the outcome of the query into two columns
to make it more readable by using the ^ (caret) symbol:
- [pos=da][pos!=cn]{2}^[pos=adj] matches a definite article
followed by 2 tokens that are not common nouns, followed by an
adjective. The definite article and the following 2 tokens will be
displayed in a column while the final adjective will be shown in a
column by itself.
© All rights reserved