PaCQL tutorial

Treebank Studio (PREVIEW)

PaCQL tutorial (for "pack-well", the Parsed Corpus Query Language)

This is an introduction to the PaCQL query language using examples from IcePaHC, the Icelandic Parsed Historical Corpus. The best way to use this document is to have PaCQL search box for the corpus open in another window and try out the queries as you go through the material. Note that although this introduction will get you started, you will need to consult further sources of information in order to use parsed corpora for advanced work.

PaCQL ("pack-well") is designed to meet the following criteria:

Like CorpusSearch, PaCQL expresses complex linguistic relationships in a way that is natural to linguists and makes it easy to extract rich tabluated results for further processing in one's favorite statictical package.
Like Python and Ruby, PaCQL emphasizes productivity and readability
Like any index-based search engine, the search engine behind PaCQL is pretty fast. Real research queries which code several variables at a time will in most cases not keep you waiting for long.

Because many linguistic researchers are familiar with CorpusSearch (henceforth CS), this guide will occasionally point out similarities and differences between the two system. There is considerable overlap in the functionality of the two systems and it should be easy transfer knowledge of one system to the other.

Basic concepts

We will start right away with the first query. Type the following PaCQL query into the search box and click the Submit button.

NP-SBJ idoms N-D

In the IcePaHC 0.9 corpus, this query matches 852 noun phrases (NP) which are subjects (NP-SBJ) and which immediately dominate (idoms) a singular common noun (N) in the dative case (N-D). Dative subjects are a well known property of Icelandic syntax so we are already seeing something linguistically interesting!

If you are familiar with CorpusSearch, this first query will look familiar. For some simple queries, such as this one, there is no difference between the CorpusSearch query language and PacQL.

The first label you type into the query box, NP-SBJ in this case, becomes the "anchor" for the query. The query answers the question: For which anchors in the corpus does the pattern described be the query hold? By default, the box "Anchors only" is checked and this means that the output will display only the part of the tree which is the matched anchor.

(NP-SBJ (Q-D öllu) 'all'-dative (N-D mál$) 'language'-dative (D-D $inu)) 'the'-dative

The glosses above are unfortunately not included in the corpus; they are just added here for clarification. Furthermore, Q-D stands for "quantifier-dative" and D-D for "determiner-dative". Dollar signs indicate that an orthographic word has been split in the annotation. The suffixed definite article is the element which is most commonly split this way in the corpus.

The full text of the tree which contains the anchor is displayed above the anchor tree. Note that the left bracket of any matching label is highlighted inside the full text for the matching tree when you are looking at the results. Try unselecting "Anchors only" before running the query again. Each tree in which there is a matching anchor will now be displayed in full so you may have to scroll down to find the dative subject. The labels NP-SBJ and N-D in the example above are displayed in a red color in the tree so they will still stand out.

In addition to using colors to highlight important parts of the results, the query box highlights the names of search functions like "idoms". If they turn red as you type, you can be sure that you typed them correctly. The top right corner of the query box has options for the output format and the default is "Web output". You can run the query with "TSV" selected, in which case you will get tab-seperated values corresponding to the coding table on the right of each result tree, as well as the raw text for the relevant tree. The current query only gives us unique identifiers for the text and the tree, but the TSV format will become more interesting as we start coding for more variables.

Unlike CS, every query is technically a "coding query" in PaCQL, there are no "simple queries". However, unless you specify variables to code for, the output will be much like for a simple CS query.

Below the query box, next to where it says "852 results" after you run the query (remember to change "TSV" back to "Web output"), there is a link "Display Summary". Clicking the link will give you a breakdown of the results by text and century. The following columns are shown in the text table.

Column Interpretation

r/T results per text

rt/T result trees per text (fewer than r/T if >1 result per tree)

t/T total trees per text

r/1Kw results per 1000 words for this text

The last of these, r/1Kw, can be used as a rough measure of the frequency of a particular syntactic pattern. The section about coding queries below will introduce more sophisticated ways of studying syntactic change. The table which shows results by century is similar.

Column Interpretation

r/C results per century

rt/C result trees per century (fewer than r/T if >1 result per tree)

t/C total trees per century

r/1Kw results per 1000 words for this century

The century table is accompanied by a visualization which displays the development of r/1Kw over time. The drop in dative subject in the 16th century is due to the New Testament translation of Oddur Gottskálksson. He moved to Norway at a young age and perhaps it is for that reason that diagnostics for oblique subjects are weak in his language. Translation effects from Luther are also a plausible source of odd patterns. The reader may feel free to use Treebank Studio to develop her own explanation of these facts!

Be warned that it often takes a few tries to get the query to match exactly what we want! Take your time studying the results by scrolling down and looking at the individual examples. To speed things up, Treebank Studio initially only loads the first 10 results but more will appear when you scroll down to the bottom of the page.

Adding conditions

If we want to look at dative subjects in finite subordinate clauses, we can use the following query.

IP-SUB idoms NP-SBJ NP-SBJ idoms N-D

IP-SUB matches finite subordinate clauses and the two instances of NP-SBJ in the query are taken to be the same noun phrase. The query therefore means that the IP-SUB should immediately dominate a phrase NP-SBJ and this same phrase should dominate a dative noun N-D. IP-SUB is the first label in the query and it is therefore the anchor. There are 269 results in the corpus which match this pattern, one of which is the following:

(IP-SUB (NEG eigi) (BEPS sé) (NP-SBJ (Q-D öllu) (N-D mál$) (D-D $inu)) (ADVP (ADV orðfimlega) (CONJP *ICH*-4)) (VAN farið) (CONJP-4 (CONJ eða) (ADVP (ADV skörulega))))

Note that unlike CS, PaCQL has no AND between conjoined query expressions. The line break between the expressions is interpreted the same way as AND in CS. For a behavior similar to OR in CS, see the coding query syntax which is introduced below.

If you want to search for exactly the dative subjects which include quantifiers or determiners, but may or may not include other elements, you can use the following queries.

IP-SUB idoms NP-SBJ NP-SBJ idoms Q-D

IP-SUB idoms NP-SBJ NP-SBJ idoms D-D

However, if you want to capture all dative subjects in the same query, you will need to use more advanced pattern matching. For this purpose, we can use regular expressions.

Using regular expressions

In PaCQL, every label specification is a regular expression. A full treatment of regular expressions is beyond the scope of this guide, but a few basic patterns demonstrate the power of this tool. First, a list of characters inside a bracket matches any one of these characters. The following therefore matches all IP-SUB clauses whose subject immediately dominates one of the labels N-D, Q-D or D-D.

IP-SUB idoms NP-SBJ NP-SBJ idoms [NQD]-D

A parenthesized list of strings of characters where the strings are separated by the pipe symbol (|) matches any of the strings on the list. We can use this technique to expand the search to include finite main clauses IP-MAT. Notice how the anchor indicator updates automatically as you type to reflect the first label specification in the query.

IP-(MAT|SUB) idoms NP-SBJ NP-SBJ idoms [NQD]-D

In fact, there are some more part-of-speech tags beyond N-D, Q-D and D-D which are used to annotate dative elements in the corpus. For example, we have not looked at dative pronouns yet, PRO-D. As the reader may have noticed, all the dative tags end with the "dash tag" -D and we can use the regular expression .*-D "dot star" to capture all of these dative tags in a simple manner. The formal meaning of the star is "zero or more repetitions of the preceding character" and the dot is a special symbol which matches any character! We are now up to 3808 dative subjects!

IP-(MAT|SUB) idoms NP-SBJ NP-SBJ idoms .*-D

The regular expressions in PaCQL always match full labels. Therefore, if there existed a label that includes -D but does not end in -D, it would not be matched by .*-D in the above query. Also, PaCQL uses standard regular expressions without any shortcuts and therefore *-D will not work as it would in CS, the dot is required!

Syntactic relationships

Basic syntactic rels (there must be at least one of these!):

idoms - immediately dominates
idomsonly - immediately dominates x and nothing else
idomsfirst - immediately dominates the leftmost child x
idomslast - immediately dominates the rightmost child x
doms - dominates at an arbitrary depth
sprec - sisterwise precedence
precedes - precedence regardless of embedding
hassister - sisterhood
sameindex - A has the same index as B

Special syntactic rels (specify conditions on what the basic rels matched):

haslabel
domswords
domswords<
domswords>
idomsleaf
idomslemma

Coding queries

Define the dependent variable "ov" whose value is 1 if "ov:1" holds but 0 if "ov:0" holds.

IP-(MAT|SUB) idoms MDPI ov:1 MDPI sprec NP-OB[12] NP-OB[12] sprec VB ov:0 MDPI sprec VB VB sprec NP-OB[12]

Types of noun phrases. Elsewhere condition.

Meta coding

Meta coding adds columns to the output. The meta block in the query has to be separated from the rest of the query by a blank line. The first line of the meta section has to be "meta:" and this first line must be followed by one or more of the meta coding commands below.

Text level meta coding:

text textid - id of the text
text year - (estimated) year the text was written
text century - century the text was written
text genre - main genre of the text
text subgenre - subgenre of the text
text postnt - 0 if written before New Testament translation, 1 otherwise
text texttrees - total number of trees in the text
text meantreewords - mean number of words per tree in the text
text mediantreewords - median number of words per tree in the text
text meanwordletters - mean number of letters per word in the text
text lexicaldiversity - type frequency of word forms divided by the total number of words in the text

Tree level meta coding:

tree treeid - unique id for the tree
tree treewords - number of words in the tree

Node level meta coding:

node label A - the label matched by A
node nodestring A - the string of leafs dominated by A
node nodewords - the number of words dominated by A

Column	Interpretation
r/T	results per text
rt/T	result trees per text (fewer than r/T if >1 result per tree)
t/T	total trees per text
r/1Kw	results per 1000 words for this text

Column	Interpretation
r/C	results per century
rt/C	result trees per century (fewer than r/T if >1 result per tree)
t/C	total trees per century
r/1Kw	results per 1000 words for this century