Sepulkarium: syntax

Showing posts with label syntax. Show all posts

Friday, July 20, 2012

Visibility ambiguity

I'm now dealing with a quite interesting kind of ambiguity which I call visibility ambiguity. Consider the following three sentences (I might have put too many commas; imagine it's Russian where they're obligatory):

But there, thinking about her words, we got sad.
But there, thinking about her words, which were so cruel, we got sad.
But there, thinking about her words, intonation and gestures, we got sad.

The sentences are the same until the second comma after which they diverge dramatically. To process the next word the parser should determine which previous state it can attach to. In 1, we is attached to there and the top-level sentence is continued. In 2, which relates to words plus comma, a relative clause begins. In 3, there's a conjunction: words gets merged with intonation and re-attached together to about and her.

Obviously, all three attachment sites should be visible to the parser: there, words and comma. And all three continuation possibilities should be explored.

Which is where it gets complicated, because these attachment sites are also incompatible with each other. Once you've attached a word following route 1, you cannot attach the next one to route 2. The parser should regard such attachments as mutually incompatible. There are at least three diverging routes after every comma, each defined by its own set of constructions. And all constructions from different sets should be marked as contradictory, pair-wise. Which is quite a few pairs.

So it's not only that the number of possible variants will be big. It's also that the number of incompatibility relations to track will be much bigger, something pretty exponential. Given that I normally log all the visible attachment sites after each word during parsing, the log will be hard to analyze.

And what troubles me: people don't do this, they don't track all the possibilities. Instead, they quite quickly choose a variant which suits best, purge everything else from the memory and pursue only this parsing route. If it goes wrong eventually, they perform some kind of reanalysis. But the heuristics are tuned well enough and normally such a need never arises.

Right now my parser maintains all the structures needed for all possible analyses, and switching between them is very simple. But that just doesn't scale well enough to support visibility ambiguities. So it seems that the time has come to teach my parser the human strategy: choosing one route and then reparsing when anything doesn't fit well.

Saturday, July 16, 2011

Ellipsis

One of the great things about constructions is that you can do cool things with them which are not so easy with other parsing techniques. One example is ellipsis, when you may just omit any number of words if they are the same as in the previous clause:


По   мнению  одних дальше следовало 7, по   мнению  других — 8
PREP opinion some  next   went      7  PREP opinion others — 8
In the opinion of some, a 7 went next; but in opinion of others an 8 did

To translate this correctly, several things should be done. The parser should recognize that "—" actually stands for went next. The semantic model should somehow incorporate this as well as the meta-knowledge that this went next is actually not specified explicitly (otherwise the generator would produce but in opinion of others an 8 went next). Finally, the generator should be able to replace the full verb phrase with did.

Parser

There's clearly some parallelism here. After processing in opinion of others the parser should already know that this is similar to in the opinion of some. The same happens with 8, which is similar to 7. Now the parser should fill in the dash gap with the missing information taken from the previous clause. This means it should have access to all actions it performed during processing of went next, and it should repeat them.

So, there are two problems. One is how to find similar fragments given that they consist of different words. The other is how to be aware of the earlier activity of the parser and be able to reproduce it in new contexts. The problems are quite tightly connected: it's the earlier activity where you search for something similar. And no surprise that both problems are solved by one means, which is, keeping the change log.

Every change in the parsing state is kept in a log. Luckily, the construction architecture is very suitable for this. Remember, each word tells the parser which constructions it participates in and which of them are mutually incompatible. It also contributes arguments to the constructions: specifying a noun of a nom(inative) construction is completely different from contributing a head (verb). Based on this information, the parser resolves the possible conflicts, determines the winning constructions and executes the code attached to them which makes the changes to the semantic model.

The log consists of the contributions each word has made. And by allowing constructions to have access to this log, the parser can tackle the two problems posed by ellipsis:

1. Similarity

7 and 8 are different words, but the contributions they make are quite similar: they invoke the same constructions (nom, gen, acc, ... — all cases) but provide different arguments to them: one creates a frame whose type is 7, another — the same with 8. So I infer that two contributions are similar if they contribute to the same construction and supply same kinds of arguments.

2. Repetition

The parser knows it's processing ellipsis. It knows where the ellipsis starts (in opinion of others) and ends (8). It has found the similar fragments in the previous clause (in the opinion of some and 7). Now it just has to look at the contribution history, find everything in between those two similar fragments and insert it between in opinion of others and 8. The copying happens upon encountering 8 after "—", and the own contribution of 8 is nicely integrated into the environment set by the copied history (e.g. it completes the noun of nom construction, whose head was set by copied went).

Intermediate representation

OK, now we have two went next situations, but only one of them was fully expressed in Russian. And we should do the same in translation. So we have to mark the copy in a special way. In the semantic model we have no contributions, we have only frames and assignments to their properties. Hence we should mark the effects of the copied contributions, i.e. the assignments they add.

For this I've introduced a notion of scope into the semantic model. A scope is just a frame that plays a special role during parsing. Which special role — depends on the frame and its attributes. Important is that it's activated at one moment and deactivated at another moment, with arbitrarily many assignments made in between. In the semantic log this is written as [X] ... [/X] where X is the scope frame and ... stands for the assignments.

I plan to use scopes for many more things, mostly of meta-linguistic kind: separating paragraphs, reflecting the markup, etc. Using it with ellipsis is simple. You just create a frame, say its type is ellipsis, activate it, do the whole contribution copying process (which surely adds some assignments) and deactivate it. Thus you get the nice clear separation of what's said and what's inferred:

A.type=OPINION

#1.opinion_of=A

B.type=SOME

A.arg1=B

#1.time=PAST

C.type=COME_SCALARLY

C.order=AFTER

C.arg1=D

D.type=7

D.number=true

#1.but=#2

A.type=OPINION

#2.opinion_of=A

B.type=OTHERS

A.arg1=B

C.type=ellipsis

[C]

D.type=COME_SCALARLY

D.order=AFTER

[/C]

D.arg1=E

E.type=8

E.number=true

Generator

Getting the generator to work is not interesting at all. It's just a matter of adding another condition, given that I mock the generation throughout anyway.

After writing all this I've suddenly thought that well, this is great, but possibly unnecessary. Maybe the right way is not to replicate the history but just mark the whole second clause situation as ellipsis, parse only what's in the text, attach it somewhere and assume that the inference engine (mocked) should be responsible for inferring the went next part. But I don't have enough data yet to make a choice.

Wednesday, August 18, 2010

Placeholders aka left-corner parsing

From the parsing perspective, some words have much more far-reaching consequences than the other ones. Take English articles, for example. They don't only denote the definiteness or indefiniteness, they also give a broad hint that a noun is coming. So, having encountered an article, a smart parser may immediately create a placeholder node for noun (and also specify its definiteness). When the noun comes, instead of creating its own structure it'll just fit into the existing placeholder.

Once an noun placeholder is constructed, a parser may immediately attach it where it's is needed. If there is an ambiguous verb or preposition, attaching that noun as an argument can help choosing the right alternative and thus reduce the working memory consumption.

Hawkins argues that such placeholders play an important role in the syntax. He shows for many languages that their word order is in many cases optimized for such placeholders to be constructed and integrated as early as possible. That helps minimizing the amount of the nodes awaiting integration at any moment in parsing.

In German and Russian, the NP placeholder creators (determiners and adjectives respectively) also carry case markers. Therefore case is known for the placeholders, which also aids faster verb disambiguation (e.g. in+Dative denotes location, while in+Accusative denotes direction in German).

Verb placeholders also seem to exist. For example, in Japanese, where the verb comes at the very end of the sentence, there is no sign that a sequence of the argument noun phrases causes any processing difficulties. I conclude that these noun phrases get integrated into some structure as they appear. When a verb comes, it just quickly iterates through this structure and assigns to the accumulated noun phrases their semantic roles. The rich case marking of Japanese helps the arguments to be stored efficiently and not to interfere with each other while awaiting for the verb. I think that the structure I'm talking about can safely be called a verbal (or clausal) placeholder.

It gets more interesting in other languages. On surface German is a verb-second language. A nice processing explanation would be that there is no verb placeholder, and the first constituent hangs active in the memory until a verb actually appears. So the verb should appear immediately to reduce the memory usage. In most relative clauses, on the other hand, the verb comes at the end, and no processing problems arise even if there are lots of NP/PP dependents before it. Incidentally, all these clauses have to be introduced by a some complementizer. When there is no complementizer, the verb comes second just as if it were a main clause. All this makes me think that the complementizer just creates the verb placeholder which enables the verb to be postponed.

In English, the verb never comes too late, so the need for a placeholder is doubtful. Russian is difficult in this respect, because normally the verb is close to the beginning but it can easily move to the end for focalization purposes. I'm not aware of any experimental data, but I personally don't feel too much overburdened by multiple intermediate NPs and PPs in this case. So I have no idea about when a verbal placeholder is created in Russian: always, never, only in case of too many subsequent NPs and PPs, or only in some specific context (e.g. the one that forces focalization).

Saturday, February 13, 2010

Charts and clouds

As Doug Arnold pointed out, attacking trees and looking for a replacement isn't new. It has already been done, so I didn't actually invent anything revolutionary. The construction cloud is actually a well known structure called chart, as in chart parsing. The chart is designed for keeping all possible parse variants in a compact way to ease backtracking and reparse. And my cloud does exactly this and in a very similar fashion.

One difference is what kind of information is stored in the chart. In the classical approach, it's a bunch of records like "there may be an NP constituent from word 3 to word 7". No internal structure of these constituents is stored. If someone needs the complete tree, it can be computed on demand, since all of the daughter phrases are also present in the chart. And you're lucky if there's only one variant of such subtree.

Cloud is a collection of records like "constructions X, Y and Z may form a construction T". Since constructions don't tend to have a deep hierarchy, each of them stores its children. We don't have to compute the subtree on demand. But the subtree rarely covers the whole sentence. Therefore the task is to find a subset of all constructions such that it covers the input and its constructions don't contradict each other.

Another difference is that usually chart is tightly connected with the parser. A typical chart parser computes all the results in parallel, storing every possible intermediate constituent in the chart. On the other hand, I'm into single-variant serial parsing with error recovery. Of course, nothing prevents from implementing that kind of parser with phrase-structure-based chart: just drive one variant instead of all. One of the reasons I prefer dependency-like constructions over constituents is that free word order languages appear to be easier to describe in terms of constructions.

Tuesday, February 9, 2010

What else is wrong with the trees?

I've already explained why I don't like using trees as an intermediate parser representation. Here's why I don't want to use them as the final parser representation, i.e. for capturing the syntactic structure of a text.

1. Double attachment

I wrote a letter to my friend. Did I write something (a letter) to my friend or was it the letter to my friend that I wrote? I have no idea. Both alternatives seem plausible to me. So why should I make some unmotivated choice if I can have 'to my friend' attached to both 'wrote' and 'letter'? But the result won't be a tree.

2. Overlapping

German zum means zu (preposition) + dem (dative definite article). If the tree has separate slots for prepositions and articles (e.g. PP and DP heads), this word would go to both, which is not tree-like.

Welsh has single constituent subject-verb agreement. For example, in the Welsh translation of (very "natural") I and my brother saw ourselves, saw agrees with singular I, whilst ourselves agrees with the plural compound I and my brother. If we think that the agreed items should be directly connected in the syntactic structure (not necessarily true, but would be nice), then saw should be directly connected both with I (agreement) and I and my brother (subject).

However, Doctor, I'll stay here. IMHO any text should be analyzed as is. In particular, the punctuation should be processed in the same way as the words, since it's part of the text and is subject to grammatical rules of the language called Written English. But however at the beginning of a clause requires a comma after itself. The title of the Doctor being addressed, also requires to be surrounded by commas. So, where does the comma between however and Doctor belong? If you ask me, it belongs to both places, and this again isn't tree-like.

3. Discontinuous constituents

Actually, there's nothing wrong about them from programmer's perspective. But both the phrase-structure and dependency grammars prohibit discontinuous constituents. They don't allow the lines connecting their nodes to cross for some theoretical reasons.

So, for me trees are also not very convincing as the 'ideal' syntactic representation. But is there an alternative? Yes, and not a single one. Attribute-value matrices, for example. Or constructions. But that's another story.

Monday, February 8, 2010

Trees in NL parser

Consider you're a parser analyzing the famous The horse raced past the barn fell. Additionally, you're operating in an incremental left-to-right fashion, considering single active parse variant at each moment. Independently of the theory, you'll probably combine the horse and past the barn into some kind of units, say NP1 and PP2. Before you see the last word, you'll probably consider that NP1 is the subject of the finite verb raced, and PP2 is its adjunct. But then the fell comes. At this moment, you'll understand that the single active parse variant you have is wrong, return to raced and reanalyze it as a passive participle (The horse [that was] raced past the barn fell). This is all well known and boring.

What's not boring is the implications for a computational parser operating in a similar way. First, it should be able somehow to restore its state right after NP1 and reparse from there, choosing the passive participle alternative of raced. Second, it would be nice not to reparse the PP2 again. This word sequence will anyway produce the same result and function as an adjunct, though now in the relative clause.

Since we're going to store the parsing state and analyze it in various ways (e.g. find slots to fill), we'd better store it as a data. What kind of data? The standard answer is, an (incomplete) tree. I'll show that's not convenient.

Let's consider mutable trees first. Nodes can be added or removed in such trees to any positions at any time. Therefore to store anything we need first to copy it completely and put aside, so that subsequent changes in the active tree aren't reflected in the copy. This applies both to parser state after NP1 and to the i'd-like-it-cached PP2. But we don't want to copy the parser state at each moment just in case we'll have to backtrack to here in future: we'll run out of memory very soon.

An alternative approach I see is not to copy anything, but to log thoroughly every change you make to the tree in an undoable manner. When a need for backtracking arises, we could just go back through the log undoing every single change and restart. This solves the state-saving problem, but doesn't offer caching for what's undone. We'll have to reparse PP2 anew.

So, mutable trees don't appear to be useful for our particular parser. Let's look at the immutable trees. Their nodes are created once and fixed forever. No nodes can be added or removed. The only way to attach a constituent to a tree is to produce a new tree with the constituent already in the right place. The parts of the old tree not directly dominating the attachment place can be reused in the new tree, everything else has to be recreated. Luckily, the reused part is usually much bigger, so not very much memory is consumed. So, we could easily just save current parsing state at each moment without fear. And we can also cache the PP2 before reparse since no one can hurt it anymore, it's immutable.

But all this is not for free. You should keep track of all the places in the partial tree where it's possible to attach. For example, in 'I gave the ...' there are 2 slots awaiting to be filled: the noun after the and second argument of gave. In the tree they should be somehow represented, and at least one of them won't be on the top level.

The slots to be filled may also not be marked explicitly. In I saw a girl with a telescope the PP with a telescope part may be attached either to noun girl or to the verb saw. Since it's not required by any of them, the parser should walk down the whole tree and find all the constituents touching the right side, and then trying to combine them with the PP. Moreover, psycholinguists observe that most often such an adjunct PP is attached to the downmost suitable constituent. Probably it's easier that way for humans, but this would be very resource-consuming operation in the computational parser.

Finally, the attaching place may even be located not on the right boundary at all. It's the infrequent but still important case of discontinuous constituents, usually occurring in the free word order languages like Russian, Latin or Walpiri. English also has such examples:

The question has been asked, if there's life on Mars.

"Tomorrow," he said, "I'll return".

Yesterday, however, they left.

In the first example, we should attach the subordinate clause if there's life on Mars to the NP the question, which doesn't touch the right boundary at all. The tree we have at that moment is something like

[S: [NP: the question] [VP: has been asked]]

and represents a complete grammatical sentence. So, do we have to search the whole tree to find the constituent to attach to? Nonsense. But I have no idea on what the reasonable attachment strategy should be, that uses immutable trees and covers the presented data.

Of course, all these problems are solvable. It's just me that doesn't want to implement those solutions due to their clumsiness, complexity and possible bug-riddenness. I'd better just use another intermediate parsing representation for the serial parser. Which one? That's another story.

Sunday, January 17, 2010

Operational Text=>Meaning model

The model has several levels, though they are almost completely different from Melchuk's:

Level 1: text/phonology.
The parser receives it as an input and then produces

Level 2: constructions.
The contiguous text is separated into constructions of all sizes, from morphological to complex syntactical ones. Basic constructions are (mostly) word stems, complex constructions combine the simpler ones. The structure is most probably a tree though not necessarily. Each construction may have information structure. Constructions are ordered and considered to be executable in that order. Being executed, they give us

Level 3: semantics: a program.
In some general-purpose (programming) language with well-defined semantics and model. The model probably consists of local variables (max. 4-7), salient context and general memory, which is organized as a graph of frames and represents the listener's knowledge of the world. The language is imperative, object-oriented and almost lacks any control flow, being very linear. The basic instructions are frame creation and assignments to local variables or frame slots. If we execute this semantics program, we'll get

Level 4: change in the frame structure in the memory.
Which means change in the listener's knowledge of the world. Actually, the change may be greater than we expect looking at the semantics program. Every its simple instruction may cause significant changes that are not directly encoded in that program. This is what we call pragmatics.

We've done! Though it's worth noting that the built frames might well encode a sequence of actions (e.g. cooking recipe), which can also be executed via interpretation and result in real listener's actions in the real world.

Sunday, November 15, 2009

Construction grammar

The subj (CxG) is exactly what I had in mind. In short:

There's no language-specific unit in the brain, the language ability is due to common cognitive units.

Basic notion in the language is a construction (a template text fragment). No one has defined it but everyone uses.

Both the speaker and the hearer operate in terms of constructions, not words. Unlike other prominent theories.

Children learn specific constructions first, abstracting them into more general rules afterwards.

Lexical items which are often used in similar contexts tend to grammaticalize into new constructions as language evolves.

Construction approach seems very promising to me, though I see at least two weaknesses in it, to which I haven't found an adequate answer so far:

Why learning new languages becomes much harder after puberty. Is it a degradation in common cognitive abilities? What's different with second language learners? Why do Japanese speakers miss 3d singular -s in English ('he like driving')?

CxG accounts only for positive data, and doesn't explain why exactly the ungrammatical samples are so (or not so) ungrammatical. A vague explanation that one gets used to a specific way of expressing an idea and all the other ways are not so habitual hence more difficult to produce/analyze, doesn't satisfy me very much. It may be equally difficult (or easy) from processing perspective. E.g. adjectives in English would have come after the noun with no clear performance disadvantage. Semantics could also be clear, like in 'he go cinema'.

Another explanation could be an Occam's Razor. In all ungrammatical examples I've seen, there is a grammatical counterpart. So, the brain could think, why should the same meaning be produced in this strange way while there's another well-established one, and mark the 'strange way' as an ungrammatical.

The question remains, how to create a parser based on constructions, and how to induce those constructions from a corpus. And, as usual, what that parser should produce as the result.

Monday, November 9, 2009

Communicative fragments by example

Communicative fragments (CF) are presented as overlapping template sequences that cover the text. They consist of fixed parts (e.g. words) and placeholders which may be filled by certain expressions. A limerick example:

There was an Old Man in a pew,
Whose waistcoat was spotted with blue;
But he tore it in pieces,
To give to his nieces,--
That cheerful Old Man in a pew.

The resulting fragments will be:

there was X[NP] in Y[NP] -> Clause
//variables are uppercase
//sample placeholder constraints are in brackets
//NP=Noun Phrase
//substituted template forms a clause
an X[NP, Starts with a vowel] -> NP
old X[NP] -> NP
man -> NP
//one-word fragment
X[NP] in Y[NP] -> NP
//no one has promised that CF will form a tree
//cf. the first fragment
a X[NP, Starts with a consonant] -> NP
pew -> NP
X[NP] whose Y[NP] Z[Clause, finite] -> NP
//V=Verb

//note a non-projective construct
was X[Participle] -> Clause
//a rule for passive
spotted with X[Adj] -> Participle
but X[Clause] -> Clause
X[NP] tore Y[NP] -> Clause
X[=tear] in pieces -> Clause
//any form of 'tear'
X[Clause] to Y[Clause, infinitive] -> Clause

X[=give] to X[NP] -> Clause, finite
his X[NP] -> NP
that X[NP] -> NP
cheerful X[NP] -> NP

That's a way of parsing: considering syntax as a set of CF patterns where every pattern contains at least one lexical entry that triggers the rule. It should be also much easier to extract such a set from a corpus than to induce a typical generative grammar with movements.

It isn't specified if anything can occur between template components assuming that everything can. Hence free word order languages are supported, but English syntax rules are weakened very much, allowing ungrammatical sentences. So there needs to be a tight parser integration constraining the fragments' freedom.

Monday, November 2, 2009

All syntax problems...

...can be solved by another level of constituency. Except, of course, for the problem of too many layers of constituency.

At least, in Minimalism:

Tuesday, October 20, 2009

Linear Unit Grammar

Predominant natural language syntax theories usually deal with one sentence, which is carefully selected to be well-formed, so called 'grammatical'. But the live language seems to be different, more complex, especially the spoken one. It contains lots of repetitions, false starts, hesitations, reformulations, speaker changes and so on.

I therefore like meeting theories aiming to describe spoken discourse 'as is' instead of labelling it as an incorrect. One of those theories is Linear Unit Grammar by John Sinclair and Anna Mauranen.

The main notion here is 'chunk'. It's a fuzzy pre-theoretical concept, with no precise definition. Basically it's a linear fragment of input signal (text, speech, etc.) which people tend to comprehend all at once. It may be unfinished. Usually it's formed of closely connected words like a noun phrase with adjectives (a small dog) or verb with its obligatory arguments (to love Mary). Relative clauses, of course, form separate chunks. Moreover, the auxiliary words (like 'that' in 'the dog that loves Mary') are separated from anything else and go to single-word chunks. The chunks are very small, their size rarely exceeds 5 words.

Here's a sample chunking made by me from a spoken English corpus. The analyzed fragment is:

so an example of this s- s- second rule i mean the second rule probably is the easiest one to look at cuz you have the, the f- the the four-six-seven type of relationship

I divide it according to my intuition only. Whenever I have doubts, I put a chunk boundary. And so the result will be:

1. so
2. an example of this
3. s- s-
4. second rule
5. i mean
6. the second rule
7. probably
8. is
9. the easiest one
10. to look at
11. cuz
12. you have the
13. the f-
14. the
15. the four-six-seven type
16. of relationship

Sinclair & Mauranen classify the chunks into organizational fragments (1,5,11) and message fragments (the others). These groups are also divided into subgroups according to organizational function or message completeness. There's a non-deterministic algorithm that translates any kind of text into a well-formed one. In this example it would be something like 'an example of this second rule is probably the easiest one to look at cuz you have the four-six-seven type of relationship'.

That's a bit surprising! How can anyone claim that 'grammatical sentences are unnatural, let's analyze the real ones' and then analyze spoken discourse by first making it grammatical? The answer is that the authors in fact don't aim to compete with major syntactic theories, they strive to co-exist with them, at least in the beginning. The described algorithm may be just a first step in complex language analysis. Authors also suggest chunking could help in second language teaching/learning.

What I personally like about Linear Unit Grammar is precisely the chunking. It's so simple! And, in contrast to Gasparov's approach when the text is divided into communicative fragments, the LUG chunks are contiguous and non-overlapping. Therefore the LUG chunking can be done by simple regular expressions, or Markov processes. A great part of the syntax lies inside chunks so there's no need to analyze it the same way as the 'bigger' structures like clause subordination. NLTK seems to provide chunking out of the box, so I guess I gotta try it.

Sepulkarium