Sunday, October 25, 2009

A solution for John

It appears that I've been too pessimistic claiming that I can't assemble the meaning of 'John has two sisters' from the word meanings. Blackburn&Bos have taught me:

John: λv (v JOHN)
has: λsubj λobj (subj (λs (obj s id)))
two: λn λy λv (v (∃S (|S|=2 & ∀x∊S (n x y))))
sisters: λx λy SISTER(x,y)

Let's evaluate:

(two sisters) = λy λv (v (∃S (|S|=2 & ∀x∊S SISTER(x,y))))
(has John) = λobj (obj JOHN id)
((has John) (two sisters)):
   ∃S (|S|=2 & ∀x∊S SISTER(x,JOHN))

Note that this semantics also doesn't assume that John has only 2 sisters, so it really looks like an appropriate one.

So, the applicative text structure is restored in its rights. Although I don't much like this kind of solution, because the semantics of 'has' and 'two' are too dependent of the fact that SISTER predicate takes 2 arguments. I'd have to change them both to express 'John has two dogs', thus making 'two' (and every other numeric quantifier) ambiguous in semantics. But it's still better than nothing.

Friday, October 23, 2009

Asking right questions

When I said that semantics of 'John has two sisters' was |{ x | SISTER(x, JOHN) }|=2, I wasn't quite correct. In fact there's nothing in the text that preventing John to have 5 or 42 sisters. It's the Maxim of Quality which may limit the sister count to 2. Being not an absolute rule, this maxim can be easily flouted and the sentence could actually mean that John has more than 2 sisters in a right context.

Things get even more interesting if we just add one word: John has two beautiful sisters. There just isn't a default meaning here! John may have 2 sisters that are beautiful, but he may have 2 beautiful sisters and another 3 who are not so beautiful.

The question is, what should computer do in such situations. Should it apply pragmatic knowledge and disambiguate everything immediately after syntactic analysis using whole context? Or should it maintain an intermediate semantic representation and give it to some pragmatics module who could infer everything from the semantics? I clearly prefer modularization, i.e. the latter possibility. Of course I don't suppose any sequentiality, the modules may run in parallel interactively.

If we separate semantics from pragmatics, the representation problem arises again, even harder now. The semantic structure should be very generic, it should be interpretable in all the ways that were possible with the original text (minus the resolved lexical/syntactic ambiguities). And at the same time there should be no way of understanding it in any other way. If we just replace = with >= in the John has two sisters meaning, the pragmatics module still won't be able to apply the Quality Maxim. Such a meaning could well be produced from John has at least two sisters which is unambiguous with respect to sister count. So it still should be some kind of =2, but in a form open for interpretation. What a format could it be? I don't know. Yet.

Thursday, October 22, 2009

Shallow vs. structural analysis

Let's look at a very simple sentence, namely 'John has two sisters'. I'm now interested in its semantics, or, more precisely, its computer representation. The truth condition is actually very simple, it says that the number of those who happen to be sisters of John equals to 2:

|{ x | SISTER(x, JOHN) }|=2

(let the uppercase letters denote some semantic meaning here).

A question arises, how can we assemble this semantics from meanings of sentence components? The constituent structure for this sentence would be:

[S[NPJohn] [VPhas [QPtwo sisters]]]

The dependency structure:

John <- has -> two -> sisters

The beloved one, applicative structure:

(has John (two sisters))

Lexical Functional Grammar-style:

| PRED 'has'
|
| SUBJ | PRED 'John'
|
| OBJ | PRED 'sisters'
| | SPEC | NUM 2

In any of these variants has has two arguments: John and the combined two sisters. So it appears that we should combine the word meanings in this order, getting something like f(HAS, JOHN, g(2, SISTER)). And this formula should somehow be equivalent to |{x | SISTER(x, JOHN)}|=2. The question is, what are f and g? I see no direct structural answer. The best variant I've come to is that we should change the structure, replace it with another one which contains only one predicate:

HAS_N_SISTERS(Who,N)

which would translate to

|{ x | SISTER(x, Who) }|=N

This can be generalized a bit (take sibling instead of sister), but not further. A similar sentence 'John has two dogs' would have a different semantics, e.g. |{x | DOG(x) & BELONGS(x, John)}|=2. A two-place 'sister'-like 'dog' predicate would be funny.

So it seems that all the structures I know of are of no use with this sentence. That's one of the reasons I prefer shallow parsing based on patterns with wildcards: it appears to map better onto semantics. And a probable sad consequence is that the applicative structure, though being so beautiful, will remain unapplied.

Wednesday, October 21, 2009

What Science Underlies Natural Language Engineering?

Quote:

A superficial look at the papers presented in our main conferences reveals that the vast majority of them are engineering papers, discussing engineering solutions to practical problems. Virtually none addresses fundamental issues in linguistics.


Couldn't have said it better.

Tuesday, October 20, 2009

Linear Unit Grammar

Predominant natural language syntax theories usually deal with one sentence, which is carefully selected to be well-formed, so called 'grammatical'. But the live language seems to be different, more complex, especially the spoken one. It contains lots of repetitions, false starts, hesitations, reformulations, speaker changes and so on.

I therefore like meeting theories aiming to describe spoken discourse 'as is' instead of labelling it as an incorrect. One of those theories is Linear Unit Grammar by John Sinclair and Anna Mauranen.

The main notion here is 'chunk'. It's a fuzzy pre-theoretical concept, with no precise definition. Basically it's a linear fragment of input signal (text, speech, etc.) which people tend to comprehend all at once. It may be unfinished. Usually it's formed of closely connected words like a noun phrase with adjectives (a small dog) or verb with its obligatory arguments (to love Mary). Relative clauses, of course, form separate chunks. Moreover, the auxiliary words (like 'that' in 'the dog that loves Mary') are separated from anything else and go to single-word chunks. The chunks are very small, their size rarely exceeds 5 words.

Here's a sample chunking made by me from a spoken English corpus. The analyzed fragment is:

so an example of this s- s- second rule i mean the second rule probably is the easiest one to look at cuz you have the, the f- the the four-six-seven type of relationship

I divide it according to my intuition only. Whenever I have doubts, I put a chunk boundary. And so the result will be:

1. so
2. an example of this
3. s- s-
4. second rule
5. i mean
6. the second rule
7. probably
8. is
9. the easiest one
10. to look at
11. cuz
12. you have the
13. the f-
14. the
15. the four-six-seven type
16. of relationship

Sinclair & Mauranen classify the chunks into organizational fragments (1,5,11) and message fragments (the others). These groups are also divided into subgroups according to organizational function or message completeness. There's a non-deterministic algorithm that translates any kind of text into a well-formed one. In this example it would be something like 'an example of this second rule is probably the easiest one to look at cuz you have the four-six-seven type of relationship'.

That's a bit surprising! How can anyone claim that 'grammatical sentences are unnatural, let's analyze the real ones' and then analyze spoken discourse by first making it grammatical? The answer is that the authors in fact don't aim to compete with major syntactic theories, they strive to co-exist with them, at least in the beginning. The described algorithm may be just a first step in complex language analysis. Authors also suggest chunking could help in second language teaching/learning.

What I personally like about Linear Unit Grammar is precisely the chunking. It's so simple! And, in contrast to Gasparov's approach when the text is divided into communicative fragments, the LUG chunks are contiguous and non-overlapping. Therefore the LUG chunking can be done by simple regular expressions, or Markov processes. A great part of the syntax lies inside chunks so there's no need to analyze it the same way as the 'bigger' structures like clause subordination. NLTK seems to provide chunking out of the box, so I guess I gotta try it.