Tuesday, October 20, 2009

Linear Unit Grammar

Predominant natural language syntax theories usually deal with one sentence, which is carefully selected to be well-formed, so called 'grammatical'. But the live language seems to be different, more complex, especially the spoken one. It contains lots of repetitions, false starts, hesitations, reformulations, speaker changes and so on.

I therefore like meeting theories aiming to describe spoken discourse 'as is' instead of labelling it as an incorrect. One of those theories is Linear Unit Grammar by John Sinclair and Anna Mauranen.

The main notion here is 'chunk'. It's a fuzzy pre-theoretical concept, with no precise definition. Basically it's a linear fragment of input signal (text, speech, etc.) which people tend to comprehend all at once. It may be unfinished. Usually it's formed of closely connected words like a noun phrase with adjectives (a small dog) or verb with its obligatory arguments (to love Mary). Relative clauses, of course, form separate chunks. Moreover, the auxiliary words (like 'that' in 'the dog that loves Mary') are separated from anything else and go to single-word chunks. The chunks are very small, their size rarely exceeds 5 words.

Here's a sample chunking made by me from a spoken English corpus. The analyzed fragment is:

so an example of this s- s- second rule i mean the second rule probably is the easiest one to look at cuz you have the, the f- the the four-six-seven type of relationship

I divide it according to my intuition only. Whenever I have doubts, I put a chunk boundary. And so the result will be:

1. so
2. an example of this
3. s- s-
4. second rule
5. i mean
6. the second rule
7. probably
8. is
9. the easiest one
10. to look at
11. cuz
12. you have the
13. the f-
14. the
15. the four-six-seven type
16. of relationship

Sinclair & Mauranen classify the chunks into organizational fragments (1,5,11) and message fragments (the others). These groups are also divided into subgroups according to organizational function or message completeness. There's a non-deterministic algorithm that translates any kind of text into a well-formed one. In this example it would be something like 'an example of this second rule is probably the easiest one to look at cuz you have the four-six-seven type of relationship'.

That's a bit surprising! How can anyone claim that 'grammatical sentences are unnatural, let's analyze the real ones' and then analyze spoken discourse by first making it grammatical? The answer is that the authors in fact don't aim to compete with major syntactic theories, they strive to co-exist with them, at least in the beginning. The described algorithm may be just a first step in complex language analysis. Authors also suggest chunking could help in second language teaching/learning.

What I personally like about Linear Unit Grammar is precisely the chunking. It's so simple! And, in contrast to Gasparov's approach when the text is divided into communicative fragments, the LUG chunks are contiguous and non-overlapping. Therefore the LUG chunking can be done by simple regular expressions, or Markov processes. A great part of the syntax lies inside chunks so there's no need to analyze it the same way as the 'bigger' structures like clause subordination. NLTK seems to provide chunking out of the box, so I guess I gotta try it.

No comments: