The model has several levels, though they are almost completely different from Melchuk's:
Level 1: text/phonology.
The parser receives it as an input and then produces
Level 2: constructions.
The contiguous text is separated into constructions of all sizes, from morphological to complex syntactical ones. Basic constructions are (mostly) word stems, complex constructions combine the simpler ones. The structure is most probably a tree though not necessarily. Each construction may have information structure. Constructions are ordered and considered to be executable in that order. Being executed, they give us
Level 3: semantics: a program.
In some general-purpose (programming) language with well-defined semantics and model. The model probably consists of local variables (max. 4-7), salient context and general memory, which is organized as a graph of frames and represents the listener's knowledge of the world. The language is imperative, object-oriented and almost lacks any control flow, being very linear. The basic instructions are frame creation and assignments to local variables or frame slots. If we execute this semantics program, we'll get
Level 4: change in the frame structure in the memory.
Which means change in the listener's knowledge of the world. Actually, the change may be greater than we expect looking at the semantics program. Every its simple instruction may cause significant changes that are not directly encoded in that program. This is what we call pragmatics.
We've done! Though it's worth noting that the built frames might well encode a sequence of actions (e.g. cooking recipe), which can also be executed via interpretation and result in real listener's actions in the real world.
Showing posts with label language. Show all posts
Showing posts with label language. Show all posts
Sunday, January 17, 2010
Sunday, November 15, 2009
Construction grammar
The subj (CxG) is exactly what I had in mind. In short:
There's no language-specific unit in the brain, the language ability is due to common cognitive units.
Basic notion in the language is a construction (a template text fragment). No one has defined it but everyone uses.
Both the speaker and the hearer operate in terms of constructions, not words. Unlike other prominent theories.
Children learn specific constructions first, abstracting them into more general rules afterwards.
Lexical items which are often used in similar contexts tend to grammaticalize into new constructions as language evolves.
Construction approach seems very promising to me, though I see at least two weaknesses in it, to which I haven't found an adequate answer so far:
Why learning new languages becomes much harder after puberty. Is it a degradation in common cognitive abilities? What's different with second language learners? Why do Japanese speakers miss 3d singular -s in English ('he like driving')?
CxG accounts only for positive data, and doesn't explain why exactly the ungrammatical samples are so (or not so) ungrammatical. A vague explanation that one gets used to a specific way of expressing an idea and all the other ways are not so habitual hence more difficult to produce/analyze, doesn't satisfy me very much. It may be equally difficult (or easy) from processing perspective. E.g. adjectives in English would have come after the noun with no clear performance disadvantage. Semantics could also be clear, like in 'he go cinema'.
Another explanation could be an Occam's Razor. In all ungrammatical examples I've seen, there is a grammatical counterpart. So, the brain could think, why should the same meaning be produced in this strange way while there's another well-established one, and mark the 'strange way' as an ungrammatical.
The question remains, how to create a parser based on constructions, and how to induce those constructions from a corpus. And, as usual, what that parser should produce as the result.
Construction approach seems very promising to me, though I see at least two weaknesses in it, to which I haven't found an adequate answer so far:
Another explanation could be an Occam's Razor. In all ungrammatical examples I've seen, there is a grammatical counterpart. So, the brain could think, why should the same meaning be produced in this strange way while there's another well-established one, and mark the 'strange way' as an ungrammatical.
The question remains, how to create a parser based on constructions, and how to induce those constructions from a corpus. And, as usual, what that parser should produce as the result.
Tuesday, October 20, 2009
Linear Unit Grammar
Predominant natural language syntax theories usually deal with one sentence, which is carefully selected to be well-formed, so called 'grammatical'. But the live language seems to be different, more complex, especially the spoken one. It contains lots of repetitions, false starts, hesitations, reformulations, speaker changes and so on.
I therefore like meeting theories aiming to describe spoken discourse 'as is' instead of labelling it as an incorrect. One of those theories is Linear Unit Grammar by John Sinclair and Anna Mauranen.
The main notion here is 'chunk'. It's a fuzzy pre-theoretical concept, with no precise definition. Basically it's a linear fragment of input signal (text, speech, etc.) which people tend to comprehend all at once. It may be unfinished. Usually it's formed of closely connected words like a noun phrase with adjectives (a small dog) or verb with its obligatory arguments (to love Mary). Relative clauses, of course, form separate chunks. Moreover, the auxiliary words (like 'that' in 'the dog that loves Mary') are separated from anything else and go to single-word chunks. The chunks are very small, their size rarely exceeds 5 words.
Here's a sample chunking made by me from a spoken English corpus. The analyzed fragment is:
so an example of this s- s- second rule i mean the second rule probably is the easiest one to look at cuz you have the, the f- the the four-six-seven type of relationship
I divide it according to my intuition only. Whenever I have doubts, I put a chunk boundary. And so the result will be:
1. so
2. an example of this
3. s- s-
4. second rule
5. i mean
6. the second rule
7. probably
8. is
9. the easiest one
10. to look at
11. cuz
12. you have the
13. the f-
14. the
15. the four-six-seven type
16. of relationship
Sinclair & Mauranen classify the chunks into organizational fragments (1,5,11) and message fragments (the others). These groups are also divided into subgroups according to organizational function or message completeness. There's a non-deterministic algorithm that translates any kind of text into a well-formed one. In this example it would be something like 'an example of this second rule is probably the easiest one to look at cuz you have the four-six-seven type of relationship'.
That's a bit surprising! How can anyone claim that 'grammatical sentences are unnatural, let's analyze the real ones' and then analyze spoken discourse by first making it grammatical? The answer is that the authors in fact don't aim to compete with major syntactic theories, they strive to co-exist with them, at least in the beginning. The described algorithm may be just a first step in complex language analysis. Authors also suggest chunking could help in second language teaching/learning.
What I personally like about Linear Unit Grammar is precisely the chunking. It's so simple! And, in contrast to Gasparov's approach when the text is divided into communicative fragments, the LUG chunks are contiguous and non-overlapping. Therefore the LUG chunking can be done by simple regular expressions, or Markov processes. A great part of the syntax lies inside chunks so there's no need to analyze it the same way as the 'bigger' structures like clause subordination. NLTK seems to provide chunking out of the box, so I guess I gotta try it.
I therefore like meeting theories aiming to describe spoken discourse 'as is' instead of labelling it as an incorrect. One of those theories is Linear Unit Grammar by John Sinclair and Anna Mauranen.
The main notion here is 'chunk'. It's a fuzzy pre-theoretical concept, with no precise definition. Basically it's a linear fragment of input signal (text, speech, etc.) which people tend to comprehend all at once. It may be unfinished. Usually it's formed of closely connected words like a noun phrase with adjectives (a small dog) or verb with its obligatory arguments (to love Mary). Relative clauses, of course, form separate chunks. Moreover, the auxiliary words (like 'that' in 'the dog that loves Mary') are separated from anything else and go to single-word chunks. The chunks are very small, their size rarely exceeds 5 words.
Here's a sample chunking made by me from a spoken English corpus. The analyzed fragment is:
so an example of this s- s- second rule i mean the second rule probably is the easiest one to look at cuz you have the, the f- the the four-six-seven type of relationship
I divide it according to my intuition only. Whenever I have doubts, I put a chunk boundary. And so the result will be:
1. so
2. an example of this
3. s- s-
4. second rule
5. i mean
6. the second rule
7. probably
8. is
9. the easiest one
10. to look at
11. cuz
12. you have the
13. the f-
14. the
15. the four-six-seven type
16. of relationship
Sinclair & Mauranen classify the chunks into organizational fragments (1,5,11) and message fragments (the others). These groups are also divided into subgroups according to organizational function or message completeness. There's a non-deterministic algorithm that translates any kind of text into a well-formed one. In this example it would be something like 'an example of this second rule is probably the easiest one to look at cuz you have the four-six-seven type of relationship'.
That's a bit surprising! How can anyone claim that 'grammatical sentences are unnatural, let's analyze the real ones' and then analyze spoken discourse by first making it grammatical? The answer is that the authors in fact don't aim to compete with major syntactic theories, they strive to co-exist with them, at least in the beginning. The described algorithm may be just a first step in complex language analysis. Authors also suggest chunking could help in second language teaching/learning.
What I personally like about Linear Unit Grammar is precisely the chunking. It's so simple! And, in contrast to Gasparov's approach when the text is divided into communicative fragments, the LUG chunks are contiguous and non-overlapping. Therefore the LUG chunking can be done by simple regular expressions, or Markov processes. A great part of the syntax lies inside chunks so there's no need to analyze it the same way as the 'bigger' structures like clause subordination. NLTK seems to provide chunking out of the box, so I guess I gotta try it.
Labels:
language,
linear unit grammar,
nlp,
parsing,
syntax
Subscribe to:
Posts (Atom)