In the previous entry, I outlined an application (with the working name Deep Probe) capable of finding main themes in a large corpus of text that it finds. Today, I want to say something on how I think we can tackle such problems.
If computers are left to themselves, our written language looks to them like this looks to us:
That is, as a bunch of symbols with no meaning. There is little actual text analysis that can be done on this level.
Understanding involves knowing relations between symbols. Let’s say that ‘mu’ can stand next to ‘u’ or ‘qa’, while being a kind of ‘ta’ somewhat similar to ‘wa’. This type of knowledge is fundamental to really using a language. Otherwise you even could associate words with some experiences, like pain or seeing some kind of an object, like a bird – but without an intricate system of language, your communication would be like simple animal grunts or meows that can’t even form sentences.
One of the most powerful ideas in semantics, I think, is that complex words can be defined or explained in terms of simple ones. This way we can assume that we know what “simple” words mean, and then build even a huge vocabulary out of them. I particularly love this old programming talk where Guy Steele defines all the words he uses that are longer than one syllable.
Some linguists believe that this is how everyday languages that we use are constructed.
The particular theory that I favor is Natural Semantic Metalanguage, developed chiefly by Anna Wierzbicka since 1970s, on ideas going back as far as the Lingua Mentalis fantasized by 17th century’s Gottfried Leibniz. Let’s say that we have a closed set of words that are so simple that everybody can understand them and so universal that they appear (in some form) in every language. We should be able to explain everything else in terms of these words, known as semantic primes.
Based on research of numerous languages, including European ones, Asian, native Australian, African, North American etc., the number of semantic primes is now believed to be about 50-60. They include such concepts as I, you, people (one of my favorites – there are reasons to believe that we understand human in terms of people, not the other way around!), place, know, happen and many others. These are, maybe, ideas so fundamental that almost any word you may know can be ultimately explained with them.
This is not an easy task. There are so many nuances that can be overlooked, and cause a definition to mean strange unintended things or clash with words that are superficially similar. See how Wierzbicka explains the phrase I propose [X]:
I think it would be good if we caused X to happen
I know that I cannot cause it to happen if other people don’t want it to happen
I say: if you people want it to happen, I want it to happen
I say this because I want to cause other people to think about it and to say if they want it to happen
I assume that you will say if you want it to happen
[English speech act verbs: a semantic dictionary, Syndey 1987, p. 188]
While this does sound long-winded, not only because of using only primes (ones that were assumed in 1987). But all these points are actually needed and well thought-out. For example, suggest is more about something that someone else – to whom the thing is suggested – could do. Proposing is about deciding something together. Without the second point, it’s not really a proposal: it maybe some kind of an order or a threat. Also, if I don’t really want something to happen (or don’t want people to think that I want it), I would say that I don’t really propose this.
So I believe that in order for a computer to pick up on main threads and themes appearing in texts, they should be converted to a bunch of primes, which would abstract away the specific wording that people use. Of course in practice having definitions for all words is completely unrealistic. One needs to join seamlessly parts that we understand with parts that seem barely if at all comprehensible. Incidentally, it’s something that also humans do. This is why we can enjoy certain Lewis Caroll poems and talk to people using random jargon.
My exact method is nothing fancy really, but since I hope to make Deep Probe a commercial product (and be able to live off doing something that I enjoy), most moving parts and plumbing will be left to your imagination. But I do enjoy talking about semantics and language engineering, so this will be probably the main content of this blog.