When I talk to people in my field about a project involving information retrieval, using semantic representations based on linguistics, there is an aura of… let’s say, controversy surrounding this. Nowadays everyone knows that numerical embeddings, representing words and sentences as vectors of numbers, trained inside neural nets, is the modern, and developing, way of doing things.
This wasn’t always true; in fact, to this day there are ghostly figures frequenting NLP conferences with their systems that use handcrafted representations.
Of course, in reality, there is no hard border between learning representations from data and using some theoretical frameworks and hand-made knowledge bases. It’s not like natural language processing doesn’t use (heavily!) human knowledge in practice, or that you cannot do interesting things combining these approaches. But the current tide seems to go in the direction of optimization algorithms learning everything on their own.
Today I want to talk about a) why knowledge bases (i.e. things like Cyc, WordNet, Semantic Web) are generally no longer considered a strong strategy by themselves, and b) that – on the other hand – they could have surprising strengths. The main problems with them would be that:
They require entering knowledge manually, which can be very time- and cash-consuming. Also, the labor-gain ratio is kind of linear: not nice, we’ll discuss it in a moment.
They are limited to their manually-covered area, and most often lack a good fallback mechanism when they lack information on something (besides, fall back to what?).
Every knowledge base is made according to some schema. This limits what can be stored in it, and often the space of things we’d like to express, but can’t do it adequately is uncomfortably large 1.
Also, when many people input data together, the implicit “style” of one person will be incompatible with another’s. It’s easy to lose the predictability that schema was supposed to provide us in the first place.
I mentioned that the labor-gain ratio is linear. This means that for each new item we want to handle (a new word, for example), we have to perform a similar amount of work. This is at odds with more explosive, exciting ways in which technology can often compound power. You figure out how to build a computer, and then you can compute innumerable things. You enter 5000 facts into a database, and then you can exploit 5000 facts. It can be a little sad.
Actually, the linearity not wholly true. The Zipf’s law suggests to us that in a ranking of the most frequently used items of information (such as dictionary words), each next one is becoming rarer with accelerating speed. Thus with adding the most fundamental knowledge, we should notice the biggest gains, and then, with adding the more obscure facts, benefits should just diminish per each one. This doesn’t mean that the “long tail” of rarely used knowledge isn’t a (or maybe even the most) valuable asset: just that the effort to get there can be larger by awful multiples.
A big promise of a neural network algorithm is that you can just take it and use for any data – assuming you can get it from somewhere, and it’s often hard.
One of the most fascinating topics to me is designing neural nets to take some set of data and not only learn from it, but also expand it for later use. For instance, you can have a grammar correction system that uses its own unsuccessful attempts to correct sentences as additional training examples. Researchers also work on techniques of automatic expansion of data using more complex knowledge bases, such as corpora of text tagged with WordNet senses (technically these are graph algorithms, whose output is then consumed by neural networks).
Now, what is the biggest advantage of knowledge bases, and why the existing ones tend not to exploit it very effectively?
I could find examples of linguistic phenomena for which deep learning methods currently do not account satisfactorily. But many such issues can be mitigated with some tricks and vector magery if really needed. Alternatively, I could point to lack of convincing proof that our brains, like artificial networks, produce what they produce with some variant of a gradient descent procedure. But this is a very philosophical path that gives us more new questions than answers.
Instead, here is a simple observation (that happens to be relevant to what Deep Probe does). Neural networks ultimately operate on continuous data – real numbers. You can encode discrete data for them, and/or interpret their output as discrete (this cell, that we agreed represents cats, has 0.124, and the cell for dogs has 0.447, so we read the answer as “dog”), but this is external to what they do. No help from them here.
When you ask a neural network to make for you (continuous) representations of, say, words from a dictionary, they will be vectors of numbers. What each of the numbers represents, generally, no one knows. These representations, embeddings, tend to work anyway – for many tasks. But there are tasks where there are hard reasons, related to basic computer science and combinatorics, why it is difficult to exploit embeddings.
One of such situations can summed up like this: you compute an embedding for each from a set of a hundred million sentences (108). Then you want to be able to know, for each sentence, what are other sentences that are the most similar to it. There is no way around checking each of the data points looking like this:
[-6.414142 -1.008358 -0.780210 0.892059 -2.524359 ...].
This will involve a hundred million vector distance computations each time. Or, if you prefer that for some reason, storing a table of around ten quadrillion (as the square of 108 is 1016) pre-computed distances and then only retrieving these hundred million, instead of computing. And you most certainly can do it, it’s like forty petabytes in single-precision floating point format, which for Google nowadays should be nothing.
But it is not a very enjoyable territory. It’s not that in the age of cheaper computing power we don’t care about efficient computation. Recently language modeling moved from recurrent network architecture to transformers because, among others, the latter let us compute more in parallel, i.e. faster.
“Embedding” the sentences with some predesigned, discrete system would give us new possibilities. To adapt a theme from this Ling Space video, you could use predicate logic and say that
Bite(x, y) is among the main claims of sentences like Sue returned from the ACL conference with vampire bite marks and A CIA spook returned disturbed after Chomsky bit him in a dark alley (as opposed to bit in It was a bit rainy during the tea party).
Bite(x, y) is an index in your hash table (also known as map or dictionary), which is the computer science equivalent of labeled drawers – as seen on the previous photo.
You can retrieve all sentences with that specific
Bite(x, y) claim from a hash table in constant time; they are guaranteed to be at least somewhat similar. If you want more, you can check which sentences are shared between various drawers to which the sentence’s representation points, such as
Bite(x, y) and
Return(x). This isn’t free in computational terms either, but still you’d be working with smaller, more organized subsets of sentences.
I suspect the main unpleasant surprise for the engineers working with knowledge bases is that they don’t cover sufficiently the knowledge needed by their use case. Say, you want to build a chatbot signing people up for doctor visits. You discover that your knowledge base has information on various genera of tropical plants and capitals of countries, but a weak understanding of how various problems reported by people relate to specializations of doctors.
It may also have, for example, detailed descriptions of diseases and their corresponding prescribed medications. This is “specialized” knowledge, “similar” to what you do, but of limited use in reality: after all the chatbot doesn’t diagnose, it assigns visits.
So I think the problem with knowledge bases is that they often build a (possibly rich) model of the world for the sake of it. No living creature, for example, does that. Instead, even in human scientists, there is always bias to process information in ways that are aligned with your goal, even if it is discovering new things in some field.
Employing such a tool may be very much like braving the jungle with a printed encyclopedia in ten volumes. You can barely carry it, it contains lots of information on things you don’t need, and on the topics you do need, it has but a few paragraphs. I believe that there is place for knowledge bases written more like survival manuals: very concise, optimized, opportunistic.
This would mean treating your knowledge base a little like a program in a domain-specific declarative language. That is, you are still programming: but instead of telling the machine directly what it should do, you’re programming its knowledge, or “worldview”, upon which it will act.