Appendix 1

Digging into Information Theory

,

 

 

 

A1.1. Shannon entropy1

In the late 1940s, Claude Shannon was working on encoding problems at Bell Labs. More specifically, Shannon was interested in establishing reliable transmission of information between a sender S and a recipient R. Think of S as randomly generating messages composed of sequences of words. The fundamental question Shannon was trying to solve is: what is the optimal encoding that makes the encoded messages as short as possible and what is the average length of these encoded words?

The intuition behind the solution that he found is easy to grasp:

Frequent words should be encoded with short strings, whereas infrequent words can be encoded with longer strings.

In this section, we will show how the randomness of messages that S generates is actually related to the amount of information contained in those messages.

Remember that the three forms of information theory complexity that we review in this appendix are all defined as the value of information contained in binary strings once a goal has been specified for what to do with these strings. For the case at hand, Shannon entropy, the goal is to compress the random messages that S sends, as much as possible.

Let us be just a little bit more precise here. Assume that the sender S uses only a finite number N of words that we denote wi where the index i runs from 1 to N. Assume, moreover, that the words wi are generated with probabilities pi (numbers that sum up to 1). Let us then denote by l (wi ) the length of the encoded version of wi . The average length of encoded words sent by S is simply Σipi l(pi). This is precisely the quantity that an optimal encoding should make as small as possible. Shannon was able to find such an optimal coding2 (based on some mild technical assumptions about encodings). He also found an explicit formula (that we prove here) for the optimal length, which is l(wi)= log(1/pi). As expected, small pis lead to large lengths.

Plugging in these optimal lengths in the average length, we get Σipi log(1/pi) for the average length of the optimally encoded words. Shannon called this quantity the entropy of the source S and denoted it with H. Let us emphasize that this quantity H depends only on the probabilities pi and not on the words wi themselves. Following Shannon, we now argue that H is really a measure of the amount of disorder in the set of words produced by S.

Before continuing, note that the quantity H= Σipi log(1/pi) can be defined as soon as probabilities pi of some events are available, independently of the existence of words wi . The entropy concept thus has broader significance than just for coding issues. The coding interpretation is useful, however, because it clarifies the relation of H with other concepts such as information, which is our main concern in this book. A convincing connection of H with complexity will have to await the next section on K-complexity.

To get a better intuition of why H is indeed a measure of randomness, let us look at the possible outcomes of throwing a six-sided dice, thus N = 6 here. If the dice is fair, each side has the same probability pi = 1/6 of appearing. The entropy H of such a situation is readily computed, it equals log 6 ≈ 2.58. If, on the other hand, the dice had only 6s on each face we could be sure that a 6 comes out in any case. In this new situation, p6 = 1, whereas p1=…= p5 = 0. In this case, H is zero3. The former situation has a higher value of randomness H = 2.58 than the latter, which has H = 0, just as we expect.

These simple observations actually generalize in a straightforward way. It can easily be proven that the maximal entropy is always reached when pi are all equal to 1/N; recall that N is the number of possible outcomes. These are the situations where we have strictly no clue about the outcomes of the probabilistic events. Entropy H in such cases is just log N. By contrast, in a situation where we know the outcome for sure, H is always zero.

There is a useful rephrasing of the above, that relates the concepts of entropy H and information, which is perhaps more familiar. In a situation where H = 0, in other words, when there is no unpredictability at all, we really learn nothing. We already knew the result in advance! By contrast, when H is maximal, we have no prior knowledge of the outcome and thus we really gain information when we discover which event (or word) occurred. The information gained from a random source is thus greater the higher the unpredictability or randomness of that source is.

The Shannon entropy H is a convenient and universal measure of this unpredictability of a random source. The interpretation of H as an average encoding length also shows that encoding highly random messages, or equivalent messages that convey much information, requires code words that on average are longer than for non-random source.

It is more costly to encode information from a very unpredictable source than from a predictable one.

Innumerable uses have been made of the Shannon entropy in all areas of engineering and applied sciences, from biology to object-oriented programming. Too often, however, the need for a genuine probabilistic setting, beyond which H has no clear interpretation, has been overlooked.

A1.2. Shannon entropy in short

For further reference, let us summarize the most important properties of Shannon entropy:

– It measures the amount of randomness present in a system. This can be interpreted as some peculiar form of complexity as we shall justify more explicitly in the next section.

– The concept of entropy arises naturally when optimizing the encoding of random messages. The goal mentioned in the introduction is to compress as much as possible a random message for which the probabilities of words are known. The randomness of these messages (or complexity, or information) is high if their optimal compression is on average long.

– Shannon entropy is easy to compute, provided the probabilities pi exist and are known.

We emphasize once more that we will not make any explicit use of the entropy formula H within IT but shall use these reflections as our first input for what complexity can mean in a context where randomness is present in a system. As we shall see shortly, it also provides a natural introduction for our next concept.

A1.3. Kolmogorov complexity

One of the most obvious shortcomings of the Shannon entropy H is that it only allows assigning a measure of randomness or complexity to messages S comprised random words wi. What, however, if we are interested in the complexity of just one single word wi or of a single object? It turns out that another concept of complexity can be used for this purpose. This is precisely what Kolmogorov complexity achieves and we now briefly introduce it on a heuristic level.

The idea that Kolmogorov proposed in the 1960s is again related to a concept of optimal compression. This time, however, we look at a single binary string or message s and ask: “What is the optimal compression of that single string?” We suppose we have some generic computing mechanism at hand, something like an abstract computer. Let us denote it by T and assume it can be programmed in the usual sense. Let us denote by p a program, thought of as a binary string, in any language understood by T. When p runs on T, it produces a result s that is itself a binary string. We write that s = T(p). What is the optimal compression for a single string s? For a given computer T, the optimal compression of a string s might be defined as the shortest program pmin that, when run on T, gives s as a result. In other words, s = T(pmin). The Kolmogorov complexity K(s) of this string s is then simply defined as the length of this optimal program pmin.

Thus, K(s) = length(pmin).

The K-complexity K(s) of a string s is the length of the shortest program pmin that produces it.

This new concept of complexity fits quite well in the general context where the measure of complexity is the value of information with respect to some goal. The goal here is simply: “compress the single string as much as possible”.

Let us illustrate this definition with three simple examples (assume they have comparable length, say one billion bytes):

images

The sequencesalternating is extremely simple in the sense that a very short program will be able to produce it. The next sequence,sprime numbers, which lists prime numbers separated by a zero will need a slightly longer program. The last sequence srandom stuff is completely random. The only program that produces it is an explicit print operation for that particular string. The size of this program will be slightly above the length of the string: in this case, one billion. We can thus classify the above sequences according to their Kolmogorov complexity:

images

As it stands, this definition would seem to entail a great deal of arbitrariness. Indeed it looks like K(s) is strongly dependent on both the machine T and the programming language we use for writing our shortest program pmin. Information theory teaches us, however, that this is not true. It was Turing’s discovery in the 1930s that stated that there exists a universal computing mechanism. They are traditionally referred to as Turing machines and are, conceptually, just as powerful as any modern computer (they can perform the same tasks), hence the adjective “universal”. Our T is such a Turing machine. What makes K-complexity really meaningful is that it can be proved that K(s) does not depend very strongly4 on which Turing machine we choose in the definition. This might perhaps sound amazing or unintuitive, but it is nevertheless true and is the real foundation of algorithmic complexity theory.

On a metaphoric level, we can summarize the above as follows:

An object or a sequence s is K-complex if its shortest possible description is long.

A1.4. Choosing a scale of description

When examining the complexity of an object, an information system (IS), a molecule, or a living being, we claimed above that it is enough to look at its binary description s. However, for such a description to really make sense, we should first choose an appropriate scale of description for the object under consideration. Taking an example from IT, the description of an IS at the level of computer chips is certainly not the same as describing the same IS at the level of the source code of the software, which is still different from describing the set of business objects and business processes. A fine-grained description sfine contains typically much more information than a coarsegrained description scoarse. The point here is that more information does not necessarily mean more useful information. A description sfine could be useless because the information of interest to us is hidden in huge amount of mostly irrelevant information. Conversely, a description scoarse could be useless because it is just too coarse!

These remarks point out the necessity to choose an appropriate level of abstraction when describing objects or systems that involve different scales. ISs are definitely in this category. Choosing a level of abstraction implies, among other things, describing an object with just the information of interest for a particular purpose, forgetting smaller scale details. This is a delicate issue when modeling ISs, with UML diagrams for instance. This topic is discussed more in depth in section 2.3.2.

K-complexity is an intrinsic attribute of a binary string. It is not, however, an intrinsic attribute of an object in the physical world, which requires choosing a scale for its description.

Thus, even though the K-complexity manages to remove the arbitrariness in the evaluation of complexity related to choosing a description mechanism (the T mentioned above), it does not remove the arbitrariness related to choosing an appropriate scale of description of an object.

A1.5. Relation to Shannon entropy

There is a simple relation between the Shannon entropy and the K-complexity that helps justify our previous claim that Shannon entropy is related to an idea of complexity, which is really our main concern.

Recall that the Shannon entropy H can only be defined when probabilities pi are defined on a set of words or elementary messages wi, whereas K-complexity measures complexity for each single wi separately. It is thus hard to resist the temptation to ask: “What is the average of the individual K-complexities K(wi) of those words wi when the probabilities pi are used as weights?” The answer is quite simple (provided the pi is not too “complicated” in a precise sense): it is just the Shannon entropy H! In other words, the Shannon entropy is nothing but an average version of the K-complexity. The two concepts are thus compatible and complementary.

A1.6. Computing the Kolmogorov complexity

Now, assume we have an object for which we chose an appropriate description level and whose K-complexity we would like to compute. Is there any method or algorithm to compute K or, at least, an approximation thereof? Well, perspectives look rather grim here unfortunately. As a matter of fact, information theory tells us that, except for exceptionally simple binary sequences s, the complexity K(s) cannot be computed by any algorithmic means.5 In other words, there is no efficient way to compute the number K(s), efficient meaning here: computable in a predictable length of time. It is essential to understand that this is not due to some lack of skill or lack of imagination of mathematicians or engineers, but that this is an intrinsic property of questions related to finding optimal programs such as pmin. The lesson is that computing complexity is most often so hard that it is practically infeasible.

Facing such a reality, we might relax our requirements and ask whether we can at least find some decent approximation of K(s). The good news is: yes, it is indeed possible to find computable approximations of K(s). The bad news, however, is that it is impossible to know the quality of such approximations, making them of little practical use.

This might seem somewhat depressing, but we cannot just ignore these facts. We draw the following conclusions:

– Even a concept of complexity defined using such a simple goal as mentioned above (compress s as much as possible) is extremely hard to compute. Unsurprisingly, things can only get worse when the goal becomes more realistic. The previous G related to evaluating a compiled code was such an example. Complexity is really an intrinsically difficult topic. Thus, any claim for an easy solution for computing or optimizing it is necessarily wrong.

– Even if the optimal or shortest description (pmin mentioned above) of an object (s mentioned above) is impossible to discover in a reasonable time, comparing two descriptions to find the shortest could be achievable and useful.

– One possible way out of the non-computability trap could be to abandon the quest for a universal concept of complexity altogether. As we just realized, such concepts are of low practical interest, except for supplying some healthy intuition. The contrapositive formulation is that useful or computable versions of complexity can only have limited meaning for some very specific types of objects or structures. Such examples were given in section 2.3.1.

A1.7. Kolmogorov complexity in short

Let us summarize what we have learned so far regarding K-complexity:

– K-complexity is a concept that applies to single binary strings describing any kind of object or system. K(s) is really an intrinsic property of s.

– In contrast with the Shannon entropy H, the K-complexity K(s) is most often impossible to evaluate and even to approximate efficiently. This difficulty is intrinsic to the problem at hand. No future improvement in engineering science will ever improve this situation.

– Describing an object or a system with a binary string entails choosing a scale of description, or an abstraction level, that keeps only relevant information. This arbitrariness is impossible to remove and must be taken seriously.

– The Shannon entropy can be seen as an averaged version of K-complexity.

Besides the already mentioned practical limitations to the K-complexity, there is another, more conceptual, problem we now briefly address.

A1.8. Bennett’s logical depth

Let us briefly come back to the example mentioned above and look once more at sprime numbers and srandom stuff. Both sequences are complex and have higher K-complexity than salternating. However, they are complex in very different ways: sprime numbers is complex because it needs some amount of calculation to be performed, namely computing a long sequence of prime numbers. By contrast, srandom stuff is complex because there is no way to compress it. It is just irreducibly complicated.

Thus, we see that K-complexity can assign high complexity to random objects and low complexity to organized objects. Keeping ISs in mind, we now realize that, after all, K-complexity is perhaps not such a good measure of the kind of complexity we are interested in. We would like to distinguish complicated systems, which are systems that have no short description, from complex systems, which require a lot of effort to produce. The table below summarizes the main differences we expect between these two forms of complexity.

Complicated objects Organized objects
Their description is long. Their description could be either long or short.
Random objects are always complicated. Random objects are not organized objects.
They are easy to produce. They require design or computing to be produced.
They can be produced very quickly; they even occur spontaneously. They grow slowly; they never occur spontaneously.

Is there a mathematical formalization for this kind of organized complexity? Indeed there is. It is called Bennett’s logical depth — named after its inventor. The associated theory is, however, less developed than for K-complexity.

Bennett’s logical depth formalizes the idea that what determines the value of a system or some information is the cost of the computing effort needed to produce it. A goal appropriate for the question at hand would be something like: “find a description of s that requires as little computation as possible to produce s”.

The first attempt to define organized complexity is to simply mimic the definition of K-complexity. That is, replace, in that definition, the shortest program pmin with the fastest program pfast. Unfortunately, it turns out that this does not work for two reasons. First, it is not very hard to prove that this definition leads to a measure of complexity, which is simply proportional to the length of the binary string s (whose complexity we are trying to evaluate). This obviously is not very exciting. Second, no universality can be proved. In contrast with K-complexity, this definition does depend strongly on which Turing machine T we select.

An appropriate definition was proposed in the late 1970s by Charles H. Bennett. It is called the Bennett logical depth D(s) and is defined as the number of steps necessary for the minimal program pmin running on a Turing machine (the same as in K-complexity!) to produce s. The good news is that this definition can be proved to be very robust. First, D(s), like K(s) before, does not depend on T too much, making it an intrinsic property of s. Second, programs p, which are not strictly minimal, but only close to pmin, are acceptable as well. To summarize:

An object has large Bennett logical depth if it encapsulates a great amount of computation.

There is bad news too, unfortunately, which is quite similar to that for K-complexity. That is, D(s) is most often impossible to calculate practically and even to approximate efficiently.

When attempting to describe the depth D for an object or a system, rather than for a binary strings s, we must choose a scale of description, just as for K-complexity.

To gain some intuition, let us look at some simple examples from the IT world, using the above-mentioned concepts as metaphors.

Low K-complexity (short description) High K-complexity (long description)
Low Bennett depth (low design effort) Standard piece of code that is generated automatically. Code for CRUD6 operations belong here for instance Databases with lots of unstructured information.
High Bennett depth (high design effort) Simple business rules that need to be coded by hand because they are non-standard. Large and well structured databases. Large and well structured code.

ISs taken as a whole are obviously both deep and K-complex. Practical complexity metrics for ISs will only apply to some layers and to some limited aspects of enterprise architecture (see Appendix 2).

A1.9. Bennett’s logical depth in short

Let us summarize what we have learned about Bennett’s logical depth:

– Bennett’s logical depth is a robust formalization of organizational complexity in the sense that it measures the amount of computing required to construct an object starting from its optimally compressed description.

– It corresponds to a goal, which is: “Find a description that requires as little computation as possible to produce the object”.

– Just as K-complexity it is hard to compute in practice and hard even to approximate efficiently.

 

 

1 Our presentation is inspired by the excellent book by J.-L. Delahaye [DEL 99], to which we refer the interested reader for more details.

2 The optimal coding is known as the Shannon-Fano code (see [GRU 04]).

3 Because 0 log 0 really equals 0 when things are carefully worked out.

4 What can be proved is that two different Turing machines T1 and T2 lead two concepts of complexity K1(s) and K2(s), which differ only by a constant, independent of s, that is obviously of no importance provided we are interested in long sequences s. Kolmogorov complexity is therefore only an asymptotic concept.

5 For the mathematically inclined reader, this can be traced back to limitations implied by Gödel’s theorem.

6 CRUD is an acronym for Create, Read, Update and Delete which are the basic data-manipulation operations on any set of data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset