CPS222 Lecture: Introduction to Trees and Forests Last revised 1/18/2013 Objectives: 1. To define "tree" and "forest" 2. To introduce basic operations on trees (e.g. traversals) 3. To show how trees and forests can be represented as binary trees Materials: 1. Excerpts from an "array of pointers to children" implementation (to project) 2. Excerpts from an "oldest child/next sibling" representation (to project) I. Introduction - ------------ A. Our discussion of data structures has focussed on sequential structures (arrays, stacks, queues, lists etc.). Now we want to move to a consideration of branching structures, in which each element of the structure can have more than one "successor". B. The most general sort of branching structure is the graph, which we shall consider later. First, though, we want to give considerable attention to a particularly useful class of branching structures: trees. C. Definition: A tree is a set of nodes, consisting of a special node-called the root - and 0 or more disjoint subsets, each of which is a tree. 1. ex: A / | \ B C E | / \ D F G | H - the set of nodes A .. H is a tree. A is the root, and the subtrees are B, C .. D, and E .. H. in the subtree B, B is the root and there are no subtrees. in the subtree C..D, C is the root, E is the subtree. E in turn is the root of a tree with no subtrees in the subtree E..H, E is the root, F and G are the roots of two subtrees, one of which (F) has no subtrees of its own, and the other of which (G) has the subtree H. 2. Note well the insistence that the subtrees be disjoint. For example: A / \ B C \ / \ D E is not a tree. 3. This definition differs slightly from the one in the book - though it is basically saying the same thing a. A tree cannot be empty - it must at least have a root node. b. The first of of the two definitions in the book was in terms of the parent relationship, rather than subtree. (But the book also gave a second definition like the one above.) D. Some terminology: 1. Tree terminology is borrowed from two portions of the natural world: a. Wood type trees: we speak of the "root" of a tree and of its "leaves". We have already defined the notion of "root" (but notice that we draw it on the top, not on the bottom!) A leaf of a tree is the root of a (sub)tree that has no subtrees of its own. b. Geneaological trees (family trees): (1) If A is the root of a tree and B is the root of one of its subtrees, then we say that A is the "father" or "parent" of B, and B is the "son" or "child" of A. In the above: - A is the parent of B, C, and E. B,C, and E are children of A. - C is the parent of D, D is the child of C. - E is the parent of F and G; F and G are children of E. (2) We can carry this further, speaking of A as the grandparent of D etc. In general, we say that A is the "ancestor" of H and H is the "descendant" of A if H is in one of the subtrees of A. In the example above, B, C, D, E, F, G, and H are all descendants of A. (3) If two nodes are the children of the same parent), we say that they are "brothers" or "siblings" or (sometimes) "twins". In the above, B, C, and E are siblings, as are F and G. (4) We could go farther and use terms like "uncle" - but we seldom do. 2. Additional terminology: a. The leaves of a tree are sometimes also called "external" or "terminal" nodes, and the non-leaf nodes can be called "internal" or "non-terminal" nodes. b. The "degree" of a node is the number of children it has. (Note that we can then define a leaf as a node with degree 0.) The degree of a tree is the maximum degree of any of its nodes. In the above example, the degree of A is three - and this also happens to be the degree of the whole tree, since the next highest degree is two. It need not always be the case that the root has the highest degree. c. A "path" from the root of a tree to a node is a sequence of nodes N .. N such that N is the root, N is the leaf, and N is the 1 hi 1 h i parent of N for all i, 1 <= i < h. The length of a path is i+1 the number of EDGES traversed - i.e. one less than the number of nodes on the path. d. The "depth" or "level" of a node can be defined as follows: - The depth level of a node is its distance from the root - the length of a path from the root to it. or - equivalently: - The depth (level) of the root of a tree is zero. - The depth (level) of any other node is 1 + the depth (level) of its parent. - In the above: A is at depth 0, B, C, and E at depth 1, D, F, and G at depth 2, and H at depth 3. But note: Some authors define the depth of the root of a tree to be 1, not 0. The effect, in the above example, would to make each value one greater. e. The "height" of a node is the length of the longest path from that node to a leaf. This can be done by counting nodes or edges - which leads to two different answers that differ by 1. - If we count edges, then leaf nodes have height 0. - If we count nodes, leaf nodes have a height of 1. i. In either case, the height of any other node is 1 + the maximum of the heights of its children. The height of a tree is defined to be the height of the root. ii. The book uses the "edges" form of definition, which leads to a single node tree (just a root) having a height of 0. The "nodes" form of definition is more intuitive, I think. For example, a single node tree would have a height of 1. iii. I'll use the latter definition in subsequent lectures. 3. In drawing our tree examples, there has been an implicit left-to-right ordering of the children of a given parent. In an actual tree, this ordering may or may not be an important. An "ordered" tree is one in which there is such an ordering imposed on the children of the same parent; in an "unordered" tree, no such relationship exists. a. Note that any practical scheme for representing a tree imposes an order. b. In our further discussion, we will work with ordered trees unless we explicitly say otherwise - though most of what we say about ordered trees applies equally to unordered trees. c. Sometimes, when we are thinking of a tree as an ordered tree, we will say of two siblings that the first is "older" than the second if the first is to the left of the second in our drawing. We can then use the term "oldest child" to refer to the leftmost child of a node. Example: In the tree we have been using for examples, B is the oldest child of A, C the oldest, and E the youngest. E. To further generalize, we can define the concept of a "forest" as a set of 0 or more disjoint trees. 1. Example: B C E | / \ D F G | H 2. Observe: we can convert a forest to a tree by adding a single node to serve as the root of a tree in which each of the original trees is a subtree: ex: A / | \ B C E | / \ D F G | H 3. Conversely, deleting the root from a tree leaves behind a forest consisting of its subtrees. (Obviously, this is how we got our forest from our original tree.) F. In writing about trees, we can adopt one of several systems of notation: 1. The graph-like drawings we have been using thus far. 2. Indentation: ex: Our original tree: A B C D E F G H ex: Our forest: B C D E F G H 3. Parentheses. ex: our tree A(B, C(D), E(F, G(H))) G. Some uses of trees: Observe that a tree is a fundamentally hierarchical structure. Thus, a tree is appropriate to model any reality that exhibits hierarchy: 1. File system directories are often tree-structured. 2. Geneaological trees of all sorts: family relationships among individuals, tribes, languages etc. 3. Classifications systems: a. Taxonomic classification of plants and animals. b. Dewey decimal (or Library of Congress) classification of books. 4. Breakdown of a manufactured product into subassemblies, each of turn consists of sub-subassemblies etc. down to the smallest components. 5. Structure of a program - main routine is the root, procedures it contains are subtrees, each of which contains nested procedure definitions etc. H. Trees are also very useful for information storage and retrieval situations such as symbol tables, even though hierarchy may not be involved. II. Operations on trees -- ---------- -- ----- A. As with any flexible data structure, there are many possible operations we could define on trees. Certainly, we want a create operation - but note that there is no such thing as an empty tree! So when we create a tree, we create a tree having at least one node - the root. B. The operation of insertion into a tree is certainly important, but depends heavily on the principle by which the nodes are organized. We defer discussion of insertion and deletion to discussion of various special kinds of tree organized on various principles. C. One class of operations that can be defined for all kinds of tree is traversal. By "traversal", we mean the act of systematically "visiting" all of the nodes to perform some operation on them: 1. Printing out the contents of all of the nodes, or performing some other operation on all the nodes, involves a traversal. 2. Unless the tree is ordered somehow on the basis of some key, searching for a node containing a given value would involve a traversal (though in practice trees that are to be searched are usually structured in such a way as to avoid this.) D. One issue that arises in connection with traversal is the order of traversal. Two orders are of particular importance: 1. Preorder traversal: Visit the root of the tree Traverse each subtree in turn in preorder Example on the above: A B C D E F G H 2. Postorder traversal: Traverse each subtree in postorder Visit the root Example on the above: B D C F H G E A E. Of lesser importance is level order traversal: visit all the nodes on level zero, then all on level one etc. Example on the above: A B C E D F G H F. The above operations can be defined on a forest by mentally adding a root which is ignored when it comes time to visit it. III. Representing Trees and Forests --- ------------ ----- --- ------- A. We have noted that a forest can be converted to a tree by adding a root. Thus we focus on representing trees - to represent a forest, simply include a "root" as a header. B. One method is to use a linked representation in which each node contains pointers to its children. This means that when we define the data type for a node, the degree of the tree determines the number of pointer fields needed. Pointer fields in a given node that are not needed can be set to null. PROJECT: Array of pointers to children example - class Node 1. Now, for example, we could implement operations on this tree as follows: a. preorder traversal: PROJECT: preorder b. postorder traversal could be written similarly. What changes would be needed to turn the given preorder code into postorder? ASK - Change the name of the function! - Do the visit AFTER the recursive calls c. Reading a tree in from a text file. Assume that the nodes of a tree have been written out, one node to a line, in pre-order. Assume each line contains the contents of the node and the number of its children. ex: The tree A / | \ B C D /\ E F would be stored as: A 3 B 0 C 0 D 2 E 0 F 0 PROJECT readTree code 2. However, this representation runs into a severe efficiency problem if the degree of the tree is large. a. Thm: For a tree of degree d with n nodes, represented using the array of pointers to children representation, we will always have n*(d-1) + 1 NULL pointers stored in the nodes. Pf: Each of the n nodes has room for d pointers - or n*d pointers in all. Each node (except the root) is pointed to by exactly one of these. So n-1 pointers are used to point to other nodes, leaving n*d - (n-1) = n*(d-1) + 1 NULL. b. For example, for a tree of degree 10 with 100 nodes, we waste 901 pointers. C. An alternate representation can be arrived at by using a linked list representation for the children of a node. 1. Each node holds two pointers. One points to its oldest child. The other points to its next sibling (next younger node with the same parent.) 2. Such a tree is actually a binary tree. A binary tree is either empty, or it consists of a root and exactly two disjoint sets of nodes - designated left child and right child, each of which is a binary tree. We will say more about binary trees in the next lecture - for now note that a binary tree is a different thing from a tree! 3. The transformation from a general tree into an equivalent binary tree (oldest child/next sibling representation) can be done recursively, as follows: a. To transform a general tree rooted at a node A to its equivalent binary tree: - create a binary tree whose root is A. - transform the leftmost subtree of A in the general tree, and make this the left subtree of A in the binary tree. - transform the next sibling of A in the general tree, and make this the right subtree of A in the binary tree.. b. ex: our original tree: A / B \ C / \ D E / F \ G / H c. Note that you can visualize the shape of the original tree by mentally rotating the binary equivalent 45 degrees counterclockwise. d. The same method can be applied to a forest - the right subtree of the binary equivalent of the root of one of the trees is the transformed version of the next tree in the forest. We can see what this would look like for our example forest by just deleting the A node from the above tree. PROJECT: Code for Oldest child/next sibling representation - NODE class 4. Note that this representation dramatically decreases the number of NULL pointers. If we used the same reasoning we used previously, an n-node tree would need just n + 1 NULL pointers. 5. Performing traversals on a general tree represented by an equivalent binary tree. a. Preorder traversal of the general tree is accomplished by preorder traversal of the transformed tree. ex: preorder traversal of the above binary tree: A B C D E F G H PROJECT: Code for preorder b. What about postorder traversal? How would this be done? ASK i. Postorder traversal of the general tree is accomplished by INORDER traversal of the transformed tree. Inorder traversal: traverse the left subtree in inorder visit the root traverse the right subtree in inorder ii. ex: the above: B D C F H G E A iii. This works because: - The left subtree of any node in the transformed tree contains all the nodes that were descendants of that node in the original tree. These should be visited first. - The right subtree of any node in the transformed tree contains all the nodes that were right siblings (or descendants thereof) of the node in the original tree. These should be visited after the node. iv. What would need to be done to change the example code for preorder just projected to do this? ASK - Change the name - Do the visit between subtrees c. Postorder traversal of the transformed tree has no relationship to any meaningful operation on the original tree. d. An equivalent to our ReadTree procedure defined above can also be done PROJECT: Code for readTree