**CS542 Project Progress Report **

Li Chen Ying Fua

October 27, 1997

**Background**

**The R-Tree Structure**

R-trees [1] are an extension of B-trees that store multi-dimensional data. R-trees are dynamically balanced. In most R-tree variant, entry MBRs
are allowed to overlap. To improve this weakness, an R*-tree [11] is proposed
that introduces heuristics in dynamic tree adjustment and yields a better
query performance. R-trees are constructed in a bottom-up approach
called the packed R-tree based on the Hilbert curve transformation [18].
The Hilbert value is used to partially sort the data and to pack them into tree
nodes. As a result, the node occupancy rate is maximized whereas the
overlap between entry MBRs is minimized.

**About Bulk Incremental Update of R-tree**

Perhaps the most critical issues in data warehouse environment is the time to generate and/or refresh its data index structure from the raw data, and the mere size of it does not permit frequent re-construction. Moreover, creating a new R-Tree index structure for new input spatial data sets from scratch is very wasteful. Hence, arise the issues on maintenance of R-Tree index structure efficiently whereby two postulates are made:

- Record-level granularity operations on R-Tree are too expensive and might cause many splitting on original R-Tree and thus destroy good clustering.
- Bulk incremental update is a promising solution. The paper on cube tree [26] proposed a bulk incremental update algorithm by dividing the problem into a sort phase and a merge-pack phase.

However, sorting could be the dominant cost factor in the above incremental computation. In addition, sorting multi-dimensional data ( especially spatial data object ) according to a predefined global ordering would impose too strong a limitation on the way spatial clustering can be achieved. Therefore, we are trying to find an approach on bulk incremental update which are solely based on standard routines for inserting, splitting and merging of R-trees.

**Problem Definition**

Our problem domain is defined as the bulk incremental update of
R-Tree using newly input skewed data sets. This assumption has its reality
basis: Using measurements on real data sets (road intersections of U.S.
counties, star coordinates from NASA's Infrared-Ultraviolet Explorer etc.).
We can provide evidence that real data are indeed skewed.
Morever, since the original R-Tree has experienced many insertion and
deletion operations on it, the newly input datasets would be more
clustered than the data in the original R-Tree, at least in general cases.

**Accomplished Tasks**

- Read and understand conference papers related to the R-trees indexing methods. The detailed list can be found in Appendix A.
- An algorithm that does bulk incremental update of an
R-tree by first constructing a small R-tree on the skewed
datasets, and insert it into the original big R-tree. The
proposed algorithm makes some adjustments to get a balanced and
better-structured R-tree while maintaining the clustering of the skewed
datasets. The detailed algorithm is described in the following section.

**Proposed Algorithm**

Based on the problem definition, we come up with an idea of building a small
R-Tree for the new input skewed data sets, then fit it into the original
big R-Tree with some slight adjustment. By this way, we hope to achieve
a good structured resultant R-Tree while preserving the natural
clustering of the new input skewed data with minimum cost. Below is a rough
description of the algorithm of how to fit a new small
rtree into a big Rtree.

The height of small rtree is ; the height of original rtree is ;
We consider the root rectangle of the small rtree ( enclosing rectangle of all
new data rectangles ) as a data rectangle, and try to insert it to the level
of the original rtree, so the bottom level of the small rtree is
on the same level as the original rtree.

**ALGORITHM InsertSkewedDataset **

- I1: Invoke ChooseSubtree, with the level l as a parameter, to find an appropriate node N, in which to place the new entry E.
- I2: If N has less than M entriies, accomodate E in N. If N has M entries,
invoke OverflowTreatment with the level l.

**ALGORITHM OverflowTreatment**

- O1: If the level is not the root level and this is the first call of OverflowTreatment in the given level, invoke ReInsert;
- O2: else invoke Split

**ALGORITHM ReInsert**

Since the small rtree is generally more skewed, clustered and packed, we
do not want to decompose it into subtrees and fit them one at a time into
the original rtree. Instead, we would like to use some methods ( described below )
to adjust other subtrees and keep the new rtree untouched.

**(1)**- Ideal Case

The most suitable level in the oiginal rtree for insertion of the small rtree has an entry slot for the root of the small rtree. **(2)**- Merge Siblings

If there is no entry slot for the root of small rtree, we try to merge the sibling nodes in order to leave an entry slot for it.*Option 1*

Merge between two closest nodes with relatively low capacity.*Option 2*

Merge i+1 nodes into i nodes after sorting them by some hilbert value.For

*Option 1 and 2*, we can treat them as merging between multiple nodes. If the merge fails, that is the entries in*i*+1 nodes > max entries allowed in i nodes, redistribute them for some nodes to attract more entries.Merging the (

*i*+1) nodes, the node would have the lowest capacity, so it could be deleted and re-insert from the root node.*Option 3*

Select a candidate node with a relatively low capacity and perform a merge bottom-up, hoping that the resultant parent node would have very few children. In this case, we could delete it and make space for the small rtree, and re-insert its children from the root-node.

*Comparisons between the 3 options*- Option 1 and 2

*Option 1*may not be as stable as*Option 2*because we may not merge the 2 most optimal nodes. - Option 1/2 and 3

*Option 3*might be very expensive depending on the number of children that needs to be re-inserted and also the merging cost from bottom-up.

Overall,

*Option 3*might be too expensive and hence not so favorable. - Option 1 and 2
**(3)**- Forced Re-insert

If no expected result is obtained in the merging step, we choose one sibling node to delete and re-insert, hoping to give way to the small rtree. We can choose the candidate sibling node by considering the distance between each children node from the center of its enclosing rectangle, and select the one that has the furthest distance. **(4)**- Split

If all the above tries fail, split the parent node to create new entry slot for them.

**Tasks At Hand **

- (1)
- Look into the packing code ftp from the University of Michigan.
- (2)
- Construct an R-tree and build up some datasets.
- (3)
- Implement the proposed algorithm and satisfy the cases below
in that order.
- (a)
- Ideal Case

The most suitable level in the tree for insertion of the small r-tree has an entry slot for it. - (b)
- Merge Siblings

Siblings nodes are merged in order to accomodate the small r-tree. - (c)
- "Forced Re-insert"

Delete a sibling node to give way to the small r-tree and re-insert its orphaned entries by traversal from the root. - (d)
- Split.

This is the worst case scenario whereby a split of a node is required due to an overflow.

- (4)
- Perform comparisons on the proposed algorithm with existing
algorithm that manipulates the R-tree based on insertion
cost and the resultant R-tree structure.

**Weekly Schedule**

Weekly Schedule | |

Week | Tasks |

1 | Understand the packing code. |

2-3 | Implement simple structures and construct a simple R-tree. |

4 | Build up datasets and refinement/implementation of the proposed algorithm. |

5 | Implement the proposed algorithm and perform comparisons, if time permits. |

6 | Writeup of project report. |

**Appendix A**

**[1]**- Antodgrsin Guttman: R-Trees: A Dynamic Index Structure for Spatial
Searching. SIGMOD Conference 1984: 47-57
**[2]**- Nick Roussopoulos, Daniel Leifker: Direct Spatial Search on Pictorial
Databases Using Packed R-Trees. SIGMOD Conference 1985: 17-31
**[3]**- Scott T. Leutenegger, J. M. Edgington, Mario A. Lopez: STR: A Simple
and Efficient Algorithm for R-Tree Packing. ICDE 1997: 497-506
**[4]**- Apostolos Papadopou dgrslos, Yannis Manolopoulos: Performance of Nearest
Neighbor Queries in R-Trees. ICDT 1997: 394-408
**[5]**- Scott T. Leutenegger, Mario A. Lopez: A Buffer Model for Evaluating
the Performance of R-Tree Packing Algorithms. SIGMETRICS 1996: 264-265
**[6]**- Thomas Brinkhoff, Hans-Peter Kriegel, Bernhard Seeger: Efficient
Processing of Spatial Joins Using R-Trees. SIGMOD Conference 1993: 237-246
**[7]**- Vincent Ng, Tiko Kameda: Concurrent Access to R-Trees. SSD 1993: 142-161
**[8]**- Vincent Ng, Tiko Kameda: The R-Link Tree: A Recoverable Index Structure
of Data. DEXA 1994: 163-172
**[9]**- Marcel Kornacker, Douglas Banks: High-Concurrency Locking in R-Trees.
VLDB 1995: 134-145
**[10]**- Thomas Brinkhoff, Hans-Peter Kriegel, Bernhard Seeger: Efficient Processing
of Spatial Joins Using R-Trees. SIGMOD Conference 1993: 237-246
**[11]**- Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, Bernhard Seeger:
The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles.
SIGMOD Conference 1990: 322-331
**[12]**- Andreas Hutflesz, Hans-Werner Six, Peter Widmayer: The R-File: An Efficient
Access Structure for Proximity Queries. ICDE 1990: 372-379
**[13]**- Raymond T. Ng, Jiawei Han: Efficient and Effective Clustering Methods for
Spatial Data Mining. VLDB 1994: 144-155
**[14]**- Hanan Samet: The Design and Analysis of Spatial Data Structures.
Addison-Wesley 1990
**[15]**- Timos K. Sellis, Nick Roussopoulos, Christos Faloutsos: The R+-Tree: A
Dynamic Index for Multi-Dimensional Objects. VLDB 1987: 507-518
**[16]**- Ibrahim Kamel, Christos Faloutsos: On Packing R-trees. CIKM 1993: 490-499
**[17]**- Ibrahim Kamel, Christos Faloutsos: Parallel R-trees. SIGMOD Conference
1992: 195-204
**[18]**- Ibrahim Kamel, Christos Faloutsos: Hilbert R-tree: An Improved R-tree using
Fractals. VLDB 1994: 500-509
**[19]**- Christos Faloutsos, Ibrahim Kamel: High Performance R-trees. Data Engineering
Bulletin 16(3): 28-33(1993)
**[20]**- Christos Faloutsos, Ibrahim Kamel: Beyond Uniformity and Independence:
Analysis of R-trees Using the Concept of Fractal Dimension. PODS 1994: 4-13
**[21]**- Yannis Theodoridis, Timos K. Sellis: Optimization Issues in R-tree
Construction (Extended Abstract). IGIS 1994: 270-273
**[22]**- Vincent Ng, Tiko Kameda: The R-Link Tree: A Recoverable Index Structure for
Spatial Data. DEXA 1994: 163-172
**[23]**- Dimitris Papadias, Yannis Theodoridis, Timos K. Sellis: The Retrieval of
Direction Relations using R-trees. DEXA 1994: 173-182
**[24]**- Dimitris Papadias, Yannis Theodoridis, Timos K. Sellis, Max J. Egenhofer:
Topological Relations in the World of Minimum Bounding Rectangles: A Study with
R-trees. SIGMOD Conference 1995: 92-103
**[25]**- Joseph M. Hellerstein, Jeffrey F. Naughton, Avi Pfeffer: Generalized Search
Trees for Database Systems. VLDB 1995: 562-573
**[26]**- Nick Roussopoulos, Yannis Kotidis, Mema Roussopoulos:
Cubetree: Organization of and Bulk Incremental Updates on the Data Cube.
SIGMOD Conference 1997: 89-99

Wed Oct 29 23:03:34 EST 1997