No Title

CS542 Project Progress Report
Li Chen Ying Fua
October 27, 1997

Background

The R-Tree Structure

R-trees [1] are an extension of B-trees that store multi-dimensional data. R-trees are dynamically balanced. In most R-tree variant, entry MBRs are allowed to overlap. To improve this weakness, an R*-tree [11] is proposed that introduces heuristics in dynamic tree adjustment and yields a better query performance. R-trees are constructed in a bottom-up approach called the packed R-tree based on the Hilbert curve transformation [18]. The Hilbert value is used to partially sort the data and to pack them into tree nodes. As a result, the node occupancy rate is maximized whereas the overlap between entry MBRs is minimized.

About Bulk Incremental Update of R-tree

Perhaps the most critical issues in data warehouse environment is the time to generate and/or refresh its data index structure from the raw data, and the mere size of it does not permit frequent re-construction. Moreover, creating a new R-Tree index structure for new input spatial data sets from scratch is very wasteful. Hence, arise the issues on maintenance of R-Tree index structure efficiently whereby two postulates are made:

Record-level granularity operations on R-Tree are too expensive and might cause many splitting on original R-Tree and thus destroy good clustering.
Bulk incremental update is a promising solution. The paper on cube tree [26] proposed a bulk incremental update algorithm by dividing the problem into a sort phase and a merge-pack phase.

However, sorting could be the dominant cost factor in the above incremental computation. In addition, sorting multi-dimensional data ( especially spatial data object ) according to a predefined global ordering would impose too strong a limitation on the way spatial clustering can be achieved. Therefore, we are trying to find an approach on bulk incremental update which are solely based on standard routines for inserting, splitting and merging of R-trees.

Problem Definition

Our problem domain is defined as the bulk incremental update of R-Tree using newly input skewed data sets. This assumption has its reality basis: Using measurements on real data sets (road intersections of U.S. counties, star coordinates from NASA's Infrared-Ultraviolet Explorer etc.). We can provide evidence that real data are indeed skewed. Morever, since the original R-Tree has experienced many insertion and deletion operations on it, the newly input datasets would be more clustered than the data in the original R-Tree, at least in general cases.

Accomplished Tasks

Read and understand conference papers related to the R-trees indexing methods. The detailed list can be found in Appendix A.
An algorithm that does bulk incremental update of an R-tree by first constructing a small R-tree on the skewed datasets, and insert it into the original big R-tree. The proposed algorithm makes some adjustments to get a balanced and better-structured R-tree while maintaining the clustering of the skewed datasets. The detailed algorithm is described in the following section.

Proposed Algorithm

Based on the problem definition, we come up with an idea of building a small R-Tree for the new input skewed data sets, then fit it into the original big R-Tree with some slight adjustment. By this way, we hope to achieve a good structured resultant R-Tree while preserving the natural clustering of the new input skewed data with minimum cost. Below is a rough description of the algorithm of how to fit a new small rtree into a big Rtree.

The height of small rtree is ; the height of original rtree is ; We consider the root rectangle of the small rtree ( enclosing rectangle of all new data rectangles ) as a data rectangle, and try to insert it to the level of the original rtree, so the bottom level of the small rtree is on the same level as the original rtree.

ALGORITHM InsertSkewedDataset

I1: Invoke ChooseSubtree, with the level l as a parameter, to find an appropriate node N, in which to place the new entry E.
I2: If N has less than M entriies, accomodate E in N. If N has M entries, invoke OverflowTreatment with the level l.

ALGORITHM OverflowTreatment

O1: If the level is not the root level and this is the first call of OverflowTreatment in the given level, invoke ReInsert;
O2: else invoke Split

ALGORITHM ReInsert

Since the small rtree is generally more skewed, clustered and packed, we do not want to decompose it into subtrees and fit them one at a time into the original rtree. Instead, we would like to use some methods ( described below ) to adjust other subtrees and keep the new rtree untouched.

(1)

Ideal Case
The most suitable level in the oiginal rtree for insertion of the small rtree has an entry slot for the root of the small rtree.

(2)

Merge Siblings
If there is no entry slot for the root of small rtree, we try to merge the sibling nodes in order to leave an entry slot for it.

Option 1
Merge between two closest nodes with relatively low capacity.

Option 2
Merge i+1 nodes into i nodes after sorting them by some hilbert value.

For Option 1 and 2, we can treat them as merging between multiple nodes. If the merge fails, that is the entries in i+1 nodes > max entries allowed in i nodes, redistribute them for some nodes to attract more entries.

Merging the (i+1) nodes, the node would have the lowest capacity, so it could be deleted and re-insert from the root node.

Option 3
Select a candidate node with a relatively low capacity and perform a merge bottom-up, hoping that the resultant parent node would have very few children. In this case, we could delete it and make space for the small rtree, and re-insert its children from the root-node.

Comparisons between the 3 options

Option 1 and 2
Option 1 may not be as stable as Option 2 because we may not merge the 2 most optimal nodes.
Option 1/2 and 3
Option 3 might be very expensive depending on the number of children that needs to be re-inserted and also the merging cost from bottom-up.

Overall, Option 3 might be too expensive and hence not so favorable.

(3)

Forced Re-insert
If no expected result is obtained in the merging step, we choose one sibling node to delete and re-insert, hoping to give way to the small rtree. We can choose the candidate sibling node by considering the distance between each children node from the center of its enclosing rectangle, and select the one that has the furthest distance.

(4)

Split
If all the above tries fail, split the parent node to create new entry slot for them.

Tasks At Hand

(1)

Look into the packing code ftp from the University of Michigan.

(2)

Construct an R-tree and build up some datasets.

(3)

Implement the proposed algorithm and satisfy the cases below in that order.

(a)

Ideal Case
The most suitable level in the tree for insertion of the small r-tree has an entry slot for it.

(b)

Merge Siblings
Siblings nodes are merged in order to accomodate the small r-tree.

(c)

"Forced Re-insert"
Delete a sibling node to give way to the small r-tree and re-insert its orphaned entries by traversal from the root.

(d)

Split.
This is the worst case scenario whereby a split of a node is required due to an overflow.

(4)

Perform comparisons on the proposed algorithm with existing algorithm that manipulates the R-tree based on insertion cost and the resultant R-tree structure.

Weekly Schedule

Weekly Schedule

Week Tasks

1 Understand the packing code.

2-3 Implement simple structures and construct a simple R-tree.

4 Build up datasets and refinement/implementation of the proposed algorithm.

5 Implement the proposed algorithm and perform comparisons, if time permits.

6 Writeup of project report.

Appendix A

[1]: Antodgrsin Guttman: R-Trees: A Dynamic Index Structure for Spatial Searching. SIGMOD Conference 1984: 47-57
[2]: Nick Roussopoulos, Daniel Leifker: Direct Spatial Search on Pictorial Databases Using Packed R-Trees. SIGMOD Conference 1985: 17-31
[3]: Scott T. Leutenegger, J. M. Edgington, Mario A. Lopez: STR: A Simple and Efficient Algorithm for R-Tree Packing. ICDE 1997: 497-506
[4]: Apostolos Papadopou dgrslos, Yannis Manolopoulos: Performance of Nearest Neighbor Queries in R-Trees. ICDT 1997: 394-408
[5]: Scott T. Leutenegger, Mario A. Lopez: A Buffer Model for Evaluating the Performance of R-Tree Packing Algorithms. SIGMETRICS 1996: 264-265
[6]: Thomas Brinkhoff, Hans-Peter Kriegel, Bernhard Seeger: Efficient Processing of Spatial Joins Using R-Trees. SIGMOD Conference 1993: 237-246
[7]: Vincent Ng, Tiko Kameda: Concurrent Access to R-Trees. SSD 1993: 142-161
[8]: Vincent Ng, Tiko Kameda: The R-Link Tree: A Recoverable Index Structure of Data. DEXA 1994: 163-172
[9]: Marcel Kornacker, Douglas Banks: High-Concurrency Locking in R-Trees. VLDB 1995: 134-145
[10]: Thomas Brinkhoff, Hans-Peter Kriegel, Bernhard Seeger: Efficient Processing of Spatial Joins Using R-Trees. SIGMOD Conference 1993: 237-246
[11]: Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, Bernhard Seeger: The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles. SIGMOD Conference 1990: 322-331
[12]: Andreas Hutflesz, Hans-Werner Six, Peter Widmayer: The R-File: An Efficient Access Structure for Proximity Queries. ICDE 1990: 372-379
[13]: Raymond T. Ng, Jiawei Han: Efficient and Effective Clustering Methods for Spatial Data Mining. VLDB 1994: 144-155
[14]: Hanan Samet: The Design and Analysis of Spatial Data Structures. Addison-Wesley 1990
[15]: Timos K. Sellis, Nick Roussopoulos, Christos Faloutsos: The R+-Tree: A Dynamic Index for Multi-Dimensional Objects. VLDB 1987: 507-518
[16]: Ibrahim Kamel, Christos Faloutsos: On Packing R-trees. CIKM 1993: 490-499
[17]: Ibrahim Kamel, Christos Faloutsos: Parallel R-trees. SIGMOD Conference 1992: 195-204
[18]: Ibrahim Kamel, Christos Faloutsos: Hilbert R-tree: An Improved R-tree using Fractals. VLDB 1994: 500-509
[19]: Christos Faloutsos, Ibrahim Kamel: High Performance R-trees. Data Engineering Bulletin 16(3): 28-33(1993)
[20]: Christos Faloutsos, Ibrahim Kamel: Beyond Uniformity and Independence: Analysis of R-trees Using the Concept of Fractal Dimension. PODS 1994: 4-13
[21]: Yannis Theodoridis, Timos K. Sellis: Optimization Issues in R-tree Construction (Extended Abstract). IGIS 1994: 270-273
[22]: Vincent Ng, Tiko Kameda: The R-Link Tree: A Recoverable Index Structure for Spatial Data. DEXA 1994: 163-172
[23]: Dimitris Papadias, Yannis Theodoridis, Timos K. Sellis: The Retrieval of Direction Relations using R-trees. DEXA 1994: 173-182
[24]: Dimitris Papadias, Yannis Theodoridis, Timos K. Sellis, Max J. Egenhofer: Topological Relations in the World of Minimum Bounding Rectangles: A Study with R-trees. SIGMOD Conference 1995: 92-103
[25]: Joseph M. Hellerstein, Jeffrey F. Naughton, Avi Pfeffer: Generalized Search Trees for Database Systems. VLDB 1995: 562-573
[26]: Nick Roussopoulos, Yannis Kotidis, Mema Roussopoulos: Cubetree: Organization of and Bulk Incremental Updates on the Data Cube. SIGMOD Conference 1997: 89-99

Ah...bach!
Wed Oct 29 23:03:34 EST 1997

Weekly Schedule
Week	Tasks
1	Understand the packing code.
2-3	Implement simple structures and construct a simple R-tree.
4	Build up datasets and refinement/implementation of the proposed algorithm.
5	Implement the proposed algorithm and perform comparisons, if time permits.
6	Writeup of project report.