QUADT
Required Arguments
X ‑ NROW by NCOL matrix containing the data to be used on this call. (Input/Output)
On output the rows of X have been rearranged to form the k-d tree. X must not contain missing values (NaN).
IND ‑ Vector of length NVAR containing the column numbers in X to be used in the forming the k-d tree. (Input)
NBUCK ‑ Bucket size. (Input)
NBUCK gives the maximum number of observations in a leaf of the k-d tree. NBUCK = 3 is a common choice. NBUCK should be small when compared to NROW.
IDISCR ‑ Vector of length NROW containing the element number in IND that points to the column of X to be used as the discriminator in the k-d tree. (Output)
IDISCR(I) = 0 if the observation is a terminal node. IND(IDISCR(I)) is the column number in X to be used as the discriminator.
PART ‑ Vector of length NROW containing the value to be used in the partition for this observation. (Output)
Optional Arguments
NROW ‑ Number of rows of X to be used in forming the k-d tree. (Input)
Default: NROW = size (X,1).
NVAR ‑ Number of variables to be used in forming the tree. (Input)
Default: NVAR = size (IND,1).
NCOL ‑ Number of columns in X. (Input)
Default: NCOL = size (X,2).
LDX ‑ Leading dimension of X exactly as specified in the dimension statement in the calling program. (Input)
Default: LDX = size (X,1).
FORTRAN 90 Interface
Generic: CALL QUADT (X, IND, NBUCK, IDISCR, PART [, …])
Specific: The specific interface names are S_QUADT and D_QUADT.
FORTRAN 77 Interface
Single: CALL QUADT (NROW, NVAR, NCOL, X, LDX, IND, NBUCK, IDISCR, PART)
Double: The double precision name is DQUADT.
Description
Routine QUADT creates the data structure required for a k-d tree. A k-d tree is a multivariate form of B‑tree that is especially useful for finding nearest neighbors but may be of use in other situations. Once the k-d tree has been formed, routine NGHBR may be used to find the nearest neighbors for any point in logarithmic time.
The basic algorithm is given by Friedman, Bentley, and Finkel (1977) and can be summarized as follows:
1. Let l = 1 and h = NROW.
2. Let k = (l + h)/2.
3. Each column in X to be used in forming the k-d tree is examined over the range [l, h] in order to find the column with the maximum spread. Let j equal this column number.
4. The k‑th element of PART is set to the median value in the range [l, h] of the j‑th column in X while IDISCR(k) is set to the element in IND that points to this column.
5. The rows of X are interchanged so that all rows of X with values in column j less than or equal to the median value computed in Step 4 occur before (or at) the k‑th element.
6. Go to Step 2 repeatedly with zero, one, or two submatrices in X. Go to Step 2 with the submatrix formed from rows l to k of X if k ‑ l is greater than NBUCK. Go to Step 2 with the submatrix formed from rows k + 1 to h of X if h ‑ k ‑ 1 is greater than NBUCK.
The bucket size, NBUCK, is the maximum number of observations allowed in the lowest level of the k-d tree, i.e., in the leaves of the tree. The choice of NBUCK can affect the speed with which nearest neighbors are found. A value of 3 or 5 is a common choice, but if the number of nearest neighbors to be obtained is large, a larger value for NBUCK should probably be used.
Comments
Workspace may be explicitly provided, if desired, by use of Q2ADT/DQ2ADT. The reference is:
CALL Q2ADT (NROW, NVAR, NCOL, X, LDX, IND, NBUCK, IDISCR, PART, ILOW, IHIGH, WK, IWK)
The additional arguments are as follows:
ILOW ‑ Work vector of length log2 (NROW) + 3.
IHIGH ‑ Work vector of length log2 (NROW) + 3.
WK ‑ Work vector of length NROW.
IWK ‑ Work vector of length NROW.
Example
The following example creates a k-d tree from financial data collected for firms approximately 2 years prior to bankruptcy and for financially sound firms at about the same point in time. The data on five variables, X1 = (population), X2 = (cash flow)/(total debt), X3 = (net income)/(total assets), X4 = (current assets)/(current liabilities), and X5 = (current assets)/(net sales) are taken from Johnson and Wichern (1988, page 536).
USE QUADT_INT
USE WRRRN_INT
USE WRIRN_INT
IMPLICIT NONE
INTEGER LDX, NBUCK, NCOL, NROW, NVAR, I
PARAMETER (LDX=47, NBUCK=3, NCOL=5, NROW=47, NVAR=4)
!
INTEGER IDISCR(NROW), IND(NVAR)
REAL PART(NROW), X(LDX,NCOL)
!
DATA IND/2, 3, 4, 5/
DATA (X(I,1),I=1,47)/1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., &
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 2., 2., 2., 2., &
2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., &
2., 2., 2., 2., 2., 2./
DATA (X(I,2),I=1,47)/-0.4485, -0.5633, 0.0643, -0.0721, -0.1002, &
-0.1421, 0.0351, -0.0653, 0.0724, -0.1353, -0.2298, 0.0713, &
0.0109, -0.2777, 0.1454, 0.3703, -0.0757, 0.0451, 0.0115, &
0.1227, -0.2843, 0.5135, 0.0769, 0.3776, 0.1933, 0.3248, &
0.3132, 0.1184, -0.0173, 0.2169, 0.1703, 0.1460, -0.0985, &
0.1398, 0.1379, 0.1486, 0.1633, 0.2907, 0.5383, -0.3330, &
0.4785, 0.5603, 0.2029, 0.2029, 0.4746, 0.1661, 0.5808/
DATA (X(I,3),I=1,47)/-0.4106, -0.3114, -0.3114, -0.0930, &
-0.0917, -0.0651, 0.0147, -0.0566, -0.0076, -0.1433, &
-0.2961, 0.0205, 0.0011, -0.2316, 0.0500, 0.1098, -0.0821, &
0.0263, -0.0032, 0.1055, -0.2703, 0.1001, 0.0195, 0.1075, &
0.0473, 0.0718, 0.0511, 0.0499, 0.0233, 0.0779, 0.0695, &
0.0518, -0.0123, -0.0312, 0.0728, 0.0564, 0.0486, 0.0597, &
0.1064, -0.0854, 0.0910, 0.1112, 0.0792, 0.0792, 0.1380, &
0.0351, 0.0371/
DATA (X(I,4),I=1,47)/1.0865, 1.5134, 1.0077, 1.4544, 1.5644, &
0.7066, 1.5046, 1.3737, 1.3723, 1.4196, 0.3310, 1.3124, &
2.1495, 1.1918, 1.8762, 1.9941, 1.5077, 1.6756, 1.2602, &
1.1434, 1.2722, 2.4871, 2.0069, 3.2651, 2.2506, 4.2401, &
4.4500, 2.5210, 2.0538, 2.3489, 1.7973, 2.1692, 2.5029, &
0.4611, 2.6123, 2.2347, 2.3080, 1.8381, 2.3293, 3.0124, &
1.2444, 4.2918, 1.9936, 1.9936, 2.9166, 2.4527, 5.0594/
DATA (X(I,5),I=1,47)/0.4526, 0.1642, 0.3978, 0.2589, 0.6683, &
0.2794, 0.7080, 0.4032, 0.3361, 0.4347, 0.1824, 0.2497, &
0.6969, 0.6601, 0.2723, 0.3828, 0.4215, 0.9494, 0.6038, &
0.1655, 0.5128, 0.5368, 0.5304, 0.3548, 0.3309, 0.6279, &
0.6852, 0.6925, 0.3483, 0.3970, 0.5174, 0.5500, 0.5778, &
0.2643, 0.5151, 0.5563, 0.1978, 0.3786, 0.4835, 0.4730, &
0.1847, 0.4443, 0.3018, 0.3018, 0.4487, 0.1370, 0.1268/
!
CALL QUADT (X, IND, NBUCK, IDISCR, PART)
CALL WRRRN ('first 10 rows of X after QUADT', X, 10, NCOL, LDX)
CALL WRRRN ('PART', PART, 1, NROW, 1)
CALL WRIRN ('IDISCR', IDISCR, 1, NROW, 1)
!
END
Output
first 10 rows of X after QUADT
1 2 3 4 5
1 1.000 -0.230 -0.296 0.331 0.182
2 2.000 0.140 -0.031 0.461 0.264
3 1.000 -0.142 -0.065 0.707 0.279
4 1.000 -0.449 -0.411 1.087 0.453
5 1.000 0.064 -0.311 1.008 0.398
6 1.000 0.123 0.105 1.143 0.166
7 1.000 -0.284 -0.270 1.272 0.513
8 1.000 -0.278 -0.232 1.192 0.660
9 1.000 0.012 -0.003 1.260 0.604
10 1.000 0.071 0.021 1.312 0.250
PART
1 2 3 4 5 6 7 8 9 10
0.000 0.461 0.857 0.000 0.064 1.168 0.000 -0.278 0.041 0.000
11 12 13 14 15 16 17 18 19 20
0.072 1.373 0.000 -0.072 0.412 0.000 0.435 -0.015 0.000 1.876
21 22 23 24 25 26 27 28 29 30
0.448 0.000 .708 1.994 0.000 0.203 2.152 0.000 2.308 0.390
31 32 33 34 35 36 37 38 39 40
0.000 .550 0.147 0.000 0.217 2.453 0.000 2.521 0.128 0.000
41 42 43 44 45 46 47
2.612 3.012 0.000 4.240 4.292 4.755 0.000
IDISCR
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0 3 3 0 1 3 0 1 1 0 1 3 0 1 4 0 4 1 0 3
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
4 0 4 3 0 1 3 0 3 4 0 4 1 0 1 3 0 3 1 0
41 42 43 44 45 46 47
3 3 0 3 3 3 0