辅导案例-E3B

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

A Fast Decision Tree Learning Algorithm
Jiang Su and Harry Zhang
Faculty of Computer Science
University of New Brunswick, NB, Canada, E3B 5A3
{jiang.su, hzhang}@unb.ca
Abstract
There is growing interest in scaling up the widely-used
decision-tree learning algorithms to very large data sets.
Although numerous diverse techniques have been pro-
posed, a fast tree-growing algorithm without substantial
decrease in accuracy and substantial increase in space
complexity is essential. In this paper, we present a
novel, fast decision-tree learning algorithm that is based
on a conditional independence assumption. The new
algorithm has a time complexity of O(m · n), where
m is the size of the training data and n is the num-
ber of attributes. This is a significant asymptotic im-
provement over the time complexity O(m · n2) of the
standard decision-tree learning algorithm C4.5, with an
additional space increase of only O(n). Experiments
show that our algorithm performs competitively with
C4.5 in accuracy on a large number of UCI benchmark
data sets, and performs even better and significantly
faster than C4.5 on a large number of text classification
data sets. The time complexity of our algorithm is as
low as naive Bayes’. Indeed, it is as fast as naive Bayes
but outperforms naive Bayes in accuracy according to
our experiments. Our algorithm is a core tree-growing
algorithm that can be combined with other scaling-up
techniques to achieve further speedup.
Introduction and Related Work
Decision-tree learning is one of the most successful learn-
ing algorithms, due to its various attractive features: sim-
plicity, comprehensibility, no parameters, and being able to
handle mixed-type data. In decision-tree learning, a deci-
sion tree is induced from a set of labeled training instances
represented by a tuple of attribute values and a class label.
Because of the vast search space, decision-tree learning is
typically a greedy, top-down and recursive process starting
with the entire training data and an empty tree. An attribute
that best partitions the training data is chosen as the splitting
attribute for the root, and the training data are then parti-
tioned into disjoint subsets satisfying the values of the split-
ting attribute. For each subset, the algorithm proceeds re-
cursively until all instances in a subset belong to the same
class. A typical tree-growing algorithm, such as C4.5 (Quin-
Copyright c© 2006, American Association for Artificial Intelli-
gence (www.aaai.org). All rights reserved.
lan 1993), has a time complexity of O(m · n2), where m is
the size of the training data and n is the number of attributes.
Since large data sets with millions of instances and thou-
sands of attributes are not rare today, the interest in devel-
oping fast decision-tree learning algorithms is rapidly grow-
ing. The major reason is that decision-tree learning works
comparably well on large data sets, in addition to its gen-
eral attractive features. For example, decision-tree learning
outperforms naive Bayes on larger data sets, while naive
Bayes performs better on smaller data sets (Kohavi 1996;
Domingos & Pazzani 1997). A similar observation has been
noticed in comparing decision-tree learning with logistic re-
gression (Perlich, Provost, & Simonoff 2003).
Numerous techniques have been developed to speed up
decision-tree learning, such as designing a fast tree-growing
algorithm, parallelization, and data partitioning. Among
them, a large amount of research work has been done on
reducing the computational cost related to accessing the sec-
ondary storage, such as SLIQ (Mehta, Agrawal, & Rissanen
1996), SPRINT (Shafer, Agrawal, & Mehta 1996), or Rain-
forest (Gehrke, Ramakrishnan, & Ganti 2000). An excellent
survey is given in (Provost & Kolluri 1999). Apparently,
however, developing a fast tree-growing algorithm is more
essential. There are basically two approaches to designing a
fast tree-growing algorithm: searching in a restricted model
space, and using a powerful search heuristic.
Learning a decision tree from a restricted model space
achieves the speedup by avoiding searching the vast model
space. One-level decision trees (Holte 1993) are a simple
structure in which only one attribute is used to predict the
class variable. A one-level tree can be learned quickly with
a time complexity ofO(m·n), but its accuracy is often much
lower than C4.5’s. Auer et al. (1995) present an algorithm
T2 for learning two-level decision trees. However, It has
been noticed that T2 is no more efficient than even C4.5
(Provost & Kolluri 1999).
Learning restricted decision trees often leads to perfor-
mance degradation in some complex domains. Using a pow-
erful heuristic to search the unrestricted model space is an-
other realistic approach. Indeed, most standard decision-tree
learning algorithms are based on heuristic search. Among
them, the decision tree learning algorithm C4.5 (Quinlan
1993) has been well recognized as the reigning standard.
C4.5 adopts information gain as the criterion (heuristic) for
500
splitting attribute selection and has a time complexity of
O(m ·n2). Note that the number n of attributes corresponds
to the depth of the decision tree, which is an important fac-
tor contributing to the computational cost for tree-growing.
Although one empirical study suggests that on average the
learning time of ID3 is linear with the number of attributes
(Shavlik, Mooney, & Towell 1991), it has been also noticed
that C4.5 does not scale well when there are many attributes
(Dietterich 1997).
The motivation of this paper is to develop a fast algo-
rithm searching the unrestricted model space with a pow-
erful heuristic that can be computed efficiently. Our work
is inspired by naive Bayes, which is based on an unrealistic
assumption: all attributes are independent given the class.
Because of the assumption, it has a very low time complex-
ity of O(m ·n), and still performs surprisingly well (Domin-
gos & Pazzani 1997). Interestingly, if we introduce a sim-
ilar assumption in decision-tree learning, the widely used
information-gain heuristic can be computed more efficiently,
which leads to a more efficient tree-growing algorithm with
the same asymptotic time complexity of O(m ·n) with naive
Bayes and one-level decision trees. That is the key idea of
this paper.
A Fast Tree-Growing Algorithm
To simplify our discussion, we assume that all attributes are
non-numeric, and each attribute then occurs exactly once on
each path from leaf to root. We will specify how to cope
with numeric attributes later. In the algorithm analysis in
this paper, we assume that both the number of classes and
the number of values of each attribute are much less than m
and are then discarded. We also assume that all training data
are loaded to the main memory.
Conditional Independence Assumption
In tree-growing, the heuristic plays a critical role in deter-
mining both classification performance and computational
cost. Most modern decision-tree learning algorithms adopt
a (im)purity-based heuristic, which essentially measures the
purity of the resulting subsets after applying the splitting at-
tribute to partition the training data. Information gain, de-
fined as follows, is widely used as a standard heuristic.
IG(S, X) = Entropy(S)−
∑
x
|Sx|
|S| Entropy(Sx), (1)
where S is a set of training instances, X is an attribute and
x is its value, Sx is a subset of S consisting of the instances
with X = x, and Entropy(S) is defined as
Entropy(S) = −
|C|∑
i=1
PS(ci)logPS(ci), (2)
where PS(ci) is estimated by the percentage of instances
belonging to ci in S, and |C| is the number of classes.
Entropy(Sx) is similar.
Note that tree-growing is a recursive process of partition-
ing the training data and S is the training data associated
with the current node. Then, PS(ci) is actually P (ci|xp)
on the entire training data, where Xp is the set of attributes
along the path from the current node to the root, called path
attributes, and xp is an assignment of values to the variables
in Xp. Similarly, PSx(ci) is P (ci|xp, x) on the entire train-
ing data.
In the tree-growing process, each candidate attribute (the
attributes not in Xp) is examined using Equation 1, and the
one with the highest information gain is selected as the split-
ting attribute. The most time-consuming part in this process
is evaluating P (ci|xp, x) for computing Entropy(Sx). It
must pass through each instance in Sx, for each of which it
iterates through each candidate attribute X . This results in a
time complexity ofO(|S|·n). Note that the union of the sub-
sets on each level of the tree is the entire training data of size
m, and the time complexity for each level is thus O(m · n).
Therefore, the standard decision-tree learning algorithm has
a time complexity of O(m · n2).
Our key observation is that we may not need to
pass through S for each candidate attribute to estimate
P (ci|xp, x). According to probability theory, we have
P (ci|xp, x) = P (ci|xp)P (x|xp, ci)
P (x|xp)
=
P (ci|xp)P (x|xp, ci)∑|C|
i=1 P (ci|xp)P (x|xp, ci)
. (3)
Assume that each candidate attribute is independent of the
path attribute assignment xp given the class, i.e.,
P (X|xp, C) = P (X|C). (4)
Then we have
P (ci|xp, x) ≈ P (ci|xp)P (x|ci)∑|C|
i=1 P (ci|xp)P (x|ci)
. (5)
The information gain obtained by Equation 5 and Equation
1 is called independent information gain (IIG) in this paper.
Note that in Equation 5, P (x|ci) is the percentage of in-
stances with X = x and C = ci on the entire training data
that can be pre-computed and stored with a time complexity
of O(m · n) before the tree-growing process with an addi-
tional space increase of O(n), and P (ci|xp) is the percent-
age of instances belonging to class ci in S that can be com-
puted by passing through S once taking O(|S|). Thus, at
each level, the time complexity for computing P (ci|xp, x)
using Equation 5 is O(m), while C4.5 takes O(m · n) using
Equation 3.
To compute IIG, |Sx||S| in Equation 1 must also be com-
puted. If we examine the partition for each candidate at-
tribute X , the corresponding time complexity would be
O(m · n). Fortunately, |Sx||S| = P (x|xp), which can be ap-
proximated by the denominator of Equation 5 taking O(m).
In C4.5, it takes O(m · n).
The time complexity for selecting the splitting attribute
using IIG is similar to using information gain in C4.5. For
the current node, IIG(S, X) must be calculated for each
candidate attribute, which takes O(n) for each node. The
total time complexity for splitting attribute selection on the
501
entire tree is then O(k · n), where k is the number of in-
ternal nodes on the tree. In the worst case, each leaf of the
tree contains a single instance, and the number of internal
nodes is O(m). Thus, the total time complexity for splitting
attribute selection is O(m ·n). Note that C4.5 does the same
thing and has the exact same time complexity for splitting
attribute selection.
The total time for tree-growing is the sum of the time for
probability estimation, partition, and splitting attribute se-
lection. Therefore, the time complexity for tree-growing
using IIG is O(m · n), as low as the time complexity for
learning a naive Bayes and a one-level tree. Compared with
the time complexityO(m ·n2) of C4.5 based on information
gain, it is indeed a significant asymptotic improvement.
The conditional independence assumption in Equation 4
seems similar to the well-known conditional independence
assumption in naive Bayes that assumes all attributes are in-
dependent given the class, defined as follows.
P (X1, X2, · · · , Xn|C) =
n∏
i=1
P (Xi|C). (6)
However, the assumption in Equation 4 is weaker than the
assumption in Equation 6. First, Equation 6 defines an as-
sumption on the entire instance space (global independence
assumption). But Equation 4 defines an assumption on a
subspace specified by a particular assignment xp. It is es-
sentially context-specific independence (Friedman & Gold-
szmidt 1996). Apparently, context-specific independence
has finer granularity than global independence in the sense
that some attributes are not globally independent, but they
are independent in certain context. Second, in Equation 4,
we do not assume that candidate attributes not in Xp are
independent. Naive Bayes, however, assumes that all at-
tributes are independent, including candidate attributes.
It is well-known that the independence assumption for
naive Bayes is unrealistic, but it still works surprisingly well
even in the domains containing dependencies. Domingos
and Pazanni (1997) ascribe that to the zero-one loss func-
tion in classification. That is, for a given instance e, as long
as the class with the largest posterior probability P (c|e) in
naive Bayes remains the same with the true class, the classi-
fication of naive Bayes would be correct. Interestingly, their
explanation is also suitable to decision-tree learning. In tree-
growing, one attribute is selected at each step. Similarly, as
long as the attribute with the largest IIG is the same with
the one with the largest information gain, it would not af-
fect attribute selection. So accurate information gain does
not matter. To see how this could happen, let us consider the
following example.
Given a Boolean concept C = (A1 ∧ A2) ∨ A3 ∨ A4,
assume that the training data consist of all 16 instances. The
decision tree generated by C4.5 is shown in Figure 1. It is
interesting that the tree generated using IIG is the same.
When choosing the root, IIG is information gain. In the
other steps in tree-growing, IIGmight not be equal to infor-
mation gain. But the attribute with the highest information
gain remains the same. For example, when choosing the
attribute immediately under the root, the information gain
values for A1, A2 and A4 are 0.04, 0.04 and 0.54, and the
IIG values for A1, A2 and A4 are 0.02, 0.02 and 0.36. Ap-
parently, A4 is the attribute with both the highest informa-
tion gain and IIG, although IIG is not equal to information
gain. Note that the independence assumption for IIG is not
true in this example.
A3
A4
10
e (1)=8
A1
0
10
A2
10
e (0)=1 e (1)=1
e (0)=2
1
e (1)=4
Gain( A4 )=0.54
Gain( A1 )=Gain( A2 )=0.04
Figure 1: Decision tree generated for Boolean concept (A1∧
A2) ∨A3 ∨A4, in which e(c) is the number of instances in
class c falling into a leaf.
Another reason to support the independence assumption
in Equation 4 is the accuracy of probability estimation. It
is well-known that decision-tree learning suffers from the
fragmentation problem: As the splitting process proceeds,
the data associated with each descendant node becomes
small. Eventually, when the depth of a tree is large, there
is very little data with each leaf node and so the estimate for
P (ci|xp, x) could be inaccurate. In computing IIG, since
P (x|ci) is estimated on the entire training data and P (ci|xp)
is estimated before the partitioning using attribute X , this
problem could be less serious.
Actually, the assumption in Equation 4 can be relaxed. To
simplify our discussion, let us just consider binary classes
{+,−}. Then, we have
P (+|xp, x) = P (+|xp)P (x|xp,+)
P (+|xp)P (x|xp,+) + P (−|xp)P (x|xp,−) ,
=
P (+|xp)
P (+|xp) + P (−|xp)P (x|xp,−)P (x|xp,+)
.
Obviously, the probability estimate generated from Equa-
tion 5 is equal to the one from Equation 3, if
P (x|xp,+)
P (x|xp,−) =
P (x|+)
P (x|−) . (7)
Naive Tree Algorithm
The tree-growing algorithm based on the conditional inde-
pendence assumption, called naive tree algorithm (NT), is
illustrated as follows.
Algorithm NT (Π, S)
Input: Π is a set of candidate attributes, and S is a set of
labeled instances.
502
Output: A decision tree T.
1. If (S is pure or empty) or (Π is empty) Return T.
2. Compute PS(ci) on S for each class ci.
3. For each attribute X in Π, compute IIG(S, X) based
on Equation 1 and 5.
4. Use the attribute Xmax with the highest IIG for the
root.
5. Partition S into disjoint subsets Sx using Xmax.
6. For all values x of Xmax
• Tx=NT (Π−Xmax, Sx)
• Add Tx as a child of Xmax.
7. Return T.
Before we call the NT algorithm, a set of probabilities
P (X|C) should be computed on the entire training data for
each attribute and each class, which takes O(m · n). Ac-
cording to the analysis in the preceding section, the total
time complexity of the NT algorithm is O(m · n). Note
that the major difference between NT and C4.5 is in Step
3. C4.5 computes the information gain using P (ci|xp, x)
for each candidate attribute X that needs to pass through
each instance for each candidate attribute, taking O(m · n).
In NT , P (ci|xp, x) is approximated using Equation 5 that
takes onlyO(m). Consequently, the time complexity ofNT
is as low as the time complexity of naive Bayes and one-level
trees.
Note that in the NT algorithm above, we do not cope
with numeric attributes for simplicity. In the implementation
of the NT algorithm we process numeric attributes in the
following way.
1. In preprocessing, all numeric attributes are discretized by
k-bin discretization, where k =
√
m.
2. In selecting the splitting attribute, numeric attributes are
treated the same as non-numeric attributes.
3. Once a numeric attribute is chosen, a splitting point is
found using the same way as in C4.5. Then, the corre-
sponding partitioning is done in Step 5, and a recursive
call is made for each subset in Step 6. Note that a numeric
attribute could be chosen again in the attribute selection
on descendant nodes.
Note that we do not change the way to choose the split-
ting attribute in NT when there are numeric attributes, and
that we use the same way as in C4.5 to choose a splitting
point only after a numeric attribute has been selected. Thus
the time complexity for selecting splitting attributes for NT
keeps the same, and its advantage over C4.5 also remains the
same. However, the time complexity of NT is higher than
O(m ·n). In fact, the time complexity of C4.5 is also higher
than O(m · n2) in the presence of numeric attributes.
In a traditional decision-tree learning algorithm, an at-
tribute X is more likely to be selected if it is independent
from the path attributes Xp, because it provides more dis-
criminative power than an attribute depending on Xp. In
other words, an attribute Y dependent on Xp would have
less chance to be chosen than X . Using the independence
assumption in Equation 4, such kind of attributes Y might
have more chance to be chosen. In our tree-growing algo-
rithm, we keep checking this kind of attributes. In our im-
plementation, if an attribute leads to all subsets except one
being empty, it will not be chosen as the splitting attribute
and will be removed from the candidate attribute set. Intu-
itively, the tree sizes generated by NT tend to be larger than
C4.5’s. This issue can be overcome using post pruning. In
our implementation, the pruning method in C4.5 is applied
after a tree is grown. Interestingly, the tree sizes generated
by NT are not larger than C4.5’s after pruning according to
the experimental results depicted in next section.
Experiments
Experiment Setup
We conducted experiments under the framework of Weka
(version 3.4.7) (Witten & Frank 2000). All experiments
were performed on a Pentium 4 with 2.8GHZ CPU and 1G
RAM. We design two experiments. The first one is on 33
UCI benchmark data sets, which are selected by Weka and
represent a wide range of domains and data characteristics.
The data sets occur in a descending order of their sizes in
Table 1. Missing values are processed using the mechanism
in Weka, which replaces all missing values with the modes
and means from the training data. The purpose of this ex-
periment is to observe the general performance of NT and
the influence of its independence assumption on accuracy in
practical environments.
The second experiment is on 19 text classification bench-
mark data sets (Forman & Cohen 2004). These data sets are
mainly from Reuters, TREC and OHSUMED, widely used
in text classification research. In fact, decision-tree learn-
ing is one of the competitive text classification algorithms
(Dumais et al. 1998), and its classification time is very low.
From the preceding sections, we know that when the number
of attributes is large, NT of O(m ·n) should be significantly
faster than C4.5 of O(m ·n2). Text classification usually in-
volves thousands of attributes, from which we can observe
the advantage of NT over C4.5 in running time (training
time).
We compare NT with C4.5 and naive Bayes. Naive
Bayes is well known as one of the most practical text classi-
fication algorithms, with very low time complexity and con-
siderably good accuracy as well. We use the implementation
of C4.5 and naive Bayes in Weka, and implement NT under
the framework of Weka. The source code of NT is available
at: http://www.cs.unb.ca/profs/hzhang/naivetree.rar.
In our experiments, we use three evaluation measures: ac-
curacy, running time, and tree size. Certainly, we want the
new algorithm to be competitive with C4.5 in accuracy with
substantial decrease in running time. In decision-tree learn-
ing, a compact tree usually offers more insightful results
than a larger tree, and tree size is also a widely-used eval-
uation criterion in decision trees learning.
We use the following abbreviations in the tables below.
NT: the NT algorithm. NT(T ) denotes its training
time.
C4.5: the traditional decision-tree learning algorithm
(Quinlan 1993). C4.5(T ) denotes its training time.
503
NB: naive Bayes. NB(T ) denotes its training time.
R(S): the ratio of the tree size learned by NT to C4.5’s.
If the ratio is greater than 1, it means that NT grows a larger
tree than does C4.5.
R(T ): the ratio of the running time of NT to the running
time of C4.5. If the ratio is greater than 1, it means that the
running time of NT is longer than C4.5’s.
Experimental Results on UCI Data Sets
Table 1 shows the accuracy, tree size ratio and training time
ratio of each algorithm on each data set obtained via 10 runs
of 10-fold stratified cross validation, and the average accu-
racy and standard deviation on all data sets are summarized
at the bottom of the table. Note that we reported the running
time ratio, instead of running time, because the running time
on small data sets is very close to zero. Table 2 shows the
results of the two-tailed t-test in accuracy with a 95% con-
fidence interval, in which each entry w/t/l means that the
algorithm in the corresponding row wins in w data sets, ties
in t data sets, and loses in l data sets, compared to the algo-
rithm in the corresponding column.
Table 1: Experimental results on UCI data sets.
Data set NT C4.5 NB R(S) R(T )
Letter 85.66± 0.76 88.03± 0.71 64.07± 0.91 1.18 0.25
Mushroom 100± 0 100± 0 95.52± 0.78 1.10 1.02
Waveform 76.78± 1.74 75.25± 1.9 80.01± 1.45 1.13 0.23
Sick 98.54± 0.62 98.81± 0.56 92.89± 1.37 0.71 0.27
Hypothyroid 99.34± 0.41 99.58± 0.34 95.29± 0.74 0.81 0.76
Chess 97.34± 1 99.44± 0.37 87.79± 1.91 0.89 0.99
Splice 94.24± 1.23 94.08± 1.29 95.42± 1.14 0.81 0.31
Segment 95.34± 1.24 96.79± 1.29 80.17± 2.12 1.40 0.33
German-C 73.01± 3.93 71.13± 3.19 75.16± 3.48 0.42 0.36
Vowel 76.14± 4.17 80.08± 4.34 62.9± 4.38 1.15 0.45
Anneal 98.47± 1.22 98.57± 1.04 86.59± 3.31 1.11 0.16
Vehicle 69.36± 4.24 72.28± 4.32 44.68± 4.59 0.94 0.49
Diabetes 73.96± 4.62 74.49± 5.27 75.75± 5.32 0.77 0.41
Wisconsin-b 94.09± 2.77 94.89± 2.49 96.12± 2.16 0.93 0.44
Credit-A 85.17± 3.93 85.75± 4.01 77.81± 4.21 1.30 0.37
Soybean 91.9± 2.74 92.55± 2.61 92.2± 3.23 1.22 0.72
Balance 78.92± 4.04 77.82± 3.42 90.53± 1.67 0.94 0.62
Vote 95.17± 2.93 96.27± 2.79 90.21± 3.95 0.62 0.76
Horse-Colic 85.27± 5.04 83.28± 6.05 77.39± 6.34 0.47 0.47
Ionosphere 88.89± 4.74 89.74± 4.38 82.17± 6.14 1.00 0.25
Primary-t 38.76± 6.15 41.01± 6.59 47.2± 6.02 0.86 0.91
Heart-c 78.98± 7.31 76.64± 7.14 82.65± 7.38 0.88 0.56
Breast-Can 72.84± 5.49 75.26± 5.04 72.94± 7.71 2.30 0.95
Heart-s 82.89± 7.29 78.15± 7.42 83.59± 5.98 0.63 0.34
Audiology 75.67± 6.56 76.69± 7.68 71.4± 6.37 1.19 1.14
Glass 70.94± 9.51 67.63± 9.31 49.45± 9.5 1.02 0.47
Sonar 71.81± 10.74 73.61± 9.34 67.71± 8.66 1.35 0.24
Autos 74.02± 9.18 78.64± 9.94 57.67± 10.91 1.48 0.54
Hepatitis 80.16± 8.89 77.02± 10.03 83.29± 10.26 0.53 0.67
Iris 95.4± 4.51 94.73± 5.3 95.53± 5.02 1.03 1.07
Lymph 75.17± 9.57 75.84± 11.05 83.13± 8.89 0.85 0.70
Zoo 93.07± 6.96 92.61± 7.33 94.97± 5.86 1.20 0.63
Labor 83.77± 14.44 81.23± 13.68 93.87± 10.63 0.83 0.79
Mean 83.37± 4.79 83.57± 4.86 79.58± 4.92 1.00 0.56
Table 2: T-test summary in accuracy on UCI data sets.
C4.5 NB
NT 2-27-4 15-11-7
C4.5 15-9-9
Table 1 and 2 show that the classification performance of
NT is better than naive Bayes, and is very close to C4.5. We
summarize the highlights briefly as follows:
1. NT performs reasonably well compared with C4.5. The
average accuracy of NT is 83.37%, very close to C4.5’s
83.57%, and the average tree size ofNT is equal to C4.5’s
(the average ratio is 1.0).
2. On data sets “Letter”, “Vowel” and “Chess”, strong and
high-order attribute dependencies have been observed
(Kohavi 1996). We can see that naive Bayes performs
severely worse than C4.5. However, the accuracy de-
crease of NT is only 2%-4%. This evidence supports
our argument that the independence assumption in NT is
weaker than naive Bayes’, and shows that NT performs
considerably well even when the independence assump-
tion is seriously violated.
3. NT achieves higher accuracy and smaller tree size than
C4.5 in data sets “German-C”, “Heart-s” and “Labor”.
This means that C4.5 is not always superior to NT . We
will see that NT even outperforms C4.5 in text classifica-
tion in next section .
4. NT inherits the superiority of C4.5 to naive Bayes in ac-
curacy on larger data sets. Since it has the same time com-
plexity with naive Bayes, NT could be a more practical
learning algorithm in large-scale applications.
5. NT is faster than C4.5. The average training time ratio
is 0.56. But its advantage is not very clear. According to
our discussion in the preceding section, NT would have
a more clear advantage in running time when the data set
is large and has a large number of attributes.
Experimental Results in Text Classification
We conducted the second experiment on text classification
benchmark data sets. Table 3 shows the accuracy, training
time in seconds and tree size ratio for each algorithm on
each data set obtained via 10-fold stratified cross validation.
Table 4 shows the results of the two-tailed t-test in accuracy
with a 95% confidence interval. We summarize the high-
lights briefly as follows:
1. NT performs slightly better than C4.5 with 4 wins and
1 loss. The average accuracy of NT is 79.88%, higher
than C4.5’s 78%. On most data sets (14 out of 19),
NT achieves higher accuracy . In addition, the tree size
learned byNT is smaller than C4.5’s on each data set, and
the average tree size ratio is only 0.73. Note that, gener-
ally speaking, the conditional independence assumption
is violated in text classification.
2. It is clear that NT is remarkably faster than C4.5. The
average training time for NT is 93 seconds, thirty times
less than C4.5’s 2823 seconds.
3. It is not a surprise that NT performs better than naive
Bayes, with the average accuracy 79.88% vs. the 75.63%
of naive Bayes. The equally important point is that NT is
as fast as naive Bayes in training. Note that decision-tree
learning has a significant advantage over naive Bayes in
terms of classification time.
In summary, our experimental results in text classification
show:
504
Table 3: Experimental results on text classification data sets.
Data NT C4.5 NB NT(T ) NB(T ) C4.5(T ) R(S)
Ohscal 72.48 71.35 70.86 560 643 32694 0.42
La1 75.5 77.22 83.21 419 414 7799 0.55
La2 77.33 76.39 84.55 467 390 6820 0.58
Fbis 76.74 74.06 67.4 24 21 457 0.63
Cora36 39.39 46.94 61.78 106 25 1764 0.88
Re1 83.47 79 75.14 30 15 512 0.74
Re0 76.01 74.73 71.02 16 10 252 0.74
Wap 67.44 62.44 80.45 70 45 2180 0.73
Oh0 81.86 82.56 87.34 7 7 129 0.69
Oh10 78.19 73.62 80.19 8 8 161 0.54
Oh15 72.19 73.7 80.81 7 6 138 0.80
Oh5 84.3 81.58 82.45 5 6 88 0.68
Tr31 94.07 92.77 88.45 12 28 188 0.96
Tr11 86.95 77.58 67.63 5 7 73 0.74
Tr12 81.44 83.05 72.85 3 4 33 0.85
Tr21 88.68 82.44 52.34 4 7 81 0.54
Tr23 97.57 91.12 59.29 1 3 12 0.98
Tr41 93.73 91.35 91.81 8 18 153 0.89
Tr45 90.29 90.14 79.42 10 16 101 0.98
Mean 79.88 78.00 75.63 93 88 2823 0.73
1. NT scales well in the problems with large number of at-
tributes.
2. NT could be a practical text classification algorithm, be-
cause it is as fast as naive Bayes in training and signifi-
cantly faster in testing, and still enjoys higher classifica-
tion accuracy.
Conclusion
In this paper, we present a novel, efficient algorithm for
decision-tree learning. Its time complexity is O(m · n), as
low as learning a naive Bayes and a one-level tree. It is also
a significant asymptotic reduction on the time complexity
O(m · n2) of C4.5. Our new algorithm is based on a condi-
tional independence assumption, similar to but weaker than
the conditional independence assumption of naive Bayes.
Experiments show that the new algorithm scales up well to
large data sets with large number of attributes: It performs
significantly faster than C4.5 while maintaining competitive
accuracy. One drawback of the new algorithm is its manner
to handle missing values, which is clearly inferior to C4.5’s
and will be a topic for our future research.
Although we focus on the information gain-based
decision-tree learning in this paper, estimating the purity of
subsets generated by a splitting attribute X , i.e., computing
PSx(ci), is essential in many other decision-tree learning al-
gorithms. So the method proposed in this paper is also suit-
able to those algorithms. Note also that the NT algorithm is
a core tree-growing algorithm, which can be combined with
other scaling-up techniques to achieve further speedup. In
addition, NT could be a practical algorithm used in various
applications with large amount of data, such as text classifi-
cation.
Table 4: T-test summary in accuracy on text classification
data sets.
C4.5 NB
NT 4-14-1 9-4-6
C4.5 7-6-6
References
Auer, P.; Holte, R. C.; and Maass, W. 1995. Theory and
applications of agnostic PAC-learning with small decision
trees. In Proceedings of the 12th International Conference
on Machine Learning. Morgan Kaufmann. 21–29.
Dietterich, T. G. 1997. Machine learning research: Four
current directions. AI Magazine 18:4:97–136.
Domingos, P., and Pazzani, M. 1997. Beyond inde-
pendence: Conditions for the optimality of the simple
Bayesian classifier. Machine Learning 29:103–130.
Dumais, S.; Platt, J.; Heckerman, D.; and Sahami, M.
1998. Inductive learning algorithms and representations
for text categorization. In Proceedings of the Seventh Inter-
national Conference on Information and Knowledge Man-
agement. 148–155.
Forman, G., and Cohen, I. 2004. Learning from little:
Comparison of classifiers given little training. In Proceed-
ing of PKDD2004. 161–172.
Friedman, N., and Goldszmidt, M. 1996. Learning
Bayesian networks with local structure. In Proceedings of
the Twelfth Conference on Uncertainty in Artificial Intelli-
gence. 252–262.
Gehrke, J. E.; Ramakrishnan, R.; and Ganti, V. 2000. Rain-
forest - A framework for fast decision tree construction
of large datasets. Data Mining and Knowledge Discovery
4:2/3:127–162.
Holte, R. 1993. Very simple classification rules perform
well on most commonly used datasets. Machine Learning
11:63–91.
Kohavi, R. 1996. Scaling up the accuracy of naive-Bayes
classifiers: A decision-tree hybrid. In Proceedings of the
Second International Conference on Knowledge Discovery
and Data Mining. AAAI Press. 202–207.
Mehta, M.; Agrawal, R.; and Rissanen, J. 1996. SLIQ: A
fast scalable classifier for data mining. In Proceedings of
the Fifth International Conference on Extending Database
Technology. 18–32.
Perlich, C.; Provost, F.; and Simonoff, J. S. 2003. Tree in-
duction vs. logistic regression: A learning-curve analysis.
Machine Learning Research 4:211–255.
Provost, F. J., and Kolluri, V. 1999. A survey of meth-
ods for scaling up inductive algorithms. Data Min. Knowl.
Discov 3(2):131–169.
Quinlan, J. 1993. C4.5: Programs for Machine Learning.
Morgan Kaufmann: San Mateo, CA.
Shafer, J. C.; Agrawal, R.; and Mehta, M. 1996. SPRINT:
A scalable parallel classifier for data mining. In Proceed-
ings of the 22nd International Conference on Very Large
Databases. Morgan Kaufmann. 544–555.
Shavlik, J.; Mooney, R.; and Towell, G. 1991. Symbolic
and neural network learning algorithms: An experimental
comparison. Machine Learning 6:111–143.
Witten, I. H., and Frank, E. 2000. Data Mining –Practical
Machine Learning Tools and Techniques with Java Imple-
mentation. Morgan Kaufmann.
505

欢迎咨询51作业君