Witten’s Chapter 3 Output: Knowledge representation
Data Mining tool
Training instances Concept
Concept to be learned description
(attributes, values) (rules, trees, clusters, etc)
Representing structural patterns

Many different ways of representing patterns

Decision trees, rules, instancebased, …

Also called “knowledge” representation

Representation determines inference method

Understanding the output is the key to understanding the underlying learning methods

Different types of output for different learning problems (e.g. classification, regression, …)
3.1 Decision tables

Most rudimentary form of representing output:

Use the same format as input!
Is it trivial? NO
Can we use the same input table? Maybe not

Decision table for the weather problem (WEKA on nominal data)
What’s the challenge?

Main challenge: selecting the right attributes
3.2 Decision trees

“Divideandconquer” approach produces tree

Nodes involve testing a particular attribute

Usually, attribute value is compared to constant

Other possibilities:

Comparing values of two attributes

Using a function on one or more attributes

Leaves assign classification, set of classifications, or probability distribution to instances

Unknown instance is routed down the tree
Example:
The contact lenses data
A decision tree for this problem
Nominal and numeric attributes

Nominal attribute: number of children usually equal to number values

An attribute won’t get tested more than once
Why?

Other possibility: division into two subsets

Numeric attribute: test whether value is greater or less than constant

An attribute may get tested several times
Can we avoid that?
Other possibility: threeway split (or multiway split)

Integer: less than, equal to, greater than

Real: below, within, above
Missing values

Does absence of value have some significance?

Yes  “missing” cane be considered a separate value
Is a couple of missing years from your resume significant?

No  “missing” must be treated in a special way

Solution A: assign instance to most popular branch

Solution B: split instance into pieces

Pieces receive weight according to fraction of training instances that go down each branch

Classifications from leave nodes are combined using the weights that have percolated to them
Size of tree: Big vs. Small, which is better
Data from labor negotiations: A more realistic example
Decision trees for the labor data
Note

Neither tree classifies all contracts correctly.

Tree 2 misses fewer examples from the training data

Which tree is better?
Tree 1 – when all else are equal, choose the simpler module.
3.3 Classification rules

Popular alternative to decision trees

Antecedent (precondition): a series of tests (just like the tests at the nodes of a decision tree)

Tests are usually logically ANDed together (but may also be general logical expressions)

Consequent (conclusion): classes, set of classes, or probability distribution assigned by rule

Individual rules are often logically ORed together

Conflicts arise if different conclusions apply (more about this later)
Example:
A complete and correct rule set
If tear production rate = reduced then recommendation = none
If age = young and astigmatic = no and tear production rate = normal
then recommendation = soft
If age = prepresbyopic and astigmatic = no and
tear production rate = normal then recommendation = soft
If age = presbyopic and spectacle prescription = myope and
astigmatic = no then recommendation = none
If spectacle prescription = hypermetrope and astigmatic = no and
tear production rate = normal then recommendation = soft
If spectacle prescription = myope and astigmatic = yes and
tear production rate = normal then recommendation = hard
If age young and astigmatic = yes and tear production rate = normal
then recommendation = hard
If age = prepresbyopic and spectacle prescription = hypermetrope
and astigmatic = yes then recommendation = none
If age = presbyopic and spectacle prescription = hypermetrope and
astigmatic = yes then recommendation = none
Decision Tree = Classification rules?
From trees to rules

Easy: converting a tree into a set of rules

One rule for each leaf:

Antecedent contains a condition for every node on the path from the root to the leaf

Consequent is class assigned by the leaf

Produces rules that are unambiguous

Doesn’t it matter in which order the rules are listed?

But: resulting rules may be unnecessarily complex

Not as good as ones generated directly

Pruning to remove redundant tests/rules
From rules to trees

More difficult: transforming a rule set into a tree
If a and b then x
If c and d then x
Class exercise: Create a decision tree for that.

Symmetry needs to be broken

Corresponding tree contains identical subtrees (==> “replicated subtree problem”)
A tree for a simple disjunction
Another Example:
The exclusiveor problem
The rule set is not much smaller than the tree.
However, it could be much worse.
A tree with a replicated subtree
Having a default value reduces the number of rules required.
(WEKA ripple down rule learner)
“Nuggets” of knowledge

Are rules independent pieces of knowledge? (It seems easy to add a rule to an existing rule base.)
Can we make the last rule independent of the first 2?
No, we can not ignore how rules are executed.

Two ways of executing a rule set:

Ordered set of rules (“decision list”)

Unordered set of rules

Each rule is independent

Rules may overlap and lead to different conclusions for the same instance.
Interpreting rules

What if two or more rules conflict?

Give no conclusion at all?

Go with rule that is most popular on training data?

Methods used in knowledgebased systems: use probability (or Certainty Factor)

What if no rule applies to a test instance?

Give no conclusion at all?

Go with class that is most frequent in training data?

Make sure this does not happen by checking the consistency of a rule base – rule base verification and validation.
Special case: Boolean class

Assumption: if instance does not belong to class “yes”, it belongs to class “no”

Trick: only learn rules for class “yes” and use default rule for “no”
If x = 1 and y = 1 then class = a
If z = 1 and w = 1 then class = a
//Otherwise class = b

Order of rules is not important. No conflicts!

Rule can be written in disjunctive normal form
Ex.,
(x=1 AND y=1) OR (z = 1 and w = 1)
3.4 Association rules

Association rules

can predict value(s) of any attribute or combinations of attributes

are not intended to be used together as a set

Problem: huge number of possible associations

Output needs to be restricted to show only the most predictive associations ==> only those with high support (i.e., coverage) and high confidence (i.e., accuracy)
Ex., Apriori algorithm
Support and confidence of a rule

Support: number (or percent) of instances predicted correctly, (of all the instances)

Confidence: number of correct predictions, as proportion of all instances that rule applies to

Weather data Example:
4 cool days with normal humidity ==>
If temperature = cool then humidity = normal
Support = 4 (or 4/14), confidence = 100%

Normally: minimum support and confidence specified in the mining algorithm (e.g. 58 rules with support >= 2 and confidence >= 95% for weather data)
Interpreting association rules

Interpretation is not obvious:
Rule1: If windy = false and play = no
then outlook = sunny and humidity = high
Rule2: If windy = false and play = no then outlook = sunny
Rule3: If windy = false and play = no then humidity = high
Is rule1 = rule2 && rule3?
Think in terms of confidence (ex., 75%)
Does rule1 rule4 && rule5
Rule4: If windy = false and play = no and humidity = high
then outlook = sunny
Rule5: If windy = false and play = no and outlook = sunny
then humidity = high
3.5 Rules with exceptions
Every rule has exceptions  Exception handling
Example:
Classifying iris flowers
rule for iris data
If petallength >= 2.45 and petallength < 4.45
then type = Irisversicolor
New instance:
What should we do with a rule when we find an exception case?

Remove

Modify: Change >2.45 to >2.6
Rules with exceptions

Idea: allow rules to have exceptions instead of deleting the rule
If petallength >= 2.45 and petallength < 4.45 then Irisversicolor EXCEPT if petalwidth < 1.0 then Irissetosa
A more complex example

There are always exceptions to exceptions …
Ex., number of days in February

Rule can get complicated quickly. Ex.,
Default: Irissetosa
except if petallength >= 2.45 and petallength < 5.355 and
petalwidth < 1.75
then Irisversicolor
except if petallength ³ 4.95 and petalwidth < 1.55
then Irisvirginica
else if sepallength < 4.95 and sepalwidth ³ 2.45
then Irisvirginica
else if petallength ³ 3.35
then Irisvirginica
except if petallength < 4.85 and sepallength < 5.95
then Irisversicolor
Advantages of using exceptions

Rules can be updated incrementally

Easy to incorporate new data

Easy to incorporate domain knowledge

People often think in terms of exceptions

Each conclusion can be considered just in the context of rules and exceptions that lead to it

Locality property is important for understanding large rule sets.

“Normal” rule sets don’t offer this advantage
More on exceptions

Default...except if...then...
is logically equivalent to
if...then...else
(where the else specifies what the default did)

But: exceptions offer a psychological advantage

Assumption: defaults and tests early on apply more widely than exceptions further down

Exceptions reflect special cases
3.6 Rules involving relations

So far: all rules involved comparing an attributevalue to a constant (e.g. temperature < 45)

These rules are called “propositional” because they have the same expressive power as propositional logic

What if problem involves relationships between examples (shape problem)

Can’t be expressed with propositional rules

More expressive representation required
The shapes problem

Target concept: standing up

Shaded: standing Unshaded: lying
Can we express this with propositional rules? (please give a try!)
A propositional solution
Generalized rules:
If width > 3.5 and height < 7.0 then lying
If height > 3.5 then standing
Make sense? Not really
A relational solution

Comparing attributes with each other
If width > height then lying
If height > width then standing

Generalizes better to new data

Standard relations: =, <, >

Simple solution: adding extra attributes (e.g. a binary attribute is width < height?)
Width and height of what
block? Building? etc
Rules with variables

Using variables and multiple relations:
If height(x) > width(x) then standing(x)

Instances can be decomposed:
If height(tower.top) > width(tower.top) then standing(tower.top)
If taller_than(x,y) and taller_than(y,z) then taller_than(x,z)

Recursive definition can be seen as logic program

But: recursive definitions are extremely hard to learn in practice

Luckily: very few practical problems require recursion

Thus: many techniques are restricted to nonrecursive definitions to make learning easier
3.7 Numeric prediction
Ex.,
Predicting CPU performance
Examples: 209 different computer configurations
Linear regression for the CPU data
PRP =
 56.1
+ 0.049 MYCT
+ 0.015 MMIN
+ 0.006 MMAX
+ 0.630 CACH
 0.270 CHMIN
+ 1.46 CHMAX

Regression tree: “decision tree” where each leaf predicts a numeric quantity:
Predicted value is average value of training instances that reach the leaf
Regression tree for the CPU data

Model tree: “Regression tree” with linear regression models at the leaf nodes. Ex., LM1: PRP = 8.29 + 0.004 MMAX + 2.77 CHMIN
3.8 Instancebased representation (IBL)

Simplest form of learning: rote learning

Training instances are searched for instance that most closely resembles new instance

The instances themselves represent the knowledge

Also called instancebased (or casebased) learning

Similarity function defines what’s “learned”

Instancebased learning is lazy learning

Methods:

nearestneighbor – use the class of the closest neighbor

knearestneighbor – use the majority class (or the average if the output is numeric) of the K nearest neighbors

How do we find the nearest neighbor?
The distance function – a challenge

Simplest case: one numeric attribute

Distance is the difference between the two attribute values involved (or a function thereof)

Several numeric attributes: normally, Euclidean distance is used and attributes are normalized

Nominal attributes: distance is set to 1 if values are different, 0 if they are equal

Are all attributes equally important?

Weighting the attributes might be necessary

Only those instances involved in a decision need to be stored
How do we decide?

Noisy instances should be filtered out

Idea: Save a few prototype examples for each class

To make the concept more explicit (i.e., quicker decision):
Rectangular generalizations  create a rectangle region for examples of class

Nearestneighbor rule is used outside rectangles

Rectangles are rules! (But they can be more conservative than “normal” rules.)

Nested rectangles are rules with exceptions
6. Representing clusters

Possible outputs:

A diagram that shows how the instances fall into clusters (a)

Associate each instance to a cluster number (b)

An instance may belong to >1 clusters, with attached probability value for each (c)

There also could be hierarchies of clusters (d)
