Witten’s Chapter 3 Output: Knowledge representation




Дата канвертавання22.04.2016
Памер50.89 Kb.
Witten’s Chapter 3 Output: Knowledge representation
Data Mining tool


Training instances Concept

Concept to be learned description

(attributes, values) (rules, trees, clusters, etc)



Representing structural patterns


  • Many different ways of representing patterns

  • Decision trees, rules, instance-based, …

  • Also called “knowledge” representation

  • Representation determines inference method

  • Understanding the output is the key to understanding the underlying learning methods

  • Different types of output for different learning problems (e.g. classification, regression, …)


3.1 Decision tables

  • Most rudimentary form of representing output:

    • Use the same format as input!

Is it trivial? NO

Can we use the same input table? Maybe not


  • Decision table for the weather problem (WEKA on nominal data)

What’s the challenge?



  • Main challenge: selecting the right attributes



3.2 Decision trees

  • “Divide-and-conquer” approach produces tree

  • Nodes involve testing a particular attribute

  • Usually, attribute value is compared to constant

  • Other possibilities:

    • Comparing values of two attributes

    • Using a function on one or more attributes

  • Leaves assign classification, set of classifications, or probability distribution to instances

  • Unknown instance is routed down the tree

Example:



The contact lenses data



A decision tree for this problem


Nominal and numeric attributes


  • Nominal attribute: number of children usually equal to number values

    • An attribute won’t get tested more than once

Why?

    • Other possibility: division into two subsets




  • Numeric attribute: test whether value is greater or less than constant

    • An attribute may get tested several times

Can we avoid that?

Other possibility: three-way split (or multi-way split)



      • Integer: less than, equal to, greater than

      • Real: below, within, above


Missing values

  • Does absence of value have some significance?

  • Yes - “missing” cane be considered a separate value

Is a couple of missing years from your resume significant?

  • No - “missing” must be treated in a special way

    • Solution A: assign instance to most popular branch

    • Solution B: split instance into pieces

      • Pieces receive weight according to fraction of training instances that go down each branch

      • Classifications from leave nodes are combined using the weights that have percolated to them


Size of tree: Big vs. Small, which is better
Data from labor negotiations: A more realistic example




Decision trees for the labor data


Note

  1. Neither tree classifies all contracts correctly.

  2. Tree 2 misses fewer examples from the training data

  3. Which tree is better?

Tree 1 – when all else are equal, choose the simpler module.
3.3 Classification rules

  • Popular alternative to decision trees

  • Antecedent (pre-condition): a series of tests (just like the tests at the nodes of a decision tree)

  • Tests are usually logically ANDed together (but may also be general logical expressions)

  • Consequent (conclusion): classes, set of classes, or probability distribution assigned by rule

  • Individual rules are often logically ORed together

  • Conflicts arise if different conclusions apply (more about this later)

Example:


A complete and correct rule set
If tear production rate = reduced then recommendation = none

If age = young and astigmatic = no and tear production rate = normal

then recommendation = soft

If age = pre-presbyopic and astigmatic = no and

tear production rate = normal then recommendation = soft

If age = presbyopic and spectacle prescription = myope and

astigmatic = no then recommendation = none

If spectacle prescription = hypermetrope and astigmatic = no and

tear production rate = normal then recommendation = soft

If spectacle prescription = myope and astigmatic = yes and

tear production rate = normal then recommendation = hard

If age young and astigmatic = yes and tear production rate = normal

then recommendation = hard

If age = pre-presbyopic and spectacle prescription = hypermetrope

and astigmatic = yes then recommendation = none

If age = presbyopic and spectacle prescription = hypermetrope and

astigmatic = yes then recommendation = none
Decision Tree = Classification rules?
From trees to rules


  • Easy: converting a tree into a set of rules

    • One rule for each leaf:

      • Antecedent contains a condition for every node on the path from the root to the leaf

      • Consequent is class assigned by the leaf

  • Produces rules that are unambiguous

    • Doesn’t it matter in which order the rules are listed?

  • But: resulting rules may be unnecessarily complex

    • Not as good as ones generated directly

    • Pruning to remove redundant tests/rules


From rules to trees


  • More difficult: transforming a rule set into a tree

If a and b then x

If c and d then x


Class exercise: Create a decision tree for that.


    • Symmetry needs to be broken

    • Corresponding tree contains identical subtrees (==> “replicated subtree problem”)


A tree for a simple disjunction

Another Example:

The exclusive-or problem

The rule set is not much smaller than the tree.


However, it could be much worse.

A tree with a replicated subtree


Having a default value reduces the number of rules required.

(WEKA ripple down rule learner)

Nuggets” of knowledge


  • Are rules independent pieces of knowledge? (It seems easy to add a rule to an existing rule base.)

Can we make the last rule independent of the first 2?

No, we can not ignore how rules are executed.




  • Two ways of executing a rule set:

    • Ordered set of rules (“decision list”)

    • Unordered set of rules

      • Each rule is independent

      • Rules may overlap and lead to different conclusions for the same instance.


Interpreting rules

  • What if two or more rules conflict?

    • Give no conclusion at all?

    • Go with rule that is most popular on training data?

    • Methods used in knowledge-based systems: use probability (or Certainty Factor)




  • What if no rule applies to a test instance?

    • Give no conclusion at all?

    • Go with class that is most frequent in training data?




  • Make sure this does not happen by checking the consistency of a rule base – rule base verification and validation.


Special case: Boolean class

  • Assumption: if instance does not belong to class “yes”, it belongs to class “no”

  • Trick: only learn rules for class “yes” and use default rule for “no”

If x = 1 and y = 1 then class = a

If z = 1 and w = 1 then class = a

//Otherwise class = b


  • Order of rules is not important. No conflicts!

  • Rule can be written in disjunctive normal form

Ex.,

(x=1 AND y=1) OR (z = 1 and w = 1)



3.4 Association rules

  • Association rules

    • can predict value(s) of any attribute or combinations of attributes

    • are not intended to be used together as a set

  • Problem: huge number of possible associations

  • Output needs to be restricted to show only the most predictive associations ==> only those with high support (i.e., coverage) and high confidence (i.e., accuracy)

Ex., Apriori algorithm


Support and confidence of a rule

  • Support: number (or percent) of instances predicted correctly, (of all the instances)

  • Confidence: number of correct predictions, as proportion of all instances that rule applies to

  • Weather data Example:

4 cool days with normal humidity ==>

If temperature = cool then humidity = normal

Support = 4 (or 4/14), confidence = 100%


  • Normally: minimum support and confidence specified in the mining algorithm (e.g. 58 rules with support >= 2 and confidence >= 95% for weather data)


Interpreting association rules

  • Interpretation is not obvious:


Rule1: If windy = false and play = no

then outlook = sunny and humidity = high
Rule2: If windy = false and play = no then outlook = sunny

Rule3: If windy = false and play = no then humidity = high
Is rule1 = rule2 && rule3?

Think in terms of confidence (ex., 75%)


Does rule1 rule4 && rule5


Rule4: If windy = false and play = no and humidity = high

then outlook = sunny

Rule5: If windy = false and play = no and outlook = sunny

then humidity = high

3.5 Rules with exceptions
Every rule has exceptions - Exception handling
Example:
Classifying iris flowers


rule for iris data
If petal-length >= 2.45 and petal-length < 4.45

then type = Iris-versicolor


New instance:


What should we do with a rule when we find an exception case?

  1. Remove

  2. Modify: Change >2.45 to >2.6


Rules with exceptions

  • Idea: allow rules to have exceptions instead of deleting the rule




  • Modified rule:

If petal-length >= 2.45 and petal-length < 4.45 then Iris-versicolor EXCEPT if petal-width < 1.0 then Iris-setosa
A more complex example

  • There are always exceptions to exceptions …

Ex., number of days in February

  • Rule can get complicated quickly. Ex.,

Default: Iris-setosa

except if petal-length >= 2.45 and petal-length < 5.355 and

petal-width < 1.75

then Iris-versicolor

except if petal-length ³ 4.95 and petal-width < 1.55

then Iris-virginica

else if sepal-length < 4.95 and sepal-width ³ 2.45

then Iris-virginica

else if petal-length ³ 3.35

then Iris-virginica

except if petal-length < 4.85 and sepal-length < 5.95

then Iris-versicolor
Advantages of using exceptions


  • Rules can be updated incrementally

    • Easy to incorporate new data

    • Easy to incorporate domain knowledge

  • People often think in terms of exceptions

  • Each conclusion can be considered just in the context of rules and exceptions that lead to it

    • Locality property is important for understanding large rule sets.

    • “Normal” rule sets don’t offer this advantage



More on exceptions

  • Default...except if...then...

is logically equivalent to

if...then...else

(where the else specifies what the default did)



  • But: exceptions offer a psychological advantage

    • Assumption: defaults and tests early on apply more widely than exceptions further down

    • Exceptions reflect special cases


3.6 Rules involving relations

  • So far: all rules involved comparing an attribute-value to a constant (e.g. temperature < 45)

  • These rules are called “propositional” because they have the same expressive power as propositional logic

  • What if problem involves relationships between examples (shape problem)

    • Can’t be expressed with propositional rules

    • More expressive representation required


The shapes problem

  • Target concept: standing up

  • Shaded: standing Un-shaded: lying



Can we express this with propositional rules? (please give a try!)
A propositional solution

Generalized rules:
If width > 3.5 and height < 7.0 then lying

If height > 3.5 then standing


Make sense? Not really
A relational solution

  • Comparing attributes with each other

If width > height then lying

If height > width then standing




  • Generalizes better to new data

  • Standard relations: =, <, >

  • Simple solution: adding extra attributes (e.g. a binary attribute is width < height?)

Width and height of what

block? Building? etc

Rules with variables


  • Using variables and multiple relations:

If height(x) > width(x) then standing(x)

  • Instances can be decomposed:

If height(tower.top) > width(tower.top) then standing(tower.top)


  • Recursive definition!

If taller_than(x,y) and taller_than(y,z) then taller_than(x,z)


    • Recursive definition can be seen as logic program

    • But: recursive definitions are extremely hard to learn in practice

    • Luckily: very few practical problems require recursion

    • Thus: many techniques are restricted to non-recursive definitions to make learning easier


3.7 Numeric prediction


Ex.,
Predicting CPU performance

Examples: 209 different computer configurations





Linear regression for the CPU data

PRP =


- 56.1

+ 0.049 MYCT

+ 0.015 MMIN

+ 0.006 MMAX

+ 0.630 CACH

- 0.270 CHMIN



+ 1.46 CHMAX


  • Regression tree: “decision tree” where each leaf predicts a numeric quantity:

Predicted value is average value of training instances that reach the leaf

Regression tree for the CPU data





  • Model tree: “Regression tree” with linear regression models at the leaf nodes. Ex., LM1: PRP = 8.29 + 0.004 MMAX + 2.77 CHMIN



3.8 Instance-based representation (IBL)

  • Simplest form of learning: rote learning

    • Training instances are searched for instance that most closely resembles new instance

    • The instances themselves represent the knowledge

    • Also called instance-based (or case-based) learning

  • Similarity function defines what’s “learned”

  • Instance-based learning is lazy learning

  • Methods:

    • nearest-neighbor – use the class of the closest neighbor

    • k-nearest-neighbor – use the majority class (or the average if the output is numeric) of the K nearest neighbors

    • How do we find the nearest neighbor?


The distance function – a challenge

  • Simplest case: one numeric attribute

    • Distance is the difference between the two attribute values involved (or a function thereof)

  • Several numeric attributes: normally, Euclidean distance is used and attributes are normalized

  • Nominal attributes: distance is set to 1 if values are different, 0 if they are equal

  • Are all attributes equally important?

    • Weighting the attributes might be necessary




  • Only those instances involved in a decision need to be stored

How do we decide?

  • Noisy instances should be filtered out

    • Idea: Save a few prototype examples for each class








  • To make the concept more explicit (i.e., quicker decision):

Rectangular generalizations - create a rectangle region for examples of class



  • Nearest-neighbor rule is used outside rectangles

  • Rectangles are rules! (But they can be more conservative than “normal” rules.)

  • Nested rectangles are rules with exceptions


6. Representing clusters

  • Possible outputs:

    • A diagram that shows how the instances fall into clusters (a)

    • Associate each instance to a cluster number (b)

  • An instance may belong to >1 clusters, with attached probability value for each (c)

  • There also could be hierarchies of clusters (d)







База данных защищена авторским правом ©shkola.of.by 2016
звярнуцца да адміністрацыі

    Галоўная старонка