Machine Learning and Inductive Inference Hendrik Blockeel 2001-2002
1 Introduction Practical information What is "machine learning and inductive inference"? What is it useful for? (some example applications) Different learning tasks Data representation Brief overview of approaches Overview of the course
Practical information about the course 10 lectures (2h) + 4 exercise sessions (2.5h) Audience with diverse backgrounds Course material: Book: Machine Learning (Mitchell, 1997, McGraw-Hill) Slides & notes, http://www.cs.kuleuven.ac.be/~hendrik/ML/ Examination: oral exam (20') with written preparation (+/- 2h) 2/3 theory, 1/3 exercises Only topics discussed in lectures / exercises
What is machine learning? Study of how to make programs improve their performance on certain tasks from own experience "performance" = speed, accuracy, ... "experience" = set of previously seen cases ("observations") For instance (simple method): experience: taking action A in situation S yielded result R situation S arises again if R was undesirable: try something else if R was desirable: try action A again
This is a very simple example only works if precisely the same situation is encountered what if similar situation? Need for generalisation how about choosing another action even if a good one is already known ? (you might find a better one) Need for exploration This course focuses mostly on generalisation or inductive inference
Inductive inference = Reasoning from specific to general e.g. statistics: from sample, infer properties of population sample population observation: "these dogs are all brown" hypothesis: "all dogs are brown"
Note: inductive inference is more general than statistics statistics mainly consists of numerical methods for inference infer mean, probability distribution, … of population other approaches: find symbolic definition of a concept (“concept learning”) find laws with complicated structure that govern the data study induction from a logical, philosophical, … point of view …
Applications of inductive inference: Machine learning "sample" of observations = experience generalizing to population = finding patterns in the observations that generally hold and may be used for future tasks Knowledge discovery (Data mining) "sample" = database generalizing = finding patterns that hold in this database and can also be expected to hold on similar data not in the database discovered knowledge = comprehensible description of these patterns ...
What is it useful for? Scientifically: for understanding learning and intelligence in humans and animals interesting for psychologists, philosophers, biologists, … More practically: for building AI systems expert systems that improve automatically with time systems that help scientists discover new laws also useful outside “classical” AI-like applications when we don’t know how to program something ourselves when a program should adapt regularly to new circumstances when a program should tune itself towards its user
Knowledge discovery Scientific knowledge discovery Some “toy” examples: Bacon : rediscovered some laws of physics (e.g. Kepler’s laws of planetary motion) AM: rediscovered some mathematical theorems More serious recent examples: mining the human genome mining the web for information on genes, proteins, … drug discovery context: robots perform lots of experiments at high rate; this yields lots of data, to be studied and interpreted by humans; try to automate this process (because humans can’t keep up with robots)
Example: given molecules that are active against some disease, find out what is common in them; this is probably the reason for their activity.
Data mining in databases, looking for “interesting” patterns e.g. for marketing based on data in DB, who should be interested in this new product? (useful for direct mailing) study customer behaviour to identify typical profiles of customers find out which products in store are often bought together e.g. in hospital: help with diagnosis of patients
Learning to perform difficult tasks Difficult for humans… LEX system : learned how to perform symbolic integration of functions … or “easy” for humans, but difficult to program humans can do it, but can’t explain how they do it e.g.: learning to play games (chess, go, …) learning to fly a plane, drive a car, … recognising faces …
Adaptive systems Robots in changing environment continuously needs to adapt its behaviour Systems that adapt to the user based on user modelling: observe behaviour of user build model describing this behaviour use model to make user’s life easier e.g. adaptive web pages, intelligent mail filters, adaptive user interfaces (e.g. intelligent Unix shell), …
Illustration : building a system that learns checkers Learning = improving on task T, with respect to performance measure P, based on experience E In this example: T = playing checkers P: % of games won in world tournament E: games played against self possible problem: is experience representative for real task? Questions to be answered: exactly what is given, exactly what is learnt, what representation & learning algorithm should we use
What do we want to learn? given board situation, which move to make What is given? direct or indirect evidence ? direct: e.g., which moves were good, which were bad indirect: consecutive moves in game, + outcome of the game in our case: indirect evidence direct evidence would require a teacher
What exactly shall we learn? Choose type of target function: ChooseMove: Board  Move ? directly applicable V: Board   ? indicates quality of state when playing, choose move that leads to best state Note: reasonable definition for V easy to give: V(won) = 100, V(lost) = -100, V(draw) = 0, V(s) = V(e) with e best state reachable from s when playing optimally Not feasible in practice (exhaustive minimax search) … Let’s choose the V function here
Choose representation for target function: set of rules? neural network? polynomial function of numerical board features? … Let’s choose: V = w 1 bp+w 2 rp+w 3 bk+w 4 rk+w 5 bt+w 6 rt bp, rp : number of black / red pieces bk, rk : number of black / red kings bt, rt : number of black / read pieces threatened w i : constants to be learnt from experience
How to obtaining training examples? we need a set of examples [bp, rp, bk, rk, bt, rt, V] bp etc. easy to determine; but how to guess V? we have indirect evidence only! possible method: with V(s) true target function, V’(s) learnt function, V t (s) training value for a state s: V t (s) <- V’(successor(s)) adapt V’ using V t values (making V’ and V t converge) hope that V’ will converge to V intuitively: V for end states is known; propagate V values from later states to earlier states in the game
Training algorithm: how to adapt the weights w i ? possible method: look at “error”: error(s) = V’(s) - V t (s) adapt weights so that error is reduced e.g. using gradient descent method for each feature f i : w i  w i + c f i error(s) with c some small constant
Overview of design choices type of training experience games against self games against expert table of good moves determine type of target function determine representation determine learning algorithm … … … … ready! Board   Board  Move linear function of 6 features … gradient descent
Some issues that influence choices Which algorithms useful for what type of functions? How is learning influenced by # training examples complexity of hypothesis (function) representation noise in the data Theoretical limits of learning? Can we help the learner with prior knowledge? Could a system alter its representation itself? …
Typical learning tasks Concept learning learn a definition of a concept supervised vs. unsupervised Function learning (&quot;predictive modelling&quot;) Discrete (&quot;classification&quot;) or continuous (&quot;regression&quot;) Concept = function with boolean result Clustering Finding descriptive patterns
Concept learning: supervised Given positive (+) and negative (-) examples of a concept, infer properties that cause instances to be positive or negative (= concept definition) + + + + + - - - - - - - - - X C : X  {true,false} + + + + + - - - - - - - - - X C
Concept learning: unsupervised Given examples of instances Invent reasonable concepts ( = clustering ) Find definitions for these concepts . . . . . . . . . . . X . . . . . . . . . . . . . . . . . X . . . . . . C 3 C 2 C 1 Cf. taxonomy of animals, identification of market segments, ...
Function learning Generalises over concept learning Learn function f : X  S where S is finite set of values: classification S is a continuous range of reals: regression . 1.4 . 2.7 . 0.6 . 2.1 X . 0.9 . 1.4 . 2.7 . 0.6 . 2.1 X . 0.9 0 1 2 3 f
Clustering Finding groups of instances that are similar May be a goal in itself (unsupervised classification) ... but also used for other tasks regression flexible prediction : when it is not known in advance which properties to predict from which other properties . . . . . . . . . . . X . . . . . . . . . . . . . . . . . X . . . . . .
Finding descriptive patterns Descriptive patterns = any kind of patterns, not necessarily directly useful for prediction Generalises over predictive modelling (= finding predictive patterns) Examples of patterns: &quot;fast cars usually cost more than slower cars&quot; &quot;people are never married to more than one person at the same time&quot;
Representation of data Numerical data: instances are points in  n Many techniques focus on this kind of data Symbolic data (true/false, black/white/red/blue, ...) Can be converted to numeric data Some techniques work directly with symbolic data Structural data Instances have internal structure (graphs, sets, …; cf. molecules) Difficult to convert to simpler format Few techniques can handle these directly
Brief overview of approaches Symbolic approaches: Version Spaces, Induction of decision trees, Induction of rule sets, inductive logic programming, … Numeric approaches: neural networks, support vector machines, … Probabilistic approaches (“bayesian learning”) Miscellaneous: instance based learning, genetic algorithms, reinforcement learning
Overview of the course Introduction (today) (Ch. 1) Concept-learning: Versionspaces (Ch. 2 - brief) Induction of decision trees (Ch. 3) Artificial neural networks (Ch. 4 - brief) Evaluating hypotheses (Ch. 5) Bayesian learning (Ch. 6) Computational learning theory (Ch. 7) Support vector machines (brief)
Instance-based learning (Ch. 8) Genetic algorithms (Ch. 9) Induction of rule sets & association rules (Ch. 10) Reinforcement learning (Ch. 13) Clustering Inductive logic programming Combining different models bagging, boosting, stacking, …
2 Version Spaces Recall basic principles from AI course stressing important concepts for later use Difficulties with version space approaches Inductive bias  Mitchell, Ch.2
Basic principles Concept learning as search given : hypothesis space H and data set S find : all h  H consistent with S this set is called the version space , VS(H,S) How to search in H enumerate all h in H: not feasible prune search using some generality ordering h 1 more general than h 2  (x  h 2  x  h 1 ) See Mitchell Chapter 2 for examples
An example + : belongs to concept; - : does not S = set of these + and - examples Assume hypotheses are rectangles I.e., H = set of all rectangles VS(H,S) = set of all rectangles that contain all + and no - - - - - - - - - - - - - + + + - -
Example of consistent hypothesis: green rectangle - - - - - - - - - - - - + + + - -
h1 more general than h2  h2 totally inside h1 h 1 h 2 h 3 h 2 more specific than h 1 h 3 incomparable with h 1 - - - - - - - - - - - - + + + - -
Version space boundaries Bound versionspace by giving its most specific (S) and most general (G) borders S: rectangles that cannot become smaller without excluding some + G: rectangles that cannot become larger without including some - Any hypothesis h consistent with the data must be more general than some element in S must be more specific than some element in G Thus, G and S completely specify VS
Example, continued So what are S and G here? S = {h 1 }, G = {h 2 ,h 3 } - - - - - - - - - - - - + + + - - h 2 : most general hypothesis h 3 : another most general hyp. h 1 : most specific hypothesis
Computing the version space Computing G and S is sufficient to know the full versionspace Algorithms in Mitchell’s book: FindS: computes only S set S is always singleton in Mitchell’s examples Candidate Elimination: computes S and G
Candidate Elimination Algorithm: demonstration with rectangles Algorithm: see Mitchell Representation: Concepts are rectangles Rectangle represented with 2 attributes: <Xmin-Xmax, Ymin-Ymax> Graphical representation: hypothesis consistent with data if all + inside rectangle no - inside rectangle
G S Start: S = {none}, G = {all} S = {<  ,  >} G = {<1-6, 1-6>} 3 4 5 6 2 1 3 2 1 4 5 6
G S Example e1 appears, not covered by S Start: S = {none}, G = {all} S = {<  ,  >} G = {<1-6,1-6>} (3,2) : + +
+ G S (3,2) : + Example e1 appears, not covered by S Start: S = {none}, G = {all} S is extended to cover e1 S = {< 3-3,2-2 >} G = {<1-6,1-6>}
(3,2) : + Example e1 appears, not covered by S Start: S = {none}, G = {all} S is extended to cover e1 Example e2 appears, covered by G + G S S = {<3-3,2-2>} G = {<1-6,1-6>} - (5,4) : -
+ G S (3,2) : + Example e1 appears, not covered by S Start: S = {none}, G = {all} S is extended to cover e1 S = {<3-3,2-2>} G = {< 1-4,1-6 >, < 1-6, 1-3 >} Example e2 appears, covered by G G is changed to avoid covering e2 note: now consists of 2 parts each part covers all + and no - (5,4) : - -
+ G S (3,2) : + Example e1 appears, not covered by S Start: S = {none}, G = {all} S is extended to cover e1 Example e2 appears, covered by G G is changed to avoid covering e2 Example e3 appears, covered by G S = {<3-3,2-2>} G = {<1-4,1-6>, <1-6, 1-3>} (5,4) : - - (2,4) : - -
+ G S (3,2) : + Example e1 appears, not covered by S Start: S = {none}, G = {all} S is extended to cover e1 Example e2 appears, covered by G G is changed to avoid covering e2 Example e3 appears, covered by G One part of G is affected & reduced S = {<3-3,2-2>} G = {< 3-4,1-6 >, <1-6, 1-3>} (5,4) : - - (2,4) : - -
+ G S (3,2) : + Example e1 appears, not covered by S Start: S = {none}, G = {all} S is extended to cover e1 Example e2 appears, covered by G G is changed to avoid covering e2 Example e3 appears, covered by G One part of G is affected & reduced Example e4 appears, not covered by S S = {<3-3,2-2>} G = {<3-4,1-6>, <1-6, 1-3>} (5,4) : - - (2,4) : - - (5,3) : + +
+ G S (3,2) : + Example e1 appears, not covered by S Start: S = {none}, G = {all} S is extended to cover e1 Example e2 appears, covered by G G is changed to avoid covering e2 Example e3 appears, covered by G One part of G is affected & reduced Example e4 appears, not covered by S S is extended to cover e4 S = {< 3-5,2-3 >} G = {<3-4,1-6>, <1-6, 1-3>} (5,4) : - - (2,4) : - - (5,3) : + +
+ G S (3,2) : + Example e1 appears, not covered by S Start: S = {none}, G = {all} S is extended to cover e1 Example e2 appears, covered by G G is changed to avoid covering e2 Example e3 appears, covered by G One part of G is affected & reduced Example e4 appears, not covered by S S is extended to cover e4 Part of G not covering new S is removed S = {<3-5,2-3>} G = {<1-6, 1-3>} (5,4) : - - (2,4) : - - (5,3) : + +
+ G S Current versionspace contains all rectangles covering S and covered by G, e.g. h = <2-5,2-3> h S = {<3-5,2-3>} G = {<1-6, 1-3>}
Interesting points: We here use an extended notion of generality In book :  < value < ? Here: e.g.  < 2-3 < 2-5 < 1-5 < ? We still use a conjunctive concept definition each concept is 1 rectangle this could be extended as well (but complicated)
Difficulties with version space approaches Idea of VS provides nice theoretical framework But not very useful for most practical problems Difficulties with these approaches: Not very efficient Borders G and S may be very large (may grow exponentially) Not noise resistant VS “collapses” when no consistent hypothesis exists often we would like to find the “best” hypothesis in this case in Mitchell’s examples: only conjunctive definitions We will compare with other approaches...
Inductive bias After having seen a limited number of examples, we believe we can make predictions for unseen cases. From seen cases to unseen cases = inductive leap Why do we believe this ? Is there any guarantee this prediction will be correct ? What extra assumptions do we need to guarantee correctness? Inductive bias : minimal set of extra assumptions that guarantees correctness of inductive leap
Equivalence between inductive and deductive systems inductive system training examples new instance deductive system training examples new instance inductive bias result (by proof) result (by inductive leap)
Definition of inductive bias More formal definition of inductive bias (Mitchell): L(x,D) denotes classification assigned to instance x by learner L after training on D The inductive bias of L is any minimal set of assertions B such that for any target concept c and corresponding training examples D,  x  X: B  D  x |- L(x,D)
Effect of inductive bias Different learning algorithms give different results on same dataset because each may have a different bias Stronger bias means less learning more is assumed in advance Is learning possible without any bias at all? I.e., “pure” learning, without any assumptions in advance The answer is No .
Inductive bias of version spaces Bias of candidate elimination algorithm: target concept is in H H typically consists of conjunctive concepts in our previous illustration, rectangles H could be extended towards disjunctive concepts Is it possible to use version spaces with H = set of all imaginable concepts, thereby eliminating all bias ?
Unbiased version spaces Let U be the example domain Unbiased: target concept C can be any subset of U hence, H = 2 U Condider VS(H,D) with D a strict subset of U Assume you see an unseen instance x (x  U \ D) For each h  VS that predicts x  C, there is a h’  VS that predicts x  C, and vice versa just take h = h’  {x}: since x  D, h and h’ are exactly the same w.r.t. D; so either both are in VS, or none of them are
Conclusion: version spaces without any bias do not allow generalisation To be able to make an inductive leap, some bias is necessary . We will see many different learning algorithms that all differ in their inductive bias. When choosing one in practice, bias should be an important criterium unfortunately: not always well understood…
To remember Definition of version space, importance of generality ordering for searching Definition of inductive bias, practical importance, why it is necessary for learning, how it relates inductive systems to deductive systems
3 Induction of decision trees What are decision trees? How can they be induced automatically? top-down induction of decision trees avoiding overfitting converting trees to rules alternative heuristics  a generic TDIDT algorithm   Mitchell, Ch. 3
What are decision trees? Represent sequences of tests According to outcome of test, perform a new test Continue until result obtained known Cf. guessing a person using only yes/no questions: ask some question depending on answer, ask a new question continue until answer known
Example decision tree 1 Mitchell’s example: Play tennis or not? (depending on weather conditions) Outlook Humidity Wind No Yes No Yes Yes Sunny Overcast Rainy High Normal Strong Weak
Example decision tree 2 Again from Mitchell: tree for predicting whether C-section necessary Leaves are not pure here; ratio pos/neg is given Fetal_Presentation Previous_Csection + - - 1 2 3 0 1 [3+, 29-] .11+ .89- [8+, 22-] .27+ .73- [55+, 35-] .61+ .39- Primiparous … …
Representation power Typically: examples represented by array of attributes 1 node in tree tests value of 1 attribute 1 child node for each possible outcome of test Leaf nodes assign classification Note: tree can represent any boolean function i.e., also disjunctive concepts (<-> VS examples) tree can allow noise (non-pure leaves)
Representing boolean formulae E.g., A  B Similarly (try yourself): A  B, A xor B, (A  B)  (C   D  E) “ M of N” (at least M out of N propositions are true) What about complexity of tree vs. complexity of original formula? A false true B true true false true false
Classification, Regression and Clustering trees Classification trees represent function X -> C with C discrete (like the decision trees we just saw) Regression trees predict numbers in leaves could use a constant (e.g., mean), or linear regression model, or … Clustering trees just group examples in leaves Most (but not all) research in machine learning focuses on classification trees
Example decision tree 3 (from study of river water quality) &quot;Data mining&quot; application Given: descriptions of river water samples biological description: occurrence of organisms in water (“abundance”, graded 0-5) chemical description: 16 variables (temperature, concentrations of chemicals (NH 4 , ...)) Question: characterize chemical properties of water using organisms that occur
Clustering tree abundance(Tubifex sp.,5) ? T = 0.357111 pH = -0.496808 cond = 1.23151 O2 = -1.09279 O2sat = -1.04837 CO2 = 0.893152 hard = 0.988909 NO2 = 0.54731 NO3 = 0.426773 NH4 = 1.11263 PO4 = 0.875459 Cl = 0.86275 SiO2 = 0.997237 KMnO4 = 1.29711 K2Cr2O7 = 0.97025 BOD = 0.67012 abundance(Sphaerotilus natans,5) ? yes no T = 0.0129737 pH = -0.536434 cond = 0.914569 O2 = -0.810187 O2sat = -0.848571 CO2 = 0.443103 hard = 0.806137 NO2 = 0.4151 NO3 = -0.0847706 NH4 = 0.536927 PO4 = 0.442398 Cl = 0.668979 SiO2 = 0.291415 KMnO4 = 1.08462 K2Cr2O7 = 0.850733 BOD = 0.651707 yes no abundance( ...) <- &quot;standardized&quot; values (how many standard deviations above mean)
Top-Down Induction of Decision Trees Basic algorithm for TDIDT: (later more formal version) start with full data set find test that partitions examples as good as possible “ good” = examples with same class, or otherwise similar examples, should be put together for each outcome of test, create child node move examples to children according to outcome of test repeat procedure for each child that is not “pure” Main question: how to decide which test is “best”
Finding the best test (for classification trees) For classification trees: find test for which children are as “pure” as possible Purity measure borrowed from information theory: entropy is a measure of “missing information”; more precisely, #bits needed to represent the missing information, on average, using optimal encoding Given set S with instances belonging to class i with probability p i : Entropy(S) = -  p i log 2 p i
Entropy Intuitive reasoning: use shorter encoding for more frequent messages information theory: message with probability p should get -log 2 p bits e.g. A,B,C,D both 25% probability: 2 bits for each (00,01,10,11) if some are more probable, it is possible to do better average #bits for a message is then -  p i log 2 p i
Entropy Entropy in function of p, for 2 classes:
Information gain Heuristic for choosing a test in a node: choose that test that on average provides most information about the class this is the test that, on average, reduces class entropy most on average: class entropy reduction differs according to outcome of test expected reduction of entropy = information gain Gain(S,A) = Entropy(S) -  |S v |/|S| Entropy(S v )
Example Assume S has 9 + and 5 - examples; partition according to Wind or Humidity attribute Humidity Wind High Normal Strong Weak S: [9+,5-] S: [9+,5-] S: [3+,4-] S: [6+,1-] S: [6+,2-] S: [3+,3-] E = 0.985 E = 0.592 E = 0.811 E = 1.0 E = 0.940 E = 0.940 Gain(S, Humidity) = .940 - (7/14).985 - (7/14).592 = 0.151 Gain(S, Wind) = .940 - (8/14).811 - (6/14)1.0 = 0.048
Assume Outlook was chosen: continue partitioning in child nodes Outlook ? ? Yes Sunny Overcast Rainy [9+,5-] [2+,3-] [3+,2-] [4+,0-]
Hypothesis space search in TDIDT Hypothesis space H = set of all trees H is searched in a hill-climbing fashion, from simple to complex ...
Inductive bias in TDIDT Note: for e.g. boolean attributes, H is complete: each concept can be represented! given n attributes, can keep on adding tests until all attributes tested So what about inductive bias? Clearly no “restriction bias” (H  2 U ) as in cand. elim. Preference bias : some hypotheses in H are preferred over others In this case: preference for short trees with informative attributes at the top
Occam’s Razor Preference for simple models over complex models is quite generally used in machine learning Similar principle in science: Occam’s Razor roughly: do not make things more complicated than necessary Reasoning, in the case of decision trees: more complex trees have higher probability of overfitting the data set
Avoiding Overfitting Phenomenon of overfitting: keep improving a model, making it better and better on training set by making it more complicated … increases risk of modelling noise and coincidences in the data set may actually harm predictive power of theory on unseen cases Cf. fitting a curve with too many parameters . . . . . . . . . . . .
Overfitting: example + + + + + + + - - - - - - - - - - - - - - + - - - - - area with probably wrong predictions
Overfitting: effect on predictive accuracy Typical phenomenon when overfitting: training accuracy keeps increasing accuracy on unseen validation set starts decreasing accuracy on training data accuracy on unseen data size of tree accuracy overfitting starts about here
How to avoid overfitting when building classification trees? Option 1: stop adding nodes to tree when overfitting starts occurring need stopping criterion Option 2: don’t bother about overfitting when growing the tree after the tree has been built, start pruning it again
Stopping criteria How do we know when overfitting starts? a) use a validation set : data not considered for choosing the best test when accuracy goes down on validation set: stop adding nodes to this branch b) use some statistical test significance test: e.g., is the change in class distribution still significant? (  2 -test) MDL : minimal description length principle fully correct theory = tree + corrections for specific misclassifications minimize size(f.c.t.) = size(tree) + size(misclassifications(tree)) Cf. Occam’s razor
Post-pruning trees After learning the tree: start pruning branches away For all nodes in tree: Estimate effect of pruning tree at this node on predictive accuracy e.g. using accuracy on validation set Prune node that gives greatest improvement Continue until no improvements Note : this pruning constitutes a second search in the hypothesis space
accuracy on training data accuracy on unseen data size of tree accuracy effect of pruning
Comparison Advantage of Option 1: no superfluous work But: tests may be misleading E.g., validation accuracy may go down briefly, then go up again Therefore, Option 2 (post-pruning) is usually preferred (though more work, computationally)
Turning trees into rules From a tree a rule set can be derived Path from root to leaf in a tree = 1 if-then rule Advantage of such rule sets may increase comprehensibility can be pruned more flexibly in 1 rule, 1 single condition can be removed vs. tree: when removing a node, the whole subtree is removed 1 rule can be removed entirely
Rules from trees: example Outlook Humidity Wind No Yes No Yes Yes Sunny Overcast Rainy High Normal Strong Weak if Outlook = Sunny and Humidity = High then No if Outlook = Sunny and Humidity = Normal then Yes …
Pruning rules Possible method: 1. convert tree to rules 2. prune each rule independently remove conditions that do not harm accuracy of rule 3. sort rules (e.g., most accurate rule first) before pruning: each example covered by 1 rule after pruning, 1 example might be covered by multiple rules therefore some rules might contradict each other
Pruning rules: example A false true B true true false true false if A=true then true if A=false and B=true then true if A=false and B=false then false Tree representing A  B Rules represent A  (  A  B) A  B
Alternative heuristics for choosing tests Attributes with continuous domains (numbers) cannot different branch for each possible outcome allow, e.g., binary test of the form Temperature < 20 Attributes with many discrete values unfair advantage over attributes with few values cf. question with many possible answers is more informative than yes/no question To compensate: divide gain by “max. potential gain” SI Gain Ratio : GR(S,A) = Gain(S,A) / SI(S,A) Split-information SI(S,A) = -  |Si|/|S| log2 |Si|/|S| with i ranging over different results of test A 
Tests may have different costs e.g. medical diagnosis: blood test, visual examination, … have different costs try to find tree with low expected cost instead of low expected number of tests alternative heuristics, taking cost into account,have been proposed
Properties of good heuristics Many alternatives exist ID3 uses information gain or gain ratio CART uses “Gini criterion” (not discussed here) Q: Why not simply use accuracy as a criterion? A1 80-, 20+ 40-,0+ 40-,20+ A2 80-, 20+ 40-,10+ 40-,10+ How would - accuracy - information gain rate these splits?
Heuristics compared Good heuristics are strictly concave
Why concave functions? E E 1 E 2 p p 2 p 1 Assume node with size n , entropy E and proportion of positives p is split into 2 nodes with n 1 , E 1 , p 1 and n 2 , E 2 p 2 . We have p = (n 1 /n)p 1 + (n 2 /n) p 2 and the new average entropy E’ = (n 1 /n)E 1 +(n 2 /n)E 2 is therefore found by linear interpolation between ( p 1 ,E 1 ) and ( p 2 ,E 2 ) at p . Gain = difference in height between ( p, E ) and ( p,E’ ). (n 1 /n)E 1 +(n 2 /n)E 2 Gain
Handling missing values What if result of test is unknown for example? e.g. because value of attribute unknown Some possible solutions, when training: guess value: just take most common value (among all examples, among examples in this node / class, …) assign example partially to different branches e.g. counts for 0.7 in yes subtree, 0.3 in no subtree When using tree for prediction: assign example partially to different branches combine predictions of different branches
Generic TDIDT algorithm function TDIDT( E : set of examples) returns tree; T' := grow_tree( E ); T := prune ( T' ); return T ; function grow_tree( E : set of examples) returns tree; T := generate_tests ( E ); t := best_test ( T , E ); P := partition induced on E by t ; if stop_criterion ( E , P ) then return leaf( info ( E )) else for all E j in P: t j := grow_tree( E j ); return node( t , {( j,t j )}; 
For classification... prune : e.g. reduced-error pruning, ... generate_tests : Attr=val, Attr<val, ... for numeric attributes: generate val best_test : Gain, Gainratio, ... stop_criterion : MDL, significance test (e.g.  2 -test), ... info : most frequent class (&quot;mode&quot;) Popular systems: C4.5 (Quinlan 1993), C5.0 ( www.rulequest.com )
For regression... change best_test : e.g. minimize average variance info : mean stop_criterion : significance test (e.g., F-test), ... A1 A2 {1,3,4,7,8,12} {1,3,4,7,8,12} {1,4,12} {3,7,8} {1,3,7} {4,8,12}
CART Classification and regression trees (Breiman et al., 1984) Classification: info : mode, best_test : Gini Regression: info : mean, best_test : variance prune : &quot;error complexity&quot; pruning penalty  for each node the higher  , the smaller the tree will be optimal  obtained empirically (cross-validation)
n-dimensional target spaces Instead of predicting 1 number, predict vector of numbers info : mean vector best_test : variance (mean squared distance) in n-dimensional space stop_criterion : F-test mixed vectors (numbers and symbols)? use appropriate distance measure -> &quot;clustering trees&quot;
Clustering tree abundance(Tubifex sp.,5) ? T = 0.357111 pH = -0.496808 cond = 1.23151 O2 = -1.09279 O2sat = -1.04837 CO2 = 0.893152 hard = 0.988909 NO2 = 0.54731 NO3 = 0.426773 NH4 = 1.11263 PO4 = 0.875459 Cl = 0.86275 SiO2 = 0.997237 KMnO4 = 1.29711 K2Cr2O7 = 0.97025 BOD = 0.67012 abundance(Sphaerotilus natans,5) ? yes no T = 0.0129737 pH = -0.536434 cond = 0.914569 O2 = -0.810187 O2sat = -0.848571 CO2 = 0.443103 hard = 0.806137 NO2 = 0.4151 NO3 = -0.0847706 NH4 = 0.536927 PO4 = 0.442398 Cl = 0.668979 SiO2 = 0.291415 KMnO4 = 1.08462 K2Cr2O7 = 0.850733 BOD = 0.651707 yes no abundance( ...) <- &quot;standardized&quot; values (how many standard deviations above mean)
To Remember Decision trees & their representational power Generic TDIDT algorithm and how to instantiate its parameters Search through hypothesis space, bias, tree to rule conversion For classification trees: details on heuristics, handling missing values, pruning, … Some general concepts: overfitting, Occam’s razor
4 Neural networks (Brief summary - studied in detail in other courses) Basic principle of artificial neural networks Perceptrons and multi-layer neural networks Properties  Mitchell, Ch. 4
Artificial neural networks Modelled after biological neural systems Complex systems built from very simple units 1 unit = neuron has multiple inputs and outputs, connecting the neuron to other neurons when input signal sufficiently strong, neuron fires (i.,e., propagates signal)
ANNs consists of neurons connections between them these connections have weights associated with them input and output ANNs can learn to associate inputs to outputs by adapting the weights For instance (classification): inputs = pixels of photo outputs = classification of photo (person? tree? …)
Perceptrons Simplest type of neural network Perceptron simulates 1 neuron Fires if sum of (inputs * weights) > some threshold Schematically:  computes  w i x i X Y threshold function: Y = -1 if X<t, Y=1 otherwise x 1 x 2 x 3 x 4 x 5 w 1 w 5
2-input perceptron represent inputs in 2-D space perceptron learns a function of following form: if aX + bY > c then +1, else -1 i.e., creates linear separation between classes + and - +1 -1
n-input perceptrons In general, perceptrons construct a hyperplane in an n-dimensional space one side of hyperplane = +, other side = - Hence, classes must be linearly separable, otherwise perceptron cannot learn them E.g.: learning boolean functions encode true/false as +1, -1 is there a perceptron that encodes 1. A and B? 2. A or B? 3. A xor B?
Multi-layer networks Increase representation power by combining neurons in a network +1 -1 +1 -1 X Y neuron 1 neuron 2 +1 -1 output -1 -1 inputs hidden layer output layer
“ Sigmoid” function instead of crisp threshold changes continuously instead of in 1 step has advantages for training multi-layer networks  x 1 x 2 x 3 x 4 x 5 w 1 w 5
Non-linear sigmoid function causes non-linear decision surfaces e.g., 5 areas for 5 classes a,b,c,d,e Very powerful representation a b c d e
Note : previous network had 2 layers of neurons Layered feedforward neural networks: neurons organised in n layers each layer has output from previous layer as input neurons fully interconnected successive layers = different representations of input 2-layer feedforward networks very popular… … but many other architectures possible! e.g. recurrent NNs
Example: 2-layer net representing ID function 8 input patterns, mapped to same pattern in output network converges to binary representation in hidden layer for instance: 1 101 2 100 3 011 4 111 5 000 6 010 7 110 8 001
Training neural networks Trained by adapting the weights Popular algorithm: backpropagation minimizing error through gradient descent principle: output error of a layer is attributed to 1: weights of connections in that layer adapt these weights 2: inputs of that layer (except if first layer) “ backpropagate” error to these inputs now use same principle to adapt weights of previous layer Iterative process, may be slow
Properties of neural networks Useful for modelling complex, non-linear functions of numerical inputs & outputs symbolic inputs/outputs representable using some encoding, cf. true/false = 1/-1 2 or 3 layer networks can approximate a huge class of functions (if enough neurons in hidden layers) Robust to noise but: risk of overfitting! (because of high expressiveness) may happen when training for too long usually handled using e.g. validation sets
All inputs have some effect cf. decision trees: selection of most important attributes Explanatory power of ANNs is limited model represented as weights in network no simple explanation why networks makes a certain prediction contrast with e.g. trees: can give a “rule” that was used
Hence, ANNs are good when high-dimensional input and output (numeric or symbolic) interpretability of model unimportant Examples: typical: image recognition, speech recognition, … e.g. images: one input per pixel see http://www.cs.cmu.edu/~tom/faces.html for illustration less typical: symbolic problems cases where e.g. trees would work too performance of networks and trees then often comparable
To remember Perceptrons, neural networks: inspiration what they are how they work representation power explanatory power

Machine Learning and Inductive Inference

  • 1.
    Machine Learning and Inductive Inference Hendrik Blockeel 2001-2002
  • 2.
    1 IntroductionPractical information What is &quot;machine learning and inductive inference&quot;? What is it useful for? (some example applications) Different learning tasks Data representation Brief overview of approaches Overview of the course
  • 3.
    Practical information aboutthe course 10 lectures (2h) + 4 exercise sessions (2.5h) Audience with diverse backgrounds Course material: Book: Machine Learning (Mitchell, 1997, McGraw-Hill) Slides & notes, http://www.cs.kuleuven.ac.be/~hendrik/ML/ Examination: oral exam (20') with written preparation (+/- 2h) 2/3 theory, 1/3 exercises Only topics discussed in lectures / exercises
  • 4.
    What is machinelearning? Study of how to make programs improve their performance on certain tasks from own experience &quot;performance&quot; = speed, accuracy, ... &quot;experience&quot; = set of previously seen cases (&quot;observations&quot;) For instance (simple method): experience: taking action A in situation S yielded result R situation S arises again if R was undesirable: try something else if R was desirable: try action A again
  • 5.
    This is avery simple example only works if precisely the same situation is encountered what if similar situation? Need for generalisation how about choosing another action even if a good one is already known ? (you might find a better one) Need for exploration This course focuses mostly on generalisation or inductive inference
  • 6.
    Inductive inference =Reasoning from specific to general e.g. statistics: from sample, infer properties of population sample population observation: &quot;these dogs are all brown&quot; hypothesis: &quot;all dogs are brown&quot;
  • 7.
    Note: inductive inferenceis more general than statistics statistics mainly consists of numerical methods for inference infer mean, probability distribution, … of population other approaches: find symbolic definition of a concept (“concept learning”) find laws with complicated structure that govern the data study induction from a logical, philosophical, … point of view …
  • 8.
    Applications of inductiveinference: Machine learning &quot;sample&quot; of observations = experience generalizing to population = finding patterns in the observations that generally hold and may be used for future tasks Knowledge discovery (Data mining) &quot;sample&quot; = database generalizing = finding patterns that hold in this database and can also be expected to hold on similar data not in the database discovered knowledge = comprehensible description of these patterns ...
  • 9.
    What is ituseful for? Scientifically: for understanding learning and intelligence in humans and animals interesting for psychologists, philosophers, biologists, … More practically: for building AI systems expert systems that improve automatically with time systems that help scientists discover new laws also useful outside “classical” AI-like applications when we don’t know how to program something ourselves when a program should adapt regularly to new circumstances when a program should tune itself towards its user
  • 10.
    Knowledge discovery Scientificknowledge discovery Some “toy” examples: Bacon : rediscovered some laws of physics (e.g. Kepler’s laws of planetary motion) AM: rediscovered some mathematical theorems More serious recent examples: mining the human genome mining the web for information on genes, proteins, … drug discovery context: robots perform lots of experiments at high rate; this yields lots of data, to be studied and interpreted by humans; try to automate this process (because humans can’t keep up with robots)
  • 11.
    Example: given moleculesthat are active against some disease, find out what is common in them; this is probably the reason for their activity.
  • 12.
    Data mining indatabases, looking for “interesting” patterns e.g. for marketing based on data in DB, who should be interested in this new product? (useful for direct mailing) study customer behaviour to identify typical profiles of customers find out which products in store are often bought together e.g. in hospital: help with diagnosis of patients
  • 13.
    Learning to performdifficult tasks Difficult for humans… LEX system : learned how to perform symbolic integration of functions … or “easy” for humans, but difficult to program humans can do it, but can’t explain how they do it e.g.: learning to play games (chess, go, …) learning to fly a plane, drive a car, … recognising faces …
  • 14.
    Adaptive systems Robotsin changing environment continuously needs to adapt its behaviour Systems that adapt to the user based on user modelling: observe behaviour of user build model describing this behaviour use model to make user’s life easier e.g. adaptive web pages, intelligent mail filters, adaptive user interfaces (e.g. intelligent Unix shell), …
  • 15.
    Illustration : buildinga system that learns checkers Learning = improving on task T, with respect to performance measure P, based on experience E In this example: T = playing checkers P: % of games won in world tournament E: games played against self possible problem: is experience representative for real task? Questions to be answered: exactly what is given, exactly what is learnt, what representation & learning algorithm should we use
  • 16.
    What do wewant to learn? given board situation, which move to make What is given? direct or indirect evidence ? direct: e.g., which moves were good, which were bad indirect: consecutive moves in game, + outcome of the game in our case: indirect evidence direct evidence would require a teacher
  • 17.
    What exactly shallwe learn? Choose type of target function: ChooseMove: Board  Move ? directly applicable V: Board   ? indicates quality of state when playing, choose move that leads to best state Note: reasonable definition for V easy to give: V(won) = 100, V(lost) = -100, V(draw) = 0, V(s) = V(e) with e best state reachable from s when playing optimally Not feasible in practice (exhaustive minimax search) … Let’s choose the V function here
  • 18.
    Choose representation for target function: set of rules? neural network? polynomial function of numerical board features? … Let’s choose: V = w 1 bp+w 2 rp+w 3 bk+w 4 rk+w 5 bt+w 6 rt bp, rp : number of black / red pieces bk, rk : number of black / red kings bt, rt : number of black / read pieces threatened w i : constants to be learnt from experience
  • 19.
    How to obtainingtraining examples? we need a set of examples [bp, rp, bk, rk, bt, rt, V] bp etc. easy to determine; but how to guess V? we have indirect evidence only! possible method: with V(s) true target function, V’(s) learnt function, V t (s) training value for a state s: V t (s) <- V’(successor(s)) adapt V’ using V t values (making V’ and V t converge) hope that V’ will converge to V intuitively: V for end states is known; propagate V values from later states to earlier states in the game
  • 20.
    Training algorithm: howto adapt the weights w i ? possible method: look at “error”: error(s) = V’(s) - V t (s) adapt weights so that error is reduced e.g. using gradient descent method for each feature f i : w i  w i + c f i error(s) with c some small constant
  • 21.
    Overview of designchoices type of training experience games against self games against expert table of good moves determine type of target function determine representation determine learning algorithm … … … … ready! Board   Board  Move linear function of 6 features … gradient descent
  • 22.
    Some issues thatinfluence choices Which algorithms useful for what type of functions? How is learning influenced by # training examples complexity of hypothesis (function) representation noise in the data Theoretical limits of learning? Can we help the learner with prior knowledge? Could a system alter its representation itself? …
  • 23.
    Typical learning tasksConcept learning learn a definition of a concept supervised vs. unsupervised Function learning (&quot;predictive modelling&quot;) Discrete (&quot;classification&quot;) or continuous (&quot;regression&quot;) Concept = function with boolean result Clustering Finding descriptive patterns
  • 24.
    Concept learning: supervisedGiven positive (+) and negative (-) examples of a concept, infer properties that cause instances to be positive or negative (= concept definition) + + + + + - - - - - - - - - X C : X  {true,false} + + + + + - - - - - - - - - X C
  • 25.
    Concept learning: unsupervisedGiven examples of instances Invent reasonable concepts ( = clustering ) Find definitions for these concepts . . . . . . . . . . . X . . . . . . . . . . . . . . . . . X . . . . . . C 3 C 2 C 1 Cf. taxonomy of animals, identification of market segments, ...
  • 26.
    Function learning Generalisesover concept learning Learn function f : X  S where S is finite set of values: classification S is a continuous range of reals: regression . 1.4 . 2.7 . 0.6 . 2.1 X . 0.9 . 1.4 . 2.7 . 0.6 . 2.1 X . 0.9 0 1 2 3 f
  • 27.
    Clustering Finding groupsof instances that are similar May be a goal in itself (unsupervised classification) ... but also used for other tasks regression flexible prediction : when it is not known in advance which properties to predict from which other properties . . . . . . . . . . . X . . . . . . . . . . . . . . . . . X . . . . . .
  • 28.
    Finding descriptive patternsDescriptive patterns = any kind of patterns, not necessarily directly useful for prediction Generalises over predictive modelling (= finding predictive patterns) Examples of patterns: &quot;fast cars usually cost more than slower cars&quot; &quot;people are never married to more than one person at the same time&quot;
  • 29.
    Representation of dataNumerical data: instances are points in  n Many techniques focus on this kind of data Symbolic data (true/false, black/white/red/blue, ...) Can be converted to numeric data Some techniques work directly with symbolic data Structural data Instances have internal structure (graphs, sets, …; cf. molecules) Difficult to convert to simpler format Few techniques can handle these directly
  • 30.
    Brief overview ofapproaches Symbolic approaches: Version Spaces, Induction of decision trees, Induction of rule sets, inductive logic programming, … Numeric approaches: neural networks, support vector machines, … Probabilistic approaches (“bayesian learning”) Miscellaneous: instance based learning, genetic algorithms, reinforcement learning
  • 31.
    Overview of thecourse Introduction (today) (Ch. 1) Concept-learning: Versionspaces (Ch. 2 - brief) Induction of decision trees (Ch. 3) Artificial neural networks (Ch. 4 - brief) Evaluating hypotheses (Ch. 5) Bayesian learning (Ch. 6) Computational learning theory (Ch. 7) Support vector machines (brief)
  • 32.
    Instance-based learning (Ch. 8) Genetic algorithms (Ch. 9) Induction of rule sets & association rules (Ch. 10) Reinforcement learning (Ch. 13) Clustering Inductive logic programming Combining different models bagging, boosting, stacking, …
  • 33.
    2 VersionSpaces Recall basic principles from AI course stressing important concepts for later use Difficulties with version space approaches Inductive bias  Mitchell, Ch.2
  • 34.
    Basic principles Conceptlearning as search given : hypothesis space H and data set S find : all h  H consistent with S this set is called the version space , VS(H,S) How to search in H enumerate all h in H: not feasible prune search using some generality ordering h 1 more general than h 2  (x  h 2  x  h 1 ) See Mitchell Chapter 2 for examples
  • 35.
    An example +: belongs to concept; - : does not S = set of these + and - examples Assume hypotheses are rectangles I.e., H = set of all rectangles VS(H,S) = set of all rectangles that contain all + and no - - - - - - - - - - - - - + + + - -
  • 36.
    Example of consistenthypothesis: green rectangle - - - - - - - - - - - - + + + - -
  • 37.
    h1 more generalthan h2  h2 totally inside h1 h 1 h 2 h 3 h 2 more specific than h 1 h 3 incomparable with h 1 - - - - - - - - - - - - + + + - -
  • 38.
    Version space boundariesBound versionspace by giving its most specific (S) and most general (G) borders S: rectangles that cannot become smaller without excluding some + G: rectangles that cannot become larger without including some - Any hypothesis h consistent with the data must be more general than some element in S must be more specific than some element in G Thus, G and S completely specify VS
  • 39.
    Example, continued Sowhat are S and G here? S = {h 1 }, G = {h 2 ,h 3 } - - - - - - - - - - - - + + + - - h 2 : most general hypothesis h 3 : another most general hyp. h 1 : most specific hypothesis
  • 40.
    Computing the versionspace Computing G and S is sufficient to know the full versionspace Algorithms in Mitchell’s book: FindS: computes only S set S is always singleton in Mitchell’s examples Candidate Elimination: computes S and G
  • 41.
    Candidate Elimination Algorithm:demonstration with rectangles Algorithm: see Mitchell Representation: Concepts are rectangles Rectangle represented with 2 attributes: <Xmin-Xmax, Ymin-Ymax> Graphical representation: hypothesis consistent with data if all + inside rectangle no - inside rectangle
  • 42.
    G S Start:S = {none}, G = {all} S = {<  ,  >} G = {<1-6, 1-6>} 3 4 5 6 2 1 3 2 1 4 5 6
  • 43.
    G S Examplee1 appears, not covered by S Start: S = {none}, G = {all} S = {<  ,  >} G = {<1-6,1-6>} (3,2) : + +
  • 44.
    + G S(3,2) : + Example e1 appears, not covered by S Start: S = {none}, G = {all} S is extended to cover e1 S = {< 3-3,2-2 >} G = {<1-6,1-6>}
  • 45.
    (3,2) : +Example e1 appears, not covered by S Start: S = {none}, G = {all} S is extended to cover e1 Example e2 appears, covered by G + G S S = {<3-3,2-2>} G = {<1-6,1-6>} - (5,4) : -
  • 46.
    + G S(3,2) : + Example e1 appears, not covered by S Start: S = {none}, G = {all} S is extended to cover e1 S = {<3-3,2-2>} G = {< 1-4,1-6 >, < 1-6, 1-3 >} Example e2 appears, covered by G G is changed to avoid covering e2 note: now consists of 2 parts each part covers all + and no - (5,4) : - -
  • 47.
    + G S(3,2) : + Example e1 appears, not covered by S Start: S = {none}, G = {all} S is extended to cover e1 Example e2 appears, covered by G G is changed to avoid covering e2 Example e3 appears, covered by G S = {<3-3,2-2>} G = {<1-4,1-6>, <1-6, 1-3>} (5,4) : - - (2,4) : - -
  • 48.
    + G S(3,2) : + Example e1 appears, not covered by S Start: S = {none}, G = {all} S is extended to cover e1 Example e2 appears, covered by G G is changed to avoid covering e2 Example e3 appears, covered by G One part of G is affected & reduced S = {<3-3,2-2>} G = {< 3-4,1-6 >, <1-6, 1-3>} (5,4) : - - (2,4) : - -
  • 49.
    + G S(3,2) : + Example e1 appears, not covered by S Start: S = {none}, G = {all} S is extended to cover e1 Example e2 appears, covered by G G is changed to avoid covering e2 Example e3 appears, covered by G One part of G is affected & reduced Example e4 appears, not covered by S S = {<3-3,2-2>} G = {<3-4,1-6>, <1-6, 1-3>} (5,4) : - - (2,4) : - - (5,3) : + +
  • 50.
    + G S(3,2) : + Example e1 appears, not covered by S Start: S = {none}, G = {all} S is extended to cover e1 Example e2 appears, covered by G G is changed to avoid covering e2 Example e3 appears, covered by G One part of G is affected & reduced Example e4 appears, not covered by S S is extended to cover e4 S = {< 3-5,2-3 >} G = {<3-4,1-6>, <1-6, 1-3>} (5,4) : - - (2,4) : - - (5,3) : + +
  • 51.
    + G S(3,2) : + Example e1 appears, not covered by S Start: S = {none}, G = {all} S is extended to cover e1 Example e2 appears, covered by G G is changed to avoid covering e2 Example e3 appears, covered by G One part of G is affected & reduced Example e4 appears, not covered by S S is extended to cover e4 Part of G not covering new S is removed S = {<3-5,2-3>} G = {<1-6, 1-3>} (5,4) : - - (2,4) : - - (5,3) : + +
  • 52.
    + G SCurrent versionspace contains all rectangles covering S and covered by G, e.g. h = <2-5,2-3> h S = {<3-5,2-3>} G = {<1-6, 1-3>}
  • 53.
    Interesting points: Wehere use an extended notion of generality In book :  < value < ? Here: e.g.  < 2-3 < 2-5 < 1-5 < ? We still use a conjunctive concept definition each concept is 1 rectangle this could be extended as well (but complicated)
  • 54.
    Difficulties with version space approaches Idea of VS provides nice theoretical framework But not very useful for most practical problems Difficulties with these approaches: Not very efficient Borders G and S may be very large (may grow exponentially) Not noise resistant VS “collapses” when no consistent hypothesis exists often we would like to find the “best” hypothesis in this case in Mitchell’s examples: only conjunctive definitions We will compare with other approaches...
  • 55.
    Inductive bias Afterhaving seen a limited number of examples, we believe we can make predictions for unseen cases. From seen cases to unseen cases = inductive leap Why do we believe this ? Is there any guarantee this prediction will be correct ? What extra assumptions do we need to guarantee correctness? Inductive bias : minimal set of extra assumptions that guarantees correctness of inductive leap
  • 56.
    Equivalence between inductiveand deductive systems inductive system training examples new instance deductive system training examples new instance inductive bias result (by proof) result (by inductive leap)
  • 57.
    Definition of inductivebias More formal definition of inductive bias (Mitchell): L(x,D) denotes classification assigned to instance x by learner L after training on D The inductive bias of L is any minimal set of assertions B such that for any target concept c and corresponding training examples D,  x  X: B  D  x |- L(x,D)
  • 58.
    Effect of inductivebias Different learning algorithms give different results on same dataset because each may have a different bias Stronger bias means less learning more is assumed in advance Is learning possible without any bias at all? I.e., “pure” learning, without any assumptions in advance The answer is No .
  • 59.
    Inductive bias ofversion spaces Bias of candidate elimination algorithm: target concept is in H H typically consists of conjunctive concepts in our previous illustration, rectangles H could be extended towards disjunctive concepts Is it possible to use version spaces with H = set of all imaginable concepts, thereby eliminating all bias ?
  • 60.
    Unbiased version spacesLet U be the example domain Unbiased: target concept C can be any subset of U hence, H = 2 U Condider VS(H,D) with D a strict subset of U Assume you see an unseen instance x (x  U \ D) For each h  VS that predicts x  C, there is a h’  VS that predicts x  C, and vice versa just take h = h’  {x}: since x  D, h and h’ are exactly the same w.r.t. D; so either both are in VS, or none of them are
  • 61.
    Conclusion: version spaceswithout any bias do not allow generalisation To be able to make an inductive leap, some bias is necessary . We will see many different learning algorithms that all differ in their inductive bias. When choosing one in practice, bias should be an important criterium unfortunately: not always well understood…
  • 62.
    To remember Definitionof version space, importance of generality ordering for searching Definition of inductive bias, practical importance, why it is necessary for learning, how it relates inductive systems to deductive systems
  • 63.
    3 Inductionof decision trees What are decision trees? How can they be induced automatically? top-down induction of decision trees avoiding overfitting converting trees to rules alternative heuristics  a generic TDIDT algorithm   Mitchell, Ch. 3
  • 64.
    What are decisiontrees? Represent sequences of tests According to outcome of test, perform a new test Continue until result obtained known Cf. guessing a person using only yes/no questions: ask some question depending on answer, ask a new question continue until answer known
  • 65.
    Example decision tree1 Mitchell’s example: Play tennis or not? (depending on weather conditions) Outlook Humidity Wind No Yes No Yes Yes Sunny Overcast Rainy High Normal Strong Weak
  • 66.
    Example decision tree2 Again from Mitchell: tree for predicting whether C-section necessary Leaves are not pure here; ratio pos/neg is given Fetal_Presentation Previous_Csection + - - 1 2 3 0 1 [3+, 29-] .11+ .89- [8+, 22-] .27+ .73- [55+, 35-] .61+ .39- Primiparous … …
  • 67.
    Representation power Typically:examples represented by array of attributes 1 node in tree tests value of 1 attribute 1 child node for each possible outcome of test Leaf nodes assign classification Note: tree can represent any boolean function i.e., also disjunctive concepts (<-> VS examples) tree can allow noise (non-pure leaves)
  • 68.
    Representing boolean formulaeE.g., A  B Similarly (try yourself): A  B, A xor B, (A  B)  (C   D  E) “ M of N” (at least M out of N propositions are true) What about complexity of tree vs. complexity of original formula? A false true B true true false true false
  • 69.
    Classification, Regression andClustering trees Classification trees represent function X -> C with C discrete (like the decision trees we just saw) Regression trees predict numbers in leaves could use a constant (e.g., mean), or linear regression model, or … Clustering trees just group examples in leaves Most (but not all) research in machine learning focuses on classification trees
  • 70.
    Example decision tree3 (from study of river water quality) &quot;Data mining&quot; application Given: descriptions of river water samples biological description: occurrence of organisms in water (“abundance”, graded 0-5) chemical description: 16 variables (temperature, concentrations of chemicals (NH 4 , ...)) Question: characterize chemical properties of water using organisms that occur
  • 71.
    Clustering tree abundance(Tubifexsp.,5) ? T = 0.357111 pH = -0.496808 cond = 1.23151 O2 = -1.09279 O2sat = -1.04837 CO2 = 0.893152 hard = 0.988909 NO2 = 0.54731 NO3 = 0.426773 NH4 = 1.11263 PO4 = 0.875459 Cl = 0.86275 SiO2 = 0.997237 KMnO4 = 1.29711 K2Cr2O7 = 0.97025 BOD = 0.67012 abundance(Sphaerotilus natans,5) ? yes no T = 0.0129737 pH = -0.536434 cond = 0.914569 O2 = -0.810187 O2sat = -0.848571 CO2 = 0.443103 hard = 0.806137 NO2 = 0.4151 NO3 = -0.0847706 NH4 = 0.536927 PO4 = 0.442398 Cl = 0.668979 SiO2 = 0.291415 KMnO4 = 1.08462 K2Cr2O7 = 0.850733 BOD = 0.651707 yes no abundance( ...) <- &quot;standardized&quot; values (how many standard deviations above mean)
  • 72.
    Top-Down Induction of Decision Trees Basic algorithm for TDIDT: (later more formal version) start with full data set find test that partitions examples as good as possible “ good” = examples with same class, or otherwise similar examples, should be put together for each outcome of test, create child node move examples to children according to outcome of test repeat procedure for each child that is not “pure” Main question: how to decide which test is “best”
  • 73.
    Finding the besttest (for classification trees) For classification trees: find test for which children are as “pure” as possible Purity measure borrowed from information theory: entropy is a measure of “missing information”; more precisely, #bits needed to represent the missing information, on average, using optimal encoding Given set S with instances belonging to class i with probability p i : Entropy(S) = -  p i log 2 p i
  • 74.
    Entropy Intuitive reasoning:use shorter encoding for more frequent messages information theory: message with probability p should get -log 2 p bits e.g. A,B,C,D both 25% probability: 2 bits for each (00,01,10,11) if some are more probable, it is possible to do better average #bits for a message is then -  p i log 2 p i
  • 75.
    Entropy Entropy infunction of p, for 2 classes:
  • 76.
    Information gain Heuristicfor choosing a test in a node: choose that test that on average provides most information about the class this is the test that, on average, reduces class entropy most on average: class entropy reduction differs according to outcome of test expected reduction of entropy = information gain Gain(S,A) = Entropy(S) -  |S v |/|S| Entropy(S v )
  • 77.
    Example Assume Shas 9 + and 5 - examples; partition according to Wind or Humidity attribute Humidity Wind High Normal Strong Weak S: [9+,5-] S: [9+,5-] S: [3+,4-] S: [6+,1-] S: [6+,2-] S: [3+,3-] E = 0.985 E = 0.592 E = 0.811 E = 1.0 E = 0.940 E = 0.940 Gain(S, Humidity) = .940 - (7/14).985 - (7/14).592 = 0.151 Gain(S, Wind) = .940 - (8/14).811 - (6/14)1.0 = 0.048
  • 78.
    Assume Outlook was chosen: continue partitioning in child nodes Outlook ? ? Yes Sunny Overcast Rainy [9+,5-] [2+,3-] [3+,2-] [4+,0-]
  • 79.
    Hypothesis space searchin TDIDT Hypothesis space H = set of all trees H is searched in a hill-climbing fashion, from simple to complex ...
  • 80.
    Inductive bias inTDIDT Note: for e.g. boolean attributes, H is complete: each concept can be represented! given n attributes, can keep on adding tests until all attributes tested So what about inductive bias? Clearly no “restriction bias” (H  2 U ) as in cand. elim. Preference bias : some hypotheses in H are preferred over others In this case: preference for short trees with informative attributes at the top
  • 81.
    Occam’s Razor Preferencefor simple models over complex models is quite generally used in machine learning Similar principle in science: Occam’s Razor roughly: do not make things more complicated than necessary Reasoning, in the case of decision trees: more complex trees have higher probability of overfitting the data set
  • 82.
    Avoiding Overfitting Phenomenonof overfitting: keep improving a model, making it better and better on training set by making it more complicated … increases risk of modelling noise and coincidences in the data set may actually harm predictive power of theory on unseen cases Cf. fitting a curve with too many parameters . . . . . . . . . . . .
  • 83.
    Overfitting: example ++ + + + + + - - - - - - - - - - - - - - + - - - - - area with probably wrong predictions
  • 84.
    Overfitting: effect onpredictive accuracy Typical phenomenon when overfitting: training accuracy keeps increasing accuracy on unseen validation set starts decreasing accuracy on training data accuracy on unseen data size of tree accuracy overfitting starts about here
  • 85.
    How to avoidoverfitting when building classification trees? Option 1: stop adding nodes to tree when overfitting starts occurring need stopping criterion Option 2: don’t bother about overfitting when growing the tree after the tree has been built, start pruning it again
  • 86.
    Stopping criteria Howdo we know when overfitting starts? a) use a validation set : data not considered for choosing the best test when accuracy goes down on validation set: stop adding nodes to this branch b) use some statistical test significance test: e.g., is the change in class distribution still significant? (  2 -test) MDL : minimal description length principle fully correct theory = tree + corrections for specific misclassifications minimize size(f.c.t.) = size(tree) + size(misclassifications(tree)) Cf. Occam’s razor
  • 87.
    Post-pruning trees Afterlearning the tree: start pruning branches away For all nodes in tree: Estimate effect of pruning tree at this node on predictive accuracy e.g. using accuracy on validation set Prune node that gives greatest improvement Continue until no improvements Note : this pruning constitutes a second search in the hypothesis space
  • 88.
    accuracy on trainingdata accuracy on unseen data size of tree accuracy effect of pruning
  • 89.
    Comparison Advantage ofOption 1: no superfluous work But: tests may be misleading E.g., validation accuracy may go down briefly, then go up again Therefore, Option 2 (post-pruning) is usually preferred (though more work, computationally)
  • 90.
    Turning trees intorules From a tree a rule set can be derived Path from root to leaf in a tree = 1 if-then rule Advantage of such rule sets may increase comprehensibility can be pruned more flexibly in 1 rule, 1 single condition can be removed vs. tree: when removing a node, the whole subtree is removed 1 rule can be removed entirely
  • 91.
    Rules from trees:example Outlook Humidity Wind No Yes No Yes Yes Sunny Overcast Rainy High Normal Strong Weak if Outlook = Sunny and Humidity = High then No if Outlook = Sunny and Humidity = Normal then Yes …
  • 92.
    Pruning rules Possiblemethod: 1. convert tree to rules 2. prune each rule independently remove conditions that do not harm accuracy of rule 3. sort rules (e.g., most accurate rule first) before pruning: each example covered by 1 rule after pruning, 1 example might be covered by multiple rules therefore some rules might contradict each other
  • 93.
    Pruning rules: exampleA false true B true true false true false if A=true then true if A=false and B=true then true if A=false and B=false then false Tree representing A  B Rules represent A  (  A  B) A  B
  • 94.
    Alternative heuristics for choosing tests Attributes with continuous domains (numbers) cannot different branch for each possible outcome allow, e.g., binary test of the form Temperature < 20 Attributes with many discrete values unfair advantage over attributes with few values cf. question with many possible answers is more informative than yes/no question To compensate: divide gain by “max. potential gain” SI Gain Ratio : GR(S,A) = Gain(S,A) / SI(S,A) Split-information SI(S,A) = -  |Si|/|S| log2 |Si|/|S| with i ranging over different results of test A 
  • 95.
    Tests may have different costs e.g. medical diagnosis: blood test, visual examination, … have different costs try to find tree with low expected cost instead of low expected number of tests alternative heuristics, taking cost into account,have been proposed
  • 96.
    Properties of goodheuristics Many alternatives exist ID3 uses information gain or gain ratio CART uses “Gini criterion” (not discussed here) Q: Why not simply use accuracy as a criterion? A1 80-, 20+ 40-,0+ 40-,20+ A2 80-, 20+ 40-,10+ 40-,10+ How would - accuracy - information gain rate these splits?
  • 97.
    Heuristics compared Goodheuristics are strictly concave
  • 98.
    Why concave functions?E E 1 E 2 p p 2 p 1 Assume node with size n , entropy E and proportion of positives p is split into 2 nodes with n 1 , E 1 , p 1 and n 2 , E 2 p 2 . We have p = (n 1 /n)p 1 + (n 2 /n) p 2 and the new average entropy E’ = (n 1 /n)E 1 +(n 2 /n)E 2 is therefore found by linear interpolation between ( p 1 ,E 1 ) and ( p 2 ,E 2 ) at p . Gain = difference in height between ( p, E ) and ( p,E’ ). (n 1 /n)E 1 +(n 2 /n)E 2 Gain
  • 99.
    Handling missing valuesWhat if result of test is unknown for example? e.g. because value of attribute unknown Some possible solutions, when training: guess value: just take most common value (among all examples, among examples in this node / class, …) assign example partially to different branches e.g. counts for 0.7 in yes subtree, 0.3 in no subtree When using tree for prediction: assign example partially to different branches combine predictions of different branches
  • 100.
    Generic TDIDT algorithmfunction TDIDT( E : set of examples) returns tree; T' := grow_tree( E ); T := prune ( T' ); return T ; function grow_tree( E : set of examples) returns tree; T := generate_tests ( E ); t := best_test ( T , E ); P := partition induced on E by t ; if stop_criterion ( E , P ) then return leaf( info ( E )) else for all E j in P: t j := grow_tree( E j ); return node( t , {( j,t j )}; 
  • 101.
    For classification... prune: e.g. reduced-error pruning, ... generate_tests : Attr=val, Attr<val, ... for numeric attributes: generate val best_test : Gain, Gainratio, ... stop_criterion : MDL, significance test (e.g.  2 -test), ... info : most frequent class (&quot;mode&quot;) Popular systems: C4.5 (Quinlan 1993), C5.0 ( www.rulequest.com )
  • 102.
    For regression... changebest_test : e.g. minimize average variance info : mean stop_criterion : significance test (e.g., F-test), ... A1 A2 {1,3,4,7,8,12} {1,3,4,7,8,12} {1,4,12} {3,7,8} {1,3,7} {4,8,12}
  • 103.
    CART Classification andregression trees (Breiman et al., 1984) Classification: info : mode, best_test : Gini Regression: info : mean, best_test : variance prune : &quot;error complexity&quot; pruning penalty  for each node the higher  , the smaller the tree will be optimal  obtained empirically (cross-validation)
  • 104.
    n-dimensional target spacesInstead of predicting 1 number, predict vector of numbers info : mean vector best_test : variance (mean squared distance) in n-dimensional space stop_criterion : F-test mixed vectors (numbers and symbols)? use appropriate distance measure -> &quot;clustering trees&quot;
  • 105.
    Clustering tree abundance(Tubifexsp.,5) ? T = 0.357111 pH = -0.496808 cond = 1.23151 O2 = -1.09279 O2sat = -1.04837 CO2 = 0.893152 hard = 0.988909 NO2 = 0.54731 NO3 = 0.426773 NH4 = 1.11263 PO4 = 0.875459 Cl = 0.86275 SiO2 = 0.997237 KMnO4 = 1.29711 K2Cr2O7 = 0.97025 BOD = 0.67012 abundance(Sphaerotilus natans,5) ? yes no T = 0.0129737 pH = -0.536434 cond = 0.914569 O2 = -0.810187 O2sat = -0.848571 CO2 = 0.443103 hard = 0.806137 NO2 = 0.4151 NO3 = -0.0847706 NH4 = 0.536927 PO4 = 0.442398 Cl = 0.668979 SiO2 = 0.291415 KMnO4 = 1.08462 K2Cr2O7 = 0.850733 BOD = 0.651707 yes no abundance( ...) <- &quot;standardized&quot; values (how many standard deviations above mean)
  • 106.
    To Remember Decisiontrees & their representational power Generic TDIDT algorithm and how to instantiate its parameters Search through hypothesis space, bias, tree to rule conversion For classification trees: details on heuristics, handling missing values, pruning, … Some general concepts: overfitting, Occam’s razor
  • 107.
    4 Neuralnetworks (Brief summary - studied in detail in other courses) Basic principle of artificial neural networks Perceptrons and multi-layer neural networks Properties  Mitchell, Ch. 4
  • 108.
    Artificial neural networksModelled after biological neural systems Complex systems built from very simple units 1 unit = neuron has multiple inputs and outputs, connecting the neuron to other neurons when input signal sufficiently strong, neuron fires (i.,e., propagates signal)
  • 109.
    ANNs consists ofneurons connections between them these connections have weights associated with them input and output ANNs can learn to associate inputs to outputs by adapting the weights For instance (classification): inputs = pixels of photo outputs = classification of photo (person? tree? …)
  • 110.
    Perceptrons Simplest typeof neural network Perceptron simulates 1 neuron Fires if sum of (inputs * weights) > some threshold Schematically:  computes  w i x i X Y threshold function: Y = -1 if X<t, Y=1 otherwise x 1 x 2 x 3 x 4 x 5 w 1 w 5
  • 111.
    2-input perceptron representinputs in 2-D space perceptron learns a function of following form: if aX + bY > c then +1, else -1 i.e., creates linear separation between classes + and - +1 -1
  • 112.
    n-input perceptrons Ingeneral, perceptrons construct a hyperplane in an n-dimensional space one side of hyperplane = +, other side = - Hence, classes must be linearly separable, otherwise perceptron cannot learn them E.g.: learning boolean functions encode true/false as +1, -1 is there a perceptron that encodes 1. A and B? 2. A or B? 3. A xor B?
  • 113.
    Multi-layer networks Increaserepresentation power by combining neurons in a network +1 -1 +1 -1 X Y neuron 1 neuron 2 +1 -1 output -1 -1 inputs hidden layer output layer
  • 114.
    “ Sigmoid” functioninstead of crisp threshold changes continuously instead of in 1 step has advantages for training multi-layer networks  x 1 x 2 x 3 x 4 x 5 w 1 w 5
  • 115.
    Non-linear sigmoid functioncauses non-linear decision surfaces e.g., 5 areas for 5 classes a,b,c,d,e Very powerful representation a b c d e
  • 116.
    Note : previousnetwork had 2 layers of neurons Layered feedforward neural networks: neurons organised in n layers each layer has output from previous layer as input neurons fully interconnected successive layers = different representations of input 2-layer feedforward networks very popular… … but many other architectures possible! e.g. recurrent NNs
  • 117.
    Example: 2-layer netrepresenting ID function 8 input patterns, mapped to same pattern in output network converges to binary representation in hidden layer for instance: 1 101 2 100 3 011 4 111 5 000 6 010 7 110 8 001
  • 118.
    Training neural networksTrained by adapting the weights Popular algorithm: backpropagation minimizing error through gradient descent principle: output error of a layer is attributed to 1: weights of connections in that layer adapt these weights 2: inputs of that layer (except if first layer) “ backpropagate” error to these inputs now use same principle to adapt weights of previous layer Iterative process, may be slow
  • 119.
    Properties of neuralnetworks Useful for modelling complex, non-linear functions of numerical inputs & outputs symbolic inputs/outputs representable using some encoding, cf. true/false = 1/-1 2 or 3 layer networks can approximate a huge class of functions (if enough neurons in hidden layers) Robust to noise but: risk of overfitting! (because of high expressiveness) may happen when training for too long usually handled using e.g. validation sets
  • 120.
    All inputs havesome effect cf. decision trees: selection of most important attributes Explanatory power of ANNs is limited model represented as weights in network no simple explanation why networks makes a certain prediction contrast with e.g. trees: can give a “rule” that was used
  • 121.
    Hence, ANNs aregood when high-dimensional input and output (numeric or symbolic) interpretability of model unimportant Examples: typical: image recognition, speech recognition, … e.g. images: one input per pixel see http://www.cs.cmu.edu/~tom/faces.html for illustration less typical: symbolic problems cases where e.g. trees would work too performance of networks and trees then often comparable
  • 122.
    To remember Perceptrons,neural networks: inspiration what they are how they work representation power explanatory power