
If your training instances do not represent the right proportion of different values of the learning attribute, then you could hardly expect an ML classifier learned from that data to perform well on the new examples. In such cases, you may want to adjust your training sets by removing some outliers and/or adding more representative training instances. It can be done by so called “Trainer”.
Training functionality usually cover the following tasks:
-
-
Selection of issues and attributes to be considered by a Rule Learner
-
Generation of new attributes that generalize the existing attributes by adding nominal attributes, ratios, etc.
-
Preliminary classification of issues
-
Instance filtering rules.
-
Trainer – a Simple Example. The standard sample project “Credits” provides an example of how a rule-based Trainer can be implemented. This sample uses hypothetical German credit data set which contains sample of 1000 debtors classified as “good“ or “bad“. It’s initial Glossary looks as follows:
The Excel table “instances” contains 1,000 training instances – too big to show here. We need to generated rules capable to define the learning attribute “Classified As” (good or bad). Let’s run “learn.bat” using the following Java’s launcher “Learner.java” described in the page “Java API“:
The “project.properties” are specified as follows:
It will run RIPPER against all 1,000 training instances and it will generate rules
which will incorrectly classify 244 instance out of 1,000.
Now, let’s assume that we want to filter these training instances to use only foreign workers who are older than 30. First, we will add to the glossary a new business concept “Filter” that has two arrays of TrainingInstances (see yellow lines):
To create an array “Filtered Instances” from the array “Source Instances” we will use the following filtering rules:
The table “FilterTrainingRules” will iterate through the array “Source Instances” and will apply the rules “FilteringRules” to add only instances for which
Age >= 30 and Foreign Worker is yes
to the array “Filtered Instances”. For those who knows how to use OpenRules, this is a regular way to deal with collections of objects.
To make sure that we will use this Filter before running Rule Learner we will create a new Java launcher “LearnerWithTrainer.java” that looks as follows:
Now it executes the goal “FilterTrainingRules” to create “filteredInstances” from “sourceInstances” and passes them to the LearningProblem. We also need to add two more properties goal.name and run.class to the file “project.properties”:
Now instead of “learn.bat” we can run the standard OpenRules “test.bat” to execute the run.class “LearnerWithTrainer” . First, will run our Filter producing 606 filtered instances, and then it will run RIPPER against all them producing the following classification rules:
These generated rules are quite different from previously generated rules, and they incorrectly classify only 135 instances out of 605.
Of course, now a subject matter expert can create training (filtering) rules of any complexity, and the same schema with a rules-based trainer will continue to work!