Filtering Training Sets

Rule Engine as ML Trainer. Rule Learner implements a supervised Machine Learning approach that requires Training Sets that usually consists of examples indicating when the desired result has been achieved (positive examples) and counter examples indicating cases when the desired result has not been achieved (negative examples).  Training sets are used by the Rule Learner to discover and represent new rules and to measure the accuracy and effectiveness of the rules once they have been learned.  If the results are satisfactory, the rules can then be used to predict results for new, previously unseen cases.

If your training instances do not represent the right proportion of different values of the learning attribute, then you could hardly expect an ML classifier learned from that data to perform well on the new examples. In such cases, you may want to adjust your training sets by removing some outliers and/or adding more representative training instances. It can be done by so called “Trainer”.

Usually a trainer is a subject matter expert (SME) who has extensive experience dealing with the historical enterprise data and has the competence and skills to establish goals, concepts, and/or criteria, for detecting patterns and rules.

Training functionality usually cover the following tasks:

    • Selection of issues and attributes to be considered by a Rule Learner
    • Generation of new attributes that generalize the existing attributes by adding nominal attributes, ratios, etc.
    • Preliminary classification of issues
    • Instance filtering rules.
It is important to automate the trainer function and make it an integral part of the system architecture as you want to keep your learned rules up to date with the latest changes.
Rule Learner allows you to implement Trainer as a special rule engine that automatically analyses large volumes of data in accordance with “training rules” created by a SME and uses this data to generate new training sets.  Such a Trainer allows domain experts to incorporate their knowledge into Rule Learner by presenting it in a form of domain-specific training rules. 

Trainer – a Simple Example. The standard sample project “Credits” provides an example of how a rule-based Trainer can be implemented. This sample uses hypothetical German credit data set which contains sample of 1000 debtors classified as “good“ or “bad“.  It’s initial Glossary looks as follows:

The Excel table “instances” contains 1,000 training instances – too big to show here. We need to generated rules capable to define the learning attribute “Classified As” (good or bad). Let’s run “learn.bat” using the following Java’s launcher “” described in the page “Java API“:

The “” are specified as follows:

It will run RIPPER against all 1,000 training instances and it will generate rules

which will incorrectly classify 244 instance out of 1,000.

Now, let’s assume that we want to filter these training instances to use only foreign workers who are older than 30. First, we will add to the glossary a new business concept “Filter” that has two arrays of TrainingInstances (see yellow lines):

To create an array “Filtered Instances” from the array “Source Instances” we will use the following filtering rules:

The table “FilterTrainingRules” will iterate through the array  “Source Instances” and will apply the rules “FilteringRules” to add only instances for which

Age >= 30 and  Foreign Worker is yes

to the array “Filtered Instances”. For those who knows how to use OpenRules, this is a regular way to deal with collections of objects.

To make sure that we will use this Filter before running Rule Learner we will create a new Java launcher “” that looks as follows:

Now it executes the goal “FilterTrainingRules” to create “filteredInstances” from “sourceInstances” and passes them to the LearningProblem. We also need to add two more properties and run.class to the file “”:

Now instead of “learn.bat” we can run the standard OpenRules “test.bat” to execute the run.class “LearnerWithTrainer” . First, will run our Filter producing 606 filtered instances, and then it will run RIPPER against all them producing the following classification rules:

These generated rules are quite different from previously generated rules, and they incorrectly classify only 135 instances out of 605.

Of course, now a subject matter expert can create training (filtering) rules of any complexity, and the same schema with a rules-based trainer will continue to work!