Machine Learning Algorithms and Tools. While there are many ML algorithms and implementation tools, it’s naive to expect that one ML algorithm is an optimal choice for ALL data sets. In most cases, serious data analysis, as well as data cleaning and tuning are needed. It is also to be expected that different ML algorithms may behave differently on the same data set.
To support these real-world needs, Rule Learner doesn’t limit itself to one particular ML algorithm. It has been designed to apply different ML algorithms and to accept different input/output formats. However, the practical experience shows that two well-known ML algorithms, namely C4.5 and RIPPER, are usually the best choice to produce compact and human understandable classification rules. The current version of Rule Learner utilized the implementation of these algorithms provided by the popular open source machine learning framework “WEKA“.
Selecting ML Algorithms. To switch between different ML algorithm, modify the property “learning.classifier” in the file “project.properties”. For example, in this extract from this file:
the classifier “RIPPER” is commented out (by putting # at the beginning of the line) and the classifier “C4.5” is activated.
Setting Learning Attribute. After you defined your two main tables “Glossary” and “Instances” in the Excel file such as “Lenses.xls” (described here), you need to specify your Learning Attribute which the generated classification rules should specify for any instances. This attribute is specified by the property “learning.attribute” in the file “project.properties”, e.g.:
Analyzing ML Output. A selected ML algorithm extracts classification rules from the training sets to specify the value of the learning attribute, in this case it is “Contact Lenses”. The generated rules will be placed in the Excel file (by default “GeneratedRules.xls”) as a decision table ready to be executed by a rule engine:
After rules generations, you will also see the results in the format specific for the selected ML algorithm, e.g. here is the execution protocol for the C4.5 algorithm:
Along with the generated rules it shows statistical metrics such as the numbers of correctly and incorrectly classified instances (20 and 4 in this particular case). These numbers already take into consideration how the learned rules will behave on the new instances, and not how they evaluate the provided training instances. The actual numbers of correctly and incorrectly classified training instances (22 and 2) are also shown with a list of incorrectly classified training instances:
If you want to learn more about other statistical metrics (that is not critical for business analysts), we recommend you to look at this very good WEKA-based book.
Avoid Over-Fitting and Under-Fitting. The frequently occurred issue with supervised machine learning is that your training set might not be representative enough.
Over-fitting is a key problem when a learning algorithm fits the training data set so well that noise and the peculiarities of the training data play a big role during rules generation. In such situations, the accuracy of learned rules drops when they are tested using unknown data sets. The amount of data used for learning process is fundamental in this context. Small data sets are more prone to over-fitting than large data sets, but unfortunately large data sets can also be affected.
Under-fitting is the opposite of over-fitting and occurs when the ML algorithm is incapable of capturing the variability of the training data.
ML algorithms have a built-in mechanism called cross-validation that helps to avoid these problems. By default, Rule Learner uses the recommended “10-fold cross validation” that became the standard method in many practical situations. However, you always may do some experiments by changing the value 10 to, for example 20, by adding the property “cross.validation=20” into the file “project.properties”.
However, if your training instances do not represent the right proportion of different values of the learning attribute, then you could hardly expect an ML classifier learned from that data to perform well on the new examples. One of the practical way to remove some outliers and/or add more representative training instances is to use expert knowledge of the business domain. Rule Learner provides a practical tool for doing exactly this by letting your domain experts to represent their knowledge in the form of “training rules” – see Filtering Training Sets.
Adding more ML Algorithms. More ML algorithms can be added to Rule Learner on an as-needed basis. WEKA is only one of possible implementation tools, it is open-sourced, and shows good practical results. As Rule Learner is also open-sourced, a Java developer can add a new algorithm by providing new implementations of the standard Java classes. Alternatively, you may send a request for adding another algorithm to firstname.lastname@example.org.