A Java Naive Bayes Classifier that works in-memory or off the heap on fast key-value stores (MapDB, LevelDB or RocksDB). Naive Bayes Classification is known to be fast. The objective of this ground-up implementations is to provide a self-contained, vertically scalable and explainable implementation.
It comes with three classic examples and unit tests :
- Sport / No Sport, based on weather conditions,
- An Introduction to Naïve Bayes Classifier,
- Banana / Orange or other Fruit.
Features
- learn and classify : NaiveBayesClassifierMapImpl works in-memory with a ConcurrentHashMap or off-the-heap using org.mapdb.HTreeMap
- NaiveBayesClassifierMapLaplacedImpl adds Laplace smoothing to the implementation above
- other popular Key-Value stores are supported : LevelDB and RocksDB
- explain : NaiveBayesExplainerImpl provides explainable trace of the algorithm, so it can be interpreted by human (formulae and expressions) or plain JavaScript
- Javadoc documentation
- Maven central signed JARs
API
It has a simple API for train, classify and explain.
Simply add this dependency to the project:
<dependency> <groupId>com.namsor</groupId> <artifactId>Java-Naive-Bayes-Classifier-JNBC</artifactId> <version>2.0.4</version> </dependency>
and then use the INaiveBayesClassifier
and INaiveBayesExplainer
like this:
String[] cats = {YES, NO}; INaiveBayesClassifier bayes = new NaiveBayesClassifierMapImpl("tennis", cats); // learn ex. Category=Yes Conditions=Sunny, Cool, Normal and Weak. String category = YES; for(...) { // training loop Mapfeatures = new HashMap(); // put features bayes.learn(category, features); } // Shall we play given weather conditions Sunny, Cool, Rainy and Windy ? Map features = new HashMap(); features.put("outlook", "Sunny"); features.put("temp", "Cool"); features.put("humidity", "High"); features.put("wind", "Strong"); IClassification predict = bayes.classify(features, true); // How do you explain that result? INaiveBayesExplainer explainer = new NaiveBayesExplainerImpl(); IClassificationExplained explained = explainer.explain(predict); System.out.println(explained.toString());
See the Javadoc for more details about the API.
Performance
Binomial classifiers : the AbstractNaiveBayesClassifierMapImpl with in-memory ConcurrentHashMap can learn from billions of facts and classify new data very fast. Using off-the-heap persistent key-value stores can help scaling vertically to even larger volumes. For example, the MapDB implementation on SSDs is only ~3-5 times slower and it can scale on large volumes.
Multinomial classifiers : with many class categories and many features, you may need to use the in-memory ConcurrentHashMap implementation and allocate more memory to the java heap. This implementation is known to run smoothly on servers with 192Gb RAM. Further optimization will be needed to effectively use MemDB, LevelDB or RocksDB when the classification needs to read a LOT of data.
Explainability
Use NaiveBayesExplainerImpl to explain exactly how a classification was calculated by the engine (with or without Laplace smoothing). When running the first example Sport / No Sport, we see that we are unlikely to play sport given the weather conditions Sunny, Cool, Rainy and Windy : P(No)=0.795417348608838 > P(Yes)=0.204582651391162. But how was that calculated exactly ? You can reproduce the calculations using a spreadsheet like this one, or you can use the NaiveBayesExplainerImpl to trace the algorithm calculations as Formulae, Expressions and Values.
// JavaScript : // observation table variables var gL=14 var gL_cA_No=5 var gL_cA_No_fE_humidity=5 var gL_cA_No_fE_humidity_is_High=4 var gL_cA_No_fE_outlook=5 var gL_cA_No_fE_outlook_is_Sunny=3 var gL_cA_No_fE_temp=5 var gL_cA_No_fE_temp_is_Cool=1 var gL_cA_No_fE_wind=5 var gL_cA_No_fE_wind_is_Strong=3 var gL_cA_Yes=9 var gL_cA_Yes_fE_humidity=9 var gL_cA_Yes_fE_humidity_is_High=3 var gL_cA_Yes_fE_outlook=9 var gL_cA_Yes_fE_outlook_is_Sunny=2 var gL_cA_Yes_fE_temp=9 var gL_cA_Yes_fE_temp_is_Cool=3 var gL_cA_Yes_fE_wind=9 var gL_cA_Yes_fE_wind_is_Strong=3 var gL_fE_humidity=14 var gL_fE_outlook=14 var gL_fE_temp=14 var gL_fE_wind=14 // likelyhoods by category // likelyhoods for category No var likelyhoodOfNo=gL_cA_No / gL * (gL_cA_No_fE_temp_is_Cool / gL_cA_No_fE_temp * gL_cA_No_fE_humidity_is_High / gL_cA_No_fE_humidity * gL_cA_No_fE_outlook_is_Sunny / gL_cA_No_fE_outlook * gL_cA_No_fE_wind_is_Strong / gL_cA_No_fE_wind * 1 ) var likelyhoodOfNoExpr=5 / 14 * (1 / 5 * 4 / 5 * 3 / 5 * 3 / 5 * 1 ) var likelyhoodOfNoValue=0.020571428571428574 // likelyhoods for category Yes var likelyhoodOfYes=gL_cA_Yes / gL * (gL_cA_Yes_fE_temp_is_Cool / gL_cA_Yes_fE_temp * gL_cA_Yes_fE_humidity_is_High / gL_cA_Yes_fE_humidity * gL_cA_Yes_fE_outlook_is_Sunny / gL_cA_Yes_fE_outlook * gL_cA_Yes_fE_wind_is_Strong / gL_cA_Yes_fE_wind * 1 ) var likelyhoodOfYesExpr=9 / 14 * (3 / 9 * 3 / 9 * 2 / 9 * 3 / 9 * 1 ) var likelyhoodOfYesValue=0.005291005291005291 // probability estimates by category // probability estimate for category No var probabilityOfNo=likelyhoodOfNo/(likelyhoodOfNo+likelyhoodOfYes+0) var probabilityOfNoValue=0.795417348608838 // probability estimate for category Yes var probabilityOfYes=likelyhoodOfYes/(likelyhoodOfNo+likelyhoodOfYes+0) var probabilityOfYesValue=0.204582651391162 // return the highest probability estimate for evaluation probabilityOfNo
Credits
This page template from Jsign project.
Contact
Elian Carsenat (elian.carsenat@namsor.com, @eliancarsenat)