View on GitHub

Java Naive Bayes Classifier JNBC

Naive Bayes Classifier that works in-memory
or off the heap on fast key-value stores (MapDB, LevelDB or RocksDB)

A Java Naive Bayes Classifier that works in-memory or off the heap on fast key-value stores (MapDB, LevelDB or RocksDB). Naive Bayes Classification is known to be fast. The objective of this ground-up implementations is to provide a self-contained, vertically scalable and explainable implementation.

It comes with three classic examples and unit tests :

Features

API

It has a simple API for train, classify and explain.

Simply add this dependency to the project:

        <dependency>
            <groupId>com.namsor</groupId>
            <artifactId>Java-Naive-Bayes-Classifier-JNBC</artifactId>
            <version>2.0.4</version>
        </dependency>

and then use the INaiveBayesClassifier and INaiveBayesExplainer like this:

 String[] cats = {YES, NO};
 INaiveBayesClassifier bayes = new NaiveBayesClassifierMapImpl("tennis", cats);
 // learn ex. Category=Yes Conditions=Sunny, Cool, Normal and Weak.
 String category = YES;
 for(...) { // training loop
	 Map features = new HashMap();
	 // put features
	 bayes.learn(category, features); 
 }
 // Shall we play given weather conditions Sunny, Cool, Rainy and Windy ?
 Map features = new HashMap();
 features.put("outlook", "Sunny");
 features.put("temp", "Cool");
 features.put("humidity", "High");
 features.put("wind", "Strong");
 IClassification predict = bayes.classify(features, true);
 // How do you explain that result? 
 INaiveBayesExplainer explainer = new NaiveBayesExplainerImpl();
 IClassificationExplained explained = explainer.explain(predict);
 System.out.println(explained.toString());

See the Javadoc for more details about the API.


Performance

Binomial classifiers : the AbstractNaiveBayesClassifierMapImpl with in-memory ConcurrentHashMap can learn from billions of facts and classify new data very fast. Using off-the-heap persistent key-value stores can help scaling vertically to even larger volumes. For example, the MapDB implementation on SSDs is only ~3-5 times slower and it can scale on large volumes.

Multinomial classifiers : with many class categories and many features, you may need to use the in-memory ConcurrentHashMap implementation and allocate more memory to the java heap. This implementation is known to run smoothly on servers with 192Gb RAM. Further optimization will be needed to effectively use MemDB, LevelDB or RocksDB when the classification needs to read a LOT of data.


Explainability

Use NaiveBayesExplainerImpl to explain exactly how a classification was calculated by the engine (with or without Laplace smoothing). When running the first example Sport / No Sport, we see that we are unlikely to play sport given the weather conditions Sunny, Cool, Rainy and Windy : P(No)=0.795417348608838 > P(Yes)=0.204582651391162. But how was that calculated exactly ? You can reproduce the calculations using a spreadsheet like this one, or you can use the NaiveBayesExplainerImpl to trace the algorithm calculations as Formulae, Expressions and Values.

// JavaScript : 

// observation table variables 
var gL=14
var gL_cA_No=5
var gL_cA_No_fE_humidity=5
var gL_cA_No_fE_humidity_is_High=4
var gL_cA_No_fE_outlook=5
var gL_cA_No_fE_outlook_is_Sunny=3
var gL_cA_No_fE_temp=5
var gL_cA_No_fE_temp_is_Cool=1
var gL_cA_No_fE_wind=5
var gL_cA_No_fE_wind_is_Strong=3
var gL_cA_Yes=9
var gL_cA_Yes_fE_humidity=9
var gL_cA_Yes_fE_humidity_is_High=3
var gL_cA_Yes_fE_outlook=9
var gL_cA_Yes_fE_outlook_is_Sunny=2
var gL_cA_Yes_fE_temp=9
var gL_cA_Yes_fE_temp_is_Cool=3
var gL_cA_Yes_fE_wind=9
var gL_cA_Yes_fE_wind_is_Strong=3
var gL_fE_humidity=14
var gL_fE_outlook=14
var gL_fE_temp=14
var gL_fE_wind=14


// likelyhoods by category 

// likelyhoods for category No
var likelyhoodOfNo=gL_cA_No / gL * (gL_cA_No_fE_temp_is_Cool / gL_cA_No_fE_temp * gL_cA_No_fE_humidity_is_High / gL_cA_No_fE_humidity * gL_cA_No_fE_outlook_is_Sunny / gL_cA_No_fE_outlook * gL_cA_No_fE_wind_is_Strong / gL_cA_No_fE_wind * 1 )
var likelyhoodOfNoExpr=5 / 14 * (1 / 5 * 4 / 5 * 3 / 5 * 3 / 5 * 1 )
var likelyhoodOfNoValue=0.020571428571428574

// likelyhoods for category Yes
var likelyhoodOfYes=gL_cA_Yes / gL * (gL_cA_Yes_fE_temp_is_Cool / gL_cA_Yes_fE_temp * gL_cA_Yes_fE_humidity_is_High / gL_cA_Yes_fE_humidity * gL_cA_Yes_fE_outlook_is_Sunny / gL_cA_Yes_fE_outlook * gL_cA_Yes_fE_wind_is_Strong / gL_cA_Yes_fE_wind * 1 )
var likelyhoodOfYesExpr=9 / 14 * (3 / 9 * 3 / 9 * 2 / 9 * 3 / 9 * 1 )
var likelyhoodOfYesValue=0.005291005291005291


// probability estimates by category 

// probability estimate for category No
var probabilityOfNo=likelyhoodOfNo/(likelyhoodOfNo+likelyhoodOfYes+0)
var probabilityOfNoValue=0.795417348608838

// probability estimate for category Yes
var probabilityOfYes=likelyhoodOfYes/(likelyhoodOfNo+likelyhoodOfYes+0)
var probabilityOfYesValue=0.204582651391162


// return the highest probability estimate for evaluation 
probabilityOfNo

Credits

This page template from Jsign project.

Contact

Elian Carsenat (elian.carsenat@namsor.com, @eliancarsenat)