Software Engineering and Machine Learning enthusiast: April 2019

Monday, April 29, 2019

Go Vs Java Benchmark

Desclamer

I have 9 YEARS of experience in Java and 3 DAYS of experience in Go. This comparision might be highly biased.

What is compared here.

Ease of coding
Execution Time
Memory used

What is being done here ?

I recently(3 days ago) started to learn Go and wanted to know if Go is actually as good as its publisized. So, in order to learn along with testing I translated one Java project into Go and did some benchmarking on it. Below are the results.

Machine Used here is Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz ( 2 physical cores and 4 cores with hyperthreading ).

My understanding of Go (based on 3 days) :

Go is better C. Go cherry picked good parts of C, Java, Python along with adding new concepts like goroutines

Go seems very simple if you know C & Java.

**Results**
	Go	Java
Serial Execution(Seconds)	86	171
Parallel Execution(Seconds)	105	207
Memory(MB)	3112	4423

Conclusion :

Go is really useful for small projects with relatively small code, I think as the code become big with large number of functionality managing code might be little difficult.
Go is easy to code and relatively very efficient compared to Java.

Thursday, April 4, 2019

How to do LightGBM predictions in realtime at high scale in Java

We recently landed on a problem where we need to do predictions of LightGBM model in realtime in Java. Training in Python and Predicting in java is no new problem but this problem was unique because popular way of using PMML to do prediction was giving following issues.

Issues

Validations at PMML generation end is very scrict even stricter then LightGBM. Some examples are

PMML doesn't allow Single valued features while LightGBM do.
PMML have weird restrictions on value range which is strange.

At the time of prediction PMML predictor was doing some validation which was making the predictions very costly.
PMML generator is not able to parse feature values with special characters from model dump.
PMML generation was taking 20 Min with 60GB of memory for 160MB model file. This is a big issue if we need to update the model every hour.
Client of PMML has to be aware of data type of features.

We were able to find work arounds to some of above but not all and was in the bad state to scale it using PMML. Finally we gave up on PMML and

we wrote our own parser and predictor to parse model dump of LightGBM in Java and load self defined objects and do predictions.

Problems faced :

LightGBM uses its own format and representation of trees in dump of LightGBM. With no documentation along with fairly complex representation of Trees make this understanding hard. Understanding this dump was the primary thing refraining us from taking this approach from biggining.

What we achived :

Memory requirements for loaded model is reduced by 50%
Prediction time is reduced by 50%, cutting down server cost to compute prediction by 50%
Model is more debugable, we can log anything we want, we are able to add break points and debug the model better.
We are able to remove middle man PMML completely and along with dependency on external library.
Significantly reducing resouce requirement and cutting time by 20 Min to complete the process.

Software Engineering and Machine Learning enthusiast