Sunday, December 26, 2021

Do you consider your reporting manager as your mentor?

I was talking to a friend and she kept comparing what I was saying with her manager. She believe all managers just say things which are complete bullshit and nothing good comes out of thier mouth. From my end I was completely logical and honest and saying from my experience. I observed she is not alone, everyone does that because of 2 reasons

1. People don't trust thier managers.

2. People think it's manager's job to lecture and ignore him as much as possible.

3. Usually managers have more experience and managers talk based on thier experience and it's difficult to understand the same without personally having those experiences.

I am not claiming all managers are honest and/or always corect but sometime when they are honest and correct and try to help us, we ignore them because of above reasons.

Sunday, October 24, 2021

Random Notes

How to sum values of one column in command line.

cat <file with one column of values> | paste -sd+ - | bc

How to change vim type to take keyboard input in normal form

Create a file ~/.vimrc if already not exist add below line to it
"set nocp"

What to do if mouse/touchpad stops working on ubuntu suddenly ?

run following commands, it will remove and install the psmouse module of ubuntu
sudo modprobe -r psmouse & sudo modprobe psmouse

Friday, July 24, 2020

Explanation of Json model format of CatBoost

CatBoost is very famous package for boosted trees. It automatically handles categorical features which makes it very useful for training data with many categorical features of large cardinality. You can save the trained model on permanent storage in json format(or binary format). Currently its prediction library is available in Java(via JNI), Python & C/C++. I recently worked on a project where we needed to do predictions in java but cannot use JNI. So we decided to write Java client for Catboost but there is no documentation currently available which explains the model json file format. I ended up understanding the working from c++ source code and I am sharing the same here.

I will use this sample_model.com which is trained on sample.tsv to explain with example

Section 1 : Some concepts of Catboost model

CatBoost convert categorical features into 16 bit hash using CityHash. These hashes are available in model if you pass pool while saving the model. Location and format of these hashes in model is explained in section x.
CatBoost create new features by combining 2 ore more features to make more complex features. These are called "ctr features". How these ctr features are represented and computed are explained in the section 3.
CatBoost use indexes to refer to feature & tree node decision

feature_index: This is feature index of categorical feature of in the list of categorical features and numrical feature in the list of numerical feature
flat_feature_index: this is the index of the feature(categorical or numerical) in the list of all features.
cat_feature_index: this is the index of the feature(categorical) in the ordered list of categorical features
In the example file order of features is age, day, hour, month, app_cat, user_occ, gender, statecode with index of age is 0 and its a numerical feature. This list is further divided into numerical features and categorical features. The elements in these lists are referred as flat_feature_index and cat_feature_index. i.e. flat_feature_index = 0 means age and cat_feature_index=0 means day.

Borders : this is used in condition of internal nodes of the trees. How to compute the value in the condition of the node is bit complex and explained later. Borders are referred by their index in the ordered list of borders across all types of features. the order is as following.

Numerical Borders
OneHotEncoded feature borders
CTR borders

Section 2 : CTR Types

CatBoost convert categorical features into different numerical values which are represented as numerator and denominator in the model. These numerical values do have derived with laplace smoothing using numerator prior and denominator prior and other values. What all components are available in model and how to compute values for different type of CTR values are explained in this section

Section 2.1 : Count Or FeatureFreq

Components

Count Prior (prior_numerator)
Count
Total Prior (prior_denominator)
Total (counter_denominator)
Scale
Shift

Formula

  Shift + Scale * ((Count Prior + Count)/(Total Prior + Total))

Section 2.2 : Borders

Components

Success Prior (prior_numerator)
Success
Failure Prior (prior_denominator)
Failures
Scale
Shift

Formula

  Shift + Scale * ((Success Prior + Successes)/(Success Prior + Success + Failure Prior + Failures))

How to find the components from json model is explained later.

Section 3 : Complex CTR Features:

These feature combinations are represented in a form of json. I will explain this section with an example of given row

day	hour	month	app_cat	user_occ	age	gender	statecode
fri	22	oct	games	student	47	male	103

and below example(not from sample_model.json) to compute.

{
    "elements":[
        {
            "cat_feature_index":6,
            "combination_element":
            "cat_feature_value"
        },
        {
            "border":12.5,
           	"combination_element":"float_feature",
            "float_feature_index":0
        },
        { 
            "cat_feature_index":5, 
            "combination_element":"cat_feature_exact_value", 
            "value":-845129958
        }
    ],
    "type":"Borders"
}

Representation

elements

This is an ordered list of features to be combined to form the new CTR feature.

In the above example there are 3 features which needs to be combined. These features can be one of three types which is given by value of combination_element

Categorical Feature ( combination_element = cat_feature_value) :

This component has only 1 other element

cat_feature_index: In Categorical Feature we compute Cityhash of the feature using feature value at cat_feature_index In the example cat_feature_index=6 that means we should use value of feature statecode and take its Cityhash.

ComputedValue : CityHash("103") = 2080643764
How to compute CityHash is explained

Numerical Feature(combination_element = float_feature) :

This component has 2 elements

flat_feature_index: this tell which feature to use for this. In the example(flat_feature_index = 0) means age.
border : This is a numerical value. We need to compare feature value with this value and if feature_value >= border then this component will compute to 1 else 0.

ComputedValue : 12.5>47?1:0 = 1

OneHotFeatures ( combination_element = cat_feature_exact_value) :

Here CityHash of feature will be compared with given value. This has 2 elements

cat_feature_index: this tell which feature to use. In the example above cat_feature_index = 5, which is gender.
value : this tell with which value we need to compare the CityHash of feature value. If they are same this component will compute to 1 else 0. i.e. if (CityHash(features[age]) = -845129958) then 1 else 0

ComputedValue : CityHash("male") == -845129958 ? 1 else 0 = 1
CityHash("male") = -845129958. How to compute this is explained later

type

type is the CTR type as explained in Section 2.

How to combine components of the list into 1 value :

To combine these values into 1 value catboost uses 1 special number

MAGIC_MULT = 0x4906ba494954cb65l

It iterate over the list of elements as follow and starting with hash value = 0

 hash = 0 
 for i in (1:size(elements)){
     hash = MAGIC_MULT * ( hash + MAGIC_MULT * VALUE(i) )
 }

So to compute the above example

element[0] :

 MAGIC_MULT * ( 0 + MAGIC_MULT *  2080643764) = 5418337360679822996

element[1] :

    
   MAGIC_MULT * ( 5418337360679822996 + MAGIC_MULT *  1) = 6727615186874477117
  

element[2]:

  
   MAGIC_MULT * ( 6727615186874477117 + MAGIC_MULT * 1 = 4977781465358360298
  

So finally we will use 4977781465358360298 to look for values in model to compute the feature value for decision type given in "type" field. How to find the values using this hashed value is explained later in section 4.2

Section 4 : Structure of catboost json file

There are 6 high level elements in outer json. I will explain 1 by 1

section 4.1 : features_info :

This gives the list of features and other information including ordering of borders as explained below in this section. There are 3 different Feature Types :

section 4.1.1 : float_features

  
   "float_features":[ 
    { 
        "feature_name":"age", 
        "nan_value_treatment":"AsIs", 
        "has_nans":false, 
        "flat_feature_index":5, 
        "feature_index":0,"borders":[
            12.5,    13.5,    15.5,    16.5,     20.5,
            21.5,    22.5,    24.5,    25.5,     29.5,
            30.5,    31.5,    34.5,    35.5,     36.5,
            37.5,    41.5,    47.5,    49.5,     50.5,
            52.5,    53.5,    54.5,    55.5,     58.5
        ]
    }
 ]
  

this is the ordered list of numerical features along ordered list of borders used for decisions in trees.

feature_name : name of the feature if pool is passed while creating the json dump.

nan_value_treatment: How to treat if value of the feature is not passed while prediction.

has_nans : if this feature had missing values at the time of training
flat_feature_index : this tell the flat_feature_index of the this numerical feature
feature_index : this is feature_index of this numerical feature
borders : This is a first part of the global ordered list of values which needs to be used in conditions of the node where this numerical feature is used. This order is very important and values are used by their indexes. This global index is call split_index. There is a alot infomration we will get just by knowing the split_index.

For FloatFeature type of feature we can get following information just by knowing index of split_index

Border value
Feature name
Feature index

For above float_feature example, we can generate 25 split_indexes. Below are 2 example of 1st and last split_indexes. Please note the indexes of split_index here are 0 and 24

  
   split_index(0) = 
 { 
     "feature_type": "FloatFeature",
     "border_value": 12.5,
     "flat_feature_index": 5, 
     "feature_index": 0,
     "feature_name": "age"
 }

  
   split_index(24) = 
 {
     "feature_type": "FloatFeature",
     "border_value": 58.5,
     "flat_feature_index": 5,
     "feature_index": 0,
     "feature_name": "age"
 }

split_indexes are indexed across all the features(numerical, OneHot, CTR) that means index of the next split_index will start from 25

section 4.1.2 : Categorical_features

This is ordered list of categorical features along with information about one hot encoded values

  
   "categorical_features":[ 
    {
        "feature_name":"day", 
       	"flat_feature_index":0, 
        "feature_index":0
    },
    {
    	"feature_name":"hour", 
        "flat_feature_index":1,
        "feature_index":1 
    }, 
    { 
    	"feature_name":"month", 
        "flat_feature_index":2, 
        "feature_index":2
    },
    { 
    	"feature_name":"app_cat", 
        "flat_feature_index":3,
        "feature_index":3
    }, 
    { 
    	"feature_name":"user_occ", 
        "values":[     
            -1445282301,   
            -364013047,
            1886646856
        ],
      	"flat_feature_index":4,
        "feature_index":4 
    }, 
    { 
    	"feature_name":"gender", 
        "flat_feature_index":6, 
        "feature_index":5 
    }, 
    {  
    	"feature_name":"statecode", 
        "flat_feature_index":7, 
        "feature_index":6 
    }
 ]

feature_name, flat_feature_index, feature_index : are already explained above.

values :this is the ordered list of values which will be used for one-hot feature. This will be part of global list of split_index. for categorical_features split_index will give following formation

Feature Name
Feature Index
Value for OneHot comparision

For above example of categorical_features we can generate 3 OneHot split_indexes. Since this is part of global list of split_indexes and we have already used 0 to 24 indexes for float_features, index for this split_index will start with 25.

  
   split_index(25) = 
 {
    "feature_type": "OneHot",
    "value": -1445282301,
    "feature_name": "user_occ",
    "flat_feature_index": 4, 
    "feature_index": 4
 }

  
   split_index(27) =
 { 
    "feature_type": "OneHot",
    "value": 1886646856, 
    "feature_name": "user_occ",
    "flat_feature_index": 4, 
    "feature_index": 4
 }

Note : there are few more OneHot split_indexes not listed here. Those values are explained in section <x>

Section 4.1.3: ctrs

This is ordered list of CTR features, it has important fields which are tightly linked to CTR types

  
   "ctrs":[ 
    {
        "borders":[ 
            0.9999989867,    1.999999046,    3.999999046,    4.999999046, 
            6.999999046,     8.999999046,    9.999999046,    11.99999905 
        ],
        "prior_numerator":0, 
        "shift":-0,"target_border_idx":0, 
        "ctr_type":"Borders", 
        "scale":15, 
        "elements":[ 
            { 
                "cat_feature_index":0, 
                "combination_element":"cat_feature_value"
            } 
        ],     
        "identifier":"{\"identifier\":[{\"cat_feature_index\":0,\"combination_element\":\"cat_feature_value\"}],\"type\":\"Borders\"}", 
        "prior_denomerator":1 
    }, 
    {
        "borders": [0.9999989867], 
        "prior_numerator":0, 
        "shift":-0, 
        "target_border_idx":0, 
        "ctr_type":"Counter", 
        "scale":15, 
        "elements": [ 
            { 
                "cat_feature_index":0, 
                "combination_element":"cat_feature_value" 
            }, 
            { 
                "float_feature_index":0, 
                "combination_element":"float_feature", 
                "border":15.5 
            } 
        ], 
        "identifier":"{\"identifier\":[{\"cat_feature_index\":0,\"combination_element\":\"cat_feature_value\"},{\"border\":15.5,\"combination_element\":\"float_feature\",\"float_feature_index\":0}],\"type\":\"Counter\"}", 
        "prior_denomerator":1 
    }, 
    .... 
    .... 
 ]

ctr_type: it is type of ctr feature. Different Types are previously explained

prior_numerator, prior_denominator, scale, shift: They are same as explained in the "ctr types" section.

elements: This is the list of features to CTR type. How to compute CTR type key is already explained in section 3. How to compute the value using that key is explained in section 4.2 & section 2.

identifier: This is a stringified json of elements & ctr_type. This will be used find other values required to compute the CTR type. How to find those values are explained later in section 4.2.

borders: This is the ordered list of borders which will be used for CTR feature. This will be part of global list of split_index. for ctrs split_index will give following formation

elements
ctr_type
shift
scale
prior_numerator
prior_denominator
border value
identifier

Since this is part of global list of split_indexes and we have already used 0 to 24 indexes for float_features, and 25-27 for OneHot features, split_index for ctrs will start with 28.

   split_index(28) = 
 {
    "split_type": "OnlineCTR",
    "prior_numerator":0, 
    "shift":-0, 
    "ctr_type":"Counter", 
    "scale":15, 
    "elements": 
    [ 
    	{ 
            "cat_feature_index":0, 
            "combination_element":"cat_feature_value" 
        }, 
        {
            "float_feature_index":0, 
            "combination_element":"float_feature", 
            "border":15.5 
        } 
    ],
    "identifier":"{\"identifier\":[{\"cat_feature_index\":0,\"combination_element\":\"cat_feature_value\"},{\"border\":15.5,\"combination_element\":\"float_feature\",\"float_feature_index\":0}],\"type\":\"Counter\"}", 
    "prior_denomerator":1,
    "border_value": 0.9999989867 
 }

Section 4.1.4 : cat_features_hash:

It is a list of json elements. Each elements contains 2 elements, Hash and value,

where CityHash(value) = Hash

   "cat_features_hash": 
 [ 
    { 
        "hash":4216280702, 
        "value":"4" 
    }, 
    {
        "hash":1445320840
        "value":"12", 
    },
    {
        "hash":3509359534,
        "value":"thu"
    }, 
    ..... 
    ..... 
 ]

Section 4.2 : ctr_data

This gives information about values needed to compute different ctr type. It is a json map where keys are identifiers from section 4.1.3.

 

   "ctr_data":   
 { 
 	"{\"identifier\":[{\"cat_feature_index\":6,\"combination_element\":\"cat_feature_value\"}],\"type\":\"Counter\"}":  	{ 
    	"counter_denominator":32, 
        "hash_stride":2, 
        "hash_map": 
        [ 
            "18446744073709551615",  
            18,     
            "9285085082777897483",   
            22, 
            "5287930183868741137",	      
            22,     
            "13172738045064563988",     
            20,
            "6703001155837839893",   
            25, 
            "4858559805981312021", 
            17,	      
            "5418337360679822996", 
            26, 
            ..... 
        ]    
    }, 
    "{\"identifier\":[{\"cat_feature_index\":6,\"combination_element\":\"cat_feature_value\"},{\"border\":55.5,\"combination_element\":\"float_feature\",\"float_feature_index\":0}],\"type\":\"Borders\"}":    
    { 
       	"counter_denominator":0,
        "hash_stride":3,	 
        "hash_map": 
        [
            "18446744073709551615", 
            4, 
            12, 
            "16473605376674046465", 
            8, 
            9,
            "14187676640779072514", 
            1, 
            2,  
            "9393377116333705988", 
            7, 
            8, 
            "5094243394766242315", 
            0, 
            2,
            "5094137587211407371",  
            1,     
            1, 
            ..... 
        ],  
    }, 
    .....  
 }

hash_map : hash_map is basically a list of key and values. The number of values depends on CTR type. How to get the key is explained earlier in the blog.

Borders : It has key and 2 values that means 0,3,6,.. are keys. (1,2),(4,5),(7,8),.. indexes are (failures, successes) used for computing the value of "border" CTR type at that key.
Counter & FeatureFreq : It has key and 1 value that mean 0,2,4,6,.. are the keys and 1,3,5,7... are the counts for computing the value of "counter" or "FeatureFreq" CTR type at that key.

counter_denominator : this will be used as total for computing "counter" or "FeatureFreq" CTR type.

Section 4.3 oblivious_trees

This is the list of trees. Each tree is computed separately. I will try to explain these trees by computing tree values for example trees.

Example 1

   {
    "leaf_weights":[2, 49, 18, 931], 
    "leaf_values":[1.153793657, 0.3428686569, 0.06438921655, 0.0150123653], 
    "splits": 
    [ 
    	{ 
        	"split_index":189,   
            "split_type":"OnlineCtr",  
            "border":3.999999046, 
            "ctr_target_border_idx":0 
        }, 
        {
        	"split_index":86,    
            "split_type":"OnlineCtr", 
            "border":4.999999046, 
            "ctr_target_border_idx":0 
        } 
    ]
 }

There are 2 important parts

splits : List of conditions to be used in different levels of tree. Since Catboost generate Symmetric trees we need only len(splits) = depth of tree. This list is in reverse order i.e. last element in the list is used in root of the tree. We just need "split_index" from this part to compute the node values.

leaf_values: there are 2^(tree_depth) number of values in this representing terminal node values of the complete tree from left to right. We just need split_index to compute the node value.

Section 4.6.1

If split_type of split is "OneHotFeature", it might not be listed in the features_info section, but it is still the part of global list of split_indexes.

For Example in sample_model.json

   oblivious_trees[1]['splits][0] = 
 {
    "split_index": 28, 
    "cat_feature_index": 5, 
    "value": -845129958, 
    "split_type": 'OneHotFeature'
 }

so some of the split indexes needs to be derived from here. These indexes start just after OneHot feature values and before ctrs.

I will compute the above tree example along with few more trees for the one row below

day hour month app_cat user_occ age gender statecode
fri 22 oct games student 47 male 103

For root node (split_index = 86), find the split index using section 4.1

   split_index(86) = 
 { 
    "prior_numerator": 0, 
    "shift": 0, 
    "ctr_type": 'Counter', 
    "scale": 15, 
    "elements": 
    [ 
    	{ 
            "cat_feature_index": 0, 
            "combination_element": 'cat_feature_value'
        }, 
        {
            "cat_feature_index": 2, 
            "combination_element": 'cat_feature_value'
        }
    ], 
    "identifier": '{"identifier":[{"cat_feature_index":0,"combination_element":"cat_feature_value"},{"cat_feature_index":2,"combination_element":"cat_feature_value"}],"type":"Counter"}', 
    "prior_denomerator": 1, 
    "border" = 4.999999046 
 }

Compute the hash to be used for ctr_data using section 3 and Section 4.1.4

   cat_feature_index(0) = "day"
 input["day"] = "fri"
 CityHash("fri") = 2386474191
 MAGIC_MULT * ( 0 + MAGIC_MULT * 2386474191 ) = 10849034401243659895
 cat_feature_index(2) = "month"
 input["month"] = "oct"
 CityHash("oct") = 964427042
 MAGIC_MULT * ( 10849034401243659895 + MAGIC_MULT * 964427042 ) = 7913951620039613893

So final hash to be used is "7913951620039613893"

using section 4.2 find values to be used to compute CTR at the node

   ctr_hash['{"identifier":[{"cat_feature_index":0,"combination_element":"cat_feature_value"},{"cat_feature_index":2,"combination_element":"cat_feature_value"}],"type":"Counter"}']['hash_map'][7913951620039613893] = 
 { 
    "count" = 12, 
    "counter_denominator" = 23
 }

using section 2.1 to compute the node and compare it with border to find the next node.

   15 * ((0 + 12)/(1 + 23)) = 7.5
 7.5 >= 4.999999046 is True so traverse to Right Node. 

using section 4.1

   split_index(189) = 
 {
    "prior_numerator": 0.5, 
    "shift": 0, 
    "scale": 15, 
    "elements": 
    [
    	{
            "cat_feature_index": 2, 
            "combination_element": "cat_feature_value" 
        }
    ], 
    "identifier": '{"identifier":[{"cat_feature_index":2,"combination_element":"cat_feature_value"}],"type":"Borders"}', 
    "prior_denomerator": 1, 
    "ctr_type" : "Border", 
    "border" = 3.999999046
 }

using section 3 & section 4.1.4

   cat_feature_index(2) = "month"
 input["month"] = "oct"
 CityHash("oct") = 964427042
 MAGIC_MULT * ( 0 + MAGIC_MULT * 964427042 ) = 14429572201882049490

using section 4.2

   ctr_data['{"identifier":[{"cat_feature_index":2,"combination_element":"cat_feature_value"}],"type":"Borders"}']['hash_map'][14429572201882049490] =
 {
    "successes": 51,
    "failures" : 43
 }

using section 2.2

   0 + 15 * ((0.5 + 51)/(0.5 + 51 + 1 + 43)) = 8.089005235602095 
 8.089005235602095 >= 3.999999046 is True so traverse to Right node 

Reached terminal node so value of the tree is value of terminal node.

  Computed value of this tree is 0.0150123653

Example 2

   {
    "leaf_weights":[551, 9, 434, 6], 
    "leaf_values":[1.142552497, 0.7536726871, 1.019659549, 1.933912389], 
    "splits": 
    [ 
    	{     
            "split_index":24, 
            "float_feature_index":0, 
            "split_type":"FloatFeature", 
            "border":58.5 
        },
        { 
            "split_index":201, 
            "split_type":"OnlineCtr", 
            "border":7.999999046,
            "ctr_target_border_idx":0 
        } 
    ]
 }

using section 4.1

   split_index(201) = 
 {
    "prior_numerator": 1, 
    "shift": 0, 
    "ctr_type": "Borders", 
    "scale": 15, 
    "elements": 
    [
        {
            "cat_feature_index": 2, 
            "combination_element": "cat_feature_value"
        }
    ], 
    "identifier": '{"identifier":[{"cat_feature_index":2,"combination_element":"cat_feature_value"}],"type":"Borders"}', 
    "prior_denomerator": 1,
    "border" : 7.999999046
 }

using section 3 & section 4.1.4

   cat_feature_index(2) = "month"
 input["month"] = "oct"
 CityHash("oct") = 964427042
 MAGIC_MULT * ( 0 + MAGIC_MULT * 964427042 ) = 14429572201882049490

using section 4.2

   ctr_data['{"identifier":[{"cat_feature_index":2,"combination_element":"cat_feature_value"}],"type":"Borders"}']['hash_map'][14429572201882049490] =
 {
    "successes": 51,
    "failures": 43
 }

using section 2.2

   0 + 15 * ((1 + 51)/(1 + 51 + 1 + 43)) = 8.125
 8.125 >= 7.999999046 is True, so travese to Right node

using section 4.1

   split_index(24) =
 {
    "feature_type": 'FloatFeature',
    "feature_name": 'age', 
    "nan_value_treatment": 'AsIs', 
    "has_nans": False, 
    "flat_feature_index": 5, 
    "feature_index": 0,
    "border": 58.5
 }

using section 3

   input['age'] = 47
 47 >= 58.5 is False, so traverse to Left node 

  Computed value of this tree is 1.019659549

Example 3

 {
    "leaf_weights":[470, 530],
    "leaf_values":[2.411496103, 2.176863892],
    "splits":
    [
        {
            "cat_feature_index":5,
            "split_index":28,
            "split_type":"OneHotFeature"
            "value":-845129958,
        }
    ]
 }

using section 4.1

 split_index(28) =
 {
 	"feature_name": "gender",
    "cat_feature_index": 5,
    "split_type": "OneHotFeature"
    "value": -845129958
 }

using section 3 & section 4.1.4

 input['gender'] = 'male'
 CityHash('male') = 2083200611
 2083200611 == -845129958 is False so traverse to Left node

  Computed value of this tree is 2.411496103

Example 4

   {
    "leaf_weights":[257, 743],
    "leaf_values":[2.081450524, 2.052298752],
    "splits":
    [
        {
            "split_index":55,
            "split_type":"OnlineCtr",
            "border":12.99999905,
            "ctr_target_border_idx":0
        }
    ]
 }

using section 4.1

   split_index(55) = 
 {
    "prior_numerator": 0, 
    "shift": 0, 
    "ctr_type": 'Counter', 
    "scale": 15, 
    "elements": 
    [
        {
            "cat_feature_index": 0, 
            "combination_element": 'cat_feature_value'
        }
    ], 
    "identifier": '{"identifier":[{"cat_feature_index":0,"combination_element":"cat_feature_value"}],"type":"Counter"}', 
    "prior_denomerator": 1,
    "border": 12.99999905
 }

using section 3 & section 4.1.4

   cat_feature_index(0) = 'day'
 input['day'] = 'fri'
 CityHash('fri') = 2386474191
 MAGIC_MULT * ( 0 + MAGIC_MULT * 2386474191 ) = 10849034401243659895

So final hash to be used is "10849034401243659895"

using section 4.2

   ctr_data['{"identifier":[{"cat_feature_index":0,"combination_element":"cat_feature_value"}],"type":"Counter"}']['hash_map']['10849034401243659895'] =
 {
    "counter_denominator" = 154,
    "count" = 151
 }

using section 2.1

   0 + 15 * ((0 + 151)/(1 + 154)) = 14.612903225806452
 14.612903225806452 >= 12.99999905 is True so traverse to Right node

Computed value of this tree is 2.052298752 

Example 5

 {
    "leaf_weights":[944, 56],
    "leaf_values":[1.867651047, 1.653345508],
    "splits":
    [
        {
            "split_index":257,
            "split_type":"OnlineCtr",
            "border":9.999999046,
            "ctr_target_border_idx":0
        }
    ]
 }

using section 4.1

 {
    "prior_numerator": 1, 
    "shift": 0, 
    "ctr_type": 'Borders', 
    "scale": 15, 
    "elements": 
    [
        {
            "cat_feature_index": 3, 
            "combination_element": 'cat_feature_value'
        }
    ], 
    "identifier": '{"identifier":[{"cat_feature_index":3,"combination_element":"cat_feature_value"}],"type":"Borders"}', 
    "prior_denomerator": 1,
    "border": 9.999999046
 }

using section 3 & section 4.1.4

 cat_feature_index(3) = 'app_cat'
 input['app_cat'] = 'games'
 CityHash('games') = 4157097893
 MAGIC_MULT * ( 0 + MAGIC_MULT * 4157097893 ) = 7916588964563354589

So final hash to be used is "7916588964563354589"

using section 4.2

 ctr_data['{"identifier":[{"cat_feature_index":3,"combination_element":"cat_feature_value"}],"type":"Borders"}']['hash_map']['7916588964563354589'] =
 {
	"failures" = 80,
	"successes" = 74
 }

using section 2.2

 0 + 15 * ((1 + 74)/(1 + 74 + 1 + 80)) = 7.211538461538462
7.211538461538462 >= 9.999999046 is False so traverse to Left node

Computed value of this tree is 1.867651047

Example 6

 {
    "leaf_weights":[25, 742, 4, 71, 0, 47, 0, 111],
    "leaf_values":
    [
        -0.4397322896,  0.000866602846,	-0.5008245793,	-0.04967483793, 
        0, 	0.2684168171, 	0, 	0.06013996591
    ],
    "splits":
    [
        {
            "split_index":0,
            "float_feature_index":0,
            "split_type":"FloatFeature",
            "border":12.5
        },
        {
            "split_index":136,
            "split_type":"OnlineCtr",
            "border":9.999999046,
            "ctr_target_border_idx":0
        },
        {
            "split_index":148,
            "split_type":"OnlineCtr",
            "border":8.999999046,
            "ctr_target_border_idx":0
        }
    ]
 }

using section 4.1

 split_index(148) = 
 {
    "prior_numerator": 0, 
    "shift": 0, 
    "ctr_type": 'Borders', 
    "scale": 15, 
    "elements": 
    [
        {
            "cat_feature_index": 1, 
            "combination_element": 'cat_feature_value'
        }, 
        {
            "float_feature_index": 0, 
            "combination_element": 'float_feature', 
            "border": 12.5
        }
    ], 
    "identifier": '{"identifier":[{"cat_feature_index":1,"combination_element":"cat_feature_value"},{"border":12.5,"combination_element":"float_feature","float_feature_index":0}],"type":"Borders"}', 
    "prior_denomerator": 1,
    "border": 8.999999046
 }

using section 3 & section 4.1.4

 cat_feature_index(1) = 'hour'
 input['hour'] = '12'
 CityHash('12') = 1445320840
 MAGIC_MULT * ( 0 + MAGIC_MULT * 1445320840 ) = 8879677033106486088

 float_feature_index() = 'age'
 input['age'] = 47
 47 >= 12.5 is True so its computed value is 1
 MAGIC_MULT * ( 8879677033106486088 + MAGIC_MULT * 1 ) = 4244931956247570753 

So final hash to be used is "7916588964563354589"

using section 4.2

 ctr_data[''{"identifier":[{"cat_feature_index":1,"combination_element":"cat_feature_value"},{"border":12.5,"combination_element":"float_feature","float_feature_index":0}],"type":"Borders"}']['hash_map']['4244931956247570753'] =
 {
    "failures": 15,
    "successes": 16
 }

using section 2.2

 0 + 15 * ((0 + 15) / (0 + 15 + 1 + 16)) = 7.03125
 7.03125 >= 8.999999046 is False so traverse to Left node

using section 4.1

 split_index(136) =
 {
    "prior_numerator": 1, 
    "shift": 0, 
    "ctr_type": 'Borders', 
    "scale": 15, 
    "elements": 
    [
        {
            "cat_feature_index": 1, 
            "combination_element": 'cat_feature_value'
        }
    ], 
    "identifier": '{"identifier":[{"cat_feature_index":1,"combination_element":"cat_feature_value"}],"type":"Borders"}', 
    "prior_denomerator": 1,
    "border" : 9.999999046
 }

using section 3 & section 4.1.4

 cat_feature_index(1) = 'hour'
 input['hour'] = '12'
 CityHash('12') = 1445320840
 MAGIC_MULT * ( 0 + MAGIC_MULT * 1445320840 ) = 8879677033106486088 

using section 4.2

 ctr_data['{"identifier":[{"cat_feature_index":1,"combination_element":"cat_feature_value"}],"type":"Borders"}']['hash_map']['8879677033106486088'] = 
 {
    "failures": 15,
    "succeses": 17
 }

using section 2.2

 0 + 15 * ((1 + 17) / (1 + 17 + 1 + 15)) = 7.9411764705882355
 7.9411764705882355 > = 9.999999046 is False so traverse to Left node

using section 4.1

 split_index(0) = 
 {
    "float_feature_index":0
    "split_type":"FloatFeature"
    "border":12.5
 }

using section 3

 float_feature_index(0) = 'age'
 input['age'] = 47
 47 >= 12.5 is True so traverse to Right node

   Computed value of this tree is 0.000866602846 

Section 4.3 scale_and_bias

It is an array of 2 elements. 1st value is scale and second value is bias. Finally to compute the model output we need to use following(for regression)

     bias + scale * (Σ oblivious_trees)

Thursday, June 25, 2020

Migration to Aerospike from Redis Clusters & HBase

We recently moved our production data base technology infrastructure from Redis and HBase to Aerospike. In this blog I am sharing the learning from this exercise

What use case we are solving ?

I am responsible for a large scale, low latency service. This service handle 350,000 request per second at peak with average latency of 10ms. This is one monolithic service to serve bids for RTB auctions at scale. I need to query large amount of data from different data sources, compute multiple models with 100s of trees on the fly with in this time frame. This data is heterogeneous in nature and cannot be stored together. Some data is large to be stored in memory. Data is distributed between many Redis Clusters and HBase based on type of data.

What data is being stored in Redis Cluster

Redis Cluster is used for storing large amount of key-values where key size is between 32 to 50 characters. Some values are simple string other are maps, lists, sets or sorted sets. Size of value range from few bytes to large sorted sets of few Gigabytes.

Write patterns in Redis

In few Redis cluster data is written once every hour and other Redis clusters data is inserted/updated and read on the fly with every call to service.

Read Patterns in Redis

Few operations we perform

Read data against one key
Read data against many keys
Get one value from sorted set
Get one/all values from map

There are many operations performed directly on redis server using Lua scripts. These scripts are used for read update write type of patterns.
We heavily use pipeline and TTL functionality of Redis

What data we are storing in HBase

Simple Key-Value pairs where number of keys are in billions and storing in memory will be very expensive. These keys are of type write once with TTL and read many ti\mes.

Why we migrated from Redis to Aerospike:

We were running 10s of Redis clusters with number of shards ranging from 3 to 100s in each cluster.

Since in Redis cluster individual shards needs to be added/removed it was being management nightmare.
Our data was growing and keeping all the data in memory was becoming very expensive.

Why we migrated from HBase to Aerospike:

HBase was slow for our use case.
Managing tail latencies was coming more and more challenging because of HBase's JVM GC.
HBase was costly, the number of requests served by per host was very less.

Some issues and restrictions of Aerospace :

Aerospike is good for storing small values against keys. Aerospike rewrite replaces the whole value against key on every update. Something to keep in note of because we if we are using large values in data structure against one key this will slow down the system alot.
For HyBrid and InMemory version Aerospike reserves 64 Bytes in memory for each key so to store 1 Billions keys we need ~64GB of memory to store the index.
There is no control over how Aerospike distribute data in its shards, this can lead to unbalanced sharding. In our testing we have found there is a difference of 10% in number of slots between highest slot machine and second highest slot machine.
If we are using in memory version of Aerospike and machine goes down we will loose the data but in case of Redis data can be recovered(not fully) if persistence is on
Redis supports key tags where we can force the multiple keys to go to one shard by specifying the sharding key as substring of original key. Aerospike doesn't support this functionality. This functionality if very useful for specific query patterns.
Redis supports pipeline/batch calls for insert/updates. Aerospike doesn't support this.

Some benefits of Aerospike

Aerospike is multithreaded compared to Redis so we don't have to manage 100s of redis instances we can use few big hosts to make a cluster. This reduces management overhead.
Aerospike is optimized for reads & writes to SSD / NVMe which makes it blazing fast.
Aerospike is fully written in C++ so there is no JVM or Garbage collection.
Tail latencies are less in Aerospike then in HBase
Aerospike support nested data structures so you can create keys where value is map which has key and then map again in value of internal map.

Migration of HBase

Our Data model for large number of keys in Aerospike

We have 8 Billion records to store with TTL. Where key is hexa string of 32 characters and value is string of length between 10 to 20 characters. To store these records with replication factor of 3 we needed 3 X 8^9 X 64 bytes or 1.4 TB of memory for the Aerospike index. We can just store all of this data in memory with 1.4 TB of memory.

To reduce the memory usage we decided to use nested map functionality of Aerospike and use first 7 characters of key as primary key and keep a map inside it with rest of the key as key and value as value and another key with timestamp. Because of this data format we are able to reduce our memory footprint by factor of 30 and but loose the functionality of TTL by Aerospike.

We decided to go with this Data Model because of following reasons

Our cost was reducing a lot
There was no degradation of performance because of this.
Instead of TTL we want to delete/expire key from the last read time which is not available in Aerospike, so we decided to build a Hybrid solution to expire they keys ourselves. If you are interested in our hybrid solution of expiring keys please check this.

To Be Continued

Thursday, January 2, 2020

How to debug redis at high scale

Background of our architecture

We are using very simple architecture of 100s of Java web boxes with auto-scaling and Redis Cluster as primary data source and Jedis as Redis client. We need to hit many different Redises in our single serving call with 20ms timeouts from client of our service.

Problem we were facing

At a very high scale whenever CPU of Redis was going beyond 60% whole cluster was spiking to 100% with decrease in IOPS. Because of this we had to over provision the Redis boxes.

Things we tried to debug the issue

Enabled debug logs in Redis

Debug logs were giving hint of things going bad but didn't give full picture

Observation from debug logs 100s of new connections were getting created very frequently.

Ran strace on Redis server with full text and timestamps

`sudo strace -s 1000 -tt -T -p <PID>`

This gave us very good information about what Redis does internally.

Redis makes all system calls in batches. 3 major system calls are.

Accept
Read
Write

Because of this batch reading/writing/accepting requests were timing out results in disconnection of TCP connection from Jedis and then it was creating new connection
So whenever there was momentary spike in calls, Jedis was creating new connections, results in timeout of existing connections and logging of connection breaking, etc. and it was going in cycle of creating and destroying new connections.

Ran tcpdump on client with full text and timestamp

`sudo tcpdump -tttt -A -i any port <Redis Port>`

This further clears everything that was happening about what all is happening at the TCP level.

Our solution was to increase the number of idle connections to handle sudden spike in calls without creating new connections.

Monday, April 29, 2019

Go Vs Java Benchmark

Desclamer

I have 9 YEARS of experience in Java and 3 DAYS of experience in Go. This comparision might be highly biased.

What is compared here.

Ease of coding
Execution Time
Memory used

What is being done here ?

I recently(3 days ago) started to learn Go and wanted to know if Go is actually as good as its publisized. So, in order to learn along with testing I translated one Java project into Go and did some benchmarking on it. Below are the results.

Machine Used here is Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz ( 2 physical cores and 4 cores with hyperthreading ).

My understanding of Go (based on 3 days) :

Go is better C. Go cherry picked good parts of C, Java, Python along with adding new concepts like goroutines

Go seems very simple if you know C & Java.

**Results**
	Go	Java
Serial Execution(Seconds)	86	171
Parallel Execution(Seconds)	105	207
Memory(MB)	3112	4423

Conclusion :

Go is really useful for small projects with relatively small code, I think as the code become big with large number of functionality managing code might be little difficult.
Go is easy to code and relatively very efficient compared to Java.

Thursday, April 4, 2019

How to do LightGBM predictions in realtime at high scale in Java

We recently landed on a problem where we need to do predictions of LightGBM model in realtime in Java. Training in Python and Predicting in java is no new problem but this problem was unique because popular way of using PMML to do prediction was giving following issues.

Issues

Validations at PMML generation end is very scrict even stricter then LightGBM. Some examples are

PMML doesn't allow Single valued features while LightGBM do.
PMML have weird restrictions on value range which is strange.

At the time of prediction PMML predictor was doing some validation which was making the predictions very costly.
PMML generator is not able to parse feature values with special characters from model dump.
PMML generation was taking 20 Min with 60GB of memory for 160MB model file. This is a big issue if we need to update the model every hour.
Client of PMML has to be aware of data type of features.

We were able to find work arounds to some of above but not all and was in the bad state to scale it using PMML. Finally we gave up on PMML and

we wrote our own parser and predictor to parse model dump of LightGBM in Java and load self defined objects and do predictions.

Problems faced :

LightGBM uses its own format and representation of trees in dump of LightGBM. With no documentation along with fairly complex representation of Trees make this understanding hard. Understanding this dump was the primary thing refraining us from taking this approach from biggining.

What we achived :

Memory requirements for loaded model is reduced by 50%
Prediction time is reduced by 50%, cutting down server cost to compute prediction by 50%
Model is more debugable, we can log anything we want, we are able to add break points and debug the model better.
We are able to remove middle man PMML completely and along with dependency on external library.
Significantly reducing resouce requirement and cutting time by 20 Min to complete the process.

Monday, October 29, 2018

Twemproxy vs RedisCluster for high performance large datasets

This project demostrate the difference between Twemproxy and RedisCluster when it comes to high performance large in memory datasets.
RedisCluster solves all the issues originaly comes with Twemproxy along with providing all(most) the current features of twemproxy.

Twemproxy :

responsible for sharding based on given configurations. We can start multiple Twemproxy servers with same configurations. Call first goes to Twemproxy and then Twemproxy calls Redis to fire the command, it support all type of commands.

Architecture based on twemproxy has many issues :

Data has to be on same host as twemproxy to avoid n/w call. So, Data cannot be sharded over different servers.
Twem proxy takes more CPU then actual Redis and add extra hop. This is specially a problem in AWS where redis machines have lesser CPU power then compute machines.
Adding another shard and re sharding is very difficult process.
Adding another twem is difficult because we have to add it to client code manually by production build.

RedisCluster :

It makes Twemproxy obsolete by adding sharding intelligence in client and Redis servers. It provide support for adding new shards by providing auto-sharding and It further auto-syncs b/w client and server when we add another shard.

Advantages of using RedisCluster instead of Twemproxy.

Data can be sharding in multiple hosts.
Logic of computing the correct shard is on the client itself. This does 2 thing.
Removes extra hop
Move the CPU requirements on different box.
Easy to add new shard
Auto sync the shard information with client which means we don't have to do any code upload.
Cutting Redis command time by 50%.

Note : RedisCluster java client Jedis doesn't provide pipeline support. I have written my own wrapper to support pipeline with jedis. This pipeline wrapper is not fully constant during node failure or resharding.

Test Results :

Running with 10 threads with 4 redis and 4 twemproxy as 1st setup and 4 redis servers in redis clusters as 2nd setup.

Setup	Call Type	Time Taken(Sec)	Redis CPU * 4	Client CPU	Twemproxy CPU * 4	Total CPU on given time	Total CPU cycles
Twemproxy	GET	140	22	90	50	378	52920
Twemproxy	GET Pipeline	35	25	70	50	370	12950
RedisCluster	GET	67	44	190	0	366	24522
RedisCluster	GET Pipeline	29	44	180	0	356	10324

To generate the results you can follow commands here : https://github.com/uniqueuser/RedisClusterTest

PS: This is not in production and all tests are done offline only.

Sunday, December 26, 2021

Sunday, October 24, 2021

How to sum values of one column in command line.

cat <file with one column of values> | paste -sd+ - | bc

How to change vim type to take keyboard input in normal form

Create a file ~/.vimrc if already not exist add below line to it "set nocp"

What to do if mouse/touchpad stops working on ubuntu suddenly ?

run following commands, it will remove and install the psmouse module of ubuntu sudo modprobe -r psmouse & sudo modprobe psmouse

Friday, July 24, 2020

Section 1 : Some concepts of Catboost model

Section 2 : CTR Types

Section 3 : Complex CTR Features:

Representation

elements

type

How to combine components of the list into 1 value :

Section 4 : Structure of catboost json file

section 4.1 : features_info :

section 4.1.1 : float_features

section 4.1.2 : Categorical_features

Section 4.1.3: ctrs

Section 4.1.4 : cat_features_hash:

Section 4.2 : ctr_data

Section 4.3 oblivious_trees

Example 1

Section 4.6.1

Example 2

Example 3

Example 4

Example 5

Example 6

Section 4.3 scale_and_bias

Thursday, June 25, 2020

What use case we are solving ?

Write patterns in Redis

Read Patterns in Redis

What data we are storing in HBase

Why we migrated from Redis to Aerospike:

Why we migrated from HBase to Aerospike:

Some issues and restrictions of Aerospace :

Some benefits of Aerospike

Migration of HBase

Our Data model for large number of keys in Aerospike

Thursday, January 2, 2020

Background of our architecture

Problem we were facing

Things we tried to debug the issue

Enabled debug logs in Redis

Ran strace on Redis server with full text and timestamps sudo strace -s 1000 -tt -T -p <PID>

sudo strace -s 1000 -tt -T -p <PID>

Ran tcpdump on client with full text and timestamp

sudo tcpdump -tttt -A -i any port <Redis Port>

Our solution was to increase the number of idle connections to handle sudden spike in calls without creating new connections.

Monday, April 29, 2019

Desclamer

What is compared here.

What is being done here ?

My understanding of Go (based on 3 days) :

Conclusion :

Thursday, April 4, 2019

Issues

Problems faced :

What we achived :

Monday, October 29, 2018

Twemproxy :

Architecture based on twemproxy has many issues :

RedisCluster :

Advantages of using RedisCluster instead of Twemproxy.

Test Results :

Create a file ~/.vimrc if already not exist add below line to it
"set nocp"

run following commands, it will remove and install the psmouse module of ubuntu
sudo modprobe -r psmouse & sudo modprobe psmouse

Ran strace on Redis server with full text and timestamps

`sudo strace -s 1000 -tt -T -p <PID>`

`sudo strace -s 1000 -tt -T -p <PID>`

`sudo tcpdump -tttt -A -i any port <Redis Port>`