Friday, July 24, 2020

Explanation of Json model format of CatBoost

CatBoost is very famous package for boosted trees. It automatically handles categorical features which makes it very useful for training data with many categorical features of large cardinality. You can save the trained model on permanent storage in json format(or binary format). Currently its prediction library is available in Java(via JNI), Python & C/C++. I recently worked on a project where we needed to do predictions in java but cannot use JNI. So we decided to write Java client for Catboost but there is no documentation currently available which explains the model json file format. I ended up understanding the working from c++ source code and I am sharing the same here. 
I will use this sample_model.com which is trained on sample.tsv to explain with example

Section 1 : Some concepts of Catboost model

  • CatBoost convert categorical features into 16 bit hash using CityHash. These hashes are available in model if you pass pool while saving the model. Location and format of these hashes in model is explained in section x. 
  • CatBoost create new features by combining 2 ore more features to make more complex features. These are called "ctr features". How these ctr features are represented and computed are explained in the section 3. 
  • CatBoost use indexes to refer to feature & tree node decision
    • feature_index: This is feature index of categorical feature of in the list of categorical features and numrical feature in the list of numerical feature
    • flat_feature_index: this is the index of the feature(categorical or numerical) in the list of all features.
    • cat_feature_index: this is the index of the feature(categorical) in the ordered list of categorical features
      In the example file order  of features is age, day, hour, month, app_cat, user_occ, gender, statecode with index of age is 0 and its a numerical feature. This list is further divided into numerical features and categorical features. The elements in these lists are referred as flat_feature_index and cat_feature_index. i.e. flat_feature_index = 0 means age and cat_feature_index=0 means day. 
  • Borders : this is used in condition of internal nodes of the trees. How to compute the value in the condition of the node is bit complex and explained later. Borders are referred by their index in the ordered list of borders across all types of features. the order is as following.
    • Numerical Borders
    • OneHotEncoded feature borders
    • CTR borders

Section 2 : CTR Types 

CatBoost convert categorical features into different numerical values which are represented as numerator and denominator in the model. These numerical values do have derived with laplace smoothing using numerator prior and denominator prior and other values. What all components are available in model and how to compute values for different type of CTR values are explained in this section 

  • Section 2.1 : Count Or FeatureFreq
    • Components
      • Count Prior (prior_numerator)
      • Count 
      • Total Prior (prior_denominator)
      • Total (counter_denominator)
      • Scale
      • Shift
    • Formula
      Shift + Scale * ((Count Prior + Count)/(Total Prior + Total))
  • Section 2.2 : Borders
    • Components
      • Success Prior (prior_numerator)
      • Success
      • Failure Prior (prior_denominator)
      • Failures
      • Scale
      • Shift
    • Formula
      Shift + Scale * ((Success Prior + Successes)/(Success Prior + Success + Failure Prior + Failures))
How to find the components from json model is explained later. 


Section 3 : Complex CTR Features: 


These feature combinations are represented in a form of json. I will explain this section with an example of given row

dayhourmonthapp_catuser_occagegenderstatecode
fri22octgamesstudent47male103

and below example(not from sample_model.json) to compute. 
{
    "elements":[
        {
            "cat_feature_index":6,
            "combination_element":
            "cat_feature_value"
        },
        {
            "border":12.5,
           	"combination_element":"float_feature",
            "float_feature_index":0
        },
        { 
            "cat_feature_index":5, 
            "combination_element":"cat_feature_exact_value", 
            "value":-845129958
        }
    ],
    "type":"Borders"
}

Representation

elements

This is an ordered list of features to be combined to form the new CTR feature. 
In the above example there are 3 features which needs to be combined. These features can be one of three types which is given by value of combination_element 
  • Categorical Feature ( combination_element  = cat_feature_value) : 
    • This component has only 1 other element
      • cat_feature_index: In Categorical Feature we compute Cityhash of the feature using feature value at cat_feature_index In the example cat_feature_index=6 that means we should use value of feature statecode and take its Cityhash. 
    • ComputedValue : CityHash("103") = 2080643764 
    • How to compute CityHash is explained
  • Numerical Feature(combination_element = float_feature) :
    • This component has 2 elements 
      • flat_feature_index: this tell which feature to use for this. In the example(flat_feature_index = 0) means age. 
      • border : This is a numerical value. We need to compare feature value with this value and if feature_value >= border then this component will compute to 1 else 0.
    • ComputedValue : 12.5>47?1:0 = 1
  • OneHotFeatures ( combination_element = cat_feature_exact_value) : 
    • Here CityHash of feature will be compared with given value. This has 2 elements
      • cat_feature_index: this tell which feature to use. In the example above cat_feature_index = 5, which is gender. 
      • value : this tell with which value we need to compare the CityHash of feature value. If they are same this component will compute to 1 else 0. i.e. if (CityHash(features[age]) = -845129958) then 1 else 0
    • ComputedValue : CityHash("male") == -845129958 ? 1 else 0 = 1
    • CityHash("male") = -845129958. How to compute this is explained later

type

type is the CTR type as explained in Section 2. 

How to combine components of the list into 1 value : 

To combine these values into 1 value catboost uses 1 special number  
MAGIC_MULT = 0x4906ba494954cb65l 
It iterate over the list of elements as follow and starting with hash value = 0
 hash = 0 
 for i in (1:size(elements)){
     hash = MAGIC_MULT * ( hash + MAGIC_MULT * VALUE(i) )
 }
So to compute the above example
element[0] :
 MAGIC_MULT * ( 0 + MAGIC_MULT *  2080643764) = 5418337360679822996

element[1] :
 MAGIC_MULT * ( 5418337360679822996 + MAGIC_MULT *  1) = 6727615186874477117

element[2]:
 MAGIC_MULT * ( 6727615186874477117 + MAGIC_MULT * 1 = 4977781465358360298

So finally we will use 4977781465358360298 to look for values in model to compute the feature value for decision type given in "type" field. How to find the values using this hashed value is explained later in section 4.2

Section 4 : Structure of catboost json file

There are 6 high level elements in outer json. I will explain 1 by 1

section 4.1 : features_info : 

This gives the list of features and other information including ordering of borders as explained below in this section. There are 3 different Feature Types : 

section 4.1.1 : float_features

 "float_features":[ 
    { 
        "feature_name":"age", 
        "nan_value_treatment":"AsIs", 
        "has_nans":false, 
        "flat_feature_index":5, 
        "feature_index":0,"borders":[
            12.5,    13.5,    15.5,    16.5,     20.5,
            21.5,    22.5,    24.5,    25.5,     29.5,
            30.5,    31.5,    34.5,    35.5,     36.5,
            37.5,    41.5,    47.5,    49.5,     50.5,
            52.5,    53.5,    54.5,    55.5,     58.5
        ]
    }
 ]

this is the ordered list of numerical features along ordered list of borders used for decisions in trees. 

feature_name : name of the feature if pool is passed while creating the json dump. 
nan_value_treatment: How to treat if value of the feature is not passed while prediction.

has_nans : if this feature had missing values at the time of training
flat_feature_index : this tell the flat_feature_index of the this numerical feature
feature_index : this is feature_index of this numerical feature
borders : This is a first part of the global ordered list of values which needs to be used in conditions of the node where this numerical feature is used. This order is very important and values are used by their indexes. This global index is call split_index. There is a alot infomration we will get just by knowing the split_index. 

For FloatFeature type of feature we can get following information just by knowing index of split_index 

  1. Border value
  2. Feature name
  3. Feature index


For above float_feature example, we can generate 25 split_indexes. Below are 2 example of 1st and last split_indexes. Please note the indexes of split_index here are 0 and 24

 split_index(0) = 
 { 
     "feature_type": "FloatFeature",
     "border_value": 12.5,
     "flat_feature_index": 5, 
     "feature_index": 0,
     "feature_name": "age"
 }
 split_index(24) = 
 {
     "feature_type": "FloatFeature",
     "border_value": 58.5,
     "flat_feature_index": 5,
     "feature_index": 0,
     "feature_name": "age"
 }

split_indexes are indexed across all the features(numerical, OneHot, CTR) that means index of the next split_index will start from 25

section 4.1.2 : Categorical_features

This is ordered list of categorical features along with information about one hot encoded values

 "categorical_features":[ 
    {
        "feature_name":"day", 
       	"flat_feature_index":0, 
        "feature_index":0
    },
    {
    	"feature_name":"hour", 
        "flat_feature_index":1,
        "feature_index":1 
    }, 
    { 
    	"feature_name":"month", 
        "flat_feature_index":2, 
        "feature_index":2
    },
    { 
    	"feature_name":"app_cat", 
        "flat_feature_index":3,
        "feature_index":3
    }, 
    { 
    	"feature_name":"user_occ", 
        "values":[     
            -1445282301,   
            -364013047,
            1886646856
        ],
      	"flat_feature_index":4,
        "feature_index":4 
    }, 
    { 
    	"feature_name":"gender", 
        "flat_feature_index":6, 
        "feature_index":5 
    }, 
    {  
    	"feature_name":"statecode", 
        "flat_feature_index":7, 
        "feature_index":6 
    }
 ]
feature_name, flat_feature_index, feature_index : are already explained above.
values :this is the ordered list of values which will be used for one-hot feature. This will be part of global list of split_index. for categorical_features split_index will give following formation 
  1. Feature Name
  2. Feature Index
  3. Value for OneHot comparision
For above example of categorical_features we can generate 3 OneHot split_indexes. Since this is part of global list of split_indexes and we have already used 0 to 24 indexes for float_features, index for this split_index will start with 25. 

 split_index(25) = 
 {
    "feature_type": "OneHot",
    "value": -1445282301,
    "feature_name": "user_occ",
    "flat_feature_index": 4, 
    "feature_index": 4
 }
 split_index(27) =
 { 
    "feature_type": "OneHot",
    "value": 1886646856, 
    "feature_name": "user_occ",
    "flat_feature_index": 4, 
    "feature_index": 4
 }


Note : there are few more OneHot split_indexes not listed here. Those values are explained in section <x>

Section 4.1.3: ctrs

This is ordered list of CTR features, it has important fields which are tightly linked to CTR types

 "ctrs":[ 
    {
        "borders":[ 
            0.9999989867,    1.999999046,    3.999999046,    4.999999046, 
            6.999999046,     8.999999046,    9.999999046,    11.99999905 
        ],
        "prior_numerator":0, 
        "shift":-0,"target_border_idx":0, 
        "ctr_type":"Borders", 
        "scale":15, 
        "elements":[ 
            { 
                "cat_feature_index":0, 
                "combination_element":"cat_feature_value"
            } 
        ],     
        "identifier":"{\"identifier\":[{\"cat_feature_index\":0,\"combination_element\":\"cat_feature_value\"}],\"type\":\"Borders\"}", 
        "prior_denomerator":1 
    }, 
    {
        "borders": [0.9999989867], 
        "prior_numerator":0, 
        "shift":-0, 
        "target_border_idx":0, 
        "ctr_type":"Counter", 
        "scale":15, 
        "elements": [ 
            { 
                "cat_feature_index":0, 
                "combination_element":"cat_feature_value" 
            }, 
            { 
                "float_feature_index":0, 
                "combination_element":"float_feature", 
                "border":15.5 
            } 
        ], 
        "identifier":"{\"identifier\":[{\"cat_feature_index\":0,\"combination_element\":\"cat_feature_value\"},{\"border\":15.5,\"combination_element\":\"float_feature\",\"float_feature_index\":0}],\"type\":\"Counter\"}", 
        "prior_denomerator":1 
    }, 
    .... 
    .... 
 ]
ctr_type: it is type of ctr feature. Different Types are previously explained
prior_numerator, prior_denominator, scale, shift: They are same as explained in the "ctr types" section.
elements: This is the list of features to CTR type. How to compute CTR type key is already explained in section 3. How to compute the value using that key is explained in section 4.2 & section 2. 

identifier: This is a stringified json of elements & ctr_type. This will be used find other values required to compute the CTR type. How to find those values are explained later in section 4.2.

borders: This is the ordered list of borders which will be used for CTR feature. This will be part of global list of split_index. for ctrs split_index will give following formation 
  1. elements
  2. ctr_type
  3. shift
  4. scale
  5. prior_numerator
  6. prior_denominator
  7. border value 
  8. identifier 
    Since this is part of global list of split_indexes and we have already used 0 to 24 indexes for float_features, and 25-27 for OneHot features, split_index for ctrs will start with 28.

     split_index(28) = 
     {
        "split_type": "OnlineCTR",
        "prior_numerator":0, 
        "shift":-0, 
        "ctr_type":"Counter", 
        "scale":15, 
        "elements": 
        [ 
        	{ 
                "cat_feature_index":0, 
                "combination_element":"cat_feature_value" 
            }, 
            {
                "float_feature_index":0, 
                "combination_element":"float_feature", 
                "border":15.5 
            } 
        ],
        "identifier":"{\"identifier\":[{\"cat_feature_index\":0,\"combination_element\":\"cat_feature_value\"},{\"border\":15.5,\"combination_element\":\"float_feature\",\"float_feature_index\":0}],\"type\":\"Counter\"}", 
        "prior_denomerator":1,
        "border_value": 0.9999989867 
     }


    Section 4.1.4 : cat_features_hash: 

    It is a list of json elements. Each elements contains 2 elements, Hash and value,

    where CityHash(value) = Hash

     "cat_features_hash": 
     [ 
        { 
            "hash":4216280702, 
            "value":"4" 
        }, 
        {
            "hash":1445320840
            "value":"12", 
        },
        {
            "hash":3509359534,
            "value":"thu"
        }, 
        ..... 
        ..... 
     ]

    Section 4.2 : ctr_data 

    This gives information about values needed to compute different ctr type. It is a json map where keys are identifiers from section 4.1.3. 

     "ctr_data":   
     { 
     	"{\"identifier\":[{\"cat_feature_index\":6,\"combination_element\":\"cat_feature_value\"}],\"type\":\"Counter\"}":  	{ 
        	"counter_denominator":32, 
            "hash_stride":2, 
            "hash_map": 
            [ 
                "18446744073709551615",  
                18,     
                "9285085082777897483",   
                22, 
                "5287930183868741137",	      
                22,     
                "13172738045064563988",     
                20,
                "6703001155837839893",   
                25, 
                "4858559805981312021", 
                17,	      
                "5418337360679822996", 
                26, 
                ..... 
            ]    
        }, 
        "{\"identifier\":[{\"cat_feature_index\":6,\"combination_element\":\"cat_feature_value\"},{\"border\":55.5,\"combination_element\":\"float_feature\",\"float_feature_index\":0}],\"type\":\"Borders\"}":    
        { 
           	"counter_denominator":0,
            "hash_stride":3,	 
            "hash_map": 
            [
                "18446744073709551615", 
                4, 
                12, 
                "16473605376674046465", 
                8, 
                9,
                "14187676640779072514", 
                1, 
                2,  
                "9393377116333705988", 
                7, 
                8, 
                "5094243394766242315", 
                0, 
                2,
                "5094137587211407371",  
                1,     
                1, 
                ..... 
            ],  
        }, 
        .....  
     }

    hash_map : hash_map is basically a list of key and values. The number of values depends on CTR type. How to get the key is explained earlier in the blog.

    • Borders : It has key and 2 values that means 0,3,6,.. are keys. (1,2),(4,5),(7,8),.. indexes are (failures, successes) used for computing the value of "border" CTR type at that key.
    • Counter & FeatureFreq : It has key and 1 value that mean 0,2,4,6,.. are the keys and 1,3,5,7... are the counts for computing the value of "counter" or "FeatureFreq" CTR type at that key.
    counter_denominator : this will be used as total for computing "counter" or "FeatureFreq" CTR type.

    Section 4.3 oblivious_trees 

    This is the list of trees. Each tree is computed separately. I will try to explain these trees by computing tree values for example trees. 

    Example 1

     {
        "leaf_weights":[2, 49, 18, 931], 
        "leaf_values":[1.153793657, 0.3428686569, 0.06438921655, 0.0150123653], 
        "splits": 
        [ 
        	{ 
            	"split_index":189,   
                "split_type":"OnlineCtr",  
                "border":3.999999046, 
                "ctr_target_border_idx":0 
            }, 
            {
            	"split_index":86,    
                "split_type":"OnlineCtr", 
                "border":4.999999046, 
                "ctr_target_border_idx":0 
            } 
        ]
     }

    There are 2 important parts

    splits : List of conditions to be used in different levels of tree. Since Catboost generate Symmetric trees we need only len(splits) = depth of tree. This list is in reverse order i.e. last element in the list is used in root of the tree. We just need "split_index" from this part to compute the node values. 

    leaf_values: there are 2^(tree_depth) number of values in this representing terminal node values of the complete tree from left to right. We just need split_index to compute the node value. 

    Section 4.6.1 

    If split_type of split is "OneHotFeature", it might not be listed in the features_info section, but it is still the part of global list of split_indexes.

    For Example in sample_model.json

     oblivious_trees[1]['splits][0] = 
     {
        "split_index": 28, 
        "cat_feature_index": 5, 
        "value": -845129958, 
        "split_type": 'OneHotFeature'
     }

    so some of the split indexes needs to be derived from here. These indexes start just after OneHot feature values and before ctrs. 

    I will compute the above tree example along with few more trees for the one row below

    dayhourmonthapp_catuser_occagegenderstatecode
    fri22octgamesstudent47male103

    For root node (split_index = 86), find the split index using section 4.1

     split_index(86) = 
     { 
        "prior_numerator": 0, 
        "shift": 0, 
        "ctr_type": 'Counter', 
        "scale": 15, 
        "elements": 
        [ 
        	{ 
                "cat_feature_index": 0, 
                "combination_element": 'cat_feature_value'
            }, 
            {
                "cat_feature_index": 2, 
                "combination_element": 'cat_feature_value'
            }
        ], 
        "identifier": '{"identifier":[{"cat_feature_index":0,"combination_element":"cat_feature_value"},{"cat_feature_index":2,"combination_element":"cat_feature_value"}],"type":"Counter"}', 
        "prior_denomerator": 1, 
        "border" = 4.999999046 
     }

    Compute the hash to be used for ctr_data using section 3 and Section 4.1.4

     cat_feature_index(0) = "day"
     input["day"] = "fri"
     CityHash("fri") = 2386474191
     MAGIC_MULT * ( 0 + MAGIC_MULT * 2386474191 ) = 10849034401243659895
     cat_feature_index(2) = "month"
     input["month"] = "oct"
     CityHash("oct") = 964427042
     MAGIC_MULT * ( 10849034401243659895 + MAGIC_MULT * 964427042 ) = 7913951620039613893
    So final hash to be used is "7913951620039613893"

    using section 4.2 find values to be used to compute CTR at the node

     ctr_hash['{"identifier":[{"cat_feature_index":0,"combination_element":"cat_feature_value"},{"cat_feature_index":2,"combination_element":"cat_feature_value"}],"type":"Counter"}']['hash_map'][7913951620039613893] = 
     { 
        "count" = 12, 
        "counter_denominator" = 23
     }

    using section 2.1 to compute the node and compare it with border to find the next node.  

     15 * ((0 + 12)/(1 + 23)) = 7.5
     7.5 >= 4.999999046 is True so traverse to Right Node. 

    using section 4.1

     split_index(189) = 
     {
        "prior_numerator": 0.5, 
        "shift": 0, 
        "scale": 15, 
        "elements": 
        [
        	{
                "cat_feature_index": 2, 
                "combination_element": "cat_feature_value" 
            }
        ], 
        "identifier": '{"identifier":[{"cat_feature_index":2,"combination_element":"cat_feature_value"}],"type":"Borders"}', 
        "prior_denomerator": 1, 
        "ctr_type" : "Border", 
        "border" = 3.999999046
     }

    using section 3 & section 4.1.4

     cat_feature_index(2) = "month"
     input["month"] = "oct"
     CityHash("oct") = 964427042
     MAGIC_MULT * ( 0 + MAGIC_MULT * 964427042 ) = 14429572201882049490

    using section 4.2

     ctr_data['{"identifier":[{"cat_feature_index":2,"combination_element":"cat_feature_value"}],"type":"Borders"}']['hash_map'][14429572201882049490] =
     {
        "successes": 51,
        "failures" : 43
     }

    using section 2.2

     0 + 15 * ((0.5 + 51)/(0.5 + 51 + 1 + 43)) = 8.089005235602095 
     8.089005235602095 >= 3.999999046 is True so traverse to Right node 

    Reached terminal node so value of the tree is value of terminal node.

    Computed value of this tree is 0.0150123653

    Example 2

     {
        "leaf_weights":[551, 9, 434, 6], 
        "leaf_values":[1.142552497, 0.7536726871, 1.019659549, 1.933912389], 
        "splits": 
        [ 
        	{     
                "split_index":24, 
                "float_feature_index":0, 
                "split_type":"FloatFeature", 
                "border":58.5 
            },
            { 
                "split_index":201, 
                "split_type":"OnlineCtr", 
                "border":7.999999046,
                "ctr_target_border_idx":0 
            } 
        ]
     }

    using section 4.1

     split_index(201) = 
     {
        "prior_numerator": 1, 
        "shift": 0, 
        "ctr_type": "Borders", 
        "scale": 15, 
        "elements": 
        [
            {
                "cat_feature_index": 2, 
                "combination_element": "cat_feature_value"
            }
        ], 
        "identifier": '{"identifier":[{"cat_feature_index":2,"combination_element":"cat_feature_value"}],"type":"Borders"}', 
        "prior_denomerator": 1,
        "border" : 7.999999046
     }

    using section 3 & section 4.1.4

     cat_feature_index(2) = "month"
     input["month"] = "oct"
     CityHash("oct") = 964427042
     MAGIC_MULT * ( 0 + MAGIC_MULT * 964427042 ) = 14429572201882049490

    using section 4.2

     ctr_data['{"identifier":[{"cat_feature_index":2,"combination_element":"cat_feature_value"}],"type":"Borders"}']['hash_map'][14429572201882049490] =
     {
        "successes": 51,
        "failures": 43
     }

    using section 2.2

     0 + 15 * ((1 + 51)/(1 + 51 + 1 + 43)) = 8.125
     8.125 >= 7.999999046 is True, so travese to Right node

    using section 4.1

     split_index(24) =
     {
        "feature_type": 'FloatFeature',
        "feature_name": 'age', 
        "nan_value_treatment": 'AsIs', 
        "has_nans": False, 
        "flat_feature_index": 5, 
        "feature_index": 0,
        "border": 58.5
     }

    using section 3

     input['age'] = 47
     47 >= 58.5 is False, so traverse to Left node 


    Computed value of this tree is 1.019659549

    Example 3

     {
        "leaf_weights":[470, 530],
        "leaf_values":[2.411496103, 2.176863892],
        "splits":
        [
            {
                "cat_feature_index":5,
                "split_index":28,
                "split_type":"OneHotFeature"
                "value":-845129958,
            }
        ]
     }


    using section 4.1

     split_index(28) =
     {
     	"feature_name": "gender",
        "cat_feature_index": 5,
        "split_type": "OneHotFeature"
        "value": -845129958
     }

    using section 3 & section 4.1.4

     input['gender'] = 'male'
     CityHash('male') = 2083200611
     2083200611 == -845129958 is False so traverse to Left node


    Computed value of this tree is 2.411496103

    Example 4

     {
        "leaf_weights":[257, 743],
        "leaf_values":[2.081450524, 2.052298752],
        "splits":
        [
            {
                "split_index":55,
                "split_type":"OnlineCtr",
                "border":12.99999905,
                "ctr_target_border_idx":0
            }
        ]
     }

    using section 4.1

     split_index(55) = 
     {
        "prior_numerator": 0, 
        "shift": 0, 
        "ctr_type": 'Counter', 
        "scale": 15, 
        "elements": 
        [
            {
                "cat_feature_index": 0, 
                "combination_element": 'cat_feature_value'
            }
        ], 
        "identifier": '{"identifier":[{"cat_feature_index":0,"combination_element":"cat_feature_value"}],"type":"Counter"}', 
        "prior_denomerator": 1,
        "border": 12.99999905
     }

    using section 3 & section 4.1.4
     cat_feature_index(0) = 'day'
     input['day'] = 'fri'
     CityHash('fri') = 2386474191
     MAGIC_MULT * ( 0 + MAGIC_MULT * 2386474191 ) = 10849034401243659895
    So final hash to be used is "10849034401243659895"

    using section 4.2

     ctr_data['{"identifier":[{"cat_feature_index":0,"combination_element":"cat_feature_value"}],"type":"Counter"}']['hash_map']['10849034401243659895'] =
     {
        "counter_denominator" = 154,
        "count" = 151
     }

    using section 2.1
     0 + 15 * ((0 + 151)/(1 + 154)) = 14.612903225806452
     14.612903225806452 >= 12.99999905 is True so traverse to Right node


    Computed value of this tree is 2.052298752 

    Example 5

     {
        "leaf_weights":[944, 56],
        "leaf_values":[1.867651047, 1.653345508],
        "splits":
        [
            {
                "split_index":257,
                "split_type":"OnlineCtr",
                "border":9.999999046,
                "ctr_target_border_idx":0
            }
        ]
     }



    using section 4.1

     {
        "prior_numerator": 1, 
        "shift": 0, 
        "ctr_type": 'Borders', 
        "scale": 15, 
        "elements": 
        [
            {
                "cat_feature_index": 3, 
                "combination_element": 'cat_feature_value'
            }
        ], 
        "identifier": '{"identifier":[{"cat_feature_index":3,"combination_element":"cat_feature_value"}],"type":"Borders"}', 
        "prior_denomerator": 1,
        "border": 9.999999046
     }

    using section 3 & section 4.1.4

     cat_feature_index(3) = 'app_cat'
     input['app_cat'] = 'games'
     CityHash('games') = 4157097893
     MAGIC_MULT * ( 0 + MAGIC_MULT * 4157097893 ) = 7916588964563354589
    So final hash to be used is "7916588964563354589"

    using section 4.2

     ctr_data['{"identifier":[{"cat_feature_index":3,"combination_element":"cat_feature_value"}],"type":"Borders"}']['hash_map']['7916588964563354589'] =
     {
    	"failures" = 80,
    	"successes" = 74
     }

    using section 2.2

     0 + 15 * ((1 + 74)/(1 + 74 + 1 + 80)) = 7.211538461538462
    7.211538461538462 >= 9.999999046 is False so traverse to Left node


    Computed value of this tree is 1.867651047

    Example 6

     {
        "leaf_weights":[25, 742, 4, 71, 0, 47, 0, 111],
        "leaf_values":
        [
            -0.4397322896,  0.000866602846,	-0.5008245793,	-0.04967483793, 
            0, 	0.2684168171, 	0, 	0.06013996591
        ],
        "splits":
        [
            {
                "split_index":0,
                "float_feature_index":0,
                "split_type":"FloatFeature",
                "border":12.5
            },
            {
                "split_index":136,
                "split_type":"OnlineCtr",
                "border":9.999999046,
                "ctr_target_border_idx":0
            },
            {
                "split_index":148,
                "split_type":"OnlineCtr",
                "border":8.999999046,
                "ctr_target_border_idx":0
            }
        ]
     }

    using section 4.1

     split_index(148) = 
     {
        "prior_numerator": 0, 
        "shift": 0, 
        "ctr_type": 'Borders', 
        "scale": 15, 
        "elements": 
        [
            {
                "cat_feature_index": 1, 
                "combination_element": 'cat_feature_value'
            }, 
            {
                "float_feature_index": 0, 
                "combination_element": 'float_feature', 
                "border": 12.5
            }
        ], 
        "identifier": '{"identifier":[{"cat_feature_index":1,"combination_element":"cat_feature_value"},{"border":12.5,"combination_element":"float_feature","float_feature_index":0}],"type":"Borders"}', 
        "prior_denomerator": 1,
        "border": 8.999999046
     }

    using section 3 & section 4.1.4

     cat_feature_index(1) = 'hour'
     input['hour'] = '12'
     CityHash('12') = 1445320840
     MAGIC_MULT * ( 0 + MAGIC_MULT * 1445320840 ) = 8879677033106486088
    
     float_feature_index() = 'age'
     input['age'] = 47
     47 >= 12.5 is True so its computed value is 1
     MAGIC_MULT * ( 8879677033106486088 + MAGIC_MULT * 1 ) = 4244931956247570753 
    So final hash to be used is "7916588964563354589"

    using section 4.2

     ctr_data[''{"identifier":[{"cat_feature_index":1,"combination_element":"cat_feature_value"},{"border":12.5,"combination_element":"float_feature","float_feature_index":0}],"type":"Borders"}']['hash_map']['4244931956247570753'] =
     {
        "failures": 15,
        "successes": 16
     }

    using section 2.2

     0 + 15 * ((0 + 15) / (0 + 15 + 1 + 16)) = 7.03125
     7.03125 >= 8.999999046 is False so traverse to Left node

    using section 4.1

     split_index(136) =
     {
        "prior_numerator": 1, 
        "shift": 0, 
        "ctr_type": 'Borders', 
        "scale": 15, 
        "elements": 
        [
            {
                "cat_feature_index": 1, 
                "combination_element": 'cat_feature_value'
            }
        ], 
        "identifier": '{"identifier":[{"cat_feature_index":1,"combination_element":"cat_feature_value"}],"type":"Borders"}', 
        "prior_denomerator": 1,
        "border" : 9.999999046
     }

    using section 3 & section 4.1.4

     cat_feature_index(1) = 'hour'
     input['hour'] = '12'
     CityHash('12') = 1445320840
     MAGIC_MULT * ( 0 + MAGIC_MULT * 1445320840 ) = 8879677033106486088 

    using section 4.2

     ctr_data['{"identifier":[{"cat_feature_index":1,"combination_element":"cat_feature_value"}],"type":"Borders"}']['hash_map']['8879677033106486088'] = 
     {
        "failures": 15,
        "succeses": 17
     }

    using section 2.2

     0 + 15 * ((1 + 17) / (1 + 17 + 1 + 15)) = 7.9411764705882355
     7.9411764705882355 > = 9.999999046 is False so traverse to Left node

    using section 4.1

     split_index(0) = 
     {
        "float_feature_index":0
        "split_type":"FloatFeature"
        "border":12.5
     }

    using section 3

     float_feature_index(0) = 'age'
     input['age'] = 47
     47 >= 12.5 is True so traverse to Right node

     Computed value of this tree is 0.000866602846 

    Section 4.3 scale_and_bias

    It is an array of 2 elements. 1st value is scale and second value is bias. Finally to compute the model output we need to use following(for regression)

     bias + scale * (Σ oblivious_trees)

    Thursday, June 25, 2020

    Migration to Aerospike from Redis Clusters & HBase

    We recently moved our production data base technology infrastructure from Redis and HBase to Aerospike. In this blog I am sharing the learning from this exercise

    What use case we are solving ? 

    I am responsible for a large scale, low latency service. This service handle 350,000 request per second at peak with average latency of 10ms. This is one monolithic service to serve bids for RTB auctions at scale. I need to query large amount of data from different data sources, compute multiple models with 100s of trees on the fly with in this time frame. This data is heterogeneous in nature and cannot be stored together. Some data is large to be stored in memory.  Data is distributed between many Redis Clusters and HBase based on type of data. 


    What data is being stored in Redis Cluster

    Redis Cluster is used for storing large amount of key-values where key size is between 32 to 50 characters. Some values are simple string other are maps, lists, sets or sorted sets. Size of value range from few bytes to large sorted sets of few Gigabytes.  

    Write patterns  in Redis

    In few Redis cluster data is written once every hour and other Redis clusters data is inserted/updated and read on the fly with every call to service. 

    Read Patterns in Redis 

    Few operations we perform 
    • Read data against one key 
    • Read data against many keys 
    • Get one value from sorted set
    • Get one/all values from map 

    There are many operations performed directly on redis server using Lua scripts. These scripts are used for read update write type of patterns. 
    We heavily use pipeline and TTL functionality of Redis
     

    What data we are storing in HBase 

    Simple Key-Value pairs where number of keys are in billions and storing in memory will be very expensive.  These keys are of type write once with TTL and read many ti\mes. 

    Why we migrated from Redis to Aerospike:

    We were running 10s of Redis clusters with number of shards ranging from 3 to 100s in each cluster. 
    • Since in Redis cluster individual shards needs to be added/removed it was being management nightmare. 
    • Our data was growing and keeping all the data in memory was becoming very expensive. 

    Why we migrated from HBase to Aerospike:

    • HBase was slow for our use case. 
    • Managing tail latencies was coming more and more challenging because of HBase's JVM GC.
    • HBase was costly, the number of requests served by per host was very less. 

    Some issues and restrictions of Aerospace :

    • Aerospike is good for storing small values against keys. Aerospike rewrite replaces the whole value against key on every update. Something to keep in note of because we if we are using large values in data structure against one key this will slow down the system alot. 
    • For HyBrid and InMemory version Aerospike reserves 64 Bytes in memory for each key so to store 1 Billions keys we need ~64GB of memory to store the index.  
    • There is no control over how Aerospike distribute data in its shards, this can lead to unbalanced sharding. In our testing we have found there is a difference of 10% in number of slots between highest slot machine and second highest slot machine. 
    • If we are using in memory version of Aerospike and machine goes down we will loose the data but in case of Redis data can be recovered(not fully) if persistence is on 
    • Redis supports key tags where we can force the multiple keys to go to one shard by specifying the sharding key as substring of original key. Aerospike doesn't support this functionality. This functionality if very useful for specific query patterns. 
    • Redis supports pipeline/batch calls for insert/updates. Aerospike doesn't support this. 

    Some benefits of Aerospike 

    • Aerospike is multithreaded compared to Redis so we don't have to manage 100s of redis instances we can use few big hosts to make a cluster. This reduces management overhead. 
    • Aerospike is optimized for reads & writes to SSD / NVMe which makes it blazing fast. 
    • Aerospike is fully written in C++ so there is no JVM or Garbage collection. 
    • Tail latencies are less in Aerospike then in HBase 
    • Aerospike support nested data structures so you can create keys where value is  map which has key and then map again in value of internal map. 

    Migration of HBase

    Our Data model for large number of keys in Aerospike

    We have 8 Billion records to store with TTL. Where key is hexa string of 32 characters and value is string of length between 10 to 20 characters. To store these records with replication factor of 3 we needed 3 X 8^9 X 64 bytes or 1.4 TB of memory for the Aerospike index. We can just store all of this data in memory with 1.4 TB of memory. 
    To reduce the memory usage we decided to use nested map functionality of Aerospike and use first 7 characters of key as primary key and keep a map inside it with rest of the key as key and value as value and another key with timestamp. Because of this data format we are able to reduce our memory footprint by factor of 30 and but loose the functionality of TTL by Aerospike. 
    We decided to go with this Data Model because of following reasons 
    • Our cost was reducing a lot 
    • There was no degradation of performance because of this. 
    • Instead of TTL we want to delete/expire key from the last read time which is not available in Aerospike, so we decided to build a Hybrid solution to expire they keys ourselves. If you are interested in our hybrid solution of expiring keys please check this. 

                                                                    To Be Continued

    Thursday, January 2, 2020

    How to debug redis at high scale

    Background of our architecture

    We are using very simple architecture of 100s of Java web boxes with auto-scaling and Redis Cluster as primary data source and Jedis as Redis client. We need to hit many different Redises in our single serving call with 20ms timeouts from client of our service. 

    Problem we were facing

    At a very high scale whenever CPU of Redis was going beyond 60% whole cluster was spiking to 100% with decrease in IOPS. Because of this we had to over provision the Redis boxes. 

    Things we tried to debug the issue

    • Enabled debug logs in Redis

      • Debug logs were giving hint of things going bad but didn't give full picture 
        • Observation from debug logs 100s of new connections were getting created very frequently. 

    • Ran strace on Redis server with full text and timestamps

    sudo strace -s 1000 -tt -T -p <PID>

      • This gave us very good information about what Redis does internally. 
        • Redis makes all system calls in batches. 3 major system calls are.
          • Accept
          • Read 
          • Write
        • Because of this batch reading/writing/accepting requests were timing out results in disconnection of TCP connection from Jedis and then it was creating new connection 
        • So whenever there was momentary spike in calls, Jedis was creating new connections, results in timeout of existing connections and logging of connection breaking, etc. and it was going in cycle of creating and destroying new connections. 

    • Ran tcpdump on client with full text and timestamp

    sudo tcpdump -tttt -A -i any port <Redis Port>

      • This further clears everything that was happening about what all is happening at the TCP level. 

    Our solution was to increase the number of idle connections to handle sudden spike in calls without creating new connections.