CatBoost is very famous package for boosted trees. It automatically handles categorical features which makes it very useful for training data with many categorical features of large cardinality. You can save the trained model on permanent storage in json format(or binary format). Currently its prediction library is available in Java(via JNI), Python & C/C++. I recently worked on a project where we needed to do predictions in java but cannot use JNI. So we decided to write Java client for Catboost but there is no documentation currently available which explains the model json file format. I ended up understanding the working from c++ source code and I am sharing the same here.
Section 1 : Some concepts of Catboost model
- CatBoost convert categorical features into 16 bit hash using CityHash. These hashes are available in model if you pass pool while saving the model. Location and format of these hashes in model is explained in section x.
- CatBoost create new features by combining 2 ore more features to make more complex features. These are called "ctr features". How these ctr features are represented and computed are explained in the section 3.
- CatBoost use indexes to refer to feature & tree node decision
- feature_index: This is feature index of categorical feature of in the list of categorical features and numrical feature in the list of numerical feature
- flat_feature_index: this is the index of the feature(categorical or numerical) in the list of all features.
cat_feature_index: this is the index of the feature(categorical) in the ordered list of categorical features
In the example file order of features is age, day, hour, month, app_cat, user_occ, gender, statecode with index of age is 0 and its a numerical feature. This list is further divided into numerical features and categorical features. The elements in these lists are referred as flat_feature_index and cat_feature_index. i.e. flat_feature_index = 0 means age and cat_feature_index=0 means day.
- Borders : this is used in condition of internal nodes of the trees. How to compute the value in the condition of the node is bit complex and explained later. Borders are referred by their index in the ordered list of borders across all types of features. the order is as following.
- Numerical Borders
- OneHotEncoded feature borders
- CTR borders
Section 2 : CTR Types
CatBoost convert categorical features into different numerical values which are represented as numerator and denominator in the model. These numerical values do have derived with laplace smoothing using numerator prior and denominator prior and other values. What all components are available in model and how to compute values for different type of CTR values are explained in this section
- Section 2.1 : Count Or FeatureFreq
- Components
- Count Prior (prior_numerator)
- Count
- Total Prior (prior_denominator)
- Total (counter_denominator)
- Scale
- Shift
- Formula
Shift + Scale * ((Count Prior + Count)/(Total Prior + Total))
- Section 2.2 : Borders
- Components
- Success Prior (prior_numerator)
- Success
- Failure Prior (prior_denominator)
- Failures
- Scale
- Shift
- Formula
Shift + Scale * ((Success Prior + Successes)/(Success Prior + Success + Failure Prior + Failures))
How to find the components from json model is explained later.
Section 3 : Complex CTR Features:
These feature combinations are represented in a form of json. I will explain this section with an example of given row
day | hour | month | app_cat | user_occ | age | gender | statecode |
fri | 22 | oct | games | student | 47 | male | 103 |
and below example(not from sample_model.json) to compute.
{
"elements":[
{
"cat_feature_index":6,
"combination_element":
"cat_feature_value"
},
{
"border":12.5,
"combination_element":"float_feature",
"float_feature_index":0
},
{
"cat_feature_index":5,
"combination_element":"cat_feature_exact_value",
"value":-845129958
}
],
"type":"Borders"
}
Representation
elements
This is an ordered list of features to be combined to form the new CTR feature.
In the above example there are 3 features which needs to be combined. These features can be one of three types which is given by value of combination_element
- Categorical Feature ( combination_element = cat_feature_value) :
- This component has only 1 other element
- cat_feature_index: In Categorical Feature we compute Cityhash of the feature using feature value at cat_feature_index In the example cat_feature_index=6 that means we should use value of feature statecode and take its Cityhash.
- ComputedValue : CityHash("103") = 2080643764
- How to compute CityHash is explained
- Numerical Feature(combination_element = float_feature) :
- This component has 2 elements
- flat_feature_index: this tell which feature to use for this. In the example(flat_feature_index = 0) means age.
- border : This is a numerical value. We need to compare feature value with this value and if feature_value >= border then this component will compute to 1 else 0.
- ComputedValue : 12.5>47?1:0 = 1
- OneHotFeatures ( combination_element = cat_feature_exact_value) :
- Here CityHash of feature will be compared with given value. This has 2 elements
- cat_feature_index: this tell which feature to use. In the example above cat_feature_index = 5, which is gender.
- value : this tell with which value we need to compare the CityHash of feature value. If they are same this component will compute to 1 else 0. i.e. if (CityHash(features[age]) = -845129958) then 1 else 0
- ComputedValue : CityHash("male") == -845129958 ? 1 else 0 = 1
- CityHash("male") = -845129958. How to compute this is explained later
type
type is the CTR type as explained in Section 2.
How to combine components of the list into 1 value :
To combine these values into 1 value catboost uses 1 special number
MAGIC_MULT = 0x4906ba494954cb65l
It iterate over the list of elements as follow and starting with hash value = 0
hash = 0
for i in (1:size(elements)){
hash = MAGIC_MULT * ( hash + MAGIC_MULT * VALUE(i) )
}
So to compute the above example
element[0] :
MAGIC_MULT * ( 0 + MAGIC_MULT * 2080643764) = 5418337360679822996
element[1] :
MAGIC_MULT * ( 5418337360679822996 + MAGIC_MULT * 1) = 6727615186874477117
element[2]:
MAGIC_MULT * ( 6727615186874477117 + MAGIC_MULT * 1 = 4977781465358360298
So finally we will use 4977781465358360298 to look for values in model to compute the feature value for decision type given in "type" field. How to find the values using this hashed value is explained later in section 4.2
Section 4 : Structure of catboost json file
There are 6 high level elements in outer json. I will explain 1 by 1
section 4.1 : features_info :
This gives the list of features and other information including ordering of borders as explained below in this section. There are 3 different Feature Types :
section 4.1.1 : float_features
"float_features":[
{
"feature_name":"age",
"nan_value_treatment":"AsIs",
"has_nans":false,
"flat_feature_index":5,
"feature_index":0,"borders":[
12.5, 13.5, 15.5, 16.5, 20.5,
21.5, 22.5, 24.5, 25.5, 29.5,
30.5, 31.5, 34.5, 35.5, 36.5,
37.5, 41.5, 47.5, 49.5, 50.5,
52.5, 53.5, 54.5, 55.5, 58.5
]
}
]
this is the ordered list of numerical features along ordered list of borders used for decisions in trees.
feature_name : name of the feature if pool is passed while creating the json dump.
nan_value_treatment: How to treat if value of the feature is not passed while prediction.
has_nans : if this feature had missing values at the time of training
flat_feature_index : this tell the flat_feature_index of the this numerical feature
feature_index : this is feature_index of this numerical feature
borders : This is a first part of the global ordered list of values which needs to be used in conditions of the node where this numerical feature is used. This order is very important and values are used by their indexes. This global index is call split_index. There is a alot infomration we will get just by knowing the split_index.
For FloatFeature type of feature we can get following information just by knowing index of split_index
- Border value
- Feature name
- Feature index
For above float_feature example, we can generate 25 split_indexes. Below are 2 example of 1st and last split_indexes. Please note the indexes of split_index here are 0 and 24
split_index(0) =
{
"feature_type": "FloatFeature",
"border_value": 12.5,
"flat_feature_index": 5,
"feature_index": 0,
"feature_name": "age"
}
split_index(24) =
{
"feature_type": "FloatFeature",
"border_value": 58.5,
"flat_feature_index": 5,
"feature_index": 0,
"feature_name": "age"
}
split_indexes are indexed across all the features(numerical, OneHot, CTR) that means index of the next split_index will start from 25
section 4.1.2 : Categorical_features
This is ordered list of categorical features along with information about one hot encoded values
"categorical_features":[
{
"feature_name":"day",
"flat_feature_index":0,
"feature_index":0
},
{
"feature_name":"hour",
"flat_feature_index":1,
"feature_index":1
},
{
"feature_name":"month",
"flat_feature_index":2,
"feature_index":2
},
{
"feature_name":"app_cat",
"flat_feature_index":3,
"feature_index":3
},
{
"feature_name":"user_occ",
"values":[
-1445282301,
-364013047,
1886646856
],
"flat_feature_index":4,
"feature_index":4
},
{
"feature_name":"gender",
"flat_feature_index":6,
"feature_index":5
},
{
"feature_name":"statecode",
"flat_feature_index":7,
"feature_index":6
}
]
feature_name, flat_feature_index, feature_index : are already explained above.
values :this is the ordered list of values which will be used for one-hot feature. This will be part of global list of split_index. for categorical_features split_index will give following formation
- Feature Name
- Feature Index
- Value for OneHot comparision
For above example of categorical_features we can generate 3 OneHot split_indexes. Since this is part of global list of split_indexes and we have already used 0 to 24 indexes for float_features, index for this split_index will start with 25.
split_index(25) =
{
"feature_type": "OneHot",
"value": -1445282301,
"feature_name": "user_occ",
"flat_feature_index": 4,
"feature_index": 4
}
split_index(27) =
{
"feature_type": "OneHot",
"value": 1886646856,
"feature_name": "user_occ",
"flat_feature_index": 4,
"feature_index": 4
}
Note : there are few more OneHot split_indexes not listed here. Those values are explained in section <x>
Section 4.1.3: ctrs
This is ordered list of CTR features, it has important fields which are tightly linked to CTR types
"ctrs":[
{
"borders":[
0.9999989867, 1.999999046, 3.999999046, 4.999999046,
6.999999046, 8.999999046, 9.999999046, 11.99999905
],
"prior_numerator":0,
"shift":-0,"target_border_idx":0,
"ctr_type":"Borders",
"scale":15,
"elements":[
{
"cat_feature_index":0,
"combination_element":"cat_feature_value"
}
],
"identifier":"{\"identifier\":[{\"cat_feature_index\":0,\"combination_element\":\"cat_feature_value\"}],\"type\":\"Borders\"}",
"prior_denomerator":1
},
{
"borders": [0.9999989867],
"prior_numerator":0,
"shift":-0,
"target_border_idx":0,
"ctr_type":"Counter",
"scale":15,
"elements": [
{
"cat_feature_index":0,
"combination_element":"cat_feature_value"
},
{
"float_feature_index":0,
"combination_element":"float_feature",
"border":15.5
}
],
"identifier":"{\"identifier\":[{\"cat_feature_index\":0,\"combination_element\":\"cat_feature_value\"},{\"border\":15.5,\"combination_element\":\"float_feature\",\"float_feature_index\":0}],\"type\":\"Counter\"}",
"prior_denomerator":1
},
....
....
]
ctr_type: it is type of ctr feature. Different Types are previously explained
prior_numerator, prior_denominator, scale, shift: They are same as explained in the "ctr types" section.
elements: This is the list of features to CTR type. How to compute CTR type key is already explained in section 3. How to compute the value using that key is explained in section 4.2 & section 2.
identifier: This is a stringified json of elements & ctr_type. This will be used find other values required to compute the CTR type. How to find those values are explained later in section 4.2.
borders: This is the ordered list of borders which will be used for CTR feature. This will be part of global list of split_index. for ctrs split_index will give following formation
- elements
- ctr_type
- shift
- scale
- prior_numerator
- prior_denominator
- border value
- identifier
Since this is part of global list of split_indexes and we have already used 0 to 24 indexes for float_features, and 25-27 for OneHot features, split_index for ctrs will start with 28.
split_index(28) =
{
"split_type": "OnlineCTR",
"prior_numerator":0,
"shift":-0,
"ctr_type":"Counter",
"scale":15,
"elements":
[
{
"cat_feature_index":0,
"combination_element":"cat_feature_value"
},
{
"float_feature_index":0,
"combination_element":"float_feature",
"border":15.5
}
],
"identifier":"{\"identifier\":[{\"cat_feature_index\":0,\"combination_element\":\"cat_feature_value\"},{\"border\":15.5,\"combination_element\":\"float_feature\",\"float_feature_index\":0}],\"type\":\"Counter\"}",
"prior_denomerator":1,
"border_value": 0.9999989867
}
Section 4.1.4 : cat_features_hash:
It is a list of json elements. Each elements contains 2 elements, Hash and value,
where CityHash(value) = Hash
"cat_features_hash":
[
{
"hash":4216280702,
"value":"4"
},
{
"hash":1445320840
"value":"12",
},
{
"hash":3509359534,
"value":"thu"
},
.....
.....
]
Section 4.2 : ctr_data
This gives information about values needed to compute different ctr type. It is a json map where keys are identifiers from section 4.1.3.
"ctr_data":
{
"{\"identifier\":[{\"cat_feature_index\":6,\"combination_element\":\"cat_feature_value\"}],\"type\":\"Counter\"}": {
"counter_denominator":32,
"hash_stride":2,
"hash_map":
[
"18446744073709551615",
18,
"9285085082777897483",
22,
"5287930183868741137",
22,
"13172738045064563988",
20,
"6703001155837839893",
25,
"4858559805981312021",
17,
"5418337360679822996",
26,
.....
]
},
"{\"identifier\":[{\"cat_feature_index\":6,\"combination_element\":\"cat_feature_value\"},{\"border\":55.5,\"combination_element\":\"float_feature\",\"float_feature_index\":0}],\"type\":\"Borders\"}":
{
"counter_denominator":0,
"hash_stride":3,
"hash_map":
[
"18446744073709551615",
4,
12,
"16473605376674046465",
8,
9,
"14187676640779072514",
1,
2,
"9393377116333705988",
7,
8,
"5094243394766242315",
0,
2,
"5094137587211407371",
1,
1,
.....
],
},
.....
}
hash_map : hash_map is basically a list of key and values. The number of values depends on CTR type. How to get the key is explained earlier in the blog.
- Borders : It has key and 2 values that means 0,3,6,.. are keys. (1,2),(4,5),(7,8),.. indexes are (failures, successes) used for computing the value of "border" CTR type at that key.
- Counter & FeatureFreq : It has key and 1 value that mean 0,2,4,6,.. are the keys and 1,3,5,7... are the counts for computing the value of "counter" or "FeatureFreq" CTR type at that key.
counter_denominator : this will be used as total for computing "counter" or "FeatureFreq" CTR type.
Section 4.3 oblivious_trees
This is the list of trees. Each tree is computed separately. I will try to explain these trees by computing tree values for example trees.
Example 1
{
"leaf_weights":[2, 49, 18, 931],
"leaf_values":[1.153793657, 0.3428686569, 0.06438921655, 0.0150123653],
"splits":
[
{
"split_index":189,
"split_type":"OnlineCtr",
"border":3.999999046,
"ctr_target_border_idx":0
},
{
"split_index":86,
"split_type":"OnlineCtr",
"border":4.999999046,
"ctr_target_border_idx":0
}
]
}
There are 2 important parts
splits : List of conditions to be used in different levels of tree. Since Catboost generate Symmetric trees we need only len(splits) = depth of tree. This list is in reverse order i.e. last element in the list is used in root of the tree. We just need "split_index" from this part to compute the node values.
leaf_values: there are 2^(tree_depth) number of values in this representing terminal node values of the complete tree from left to right. We just need split_index to compute the node value.
Section 4.6.1
If split_type of split is "OneHotFeature", it might not be listed in the features_info section, but it is still the part of global list of split_indexes.
For Example in sample_model.json
oblivious_trees[1]['splits][0] =
{
"split_index": 28,
"cat_feature_index": 5,
"value": -845129958,
"split_type": 'OneHotFeature'
}
so some of the split indexes needs to be derived from here. These indexes start just after OneHot feature values and before ctrs.
I will compute the above tree example along with few more trees for the one row below
day | hour | month | app_cat | user_occ | age | gender | statecode |
fri | 22 | oct | games | student | 47 | male | 103 |
For root node (split_index = 86), find the split index using section 4.1
split_index(86) =
{
"prior_numerator": 0,
"shift": 0,
"ctr_type": 'Counter',
"scale": 15,
"elements":
[
{
"cat_feature_index": 0,
"combination_element": 'cat_feature_value'
},
{
"cat_feature_index": 2,
"combination_element": 'cat_feature_value'
}
],
"identifier": '{"identifier":[{"cat_feature_index":0,"combination_element":"cat_feature_value"},{"cat_feature_index":2,"combination_element":"cat_feature_value"}],"type":"Counter"}',
"prior_denomerator": 1,
"border" = 4.999999046
}
Compute the hash to be used for ctr_data using section 3 and Section 4.1.4
cat_feature_index(0) = "day"
input["day"] = "fri"
CityHash("fri") = 2386474191
MAGIC_MULT * ( 0 + MAGIC_MULT * 2386474191 ) = 10849034401243659895
cat_feature_index(2) = "month"
input["month"] = "oct"
CityHash("oct") = 964427042
MAGIC_MULT * ( 10849034401243659895 + MAGIC_MULT * 964427042 ) = 7913951620039613893
So final hash to be used is
"7913951620039613893"
using section 4.2 find values to be used to compute CTR at the node
ctr_hash['{"identifier":[{"cat_feature_index":0,"combination_element":"cat_feature_value"},{"cat_feature_index":2,"combination_element":"cat_feature_value"}],"type":"Counter"}']['hash_map'][7913951620039613893] =
{
"count" = 12,
"counter_denominator" = 23
}
using section 2.1 to compute the node and compare it with border to find the next node.
15 * ((0 + 12)/(1 + 23)) = 7.5
7.5 >= 4.999999046 is True so traverse to Right Node.
using section 4.1
split_index(189) =
{
"prior_numerator": 0.5,
"shift": 0,
"scale": 15,
"elements":
[
{
"cat_feature_index": 2,
"combination_element": "cat_feature_value"
}
],
"identifier": '{"identifier":[{"cat_feature_index":2,"combination_element":"cat_feature_value"}],"type":"Borders"}',
"prior_denomerator": 1,
"ctr_type" : "Border",
"border" = 3.999999046
}
using section 3 & section 4.1.4
cat_feature_index(2) = "month"
input["month"] = "oct"
CityHash("oct") = 964427042
MAGIC_MULT * ( 0 + MAGIC_MULT * 964427042 ) = 14429572201882049490
using section 4.2
ctr_data['{"identifier":[{"cat_feature_index":2,"combination_element":"cat_feature_value"}],"type":"Borders"}']['hash_map'][14429572201882049490] =
{
"successes": 51,
"failures" : 43
}
using section 2.2
0 + 15 * ((0.5 + 51)/(0.5 + 51 + 1 + 43)) = 8.089005235602095
8.089005235602095 >= 3.999999046 is True so traverse to Right node
Reached terminal node so value of the tree is value of terminal node.
Computed value of this tree is 0.0150123653
Example 2
{
"leaf_weights":[551, 9, 434, 6],
"leaf_values":[1.142552497, 0.7536726871, 1.019659549, 1.933912389],
"splits":
[
{
"split_index":24,
"float_feature_index":0,
"split_type":"FloatFeature",
"border":58.5
},
{
"split_index":201,
"split_type":"OnlineCtr",
"border":7.999999046,
"ctr_target_border_idx":0
}
]
}
using section 4.1
split_index(201) =
{
"prior_numerator": 1,
"shift": 0,
"ctr_type": "Borders",
"scale": 15,
"elements":
[
{
"cat_feature_index": 2,
"combination_element": "cat_feature_value"
}
],
"identifier": '{"identifier":[{"cat_feature_index":2,"combination_element":"cat_feature_value"}],"type":"Borders"}',
"prior_denomerator": 1,
"border" : 7.999999046
}
using section 3 & section 4.1.4
cat_feature_index(2) = "month"
input["month"] = "oct"
CityHash("oct") = 964427042
MAGIC_MULT * ( 0 + MAGIC_MULT * 964427042 ) = 14429572201882049490
using section 4.2
ctr_data['{"identifier":[{"cat_feature_index":2,"combination_element":"cat_feature_value"}],"type":"Borders"}']['hash_map'][14429572201882049490] =
{
"successes": 51,
"failures": 43
}
using section 2.2
0 + 15 * ((1 + 51)/(1 + 51 + 1 + 43)) = 8.125
8.125 >= 7.999999046 is True, so travese to Right node
using section 4.1
split_index(24) =
{
"feature_type": 'FloatFeature',
"feature_name": 'age',
"nan_value_treatment": 'AsIs',
"has_nans": False,
"flat_feature_index": 5,
"feature_index": 0,
"border": 58.5
}
using section 3
input['age'] = 47
47 >= 58.5 is False, so traverse to Left node
Computed value of this tree is 1.019659549
Example 3
{
"leaf_weights":[470, 530],
"leaf_values":[2.411496103, 2.176863892],
"splits":
[
{
"cat_feature_index":5,
"split_index":28,
"split_type":"OneHotFeature"
"value":-845129958,
}
]
}
using section 4.1
split_index(28) =
{
"feature_name": "gender",
"cat_feature_index": 5,
"split_type": "OneHotFeature"
"value": -845129958
}
using section 3 & section 4.1.4
input['gender'] = 'male'
CityHash('male') = 2083200611
2083200611 == -845129958 is False so traverse to Left node
Computed value of this tree is 2.411496103
Example 4
{
"leaf_weights":[257, 743],
"leaf_values":[2.081450524, 2.052298752],
"splits":
[
{
"split_index":55,
"split_type":"OnlineCtr",
"border":12.99999905,
"ctr_target_border_idx":0
}
]
}
using section 4.1
split_index(55) =
{
"prior_numerator": 0,
"shift": 0,
"ctr_type": 'Counter',
"scale": 15,
"elements":
[
{
"cat_feature_index": 0,
"combination_element": 'cat_feature_value'
}
],
"identifier": '{"identifier":[{"cat_feature_index":0,"combination_element":"cat_feature_value"}],"type":"Counter"}',
"prior_denomerator": 1,
"border": 12.99999905
}
using section 3 & section 4.1.4
cat_feature_index(0) = 'day'
input['day'] = 'fri'
CityHash('fri') = 2386474191
MAGIC_MULT * ( 0 + MAGIC_MULT * 2386474191 ) = 10849034401243659895
So final hash to be used is
"10849034401243659895"
using section 4.2
ctr_data['{"identifier":[{"cat_feature_index":0,"combination_element":"cat_feature_value"}],"type":"Counter"}']['hash_map']['10849034401243659895'] =
{
"counter_denominator" = 154,
"count" = 151
}
using section 2.1
0 + 15 * ((0 + 151)/(1 + 154)) = 14.612903225806452
14.612903225806452 >= 12.99999905 is True so traverse to Right node
Computed value of this tree is 2.052298752
Example 5
{
"leaf_weights":[944, 56],
"leaf_values":[1.867651047, 1.653345508],
"splits":
[
{
"split_index":257,
"split_type":"OnlineCtr",
"border":9.999999046,
"ctr_target_border_idx":0
}
]
}
using section 4.1
{
"prior_numerator": 1,
"shift": 0,
"ctr_type": 'Borders',
"scale": 15,
"elements":
[
{
"cat_feature_index": 3,
"combination_element": 'cat_feature_value'
}
],
"identifier": '{"identifier":[{"cat_feature_index":3,"combination_element":"cat_feature_value"}],"type":"Borders"}',
"prior_denomerator": 1,
"border": 9.999999046
}
using section 3 & section 4.1.4
cat_feature_index(3) = 'app_cat'
input['app_cat'] = 'games'
CityHash('games') = 4157097893
MAGIC_MULT * ( 0 + MAGIC_MULT * 4157097893 ) = 7916588964563354589
So final hash to be used is
"7916588964563354589"
using section 4.2
ctr_data['{"identifier":[{"cat_feature_index":3,"combination_element":"cat_feature_value"}],"type":"Borders"}']['hash_map']['7916588964563354589'] =
{
"failures" = 80,
"successes" = 74
}
using section 2.2
0 + 15 * ((1 + 74)/(1 + 74 + 1 + 80)) = 7.211538461538462
7.211538461538462 >= 9.999999046 is False so traverse to Left node
Computed value of this tree is 1.867651047
Example 6
{
"leaf_weights":[25, 742, 4, 71, 0, 47, 0, 111],
"leaf_values":
[
-0.4397322896, 0.000866602846, -0.5008245793, -0.04967483793,
0, 0.2684168171, 0, 0.06013996591
],
"splits":
[
{
"split_index":0,
"float_feature_index":0,
"split_type":"FloatFeature",
"border":12.5
},
{
"split_index":136,
"split_type":"OnlineCtr",
"border":9.999999046,
"ctr_target_border_idx":0
},
{
"split_index":148,
"split_type":"OnlineCtr",
"border":8.999999046,
"ctr_target_border_idx":0
}
]
}
using section 4.1
split_index(148) =
{
"prior_numerator": 0,
"shift": 0,
"ctr_type": 'Borders',
"scale": 15,
"elements":
[
{
"cat_feature_index": 1,
"combination_element": 'cat_feature_value'
},
{
"float_feature_index": 0,
"combination_element": 'float_feature',
"border": 12.5
}
],
"identifier": '{"identifier":[{"cat_feature_index":1,"combination_element":"cat_feature_value"},{"border":12.5,"combination_element":"float_feature","float_feature_index":0}],"type":"Borders"}',
"prior_denomerator": 1,
"border": 8.999999046
}
using section 3 & section 4.1.4
cat_feature_index(1) = 'hour'
input['hour'] = '12'
CityHash('12') = 1445320840
MAGIC_MULT * ( 0 + MAGIC_MULT * 1445320840 ) = 8879677033106486088
float_feature_index() = 'age'
input['age'] = 47
47 >= 12.5 is True so its computed value is 1
MAGIC_MULT * ( 8879677033106486088 + MAGIC_MULT * 1 ) = 4244931956247570753
So final hash to be used is
"7916588964563354589"
using section 4.2
ctr_data[''{"identifier":[{"cat_feature_index":1,"combination_element":"cat_feature_value"},{"border":12.5,"combination_element":"float_feature","float_feature_index":0}],"type":"Borders"}']['hash_map']['4244931956247570753'] =
{
"failures": 15,
"successes": 16
}
using section 2.2
0 + 15 * ((0 + 15) / (0 + 15 + 1 + 16)) = 7.03125
7.03125 >= 8.999999046 is False so traverse to Left node
using section 4.1
split_index(136) =
{
"prior_numerator": 1,
"shift": 0,
"ctr_type": 'Borders',
"scale": 15,
"elements":
[
{
"cat_feature_index": 1,
"combination_element": 'cat_feature_value'
}
],
"identifier": '{"identifier":[{"cat_feature_index":1,"combination_element":"cat_feature_value"}],"type":"Borders"}',
"prior_denomerator": 1,
"border" : 9.999999046
}
using section 3 & section 4.1.4
cat_feature_index(1) = 'hour'
input['hour'] = '12'
CityHash('12') = 1445320840
MAGIC_MULT * ( 0 + MAGIC_MULT * 1445320840 ) = 8879677033106486088
using section 4.2
ctr_data['{"identifier":[{"cat_feature_index":1,"combination_element":"cat_feature_value"}],"type":"Borders"}']['hash_map']['8879677033106486088'] =
{
"failures": 15,
"succeses": 17
}
using section 2.2
0 + 15 * ((1 + 17) / (1 + 17 + 1 + 15)) = 7.9411764705882355
7.9411764705882355 > = 9.999999046 is False so traverse to Left node
using section 4.1
split_index(0) =
{
"float_feature_index":0
"split_type":"FloatFeature"
"border":12.5
}
using section 3
float_feature_index(0) = 'age'
input['age'] = 47
47 >= 12.5 is True so traverse to Right node
Computed value of this tree is 0.000866602846
Section 4.3 scale_and_bias
It is an array of 2 elements. 1st value is scale and second value is bias. Finally to compute the model output we need to use following(for regression)
bias + scale * (Σ oblivious_trees)