Section 1 : Some concepts of Catboost model
- CatBoost convert categorical features into 16 bit hash using CityHash. These hashes are available in model if you pass pool while saving the model. Location and format of these hashes in model is explained in section x.
- CatBoost create new features by combining 2 ore more features to make more complex features. These are called "ctr features". How these ctr features are represented and computed are explained in the section 3.
- CatBoost use indexes to refer to feature & tree node decision
- feature_index: This is feature index of categorical feature of in the list of categorical features and numrical feature in the list of numerical feature
- flat_feature_index: this is the index of the feature(categorical or numerical) in the list of all features.
- cat_feature_index: this is the index of the feature(categorical) in the ordered list of categorical featuresIn the example file order of features is age, day, hour, month, app_cat, user_occ, gender, statecode with index of age is 0 and its a numerical feature. This list is further divided into numerical features and categorical features. The elements in these lists are referred as flat_feature_index and cat_feature_index. i.e. flat_feature_index = 0 means age and cat_feature_index=0 means day.
- Borders : this is used in condition of internal nodes of the trees. How to compute the value in the condition of the node is bit complex and explained later. Borders are referred by their index in the ordered list of borders across all types of features. the order is as following.
- Numerical Borders
- OneHotEncoded feature borders
- CTR borders
Section 2 : CTR Types
CatBoost convert categorical features into different numerical values which are represented as numerator and denominator in the model. These numerical values do have derived with laplace smoothing using numerator prior and denominator prior and other values. What all components are available in model and how to compute values for different type of CTR values are explained in this section
- Section 2.1 : Count Or FeatureFreq
- Components
- Count Prior (prior_numerator)
- Count
- Total Prior (prior_denominator)
- Total (counter_denominator)
- Scale
- Shift
- Formula
Shift + Scale * ((Count Prior + Count)/(Total Prior + Total))
- Section 2.2 : Borders
- Components
- Success Prior (prior_numerator)
- Success
- Failure Prior (prior_denominator)
- Failures
- Scale
- Shift
- Formula
Shift + Scale * ((Success Prior + Successes)/(Success Prior + Success + Failure Prior + Failures))
Section 3 : Complex CTR Features:
{
"elements":[
{
"cat_feature_index":6,
"combination_element":
"cat_feature_value"
},
{
"border":12.5,
"combination_element":"float_feature",
"float_feature_index":0
},
{
"cat_feature_index":5,
"combination_element":"cat_feature_exact_value",
"value":-845129958
}
],
"type":"Borders"
}
Representation
elements
- Categorical Feature ( combination_element = cat_feature_value) :
- This component has only 1 other element
- cat_feature_index: In Categorical Feature we compute Cityhash of the feature using feature value at cat_feature_index In the example cat_feature_index=6 that means we should use value of feature statecode and take its Cityhash.
- ComputedValue : CityHash("103") = 2080643764
- How to compute CityHash is explained
- Numerical Feature(combination_element = float_feature) :
- This component has 2 elements
- flat_feature_index: this tell which feature to use for this. In the example(flat_feature_index = 0) means age.
- border : This is a numerical value. We need to compare feature value with this value and if feature_value >= border then this component will compute to 1 else 0.
- ComputedValue : 12.5>47?1:0 = 1
- OneHotFeatures ( combination_element = cat_feature_exact_value) :
- Here CityHash of feature will be compared with given value. This has 2 elements
- cat_feature_index: this tell which feature to use. In the example above cat_feature_index = 5, which is gender.
- value : this tell with which value we need to compare the CityHash of feature value. If they are same this component will compute to 1 else 0. i.e. if (CityHash(features[age]) = -845129958) then 1 else 0
- ComputedValue : CityHash("male") == -845129958 ? 1 else 0 = 1
- CityHash("male") = -845129958. How to compute this is explained later
type
How to combine components of the list into 1 value :
hash = 0
for i in (1:size(elements)){
hash = MAGIC_MULT * ( hash + MAGIC_MULT * VALUE(i) )
}
So to compute the above example
MAGIC_MULT * ( 0 + MAGIC_MULT * 2080643764) = 5418337360679822996
MAGIC_MULT * ( 5418337360679822996 + MAGIC_MULT * 1) = 6727615186874477117
MAGIC_MULT * ( 6727615186874477117 + MAGIC_MULT * 1 = 4977781465358360298
Section 4 : Structure of catboost json file
section 4.1 : features_info :
This gives the list of features and other information including ordering of borders as explained below in this section. There are 3 different Feature Types :
section 4.1.1 : float_features
"float_features":[
{
"feature_name":"age",
"nan_value_treatment":"AsIs",
"has_nans":false,
"flat_feature_index":5,
"feature_index":0,"borders":[
12.5, 13.5, 15.5, 16.5, 20.5,
21.5, 22.5, 24.5, 25.5, 29.5,
30.5, 31.5, 34.5, 35.5, 36.5,
37.5, 41.5, 47.5, 49.5, 50.5,
52.5, 53.5, 54.5, 55.5, 58.5
]
}
]
this is the ordered list of numerical features along ordered list of borders used for decisions in trees.
has_nans : if this feature had missing values at the time of training
flat_feature_index : this tell the flat_feature_index of the this numerical feature
feature_index : this is feature_index of this numerical feature
borders : This is a first part of the global ordered list of values which needs to be used in conditions of the node where this numerical feature is used. This order is very important and values are used by their indexes. This global index is call split_index. There is a alot infomration we will get just by knowing the split_index.
For FloatFeature type of feature we can get following information just by knowing index of split_index
- Border value
- Feature name
- Feature index
For above float_feature example, we can generate 25 split_indexes. Below are 2 example of 1st and last split_indexes. Please note the indexes of split_index here are 0 and 24
split_index(0) =
{
"feature_type": "FloatFeature",
"border_value": 12.5,
"flat_feature_index": 5,
"feature_index": 0,
"feature_name": "age"
}
split_index(24) =
{
"feature_type": "FloatFeature",
"border_value": 58.5,
"flat_feature_index": 5,
"feature_index": 0,
"feature_name": "age"
}
split_indexes are indexed across all the features(numerical, OneHot, CTR) that means index of the next split_index will start from 25
section 4.1.2 : Categorical_features
This is ordered list of categorical features along with information about one hot encoded values
"categorical_features":[
{
"feature_name":"day",
"flat_feature_index":0,
"feature_index":0
},
{
"feature_name":"hour",
"flat_feature_index":1,
"feature_index":1
},
{
"feature_name":"month",
"flat_feature_index":2,
"feature_index":2
},
{
"feature_name":"app_cat",
"flat_feature_index":3,
"feature_index":3
},
{
"feature_name":"user_occ",
"values":[
-1445282301,
-364013047,
1886646856
],
"flat_feature_index":4,
"feature_index":4
},
{
"feature_name":"gender",
"flat_feature_index":6,
"feature_index":5
},
{
"feature_name":"statecode",
"flat_feature_index":7,
"feature_index":6
}
]
- Feature Name
- Feature Index
- Value for OneHot comparision
split_index(25) =
{
"feature_type": "OneHot",
"value": -1445282301,
"feature_name": "user_occ",
"flat_feature_index": 4,
"feature_index": 4
}
split_index(27) =
{
"feature_type": "OneHot",
"value": 1886646856,
"feature_name": "user_occ",
"flat_feature_index": 4,
"feature_index": 4
}
Section 4.1.3: ctrs
This is ordered list of CTR features, it has important fields which are tightly linked to CTR types
"ctrs":[
{
"borders":[
0.9999989867, 1.999999046, 3.999999046, 4.999999046,
6.999999046, 8.999999046, 9.999999046, 11.99999905
],
"prior_numerator":0,
"shift":-0,"target_border_idx":0,
"ctr_type":"Borders",
"scale":15,
"elements":[
{
"cat_feature_index":0,
"combination_element":"cat_feature_value"
}
],
"identifier":"{\"identifier\":[{\"cat_feature_index\":0,\"combination_element\":\"cat_feature_value\"}],\"type\":\"Borders\"}",
"prior_denomerator":1
},
{
"borders": [0.9999989867],
"prior_numerator":0,
"shift":-0,
"target_border_idx":0,
"ctr_type":"Counter",
"scale":15,
"elements": [
{
"cat_feature_index":0,
"combination_element":"cat_feature_value"
},
{
"float_feature_index":0,
"combination_element":"float_feature",
"border":15.5
}
],
"identifier":"{\"identifier\":[{\"cat_feature_index\":0,\"combination_element\":\"cat_feature_value\"},{\"border\":15.5,\"combination_element\":\"float_feature\",\"float_feature_index\":0}],\"type\":\"Counter\"}",
"prior_denomerator":1
},
....
....
]
- elements
- ctr_type
- shift
- scale
- prior_numerator
- prior_denominator
- border value
- identifier
split_index(28) =
{
"split_type": "OnlineCTR",
"prior_numerator":0,
"shift":-0,
"ctr_type":"Counter",
"scale":15,
"elements":
[
{
"cat_feature_index":0,
"combination_element":"cat_feature_value"
},
{
"float_feature_index":0,
"combination_element":"float_feature",
"border":15.5
}
],
"identifier":"{\"identifier\":[{\"cat_feature_index\":0,\"combination_element\":\"cat_feature_value\"},{\"border\":15.5,\"combination_element\":\"float_feature\",\"float_feature_index\":0}],\"type\":\"Counter\"}",
"prior_denomerator":1,
"border_value": 0.9999989867
}
Section 4.1.4 : cat_features_hash:
It is a list of json elements. Each elements contains 2 elements, Hash and value,
where CityHash(value) = Hash
"cat_features_hash":
[
{
"hash":4216280702,
"value":"4"
},
{
"hash":1445320840
"value":"12",
},
{
"hash":3509359534,
"value":"thu"
},
.....
.....
]
Section 4.2 : ctr_data
This gives information about values needed to compute different ctr type. It is a json map where keys are identifiers from section 4.1.3.
"ctr_data": { "{\"identifier\":[{\"cat_feature_index\":6,\"combination_element\":\"cat_feature_value\"}],\"type\":\"Counter\"}": { "counter_denominator":32, "hash_stride":2, "hash_map": [ "18446744073709551615", 18, "9285085082777897483", 22, "5287930183868741137", 22, "13172738045064563988", 20, "6703001155837839893", 25, "4858559805981312021", 17, "5418337360679822996", 26, ..... ] }, "{\"identifier\":[{\"cat_feature_index\":6,\"combination_element\":\"cat_feature_value\"},{\"border\":55.5,\"combination_element\":\"float_feature\",\"float_feature_index\":0}],\"type\":\"Borders\"}": { "counter_denominator":0, "hash_stride":3, "hash_map": [ "18446744073709551615", 4, 12, "16473605376674046465", 8, 9, "14187676640779072514", 1, 2, "9393377116333705988", 7, 8, "5094243394766242315", 0, 2, "5094137587211407371", 1, 1, ..... ], }, ..... }
hash_map : hash_map is basically a list of key and values. The number of values depends on CTR type. How to get the key is explained earlier in the blog.
- Borders : It has key and 2 values that means 0,3,6,.. are keys. (1,2),(4,5),(7,8),.. indexes are (failures, successes) used for computing the value of "border" CTR type at that key.
- Counter & FeatureFreq : It has key and 1 value that mean 0,2,4,6,.. are the keys and 1,3,5,7... are the counts for computing the value of "counter" or "FeatureFreq" CTR type at that key.
Section 4.3 oblivious_trees
This is the list of trees. Each tree is computed separately. I will try to explain these trees by computing tree values for example trees.
Example 1
{
"leaf_weights":[2, 49, 18, 931],
"leaf_values":[1.153793657, 0.3428686569, 0.06438921655, 0.0150123653],
"splits":
[
{
"split_index":189,
"split_type":"OnlineCtr",
"border":3.999999046,
"ctr_target_border_idx":0
},
{
"split_index":86,
"split_type":"OnlineCtr",
"border":4.999999046,
"ctr_target_border_idx":0
}
]
}
There are 2 important parts
splits : List of conditions to be used in different levels of tree. Since Catboost generate Symmetric trees we need only len(splits) = depth of tree. This list is in reverse order i.e. last element in the list is used in root of the tree. We just need "split_index" from this part to compute the node values.
leaf_values: there are 2^(tree_depth) number of values in this representing terminal node values of the complete tree from left to right. We just need split_index to compute the node value.
Section 4.6.1
If split_type of split is "OneHotFeature", it might not be listed in the features_info section, but it is still the part of global list of split_indexes.
For Example in sample_model.json
oblivious_trees[1]['splits][0] =
{
"split_index": 28,
"cat_feature_index": 5,
"value": -845129958,
"split_type": 'OneHotFeature'
}
so some of the split indexes needs to be derived from here. These indexes start just after OneHot feature values and before ctrs.
I will compute the above tree example along with few more trees for the one row below
For root node (split_index = 86), find the split index using section 4.1
split_index(86) =
{
"prior_numerator": 0,
"shift": 0,
"ctr_type": 'Counter',
"scale": 15,
"elements":
[
{
"cat_feature_index": 0,
"combination_element": 'cat_feature_value'
},
{
"cat_feature_index": 2,
"combination_element": 'cat_feature_value'
}
],
"identifier": '{"identifier":[{"cat_feature_index":0,"combination_element":"cat_feature_value"},{"cat_feature_index":2,"combination_element":"cat_feature_value"}],"type":"Counter"}',
"prior_denomerator": 1,
"border" = 4.999999046
}
Compute the hash to be used for ctr_data using section 3 and Section 4.1.4
cat_feature_index(0) = "day"
input["day"] = "fri"
CityHash("fri") = 2386474191
MAGIC_MULT * ( 0 + MAGIC_MULT * 2386474191 ) = 10849034401243659895
cat_feature_index(2) = "month"
input["month"] = "oct"
CityHash("oct") = 964427042
MAGIC_MULT * ( 10849034401243659895 + MAGIC_MULT * 964427042 ) = 7913951620039613893
using section 4.2 find values to be used to compute CTR at the node
ctr_hash['{"identifier":[{"cat_feature_index":0,"combination_element":"cat_feature_value"},{"cat_feature_index":2,"combination_element":"cat_feature_value"}],"type":"Counter"}']['hash_map'][7913951620039613893] =
{
"count" = 12,
"counter_denominator" = 23
}
using section 2.1 to compute the node and compare it with border to find the next node.
15 * ((0 + 12)/(1 + 23)) = 7.5
7.5 >= 4.999999046 is True so traverse to Right Node.
split_index(189) =
{
"prior_numerator": 0.5,
"shift": 0,
"scale": 15,
"elements":
[
{
"cat_feature_index": 2,
"combination_element": "cat_feature_value"
}
],
"identifier": '{"identifier":[{"cat_feature_index":2,"combination_element":"cat_feature_value"}],"type":"Borders"}',
"prior_denomerator": 1,
"ctr_type" : "Border",
"border" = 3.999999046
}
using section 3 & section 4.1.4
cat_feature_index(2) = "month"
input["month"] = "oct"
CityHash("oct") = 964427042
MAGIC_MULT * ( 0 + MAGIC_MULT * 964427042 ) = 14429572201882049490
ctr_data['{"identifier":[{"cat_feature_index":2,"combination_element":"cat_feature_value"}],"type":"Borders"}']['hash_map'][14429572201882049490] =
{
"successes": 51,
"failures" : 43
}
using section 2.2
0 + 15 * ((0.5 + 51)/(0.5 + 51 + 1 + 43)) = 8.089005235602095
8.089005235602095 >= 3.999999046 is True so traverse to Right node
Reached terminal node so value of the tree is value of terminal node.
Computed value of this tree is 0.0150123653
Example 2
{
"leaf_weights":[551, 9, 434, 6],
"leaf_values":[1.142552497, 0.7536726871, 1.019659549, 1.933912389],
"splits":
[
{
"split_index":24,
"float_feature_index":0,
"split_type":"FloatFeature",
"border":58.5
},
{
"split_index":201,
"split_type":"OnlineCtr",
"border":7.999999046,
"ctr_target_border_idx":0
}
]
}
using section 4.1
split_index(201) =
{
"prior_numerator": 1,
"shift": 0,
"ctr_type": "Borders",
"scale": 15,
"elements":
[
{
"cat_feature_index": 2,
"combination_element": "cat_feature_value"
}
],
"identifier": '{"identifier":[{"cat_feature_index":2,"combination_element":"cat_feature_value"}],"type":"Borders"}',
"prior_denomerator": 1,
"border" : 7.999999046
}
using section 3 & section 4.1.4
cat_feature_index(2) = "month"
input["month"] = "oct"
CityHash("oct") = 964427042
MAGIC_MULT * ( 0 + MAGIC_MULT * 964427042 ) = 14429572201882049490
using section 4.2
ctr_data['{"identifier":[{"cat_feature_index":2,"combination_element":"cat_feature_value"}],"type":"Borders"}']['hash_map'][14429572201882049490] =
{
"successes": 51,
"failures": 43
}
using section 2.2
0 + 15 * ((1 + 51)/(1 + 51 + 1 + 43)) = 8.125
8.125 >= 7.999999046 is True, so travese to Right node
using section 4.1
split_index(24) =
{
"feature_type": 'FloatFeature',
"feature_name": 'age',
"nan_value_treatment": 'AsIs',
"has_nans": False,
"flat_feature_index": 5,
"feature_index": 0,
"border": 58.5
}
using section 3
input['age'] = 47
47 >= 58.5 is False, so traverse to Left node
Computed value of this tree is 1.019659549
Example 3
{
"leaf_weights":[470, 530],
"leaf_values":[2.411496103, 2.176863892],
"splits":
[
{
"cat_feature_index":5,
"split_index":28,
"split_type":"OneHotFeature"
"value":-845129958,
}
]
}
using section 4.1
split_index(28) =
{
"feature_name": "gender",
"cat_feature_index": 5,
"split_type": "OneHotFeature"
"value": -845129958
}
using section 3 & section 4.1.4
input['gender'] = 'male'
CityHash('male') = 2083200611
2083200611 == -845129958 is False so traverse to Left node
Computed value of this tree is 2.411496103
Example 4
{
"leaf_weights":[257, 743],
"leaf_values":[2.081450524, 2.052298752],
"splits":
[
{
"split_index":55,
"split_type":"OnlineCtr",
"border":12.99999905,
"ctr_target_border_idx":0
}
]
}
using section 4.1
split_index(55) =
{
"prior_numerator": 0,
"shift": 0,
"ctr_type": 'Counter',
"scale": 15,
"elements":
[
{
"cat_feature_index": 0,
"combination_element": 'cat_feature_value'
}
],
"identifier": '{"identifier":[{"cat_feature_index":0,"combination_element":"cat_feature_value"}],"type":"Counter"}',
"prior_denomerator": 1,
"border": 12.99999905
}
cat_feature_index(0) = 'day'
input['day'] = 'fri'
CityHash('fri') = 2386474191
MAGIC_MULT * ( 0 + MAGIC_MULT * 2386474191 ) = 10849034401243659895
using section 4.2
ctr_data['{"identifier":[{"cat_feature_index":0,"combination_element":"cat_feature_value"}],"type":"Counter"}']['hash_map']['10849034401243659895'] =
{
"counter_denominator" = 154,
"count" = 151
}
0 + 15 * ((0 + 151)/(1 + 154)) = 14.612903225806452
14.612903225806452 >= 12.99999905 is True so traverse to Right node
Computed value of this tree is 2.052298752
Example 5
{
"leaf_weights":[944, 56],
"leaf_values":[1.867651047, 1.653345508],
"splits":
[
{
"split_index":257,
"split_type":"OnlineCtr",
"border":9.999999046,
"ctr_target_border_idx":0
}
]
}
using section 4.1
{
"prior_numerator": 1,
"shift": 0,
"ctr_type": 'Borders',
"scale": 15,
"elements":
[
{
"cat_feature_index": 3,
"combination_element": 'cat_feature_value'
}
],
"identifier": '{"identifier":[{"cat_feature_index":3,"combination_element":"cat_feature_value"}],"type":"Borders"}',
"prior_denomerator": 1,
"border": 9.999999046
}
using section 3 & section 4.1.4
cat_feature_index(3) = 'app_cat'
input['app_cat'] = 'games'
CityHash('games') = 4157097893
MAGIC_MULT * ( 0 + MAGIC_MULT * 4157097893 ) = 7916588964563354589
using section 4.2
ctr_data['{"identifier":[{"cat_feature_index":3,"combination_element":"cat_feature_value"}],"type":"Borders"}']['hash_map']['7916588964563354589'] =
{
"failures" = 80,
"successes" = 74
}
using section 2.2
0 + 15 * ((1 + 74)/(1 + 74 + 1 + 80)) = 7.211538461538462
7.211538461538462 >= 9.999999046 is False so traverse to Left node
Computed value of this tree is 1.867651047
Example 6
{
"leaf_weights":[25, 742, 4, 71, 0, 47, 0, 111],
"leaf_values":
[
-0.4397322896, 0.000866602846, -0.5008245793, -0.04967483793,
0, 0.2684168171, 0, 0.06013996591
],
"splits":
[
{
"split_index":0,
"float_feature_index":0,
"split_type":"FloatFeature",
"border":12.5
},
{
"split_index":136,
"split_type":"OnlineCtr",
"border":9.999999046,
"ctr_target_border_idx":0
},
{
"split_index":148,
"split_type":"OnlineCtr",
"border":8.999999046,
"ctr_target_border_idx":0
}
]
}
using section 4.1
split_index(148) =
{
"prior_numerator": 0,
"shift": 0,
"ctr_type": 'Borders',
"scale": 15,
"elements":
[
{
"cat_feature_index": 1,
"combination_element": 'cat_feature_value'
},
{
"float_feature_index": 0,
"combination_element": 'float_feature',
"border": 12.5
}
],
"identifier": '{"identifier":[{"cat_feature_index":1,"combination_element":"cat_feature_value"},{"border":12.5,"combination_element":"float_feature","float_feature_index":0}],"type":"Borders"}',
"prior_denomerator": 1,
"border": 8.999999046
}
using section 3 & section 4.1.4
cat_feature_index(1) = 'hour'
input['hour'] = '12'
CityHash('12') = 1445320840
MAGIC_MULT * ( 0 + MAGIC_MULT * 1445320840 ) = 8879677033106486088
float_feature_index() = 'age'
input['age'] = 47
47 >= 12.5 is True so its computed value is 1
MAGIC_MULT * ( 8879677033106486088 + MAGIC_MULT * 1 ) = 4244931956247570753
using section 4.2
ctr_data[''{"identifier":[{"cat_feature_index":1,"combination_element":"cat_feature_value"},{"border":12.5,"combination_element":"float_feature","float_feature_index":0}],"type":"Borders"}']['hash_map']['4244931956247570753'] =
{
"failures": 15,
"successes": 16
}
using section 2.2
0 + 15 * ((0 + 15) / (0 + 15 + 1 + 16)) = 7.03125
7.03125 >= 8.999999046 is False so traverse to Left node
using section 4.1
split_index(136) =
{
"prior_numerator": 1,
"shift": 0,
"ctr_type": 'Borders',
"scale": 15,
"elements":
[
{
"cat_feature_index": 1,
"combination_element": 'cat_feature_value'
}
],
"identifier": '{"identifier":[{"cat_feature_index":1,"combination_element":"cat_feature_value"}],"type":"Borders"}',
"prior_denomerator": 1,
"border" : 9.999999046
}
using section 3 & section 4.1.4
cat_feature_index(1) = 'hour'
input['hour'] = '12'
CityHash('12') = 1445320840
MAGIC_MULT * ( 0 + MAGIC_MULT * 1445320840 ) = 8879677033106486088
using section 4.2
ctr_data['{"identifier":[{"cat_feature_index":1,"combination_element":"cat_feature_value"}],"type":"Borders"}']['hash_map']['8879677033106486088'] =
{
"failures": 15,
"succeses": 17
}
using section 2.2
0 + 15 * ((1 + 17) / (1 + 17 + 1 + 15)) = 7.9411764705882355
7.9411764705882355 > = 9.999999046 is False so traverse to Left node
using section 4.1
split_index(0) =
{
"float_feature_index":0
"split_type":"FloatFeature"
"border":12.5
}
using section 3
float_feature_index(0) = 'age'
input['age'] = 47
47 >= 12.5 is True so traverse to Right node
Computed value of this tree is 0.000866602846
Section 4.3 scale_and_bias
It is an array of 2 elements. 1st value is scale and second value is bias. Finally to compute the model output we need to use following(for regression)
bias + scale * (Σ oblivious_trees)