2

I am reading spark mllib documentation and in decision tree documentation it says -

 Each partition is chosen greedily by selecting the best split from a set of possible splits, in order to maximize the information gain at a tree node. 

Here is the link .

I am not able to understand -

  1. the partition that we are talking about, is it spark data partition or feature partition
  2. Or could it be splits on each data partition?

1 Answer 1

4

The reference to "partition" here has nothing to do with spark data partition.This is the partitioning of the data at a tree node based on a feature selected and pertains to the "data partitioning" as in the algorithm. If you check the actual implementation it queues all the nodes which need to be split and selects a bunch of them based on memory available(config).The idea being that the passes over data can be reduced if stats for a bunch of nodes and their features can be done over 1 pass. Then for each node it takes the subset of features(config) and calculates the statistics for each of the feature ;which gives a set of possible splits.Then the driver node(node here is spark driver machine;terms can be confusing :)) is sent only the best possible split and augments the tree.Each datum or a row in your rdd is represented as BaggedTreePoint and stores the information as to which node it currently belongs to. It will take slight bit of time to go through source code ;but maybe worth it.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks Sourabh!! Does this happens with Random forest tree as well as decision tree both?
@tesnik03 yeah, it is actually exactly the same code. If you're training just one decision tree, then it trains random forest with numTrees = 1 under the hood.
Thanks @EvgeniiMorozov

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.