Classification-Tree Learning

Machine Learning with Tree-Based Models in Python

Elie Kawerk

Data Scientist

Building Blocks of a Decision-Tree

  • Decision-Tree: data structure consisting of a hierarchy of nodes.

  • Node: question or prediction.

Machine Learning with Tree-Based Models in Python

Building Blocks of a Decision-Tree

Three kinds of nodes:

  • Root: no parent node, question giving rise to two children nodes.

  • Internal node: one parent node, question giving rise to two children nodes.

  • Leaf: one parent node, no children nodes --> prediction.

Machine Learning with Tree-Based Models in Python

Prediction

DT-labeled

Machine Learning with Tree-Based Models in Python

Information Gain (IG)

IG-diagram

Machine Learning with Tree-Based Models in Python

Information Gain (IG)

IG-formula

Criteria to measure the impurity of a node $I (node)$:

  • gini index,
  • entropy. ...
Machine Learning with Tree-Based Models in Python

Classification-Tree Learning

  • Nodes are grown recursively.

  • At each node, split the data based on:

    • feature $f$ and split-point $sp$ to maximize $IG(\text{node})$.
  • If $IG (\text{node})$= 0, declare the node a leaf.

    ...

Machine Learning with Tree-Based Models in Python
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
# Import train_test_split
from sklearn.model_selection import train_test_split
# Import accuracy_score
from sklearn.metrics import accuracy_score
# Split dataset into 80% train, 20% test
X_train, X_test, y_train, y_test= train_test_split(X, y, 
                                                   test_size=0.2, 
                                                   stratify=y,
                                                   random_state=1)
# Instantiate dt, set 'criterion' to 'gini'
dt = DecisionTreeClassifier(criterion='gini', random_state=1)
Machine Learning with Tree-Based Models in Python

Information Criterion in scikit-learn

# Fit dt to the training set
dt.fit(X_train,y_train)

# Predict test-set labels
y_pred= dt.predict(X_test)

# Evaluate test-set accuracy
accuracy_score(y_test, y_pred)
0.92105263157894735
Machine Learning with Tree-Based Models in Python

Let's practice!

Machine Learning with Tree-Based Models in Python

Preparing Video For Download...