Accuracy differs between MATLAB and scikit-learn for a decision tree

2

Is there any possibility to vary the accuracy of same data set in matlab and jupyter notebook by using python code ?

For same data set, at first I applied it in matlab and get 96% accuracy for decision tree method, then I apply that same data set in jupyter notebook by using python code where I get 53% accuracy for C4.5 (decision tree) by using k-fold cross validation.

I didn't understand where's the problem for getting different accuracy for same dataset and same method.

My procedure in python code is given below:

import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.model_selection import KFold 

train=pd.read_csv('E://New.csv')
train.head()

enter image description here

# define X and y
feature_cols = ['Past','Family_History','Current','current or previous 
               workplace','diagnosed with a mental health condition by a 
               medical professional?','do you feel that it interferes with 
               your work when being treated effectively?','Gender']
X = train[feature_cols]

# y is a vector, hence we use dot to access 'label'
y = train['Diagonised condition']

kfold = KFold(n_splits=10,random_state=None)
model = tree.DecisionTreeClassifier(criterion='gini')

results = cross_val_score(model, X, y, cv=kfold,scoring = 'accuracy')
result = results.mean()*100

std = results.std()*100
print (result)

enter image description here

IS2057

Posted 2019-01-23T15:37:08.610

Reputation: 267

Please post the MATLAB code so it can be compared to the Python code. – Brian Spiering – 2019-01-23T16:12:29.590

In matlab I use classification app (decision tree) and load my data set then calculate accuracy. – IS2057 – 2019-01-23T18:05:41.697

Are you sure that all other parameters for your decision tree are the same? – TwinPenguins – 2019-01-24T06:23:54.983

@MajidMortazavi, Yes I am sure . I use the same dataset and same parameters. – IS2057 – 2019-01-24T06:49:55.217

Answers

2

It is hard to make a direct comparison between a white box implementation (scikit-learn) and a black box implementation (MATLAB).

One guess they are using different algorithms. scikit-learn uses an optimized version of the CART algorithm. Maybe MATLAB uses ID3, C4.5, or something else. Another guess two implementations are using different hyperparameters (e.g., different splitting criteria, max depth, minimum node size, ...).

Since decision trees are white-box models, you can examine their internal structure. Plot both trained trees. See how they each are making the splits and how many splits are being made.

Brian Spiering

Posted 2019-01-23T15:37:08.610

Reputation: 10 864

0

Although different splitting algorithms and hyperparameters indeed result in different model performance, I feel like the difference here is still too much. I feel like you can try to one hot encode some of your multiclass categorical data and see if there will be any difference.

plpopk

Posted 2019-01-23T15:37:08.610

Reputation: 168