Data classification is the process of separating and organizing data into relevant groups (â€œclassesâ€) based on their shared characteristics, such as their level of sensitivity and the risks they present, and the compliance regulations that protect them. When done right, data classification makes using and protecting data easier and more efficient (Data classification, 2020).
There are two data classification models described in our textbook they are:
Descriptive Modeling: It serves as an explanatory tool to distinguish between objects of different classes.
Predictive Modeling: It can used to predict the class label of the unknown records (Tan, P.-N. et al., 2013).
The decision tree has three types of nodes. A root node that has no incoming edges and zero or more outgoing edges. Internal node has exactly one incoming edge and two or. More outgoing edges. Terminal node has exactly one incoming edge and no outgoing edge. In a decision tree, each leaf node is assigned a class label. The non-terminal nodes, which include. The root and other internal nodes, contain attribute test conditions to separate records that have different characteristics (Tan, P.-N. et al., 2013). Decision tree helps to visualize and understand data. Decision tree can also handle multidimensional data with great accuracy.
Hyperparameters are model parameters that are estimated without using actual, observed data. Itâ€™s basically a â€œgood guessâ€ at what a modelâ€™s parameters might be, without using your actual data. The term â€œhyperparameterâ€ is used to distinguish the prior â€œguessâ€ parameters from other parameters used in statistics, such as coefficients in regression analysis (Hyperparameter, 2020).
Model Selection and Evaluation:
Model Selection is the process of choosing between the different learning algorithms for modelling our data, for solving a classification problem the choices could be made between Logistic Regression, SVM, Tree-based algorithms etc. Model Evaluation aims to check the generalization ability of our model, i.e ability of our model to perform well on an unseen dataset (Patel, S. 2020).