Feature Selection Example
Next, we will look at an example of how to call Feature Selection in Rason.
Feature Selection allows users the ability to rank and select the most relevant variables for inclusion in a classification or prediction model.
In many cases the most accurate models, or the models with the lowest misclassification or residual errors, have benefited from better feature
selection, using a combination of human insights and automated methods. Rason Data Science provides a facility to compute all of the following
metrics, described in the literature, to give users information on what features should be included, or excluded, from their models.
Only some of these metrics can be used in any given application, depending on the characteristics of the input variables (features) and the
type of problem. In a supervised setting, if we classify data science problems as follows:
ℝ ⁿ ⟶ ℝ : Real-valued features, prediction (regression) problem
ℝ ⁿ ⟶ {0,1} : Real-valued features, binary classification problem
ℝ ⁿ ⟶{1..C} : Real-valued features, multi-class classification problem
{1..C}ⁿ ⟶ ℝ ⁿ : Nominal categorical features, prediction (regression) problem
{1..C}ⁿ ⟶ {0,1} : Nominal categorical features, binary classification problem
{1..C}ⁿ ⟶ {1..C} : Nominal categorical features, multi-class classification problem
then we can describe the applicability of the Feature Selection metrics by the following table:
|
R-R |
R-{0,1} |
R-{1..C} |
{1..C}-R |
{1..C}-{0,1} |
{1..C}-{1..C} |
Pearson |
N |
|
|
|
|
|
Spearman |
N |
|
|
|
|
|
Kendall |
N |
|
|
|
|
|
Welch's |
D |
N |
|
|
|
|
F-Test |
D |
N |
N |
|
|
|
Chi-squared |
D |
D |
D |
D |
N |
N |
Mutual Info |
D |
D |
D |
D |
N |
N |
Gain Ratio |
D |
D |
D |
D |
N |
N |
Fisher |
D |
N |
N |
|
|
|
Gini |
D |
N |
N |
|
|
|
"N" means that metrics can be applied naturally, and āDā means that features and/or the outcome variable must be discretized
before applying the particular filter.
Here is an example of a Rason model calling Feature Selection using linear wrapping.
{
modelName: 'LinearWrapping',
modelType: 'datamining',
modelDescription: 'feature selection - linear wrapping',
datasources: {
myTrainSrc: {
type: 'csv', connection: 'hald-small.txt', direction: 'import'
}
},
datasets: {
myTrainData: {
binding: 'myTrainSrc', targetCol: 'Y'
}
},
estimator: {
linearFSEstimator: {
type: 'featureSelection', algorithm: 'linearWrapping',
parameters: {
fitIntercept: true,
method: 'EXHAUSTIVE_SEARCH'
}
}
},
actions: {
linearFSModel: {
data: 'myTrainData', estimator: 'linearFSEstimator', action: 'fit',
evaluations: [
'bestSubsets',
'bestSubsetsDetails'
]
},
reducedTrainData: {
data: 'myTrainData', fittedmodel: 'linearFSModel',
parameters: {
numTopFeatures: 2
},
action: 'transform',
evaluations: [
'transformation'
]
}
}
}
The "datasources" section in this example creates one datasource, "myTrainSrc" which imports the hald-small dataset within the hald-small.txt file.
Inside of "datasets", the datasource "myTrainData" is bound to the "myTrainSrc" dataset. An output column is specified as the "Y" column
(targetCol: 'Y'). Note: Input files in a Data Science Rason model must not contain a path to a file location.
A new estimator, linearFSEstimator, is created within "estimator". This estimator is of type "featureSelection" using algorithm "linearWrapping"
and the parameters "fitIntercept" set to True and method set to "EXHAUSTIVE_SEARCH". For a complete list of all parameters associated with Feature
Selection, see the Rason Reference Guide.
Two new action items, "linearFSModel" and "reducedTrainData" are listed within "actions". The "linearFSModel" action item "fits" the model
while "reducedTrainData" uses the resultant model to transform the dataset.
The "linearFSModel" action item uses the estimator, linearFSEstimator, to "fit" a feature selection model to the "myTrainData" dataset and
return the best subsets and their details in the results.
The "reducedTrainData" action item uses the fitted "linearFSModel" model to perform feature selection to reduce (transform) the "myTrainData"
dataset from 5 features (not including the output variable, Y) to 2 (three including the output variable, Y).
Here are the results:
First we see the status of the Rason model
Getting model results: GET https://rason.net/api/model/2590+LinearWrapping+2020-01-20-01-35-50-761786/result
{"status": {
"id": "2590+LinearWrapping+2020-01-20-01-35-50-761786",
"code":0,
"codeText": "Success"
},
"results":["linearFSModel.bestSubsets", "linearFSModel.bestSubsetsDetails", "reducedTrainData.transformation"],
Results for linearFSModel begin here. Notice that the results are printed in column order. The first subset (Subset 1) contains
only the intercept term. The second subset (Subest 2) contains the intercept term plus 1 feature, X4. The third subset (Subset 3) contains
the intercept term plus 2 features, X1 and X2 and so on.
"linearFSModel": {
"bestSubsets": {
"objectType": "dataFrame",
"name": "Best Subsets",
"order": "col",
"rowNames": ["Subset 1", "Subset 2", "Subset 3", "Subset 4", "Subset 5", "Subset 6"],
"colNames": ["Intercept", "X1", "X2", "X3", "X4", "Weights"],
"colTypes": ["double", "double", "double", "double", "double", "double"],
"indexCols": null,
"data": [
[1, 1, 1, 1, 1, 1],
[0, 0, 1, 1, 1, 1],
[0, 0, 1, 1, 1, 1],
[0, 0, 0, 0, 1, 1],
[0, 1, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 1]
]
},
Best subset result details (bestSubsetsDetails) for linearFSModel. Here are the calculated statistics for the six subsets.
The statistics returned are: RSS, Mallows's Cp, R2, Adjusted R2 and Probability.
"bestSubsetsDetails": {
"objectType": "dataFrame",
"name": "Best Subsets Details",
"order": "col",
"rowNames": ["Subset 1", "Subset 2", "Subset 3", "Subset 4", "Subset 5", "Subset 6"],
"colNames": ["#Coefficients", "RSS", "Mallows's Cp", "R2", "Adjusted R2", "Probability"],
"colTypes": ["integer", "double", "double", "double", "double", "double"],
"indexCols": null,
"data": [
[1, 2, 3, 4, 5, 6],
[2715.7630769230273, 883.86691689923498, 57.904483176071921, 47.972729400348463, 47.863639350457937, 47.800255339956486] ,
[386.70376545604654, 120.43588636278355, 1.479690732816751, 2.0252575726668702, 4.009282127686447, 6] ,
[2.5424107263916085e-14, 0.67454196413163481, 0.97867837453564743, 0.98233545120044441, 0.98237562040770021, 0.98239895970818625] ,
[2.5424107263916085e-14, 0.64495486996177454, 0.97441404944277676, 0.97644726826725459, 0.97356343061154282, 0.96982678807116807] ,
[5.4992942301671353e-06, 0.00015856276306674378, 0.69819260728365373, 0.98747306627232578, 0.9259478188926249, 0]
]
}
},
Results for reducedTrainData begin here. Again notice that the parameter values for reducedTrainData are printed.
"reducedtraindata": {
"transformation": {
"objectType": "dataFrame",
"name": "mytraindata : Reduced",
"order": "col",
"rowNames": ["Record 1", "Record 2", "Record 3", "Record 4", "Record 5", "Record 6", "Record 7", "Record 8", "Record 9",
"Record 10", "Record 11", "Record 12", "Record 13"],
"colNames": [ "X1", "X2", "Y"],
"colTypes": [ "double", "double", "double"],
"indexCols": null,
"data": [
[7,1,11,11,7,11,3,1,2,21,1,11,10],
[26, 29, 56, 31, 52, 55, 71, 31, 54, 47, 40, 66, 68],
[78.5, 74.299999999999997, 104.3, 87.599999999999994, 95.900000000000006, 109.2, 102.7, 72.5, 93.099999999999994,
115.90000000000001, 83.799999999999997, 113.3, 109.40000000000001]
]
}
}
}
Here are the results of the transformation ā the reduction of the dataset from 5 features and 1 output variable to 2 features and 1 output variable.
"transformation": {
"objectType": "dataFrame",
"name": "mytraindata:Reduced",
"order": "col",
"rowNames": ["Record 1", "Record 2", "Record 3", "Record 4", "Record 5", "Record 6", "Record 7", "Record 8", "Record 9",
"Record 10", "Record 11", "Record 12", "Record 13"], "colNames": ["X1", "X2", "Y"],
"colTypes": ["double", "double", "double"],
"indexCols": null,
"data": [
[7, 1, 11, 11, 7, 11, 3, 1, 2, 21, 1, 11, 10],
[26, 29, 56, 31, 52, 55, 71, 31, 54, 47, 40, 66, 68],
[78.5, 74.299999999999997, 104.3, 87.599999999999994, 95.900000000000006, 109.2, 102.7, 72.5, 93.099999999999994,
115.90000000000001, 83.799999999999997, 113.3, 109.40000000000001]
]
}
}
The dataset has been reduced from 5 features and 1 output variable to 2 features and 1 output variable. The resulting object is a data frame.
|
X1 |
X2 |
Y |
Record 1 |
7 |
26 |
78.5 |
Record 2 |
1 |
29 |
74.299999999999997 |
Record 3 |
11 |
56 |
104.3 |
Record 4 |
11 |
31 |
87.599999999999994 |
Record 5 |
7 |
52 |
95.900000000000006 |
Record 6 |
11 |
55 |
109.2 |
Record 7 |
3 |
71 |
102.7 |
Record 8 |
1 |
31 |
72.5 |
Record 9 |
2 |
54 |
93.099999999999994 |
Record 10 |
21 |
47 |
115.90000000000001 |
Record 11 |
1 |
40 |
83.799999999999997 |
Record 12 |
11 |
66 |
113.3 |
Record 13 |
10 |
68 |
109.40000000000001 |
|