RASON Analytics API Help

Feature Selection Example

Next, we will look at an example of how to call Feature Selection in Rason.

Feature Selection allows users the ability to rank and select the most relevant variables for inclusion in a classification or prediction model. In many cases the most accurate models, or the models with the lowest misclassification or residual errors, have benefited from better feature selection, using a combination of human insights and automated methods. Rason Data Science provides a facility to compute all of the following metrics, described in the literature, to give users information on what features should be included, or excluded, from their models.

Correlation-based

Pearson product-moment correlation

Spearman rank correlation

Kendall concordance

Statistical/probabilistic independence metrics

Welch's statistic

F statistic

Chi-square statistic

Information-theoretic metrics

Mutual Information (Information Gain)

Gain Ratio

Other

Cramer's V

Fisher Score

Gini Index

Only some of these metrics can be used in any given application, depending on the characteristics of the input variables (features) and the type of problem. In a supervised setting, if we classify data science problems as follows:

ℝ ⁿ ⟶ ℝ : Real-valued features, prediction (regression) problem

ℝ ⁿ ⟶ {0,1} : Real-valued features, binary classification problem

ℝ ⁿ ⟶{1..C} : Real-valued features, multi-class classification problem

{1..C}ⁿ ⟶ ℝ ⁿ : Nominal categorical features, prediction (regression) problem

{1..C}ⁿ ⟶ {0,1} : Nominal categorical features, binary classification problem

{1..C}ⁿ ⟶ {1..C} : Nominal categorical features, multi-class classification problem

then we can describe the applicability of the Feature Selection metrics by the following table:


	R-R	R-{0,1}	R-{1..C}	{1..C}-R	{1..C}-{0,1}	{1..C}-{1..C}
Pearson	N
Spearman	N
Kendall	N
Welch's	D	N
F-Test	D	N	N
Chi-squared	D	D	D	D	N	N
Mutual Info	D	D	D	D	N	N
Gain Ratio	D	D	D	D	N	N
Fisher	D	N	N
Gini	D	N	N

"N" means that metrics can be applied naturally, and “D” means that features and/or the outcome variable must be discretized before applying the particular filter.

Here is an example of a Rason model calling Feature Selection using linear wrapping.


{
  modelName: 'LinearWrapping',
  modelType: 'datamining',
  modelDescription: 'feature selection - linear wrapping',
    datasources: {
      myTrainSrc: {
        type: 'csv', connection: 'hald-small.txt', direction: 'import'
      }
    },
    datasets: {
      myTrainData: {
        binding: 'myTrainSrc', targetCol: 'Y'
      }
    },
    estimator: {
      linearFSEstimator: {
        type: 'featureSelection', algorithm: 'linearWrapping',
        parameters: {
          fitIntercept: true,
          method: 'EXHAUSTIVE_SEARCH'
        }
      }
    },
    actions: {
      linearFSModel: {
        data: 'myTrainData', estimator: 'linearFSEstimator', action: 'fit',
        evaluations: [ 
          'bestSubsets',
          'bestSubsetsDetails' 
        ]
      },
      reducedTrainData: {
        data: 'myTrainData', fittedmodel: 'linearFSModel',
        parameters: {
          numTopFeatures: 2
        },
        action: 'transform',
        evaluations: [
          'transformation'
        ]
      }
    }
}

The "datasources" section in this example creates one datasource, "myTrainSrc" which imports the hald-small dataset within the hald-small.txt file. Inside of "datasets", the datasource "myTrainData" is bound to the "myTrainSrc" dataset. An output column is specified as the "Y" column (targetCol: 'Y'). Note: Input files in a Data Science Rason model must not contain a path to a file location.

A new estimator, linearFSEstimator, is created within "estimator". This estimator is of type "featureSelection" using algorithm "linearWrapping" and the parameters "fitIntercept" set to True and method set to "EXHAUSTIVE_SEARCH". For a complete list of all parameters associated with Feature Selection, see the Rason Reference Guide.

Two new action items, "linearFSModel" and "reducedTrainData" are listed within "actions". The "linearFSModel" action item "fits" the model while "reducedTrainData" uses the resultant model to transform the dataset.

The "linearFSModel" action item uses the estimator, linearFSEstimator, to "fit" a feature selection model to the "myTrainData" dataset and return the best subsets and their details in the results.

The "reducedTrainData" action item uses the fitted "linearFSModel" model to perform feature selection to reduce (transform) the "myTrainData" dataset from 5 features (not including the output variable, Y) to 2 (three including the output variable, Y).

Here are the results:

First we see the status of the Rason model



  Getting model results: GET https://rason.net/api/model/2590+LinearWrapping+2020-01-20-01-35-50-761786/result 
  {"status": {
    "id": "2590+LinearWrapping+2020-01-20-01-35-50-761786",
    "code":0,
    "codeText": "Success"
  },
  "results":["linearFSModel.bestSubsets", "linearFSModel.bestSubsetsDetails", "reducedTrainData.transformation"],

Results for linearFSModel begin here. Notice that the results are printed in column order. The first subset (Subset 1) contains only the intercept term. The second subset (Subest 2) contains the intercept term plus 1 feature, X4. The third subset (Subset 3) contains the intercept term plus 2 features, X1 and X2 and so on.


  "linearFSModel": {   
	 "bestSubsets": { 
       "objectType": "dataFrame", 
       "name": "Best Subsets", 
       "order": "col",
       "rowNames": ["Subset 1", "Subset 2", "Subset 3", "Subset 4", "Subset 5", "Subset 6"], 
       "colNames": ["Intercept", "X1", "X2", "X3", "X4", "Weights"],
       "colTypes": ["double", "double", "double", "double", "double", "double"],
	     "indexCols": null, 
       "data": [ 
         [1, 1, 1, 1, 1, 1], 
         [0, 0, 1, 1, 1, 1], 
         [0, 0, 1, 1, 1, 1], 
         [0, 0, 0, 0, 1, 1], 
         [0, 1, 0, 1, 1, 1],
         [0, 0, 0, 0, 0, 1]
	    ] 
     },

Best subset result details (bestSubsetsDetails) for linearFSModel. Here are the calculated statistics for the six subsets. The statistics returned are: RSS, Mallows's Cp, R2, Adjusted R2 and Probability.



	 "bestSubsetsDetails": { 
       "objectType": "dataFrame", 
       "name": "Best Subsets Details", 
       "order": "col",
       "rowNames": ["Subset 1", "Subset 2", "Subset 3", "Subset 4", "Subset 5", "Subset 6"], 
       "colNames": ["#Coefficients", "RSS", "Mallows's Cp", "R2", "Adjusted R2", "Probability"],
       "colTypes": ["integer", "double", "double", "double", "double", "double"],
       "indexCols": null,
       "data": [
         [1, 2, 3, 4, 5, 6], 
         [2715.7630769230273, 883.86691689923498, 57.904483176071921, 47.972729400348463, 47.863639350457937, 47.800255339956486] , 
         [386.70376545604654, 120.43588636278355, 1.479690732816751, 2.0252575726668702, 4.009282127686447, 6] , 
         [2.5424107263916085e-14, 0.67454196413163481, 0.97867837453564743, 0.98233545120044441, 0.98237562040770021, 0.98239895970818625] , 
         [2.5424107263916085e-14, 0.64495486996177454, 0.97441404944277676, 0.97644726826725459, 0.97356343061154282, 0.96982678807116807] , 
         [5.4992942301671353e-06, 0.00015856276306674378, 0.69819260728365373, 0.98747306627232578, 0.9259478188926249, 0]
       ] 
     } 
   },

Results for reducedTrainData begin here. Again notice that the parameter values for reducedTrainData are printed.



   "reducedtraindata": { 
     "transformation": {
        "objectType": "dataFrame", 
        "name": "mytraindata : Reduced", 
        "order": "col",
        "rowNames": ["Record 1", "Record 2", "Record 3", "Record 4", "Record 5", "Record 6", "Record 7", "Record 8", "Record 9", 
        "Record 10", "Record 11", "Record 12", "Record 13"], 
        "colNames": [ "X1", "X2", "Y"],
        "colTypes": [ "double", "double", "double"],
        "indexCols": null,
        "data": [
            [7,1,11,11,7,11,3,1,2,21,1,11,10],
            [26, 29, 56, 31, 52, 55, 71, 31, 54, 47, 40, 66, 68],
            [78.5, 74.299999999999997, 104.3, 87.599999999999994, 95.900000000000006, 109.2, 102.7, 72.5, 93.099999999999994, 
            115.90000000000001, 83.799999999999997, 113.3, 109.40000000000001]
       ]
   }
 }
}

Here are the results of the transformation – the reduction of the dataset from 5 features and 1 output variable to 2 features and 1 output variable.



  "transformation": {
   "objectType": "dataFrame",
   "name": "mytraindata:Reduced",
   "order": "col",
   "rowNames": ["Record 1", "Record 2", "Record 3", "Record 4", "Record 5", "Record 6", "Record 7", "Record 8", "Record 9", 
   "Record 10", "Record 11", "Record 12", "Record 13"], "colNames": ["X1", "X2", "Y"],
   "colTypes": ["double", "double", "double"],
   "indexCols": null,
   "data": [
    [7, 1, 11, 11, 7, 11, 3, 1, 2, 21, 1, 11, 10],
    [26, 29, 56, 31, 52, 55, 71, 31, 54, 47, 40, 66, 68],
    [78.5, 74.299999999999997, 104.3, 87.599999999999994, 95.900000000000006, 109.2, 102.7, 72.5, 93.099999999999994, 
    115.90000000000001, 83.799999999999997, 113.3, 109.40000000000001]
    ]
   }
  }

The dataset has been reduced from 5 features and 1 output variable to 2 features and 1 output variable. The resulting object is a data frame.


	X1	X2	Y
Record 1	7	26	78.5
Record 2	1	29	74.299999999999997
Record 3	11	56	104.3
Record 4	11	31	87.599999999999994
Record 5	7	52	95.900000000000006
Record 6	11	55	109.2
Record 7	3	71	102.7
Record 8	1	31	72.5
Record 9	2	54	93.099999999999994
Record 10	21	47	115.90000000000001
Record 11	1	40	83.799999999999997
Record 12	11	66	113.3
Record 13	10	68	109.40000000000001

Back to Partitioning Example

Continue to Find Best Model Example

RASON Analytics API Help

Download RASON User Guide

Download RASON Reference

Introduction to RASON

About RASON Models and the RASON Server

Rason Subscriptions

Rason Web IDE

Creating and Running a Decision Flow

Defining Your Optimization Model

Defining Your Simulation Model

Performing Sensitivity Analysis

Defining Your Stochastic Optimization Model

Defining Your Data Science Model

Defining Custom Types

Defining Custom Functions

Defining Your Decision Table

Defining Contexts

Using the REST API

REST API Quick Call Endpoints

REST API Endpoints

Decision Flow REST API Endpoints

OData Endpoints

OData Service for Decision Flows

Creating Your Own Application

Using Arrays, For, Loops and Tables

Organization Accounts

Feature Selection Example