Introduction to RASON
About RASON Models and the RASON Server
Rason Subscriptions
Rason Web IDE
Creating and Running a Decision Flow
Defining Your Optimization Model
Defining Your Simulation Model
Performing Sensitivity Analysis
Defining Your Stochastic Optimization Model
Defining Your Data Science Model
Defining Custom Types
Defining Custom Functions
Defining Your Decision Table
Defining Contexts
Using the REST API
REST API Quick Call Endpoints
REST API Endpoints
Decision Flow REST API Endpoints
OData Endpoints
OData Service for Decision Flows
Creating Your Own Application
Using Arrays, For, Loops and Tables
Organization Accounts

Feature Selection Example

Next, we will look at an example of how to call Feature Selection in Rason.

Feature Selection allows users the ability to rank and select the most relevant variables for inclusion in a classification or prediction model. In many cases the most accurate models, or the models with the lowest misclassification or residual errors, have benefited from better feature selection, using a combination of human insights and automated methods. Rason Data Science provides a facility to compute all of the following metrics, described in the literature, to give users information on what features should be included, or excluded, from their models.

  • Correlation-based
  • Pearson product-moment correlation
  • Spearman rank correlation
  • Kendall concordance
  • Statistical/probabilistic independence metrics
  • Welch's statistic
  • F statistic
  • Chi-square statistic
  • Information-theoretic metrics
  • Mutual Information (Information Gain)
  • Gain Ratio
  • Other
  • Cramer's V
  • Fisher Score
  • Gini Index
  • Only some of these metrics can be used in any given application, depending on the characteristics of the input variables (features) and the type of problem. In a supervised setting, if we classify data science problems as follows:

  • ℝ ⁿ ⟶ ℝ : Real-valued features, prediction (regression) problem
  • ℝ ⁿ ⟶ {0,1} : Real-valued features, binary classification problem
  • ℝ ⁿ ⟶{1..C} : Real-valued features, multi-class classification problem
  • {1..C}ⁿ ⟶ ℝ ⁿ : Nominal categorical features, prediction (regression) problem
  • {1..C}ⁿ ⟶ {0,1} : Nominal categorical features, binary classification problem
  • {1..C}ⁿ ⟶ {1..C} : Nominal categorical features, multi-class classification problem
  • then we can describe the applicability of the Feature Selection metrics by the following table:

    R-R R-{0,1} R-{1..C} {1..C}-R {1..C}-{0,1} {1..C}-{1..C}
    Pearson N
    Spearman N
    Kendall N
    Welch's D N
    F-Test D N N
    Chi-squared D D D D N N
    Mutual Info D D D D N N
    Gain Ratio D D D D N N
    Fisher D N N
    Gini D N N

    "N" means that metrics can be applied naturally, and ā€œDā€ means that features and/or the outcome variable must be discretized before applying the particular filter.

    Here is an example of a Rason model calling Feature Selection using linear wrapping.

    
    {
      modelName: 'LinearWrapping',
      modelType: 'datamining',
      modelDescription: 'feature selection - linear wrapping',
        datasources: {
          myTrainSrc: {
            type: 'csv', connection: 'hald-small.txt', direction: 'import'
          }
        },
        datasets: {
          myTrainData: {
            binding: 'myTrainSrc', targetCol: 'Y'
          }
        },
        estimator: {
          linearFSEstimator: {
            type: 'featureSelection', algorithm: 'linearWrapping',
            parameters: {
              fitIntercept: true,
              method: 'EXHAUSTIVE_SEARCH'
            }
          }
        },
        actions: {
          linearFSModel: {
            data: 'myTrainData', estimator: 'linearFSEstimator', action: 'fit',
            evaluations: [ 
              'bestSubsets',
              'bestSubsetsDetails' 
            ]
          },
          reducedTrainData: {
            data: 'myTrainData', fittedmodel: 'linearFSModel',
            parameters: {
              numTopFeatures: 2
            },
            action: 'transform',
            evaluations: [
              'transformation'
            ]
          }
        }
    }
    

    The "datasources" section in this example creates one datasource, "myTrainSrc" which imports the hald-small dataset within the hald-small.txt file. Inside of "datasets", the datasource "myTrainData" is bound to the "myTrainSrc" dataset. An output column is specified as the "Y" column (targetCol: 'Y'). Note: Input files in a Data Science Rason model must not contain a path to a file location.

    A new estimator, linearFSEstimator, is created within "estimator". This estimator is of type "featureSelection" using algorithm "linearWrapping" and the parameters "fitIntercept" set to True and method set to "EXHAUSTIVE_SEARCH". For a complete list of all parameters associated with Feature Selection, see the Rason Reference Guide.

    Two new action items, "linearFSModel" and "reducedTrainData" are listed within "actions". The "linearFSModel" action item "fits" the model while "reducedTrainData" uses the resultant model to transform the dataset.

    The "linearFSModel" action item uses the estimator, linearFSEstimator, to "fit" a feature selection model to the "myTrainData" dataset and return the best subsets and their details in the results.

    The "reducedTrainData" action item uses the fitted "linearFSModel" model to perform feature selection to reduce (transform) the "myTrainData" dataset from 5 features (not including the output variable, Y) to 2 (three including the output variable, Y).

    Here are the results:

    First we see the status of the Rason model

    
    
      Getting model results: GET https://rason.net/api/model/2590+LinearWrapping+2020-01-20-01-35-50-761786/result 
      {"status": {
        "id": "2590+LinearWrapping+2020-01-20-01-35-50-761786",
        "code":0,
        "codeText": "Success"
      },
      "results":["linearFSModel.bestSubsets", "linearFSModel.bestSubsetsDetails", "reducedTrainData.transformation"],
    

    Results for linearFSModel begin here. Notice that the results are printed in column order. The first subset (Subset 1) contains only the intercept term. The second subset (Subest 2) contains the intercept term plus 1 feature, X4. The third subset (Subset 3) contains the intercept term plus 2 features, X1 and X2 and so on.

    
      "linearFSModel": {   
    	 "bestSubsets": { 
           "objectType": "dataFrame", 
           "name": "Best Subsets", 
           "order": "col",
           "rowNames": ["Subset 1", "Subset 2", "Subset 3", "Subset 4", "Subset 5", "Subset 6"], 
           "colNames": ["Intercept", "X1", "X2", "X3", "X4", "Weights"],
           "colTypes": ["double", "double", "double", "double", "double", "double"],
    	     "indexCols": null, 
           "data": [ 
             [1, 1, 1, 1, 1, 1], 
             [0, 0, 1, 1, 1, 1], 
             [0, 0, 1, 1, 1, 1], 
             [0, 0, 0, 0, 1, 1], 
             [0, 1, 0, 1, 1, 1],
             [0, 0, 0, 0, 0, 1]
    	    ] 
         },
    

    Best subset result details (bestSubsetsDetails) for linearFSModel. Here are the calculated statistics for the six subsets. The statistics returned are: RSS, Mallows's Cp, R2, Adjusted R2 and Probability.

    
    
    	 "bestSubsetsDetails": { 
           "objectType": "dataFrame", 
           "name": "Best Subsets Details", 
           "order": "col",
           "rowNames": ["Subset 1", "Subset 2", "Subset 3", "Subset 4", "Subset 5", "Subset 6"], 
           "colNames": ["#Coefficients", "RSS", "Mallows's Cp", "R2", "Adjusted R2", "Probability"],
           "colTypes": ["integer", "double", "double", "double", "double", "double"],
           "indexCols": null,
           "data": [
             [1, 2, 3, 4, 5, 6], 
             [2715.7630769230273, 883.86691689923498, 57.904483176071921, 47.972729400348463, 47.863639350457937, 47.800255339956486] , 
             [386.70376545604654, 120.43588636278355, 1.479690732816751, 2.0252575726668702, 4.009282127686447, 6] , 
             [2.5424107263916085e-14, 0.67454196413163481, 0.97867837453564743, 0.98233545120044441, 0.98237562040770021, 0.98239895970818625] , 
             [2.5424107263916085e-14, 0.64495486996177454, 0.97441404944277676, 0.97644726826725459, 0.97356343061154282, 0.96982678807116807] , 
             [5.4992942301671353e-06, 0.00015856276306674378, 0.69819260728365373, 0.98747306627232578, 0.9259478188926249, 0]
           ] 
         } 
       }, 
    

    Results for reducedTrainData begin here. Again notice that the parameter values for reducedTrainData are printed.

    
    
       "reducedtraindata": { 
         "transformation": {
            "objectType": "dataFrame", 
            "name": "mytraindata : Reduced", 
            "order": "col",
            "rowNames": ["Record 1", "Record 2", "Record 3", "Record 4", "Record 5", "Record 6", "Record 7", "Record 8", "Record 9", 
            "Record 10", "Record 11", "Record 12", "Record 13"], 
            "colNames": [ "X1", "X2", "Y"],
            "colTypes": [ "double", "double", "double"],
            "indexCols": null,
            "data": [
                [7,1,11,11,7,11,3,1,2,21,1,11,10],
                [26, 29, 56, 31, 52, 55, 71, 31, 54, 47, 40, 66, 68],
                [78.5, 74.299999999999997, 104.3, 87.599999999999994, 95.900000000000006, 109.2, 102.7, 72.5, 93.099999999999994, 
                115.90000000000001, 83.799999999999997, 113.3, 109.40000000000001]
           ]
       }
     }
    }
    

    Here are the results of the transformation ā€“ the reduction of the dataset from 5 features and 1 output variable to 2 features and 1 output variable.

    
    
      "transformation": {
       "objectType": "dataFrame",
       "name": "mytraindata:Reduced",
       "order": "col",
       "rowNames": ["Record 1", "Record 2", "Record 3", "Record 4", "Record 5", "Record 6", "Record 7", "Record 8", "Record 9", 
       "Record 10", "Record 11", "Record 12", "Record 13"], "colNames": ["X1", "X2", "Y"],
       "colTypes": ["double", "double", "double"],
       "indexCols": null,
       "data": [
        [7, 1, 11, 11, 7, 11, 3, 1, 2, 21, 1, 11, 10],
        [26, 29, 56, 31, 52, 55, 71, 31, 54, 47, 40, 66, 68],
        [78.5, 74.299999999999997, 104.3, 87.599999999999994, 95.900000000000006, 109.2, 102.7, 72.5, 93.099999999999994, 
        115.90000000000001, 83.799999999999997, 113.3, 109.40000000000001]
        ]
       }
      }
    

    The dataset has been reduced from 5 features and 1 output variable to 2 features and 1 output variable. The resulting object is a data frame.

    X1 X2 Y
    Record 1 7 26 78.5
    Record 2 1 29 74.299999999999997
    Record 3 11 56 104.3
    Record 4 11 31 87.599999999999994
    Record 5 7 52 95.900000000000006
    Record 6 11 55 109.2
    Record 7 3 71 102.7
    Record 8 1 31 72.5
    Record 9 2 54 93.099999999999994
    Record 10 21 47 115.90000000000001
    Record 11 1 40 83.799999999999997
    Record 12 11 66 113.3
    Record 13 10 68 109.40000000000001