This project mainly implements the Monotone Optimal Binning(MOB) algorithm in SAS 9.4. We extend the application of this algorithm which can be applied to numerical and categorical data. In order to avoid the problem of creating too many bins, we optimize the p-value iteratively and provide bins size first binning, monotonicity first binning, and chi merge binning methods for users to discretize data more conveniently.
git clone https://github.com/cdfq384903/MonotonicOptimalBinning.git
- Upload source code as the frame shown below.
Note: we had made some modifications to the dataset
german_data_credit_cat.csv. Details are shown below :
- Rename all columns
- Change the value of column
Cost Matrix(Risk):
| Types of Credit Risk | original value | Revised value |
|---|---|---|
| Good Risk | 1 | 0 |
| Bad Risk | 2 | 1 |
Initialize parameters:
%let data_table = german_credit_card;
%let y = CostMatrixRisk;
%let x = AgeInYears CreditAmount DurationInMonth;
%let exclude_condi = < -99999999;
%let init_sign = auto ;
%let min_samples = %sysevalf(1000 * 0.05);
%let min_bads = 10;
%let min_pvalue = 0.35;
%let show_woe_plot = 1;
%let lib_name = TMPWOE;
%let is_using_encoding_var = 1;Run MainSizeFirstBining.sas script
%let min_bins = 3;
%let max_samples = %sysevalf(1000 * 0.4);
PROC DATASETS lib = TMPWOE kill ; QUIT ;RUN ;
%init(data_table = &data_table., y = &y., x = &x., exclude_condi = &exclude_condi., init_sign = &init_sign.,
min_samples = &min_samples., min_bads = &min_bads., min_pvalue = &min_pvalue.,
show_woe_plot = &show_woe_plot.,
is_using_encoding_var = &is_using_encoding_var., lib_name = &lib_name.);
%initSizeFirstBining(max_samples = &max_samples., min_bins = &min_bins., max_bins = 7);
%runMob();
SFB RESULT OUTPUT - DurationInMonth:
Note: The image above shows the Woe Transformation Result of variable
DurationInMonthwith applyingSFB Algorithm. It clearly presents the monotonicity of the WoE value.
SFB RESULT OUTPUT - CreditAmount :
Note: The image above shows the Woe Transformation Result of variable
CreditAmountwith applyingSFB Algorithm. It violates the monotonicity of WoE becauseSBF Algorithmwill tend to meet the bins relevant restrictions as priority.
Run MainMonotonicFirstBining.sas script
PROC DATASETS lib = TMPWOE kill ; QUIT ;RUN ;
%init(data_table = &data_table., y = &y., x = &x., exclude_condi = &exclude_condi., init_sign = &init_sign.,
min_samples = &min_samples., min_bads = &min_bads., min_pvalue = &min_pvalue.,
show_woe_plot = &show_woe_plot.,
is_using_encoding_var = &is_using_encoding_var., lib_name = &lib_name.);
%initMonotonicFirstBining();
%runMob();
MFB RESULT OUTPUT - DurationInMonth:
Note: The image above shows the Woe Transformation Result of variable
DurationInMonthwith applyingMFB Algorithm. It presents the monotonicity of WoE.
MFB RESULT OUTPUT - CreditAmount :
Note: The image above shows the Woe Transformation Result of variable
CreditAmountwith applyingMFB Algorithm. It presents the monotonicity of WoE, but it is likely to lead to some issues such as excessive sample proportion or an insufficient number of bins or bins size.
Initialize parameters:
%let data_table = german_credit_card;
%let y = CostMatrixRisk;
%let x = Purpose;
%let max_bins_threshold = 30 ;
%let min_bins = 4 ;
%let max_bins = 6 ;
%let min_samples = 0.05 ;
%let max_samples = 0.4 ;
%let p_value_threshold = 0.35 ;
%let libName = TMPWOE ;Chi Merge Binning (CMB) is an auto binning algorithm applying chi-squared test for the merging criterion. It is also limited by the same restrictions as the SFB and MFB on bins amount, bins size, sample size, etc. Currently, the CMB cannot deal with the categorical varibales with order.
Run MainChiMerge.sas script
%runChiMerge( dataFrame = german_credit_card, x = &x., y = &y.,
max_bins_threshold = &max_bins_threshold.,
min_bins = &min_bins., max_bins = &max_bins.,
min_samples = &min_samples., max_samples = &max_samples.,
p_value_threshold = &p_value_threshold.,
libName = &libName.) ;
CMB OUTPUT RESULT :
The result of CMB is shown above. We can see that the CMB Algorithm merges the categorical variable Purpose in german_credit_card from 10 attributes to 6 groups eventually.
MFB Algorithm macro example:
%init(data_table, y, x, exclude_condi, min_samples, min_bads, min_pvalue,
show_woe_plot, is_using_encoding_var, lib_name);
%initMonotonicFirstBining();
%runMob();
SFB Algorithm macro example:
%init(data_table, y, x, exclude_condi, min_samples, min_bads, min_pvalue,
show_woe_plot , is_using_encoding_var , lib_name );
%initSizeFirstBining(max_samples , min_bins , max_bins);
%runMob();
-
data_table
Default: None
Suggestion: a training data set.
Thedata_tableargument defines the input data set. The datasets must includes all independent variables and the target variable (response variable). For example, inMainMonotonicFirstBining.sasscript you can passgerman_credit_cardas the given dataset which is a table structure created by%readCsvFile()macro. -
y
Default: None
Suggestion: The label name of response variable.
Theyargument defines the column name of the response variable. For example, inMainMonotonicFirstBining.sasscript you can passCostMatrixRiskwhich exists in the datasetgerman_credit_card. -
x
Default: None
Suggestion: The column names of the variable for executing the alogorithm.
Thexargument defines the column names of the chosen variables. Multiuple columns can be passed simultaneously. For example, inMainMonotonicFirstBining.sasscript you can passAgeInYearsCreditAmountDurationInMonthwhich all exist in the datasetgerman_credit_card. -
exclude_condi
Default: None
Suggestion: The condition given to exclude the observations in the variables.
Theexclude_condiargument defines the conditiont to exclude the observations that meet the specified condition of the variables. For example, inMainMonotonicFirstBining.sasscript you can pass< -99999999, which means that the algorithm will exclude the observations that the value of the variable is less then -99999999. -
init_sign
Default: None
Suggestion: Set theinit_signasautowill automatically calculate the pearson correlation to determine the relation between thexandyvariables. If the pearson correlation is greater than 0, then the program will take it as a positive relation, which means the greaterxis, the higher defualt rate (higher mean ofy) is. -
min_samples
Default: None
Suggestion: The minimum sample amount that will be kept in each bin. Usuallymin_samplesis suggested to be 5% of the total population.
Themin_samplesargument defines the minimum sample that will be kept in each bin. For example, inMainMonotonicFirstBining.sasscript you can pass%sysevalf(1000 * 0.05), which means the minimum samples will be constrained by 5% of total samples (1000 obs). -
min_bads
Default: None
Suggestion: The minimum positive event amount (default/bad in risk analysis) that will be kept in each bin. Usuallymin_badsis suggested to be 1.
Themin_badsargument defines the minimum positive event amount that will be kept in each bin. For example, inMainMonotonicFirstBining.sasscript you can pass 10, which means that the minimum bads will be constrained by a minimum of 10 positive events in each bins. -
min_pvalue
Default: None
Suggestion: The minimum threshold of p-value for the algorithm to decide whether merge the two bins or not. Usually a highermin_pvalue, the algorithm will reduce the times of merging bins.
Themin_pvalueargument defines the minimum threshold of p value. For example, inMainMonotonicFirstBining.sasscript you can pass 0.35, which means that the alogorithm will decide to merge the two bins if the p-value of the statistical test (Z-Test) conducted between them is greater than 0.35. The argument will iteratively decrease its value if there is no p-value of the statistical test (Z-Test) conducted between any two bins greater than the given parameter and the final bins amount is still greater thanmax_bins. -
show_woe_plot
Default: None
Suggestion: Boolean(0, 1) : Whether showing the woe plot when MOB algorithm is running.
Theshow_woe_plotargument defines whether showing the woe plot in the algorithm process or not. For example, inMainMonotonicFirstBining.sasscript you can pass 1, which means that the SAS will show the woe plot result for each givenx. -
is_using_encoding_var
Default: None
Suggestion: The boolean(0, 1) of using encoding var table. If your length of label name(x or y) is too long for sas macro, suggest you should open this parameter.
Theis_using_encoding_varargument defines the boolean(0, 1) of using encoding var table. For example, in MainMonotonicFirstBining.sas script you can try 1, which means the attributes name of data will be changed to be encoding variable. -
lib_name
Default: None
Suggestion: The library name to store the output tables. If no preference, please passwork, which means a temporary library in SAS.
Thelib_nameargument defines the output library name for storing tables created by the algorithm. For example, inMainMonotonicFirstBining.sasscript you can passTMPWOEwhich are assigned byLIBNAME TMPWOE "/home/u60021675/output"under the given direction. -
max_samples
Default: None
Suggestion: Only use in%initSizeFirstBining()macro. The maximum sample will be kept in each bins. Usuallymax_samplesuggest to be 40% of population to avoid a serious concentration issue on WoE binning.
Themax_samplesargument defines the maximum sample amount that will be kept in each bin. For example, inMainSizeFirstBining.sasscript you can pass with%sysevalf(1000 * 0.4), which means the maximum samples will be constrained by a maximum limitation of observations which is 40% of population in each bins. -
min_bins
Default: None
Suggestion: Only use in%initSizeFirstBining()macro. The minimum bins will be kept in the final woe summary output for each givenx.
Themin_binsargument defines the minimum bins amount that will be kept in the final woe summary output for each givenx. For example, inMainSizeFirstBining.sasscript you can pass3, which means the algorithm will create at least 3 bins for the givenxin each. -
max_bins
Default: None
Suggestion: Only use in%initSizeFirstBining()macro. The maximum bins will be kept in the final woe summary output for each givenx. Note thatmax_binsmust be higher thanmin_bins.
Themax_binsargument defines the maximum bins amount that will be kept in the final woe summary output for each givenx. For example, inMainSizeFirstBining.sasscript you can pass7, which means the algorithm will create at most 7 bins for the givenxin each.
- The output files created by MOB algorithm.
- The woe summary result table created by MOB algorithm.
%printWithoutCname() macro example:
%printWithoutCname(lib_name);
lib_name
Default: None
Suggestion: The library which will be assigned for storing the woe summary result.
Thelib_nameargument defines the library which will be assigned for storing woe summary result. For example, inMainMonotonicFirstBining.sasscript you can passTMPWOE, which means that the%printWithoutCname()macro will output the files and result table toTMPWOElibrary assigned byLIBNAME TMPWOE(/home/u60021675/output) ;.
The output of runing %printWithoutCname() macro. It shows the result of all variable which was discretized.
%getIvPerVar() macro example:
%getIvPerVar(lib_name, min_iv, min_obs_rate, max_obs_rate, min_bin_size, max_bin_size, min_bad_count);
-
lib_name
Default: None
Suggestion: The library which will be assigned for storing the IV summary result.
Thelib_nameargument defines the library which will be assigned for storing the IV summary result. For example, inMainMonotonicFirstBining.sasscript you can passTMPWOE, which means that the%printWithoutCname()macro will output the files and result table toTMPWOElibrary assigned byLIBNAME TMPWOE(/home/u60021675/output) ;. -
min_iv
Default: None
Suggestion: The minimum threshold of information value (IV). Usually greater than 0.1.
Themin_ivargument defines the minimum threshold of the information value (IV). For example, inMainMonotonicFirstBining.sasscript you can pass 0.1, which means the%getIvPerVar()macro will markis_iv_passas 1 if IV is greater than 0.1. -
min_obs_rate
Default: None
Suggestion: The minimum threshold of observation rate.0.05is usually given based on experiences.
Themin_obs_rateargument defines minimum threshold of observation rate. For example, in MainMonotonicFirstBining.sas script you can pass 0.05, which means the%getIvPerVar()macro will markis_obs_passas 1 if the value is greater than 0.05 and lower thanmax_obs_rate. -
max_obs_rate
Default: None
Suggestion: The maximum threshold of observation rate.0.4is usually given based on experiences.
Themax_obs_rateargument defines maximum threshold of observation rate. For example, inMainMonotonicFirstBining.sasscript you can pass 0.4, which means the%getIvPerVar()macro will markis_obs_passas 1 if the value is less than 0.4 and greater thanmin_obs_rate. -
min_bin_size
Default: None
Suggestion: The minimum threshold of bins size. Usually set at 3.
Themin_bin_sizeargument defines the minimum amount of bins. For example, inMainMonotonicFirstBining.sasscript you can pass 3, which means the%getIvPerVar()macro will markis_bin_passas 1 if the value is higher than 3 and lower thanmax_bin_size. -
max_bin_size
Default: None
Suggestion: The maximum threshold of bins size. Usually set at 6.
Themax_bin_sizeargument defines the maximum amount of bins. For example, inMainMonotonicFirstBining.sasscript you can pass 10, which means the%getIvPerVar()macro will markis_bin_passas 1 if the value is less than 6 and greater thanmin_bin_size. -
min_bad_count
Default: None
Suggestion: The minimum number threshold of the positive events (default/bad). Usually set at 1.
Themin_bad_countargument defines the minimum number threshold of the positive events, defualt or bad event is commonly seen in risk analysis. For example, inMainMonotonicFirstBining.sasscript you can pass 1, which means the%getIvPerVar()macro will markis_bad_count_passas 1 if the value is higher than 1.
The output of %getIvPerVar() macro. It shows the IV information for all discretized variables.
iv: the information value per each discretized variable.is_iv_pass: true(1) if IV higher thanmin_ivelse than false(0).is_obs_pass: true(1) if observation rate betweenmin_obs_rateandmax_obs_rateelse then false(0).is_bad_count_pass: true(1) if bad count higher thanmin_bad_countelse then false(0).is_bin_pass: true(1) if bin size betweenmin_bin_sizeandmax_bin_sizeelse then false(0).is_woe_pass: true(1) if the value of WoE have monotonicity properties else then false(0).woe_dir:ascif the WoE value show a monotone increasing pattern, whiledescif the WoE value show a monotone decreasing pattern. Otherwise, null is given.
%printWoeBarLineChart() macro example:
%printWoeBarLineChart(lib_name, min_iv);
-
lib_name
Default: None
Suggestion: The library which will be assigned for the data to print WoE bar chart.
Thelib_nameargument defines the library used to store the data for plotting. For example, inMainMonotonicFirstBining.sasscript you can passTMPWOE, which means that the%printWithoutCname()macro will output the files and result table toTMPWOElibrary assigned byLIBNAME TMPWOE(/home/u60021675/output) ;. -
min_iv
Default: None
Suggestion: The minimum threshold of information value. Usually set more higher than 0.1.
Themin_ivargument defines the minimum threshold of information value. For example, inMainMonotonicFirstBining.sasscript you can pass 0.1, which means the%printWoeBarLineChart()macro will show the woe bar chart of the varibale if its IV is greater than 0.1.
The output of runing %printWoeBarLineChart() macro. It shows the woe bar charts of the variables whose IV is greater than min_iv.
%exportSplitRule() macro example:
%exportSplitRule(lib_name, output_file);
-
lib_name
Default: None
Suggestion: The library which is assigned to store the split rule exported by the macro.
Thelib_nameargument defines the library which is assigned to store the split rule exported by the macro. For example, inMainMonotonicFirstBining.sasscript you can passTMPWOE, which means that the%printWithoutCname()macro will output the files and result table toTMPWOElibrary assigned byLIBNAME TMPWOE(/home/u60021675/output) ;. -
output_file
Default: None
Suggestion: The output file path which will be export split rule.
Theoutput_fileargument defines the output file path which will be export split rule. For example, inMainMonotonicFirstBining.sasscript you can try/home/u60021675/output/, which means the%exportSplitRule()macro will export the split rule to "/home/u60021675/output/" directory. Note that you DON'T need to quote the direction.
The output of %exportSplitRule() macro.
%cleanBinsDetail() macro example:
%cleanBinsDetail(bins_lib);
bins_lib
Default: None
Suggestion: The library used to store files created from the algorithm process and will be cleared eventually. Suggest to use the same value assigned in%init()macro.
Thebins_libargument defines the library which the files in it will be cleared at the end. For example, inMainMonotonicFirstBining.sasscript you can passTMPWOE, which means bins summary files and exclude files will be deleted.
The output of runing %cleanBinsDetail() macro. It shows the bins_summary and exclude file was be deleted.
CMB Algorithm macro example:
%runChiMerge( dataFrame = german_credit_card, x = &x., y = &y.,
max_bins_threshold = &max_bins_threshold.,
min_bins = &min_bins., max_bins = &max_bins.,
min_samples = &min_samples., max_samples = &max_samples.,
p_value_threshold = &p_value_threshold.,
libName = &libName.) ;
-
dataFrame
Default: None
Suggestion: a training data set.
ThedataFrameargument defines the input data set. The datasets must includes all independent variables and the target variable (response variable). For example, inMainChiMerge.sasscript you can passgerman_credit_cardas the given dataset which is a table structure created by%readCsvFile()macro. -
y
Default: None
Suggestion: The label name of response variable.
Theyargument defines the column name of the response variable. For example, inMainChiMerge.sasscript you can passCostMatrixRiskwhich exists in the datasetgerman_credit_card. -
x
Default: None
Suggestion: The column names of the variable for executing the alogorithm.
Thexargument defines the column names of the chosen variables. Multiuple columns can be passed simultaneously. For example, inMainChiMerge.sasscript you can passPurposewhich exists in the datasetgerman_credit_card. -
max_bins_threshold
Default: None
Suggestion: Maximum initial attributes of a variable to run CMB algorithm.
Themax_bins_thresholdargument defines that the maximum for conducting the CMB algorithm, if the inital unique attributes of the givenxexceed the given parameter ofmax_bins_thresholdthen the algorithm will stop the execution. For example, inMainChiMerge.sasscript, you can pass20, which means that if the givenxhas unique attributes greater than 20, then the algorithm will stop executing. -
min_bins
Default: None
Suggestion: The minimum bins will be kept in the final woe summary output for each givenx.
Themin_binsargument defines the minimum bins amount that will be kept in the final woe summary output for each givenx. For example, inMainChiMerge.sasscript you can pass3, which means the algorithm will create at least 3 bins for the givenxin each. -
max_bins
Default: None
Suggestion: The maximum bins will be kept in the final woe summary output for each givenx. Note thatmax_binsmust be higher thanmin_bins.
Themax_binsargument defines the maximum bins amount that will be kept in the final woe summary output for each givenx. For example, inMainChiMerge.sasscript you can pass7, which means the algorithm will create at most 7 bins for the givenxin each. -
min_samples
Default: None
Suggestion: Integer or float : The minimum sample amount that will be kept in each bin. Usuallymin_samplesis suggested to be5%of the total population.
Themin_samplesargument defines the minimum sample that will be kept in each bin. If the given value is between 0 and 1, which means 0 <min_samples< 1, then the program will calculate the given proportion samples of the total population. For example, inMainChiMerge.sasscript you can pass0.05, which means the minimum samples will be constrained by5%of total samples automatically calculated in the program. Or, the parameter can be passed%sysevalf(1000 * 0.05) ;, which means the minimum sample will directly be constrained as 50. -
max_samples
Default: None
Suggestion: Integer or float : The maximum sample will be kept in each bins. Usuallymax_samplesuggest to be 40% of the total population to avoid a serious concentration issue on WoE binning.
Themax_samplesargument defines the maximum sample amount that will be kept in each bin. For example, inMainChiMerge.sasscript you can pass0.4, which means the minimum samples will be constrained by40%of total samples automatically calculated in the program. Or, the parameter can be passed%sysevalf(1000 * 0.4), which means the maximum samples will directly be constrained as 400. -
p_value_threshold
Default: None
Suggestion: The minimum threshold of p-value for the algorithm to decide whether merge the two bins or not. Usually a highermin_pvalue, the algorithm will reduce the times of merging bins.
Thep_value_thresholdargument defines the minimum threshold of p value. For example, inMainChiMerge.sasscript you can pass0.35, which means that the alogorithm will decide to merge the two bins if the p-value of the statistical test (Chi-Squared Test) conducted between them is greater than0.35. The argument will iteratively decrease its value if there is no p-value of the statistical test (Chi-Squared Test) conducted between any two bins greater than the given parameter and the final bins amount is still greater thanmax_bins. -
libName
Default: None
Suggestion: The library which will store the woe summary result and other tables.
ThelibNameargument defines the library which will be loaded and show IV summary result. For example, in MainMonotonicFirstBining.sas script you can passTMPWOE, which means that the%printWithoutCname()macro will output the files and result table toTMPWOElibrary assigned byLIBNAME TMPWOE(/home/u60021675/output) ;.
- The output files created by CMB algorithm.
The final output of the woe binning result is stored in woe_summary_<x>.sas7bdat. Details are shown below:
SAS Studio 3.8 with SAS 9.4
- German Credit Risk Analysis : Beginner's Guide . (2022). Retrieved 9 June 2022, from Kaggle
- Pavel Mironchyk and Viktor Tchistiakov. "Monotone optimal binning algorithm for credit risk modeling.". (2017): 1-15. citation
- SAS OnDemand for Academics. (2022). Retrieved 9 June 2022
- MOBPY : Monotonic-Optimal-Binning
- Darren Tsai(https://www.linkedin.com/in/darren-yucheng-tsai/)
- Denny Chen(https://www.linkedin.com/in/dennychen-tahung/)
- Thea Chan(yahui0219@gmail.com)















