Tabular Data Binary Classification: All Tips and Tricks from 5 Kaggle Competitions
This article was originally written by Shahul ES and posted on the Neptune blog.
In this article, I will discuss some great tips and tricks to improve the performance of your structured data binary classification model. These tricks are obtained from solutions of some of Kaggle’s top tabular data competitions. Without much lag, let’s begin.
These are the five competitions that I have gone through to create this article:
- Home credit default risk
- Santander Customer Transaction Prediction
- VSB Power Line Fault Detection
- Microsoft Malware Prediction
- IEEE-CIS Fraud Detection
Dealing with larger datasets
One issue you might face in any machine learning competition is the size of your data set. If the size of your data is large, that is 3GB + for kaggle kernels and more basic laptops you could find it difficult to load and process with limited resources. Here is the link to some of the articles and kernels that I have found useful in such situations.
- Faster data loading with pandas.
- Data compression techniques to reduce the size of data by 70%.
- Optimize the memory by reducing the size of some attributes.
- Use open-source libraries such as Dask to read and manipulate the data, it performs parallel computing and saves up memory space.
- Use cudf.
- Convert data to parquet format.
- Converting data to feather format.
- Reducing memory usage for optimizing RAM.
Data exploration
Data exploration always helps to better understand the data and gain insights from it. Before starting to develop machine learning models, top competitors always read/do a lot of exploratory data analysis for the data. This helps in feature engineering and cleaning of the data.
- EDA for Microsoft malware detection.
- Time Series EDA for malware detection.
- Complete EDA for home credit loan prediction.
- Complete EDA for Santader prediction.
- EDA for VSB Power Line Fault Detection.
Data preparation
After data exploration, the first thing to do is to use those insights to prepare the data. To tackle issues like class imbalance, encoding categorical data, etc. Let’s see the methods used to do it.
- Methods to tackle class imbalance.
- Data augmentation by [Synthetic Minority Oversampling Technique.
- Fast inplace shuffle for augmentation.
- Finding synthetic samples in the dataset.
- Signal denoising used in signal processing competitions.
- Finding patterns of missing data.
- Methods to handle missing data.
- An overview of various encoding techniques for categorical data.
- Building model to predict missing values.
- Random shuffling of data to create new synthetic training set.
Feature engineering
Next, you can check the most popular feature and feature engineering techniques used in these top kaggle competitions. The feature engineering part varies from problem to problem depending on the domain.
- Target encoding cross validation for better encoding.
- Entity embedding to handle categories.
- Encoding cyclic features for deep learning.
- Manual feature engineering methods.
- Automated feature engineering techniques using featuretools.
- Top hard crafted features used in microsoft malware detection.
- Denoising NN for feature extraction.
- Feature engineering using RAPIDS framework.
- Things to remember while processing features using LGBM.
- Lag features and moving averages.
- Principal component analysis for dimensionality reduction.
- LDA for dimensionality reduction.
- Best hand crafted LGBM features for microsoft malware detection.
- Generating frequency features.
- Dropping variables with different train and test distribution.
- Aggregate time series features for home credit competition.
- Time Series features used in home credit default risk.
- Scale,Standardize and normalize with sklearn.
- Handcrafted features for Home default risk competition.
- Handcrafted features used in Santander Transaction Prediction.
Feature selection
After generating many features from your data, you need to decide which all features to use in your model to get the maximum performance out of your model. This step also includes identifying the impact each feature is having on your model. Let’s see some of the most popular feature selection methods.
- Six ways to do features selection using sklearn.
- Permutation feature importance.
- Adversarial feature validation.
- Feature selection using null importance.
- Tree explainer using SHAP.
- DeepNN explainer using SHAP.
Modeling
After handcrafting and selecting your features, you should choose the right Machine learning algorithm to make your prediction. These are the collection of some of the most used ML models in structured data classification challenges.
- Random forest classifier.
- XGBoost : Gradient boosted decision trees.
- LightGBM for distributed and faster training.
- CatBoost to handle categorical data.
- Naive bayes classifier.
- Gaussian naive bayes model.
- LGBM + CNN model used in 3rd place solution of Santander Customer Transaction Prediction
- Knowledge distillation in Neural Network.
- Follow the regularized leader method.
- Comparison between LGB boosting methods (goss, gbdt and dart).
- NN + focal loss experiment.
- Keras NN with timeseries splitter.
- 5th place NN architecture with code for Santander Transaction prediction.
Hyperparameter tuning
- LGBM hyperparameter tuning methods.
- Automated model tuning methods.
- Parametre tuning with hyper plot.
- Bayesian optimization for hyperparameter tuning.
- Gpyopt Hyperparameter Optimisation.
Evaluation
Choosing a suitable validation strategy is very important to avoid huge shake-ups or poor performance of the model in the private test set.
The traditional 80:20 split wouldn’t work for many cases. Cross-validation works in most cases over the traditional single train-validation split to estimate the model performance.
There are different variations of KFold cross-validation such as group k-fold that should be chosen accordingly.
- K-fold cross-validation.
- Stratified KFold cross-validation.
- Group KFold
- Adversarial validation to check if train and test distributions are similar or not.
- Time Series split validation.
- Extensive time series splitter.
Note:
There are various metrics that you can use to evaluate the performance of your tabular models. A bunch of useful classification metrics are listed and explained here.
Other training tricks
- GPU acceleration for LGBM.
- Use the GPU efficiently.
- Free keras memory.
- Save and load models to save runtime and memory.
Ensemble
If you’re in the competing environment one won’t get to the top of the leaderboard without ensembling. Selecting the appropriate ensembling/stacking method is very important to get the maximum performance out of your models.
Let’s see some of the popular ensembling techniques used in kaggle competitions:
- Weighted average ensemble.
- Stacked generalization ensemble.
- Out of folds predictions.
- Blending with linear regression.
- Use optuna to determine blending weights.
- Power average ensemble.
- Power 3.5 blending strategy.
- Blending diverse models.
- Different stacking approaches.
- AUC weight optimization.
- Geometric mean for low correlation predictions.
- Weighted rank average.
Final thoughts
In this article, you saw many popular and effective ways to improve the performance of your tabular data binary classification model. Hopefully, you will find them useful in your projects.
This article was originally written by Shahul ES and posted on the Neptune blog, where you can find more in-depth articles for machine learning practitioners.