Boosting Rare Event Data with SMOTE in SAS Data Maker

Leveraging SAS Data Maker's synthetic data generation to overcome imbalanced datasets

Apr. 1, 2026 at 5:07pm

A SAS employee explores using the SMOTE (Synthetic Minority Oversampling Technique) method in SAS Data Maker to generate synthetic data and boost the representation of rare event cases in a dataset, which can help improve the training of predictive models on imbalanced data.

Why it matters

Imbalanced datasets, where the target variable has a disproportionately small number of positive cases compared to negative cases, can pose challenges for training accurate predictive models. Techniques like SMOTE that can intelligently generate synthetic data for the minority class can help address this issue and lead to better model performance.

The details

The author initially tried to use SMOTE in SAS Data Maker to oversample a rare binary target variable, but found that the synthetic data remained just as imbalanced as the original data. They realized this was because they had provided the full 40,000 observation dataset to SMOTE, rather than just the 500 positive cases. By focusing SMOTE only on the rare event cases, it was able to effectively generate 5,000 synthetic positive examples, which could then be combined with the original 40,000 cases to create a more balanced training dataset for predictive modeling.

  • The author started playing with SAS Data Maker and exploring its synthetic data generation capabilities last Fall.
  • The author needed to boost a rare event rate for a binary target variable in a 40,000 record dataset that only had 500 positive cases.

The players

SAS Data Maker

A SAS software tool that allows users to quickly and easily generate realistic synthetic data from an original dataset, using techniques like SMOTE to handle imbalanced classes.

Dan Obermiller

A friend of the author who provided the insight that led to the realization that SMOTE should only be applied to the minority class cases, rather than the full dataset.

Got photos? Submit your photos here. ›

What they’re saying

“Don't put 40,000 training cases in if you want to oversample from only the 500 event cases. Just put in the event cases.”

— Dan Obermiller

What’s next

The author plans to attend a hands-on workshop on using SAS Data Maker, including the SMOTE technique, at the SAS Innovate conference in Grapevine, Texas on April 29.

The takeaway

Leveraging synthetic data generation techniques like SMOTE in SAS Data Maker can be a powerful way to address imbalanced datasets and improve the training of predictive models, as long as the minority class cases are specifically targeted rather than the full dataset.