Introduction

In the fast-paced domain of digital marketing, businesses rely on various strategies to increase website traffic and enhance user engagement. Among these, display advertising has emerged as a critical tool for capturing consumer attention. The effectiveness of a display ad is often evaluated by its Click-Through Rate (CTR)—a metric that represents the percentage of viewers who click on the advertisement. A high CTR not only reflects successful ad performance but also drives potential sales for online retailers and enhances engagement for content-based websites.

Key points about the project:

Significance of CTR: CTR serves as a primary metric to measure the success of display advertisements by indicating the percentage of users who interact with the ad.

Business Impact: High CTRs can improve consumer engagement, boost sales, and enhance marketing campaign efficiency.

Project Focus: Predicting CTR using machine learning techniques based on ad attributes like quality, relevance, type, audience, and content.

Goal: Provide actionable insights for optimizing ad performance and improving targeting strategies through data-driven approaches. This project utilizes a robust dataset and advanced machine learning techniques to analyze and predict CTR, offering valuable tools for optimizing digital advertising efforts.

Objectives

The primary objectives of this project are as follows:

Data Preprocessing and Exploration: Clean and preprocess the provided datasets to ensure consistency and accuracy, addressing missing values and encoding categorical variables appropriately.

Feature Engineering: Identify significant predictors of CTR, engineer new features to enhance model performance, and analyze their impact using correlation and visual tools.

Model Development and Evaluation: Build and compare predictive models—including logistic regression, random forest, decision tree, and XGBoost—to identify the most accurate approach for predicting CTR.

Optimization and Prediction: Use Bayesian Optimization to fine-tune hyperparameters and maximize the model’s performance, ensuring reliable predictions on unseen data.

These objectives aim to demonstrate a comprehensive pipeline from data preparation to model evaluation, emphasizing practical insights for improving the efficiency of display advertising campaigns.

Data Preparation and Exploratory Analysis

Data Loading and Preprocessing

This section involves loading the datasets and preparing them for analysis. The key steps include:

• Loading the necessary libraries to facilitate data manipulation, preprocessing, and modeling.

• Importing the analysis_data and scoring_data datasets.

• Converting character columns to factors for better handling during modeling.

• Imputing missing values:

  • Numerical columns: Replacing missing values with the median.
  • Categorical columns: Replacing missing values with the mode.
# Convert character columns to factors in both datasets
convert_to_factor <- function(df) {
  df %>% mutate(across(where(is.character), as.factor))
}

analysis_data <- convert_to_factor(analysis_data)
scoring_data <- convert_to_factor(scoring_data)

# Define functions to impute missing values
# Function to impute numerical columns with median
impute_median <- function(df) {
  df %>% mutate(across(where(is.numeric), ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))
}

# Function to impute categorical columns with mode
impute_mode <- function(df) {
  df %>% mutate(across(where(is.factor), ~ ifelse(is.na(.), 
                                                  as.factor(names(sort(table(.), decreasing = TRUE))[1]), 
                                                  .)))
}

# Apply imputations to both datasets
analysis_data <- analysis_data %>% impute_median() %>% impute_mode()
scoring_data <- scoring_data %>% impute_median() %>% impute_mode()

Correlation Analysis with CTR

This section focuses on analyzing the relationship between numerical variables in the dataset and the target variable, CTR. The goal is to identify features that have a significant correlation with CTR, visualize these relationships, and explore patterns among numeric variables to guide further feature selection and model building.

To achieve this, three types of visualizations were used:

• Bar Plot of Significant Correlations with CTR

• Correlation Heatmap

• Scatterplot Matrix of Significant Variables

Correlation Calculation with CTR

To analyze the relationship between numeric variables and CTR, the following steps were performed:

• Correlation Computation: Pearson correlation coefficients were calculated for all numeric columns, excluding missing observations using the complete.obs parameter.

• Sorting Correlations: The correlations were ordered by their absolute values to identify variables with the strongest positive or negative relationships to CTR.

• Filtering Significant Correlations: A threshold of 0.05 was applied to retain only meaningful correlations for further analysis.

• Outcome: This process highlights the most impactful predictors, providing a focused direction for feature selection and modeling.

# Calculate correlations with CTR
numeric_columns <- sapply(analysis_data, is.numeric)
correlations <- cor(analysis_data[, numeric_columns], use = "complete.obs")
ctr_correlations <- correlations[,"CTR"]

# Display significant correlations
ctr_correlations <- ctr_correlations[order(abs(ctr_correlations), decreasing = TRUE)]
print(ctr_correlations)
##                    CTR          visual_appeal        targeting_score 
##           1.0000000000           0.5011609081           0.3336587930 
##        headline_length           cta_strength              ad_format 
##          -0.1406820672           0.1164018138           0.0977400302 
##        body_word_count    headline_word_count       body_text_length 
##          -0.0642559115          -0.0598547024          -0.0560056115 
##     headline_sentiment   contextual_relevance               location 
##          -0.0420156207           0.0417913908          -0.0357870555 
##       position_on_page            device_type                     id 
##           0.0305120490           0.0296849089          -0.0187398531 
##              age_group      market_saturation      headline_question 
##          -0.0187237938          -0.0178496353          -0.0149585014 
##         body_sentiment            day_of_week       headline_numbers 
##           0.0121950299           0.0089715145          -0.0085340790 
##   headline_power_words body_readability_score      brand_familiarity 
##          -0.0084196140           0.0076469712          -0.0027191363 
##            seasonality   body_keyword_density            time_of_day 
##           0.0023090672           0.0021354996          -0.0020327812 
##           ad_frequency                 gender 
##           0.0009782948           0.0001673900
# Filter significant correlations
significant_vars <- ctr_correlations[abs(ctr_correlations) > 0.05]
print(significant_vars)
##                 CTR       visual_appeal     targeting_score     headline_length 
##          1.00000000          0.50116091          0.33365879         -0.14068207 
##        cta_strength           ad_format     body_word_count headline_word_count 
##          0.11640181          0.09774003         -0.06425591         -0.05985470 
##    body_text_length 
##         -0.05600561

Bar Plot: Significant Correlations with CTR

Description:

• The bar plot illustrates the strength and direction of the correlations between variables and CTR.

• Each bar represents a variable, and its length corresponds to the correlation coefficient.

• Variables with positive correlations are shaded in red, while those with negative correlations are shaded in purple.

Interpretation:

• visual_appeal has the highest positive correlation (~0.50), indicating that advertisements with better visual appeal are strongly associated with higher CTR.

• targeting_score (0.33) and cta_strength (0.12) also exhibit positive correlations, suggesting their relevance in driving CTR.

• headline_length has a moderate negative correlation (-0.14), implying that longer headlines may slightly reduce user engagement.

Insights:

• Features like visual_appeal and targeting_score should be prioritized in predictive modeling as they show the strongest positive impact on CTR.

• Negative correlations, like headline_length, indicate areas where optimization could potentially improve CTR.

# Filter correlations excluding CTR itself
ctr_correlation_df <- data.frame(
  Variable = names(ctr_correlations),
  Correlation = ctr_correlations
) %>%
  filter(Variable != "CTR")  # Exclude CTR

# Bar plot for significant correlations with CTR
significant_ctr_df <- ctr_correlation_df %>%
  filter(abs(Correlation) > 0.05)

ggplot(significant_ctr_df, aes(x = reorder(Variable, Correlation), y = Correlation, fill = Correlation)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
  theme_minimal() +
  labs(
    title = "Significant Correlations with CTR",
    x = "Variables",
    y = "Correlation Coefficient"
  )

Heatmap: Correlation Between Variables

Description:

• The heatmap visualizes the pairwise correlations among numeric variables in the dataset.

• The color gradient represents the strength and direction of the correlation:

  • Red indicates a strong positive correlation.
  • Blue indicates a strong negative correlation.
  • White indicates no correlation.

Interpretation:

• The diagonal (from top-left to bottom-right) represents perfect correlations (1.0) of variables with themselves.

• Variables like visual_appeal and targeting_score show moderate positive correlations with several other predictors.

• Areas with darker shades (e.g., between headline_length and headline_word_count) suggest potential multicollinearity, which could affect model performance.

Insights:

• Understanding variable relationships helps detect multicollinearity, which can be addressed using techniques like regularization.

• Features with weak correlations with others (lighter colors) may add unique predictive value to the model.

# Heatmap for CTR-related correlations
numeric_correlations <- as.data.frame(as.table(correlations)) %>%
  filter(Var1 != "CTR", Var2 != "CTR")  # Exclude CTR from rows and columns

ggplot(numeric_correlations, aes(Var1, Var2, fill = Freq)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, 
                       limit = c(-1, 1), space = "Lab", 
                       name = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(
    title = "Correlation Heatmap",
    x = "Variables",
    y = "Variables"
  )

Scatterplot Matrix: Relationships Among Significant Variables

Description:

• The scatterplot matrix presents pairwise relationships between variables that are significantly correlated with CTR.

• Each plot shows the distribution and trends between two variables.

• Red and blue points distinguish data based on whether the CTR is above or below the median.

Interpretation:

• The scatterplots between visual_appeal and targeting_score show a positive linear trend, reinforcing their strong correlations with CTR.

• Plots involving headline_length show a slight negative trend, consistent with its negative correlation with CTR.

• Other scatterplots reveal varying degrees of clustering or randomness, suggesting different levels of predictive relationships.

Insights:

• Visual patterns in scatterplots highlight potential interactions or transformations (e.g., combining visual_appeal and targeting_score) that could enhance model performance.

• Clustering indicates the possibility of segmentation in the data, which might align with specific user behaviors or ad types.

# Scatterplot matrix for numeric variables using base R pairs()
significant_vars_excluding_ctr <- names(significant_vars)[names(significant_vars) != "CTR"]

# Select only significant numeric variables excluding CTR
selected_numeric_vars <- analysis_data %>%
  select(all_of(significant_vars_excluding_ctr))  # Use only variables related to CTR (excluding CTR itself)

# Scatterplot matrix for numeric variables related to CTR (excluding CTR)
pairs(
  selected_numeric_vars,
  main = "Scatterplot Matrix for Variables Related to CTR",
  pch = 21,
  bg = c("red", "blue")[as.factor(analysis_data$CTR > median(analysis_data$CTR))],
  col = "black",
  upper.panel = NULL
)

###Feature Selection for Analysis and Scoring Data

Based on the EDA results, the most relevant features were selected for modeling:

Analysis Data: •Key features such as visual_appeal, targeting_score, headline_length, cta_strength, ad_format, and body_word_count were included along with id and the target variable CTR.

Scoring Data: •The same key features were selected, excluding CTR since it is not available in the scoring dataset.

This step ensures that only meaningful variables are used, improving model efficiency and prediction accuracy.

# Select relevant columns for analysis data (including ID and CTR)
analysis_data <- analysis_data %>%
  select(id, CTR, visual_appeal, targeting_score, headline_length, cta_strength, ad_format, body_word_count)

# Select relevant columns for scoring data (including ID without CTR)
scoring_data <- scoring_data %>%
  select(id, visual_appeal, targeting_score, headline_length, cta_strength, ad_format, body_word_count)

Feature Engineering

Based on the insights from the Exploratory Data Analysis, feature engineering was performed to enhance the predictive power of the dataset. The following steps describe the process:

Creation of Interaction Term:

• An interaction term (interaction_term) was introduced by multiplying visual_appeal and targeting_score, as these variables showed significant positive correlations with CTR.

• This term captures the combined effect of these two variables, which may provide additional explanatory power.

Log Transformation:

• The targeting_score variable was log-transformed (targeting_score_log) using the log1p function to handle skewness and compress outliers.

• Log transformation helps stabilize variance and improves model performance for highly skewed data.

Implementation:

• The above transformations were applied to both datasets using the add_features function, which ensures consistency in feature creation for training and scoring data.

##Feature Engineering

# Add interaction term and log-transformed targeting_score to both datasets
add_features <- function(df) {
  df %>%
    mutate(
      interaction_term = visual_appeal * targeting_score,
      targeting_score_log = log1p(targeting_score)
    )
}
# Apply feature engineering to both datasets
analysis_data <- add_features(analysis_data)
scoring_data <- add_features(scoring_data)

Model Development

To predict CTR effectively, multiple machine learning models were implemented, trained, and evaluated. The process consisted of several key steps:

Data Preparation for Modeling

Feature and Target Separation: • Independent variables (x_train) were separated from the target variable (CTR) to facilitate training.

• The unique identifier column (id) was excluded as it does not provide predictive value.

Train-Test Split: • The dataset was split into 80% training and 20% testing subsets to evaluate model performance on unseen data.

• The createDataPartition function ensured a representative split.

x_train <- analysis_data %>% select(-CTR, -id) %>% as.matrix() 
y_train <- analysis_data$CTR

# Split data into training and testing sets for model evaluation
set.seed(123)  # For reproducibility
train_index <- createDataPartition(y_train, p = 0.8, list = FALSE)
train_data <- analysis_data[train_index, ]
test_data <- analysis_data[-train_index, ]

x_train <- train_data %>% select(-CTR, -id) %>% as.matrix()
y_train <- train_data$CTR
x_test <- test_data %>% select(-CTR, -id) %>% as.matrix()
y_test <- test_data$CTR

Model Training

Logistic Regression: • A linear regression model was trained to establish a baseline for CTR prediction.

• The glm function with a Gaussian link was used to predict CTR, and Root Mean Squared Error (RMSE) was calculated to evaluate performance.

Random Forest: • A Random Forest model was trained using 100 decision trees (ntree = 100), leveraging its ability to handle non-linear relationships.

• Predictions were generated and evaluated using RMSE.

Decision Tree: • A Decision Tree model was built to capture simple decision-making paths for CTR prediction.

• RMSE was calculated for model comparison.

Model Comparison • RMSE Comparison: The RMSE values for the three initial models—Logistic Regression (0.1377), Random Forest (0.0805), and Decision Tree (0.1052)—were evaluated to determine their performance in predicting CTR.

• Performance Insights: Random Forest achieved the lowest RMSE, indicating it is the most effective baseline model among the three for capturing relationships in the data.

• Next Steps: The results highlight that while simpler models like Logistic Regression and Decision Tree provide valuable benchmarks, more advanced techniques such as XGBoost may further improve performance by leveraging complex relationships in the data.

# Logistic Regression Model
logistic_model <- glm(CTR ~ ., data = train_data %>% select(-id), family = gaussian(link = "identity"))
logistic_preds <- predict(logistic_model, newdata = test_data %>% select(-CTR, -id))
logistic_rmse <- rmse(y_test, logistic_preds)

# Random Forest Model
rf_model <- randomForest(CTR ~ ., data = train_data %>% select(-id), ntree = 100)
rf_preds <- predict(rf_model, newdata = test_data %>% select(-CTR, -id))
rf_rmse <- rmse(y_test, rf_preds)

# Decision Tree Model
dt_model <- rpart(CTR ~ ., data = train_data %>% select(-id))
dt_preds <- predict(dt_model, newdata = test_data %>% select(-CTR, -id))
dt_rmse <- rmse(y_test, dt_preds)

# Compare RMSE values for initial models
rmse_comparison <- data.frame(
  Model = c("Logistic Regression", "Random Forest", "Decision Tree"),
  RMSE = c(logistic_rmse, rf_rmse, dt_rmse)
)

print("RMSE Comparison (Before XGBoost):")
## [1] "RMSE Comparison (Before XGBoost):"
print(rmse_comparison)
##                 Model       RMSE
## 1 Logistic Regression 0.13765717
## 2       Random Forest 0.08046275
## 3       Decision Tree 0.10522640

Decision to Use XGBoost

Based on the performance of the initial models (Logistic Regression, Random Forest, and Decision Tree), XGBoost was selected for its ability to handle complex data relationships and its proven effectiveness in predictive modeling. XGBoost offers the following advantages:

High Predictive Power: • XGBoost combines decision trees with gradient boosting, allowing it to capture non-linear relationships and interactions effectively.

Regularization: • Built-in regularization (L1 and L2) helps prevent overfitting, making it suitable for datasets with potential multicollinearity.

Efficiency: • XGBoost is optimized for speed and scalability, ensuring efficient computation even with larger datasets.

Flexibility: • Its ability to handle missing data and support various objective functions makes it versatile for tasks like CTR prediction.

Given these strengths, XGBoost was chosen as the final model for further tuning and evaluation.

Advanced Hyperparameter Tuning Using Bayesian Optimization

To improve the performance of the XGBoost model, Bayesian Optimization was used for hyperparameter tuning. The process is explained below:

Objective Function:

• An objective function was defined to minimize the RMSE of the XGBoost model using cross-validation (xgb.cv).

• Key parameters tuned include: - max_depth: Controls tree depth for better fitting of complex data. eta: The learning rate, balancing model accuracy and convergence speed. gamma: Penalizes model complexity by adding a regularization term. colsample_bytree: Fraction of columns sampled per tree, reducing overfitting. min_child_weight: Minimum sum of instance weight required in a leaf node, helping control tree depth. subsample: Fraction of training data used per iteration, adding randomness. lambda and alpha: L2 and L1 regularization terms, reducing overfitting.

Optimization Strategy:

• Bayesian Optimization maximizes the defined Score by minimizing the negative RMSE (as Bayesian optimization seeks to maximize). • A 5-fold cross-validation was used to ensure robust evaluation of parameter combinations.

Purpose:

• By automating the search for optimal hyperparameter values, Bayesian Optimization helps improve model performance while saving computation time compared to grid or random search.

# Bayesian Optimization Function for Advanced Hyperparameter Tuning
objective_function <- function(max_depth, eta, gamma, colsample_bytree, min_child_weight, subsample, lambda, alpha) {
  cv <- xgb.cv(
    data = x_train,
    label = y_train,
    nrounds = 100,
    objective = "reg:squarederror",
    eval_metric = "rmse",
    nfold = 5,
    verbose = 0,
    params = list(
      max_depth = max_depth,
      eta = eta,
      gamma = gamma,
      colsample_bytree = colsample_bytree,
      min_child_weight = min_child_weight,
      subsample = subsample,
      lambda = lambda,  # L2 regularization
      alpha = alpha      # L1 regularization
    )
  )
  list(Score = -min(cv$evaluation_log$test_rmse_mean)) # Bayesian optimization maximizes, so negate RMSE
}

# Define the bounds for each parameter in Bayesian Optimization
bounds <- list(
  max_depth = c(3L, 10L),
  eta = c(0.01, 0.3),
  gamma = c(0, 5),
  colsample_bytree = c(0.3, 1),
  min_child_weight = c(1, 10),
  subsample = c(0.5, 1),
  lambda = c(0, 10),  # Set bounds for L2 regularization
  alpha = c(0, 10)    # Set bounds for L1 regularization
)
# Perform Bayesian Optimization
opt_result <- bayesOpt(
  FUN = objective_function,
  bounds = bounds,
  initPoints = 10,    # Number of initial random points
  iters.n = 10       # Number of optimization iterations
)
# Get the best parameters
best_params <- getBestPars(opt_result)

# Train the final XGBoost model with the best parameters
xgb_model <- xgboost(
  data = x_train,
  label = y_train,
  objective = "reg:squarederror",
  eval_metric = "rmse",
  max_depth = best_params$max_depth,
  eta = best_params$eta,
  gamma = best_params$gamma,
  colsample_bytree = best_params$colsample_bytree,
  min_child_weight = best_params$min_child_weight,
  subsample = best_params$subsample,
  lambda = best_params$lambda,  # Use best L2 regularization parameter
  alpha = best_params$alpha,    # Use best L1 regularization parameter
  nrounds = 100,
  verbose = FALSE
)

Model Evaluation

RMSE Comparison Across Models

The bar plot visualizes the RMSE values for the four models—Logistic Regression, Decision Tree, Random Forest, and XGBoost—allowing for a direct comparison of their predictive performance.

Results and Observations

Key Observations and Insights

Logistic Regression: • The highest RMSE among all models, indicating it is not well-suited for capturing the non-linear relationships in the dataset.

Decision Tree: • Shows an improvement over Logistic Regression but still has a relatively high RMSE, suggesting its simplicity limits its performance.

Random Forest: • Significantly reduces RMSE compared to Logistic Regression and Decision Tree, demonstrating its ability to handle more complex relationships.

XGBoost: • Achieves the lowest RMSE, outperforming all other models. This highlights its strength in capturing intricate patterns and handling the complexity of the dataset.

Conclusion

XGBoost was chosen as the final model for CTR prediction due to its superior performance in minimizing RMSE. Its advanced capabilities, such as gradient boosting, regularization, and efficient computation, make it the most effective choice for this problem. Further hyperparameter tuning and optimization were applied to ensure its performance was maximized.

# Calculate RMSE for XGBoost
xgb_preds <- predict(xgb_model, x_test)
xgb_rmse <- rmse(y_test, xgb_preds)

# Add XGBoost RMSE to comparison
rmse_comparison <- rbind(
  rmse_comparison,
  data.frame(Model = "XGBoost", RMSE = xgb_rmse)
)

# Print RMSE comparison
print("Final RMSE Comparison:")
## [1] "Final RMSE Comparison:"
print(rmse_comparison)
##                 Model       RMSE
## 1 Logistic Regression 0.13765717
## 2       Random Forest 0.08046275
## 3       Decision Tree 0.10522640
## 4             XGBoost 0.06878627
# Visualize RMSE comparison
ggplot(rmse_comparison, aes(x = reorder(Model, RMSE), y = RMSE, fill = Model)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "RMSE Comparison Across Models",
    x = "Model",
    y = "RMSE"
  )

Executive Summary

This project aimed to predict Click-Through Rate (CTR) in digital advertising, a critical metric for measuring the effectiveness of display advertisements. Leveraging a robust dataset with variables describing the quality, relevance, type, audience, and content of advertisements, the project followed a structured pipeline to ensure accurate and actionable insights.

Key Highlights

1.Exploratory Data Analysis (EDA):

• Relationships between predictors and CTR were thoroughly analyzed, revealing significant correlations with variables such as visual_appeal, targeting_score, and cta_strength. • Visualizations, including correlation heatmaps and scatterplot matrices, were utilized to understand feature interactions and patterns.

2.Feature Engineering:

• New features, such as an interaction term between visual_appeal and targeting_score, and a log-transformed targeting_score, were created to enhance model performance. • Feature selection narrowed the dataset to key variables with the highest predictive potential.

3.Model Comparison:

• Three baseline models—Logistic Regression, Random Forest, and Decision Tree—were evaluated based on RMSE. • Random Forest emerged as the best performer among the baseline models, highlighting the need for advanced techniques to improve accuracy.

4.XGBoost Model Selection and Optimization:

• XGBoost was chosen for its ability to handle complex relationships and its low RMSE compared to other models. • Bayesian Optimization was applied to fine-tune hyperparameters, ensuring optimal performance and preventing overfitting.

5.Results and Deliverables:

• XGBoost achieved the lowest RMSE among all models, demonstrating its robustness and reliability for CTR prediction. • Predictions for the scoring dataset were generated and saved for submission, providing actionable insights for optimizing advertising strategies.