Table of Contents
1. ABSTRACT
2. INTRODUCTION
3. DATA SETS
4. FUTURE ENGINEERING
5. TOOL REQUIRED
6. MODEL DEVELOPMENT
7. ALGORITHM USED
8. MODEL VALIDATION
9. ASSUMPTION
10. RESULT
11. USE CASES
12. DATA SCIENTIST’S OWN CONCLUSIONS
13. FUTURE CONSIDERATIONS
14. REFERENCES
ABSTRACT
This report describes the Call-To-Action model developed by Loxz Digital. This model provides predictive analytics for specific call-to-action buttons included in the email campaign prior to deploying it, in order to improve the engagement rate based on the model predictions and recommendations. The CTA model provides recommendations for color and text of the chosen CTAs buttons that will give the highest engagement rates based on the selected model parameters or inputs.
The current dataset contains 566 samples of data with features including call-to-action color and text. The machine learning algorithm used in this model is Random Forest Regression which is an ensemble of decision trees. The model is able to provide the highest accuracy of 80.49% with our current dataset. We believe that the model can achieve a higher accuracy using transfer learning from additional data.
Ⅰ. INTRODUCTION
Call-to-actions (CTAs) are used in email campaigns to engage the users to conduct an intended action in a particular campaign or webpage. [7]. The action can be buying a product, signing up to an event, subscribing to the emails etc. and is encouraged by using a button with a specific color and a specific text to attract the user to convert or complete a particular action.
Our Call-to-action model is specifically developed to provide predictive analytics on the CTA button colors and texts in email campaigns. The model provides predictions on an email campaign for the selected email engagement target metric as well as recommendations on how to optimize the color or text of the CTAs in the email in order to maximize the targeted campaign metric. These call-to-action emails can belong to any campaign type in any industry the campaign engineer considers as inputs. Using our call-to-action model, the users can identify in real-time, the optimum color and the text for the CTA button to be used in the campaign to achieve the best possible outcome. The campaign engineer is empowered to complete these “runs,” to serve predictions within the workflow of the campaign.
II. DATA SET
The model uses 566 emails across eight different industries and four campaign types. Table 1 shows the distribution of emails across industries and Figure 1 shows the percentages of emails across the industries considered for the CTA model. This set of emails was selected from a collection or proprietary set of emails after filtering exclusively for the ones that contain CTA buttons with colors and texts. For the email data samples, a collection of carefully curated emails belonging to different campaigns and industries is used in the ‘.eml’ format.
The types of campaigns are as follows. Currently, there is a higher number of emails for promotional campaigns as there are more call-to-actions in promotions. But we plan to introduce additional datasets for emails for other campaigns as well as increase the list of campaign types in the future.
- Abandoned Cart
- Newsletter
- Promotional
- Transactional
The model uses the body of the content of the email to extract the features. The model features include the CTA color, CTA text, industry type, and campaign type. The target variables considered are Click-To-Open Rate and Conversion Rate. These target variables can be customized and should be considered when leveraging the model. For the CTA model, Open Rate was disregarded as the CTAs appear after opening an email and won’t affect the Open Rate. We are also considering implementing Revenue per email in the next version.
Due to the lack of publicly available data for the target variables, we refer to multiple resources of email campaign benchmarks to generate a set of data within predefined normalized distribution ranges [1–3, 6, 10]. This approach allows us to create our model for the potential use in email campaigns with non-synthetic data. Currently we use resources from Campaign Monitor, Ruler Analytics, ContentGrip, Cross-Channel Marketing Automation Platform and CM Commerce. Benchmark Data is important for Loxz to establish a baseline. We then consider options to enhance the variance between each recommendation.
III. FEATURE ENGINEERING
The feature set consists of industry type, campaign type, CTA color and CTA text. The industry type and campaign type information are directly available from the curated email collection. The emails are already categorized or labled based on the industry and campaign type in the collection.
For CTA color and CTA text information, the email content had to be parsed through carefully to identify CTA buttons. The is part of the data preparation phase. The filter first goes through the email which is in HTML format and identifies the sections with links to websites. If those sections have the predefined characteristics of a CTA button used for styling, they are considered and confirmed as a CTA button. Following are the specific characteristics considered in our filter. They are the CSS style properties used to style an HTML element. If the following properties are used to style the element, it is considered a CTA button.
- background-color: Background color of the button
- display: The way the button is displayed
- border-radius: The rounded corners of the button
Figure 2 shows and example for an HTML element for a CTA button with highlighted properties.
After filtering the sections corresponding to CTA buttons, the color and text are extracted. The color of the button is defined with the CSS property background-color. The value can be a Hex, RGB value, or standard name of a color. For consistency, all the CTA color feature values are converted into their Hex value. The CTA text embedded within the section is extracted using BeautifulSoup web-scraping package [12].
To further validate that the text belongs to a particular CTA, the extracted text is cross-validated with a curated list of commonly used CTA texts and verbs. This process of filtering results in a set of information that exclusively belongs to CTA buttons.
IV. TOOLS REQUIRED
The implementation of the model was done in a Jupyter Notebook instance in AWS SageMaker using Python programming [8]. For CTA feature extraction, the BeautifulSoup web scraping package functionalities were used [12]. In order to identify and display CTA colors, both Webcolors and Color packages were utilized [4, 13]. Machine learning tasks were implemented using the Scikit-learn package for Python [11].
V. MODEL DEVELOPMENT
After generating the features for the dataset as described in section 3, the next step is the model development. The model predicts Click-to-open rate and Conversion rate which are both continuous values requiring a regression model for predictions. The regression model takes the set of features and the target variables as inputs and trains the model for predictions. The model takes a total of four features and two different target variables. But the model is trained on one selected target variable at a time based on the user’s inputs. For the machine learning model, three-based algorithms are considered for their simplicity. Both Random Forest and XGBoost algorithms were considered and based on their performance, Random Forest regression is used for model development [5, 9]. The algorithm is used with its default hyperparameter values except for random_state to maintain consistent results. The dataset is normalized with L2-norm before feeding into the regression model.
Apart from the predicted accuracies, “the output” the model is further developed to give three recommendations for the user. Currently, the recommendations are selected from the historical data and output “upon run” to the campaign engineer for the best CTA color and/or text combination for that particular email campaign.
The user interface is developed with the option for the user to upload the HTML email for the campaign and select parameters (industry and campaign). These parameters can vary in scope and number.
Then the user gets to select which conversion rate to be predicted by the model as well as the CTA color and/or text to be optimized based on the predictions. The features are extracted from the uploaded email with the same process described in section 3.
VI. ALGORITHM USED
For machine learning, the CTA model uses the Random forest regression algorithm [9]. The tree-based algorithms are easier to interpret than other algorithms. Random forest is a tree-based ensemble method that uses a bagging boosting method where the model output is based on the majority prediction of the trees. The random forest regression model implemented in the Scikit-learn package is used directly for the call-to- action model development.
VII. MODEL VALIDATION
After generating the features for the dataset as described in section 3, the next step is the model development. The model predicts Click-to-open rate and Conversion rate which are both continuous values requiring a regression model for predictions. The regression model takes the set of features and the target variables as inputs and trains the model for predictions. The model takes a total of four features and two different target variables. But the model is trained on one selected target variable at a time based on the user’s inputs. For the machine learning model, three-based algorithms are considered for their simplicity. Both Random Forest and XGBoost algorithms were considered and based on their performance, Random Forest regression is used for model development [5, 9]. The algorithm is used with its default hyperparameter values except for random_state to maintain consistent results. The dataset is normalized with L2-norm before feeding into the regression model.
VIII. ASSUMPTION
There are several assumptions that had to be made during the entire process. The filters used to extract CTA from emails won’t extract all the CTAs every time. In order to be more accurate when extracting the CTAs and not to extract non-CTA information, the filters are made with specific conditions based on the information from historical data. Due to this, the filters might miss a small number of CTAs within some emails but there is a high probability that the ones it filters are always correct CTAs (no non-CTAs). Due to the lack of availability of data for the model, it refers to the email benchmark data, and assumptions are made to decide on the range of the distributions of target variables based on the average benchmarks. Also, the target variable values are assigned with the assumption that certain colors and texts have higher engagement such as brighter colors
IX. RESULTS
The accuracy of the model is measured by using 𝑅2. It represents how well the model fits the data. The higher the value the better the prediction. Using this metric, the accuracy of the model is calculated. First, the performances of both random forest and XGBoost algorithms were compared in order to select the best regression algorithm for the model. Figure 4 depicts the performance comparison between the two algorithms. As shown in the figure, Random Forest provides higher accuracy than XGBoost in this scenario. Since the runtimes didn’t seem to have any effect on the model performance, Random Forest is selected for the model based on the accuracy.
Using the random forest algorithm for predictions, the call-to-action model provides an accuracy score click-to-open rate of 80.49% and a conversion rate of 73.39%.
X. USE CASES
The call-to-action model is developed for email marketing campaigns and is to be used by the campaign engineers within the workflow of the campaign and to identify ways to increase user engagement prior to deployment based on given parameters or inputs. Running two or more models can have higher engagement rates. This would be considered a multi-modal campaign. The current model provides predictive analytics for click-to-open rate and conversion rate. The campaign engineer is able to upload the email for the campaign and select the industry and campaign type. Once again, these input elements can be optimized or increased. Then they can select the preferred engagement rate and the call-to-action property (color, text, or both). This part of the model interface is shown in figure 5.
After selecting the parameters, the CTA model will first scan the campaign email for the CTAs and display them to the user. In the case of multiple CTAs in the email, the model will list all the CTAs and will give the user the option to select one or more CTAs for predictions and recommendations. We firmly believe that the combination of text and color as an output and base those outputs on recommendations will enhance campaign engagement rates. Based on that, the campaign engineer can optimize multiple CTAs for the campaign upon “run.” Figure 6 shows an example of multiple CTAs in the email with more than one CTA selected for predictions.
Based on the parameters selected by the user, the model then will run for the given campaign email to provide “three” predictions. We use three in our model examples, and provide the predicted rate with three recommendations to increase the intended rate by optimizing the CTA color and/or text. In the end, the model will display the model prediction and overall accuracy. An example of recommendations and model accuracy is given in figures 7 and 8. Based on these model outputs, the campaign engineer can decide the most suitable CTA properties to conform to the highest engagement rate. This is all done prior to deployment and ensures a potentially successful campaign. Using the call-to-action model, the campaign engineer can foresee the outcome of the campaign and take any action to increase its outcome using various workflow tools in her UI.
XI. DATA SCIENTIST OWN CONCLUSIONS
The call-to-action model is developed to provide predictive analytics for the email campaigns by giving recommendations to improve outcomes within a few seconds. According to the results from the CTA model, it is able to provide higher accuracies for the target variables using the current dataset.
For the click-to-open rate, the model provides 80.49% of accuracy, and for the conversion rate 73.39% of accuracy which validates the performance of the CTA model. As these results are based on our current dataset, this suggests our model has more room for improvement in the future with further tweaks. These results show the potential of the model to be used in real-world scenarios. As these results are based on benchmark data, the model will gain further information and gain more performance when used in real-time.
XII. FEATURE CONSIDERATIONS
For immediate future work, the model will be extended to predict Revenue-per-email values. Additional datasets will be introduced and will be added to the dataset to improve recommendations made by the model. The current model doesn’t use NLP to get the information in the email text as well as CTA texts. In the future, NLP will be incorporated to make better predictions along with predictions for custom CTA text suggestions by the user. We currently have deployed internally a semantic model to curate alternative text from user inputs. The model will potentially be updated to identify the approximate location of the CTA button within the email to use that towards predictions in case of multiple CTAs within the email. The current machine learning model uses default hyperparameters and therefore, a hyperparameter tuning process will be done to further improve the performance.
References
[1] 2016. Ecommerce Email Marketing Benchmarks. https://cm-commerce.com/academy/email-marketing-benchmarks/
[2] 2018. Email Conversion Rate Benchmarks. https://www.listrak.com/white-papers/2018-email-benchmarks
[3] 2021. Ultimate Email Marketing Benchmarks for 2021: By Industry and Day. https://www.campaignmonitor.com/resources/guides/email- marketing-benchmarks/
[4] James Bennett. 2022. Module contents — webcolors 1.12 documentation. https://webcolors.readthedocs.io/en/1.12/contents.html
[5] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794.
[6] Katie Holmes. 2021. Average Conversion Rate by Industry and Marketing Source. https://www.ruleranalytics.com/blog/insight/conversion-rate-by-industry/
[7] Kimberly Huang. 2021. Calls-To-Action: Best Practices in Email Marketing [Guide] - Litmus. https://www.litmus.com/blog/click-tap-and- touch-a-guide-to-cta-best-practices/
[8] Ameet V Joshi. 2020. Amazon’s Machine Learning Toolkit: Sagemaker. In Machine Learning and Artificial Intelligence. Springer Nature, Chapter 24, 233–243. https://doi.org/10.1007/978-3-030-26622-6_24
[9] Yanli Liu, Yourong Wang, and Jian Zhang. 2012. New machine learning algorithm: Random forest. In International Conference on Information Computing and Applications. Springer, 246–252.
[10] Enricko Lukman. 2021. What is a good conversion rate for your business? https://www.contentgrip.com/conversion-rate-business-benchmark/
[11] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830. http://scikit-learn.sourceforge.net.
[12] Leonard Richardson. 2007. Beautiful soup documentation. Dosegljivo: https://www.crummy.com/software/BeautifulSoup/bs4/doc/.[Dostopano: 7. 7. 2018] (2007). http://scikit-learn.sourceforge.net.
[13] Christopher Welborn. 2019. Colr · PyPI. https://pypi.org/project/Colr/