What is A/B Testing?
At its core, A/B testing compares the results of two models and informs us of that which performs superior. This superiority could be explained in terms of higher conversion rates in online businesses or the efficacy of a medicine compared to a placebo.
How to Conduct A/B Testing in General?
To conduct A/B testing, first, we need to specify the metric we want to asses under different circumstances, such as conversion rate or click-through rate. Afterward, we should formulate a null hypothesis such as “There is no statistically significant difference between results from conducting the experiment on group A and B.” Statistical significance can be examined through significance level (p-value, typically 0.05) or through confidence level (typically if 95% confidence interval does not include the null hypothesis, the result is statistically significant).
Now it is time to assign each group in the experiment its proper label: Control Group or Treatment Group. Control groups and treatment groups are the building blocks of A/B testing. These two groups must be logically connected. They should have the same controlled variables and only differ in the target (independent) variable. The independent variable is the one helping us measure the difference in our previously defined metric after running the experiment. To be confident that the results, driven from the experiment on our samples, are representative of the population, the size of the sample, and the length of running the experiment are of utmost importance. Also, in order not to get spurious correlations, we should have an eye for confounding factors (i.e. the third factor that directly affects the first two factors’ movements but is not present in the correlation; as a result, we may interpret the movement of the first factor directly affected by the second one). Thus, it is essential to calculate whether the controlled and target variables are correlated. Also, there should be no correlation between the target variables themselves. There are two primary techniques to run A/B testing: randomised and matched pair designs. We use randomised designs when the volume and velocity of the data are high such as website visits; however, we can use matched pair design when we want to control all the controlled variables except for the one under investigation. In matched pair design, each person/object/etc. is assigned to either the control group or the treatment group. Note that the pairs could be the same person/object at a different time or under another circumstance.

Why Run A/B testing for ML models?
Machine learning is a subset of artificial intelligence that has to do with extracting patterns from the relationship(s) between inputs and outputs of a given dataset. These patterns are then used to arrive at predictions for the output later when a new set of input is given to the model without explicitly writing programming codes. The major ML models are classification and regression models. The former could be helpful in situations where we want to classify the type of a product, and the latter when we need to answer questions such as “how many units will this product sell?” In both cases, A/B testing is helpful when we bring the ML model to the production environment. A/B testing becomes essential when our model is trained by a different dataset, algorithm or framework, or we seek hyperparameter tuning; afterward, we need to decide whether to update the model or not. It is because all the testing against the required KPIs in the ML world is done offline via a specific dataset, and no causal relationship between the model and user outcome can be established to guarantee high performance in the real world.
A/B Testing Use Cases for ML Models
When conducting A/B testing on ML models, it is possible to direct a specific portion of the traffic (for example, on a website) to the new variant (target variable) by Amazon SageMaker and see how the results differ from the original variant experienced by the control group. Moreover, multivariate testing, or testing several ML models at the same time on production, is also possible.
Imagine running A/B testing on a trained model that tells us which review for a product is most helpful based on historical data. We can use the traditional A/B testing method by sending a 50%-50% traffic to the new and old ML models for a specified period of time and then compare the results to see which performed better. We can also use a multi-armed bandit to gradually direct more traffic towards the better-performing model, reducing the traffic sent to the sub-optimal variant during the test, subsequently boosting the conversion rate. Although helpful in terms of avoiding loss in conversion rate, this model is not helpful when we want to check the overall statistical significance of the difference between control and treatment groups.
A practical approach for using machine learning models in A/B testing could be to classify the segments that are more drawn to a specific variant (for example, those who are coming from a particular browser tend to click more on variant 2) and each time the presence of that person is detected the corresponding variant could be shown to that person. In this way, the conversion rate is optimised even further.