Learning Metrics that Maximise Power for Accelerated A/B-Tests (2024)

Olivier JeunenShareChatEdinburghUnited Kingdomjeunen@sharechat.coandAleksei UstimenkoShareChatLondonUnited Kingdomaleksei.ustimenko@sharechat.co

(2024)

Abstract.

Online controlled experiments are a crucial tool to allow for confident decision-making in technology companies.A North Star metric is defined (such as long-term revenue or user retention), and system variants that statistically significantly improve on this metric in an A/B-test can be considered superior.North Star metrics are typically delayed and insensitive.As a result, the cost of experimentation is high: experiments need to run for a long time, and even then, type-II errors (i.e. false negatives) are prevalent.

We propose to tackle this by learning metrics from short-term signals that directly maximise the statistical power they harness with respect to the North Star.We show that existing approaches are prone to overfitting, in that higher average metric sensitivity does not imply improved type-II errors, and propose to instead minimise the p๐‘pitalic_p-values a metric would have produced on a log of past experiments.We collect such datasets from two social media applications with over 160 million Monthly Active Users each, totalling over 153 A/B-pairs.Empirical results show that we are able to increase statistical power by up to 78% when using our learnt metrics stand-alone, and by up to 210% when used in tandem with the North Star.Alternatively, we can obtain constant statistical power at a sample size that is down to 12% of what the North Star requires, significantly reducing the cost of experimentation.

A/B-Testing; Evaluation Metrics; Statistical Power

โ€ โ€ journalyear: 2024โ€ โ€ copyright: acmlicensedโ€ โ€ conference: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 25โ€“29, 2024; Barcelona, Spainโ€ โ€ booktitle: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD โ€™24), August 25โ€“29, 2024, Barcelona, Spainโ€ โ€ doi: 10.1145/3637528.3671512โ€ โ€ isbn: 979-8-4007-0490-1/24/08โ€ โ€ ccs: General and referenceExperimentationโ€ โ€ ccs: Mathematics of computingHypothesis testing and confidence interval computationโ€ โ€ ccs: Computing methodologiesMachine learning

1. Introduction & Motivation

Modern platforms on the web need to continuously make decisions about their product and user experience, which are often central to the business at hand.These decisions range from design and interface choices to back-end technology adoption and machine learning models that power personalisation.Online controlled experiments, the modern web-based extension of Randomised Controlled Trials (RCTs)(Rubin, 1974), provide an effective tool to allow for confident decision-making in this context(Kohavi etal., 2020) (bar some common pitfalls(Kohavi etal., 2022; Jeunen, 2023)).

A North Star metric is adopted, such as long-term revenue or user retention, and system variants that statistically significantly improve the North Star metric are considered superior to the tested alternative(Deng and Shi, 2016).Proper use of statistical hypothesis testing tools such as Welchโ€™s t๐‘กtitalic_t-test(WELCH, 1947), then allows us to define and measure statistical significance in a mathematically rigorous manner.

However effective this procedure is, it is far from efficient.Indeed, experiments typically need to run for a long time, and statistically significant changes to the North Star are scarce.This can either be due to false negatives (i.e. type-II error), or simply because the North Star is not moved by short-term experiments.In these cases, we need to resort to second-tier metrics (e.g. various types of user engagement signals) to make decisions instead.These problems are common in industry, as evidenced by a wide breadth of related work.A first line of research leverages control variates to reduce the variance of the North Star metric, directly reducing type-II errors by increasing sensitivity(Deng etal., 2013; Xie and Aurisset, 2016; Budylin etal., 2018; Poyarkov etal., 2016; Guo etal., 2021; Baweja etal., 2024).Another focuses on identifying second-tier โ€œproxyโ€ or โ€œsurrogateโ€ metrics that are promising to consider instead of the North Star(Wang etal., 2022; Richardson etal., 2023; Tripuraneni etal., 2023), or to predict long-term effects from short-term data(Athey etal., 2019; Tang etal., 2022; Goffrier etal., 2023).Finally, several works learn metric combinations that maximise sensitivity(Deng and Shi, 2016; Kharitonov etal., 2017; Tripuraneni etal., 2023).

This paper synthesises, generalises and extends several of the aforementioned works into a general framework to learn A/B-testing metrics that maximise the statistical power they harness.We specifically extend the work of Kharitonov etal.(Kharitonov etal., 2017) to applications beyond web search, where the North Star can be delayed and insensitive.We highlight how their approach of maximising the average z๐‘งzitalic_z-score does not accurately reflect downstream metric utility in our case, in that it does not penalise disagreement with the North Star sufficiently (i.e. type-III/S errors(Mosteller, 1948; Kaiser, 1960; Gelman and Carlin, 2014; Urbano etal., 2019)).Indeed: whilst this approach maximises the mean z๐‘งzitalic_z-score, it does not necessarily improve the median z๐‘งzitalic_z-score, and does not lead to improved statistical power in the form of reduced type-II error as a result.

Alternatively, optimising the learnt metric to minimise p๐‘pitalic_p-values โ€”either directly or after applying a log\logroman_log-transformationโ€” more equitably ditributes gains over multiple experiments, leading to more statistically significant results instead of a few extremely significant results.Furthermore, we emphasise that learnt metrics are not meant to replace existing metrics, but rather to complement them.As such, their evaluation should be done through multiple hypothesis testing (with appropriate corrections(Shaffer, 1995)) if any of the North Star, available vetted proxies and surrogates, or learnt metrics are statistically significant under the considered treatment variant.We can then either adopt a conservative plug-in Bonferroni correction to temper type-I errors, or analyse synthetic A/A experiments to ensure the final procedure matches the expected confidence level.

We empirically validate these insights through two dataset of past logged A/B results from large-scale short-video platforms with over 160 million monthly active users each: ShareChat and Moj.Experimental results highlight that our learnt metrics provide significant value to the business: learnt metrics can increase statistical power by up to 78% over the North Star, and up to 210% when used in tandem.Alternatively, if we wish to retain constant statistical power as we do under the North Star, we can do so with down to 12% of the original required sample size.This significantly reduces the cost of online experimentation to the business.Our learnt metrics are currently used for confident, high-velocity decision-making across ShareChat and Moj business units.

2. Background & Problem Setting

We deal with online controlled experiments, where two system variants A๐ดAitalic_A and B๐ตBitalic_B are deployed to a properly randomised sub-population of users, adhering to best practices(Kohavi etal., 2020; Jeunen, 2023).

For every system variant, for every experiment, we measure various metrics that describe how users interact with the platform.These metrics include types of implicit engagement (e.g. video-plays and watch time), as well as explicit engagement (e.g. likes and shares) as well as longer-term retention or revenue signals.For each metric, we log empirical means, variances and covariances (of the sample mean).For metrics misubscript๐‘š๐‘–m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with 1โ‰คiโ‰คN1๐‘–๐‘1\leq i\leq N1 โ‰ค italic_i โ‰ค italic_N, that is:

(1)๐=[ฮผ1,โ€ฆโขฮผN],andโข๐šบ=[ฯƒ1โ€ฆฯƒ1โขiโ€ฆฯƒ1โขNโ‹ฎโ‹ฑโ‹ฎโ‹ฑโ‹ฎฯƒiโข1โ€ฆฯƒiโ€ฆฯƒiโขNโ‹ฎโ‹ฑโ‹ฎโ‹ฑโ‹ฎฯƒNโข1โ€ฆฯƒNโขiโ€ฆฯƒN].formulae-sequence๐subscript๐œ‡1โ€ฆsubscript๐œ‡๐‘and๐šบmatrixsubscript๐œŽ1โ€ฆsubscript๐œŽ1๐‘–โ€ฆsubscript๐œŽ1๐‘โ‹ฎโ‹ฑโ‹ฎโ‹ฑโ‹ฎsubscript๐œŽ๐‘–1โ€ฆsubscript๐œŽ๐‘–โ€ฆsubscript๐œŽ๐‘–๐‘โ‹ฎโ‹ฑโ‹ฎโ‹ฑโ‹ฎsubscript๐œŽ๐‘1โ€ฆsubscript๐œŽ๐‘๐‘–โ€ฆsubscript๐œŽ๐‘\bm{\mu}=[\mu_{1},\ldots\mu_{N}],\text{ and }\bm{\Sigma}=\begin{bmatrix}\sigma%_{1}&\dots&\sigma_{1i}&\dots&\sigma_{1N}\\\vdots&\ddots&\vdots&\ddots&\vdots\\\sigma_{i1}&\dots&\sigma_{i}&\dots&\sigma_{iN}\\\vdots&\ddots&\vdots&\ddots&\vdots\\\sigma_{N1}&\dots&\sigma_{Ni}&\dots&\sigma_{N}\end{bmatrix}.bold_italic_ฮผ = [ italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ italic_ฮผ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] , and bold_ฮฃ = [ start_ARG start_ROW start_CELL italic_ฯƒ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL โ€ฆ end_CELL start_CELL italic_ฯƒ start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT end_CELL start_CELL โ€ฆ end_CELL start_CELL italic_ฯƒ start_POSTSUBSCRIPT 1 italic_N end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL โ‹ฎ end_CELL start_CELL โ‹ฑ end_CELL start_CELL โ‹ฎ end_CELL start_CELL โ‹ฑ end_CELL start_CELL โ‹ฎ end_CELL end_ROW start_ROW start_CELL italic_ฯƒ start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT end_CELL start_CELL โ€ฆ end_CELL start_CELL italic_ฯƒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL โ€ฆ end_CELL start_CELL italic_ฯƒ start_POSTSUBSCRIPT italic_i italic_N end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL โ‹ฎ end_CELL start_CELL โ‹ฑ end_CELL start_CELL โ‹ฎ end_CELL start_CELL โ‹ฑ end_CELL start_CELL โ‹ฎ end_CELL end_ROW start_ROW start_CELL italic_ฯƒ start_POSTSUBSCRIPT italic_N 1 end_POSTSUBSCRIPT end_CELL start_CELL โ€ฆ end_CELL start_CELL italic_ฯƒ start_POSTSUBSCRIPT italic_N italic_i end_POSTSUBSCRIPT end_CELL start_CELL โ€ฆ end_CELL start_CELL italic_ฯƒ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] .

Superscripts denote measurements pertaining to different variants in an experiment: e.g. ๐Asuperscript๐๐ด\bm{\mu}^{A}bold_italic_ฮผ start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT and ๐Bsuperscript๐๐ต\bm{\mu}^{B}bold_italic_ฮผ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT.

2.1. Statistical Significance Testing

We want to assess whether the mean of metric misubscript๐‘š๐‘–m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is statistically significantly higher under variant A๐ดAitalic_A compared to variant B๐ตBitalic_B.To this end, we define a significance level ฮฑ๐›ผ\alphaitalic_ฮฑ (often ฮฑโ‰ˆ0.05๐›ผ0.05\alpha\approx 0.05italic_ฮฑ โ‰ˆ 0.05), corresponding to the false-positive-rate we deem acceptable.Then, we apply Welchโ€™s t๐‘กtitalic_t-test.The test statistic (also known as the z๐‘งzitalic_z-score) for metric misubscript๐‘š๐‘–m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the given variants is given by:

(2)ziAโ‰ปB=ฮผiAโˆ’ฮผiBฯƒiA+ฯƒiB.superscriptsubscript๐‘ง๐‘–succeeds๐ด๐ตsuperscriptsubscript๐œ‡๐‘–๐ดsuperscriptsubscript๐œ‡๐‘–๐ตsubscriptsuperscript๐œŽ๐ด๐‘–subscriptsuperscript๐œŽ๐ต๐‘–z_{i}^{A\succ B}=\frac{\mu_{i}^{A}-\mu_{i}^{B}}{\sqrt{\sigma^{A}_{i}+\sigma^{B%}_{i}}}.italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A โ‰ป italic_B end_POSTSUPERSCRIPT = divide start_ARG italic_ฮผ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT - italic_ฮผ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_ฯƒ start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ฯƒ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG .

We then transform this to a p๐‘pitalic_p-value for a two-tailed test as:

(3)piAโ‰ B=2โ‹…minโก(ฮฆโข(ziAโ‰ปB);1โˆ’ฮฆโข(ziAโ‰ปB)).superscriptsubscript๐‘๐‘–๐ด๐ตโ‹…2ฮฆsuperscriptsubscript๐‘ง๐‘–succeeds๐ด๐ต1ฮฆsuperscriptsubscript๐‘ง๐‘–succeeds๐ด๐ตp_{i}^{A\neq B}=2\cdot\min(\Phi(z_{i}^{A\succ B});1-\Phi(z_{i}^{A\succ B})).italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A โ‰  italic_B end_POSTSUPERSCRIPT = 2 โ‹… roman_min ( roman_ฮฆ ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A โ‰ป italic_B end_POSTSUPERSCRIPT ) ; 1 - roman_ฮฆ ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A โ‰ป italic_B end_POSTSUPERSCRIPT ) ) .

Here, Aโ‰ปBsucceeds๐ด๐ตA\succ Bitalic_A โ‰ป italic_B denotes a partial ordering between variants, implying that A๐ดAitalic_A is preferred over B๐ตBitalic_B.ฮฆโข(โ‹…)ฮฆโ‹…\Phi(\cdot)roman_ฮฆ ( โ‹… ) represents the cumulative distribution function (CDF) for a standard Gaussian.For completeness, this CDF is given by:

(4)ฮฆโข(z)=12โขฯ€โขโˆซโˆ’โˆžzeโˆ’t22โขdt.ฮฆ๐‘ง12๐œ‹superscriptsubscript๐‘งsuperscript๐‘’superscript๐‘ก22differential-d๐‘ก\Phi(z)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{z}e^{-\frac{t^{2}}{2}}{\rm d}t.roman_ฮฆ ( italic_z ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_ฯ€ end_ARG end_ARG โˆซ start_POSTSUBSCRIPT - โˆž end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT roman_d italic_t .

When piAโ‰ B<ฮฑsuperscriptsubscript๐‘๐‘–๐ด๐ต๐›ผp_{i}^{A\neq B}<\alphaitalic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A โ‰  italic_B end_POSTSUPERSCRIPT < italic_ฮฑ, we can confidently reject the null-hypothesis that A๐ดAitalic_A and B๐ตBitalic_B are equivalent w.r.t. the mean of metric misubscript๐‘š๐‘–m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.Note that z๐‘งzitalic_z-scores are signed, whereas two-tailed p๐‘pitalic_p-values are not.Indeed: relabeling the variants changes the z๐‘งzitalic_z-score but not the p๐‘pitalic_p-value, which leaves room for faulty conclusions of directionality, known as type-III errors(Mosteller, 1948; Kaiser, 1960; Urbano etal., 2019) or sign errors(Gelman and Carlin, 2014).We discuss these phenomena in detail, further in this article.

A one-tailed p๐‘pitalic_p-value for the one-tailed null hypothesis AโŠBnot-succeeds๐ด๐ตA\nsucc Bitalic_A โŠ italic_B is given by piAโŠB=1โˆ’ฮฆโข(ziAโ‰ปB)superscriptsubscript๐‘๐‘–not-succeeds๐ด๐ต1ฮฆsuperscriptsubscript๐‘ง๐‘–succeeds๐ด๐ตp_{i}^{A\nsucc B}=1-\Phi(z_{i}^{A\succ B})italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A โŠ italic_B end_POSTSUPERSCRIPT = 1 - roman_ฮฆ ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A โ‰ป italic_B end_POSTSUPERSCRIPT ), and rejected when <ฮฑ2absent๐›ผ2<\frac{\alpha}{2}< divide start_ARG italic_ฮฑ end_ARG start_ARG 2 end_ARG.Throughout, we use two-tailed p๐‘pitalic_p-values unless mentioned otherwise.

2.1.1. p๐‘pitalic_p-value corrections

The above procedure is valid for a single metric, a single hypothesis, and importantly, a single decision.Nevertheless, this is not how experiments run in practice.Without explicit corrections on the p๐‘pitalic_p-values (or corresponding z๐‘งzitalic_z-scores), violations of these assumptions lead to inflated false-positive-rates.We consider two common cases: a (conservative) multiple testing correction when an experiment has several treatments, and a sequential testing correction when experiments have no predetermined end-date or sample size at which to conclude.These corrections are applied as experiment-level corrections, to ensure that for any metric misubscript๐‘š๐‘–m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and variants A,B๐ด๐ตA,Bitalic_A , italic_B, the obtained p๐‘pitalic_p-values accurately reflect what they should reflect, yielding the specified coverage at varying confidence levels ฮฑ๐›ผ\alphaitalic_ฮฑ.

Multiple comparisons

Often, launched experiments will have multiple treatments deployed, leading to the infamous โ€œmultiple hypothesis testingโ€ problem(Shaffer, 1995).We apply a Bonferroni correction to deal with this.When there are T๐‘‡Titalic_T treatments, we consider a treatment to be statistically significantly different from control when a two-tailed p<ฮฑT๐‘๐›ผ๐‘‡p<\frac{\alpha}{T}italic_p < divide start_ARG italic_ฮฑ end_ARG start_ARG italic_T end_ARG, instead of the original p<ฮฑ๐‘๐›ผp<\alphaitalic_p < italic_ฮฑ threshold.

We can equivalently apply this correction on z๐‘งzitalic_z-scores instead, allowing us to directly compare z๐‘งzitalic_z-scores across experiments with varying numbers of treatments.Recall that the percentile point function is the inverse of the CDF.We obtain a one-tailed p๐‘pitalic_p-value as p=1โˆ’ฮฆโข(z)๐‘1ฮฆ๐‘งp=1-\Phi(z)italic_p = 1 - roman_ฮฆ ( italic_z ), and we reject the one-tailed null hypothesis when p<ฮฑ2๐‘๐›ผ2p<\frac{\alpha}{2}italic_p < divide start_ARG italic_ฮฑ end_ARG start_ARG 2 end_ARG.Now, instead, we reject when p<ฮฑ2โขT๐‘๐›ผ2๐‘‡p<\frac{\alpha}{2T}italic_p < divide start_ARG italic_ฮฑ end_ARG start_ARG 2 italic_T end_ARG.As such, computing corrected z๐‘งzitalic_z-scores as zยฏ=zโขฮฆโˆ’1โข(ฮฑ2)ฮฆโˆ’1โข(ฮฑ2โขT)ยฏ๐‘ง๐‘งsuperscriptฮฆ1๐›ผ2superscriptฮฆ1๐›ผ2๐‘‡\bar{z}=z\frac{\Phi^{-1}(\frac{\alpha}{2})}{\Phi^{-1}(\frac{\alpha}{2T})}overยฏ start_ARG italic_z end_ARG = italic_z divide start_ARG roman_ฮฆ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( divide start_ARG italic_ฮฑ end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG roman_ฮฆ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( divide start_ARG italic_ฮฑ end_ARG start_ARG 2 italic_T end_ARG ) end_ARG controls type-I errors effectively.

Always-Valid-Inference and peeking

A statistical test should only be performed once, at the end of an experiment.When the treatment effect is large, this implies we may have been able to conclude the experiment earlier.To this end, sequential hypothesis tests have been proposed in the literature(Wald, 1945).Modern versions make use of Always-Valid-Inference (AVI)(Howard etal., 2021) to allow for continuous peeking at intermediate results and making decisions based on them, whilst controlling type-I errors.Here, analogously, we can apply a correction on the z๐‘งzitalic_z-scores as follows:

(5)zยฏ=z(NAโขB+ฯ)NAโขBโขlogโก(NAโขB+ฯฯโขฮฑ2),ฯ=10โ€‰000logโก(logโก(eฮฑ2))โˆ’2โขlogโก(ฮฑ),formulae-sequenceยฏ๐‘ง๐‘งsubscript๐‘๐ด๐ต๐œŒsubscript๐‘๐ด๐ตsubscript๐‘๐ด๐ต๐œŒ๐œŒsuperscript๐›ผ2๐œŒ10000๐‘’superscript๐›ผ22๐›ผ\bar{z}=\frac{z}{\sqrt{\frac{(N_{AB}+\rho)}{N_{AB}}\log\left(\frac{N_{AB}+\rho%}{\rho\alpha^{2}}\right)}},\quad\rho=\frac{10\,000}{\log(\log(\frac{e}{\alpha^%{2}}))-2\log(\alpha)},overยฏ start_ARG italic_z end_ARG = divide start_ARG italic_z end_ARG start_ARG square-root start_ARG divide start_ARG ( italic_N start_POSTSUBSCRIPT italic_A italic_B end_POSTSUBSCRIPT + italic_ฯ ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_A italic_B end_POSTSUBSCRIPT end_ARG roman_log ( divide start_ARG italic_N start_POSTSUBSCRIPT italic_A italic_B end_POSTSUBSCRIPT + italic_ฯ end_ARG start_ARG italic_ฯ italic_ฮฑ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) end_ARG end_ARG , italic_ฯ = divide start_ARG 10 000 end_ARG start_ARG roman_log ( roman_log ( divide start_ARG italic_e end_ARG start_ARG italic_ฮฑ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ) - 2 roman_log ( italic_ฮฑ ) end_ARG ,

where NAโขBsubscript๐‘๐ด๐ตN_{AB}italic_N start_POSTSUBSCRIPT italic_A italic_B end_POSTSUBSCRIPT is the total number of samples over both variants combined.For a detailed motivation, see Schmit and Miller(Schmit and Miller, 2022).

These corrections are applied on a per-experiment level, both in the objective functions of methods introduced in the following Sections and when evaluating the metrics that they produce.

2.2. Learning Metrics that Maximise Sensitivity

The observation that we can learn parameters to maximise statistical sensitivity is not new.Yue etal. apply such ideas specifically for interleaving experiments in web search(Yue etal., 2010).Kharitonov etal. extend this to A/B-testing in web search, aiming to learn combinations of metrics that maximise the average z๐‘งzitalic_z-score(Kharitonov etal., 2017).Deng and Shi discuss โ€œlessons learnedโ€ from applying similar techniques(Deng and Shi, 2016).We introduce the approach presented by Kharitonov etal.(Kharitonov etal., 2017), as our proposed improvements build on their foundations.

We consider new metrics as linear transformations on ๐๐\bm{\mu}bold_italic_ฮผ:

(6)ฯ‰=๐โขwโŠบ,whereโขwโˆˆโ„1ร—N.formulae-sequence๐œ”๐superscript๐‘คโŠบwhere๐‘คsuperscriptโ„1๐‘\omega=\bm{\mu}w^{\intercal},\text{ where }w\in\mathbb{R}^{1\times N}.italic_ฯ‰ = bold_italic_ฮผ italic_w start_POSTSUPERSCRIPT โŠบ end_POSTSUPERSCRIPT , where italic_w โˆˆ blackboard_R start_POSTSUPERSCRIPT 1 ร— italic_N end_POSTSUPERSCRIPT .

The advantage of restricting ourselves to linearity, it that we can write out the z๐‘งzitalic_z-score of the new metric as a function of its weights:

(7)zฯ‰Aโ‰ปB=๐AโขwโŠบโˆ’๐BโขwโŠบwโข๐šบAโขwโŠบ+wโข๐šบBโขwโŠบ.subscriptsuperscript๐‘งsucceeds๐ด๐ต๐œ”superscript๐๐ดsuperscript๐‘คโŠบsuperscript๐๐ตsuperscript๐‘คโŠบ๐‘คsuperscript๐šบ๐ดsuperscript๐‘คโŠบ๐‘คsuperscript๐šบ๐ตsuperscript๐‘คโŠบz^{A\succ B}_{\omega}=\frac{\bm{\mu}^{A}w^{\intercal}-\bm{\mu}^{B}w^{\intercal%}}{\sqrt{w\bm{\Sigma}^{A}w^{\intercal}+w\bm{\Sigma}^{B}w^{\intercal}}}.italic_z start_POSTSUPERSCRIPT italic_A โ‰ป italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฯ‰ end_POSTSUBSCRIPT = divide start_ARG bold_italic_ฮผ start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT โŠบ end_POSTSUPERSCRIPT - bold_italic_ฮผ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT โŠบ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_w bold_ฮฃ start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT โŠบ end_POSTSUPERSCRIPT + italic_w bold_ฮฃ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT โŠบ end_POSTSUPERSCRIPT end_ARG end_ARG .

These z๐‘งzitalic_z-scores can be used exactly as before to obtain p๐‘pitalic_p-values.An intuitive property of the z๐‘งzitalic_z-score, is that a relative z๐‘งzitalic_z-score of r=zฯ‰Aโ‰ปBziAโ‰ปB๐‘Ÿsubscriptsuperscript๐‘งsucceeds๐ด๐ต๐œ”subscriptsuperscript๐‘งsucceeds๐ด๐ต๐‘–r=\frac{z^{A\succ B}_{\omega}}{z^{A\succ B}_{i}}italic_r = divide start_ARG italic_z start_POSTSUPERSCRIPT italic_A โ‰ป italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฯ‰ end_POSTSUBSCRIPT end_ARG start_ARG italic_z start_POSTSUPERSCRIPT italic_A โ‰ป italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG implies that ฯ‰๐œ”\omegaitalic_ฯ‰ requires a factor r2superscript๐‘Ÿ2r^{2}italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT fewer samples to achieve the same significance level as misubscript๐‘š๐‘–m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT(Chapelle etal., 2012).This can directly be translated to the cost of experimentation, as it allows us to run experiments for shorter time-periods or on smaller sub-populations.

As such, it comes naturally to frame the objective as learning the weights w๐‘คwitalic_w that maximise the z๐‘งzitalic_z-score on the training data.This training data consists of a set of experiments with pairs of variants โ„ฐ={(A,B)i}i=1|โ„ฐ|โ„ฐsuperscriptsubscriptsubscript๐ด๐ต๐‘–๐‘–1โ„ฐ\mathcal{E}=\{(A,B)_{i}\}_{i=1}^{|\mathcal{E}|}caligraphic_E = { ( italic_A , italic_B ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_E | end_POSTSUPERSCRIPT.We consider three distinct relations between pairs of deployed system variants:

  1. (1)

    Known outcomes: (A,B)โˆˆโ„ฐ+๐ด๐ตsuperscriptโ„ฐ(A,B)\in\mathcal{E}^{+}( italic_A , italic_B ) โˆˆ caligraphic_E start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, where Aโ‰ปBsucceeds๐ด๐ตA\succ Bitalic_A โ‰ป italic_B,

  2. (2)

    Unknown outcomes: (A,B)โˆˆโ„ฐ?๐ด๐ตsuperscriptโ„ฐ?(A,B)\in\mathcal{E}^{?}( italic_A , italic_B ) โˆˆ caligraphic_E start_POSTSUPERSCRIPT ? end_POSTSUPERSCRIPT, where Aโข?โขB๐ด?๐ตA~{}?~{}Bitalic_A ? italic_B,

  3. (3)

    A/A outcomes: (A,B)โˆˆโ„ฐโ‰ƒ๐ด๐ตsuperscriptโ„ฐsimilar-to-or-equals(A,B)\in\mathcal{E}^{\simeq}( italic_A , italic_B ) โˆˆ caligraphic_E start_POSTSUPERSCRIPT โ‰ƒ end_POSTSUPERSCRIPT, where Aโ‰ƒBsimilar-to-or-equals๐ด๐ตA\simeq Bitalic_A โ‰ƒ italic_B.

Here, Aโ‰ปBsucceeds๐ด๐ตA\succ Bitalic_A โ‰ป italic_B implies that there is a known and vetted preference of variant A๐ดAitalic_A over B๐ตBitalic_B โ€” typically because the North star or other guardrail metrics showed statistically significant improvements.These experiments are further validated by replicating outcomes, observing long-term holdouts, or because the experiment was part of an intentional degradation test.We denote inconclusive experiments as Aโข?โขB๐ด?๐ตA~{}?~{}Bitalic_A ? italic_B, implying statistically insignificant outcomes on the North Star.In rare cases, the North star might have gone up at the expense of important guardrail metrics, rendering conclusions ambiguous.We only include experiments into the inconclusive set for which we have a very strong intuition that something changed (and we โ€œknowโ€ the null hypothesis should be rejected), but we are unable to make a confident directionality decision.This ensures that we can use this set to truly measure type-II errors.Finally, Aโ‰ƒBsimilar-to-or-equals๐ด๐ตA\simeq Bitalic_A โ‰ƒ italic_B represents A/A experiments, where we know the null hypothesis to hold true (by design).The first set of experiments is used to measure type-III/S errors.Known and unknown outcomes are used to measure type-II errors, and A/A experiments can inform us about type-I errors.This dataset of past A/B experiments is collected and labelled by hand, from natural experiments occurring on the platform over time.

2.2.1. Optimising Metric Weights with a Geometric Heuristic

Note that z๐‘งzitalic_z as a function of w๐‘คwitalic_w is scale-free. That is, the direction of the weight vector w๐‘คwitalic_w matters, but its scale does not.As Kharitonov etal. write(Kharitonov etal., 2017), we can compute the optimal direction of w๐‘คwitalic_w using the Lagrange multipliers method, to obtain:

(8)wAโ‰ปBโ‹†โˆ(๐šบA+๐šบB+ฯตโขI)โˆ’1โข(๐Aโˆ’๐B).proportional-tosubscriptsuperscript๐‘คโ‹†succeeds๐ด๐ตsuperscriptsuperscript๐šบ๐ดsuperscript๐šบ๐ตitalic-ฯต๐ผ1superscript๐๐ดsuperscript๐๐ตw^{\star}_{A\succ B}\propto(\bm{\Sigma}^{A}+\bm{\Sigma}^{B}+\epsilon I)^{-1}(%\bm{\mu}^{A}-\bm{\mu}^{B}).italic_w start_POSTSUPERSCRIPT โ‹† end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_A โ‰ป italic_B end_POSTSUBSCRIPT โˆ ( bold_ฮฃ start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT + bold_ฮฃ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT + italic_ฯต italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_ฮผ start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT - bold_italic_ฮผ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) .

Here, ฯตโˆˆโ„italic-ฯตโ„\epsilon\in\mathbb{R}italic_ฯต โˆˆ blackboard_R is a small number to ensure that the matrix to be inverted is not singular.Kharitonov etal. fix this value at ฯต=0.01italic-ฯต0.01\epsilon=0.01italic_ฯต = 0.01 and never adjust it throughout the paper.We wish to highlight that technique is known as Ledoit-Wolf shrinkage(Ledoit and Wolf, 2004, 2020), and that it can have substantial influence on the obtained direction.Indeed: it acts as a regularisation term pushing the weights closer to w=(๐Aโˆ’๐B)๐‘คsuperscript๐๐ดsuperscript๐๐ตw=(\bm{\mu}^{A}-\bm{\mu}^{B})italic_w = ( bold_italic_ฮผ start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT - bold_italic_ฮผ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ).This can be seen by observing that as ฯตโ†’infโ†’italic-ฯตinfimum\epsilon\to\infitalic_ฯต โ†’ roman_inf, the inverse becomes 1ฯตโขI1italic-ฯต๐ผ\frac{1}{\epsilon}Idivide start_ARG 1 end_ARG start_ARG italic_ฯต end_ARG italic_I, and the solution hence becomes w=(๐Aโˆ’๐B)ฯต๐‘คsuperscript๐๐ดsuperscript๐๐ตitalic-ฯตw=\frac{(\bm{\mu}^{A}-\bm{\mu}^{B})}{\epsilon}italic_w = divide start_ARG ( bold_italic_ฮผ start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT - bold_italic_ฮผ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_ฯต end_ARG.As we only care about the direction we can ignore the denominator.To ensure fair comparison, we also set ฯต=0.01italic-ฯต0.01\epsilon=0.01italic_ฯต = 0.01.Exploring the effects of Ledoit-Wolf shrinkage as a regularisation technique where ฯตitalic-ฯต\epsilonitalic_ฯต is a hyper-parameter, gives an interesting avenue for future work.

In order to include observations from multiple experiments into a single set of learned weights, they propose to compute the optimal direction per experiment, normalise, and average the weights:

(9)wโ„ฐ+โ‹†=1|โ„ฐ+|โขโˆ‘(A,B)โˆˆโ„ฐ+wAโ‰ปBโ‹†โ€–wAโ‰ปBโ‹†โ€–2.subscriptsuperscript๐‘คโ‹†superscriptโ„ฐ1superscriptโ„ฐsubscript๐ด๐ตsuperscriptโ„ฐsubscriptsuperscript๐‘คโ‹†succeeds๐ด๐ตsubscriptnormsubscriptsuperscript๐‘คโ‹†succeeds๐ด๐ต2w^{\star}_{\mathcal{E}^{+}}=\frac{1}{|\mathcal{E}^{+}|}\sum_{(A,B)\in\mathcal{%E}^{+}}\frac{w^{\star}_{A\succ B}}{\left\|w^{\star}_{A\succ B}\right\|_{2}}.italic_w start_POSTSUPERSCRIPT โ‹† end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_E start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT | end_ARG โˆ‘ start_POSTSUBSCRIPT ( italic_A , italic_B ) โˆˆ caligraphic_E start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_w start_POSTSUPERSCRIPT โ‹† end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_A โ‰ป italic_B end_POSTSUBSCRIPT end_ARG start_ARG โˆฅ italic_w start_POSTSUPERSCRIPT โ‹† end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_A โ‰ป italic_B end_POSTSUBSCRIPT โˆฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG .

Whilst this procedure provides no guarantees about the sensitivity of the obtained metric on the overall set of experiments, it is efficient to compute and provides a strong baseline method.

2.2.2. Optimising Metric Weights via Gradient Ascent on z๐‘งzitalic_z-scores

A more principled approach is to cast the above as an optimisation problem.The objective function for this optimisation problem consists of three parts.First, we wish to maximise the z๐‘งzitalic_z-score for all variant pairs with known outcomes:

(10)โ„’z+โข(w;โ„ฐ+)=1|โ„ฐ+|โขโˆ‘(A,B)โˆˆโ„ฐ+zฯ‰Aโ‰ปB.superscriptsubscriptโ„’๐‘ง๐‘คsuperscriptโ„ฐ1superscriptโ„ฐsubscript๐ด๐ตsuperscriptโ„ฐsubscriptsuperscript๐‘งsucceeds๐ด๐ต๐œ”\begin{split}\mathcal{L}_{z}^{+}(w;\mathcal{E}^{+})=\frac{1}{|\mathcal{E}^{+}|%}\sum_{(A,B)\in\mathcal{E}^{+}}z^{A\succ B}_{\omega}.\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_w ; caligraphic_E start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_E start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT | end_ARG โˆ‘ start_POSTSUBSCRIPT ( italic_A , italic_B ) โˆˆ caligraphic_E start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_A โ‰ป italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฯ‰ end_POSTSUBSCRIPT . end_CELL end_ROW

Second, we wish to maximise the absolute z๐‘งzitalic_z-score for all variant pairs with inconclusive outcomes under the North Star:

(11)โ„’z?โข(w;โ„ฐ?)=1|โ„ฐ?|โขโˆ‘(A,B)โˆˆโ„ฐ?|zฯ‰Aโ‰ปB|.superscriptsubscriptโ„’๐‘ง?๐‘คsuperscriptโ„ฐ?1superscriptโ„ฐ?subscript๐ด๐ตsuperscriptโ„ฐ?subscriptsuperscript๐‘งsucceeds๐ด๐ต๐œ”\mathcal{L}_{z}^{?}(w;\mathcal{E}^{?})=\frac{1}{|\mathcal{E}^{?}|}\sum_{(A,B)%\in\mathcal{E}^{?}}\left|z^{A\succ B}_{\omega}\right|.caligraphic_L start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ? end_POSTSUPERSCRIPT ( italic_w ; caligraphic_E start_POSTSUPERSCRIPT ? end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_E start_POSTSUPERSCRIPT ? end_POSTSUPERSCRIPT | end_ARG โˆ‘ start_POSTSUBSCRIPT ( italic_A , italic_B ) โˆˆ caligraphic_E start_POSTSUPERSCRIPT ? end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_z start_POSTSUPERSCRIPT italic_A โ‰ป italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฯ‰ end_POSTSUBSCRIPT | .

Third, we wish to minimise the absolute z๐‘งzitalic_z-score for all variant pairs that are equivalent (i.e. A/A-pairs):

(12)โ„’zโ‰ƒโข(w;โ„ฐโ‰ƒ)=1|โ„ฐโ‰ƒ|โขโˆ‘(A,B)โˆˆโ„ฐโ‰ƒ|zฯ‰Aโ‰ปB|.superscriptsubscriptโ„’๐‘งsimilar-to-or-equals๐‘คsuperscriptโ„ฐsimilar-to-or-equals1superscriptโ„ฐsimilar-to-or-equalssubscript๐ด๐ตsuperscriptโ„ฐsimilar-to-or-equalssubscriptsuperscript๐‘งsucceeds๐ด๐ต๐œ”\mathcal{L}_{z}^{\simeq}(w;\mathcal{E}^{\simeq})=\frac{1}{|\mathcal{E}^{\simeq%}|}\sum_{(A,B)\in\mathcal{E}^{\simeq}}\left|z^{A\succ B}_{\omega}\right|.caligraphic_L start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โ‰ƒ end_POSTSUPERSCRIPT ( italic_w ; caligraphic_E start_POSTSUPERSCRIPT โ‰ƒ end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_E start_POSTSUPERSCRIPT โ‰ƒ end_POSTSUPERSCRIPT | end_ARG โˆ‘ start_POSTSUBSCRIPT ( italic_A , italic_B ) โˆˆ caligraphic_E start_POSTSUPERSCRIPT โ‰ƒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_z start_POSTSUPERSCRIPT italic_A โ‰ป italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฯ‰ end_POSTSUBSCRIPT | .

This gives rise to a combined objective as a weighted average:

(13)โ„’zโข(w;โ„ฐ)=โ„’z+โข(w;โ„ฐ+)+ฮป?โขโ„’z?โข(w;โ„ฐ?)โˆ’ฮปโ‰ƒโขโ„’zโ‰ƒโข(w;โ„ฐโ‰ƒ).subscriptโ„’๐‘ง๐‘คโ„ฐsuperscriptsubscriptโ„’๐‘ง๐‘คsuperscriptโ„ฐsuperscript๐œ†?superscriptsubscriptโ„’๐‘ง?๐‘คsuperscriptโ„ฐ?superscript๐œ†similar-to-or-equalssuperscriptsubscriptโ„’๐‘งsimilar-to-or-equals๐‘คsuperscriptโ„ฐsimilar-to-or-equals\mathcal{L}_{z}(w;\mathcal{E})=\mathcal{L}_{z}^{+}(w;\mathcal{E}^{+})+\lambda^%{?}\mathcal{L}_{z}^{?}(w;\mathcal{E}^{?})-\lambda^{\simeq}\mathcal{L}_{z}^{%\simeq}(w;\mathcal{E}^{\simeq}).caligraphic_L start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_w ; caligraphic_E ) = caligraphic_L start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_w ; caligraphic_E start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) + italic_ฮป start_POSTSUPERSCRIPT ? end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ? end_POSTSUPERSCRIPT ( italic_w ; caligraphic_E start_POSTSUPERSCRIPT ? end_POSTSUPERSCRIPT ) - italic_ฮป start_POSTSUPERSCRIPT โ‰ƒ end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โ‰ƒ end_POSTSUPERSCRIPT ( italic_w ; caligraphic_E start_POSTSUPERSCRIPT โ‰ƒ end_POSTSUPERSCRIPT ) .

Kharitonov etal. demonstrate that, for a variety of different metrics in a web search engine, these approaches can exhibit improved sensitivity(Kharitonov etal., 2017).We apply this method to learn instantaneously available proxies to a delayed North Star metric in general scenarios, and propose several extensions, detailed in the following Sections.

3. Methodology & Contributions

3.1. Learning Metrics that Maximise Power

When directly optimising z๐‘งzitalic_z-scores, an implicit assumption is made that the utility we obtain from increased z๐‘งzitalic_z-scores is linear.This is seldom a truthful characterisation of reality, considering how we wish to actually use these metrics downstream.We provide a toy example in Table1, reporting z๐‘งzitalic_z-scores and one-tailed p๐‘pitalic_p-values for two experiments and three possible metrics m1,m2,m3subscript๐‘š1subscript๐‘š2subscript๐‘š3m_{1},m_{2},m_{3}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, inspired by real data.In this toy example, we know that Aโ‰ปBsucceeds๐ด๐ตA\succ Bitalic_A โ‰ป italic_B, based on a hypothetical North Star metric.Nevertheless, as we do not know this beforehand, we typically test for the null hypothesis Aโ‰ƒBsimilar-to-or-equals๐ด๐ตA\simeq Bitalic_A โ‰ƒ italic_B with two-tailed p๐‘pitalic_p-values.In the table, this means that the outcome is statistically significant if the reported one-tailed values are p<ฮฑ2๐‘๐›ผ2p<\frac{\alpha}{2}italic_p < divide start_ARG italic_ฮฑ end_ARG start_ARG 2 end_ARG or p>1โˆ’ฮฑ2๐‘1๐›ผ2p>1-\frac{\alpha}{2}italic_p > 1 - divide start_ARG italic_ฮฑ end_ARG start_ARG 2 end_ARG.We report the power of every metric at varying significance levels ฮฑโˆˆ{0.05,0.01}๐›ผ0.050.01\alpha\in\{0.05,0.01\}italic_ฮฑ โˆˆ { 0.05 , 0.01 }, reporting whether(i) the null hypothesis is correctly rejected (p<ฮฑ2๐‘๐›ผ2p<\frac{\alpha}{2}italic_p < divide start_ARG italic_ฮฑ end_ARG start_ARG 2 end_ARG, โœ“ ),(ii) the outcome is inconclusive (?), i.e. a type-II error, or(iii) the null hypothesis is rejected, but for the wrong reason (p>1โˆ’ฮฑ2๐‘1๐›ผ2p>1-\frac{\alpha}{2}italic_p > 1 - divide start_ARG italic_ฮฑ end_ARG start_ARG 2 end_ARG, ร—\timesร—).

This latter case is deeply problematic, as it signifies disagreement with the North Star.Such errors have been described as type-III or type-S errors in the statistical literature(Mosteller, 1948; Kaiser, 1960; Gelman and Carlin, 2014; Urbano etal., 2019).Naturally, we would rather have a metric that fails to reject the null than one that confidently declares a faulty variant to be superior.Indeed, Deng and Shi argue that both directionality and sensitivity are desirable attributes for any metric(Deng and Shi, 2016).Nevertheless, considering candidate metrics in Table1, type-III errors are not sufficiently penalised by the average z๐‘งzitalic_z-score: metric m3subscript๐‘š3m_{3}italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT maximises this objective despite yielding statistical power that is on par with a coin flip.

Directly maximising power might prove cumbersome, as it is essentially a discrete step-function w.r.t. the z๐‘งzitalic_z-score, dependent on the significance level ฮฑ๐›ผ\alphaitalic_ฮฑ.Instead, it comes natural to minimise the one-tailed p๐‘pitalic_p-value reported in Table1.Indeed, the p๐‘pitalic_p-value transformation models diminishing returns for high z๐‘งzitalic_z-scores, which allows type-III errors to be sufficiently penalised.When considering this objective, m3subscript๐‘š3m_{3}italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is clearly suboptimal whilst m2subscript๐‘š2m_{2}italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is preferred.

Note that the change in objective would not affect the geometric heuristic as described in Section2.2.1.As we simply apply a monotonic transformation on the z๐‘งzitalic_z-scores, the weight direction that maximises the z๐‘งzitalic_z-score equivalently minimises its p๐‘pitalic_p-value.When learning via gradient descent, however, the p๐‘pitalic_p-value transformation affects how we aggregate and attribute gains over different input samples.This allows us to stop focusing on increasing sensitivity for experiments that are already โ€œsensitive enoughโ€, and more equitably consider all experiments in the training data.

z๐‘งzitalic_z-scorep๐‘pitalic_p-valuePower
ฮฑ=0.05๐›ผ0.05\alpha=0.05italic_ฮฑ = 0.05ฮฑ=0.01๐›ผ0.01\alpha=0.01italic_ฮฑ = 0.01
m1subscript๐‘š1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTExp. 11.971.971.971.972.44โขeโขโˆ’022.44E-022.4410-02start_ARG 2.44 end_ARG start_ARG โข end_ARG start_ARG roman_e start_ARG - 02 end_ARG end_ARGโœ“?
Exp. 21.971.971.971.972.44โขeโขโˆ’022.44E-022.4410-02start_ARG 2.44 end_ARG start_ARG โข end_ARG start_ARG roman_e start_ARG - 02 end_ARG end_ARGโœ“?
Mean1.971.971.971.972.44โขeโขโˆ’022.44E-022.4410-02start_ARG 2.44 end_ARG start_ARG โข end_ARG start_ARG roman_e start_ARG - 02 end_ARG end_ARG100%0%
\cdashline1-7
m2subscript๐‘š2m_{2}italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTExp. 11.901.901.901.902.87โขeโขโˆ’022.87E-022.8710-02start_ARG 2.87 end_ARG start_ARG โข end_ARG start_ARG roman_e start_ARG - 02 end_ARG end_ARG??
Exp. 23.503.503.503.502.33โขeโขโˆ’042.33E-042.3310-04start_ARG 2.33 end_ARG start_ARG โข end_ARG start_ARG roman_e start_ARG - 04 end_ARG end_ARGโœ“โœ“
Mean2.702.702.702.701.45โข๐žโขโˆ’๐ŸŽ๐Ÿ1.45E-021.4510-02start_ARG bold_1.45 end_ARG start_ARG bold_โข end_ARG start_ARG bold_e start_ARG bold_- bold_02 end_ARG end_ARG50%50%
\cdashline1-7
m3subscript๐‘š3m_{3}italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTExp. 1โˆ’2.58-2.58-2.58- 2.589.95โขeโขโˆ’019.95E-019.9510-01start_ARG 9.95 end_ARG start_ARG โข end_ARG start_ARG roman_e start_ARG - 01 end_ARG end_ARGร—\timesร—ร—\timesร—
Exp. 28.008.008.008.006.66โขeโขโˆ’166.66E-166.6610-16start_ARG 6.66 end_ARG start_ARG โข end_ARG start_ARG roman_e start_ARG - 16 end_ARG end_ARGโœ“โœ“
Mean5.294.98โขeโขโˆ’014.98E-014.9810-01start_ARG 4.98 end_ARG start_ARG โข end_ARG start_ARG roman_e start_ARG - 01 end_ARG end_ARG50%50%

This change in objective provides an intuitive and efficient extension to existing approaches, allowing us to directly optimise the confidence we have to correctly reject the null-hypothesis.For known outcomes, the loss function is given by:

(14)โ„’p+โข(w;โ„ฐ+)=1|โ„ฐ+|โขโˆ‘(A,B)โˆˆโ„ฐ+1โˆ’ฮฆโข(zฯ‰Aโ‰ปB)=1|โ„ฐ+|โขโˆ‘(A,B)โˆˆโ„ฐ+1โˆ’ฮฆโข(๐AโขwโŠบโˆ’๐BโขwโŠบwโข๐šบAโขwโŠบ+wโข๐šบBโขwโŠบ),superscriptsubscriptโ„’๐‘๐‘คsuperscriptโ„ฐ1superscriptโ„ฐsubscript๐ด๐ตsuperscriptโ„ฐ1ฮฆsubscriptsuperscript๐‘งsucceeds๐ด๐ต๐œ”1superscriptโ„ฐsubscript๐ด๐ตsuperscriptโ„ฐ1ฮฆsuperscript๐๐ดsuperscript๐‘คโŠบsuperscript๐๐ตsuperscript๐‘คโŠบ๐‘คsuperscript๐šบ๐ดsuperscript๐‘คโŠบ๐‘คsuperscript๐šบ๐ตsuperscript๐‘คโŠบ\begin{split}\mathcal{L}_{p}^{+}(w;\mathcal{E}^{+})=\frac{1}{|\mathcal{E}^{+}|%}\sum_{(A,B)\in\mathcal{E}^{+}}1-\Phi(z^{A\succ B}_{\omega})\\=\frac{1}{|\mathcal{E}^{+}|}\sum_{(A,B)\in\mathcal{E}^{+}}1-\Phi\left(\frac{%\bm{\mu}^{A}w^{\intercal}-\bm{\mu}^{B}w^{\intercal}}{\sqrt{w\bm{\Sigma}^{A}w^{%\intercal}+w\bm{\Sigma}^{B}w^{\intercal}}}\right),\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_w ; caligraphic_E start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_E start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT | end_ARG โˆ‘ start_POSTSUBSCRIPT ( italic_A , italic_B ) โˆˆ caligraphic_E start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT 1 - roman_ฮฆ ( italic_z start_POSTSUPERSCRIPT italic_A โ‰ป italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฯ‰ end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL = divide start_ARG 1 end_ARG start_ARG | caligraphic_E start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT | end_ARG โˆ‘ start_POSTSUBSCRIPT ( italic_A , italic_B ) โˆˆ caligraphic_E start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT 1 - roman_ฮฆ ( divide start_ARG bold_italic_ฮผ start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT โŠบ end_POSTSUPERSCRIPT - bold_italic_ฮผ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT โŠบ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_w bold_ฮฃ start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT โŠบ end_POSTSUPERSCRIPT + italic_w bold_ฮฃ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT โŠบ end_POSTSUPERSCRIPT end_ARG end_ARG ) , end_CELL end_ROW

and analogously extended to unknown โ„’p?superscriptsubscriptโ„’๐‘?\mathcal{L}_{p}^{?}caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ? end_POSTSUPERSCRIPT and A/A-outcomes โ„’pโ‰ƒsuperscriptsubscriptโ„’๐‘similar-to-or-equals\mathcal{L}_{p}^{\simeq}caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โ‰ƒ end_POSTSUPERSCRIPT.Nevertheless, we wish to point out that we only want to maximise p๐‘pitalic_p-values for A/A-outcomes if type-I error becomes problematic.As we will show empirically in Section4.3, this is not a problem we encounter.For this reason, we set ฮปโ‰ƒโ‰ก0superscript๐œ†similar-to-or-equals0\lambda^{\simeq}\equiv 0italic_ฮป start_POSTSUPERSCRIPT โ‰ƒ end_POSTSUPERSCRIPT โ‰ก 0.

Note that whilst direct optimisation of p๐‘pitalic_p-values is an improvement over myopic consideration of z๐‘งzitalic_z-scores, there is another caveat: the โ€œworst-caseโ€ loss of a type-III/S error is bounded at 1, which does not reflect our true utility function: metrics that disagree with the North Star are far less reliable than those that simply remain inconclusive.As such, we also consider another variant of the objective, where pยฏ=โˆ’pโขlogโก(1โˆ’p)ยฏ๐‘๐‘1๐‘\bar{p}=-p\log(1-p)overยฏ start_ARG italic_p end_ARG = - italic_p roman_log ( 1 - italic_p ).Figure1 provides visual intuition to clarify how this monotonic transformation on the p๐‘pitalic_p-values more heavily penalises type-III/S errors, whilst retaining the optimum.From a theoretical perspective, this function provides a convex relaxation for minimising the number of type-III/S errors a metric produces.As a result, we expect this surrogate to exhibit strong generalisation.We refer to this objective as minimising the logโกp๐‘\log proman_log italic_p-value.

Learning Metrics that Maximise Power for Accelerated A/B-Tests (1)

Note that one could envision further extensions here where the significance level ฮฑ๐›ผ\alphaitalic_ฮฑ is directly incorporated into the objective function to maximise statistical power at a given significance level ฮฑ๐›ผ\alphaitalic_ฮฑ.Nevertheless, we conjecture that their discrete nature might hamper effective optimisation and generalisation, compared to the strictly convex and smooth surrogate we obtain from the logโกp๐‘\log proman_log italic_p-value.

3.2. Accelerated Convergence for Scale-Free Objectives via Spherical Regularisation

The objective functions we describe โ€”either z๐‘งzitalic_z-scores or (log)โขp๐‘(\log)p( roman_log ) italic_p-valuesโ€” are scale-free w.r.t. the weights that are being optimised.As a result, out-of-the-box gradient-based optimisation techniques are not well-equipped to handle this efficiently.

Consider a simple toy example where we have two observed metrics for an experiment with a known preference Aโ‰ปBsucceeds๐ด๐ตA\succ Bitalic_A โ‰ป italic_B, and:

(15)๐A=[1.0,1.0],๐B=[0.5,0.5],๐šบA=๐šบB=I.formulae-sequencesuperscript๐๐ด1.01.0formulae-sequencesuperscript๐๐ต0.50.5superscript๐šบ๐ดsuperscript๐šบ๐ต๐ผ\bm{\mu}^{A}=[1.0,1.0],\qquad\bm{\mu}^{B}=[0.5,0.5],\qquad\bm{\Sigma}^{A}=\bm{%\Sigma}^{B}=I.bold_italic_ฮผ start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT = [ 1.0 , 1.0 ] , bold_italic_ฮผ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT = [ 0.5 , 0.5 ] , bold_ฮฃ start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT = bold_ฮฃ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT = italic_I .

For this low-dimensional problem, we can visualise the z๐‘งzitalic_z-score as a function of the metric weights in a contour plot, as shown in Figure2(a).

Learning Metrics that Maximise Power for Accelerated A/B-Tests (2)
Learning Metrics that Maximise Power for Accelerated A/B-Tests (3)

Here, it becomes visually clear that whilst the direction of the w=[w1,w2]๐‘คsubscript๐‘ค1subscript๐‘ค2w=[w_{1},w_{2}]italic_w = [ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] vector matters, its scale does not.The consequence is that the gradient vectors w.r.t. the objective on the right-hand plot can lead to slow convergence, even in this concave objective.Indeed, for poor initialisation in the bottom left quadrant (e.g. w=[โˆ’1,โˆ’2]๐‘ค12w=[-1,-2]italic_w = [ - 1 , - 2 ]), the gradient direction is perpendicular w.r.t. the optima.

Recent work makes a similar observation for discrete scale-free objectives as they appear in ranking problems(Ustimenko and Prokhorenkova, 2020).They propose to adopt projected gradient descent, normalising the gradients before every update.Whilst effective, in our setting we would prefer to use out-of-the-box optimisation methods for practitionersโ€™ ease-of-use.Instead, we introduce a simple regularisation term that represents the distance between the scale of the w๐‘คwitalic_w vector and a hyper-sphere:

(16)โ„’โˆฅwโˆฅ=โˆ’ฮดโข(Nโˆ’โˆฅwโˆฅ22)2.subscriptโ„’delimited-โˆฅโˆฅ๐‘ค๐›ฟsuperscript๐‘superscriptsubscriptdelimited-โˆฅโˆฅ๐‘ค222\mathcal{L}_{\lVert w\rVert}=-\delta\left(N-\lVert w\rVert_{2}^{2}\right)^{2}.caligraphic_L start_POSTSUBSCRIPT โˆฅ italic_w โˆฅ end_POSTSUBSCRIPT = - italic_ฮด ( italic_N - โˆฅ italic_w โˆฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

All optima for this objective function are also optima to the original functionโ€”but the gradient field is more amenable to iterative gradient-based optimisation techniques.Figure2(b) visualises how this transforms the loss surface.Under this regularised objective, it is visually clear that gradient-based optimisation methods are likely to exhibit faster convergence.Our empirical results confirm this, for a variety of initialisation weights and learning objectives.

4. Experiments & Discussion

To empirically validate the methods proposed in this work, we require a dataset containing logged metrics (sample means, their variances and covariances), together with preference orderings over competing system variants that were collected from real-world A/B-tests, ideally spanning large user-bases and several weeks.

Existing work on this topic used a private dataset from Yandex focused on web search experiments that ran between 2013โ€“2014(Kharitonov etal., 2017).They report type-I and -II errors for 8 metrics and a fixed 5% significance level, over 118 A/B-tests and 472 A/A-tests.

In this work, we consider more general metrics that are relevant for use-cases beyond web search (i.a. user retention and various engagement signals).Furthermore, we report type-I/II/III/S errors at varying significance levels, providing insights into the learnt metricsโ€™ behaviour.For this, we leverage logs of past A/B-experiments on two large-scale short-video platforms with over 160 million monthly active users each: ShareChat and Moj.The datasets consist of 153 A/B-experiments (of which 58 were conclusive) total that ran in 2023, and over 25โ€‰000 A/A-pairs.In total, we have access to roughly 100 metrics detailing various types of interactions with the platform, engagements, and delayed retention signals.Because our dataset is limited in size (a natural consequence of the problem domain), we are bound to overfit when using all available metrics as input features.As such, we limit ourselves to 10 input metrics to learn from, and evaluate them w.r.t. the delayed North Star.This feature selection step also ensures that our linear model consists of fewer parameters, which increases practitionersโ€™ and business stakeholdersโ€™ trust in its output.We focus on non-delayed signals, including activity metrics such as the number of sessions and active days, and counters for positive and negative feedback engagements of various types.These are selected through an analysis of their type-I/II/III/S errors w.r.t. the North Star, as well as their z๐‘งzitalic_z-scores: focusing on metrics with high sensitivity and limited disagreement.The research questions we wish to answer empirically using this data, are the following:

RQ1:

Do learnt metrics effectively improve on their objectives?

RQ2:

How do learnt metrics behave in terms of type-III/S errors?

RQ3:

How do learnt metricsโ€™ type-I/II errors behave when considered as stand-alone evaluation metrics?

RQ4:

How do learnt metricsโ€™ type-I/II errors behave when used in conjunction with the North Star and top proxy metrics?

RQ5:

How do learnt metrics influence required sample sizes when used in conjunction with the North Star and top proxy metrics?

RQ6:

Do we observe accelerated convergence over varying objectives via the proposed spherical regularisation technique?

We report results for the ShareChat platform in what follows, and provide further empirical results for Moj in AppendixA.

4.1. Effectiveness of Learnt Metrics (RQ1)

We learn and evaluate metrics through leave-one-out cross-validation: for every experiment, we train a model on all other experiments and evaluate the z๐‘งzitalic_z-score (Eq.7) and p๐‘pitalic_p-value (Eq.3) the metric yields for the held-out experiment.We report the mean and median z๐‘งzitalic_z-scores and p๐‘pitalic_p-values we obtain for all A/B-pairs with known outcomes (i.e. โ„ฐ+superscriptโ„ฐ\mathcal{E}^{+}caligraphic_E start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT) in Table2.Best performers for every column (either maximising z๐‘งzitalic_z-scores or minimising p๐‘pitalic_p-values) are highlighted in boldface.Empirical observations match our theoretical expectations: whilst the z๐‘งzitalic_z-score objective does effectively maximise the average z๐‘งzitalic_z-score, it is the worst performer for both mean and median p๐‘pitalic_p-values, and even the median z๐‘งzitalic_z-score.Our proposed logโกp๐‘\log proman_log italic_p-value objective effectively improves both the median z๐‘งzitalic_z-score and p๐‘pitalic_p-value over alternatives.

Objectivez๐‘งzitalic_z-score (โ†‘)\text{(}\uparrow{)}( โ†‘ )p๐‘pitalic_p-value (โ†“)\text{(}\downarrow{)}( โ†“ )
MeanMedianMeanMedian
heuristic7.317.317.317.313.073.073.073.071.88โขeโขโˆ’011.88E-011.8810-01start_ARG 1.88 end_ARG start_ARG โข end_ARG start_ARG roman_e start_ARG - 01 end_ARG end_ARG1.18โขeโขโˆ’031.18E-031.1810-03start_ARG 1.18 end_ARG start_ARG โข end_ARG start_ARG roman_e start_ARG - 03 end_ARG end_ARG
z๐‘งzitalic_z-score7.552.672.672.672.672.33โขeโขโˆ’012.33E-012.3310-01start_ARG 2.33 end_ARG start_ARG โข end_ARG start_ARG roman_e start_ARG - 01 end_ARG end_ARG3.88โขeโขโˆ’033.88E-033.8810-03start_ARG 3.88 end_ARG start_ARG โข end_ARG start_ARG roman_e start_ARG - 03 end_ARG end_ARG
p๐‘pitalic_p-value5.225.225.225.223.083.083.083.084.32โข๐žโขโˆ’๐ŸŽ๐Ÿ4.32E-024.3210-02start_ARG bold_4.32 end_ARG start_ARG bold_โข end_ARG start_ARG bold_e start_ARG bold_- bold_02 end_ARG end_ARG1.09โขeโขโˆ’031.09E-031.0910-03start_ARG 1.09 end_ARG start_ARG โข end_ARG start_ARG roman_e start_ARG - 03 end_ARG end_ARG
logโกp๐‘\log proman_log italic_p-value4.334.334.334.333.175.19โขeโขโˆ’025.19E-025.1910-02start_ARG 5.19 end_ARG start_ARG โข end_ARG start_ARG roman_e start_ARG - 02 end_ARG end_ARG8.60โข๐žโขโˆ’๐ŸŽ๐Ÿ’8.60E-048.6010-04start_ARG bold_8.60 end_ARG start_ARG bold_โข end_ARG start_ARG bold_e start_ARG bold_- bold_04 end_ARG end_ARG
Learning Metrics that Maximise Power for Accelerated A/B-Tests (4)
Learning Metrics that Maximise Power for Accelerated A/B-Tests (5)
Learning Metrics that Maximise Power for Accelerated A/B-Tests (6)

4.2. Agreement with the North Star (RQ2)

From the obtained z๐‘งzitalic_z-scores and p๐‘pitalic_p-values summarised in Table2, we can additionally derive (dis-)agreement with the North Star, for varying significance levels ฮฑ๐›ผ\alphaitalic_ฮฑ.We visualise these results in Figure3: if the obtained p๐‘pitalic_p-value under a learnt metric is lower than ฮฑ๐›ผ\alphaitalic_ฮฑ, that metric yields a statistically significant result (agreement).If the obtained p๐‘pitalic_p-value for the alternative hypothesis (i.e. Bโ‰ปAsucceeds๐ต๐ดB\succ Aitalic_B โ‰ป italic_A when we know Aโ‰ปBsucceeds๐ด๐ตA\succ Bitalic_A โ‰ป italic_B) is lower than ฮฑ๐›ผ\alphaitalic_ฮฑ, we have statistically significant disagreement, or type-III error.This is a capital sin we wish to avoid at all costs, as it severely diminishes the trust we can put in the learnt metric.If the p๐‘pitalic_p-value reveals a statistically insignificant result, we say the result is inconclusive, implying a type-II error.We observe that both the z๐‘งzitalic_z-score-maximising metric and the heuristic approach fail to steer clear from type-III error.Optimising (log\logroman_log)p๐‘pitalic_p-values instead, alleviates this issue.For this reason, we only consider these metrics for further evaluation.Indeed: an analysis of type-II error is rendered meaningless when type-III errors are present.

Figure7(a) in AppendixA highlights that for the Moj platform as well, type-III errors are common in the case of z๐‘งzitalic_z-score-maximising or heuristic metrics.As such, we only consider (log\logroman_log)p๐‘pitalic_p-values to assess increases in statistical power and potential reductions to the cost of running online experiments.

4.3. Power Increase from Learnt Metrics (RQ3โ€“4)

Until now, we have leveraged experiments with known outcomes to assess sensitivity and agreement with the North Star.Now, we additionally consider A/A-experiments (โ„ฐโ‰ƒsuperscriptโ„ฐsimilar-to-or-equals\mathcal{E}^{\simeq}caligraphic_E start_POSTSUPERSCRIPT โ‰ƒ end_POSTSUPERSCRIPT) and experiments with unknown outcomes (โ„ฐ?superscriptโ„ฐ?\mathcal{E}^{?}caligraphic_E start_POSTSUPERSCRIPT ? end_POSTSUPERSCRIPT) to measure type-I and type-II error respectively.We measure this for the North Star, for the best-performing proxy metric that serves as input to the learnt metrics, and for learnt metrics that exhibit no empirical disagreement with the North Star.We plot the type-I error (i.e. fraction of A/A-pairs in โ„ฐโ‰ƒsuperscriptโ„ฐsimilar-to-or-equals\mathcal{E}^{\simeq}caligraphic_E start_POSTSUPERSCRIPT โ‰ƒ end_POSTSUPERSCRIPT that are statistically significant at significance level ฮฑ๐›ผ\alphaitalic_ฮฑ) and the type-II error (i.e. fraction of A/B-pairs in {โ„ฐ+โˆชโ„ฐ?}superscriptโ„ฐsuperscriptโ„ฐ?\{\mathcal{E}^{+}\cup\mathcal{E}^{?}\}{ caligraphic_E start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT โˆช caligraphic_E start_POSTSUPERSCRIPT ? end_POSTSUPERSCRIPT } that are statistically insignificant at significance level ฮฑ๐›ผ\alphaitalic_ฮฑ) for varying values of ฮฑ๐›ผ\alphaitalic_ฮฑ in Figure4(a).We observe that we are able to significantly reduce type-II errors compared to the North Star (up to 78%), whilst keeping type-I errors at the required level (i.e. ฮฑ๐›ผ\alphaitalic_ฮฑ).However, we also observe that the type-II error we obtain when using learnt metrics does not significantly improve over the top proxy metric, when considered in isolation.

Nonetheless, this is not how evaluation metrics are used in practice.Indeed, we track several metrics and can draw conclusions if any of them are statistically significant.As such, metrics should be evaluated on their complementary sensitivity.That is, we compute p๐‘pitalic_p-values for a set of metrics, apply a Bonferroni correction, and assess statistical significance.The statistical power that we obtain through this procedure is visualised in Figure4(b).We consider either the North Star in isolation, the North Star in conjunction with the top-proxy, or a further combination with any learnt metric.Here, we observe that the learnt metric provides a substantial increase in statistical power: statistical power (i.e. 1โˆ’limit-from11-1 - type-II error) is increased by up to a relative 210% compared to the North Star alone, and 25โ€“30% over the North Star plus proxies.Furthermore, as the Bonferroni correction is slightly conservative, we observe lower than expected type-I error for higher significance levels ฮฑ๐›ผ\alphaitalic_ฮฑ.This implies that a more fine-grained multiple testing correction can further improve statistical power.We empirically observe that this works as expected, but its effects are negligible in practice.

4.4. Cost Reduction from Learnt Metrics (RQ5)

So far, we have shown that metrics learnt to minimise (log\logroman_log)p๐‘pitalic_p-values are effective at improving sensitivity (Table2), whilst minimising type-III error (Figure3) and improving statistical power (Figure4).

On one hand, powerful learnt metrics can lead to more confident decisions from statistically significant A/B-test outcomes.Another view is that we could make the same amount of decisions based on fewer data points, as we reach statistical significance with smaller sample sizes.This implies a cost reduction, as we can run experiments either on smaller portions of user traffic or for shorter periods of time, directly impacting experimentation velocity.

This reduction in required sample size is equal to the square of the relative z๐‘งzitalic_z-score(Chapelle etal., 2012; Kharitonov etal., 2017).We visualise this quantity in Figure5, for varying significance levels ฮฑ๐›ผ\alphaitalic_ฮฑ, for the same Bonferroni-corrected procedure as Figure4(b).To obtain a z๐‘งzitalic_z-score for a set of metrics, we simply take the maximal score and apply a Bonferroni correction to it as laid out in Section2.1.1.Note that this procedure depends on ฮฑ๐›ผ\alphaitalic_ฮฑ, explaining the slope in Figure5.We observe that our learnt metrics can achieve the same level of statistical confidence as the North Star with up to 8 times fewer samples, i.e. a reduction down to 12.5%percent12.512.5\%12.5 %.This significantly reduces the cost of experimentation for the business, further strengthening the case for our learnt metrics.

Learning Metrics that Maximise Power for Accelerated A/B-Tests (7)
Learning Metrics that Maximise Power for Accelerated A/B-Tests (8)

4.5. Spherical Regularisation (RQ6)

Our goal is to assess and quantify the effects of the proposed spherical regularisation method in Section3.2.We train models on all available data with known outcomes โ„ฐ+superscriptโ„ฐ\mathcal{E}^{+}caligraphic_E start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, where we have a vetted preference over variants Aโ‰ปBsucceeds๐ด๐ตA\succ Bitalic_A โ‰ป italic_B.We consider three weight initialisation strategies to set winitsubscript๐‘คinitw_{\rm init}italic_w start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT, and normalise weights to ensure โ„’โˆฅwinitโˆฅ=0subscriptโ„’delimited-โˆฅโˆฅsubscript๐‘คinit0\mathcal{L}_{\lVert w_{\rm init}\rVert}=0caligraphic_L start_POSTSUBSCRIPT โˆฅ italic_w start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT โˆฅ end_POSTSUBSCRIPT = 0:(i) goodinitialisation at winit=1|โ„ฐ+|โขโˆ‘(A,B)โˆˆโ„ฐ+๐Aโˆ’๐Bsubscript๐‘คinit1superscriptโ„ฐsubscript๐ด๐ตsuperscriptโ„ฐsuperscript๐๐ดsuperscript๐๐ตw_{\rm init}=\frac{1}{|\mathcal{E}^{+}|}\sum_{(A,B)\in\mathcal{E}^{+}}\bm{\mu}%^{A}-\bm{\mu}^{B}italic_w start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_E start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT | end_ARG โˆ‘ start_POSTSUBSCRIPT ( italic_A , italic_B ) โˆˆ caligraphic_E start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_italic_ฮผ start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT - bold_italic_ฮผ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT,(ii) constantinitialisation at the all-one vector winit=๐Ÿโ†’subscript๐‘คinitโ†’1w_{\rm init}=\vec{\bm{1}}italic_w start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT = overโ†’ start_ARG bold_1 end_ARG, and(iii) badinitialisation at winit=1|โ„ฐ+|โขโˆ‘(A,B)โˆˆโ„ฐ+๐Bโˆ’๐Asubscript๐‘คinit1superscriptโ„ฐsubscript๐ด๐ตsuperscriptโ„ฐsuperscript๐๐ตsuperscript๐๐ดw_{\rm init}=\frac{1}{|\mathcal{E}^{+}|}\sum_{(A,B)\in\mathcal{E}^{+}}\bm{\mu}%^{B}-\bm{\mu}^{A}italic_w start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_E start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT | end_ARG โˆ‘ start_POSTSUBSCRIPT ( italic_A , italic_B ) โˆˆ caligraphic_E start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_italic_ฮผ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT - bold_italic_ฮผ start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT.We train models for all learning objectives we deal with in this paper: z๐‘งzitalic_z-scores, p๐‘pitalic_p-values, and logโกp๐‘\log proman_log italic_p-values; whilst varying the strength of the spherical regularisation term ฮด๐›ฟ\deltaitalic_ฮด.As discussed, this term does not affect the optima, but simply transforms the loss to be more amenable to gradient-based optimisation methods.Thus, we expect convergence after fewer training iterations.All models are trained until convergence with the adam optimiser(Kingma and Ba, 2014), initialising the learning rate at 5โขeโขโˆ’45E-4510-4start_ARG 5 end_ARG start_ARG โข end_ARG start_ARG roman_e start_ARG - 4 end_ARG end_ARG and halving it every 1โ€‰000 steps where we do not observe improvements.We use the radam variant to avoid convergence issues(Reddi etal., 2018; Liu etal., 2020), and have validated that this choice does not significantly alter our obtained results and conclusions.We consider a model converged if there are no improvements to the learning objective after 10โ€‰000 steps.All methods are implemented using Python3.9 and PyTorch(Paszke etal., 2019).

Figure6 visualises the evolution of the learning objective over optimisation steps, for all mentioned learning objectives, initialisation strategies, and regularisation strengths.We observe that the method is robust, significantly improving convergence speed for all settings, requiring up to 40% fewer iterations until convergence is reached.This positively influences the practical utility of the learnt metric pipeline for researchers and practitioners.

We provide source code to reproduce Figure2 and our regularisation method at github.com/olivierjeunen/learnt-metrics-kdd-2024.

5. Insights from Learnt Metrics

In this Section, we briefly discuss insights that arose through our empirical evaluation of all metrics: the North Star, classical surrogates and proxies, as well as learnt metrics.These insights are specific to our platforms, but we believe they can contribute to a general intuition and understanding of metrics for online content platforms and broader application areas.

Ratio metrics are easily fooled.

Often, important metrics can be framed as a ratio of the means (or sums) of two existing metrics(Baweja etal., 2024; Budylin etal., 2018).Examples include click-through rate (i.e. clicks / impressions), variants of user retention (i.e. retained users / active users), or general engagement ratios (e.g. likes / video-plays).We observe that, whilst these metrics can be important from a business perspective, they typically exhibit significant type-III/S errors w.r.t. the North Star.Indeed, in the examples above both the numerator and denominator represent positive signals we wish to increase.Suppose an online experiment increases the number of video-plays by Y%percent๐‘ŒY\%italic_Y %, and the overall number of likes by X%<Y%percent๐‘‹percent๐‘ŒX\%<Y\%italic_X % < italic_Y %.These two positive signals will lead to a decreasing ratio, whilst we are likely to still prefer the treatment w.r.t. the North Star if X๐‘‹Xitalic_X and Y๐‘ŒYitalic_Y are substantially large.Similar observations cautioning the use of ratio metrics have been made by Dmitriev etal. (2017).

We believe this is connected to common offline ranking evaluation metrics prevalent in the recommender systems field(Steck, 2013; Jeunen, 2019).Indeed, such metrics are cumulative in nature, optimising overall value instead of some notion of value-per-item(Jeunen etal., 2024).

User-level aggregations conquer general counters.

In the previous example, we describe general count metrics for the number of likes and the number of video-play events.User behaviour on online platforms often follows a power-law distribution: a few โ€œpower usersโ€ generate the majority of such events(Chi, 2020).As a result, such metrics are easily skewed, and they are not guaranteed to accurately reflect improvements for the full population of usersโ€”empirically leading to type-III/S errors w.r.t. the North Star.Aggregating such counters per users (e.g. count the number of days a user has at least X๐‘‹Xitalic_X video-plays) instead of using raw event counters, provides strong and sensitive proxies to the North Star.

Interestingly, this framing is reminiscent of recall, as we effectively measure the coverage of users about whom we have positive signals.Recall metrics are again strongly connected to offline evaluation practices in recommender systems, especially in the first stage of two-stage systems common in industry(Covington etal., 2016; Ma etal., 2020).

6. Conclusions & Outlook

A/B-testing is a crucial tool for decision-making in online businesses, and it has widely been adopted as a go-to approach to allow for continuous system improvements.Notwithstanding their popularity, online experiments are often expensive to perform.Indeed: many experiments lead to statistically insignificant outcomes, presenting an obstacle for confident decision-making.Experiments that do lead to significant outcomes are costly too: by their very definition, a portion of user traffic interacts with a sub-optimal system variant.As such, we want to maximise the number of decisions we can make based on the experiments we run, and we want to minimise the required sample size for statistically significant outcomes.In this work, we achieve this by learning metrics that maximise the statistical power they harness.We present novel learning objectives for such metrics, and provide a thorough evaluation of the effectiveness of our proposed approaches.Our learnt metrics are currently used for confident, high-velocity decision-making across ShareChat and Moj business units.

We believe our work opens several avenues for future work improving the efficacy of learnt metrics, by e.g. relaxing the linearity constraint we rely on.Furthermore, we wish to leverage our learnt metrics as reward signals for personalisation through machine learning models(Jeunen, 2021).

References

  • (1)
  • Athey etal. (2019)Susan Athey, Raj Chetty, GuidoW Imbens, and Hyunseung Kang. 2019.The Surrogate Index: Combining Short-Term Proxies to Estimate Long-Term Treatment Effects More Rapidly and Precisely.Working Paper 26463. National Bureau of Economic Research.https://doi.org/10.3386/w26463
  • Baweja etal. (2024)Shubham Baweja, Neeti Pokharna, Aleksei Ustimenko, and Olivier Jeunen. 2024.Variance Reduction in Ratio Metrics for Efficient Online Experiments. In Proc. of the 46th European Conference on Information Retrieval (ECIR โ€™24). Springer.
  • Budylin etal. (2018)Roman Budylin, Alexey Drutsa, Ilya Katsev, and Valeriya Tsoy. 2018.Consistent Transformation of Ratio Metrics for Efficient Online Controlled Experiments. In Proc. of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM โ€™18). ACM, 55โ€“63.https://doi.org/10.1145/3159652.3159699
  • Chapelle etal. (2012)Olivier Chapelle, Thorsten Joachims, Filip Radlinski, and Yisong Yue. 2012.Large-Scale Validation and Analysis of Interleaved Search Evaluation.ACM Trans. Inf. Syst. 30, 1, Article 6 (mar 2012), 41pages.https://doi.org/10.1145/2094072.2094078
  • Chi (2020)EdH. Chi. 2020.From Missing Data to Boltzmann Distributions and Time Dynamics: The Statistical Physics of Recommendation. In Proc. of the 13th International Conference on Web Search and Data Mining (WSDM โ€™20). ACM, 1โ€“2.https://doi.org/10.1145/3336191.3372193
  • Covington etal. (2016)Paul Covington, Jay Adams, and Emre Sargin. 2016.Deep Neural Networks for YouTube Recommendations. In Proc. of the 10th ACM Conference on Recommender Systems (RecSys โ€™16). ACM, 191โ€“198.https://doi.org/10.1145/2959100.2959190
  • Deng and Shi (2016)Alex Deng and Xiaolin Shi. 2016.Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned. In Proc. of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD โ€™16). ACM, 77โ€“86.https://doi.org/10.1145/2939672.2939700
  • Deng etal. (2013)Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. 2013.Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data. In Proc. of the Sixth ACM International Conference on Web Search and Data Mining (WSDM โ€™13). ACM, 123โ€“132.https://doi.org/10.1145/2433396.2433413
  • Dmitriev etal. (2017)Pavel Dmitriev, Somit Gupta, DongWoo Kim, and Garnet Vaz. 2017.A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD โ€™17). ACM, 1427โ€“1436.https://doi.org/10.1145/3097983.3098024
  • Gelman and Carlin (2014)Andrew Gelman and John Carlin. 2014.Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors.Perspectives on Psychological Science 9, 6 (2014), 641โ€“651.https://doi.org/10.1177/1745691614551642PMID: 26186114.
  • Goffrier etal. (2023)GrahamVan Goffrier, Lucas Maystre, and CiarรกnMark Gilligan-Lee. 2023.Estimating long-term causal effects from short-term experiments and long-term observational data with unobserved confounding. In Proc. of the Second Conference on Causal Learning and Reasoning (Proc. of Machine Learning Research, Vol.213), Mihaela vander Schaar, Cheng Zhang, and Dominik Janzing (Eds.). PMLR, 791โ€“813.https://proceedings.mlr.press/v213/goffrier23a.html
  • Guo etal. (2021)Yongyi Guo, Dominic Coey, Mikael Konutgan, Wenting Li, Chris Schoener, and Matt Goldman. 2021.Machine Learning for Variance Reduction in Online Experiments. In Advances in Neural Information Processing Systems, Vol.34. Curran Associates, Inc., 8637โ€“8648.
  • Howard etal. (2021)StevenR. Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. 2021.Time-uniform, nonparametric, nonasymptotic confidence sequences.The Annals of Statistics 49, 2 (2021), 1055 โ€“ 1080.https://doi.org/10.1214/20-AOS1991
  • Jeunen (2019)Olivier Jeunen. 2019.Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems. In Proc. of the 13th ACM Conference on Recommender Systems (RecSys โ€™19). ACM, 596โ€“600.https://doi.org/10.1145/3298689.3347069
  • Jeunen (2021)Olivier Jeunen. 2021.Offline Approaches to Recommendation with Online Success.Ph.โ€‰D. Dissertation. University of Antwerp.
  • Jeunen (2023)Olivier Jeunen. 2023.A Common Misassumption in Online Experiments with Machine Learning Models.SIGIR Forum 57, 1, Article 13 (dec 2023), 9pages.https://doi.org/10.1145/3636341.3636358
  • Jeunen etal. (2024)Olivier Jeunen, Ivan Potapov, and Aleksei Ustimenko. 2024.On (Normalised) Discounted Cumulative Gain as an Offline Evaluation Metric for Top-n๐‘›nitalic_n Recommendation. In Proc. of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD โ€™24).arXiv:2307.15053[cs.IR]
  • Kaiser (1960)HenryF Kaiser. 1960.Directional statistical decisions.Psychological Review 67, 3 (1960), 160.
  • Kharitonov etal. (2017)Eugene Kharitonov, Alexey Drutsa, and Pavel Serdyukov. 2017.Learning Sensitive Combinations of A/B Test Metrics. In Proc. of the Tenth ACM International Conference on Web Search and Data Mining (WSDM โ€™17). ACM, 651โ€“659.https://doi.org/10.1145/3018661.3018708
  • Kingma and Ba (2014)DiederikP. Kingma and Jimmy Ba. 2014.Adam: A Method for Stochastic Optimization. In Proc. of the 3rd International Conference on Learning Representations (ICLR โ€™14).arXiv:1412.6980[cs.LG]
  • Kohavi etal. (2022)Ron Kohavi, Alex Deng, and Lukas Vermeer. 2022.A/B Testing Intuition Busters: Common Misunderstandings in Online Controlled Experiments. In Proc. of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD โ€™22). ACM, 3168โ€“3177.https://doi.org/10.1145/3534678.3539160
  • Kohavi etal. (2020)Ron Kohavi, Diane Tang, and Ya Xu. 2020.Trustworthy online controlled experiments: A practical guide to A/B testing.Cambridge University Press.
  • Ledoit and Wolf (2004)Olivier Ledoit and Michael Wolf. 2004.A well-conditioned estimator for large-dimensional covariance matrices.Journal of Multivariate Analysis 88, 2 (2004), 365โ€“411.https://doi.org/10.1016/S0047-259X(03)00096-4
  • Ledoit and Wolf (2020)Olivier Ledoit and Michael Wolf. 2020.The Power of (Non-)Linear Shrinking: A Review and Guide to Covariance Matrix Estimation.Journal of Financial Econometrics 20, 1 (06 2020), 187โ€“218.https://doi.org/10.1093/jjfinec/nbaa007arXiv:https://academic.oup.com/jfec/article-pdf/20/1/187/42274902/nbaa007.pdf
  • Liu etal. (2020)Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. 2020.On the Variance of the Adaptive Learning Rate and Beyond. In International Conference on Learning Representations (ICLR โ€™20).https://arxiv.org/abs/1908.03265
  • Ma etal. (2020)Jiaqi Ma, Zhe Zhao, Xinyang Yi, Ji Yang, Minmin Chen, Jiaxi Tang, Lichan Hong, and EdH Chi. 2020.Off-policy learning in two-stage recommender systems. In Proc. of The Web Conference 2020. 463โ€“473.
  • Mosteller (1948)Frederick Mosteller. 1948.A k-Sample Slippage Test for an Extreme Population.The Annals of Mathematical Statistics 19, 1 (1948), 58โ€“65.http://www.jstor.org/stable/2236056
  • Paszke etal. (2019)Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019.PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems, H.Wallach, H.Larochelle, A.Beygelzimer, F.d'Alchรฉ-Buc, E.Fox, and R.Garnett (Eds.), Vol.32. Curran Associates, Inc.https://proceedings.neurips.cc/paper_files/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf
  • Poyarkov etal. (2016)Alexey Poyarkov, Alexey Drutsa, Andrey Khalyavin, Gleb Gusev, and Pavel Serdyukov. 2016.Boosted Decision Tree Regression Adjustment for Variance Reduction in Online Controlled Experiments. In Proc. of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD โ€™16). ACM, 235โ€“244.https://doi.org/10.1145/2939672.2939688
  • Reddi etal. (2018)SashankJ. Reddi, Satyen Kale, and Sanjiv Kumar. 2018.On the Convergence of Adam and Beyond. In International Conference on Learning Representations (ICLR โ€™18).https://openreview.net/forum?id=ryQu7f-RZ
  • Richardson etal. (2023)Lee Richardson, Alessandro Zito, Dylan Greaves, and Jacopo Soriano. 2023.Pareto optimal proxy metrics.arXiv:2307.01000[stat.ME]
  • Rubin (1974)DonaldB Rubin. 1974.Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of educational Psychology 66, 5 (1974), 688.
  • Schmit and Miller (2022)Sven Schmit and Evan Miller. 2022.Sequential confidence intervals for relative lift with regression adjustments.(2022).
  • Shaffer (1995)JulietPopper Shaffer. 1995.Multiple Hypothesis Testing.Annual Review of Psychology 46, 1 (1995), 561โ€“584.https://doi.org/10.1146/annurev.ps.46.020195.003021arXiv:https://doi.org/10.1146/annurev.ps.46.020195.003021
  • Steck (2013)Harald Steck. 2013.Evaluation of recommendations: rating-prediction and ranking. In Proc. of the 7th ACM Conference on Recommender Systems (RecSys โ€™13). ACM, 213โ€“220.https://doi.org/10.1145/2507157.2507160
  • Tang etal. (2022)Ziyang Tang, Yiheng Duan, Steven Zhu, Stephanie Zhang, and Lihong Li. 2022.Estimating Long-Term Effects from Experimental Data. In Proc. of the 16th ACM Conference on Recommender Systems (RecSys โ€™22). ACM, 516โ€“518.https://doi.org/10.1145/3523227.3547398
  • Tripuraneni etal. (2023)Nilesh Tripuraneni, Lee Richardson, Alexander Dโ€™Amour, Jacopo Soriano, and Steve Yadlowsky. 2023.Choosing a Proxy Metric from Past Experiments.arXiv:2309.07893[stat.ME]
  • Urbano etal. (2019)Juliรกn Urbano, Harlley Lima, and Alan Hanjalic. 2019.Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors. In Proc. of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIRโ€™19). ACM, 505โ€“514.https://doi.org/10.1145/3331184.3331259
  • Ustimenko and Prokhorenkova (2020)Aleksei Ustimenko and Liudmila Prokhorenkova. 2020.StochasticRank: Global Optimization of Scale-Free Discrete Functions. In Proc. of the 37th International Conference on Machine Learning (ICML โ€™20โ€™, Vol.119). PMLR, 9669โ€“9679.https://proceedings.mlr.press/v119/ustimenko20a.html
  • Wald (1945)Abraham Wald. 1945.Sequential Tests of Statistical Hypotheses.The Annals of Mathematical Statistics 16, 2 (1945), 117 โ€“ 186.https://doi.org/10.1214/aoms/1177731118
  • Wang etal. (2022)Yuyan Wang, Mohit Sharma, Can Xu, Sriraj Badam, Qian Sun, Lee Richardson, Lisa Chung, EdH. Chi, and Minmin Chen. 2022.Surrogate for Long-Term User Experience in Recommender Systems. In Proc. of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Washington DC, USA) (KDD โ€™22). ACM, 4100โ€“4109.https://doi.org/10.1145/3534678.3539073
  • WELCH (1947)BernardLewis WELCH. 1947.The Generalization of โ€˜Studentโ€™sโ€™ Problem when Several Different Population Variances are Involved.Biometrika 34, 1-2 (01 1947), 28โ€“35.https://doi.org/10.1093/biomet/34.1-2.28
  • Xie and Aurisset (2016)Huizhi Xie and Juliette Aurisset. 2016.Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix. In Proc. of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD โ€™16). ACM, 645โ€“654.https://doi.org/10.1145/2939672.2939733
  • Yue etal. (2010)Yisong Yue, Yue Gao, Oliver Chapelle, Ya Zhang, and Thorsten Joachims. 2010.Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation. In Proc. of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR โ€™10). ACM, 507โ€“514.https://doi.org/10.1145/1835449.1835534
Learning Metrics that Maximise Power for Accelerated A/B-Tests (9)
Learning Metrics that Maximise Power for Accelerated A/B-Tests (10)
Learning Metrics that Maximise Power for Accelerated A/B-Tests (11)
Learning Metrics that Maximise Power for Accelerated A/B-Tests (12)

Appendix A Additional Experimental Results

To further empirically validate our theoretical insights w.r.t. the proposed methods, we repeat the experiments reported in Section4 on data collected for the Moj platform, and reproduce Figures3โ€“5.Results are visualised in Figure7.Observations match our expectations, further strengthening trust in the replicability of our results.

All improvements in sensitivity and statistical power are a similar order of magnitude as those for ShareChat: learnt metrics that minimise (log\logroman_log)p๐‘pitalic_p-values can substantially reduce type-II/III errors without affecting type-I errors.We observe an improvement over ShareChat data in Figure7(d): learnt metrics for Moj exhibit a 12-fold reduction in the sample size that is required to attain constant statistical confidence as to the North Star.

Learning Metrics that Maximise Power for Accelerated A/B-Tests (2024)

References

Top Articles
Latest Posts
Article information

Author: Cheryll Lueilwitz

Last Updated:

Views: 6006

Rating: 4.3 / 5 (54 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Cheryll Lueilwitz

Birthday: 1997-12-23

Address: 4653 O'Kon Hill, Lake Juanstad, AR 65469

Phone: +494124489301

Job: Marketing Representative

Hobby: Reading, Ice skating, Foraging, BASE jumping, Hiking, Skateboarding, Kayaking

Introduction: My name is Cheryll Lueilwitz, I am a sparkling, clean, super, lucky, joyous, outstanding, lucky person who loves writing and wants to share my knowledge and understanding with you.