Best-Arm Identification with Noisy Actuation

Merve Karakas^†, Osama Hanna^‡, Lin F. Yang^†, and Christina Fragouli^†
^†University of California, Los Angeles, ^‡ Meta, Central Applied Science
Email: mervekarakas@ucla.edu, ohanna@meta.com, {linyang, christina.fragouli}@ucla.edu This work is partially supported by NSF grants #2221871, #2007714, and #2221871, by Army Research Laboratory grant under Cooperative Agreement W911NF-17-2-0196, and by Amazon Faculty Award.

Abstract

In this paper, we consider a multi-armed bandit (MAB) instance and study how to identify the best arm when arm commands are conveyed from a central learner to a distributed agent over a discrete memoryless channel (DMC). Depending on the agent capabilities, we provide communication schemes along with their analysis, which interestingly relate to the zero-error capacity of the underlying DMC.

I Introduction

Motivated by the growing interest in distributed learning applications, recent work has begun to investigate learning over noisy communication channels [6, 5, 9, 10, 16]. Within this landscape, we consider a new formulation: best arm identification in which arm commands are conveyed from a central learner to a distributed agent over a discrete memoryless channel (DMC).

Such channels can model situations in which actions are communicated over unreliable interfaces, including physical controls, low-bandwidth links, or human-mediated instructions, where commands may be received correctly or confused with a small set of alternatives. Despite their simplicity, DMCs capture key sources of noise in practical action-communication pipelines and allow us to derive theoretical insights.

Our focus is on identifying when performance guarantees can be made independent of the channel error probabilities and depend only on confusability. For that question, zero-error capacity is the natural threshold quantity.

In this paper, we study three models of increasing capability for the distributed agent. In the first model, the agent can only execute the commands it receives. In the second, the agent is equipped with a preloaded codebook that allows it to decode the received signals into actions. In the third, most powerful model, the agent can maintain state and execute multi-round plans installed via zero-error packets. For each model, we compare performance against an idealized benchmark in which communication occurs over an error-free channel. Our main findings are summarized below:

•

Case 1 (No decoding). If the agent simply executes received commands, the channel induces mixing of the action distributions. Performance degrades by a factor governed by the smallest singular value of the channel matrix, which depends on the channel error probability.
•

Case 2 (Fixed decoding). If the zero-error capacity of the channel is nonzero and the agent can use a fixed block code, then the resulting performance loss can be only a constant multiplicative factor, independent of the channel error probabilities.
•

Case 3 (Stateful execution). If the agent can maintain state and execute multi-round plans installed via zero-error packets, then the performance gap relative to the error-free benchmark can be reduced to an additive overhead.

We note that when the zero-error capacity of the underlying DMC is zero, no coding strategy can eliminate dependence on the channel noise (Remark 1), and performance necessarily degrades with the channel error probability.

Related Work. Multi-armed bandits (MAB) are a standard model for sequential decision-making under uncertainty; see, e.g., [13] and references therein. Our objective is fixed-confidence best-arm identification (BAI), for which classic elimination and tracking-style algorithms yield instance-dependent sample complexities of order $\tilde{O}\!\big(\sum_{a\neq a^{\star}}\Delta_{a}^{-2}\log(1/\delta)\big)$ ; representative references include successive elimination [2], lil’UCB [7], and Track-and-Stop [3] as well as complexity/lower-bound characterizations [11].

Prior work on noisy bandits focuses on the reward channel—adversarial corruption [15, 4, 1] and delayed/censored feedback [8]—but assumes the chosen arm is executed as intended. Recent work has studied uncertainty in the executed arm via arm-erasure models, for regret [6, 5, 9, 10] and BAI [16]. In contrast, we consider general discrete memoryless actuation channels with confusability (typewriter channels as a running example), leveraging zero-error communication tools [17, 14, 12].

Paper Organization. Section II provides our model, objectives, and background on zero-error capacity; Sections III, IV, and V discuss cases 1–3 respectively.

II Model and Objectives

II-A Bandit instance and objective (BAI)

We consider a stochastic $K$ -armed bandit with arms indexed by $\mathcal{K}\coloneq\{0,1,\dots,K-1\}$ . Pulling arm $a$ produces a reward $r\sim\nu_{a}$ with mean $\mu_{a}\triangleq\mathbb{E}[r]$ . Let $a^{\star}\in\arg\max_{a\in\mathcal{K}}\mu_{a}$ denote a best arm (assumed unique for simplicity), and define gaps $\Delta_{a}\triangleq\mu_{a^{\star}}-\mu_{a}$ for $a\neq a^{\star}$ .

Our goal is fixed-confidence best-arm identification (BAI): an algorithm adaptively interacts with the environment, stops at a (random) time $\tau$ , and outputs $\hat{a}$ . It is $\delta$ -correct if

\Pr(\hat{a}=a^{\star})\geq 1-\delta.

Hence, we measure performance primarily by the total number of physical rounds (pulls). In our interface, each round also uses the command channel once, so the pull count coincides with the number of channel uses.

In the noiseless setting, instance‑dependent lower and upper bounds show that fixed‑confidence BAI can be solved with $\Theta\left(\sum_{a\neq a^{\star}}\Delta^{-2}_{a}\log{(1/\delta)}\right)$ samples (see, e.g., [7, 2, 3]). We will treat this benchmark as $N_{\text{clean}}(\delta,\mu)$ .

II-B Actuation Channel and Arm Mismatch

At each time slot $t$ , the learner produces a channel input $X_{t}\in\mathcal{X}$ which is passed through a discrete memoryless channel (DMC) $W:\mathcal{X}\to\mathcal{Y}$ , yielding an output $Y_{t}\in\mathcal{Y}$ at the agent according to

W(y\mid x)\;\triangleq\;\Pr(Y_{t}=y\mid X_{t}=x).

We consider an arm channel with $\mathcal{X}=\mathcal{Y}=\mathcal{K}$ , so $X_{t},Y_{t}\in\mathcal{K}$ and block codes correspond to sequences $X^{n},Y^{n}\in\mathcal{K}^{n}$ . The agent then executes an arm $\tilde{a}_{t}\in\mathcal{K}$ (as a function of its received symbols and actuation rule), and a reward $r_{t}$ is generated with mean $\mu_{\tilde{a}_{t}}$ . The learner observes $r_{t}$ and its own transmissions $X_{t}$ , but does not observe channel outputs $Y_{t}$ .

II-C Zero-error capacity and confusability graphs

We briefly introduce notation and review zero-error communication over a DMC. Zero-error rates depend only on which outputs are possible for each input (the support of $W$ ), not on their probabilities [17]. Define the output support sets

\mathcal{Y}(x)\triangleq\{y\in\mathcal{Y}:W(y\mid x)>0\},\qquad x\in\mathcal{X}.

Confusability graph

The confusability graph of $W$ is the undirected graph $G=(V,E)$ with $V=\mathcal{X}$ and

\{x,x^{\prime}\}\in E\quad\Longleftrightarrow\quad\mathcal{Y}(x)\cap\mathcal{Y}(x^{\prime})\neq\emptyset.

Blocklength- $n$ packets

A length- $n$ packet is an input sequence $x^{n}\in V^{n}$ . Confusability of length- $n$ sequences is captured by the $n$ -fold strong graph power $G^{\boxtimes n}$ : distinct $x^{n},(x^{\prime})^{n}\in V^{n}$ are adjacent in $G^{\boxtimes n}$ iff for all coordinates $i$ , $x_{i}=x^{\prime}_{i}$ or $\{x_{i},x^{\prime}_{i}\}\in E$ [17, 12].

Zero-error message count

Let $\alpha(\cdot)$ be the independence number. The maximum number of messages conveyable with zero decoding error using exactly $n$ channel uses is

M(n)\;\triangleq\;\alpha\!\big(G^{\boxtimes n}\big),

(1)

via the standard equivalence between zero-error codebooks and independent sets [17].

Minimal blocklength for a message set

For a finite message set $\mathcal{S}$ with $|\mathcal{S}|<\infty$ , define

n^{\star}(\mathcal{S})\;\triangleq\;\min\{n\in\mathbb{N}:\alpha(G^{\boxtimes n})\geq|\mathcal{S}|\}.

(2)

Zero-error capacity

Shannon’s zero-error capacity is

C_{0}(G)\;\triangleq\;\lim_{n\to\infty}\frac{1}{n}\log_{2}\alpha\!\big(G^{\boxtimes n}\big)\;=\;\log_{2}\Theta(G),

(3)

where $\Theta(G)$ is the Shannon capacity [17]. When $C_{0}(G)>0$ , $M(n)$ grows exponentially in $n$ , so any fixed finite message set is achievable with some finite blocklength.

Remark 1

The following are equivalent (see [17]):

C_{0}(G)=0\;\Longleftrightarrow\;\alpha(G^{\boxtimes n})=1\ \forall n\;\Longleftrightarrow\;G\text{ is complete}.

Here, $M(n)=1~\forall~n$ , so no fixed-blocklength, no-feedback protocol can convey even a binary command with zero-error.

II-D Typewriter channel and general graphs

A central example is the one-sided typewriter channel (see Fig. 1) over alphabet $\mathcal{X}=\{0,1,\dots,K-1\}$ :

Y=\begin{cases}X,&\text{w.p. }1-\varepsilon,\\ X+1\!\!\!\!\pmod{K},&\text{w.p. }\varepsilon.\end{cases}

More generally, our coding layer is described by the channel’s confusability graph $G$ , which allows us to state results for arbitrary discrete actuation links beyond typewriter channels. For any $\varepsilon\in(0,1)$ , the one-sided typewriter channel has confusability graph $G=C_{K}$ , the undirected cycle on $K$ vertices with edges $\{i,i+1\}$ (indices modulo $K$ ); throughout the paper we use $C_{5}$ and $C_{6}$ as running examples.

Refer to caption — Figure 1: Example of a one-sided typewriter channel over alphabet $\mathcal{X}=\{0,\dots,4\}$ (left) and its confusability graph $C_{5}$ (right)

III Case 1: No decoding and $\varepsilon$ -dependent inflation

We begin with the vanilla single-shot method: in each round $t$ , the learner selects an intended arm $a_{t}\in\mathcal{K}$ and transmits it once over the channel. The agent then executes the received symbol $\tilde{a}_{t}$ as the pulled arm immediately, generating a reward $r_{t}$ with mean $\mu_{\tilde{a}_{t}}$ . The learner observes $(a_{t},r_{t})$ but not $\tilde{a}_{t}$ .

Let $\mu=(\mu_{0},\dots,\mu_{K-1})^{\top}$ be the (unknown) physical arm-mean vector. Define the command-conditioned mean vector ${\tilde{\mu}}=({\tilde{\mu}_{0}},\dots,{\tilde{\mu}_{K-1}})^{\top}$ by ${\tilde{\mu}_{i}}\triangleq\mathbb{E}[r_{t}\mid a_{t}=i],\quad i\in\mathcal{K}.$

For a general DMC with transition matrix $W$ , the single-shot method yields

{\tilde{\mu}_{i}}\;=\;\sum_{y=0}^{K-1}W(y\mid i)\,\mu_{y},\qquad\text{i.e.}\qquad{\tilde{\mu}}=W\mu.

(4)

Thus, in the baseline, the learner observes rewards from a mixed instance ${\tilde{\mu}}$ , not directly from $\mu$ .

Typewriter specialization.

Under the typewriter law from Section II-D, we have

{\tilde{\mu}_{i}}\;=\;(1-\varepsilon)\mu_{i}+\varepsilon\,\mu_{i+1},\qquad i\!\!\!\!\pmod{K}.

(5)

i.e., $W=(1-\varepsilon)I+\varepsilon S$ , where $S$ is the cyclic shift $(S\mu)_{i}=\mu_{i+1}$ .

III-1 Identifiability: when the best arm cannot be recovered

The BAI goal is to identify $a^{\star}\in\arg\max_{i}\mu_{i}$ . This is impossible if the mixing map $\mu\mapsto{\tilde{\mu}}=W\mu$ is not injective.

Lemma 1 (Non-identifiability under non-injective mixing)

If there exist $\mu\neq\mu^{\prime}$ such that $W\mu=W\mu^{\prime}$ but $\arg\max_{i}\mu_{i}\neq\arg\max_{i}\mu^{\prime}_{i}$ , then no algorithm can be $\delta$ -correct for all instances (for any $\delta<1$ ).

III-2 Baseline inflation via conditioning of the mixing map

Under single-shot mixing (4), the learner observes the mixed means ${\tilde{\mu}}=W\mu$ rather than $\mu$ . When $W$ is invertible, a natural baseline is to estimate ${\tilde{\mu}}$ from samples and then unmix via $\hat{\mu}=W^{-1}{\hat{\tilde{\mu}}}$ . This unmixing step amplifies estimation error according to the conditioning of $W$ .

Proposition 1

Assume $W$ is known and invertible. Let $\sigma_{\min}(W)$ denote the smallest singular value of $W$ . Then, relative to the noiseless setting, the single-shot + unmixing baseline incurs an inflation governed by

\displaystyle T_{\text{baseline}}(\delta,\mu)

\displaystyle\;=\;\tilde{O}\!\Big(\frac{1}{\sigma_{\min}(W)^{2}}\,N_{\text{clean}}(\delta,\mu)\Big),

(6)

where $\tilde{O}(\cdot)$ hides universal constants and logarithmic factors and

N_{\text{clean}}(\delta,\mu)\;=\;\Theta\!\Big(\sum_{a\neq a^{\star}}\Delta_{a}^{-2}\log(1/\delta)\Big).

(7)

Proof sketch. Let ${\hat{\tilde{\mu}}}$ be an estimator of ${\tilde{\mu}}$ and define $\hat{\mu}=W^{-1}{\hat{\tilde{\mu}}}$ . By submultiplicativity,

\|\hat{\mu}-\mu\|_{2}=\|W^{-1}({\hat{\tilde{\mu}}}-{\tilde{\mu}})\|_{2}\leq\frac{1}{\sigma_{\min}(W)}\,\|{\hat{\tilde{\mu}}}-{\tilde{\mu}}\|_{2}.

(8)

In the noiseless (direct-actuation) setting, standard fixed-confidence BAI algorithms satisfy (7) pulls. Thus, to achieve the same accuracy on $\mu$ as in the noiseless case, it suffices to estimate ${\tilde{\mu}}$ more accurately by a factor $\sigma_{\min}(W)$ . Since statistical estimation error scales as $t^{-1/2}$ , shrinking the target error by a factor $\sigma_{\min}(W)$ increases the required sample size by a factor $1/\sigma_{\min}(W)^{2}$ , yielding (6).¹¹1Note that $\sigma_{\min}(W)\leq 1$ for any square stochastic matrix, so $1/\sigma_{\min}(W)^{2}\geq 1$ and the baseline never improves over the noiseless case.

Remark 2

In general, the baseline inflation is governed by the conditioning of the mixing map via $1/\sigma_{\min}(W)^{2}$ . For the typewriter mixing matrix $W=(1-\varepsilon)I+\varepsilon S$ , when $K$ is even we have $\sigma_{\min}(W)=|1-2\varepsilon|$ , so the inflation reduces to $(1-2\varepsilon)^{-2}$ and diverges as $\varepsilon\to\tfrac{1}{2}$ , where identifiability fails.

Example ( $C_{5}$ , $C_{6}$ ):

•

K=6 (even): $\sigma_{\min}(W)=|1-2\varepsilon|$ so baseline inflation blows up as $\varepsilon\to\tfrac{1}{2}$ and is impossible at $\varepsilon=1/2$ .
•

K=5 (odd): $\sigma_{\min}(W)>0$ for all $\varepsilon\in(0,1)$ so baseline remains identifiable (no singularity), but still suffers conditioning-based inflation.

IV Case 2: Zero-Error Block Codes

In this section, we consider systems that enable a learner and an agent to preshare a fixed block code that achieves zero-error message transmission in $n$ channel uses. This removes $\varepsilon$ -dependent conditioning entirely (since correctness is zero-error and depends only on the channel support), but introduces a constant multiplicative slowdown whose value depends on the zero-error control scheme. We present two generic such zero-error coding construction methods, using for illustration the typewriter channels $C_{5}$ and $C_{6}$ (Section II-D).

•

Scheme 1: zero-error capacity codebook. A length- $n_{u}$ zero-error codebook for $K$ messages allows the learner to install any arm index once every $n_{u}$ channel uses.
•

Scheme 2: independent-set schedules. The learner uses a public schedule over independent sets of the confusability graph; in each time slot, only arms in the active independent set may be transmitted, but each such arm can be decoded in a single channel use.

IV-A Scheme 1: zero-error capacity codebook for arm updates.

We illustrate on $C_{5}$ , producing a $(2,1)$ zero-error code, where $(i,j)$ indicates $i$ channel use(s) to install $j$ arm choice(s). Set $K=5$ . Consider the length- $2$ codebook

\varphi(i)\;\triangleq\;\big(i,\;2i\!\!\!\!\pmod{5}\big),\qquad i\in\{0,1,2,3,4\}.

(9)

Under the typewriter law, transmitting $\varphi(i)$ yields

Y_{1}\in\{i,i+1\},\qquad Y_{2}\in\{2i,2i+1\}\pmod{5}.

Lemma 2

The codebook $\{\varphi(i)\}_{i=0}^{4}\subseteq\{0,\dots,4\}^{2}$ is zero-error for the typewriter channel (with $C_{5}$ ). Equivalently, $\{\varphi(i)\}$ is an independent set in $C_{5}^{\boxtimes 2}$ , so $\alpha(C_{5}^{\boxtimes 2})\geq 5$ .

This code conveys $\log_{2}5$ bits in $2$ uses, i.e., rate $\tfrac{1}{2}\log_{2}5$ bits/use. Since $C_{0}(C_{5})=\tfrac{1}{2}\log_{2}5$ [14], it is capacity-achieving²²2By (1) and Sec. II-D, the number of zero-error messages in $n$ channel uses is $M(n)=\alpha(C_{K}^{\boxtimes n})$ . on $C_{5}$ .

Proposition 2

Assume there exists a fixed-length zero-error update code of blocklength $n_{u}$ with $M(n_{u})=\alpha(G^{\boxtimes n_{u}})\geq K$ . Assume the agent executes the decoded update immediately upon decoding, i.e., on the last channel use of the packet.
Let $\mathsf{A}$ be any $\delta$ -correct clean BAI algorithm and $\tau_{\rm clean}$ its pull count. Then there exists an algorithm $\widetilde{\mathsf{A}}$ over the actuation link that is $\delta$ -correct and whose total number of physical rounds satisfies

\tau\;\leq\;n_{u}\,\tau_{\rm clean}.

Proof sketch. Run $\mathsf{A}$ as a virtual algorithm. Partition time into consecutive blocks of length $n_{u}$ . In block $i$ , transmit the $n_{u}$ -symbol zero-error codeword for the arm requested by $\mathsf{A}$ at virtual step $i$ . By assumption, the decoded arm is executed on the last slot of the block; use the reward from that last slot as the virtual reward fed back to $\mathsf{A}$ . All intermediate rewards within each block can be logged but ignored. Zero decoding error ensures the virtual interaction matches the clean bandit interaction exactly. Thus, $\delta$ -correctness is preserved and each virtual pull costs $n_{u}$ physical rounds. Full proof in App. A.

Corollary 1

If $\mathsf{A}$ satisfies $\tau_{\rm clean}\leq N_{\rm clean}(\delta,\mu)$ , then $\tau\leq n_{u}\,N_{\rm clean}(\delta,\mu)$ .

Corollary 2

For $K=5$ , Lemma 2 gives an independent set of size $5$ in $C_{5}^{\boxtimes 2}$ , so $M(2)\geq 5$ while $M(1)=\alpha(C_{5})=2<5$ . Hence $n^{\star}(5)=2$ . For $K=6$ , since $\alpha(C_{6})=3$ , we have $M(1)=3<6$ and by Cartesian-product coding, $M(2)\geq\alpha(C_{6})^{2}=9\geq 6$ , hence $n^{\star}(6)=2$ for Scheme 1 type of codes. Therefore, a fully flexible zero-error update packet exists with blocklength $n_{u}=2$ for both $K=5$ and $K=6$ . By Lemma 2, the number of physical rounds satisfies $\tau\leq 2\,N_{\rm clean}(\delta,\mu)$ .
These guarantees hold for all $\varepsilon\in(0,1)$ , whereas, the single-shot baseline can be non-identifiable at $\varepsilon=\nicefrac{{1}}{{2}}$ for even $K$ .

Remark 3 (Non–zero-error preshared codes)

Instead of zero-error codes, one could preshare a classical block code (e.g., Reed–Solomon) with small decoding error. From the learner’s viewpoint this is equivalent to replacing $W$ by an effective DMC $\widetilde{W}$ on $\mathcal{K}$ , so the analysis reduces to the single-shot baseline of Section III with $\sigma_{\min}(\widetilde{W})$ instead of $\sigma_{\min}(W)$ .

IV-B Scheme 2: independent sets using a public parity schedule

This scheme exploits the graph structure of $G$ more directly: instead of sending arbitrary arms at every use, we restrict each time slot to an independent set of vertices, which allows zero-error decoding in a single channel use. We illustrate on $C_{6}$ by providing a constrained $(1,1)$ zero-error code on $C_{6}$ with a public parity schedule.

Lemma 3

Let $G=(V,E)$ be the confusability graph of $W$ and $\mathcal{S}\subseteq V$ an independent set. If the encoder restricts to inputs $X\in\mathcal{S}$ and the decoder knows $\mathcal{S}$ , then $X$ can be decoded with zero-error from a single channel output $Y$ .

Fix a finite collection of independent sets $\mathcal{S}_{1},\dots,\mathcal{S}_{m}\subseteq\mathcal{K}$ , and a public periodic schedule $s:\mathbb{N}\to[m]$ known to both learner and agent. At time $t$ , only arms in the active set $\mathcal{S}_{s(t)}$ may be transmitted; by Lemma 3, any arm in $\mathcal{S}_{s(t)}$ can be sent and decoded with zero-error in one channel use.

To simulate a clean BAI algorithm $\mathsf{A}$ , the learner runs $\mathsf{A}$ virtually and, whenever $\mathsf{A}$ requests arm $a$ , waits until the first time $t$ such that $a\in\mathcal{S}_{s(t)}$ , then transmits $a$ and counts the resulting reward as the virtual reward. The slowdown factor depends on how frequently each arm (or pair of arms) appears in the active sets. We now instantiate this on $C_{6}$ .

Set $K=6$ . The cycle $C_{6}$ admits a $2$ -coloring with independent sets $\quad\mathcal{S}_{\mathrm{even}}=\{0,2,4\},\quad\mathcal{S}_{\mathrm{odd}}=\{1,3,5\}.$

Fix a public, deterministic schedule known to encoder/decoder that specifies which set is active at each channel use, e.g., alternating $\mathcal{S}_{\mathrm{even}},\mathcal{S}_{\mathrm{odd}},\dots$ . In a use where $\mathcal{S}$ is active, the encoder is restricted to transmit $X\in\mathcal{S}$ .

Lemma 4

For the typewriter channel on $C_{6}$ , if $X$ is restricted to an independent set $\mathcal{S}\in\{\mathcal{S}_{\mathrm{even}},\mathcal{S}_{\mathrm{odd}}\}$ known to the decoder, then $X$ can be decoded from a single output $Y$ with zero-error.

Each use conveys one of $|\mathcal{S}|=3$ possibilities with zero-error, a ternary digit per channel use, where the active set $\mathcal{S}$ acts as shared side information. Moreover, for even cycles $C_{2m}$ , $C_{0}(C_{2m})=\log_{2}m$ [14]; hence $C_{0}(C_{6})=\log_{2}3$ . The above scheme then achieves $\log_{2}3$ bits/use, and is capacity-achieving on $C_{6}$ .

Proposition 3

Consider Scheme 2 with $K=6$ (public alternating schedule $\mathcal{S}_{\rm even}$ , $\mathcal{S}_{\rm odd}$ ). In each slot, the transmitted arm is decoded with zero-error in one channel use (Lemma 4).
Let $\mathsf{A}$ be any $\delta$ -correct clean BAI algorithm, let $\tau_{\rm clean}$ be its (random) number of pulls under no channel noise, and let $a_{1},\ldots,a_{\tau_{\rm clean}}$ denote the random sequence of arm requests produced by $\mathsf{A}$ on the clean instance, and define the parity class sequence

{p_{i}\triangleq\begin{cases}E,&a_{i}\in\mathcal{S}_{\rm even},\\ O,&a_{i}\in\mathcal{S}_{\rm odd}.\end{cases}}

Then there exists an actuation wrapper that simulates $\mathsf{A}$ over the noisy link with the same error probability and a stopping time $\tau$ satisfying

{\tau=\tau_{\rm clean}+\mathbf{1}_{\{p_{1}=O\}}+\textstyle\sum_{i=2}^{\tau_{\rm clean}}\mathbf{1}_{\{p_{i}=p_{i-1}\}}\leq 2\tau_{\rm clean}.}

(10)

Proof sketch. The wrapper runs $\mathsf{A}$ virtually and only counts rewards when the requested arm belongs to the active parity class; otherwise it waits one slot and then transmits the arm. Zero-error decoding ensures the counted interaction matches the clean bandit path. If the $i$ -th request has the same parity as the $(i{-}1)$ -th, the wrapper must wait one extra slot; otherwise the next slot already has the correct parity. Summing these waiting slots yields (10). Full proof in App. B.

Equation (10) shows that the slowdown under Scheme 2 depends on the local pattern of parity requests, not merely on the total number of pulls from each parity class. In particular,

{\frac{\tau}{\tau_{\rm clean}}=1+\frac{\mathbf{1}\{p_{1}=O\}+\sum_{i=2}^{\tau_{\rm clean}}\mathbf{1}\{p_{i}=p_{i-1}\}}{\tau_{\rm clean}}\;\in\;[1,2].}

When the requested parity alternates often, the slowdown is close to $1$ ; when the clean algorithm makes long runs within the same parity class, the slowdown approaches $2$ .

IV-B1 Beyond bipartitions: overlapping independent sets

The parity schedule on $C_{6}$ is based on a partition of the arms into two independent sets. There is no requirement, however, that the active independent sets form a partition: an arm may belong to several independent sets and thus be admissible in multiple time slots. This can be exploited to reduce the worst-case slowdown constant.

As a simple illustration, consider $C_{5}$ with vertex set $\{0,1,2,3,4\}$ . Independent sets include, for example, $\{1,3\}$ , $\{2,4\}$ , $\{0,2\},$ and we may use a length- $3$ periodic schedule that activates these sets in turn. Arm $2$ then belongs to two of the three sets and is therefore available in $2/3$ of the slots, rather than $1/3$ . This illustrates that one can bias the schedule toward specific arms by letting them appear in multiple independent sets. Intuitively, this higher availability fraction can translate into a smaller slowdown factor when simulating a clean algorithm, because the wrapper needs to wait less frequently for an arm to become admissible.

Designing schedules (and collections of independent sets) that optimally balance the availability of individual arms and pairs of arms is a combinatorial problem closely related to fractional colorings and covering designs of the confusability graph. In principle, such designs can push the worst-case slowdown below the simple factor- $2$ bound in Prop. 3. We leave a systematic exploration of this direction to future work.

V Case 3: Stateful Plan Execution via Packetized Successive Elimination (PSE)

The previous section showed that if we insist on issuing a fresh (zero-error) arm command for every pull, then any clean BAI algorithm can be simulated with a constant multiplicative slowdown for per-pull zero-error control. We now show that if the agent can maintain state and execute a multi-round plan, then the learner needs to communicate only at decision times (phase boundaries). This converts actuation cost into an additive overhead.

Throughout this section, the system acts continuously: while an $n$ -symbol command packet is being transmitted, the agent keeps executing its currently committed behavior, and the newly decoded plan takes effect only after the packet is decoded (switching latency).

V-A Packetized Successive Elimination (PSE)

We now describe Algorithm 1 (PSE) which builds on phased version of Successive Elimination [2]. The coding layer only appears through the existence of a zero-error plan packet of length $n_{r}$ that conveys any phase plan; explicit constructions (e.g., on $C_{5}$ and $C_{6}$ ) are plugged in later.

Algorithm 1 Packetized Successive Elimination (PSE)

1:Confidence

\delta\in(0,1)

; phase budgets

m_{r}=2^{r-1}

; plan families

(\mathcal{P}_{r})

; zero-error plan packets of lengths

n_{r}

with

\alpha(G^{\boxtimes n_{r}})\geq|\mathcal{P}_{r}|

r\leftarrow 1

S_{1}\leftarrow\mathcal{K}

, choose a hold arm

h\in\mathcal{K}

3:while

|S_{r}|>1

4: Install: transmit a length-

n_{r}

zero-error packet for

\text{plan}(S_{r},m_{r})

. (During installation: agent pulls

h

5: Execute: for each

a\in S_{r}

, pull arm

a

for

m_{r}

rounds and count these rewards.

6: Let

t_{r}\leftarrow 2^{r}-1

. For each

a\in S_{r}

, set

\mathrm{UCB}_{a}=\widehat{\mu}_{a}(t_{r})+\beta(t_{r})

\mathrm{LCB}_{a}=\widehat{\mu}_{a}(t_{r})-\beta(t_{r})

b_{r}\in\arg\max_{a\in S_{r}}\mathrm{LCB}_{a}

S_{r+1}\leftarrow\{a\in S_{r}:\mathrm{UCB}_{a}\geq\mathrm{LCB}_{b_{r}}\}

;

h\leftarrow

the last arm pulled in $r$ .

r\leftarrow r+1

9:return the unique arm in

S_{r}

Confidence radius and phase schedule.

Let

\beta(t)\triangleq\sqrt{\nicefrac{{2\log\!\left(\frac{8Kt^{2}}{\delta}\right)}}{{t}}}.

We use phases $r=1,2,\dots$ with per-phase budget $m_{r}\triangleq 2^{r-1}$ counted pulls per active arm, so that the cumulative number of counted pulls per surviving arm at the end of phase $r$ is $t_{r}=\sum_{j=1}^{r}m_{j}=2^{r}-1$ .

During plan installation (the $n_{r}$ channel uses), the agent keeps executing its previously committed arm.

Plan packets.

In phase $r$ , the learner selects an active set $S_{r}\subseteq\mathcal{K}$ and a pre-agreed repetition budget $m_{r}\in\mathbb{N}$ . The phase plan is:

\text{plan}(S_{r},m_{r}):\ \text{pull each arm in $S_{r}$ exactly $m_{r}$ times.}

Let $\mathcal{P}_{r}$ denote the set of admissible phase plans (equivalently, admissible active sets). A zero-error plan packet of length $n_{r}$ is any code that can convey any index in $\mathcal{P}_{r}$ with zero decoding error; its existence is ensured whenever $\alpha(G^{\boxtimes n_{r}})\geq|\mathcal{P}_{r}|$ .

Theorem 1

Assume $1$ -subGaussian rewards, then PSE is $\delta$ -correct. Moreover, with probability at least $1-\delta$ , its total number of physical rounds satisfies

\tau\;\leq\;N_{\rm SE}(\delta,\mu)\;+\;\sum_{r=1}^{R}n_{r},

where $R$ is the number of executed phases and $N_{\rm SE}(\delta,\mu)$ is the (instance-dependent) pull complexity of the clean phased successive-elimination algorithm (standard analyses), i.e.,

N_{\rm SE}(\delta,\mu)=\tilde{O}\!\left(\sum_{a\neq a^{\star}}\frac{1}{\Delta_{a}^{2}}\log\!\frac{K\log(1/\Delta_{a})}{\delta}\right).

Proof sketch. Define the standard uniform concentration event $\mathcal{E}\triangleq\bigcap_{a\in\mathcal{K}}\bigcap_{t\geq 1}\left\{|\widehat{\mu}_{a}(t)-\mu_{a}|\leq\beta(t)\right\}.$ By a union bound over arms and times, $\Pr(\mathcal{E})\geq 1-\delta$ .
On $\mathcal{E}$ , the elimination rule of phased successive elimination is sound: $a^{\star}$ is never removed, and any suboptimal arm $a$ is eliminated once $4\beta(t)\leq\Delta_{a}$ at the end of some phase (since $\mathrm{UCB}_{a}\leq\mu_{a}+2\beta(t)$ and $\mathrm{LCB}_{b_{r}}\geq\mathrm{LCB}_{a^{\star}}\geq\mu_{a^{\star}}-2\beta(t)$ ).
Crucially, PSE’s counted rewards are generated by exactly the same arm sequence and sample sizes as the clean phased-SE algorithm (the additional $n_{r}$ rounds during plan installation are simply ignored for the statistical test). Therefore, on $\mathcal{E}$ , the number of counted pulls until termination is at most $N_{\rm SE}(\delta,\mu)$ .
Finally, each executed phase incurs $n_{r}$ additional physical rounds for plan installation (switching latency), yielding $\tau\leq N_{\rm SE}(\delta,\mu)+\sum_{r=1}^{R}n_{r}$ on $\mathcal{E}$ . See full proof in App. C.

Remark 4

For $\beta(t)=\sqrt{2\log(8Kt^{2}/\delta)/t}$ , the condition $4\beta(t)\leq\Delta$ is implied by $t\geq\frac{c}{\Delta^{2}}\log\!\Big(\frac{c^{\prime}K}{\delta\Delta^{2}}\Big)$ for universal constants $c,c^{\prime}>0$ . Consequently, $R^{\star}=1+\lceil\log_{2}T_{\max}\rceil=O(\log(1/\Delta_{\min}))$ .

Corollary 3

Consider the one-sided typewriter channel with $\varepsilon\in(0,1)$ , so the confusability graph is $C_{K}$ . Assume the simple plan family $\mathcal{P}_{r}=\{\,\text{plan}(S,m_{r}):\emptyset\neq S\subseteq\mathcal{K}\,\}$ , so $|\mathcal{P}_{r}|\leq 2^{K}-1$ for all $r$ .
(i) $K=5$ . Using the base- $5$ digit on $C_{5}$ (two uses per digit), $n_{r}=2\Big\lceil\log_{5}|\mathcal{P}_{r}|\Big\rceil\leq 2\Big\lceil\log_{5}(2^{5})\Big\rceil=6.$
(ii) $K=6$ . Using the calendar interface on $C_{6}$ (one ternary digit per use), $n_{r}=\Big\lceil\log_{3}|\mathcal{P}_{r}|\Big\rceil\leq\Big\lceil\log_{3}(2^{6})\Big\rceil=4.$
In both cases, Theorem 1 gives $\tau\leq N_{\rm SE}(\delta,\mu)+O(R)$ with small constants: at most $6R$ on $C_{5}$ and $4R$ on $C_{6}$ .

Thus when the statistical term $N_{\rm SE}(\delta,\mu)$ dominates (small gaps / small $\delta$ ), PSE is dramatically better than per-pull zero-error updates, which incur a multiplicative constant $\approx n^{\star}(\mathcal{K})$ .

References

[1] I. Amir, I. Attias, T. Koren, Y. Mansour, and R. Livni (2020) Prediction with corrupted expert advice. Advances in Neural Information Processing Systems 33, pp. 14315–14325. Cited by: §I.
[2] E. Even-Dar, S. Mannor, Y. Mansour, and S. Mahadevan (2006) Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems.. Journal of machine learning research 7 (6). Cited by: §I, §II-A, §V-A.
[3] A. Garivier and E. Kaufmann (2016) Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pp. 998–1027. Cited by: §I, §II-A.
[4] A. Gupta, T. Koren, and K. Talwar (2019) Better algorithms for stochastic bandits with adversarial corruptions. In Proceedings of the Thirty-Second Conference on Learning Theory, A. Beygelzimer and D. Hsu (Eds.), Proceedings of Machine Learning Research, Vol. 99, pp. 1562–1578. External Links: Link Cited by: §I.
[5] O. A. Hanna, M. Karakas, L. Yang, and C. Fragouli (2024) Multi-agent bandit learning through heterogeneous action erasure channels. In International Conference on Artificial Intelligence and Statistics, pp. 3898–3906. Cited by: §I, §I.
[6] O. A. Hanna, M. Karakas, L. F. Yang, and C. Fragouli (2023) Multi-arm bandits over action erasure channels. In 2023 IEEE International Symposium on Information Theory (ISIT), Vol. , pp. 1312–1317. External Links: Document Cited by: §I, §I.
[7] K. Jamieson, M. Malloy, R. Nowak, and S. Bubeck (2014-13–15 Jun) Lil’ ucb : an optimal exploration algorithm for multi-armed bandits. In Proceedings of The 27th Conference on Learning Theory, M. F. Balcan, V. Feldman, and C. Szepesvári (Eds.), Proceedings of Machine Learning Research, Vol. 35, Barcelona, Spain, pp. 423–439. External Links: Link Cited by: §I, §II-A.
[8] S. Kapoor, K. K. Patel, and P. Kar (2019) Corruption-tolerant bandit learning. Mach. Learn. 108 (4), pp. 687–715. External Links: ISSN 0885-6125, Document, Link Cited by: §I.
[9] M. Karakas, O. Hanna, L. F. Yang, and C. Fragouli (2025) Does feedback help in bandits with arm erasures?. In 2025 IEEE International Symposium on Information Theory (ISIT), Vol. , pp. 1–6. External Links: Document Cited by: §I, §I.
[10] M. Karakas, O. Hanna, L. F. Yang, and C. Fragouli (2026) Fundamental limits of learning under erasure-constrained communication channels. IEEE Journal on Selected Areas in Information Theory, pp. 1–1. External Links: Document Cited by: §I, §I.
[11] E. Kaufmann, O. Cappé, and A. Garivier (2016) On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research 17 (1), pp. 1–42. Cited by: §I.
[12] J. Korner and A. Orlitsky (1998) Zero-error information theory. IEEE Transactions on Information Theory 44 (6), pp. 2207–2229. External Links: Document Cited by: §I, §II-C.
[13] T. Lattimore and C. Szepesvári (2020) Bandit algorithms. Cambridge University Press. Cited by: §I.
[14] L. Lovász (1979) On the shannon capacity of a graph. IEEE Transactions on Information theory 25 (1), pp. 1–7. Cited by: §I, §IV-A, §IV-B.
[15] T. Lykouris, V. Mirrokni, and R. Paes Leme (2018) Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 114–122. Cited by: §I.
[16] K. S. Reddy, P. N. Karthik, and V. Y. F. Tan (2024) Best arm identification with arm erasures. In 2024 IEEE International Symposium on Information Theory (ISIT), Vol. , pp. 2293–2298. External Links: Document Cited by: §I, §I.
[17] C. Shannon (1956) The zero error capacity of a noisy channel. IRE Transactions on Information Theory 2 (3), pp. 8–19. External Links: Document Cited by: §I, §II-C, §II-C, §II-C, §II-C, Remark 1.

Appendix A Proof of Proposition 2

Fix a zero-error update code of blocklength $n_{u}$ for the message set $\mathcal{K}$ : an encoder $\mathrm{enc}:\mathcal{K}\to\mathcal{X}^{n_{u}}$ and decoder $\mathrm{dec}:\mathcal{Y}^{n_{u}}\to\mathcal{K}$ such that for every $a\in\mathcal{K}$ , if $X^{n_{u}}=\mathrm{enc}(a)$ is transmitted then $\mathrm{dec}(Y^{n_{u}})=a$ almost surely.

We construct $\widetilde{\mathsf{A}}$ by simulating $\mathsf{A}$ as a virtual clean algorithm. Partition physical time into blocks $B_{i}\triangleq\{(i-1)n_{u}+1,\dots,in_{u}\}$ for $i=1,2,\dots$ , and let $t_{i}\triangleq in_{u}$ denote the last slot of block $i$ .

At virtual step $i$ , the simulated algorithm $\mathsf{A}$ outputs an arm request $a_{i}\in\mathcal{K}$ as a measurable function of its past virtual rewards. During physical block $B_{i}$ , the learner transmits the length- $n_{u}$ codeword $\mathrm{enc}(a_{i})$ over the actuation channel. By the assumed timing model, the agent executes the decoded arm immediately upon decoding, i.e., in the last slot $t_{i}$ the executed arm equals $a_{i}$ . Define the virtual reward fed back to $\mathsf{A}$ at step $i$ to be the physical reward $r_{t_{i}}$ observed in that last slot. All other rewards inside the block are ignored.

We now verify that the virtual interaction has the same law as a clean run of $\mathsf{A}$ . Condition on the virtual history up to step $i-1$ , equivalently, on the $\sigma$ -field generated by $\{a_{1},r_{t_{1}},\dots,a_{i-1},r_{t_{i-1}}\}$ . Then the requested arm $a_{i}$ is determined. By zero-error decoding and the “execute-on-decode” assumption, the executed arm at time $t_{i}$ equals $a_{i}$ , so the reward $r_{t_{i}}$ is distributed exactly as a clean reward sample from arm $a_{i}$ (and is independent of the past conditional on $a_{i}$ under the standard bandit model). Hence the joint law of $(a_{i},r_{t_{i}})_{i\geq 1}$ under $\widetilde{\mathsf{A}}$ matches the joint law of the arm/reward pairs produced by $\mathsf{A}$ in the clean bandit instance. Therefore, $\widetilde{\mathsf{A}}$ is $\delta$ -correct whenever $\mathsf{A}$ is $\delta$ -correct.

Finally, if $\mathsf{A}$ stops after $\tau_{\rm clean}$ virtual pulls, then $\widetilde{\mathsf{A}}$ uses exactly $\tau=n_{u}\,\tau_{\rm clean}$ physical rounds (each virtual pull consumes one full block), which proves the claimed time bound.

Appendix B Proof of Proposition 3

Let the public schedule activate $\mathcal{S}_{\rm even}$ on odd time slots and $\mathcal{S}_{\rm odd}$ on even time slots:

s(t)=\begin{cases}\mathrm{even},&t\text{ odd},\\ \mathrm{odd},&t\text{ even}.\end{cases}

By Lemma 4, any transmitted arm in the active set is decoded with zero-error in that same slot and is executed as intended.

We define a wrapper that simulates $\mathsf{A}$ as a virtual clean algorithm driven only by counted rewards. Let $a_{1},a_{2},\dots$ be the random sequence of arm requests produced by $\mathsf{A}$ when run on the clean instance, and let $\tau_{\rm clean}$ be its (random) stopping time. The wrapper maintains a virtual step counter $i$ , initialized to $i=1$ . At each physical slot $t$ :

•

If $a_{i}\in\mathcal{S}_{s(t)}$ , i.e., the requested arm is admissible in the active parity class, the learner transmits $a_{i}$ , the agent executes $a_{i}$ , the wrapper counts the resulting reward as the virtual reward for step $i$ , and increments $i\leftarrow i+1$ .
•

Otherwise, the learner transmits any arm in the active set, e.g., an arbitrary fixed element of $\mathcal{S}_{s(t)}$ , and discards the reward (this slot is an uncounted “wait” slot).

The wrapper terminates when $i=\tau_{\rm clean}+1$ , i.e., once it has produced $\tau_{\rm clean}$ counted rewards.

Correctness & coupling. Consider the subsequence of physical slots at which the wrapper counts rewards. By construction, at the $i$ -th counted slot the executed arm equals $a_{i}$ (zero-error decoding and immediate execution). Hence the $i$ -th counted reward has exactly the same conditional distribution as a clean reward sample from arm $a_{i}$ . Since $\mathsf{A}$ chooses $a_{i}$ as a measurable function of past counted rewards, the sequence of counted arm-reward pairs produced by the wrapper has the same joint distribution as the clean interaction of $\mathsf{A}$ . Therefore the wrapper preserves the error probability of $\mathsf{A}$ , and the resulting algorithm is $\delta$ -correct.

Time bound.

Let $p_{i}\in\{E,O\}$ denote the parity class of the $i$ -th arm request $a_{i}$ made by the clean run of $\mathsf{A}$ , where $E$ corresponds to $\mathcal{S}_{\rm even}$ and $O$ to $\mathcal{S}_{\rm odd}$ . Let $t_{i}$ be the physical time at which the wrapper obtains the $i$ -th counted reward. Because the public schedule alternates between $\mathcal{S}_{\rm even}$ and $\mathcal{S}_{\rm odd}$ , and the wrapper always serves the current request at the first slot in which its parity class is active, we have

{t_{1}=1+\mathbf{1}\{p_{1}=O\},}

and for every $i\geq 2$ ,

{t_{i}=t_{i-1}+1+\mathbf{1}\{p_{i}=p_{i-1}\}.}

Indeed, if $p_{i}\neq p_{i-1}$ then the next requested parity class is active in the very next slot, whereas if $p_{i}=p_{i-1}$ the wrapper must wait one additional slot before serving it. Summing the recurrence yields

{\tau=t_{\tau_{\rm clean}}=\tau_{\rm clean}+\mathbf{1}\{p_{1}=O\}+\sum_{i=2}^{\tau_{\rm clean}}\mathbf{1}\{p_{i}=p_{i-1}\}\leq 2\tau_{\rm clean}.}

Appendix C Proof of Theorem 1

We recall that rewards are independent across rounds, and when arm $a\in\mathcal{K}$ is executed the reward distribution has mean $\mu_{a}$ and is $1$ -subGaussian. The confidence radius is

\beta(t)\triangleq\sqrt{\frac{2\log\!\left(\frac{8Kt^{2}}{\delta}\right)}{t}},\qquad t\geq 1.

For PSE, in phase $r$ the per-arm counted budget is $m_{r}=2^{r-1}$ , hence each arm in $S_{r}$ has exactly

t_{r}\triangleq\sum_{j=1}^{r}m_{j}=2^{r}-1

counted samples at the end of phase $r$ .

C-A Step 1: Anytime concentration with explicit constants

For each arm $a$ , let $\widehat{\mu}_{a}(t)$ denote the empirical mean of the first $t$ counted rewards obtained when executing arm $a$ . (Equivalently, one may imagine an infinite i.i.d. sequence of arm- $a$ rewards, and $\widehat{\mu}_{a}(t)$ is the mean of the first $t$ draws; this is standard and matches the algorithm because PSE only ever uses $\widehat{\mu}_{a}(t)$ at times $t=t_{r}$ for arms still active.)

Lemma 5

Define the event

\mathcal{E}\triangleq\bigcap_{a\in\mathcal{K}}\bigcap_{t\geq 1}\left\{\,|\widehat{\mu}_{a}(t)-\mu_{a}|\leq\beta(t)\right\}.

If rewards are $1$ -subGaussian, then $\Pr(\mathcal{E})\geq 1-\delta$ .

Proof: Fix an arm $a$ and a time $t\geq 1$ . Since rewards are $1$ -subGaussian, the empirical mean satisfies

\Pr\!\left(\left|\widehat{\mu}_{a}(t)-\mu_{a}\right|>\varepsilon\right)\leq 2\exp\!\left(-\frac{t\varepsilon^{2}}{2}\right)\qquad\forall\varepsilon>0.

Plugging in $\varepsilon=\beta(t)$ yields

	$\displaystyle\Pr\!\left(\left\|\widehat{\mu}_{a}(t)-\mu_{a}\right\|>\beta(t)\right)$	$\displaystyle\leq 2\exp\!\left(-\frac{t\beta(t)^{2}}{2}\right)$
		$\displaystyle=2\exp\!\left(-\log\!\left(\frac{8Kt^{2}}{\delta}\right)\right)$
		$\displaystyle=\frac{\delta}{4Kt^{2}}.$

By a union bound over all $a\in\mathcal{K}$ and all $t\geq 1$ ,

\Pr(\mathcal{E}^{c})\leq\sum_{a\in\mathcal{K}}\sum_{t=1}^{\infty}\frac{\delta}{4Kt^{2}}=\frac{\delta}{4}\sum_{t=1}^{\infty}\frac{1}{t^{2}}\leq\frac{\delta}{4}\cdot\frac{\pi^{2}}{6}<\delta,

hence $\Pr(\mathcal{E})\geq 1-\delta$ .

C-B Step 2: Exact coupling of counted samples

We now state the coupling lemma formally. Define the clean phased SE algorithm as the same phased procedure as PSE but with no installation delays: in phase $r$ , it pulls each arm in $S_{r}$ exactly $m_{r}$ times (counting all rewards), then applies the same elimination rule as PSE using $\beta(t_{r})$ , and repeats until one arm remains.

Lemma 6

Assume every phase- $r$ plan packet in PSE is decoded with zero error, so the agent executes the intended phase plan $\text{\rm plan}(S_{r},m_{r})$ exactly. Let $(A^{\rm cnt}_{s},R^{\rm cnt}_{s})_{s\geq 1}$ denote the sequence of counted arm pulls and counted rewards produced by PSE, indexed by counted time $s=1,2,\dots$ , i.e., ignoring all installation rounds. Let $(A^{\rm clean}_{s},R^{\rm clean}_{s})_{s\geq 1}$ denote the pull, or equivalently reward, sequence of the clean phased-SE algorithm run with the same $\beta(\cdot)$ , the same phase budgets $m_{r}$ , and the same elimination rule.

Then there exists a coupling under which

(A^{\rm cnt}_{s},R^{\rm cnt}_{s})_{s\geq 1}\equiv(A^{\rm clean}_{s},R^{\rm clean}_{s})_{s\geq 1}\qquad\text{almost surely}.

In particular, PSE and clean phased-SE compute identical empirical means from counted samples, hence produce the same active sets $(S_{r})$ and stop after the same number of counted pulls.

Proof: Construct a probability space as follows. For each arm $a\in\mathcal{K}$ , generate two independent i.i.d. sequences:

\left(R^{(c)}_{a,1},R^{(c)}_{a,2},\dots\right)\quad\text{and}\quad\left(R^{(u)}_{a,1},R^{(u)}_{a,2},\dots\right),

each distributed as the arm- $a$ reward law (mean $\mu_{a}$ , $1$ -subGaussian). Interpret $R^{(c)}$ as the rewards that will be used for counted pulls of arm $a$ , and $R^{(u)}$ as rewards used for uncounted pulls, i.e., installation rounds. This is valid because, in the true system, every time arm $a$ is executed the reward is an independent draw from the same law; splitting draws into two independent pools preserves the joint law of any finite collection of executed rewards.

Now run the clean phased-SE algorithm and define that its $j$ -th pull of arm $a$ receives reward $R^{(c)}_{a,j}$ . Run PSE on the same probability space, with the following reward assignment:

•

During execution (counted) rounds, whenever PSE executes arm $a$ for the $j$ -th counted time, it receives reward $R^{(c)}_{a,j}$ .
•

During installation (uncounted) rounds, PSE executes the hold arm $h$ (the last arm pulled in the previous phase) and receives rewards from the $R^{(u)}$ pool (which are never used in estimates).

Because plan packets decode with zero error, in each phase $r$ , PSE executes exactly the intended plan $\text{plan}(S_{r},m_{r})$ in its execution segment. Therefore, the counted arm sequence of PSE in phase $r$ is exactly the same as the arm sequence of the clean phased-SE algorithm in phase $r$ (both pull each $a\in S_{r}$ exactly $m_{r}$ times). Under the construction above, they also receive exactly the same counted rewards, namely the corresponding $R^{(c)}$ samples. Hence, after phase $r$ , both algorithms compute identical empirical means $\widehat{\mu}_{a}(t_{r})$ for each $a\in S_{r}$ , apply the same elimination rule, and therefore produce the same $S_{r+1}$ .

By induction over phases, the entire counted pull/reward sequence agrees almost surely under this coupling.

C-C Step 3: Correctness of elimination on the good event

Lemma 7

On the event $\mathcal{E}$ from Lemma 5, PSE never eliminates the best arm $a^{\star}$ . Moreover, fix any suboptimal $a\neq a^{\star}$ . For any phase index $r$ such that

4\beta(t_{r})<\Delta_{a},

(11)

arm $a$ is eliminated at the end of phase $r$ (i.e., $a\notin S_{r+1}$ ).

Proof: Fix a phase $r$ and let $t_{r}=2^{r}-1$ be the counted sample size per surviving arm at the end of phase $r$ .

(i) The best arm is not eliminated. Let $b_{r}\in\arg\max_{b\in S_{r}}\mathrm{LCB}_{b}$ where

\mathrm{LCB}_{b}\triangleq\widehat{\mu}_{b}(t_{r})-\beta(t_{r}),\qquad\mathrm{UCB}_{b}\triangleq\widehat{\mu}_{b}(t_{r})+\beta(t_{r}).

On $\mathcal{E}$ , we have $\widehat{\mu}_{a^{\star}}(t_{r})\geq\mu_{a^{\star}}-\beta(t_{r})$ , hence

\mathrm{UCB}_{a^{\star}}=\widehat{\mu}_{a^{\star}}(t_{r})+\beta(t_{r})\geq\mu_{a^{\star}}.

Also on $\mathcal{E}$ , for any $b\in S_{r}$ we have $\widehat{\mu}_{b}(t_{r})\leq\mu_{b}+\beta(t_{r})$ , so

\mathrm{LCB}_{b_{r}}=\widehat{\mu}_{b_{r}}(t_{r})-\beta(t_{r})\leq\mu_{b_{r}}\leq\mu_{a^{\star}}.

Therefore $\mathrm{UCB}_{a^{\star}}\geq\mu_{a^{\star}}\geq\mathrm{LCB}_{b_{r}}$ , so $a^{\star}$ satisfies the retention condition $\mathrm{UCB}_{a^{\star}}\geq\mathrm{LCB}_{b_{r}}$ and is not eliminated.

(ii) A suboptimal arm is eliminated once $4\beta(t_{r})<\Delta_{a}$ . Fix $a\neq a^{\star}$ . On $\mathcal{E}$ we have

\mathrm{UCB}_{a}=\widehat{\mu}_{a}(t_{r})+\beta(t_{r})\leq(\mu_{a}+\beta(t_{r}))+\beta(t_{r})=\mu_{a}+2\beta(t_{r}).

Also, since $b_{r}$ maximizes $\mathrm{LCB}$ over $S_{r}$ and $a^{\star}\in S_{r}$ by part (i),

\mathrm{LCB}_{b_{r}}\geq\mathrm{LCB}_{a^{\star}}=\widehat{\mu}_{a^{\star}}(t_{r})-\beta(t_{r})\geq(\mu_{a^{\star}}-\beta(t_{r}))-\beta(t_{r})=\mu_{a^{\star}}-2\beta(t_{r}).

Combining the two displays,

\mathrm{UCB}_{a}\leq\mu_{a}+2\beta(t_{r})=\mu_{a^{\star}}-\Delta_{a}+2\beta(t_{r}).

If $4\beta(t_{r})<\Delta_{a}$ , then $-\Delta_{a}+2\beta(t_{r})<-2\beta(t_{r})$ , hence

\mathrm{UCB}_{a}<\mu_{a^{\star}}-2\beta(t_{r})\leq\mathrm{LCB}_{b_{r}}.

Therefore $a$ fails the retention test $\mathrm{UCB}_{a}\geq\mathrm{LCB}_{b_{r}}$ and is eliminated, i.e., $a\notin S_{r+1}$ .

C-D Step 4: Physical stopping time decomposition and conclusion

We now prove Theorem 1.

Proof:Assume plan packets are decoded with zero error (as in the theorem statement). Let $R$ be the number of executed phases, i.e., the number of iterations of the while-loop in Algorithm 1. Define the number of counted pulls made by PSE as

N_{\rm counted}\triangleq\sum_{r=1}^{R}|S_{r}|\,m_{r}.

In each executed phase $r$ , PSE spends exactly $n_{r}$ physical rounds to transmit the plan packet (during which it pulls the hold arm and discards these rewards for estimation), and then spends exactly $|S_{r}|m_{r}$ physical rounds executing the plan (counting those rewards). Therefore the total number of physical rounds is exactly

\tau=\sum_{r=1}^{R}\big(n_{r}+|S_{r}|m_{r}\big)=N_{\rm counted}+\sum_{r=1}^{R}n_{r}.

(12)

Next, by Lemma 6, the counted pull/reward sequence of PSE is (under a coupling) identical to that of the clean phased-SE algorithm with the same $(m_{r})$ and elimination rule. In particular, PSE and clean phased-SE stop after the same number of counted pulls and output the same arm as a function of the counted data. Let $\tau_{\rm SE}$ denote the (random) stopping time (number of pulls) of this clean phased-SE algorithm. Then

N_{\rm counted}=\tau_{\rm SE}\qquad\text{almost surely under the coupling},

hence, in distribution as well.

By Lemma 5, $\Pr(\mathcal{E})\geq 1-\delta$ . On $\mathcal{E}$ , Lemma 7 implies that $a^{\star}$ is never eliminated and every suboptimal arm is eventually eliminated, so both algorithms terminate and output $a^{\star}$ on $\mathcal{E}$ . Therefore PSE is $\delta$ -correct.

Finally, standard analyses of phased successive elimination imply that with probability at least $1-\delta$ ,

\tau_{\rm SE}\leq N_{\rm SE}(\delta,\mu),

where $N_{\rm SE}(\delta,\mu)$ is the usual instance-dependent clean sample complexity bound (one such bound is stated in Theorem 1). Combining with (12) yields, with probability at least $1-\delta$ ,

\tau=N_{\rm counted}+\sum_{r=1}^{R}n_{r}=\tau_{\rm SE}+\sum_{r=1}^{R}n_{r}\leq N_{\rm SE}(\delta,\mu)+\sum_{r=1}^{R}n_{r},

which concludes our proof.

Appendix D Missing Proof Sketches

This section contains the proof (sketches) we removed from the main body of the paper.

D-A Proof Sketch of Lemma 1

If $W\mu=W\mu^{\prime}$ , then for every command $i$ the conditional reward law given $X_{t}=i$ is identical under $\mu$ and $\mu^{\prime}$ . Hence the entire observation process has the same distribution under both instances, so any decision rule must behave identically and cannot be correct on both.

D-B Proof of Lemma 3

By definition of the confusability graph, $\{x,x^{\prime}\}\in E$ iff $\mathcal{Y}(x)\cap\mathcal{Y}(x^{\prime})\neq\emptyset$ . Independence of $\mathcal{S}$ means that for any distinct $x,x^{\prime}\in\mathcal{S}$ we have $\mathcal{Y}(x)\cap\mathcal{Y}(x^{\prime})=\emptyset$ . Therefore, any output $y$ can come from at most one $x\in\mathcal{S}$ , and the decoder can recover $X$ as the unique element of $\{x\in\mathcal{S}:W(y\mid x)>0\}$ .

D-C Proof Sketch of Lemma 4

For the typewriter channel, $y\in\{x,x+1\}$ . Since $\mathcal{S}$ is independent, $x+1\notin\mathcal{S}$ . Thus, the decoder rule $\hat{x}(y\mid\mathcal{S})=y$ if $y\in\mathcal{S}$ and $\hat{x}(y\mid\mathcal{S})=y-1\pmod{K}$ otherwise returns $x$ in both cases.