Best-Arm Identification with Noisy Actuation
Abstract
In this paper, we consider a multi-armed bandit (MAB) instance and study how to identify the best arm when arm commands are conveyed from a central learner to a distributed agent over a discrete memoryless channel (DMC). Depending on the agent capabilities, we provide communication schemes along with their analysis, which interestingly relate to the zero-error capacity of the underlying DMC.
I Introduction
Motivated by the growing interest in distributed learning applications, recent work has begun to investigate learning over noisy communication channels [6, 5, 9, 10, 16]. Within this landscape, we consider a new formulation: best arm identification in which arm commands are conveyed from a central learner to a distributed agent over a discrete memoryless channel (DMC).
Such channels can model situations in which actions are communicated over unreliable interfaces, including physical controls, low-bandwidth links, or human-mediated instructions, where commands may be received correctly or confused with a small set of alternatives. Despite their simplicity, DMCs capture key sources of noise in practical action-communication pipelines and allow us to derive theoretical insights.
Our focus is on identifying when performance guarantees can be made independent of the channel error probabilities and depend only on confusability. For that question, zero-error capacity is the natural threshold quantity.
In this paper, we study three models of increasing capability for the distributed agent. In the first model, the agent can only execute the commands it receives. In the second, the agent is equipped with a preloaded codebook that allows it to decode the received signals into actions. In the third, most powerful model, the agent can maintain state and execute multi-round plans installed via zero-error packets. For each model, we compare performance against an idealized benchmark in which communication occurs over an error-free channel. Our main findings are summarized below:
-
•
Case 1 (No decoding). If the agent simply executes received commands, the channel induces mixing of the action distributions. Performance degrades by a factor governed by the smallest singular value of the channel matrix, which depends on the channel error probability.
-
•
Case 2 (Fixed decoding). If the zero-error capacity of the channel is nonzero and the agent can use a fixed block code, then the resulting performance loss can be only a constant multiplicative factor, independent of the channel error probabilities.
-
•
Case 3 (Stateful execution). If the agent can maintain state and execute multi-round plans installed via zero-error packets, then the performance gap relative to the error-free benchmark can be reduced to an additive overhead.
We note that when the zero-error capacity of the underlying DMC is zero, no coding strategy can eliminate dependence on the channel noise (Remark 1), and performance necessarily degrades with the channel error probability.
Related Work. Multi-armed bandits (MAB) are a standard model for sequential decision-making under uncertainty; see, e.g., [13] and references therein. Our objective is fixed-confidence best-arm identification (BAI), for which classic elimination and tracking-style algorithms yield instance-dependent sample complexities of order ; representative references include successive elimination [2], lil’UCB [7], and Track-and-Stop [3] as well as complexity/lower-bound characterizations [11].
Prior work on noisy bandits focuses on the reward channel—adversarial corruption [15, 4, 1] and delayed/censored feedback [8]—but assumes the chosen arm is executed as intended. Recent work has studied uncertainty in the executed arm via arm-erasure models, for regret [6, 5, 9, 10] and BAI [16]. In contrast, we consider general discrete memoryless actuation channels with confusability (typewriter channels as a running example), leveraging zero-error communication tools [17, 14, 12].
II Model and Objectives
II-A Bandit instance and objective (BAI)
We consider a stochastic -armed bandit with arms indexed by . Pulling arm produces a reward with mean . Let denote a best arm (assumed unique for simplicity), and define gaps for .
Our goal is fixed-confidence best-arm identification (BAI): an algorithm adaptively interacts with the environment, stops at a (random) time , and outputs . It is -correct if
Hence, we measure performance primarily by the total number of physical rounds (pulls). In our interface, each round also uses the command channel once, so the pull count coincides with the number of channel uses.
II-B Actuation Channel and Arm Mismatch
At each time slot , the learner produces a channel input which is passed through a discrete memoryless channel (DMC) , yielding an output at the agent according to
We consider an arm channel with , so and block codes correspond to sequences . The agent then executes an arm (as a function of its received symbols and actuation rule), and a reward is generated with mean . The learner observes and its own transmissions , but does not observe channel outputs .
II-C Zero-error capacity and confusability graphs
We briefly introduce notation and review zero-error communication over a DMC. Zero-error rates depend only on which outputs are possible for each input (the support of ), not on their probabilities [17]. Define the output support sets
Confusability graph
The confusability graph of is the undirected graph with and
Blocklength- packets
Zero-error message count
Let be the independence number. The maximum number of messages conveyable with zero decoding error using exactly channel uses is
| (1) |
via the standard equivalence between zero-error codebooks and independent sets [17].
Minimal blocklength for a message set
For a finite message set with , define
| (2) |
Zero-error capacity
Shannon’s zero-error capacity is
| (3) |
where is the Shannon capacity [17]. When , grows exponentially in , so any fixed finite message set is achievable with some finite blocklength.
Remark 1
The following are equivalent (see [17]):
Here, , so no fixed-blocklength, no-feedback protocol can convey even a binary command with zero-error.
II-D Typewriter channel and general graphs
A central example is the one-sided typewriter channel (see Fig. 1) over alphabet :
More generally, our coding layer is described by the channel’s confusability graph , which allows us to state results for arbitrary discrete actuation links beyond typewriter channels. For any , the one-sided typewriter channel has confusability graph , the undirected cycle on vertices with edges (indices modulo ); throughout the paper we use and as running examples.

.
III Case 1: No decoding and -dependent inflation
We begin with the vanilla single-shot method: in each round , the learner selects an intended arm and transmits it once over the channel. The agent then executes the received symbol as the pulled arm immediately, generating a reward with mean . The learner observes but not .
Let be the (unknown) physical arm-mean vector. Define the command-conditioned mean vector by
For a general DMC with transition matrix , the single-shot method yields
| (4) |
Thus, in the baseline, the learner observes rewards from a mixed instance , not directly from .
Typewriter specialization.
III-1 Identifiability: when the best arm cannot be recovered
The BAI goal is to identify . This is impossible if the mixing map is not injective.
Lemma 1 (Non-identifiability under non-injective mixing)
If there exist such that but , then no algorithm can be -correct for all instances (for any ).
III-2 Baseline inflation via conditioning of the mixing map
Under single-shot mixing (4), the learner observes the mixed means rather than . When is invertible, a natural baseline is to estimate from samples and then unmix via . This unmixing step amplifies estimation error according to the conditioning of .
Proposition 1
Assume is known and invertible. Let denote the smallest singular value of . Then, relative to the noiseless setting, the single-shot + unmixing baseline incurs an inflation governed by
| (6) |
where hides universal constants and logarithmic factors and
| (7) |
Proof sketch. Let be an estimator of and define . By submultiplicativity,
| (8) |
In the noiseless (direct-actuation) setting, standard fixed-confidence BAI algorithms satisfy (7) pulls. Thus, to achieve the same accuracy on as in the noiseless case, it suffices to estimate more accurately by a factor . Since statistical estimation error scales as , shrinking the target error by a factor increases the required sample size by a factor , yielding (6).111Note that for any square stochastic matrix, so and the baseline never improves over the noiseless case.
Remark 2
In general, the baseline inflation is governed by the conditioning of the mixing map via . For the typewriter mixing matrix , when is even we have , so the inflation reduces to and diverges as , where identifiability fails.
Example (,):
-
•
K=6 (even): so baseline inflation blows up as and is impossible at .
-
•
K=5 (odd): for all so baseline remains identifiable (no singularity), but still suffers conditioning-based inflation.
IV Case 2: Zero-Error Block Codes
In this section, we consider systems that enable a learner and an agent to preshare a fixed block code that achieves zero-error message transmission in channel uses. This removes -dependent conditioning entirely (since correctness is zero-error and depends only on the channel support), but introduces a constant multiplicative slowdown whose value depends on the zero-error control scheme. We present two generic such zero-error coding construction methods, using for illustration the typewriter channels and (Section II-D).
-
•
Scheme 1: zero-error capacity codebook. A length- zero-error codebook for messages allows the learner to install any arm index once every channel uses.
-
•
Scheme 2: independent-set schedules. The learner uses a public schedule over independent sets of the confusability graph; in each time slot, only arms in the active independent set may be transmitted, but each such arm can be decoded in a single channel use.
IV-A Scheme 1: zero-error capacity codebook for arm updates.
We illustrate on , producing a zero-error code, where indicates channel use(s) to install arm choice(s). Set . Consider the length- codebook
| (9) |
Under the typewriter law, transmitting yields
Lemma 2
The codebook is zero-error for the typewriter channel (with ). Equivalently, is an independent set in , so .
This code conveys bits in uses, i.e., rate bits/use. Since [14], it is capacity-achieving222By (1) and Sec. II-D, the number of zero-error messages in channel uses is . on .
Proposition 2
Assume there exists a fixed-length zero-error update code of blocklength with
. Assume the agent executes the decoded update
immediately upon decoding, i.e., on the last channel use of the packet.
Let be any -correct clean BAI algorithm and its pull count.
Then there exists an algorithm over the actuation link that is -correct and whose
total number of physical rounds satisfies
Proof sketch. Run as a virtual algorithm. Partition time into consecutive blocks of length . In block , transmit the -symbol zero-error codeword for the arm requested by at virtual step . By assumption, the decoded arm is executed on the last slot of the block; use the reward from that last slot as the virtual reward fed back to . All intermediate rewards within each block can be logged but ignored. Zero decoding error ensures the virtual interaction matches the clean bandit interaction exactly. Thus, -correctness is preserved and each virtual pull costs physical rounds. Full proof in App. A.
Corollary 1
If satisfies , then .
Corollary 2
For , Lemma 2 gives an independent set of size in , so while . Hence . For , since , we have and by Cartesian-product coding, , hence for Scheme 1 type of codes.
Therefore, a fully flexible zero-error update packet exists with blocklength for both and . By Lemma 2, the number of physical rounds satisfies .
These guarantees hold for all , whereas, the single-shot baseline can be non-identifiable at for even .
Remark 3 (Non–zero-error preshared codes)
Instead of zero-error codes, one could preshare a classical block code (e.g., Reed–Solomon) with small decoding error. From the learner’s viewpoint this is equivalent to replacing by an effective DMC on , so the analysis reduces to the single-shot baseline of Section III with instead of .
IV-B Scheme 2: independent sets using a public parity schedule
This scheme exploits the graph structure of more directly: instead of sending arbitrary arms at every use, we restrict each time slot to an independent set of vertices, which allows zero-error decoding in a single channel use. We illustrate on by providing a constrained zero-error code on with a public parity schedule.
Lemma 3
Let be the confusability graph of and an independent set. If the encoder restricts to inputs and the decoder knows , then can be decoded with zero-error from a single channel output .
Fix a finite collection of independent sets , and a public periodic schedule known to both learner and agent. At time , only arms in the active set may be transmitted; by Lemma 3, any arm in can be sent and decoded with zero-error in one channel use.
To simulate a clean BAI algorithm , the learner runs virtually and, whenever requests arm , waits until the first time such that , then transmits and counts the resulting reward as the virtual reward. The slowdown factor depends on how frequently each arm (or pair of arms) appears in the active sets. We now instantiate this on .
Set . The cycle admits a -coloring with independent sets
Fix a public, deterministic schedule known to encoder/decoder that specifies which set is active at each channel use, e.g., alternating . In a use where is active, the encoder is restricted to transmit .
Lemma 4
For the typewriter channel on , if is restricted to an independent set known to the decoder, then can be decoded from a single output with zero-error.
Each use conveys one of possibilities with zero-error, a ternary digit per channel use, where the active set acts as shared side information. Moreover, for even cycles , [14]; hence . The above scheme then achieves bits/use, and is capacity-achieving on .
Proposition 3
Consider Scheme 2 with (public alternating schedule
, ).
In each slot, the transmitted arm is decoded
with zero-error in one channel use (Lemma 4).
Let be any -correct clean BAI algorithm,
let be its (random) number of pulls under no channel noise, and let
denote the random sequence of arm
requests produced by on the clean instance, and define the parity class
sequence
Then there exists an actuation wrapper that simulates over the noisy link with the same error probability and a stopping time satisfying
| (10) |
Proof sketch. The wrapper runs virtually and only counts rewards when the requested arm belongs to the active parity class; otherwise it waits one slot and then transmits the arm. Zero-error decoding ensures the counted interaction matches the clean bandit path. If the -th request has the same parity as the -th, the wrapper must wait one extra slot; otherwise the next slot already has the correct parity. Summing these waiting slots yields (10). Full proof in App. B.
Equation (10) shows that the slowdown under Scheme 2 depends on the local pattern of parity requests, not merely on the total number of pulls from each parity class. In particular,
When the requested parity alternates often, the slowdown is close to ; when the clean algorithm makes long runs within the same parity class, the slowdown approaches .
IV-B1 Beyond bipartitions: overlapping independent sets
The parity schedule on is based on a partition of the arms into two independent sets. There is no requirement, however, that the active independent sets form a partition: an arm may belong to several independent sets and thus be admissible in multiple time slots. This can be exploited to reduce the worst-case slowdown constant.
As a simple illustration, consider with vertex set . Independent sets include, for example, , , and we may use a length- periodic schedule that activates these sets in turn. Arm then belongs to two of the three sets and is therefore available in of the slots, rather than . This illustrates that one can bias the schedule toward specific arms by letting them appear in multiple independent sets. Intuitively, this higher availability fraction can translate into a smaller slowdown factor when simulating a clean algorithm, because the wrapper needs to wait less frequently for an arm to become admissible.
Designing schedules (and collections of independent sets) that optimally balance the availability of individual arms and pairs of arms is a combinatorial problem closely related to fractional colorings and covering designs of the confusability graph. In principle, such designs can push the worst-case slowdown below the simple factor- bound in Prop. 3. We leave a systematic exploration of this direction to future work.
V Case 3: Stateful Plan Execution via Packetized Successive Elimination (PSE)
The previous section showed that if we insist on issuing a fresh (zero-error) arm command for every pull, then any clean BAI algorithm can be simulated with a constant multiplicative slowdown for per-pull zero-error control. We now show that if the agent can maintain state and execute a multi-round plan, then the learner needs to communicate only at decision times (phase boundaries). This converts actuation cost into an additive overhead.
Throughout this section, the system acts continuously: while an -symbol command packet is being transmitted, the agent keeps executing its currently committed behavior, and the newly decoded plan takes effect only after the packet is decoded (switching latency).
V-A Packetized Successive Elimination (PSE)
We now describe Algorithm 1 (PSE) which builds on phased version of Successive Elimination [2]. The coding layer only appears through the existence of a zero-error plan packet of length that conveys any phase plan; explicit constructions (e.g., on and ) are plugged in later.
Confidence radius and phase schedule.
Let
We use phases with per-phase budget counted pulls per active arm, so that the cumulative number of counted pulls per surviving arm at the end of phase is .
During plan installation (the channel uses), the agent keeps executing its previously committed arm.
Plan packets.
In phase , the learner selects an active set and a pre-agreed repetition budget . The phase plan is:
Let denote the set of admissible phase plans (equivalently, admissible active sets). A zero-error plan packet of length is any code that can convey any index in with zero decoding error; its existence is ensured whenever .
Theorem 1
Assume -subGaussian rewards, then PSE is -correct. Moreover, with probability at least , its total number of physical rounds satisfies
where is the number of executed phases and is the (instance-dependent) pull complexity of the clean phased successive-elimination algorithm (standard analyses), i.e.,
Proof sketch.
Define the standard uniform concentration event
By a union bound over arms and times, .
On , the elimination rule of phased successive elimination is sound: is never removed, and any
suboptimal arm is eliminated once at the end of some phase (since
and ).
Crucially, PSE’s counted rewards are generated by exactly the same arm sequence and sample sizes as the clean
phased-SE algorithm (the additional rounds during plan installation are simply ignored for the statistical test).
Therefore, on , the number of counted pulls until termination is at most .
Finally, each executed phase incurs additional physical rounds for plan installation (switching latency), yielding
on . See full proof in App. C.
Remark 4
For , the condition is implied by for universal constants . Consequently, .
Corollary 3
Consider the one-sided typewriter channel with , so the confusability graph is .
Assume the simple plan family ,
so for all .
(i) . Using the base- digit on (two uses per digit),
(ii) . Using the calendar interface on (one ternary digit per use),
In both cases, Theorem 1 gives with small constants:
at most on and on .
Thus when the statistical term dominates (small gaps / small ), PSE is dramatically better than per-pull zero-error updates, which incur a multiplicative constant .
References
- [1] (2020) Prediction with corrupted expert advice. Advances in Neural Information Processing Systems 33, pp. 14315–14325. Cited by: §I.
- [2] (2006) Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems.. Journal of machine learning research 7 (6). Cited by: §I, §II-A, §V-A.
- [3] (2016) Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pp. 998–1027. Cited by: §I, §II-A.
- [4] (2019) Better algorithms for stochastic bandits with adversarial corruptions. In Proceedings of the Thirty-Second Conference on Learning Theory, A. Beygelzimer and D. Hsu (Eds.), Proceedings of Machine Learning Research, Vol. 99, pp. 1562–1578. External Links: Link Cited by: §I.
- [5] (2024) Multi-agent bandit learning through heterogeneous action erasure channels. In International Conference on Artificial Intelligence and Statistics, pp. 3898–3906. Cited by: §I, §I.
- [6] (2023) Multi-arm bandits over action erasure channels. In 2023 IEEE International Symposium on Information Theory (ISIT), Vol. , pp. 1312–1317. External Links: Document Cited by: §I, §I.
- [7] (2014-13–15 Jun) Lil’ ucb : an optimal exploration algorithm for multi-armed bandits. In Proceedings of The 27th Conference on Learning Theory, M. F. Balcan, V. Feldman, and C. Szepesvári (Eds.), Proceedings of Machine Learning Research, Vol. 35, Barcelona, Spain, pp. 423–439. External Links: Link Cited by: §I, §II-A.
- [8] (2019) Corruption-tolerant bandit learning. Mach. Learn. 108 (4), pp. 687–715. External Links: ISSN 0885-6125, Document, Link Cited by: §I.
- [9] (2025) Does feedback help in bandits with arm erasures?. In 2025 IEEE International Symposium on Information Theory (ISIT), Vol. , pp. 1–6. External Links: Document Cited by: §I, §I.
- [10] (2026) Fundamental limits of learning under erasure-constrained communication channels. IEEE Journal on Selected Areas in Information Theory, pp. 1–1. External Links: Document Cited by: §I, §I.
- [11] (2016) On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research 17 (1), pp. 1–42. Cited by: §I.
- [12] (1998) Zero-error information theory. IEEE Transactions on Information Theory 44 (6), pp. 2207–2229. External Links: Document Cited by: §I, §II-C.
- [13] (2020) Bandit algorithms. Cambridge University Press. Cited by: §I.
- [14] (1979) On the shannon capacity of a graph. IEEE Transactions on Information theory 25 (1), pp. 1–7. Cited by: §I, §IV-A, §IV-B.
- [15] (2018) Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 114–122. Cited by: §I.
- [16] (2024) Best arm identification with arm erasures. In 2024 IEEE International Symposium on Information Theory (ISIT), Vol. , pp. 2293–2298. External Links: Document Cited by: §I, §I.
- [17] (1956) The zero error capacity of a noisy channel. IRE Transactions on Information Theory 2 (3), pp. 8–19. External Links: Document Cited by: §I, §II-C, §II-C, §II-C, §II-C, Remark 1.
Appendix A Proof of Proposition 2
Fix a zero-error update code of blocklength for the message set : an encoder and decoder such that for every , if is transmitted then almost surely.
We construct by simulating as a virtual clean algorithm. Partition physical time into blocks for , and let denote the last slot of block .
At virtual step , the simulated algorithm outputs an arm request as a measurable function of its past virtual rewards. During physical block , the learner transmits the length- codeword over the actuation channel. By the assumed timing model, the agent executes the decoded arm immediately upon decoding, i.e., in the last slot the executed arm equals . Define the virtual reward fed back to at step to be the physical reward observed in that last slot. All other rewards inside the block are ignored.
We now verify that the virtual interaction has the same law as a clean run of . Condition on the virtual history up to step , equivalently, on the -field generated by . Then the requested arm is determined. By zero-error decoding and the “execute-on-decode” assumption, the executed arm at time equals , so the reward is distributed exactly as a clean reward sample from arm (and is independent of the past conditional on under the standard bandit model). Hence the joint law of under matches the joint law of the arm/reward pairs produced by in the clean bandit instance. Therefore, is -correct whenever is -correct.
Finally, if stops after virtual pulls, then uses exactly physical rounds (each virtual pull consumes one full block), which proves the claimed time bound.
Appendix B Proof of Proposition 3
Let the public schedule activate on odd time slots and on even time slots:
By Lemma 4, any transmitted arm in the active set is decoded with zero-error in that same slot and is executed as intended.
We define a wrapper that simulates as a virtual clean algorithm driven only by counted rewards. Let be the random sequence of arm requests produced by when run on the clean instance, and let be its (random) stopping time. The wrapper maintains a virtual step counter , initialized to . At each physical slot :
-
•
If , i.e., the requested arm is admissible in the active parity class, the learner transmits , the agent executes , the wrapper counts the resulting reward as the virtual reward for step , and increments .
-
•
Otherwise, the learner transmits any arm in the active set, e.g., an arbitrary fixed element of , and discards the reward (this slot is an uncounted “wait” slot).
The wrapper terminates when , i.e., once it has produced counted rewards.
Correctness & coupling. Consider the subsequence of physical slots at which the wrapper counts rewards. By construction, at the -th counted slot the executed arm equals (zero-error decoding and immediate execution). Hence the -th counted reward has exactly the same conditional distribution as a clean reward sample from arm . Since chooses as a measurable function of past counted rewards, the sequence of counted arm-reward pairs produced by the wrapper has the same joint distribution as the clean interaction of . Therefore the wrapper preserves the error probability of , and the resulting algorithm is -correct.
Time bound.
Let denote the parity class of the -th arm request made by the clean run of , where corresponds to and to . Let be the physical time at which the wrapper obtains the -th counted reward. Because the public schedule alternates between and , and the wrapper always serves the current request at the first slot in which its parity class is active, we have
and for every ,
Indeed, if then the next requested parity class is active in the very next slot, whereas if the wrapper must wait one additional slot before serving it. Summing the recurrence yields
Appendix C Proof of Theorem 1
We recall that rewards are independent across rounds, and when arm is executed the reward distribution has mean and is -subGaussian. The confidence radius is
For PSE, in phase the per-arm counted budget is , hence each arm in has exactly
counted samples at the end of phase .
C-A Step 1: Anytime concentration with explicit constants
For each arm , let denote the empirical mean of the first counted rewards obtained when executing arm . (Equivalently, one may imagine an infinite i.i.d. sequence of arm- rewards, and is the mean of the first draws; this is standard and matches the algorithm because PSE only ever uses at times for arms still active.)
Lemma 5
Define the event
If rewards are -subGaussian, then .
Proof: Fix an arm and a time . Since rewards are -subGaussian, the empirical mean satisfies
Plugging in yields
By a union bound over all and all ,
hence .
C-B Step 2: Exact coupling of counted samples
We now state the coupling lemma formally. Define the clean phased SE algorithm as the same phased procedure as PSE but with no installation delays: in phase , it pulls each arm in exactly times (counting all rewards), then applies the same elimination rule as PSE using , and repeats until one arm remains.
Lemma 6
Assume every phase- plan packet in PSE is decoded with zero error, so the agent executes the intended phase plan exactly. Let denote the sequence of counted arm pulls and counted rewards produced by PSE, indexed by counted time , i.e., ignoring all installation rounds. Let denote the pull, or equivalently reward, sequence of the clean phased-SE algorithm run with the same , the same phase budgets , and the same elimination rule.
Then there exists a coupling under which
In particular, PSE and clean phased-SE compute identical empirical means from counted samples, hence produce the same active sets and stop after the same number of counted pulls.
Proof: Construct a probability space as follows. For each arm , generate two independent i.i.d. sequences:
each distributed as the arm- reward law (mean , -subGaussian). Interpret as the rewards that will be used for counted pulls of arm , and as rewards used for uncounted pulls, i.e., installation rounds. This is valid because, in the true system, every time arm is executed the reward is an independent draw from the same law; splitting draws into two independent pools preserves the joint law of any finite collection of executed rewards.
Now run the clean phased-SE algorithm and define that its -th pull of arm receives reward . Run PSE on the same probability space, with the following reward assignment:
-
•
During execution (counted) rounds, whenever PSE executes arm for the -th counted time, it receives reward .
-
•
During installation (uncounted) rounds, PSE executes the hold arm (the last arm pulled in the previous phase) and receives rewards from the pool (which are never used in estimates).
Because plan packets decode with zero error, in each phase , PSE executes exactly the intended plan in its execution segment. Therefore, the counted arm sequence of PSE in phase is exactly the same as the arm sequence of the clean phased-SE algorithm in phase (both pull each exactly times). Under the construction above, they also receive exactly the same counted rewards, namely the corresponding samples. Hence, after phase , both algorithms compute identical empirical means for each , apply the same elimination rule, and therefore produce the same .
By induction over phases, the entire counted pull/reward sequence agrees almost surely under this coupling.
C-C Step 3: Correctness of elimination on the good event
Lemma 7
On the event from Lemma 5, PSE never eliminates the best arm . Moreover, fix any suboptimal . For any phase index such that
| (11) |
arm is eliminated at the end of phase (i.e., ).
Proof: Fix a phase and let be the counted sample size per surviving arm at the end of phase .
(i) The best arm is not eliminated. Let where
On , we have , hence
Also on , for any we have , so
Therefore , so satisfies the retention condition and is not eliminated.
(ii) A suboptimal arm is eliminated once . Fix . On we have
Also, since maximizes over and by part (i),
Combining the two displays,
If , then , hence
Therefore fails the retention test and is eliminated, i.e., .
C-D Step 4: Physical stopping time decomposition and conclusion
We now prove Theorem 1.
Proof:Assume plan packets are decoded with zero error (as in the theorem statement). Let be the number of executed phases, i.e., the number of iterations of the while-loop in Algorithm 1. Define the number of counted pulls made by PSE as
In each executed phase , PSE spends exactly physical rounds to transmit the plan packet (during which it pulls the hold arm and discards these rewards for estimation), and then spends exactly physical rounds executing the plan (counting those rewards). Therefore the total number of physical rounds is exactly
| (12) |
Next, by Lemma 6, the counted pull/reward sequence of PSE is (under a coupling) identical to that of the clean phased-SE algorithm with the same and elimination rule. In particular, PSE and clean phased-SE stop after the same number of counted pulls and output the same arm as a function of the counted data. Let denote the (random) stopping time (number of pulls) of this clean phased-SE algorithm. Then
hence, in distribution as well.
Appendix D Missing Proof Sketches
This section contains the proof (sketches) we removed from the main body of the paper.
D-A Proof Sketch of Lemma 1
If , then for every command the conditional reward law given is identical under and . Hence the entire observation process has the same distribution under both instances, so any decision rule must behave identically and cannot be correct on both.
D-B Proof of Lemma 3
By definition of the confusability graph, iff . Independence of means that for any distinct we have . Therefore, any output can come from at most one , and the decoder can recover as the unique element of .
D-C Proof Sketch of Lemma 4
For the typewriter channel, . Since is independent, . Thus, the decoder rule if and otherwise returns in both cases.