EPICENTER OF RANDOM EPIDEMIC SPANNING TREES ON FINITE GRAPHS

. Epidemic source detection is the problem of identifying the network node that originated an epidemic from a partial observation of the epidemic process. The problem ﬁnds applications in diﬀerent contexts, such as detecting the origin of rumors in online social media, and has been studied under various assumptions. Diﬀerent from prior studies, this work considers an epidemic process on a ﬁnite graph that starts on a random node (epidemic source) and terminates when all nodes are infected, yielding a rooted and directed epidemic tree that encodes node infections ( i.e. , a directed spanning tree of the graph with every edge directed away from the epidemic source). Assuming knowledge of the underlying graph and the undirected spanning tree ( i.e. , the infection edges but not their directions), can the epidemic source be accurately identiﬁed? This work tackles this problem by introducing the epicenter , an eﬃcient estimator for the epidemic source, and thus, the direction of every edge in the epidemic tree. When the underlying graph is vertex-transitive the epicenter can be computed in linear time and it coincides with the well-known distance center of the epidemic tree. Moreover, on a complete graph the epicenter is also the most likely estimator for the source. Finally, the accuracy of the epicenter is evaluated numerically on ﬁve diﬀerent graph models and the performance strongly depends on the graph structure, varying from 31% (on complete graphs) to 13% (on sparse power-law graphs). However, for all graph models considered the epicenter exhibited an accuracy higher than the distance center, being three times more accurate on sparse power-law graphs.


Introduction
Network epidemics is a ubiquitous model to capture different diffusion processes on networks such as the spread of a disease in a population, viruses in computer networks, and fake news in online social networks [4,16,17,19,20]. Within this context, an important and general problem is determining the epidemic source: identifying the node or nodes that were first infected, that originated the epidemic, from a partial observation of the epidemic process [1,2,6,10,13,21,[24][25][26]. This problem finds various applications such as identifying the node responsible for starting the dissemination of a fake news in an online social network, or identifying the individual first infected by a disease within a given population.
The problem of identifying the epidemic source has many flavors, as it strongly depends on the epidemic model and the information that is observed. One of the simplest epidemic model is the SI model where network nodes can either be susceptible (S) or infected (I). In this model, nodes do not recover and the only possible epidemic transition is from S to I. Starting form a single infected node (epidemic source) the infection spreads to other nodes through the edges of the network. Consider observing the set of infected nodes at a given time during the epidemic. Given the network and the set of infected nodes, can the epidemic source be identified?
The seminal work of Shah and Zaman introduced the notion of rumor centrality and characterized its effectiveness in identifying the epidemic source on trees [23,24]. Their work considers a probabilistic SI epidemic model spreading on a known infinite tree (possibly random) and the set of infected nodes at a given time. They show that rumor centrality coincides with the Maximum Likelihood (ML) estimator for the epidemic source, and provide a linear time algorithm to compute it. Various subsequent works have followed this methodology and assumed the underlying graph to be infinite or very large in comparison to the observed infected nodes [3,6,9,13].
In a different setting, the seminal work of Pinto et al. considered that the infection times of a small fraction of the nodes are observed along with knowledge of the underlying network [21]. However, multiple epidemic cascades (i.e., independent realizations of the epidemic process) are observed. They provide a polynomial time algorithm to infer the epidemic source, studying the inherent tradeoffs in the model. Various subsequent works have taken this approach of observing multiple cascades to infer the epidemic source and other epidemic characteristics [2,6,18,26].
Different from prior studies, this work considers an SI epidemic process on an arbitrary finite graph that starts on a random node (epidemic source) and terminates when all nodes are infected. The epidemic process generates a rooted and directed epidemic tree that contains all network nodes and encodes node infections (i.e., a directed spanning tree of the graph with every edge directed away from the epidemic source). Given the network G and a single observation of the unrooted and undirected epidemic tree τ , can the epidemic source be identified? Note that the observation is a spanning tree of the network that encodes the edges through which nodes were infected, but not their direction or any other timing information. Moreover, the identification of the epidemic source also reveals the direction of all edges of the epidemic tree.
As a possible application for this model, consider a meme (e.g., picture) spreading through a messaging service (e.g., Whatsapp) where edges of the network are given by the users' contact lists. Imagine that an adversary obtains information between the first exchanges of users but with no timing information or the direction of the exchange. Can the adversary identify the source of the meme?
Another possible application concerns the study of arborescence which are rooted and directed spanning trees of graphs used in various optimization problems, such as distribution networks [11]. While the minimum cost root location problem asks for the best arborescence of a graph, the proposed formulation asks to determine the most likely arborescence of a graph when constructed by a random process, given its edges but not their direction.
Beyond the novel observation model, this work makes the following contributions: -Propose a novel estimator for the epidemic source named epicenter (see, Def. 4.4) that leverages distances on both G and the spanning tree τ to reveal the epidemic source. The epicenter can always be computed in polynomial time and structural properties of G and τ allow to reduce its running time complexity. -Show that when G is a vertex-transitive graph the epicenter can be computed in linear time (in the number of nodes). In such graphs, it is shown that the epicenter coincides with the distance center and rumor center of τ (under an exponential SI model). Moreover, when G is a complete graph, the epicenter is also the Maximum Likelihood (ML) estimator for the epidemic source. -Evaluate, through simulations, the accuracy and other characteristics of the epicenter in identifying the epidemic source in five different graph models, as well as a direct comparison with the distance center of the epidemic tree. While the accuracy strongly depends on the graph structure, varying from 31% (on complete graphs) to 13% (on sparse power-law graphs), the epicenter always exhibited an accuracy higher than the distance center, being three times more accurate on sparse power-law graphs.
The remainder of this paper is organized as follows. The related work is briefly addressed in Section 2. Section 3 presents the epidemic model and observation process. Section 4 presents the epicenter and some of its structural properties. The special case of vertex-transitive graphs is studied in Section 5. Section 6 presents a numerical evaluation of the epicenter on different graph models. Section 7 presents a short conclusion.

Related work
Identifying the source of an epidemic is a fundamental problem that has received attention from both theoretical and practical perspectives. The problem has direct applications such as determining the source of a rumor that spreads through an online social network or the source of a blackout that spreads through the power grid [10,25].
Not surprising, various formulations of the problem have been investigated and their main differences concern the prior knowledge about the epidemic process and the observation model. The former determines the prior information that is available for the source identification, such as the network structure over which the epidemic unfolds, or the epidemic model and its parameters. This prior information does not depend on the realization of the epidemic. The latter determines what is observed with respect to the epidemic realization such as the infected nodes or the infection times (of a small fraction) up to a given time instant [10,25].
Beyond the observation model, another fundamental aspect in the problem formulation is the number of epidemic realizations (cascades) observed. While some formulations rely on a single epidemic cascade, others consider multiple independent (and even dependent) cascades where results are often a function of the number of cascades (more cascades leading to better results). This formulation is often used in more difficult cases such as when the network structure is unknown or multiple epidemic sources co-exist in the network [6,10,15,18,28].
Part of the prior theoretical works assume that the underlying graph is an infinite tree (possibly random) and that infected nodes are observed after a long period of time (yielding a finite tree, a subgraph of the original infinite tree) [3,24]. Indeed, the main theoretical contribution of [24] are lower and upper bounds for the asymptotic accuracy of the Maximum Likelihood Estimator (which can be computed in linear time) for the source when the observation time goes to infinity (or number of observed nodes goes to infinity). The main proof technique is casting the epidemic model as a generalized Pólya urn model and using known results from the latter to estimate the asymptotic accuracy of the estimator. Unfortunately, this approach cannot be used in the problem formulation considered in this work, for two reasons: (i) the underlying graph is not a tree (thus, the mapping to a Pólya urn is, in general, not possible); (ii) the underlying graph is finite (thus, asymptotic results are not available). While the subsequent work of Dong et al. considers a partial observation of the set of infected nodes and provides result for finite number of nodes, its proof technique also relies on Pólya urn models [3]. Indeed, prior works that consider a finite underlying graph assumes that the observation of the epidemic process occurs when a relatively small fraction of the network nodes are infected [6,13,27].
The observation model proposed in Kumar et al. consists of the set of infected nodes as well as a random fraction of directed edges from the epidemic tree [13]. While this observation model is related to the model here proposed, there are some fundamental differences: when a directed edge is observed, it is clear that the target node of that edge cannot be the epidemic source. Thus, as more edges are observed, less nodes are left as candidates for the epidemic source. In contrast, this work observes all epidemic edges, no direction or timing information is available, and thus, all nodes are candidates for the epidemic source.
A model related to SI epidemics are random tree growing processes: nodes arrive sequentially and connect to the existing tree according to some probabilistic rule (becoming part of the tree before the next node arrives). Two classic models are the uniform attachment and preferential attachment. Thus, given an observed tree of certain size, can the first node be identified? This problem is related to epidemic source identification and has been investigated mostly in theoretical grounds [1,8,14]. In this context, a common problem formulation asks for a set of tree nodes such that the epidemic source is within this set with probability at least 1 − . Interestingly, a key result is that the size of this set does not depend on the network size, but only on [1]. Again, a key ingredient for proving such result is casting the tree construction process as a generalized Pólya urn model, and allowing the tree to grow to infinity. Differently from the model consider in this paper, the tree resulting from these random tree growing processes is not constrained by an underlying graph.

Random epidemic trees on graphs
In the classic SI epidemic model individuals of a population can either be susceptible (S) or infected (I), and the only possible epidemic transition is from S to I. The epidemic will unfold on an arbitrary undirected finite graph G = (V, E) henceforth assumed connected, where the set V represents the individuals (nodes) with size n = |V |, and the set E the possibility of contagion (edges) with size m = |E|. 1 The epidemic is described by a discrete time model and a partition of V into S(t) and I(t), representing the set of susceptible and infected nodes at time t = 1, 2, . . .. Initially, i.e., at time one, a single node is infected and I(1) = {v}, with v ∈ V . This node is called the epidemic source since the epidemic process will unfold from this node. We assume that a single node is infected at each time instant such that |I(t + 1) \ I(t)| = 1 for t = 1, . . . , n − 1. Note that the epidemic process will eventually reach all nodes of G and, in particular I(n) = V .
The edges of G encode the possibility of contagion, in the sense that the epidemic unfolds through the edges. In particular, in order for a node to become infected, one of its neighbors must be infected (with the exception of the epidemic source). Therefore, only nodes that have an infected neighbor at time t can become infected at time t + 1. Moreover, we assume that an infection event occurs through an specific edge: a node becomes infected by exactly one of its infected neighbors.
Let C(t) denote the edge cut induced by the partition I(t) and S(t). Note that for each edge e = {u, v} ∈ C(t), one node u ∈ I(t) is infected and the other node v ∈ S(t) is susceptible. Let b t = (u, v) be a directed edge corresponding to the edge that infected node v at time t + 1, in the sense that I(t + 1) \ I(t) = {v}. Thus, b t provides the infection event at time t. As it turns out, the entire epidemic process is characterized by the epidemic source, denoted by r, and the sequence of directed edges b 1 , . . . , b n−1 , where b 1 is incident to r. This rooted sequence will be denoted as b r = (b 1 , . . . , b n−1 ), meaning that I(1) = {r} and that b 1 is incident to r. Note that b r induces one rooted and directed spanning tree of G, since a node is only infected once. Let (τ , r) denote the rooted and directed spanning tree induced by b r . While b r precisely constructs (τ , r), it is possible for (τ , r) to be constructed by other edge sequences.
The following probabilistic model for edge selection is considered, which also determines how the epidemic unfolds through the graph. Let e t be a random variable denoting the edge chosen at time t by the epidemic process. We assume that e t has a uniform distribution over C(t), the edge cut at time t. In particular, P (e t = {u, v}) = 1/|C(t)|, for all {u, v} ∈ C(t), and is 0 otherwise. Note that the probability that a susceptible node is infected at time t is proportional to its number of infected neighbors at that time, and thus in general not uniform over S(t).
The probability that a given rooted edge sequence b r = (b 1 , . . . , b n−1 ) is observed is simply the product of the edge cut sizes induced by the sequence. In particular, the set of infected nodes are given by I b r (1) = {r} and, for t = 2, . . . , n, The network epidemic model above is related to random tree growing processes models [5]. In the classic uniform attachment random tree model, a node at time t joins the tree connecting uniformly at random to one of the nodes in the existing tree. Note that this model is equivalent to the above epidemic model when G is a complete graph (nodes are relabeled by their infection times), since in complete graphs a susceptible node is infected by an infected node chosen uniformly at random.
The model above is also related to the classic continuous time SI network epidemic model, where the time to infection of a node follows an exponential distribution with rate given by the number of infected neighbors [23]. Since time is continuous, only one node will be infected at any given time instant. Moreover, since time is exponentially distributed, the probability that a given node is the next to become infected can be shown to be exactly the same as in the above model. Thus, these two epidemic models are equivalent.
Problem formulation: Given a graph G, the proposed network epidemic model generates (τ , r), a random rooted and directed spanning tree induced by the random source r and random sequence b r . Let τ be the unrooted and undirected spanning tree constructed from (τ , r) by removing the direction of every edge (as well as the root). Thus, τ encodes the infection edges but not the infection direction. We consider the following problem: given G and a single realization of τ , determine the epidemic source. Note that τ encodes the infection edges with no information concerning the infection direction or any other timing information.
While any node of τ can be the epidemic source, their probability of being the source varies and depends on τ . Intuitively, the structure of τ along with G provides evidence for nodes that are more likely to be the epidemic source. For example, the epidemic source is more likely to be at the "center" of τ when also considering G.

Epicenter and ML source estimator of epidemic trees
Given a graph G and a single realization of the epidemic tree τ (unrooted and undirected), the goal is to determine the epidemic source. Henceforth, let V (τ ), E(τ ) and |τ |, denote the set of nodes, set of edges and size, of the tree τ , respectively. Table 2 provides a summary of the main notation used throughout the paper.
Note that when fixing a possible root for τ , say v, there are a myriad of different edge sequences starting from v which could have generated the now rooted and directed tree τ v . This motivates the following definition which establishes the conditions for an edge sequence to be capable of generating the tree τ starting from a node v: Let B(τ v ) denote the set of all edge sequences rooted at v that generate the rooted tree τ v . Note that B(τ v ) depends only on τ and v but not on the graph G from which τ was constructed. In what follow we present a couple of results on the size of B(τ v ). Lets start with a notation that will be used quite extensively: given a rooted tree τ v and a node u ∈ V (τ v ), we denote by τ v u the rooted subtree of τ v dangling from node u (with respect to v), rooted at u. Specifically, if u = v, τ v u denotes the subtree rooted at u obtained by removing the edge connecting u to its parent with respect to v (i.e., the neighbor of u in the unique path between u and v in τ v ); whereas, if u = v, τ v u = τ v . The first lemma establishes a recursive formula for |B(τ v )|.
The combinatorial argument used above is quite standard and a sketch of the proof can be found in the Appendix. Recursively applying Lemma 4.2, we obtain the following proposition.
Recall that given a tree τ generated by the random epidemic process, the probability that a node is the epidemic source of τ is node dependent. Specifically, the probability that τ was constructed from a root v can be computed by summing over all possible rooted edge sequences that generate τ v (since they are all mutually exclusive). In particular, we have: where the dependence on the underlying graph G is through the edge cut sizes. Applying Bayes rule we obtain the probability that a node is the epidemic source given the tree τ : The above equation requires a prior for the epidemic source, namely P (root = v), which is assumed to be uniform across nodes in V , i.e., P (root = v) = 1/n for all v ∈ V . Moreover, it also requires P G (τ ) which can be computed using the Law of Total Probability. More importantly, neither the prior nor P G (τ ) depend on the specific v and thus both can be treated as constants in equation (4.2). Thus, the Maximum Likelihood estimator (MLE) for the epidemic source, denoted by r * ML (G, τ ), corresponds to where the last equality holds due to the uniform prior.
In general, maximizing equation (4.1) is computationally expensive, also due to the dependence on the underlying graph G (see discussion in Sect. 4.1). Intuitively, the structure of G and τ can be used more directly (in terms of computational complexity) to provide information about the chances that a given node is the epidemic source. Thus, a new estimator for the epidemic source is proposed, called epicenter : be a graph and τ a spanning tree of G. The epicenter of τ in G is defined as where d G (u, v) denotes the graph distance between node u and v in G (with ties in arg min broken uniformly at random).
the epicenter of τ in G is the node that minimize the sum of those positive differences. Intuitively, the epicenter is the node that better aligns the tree τ in G in terms of distances. Note that, the epicenter can be rewritten as r * EPI (G, τ ) = arg min is the distance centrality of v in G (equivalently for τ ) [16]. Example 4.5. Let G, τ, τ be as depicted in Figure 1.
Thus, the epicenter of τ in G is chosen uniformly from the nodes {1, 2}, whereas the epicenter of τ in G is 4.
When the underlying graph is itself a tree, i.e., G = τ , the only possible epidemic tree is τ itself, which implies Thus, whenever G is a tree, the epicenter does not provide any information concerning the epidemic source. Remarkably, the ML estimator r * ML in this scenario will be uniformly distributed on V , since P G (τ |root = v) will not depend on v. This follows because the epidemic source is chosen uniformly at random by the epidemic model.

Computing the epicenter
The epicenter of a spanning tree in a graph can be directly computed from its definition. In particular, one can simply compute the distance centrality for every node v ∈ V , both in G = (V, E) and its spanning tree τ , compute the difference of the corresponding distances, and return the node with the smallest value.
However, given a particular structure for G and τ , it is reasonable that not every node v ∈ V needs to be considered in this computation. Indeed, the following proposition (to be proven later) states that distance centrality of leaves of τ (i.e., nodes with degree one in τ ) are not required to determine the epicenter of τ in G.
where, u is the unique neighbor of v in G (and also in τ ).
The above proposition guarantees that a node which is a leaf in τ but not in G is never the epicenter. Moreover, a leaf in τ is an epicenter only if its parent in τ is an epicenter.
The computation of the epicenter is shown in Algorithm 1. Note that a breadth-first search (BFS) starting at v is sufficient to compute the distances from v to every node in G and τ . This has computational complexity Θ(m) and Θ(n), respectively. Computing the differences requires time Θ(n) (lines 9 − 11). This process is repeated for every non-leaf node of τ , which is bounded above by n. In general, the number of leaves in τ can be arbitrary, and thus the complexity of Algorithm 1 is O(nm).
Note that finding the MLE for the epidemic source as defined by equation (4.3) requires solving equation (4.1) for every node v ∈ V . A direct computation of equation (4.1) for a given node v would require iterating over all sequences in B(τ v ). While this number strongly depends on the structure of τ v , it is likely to grow exponentially with the number of nodes for most trees. For example, for the root of a full binary tree with k > 0 levels and Algorithm 1: Epicenter of a spanning tree of a graph Input: G = (V, E), τ /* G and a spanning tree τ */ Thus, computing the MLE directly is prohibitive (i.e., exponential number of iterations) for most cases and the epicenter provides a much more efficient approach. Moreover, when G is vertex-transitive the epicenter can be computed even more efficiently, requiring only linear time in n (details in Sect. 5).

Some properties of the epicenter
This section presents auxiliary results which will be used to prove Proposition 4.6, and also Proposition 5.2 in the sequel.
Given a graph G = (V, E) and two nodes u, v ∈ V , we denote by P G (u, v) the set of shortest paths in G between u and v (shortest path may not be unique). Also, given a path p ∈ P G (u, v) and a node w, we say that G u v Figure 2. The nodes drawn as circles correspond to the set V u v (G); the nodes drawn as squares correspond to the set V v u (G); the filled nodes do not belong neither to w ∈ p if and only if the path p crosses the node w. Then, we define the set of nodes w for which there exists a shortest path between w and u crossing v in G.
∀G since if a shortest path from a node w to u crosses node v, it cannot be the case that a shortest path from w to v crosses u. Thus, we also have that Figure 2 for an example of the set V u v (G). To avoid clutter, we shall remove the dependence from G and write , unless otherwise needed. The following lemma relates the distance centrality of two neighboring nodes in a graph.
where we recall that τ v u denotes the subtree of τ corresponding to the connected component of u when the edge {u, v} is removed from τ .
If we apply Lemma 4.7 to a tree τ , we obtain that Therefore, by applying Lemma 4.7 to G and to a spanning tree τ of G, we obtain equation (4.4).
Since v is a leaf in τ it has a unique neighbor (in τ ), henceforth denoted by u, and |τ u v | = 1 and . Towards this goal, let us partition the set of nodes V as There are three possible scenarios: Note that, in this latter case v must necessarily be a leaf in G. The proof of the second claim follows from noticing that if v is a leaf in G, then 3) holds and, given that v must necessarily also be a leaf in τ , from Corollary 4.8, we obtain Therefore, we obtain that

Epicenter of spanning trees in vertex-transitive graphs
In this section, we show that whenever the underlying graph G is vertex-transitive, the epicenter of any spanning tree of G can be computed more efficiently. Notable examples of such graphs are: complete, complete bipartite balanced, cycle, hypercubes amongst others.
Let us begin observing that if f : V → V is an automorphism of G = (V, E), then for every node u ∈ V , it holds that d G (u) = d G (f (u)). Hence, for a vertex-transitive 2 G, it holds that d G (v) = d G (u), ∀u, v ∈ V . Thus, the epicenter of a spanning tree τ of a vertex-transitive G is equivalent to In this case the epicenter only depends on τ , and reduces to a well-known notion for the center of a tree τ called distance center, which is defined as r * DC (τ ) := arg min v∈V u∈V d τ (v, u) [16]. In general, however, the epicenter cannot be easily compared to the distance center because it depends on G (network) and τ .
Remark 5.1. The distance center of trees is related to another network centrality concept called rumor center, defined for a tree τ as 3 [24].
In particular, it can be shown that, ruling out possible tie breaking, r * DC (τ ) = r * RC (τ ). However, for arbitrary graphs it is often the case that r * DC (G) = r * RC (G) (see, Prop. 2 in [23]).
Before stating the main result of this section, which provides a linear time algorithm to compute the epicenter of trees in vertex-transitive graphs, let us introduce some notation. Given a tree τ and v ∈ V (τ ) a root, let S v := {u ∈ V (τ ) : |τ v u | ≥ |τ |/2} be the set of vertices such that the tree dangling from them with respect to v has size at least half of |τ |; note that, i.e., the node in S v with the maximal distance from the chosen root v. Note that for a fixed τ by varying v, the node u * (τ, v) may change. However, for a fixed τ and v the node u * (τ, v) is unique; as a matter of fact, if the maximal distance is zero, then the only possible node is v itself. Assuming that the maximal distance is k ≥ 1, then if there were another node with the same distance from v in S v , it would necessarily imply that the tree τ has size at least |τ |/2 + |τ |/2 + 1, which is clearly a contradiction.
In the following proposition we show that: i) the node u * (τ, v) is such that after removing any edge incident to it, u * always belongs to a subtree of size at least half of τ , and this holds regardless the specific v chosen; in formula, u * is such that |τ w u * | ≥ n/2, for all w which are neighbors of u * in τ .
ii) the node u * (τ, v) is an epicenter (ruling out tie breaking), regardless the specific root v chosen. Since u * (τ, v) can be easily computed in linear time (see, Algor. 2), this provides an efficient way to find the epicenter of a spanning tree of a transitive graph.
Proposition 5.2. Let G = (V, E) be a vertex-transitive graph and τ a spanning tree of G. Then, it holds that: and, for every tree τ , is non-empty. Moreover, if the tree τ has a bisection (i.e., there exists an edge whose removal partitions the tree into two equal size subtrees), then the set in (5.2) contains two elements, namely the nodes corresponding to the edge whose removal halves the tree. If the tree does not admit a bisection, the set in (5.2) contains a unique element. In passing, note that if the size of the tree n is odd, then the epicenter is uniquely determined.
Proposition 5.2. We begin proving i), and specifically that {w ∈ V : |τ v w | ≥ n/2 , ∀v} ⊆ {u * (τ, v) : v ∈ V }. For that, it is enough to show that if w ∈ {w ∈ V : |τ v w | ≥ |τ |/2 , ∀v ∈ V }, then there exists a v such that w = u * (τ, v). Let us assume, towards a contradiction, that for all v, u * (τ, v) = w . Let v be arbitrary; by definition of w , we know that |τ v w | ≥ n/2, i.e., w ∈ S v . Therefore, we must have d τ (u * , v) > d τ (w , v). Since u * ∈ S v , we know that |τ v u * | ≥ n/2. Note that, if |τ v w | ≥ n/2 and |τ v u * | ≥ n/2, necessarily w and u * must be neighbors in τ , and |τ w u * | = |τ u * w | = n/2, i.e., the removal of the edge {u * , w } ∈ E(τ ) bisects τ . Thus, if τ does not have a bisection, we obtain a contradiction. If τ admits a bisection and we denote by {u , v } the edge whose removal halves the tree, then it is not difficult to see that . Thus, it remains to consider the nodes v ∈ V (τ v u * ) \ {u * }. Note that all nodes in the latter set have distance from v strictly bigger than the distance d τ (u * , v). Consequently, these nodes cannot belong to the set S v (otherwise it would contradict the definition of u * ), and therefore we have that |τ v v | < n/2 for all v ∈ V (τ v u * ) \ {u * }. Also, for all these nodes it holds that |τ v v | = |τ u * v | < n/2. Let w be a node in V (τ v u * ) \ {u * } having distance one from u * ; then, |τ u * w | < n/2. However, since w and u * are neighbors, it holds that |τ u * w | + |τ w u * | = |τ | = n, which implies |τ w u * | ≥ n/2, for all w ∈ V (τ v u * ) \ {u * }, which are neighbors of u * . Noticing that, for all We now proceed proving ii). For a spanning tree τ of a vertex-transitive G, we know that, ruling out tie breaking, r * EPI (G, τ ) = arg min V }, let us assume towards a contradiction, that u * = arg min v∈V d τ (v), which is equivalent to say that ∃r ∈ V such that, d τ (r) < d τ (u * ). Let p denote the unique path in τ connecting u * and r, and without loss of generality we assume p = w 0 , w 1 . . . w k−1 , w k , with w 0 = u * and w k = r. Let j := min{i ≥ 0 : d τ (w i ) > d τ (w i+1 )}, and note that, by the hypothesis d τ (r) < d τ (u * ), j is always less or equal than k − 1. Applying Lemma 4.7 we have that d τ (w j ) + |τ j+1 j | = d τ (w j+1 ) + |τ j j+1 |. Note that |τ j+1 j | + |τ j j+1 | = n (because j and j + 1 are at distance one in τ ), and that |τ j+1 j | ≥ |τ j+1 u * |. By point i) above, u * satisfies |τ w u * | ≥ n 2 ∀w ∈ V , which implies |τ j+1 j | ≥ n 2 , and thus d τ (w j ) ≤ d τ (w j+1 ), which is a contradiction.
Let us now show that towards a contradiction, that there exists an r ∈ V such that |τ r u | < n/2 (we are using i)). Note that, without loss of generality, we may assume r is a neighbor of u in τ . Given that |τ r u | < n/2 it is also the case that |τ u r | ≥ n/2, since r and u are neighbors. Applying Lemma 4.7 to τ , we have that d τ (r) + |τ u r | = d τ (u ) + |τ r u |, which implies d τ (r) < d τ (u ), and thus a contradiction.

Computing the epicenter in vertex-transitive graphs
The special structure of vertex-transitive graphs provided theoretical results that allow for the design of an efficient algorithm to compute its epicenter. In particular, Proposition 5.2 establishes that the epicenter of τ in G is given by u * (τ, v). The goal of the algorithm is to compute u * (τ, v) efficiently, and its pseudo-code is shown in Algorithm 2.
The main idea of the algorithm is to create an orientation for τ using an arbitrary node as root (e.g., the first node of V ), and then compute the subtree sizes from the leaves towards the root: leaves have subtree size equal to one, a parent has subtree size that is one plus the subtree size of its children. The algorithm stops when it reaches a node that has subtree size of at least n/2 , and returns this node as the epicenter.
The algorithm is iterative and prunes the leaves of the rooted tree τ v1 which in turn may create new leaves. When a leaf is pruned, the subtree size of its parent is updated, as well as the number of children of its parent (lines 15−16).
The algorithm stops when it encounters a node (i.e., a leaf in the pruned tree) that has subtree size of at least n/2 (line 12−13). Since the algorithm iterates from the leaves towards the root, the stopping condition is satisfied by a node at the largest possible distance from v 1 . In light of Proposition 5.2, this is the epicenter of the tree τ (and does not depend on the choice of v 1 ). Algorithm 2 has computational complexity Θ(n) where n is the number of nodes in τ . Note that the BFS in line 1 runs on the tree τ which has n − 1 edges. Moreover, a node enters the leaf set Leafs only once, and thus the main loop requires at most n iterations. Finally, all computations with in the main loop (lines 10−19) require constant time.
The running time complexity of Algorithm 2, Θ(n), is in sharp contrast with the complexity of the general Algorithm 1, O(nm). Indeed, finding the epicenter of vertex-transitive graphs requires significantly less effort.

Epicenter in complete graphs
The random network epidemic model under study (described in Sect. 3) has a distinctive feature on complete graphs, henceforth denoted as K: the edge cut size depends only on time t. Specifically, it holds that |C(t)| = where |B(τ v )| denotes the number of edge sequences rooted at v which generates the rooted tree τ v . In essence, when the underlying graph is complete, the probability of an edge sequence does not depend on the specific sequence. This implies that the ML estimator r * ML for the epidemic source is given by the node v which maximizes |B(τ v )|.
Let us point out that in a vertex-transitive graphs, equation (5.3) will not hold in general. For example, in the 3d-hypercube, there exist two different rooted sequences which generate the same rooted tree but with different probabilities. Similar counter-example can be found in the complete (balanced) bipartite graph K 3,3 .
Theorem 5.4. If G = K, then for every spanning tree τ of K, it holds that ruling out possible tie breaking.
Thus, on a complete graph, the two estimators for the epidemic source, ML e EPI, coincide. Note that, however, r * EPI (K, τ ) depends only on τ (not on any probability model), whereas r * ML (K, τ ) depends on the probability model that generates τ (see, Eq. (4.3)). Moreover, due to Theorem 5.4, for a complete graph Algorithm 2 also computes the MLE for the epidemic source (and equivalently, the rumor and distance center).
Theorem 5.4. In light of equation (5.3) and Proposition 4.3, the ML estimator for the epidemic source of a spanning tree of a complete graph will be the node v which minimize u =v |τ v u |. Thus, it holds that r * ML (K, τ ) = r * RC (τ ) (see, the first remark in Sect. 5). Furthermore, for any tree it holds r * RC (τ ) = r * DC (τ ). Finally, given that r * EPI (K, τ ) = r * DC (τ ) concludes the proof.
Remark 5.5. Another example of a vertex-transitive graph for which Theorem 5.4 holds, is the cycle graph C. Indeed, in a cycle graph the size of the edge cut induced by the epidemic model is always equal to 2, for every time t. Thus, equation (4.1) reduces to Note that the cycle graph yields a relatively simple scenario because all spanning trees of C are isomorphic paths, contrary to the spanning trees of the complete graph.

Numerical evaluation
Is the epicenter the epidemic source? What is the distance between the epicenter and the epidemic source? Is the epicenter significantly more accurate than the distance center? Clearly, the answers to such questions depend on the underlying graph structure, and this section provides empirical evidence to address them.

Graph models and performance metrics
The following graph models are considered in the characterization and evaluation of the epicenter [16]: n nodes arranged in a ring and for each node add edges to nodes at distance k or less on the ring. Every edge is then rewired with probability p, choosing one of its endpoint uniformly at random among the nodes. This sparse random graph model yields high clustering and short distances. -Barabási-Albert random graph (BA): This generative random graph model follows the preferential attachment principle, adding k edges per node. The model generates graphs with power-law degree distribution and short distances (but low clustering). Given a random graph instance G, the epidemic source is a node chosen uniformly at random and the epidemic process is simulated on G to generate the epidemic tree. Let r denote the epidemic source and τ the tree generated. Given G and τ , let node s = epi(G, τ ) denote its epicenter. Another important characteristic is the distance between the epicenter and the epidemic source on τ . Intuitively, the epicenter should be close to the epidemic source even if this node was infected rather late in the epidemic. In particular, the epicenter should be much closer to the epidemic source than most nodes on the tree. Let d(s) denote the hop distance on the tree τ between node s = epi(G, τ ) and node r, the epidemic source. Note that when d(s) = 0 the epicenter is the epidemic source.
The following methodology is adopted to characterize o(s) and d(s). Consider R independent runs of the simulator, each generating a random graph instance and a random epidemic tree τ j , for j = 1, . . . , R. Let f i denote the fraction of runs that the node returned by Algorithm 1 or 2 was the i-th infected node in the simulation, i.e., f i = 1/R R j=1 1(o(epi(G, τ j )) = i), where 1 is the indicator function. Note that f 1 is the fraction of runs where the algorithm identified the epidemic source. Similarly, let g i denote the fraction of runs that the node returned by Algorithm 1 or 2 was at distance i from the epidemic source on tree τ j , i.e., g i = 1/R R j=1 1(d(epi(G, τ j )) = i). Note that g 0 is the fraction of runs where the algorithm identified the epidemic source, and thus, In what follows, the graphs have size n = 1000 (with the exception of TO which has n = 1024 nodes) and for all random graphs the the average degree d ∈ {6, 12}. Note that d determines the parameters for each model accordingly: in ER, p = d/1000, in WS, k = d/2 and p = 0.05, and in BA, k = d/2. Algorithm 2 was used on CO and TO (since they are vertex-transitive) and Algorithm 1 was used on the random graph models. In all scenarios, R = 1.2 × 10 5 runs (or higher). Thus, the standard error for the average accuracy reported in any scenario is always less than 0.0014, giving rise to very small confidence intervals that have been omitted from the results. Figure 3 shows the fraction of time that the epicenter is the i-th infected node where the left plot shows a restricted range. Interestingly, the trend is similar for all graph models: the values for f i decrease monotonically and fast with i and f 1 (correct identification of the epidemic source) shows the highest value. However, f 1 greatly depends on the graph model being highest for CO (at 31%) and lowest for BA (at 13%). Clearly, the power-law degree distribution of BA poses a challenge in identifying the epidemic source. Recall that for CO the epicenter coincides with the MLE so their accuracy is just 31%. For ER graphs, the accuracy of the epicenter is 24%, a value relatively high given its sparseness and structure in light of the complete graph.

Infection order of the epicenter
It is curious that f 1 and f 2 are practically identical for all graph models. Indeed, for vertex-transitive graphs (such as CO and TO), given just the first edge of the epidemic tree, there is no information on which of the two nodes is the epidemic source. The epidemic tree that will be generated and hung on each of these two nodes are statistically equivalent. Thus, the algorithm returns each one of them with the same frequency. While this is not the case for general graphs, such information depends on the degrees of the nodes, and its connection to the epicenter should be further studied.
The right plot of Figure 3 is also revealing that nodes infected late in the epidemic process are never identified as the epicenter. Again, this clearly depends on the graph model: for CO, only the first 15 infected nodes were ever identified as the epicenter, while for BA this number is around 75 (still very small if compared to n = 1000). Moreover, the apparent straight line for each model shown in the semi-log plot indicates that f i has an exponential decay. Again, the decay rate (i.e., slope) depends on the graph model with BA being the slowest. Figure 4 shows the relative frequency of distances between the epidemic source and the epicenter (left) and other nodes (right). Recall that g 0 = f 1 and this indicates that the epicenter corresponds to the epidemic source. Interestingly, all graphs models show a very similar trend: g 1 is much larger than g 0 and then it decreases monotonically and fast (with the exception of BA, where the peak is in g 2 ). Thus, the epicenter is more likely to be a neighbor of the epidemic source (on the epidemic tree) than the epidemic source itself. This results follows from the fact that the epidemic source has at least one neighbor in the epidemic tree (and possibly more) and the second infected node is in this neighborhood.

Distance to the epicenter
The peak in g 2 for BA in Figure 4 is a consequence of its power-law degrees and the uniform choice for the epidemic source. In particular, the source is likely to be a small degree node that has as neighbor a high degree node which in turn has many other neighbors. Thus, the epicenter is often a node that has a common neighbor with the epidemic source, and thus is at distance two.
The right plot of Figure 4 shows the distance between the epidemic source and all other nodes on the epidemic tree (empirical distribution). Interestingly, distances follow a bell-shaped curve independent of the graph model. However, the mode and variance of the distribution strongly depend on the graph model. In particular, note that CO and BA are quite similar and generate trees where nodes are closest to the epidemic source. In sharp contrast, distances are much larger for WS and even more so for TO (distribution for TO is not shown completely).

Comparison with distance center
While the epicenter of a spanning tree τ of vertex-transitive graph G is equivalent to the distance center of τ , this is clearly not the case for arbitrary graphs. Figure 5 shows a comparison between the accuracy of the epicenter (given by f 1 ) and the distance center in identifying the epidemic source (computed on the exact same spanning trees generated for each run). 4 Note that for all random graph models, the accuracy of the epicenter is significantly higher than that of the distance center. The relative improvement depends on the graph model and average degree: for ER the epicenter is around 26% more accurate than the distance center, but for BA this superiority is around 226% (3.26 times more accurate), for d = 6. Note that with an average degree d = 12 both estimators improve their accuracy (with respect to d = 6) and their relative difference becomes smaller. Indeed, as the density of the random graphs increases, the diversity of paths of short length also increases and the underlying network becomes less   informative (recall the complete graph). In any case, the epicenter leverages the underlying graph in combination with τ to provide a better estimator for the epidemic source. Beyond accuracy, it is interesting to compare the distances between the epicenter and the distance center to the epidemic source given that they have not identified the epidemic source. Table 1 shows the average distance for the epicenter and distance center (given they have not identified the source) as well as the average distance between the source and all nodes on the tree. The average results for the epicenter are supported by Figure 4. Interestingly, the average distance of the distance center are very similar to that of the epicenter, while much smaller than the average for the entire tree. Thus, given that the epicenter and distance center failed to identify the source, the nodes identified are at a similar distance to the epidemic source (on average).

Comparison on real networks
While random network models allows for a more principled evaluation of epidemic source estimators, real networks often exhibit structural features that are not captured by such models. Thus, two real social contact networks are considered, both publicly available. The Haslemere dataset is the result of a three-day experiment with 469 volunteers in the town of Haslemere that were tracked continuously using a mobile phone app [12]. The dataset has also been used to evaluate localized COVID-19 control strategies [7]. In the network generated from the raw dataset, an edge between two individuals is present if they were closer than 12m in any time step (this threshold was required to obtain a connected graph). This network has 449 nodes and average degree of 9.3. The US High School dataset is a high-resolution (in time and space) recording of contacts between 788 individuals during a typical day at an American high school [22]. In the network generated from the raw dataset, an edge between two individuals is present if they were together (less than 3 m apart) for more than 5 minutes (this time threshold was required to obtain a connected graph). This network has 786 nodes and average degree of 51.8. Figure 6 shows a comparison between the accuracy of the epicenter (as predicted by Algorithm 1) and the distance center in identifying the epidemic source (computed on the exact same spanning trees generated for each run, using f 1 and R = 10 3 runs). For the Haslemere network, the epicenter is 54% more accurate than the distance center (17.8% versus 11.5% accuracy, respectively). For the US High School network, the epicenter is only 13% more accurate. While the US High School network is larger, it is also significantly denser (the average degree is five times larger). Interestingly, as with the random graph models (Fig. 5), the relative superiority of the epicenter over the distance center seems to decrease with the increase of the average degree.

Convergence of epicenter accuracy
While the previous results characterize the epicenter for a fixed graph size of n = 1000, it is interesting to consider its dependence on n. Growing n in random graph models generally requires scaling the average degree and such choice has a fundamental impact on the graph structure. For example, if the average degree is kept constant as n grows in ER model, then the graph is surely not connected as n grows. However, if the average degree grows as Ω(log n) then the graph is surely connected. To avoid this difficulty, only CO and TO are considered in the following evaluation since their degrees is a fixed function of n (i.e., n − 1 for CO and 4 for TO). Figure 7 shows the accuracy of the epicenter (f 1 ) as a function of n for CO and TO (x-axis in log-scale). Interestingly, the accuracy of CO is larger for smaller graphs but converges relatively quickly to about 0.308. Thus, larger graphs and epidemic trees do not provide additional structural information to increase the accuracy of the epicenter. On the positive side, it also does not diminish the accuracy of the epicenter! However, the story is quite different for TO (also in Figure 7, since the accuracy decreases monotonically with n. In this case, larger graphs and epidemic trees reduce the accuracy of the epicenter. A key distinctive feature with respect to CO are the distances both on the graph and on the epidemic tree. While all distances on CO are 1, in TO it scales with √ n (recall that the graph is a square lattice with n nodes). Moreover, Figure 4 (left plot) and Table 1 indicate that distances from the source on the epidemic tree are significantly smaller for CO, and in particular the average distance for TO is three times larger (see Tab. 1). Thus, since the epicenter is determined by the sum of distance differences and such distances increase with n, the estimator seems to become less accurate with the increase of distances.
The dependence of epicenter accuracy on the underlying graph size and structure is an interesting and challenging question. On a vertex-transitive graph of size n, in light of Proposition 5.2, the probability of correct detection is equal to the probability that the size of the tree dangling from all the neighbors of the epidemic source is less than n/2. Specifically, for a complete graph of size n, the epicenter accuracy corresponds to the probability that in the Chinese restaurant process at time n all occupied tables have less than n/2 people (each table corresponds to a subtree hanging from the epidemic source). In any case, it can be argued that the epicenter accuracy converges, as n tends to infinity, to 1 − ln 2, in accordance with simulation results shown in Figure 7.
The asymptotic result above is related to the accuracy of the rumor center on infinite d-regular trees which also converges to 1 − ln 2, as d tends to infinity [24]. Recall that in this model, for any fixed d, the probability of correct detection is computed when the time (in our case n) goes to infinity. This is possible since the d-regular tree is infinite. In our case, since the graph is finite, the epidemics stops at time n (size of the graph). Despite a technical issue (related to the double limit in [24], first letting time go to infinity and then d, while in our case there is only one limit in n) there is an intuitive explanation why these two results are related. In particular, for any given n, one can find a d 0 = d 0 (n) sufficiently large, such that the epidemic process up to time n on a complete graph of size n or on a d 0 -regular tree, are both essentially evolving according to recursive random uniform tree; i.e., for every t ≤ n the node infected at time t connects uniformly at random to one of the already infected nodes in the tree.
For a Torus of size n, the probability of correct detection also corresponds to the probability that the size of the tree dangling from all the neighbors of the epidemic source is less than n/2. However, in this case, the comparison with the Chinese restaurant process does not hold. On the one hand, the epidemic source has at most four neighbors (thus the number of possible tables is bounded) and, on the other hand, the probability of a node infected at time t ≤ n to be connected to one of these subtrees need not be proportional to the subtree size. In particular, this probability might even be zero, since every node has at most four neighbors and all neighboring nodes of one subtree might have already been infected by others. The behavior observed for the Torus (TO) in Figure 7 showing that the accuracy decreases monotonically as n goes to infinity might be explained by the fact that one of the four possible dangling subtrees of the epidemic source will eventually dominate the others, and therefore for n sufficiently large its size will be larger than n/2. Establishing the limiting accuracy of the epicenter on the torus and other graphs is an open question that subject of future investigation.

Conclusion
The problem of identifying the epidemic source by partially observing a network epidemic is quite fundamental and has been explored over the last decade both from a practical and theoretical perspective. While most works assume that infected nodes are observed (or partially observed) at a given point in time during the epidemic, this work observes the epidemic after it terminates. Note that in this scenario, observing the set of infected nodes provides no information, as this is simply the set of nodes of the graph. Thus, the observation here is the undirected and unrooted tree (a spanning tree of the graph) that encodes the edges responsible for infections, but not their direction (who infected whom).
The proposed epicenter is an estimator that leverages both the epidemic tree and the graph to estimate the epidemic source (and thus the direction of all edges) by identifying the node that better aligns the graph and the tree in terms of their distances to corresponding nodes. The epicenter can be computed in time O(nm) in general, and Θ(n) when the graph is vertex-transitive. Moreover, the epicenter and distance center (of the epidemic tree) coincide for vertex-transitive graphs. However, numerical simulations indicated that the epicenter is always more accurate than the distance center (for all random graph models considered), with larger relative improvements for sparse and power-law graphs.
Last, numerical results strongly suggest that when the epicenter (and distance center) makes a wrong prediction, this is often close (in the tree) to the epidemic source. Can a new estimator that leverages the epicenter be designed to correct for such mistakes? This seems possible, opening the doors to further explorations.

Appendix A. Proof sketch of Lemma 4.2
Given the rooted tree τ v , we identify d v = |N v | rooted subtrees τ v u , with u ∈ N v . Consider a rooted sequence which generates τ v . Assume for the moment that such an edge sequence corresponds to building one of the d v branches of τ v entirely before moving to a different branch and so on until all branches have been generated. The number of possible ways to do that is accounted in the first term of equation (4.2), i.e., u∈Nv |B(τ v u )| (the subtrees τ v u , with u ∈ N v are disjoint, thus the product). Note that, for every u ∈ N v , the size of every rooted edge sequence which generates τ v u is |τ v u | − 1, and u∈Nv |τ v u | = |τ v | − 1. Thus, when concatenating a rooted edge sequence for each τ v u we obtain an edge sequence of size |τ v | − 1 − d v , despite a rooted sequence which generates τ v having size |τ v | − 1. However, given that for every u ∈ N v there is only one edge connecting v to u, a concatenation of the d v sequences can be modified in a unique manner to give rise to a rooted edge sequence which generates τ v , namely by inserting the edge {v, u} right before the corresponding edge sequence generating τ v u .
Note that a rooted edge sequence which generates τ v , does not necessarily correspond to a concatenation of d v rooted sequences corresponding to the d v different branches. In particular, a rooted edge sequence which generates τ v may correspond to alternating between the edges of the rooted sequences generating the different branches. In order to account for the latter, we need to compute the number of different ways a given concatenation of the rooted sequences generating the subtrees τ v u , with u ∈ N v , can be rearranged to give rise to different rooted edge sequence which generates τ v . The second factor in equation (4.2), i.e., ( u∈Nv |τ v u |)! u∈Nv |τ v u |! accounts for this number.