One of the most pressing challenges in genomics is to reconstruct a long and contiguous DNA sequence from short DNA subsequences (contigs). Very recent developments show that one can get linkage information that is statistically correlated with the true contig ordering. We observe many links between neighboring pairs of contigs, while relatively few links between non-neighboring
pairs of contigs.
If we represent contigs as vertices and the observed number of links between two contigs as the edge weight, then the problem of recovering the contig ordering reduces to a Travelling Salesman Problem (TSP) in a weighted complete graph with a hidden TSP tour. We assume edges on the hidden TSP tour have independent Gaussian weights with mean μ and variance 1, while all the other edges have independent standard Gaussian weights. We show that when μ² > 4 log n, a simple linear programming relaxation of TSP, namely fractional 2-factor LP, exactly recovers the hidden TSP tour with probability tending to one as n → ∞; when μ² < 4 log n, all algorithms fail in exact recovery with probability tending to one. These two results together prove that the fractional 2-factor LP achieves the information-theoretic exact recovery threshold with the sharp constant.
This work is at the intersection of optimization, statistics, and information theory. Based on joint work with Vivek K. Bagaria and David Tse from Stanford, and Yihong Wu from Yale.