## An elementary proof of a key lemma in Shor's quantum factoring algorithm

### Introduction

 The quantum factoring algorithm of Shor depends on a crucial lemma involving the behavior of DFT's of periodic vectors. This step in the derivation is advanced, and is a difficult and deep line of reasoning. The excellent presentation of quantum computing by Vazirani notes: "the quantum Fourier transform can detect the period of a periodic vector even if it does not divide M. But the derivation is not as clean as in the case when the period does divide M, so we shall not go any further into this."

 The point of this note is to provide an alternative proof of the lemma that is based on elementary mathematics, and is relatively easy to understand.

### The period finding lemma for DFT's

 We will find it useful to consider a particularly simple type of vector in CN, which we will call a "comb vector". Informally, a comb vector is a vector with a sequence of evenly spaced 1's, and zeroes everywhere else:

```     [1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0]'

[0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]'

[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]'

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]'
```

 Definition: A "comb vector" V in CN with period p ≤ N and offset M < p is a vector of 1's and 0's with a value of 1 at each position

```    i = M + k * p
```

 such that 0 ≤ i < N for some integer k, and 0 elsewhere.

 (The symbol "*" will be used to denote standard scalar and matrix multiplication.)

 We define "DFT(V)" here as "dftMatrix * V", where

```    dftMatrix[i,j] = e-i*j * 2 I π/N / sqrt(N).
```

 In what follows we will use a convenient notation "θN" as a shorthand for "2 * π / N", the angle obtained by dividing a full circle into N equal angles. The above definition is then expressed as

```    dftMatrix[i,j] = e-i*j * I θN / sqrt(N).
```

 Definition: "Prob(V)" is a function that takes a non-zero vector V in CN and normalizes it: Each coordinate of V is multiplied by its own conjugate, giving the square of the length of that coordinate, and the resulting vector is divided by its length. This gives a vector of non-negative real values that sum to 1.0. Such a vector can be interpreted as defining a discrete probability distribution.

 Theorem: Consider a comb vector V in CN with period p and offset m. (Assume N is a power of 2.)

 Form the vector P by taking the discrete Fourier transform of V and forming probabilities:

```    P = Prob( DFT(V) )
```

 Then the probabilities in the vector P are concentrated around p locations:

```    P, P[round(N/p)], P[round(2 * (N/p)] .. P[round((p-1) * (N/p))].
```

 Specifically, each of these p locations has a value greater than 0.4 / p.

 It is important to note that even if a comb vector has a non-zero initial offset (i.e., [ 0, 0, .. , 1, 0 .. ]'), the DFT of this vector has its p equally spaced regions of high weight "shifted" to the left, and the first one is at position zero of the output vector.

### Implications of the theorem

 Let P be Prob(DFT(V)) as above. A few observations:

 - The probability vector P has p separate coordinates with large probability; together, these coordinates have probability at least 0.4.

 - The first coordinate of P is one of those with large probability, and the gaps between coordinates with high probability are approximately N/p.

 - If one of the expressions i * (N/p) is an integer, then P[ i * (N/p) ] has a probability of 1/p. This is the value it would have if all of the probability were uniformly concentrated on p locations.

 - If one of the expressions i * (N/p) defining the locations of high probability is exactly half way between two integers, then the two adjacent coordinates bracketing i * (N/p) each have probability at least 0.4/p. So, that local region of P has a probability of at least 0.8/p.

 - It turns out that every one of the expressions i * (N/p) that is not an integer is bracketed on either side by coordinates that add up to at least 0.8/p.

 - Consequently, at least 80% of all of the probability in the entire vector P is concentrated around p evenly spaced locations starting at coordinate 0.

 (The simplest proof of this last assertion that I can come up with is substantially more complicated than the proof of the less powerful theorem proved here.)

### Description of the central idea

 Before diving into the technical details of the proof, here is a synopsis of the essential idea.

 Consider a comb vector V in CN with period p and offset m:

```    [0, .. 0, 1, 0, .. 0, .. 0,
0, .. 0, 1, 0, .. 0, .. 0,
..
0, .. 0, 1, 0, .. 0]'.
```

 (We make no assumption that p evenly divides N.)

We will decompose V into a particularly useful set of p vectors V0, .. Vp-1 that exactly sum to V.

 Moreover, we will see that these vectors form an orthogonal basis for a p-dimensional subspace of CN that contains all periodic vectors with a period of p.

 Each of the p vectors, when passed through the DFT process, will produce one of the p local regions of high probability in the output.

 The special vectors will use what we will refer to as the Ω vector, raised to certain scalar powers.

 Ω is simply the second column of the positive Discrete Fourier Transform matrix:

 Definition: "Ω" is a vector in CN with the following coordinates:

```      [e( 0   * I θN),
e( 1   * I θN),
..
e(N-1) * I θN)]
```

 Or, if you prefer, Ω is composed of the N'th roots of unity in order starting from 1 and going counter-clockwise.

 Definition: Ωr, for real scalar r, is the Ω vector with each coordinate raised to the r'th power.

 Note: For complex exponential operations, we use the positive real axis as the seam. To raise a complex number eiθ to a real power r, first add or subtract an integral multiple of 2π to get θ' in the range [0 .. 2 π). Then, multiply θ' by r:

```    exp(eiθ, r) = eiθ' * r
```

 Raising a complex value to a fraction between zero and one in this way always rotates it clockwise on the complex plane.

 The vectors Ωk, k in [0 .. n-1], are the columns of the positive DFT matrix. We will be particularly interested in vectors "between" the columns on the DFT matrix, i.e., Ωr for non-integer powers r.

 Here are the promised p vectors that add up to V:

```    c0 Ω0 * N/p
c1 Ω1 * N/p
c2 Ω2 * N/p
..
cp-1 Ω(p-1) * N/p
```

 The values ci are complex constants to normalize the vector and take into account the offset m of the comb vector:

```    ci = e-m * 2 I π/N  / p
```

 That is to say, every comb vector with period p and arbitrary offset m is in the p-dimensional linear subspace spanned by the p vectors

```    Ωi * N/p, i in [0 .. p-1].
```

 whether or not p evenly divides N.

 (The above key assertion of course requires a proof, which we will tackle below.)

 It is of course also true that all scalar multiples of a comb vector with period p, and all sums of comb vectors with possibly varying offsets but the same period p, are in the above-mentioned p-dimensional linear subspace of CN.

 Finally, it will be shown that for all real r,

```    Prob( DFT(Ωr) )
```

 has a probability of at least 0.4 at coordinate round(r) mod N.

### An Example

 As an example, consider the vector in C256 with a period of 10, from the classic work of Shor (see  for example):

```    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
<23 more repetitions>
1, 0, 0, 0, 0]'
```

 The following ten vectors exactly add up to the given comb vector.

```     [1, 1, ... 1]' / 10
Ω25.6  / 10
Ω51.2  / 10
Ω76.8  / 10
Ω102.4  / 10
Ω128.0  / 10
Ω153.6  / 10
Ω179.2  / 10
Ω204.8  / 10
Ω230.4  / 10
```

 Moreover, each of the 10 vectors contributes one localized region of high probability to the DFT result:

 vector index 1 index 2 prob 1 prob 2 total probability DFT([1, 1, ... 1]') 0 - 1.0 - 1.0 DFT(Ω25.6) 25 26 0.2545 0.5727 0.8272 DFT(Ω51.2) 51 52 0.8751 0.0646 0.9297 DFT(Ω76.8) 76 77 0.0646 0.8751 0.9297 DFT(Ω102.4) 102 103 0.5727 0.2545 0.8272 DFT(Ω128.0) 128 - 1.0 - 1.0 DFT(Ω153.6) 153 154 0.2545 0.5727 0.8272 DFT(Ω179.2) 179 180 0.8751 0.0646 0.9297 DFT(Ω204.8) 204 205 0.0646 0.8751 0.9297 DFT(Ω230.4) 230 231 0.5727 0.2545 0.8272

### Summary of the proof steps

 For the sequence of lemmas that lead to the final desired result, we will consider a single one of our p vectors of the form Ωr.

 We will closely examine the behavior of this vector for values of r in [-0.5 .. 0.5]. It will then be a simple matter to extend to all other real values for r.

 The first step will be to show that the first coordinate of Prob(DFT(Ω0.5)) has a value greater than 0.4. (This and the following results all refer to vectors in CN for N a power of 2.)

 Second, we will show that the first coordinate of Prob(DFT(Ωr)) for r in [-0.5 .. 0.5] is minimized at r = +/- 0.5.

 These facts together will allow us to conclude that the first coordinate of Prob(DFT(Ωr)) is greater than 0.4 for all r in [-0.5 .. 0.5].

 Then, using the shifting theorem we will observe that Prob(DFT(Ωk+r)) has probability greater than 0.4 in coordinate position "k mod N" for all integer k and real r in [-0.5 .. 0.5].

 The result then follows immediately: Given the sum of a collection of p "Ωr" vectors that add up to a comb vector V, taking their DFT's individually, and then summing them, produces the desired output vector with p equally spaced local regions of high probability. The distributivity property then gives us the desired final result.

### The gory details

 Lemma: The set of all periodic vectors in CN with period p ≤ N is contained in a p-dimensional linear subspace Sp of CN.

 Proof: Let M be any full-dimensional p x p matrix. Form an N x p matrix S out of the rows of M: the i'th row of S is the (i mod p)'th row of M. Every periodic vector P with period p is a linear combination of the columns of S.

 To see this, consider vector p1 that contains the first p elements of P. This p-dimensional vector is a linear combination of the columns of M, since M is a full-dimensional square p x p matrix: p1 = M p1', where p1' = M-1 p1. Our original vector P is simply S p1', since the i'th element of P equals the (i mod p)'th element of P and the i'th row of S equals the (i mod p)'th row of M by construction. QED.

 Intuitively, we take the p x p matrix M, and stack copies of it vertically to make the N x p matrix S. (If p does not divide N, the last copy of M will not have all of its rows.)

 Lemma: A basis for Sp is the following set of p vectors based on Ω:

```    Ω0 * (N/p)
Ω1 * (N/p)
Ω2 * (N/p)
...
Ω(p-1) * (N/p)
```

 Proof: Take M in the above proof to be the p x p positive DFT matrix for CP. The columns of the n x p matrix S as formed above are the desired p vectors. Consider coordinate j of vector k from the above list:

```    ωj * k * (N/p) i 2 π / N = ωj * k * i 2 π / p = ω(j mod p) * k * i 2 π / p
```

 The last step above is due to the fact that f(m) = ωm i 2 π / p is a periodic function with period p. Removing a multiple of p simply means that we go around the unit circle fewer times before ending up in the same place.

 On the other hand, S[j, k] = ω(j mod p) * k * i 2 π / p by construction. QED.

 In the above lemma, the "(N/p)" term in the exponent changes the scalar terms of Ω from powers of ω2 π i / N to powers of ω2 π i / p. This changes the stepsize or angle at which the elements of Ω step around the unit circle.

 Lemma: Given a comb vector V in CN with period p, the following p vectors sum to V:

```    Ω0 * (N/p) / p
Ω1 * (N/p) / p
Ω2 * (N/p) / p
...
Ω(p-1) * (N/p) / p
```

Proof: The vectors are simply the basis vectors from the previous lemma. The constants 1/p are the values of the vector V re-expressed in the new basis. QED.

 Lemma: Consider a vector V = Ω0.5 in CN, with N >= 1 and N a power of 2. The first coordinate of prob(DFT(V)) is greater than 0.4.

 Proof: The first row of dftMatrix is [1, .. 1]'/sqrt(N), so the first element of DFT(V) is ΣVi/sqrt(N). V has length sqrt(N), so V/sqrt(N) is a vector of unit length. dftMatrix is orthonormal, so "dftMatrix * V/sqrt(N)" is also of unit length. The first coordinate of this quantity is just ΣVi/N, the average of the coordinates of V.

 The vector prob(V) has elements that are the squares of the lengths of the coordinates of V, divided by the length of V. The length of "dftMatrix V/sqrt(N)" is 1, so the first component of prob(DFV(V/sqrt(N))) is just the square of the length of the average of the coordinates.

 It is convenient to consider the average of the coordinates of V/sqrt(N) instead of the square of the length, so we will prove that the average is greater than sqrt(0.4).

 We proceed by induction on log2(N). For the base case, the claimed result can be directly confirmed for powers of 2 from 1 through 16.

 For N=17, the average of the elements of V* is slightly larger than 0.635, which is greater than sqrt(0.4).

 Now, for the inductive case. We will consider the vector [V', -1]'. This vector has coordinates that are symmetric about the imaginary axis on the complex plane, so their sum is on the imaginary axis. (We will refer below to the "height" of vectors on the complex plane, meaning their distance above the real axis.)

 Moreover, the height of the average of [V', -1]' for N = 16 is about 0.635, which is larger than sqrt(0.4).

 The average of the coordinates of [V', -1]' is less than the average of the coordinates of V for all N: Since the sum of the coordinates of [V, -1]' is on the imaginary axis, and the sum of the coordinates of V is obtained by adding 1 (as a horizontal vector on the complex plane), we have a right triangle. (ΣV) - 1 is a leg and ΣV is the hypotenuse, so the latter must be longer. (N=1 is a trivial special case.)

 The height of the average of the coordinates of [V', -1]' grows monotonically as the dimension N goes up: Going from N to 2N, the principal angle of N is divided by two to get the principal angle of 2N, the number of coordinates doubles, and both vectors of coordinates start at the horizontal vector 1 in C. So, every arc of [V2n', -1]' is an arc of [V', 1]'. Each coordinate except I in [V, 1]' also has a new adjacent coordinate in [V2n-1', -1]' that has a larger height. So, the sum of the elements of [V2n', -1]' has a larger height than the sum of the elements of [V', -1]'.
QED.

 Lemma: Consider a vector V = Ωr, r in [-0.5 .. 0.5]. The first coordinate of prob(DFT(V)) is minimized when r = -0.5 or r = 0.5.

 Proof: Let f(r) be the first element of "W * conjugate(W) / length(W)", where W = DFT(V).

 This is the function we want to show is at a minimum for r = 0.5. For convenience we will ignore the constant in the denominator.

 To get f(r), we left-multiply "Ωr" by dftMatrix and consider the first element. We multiply this value by its conjugate.

 Since the first row of dftMatrix is [1, 1, .. 1], we have:

```    f(r) =                    Σ(Ωr) * conjugate(Σ(Ωr).
```

 We are interested in the value of r in [-0.5 .. 0.5] that minimizes the above quantity.

 The conjugate of a sum is the sum of the conjugates. So, the above is

```      Σ(V) * Σ(conjugate(V)).
```

 This gives us N2 elements to consider. A typical element is

```  e(r * i * I θN * i) * e(r * (-j) * I θN).
```

 The above quantity is just

```  e(r * (i-j) * I θN).
```

 In the above expression, i goes from 0 to N-1, and so does j. So, for i != j the above expressions come in conjugate pairs.

 A typical pair is

```    e(r * (i-j) * I θN)
+ e(r * (j-i) * I θN).
```

 Without loss of generality, assume i >= j. (The terms come in pairs with "i-j" and "j-i".)

 From Euler's formula, The above is equal to

```    cos(  r * (i-j) * θN)  + I sin(  r * (i-j) * θN
+ cos(-(r * (i-j) * θN)) + I sin(-(r * (i-j) * θN)

=

cos(  r * (i-j) * θN)  + I sin(  r * (i-j) * I θN)
+ cos( (r * (i-j) * θN)) - I sin( (r * (i-j) * I θN)

= 2 cos(r * (i-j) * θN).
```

 For i >= j, (i-j) is between 0 and (N-1). r is in [-.5 .. .5]. So the argument of cos() above ranges from

```    -0.5 * θN * (N-1) to 0.5 * θN * (N-1).
```

 Since by definition θN = 2 * π / N, the above quantity is less than π in absolute value.

 This value ranges from a high of cos(0), i.e., 1 for r = 0, to a minimum of

```    cos(+/-0.5 * θN * (N-1)) for r = -0.5 or 0.5.
```

 Since f(r) is a sum of N * (N-1) / 2 conjugate pairs and N constant values (those cases where i = j), and each component is at a minimum for r = +/-0.5, and hence is a minimum at r = +/-0.5.
QED.

 We made the simplifying assumption that our input comb vector has a non-zero element in its first slot. The shifting theorem allows us to relax that assumption.

 Shifting Theorem: For a vector V and an arbitrary integer k, let Vk be V circularly rotated upward k slots. Then DFT(Vk) is DFT(V) * Ωk, where '*' is element-by-element multiplication of two vectors.

 The elements of the DFT of an input vector that is circularly shifted therefore do not change their lengths, since each one is multiplied by a vector of length 1. So, Prob( DFT(Xk) ) does not change for any circular shift k of an input vector X.

 Finally,

```    DFT(V) = DFT(V0 + .. + Vp-1) = DFT(V0) + .. + DFT(Vk-1).
```

 The latter is a sum of vectors each of which has a local region of high probability, positioned around index "i * N/p".

### Conclusion

 The idea of decomposing a comb vector V into p constituent vectors each of which contributes one of the p local regions of high probability in the DFT of V results in a conceptually simple proof of an important lemma in the quantum factoring algorithm.

### References

  Shor, Peter W., Polynomial-Time Algorithms for Prime Factorization and Discrete Logarithms on a Quantum Computer, arXiv:quant-ph/9508027v2, 25 January 1996.

  Dasgupta, S., C.H. Papadimitriou, and U.V. Vazirani, Algorithms, Chapter 10, 2006.

  https://en.wikipedia.org/wiki/Discrete_Fourier_transform#Shift_theorem