An elementary proof of a key lemma in Shor's quantum factoring algorithm

Introduction
The quantum factoring algorithm[1] of Shor depends on a crucial lemma
involving the behavior of DFT's of periodic vectors. This step in the
derivation is advanced, and is a difficult and deep line of reasoning.
The excellent presentation of quantum computing by Vazirani[2] notes:
"the quantum Fourier transform can detect the period of a periodic vector
even if it does not divide M. But the derivation is not as clean as in
the case when the period does divide M, so we shall not go any further
into this."

The point of this note is to provide an alternative proof of the lemma
that is based on elementary mathematics, and is relatively easy to
understand.

The period finding lemma for DFT's
We will find it useful to consider a particularly simple type of vector in
C^{N}, which we will call a "comb vector". Informally, a comb vector
is a vector with a sequence of evenly spaced 1's, and zeroes everywhere else:

[1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0]'
[0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]'
[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]'
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]'
Definition: A "comb vector" V in C^{N}
with period p ≤ N and offset M < p is
a vector of 1's and 0's
with a value of 1 at each position

i = M + k * p
such that 0 ≤ i < N for some integer k, and 0 elsewhere.

(The symbol "*" will be used to denote standard scalar and matrix multiplication.)

We define "DFT(V)" here as "dftMatrix * V", where

dftMatrix[i,j] = e^{i*j * 2 I π/N} / sqrt(N).
In what follows we will use a convenient notation "θ_{N}" as
a shorthand for "2 * π / N", the angle obtained by dividing a full circle into
N equal angles.
The above definition is then expressed as

dftMatrix[i,j] = e^{i*j * I θN} / sqrt(N).
Definition: "Prob(V)" is a function that takes a nonzero vector V in
C^{N} and normalizes it: Each coordinate of V is multiplied by its own
conjugate, giving the square of the length of that coordinate, and
the resulting vector is divided by its length. This gives a vector
of nonnegative real values that sum to 1.0. Such a vector can be
interpreted as defining a discrete probability distribution.

Theorem: Consider a comb vector V in C^{N} with period p and offset m.
(Assume N is a power of 2.)

Form the vector P by taking the discrete Fourier transform of V
and forming probabilities:

P = Prob( DFT(V) )
Then the probabilities in the vector P are concentrated around p locations:

P[0], P[round(N/p)], P[round(2 * (N/p)] .. P[round((p1) * (N/p))].
Specifically, each of these p locations has a value greater than 0.4 / p.

It is important to note that even if a comb vector has a nonzero
initial offset (i.e., [ 0, 0, .. , 1, 0 .. ]'), the DFT of this vector
has its p equally spaced regions of high weight "shifted" to the left, and the
first one is at position zero of the output vector.

Implications of the theorem
Let P be Prob(DFT(V)) as above. A few observations:

 The probability vector P has p separate coordinates with large
probability; together, these coordinates have probability at least 0.4.

 The first coordinate of P is one of those with large probability, and
the gaps between coordinates with high probability are approximately
N/p.

 If one of the expressions i * (N/p) is an integer, then P[ i * (N/p) ] has
a probability of 1/p. This is the value it would
have if all of the probability were uniformly concentrated on p locations.

 If one of the expressions i * (N/p) defining the locations of high probability is
exactly half way between two integers, then the two adjacent coordinates
bracketing i * (N/p) each have probability at least 0.4/p. So,
that local region of P has a probability of at least 0.8/p.

 It turns out that every one of the expressions i * (N/p) that is not an integer
is bracketed on either side by coordinates that add up to at least 0.8/p.

 Consequently, at least 80% of all of the probability in the entire vector P is
concentrated around p evenly spaced locations starting at coordinate 0.

(The simplest proof of this last assertion that I can come up with
is substantially more complicated than the proof of the less powerful
theorem proved here.)

Description of the central idea
Before diving into the technical details of the proof, here is a synopsis
of the essential idea.

Consider a comb vector V in C^{N} with period p and offset m:

[0, .. 0, 1, 0, .. 0, .. 0,
0, .. 0, 1, 0, .. 0, .. 0,
..
0, .. 0, 1, 0, .. 0]'.
(We make no assumption that p evenly divides N.)

We will decompose V into a particularly useful set of p vectors
V_{0}, .. V_{p1} that exactly sum to
V.
Moreover, we will see that these vectors form an orthogonal basis for
a pdimensional subspace of C^{N} that contains all periodic
vectors with a period of p.

Each of the p vectors, when passed through the DFT process, will produce
one of the p local regions of high probability in the output.

The special vectors will use what we will refer to as the Ω vector,
raised to certain scalar powers.

Ω is simply the second column of the positive Discrete Fourier Transform matrix:

Definition: "Ω" is a vector in C^{N} with the following coordinates:

[e^{( 0 * I θN)},
e^{( 1 * I θN)},
..
e^{(N1) * I θN)}]
Or, if you prefer, Ω is composed of the N'th roots of unity in
order starting from 1 and going counterclockwise.

Definition: Ω^{r}, for real scalar r, is the Ω vector with
each coordinate raised to the r'th power.

Note: For complex exponential operations, we use the positive real axis as the
seam. To raise a complex number e^{iθ} to a real power r,
first add or subtract an integral multiple of 2π to get θ' in the
range [0 .. 2 π). Then, multiply θ' by r:

exp(e^{iθ}, r) = e^{iθ' * r}
Raising a complex value to a fraction between zero
and one in this way always rotates it clockwise on the complex plane.

The vectors Ω^{k}, k in [0 .. n1], are the columns of the positive DFT matrix.
We will be particularly interested in vectors "between" the columns on the DFT matrix,
i.e., Ω^{r} for noninteger powers r.

Here are the promised p vectors that add up to V:

c_{0} Ω^{0 * N/p}
c_{1} Ω^{1 * N/p}
c_{2} Ω^{2 * N/p}
..
c_{p1} Ω^{(p1) * N/p}
The values c_{i} are complex constants to normalize the vector and
take into account the offset m of the comb vector:

c_{i} = e^{m * 2 I π/N } / p
That is to say, every comb vector with period p and arbitrary offset m is in
the pdimensional linear subspace spanned by the p vectors

Ω^{i * N/p}, i in [0 .. p1].
whether or not p evenly divides N.

(The above key assertion of course requires a proof, which we will
tackle below.)

It is of course also true that all scalar multiples of a comb vector
with period p, and all sums of comb vectors with possibly varying offsets but the
same period p, are in the abovementioned pdimensional linear subspace
of C^{N}.

Finally, it will be shown that for all real r,

Prob( DFT(Ω^{r}) )
has a probability of at least 0.4 at coordinate round(r) mod N.

An Example
As an example, consider the vector in C^{256} with a period of 10, from
the classic work of Shor (see [1] for example):

[1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
<23 more repetitions>
1, 0, 0, 0, 0]'
The following ten vectors exactly add up to the given comb vector.

[1, 1, ... 1]' / 10
Ω^{25.6} / 10
Ω^{51.2} / 10
Ω^{76.8} / 10
Ω^{102.4} / 10
Ω^{128.0} / 10
Ω^{153.6} / 10
Ω^{179.2} / 10
Ω^{204.8} / 10
Ω^{230.4} / 10
Moreover, each of the 10 vectors contributes one localized region of
high probability to the DFT result:


vector 
index 1 
index 2 
prob 1 
prob 2 
total probability 
DFT([1, 1, ... 1]') 
0 
 
1.0 
 
1.0 
DFT(Ω^{25.6}) 
25 
26 
0.2545 
0.5727 
0.8272 
DFT(Ω^{51.2}) 
51 
52 
0.8751 
0.0646 
0.9297 
DFT(Ω^{76.8}) 
76 
77 
0.0646 
0.8751 
0.9297 
DFT(Ω^{102.4}) 
102 
103 
0.5727 
0.2545 
0.8272 
DFT(Ω^{128.0}) 
128 
 
1.0 
 
1.0 
DFT(Ω^{153.6}) 
153 
154 
0.2545 
0.5727 
0.8272 
DFT(Ω^{179.2}) 
179 
180 
0.8751 
0.0646 
0.9297 
DFT(Ω^{204.8}) 
204 
205 
0.0646 
0.8751 
0.9297 
DFT(Ω^{230.4}) 
230 
231 
0.5727 
0.2545 
0.8272 
Summary of the proof steps
For the sequence of lemmas that lead to the final desired result, we
will consider a single one of our p vectors of the form Ω^{r}.

We will closely examine the behavior of this vector for values of r in
[0.5 .. 0.5]. It will then be a simple matter to extend to all other
real values for r.

The first step will be to show that the first coordinate of
Prob(DFT(Ω^{0.5})) has a value greater than 0.4. (This and the
following results all refer to vectors in C^{N} for N a power of 2.)

Second, we will show that the first coordinate of Prob(DFT(Ω^{r}))
for r in [0.5 .. 0.5] is minimized at r = +/ 0.5.

These facts together will allow us to conclude that the first coordinate
of Prob(DFT(Ω^{r})) is greater than 0.4 for all r in [0.5 .. 0.5].

Then, using the shifting theorem we will observe that
Prob(DFT(Ω^{k+r})) has probability greater than 0.4 in coordinate
position "k mod N" for all integer k and real r in [0.5 .. 0.5].

The result then follows immediately: Given the sum of a collection
of p "Ω^{r}" vectors that add up to a comb vector V, taking
their DFT's individually, and then summing them, produces the desired
output vector with p equally spaced local regions of high probability.
The distributivity property then gives us the desired final result.

The gory details
Lemma: The set of all periodic vectors in C^{N} with period p ≤ N is contained
in a pdimensional linear subspace S_{p} of C^{N}.

Proof: Let M be any fulldimensional p x p matrix. Form an N x p matrix S out of the
rows of M: the i'th row of S is the (i mod p)'th row of M. Every periodic vector P with
period p is a linear combination of the columns of S.

To see this, consider vector p1
that contains the first p elements of P. This pdimensional vector is a linear
combination of the columns of M,
since M is a fulldimensional square p x p matrix: p1 = M p1', where p1' = M^{1} p1.
Our original vector P is simply S p1', since the i'th element of P equals the (i mod p)'th
element of P and the i'th row of S equals the (i mod p)'th row of M by construction.
QED.

Intuitively, we take the p x p matrix M, and stack copies of it vertically
to make the N x p matrix S. (If p does not divide N, the last copy of M will not have
all of its rows.)

Lemma: A basis for S_{p} is the following set of p vectors based on Ω:

Ω^{0 * (N/p)}
Ω^{1 * (N/p)}
Ω^{2 * (N/p)}
...
Ω^{(p1) * (N/p)}
Proof: Take M in the above proof to be the p x p positive DFT matrix for C^{P}.
The columns of the n x p matrix S as formed above are the desired p vectors.
Consider coordinate j of vector k from the above list:

ω^{j * k * (N/p) i 2 π / N} = ω^{j * k * i 2 π / p} = ω^{(j mod p) * k * i 2 π / p}
The last step above is due to the fact that f(m) = ω^{m i 2 π / p} is
a periodic function with period p. Removing a multiple of p simply means that we go around
the unit circle fewer times before ending up in the same place.

On the other hand, S[j, k] = ω^{(j mod p) * k * i 2 π / p} by construction.

QED.
Lemma: Given a comb vector V in C^{N} with period p,
the following p vectors sum to V:

Ω^{0 * (N/p)} / p
Ω^{1 * (N/p)} / p
Ω^{2 * (N/p)} / p
...
Ω^{(p1) * (N/p)} / p
Proof:
The vectors are simply the basis vectors from the previous lemma. The constants 1/p are
the values of the vector V reexpressed in the new basis.
QED.
Lemma: Consider a vector V = Ω^{0.5} in C^{N}, with N >= 1 and N
a power of 2. The first coordinate of prob(DFT(V)) is greater than 0.4.

Proof: The first row of dftMatrix is [1, .. 1]'/sqrt(N), so the first
element of DFT(V) is ΣV_{i}/sqrt(N). V has length sqrt(N),
so V/sqrt(N) is a vector of unit length. dftMatrix is orthonormal, so
"dftMatrix * V/sqrt(N)" is also of unit length. The first coordinate of this
quantity is just ΣV_{i}/N, the average of the coordinates of V.

The vector prob(V) has elements that are the squares of the lengths
of the coordinates of V, divided by the length of V. The length of
"dftMatrix V/sqrt(N)" is 1, so the first component of prob(DFV(V/sqrt(N)))
is just the square of the length of the average of the coordinates.

It is convenient to consider the average of the coordinates of V/sqrt(N)
instead of the square of the length, so we will prove that the average
is greater than sqrt(0.4).

We proceed by induction on log_{2}(N). For the base case, the claimed
result can be directly confirmed for powers of 2 from 1 through 16.

For N=17, the average of the elements of V* is slightly larger than 0.635,
which is greater than sqrt(0.4).

Now, for the inductive case. We will consider the vector [V', 1]'.
This vector has coordinates that are symmetric about the imaginary
axis on the complex plane, so their sum is on the imaginary axis.
(We will refer below to the "height" of vectors on the complex plane,
meaning their distance above the real axis.)

Moreover, the height of the average of [V', 1]' for N = 16 is about
0.635, which is larger than sqrt(0.4).

The average of the coordinates of [V', 1]' is less than the average
of the coordinates of V for all N: Since the sum of the coordinates
of [V, 1]' is on the imaginary axis, and the sum of the coordinates
of V is obtained by adding 1 (as a horizontal vector on the complex
plane), we have a right triangle. (ΣV)  1 is a leg
and ΣV is the hypotenuse, so the latter must be longer.
(N=1 is a trivial special case.)

The height of the average of the coordinates of [V', 1]' grows
monotonically as the dimension N goes up: Going from N to 2N, the
principal angle of N is divided by two to get the principal angle of
2N, the number of coordinates doubles, and both vectors of coordinates
start at the horizontal vector 1 in C. So, every arc of [V_{2n}', 1]'
is an arc of [V', 1]'. Each coordinate except I in [V, 1]' also has a
new adjacent coordinate in [V_{2n1}', 1]' that has a larger height. So,
the sum of the elements of [V_{2n}', 1]' has a larger height than the
sum of the elements of [V', 1]'.

QED.
Lemma: Consider a vector V = Ω^{r}, r in [0.5 .. 0.5]. The first
coordinate of prob(DFT(V)) is minimized when r = 0.5 or r = 0.5.

Proof:
Let f(r) be the first element of "W * conjugate(W) / length(W)", where
W = DFT(V).

This is the function we want to show is at a minimum for r = 0.5.
For convenience we will ignore the constant in the denominator.

To get f(r), we leftmultiply "Ω^{r}" by dftMatrix and consider the
first element. We multiply this value by its conjugate.

Since the first row of dftMatrix is [1, 1, .. 1], we have:

f(r) = Σ(Ω^{r}) * conjugate(Σ(Ω^{r}).
We are interested in the value of r in [0.5 .. 0.5] that minimizes the
above quantity.

The conjugate of a sum is the sum of the conjugates. So, the above is

Σ(V) * Σ(conjugate(V)).
This gives us N^{2} elements to consider. A typical element is

e^{(r * i * I θN * i)} * e^{(r * (j) * I θN)}.
The above quantity is just

e^{(r * (ij) * I θN)}.
In the above expression, i goes from 0 to N1, and so does j. So,
for i != j the above expressions come in conjugate pairs.

e^{(r * (ij) * I θN)}
+ e^{(r * (ji) * I θN)}.
Without loss of generality, assume i >= j. (The terms come in pairs with
"ij" and "ji".)

From Euler's formula, The above is equal to

cos( r * (ij) * θ_{N}) + I sin( r * (ij) * θ_{N}
+ cos((r * (ij) * θ_{N})) + I sin((r * (ij) * θ_{N})
=
cos( r * (ij) * θ_{N}) + I sin( r * (ij) * I θ_{N})
+ cos( (r * (ij) * θ_{N}))  I sin( (r * (ij) * I θ_{N})
= 2 cos(r * (ij) * θ_{N}).
For i >= j, (ij) is between 0 and (N1). r is in [.5 .. .5]. So the
argument of cos() above ranges from

0.5 * θ_{N} * (N1) to 0.5 * θ_{N} * (N1).
Since by definition θ_{N} = 2 * π / N, the above quantity is less
than π in absolute value.

This value ranges from a high of cos(0), i.e., 1 for r = 0, to a
minimum of

cos(+/0.5 * θ_{N} * (N1)) for r = 0.5 or 0.5.
Since f(r) is a sum of N * (N1) / 2 conjugate pairs and N constant values
(those cases where i = j), and each component is at a minimum for r =
+/0.5, and hence is a minimum at r = +/0.5.

QED.
We made the simplifying assumption that our input comb vector has a nonzero
element in its first slot. The shifting theorem allows us to relax that assumption.

Shifting Theorem[3]:
For a vector V and an arbitrary integer k, let Vk be V
circularly rotated upward k slots.
Then DFT(Vk) is DFT(V) * Ω^{k}, where '*' is
elementbyelement multiplication of two vectors.

The elements of the DFT of an input vector that is circularly shifted therefore
do not change their lengths, since each one is multiplied by a vector of length 1.
So, Prob( DFT(Xk) ) does not change for any circular shift k of an input vector X.

DFT(V) = DFT(V_{0} + .. + V_{p1}) = DFT(V_{0}) + .. + DFT(V_{k1}).
The latter is a sum of vectors each of which has a local region of high
probability, positioned around index "i * N/p".

Conclusion
The idea of decomposing a comb vector V into p constituent vectors
each of which contributes one of the p local regions of high probability
in the DFT of V results in a conceptually simple proof of an important
lemma in the quantum factoring algorithm.

References
[1] Shor, Peter W., PolynomialTime Algorithms for Prime Factorization
and Discrete Logarithms on a Quantum Computer, arXiv:quantph/9508027v2,
25 January 1996.

[2] Dasgupta, S., C.H. Papadimitriou, and U.V. Vazirani, Algorithms,
Chapter 10, 2006.

[3] https://en.wikipedia.org/wiki/Discrete_Fourier_transform#Shift_theorem

