Library mont_mul_prg
Require Import ssreflect.
Require Import ZArith_ext.
Require Import machine_int.
Import MachineInt.
Require Import mips_cmd.
Local Open Scope mips_cmd_scope.
Import expr_m.
Require Import ZArith_ext.
Require Import machine_int.
Import MachineInt.
Require Import mips_cmd.
Local Open Scope mips_cmd_scope.
Import expr_m.
The Montgomery multiplication is a modular
multiplication. The advantage of the Montgomery multiplication is
that it does not require multi-precision division, but used
less-expensive shifts instead. The price to pay is a parasite
multiplicative factor whose elimination requires two passes.
The implementation of the Montgomery multiplication we deal with is the so-called ``Finely Integrated Operand Scanning'' (FIOS) variant . Intuitively, it resembles the classical algorithm for multi-precision multiplication: it has two nested loops, the inner-loop incrementally computes partial products (modulo) that are successively added by the outer-loop. These partial products are computed in such a way that the least significant word is always zero, thus guaranteeing that the final result will fit in k+1 words of storage (where k is the length of multi-precision integers). For this to be possible, the Montgomery multiplication requires the pre-computation of the modular inverse of the least significant work of the modulus.
The SmartMIPS architecture is well-suited to the implementation of the FIOS variant of the Montgomery multiplication because the addition performed in the inner-loop (that adds two products of 32-bits integers) fits in the integer multiplier (of size greater than 72 bits).
The implementation of the Montgomery multiplication we deal with is the so-called ``Finely Integrated Operand Scanning'' (FIOS) variant . Intuitively, it resembles the classical algorithm for multi-precision multiplication: it has two nested loops, the inner-loop incrementally computes partial products (modulo) that are successively added by the outer-loop. These partial products are computed in such a way that the least significant word is always zero, thus guaranteeing that the final result will fit in k+1 words of storage (where k is the length of multi-precision integers). For this to be possible, the Montgomery multiplication requires the pre-computation of the modular inverse of the least significant work of the modulus.
The SmartMIPS architecture is well-suited to the implementation of the FIOS variant of the Montgomery multiplication because the addition performed in the inner-loop (that adds two products of 32-bits integers) fits in the integer multiplier (of size greater than 72 bits).
Definition montgomery
k alpha
x y z m
one ext int X Y M Z quot C
t s :=
addiu one r0 one16 ;
addiu C r0 zero16;
addiu ext r0 zero16;
while.while (bne ext k) (
lwxs X ext x;
lw Y zero16 y;
lw Z zero16 z;
multu X Y;
lw M zero16 m;
maddu Z one;
mflo t;
mfhi s;
multu t alpha;
addiu int r0 one16;
mflo quot;
mthi s;
mtlo t;
maddu quot M;
mflhxu Z;
addiu t z zero16;
while.while (bne int k) (
lwxs Y int y;
lwxs Z int z;
maddu X Y;
lwxs M int m;
maddu Z one;
maddu quot M;
addiu int int one16;
mflhxu Z;
addiu t t four16;
sw Z mfour16 t);
maddu C one;
mflhxu Z;
addiu ext ext one16;
sw Z zero16 t;
mflhxu C).
Local Open Scope machine_int_scope.
Local Open Scope eqmod_scope.
Local Open Scope zarith_ext_scope.
k alpha
x y z m
one ext int X Y M Z quot C
t s :=
addiu one r0 one16 ;
addiu C r0 zero16;
addiu ext r0 zero16;
while.while (bne ext k) (
lwxs X ext x;
lw Y zero16 y;
lw Z zero16 z;
multu X Y;
lw M zero16 m;
maddu Z one;
mflo t;
mfhi s;
multu t alpha;
addiu int r0 one16;
mflo quot;
mthi s;
mtlo t;
maddu quot M;
mflhxu Z;
addiu t z zero16;
while.while (bne int k) (
lwxs Y int y;
lwxs Z int z;
maddu X Y;
lwxs M int m;
maddu Z one;
maddu quot M;
addiu int int one16;
mflhxu Z;
addiu t t four16;
sw Z mfour16 t);
maddu C one;
mflhxu Z;
addiu ext ext one16;
sw Z zero16 t;
mflhxu C).
Local Open Scope machine_int_scope.
Local Open Scope eqmod_scope.
Local Open Scope zarith_ext_scope.
Let A by the value of the multiplier before
entering the inner-loop and A % n be the
remainder of the division of A by 2^n. The
Montgomery multiplication computes
quot = ((A%32) [.*] alpha)%32 and adds quot [.*] M0
to the multiplier. The lemma below captures the fact that the
resulting multiplier is a multiple of 2^32, and thus the
least significant word of the partial product is always zero:
Lemma montgomery_lemma : forall alpha vm, u2Z vm * u2Z alpha =m -1 {{ Zbeta 1 }} ->
forall (A : int 64) (m : store.t) ,
store.utoZ m < Zbeta 2 -> store.utoZ m = u2Z A ->
store.lo (store.maddu_op (((A [.%] 32 [.*] alpha) [.%] 32) [.*] vm) m) = zero32.
Proof.
move=> alpha0 m0 H A mu H0 H1.
apply lo_remainder_zero.
rewrite store.utoZ_maddu.
- rewrite H1.
inversion_clear H as [K H2].
exists ((K * (u2Z A)) -
(K * (Zbeta 1) * (u2Z (shr_shrink 32 A))) +
(u2Z (shr_shrink 32 A)) -
((u2Z m0) * (u2Z (shr_shrink 32 (umul (rem 32 A) alpha0))))).
rewrite (@u2Z_umul 32) u2Z_rem //; last repeat constructor.
rewrite (@u2Z_umul 32) u2Z_rem //; last repeat constructor.
rewrite -Zbeta1_Zpower2 4!Zmult_minus_distr_r Zmult_plus_distr_l Zmult_minus_distr_r -Zmult_assoc -(Zmult_comm (u2Z m0)) H2.
rewrite (_ : u2Z (shr_shrink 32 A) * Zbeta 1 * u2Z alpha0 * u2Z m0 =
u2Z (shr_shrink 32 A) * Zbeta 1 * (u2Z m0 * u2Z alpha0 ) ); last by ring.
rewrite H2; ring.
- have H2 : 1 <= 2 ^^ store.acx_size - 1.
have H2 : Zpower 2 8 <= Zpower 2 store.acx_size.
apply Zpower_2_le.
by apply store.acx_size_min.
by apply Zle_trans with (2 ^^ 8 - 1).
by apply Zlt_le_trans with (Zbeta 2).
Qed.
forall (A : int 64) (m : store.t) ,
store.utoZ m < Zbeta 2 -> store.utoZ m = u2Z A ->
store.lo (store.maddu_op (((A [.%] 32 [.*] alpha) [.%] 32) [.*] vm) m) = zero32.
Proof.
move=> alpha0 m0 H A mu H0 H1.
apply lo_remainder_zero.
rewrite store.utoZ_maddu.
- rewrite H1.
inversion_clear H as [K H2].
exists ((K * (u2Z A)) -
(K * (Zbeta 1) * (u2Z (shr_shrink 32 A))) +
(u2Z (shr_shrink 32 A)) -
((u2Z m0) * (u2Z (shr_shrink 32 (umul (rem 32 A) alpha0))))).
rewrite (@u2Z_umul 32) u2Z_rem //; last repeat constructor.
rewrite (@u2Z_umul 32) u2Z_rem //; last repeat constructor.
rewrite -Zbeta1_Zpower2 4!Zmult_minus_distr_r Zmult_plus_distr_l Zmult_minus_distr_r -Zmult_assoc -(Zmult_comm (u2Z m0)) H2.
rewrite (_ : u2Z (shr_shrink 32 A) * Zbeta 1 * u2Z alpha0 * u2Z m0 =
u2Z (shr_shrink 32 A) * Zbeta 1 * (u2Z m0 * u2Z alpha0 ) ); last by ring.
rewrite H2; ring.
- have H2 : 1 <= 2 ^^ store.acx_size - 1.
have H2 : Zpower 2 8 <= Zpower 2 store.acx_size.
apply Zpower_2_le.
by apply store.acx_size_min.
by apply Zle_trans with (2 ^^ 8 - 1).
by apply Zlt_le_trans with (Zbeta 2).
Qed.