NB: This Coq documentation contains a revised version of the Coq implementation of these papers [1], [2], [3], [4], and is also the support for ongoing research. A partial archive (14/01/2001) is available at here. Drop us a line if you are interested in a complete, up-to-date archive.

Library mont_mul_prg

Require Import ssreflect.
Require Import ZArith_ext.
Require Import machine_int.
Import MachineInt.
Require Import mips_cmd.

Local Open Scope mips_cmd_scope.
Import expr_m.

The Montgomery multiplication is a modular multiplication. The advantage of the Montgomery multiplication is that it does not require multi-precision division, but used less-expensive shifts instead. The price to pay is a parasite multiplicative factor whose elimination requires two passes.

The implementation of the Montgomery multiplication we deal with is the so-called ``Finely Integrated Operand Scanning'' (FIOS) variant . Intuitively, it resembles the classical algorithm for multi-precision multiplication: it has two nested loops, the inner-loop incrementally computes partial products (modulo) that are successively added by the outer-loop. These partial products are computed in such a way that the least significant word is always zero, thus guaranteeing that the final result will fit in k+1 words of storage (where k is the length of multi-precision integers). For this to be possible, the Montgomery multiplication requires the pre-computation of the modular inverse of the least significant work of the modulus.

The SmartMIPS architecture is well-suited to the implementation of the FIOS variant of the Montgomery multiplication because the addition performed in the inner-loop (that adds two products of 32-bits integers) fits in the integer multiplier (of size greater than 72 bits).
Definition montgomery
  k alpha
  x y z m
  one ext int X Y M Z quot C
  t s :=
  addiu one r0 one16 ;
  addiu C r0 zero16;
  addiu ext r0 zero16;
  while.while (bne ext k) (
    lwxs X ext x;
    lw Y zero16 y;
    lw Z zero16 z;
    multu X Y;
    lw M zero16 m;
    maddu Z one;
    mflo t;
    mfhi s;
    multu t alpha;
    addiu int r0 one16;
    mflo quot;
    mthi s;
    mtlo t;
    maddu quot M;
    mflhxu Z;
    addiu t z zero16;
    while.while (bne int k) (
      lwxs Y int y;
      lwxs Z int z;
      maddu X Y;
      lwxs M int m;
      maddu Z one;
      maddu quot M;
      addiu int int one16;
      mflhxu Z;
      addiu t t four16;
      sw Z mfour16 t);
    maddu C one;
    mflhxu Z;
    addiu ext ext one16;
    sw Z zero16 t;
    mflhxu C).
Local Open Scope machine_int_scope.
Local Open Scope eqmod_scope.
Local Open Scope zarith_ext_scope.

Let A by the value of the multiplier before entering the inner-loop and A % n be the remainder of the division of A by 2^n. The Montgomery multiplication computes quot = ((A%32) [.*] alpha)%32 and adds quot [.*] M0 to the multiplier. The lemma below captures the fact that the resulting multiplier is a multiple of 2^32, and thus the least significant word of the partial product is always zero:
Lemma montgomery_lemma : forall alpha vm, u2Z vm * u2Z alpha =m -1 {{ Zbeta 1 }} ->
  forall (A : int 64) (m : store.t) ,
    store.utoZ m < Zbeta 2 -> store.utoZ m = u2Z A ->
    store.lo (store.maddu_op (((A [.%] 32 [.*] alpha) [.%] 32) [.*] vm) m) = zero32.
Proof.
move=> alpha0 m0 H A mu H0 H1.
apply lo_remainder_zero.
rewrite store.utoZ_maddu.
- rewrite H1.
  inversion_clear H as [K H2].
  exists ((K * (u2Z A)) -
    (K * (Zbeta 1) * (u2Z (shr_shrink 32 A))) +
    (u2Z (shr_shrink 32 A)) -
    ((u2Z m0) * (u2Z (shr_shrink 32 (umul (rem 32 A) alpha0))))).
  rewrite (@u2Z_umul 32) u2Z_rem //; last repeat constructor.
  rewrite (@u2Z_umul 32) u2Z_rem //; last repeat constructor.
  rewrite -Zbeta1_Zpower2 4!Zmult_minus_distr_r Zmult_plus_distr_l Zmult_minus_distr_r -Zmult_assoc -(Zmult_comm (u2Z m0)) H2.
  rewrite (_ : u2Z (shr_shrink 32 A) * Zbeta 1 * u2Z alpha0 * u2Z m0 =
    u2Z (shr_shrink 32 A) * Zbeta 1 * (u2Z m0 * u2Z alpha0 ) ); last by ring.
  rewrite H2; ring.
- have H2 : 1 <= 2 ^^ store.acx_size - 1.
    have H2 : Zpower 2 8 <= Zpower 2 store.acx_size.
      apply Zpower_2_le.
      by apply store.acx_size_min.
    by apply Zle_trans with (2 ^^ 8 - 1).
  by apply Zlt_le_trans with (Zbeta 2).
Qed.