Bivariate CKKS
May 23, 2026 · 23 min read
CKKS is usually presented over RNS: pick a chain of NTT-friendly primes , store every polynomial as its CRT residues, and let level be the unit of homomorphic budget. The recent paper of Belorgey, Carpov, Gama, Guasch and Jetchev, Revisiting Key Decomposition Techniques for FHE: Simpler, Faster and More Generic, keeps the scheme but throws out the prime chain entirely. In its place sits the bivariate representation: one cyclotomic axis, one multi-precision axis, and a free parameter that has nothing to do with primes.
This article walks through CKKS as it is implemented in poulpy-ckks on top of that representation. It is meant as a bridge between the rather abstract setting of the paper (Section 3 introduces the representation in full generality, with gadget decompositions and external products as the main motivation) and what actually runs on a CPU when you call ckks_add_into or ckks_mul_into. Some familiarity with CKKS is assumed, deep RNS expertise is not.
1. The bivariate representation, math first
Fix the cyclotomic ring
and let be the real torus. CKKS encrypts elements of the real torus polynomial ring
Pick an integer , the limb size. In poulpy this is called base2k.
A typical value is for the NTT120 backend. The bivariate representation describes an element by a bivariate integer polynomial
with (the paper calls such a polynomial -normalized and reduced), together with the evaluation map
We require . The picture splits into two independent axes:
- the -axis carries the cyclotomic structure: a length- vector of integer coefficients per limb, handled by an FFT over or an NTT modulo a single large prime;
- the -axis carries the multi-precision structure: limbs in , with carry propagation playing the role of arithmetic over .
Two structural consequences worth highlighting up front.
Limb count is a function of precision, not of primes. The number of limbs depends only on the desired precision and on . There is no prime chain, and is a free parameter rather than a property of some prime family.
Prefix property. A precision- ciphertext, truncated to its first limbs along , is a valid precision- ciphertext of the same plaintext, with . Modulus switching becomes a length adjustment on the limb sequence, i.e. a slice, not a computation.
2. CKKS in poulpy: where the bivariate ring lives
A CKKS ciphertext in poulpy-ckks is a GLWE buffer paired with semantic precision metadata, plus a typestate marker for whether the limb digits are carry-normalized:
pub struct CKKSCiphertext<D: Data, S: CKKSNormalizationState = Normalized> {
pub(crate) inner: GLWE<D>,
pub(crate) meta: CKKSMeta,
_state: PhantomData<S>,
}
pub struct CKKSMeta {
pub log_delta: usize,
pub log_budget: usize,
}
The S parameter is either Normalized or Unnormalized and tracks the limb invariant the unsafe arithmetic variants discussed in §5.3 expose at the type level.
The fields have direct mathematical meaning:
- (
log_delta) is the base- logarithm of the plaintext scaling factor: the encoded message lives at scale . - (
log_budget) is the remaining homomorphic capacity in bits: the noise floor sits bits above the scaled plaintext.
Together they define the effective torus width used by the kernels:
Storage uses limbs, and the rounded width is what the GLWE buffer actually holds. The bits between and are padding from limb-boundary alignment.
Picturing one ciphertext as a stack of limbs along , with the most-significant limb on top:
A plaintext is the same shape minus the secret-key half. Unlike RNS, where a plaintext has to live across the full RNS basis to interact with a ciphertext, a bivariate plaintext can be much shorter than the ciphertext it operates on and stored with just enough precision to hold it. Alignment of the two limb stacks is handled inside the kernels to allow operations.
3. The torus is a lie
The torus picture is elegant. It is also, on a finite computer, a representation lie. What we call the torus is , a continuous one-dimensional object. What actually lives in RAM, in poulpy, is
a vector of signed -bit integers per cyclotomic coefficient. Reading the limbs as the -coefficient of a torus element means evaluating
which is exactly applied coefficient-wise. The torus has been discretized at resolution , and any “real number on the torus” only exists up to that resolution.
From RNS-CKKS to bivariate
For readers more comfortable with the RNS picture: in RNS-CKKS a ciphertext at level lives in with , stored as the CRT tuple of residues modulo each . The “level” is both a budget and a storage layout: dropping a level means physically dropping a residue and recomputing the rest by mod-switch.
In bivariate, the same ciphertext lives in , -normalized, with limbs along . The CRT axis is gone. The “level” decouples into three quantities:
| Quantity | RNS-CKKS | Bivariate |
|---|---|---|
| Budget unit | one prime (– bits) | one bit |
| Storage unit | one RNS residue (one prime) | one limb of bits |
| ”Drop a level” | mod-switch: arithmetic + slice | slice (prefix property) |
| Modulus | product of primes | |
| Key-switch online NTTs | ||
| Evaluation key | tied to prime chain | prefix-shared across precisions |
The hard part of the migration is not algebra: it is essentially a change of basis at the storage level, but rebuilding the kernels around three independent axes (log_delta, log_budget, max_k) instead of the single fused level. Going from RNS to bivariate looks like a change of representation, but in practice it is mostly a change of bookkeeping, and the bookkeeping is most of CKKS engineering.
Bivariate CKKS is also easier to parameterize since instead of having to choose a chain of prime moduli, one can just choose and .
Note that Grafting in RNS provides a similar decoupling functionality, but the difference is that what Grafting gets by engineering, you get by desing in the bivariate representation.
4. Three independence axes
Once the representation is set, every CKKS operation has to deal with three quantities that may differ between operands and result:
- - the scale at which the message is encoded.
- - the bits of clean torus above the message.
- - the physical storage capacity, in bits.
In RNS-CKKS these collapse into a single state (the level), so reconciling two ciphertexts amounts to “modulus-switch to the lowest common level”. In bivariate they are independent and each has its own primitive.
The axis is reconciled by shifting inside the torus. Two ciphertexts at the same scale but with different budgets have their noise floors at different bit positions in the limb stack. A left-shift of the higher-budget operand by the budget difference brings its noise floor down to the lower-budget operand’s, at the cost of consuming the extra budget:
The axis is reconciled by storage truncation. If the destination buffer is narrower than the operand, the kernel discards the requested number of high bits before performing the addition or multiplication. This is the bivariate equivalent of “rescale before the operation to make room”, and it costs exactly that many bits of .
The axis is reconciled at the boundary, not in the kernel: scalar and constant plaintexts are quantized at the destination’s and at the chosen , so the kernel only ever sees aligned operands. Ciphertext–ciphertext operations either find equal scales (the common case) or surface MultiplicationPrecisionUnderflow and ask the caller to rescale explicitly. The analogous ciphertext–plaintext failure mode, i.e. the plaintext doesn’t fit inside the ciphertext’s headroom, is PlaintextAlignmentImpossible.
5. Bivariate CKKS operations
5.1 Encoding and decoding
Encoding follows the standard CKKS slot map. A complex slot vector is placed at the canonical embedding positions and inverse-FFT’d to yield a real-coefficient polynomial - in poulpy a temporary Vec<f64> (or Vec<f128>) inside the encoder:
This step is identical to RNS-CKKS: the canonical embedding obviously does not care how the underlying ring is stored, as it is just a ring homomorhism.
The interesting step is the next one: turning into the bivariate representation at scale . We want an integer polynomial such that , then we want to spread across the -axis so that the limbs of each coefficient encode the integer in base , with the message lying in the lowest bits of the limb stack and the bits above it left clean for noise growth:
In RNS this same operation requires representing modulo each prime , which forces the encoder to carry around the full prime chain. In poulpy the encoder writes into limbs of bits, with chosen freely. The slot map is unchanged, but the quantization-to-storage is done in poulpy so that the integer lands in the right limbs with the right alignment, and so that subsequent kernels see the limb stack the way they expect.
There is a single plaintext type, CKKSPlaintext, which lives in the ZNX (integer-limb torus) domain:
pub struct CKKSPlaintext<D: Data = Vec<u8>> {
pub(crate) inner: GLWEPlaintext<D>,
pub(crate) meta: CKKSMeta,
}
The slot map → quantization → limb decomposition pipeline is exposed as one call. The Encoder packs a complex slot vector into real coefficients via IFFT and immediately hands them to the plaintext’s encode_host_floats method, which scales by , rounds to i64 or i128, and decomposes the integer into base- limbs:
encoder.encode_reim(&mut pt, &re, &im)?; // slot map + IFFT + quantize-to-limbs
The host-codec method itself lives on a trait:
pub trait CKKSPlaintextVecHostCodec<F: CKKSScalar>: CKKSInfos + LWEInfos {
fn encode_host_floats(&mut self, coeffs: &[F]) -> Result<()>;
fn decode_host_floats(&self, coeffs: &mut [F]) -> Result<()>;
}
Decoding runs the chain in reverse with encoder.decode_reim: the bivariate plaintext is collapsed back to a single integer per coefficient by evaluating the limb stack as a base- number, the result is divided by , and the slot map is applied in the forward direction (FFT) to recover an approximation of the input slot vector. The accuracy of the round-trip is governed by the bits of that survived the homomorphic computation.
5.2 Encryption
Encryption is the standard RLWE encryption of the ZNX plaintext into a pair . The resulting ciphertext inherits the plaintext’s CKKSMeta, possibly trimmed to account for the encryption noise floor. The fresh ciphertext has corresponding to the headroom between the encoded message and where the encryption error sits in the limb stack.
5.3 Addition
Adding two ciphertexts at the same scale is just GLWE addition, if the budgets and storage widths agree. When they don’t, the kernel reconciles all three axes in a single pass:
- Compute the storage offset (clamped to ).
- Left-shift the lower-budget operand by bits and add the higher-budget operand left-shifted by bits.
- Set the result’s metadata: , .
The “safe” variant follows the addition with a carry-propagation pass that brings every limb back into . The “unsafe” variant skips that step, which is correct for chains of linear operations as long as one normalization happens before any nonlinear operation downstream. “Unsafe” here flags a numeric invariant (limbs may overflow i64 if you keep accumulating), not a memory invariant.
Plaintext addition takes a CKKSPlaintext (already in the ZNX domain). The caller encodes the floating-point coefficients into limbs ahead of time via encoder.encode_reim or encode_host_floats. What the kernel does is the alignment: the plaintext is allocated to its own bits, which can be much narrower than the ciphertext’s , and the kernel positions it inside the ciphertext via a right-shift offset before adding in place.
5.4 Multiplication
A product of two CKKS ciphertexts at scales and is a ciphertext at scale . To recover a single-scale result, one factor of must be absorbed, which costs bits of :
with the same storage offset as in addition. The pair handles the common case (equal scales: , so the subtraction is just ) and surfaces an error in the unaligned case. MultiplicationPrecisionUnderflow is the bivariate counterpart of “out of levels”: multiplication is impossible because the available headroom is smaller than the scale that would have to be absorbed.
Pictorially, for two operands at the same scale and budgets :
The rescale brings the noise floor down by bits relative to the message: the budget after the multiplication is .
The kernel produces the tensor product of the two ciphertexts under the appropriate shift, then relinearizes using a precomputed tensor key. The tensor key is sized once at keygen to the ciphertext precision plus a small overhead; it does not need to be regenerated for different multiplicative depths the way RNS evaluation keys are tied to the prime chain. The same key handles depth- and depth- circuits.
Plaintext multiplication is dramatically cheaper. A constant multiplication consumes only the plaintext’s bits of budget, with no factor of the ciphertext’s scale, and skips relinearization entirely. A small constant such as consumes exactly bits, instead of forcing an entire prime drop as it would in RNS without Grafting.
5.5 Rescale and the prefix property
ckks_rescale_assign is the operation that trades budget for re-anchoring the message inside the limb stack. Mathematically it is a left-shift in the torus by bits:
The plaintext stays at the same scale; only the headroom shrinks. The check rejects rescales that would push below zero: there is no equivalent to “negative levels”.
The prefix property the paper highlights costs nothing. A high-precision ciphertext of limbs, restricted to its first limbs, is itself a valid ciphertext of the same plaintext at lower precision. In poulpy this is exposed as ckks_compact_limbs, which reallocates the buffer to limbs after a budget-consuming operation has shrunk the meaningful content.
The picture, in the storage layout the kernels actually manipulate (limbs indexed as in §1, with carrying weight on the buffer integer, so is the high half, the low half):
The discarded limb(s) held only the zeros the rescale shifted in, and dropping them is exact: the high-weight portion (the new ) is the same integer, just stored in a smaller buffer.
There is no CRT lift, no re-projection. Evaluation keys and plaintexts inherit the same property: they are sized once at keygen and any caller can use a prefix of them to operate at lower precision.
5.6 Keyswitching, relinearization, and automorphisms
Relinearization, slot rotations, and conjugation all reduce to the same primitive: a keyswitch, where a ciphertext is rewritten under a different secret. In poulpy the underlying kernel is glwe_keyswitch from poulpy-core, wrapped at the CKKS level by ckks_rotate_into, ckks_conjugate_into, and the relinearization step inside ckks_mul_into.
A keyswitch evaluates where the are the gadget decomposition of one ciphertext polynomial and the are the key. In RNS-CKKS the operands have to be raised to an auxiliary modulus before the elementwise products, and that base extension dominates the online cost. In bivariate the are already polynomials over in the same ring: the online phase is one NTT per limb, the products, and a carry-propagation pass: no base extension, no CRT reprojection.
The prefix property carries over to key material: a relinearization, automorphism, or generic keyswitching key generated at maximum precision works for every lower precision by prefix truncation. In RNS-CKKS, evaluation keys are tied to the prime chain and are typically held at multiple levels or regenerated when the depth budget changes.
6. Why this is cool: a worked example
The runnable example at poulpy-cpu-ref/examples/ckks_poly2.rs (build with --features enable-ckks) evaluates
on encrypted complex slots, with parameters , (so , two limbs). The ciphertext buffer is sized to hold bits and the plaintext coefficient constants use .
The trace tells the bivariate story compactly:
ciphertext x dec=30 hom=65 eff= 95 limbs= 2 max=104
x^2 dec=30 hom=35 eff= 65 limbs= 2 max=104
x^2 compacted dec=30 hom=35 eff= 65 limbs= 2 max=104
a + b * x dec=30 hom=61 eff= 91 limbs= 2 max=104
c + d * x dec=30 hom=61 eff= 91 limbs= 2 max=104
(c + d * x) * x^2 dec=30 hom= 5 eff= 35 limbs= 1 max= 52
(c + d * x) * x^2 compacted dec=30 hom= 5 eff= 35 limbs= 1 max= 52
final polynomial dec=30 hom= 5 eff= 35 limbs= 1 max= 52
After encryption, has and bits of clean headroom below the noise floor, so fills almost all of the -bit buffer ( bits of unused storage padding at the limb boundary). The square consumes exactly bits of budget (, one factor of absorbed by the product). ckks_compact_limbs is called after the square but does not actually shrink here ( limbs are still needed); it would shrink as soon as crosses a -boundary. Each linear branch and is built with the affine helper ckks_affine_pt_const_into, which consumes only bits per branch (). The ciphertext-ciphertext multiply by is the budget-eating step: bits of remain, and the result is allocated to exactly the storage it needs (1 limb of bits).
The headline observation is that the budget falls in bits, not in levels. A constant multiplication by with consumes bits: not a -bit prime. The output of the final ct-ct multiply is allocated to a single -bit limb, sized to what the data needs rather than to a level boundary. The same circuit run under RNS would have used a discrete prime chain and rounded each consumption up to the next prime.
For a degree- polynomial that gap is small. For a degree- polynomial whose constants are mostly small fractions, it compounds: dozens of bit-sized consumptions versus dozens of full-prime drops. The practical consequence is fewer bootstrappings for the same circuit.
7. Conclusion and what’s next
The bivariate representation keeps the CKKS scheme intact and replaces the way ciphertexts are stored and operated on. Polynomials become bivariate (a cyclotomic axis of length and a multi-precision axis of length ), the limb size is a free parameter, and the prefix property turns modulus switching into a length adjustment. The torus picture remains a useful piece of mathematical scaffolding, but it is honest to say that the working object is a stack of -bit signed integers and that every operation is, ultimately, integer arithmetic on those stacks.
In poulpy-ckks this materializes as ciphertexts and plaintexts carrying a CKKSMeta { log_delta, log_budget } alongside their GLWE storage; arithmetic kernels that simultaneously reconcile mismatched scales, budgets and storage capacities through a single offset and a pair of shifts; a compact_limbs operation that exploits the prefix property to free unused storage; and bit-granular error reporting where RNS-CKKS would have signalled “out of levels”.
What is coming next to poulpy on top of this foundation:
- Polynomial evaluation - Paterson–Stockmeyer and friends, taking advantage of the bit-granular budget to evaluate functions like , , , without wasting time on limb dropping and alignment.
- Linear transforms - slot rotations and matrix–vector products fused with rescales so that the budget cost matches the analytic depth rather than rounding up to prime boundaries.
- Bootstrapping - the bit-granular framework should let bootstrapping be parameterized by precision rather than by a fixed prime chain, with the same evaluation key serving multiple precision targets via the prefix property.
- Discrete CKKS - variants of the scheme where the plaintext space is integer rather than real, sharing the bivariate plumbing with the standard real-valued CKKS, with functional bootstrapping.
- Faster backends - as poulpy was designed to work with any hardware, it is relatively easy to add a new backend hardware acceleration. Poulpy already has an
AVX2 + FMAand anAVX512backend, but anARMandCUDAbackend are actively being developped.
The torus is a lie. The cake, on the other hand, is real.