Bivariate CKKS

Bivariate CKKS

May 23, 2026 · 23 min read

CKKS is usually presented over RNS: pick a chain of NTT-friendly primes q0,q1,,qLq_0, q_1, \dots, q_L, store every polynomial as its CRT residues, and let level be the unit of homomorphic budget. The recent paper of Belorgey, Carpov, Gama, Guasch and Jetchev, Revisiting Key Decomposition Techniques for FHE: Simpler, Faster and More Generic, keeps the scheme but throws out the prime chain entirely. In its place sits the bivariate representation: one cyclotomic axis, one multi-precision axis, and a free parameter KK that has nothing to do with primes.

This article walks through CKKS as it is implemented in poulpy-ckks on top of that representation. It is meant as a bridge between the rather abstract setting of the paper (Section 3 introduces the representation in full generality, with gadget decompositions and external products as the main motivation) and what actually runs on a CPU when you call ckks_add_into or ckks_mul_into. Some familiarity with CKKS is assumed, deep RNS expertise is not.

1. The bivariate representation, math first

Fix the cyclotomic ring

R=Z[X]/XN+1,N a power of two,\mathcal{R} = \mathbb{Z}[X]/\langle X^N + 1\rangle, \qquad N \text{ a power of two},

and let T=R/Z\mathbb{T} = \mathbb{R}/\mathbb{Z} be the real torus. CKKS encrypts elements of the real torus polynomial ring

TN[X]  =  R[X]/XN+1modZ.\mathbb{T}_N[X] \;=\; \mathbb{R}[X]/\langle X^N+1\rangle \bmod \mathbb{Z}.

Pick an integer K[10,60]K \in [10, 60], the limb size. In poulpy this is called base2k. A typical value is K=52K = 52 for the NTT120 backend. The bivariate representation describes an element QTN[X]Q \in \mathbb{T}_N[X] by a bivariate integer polynomial

P(X,Y)  =  i=0N1j1ai,jXiYj    Z[X,Y]/XN+1,P(X, Y) \;=\; \sum_{i=0}^{N-1} \sum_{j \geq 1} a_{i,j}\, X^{i}\, Y^{j} \;\in\; \mathbb{Z}[X, Y]/\langle X^N + 1\rangle,

with ai,j[2K1,2K1)a_{i,j} \in [-2^{K-1}, 2^{K-1}) (the paper calls such a polynomial KK-normalized and reduced), together with the evaluation map

φK:R[X,Y]/XN+1    R[X]/XN+1,P(X,Y)    P(X,2K).\varphi_K : \mathbb{R}[X, Y]/\langle X^N+1\rangle \;\longrightarrow\; \mathbb{R}[X]/\langle X^N+1\rangle, \qquad P(X, Y) \;\longmapsto\; P(X,\, 2^{-K}).

We require φK(P)Q(modZ)\varphi_K(P) \equiv Q \pmod{\mathbb{Z}}. The picture splits into two independent axes:

  • the XX-axis carries the cyclotomic structure: a length-NN vector of integer coefficients per limb, handled by an FFT over R\mathbb{R} or an NTT modulo a single large prime;
  • the YY-axis carries the multi-precision structure: \ell limbs in [2K1,2K1)[-2^{K-1}, 2^{K-1}), with carry propagation playing the role of arithmetic over Z\mathbb{Z}.

Two structural consequences worth highlighting up front.

Limb count is a function of precision, not of primes. The number of limbs L/K\ell \approx L/K depends only on the desired precision L=log2(noise)L = -\log_2(\text{noise}) and on KK. There is no prime chain, and KK is a free parameter rather than a property of some prime family.

Prefix property. A precision-LL ciphertext, truncated to its first <\ell' < \ell limbs along YY, is a valid precision-LL' ciphertext of the same plaintext, with L=KL' = \ell' \cdot K. Modulus switching becomes a length adjustment on the limb sequence, i.e. a slice, not a computation.

(multi-precision, limbs)
(cyclotomic, length )

2. CKKS in poulpy: where the bivariate ring lives

A CKKS ciphertext in poulpy-ckks is a GLWE buffer paired with semantic precision metadata, plus a typestate marker for whether the limb digits are carry-normalized:

pub struct CKKSCiphertext<D: Data, S: CKKSNormalizationState = Normalized> {
    pub(crate) inner: GLWE<D>,
    pub(crate) meta: CKKSMeta,
    _state: PhantomData<S>,
}

pub struct CKKSMeta {
    pub log_delta: usize,
    pub log_budget: usize,
}

The S parameter is either Normalized or Unnormalized and tracks the limb invariant the unsafe arithmetic variants discussed in §5.3 expose at the type level.

The fields have direct mathematical meaning:

  • logΔ\log\Delta (log_delta) is the base-22 logarithm of the plaintext scaling factor: the encoded message lives at scale Δ=2logΔ\Delta = 2^{\log\Delta}.
  • logβ\log\beta (log_budget) is the remaining homomorphic capacity in bits: the noise floor sits logβ\log\beta bits above the scaled plaintext.

Together they define the effective torus width used by the kernels:

keff  =  logΔ  +  logβ.k_{\mathrm{eff}} \;=\; \log\Delta \;+\; \log\beta.

Storage uses keff/K\lceil k_{\mathrm{eff}} / K \rceil limbs, and the rounded width kmax=keff/KKk_{\max} = \lceil k_{\mathrm{eff}}/K\rceil \cdot K is what the GLWE buffer actually holds. The bits between keffk_{\mathrm{eff}} and kmaxk_{\max} are padding from limb-boundary alignment.

Picturing one ciphertext as a stack of limbs along YY, with the most-significant limb on top:

MSB LSB
ceiling
clean bits
encoded message

A plaintext is the same shape minus the secret-key half. Unlike RNS, where a plaintext has to live across the full RNS basis to interact with a ciphertext, a bivariate plaintext can be much shorter than the ciphertext it operates on and stored with just enough precision to hold it. Alignment of the two limb stacks is handled inside the kernels to allow operations.

3. The torus is a lie

The torus picture is elegant. It is also, on a finite computer, a representation lie. What we call the torus is T=R/Z\mathbb{T} = \mathbb{R}/\mathbb{Z}, a continuous one-dimensional object. What actually lives in RAM, in poulpy, is

T~K()    j=1[2K1,2K1)    Z,\widetilde{\mathbb{T}}_K^{(\ell)} \;\simeq\; \prod_{j=1}^{\ell} \big[-2^{K-1},\, 2^{K-1}\big)\;\subset\;\mathbb{Z}^{\ell},

a vector of \ell signed KK-bit integers per cyclotomic coefficient. Reading the limbs (ai,1,ai,2,,ai,)(a_{i,1}, a_{i,2}, \dots, a_{i,\ell}) as the XiX^i-coefficient of a torus element means evaluating

j=1ai,j2jK(mod1),\sum_{j=1}^{\ell} a_{i,j} \cdot 2^{-jK} \pmod{1},

which is exactly φK\varphi_K applied coefficient-wise. The torus has been discretized at resolution 2K2^{-K\ell}, and any “real number on the torus” only exists up to that resolution.

From RNS-CKKS to bivariate

For readers more comfortable with the RNS picture: in RNS-CKKS a ciphertext at level LL lives in RQL2\mathcal{R}_{Q_L}^2 with QL=j=0LqjQ_L = \prod_{j=0}^{L} q_j, stored as the CRT tuple of residues modulo each qjq_j. The “level” is both a budget and a storage layout: dropping a level means physically dropping a residue and recomputing the rest by mod-switch.

In bivariate, the same ciphertext lives in (Z[X,Y]/XN+1)2(\mathbb{Z}[X,Y]/\langle X^N+1\rangle)^2, KK-normalized, with \ell limbs along YY. The CRT axis is gone. The “level” decouples into three quantities:

QuantityRNS-CKKSBivariate
Budget unitone prime qjq_j (30\sim 305050 bits)one bit
Storage unitone RNS residue (one prime)one limb of KK bits
”Drop a level”mod-switch: arithmetic + sliceslice (prefix property)
Modulusproduct QLQ_L of primes2K2^{K\ell}
Key-switch online NTTsO(2)O(\ell^2)O()O(\ell)
Evaluation keytied to prime chainprefix-shared across precisions

The hard part of the migration is not algebra: it is essentially a change of basis at the storage level, but rebuilding the kernels around three independent axes (log_delta, log_budget, max_k) instead of the single fused level. Going from RNS to bivariate looks like a change of representation, but in practice it is mostly a change of bookkeeping, and the bookkeeping is most of CKKS engineering. Bivariate CKKS is also easier to parameterize since instead of having to choose a chain of prime moduli, one can just choose KK and \ell. Note that Grafting in RNS provides a similar decoupling functionality, but the difference is that what Grafting gets by engineering, you get by desing in the bivariate representation.

4. Three independence axes

Once the representation is set, every CKKS operation has to deal with three quantities that may differ between operands and result:

  • logΔ\log\Delta - the scale at which the message is encoded.
  • logβ\log\beta - the bits of clean torus above the message.
  • kmaxk_{\max} - the physical storage capacity, in bits.

In RNS-CKKS these collapse into a single state (the level), so reconciling two ciphertexts amounts to “modulus-switch to the lowest common level”. In bivariate they are independent and each has its own primitive.

The logβ\log\beta axis is reconciled by shifting inside the torus. Two ciphertexts at the same scale but with different budgets have their noise floors at different bit positions in the limb stack. A left-shift of the higher-budget operand by the budget difference brings its noise floor down to the lower-budget operand’s, at the cost of consuming the extra budget:

MSB LSB

The kmaxk_{\max} axis is reconciled by storage truncation. If the destination buffer is narrower than the operand, the kernel discards the requested number of high bits before performing the addition or multiplication. This is the bivariate equivalent of “rescale before the operation to make room”, and it costs exactly that many bits of logβ\log\beta.

The logΔ\log\Delta axis is reconciled at the boundary, not in the kernel: scalar and constant plaintexts are quantized at the destination’s KK and at the chosen logΔ\log\Delta, so the kernel only ever sees aligned operands. Ciphertext–ciphertext operations either find equal scales (the common case) or surface MultiplicationPrecisionUnderflow and ask the caller to rescale explicitly. The analogous ciphertext–plaintext failure mode, i.e. the plaintext doesn’t fit inside the ciphertext’s headroom, is PlaintextAlignmentImpossible.

5. Bivariate CKKS operations

5.1 Encoding and decoding

Encoding follows the standard CKKS slot map. A complex slot vector zCN/2\mathbf{z} \in \mathbb{C}^{N/2} is placed at the canonical embedding positions and inverse-FFT’d to yield a real-coefficient polynomial - in poulpy a temporary Vec<f64> (or Vec<f128>) inside the encoder:

zCN/2    slot map    m(X)R[X]/XN+1.\mathbf{z} \in \mathbb{C}^{N/2} \;\xrightarrow{\;\text{slot map}\;}\; m(X) \in \mathbb{R}[X]/\langle X^N+1\rangle.

This step is identical to RNS-CKKS: the canonical embedding obviously does not care how the underlying ring is stored, as it is just a ring homomorhism.

The interesting step is the next one: turning m(X)m(X) into the bivariate representation at scale Δ=2logΔ\Delta = 2^{\log\Delta}. We want an integer polynomial m~(X)Z[X]/XN+1\tilde m(X) \in \mathbb{Z}[X]/\langle X^N+1\rangle such that m~Δm\tilde m \approx \Delta \cdot m, then we want to spread m~\tilde m across the YY-axis so that the limbs of each coefficient encode the integer m~i\tilde m_i in base 2K2^K, with the message lying in the lowest logΔ\log\Delta bits of the limb stack and the logβ\log\beta bits above it left clean for noise growth:

MSB LSB
clean
clean
integer bits

In RNS this same operation requires representing m~\tilde m modulo each prime qjq_j, which forces the encoder to carry around the full prime chain. In poulpy the encoder writes into limbs of KK bits, with KK chosen freely. The slot map is unchanged, but the quantization-to-storage is done in poulpy so that the integer m~i\tilde m_i lands in the right limbs with the right alignment, and so that subsequent kernels see the limb stack the way they expect.

There is a single plaintext type, CKKSPlaintext, which lives in the ZNX (integer-limb torus) domain:

pub struct CKKSPlaintext<D: Data = Vec<u8>> {
    pub(crate) inner: GLWEPlaintext<D>,
    pub(crate) meta: CKKSMeta,
}

The slot map → quantization → limb decomposition pipeline is exposed as one call. The Encoder packs a complex slot vector into real coefficients via IFFT and immediately hands them to the plaintext’s encode_host_floats method, which scales by Δ=2logΔ\Delta = 2^{\log\Delta}, rounds to i64 or i128, and decomposes the integer into base-2K2^K limbs:

encoder.encode_reim(&mut pt, &re, &im)?;        // slot map + IFFT + quantize-to-limbs

The host-codec method itself lives on a trait:

pub trait CKKSPlaintextVecHostCodec<F: CKKSScalar>: CKKSInfos + LWEInfos {
    fn encode_host_floats(&mut self, coeffs: &[F]) -> Result<()>;
    fn decode_host_floats(&self, coeffs: &mut [F]) -> Result<()>;
}

Decoding runs the chain in reverse with encoder.decode_reim: the bivariate plaintext is collapsed back to a single integer per coefficient by evaluating the limb stack as a base-2K2^K number, the result is divided by Δ\Delta, and the slot map is applied in the forward direction (FFT) to recover an approximation of the input slot vector. The accuracy of the round-trip is governed by the bits of logβ\log\beta that survived the homomorphic computation.

5.2 Encryption

Encryption is the standard RLWE encryption of the ZNX plaintext into a pair (a,b=as+pt+e)(a,\, b = a \cdot s + \mathrm{pt} + e). The resulting ciphertext inherits the plaintext’s CKKSMeta, possibly trimmed to account for the encryption noise floor. The fresh ciphertext has logβ\log\beta corresponding to the headroom between the encoded message and where the encryption error sits in the limb stack.

5.3 Addition

Adding two ciphertexts at the same scale is just GLWE addition, if the budgets and storage widths agree. When they don’t, the kernel reconciles all three axes in a single pass:

  1. Compute the storage offset o=min(keffa,keffb)kmaxdsto = \min(k_{\mathrm{eff}}^a,\, k_{\mathrm{eff}}^b) \,-\, k_{\max}^{\mathrm{dst}} (clamped to 0\geq 0).
  2. Left-shift the lower-budget operand by oo bits and add the higher-budget operand left-shifted by (logβhighlogβlow)+o(\log\beta^{\text{high}} - \log\beta^{\text{low}}) + o bits.
  3. Set the result’s metadata: logΔdst=min(logΔa,logΔb)\log\Delta^{\mathrm{dst}} = \min(\log\Delta^a, \log\Delta^b), logβdst=min(logβa,logβb)o\log\beta^{\mathrm{dst}} = \min(\log\beta^a, \log\beta^b) - o.

The “safe” variant follows the addition with a carry-propagation pass that brings every limb back into [2K1,2K1)[-2^{K-1}, 2^{K-1}). The “unsafe” variant skips that step, which is correct for chains of linear operations as long as one normalization happens before any nonlinear operation downstream. “Unsafe” here flags a numeric invariant (limbs may overflow i64 if you keep accumulating), not a memory invariant.

Plaintext addition takes a CKKSPlaintext (already in the ZNX domain). The caller encodes the floating-point coefficients into limbs ahead of time via encoder.encode_reim or encode_host_floats. What the kernel does is the alignment: the plaintext is allocated to its own mink=(logΔ+logβ)/KK\mathrm{min}_k = \lceil(\log\Delta + \log\beta)/K\rceil \cdot K bits, which can be much narrower than the ciphertext’s kmaxk_{\max}, and the kernel positions it inside the ciphertext via a right-shift offset before adding in place.

5.4 Multiplication

A product of two CKKS ciphertexts at scales Δa\Delta_a and Δb\Delta_b is a ciphertext at scale ΔaΔb\Delta_a \cdot \Delta_b. To recover a single-scale result, one factor of Δ\Delta must be absorbed, which costs max(logΔa,logΔb)\max(\log\Delta_a, \log\Delta_b) bits of logβ\log\beta:

logβdst  =  min(logβa,logβb)    max(logΔa,logΔb)    o,\log\beta^{\mathrm{dst}} \;=\; \min(\log\beta^a, \log\beta^b) \;-\; \max(\log\Delta_a, \log\Delta_b) \;-\; o, logΔdst  =  min(logΔa,logΔb),\log\Delta^{\mathrm{dst}} \;=\; \min(\log\Delta_a, \log\Delta_b),

with oo the same storage offset as in addition. The min/max\min/\max pair handles the common case (equal scales: min=max=logΔ\min = \max = \log\Delta, so the subtraction is just logΔ-\log\Delta) and surfaces an error in the unaligned case. MultiplicationPrecisionUnderflow is the bivariate counterpart of “out of levels”: multiplication is impossible because the available headroom is smaller than the scale that would have to be absorbed.

Pictorially, for two operands at the same scale Δ\Delta and budgets βaβb\beta_a \leq \beta_b:

MSB LSB
upper
lower

The rescale brings the noise floor down by logΔ\log\Delta bits relative to the message: the budget after the multiplication is β=min(βa,βb)logΔ\beta'' = \min(\beta_a, \beta_b) - \log\Delta.

The kernel produces the tensor product of the two ciphertexts under the appropriate shift, then relinearizes using a precomputed tensor key. The tensor key is sized once at keygen to the ciphertext precision plus a small overhead; it does not need to be regenerated for different multiplicative depths the way RNS evaluation keys are tied to the prime chain. The same key handles depth-11 and depth-1010 circuits.

Plaintext multiplication is dramatically cheaper. A constant multiplication consumes only the plaintext’s logΔ\log\Delta bits of budget, with no factor of the ciphertext’s scale, and skips relinearization entirely. A small constant such as 3/283 / 2^{8} consumes exactly 88 bits, instead of forcing an entire prime drop as it would in RNS without Grafting.

5.5 Rescale and the prefix property

ckks_rescale_assign is the operation that trades budget for re-anchoring the message inside the limb stack. Mathematically it is a left-shift in the torus by kk bits:

logβ    logβk,logΔ   unchanged.\log\beta \;\leftarrow\; \log\beta - k, \qquad \log\Delta \;\text{ unchanged}.

The plaintext stays at the same scale; only the headroom shrinks. The check rejects rescales that would push logβ\log\beta below zero: there is no equivalent to “negative levels”.

The prefix property the paper highlights costs nothing. A high-precision ciphertext of \ell limbs, restricted to its first <\ell' < \ell limbs, is itself a valid ciphertext of the same plaintext at lower precision. In poulpy this is exposed as ckks_compact_limbs, which reallocates the buffer to keff/K\lceil k_{\mathrm{eff}}/K\rceil limbs after a budget-consuming operation has shrunk the meaningful content.

The picture, in the storage layout the kernels actually manipulate (limbs indexed j=1,,j = 1, \dots, \ell as in §1, with aja_j carrying weight 2(j)K2^{(\ell-j) \cdot K} on the buffer integer, so a1a_1 is the high half, aa_\ell the low half):

MSB LSB
K bits
K bits
K bits
K bits
old
old
old
zero
old
old
old

The discarded limb(s) held only the zeros the rescale shifted in, and dropping them is exact: the high-weight portion (the new a1,,aa_1, \dots, a_{\ell'}) is the same integer, just stored in a smaller buffer.

There is no CRT lift, no re-projection. Evaluation keys and plaintexts inherit the same property: they are sized once at keygen and any caller can use a prefix of them to operate at lower precision.

5.6 Keyswitching, relinearization, and automorphisms

Relinearization, slot rotations, and conjugation all reduce to the same primitive: a keyswitch, where a ciphertext is rewritten under a different secret. In poulpy the underlying kernel is glwe_keyswitch from poulpy-core, wrapped at the CKKS level by ckks_rotate_into, ckks_conjugate_into, and the relinearization step inside ckks_mul_into.

A keyswitch evaluates iAi(X)Bi(X)\sum_i A_i(X) \cdot B_i(X) where the AiA_i are the gadget decomposition of one ciphertext polynomial and the BiB_i are the key. In RNS-CKKS the operands have to be raised to an auxiliary modulus before the elementwise products, and that base extension dominates the online cost. In bivariate the AiA_i are already polynomials over R\mathcal{R} in the same ring: the online phase is one NTT per limb, the products, and a carry-propagation pass: no base extension, no CRT reprojection.

The prefix property carries over to key material: a relinearization, automorphism, or generic keyswitching key generated at maximum precision works for every lower precision by prefix truncation. In RNS-CKKS, evaluation keys are tied to the prime chain and are typically held at multiple levels or regenerated when the depth budget changes.

6. Why this is cool: a worked example

The runnable example at poulpy-cpu-ref/examples/ckks_poly2.rs (build with --features enable-ckks) evaluates

f(x)  =  (a+bx)  +  (c+dx)x2f(x) \;=\; (a + b\, x) \;+\; (c + d\, x)\cdot x^2

on encrypted complex slots, with parameters K=52K = 52, kct=95k^{\mathrm{ct}} = 95 (so kmax=95/5252=104k_{\max} = \lceil 95/52\rceil \cdot 52 = 104, two limbs). The ciphertext buffer is sized to hold 104104 bits and the plaintext coefficient constants use (logΔ,logβ)=(4,0)(\log\Delta, \log\beta) = (4, 0).

The trace tells the bivariate story compactly:

ciphertext x                 dec=30 hom=65 eff= 95 limbs= 2 max=104
x^2                          dec=30 hom=35 eff= 65 limbs= 2 max=104
x^2 compacted                dec=30 hom=35 eff= 65 limbs= 2 max=104
a + b * x                    dec=30 hom=61 eff= 91 limbs= 2 max=104
c + d * x                    dec=30 hom=61 eff= 91 limbs= 2 max=104
(c + d * x) * x^2            dec=30 hom= 5 eff= 35 limbs= 1 max= 52
(c + d * x) * x^2 compacted  dec=30 hom= 5 eff= 35 limbs= 1 max= 52
final polynomial             dec=30 hom= 5 eff= 35 limbs= 1 max= 52

After encryption, ctx\mathrm{ct}_x has logΔ=30\log\Delta = 30 and logβ=65\log\beta = 65 bits of clean headroom below the noise floor, so keff=95k_{\mathrm{eff}} = 95 fills almost all of the 104104-bit buffer (99 bits of unused storage padding at the limb boundary). The square consumes exactly logΔ=30\log\Delta = 30 bits of budget (653565 \to 35, one factor of Δ\Delta absorbed by the product). ckks_compact_limbs is called after the square but does not actually shrink here (65/52=2\lceil 65/52\rceil = 2 limbs are still needed); it would shrink as soon as keffk_{\mathrm{eff}} crosses a KK-boundary. Each linear branch a+bxa + b\,x and c+dxc + d\,x is built with the affine helper ckks_affine_pt_const_into, which consumes only logΔpt=4\log\Delta_{\mathrm{pt}} = 4 bits per branch (656165 \to 61). The ciphertext-ciphertext multiply by x2x^2 is the budget-eating step: min(61,35)max(30,30)=5\min(61, 35) - \max(30, 30) = 5 bits of logβ\log\beta remain, and the result is allocated to exactly the storage it needs (1 limb of 5252 bits).

The headline observation is that the budget falls in bits, not in levels. A constant multiplication by dd with logΔpt=4\log\Delta_{\mathrm{pt}} = 4 consumes 44 bits: not a 3030-bit prime. The output of the final ct-ct multiply is allocated to a single 5252-bit limb, sized to what the data needs rather than to a level boundary. The same circuit run under RNS would have used a discrete prime chain and rounded each consumption up to the next prime.

For a degree-22 polynomial that gap is small. For a degree-dd polynomial whose constants are mostly small fractions, it compounds: dozens of bit-sized consumptions versus dozens of full-prime drops. The practical consequence is fewer bootstrappings for the same circuit.

7. Conclusion and what’s next

The bivariate representation keeps the CKKS scheme intact and replaces the way ciphertexts are stored and operated on. Polynomials become bivariate (a cyclotomic axis XX of length NN and a multi-precision axis YY of length L/K\ell \approx L/K), the limb size KK is a free parameter, and the prefix property turns modulus switching into a length adjustment. The torus picture remains a useful piece of mathematical scaffolding, but it is honest to say that the working object is a stack of KK-bit signed integers and that every operation is, ultimately, integer arithmetic on those stacks.

In poulpy-ckks this materializes as ciphertexts and plaintexts carrying a CKKSMeta { log_delta, log_budget } alongside their GLWE storage; arithmetic kernels that simultaneously reconcile mismatched scales, budgets and storage capacities through a single offset and a pair of shifts; a compact_limbs operation that exploits the prefix property to free unused storage; and bit-granular error reporting where RNS-CKKS would have signalled “out of levels”.

What is coming next to poulpy on top of this foundation:

  • Polynomial evaluation - Paterson–Stockmeyer and friends, taking advantage of the bit-granular budget to evaluate functions like σ(x)\sigma(x), tanh(x)\tanh(x), exp(x)\exp(x), without wasting time on limb dropping and alignment.
  • Linear transforms - slot rotations and matrix–vector products fused with rescales so that the budget cost matches the analytic depth rather than rounding up to prime boundaries.
  • Bootstrapping - the bit-granular framework should let bootstrapping be parameterized by precision rather than by a fixed prime chain, with the same evaluation key serving multiple precision targets via the prefix property.
  • Discrete CKKS - variants of the scheme where the plaintext space is integer rather than real, sharing the bivariate plumbing with the standard real-valued CKKS, with functional bootstrapping.
  • Faster backends - as poulpy was designed to work with any hardware, it is relatively easy to add a new backend hardware acceleration. Poulpy already has an AVX2 + FMA and an AVX512 backend, but an ARM and CUDA backend are actively being developped.

The torus is a lie. The cake, on the other hand, is real.