# 4. Exercises on Rcpp

### Computing the c.d.f. of the Tweedie distribution

Here we consider the Tweedie distribution which includes continuous distributions, such as the normal and gamma, and discrete distributions, such as the Poisson. Its density is \[ p(y|\mu,\phi,p)=a(y,\phi,p)\exp\bigg[\frac{1}{\phi}\{y\theta-\kappa(\theta)\}\bigg], \] where \[ \theta=\frac{\mu^{1-p}}{1-p}\;\;\text{for}\;p\neq1\;\;\;\text{and}\;\;\;\theta=\log\mu\;\;\text{for}\;p=1, \] and \[ \kappa(\theta)=\frac{\mu^{2-p}}{2-p}\;\;\text{for}\;p\neq2\;\;\;\text{and}\;\;\;\theta=\log\mu\;\;\text{for}\;p=2. \] with \(\mu>0\) being the mean, \(\phi>0\) the scale and \(1\leq p\leq2\) is such that \(\text{var}(y)=\mu^{p}\). As explained in Dunn and Smyth (2005), evaluating the Tweedie density requires approximating the factor \(a(y,\phi,p)\), which does not have a closed-form expression, using specifically designed numerical methods. Here we consider the simpler problem of numerically approximating the Tweedie c.d.f. \(P(y|\mu,\phi,p)\), under the assumption that \(a(y,\phi,p)\) is unknown.

For \(1\leq p\leq2\), a Tweedie variable \(y\) can be written as the sum of \(N\) i.i.d. Gamma distributed random variables \(z_{1},\dots,z_{N}\) with shape \(-\alpha\) and scale \(\gamma\) (hence with mean \(-\alpha\gamma\) and variance \(-\alpha\gamma^{2}\)), while \(N\) follows a Poisson distribution with rate \(\lambda\). In terms of the original Tweedie parameters, we have \[ \lambda=\frac{\mu^{2-p}}{\phi(2-p)}, \;\;\; \alpha=(2-p)/(1-p) \;\;\; \text{and} \;\;\; \gamma=\phi(p-1)\mu^{p-1}. \] So, if \(p_{G}\) indicates the c.d.f. of a Gamma distribution with parameters \(-k\alpha\) and \(\gamma\), \(p_{p}\) is the p.m.f. of a Poisson with rate \(\lambda\) and we indicate with \(p(Y < y)\) the Tweedie c.d.f., we have \[ p(Y<y)=p\Big(\sum_{i=1}^{N}z_{i}<y\Big)=\sum_{k=1}^{\infty}p_{G}\Big(\sum_{i=1}^{k}z_{i}<y\Big)p_{P}(N=k), \] where the second equality holds due to the Law of Total Probability. Now, we want to avoid computing that infinite sum, hence we approximate it by \[ p(Y<y)\approx\hat{p}(Y<y)=\sum_{k=k_{min}}^{k_{max}}p_{G}\Big (\sum_{i=1}^{k}z_{i}<y \Big )p_{P}(N=k), \] for some \(k_{min}\) and \(k_{max}\).

How to choose \(k_{min}\) and \(k_{max}\)? Given that \(p_{P}\) is maximal at \(k = \text{floor}(\lambda)\), and then monotonically decreases as we move \(k\) away from the mode, a reasonable approach is to choose \(k_{min}\) and \(k_{max}\) such that \(p_{P}(N=k) < \epsilon p_{P}\{N=\text{floor}(\lambda)\}\) for \(k<k_{min}\) or \(k>k_{max}\), for some small \(\epsilon\). If we do so, it is clear that the a very pessimistic bound on the absolute approximation error is \[ |p(Y<y)-\hat{p}(Y<y)| < p_{P}(N<k_{min} \; \text{or} \; N>k_{max}) = 1 - p_{P}(k_{min}<N<k_{max}), \] which is easy to compute. Of course, \(k_{min}\) and \(k_{max}\) are not known in advance, but we can initialize \(k\) to \(\text{floor}(\lambda)\) and then increase \(k\) until we get to \(p_{P}(N=k) < \epsilon p_{P}\{N=\text{floor}(\lambda)\}\), which means that we have found \(k_{max}\). Then we can set \(k\) to the Poisson mode and decrease \(k\) until we find \(k_{min}\). All this is implemented in the following R function:

```
pTweedR0 <- function(y, mu, phi, p, eps = 1e-17, log = FALSE){
# Get param lambda, gamma and alpha to be used in gamma and poisson distrib
la <- mu^(2-p) / ( phi * (2-p) )
ga <- phi * (p-1) * mu^( p-1 )
al <- (2-p) / ( 1-p )
# Mode of Poisson is at k = floor(lambda).
# If mode is 0, we start from 1 instead.
k0 <- max(floor(la), 1)
# Poisson density at its mode
mxlpP <- dpois(k0, la)
# Get probability contribution at Poisson mode
pTw <- mxlpP * pgamma(y, shape = - k0 * al, scale = ga)
# Initialize k at mode
k <- k0
lP <- mxlpP
# Sum from mode k = floor(lambda) upward until we find kmax
while ( lP > mxlpP * eps ){
k <- k + 1
lP <- dpois(k, la)
pTw <- pTw + lP * pgamma(y, shape = - k * al, scale = ga)
}
kmax <- k
# Reset k to the Poisson mode, and now go down until we find kmin
k <- k0
lP <- mxlpP
while ( lP > mxlpP * eps && k > 0 ){
k <- k - 1
lP <- dpois(k, la)
pTw <- pTw + lP * pgamma(y, shape = - k * al, scale = ga)
}
kmin <- k
if( log ) { pTw <- log( pTw ) }
return( pTw )
}
```

Having defined a function for approximating the Tweedie c.d.f., we need to verify whether it is correct. Given that the true c.d.f. is unknown, we compare our function with the `ptweedie`

function from the `tweedie`

package. Below, we simulate a random vector of Tweedie parameters, and then we compare the (log) c.d.f.s produced by `pTweedR0`

with those obtained using `ptweedie`

:

```
library("tweedie")
nsim <- 1e3
mu <- runif(nsim, 0, 10)
phi <- 0.01 + rexp(nsim, 1)
p <- runif(nsim, 1.001, 1.999)
lpr1 <- lpr2 <- rep(NA, nsim)
for(ii in 1:nsim){
y <- rtweedie(n = 1, mu = mu[ii], phi = phi[ii], power = p[ii])
lpr1[ii] <- pTweedR0(y = y, mu = mu[ii], phi = phi[ii], p = p[ii], log = TRUE)
lpr2[ii] <- log(ptweedie(q = y, mu = mu[ii], phi = phi[ii], power = p[ii]))
}
par(mfrow = c(1, 2))
plot(lpr1, lpr2, xlab = "pTweedR0", ylab = "ptweedie")
abline(0, 1)
plot(lpr1 - lpr2, ylab = "pTweedR0 - ptweedie")
abline(h = 0)
```

The log c.d.f.s are very close, why is not surprising as `ptweedie`

uses pretty much the same approximation we are using.

**Q1 start**: The `pTweedR0`

function requires explicitly looping in R, because \(k_{min}\) and \(k_{max}\) are not known in advance but must be determined iteratively. To address this, create an `Rcpp`

version of `pTweedR0`

and compare its computational performance with that of `pTweedR0`

**Q1 end**.

**Q2 start**: Most the function provided by `R`

for evaluating the c.d.f. of a random variable (e.g. `pnorm()`

, `pexp()`

) are vectorized in all their main arguments (e.g. `pnorm()`

is vectorized in the arguments `q`

, `mean`

and `sd`

). Our `pTweedR0`

function cannot be not vectorized directly, because \(k_{min}\) and \(k_{max}\) depend on \(y\) and on the Tweedie parameters. Use `Rcpp`

to create a vectorized C++ version of `pTweedR0`

(that is a version of `pTweedR0`

with accepts vector inputs for `y`

, `mu`

, `phi`

and `p`

) and compare its performance with that of simply using `pTweedR0`

within a `for()`

loop in `R`

**Q2 end**.