SC2

Statistical Computing 2

4. Exercises on Rcpp


Computing the c.d.f. of the Tweedie distribution

Here we consider the Tweedie distribution which includes continuous distributions, such as the normal and gamma, and discrete distributions, such as the Poisson. Its density is \[ p(y|\mu,\phi,p)=a(y,\phi,p)\exp\bigg[\frac{1}{\phi}\{y\theta-\kappa(\theta)\}\bigg], \] where \[ \theta=\frac{\mu^{1-p}}{1-p}\;\;\text{for}\;p\neq1\;\;\;\text{and}\;\;\;\theta=\log\mu\;\;\text{for}\;p=1, \] and \[ \kappa(\theta)=\frac{\mu^{2-p}}{2-p}\;\;\text{for}\;p\neq2\;\;\;\text{and}\;\;\;\theta=\log\mu\;\;\text{for}\;p=2. \] with \(\mu>0\) being the mean, \(\phi>0\) the scale and \(1\leq p\leq2\) is such that \(\text{var}(y)=\mu^{p}\). As explained in Dunn and Smyth (2005), evaluating the Tweedie density requires approximating the factor \(a(y,\phi,p)\), which does not have a closed-form expression, using specifically designed numerical methods. Here we consider the simpler problem of numerically approximating the Tweedie c.d.f. \(P(y|\mu,\phi,p)\), under the assumption that \(a(y,\phi,p)\) is unknown.

For \(1\leq p\leq2\), a Tweedie variable \(y\) can be written as the sum of \(N\) i.i.d. Gamma distributed random variables \(z_{1},\dots,z_{N}\) with shape \(-\alpha\) and scale \(\gamma\) (hence with mean \(-\alpha\gamma\) and variance \(-\alpha\gamma^{2}\)), while \(N\) follows a Poisson distribution with rate \(\lambda\). In terms of the original Tweedie parameters, we have \[ \lambda=\frac{\mu^{2-p}}{\phi(2-p)}, \;\;\; \alpha=(2-p)/(1-p) \;\;\; \text{and} \;\;\; \gamma=\phi(p-1)\mu^{p-1}. \] So, if \(p_{G}\) indicates the c.d.f. of a Gamma distribution with parameters \(-k\alpha\) and \(\gamma\), \(p_{p}\) is the p.m.f. of a Poisson with rate \(\lambda\) and we indicate with \(p(Y < y)\) the Tweedie c.d.f., we have \[ p(Y<y)=p\Big(\sum_{i=1}^{N}z_{i}<y\Big)=\sum_{k=1}^{\infty}p_{G}\Big(\sum_{i=1}^{k}z_{i}<y\Big)p_{P}(N=k), \] where the second equality holds due to the Law of Total Probability. Now, we want to avoid computing that infinite sum, hence we approximate it by \[ p(Y<y)\approx\hat{p}(Y<y)=\sum_{k=k_{min}}^{k_{max}}p_{G}\Big (\sum_{i=1}^{k}z_{i}<y \Big )p_{P}(N=k), \] for some \(k_{min}\) and \(k_{max}\).

How to choose \(k_{min}\) and \(k_{max}\)? Given that \(p_{P}\) is maximal at \(k = \text{floor}(\lambda)\), and then monotonically decreases as we move \(k\) away from the mode, a reasonable approach is to choose \(k_{min}\) and \(k_{max}\) such that \(p_{P}(N=k) < \epsilon p_{P}\{N=\text{floor}(\lambda)\}\) for \(k<k_{min}\) or \(k>k_{max}\), for some small \(\epsilon\). If we do so, it is clear that the a very pessimistic bound on the absolute approximation error is \[ |p(Y<y)-\hat{p}(Y<y)| < p_{P}(N<k_{min} \; \text{or} \; N>k_{max}) = 1 - p_{P}(k_{min}<N<k_{max}), \] which is easy to compute. Of course, \(k_{min}\) and \(k_{max}\) are not known in advance, but we can initialize \(k\) to \(\text{floor}(\lambda)\) and then increase \(k\) until we get to \(p_{P}(N=k) < \epsilon p_{P}\{N=\text{floor}(\lambda)\}\), which means that we have found \(k_{max}\). Then we can set \(k\) to the Poisson mode and decrease \(k\) until we find \(k_{min}\). All this is implemented in the following R function:

pTweedR0 <- function(y, mu, phi, p, eps = 1e-17, log = FALSE){
  
  # Get param lambda, gamma and alpha to be used in gamma and poisson distrib
  la <- mu^(2-p) / ( phi * (2-p) )
  ga <- phi * (p-1) * mu^( p-1 )
  al <- (2-p) / ( 1-p )
  
  # Mode of Poisson is at k = floor(lambda). 
  # If mode is 0, we start from 1 instead.
  k0 <- max(floor(la), 1)
  
  # Poisson density at its mode
  mxlpP <-  dpois(k0, la)
  
  # Get probability contribution at Poisson mode
  pTw <- mxlpP * pgamma(y, shape = - k0 * al, scale = ga)
  
  # Initialize k at mode
  k <- k0 
  lP <- mxlpP
  
  # Sum from mode k = floor(lambda) upward until we find kmax
  while ( lP > mxlpP * eps ){
    k <- k + 1
    lP <- dpois(k, la)
    pTw <- pTw + lP * pgamma(y, shape = - k * al, scale = ga)
  }
  kmax <- k
  
  # Reset k to the Poisson mode, and now go down until we find kmin
  k <- k0
  lP <- mxlpP
  while ( lP > mxlpP * eps && k > 0  ){
    k <- k - 1
    lP <- dpois(k, la)
    pTw <- pTw + lP * pgamma(y, shape = - k * al, scale = ga)
  }
  kmin <- k
  
  if( log ) { pTw <- log( pTw ) }
  
  return( pTw )
  
}

Having defined a function for approximating the Tweedie c.d.f., we need to verify whether it is correct. Given that the true c.d.f. is unknown, we compare our function with the ptweedie function from the tweedie package. Below, we simulate a random vector of Tweedie parameters, and then we compare the (log) c.d.f.s produced by pTweedR0 with those obtained using ptweedie:

library("tweedie")
nsim <- 1e3
mu <- runif(nsim, 0, 10)
phi <- 0.01 + rexp(nsim, 1)
p <- runif(nsim, 1.001, 1.999)

lpr1 <- lpr2 <- rep(NA, nsim)
for(ii in 1:nsim){
  y <- rtweedie(n = 1, mu = mu[ii], phi = phi[ii], power = p[ii])
  lpr1[ii] <- pTweedR0(y = y, mu = mu[ii], phi = phi[ii], p = p[ii], log = TRUE)
  lpr2[ii] <- log(ptweedie(q = y, mu = mu[ii], phi = phi[ii], power = p[ii]))
}

par(mfrow = c(1, 2))
plot(lpr1, lpr2, xlab = "pTweedR0", ylab = "ptweedie") 
abline(0, 1)
plot(lpr1 - lpr2, ylab = "pTweedR0 - ptweedie") 
abline(h = 0)

The log c.d.f.s are very close, why is not surprising as ptweedie uses pretty much the same approximation we are using.

Q1 start: The pTweedR0 function requires explicitly looping in R, because \(k_{min}\) and \(k_{max}\) are not known in advance but must be determined iteratively. To address this, create an Rcpp version of pTweedR0 and compare its computational performance with that of pTweedR0 Q1 end.

Q2 start: Most the function provided by R for evaluating the c.d.f. of a random variable (e.g. pnorm(), pexp()) are vectorized in all their main arguments (e.g. pnorm() is vectorized in the arguments q, mean and sd). Our pTweedR0 function cannot be not vectorized directly, because \(k_{min}\) and \(k_{max}\) depend on \(y\) and on the Tweedie parameters. Use Rcpp to create a vectorized C++ version of pTweedR0 (that is a version of pTweedR0 with accepts vector inputs for y, mu, phi and p) and compare its performance with that of simply using pTweedR0 within a for() loop in R Q2 end.