Attenuation of species abundance distributions by sampling

Quantifying biodiversity aspects such as species presence/ absence, richness and abundance is an important challenge to answer scientific and resource management questions. In practice, biodiversity can only be assessed from biological material taken by surveys, a difficult task given limited time and resources. A type of random sampling, or often called sub-sampling, is a commonly used technique to reduce the amount of time and effort for investigating large quantities of biological samples. However, it is not immediately clear how (sub-)sampling affects the estimate of biodiversity aspects from a quantitative perspective. This paper specifies the effect of (sub-)sampling as attenuation of the species abundance distribution (SAD), and articulates how the sampling bias is induced to the SAD by random sampling. The framework presented also reveals some confusion in previous theoretical studies.


Introduction
Assessing biodiversity measures in species communities, such as species presence/absence, richness and abundance, has been an important task for many aspects of ecological studies including environmental research, resource management and conservation planning. As the observations we can obtain in ecological studies are often a consequence of sampling from the population of interest, a key challenge in assessing biodiversity is dealing properly with the uncertainty induced by sampling.
Owing to limited time and resources, it is almost impossible to collect the entire species community for identification, counting and weighing, forcing us to take a part of the entire community, or sampling, for investigation. However, whenever a large number of specimens are caught in surveys, collecting a part of the whole

The marine survey
The dataset used in this paper is one of the trawl bycatch samples from the west of Mornington Island in Australia's Northern Prawn Fishery in 1998 [12]. The purpose of the study was to determine the effect of sub-sampling on the estimate of species composition and abundance, particularly due to the samples being taken from different times on the conveyor of seawater hoppers. The catch chosen for this study was sub-sampled and all sub-samples were completely enumerated.
On the research vessel, large animals in the catch were first removed using an aluminium grid (300 mm) and a recirculating seawater hopper was then used to hold the catch before sorting. The entire catch in the hopper was extracted by the sorting conveyor belt and collected into consecutively numbered boxes (sub-sample replicates), each of which was about 10 kg. Each box number represented the chronological order of sub-sample extracted from the seawater hopper. The samples were identified to the lowest taxonomic level possible (mostly species). Where this was not possible, the data were grouped to genus and in a few cases to family. Total numbers and weights were recorded for each species in each sub-sample and entered directly into a relational database. See Heales et al. [12] for more details of the data collection method.
For this study, one trawl catch consisting of 116 species with 13 611 individuals was chosen. This was the largest catch over the trawls with a total catch weight of 334 kg. The entire catch was split into 26 boxes, each of which represents roughly 4% of the entire catch.

Sub-sampling experiment
To investigate the effect of taking different size sub-samples from the whole catch, the contents of several boxes were combined, repeatedly choosing boxes without replacement, and combining the data from each box into a single sub-sample. We varied the number of boxes chosen to generate a range of sub-sample sizes.
Let x kb be the number of individuals of species k in box b, and whether species k is not observed in a subset of all 26 boxes B i ⊆ B = {1, 2, . . . , 26} is defined as t ki = b∈B i I(x kb = 0). Each subset of boxes represents a sub-sample, indexed by i with i = 1, 2, . . . , q. Note here that t ki = 1 means that species k is unobserved in all boxes in the set B i , i.e. completely absent from the ith sub-sample. A set of boxes is randomly selected from the 26 boxes B without replacement, with the number chosen to approximate the required sub-sample size. The possible number of combination of boxes is equal to 26!/{(26 − |B i |)!|B i |!} so we considered q = 500 to be adequate for the experiments. Here, |B i | denotes the number of boxes in the set. However, when the number of boxes in the set B i is one, only 26 experiments are possible.
The empirical probability of species k being absent from the sub-sample,p k , is equal tô Note that 1/q is the unit resolution ofp k in this experiment. Each box contains a different number of individuals, so that the number of individuals in the ith sub-sample, n i , is also different. The sub-sampling ratio r is therefore calculated by x kb . Note that these calculations are based on counts of individuals, where N represents the number of individuals in the catch.
This combining-boxes experiment obviously assumes homogeneity over the sequence of specimens from the seawater hopper, and the model described in §4.1 also makes that assumption. This could be an issue when the assumption fails. However, it should always be expected that a kind of heterogeneity can be somehow unintentionally involved in the samples from field surveys. The detail will be investigated in §5.2 by looking at the discrepancy from the data.

The framework 4.1. Sampling in ecological studies
Given the data described earlier ( §3), a sub-sample, the boxes combined, can be simply considered as an extracted part of the entire catch. We assume here that the population of our interest is the entire catch that consists of N individuals of K species which we sub-sample. Let X = (X 1 , X 2 , . . . , X K ) be the vector of random variables representing the number of individuals of each species observed in a sub-sample. The number of individuals, X = x, acquired by a simple random sampling scheme without replacement, follows a multivariate hypergeometric distribution [14,15]: where m k is the number of individuals of species k in the catch, N = K k=1 m k , and the sub-sample size is n = K k=1 x k . Note that the distribution here assumes that the probability of each individual being in a sub-sample is common regardless of their species or body size, n!(N − n)!/N!.
The procedure of sub-sampling simply deducts x k individuals of species k from m k according to the sub-sampling ratio r = n/N. The expected value of X k is given as E[X k ] = rm k . So it is likely that at least one individual of species k is observed in the sub-sample (x k = 0) if m k is reasonably big with respect to N. This implies that sub-sampling may not affect inferences relating to species presence/absence and richness for abundant species. However, it is not immediately clear for rare species, that is, those that occur in relatively small numbers (abundances).

Species abundance distributions
The SADs are a useful representation of species composition defined as a sequence of the number of species that occur with particular frequency within a community [5]. Often SADs are defined informally in ecological literature, and there seems to have been a confusion referring to different distributions as the same SAD (see §6 for details). To avoid extra confusion, we here give a formal definition of SADs following the description by McGill et al. [5], and adopt it elsewhere in the paper. We consider the SAD of the catch m = (m 1 , m 2 , . . . , m K ) and a sub-sample X = (X 1 , X 2 , . . . , X K ). The mass of a particular frequency j of SADs for the catch and sub-sample are, respectively, defined as is an indicator function. The vectors y = (y 0 , y 1 , . . . , y N ) and Z = (Z 0 , Z 1 , . . . , Z n ) then represent the SAD of m and X; we refer to y as the original SAD and Z as the sample SAD. There are obvious constraints such as n j=0 y j = n j=0 Z j = K, the number of species, and n j=0 jy j = N and n j=0 jZ j = n, the number of individuals in the catch and sub-sample, respectively. Note that it is always true that the number of individuals in the sub-sample is always less than in the catch, i.e. x k ≤ m k , but the number of species that occur with a particular frequency can be greater in the subsample, z j , than in the catch, y j . It is, for example, always true that y 0 = 0 but z 0 ≥ 0; the sample SAD, z, can include zeros, and is typically right skewed.
As defined above, SADs are exactly the same as what is called size index [16] or frequency of frequencies [7,17] in statistics. Note that the term 'distributions' used for SADs here can be a little misleading, and does not mean the distributions in the statistical sense, according to our definition of the SAD. It is not a probability distribution that governs the random variable Z j but a sequence of (random) variables representing the number of species that occur with particular frequency (see §6).

Attenuation of species abundance distributions by sampling
The effect of sub-sampling is investigated as the discrepancy between the SADs of the total catch y and sub-sample z. Given the total number of individuals in the catch N of species composition m (or y) and the number of individuals in the sub-sample of size n, to determine the extent to which subsampling may reduce the original SAD, y, the expected value of the sample SAD, E[Z], is needed. From the definition of Z (equation (4.2)), the expected value is derived as The symbol here (·) k denotes the descending factorial moment that is defined as The hypergeometric distribution in equation (5.2) stems from the multivariate hypergeometric distribution (equation (4.1)) as its marginal distribution. In ecological literature, Dewdney [8] is one of the earliest researchers to have investigated the sampling effect on SADs, and has reached an equivalent result as equation (5.3). Note that E[Z s ] can be considered as a weighted total of the original SAD, y j (equation (5.4)). The coefficient part or weight represents the attenuation factor of a biological quantity. From the definition, its variance is In a more general statistical context, Sibuya [18] also derived the same result in the study of statistical disclosure control, discussing the risk of revealing a singleton identifiable when data are publicly available. It is useful to consider their asymptotic behaviour, that is, when the total number of individuals, N, is relatively large. This is not unreasonable as sub-sampling is commonly taken when catches are large. The asymptotic description of equation (5.4) simplifies to where n * = max{j : y j > 0} = max k {m k } and r is the sub-sampling ratio. This result occurs since Both asymptotic forms involve the binomial form which is mathematically more tractable. The theoretical result (equation (5.4)) is investigated by considering the dataset. Figure 1 shows the appearance of three types of SADs: in the catch, the original SAD, y j , (bar plot), in each one of 26 boxes as a sub-sample, the observed sample SAD, z j , (dots) and the expected sample SAD, E[Z], (red line). The x-axis is restricted to abundances of less than 50 individuals. The mean number of individuals in the sub-samples was calculated as described in §3. 2

Risk of species absence in the sub-sample
In figure 1, there is obviously no species with zero abundance in the catch since only species observed in the catch were reported. Nevertheless, after taking sub-samples, a substantial number (approx. 70-80) of the species were absent, demonstrating the risk of information loss on those species due to sub-sampling. We now quantify this reduction by considering the case when a species is absent from a sub-sample. When s = 0 in equation (5.4), the expected number of missing species in a sub-sample is and its asymptotic form is given as Equation (5.9) therefore represents the risk of a species being absent in the sub-sample. The attenuation factor can now be described as a function of sub-sampling ratio r and species abundance of count j, where (1 − r) is commonly referred to in survey literature as the finite population correction term. When the entire catch is sorted r = 1, and there is no information loss as E[Z 0 ] = 0. An intuitive interpretation of this asymptotic formulation is the risk where all j individuals of the species may not be present in a sub-sample, as (1 − r) simply represents the probability that an individual is not in the sub-sample. The attenuation factor also indicates that the effect of sub-sampling is not equal over all species but dependent on each species' abundance, j. For those species with more than 20 individuals present, the risk of their absence in the sub-sample is relatively low and almost zero even if the sub-sampling ratio was only 20% (figure 2). For rare species, however, the risk of being absent from a sub-sample is high, even for large sub-samples.    The performance of the attenuation factor p(r, j) is investigated using the dataset described in §3. In figure 3, the empirical probability of the kth species absencep k is plotted against species abundance j for the catch when the sub-sample consists of only one box (circles). The superposed red line represents the attenuation factor with r = 0.04 and the dashed lines are for the cases max{r i } and min{r i }, where i = 1, 2, . . . , 26 is a sub-sample replicate. The points are surprisingly well aligned with the theoretical line except for some outlier species that are labelled in figure 3. On close inspection of the data we discovered that these species were observed in particular boxes, which suggests it was a non-random sampling procedure for those species. According to the description of data collection procedure, these species would have been clustered in a particular chronological position on the sorting conveyor belt as Heales et al. [12] discussed. In fact, they identified Amusium pleuronectes (saucer scallop) as such a species.
Similar results for the sub-sampling experiment involving larger sub-samples, namely for r = 0.19, 0.39, 0.58 and 0.77, are displayed in figure 4. Note that the scale of the x-axis is shown as a log scale for illustrative purposes, which is different to that in figure 3. For the larger sub-samples as well, the theoretical line represents the data remarkably well. described by a different approach from ours, in a hierarchical manner adopted by the previous studies [6][7][8][9][10] as

Relationship with previous studies
In equation (6.1), the first term f (s|m), insert sampling formula, defines the sampling process by which the observed abundance s is deducted from the true abundance m, the second term f (m|λ) describes the extent to which the true abundance m varies by chance, and the third term f (λ) determines the variation of abundances among species. By introducing the hierarchical description above, the species abundance m is now regarded as random, although we have instead assumed it to be non-random (known) in our framework ( §4.2). Note that the difference among species, k (k = 1, 2, . . . , K), in equation (6.1) has vanished as the parameter λ now governs the variation of abundances among species. As in a pioneer work by Fisher et al. [6], if f (λ) is chosen to be a gamma distribution and f (s|λ) to be a Poisson distribution, the integration of these becomes a negative binomial distribution, and its zero-truncation of the resulting formula leads to Fisher's log series. If a lognormal distribution is chosen for f (λ), it leads to the discrete lognormal (or Poisson-lognormal) by Preston [19], as Pielou [7] notes.
A key part of accounting for the sampling effect is the first two terms taking the sum with respect to abundance m in equation (6.1). A typical choice for it may be a compound distribution of binomial and Poisson distributions, assuming simple random sampling, which becomes where r is the sampling ratio as the sampling effect. This has been studied among researchers (see e.g. Dewdney [8], Green & Plotkin [9]). Green & Plotking [9] have also suggested a form of negative binomial distributions for f (s|λ) as an alternative, reflecting the idea that the regional sampling effect can be proportional to the mean abundance of the area under study, like the Poisson case above, rE [M]. We, however, note that the sampling effect can be proportional to the mean abundance. This is a special feature of the Poisson distribution, which does not hold for the negative binomial distribution under random sampling settings. Instead of using equation (6.1), Dewdney [8] used equation (5.3) as we did but proposed a Poisson approximation to the hypergeometric term. This is misleading, since such approximation works when N and j or n tend to infinity and jn/N remains equal to λ. Note that both cases rescale j as λ but the original SAD, y j , is also defined on the original scale j. His conclusion has consequently slipped to equation (6.2) that subsequently confuses y j and f (λ) unfortunately, although they are not the same. We emphasize here that λ in equation (6.2) is the average abundance of species so the distribution f (λ) is not the same original SAD, y j , defined by equation (4.2), the number of species that occur with particular frequency within a community, whereas f (λ) defines the distribution of average abundance among species.
In fact, there are variations in ways of describing the SAD among previous studies. Some researchers refer to f (λ) as the SAD [9,10,20]. The normalized expected value of sample SADs, E[Z]/K, is also called relative SADs (e.g. [21]). They look like a probability function as its total mass is one, but we stress here that neither of these is the probability distribution that the number of species that occur with frequency s, Z s , follows.
The discussion above distinguishes our approach from the previous studies that mainly describe how the sample SAD is shaped from individual species abundances defined by f (λ). By contrast, our conditional approach, assuming the original SAD to be known, has shown the mechanism of how the original SAD, y, changes its shape to the sample SAD, E[Z], owing to sampling. As shown (equations (5.4) and (5.6)), this is described as a weighted linear combination of the original SAD, y j , and the attenuation factor, p(s, j), as where j is species abundance (the number of individuals). Equations (6.2) and (6.3) both provide the (expected) sample SAD, although their interpretations are very different. Nevertheless, this fact appears to have been overlooked and perhaps confused in the previous studies.

Asymptotic properties of the attenuation factor
Since the attenuation factor is asymptotically derived from equation (5.8), it is important to investigate appropriate conditions for use in actual situations. Figure 5 shows its asymptotic behaviour for different sizes of the catch, N. Note that for the case j = 1, E[Z 0 ] is not an approximation but an exact result (equation (5.8)). The approximation error tends not to be ignorable for species with relatively large abundance j when N is relatively small. However, as j increases, the size of attenuation factor decreases. By contrast, for large N, the approximation error tends to be quite small. In fact, little difference can be observed when N = 300 (figure 5). Given that sub-sampling of the catch is common when N is large, this asymptotic approximation may work reasonably well in general situations.

Sub-sampling ratio based on weight
Up to this section, the sub-sampling ratio has been based on count of individuals, but this requires counting the number of individuals N of the entire catch; therefore, the sub-sampling ratio is often calculated using weights in marine surveys. For this situation, the sub-sampling ratio based on weight r w can be expressed replacing r i by where W is the total weight of the catch andw k is mean individual weight of species k. If r ≈ r w , the subsampling ratio based on weight r w is a substitute for r. It may be difficult to address a general condition where this assumption is satisfied. However, the sub-sampling experiment undertaken in this paper shows that the calculated sub-sampling ratio based on volume (number of boxes), r w = 0.0385, was very close to that based on count, r = 0.04, making it feasible to use r w as an approximation of r. Using one dataset as an example in this study, it would be too ambitious to say that we can always expect this kind of agreement. This implies another challenge of sampling issues, calculating an adequate value of the (sub-)sampling ratio, r, in field surveys. The nature of sampling stands on the fact that sampling is done on individual basis which may not follow the actual sampling protocol in surveys. Sampling can sometimes be undertaken instead by arbitrary sampling unit, such as weight, boxes and quadrants for example. This may cause a divergence in calculating r when heterogeneity is involved among the sampling units; see Gotelli & Colwell [22] for a comprehensive review of the distinction between individual-and sample-based data. This is not the limitation of the framework discussed here but an important strategy that needs to be considered before taking surveys, the sampling design and the way of analysing the data collected by the design.

Modelling over-dispersion
Our results show that any prediction of biodiversity such as species presence/absence and richness may need to be qualified if inferences from sub-sampled data are based on the common approach of normalizing the biological response by sub-sampling ratio. In fact, the prediction from sub-sampled data will underestimate species presence/absence and richness. For species presence/absence, it will be discounted by (1 − r) j . For species richness, sub-sampling causes a bias E[Z 0 ], so inferences made by using sub-samples tend to estimate fewer species than the actual richness K. As a consequence, the prediction model will be over-dispersed even if the assumed model was correct. The impact upon rare species can be more severe in the resource management context as there is a high risk of missing rare species completely.

Concluding remarks
The effect of (sub-)sampling on SADs, the extent to which the shape of original SADs is altered by sampling, has been quantified. The framework presented stands on simple random sampling of individuals from a finite population. The key to understanding the sampling effect has been identified as the attenuation factor, presented in both explicit and asymptotic forms. Reasonable agreement between the theory and data supports the potential remark of our theoretical result. Throughout this paper, the results have been investigated and illustrated with the example data from a marine survey, but the proposed framework is not limited to marine applications, as among sampling taken in ecological studies there is a commonality: individuals (or equivalent) are, no matter what the sampling unit is, always the sampling target, that forms the sampling framework. However, the way of calculating (sub-)sampling ratio can be a slightly different story, and may depend upon its sampling unit, as heterogeneity among the sampling units, a divergence from the assumption of random sampling, can be involved.
The attenuation factor assumes that the sub-sample is taken randomly, a standard requirement of sampling design. This aspect may play a key role in some extent of further research directions. The performance or bias of (sub-)sampling procedures can be evaluated as the unexpected departure from the size of attenuation factor. The attenuation factor can be regarded as the benchmark of any (sub-) sampling process.
Taken together, our conditional approach has addressed the sampling effect on SADs as an attenuation factor, a function of sub-sampling ratio r and species abundance j. The sampling effect is, importantly, shown to be uneven among species, largely dependent on their abundance (number of individuals).
Data accessibility. The dataset used in this study can be found in the electronic supplementary material.