Reply to F.Spinazzi (translation)

From: Eric BARAN (baran@biomserv.univ-lyon1.fr)
Date: Mon Oct 13 1997 - 20:24:20 MET DST


QUICK AND SEMI-AUTOMATIC TRANSLATION OF D. CHESSEL' REPLY TO F. SPINAZZI.
I HOPE THIS CAN BE USEFUL TO ENGLISH READERS.
REGARDS.
########################################################################

>1
>I would like to make some observations on a species-centered PCA on a
>site-species count table.
>It could be possible that the first component is very often something like
>a size component ?
>I found a correlation beetwen the first component and 1/(Simpson's D)
>and/or Shannon's H up to 0.7.
>

This is the first and most important property of the ACP applied to a
faunistic array. Let be xij the abundance of the taxon j in the sample unit
i and mj the average abundance of the taxon in all sample units. The ACP
centered by taxon studies the quantity xij-mj. Some sample units may be
rich and others poor. From this moment the first factor is about a
size-effect. Two possible things:
*** The sample unit is experimentally standardized (1 m2 of ground, 15
minutes of listening, 100 liters of water,...) and it can deal with the
main information of the array. This is the case of limiting factors
(saltiness, dryness) and studies about pollution. In these experimental
cases it is necessary to run a PCA and to use the axis 1 as best indication
of wealth.
*** The sampling unit cannot be standardized (mix of several methods,
unfavorable or favorable meteorological conditions, sampling methods hard
to implement). There are large sample units and small ones but this is an
interfering information and it is not necessary to use the PCA centered by
taxon in this case. It is necessary first to eliminate the parasite by a
double centering (xij-mj-mi + m, ++ACP doubly centered, xij-aibj centering
double multiplicative, n* xij/ xi.x.j-1 double implicit centering of the
AFC).
The simplest reasoning is to tell that data = obviousness + structure +
error. What is obviousness? (example: there are large sample units and
small ones, there are rare species and common species). One seeks
structures by the analysis of the data-obviousness array. Papers by Austin,
Noy-mer et Orloci in the 60's-70's are very important about these
questions. They are surprisingly poorly known today. (Orloci, L. (1966)
Geometric models in ecology. I. The theory and applications of some
ordination methods. Journal of Ecology : 54, 193-215.
Austin, M.P. & Orloci, L. (1966) Geometric models in ecology II An
evaluation of some ordination techniques. Journal of Ecology : 54, 217-227,
see the doc of PCA: Non-centred PCA).

>>2
>>On frequencies tables, if they are espressed as %, every row will sum up
>>to 100, for example.
>>In such cases we have compositional data.
>>It does not seem appropriate to use PCA on such a table because of the
>>spurious correlation that could occur beetwen variables.
>>We can use some trasformation, instead (centered logratio or so on).
>>Do you agree ?
>
Not exactly. Centering on logs (Aitchinson, J. (1983) Principal component
analysis of compositional data. Biometrika : 70, 57-65) is a problem which
comes from geology (especially granulometry and the famous clay-silt-sand
triangle and the composition of rocks). The quantity of studied matter is
not controlled and thus is converted into percentages. The artefactual
covariance due to S(pi)= 1 is important because there are few categories
and this generates curved clouds of points in space. On a big faunistic
array this fact is of minor importance and does not bring any problem. On
the other hand, choosing to convert into percentage is decisive. The
question is:
*** is a sample unit a frequency distribution between species?
*** is a species a frequency distribution between sample unit?
*** can one have the two points of view simultaneously?
The only method which meets the third criterion is the CA (Thioulouse, J. &
Chessel, D. (1992) A method for reciprocal scaling of species tolerance and
sample diversity. Ecology : 73, 670-680). Thus the CA is very particular
and should be used only from this point of view, because the price to pay
(arch effect) is high. This is shown by the fact that a PCA on % per sample
unit, a PCA on % per taxon and a CA lead to very different results. The
reference point (origin) in space is also very important. In the theory of
the niche, the model involved is model 2 (a species is a distribution of
frequencies between sample units, it has an average= optimum and a
variance= amplitude). The reference point is either the profile of all
species together, or the profile of ubiquitous species evenly distributed
in space. This generally generates very different centered PCAs or non
centered PCAs. Taking weights into account leads to even more possibilities
(see the documentation about Niche: Species Profile PCA , thematic form n°
4.7 Ecological niches and tables matching, and the documentation about non
symmetrical correspondence analysis COA: NSCA_ Row_ Profile and COA:
NSCA_ Col_ Profile). It is necessary therefore to choose a main goal:
*** typology of sample units (taxa are variables), for instance
phyto-ecological map, expert report on water quality (a sample unit is a
frequency distribution between species, species are used to order sample
units)--> PCA on sample units in %
*** typology of species (taxa are studied, sample units are experimental
ways: study of niches, competition, biological interactions)--> PCA on taxa
in %
*** The two of them --> COA (rare case!)

>>3
>>Sometimes it happens that a categorical scatterplot (Option|Elipses) 'does
>>not work' after a Reciprocal Scaling.
>>The module gives out the message 'no item in category ...'.
>>When does such a situation occur ?
>>I tried to undertand the matrix algebra behind Reciprocal Scaling but with
>>not a great success.
>
As a matter of fact the message takes place in the subroutine "compteindiv"
used to count individuals per mode. COA: Reciprocal scaling can operate
on empty lines or columns but the graphic program does not accept them.
This comes from the particular aspect of COA: Reciprocal scaling. This
option distributes all correspondences of the array (non null cells) by row
and by column and rewrites the array as values, row number, column number.
A null line generates a mode without item and makes the graphic program
crash. The next question shows that the problem has been well understood.

>>4
>>To me it seems meaningless to insert in a table sites with no species or
>>viceversa.
>>Infact I noticed that ADE often crashes when trying to perform PCA or COA
>>on such a table.
>>Otfen, but not always. What does it happens if some row or some columns
>>sums to zero?
>>How we can consider the row and column score in such situations ?
>>Are they correct or the fact that program finished the job was only a
>>realization of an improbable event?
>
If one runs PCA: Correlation matrix PCA or PCA: Covariance matrix PCA on
DouPoi, there are no problems. The PCA tolerates rows of 0. If one
transposes the table DouPoi in A (species on rows) then Covariance matrix
PCA can be run (a column of 0 is accepted: the average and the variance are
null). On the contrary Correlation matrix PCA fails (there should be an
error message, sorry) because the division by the standard deviation is
impossible. COA: COrrespondence Analysis can be processed on the two
tables. The program is written to tolerate rows and columns of 0 with:
> for (i=1;i<=i1;i++) {
> a1 = poili[i];
> if (a1 != 0.0) {
> for (j=1;j<=j1;j++) {
> a2 = poico[j];
> if (a2 != 0) w[i][j] = w[i][j] / a1 / a2 - 1;
> }
> }
> }
thus rows and columns of 0 will have null coordinates
>*********************
Therefore centered PCA and COA on tables with rows or columns of 0 can be
run, each of them with their own logic. If this does not work, thank you
for feed back.
>*********************
>COA: Reciprocal scaling also runs with the two tables DouPoi and A. But
>this is not a good thing because there is a mistake in files _ mvco or A_
>mvli (value not allocated). What is surprising is that ScatterClass:
>Ellipses which should crash runs! This is because there was a bug in
>"compteindiv"!! Therefore the question was quite useful and has pointed
>out a defect. Thanks! We are going to set this up. It will be logical to
>block COA: Reciprocal scaling in case of rows or columns of 0, and things
>will be clearer.

>
>>5
>>Wich is the best method to rapresent units each of whom is a set of fish
>>on wich we count parassites?
>>Via MCA (how many fish have no parassites, how many fish...).
This is another question which remains to be deepened.

Best regards
Daniel Chessel
>----------------------------------------------------------------
>Universite Lyon 1 - Bat 401C - 69622 Villeurbanne CEDEX - France
>Tel : 04 72 44 82 77 Fax : 04 72 43 11 41
>----------------------------------------------------------------
>ADE-4 sur Internet ---> http://biomserv.univ-lyon1.fr/ADE-4.html
>----------------------------------------------------------------



This archive was generated by hypermail 2b30 : Sat Feb 10 2001 - 10:21:37 MET