This is a copy of the vignette of package parallelpam
.
It is included here since parallelpam is underlying this package and you
will need to know how does it work to process the single cell data using
the PAM algorithm. But you must NOT load package
parallelpam
. All the functions detailed below are already
included into scellpam
and are available just by loading it
with library(scellpam)
.
library(scellpam)
The parallelpam
package (Domingo
(2023c)) is meant as a way to apply the PAM algorithm to quite
(or very) big sets of data, such as the results of single-cell
sequencing, but can be generally used for any type of data, as long as a
distance/dissimilarity matrix can be calculated.
Differently to other packages, its main strength is its ability to perform clustering based on Partitioning Around Medoids (PAM) using a large number of items and doing it in parallel. Memory and speed limitations are reduced by extensive use of C++ programming which allows use of alternative data types (for instance, float vs. double to represent distance/dissimilarity), intermediate disk-storage to avoid memory copy operations whenever possible and use of threads.
Both phases of PAM (initialization with BUILD
and
optimization) have been parallelized so it you have a multi-core
machine, many threads will be launched with a great acceleration of the
calculations. This is done automatically, even you are allowed to choose
the number of threads yourself for comparison purposes or to allow your
machine to execute other things simultaneously.
Also, calculation of the matrix of distances/dissimilarities from the initial data and calculation of silhouette of the resulting clusterization are available, too, and are calculated in parallel in a multi-core machine.
The data are stored in memory reading them from binary files created
with the package jmatrix
(Domingo
(2023b)). To be familiar with them please read the vignette
called jmatrixsc
which is included with this package.
WARNING: you must NOT load neither jmatrix
nor
parallelpam
explicitly. Indeed, you do not need even to
install them. All their functions have been included here, too, so doing
library(scellpam)
is enough.
First of all, the package can show quite informative (but sometimes verbose) messages in the console. To turn on/off such messages you can use
# Initially, state of debug is FALSE. Turn it on exclusively for the
# parallelpam part with
ScellpamSetDebug(FALSE,debparpam=TRUE)
#> Debugging for parallelpam inside scellpam package set to ON.
# There is another parameter, debjmat, to turn on messages about
# binary matrix creation/manipulation. By default is FALSE but turn it on
# if you like with
# ScellpamSetDebug(FALSE,debparpam=TRUE,debjmat=TRUE)
The first step is to load raw data coming from external sources like
the main formats used in single cell experiments which should have been
stored as a binary matrix file in jmatrix
format. Since
this is a separate package, and for purposes of illustration, we will
create an artificial matrix for a small problem that fits in
R
memory with 5000 vectors with 500 dimensions each. Then
we will calculate the dissimilarity matrix and finally we will apply the
PAM algorithm to it.
# Create the matrix with row names V1 to V5000 and column names d1 to d500
<-5000
nvec<-500
ndim<-matrix(runif(nvec*ndim),nrow=nvec)
Prownames(P)<-paste0("V",1:nvec)
colnames(P)<-paste0("d",1:ndim)
# Write it to disk as a binary file in jmatrix format. Please,
# see vignette jmatrixsc.
JWriteBin(P,"datatest.bin",dtype="float",dmtype="full",
comment="Synthetic problem data to test PAM")
For your real problem, the input format can be a .csv
file. See function CsvToJMat
in package
scellpam
(Domingo
(2023a)).
To know details about the generated files do
JMatInfo("datatest.bin")
#> File: datatest.bin
#> Matrix type: FullMatrix
#> Number of elements: 2500000
#> Data type: float
#> Endianness: little endian (same as this machine)
#> Number of rows: 5000
#> Number of columns: 500
#> Metadata: Stored names of rows and columns.
#> Metadata comment: "Synthetic problem data to test PAM"
This is the most computationally intensive part of the process
(particularly, for samples with a high number of points and/or high
dimensionality) and therefore has been programmed in parallel, taking
advantage of the multiple cores of the machine, if available. The
funcion is called CalcAndWriteDissimilarityMatrix
. Its
input and output files (first and second parameters) are of course
compulsory. Input file can be a sparse of full binary
jmatrix
(but obviously, not a symmetric matrix).
WARNING: notice that the vectors to calculate dissimilarities amongst them MUST be stored BY ROWS. This is due to efficiency reasons.
Output dissimilarity matrix will always be a binary symmetric
(obviously square) matrix with a number of rows (and columns) equal to
the number of rows (in this case, vectors) of the input file. The type
of distance/dissimilarity can be L1
(Manhattan distance),
L2
(Euclidean distance) or Pearson
(Pearson
dissimilarity coefficient). The resulting matrix stores only names for
the rows, which are the names of the vectors stored as rows in file
datatest.bin
. If the number of vectors is \(N\), only \(N(N+1)/2\) dissimilarity values are really
stored.
A note on the number of threads, valid also for other algorithms that will be explained later:
Possible values for the number of threads are:
-1
(or any negative number) to indicate you do not
want to use threads (strictly sequential computation).
0
to allow the program to choose the number of
threads according to the problem size and the number of available
cores.
Any positive number to force the use of such number of threads.
Choosing explicitly a number of threads bigger than the number of available cores is allowed, but discouraged and the program emits a warning about it.
With respect to option 0
, for small problems (in this
case, less than 1000 vectors) the function makes the choice of not using
threads, since the overhead of opening and waiting termination is not
worth. For bigger problems the number of chosen threads is the number of
available cores, or twice this number if the processor is capable of
hyperthreading. Nevertheless, this choice may not be the best, depending
on your machine, possibly due (I guess) to the memory access conflicts
created by the need of keep processor cache coherence. You may have to
do some trials with your data in your machine.
Now, let us try it with this small dataset.
CalcAndWriteDissimilarityMatrix("datatest.bin","datatestL2.bin",
distype="L2",restype="float",
comment="L2 distance for vectors in
jmatrix file datatest.bin",nthreads=-1)
#> Input matrix is a full matrix with elements of type 'float' and size (5000,500)
#> Read full matrix from file datatest.bin. Its size is [5000 x 500] and it uses 2500000 elements of 4 bytes each with accounts for 9.53674 MBytes.
#> Creating dissimilarity matrix of size (5000x5000)
#> Loading required package: memuse
#> Package memuse is installed. OK.
#> Memory used by the matrix: 48837 KiB, which is 0.01% of the available memory, which is 255859508 Kib.
#> That seems OK.
#> End of dissimilarity matrix calculation (serial version). Elapsed time: 19.7453 s
#> Output binary file datatestL2.bin written.
WARNING: the normal way of calling
CalcAndWriteDissimilarityMatrix
would use nthreads=0 to
make use of all available cores in your machine. Nevertheless, this does
not seem to be allowed by CRAN to pass the test so I have had to use the
serial version invoked with nthreads=-1. In your normal use of code try
always nthreads=0.
The resulting matrix is stored as a binary symmetric matrix of float values, as we can check.
JMatInfo("datatestL2.bin")
#> File: datatestL2.bin
#> Matrix type: SymmetricMatrix
#> Number of elements: 25000000 (12502500 really stored)
#> Data type: float
#> Endianness: little endian (same as this machine)
#> Number of rows: 5000
#> Number of columns: 5000
#> Metadata: Stored only names of rows.
#> Metadata comment: "L2 distance for vectors in
#> jmatrix file datatest.bin"
The last step is to take the previously calculated matrix and apply
the Partitioning Around Medoids classifier. Function is
ApplyPAM
. First parameter (name of the file containing the
dissimilarity matrix in jmatrix
format) and second
parameter (k
, number of medoids) are compulsory. The names
and default values for the rest of parameters are as in this
example.
=ApplyPAM("datatestL2.bin",k=5,init_method="BUILD",max_iter=1000,
Lnthreads=-1)
#> Reading symmetric distance/dissimilarity matrix datatestL2.bin
#> Package memuse is installed. OK.
#> Memory used by the matrix: 48837 KiB, which is 0.01% of the available memory, which is 256178316 Kib.
#> That seems OK.
#> Matrix is a correct distance/dissimilarity matrix.
#> Starting BUILD initialization method, serial version
#> WARNING: all successive messages use R-numbering (from 1) for points and medoids. Substract 1 to get the internal C-numbers.
#> Looking for medoid 1. Medoid 1 found. Point 2987. TD=8.798232
#> Looking for medoid 2. Medoid 2 found. Point 2579. 2458 reassigned points. TD=8.692621
#> Looking for medoid 3. Medoid 3 found. Point 1126. 1581 reassigned points. TD=8.638234
#> Looking for medoid 4. Medoid 4 found. Point 4460. 1120 reassigned points. TD=8.602614
#> Looking for medoid 5. Medoid 5 found. Point 3863. 891 reassigned points. TD=8.577416
#> Current TD: 8.577416
#> BUILD initialization method (serial version) finished. Elapsed time: 0.317203 s
#> Starting improved FastPAM1 method in serial implementation...
#> WARNING: all successive messages use R-numbering (from 1) for points and medoids. Substract 1 to get the internal C-numbers.
#> Iteration 0. Exiting, since DeltaTDst is 0.000000. Final value of TD is 8.577407
#> Optimization method (serial version) finished. Elapsed time: 0.0614121 s
#> Time summary (serial implementation).
#> Initalization: 0.317203 s (method BUILD).
#> Optimization: 0.0614121 s in 0 iterations.
#> Total time: 0.378615 s (0 minutes, 0.378615 seconds).
WARNING: the normal way of calling ApplyPAM
would use
nthreads=0 to make use of all available cores in your machine.
Nevertheless, this does not seem to be allowed by CRAN to pass the test
so I have had to use the serial version invoked with nthreads=-1. In
your normal use of code try always nthreads=0.
Parameters init_method
(and another optional parameter,
initial_med
) deserve special comment. The first is the
method to initialize the medoids. Its possible values are
BUILD
, LAB
and PREV
. The rest of
the algorithm make medoid swapping between the points of the initial set
made with BUILD
or LAB
and the rest of points
until no swap can reduce the objective function, which is the sum of
distances of each point to its closest medoid. But this may fall (and
indeed falls) in local minima. If you initialize with BUILD
or LAB
the optional parameter initial_med
cannot be used.
The initialization methods BUILD
and LAB
are described in the paper from Schubert at al. (Schubert and Rousseeuw (2019)).
BUILD
is deterministic. LAB
uses a sample of
the total points to initialize. Obviously, you can run LAB
to get different initializations and compare the results.
The returned object is a list with two fields: med
and
clasif
. This will be explained later.
From now on, typical calls to obtain only the initial medoids would be
=ApplyPAM("datatestL2.bin",k=5,init_method="BUILD",max_iter=0,nthreads=-1)
Lbuild#> Reading symmetric distance/dissimilarity matrix datatestL2.bin
#> Package memuse is installed. OK.
#> Memory used by the matrix: 48837 KiB, which is 0.01% of the available memory, which is 256142540 Kib.
#> That seems OK.
#> Matrix is a correct distance/dissimilarity matrix.
#> Starting BUILD initialization method, serial version
#> WARNING: all successive messages use R-numbering (from 1) for points and medoids. Substract 1 to get the internal C-numbers.
#> Looking for medoid 1. Medoid 1 found. Point 2987. TD=8.798232
#> Looking for medoid 2. Medoid 2 found. Point 2579. 2458 reassigned points. TD=8.692621
#> Looking for medoid 3. Medoid 3 found. Point 1126. 1581 reassigned points. TD=8.638234
#> Looking for medoid 4. Medoid 4 found. Point 4460. 1120 reassigned points. TD=8.602614
#> Looking for medoid 5. Medoid 5 found. Point 3863. 891 reassigned points. TD=8.577416
#> Current TD: 8.577416
#> BUILD initialization method (serial version) finished. Elapsed time: 0.320641 s
#> Time summary (serial implementation).
#> Initalization: 0.320641 s (method BUILD).
#> Optimization: 0 s in 0 iterations.
#> Total time: 0.320641 s (0 minutes, 0.320641 seconds).
=ApplyPAM("datatestL2.bin",k=5,init_method="LAB",max_iter=0,nthreads=-1)
Llab1#> Reading symmetric distance/dissimilarity matrix datatestL2.bin
#> Package memuse is installed. OK.
#> Memory used by the matrix: 48837 KiB, which is 0.01% of the available memory, which is 256184624 Kib.
#> That seems OK.
#> Matrix is a correct distance/dissimilarity matrix.
#> Starting LAB initialization method, serial version.
#> WARNING: all successive messages use R-numbering (from 1) for points and medoids. Substract 1 to get the internal C-numbers.
#> Looking for medoid 1. Medoid 1 found. Point 1125. TD=8.818431
#> Looking for medoid 2. Medoid 2 found. Point 3602. 1972 reassigned points. TD=8.739282
#> Looking for medoid 3. Medoid 3 found. Point 3049. 1178 reassigned points. TD=8.700571
#> Looking for medoid 4. Medoid 4 found. Point 3189. 1341 reassigned points. TD=8.654654
#> Looking for medoid 5. Medoid 5 found. Point 1210. 735 reassigned points. TD=8.635015
#> Current TD: 8.635015
#> LAB initialization method (serial version) finished. Elapsed time: 0.00125725 s
#> Time summary (serial implementation).
#> Initalization: 0.00125725 s (method LAB).
#> Optimization: 0 s in 0 iterations.
#> Total time: 0.00125725 s (0 minutes, 0.00125725 seconds).
=ApplyPAM("datatestL2.bin",k=5,init_method="LAB",max_iter=0,nthreads=-1)
Llab2#> Reading symmetric distance/dissimilarity matrix datatestL2.bin
#> Package memuse is installed. OK.
#> Memory used by the matrix: 48837 KiB, which is 0.01% of the available memory, which is 256144296 Kib.
#> That seems OK.
#> Matrix is a correct distance/dissimilarity matrix.
#> Starting LAB initialization method, serial version.
#> WARNING: all successive messages use R-numbering (from 1) for points and medoids. Substract 1 to get the internal C-numbers.
#> Looking for medoid 1. Medoid 1 found. Point 2373. TD=8.857759
#> Looking for medoid 2. Medoid 2 found. Point 4976. 2170 reassigned points. TD=8.765445
#> Looking for medoid 3. Medoid 3 found. Point 3388. 1577 reassigned points. TD=8.711335
#> Looking for medoid 4. Medoid 4 found. Point 2417. 1151 reassigned points. TD=8.674360
#> Looking for medoid 5. Medoid 5 found. Point 2452. 973 reassigned points. TD=8.645628
#> Current TD: 8.645628
#> LAB initialization method (serial version) finished. Elapsed time: 0.0012494 s
#> Time summary (serial implementation).
#> Initalization: 0.0012494 s (method LAB).
#> Optimization: 0 s in 0 iterations.
#> Total time: 0.0012494 s (0 minutes, 0.0012494 seconds).
WARNING: the normal way of calling ApplyPAM
would use
nthreads=0 to make use of all available cores in your machine.
Nevertheless, this does not seem to be allowed by CRAN to pass the test
so I have had to use the serial version invoked with nthreads=-1. In
your normal use of code try always nthreads=0. For the LAB method this
does not matter, since parallel implementation is not yet provided.
As it can be seen, to get and compare different initializations you
must set the parameter max_iter
to the value
0
. In this case no iterations of objective function
reduction are performed, and the returned object contains the initial
medoids and the classification induced by them. Notice that even looking
equal, the results of the latter two calls are different since
LAB
initializes with a random component (the sample to
choose initial medoids is chosen randomly).
You can check that the medoids, stored in Llab1$med
and
Llab2$med
(see more on this below) are in general
different.
Now, these results can be used to initialize PAM
if you
find that any of them contains a specially good set of medoids. This is
the role of method PREV
that we have mentioned before. A
typical call would be
=ApplyPAM("datatestL2.bin",k=5,init_method="PREV",
Llab2Finalinitial_med=Llab2$med,nthreads=-1)
#> Reading symmetric distance/dissimilarity matrix datatestL2.bin
#> Package memuse is installed. OK.
#> Memory used by the matrix: 48837 KiB, which is 0.01% of the available memory, which is 256144296 Kib.
#> That seems OK.
#> Matrix is a correct distance/dissimilarity matrix.
#> Starting improved FastPAM1 method in serial implementation...
#> WARNING: all successive messages use R-numbering (from 1) for points and medoids. Substract 1 to get the internal C-numbers.
#> Iteration 0. Medoid at place 2 (point 4976) swapped with point 2988; TD-change=-0.027064; TD=8.618556. 1621 reassigned points.
#> Iteration 1. Medoid at place 4 (point 2417) swapped with point 1126; TD-change=-0.015580; TD=8.602976. 1372 reassigned points.
#> Iteration 2. Medoid at place 3 (point 3388) swapped with point 2579; TD-change=-0.012838; TD=8.590137. 1293 reassigned points.
#> Iteration 3. Medoid at place 5 (point 2452) swapped with point 4460; TD-change=-0.010259; TD=8.579878. 1119 reassigned points.
#> Iteration 4. Medoid at place 1 (point 2374) swapped with point 3863; TD-change=-0.002470; TD=8.577408. 1096 reassigned points.
#> Iteration 5. Exiting, since DeltaTDst is 0.000000. Final value of TD is 8.577408
#> Optimization method (serial version) finished. Elapsed time: 0.383343 s
#> Time summary (serial implementation).
#> Initalization: 0 s (method PREV).
#> Optimization: 0.383343 s in 4 iterations (0.0958359 seconds/iteration).
#> Total time: 0.383343 s (0 minutes, 0.383343 seconds).
The initial set of medoids is taken from the object returned by the former calls.
With respect to that object, as we said it is a list with two
vectors. The first one, L$med
, has as many components as
requested medoids and the second, L$clasif
, has as many
components as instances.
Medoids are expressed in L$med
by its number in the
array of vectors (row number in the dissimilarity matrix) starting at 1
(R
convention).
L$clasif
contains the number of the medoid (i.e.: the
cluster) to which each instance has been assigned, according to their
order in L$med
(also from 1).
This means that if L$clasif[p]
is m
, the
point p
belongs to the class grouped around medoid
L$med[m]
. Let us see it.
# Which are the indexes of the points chosen as medoids?
$med
L#> V2988 V2579 V1126 V4460 V3863
#> 2988 2579 1126 4460 3863
#
# In which class has point 147 been classified?
$clasif[147]
L#> V147
#> 4
#
# And which is the index (row in the dissimilarity matrix)
# of the medoid closest to point 147?
$med[L$clasif[147]]
L#> V4460
#> 4460
In this way, values in L$clasif
are between 1 and the
number of medoids, as we can see:
min(L$clasif)
#> [1] 1
max(L$clasif)
#> [1] 5
They can be used as factors.
It is interesting to filter points based on the degree in which they belong to a cluster. Indeed, cluster refinement can be done getting rid of points far away from any cluster center, or which are at a similar distance of two or more of them.
This is characterized by the silhouette of each point. Three
functions deal with this: CalculateSilhouette
,
FilterBySilhouetteQuantile
and
FilterBySilhouetteThreshold
.
=CalculateSilhouette(Llab2$clasif,"datatestL2.bin",nthreads=-1)
S#> Package memuse is installed. OK.
#> Memory used by the matrix: 48837 KiB, which is 0.01% of the available memory, which is 256107440 Kib.
#> That seems OK.
#> Calculating silhouette (serial implementation)...
#> 5000 points classified in 5 classes.
#> Finished serial implementation of silhouette (including dissimilarity matrix load). Elapsed time: 0.0808602 s
WARNING: the normal way of calling CalculateSilhouette
would use nthreads=0 to make use of all available cores in your machine.
Nevertheless, this does not seem to be allowed by CRAN to pass the test
so I have had to use the serial version invoked with nthreads=-1. In
your normal use of code try always nthreads=0.
The parameters to function CalculateSilhouette
are the
array of class membership, as returned by ApplyPAM
in its
clasif
field, and the file with the matrix of
dissimilarities.
A parallel implementation has been programmed, being nthreads as explained before.
Silhouette is a number in \([-1,1]\); the higher its value, the most centered a point is in its cluster.
The returned object S
is a numeric vector with the value
of the silhouette for each point, which will be a named vector if the
classification vector was named.
This vector can be converted to an object of the class
cluster:silhouette
with the function
NumSilToClusterSil
(which needs the vector of
classifications, too). This is done so that, if you load the package
cluster
(Maechler et al.
(2022)), plot will generate the kind of silhouette plots included
in such package.
If the package cluster is installed you can try to execute this: (Sorry, we can’t try ourselves since we don’t know if cluster is installed in your system and the CRAN check does not allow the use of installed.packages to test it)
<- NumSilToClusterSil(Llab2$clasif,S)
Sclus library(cluster)
plot(Sclus)
Probably the plot does not look very nice with this random data which
yields a very bad clustering (since the points are not, by its own
nature, organized in clusters) but with real data you should see
significant things (see package scellpam
(Domingo (2023a))).
Once the silhouette is calculated we can filter it by quantile or by threshold. All points under this quantile or threshold will be discarded, except if they are medoids. Parameters are:
The silhouette, as returned by
CalculateSilhouette
.
The list of medoids/clasif, as returned by
ApplyPAM
.
The file with matrix of counts for the whole set of cells, which was our first input.
The file that will contain the matrix of counts of the remaining cells.
The file with the dissimilarity matrix for the whole set, as
calculated by CalcAndWriteDissimilarityMatrix
.
The file that will contain the dissimilarity for the remaining cells.
And (depending on the function used) the quantile in \([0,1]\) or the silhouette threshold in \([-1,1]\).
As an example,
=FilterBySilhouetteQuantile(S,Llab2,"datatest.bin",
Lfilt"datatestFilt.bin","datatestL2.bin",
"datatestL2Filt.bin",0.2)
If the original matrix contained row and column names, the column names are copied and the row names are transported for those rows that remain. The same happens with respect to rows of the dissimilarity matrix.
Notice that the new dissimilarity matrix could have been calculated
from the matrix of filtered counts with
CalcAndWriteDissimilarityMatrix
but creating it here,
simply getting rid of the filtered rows and columns is much faster.
Also, if a medoid is below the silhouette quantile, it will not be filtered out, but a warning message will be shown, since this is a strange situation that may indicate that some of your clusters are not real but artifacts due to a few outliers that are close to each other.
But remember that this was the result of the first step of the PAM algorithm, so probably you will want to make them iterate.
=ApplyPAM("datatestL2Filt.bin",k=length(Lfilt$med),
Lfinalinit_method="PREV",initial_med=Lfilt$med,nthreads=-1)
#> Reading symmetric distance/dissimilarity matrix datatestL2Filt.bin
#> Package memuse is installed. OK.
#> Memory used by the matrix: 31257 KiB, which is 0.01% of the available memory, which is 256092544 Kib.
#> That seems OK.
#> Matrix is a correct distance/dissimilarity matrix.
#> Starting improved FastPAM1 method in serial implementation...
#> WARNING: all successive messages use R-numbering (from 1) for points and medoids. Substract 1 to get the internal C-numbers.
#> Iteration 0. Medoid at place 2 (point 3981) swapped with point 2385; TD-change=-0.021357; TD=8.605878. 1226 reassigned points.
#> Iteration 1. Medoid at place 3 (point 2714) swapped with point 906; TD-change=-0.012773; TD=8.593104. 1049 reassigned points.
#> Iteration 2. Medoid at place 4 (point 1935) swapped with point 2063; TD-change=-0.010820; TD=8.582285. 989 reassigned points.
#> Iteration 3. Medoid at place 5 (point 1964) swapped with point 3422; TD-change=-0.003460; TD=8.578824. 901 reassigned points.
#> Iteration 4. Medoid at place 1 (point 1899) swapped with point 3080; TD-change=-0.000778; TD=8.578047. 860 reassigned points.
#> Iteration 5. Exiting, since DeltaTDst is 0.000000. Final value of TD is 8.578047
#> Optimization method (serial version) finished. Elapsed time: 0.244156 s
#> Time summary (serial implementation).
#> Initalization: 0 s (method PREV).
#> Optimization: 0.244156 s in 4 iterations (0.0610389 seconds/iteration).
#> Total time: 0.244156 s (0 minutes, 0.244156 seconds).
WARNING: the normal way of calling ApplyPAM
would use
nthreads=0 to make use of all available cores in your machine.
Nevertheless, this does not seem to be allowed by CRAN to pass the test
so I have had to use the serial version invoked with nthreads=-1. In
your normal use of code try always nthreads=0.
Of course, we might have used simply 5 as number of medoids,
k
, since this does not change by filtering, but this is to
emphasize the fact that ApplyPAM
with method
PREV
requires both parameters to be consistent.
The user might want to compare this PAM implementation with others
provided for instance by packages cluster
(Maechler et al. (2022)) or ClusterR
(Mouselimis (2023)). In
cluster
the input is either the data matrix (so the
distance matrix is calculated inside the pam function) or directly the
distance matrix but as a R vector with the lower-diagonal part of the
symmetric matrix ordered by columns (i.e.: column 1 from M(2,1) to
M(n,1), followed by column 2 from M(3,2) to M(n,2) and so on, up to
M(n,n-1). This is a vector of \(n(n-1)/2\) components. To facilitate such
comparison the function GetSubdiag
is provided which takes
as input the jmatrix file with the distance matrix and returns the
vector to be passed to pam in the aforementioned packages.
= GetSubdiag("datatestL2.bin") d
Then, explicit comparison could be done with:
(Sorry, we can’t try ourselves since we don’t know if cluster is installed in your system and the CRAN check does not allow the use of installed.packages to test it)
library(cluster)
= pam(d,diss=TRUE,k=5)
clusterpam print(sort(clusterpam$id.med))
print(sort(L$med))
Similarly, you can check against the ClusterR package. In this package you need the complete dissimilarity matrix to be passed so we have to get it:
# Be patient, this may take some time...
= GetJManyRows("datatestL2.bin",seq(1:nvec)) Dm
and then
library(ClusterR)
= Cluster_Medoids(Dm,clusters=5)
ClusterRpam print(sort(ClusterRpam$medoid_indices))
print(sort(L$med))
In all cases we tried with this simple (but random) example results
were the same. In other cases with a large number of points some medoids
were different in the different implementations but the value of the
function to minimize (sum of distances) was always the same, indicating
that they were equivalent minima. You can test this with function
GetTD
as follows:
= GetTD(L,"datatestL2.bin")
TDparallelpam
# This is to adapt cluster package output format to ours, since this is what our GetTD function expects...
= list()
Lcl $med = clusterpam$id.med
Lcl$clasif = clusterpam$clustering
Lcl= GetTD(Lcl,"datatestL2.bin")
TDcluster
# The same with ClusterR package:
= list()
LclR $med = ClusterRpam$medoid_indices
LclR$clasif = ClusterRpam$clusters
LclR= GetTD(LclR,"datatestL2.bin") TDClusterR
and see that variables TDparallelpam
,
TDcluster
and TDClusterR
are equal.