# Preliminary steps

Start by loading the necessary packages (which must be installed
beforehand if they are not already installed) : `TraMineR`

and `TraMineRextras`

for sequence analysis, `cluster`

and `WeightedCluster`

for clustering, `FactoMineR`

et `ade4`

for
correspondence analysis, `RColorBrewer`

for color
palettes, `questionr`

et `descriptio`

for descriptive statistics, `dplyr`

et `purrr`

for data
manipulation, `ggplot2`

for plots and `seqhandbook`

which accompanies the handbook.

```
library(TraMineR)
library(TraMineRextras)
library(cluster)
library(WeightedCluster)
library(FactoMineR)
library(ade4)
library(RColorBrewer)
library(questionr)
library(descriptio)
library(dplyr)
library(purrr)
library(ggplot2)
library(seqhandbook)
```

The data used in the handbook is then loaded. The data on employment
trajectories are in the data frame `trajact`

: there are 500
observations, i.e. 500 individual trajectories, and 37 variables,
corresponding to the activity status observed each year between the ages
of 14 and 50.

```
# loading the trajectories
data(trajact)
str(trajact)
```

```
'data.frame': 500 obs. of 37 variables:
$ sact14: num 2 1 1 1 1 1 1 1 1 1 ...
$ sact15: num 2 1 1 1 1 1 1 1 1 1 ...
$ sact16: num 2 1 1 1 1 1 1 1 1 1 ...
$ sact17: num 2 1 1 1 1 1 2 2 2 2 ...
$ sact18: num 2 1 2 2 1 1 2 2 2 2 ...
$ sact19: num 2 3 2 2 1 1 2 2 2 2 ...
$ sact20: num 2 3 2 6 2 2 5 2 6 2 ...
$ sact21: num 2 3 2 2 2 2 5 6 6 2 ...
$ sact22: num 2 3 2 2 2 2 5 6 2 2 ...
$ sact23: num 2 3 2 2 2 2 5 2 2 2 ...
$ sact24: num 2 3 2 2 2 5 3 2 2 2 ...
$ sact25: num 2 3 2 2 2 5 3 2 2 2 ...
$ sact26: num 2 3 2 2 2 5 3 2 2 2 ...
$ sact27: num 2 3 2 2 2 2 3 2 2 2 ...
$ sact28: num 2 3 5 2 2 2 3 2 2 2 ...
$ sact29: num 2 3 2 2 2 2 3 2 2 2 ...
$ sact30: num 2 3 5 2 2 2 3 2 2 2 ...
$ sact31: num 2 3 5 2 2 2 3 2 2 2 ...
$ sact32: num 2 3 5 2 2 2 5 2 2 2 ...
$ sact33: num 2 3 5 2 2 2 5 2 2 2 ...
$ sact34: num 2 3 5 2 2 2 5 2 2 2 ...
$ sact35: num 2 3 5 2 2 2 2 2 2 2 ...
$ sact36: num 2 3 5 2 2 2 2 2 2 2 ...
$ sact37: num 2 3 5 2 2 2 2 2 2 2 ...
$ sact38: num 2 2 5 2 2 2 2 2 2 2 ...
$ sact39: num 2 2 5 2 2 2 2 2 2 2 ...
$ sact40: num 2 2 5 2 2 2 2 2 2 2 ...
$ sact41: num 2 2 5 2 2 2 2 2 2 2 ...
$ sact42: num 2 2 5 2 2 2 2 2 2 2 ...
$ sact43: num 2 2 5 2 2 2 2 2 2 2 ...
$ sact44: num 2 2 5 2 2 2 2 2 2 2 ...
$ sact45: num 2 2 3 2 2 2 2 2 2 2 ...
$ sact46: num 2 2 3 2 2 2 2 2 2 2 ...
$ sact47: num 2 2 3 2 2 2 2 2 2 2 ...
$ sact48: num 2 2 3 2 2 2 2 2 2 2 ...
$ sact49: num 2 2 3 2 2 2 2 2 2 2 ...
$ sact50: num 2 3 3 2 2 2 2 2 2 2 ...
```

The first step in sequence analysis is to create a corpus of
sequences, i.e. a `stslist`

class object using the
`seqdef`

function. First the labels of the 6 states and a
palette of 6 colours are defined (this is optional: `seqdef`

creates labels and palette automatically if not provided).

```
# defining of a corpus of sequences
<- c("studies","full-time","part-time","short jobs","inactivity","military")
labs <- brewer.pal(length(labs), 'Set2')
palette <- seqdef(trajact, labels=labs, cpal=palette) seqact
```

Our corpus of 500 sequences consists of 377 distinct sequences, which confirms the interest of using a statistical procedure to group together sequences that are similar.

```
# number of distinct sequences
seqtab(seqact, idx=0) %>% nrow
```

`[1] 377`

The *state distribution plot* of all the sequences shows the
preponderance of full-time employment and the non-negligible weight of
inactivity.

```
# state distribution plot
seqdplot(seqact, xtlab=14:50, cex.legend=0.7)
```

A data frame is also loaded containing some socio-demographic
variables on the individuals. Note that the categorical variables are in
`factor`

format.

```
# loading the socio-demographic variables
data(socdem)
str(socdem)
```

```
'data.frame': 500 obs. of 9 variables:
$ annais : num 1950 1937 1933 1948 1944 ...
$ nbenf : Factor w/ 4 levels "aucun","un","deux",..: 4 3 3 3 3 2 4 4 4 1 ...
$ nbunion : Factor w/ 3 levels "aucune","une",..: 2 2 2 2 3 2 2 2 2 1 ...
$ mereactive : Factor w/ 2 levels "non","oui": 1 1 2 2 1 1 2 1 1 2 ...
$ sexe : Factor w/ 2 levels "homme","femme": 1 2 2 1 2 2 2 1 1 2 ...
$ PCS : Factor w/ 8 levels "NA","agric","indep",..: 7 8 5 7 6 5 6 7 7 4 ...
$ PCSpere : Factor w/ 8 levels "agric","indep",..: 7 1 5 2 6 3 6 5 6 6 ...
$ diplome : Factor w/ 4 levels "aucun","<bac",..: 1 2 2 2 3 4 2 2 2 2 ...
$ nationalite: Factor w/ 2 levels "francaise","etrangere": 1 1 1 1 1 1 1 1 1 1 ...
```

# Building a distance matrix

## Synthetic indicators

As an example, a distance matrix is constructed from indicators describing the number of episodes in the different states. The first individual has spent his entire trajectory in full-time employment, while the second has experienced one full-time episode but also one episode of study and two of part-time employment.

```
<- seqinepi(seqact)
indics head(indics)
```

```
nepi1 nepi2 nepi3 nepi4 nepi5 nepi6
1 0 1 0 0 0 0
2 1 1 2 0 0 0
3 1 2 1 0 2 0
4 1 2 0 0 0 1
5 1 1 0 0 0 0
6 1 2 0 0 1 0
```

The matrix can be calculated directly from the indicators or after a Principal Component Analysis (PCA) step, here by retaining the first 5 dimensions.

```
# distance matrix from the indicator
<- dist(indics, method='euclidean') %>% as.matrix
dissim
# distance matrix from PCA results
<- PCA(indics, scale.unit=FALSE, ncp=5, graph=FALSE)$ind$coord
acp_coords <- dist(acp_coords, method='euclidean') %>% as.matrix dissim
```

Other synthetic indicators (durations, states visited, etc.) can be
calculated simply from the functions `seqistatd`

,
`seqi1epi`

, `seqifpos`

, `seqindic`

or
`seqpropclust`

.

## Disjunctive coding and QHA

In the case of complete disjunctive coding, the distance matrix can be calculated directly, with the Euclidean distance or the chi2 distance, or after a Principal Component Analysis (PCA) or Multiple Correspondence Analysis (MCA) step, here by retaining the first 5 dimensions.

NB: `map_df`

allows you to apply the same function to all
the columns of a data frame. Here, this function is used to convert
columns from numerical format to `factor`

format.

```
# complete disjunctive coding
<- as.data.frame(tab.disjonctif(seqact))
disjo <- disjo[,colSums(disjo)>0]
disjo
# euclidian distance
<- dist(disjo, method='euclidean') %>% as.matrix
dissim
# chi2 distance
<- map_df(disjo, as.factor) %>%
dissim dudi.acm(scannf=FALSE, nf=ncol(disjo)) %>%
dist.dudi() %>%
as.matrix
# after a PCA step
<- PCA(disjo, scale.unit=FALSE, ncp=5, graph=FALSE)$ind$coord
acp_coords <- dist(acp_coords, method='euclidean') %>% as.matrix
dissim
# after a MCA step
<- purrr::map_df(disjo, as.factor) %>%
acm_res MCA(ncp=5, graph=FALSE)
<- dist(acm_res$ind$coord, method='euclidean') %>% as.matrix dissim
```

For the qualitative harmonic analysis (QHA), the calculation of the distance matrix can be done directly (chi2 distance) or after a correspondence analysis (CA), here using the first 5 dimensions.

```
# QHA coding
<- seq2qha(seqact, c(1,3,7,10,15,20,28))
ahq <- ahq[,colSums(ahq)>0]
ahq
# chi2 distance
<- dudi.coa(ahq, scannf=FALSE, nf=ncol(ahq)) %>%
dissim dist.dudi() %>%
as.matrix
# after a CA step
<- CA(ahq, ncp=5, graph=FALSE)$row$coord
afc_coord <- dist(afc_coord, method='euclidean') %>% as.matrix dissim
```

*Optimal Matching* and alternatives

For *Optimal Matching*, the construction of a distance matrix
between sequences is done with the `seqdist`

function. This
also involves defining a substitution cost matrix between states (with
the `seqsubm`

function). Here the substitution costs are
constant and equal to 2 and the *indel* cost is equal to 1.5.

```
# building the distance matrix
<- seqsubm(seqact, method="CONSTANT", cval=2)
couts <- seqdist(seqact, method="OM", sm=couts, indel=1.5) dissim
```

From experience, *Optimal Matching* with the parameters
adopted here is a choice that allows the different dimensions of
sequence temporality to be taken into account - *sequencing*,
calendar (*timing*), duration (in different states or episodes,
*duration* and *spell duration*). If you wish to focus on
one of these dimensions, you can follow the recommendations of Studer
& Ritschard (2016, see in particular pages 507-509), and choose one
of the many other metrics implemented in the `TraMineR`

extension.

```
# sequencing
<- seqdist(seqact, method="OMstran", otto=0.1, sm=couts, indel=1)
dissim <- seqdist(seqact, method="OMspell", expcost=0, sm=couts, indel=1)
dissim <- seqdist(seqact, method="SVRspell", tpow=0)
dissim
# timing
<- seqdist(seqact, method="HAM", sm=couts)
dissim <- seqdist(seqact, method="CHI2", step=1)
dissim
# duration (distribution aver the entire period)
<- seqdist(seqact, method="EUCLID", step=37)
dissim
# duration (spell lengths)
<- seqdist(seqact, method="OMspell", expcost=1, sm=couts, indel=1)
dissim <- seqdist(seqact, method="LCS") dissim
```

Note that methods using the complete disjunctive coding or QHA are
also implemented in the `seqdist`

function (“EUCLID” and
“CHI2” methods).

# Typology of sequences

## Building a typology

A hierarchical agglomerative clustering (HAC) is then carried out
using Ward’s aggregation criterion, using the `agnes`

function of the `cluster`

extension.

NB: With a high number of sequences, HAC may require a significant
amount of computing time. However, there is a much faster implementation
in the `fastcluster`

package (`hclust`

function).

```
# hierarchical agglomerative clustering
<- as.dist(dissim) %>% agnes(method="ward", keep.diss=FALSE) agnes
```

In order to explore the solutions of a hierarchical agglomerative clustering, one usually starts by examining the dendrogram.

```
# dendrogram
as.dendrogram(agnes) %>% plot(leaflab="none")
```

The following graph combines dendrogram and *index plot*: the
sequences of the index plot index are sorted according to their position
in the dendrogram, which is shown in the margin of the graph.

```
# heatmap (dendrogram + index plot)
seq_heatmap(seqact, agnes)
```

Examination of inertia gaps can also be useful in determining the number of clusters in the typology. For example, it can be seen that there is a noticeable difference in inertia between partitions in 5 and 6 clusters.

```
# plot of inertia
plot(sort(agnes$height, decreasing=TRUE)[1:20], type="s", xlab="number of clusters", ylab="inertia")
```

There are also a number of indicators of partition quality (silhouette, Calinski-Harabasz, pseudo-R2, etc.; see Studer, 2013).

```
# indicators of quality
<- as.clustrange(agnes, diss=dissim)
wardRange summary(wardRange, max.rank=2)
```

```
1. N groups 1. stat 2. N groups 2. stat
PBC 5 0.79612318 4 0.79012613
HG 5 0.92899997 4 0.92259604
HGSD 5 0.92672338 4 0.92027518
ASW 3 0.53491169 2 0.52958820
ASWw 3 0.53862662 2 0.53182999
CH 2 156.86409716 3 103.74539004
R2 20 0.60648740 19 0.59473652
CHsq 2 324.53910100 3 245.89231855
R2sq 20 0.82862437 19 0.82073439
HC 5 0.04010082 4 0.04392402
```

The quality of the partitions for different numbers of clusters for the silhouette, pseudo-R2 and Calinski-Harabasz indicators is shown graphically here.

`plot(wardRange, stat=c('ASW','R2','CH'), norm="zscore")`

In the end, we opt for a partition in 5 clusters, by “cutting the
tree” of the HAC using the `cutree`

function.

```
# choosing the partition in 5 clusters
<- 5
nbcl <- cutree(agnes, nbcl) part
```

It is possible to “consolidate” the partition using the PAM algorithm
(*Partition Around Medoids*) and the `wcKMedoids`

function of the `WeightedCluster`

package. This results in a
very similar distribution of sequences between the clusters (see the
table crossing the clusters before and after consolidation) but the
quality of the consolidated partition is slightly higher (the R² goes
from 61 to 64%).

```
# consolidating the partition
<- wcKMedoids(dissim, k=nbcl, initialclust=part, cluster.only=TRUE)
newpart table(part, newpart)
```

```
newpart
part 28 34 103 307 480
1 104 241 1 0 0
2 9 24 2 7 0
3 0 6 6 63 1
4 0 0 17 0 0
5 8 0 0 1 10
```

`wcClusterQuality(dissim, part)$stats['R2sq'] %>% round(3)`

```
R2sq
0.607
```

`wcClusterQuality(dissim, newpart)$stats['R2sq'] %>% round(3)`

```
R2sq
0.64
```

If you wish to keep the consolidated partition :

`<- as.numeric(as.factor(newpart)) part `

NB: Another option, the “fuzzy” clustering, here with the Fanny
algorithm (extension `cluster`

). Unlike HAC or PAM, each
sequence does not belong to a single cluster but is characterised by
degrees of membership to the different clusters. The following table
presents the degrees of membership to the 6 clusters of the first 3
sequences of the corpus. The first sequence belongs 99% to cluster 1,
but the second is more “balanced”, mainly between clusters 2 and 5.

```
# fuzzy clustering
<- as.dist(dissim) %>% fanny(k=5, metric='euclidean', memb.exp=1.2)
fanny $membership %>% round(2) %>% .[1:3,] fanny
```

```
[,1] [,2] [,3] [,4] [,5]
1 0.99 0.00 0.00 0.01 0.00
2 0.06 0.51 0.02 0.13 0.28
3 0.00 0.11 0.80 0.00 0.08
```

## Describing the typology: plots

The graphical representations give a quick and intuitive idea of the
nature of the clusters in the typology. The most commonly used type of
graph is the *state distribution plot*.

```
# state distribution plots of the typology
seqdplot(seqact, group=part, xtlab=14:50, border=NA, cex.legend=0.8)
```

The *index plots* are also very common.

```
# index plots of the typology
seqIplot(seqact, group=part, xtlab=14:50, yaxis=FALSE, cex.legend=0.8)
```