# Piecewise Linear Segmentation by Dynamic Programming

#### 10 08 2020

$\newcommand{\Ell}{\mathcal{L}} \newcommand{\jump}{\mathcal{J}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\lmax}{\ell_\text{max}} \newcommand{\lmin}{\ell_\text{min}} \newcommand{\def}{\stackrel{\mathrm{def}}{=}}$

The R package dpseg performs piecewise linear segmentation of 2-dimensional data by a dynamic programming algorithm. It was developed for time series data, dissection of bacterial growth phases, and for genome-wide read-count data from next generation sequencing.

print, plot and predict methods allow quick evaluation of the results. Parameter scanning routines (estimateP, scanP) help to evaluate the best choice of parameters for a given problem.

The package and its documentation are also intended to serve as a tutorial on the concepts of linear regression, dynamic programming and the segmentation problem. The movie function animates the progress of the algorithm through the data. Generic implementations of dynamic programming routines allow to test alternative segmentation criteria.

# 1 Theory

## 1.1 Recursion

The problem is to find break-points in 2-dimensional ordered data $$\{(x_i,y_i), i=1,...,n\}$$, eg., a time series. This can be formulated as a dynamic programming recursion:

$\begin{equation} S_j = \max_{i\le j} (S_{i-\jump} + \text{score}(i,j)) - P\;, \tag{1.1} \end{equation}$

where the scoring function $$\text{score}(i,j)$$ quantifies how well a segment between points $$i$$ and $$j$$ is defined.

The break-point penalty term $$P$$ sets bounds on segment lengths and should be chosen close to the optimal value of the scoring function (section 2.2). Higher $$P$$ will yield longer segments, and especially for non-monotonic data $$P$$ lower than the optimal value for the scoring function can work better.

The binary jump parameter $$\jump \in \{0,1\}$$ determines whether the break-point $$i$$ is both, the end of the previous and start of the current segment ($$\jump=0$$), or the previous segments ends at $$i-1$$ ($$\jump=1$$). The latter choice allows discontinuous jumps between adjacent segments.

## 1.2 Scoring Functions

Segmentation into linear segments can be achieved by using a goodness-of-fit measure. Linear models are often evaluated by the r-squared value $$R^2$$, and we can use this directly as a scoring function (type="r2" in dpseg):

$\begin{equation} \text{score}(i,j)= \frac{\sum (\hat y_i - \bar y)^2}{\sum (y_i - \bar y)^2}-1\;, \tag{1.2} \end{equation}$

where $$\hat y_i$$ is the linear model and $$\bar y$$ the mean of the original data, see section 6 and equation (6.4) for details.

Subtraction of 1 simply aligns different scoring functions in dpseg at $$\text{score} \le 0$$ and thereby allows the use of a consistent default penalty parameter of $$P=0$$.

$$R^2$$ depends on the slope (eq. (6.4)) and will score poorly for segments without slope (Fig. 2.8). The negative variance of the residuals $$r_i$$ between data and the regression line does not depend on the slope (eq. (6.5)) and is for many cases a better choice

$\begin{equation} \text{score}(i,j)= - \frac{1}{n-1} \sum r_i^2\;, \tag{1.3} \end{equation}$

and this is the default type="var" in dpseg.

Notably, incremental calculation of the variances (section 6) allows for a computationally efficient implementation of the recursion (Fig. 3.1).

## 1.3 Backtracing

During calculation of $$S_j$$ the position $$i_\text{max}<j$$ that yielded the maximal score is stored as a vector. The segmentation can be easily retrieved by starting at the last position $$j=n$$, the end of the last segment, looking up its $$i_\text{max}$$, the start of the segment, and proceed from $$j=i_\text{max} - \jump$$ (the end of the next segment). If jumps were allowed ($$\jump=1$$) the previous segment ended one position ahead of the next segment at $$i-1$$, and this must be accounted for in backtracing.

simple_backtrace <- function (imax, jumps = 0) {
end <- length(imax) # end of the first segment = end of the data
ends <- c(end)      # initiate vector
while (end > 1) {
end <- imax[end] - jumps # end of previous segment
ends <- c(end, ends)     # note the order, new end is prepended
}
ends
}

## 1.4 Segment Length Restrictions

Depending on the scoring function, short segments may not be meaningful, eg. goodness-of-fit measures for linear regression such as the variance of residuals are 0 for segments of length $$\ell=2$$ and an "optimal" segmentation would split the data in pairs with perfect "fits". Thus, we can restrict score function maximization for only $$i\le(j-\lmin+1)$$.

When handling data sets much larger then the expected segment length, one can also define a maximal segment length, and thereby save memory and computation time by only considering $$i\ge(j-\lmax +1)$$.

Combining both restrictions, and with $$\Ell_x=\ell_x+1$$ the recursion becomes:

$\begin{equation} S_j = \max_{\substack{i\le j-\Ell_{\min}\\i\ge j-\Ell_{\max}}} (S_{i-\jump} + \text{score}(i,j)) - P\;. \tag{1.4} \end{equation}$

# 2 Usage

## 2.1 Basic

The main function dpseg returns an object of class dpseg and results can be inspected by print, plot and predict methods.

The result object is a simple list (S3 object) with the list item segments as the main result: a table (data frame) that lists the start and end x-values of the segments, the start and end indices in the data vectors, the linear regression coefficients and goodness-of-fit measures for the segments (intercept, slope, r-squared, variance of residuals).

library(dpseg)

type <- "var"  # use the (default) scoring, -var(residuals(lm(y~x)))
jumps <- FALSE # allow discrete jumps between segments?
P <- 1e-4      # break-point penalty, use higher P for longer segments

# get example data oddata - bacterial growth measured as optical density OD
x <- oddata\$Time
y <- log(oddata[,"A2"]) # note: exponential growth -> log(y) is linear

# NOTE: the scoring function results are stored as a matrix for re-use below
segs <- dpseg(x=x, y=y, jumps=jumps, P=P, type=type, store.matrix=TRUE)
## calculating recursion for 229 datapoints
print(segs)
##
## dynamic programming-based segmentation of 229 xy data points:
##
##          x1        x2 start end  intercept       slope        r2          var
## 1  1.022778  1.362778     1   3  0.7194953 -4.38397708 0.9909440 5.076021e-03
## 2  1.362778  1.703194     3   5 -9.3479269  3.01611065 0.9626144 1.023547e-02
## 3  1.703194  2.213611     5   8 -6.2212774  1.15334663 0.9766701 1.532996e-03
## 4  2.213611  6.982361     8  36 -4.4810363  0.42112342 0.9646807 1.365581e-02
## 5  6.982361  8.688194    36  46 -3.2862340  0.23483030 0.9857475 2.551765e-04
## 6  8.688194 15.689028    46  87 -2.1271834  0.10395130 0.9882125 5.654426e-04
## 7 15.689028 17.911806    87 100 -3.4426960  0.18899698 0.9940106 1.101408e-04
## 8 17.911806 21.329583   100 120 -0.3825874  0.01895679 0.8638382 6.374447e-05
## 9 21.329583 39.940000   120 229  0.4767462 -0.01989881 0.9919736 9.501867e-05
##
## Parameters: type: var; minl: 3; maxl: 229; P: 1e-04; jumps: 0
plot(segs)
lines(predict(segs),lty=2, lwd=3, col="yellow") Figure 2.1: Segmentation of a growth curve (E. coli in M9 minimal medium) by dpseg. Vertical lines indicate segment starts (dashed red) and ends (solid black). The predict method returns the piecewise linear model (dashed yellow).

### 2.1.1 Exploratory & Educational Movies

For both, educational purposes or detailed evaluation of novel scoring functions (section 2.5) or parameters, the movie function allows to visualize the performance of the dynamic programming as it progresses through the data:

# use options frames, and res to control file size
movie(segs, format="gif", file.name="movie", path="pkg/vignettes/figures",
frames=seq(1,length(x),3), delay=.3,res=50)

This will plot each step of the algorithm and visualize the development of the scoring function. For this function to work, the option store.matrix=TRUE must be set in the call to dpseg.