--- title: "Timing experiments for dtwclust" author: "Alexis Sarda-Espinosa" output: html_vignette: toc: true number_sections: true fig_width: 6.5 fig_height: 7 vignette: > %\VignettePackage{dtwclust} %\VignetteIndexEntry{Timing experiments for dtwclust} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown_notangle} bibliography: REFERENCES.bib editor_options: chunk_output_type: console --- ```{r setup, include=FALSE} library("dtwclust") library("dplyr") library("ggplot2") data("dtwclustTimings") dist_single_results <- dtwclustTimings$dist$single dist_multiple_results <- dtwclustTimings$dist$multiple cent_results <- dtwclustTimings$cent clus_tadpole_results <- dtwclustTimings$tadpole partitional_results <- dtwclustTimings$partitional factor_columns <- c("series_length", "window_size", "k", "num_repetitions") adjust_factors <- function(df) { for (column in factor_columns) { if (column %in% colnames(df)) { df[[column]] <- factor(df[[column]]) } } df } dist_single_results <- adjust_factors(dist_single_results) dist_multiple_results <- adjust_factors(dist_multiple_results) cent_results <- adjust_factors(cent_results) clus_tadpole_results <- adjust_factors(clus_tadpole_results) partitional_results$dtwlb_vs_dtwbasic$pam <- adjust_factors(partitional_results$dtwlb_vs_dtwbasic$pam) partitional_results$dtwlb_vs_dtwbasic$pam_vs_reps <- adjust_factors(partitional_results$dtwlb_vs_dtwbasic$pam_vs_reps) partitional_results$dtwlb_vs_dtwbasic$dba <- adjust_factors(partitional_results$dtwlb_vs_dtwbasic$dba) partitional_results$sparse_pam_k$non_symmetric <- adjust_factors(partitional_results$sparse_pam_k$non_symmetric) partitional_results$sparse_pam_k$symmetric <- adjust_factors(partitional_results$sparse_pam_k$symmetric) levels(dist_single_results$distance) <- c( lb_keogh = "LB Keogh", lb_improved = "LB Improved", sbd = "Shape-Based Distance", dtw_univariate = "DTW (Univariate)", dtw_multivariate = "DTW (Multivariate)", sdtw_univariate = "Soft-DTW (Univariate)", sdtw_multivariate = "Soft-DTW (Multivariate)", unnormalized_gak_univariate = "Unnormalized GAK (Univariate)", unnormalized_gak_multivariate = "Unnormalized GAK (Multivariate)", normalized_gak_univariate = "Normalized GAK (Univariate)", normalized_gak_multivariate = "Normalized GAK (Multivariate)" ) levels(dist_multiple_results$distance) <- c( lb_keogh = "LB Keogh", lb_improved = "LB Improved", dtw_lb = "LBI + DTW", sbd = "Shape-Based Distance", dtw_univariate = "DTW (Univariate)", dtw_multivariate = "DTW (Multivariate)", sdtw_univariate = "Soft-DTW (Univariate)", gak_univariate = "Normalized GAK (Univariate)" ) levels(cent_results$cent) <- c( shape_univariate = "Shape Extraction (Univariate)", shape_multivariate = "Shape Extraction (Multivariate)", dba_univariate = "DBA (Univariate)", dba_multivariate_byS = "DBA (Multivariate By Series)", dba_multivariate_byV = "DBA (Multivariate By Variable)", sdtw_cent_univariate = "Soft-DTW (Univariate)", sdtw_cent_multivariate = "Soft-DTW (Multivariate)" ) dist_multiple_results$num_threads <- factor( dist_multiple_results$num_threads, levels = unique(dist_multiple_results$num_threads), labels = paste(unique(dist_multiple_results$num_threads), " Thread(s)") ) clus_tadpole_results$lb <- factor( clus_tadpole_results$lb, levels = c("lbk", "lbi"), labels = c("LB Keogh", "LB Improved") ) # execution time exec_time <- attr(dist_single_results, "proctime")["elapsed"] + attr(dist_multiple_results, "proctime")["elapsed"] + attr(cent_results, "proctime")["elapsed"] + attr(clus_tadpole_results, "proctime")["elapsed"] + attr(partitional_results, "proctime")["elapsed"] exec_time <- unname(round(exec_time / 3600)) # ggplot defaults ggplot2::theme_set(ggplot2::theme_bw()) ggplot2::theme_update(legend.position = "bottom") # knitr defaults knitr::opts_chunk$set(echo = FALSE, comment = "#>") ``` # Introduction Time-series clustering is affected by several factors, such as the characteristics of time-series themselves, the choice of distance or centroid function, etc. In many situations, run-time characteristics are more important, e.g. when the amount of memory is limited by the system, or when excessive running times must be avoided. Most of these aspects cannot be generalized, especially in regards to evaluation of correctness, but things like complexity or growth rate of an algorithm can be assessed relatively more easily. To get an idea of the running times that can be expected for some of the algorithms included in `dtwclust`, a series of timing experiments were made. Said experiments were not concerned with correctness or accuracy, only with timing characteristics. The experiments were run with `R` v3.6.0 and `dtwclust` v5.5.3. The `microbenchmark` package (v1.4.6) was also used for most of the experiments. The computer used was running GNU/Linux (LTS kernel v4.19) with an `i5-6500` Intel processor (4 cores) and 16GB of RAM. The whole set of experiments took approximately `r exec_time` hours to complete. The data used comes from the Character Trajectories set [@lichman2013], which have different lengths and are originally multivariate series with 3 variables; the univariate versions were extracted from these. All scripts are available [online on GitHub](https://github.com/asardaes/dtwclust/tree/master/timing-experiments). Naturally, since we are dealing with execution times, the experiments cannot be reproduced exactly, but hopefully the median times would not vary too much between systems with similar characteristics. # Distance experiments ## Calculations for single time-series First we look at the results of the timing experiments for single time-series, i.e., for distances calculated between two individual time-series. The distances tested are those included in the package. Here we look at the effect of window sizes and series' lengths. Each calculation was repeated `r attr(dist_single_results, "times")` times and the **median** value was extracted. Note that the vertical scales differ for the following figures. ### DTW lower bounds ```{r dist-single-lb, fig.height=4.5} id_gg <- grepl("^LB", dist_single_results$distance) gg <- ggplot(dist_single_results[id_gg,], aes( x = series_length, y = median_time_us, group = window_size, colour = window_size )) gg + geom_line() + facet_wrap(~distance) + scale_color_discrete(name = "Window size") + labs(x = "Series' length", y = expression("Median time ("*mu*"s)")) ``` The first interesting result relates to the DTW lower bounds: `lb_keogh` and `lb_improved`. The window size does not seem to have a very significant effect, and the running time appears to (mostly?) grow slowly with the series' length. However, `lb_improved` was faster that `lb_keogh`. Considering that `lb_improved` first needs to calculate `lb_keogh` and then perform additional calculations, this is somewhat puzzling, and the reason is not immediately evident. Perhaps there are compiler optimizations at play. ### Shape-based distance ```{r dist-single-sbd, fig.height=3.5, fig.width=5} id_gg <- grepl("^Shape", dist_single_results$distance) gg <- ggplot(dist_single_results[id_gg,], aes( x = series_length, y = median_time_us, group = 1 )) gg + geom_line() + labs(x = "Series' length", y = expression("Median time ("*mu*"s)")) ``` The shape-based distance also presents weird behavior. While it was expected that its running time would increase with the series' length, the bump for the length of 152 is considerably large. It is true that SBD is based on the FFT, and thus it adjusts the input series' lengths to powers of 2, but if that were the cause, then the bump should have occurred for the series with length of 130, since the next power of 2 for 109 is 128, and it jumps to 256 for a length of 130. ### DTW ```{r dist-single-dtw, fig.height=4.5} id_gg <- grepl("^DTW", dist_single_results$distance) gg <- ggplot(dist_single_results[id_gg,], aes( x = series_length, y = median_time_us, group = window_size, colour = window_size )) gg + geom_line() + facet_wrap(~distance) + scale_color_discrete(name = "Window size") + labs(x = "Series' length", y = expression("Median time ("*mu*"s)")) ``` In the case of DTW, we can see that a window constraint can indeed have a very significant effect on running time, considering that a window size of 10 resulted in a calculation that was about 4 times faster than when using no constraint. In this case, using multivariate series (with 3 variables) did not have a very significant effect. ### Soft-DTW ```{r dist-single-sdtw, fig.height=4} id_gg <- grepl("^Soft-DTW", dist_single_results$distance) gg <- ggplot(dist_single_results[id_gg,], aes( x = series_length, y = median_time_us, group = 1 )) gg + geom_line() + facet_wrap(~distance) + labs(x = "Series' length", y = expression("Median time ("*mu*"s)")) ``` In principle, the soft-DTW algorithm is very similar to that of unconstrained DTW. However, its run-time characteristics are clearly different, resulting in considerably slower calculations. Interestingly, the multivariate case was marginally faster. ### Triangular global alignment kernel ```{r dist-single-gak, fig.height=6} id_gg <- grepl("GAK", dist_single_results$distance) gg <- ggplot(dist_single_results[id_gg,], aes( x = series_length, y = median_time_us, group = window_size, colour = window_size )) gg + geom_line() + facet_wrap(~distance) + scale_color_discrete(name = "Window size") + labs(x = "Series' length", y = expression("Median time ("*mu*"s)")) ``` The behavior of GAK was rather surprising. Its running times increase very fast with the series' length, and neither window size nor number of variables seem to have any effect whatsoever. The normalized version is 3 times slower because it effectively calculates a GAK distance 3 times: one time for `x` against `y`, one time for `x` alone, and one time for `y` alone; these 3 values are used to calculate the normalized version. ## Calculations for several time-series Computing cross-distance matrices for time-series can be optimized in different ways depending on the distance that is used. In the following sections, we look at the way the included distances are optimized, and evaluate the way it affects their running times when doing distance calculations between several time-series. These experiments were repeated `r attr(dist_multiple_results, "times")` times and the median value was computed. We look at the effect of several factors: the length of the series, a window size where applicable, the effect of parallelization, and the size of the cross-distance matrix. ### DTW lower bounds First we assess the distances that involve the DTW lower bounds. In the following figure, the `x` axis contains the total number of cells in the cross-distance matrix, but the color is mapped to the number of *columns* in said matrix. This is relevant in this case because of the envelopes that need to be computed when calculating the lower bounds. These are computed for the series in `y`, which end up across the columns of the distance matrix. The applied optimization consists in calculating the envelopes for `y` only once, and re-using them across `x`. In the case of `dtw_lb` (which is based on `lb_improved`), this is also important because nearest neighbors are searched row-wise by default, and more series in `y` equates to more neighbors to consider. Also note that the `dtw_lb` calculations make more sense when the series in `x` and `y` are different (if the series are the same, the nearest neighbor of a series is always itself, so no iterations take place), which is why there are less data points in those experiments. Given the results of the previous section, the window size value was fixed at 50 for these experiments. ```{r dist-multiple-lb-plot, fig.cap="*The facets' columns indicate the number of parallel threads, whereas the rows indicate the distance that was used. The vertical scales are different for each row, but they are all in milliseconds.*"} id_gg <- grepl("LB", dist_multiple_results$distance) ggplot(dist_multiple_results[id_gg,], aes(x = num_total, y = median_time_ms, colour = num_y, shape = series_length)) + geom_point(size = 3) + facet_grid(distance ~ num_threads, scales = "free_y") + scale_color_continuous(name = "Amount of series in y") + scale_shape_discrete(name = "Series' length") + labs(x = "Total number of distance calculations", y = "Median time (ms)") ``` For `lb_keogh` and `lb_improved`, we see that the effect of the series' length is more consistent when calculating cross-distance matrices. Parallelization with multi-threading yields better performance in a proportional way, and this time `lb_improved` was about twice as slow as `lb_keogh`. As expected, increasing the number of series in `y` increases the running times. The behavior of `dtw_lb` is also consistent, but the length of the series affected running times in a strange way, since the calculations with series of length 152 were the slowest ones. Using multi-threading can also be advantageous in this case, and this is applied on two occasions: when estimating the initial distance matrix with `lb_improved`, and when updating the nearest neighbors' with DTW. Nevertheless, the procedure is much slower than the lower bounds alone. ### Shape-based distance As mentioned before, the shape-based distance is based on the FFT. Similarly to the DTW lower bounds, the optimization applied here consists in calculating the FFTs only once, although in this case they must be calculated for both `x` and `y`. The results are summarized in the next figure. ```{r dist-multiple-sbd-plot, fig.height=4.5, fig.cap="*The facets' columns indicate the number of parallel threads.*"} id_gg <- grepl("^Shape", dist_multiple_results$distance) ggplot(dist_multiple_results[id_gg,], aes(x = num_total, y = median_time_ms, colour = series_length)) + geom_point(size = 3) + facet_wrap(~num_threads) + scale_color_discrete(name = "Series' length") + labs(x = "Total number of distance calculations", y = "Median time (ms)") ``` For sequential calculations, we see here the expected effect of the series' lengths. Adjusting them to powers of 2 for the FFT meant that the calculations were faster for series of length 109, and for series of length 152 and 196 the times were virtually the same (so much that the points overlap). Parallelization helped reduce times proportionally. ### DTW The DTW distance doesn't allow for many optimizations. The version implemented in `dtw_basic` for cross-distance matrices uses less RAM by saving only 2 columns of the local cost matrix (LCM) at all times (since no back-tracking is performed in this case). Moreover, this LCM is only allocated once in each thread. The results are shown in the next figure. ```{r dist-multiple-dtw-plot, fig.height=8, fig.cap="*The facets' columns indicate the number of parallel threads, whereas the rows indicate the length of the series being considered (all series had the same length for each case). All times are in milliseconds.*"} id_gg <- grepl("^DTW \\(", dist_multiple_results$distance) ggplot(dist_multiple_results[id_gg,], aes(x = num_total, y = median_time_ms, colour = window_size, shape = distance)) + geom_point(size = 3) + facet_grid(series_length ~ num_threads) + scale_shape_manual(name = "Distance", values = c(0, 3)) + scale_color_discrete(name = "Window size") + labs(x = "Total number of distance calculations", y = "Median time (ms)") ``` The DTW distance presents a much more constant behavior across the board. The difference between univariate and multivariate series is very small, but window sizes and series lengths can have a very significant effect, especially for sequential calculations. Additionally, DTW benefits linearly from parallelization, since using 2 threads reduced times in half, and using 4 reduced them practically by a factor of 4. Also note that the growth is very linear, which indicates that DTW cannot be easily optimized much more, something which was already pointed out before [@keogh2004]. ### Soft-DTW As with DTW, soft-DTW has few optimizations: its helper matrix is allocated only once in each thread during the calculation. In this case, we look only at the univariate version. ```{r dist-multiple-sdtw-plot, fig.height=4.5, fig.cap="*The facets' columns indicate the number of parallel threads. All times are in milliseconds.*"} id_gg <- grepl("^Soft", dist_multiple_results$distance) ggplot(dist_multiple_results[id_gg,], aes(x = num_total, y = median_time_ms, colour = series_length)) + geom_point(size = 3) + facet_wrap(~num_threads) + scale_color_discrete(name = "Series' length") + labs(x = "Total number of distance calculations", y = "Median time (ms)") ``` The benefits of parallelization are also very evident for soft-DTW, and present similar characteristics to the ones obtained for DTW. Unfortunately, running times also grow very fast with the series' length. ### Triangular global alignment kernel Finally we look at the computations with the GAK distance. As shown in the previous section, this distance is considerably slower. Moreover, only the normalized version can be used as a *distance* measure (as opposed to a similarity), so only the normalized version was tested. There are 2 optimizations in place here. As mentioned before, normalization effectively requires calculating a GAK for `x` and `y` by themselves, so in order to avoid repeated calculations, these normalization factors are only computed once for cross-distance matrices. GAK also uses a helper matrix to save logarithms during the intermediate calculations, and this matrix is allocated only once in each thread. It is worth pointing out that, in principle, the GAK code is very similar to that of DTW. However, GAK relies on logarithms, whereas DTW only uses arithmetic operations. This means that, unfortunately, GAK probably cannot be optimized much more. ```{r dist-multiple-gak-plot, fig.height=6, fig.cap="*The facets' columns indicate the number of parallel threads, whereas the rows indicate the window size that was used. All times are in milliseconds.*"} id_gg <- grepl("GAK", dist_multiple_results$distance) ggplot(dist_multiple_results[id_gg,], aes(x = num_total, y = median_time_ms, colour = series_length)) + geom_point(size = 3) + facet_grid(window_size ~ num_threads) + scale_color_discrete(name = "Series' length") + labs(x = "Total number of distance calculations", y = "Median time (ms)") ``` Here we see a clear effect of window sizes and series' lengths and, as was the case for DTW, parallelization can be very beneficial for GAK. # Prototyping experiments In this section, we briefly look at the running times of 3 centroid functions: shape extraction, DBA and the soft-DTW procedure. Each experiment was run `r attr(cent_results, "times")` times and the median value was computed. ## Shape extraction Shape extraction is based on SBD, and thus has no window constraints. The multivariate version simply applies the univariate algorithm to each of the variables in the series. It does not use parallelization right now. ```{r cent-shape-plot, fig.height=5} id_gg <- grepl("^Shape", cent_results$cent) ggplot(cent_results[id_gg,], aes(x = num_series, y = median_time_ms, colour = series_length)) + geom_line() + geom_point(size = 2) + facet_wrap(~cent) + scale_color_discrete(name = "Series' length") + labs(x = "Amount of time-series considered", y = "Median time (ms)") ``` As expected, the multivariate version is proportionally slower than the univariate version. The length of the series can indeed have a significant effect, and using more series naturally increases running times, albeit linearly. ## DTW barycenter averaging DBA is based on DTW, so it can use window constraints. Additionally, it supports multivariate series for the same reason, but in 2 variations. One variation simply uses the same strategy as shape extraction: it applies the univariate version to each of the variables and binds the resulting series. However, DBA uses the backtracking feature of DTW, and this can be computed based on an LCM calculated from two multivariate series as a whole. In `dtwclust`, the former variation is the `by-variable` version, and the latter is the `by-series` version. The implementation also supports multi-threading, separating the input series onto different threads. Also note that this implementations of DBA allocate the LCM used for backtracking only once in each thread; it does not matter if series have different lengths, as long as the LCM's dimensions are defined based on the longest series. ```{r cent-dba-plot, fig.height=20, fig.cap="*The facets' rows indicate the DBA version and the window size used. The facet's columns indicate how many threads were used. Note the different vertical scales for each row of the facets.*"} id_gg <- grepl("^DBA", cent_results$cent) ggplot(cent_results[id_gg,], aes(x = num_series, y = median_time_ms, colour = series_length)) + geom_line() + geom_point(size = 1.5) + facet_grid(cent + window_size ~ num_threads, scales = "free") + scale_color_discrete(name = "Series' length") + labs(x = "Amount of time-series considered", y = "Median time (ms)") ``` Both series' length and window size have a somewhat constant effect on running times. As expected, the `by-series` multivariate version is more or less as fast as the univariate version, and the `by-variable` version is proportionally slower. In all cases, however, the growth is linear in the number of series considered, and multi-threading can provide a noticeable improvement. ## Soft-DTW centroid This centroid calculation is based on soft-DTW, so it also supports multivariate series. Additionally, it depends on numerical optimization, and uses the `nloptr::nloptr` function. The implementation in `dtwclust` supports multi-threading in the same manner as DBA. Only the "NLOPT_LD_LBFGS" algorithm with a maximum of 20 evaluations was tested. Both `gamma` and the `weights` vector were left at their default values in all tests. ```{r cent-sdtwc-plot, fig.height=7, fig.cap="*The facets' rows indicate the centroid tested, and the columns show how many threads were used. Note the different vertical scales for each row of the facets.*"} id_gg <- grepl("^Soft", cent_results$cent) ggplot(cent_results[id_gg,], aes(x = num_series, y = median_time_ms, colour = series_length)) + geom_line() + geom_point(size = 2) + facet_grid(cent ~ num_threads, scales = "free") + scale_color_discrete(name = "Series' length") + labs(x = "Amount of time-series considered", y = "Median time (ms)") ``` The series' length has a very considerable effect in this case. Multi-threading also proves to be beneficial, but the optimization procedure can be considerably slow for data with high dimensionality, making this centroid calculation orders of magnitude slower than the other options. Interestingly, the multivariate version was faster in this case, but this probably shouldn't be generalized. # Clustering experiments In this section we look at particular cases of time-series clustering, namely TADPole and some special variations of partitional clustering. Hierarchical is not considered here because its complexity essentially depends on the complexity of the whole cross-distance matrix computation, which was shown indirectly in Section 2.2; this also applies to fuzzy c-medoids. For fuzzy clustering with fuzzy c-means, the whole cross-distance matrix is not computed, but the running times also depend on the distance used, so the results of Section 2.2 are also applicable here. For all experiments in this section the number of available threads was left at 4. Since multi-threading is active by default, it is expected that most users will benefit from it, and what we care about is relative performance, so the fact that all tests have access to all threads makes it a fair comparison. ## TADPole TADPole is a very particular algorithm. It is mostly limited by RAM, since it requires 3 `N x N` matrices for `N` series. It is based on DTW but also requires a cutoff distance (`dc`) parameter which can directly affect how many DTW calculations are performed. Hence, its run-time behavior cannot be generalized easily. Nevertheless, here we look at its characteristics for the given dataset and a `dc` value of 10. Each experiment was repeated `r attr(clus_tadpole_results, "times")` times and the median value was computed. No multi-process parallelization tests were made here because the current implementation of TADPole only takes advantage of that for multiple `dc` values, and that was not explored. ```{r clust-tadpole-plot} ggplot(clus_tadpole_results, aes(x = num_series, y = median_time_s, colour = window_size)) + geom_line() + geom_point(size = 2) + facet_wrap(~lb) + scale_color_discrete(name = "Window size") + labs(x = "Number of time-series in the data", y = "Median time (s)") ``` As expected (from Section 2.2), the choice of DTW lower bound has practically no effect. On the other hand, the choice of window size can have very significant effects. Finally, we can see that the growth is not linear in the number of series. ## DTW special cases Even if we were to test only time-series clusterings with the DTW distance, we would have several options to choose from. In this section, we look at the differences between clustering with `dtw_basic` and `dtw_lb` when using 2 centroids: partition around medoids (PAM) and DBA. For PAM, we also look briefly at the usage of sparse matrices. These tests were repeated `r attr(partitional_results, "times")` times and the median value was computed. ### PAM centroids We test 3 different strategies to apply PAM centroids: pre-computing the whole distance matrix once, performing no pre-computation and using a sparse matrix that is updated iteratively, and performing no pre-computation but using `dtw_lb` at each iteration. Even though **the remarks for sparse matrices apply to any distance**, we only use DTW here (the `dtw_basic` version). The value of `k` was fixed at 20. For the next figure, only 1 repetition was made in each experiment. ```{r clust-part-dtw-pam-plot, fig.height=5} ggdf <- reshape2::melt(partitional_results$dtwlb_vs_dtwbasic$pam, id.vars=c("num_series", "k", "window_size")) levels(ggdf$variable) <- c("DTW (Full Pre-Computation)", "DTW (Sparse Matrix)", "LBI + DTW") ggplot(ggdf, aes(x = num_series, y = value, colour = window_size)) + geom_line() + geom_point(size = 2) + facet_wrap(~variable) + scale_color_discrete(name = "Window size") + labs(x = "Number of time-series in the data", y = "Median time (s)") ``` We see above that using `dtw_lb` can be considerably faster than using `dtw_basic`, and that using a sparse matrix appears to be the fastest. However, performing only 1 repetition for partitional clustering is a poor choice, and performing more repetitions changes things entirely, as the next figure shows. ```{r clust-part-dtw-pam-reps-plot, fig.height=5, fig.cap="*The window size was fixed at 20 for these experiments.*"} cols <- setdiff(colnames(partitional_results$dtwlb_vs_dtwbasic$pam_vs_reps), "sparse_distmat_filled_percent") ggdf <- reshape2::melt(partitional_results$dtwlb_vs_dtwbasic$pam_vs_reps[,cols], id.vars=c("num_series", "k", "num_repetitions")) levels(ggdf$variable) <- c("DTW (Full Pre-Computation)", "DTW (Sparse Matrix)", "LBI + DTW") ggplot(ggdf, aes(x = num_series, y = value, colour = num_repetitions)) + geom_line() + geom_point(size = 2) + facet_wrap(~variable) + scale_color_discrete(name = "Number of repetitions") + labs(x = "Number of time-series in the data", y = "Median time (s)") ``` Pre-computing the whole distance matrix once means that it can be re-used across all repetitions (left-most facet above), so it makes sense that the running times are practically the same there. As we can see, using `dtw_lb` is a bad idea after around 4 repetitions; for more repetitions, the time required is much larger. Using a sparse matrix can indeed be faster, but by the time we make 5 repetitions, the running times are almost the same as when pre-computing the whole matrix. This makes sense, since more repetitions means that there is a higher chance of filling the sparse matrix, and there is an overhead for keeping track of how many cells of the sparse matrix are already set. The next figure shows the percentage of the sparse matrix that was filled for the experiments in the middle facet above. ```{r clust-part-dtw-pam-reps-distmat-plot, fig.height=5} ggplot(partitional_results$dtwlb_vs_dtwbasic$pam_vs_reps, aes(x = num_series, y = sparse_distmat_filled_percent, colour = num_repetitions)) + geom_line() + geom_point(size = 2) + scale_color_discrete(name = "Number of repetitions") + labs(x = "Number of time-series in the data", y = "Percentage of non-zero entries in the sparse matrix") ``` It can be argued that 10 repetitions is still too few for partitional clustering, so using PAM with sparse matrices or using `dtw_lb` is not a very good choice. Unfortunately, pre-computing the whole distance matrix requires much more RAM, and this increases quadratically with the number of series in the dataset. In cases where RAM is a limitation, using `dtw_lb` may be an appropriate alternative. ### DBA centroids The previous section looked solely at PAM centroids because there are several specific nuances for them. However, for other centroids the story changes. For all non-PAM centroids, the centroid time-series change at every iteration and for every repetition, so there can be no pre-computation of the cross-distance matrix. With that in mind, we look here at the difference between `dtw_basic` and `dtw_lb` for DBA centroids. This comparison of DTW-based distances probably **applies to other centroids too**. ```{r clust-part-dtw-dba-plot, fig.height=5} ggdf <- reshape2::melt(partitional_results$dtwlb_vs_dtwbasic$dba, id.vars=c("num_series", "k", "window_size")) levels(ggdf$variable) <- c("DTW", "LBI + DTW") ggplot(ggdf, aes(x = num_series, y = value, colour = window_size)) + geom_line() + geom_point(size = 2) + facet_wrap(~variable) + scale_color_discrete(name = "Window size") + labs(x = "Number of time-series in the data", y = "Median time (s)") ``` We can see that, depending on the value of the window size, using `dtw_lb` can be somewhat faster. Since partitional clustering with `dtw_lb` should provide the **same** results as using `dtw_basic`, there seems to be no reason not to use it for non-PAM centroids. ## Effect of `k` on PAM centroids with sparse cross-distance matrices In the previous section we looked at some characteristics of PAM centroids with a sparse cross-distance matrix, where the value of `k` was fixed. In this section, we briefly explore the effect of `k`. A sparse matrix can be further optimized if it is symmetric, i.e. if the lower and upper triangulars are equal. A cross-distance sparse matrix is symmetric if the distance used to compute it is itself symmetric. Therefore, we perform the following experiments for a non-symmetric case of DTW and the symmetric SBD. These experiments were also repeated `r attr(partitional_results, "times")` times and the median value computed. Only 1 repetition was made in each case. ### Non-symmetric ```{r clust-part-sparse-pam-k-nonsymmetric-plot, fig.height=5} cols <- setdiff(colnames(partitional_results$sparse_pam_k$non_symmetric), "sparse_distmat_filled_percent") ggdf <- reshape2::melt(partitional_results$sparse_pam_k$non_symmetric[,cols], id.vars=c("num_series", "k")) levels(ggdf$variable) <- c("Full Matrix Pre-Computation", "Sparse Matrix") ggplot(ggdf, aes(x = num_series, y = value, colour = k)) + geom_line() + geom_point(size = 2) + scale_color_discrete(name = "Amount of clusters") + facet_wrap(~variable) + labs(x = "Number of time-series in the data", y = "Median time (s)") ggplot(partitional_results$sparse_pam_k$non_symmetric, aes(x = num_series, y = sparse_distmat_filled_percent, colour = k)) + geom_line() + geom_point(size = 2) + scale_color_discrete(name = "Amount of clusters") + labs(x = "Number of time-series in the data", y = "Percentage of non-zero entries in the sparse matrix") ``` As expected, smaller values of `k` cause the sparse matrix to be more full, and the associated overhead causes greater running times, although they are still lower than those with pre-computation of the whole matrix. ### Symmetric ```{r clust-part-sparse-pam-k-symmetric-plot, fig.height=5} cols <- setdiff(colnames(partitional_results$sparse_pam_k$symmetric), "sparse_distmat_filled_percent") ggdf <- reshape2::melt(partitional_results$sparse_pam_k$symmetric[,cols], id.vars=c("num_series", "k")) levels(ggdf$variable) <- c("Full Matrix Pre-Computation", "Sparse Matrix") ggplot(ggdf, aes(x = num_series, y = value, colour = k)) + geom_line() + geom_point(size = 2) + scale_color_discrete(name = "Amount of clusters") + facet_wrap(~variable) + labs(x = "Number of time-series in the data", y = "Median time(s)") ggplot(partitional_results$sparse_pam_k$symmetric, aes(x = num_series, y = sparse_distmat_filled_percent, colour = k)) + geom_line() + geom_point(size = 2) + scale_color_discrete(name = "Amount of clusters") + labs(x = "Number of time-series in the data", y = "Percentage of non-zero entries in the sparse matrix") ``` Here we see the opposite behavior, i.e. using a sparse matrix is actually slower than computing the whole matrix at the beginning. ### Remarks There is indeed a dependency on the value of `k`, but if we also consider the results from Section 4.2.1 for many repetitions, it can be argued that using sparse matrices with PAM centroids would be seldom useful. # References