DIFFER
DIFFER Publication

Two-step clustering for data reduction combining DBSCAN and k-means clustering

Author
Abstract
A novel combination of two widely-used clustering algorithms is proposed here for the detection and reduction of high data density regions. The density-based spatial clustering of applications with noise (DBSCAN) algorithm is used for the detection of high data density regions and the k-means algorithm for reduction. The proposed algorithm iterates while successively decrementing the DBSCAN search radius, allowing for an adaptive reduction factor based on the effective data density. The algorithm is demonstrated for a physics simulation application, where a surrogate model for fusion reactor plasma turbulence is generated with neural networks. A training dataset for the surrogate model is created with a quasilinear gyrokinetics code for turbulent transport calculations in fusion plasmas. The training set consists of model inputs derived from a repository of experimental measurements, meaning there is a potential risk of over-representing specific regions of this input parameter space. By applying the proposed reduction algorithm to this dataset, this study demonstrates that the training dataset can be reduced by a factor similar to 20 using the proposed algorithm, without a noticeable loss in the surrogate model accuracy. This reduction also provides a way of analyzing existing high-dimensional datasets for biases and consequently reducing them, which lowers the cost of re-populating that parameter space with higher quality data.
Year of Publication
2023
Journal
Contributions to Plasma Physics
Volume
63
Issue
5-6
Number of Pages
202200177
URL
https://arxiv.org/abs/2111.12559
DOI
10.1002/ctpp.202200177
Dataset
10.5281/zenodo.7761172
PId
70da813fe4a4146414ed29b3bd976269
Alternate Journal
Contrib. Plasma Phys.
Label
OA
Journal Article
Download citation