Digital repository of Slovenian research organisations

Show document
A+ | A- | Help | SLO | ENG

Title:Ground truth clustering is not the optimum clustering
Authors:ID Absalom Bautista, Lucia (Author)
ID Hrga, Timotej (Author)
ID Povh, Janez (Author)
ID Zhao, Shudian (Author)
Files:URL URL - Source URL, visit https://www.nature.com/articles/s41598-025-90865-9
 
.pdf PDF - Presentation file, download (3,70 MB)
MD5: B7E811F029090C6131103F8630717E70
 
Language:English
Typology:1.01 - Original Scientific Article
Organization:Logo RUDOLFOVO - Rudolfovo - Science and Technology Centre Novo Mesto
Abstract:Data clustering is a fundamental yet challenging task in data science. The minimum sum-of-squares clustering (MSSC) problem aims to partition data points into k clusters to minimize the sum of squared distances between the points and their cluster centers (centroids). Despite being NP-hard, solvers exist that can compute optimal solutions for small to medium-sized datasets. One such solver is SOS-SDP, a branch-and-bound algorithm based on semidefinite programming. We used it to obtain optimal MSSC solutions (optimum clusterings) for various k across multiple datasets with known ground truth clusterings. We evaluated the alignment between the optimum and ground truth clusterings using six extrinsic measures and assessed their quality using three intrinsic measures. The results reveal that the optimum clusterings often differ significantly from the ground truth clusterings. Additionally, the optimum clusterings frequently outperform the ground truth clusterings, according to the intrinsic measures that we used. However, when ground truth clusters are well-separated convex shapes, such as ellipsoids, the optimum and ground truth clusterings closely align.
Keywords:minimum sum-of-squares clustering, ground truth clustering, rxtrinsic measures, intrinsic measures
Publication version:Version of Record
Publication date:01.01.2025
Year of publishing:2025
Number of pages:str. 1-17
Numbering:Vol. 15, article no. ǂ9223
PID:20.500.12556/DiRROS-22530 New window
UDC:519.85
ISSN on article:2045-2322
DOI:10.1038/s41598-025-90865-9 New window
COBISS.SI-ID:229897731 New window
Note:Nasl. z nasl. zaslona; Opis vira z dne 22. 3. 2025; Soavtorji: Timotej Hrga, Janez Povh & Shudian Zhao;
Publication date in DiRROS:29.05.2025
Views:571
Downloads:258
Metadata:XML DC-XML DC-RDF
:
Copy citation
  
Share:Bookmark and Share


Hover the mouse pointer over a document title to show the abstract or click on the title to get all document metadata.

Record is a part of a journal

Title:Scientific reports
Shortened title:Sci. rep.
Publisher:Nature Publishing Group
ISSN:2045-2322
COBISS.SI-ID:18727432 New window

Document is financed by a project

Funder:ARIS - Slovenian Research and Innovation Agency
Project number:DIGITOP- RRI
Name:Digitalna transformacija robotiziranih tovarn prihodnosti
Acronym:DIGITOP

Licences

License:CC BY 4.0, Creative Commons Attribution 4.0 International
Link:http://creativecommons.org/licenses/by/4.0/
Description:This is the standard Creative Commons license that gives others maximum freedom to do what they want with the work as long as they credit the author.

Secondary language

Language:Slovenian
Abstract:Razvrščanje podatkov v skupine je temeljna, a zelo zahtevna naloga v podatkovni znanosti. Problem razvrščanja z minimalno vsoto kvadratov odklonov (MSSC) je osredotočen na razvrščanje podatkovnih točk v k skupin na način, da bila vsota kvadratov razdalj med točkami in centri skupin (centroidi) minimalna. Kljub temu, da je to NP-težek problem, obstajajo reševalniki za ta problem, ki lahko izračunajo optimalne rešitve za majhne in srednje velike nabore podatkov. Eden takšnih reševalnikov je SOS-SDP, ki temelji na razveji in omeji algoritmu in na semidefinitnem programiranju. Uporabili smo ga za pridobitev optimalnih rešitev MSSC (optimalnih razvrščanj) za različne vrednosti k preko več naborov podatkov z znanimi dejanskimi razvrstitvami. Ugotavljali smo skladnost med optimalnimi in dejanskimi razvrstitvami z uporabo šestih zunanjih mer ter ocenili njihovo kakovost z uporabo treh notranjih mer. Rezultati kažejo, da se optimalne razvrstitve pogosto znatno razlikujejo od dejanskih razvrstitev. Poleg tega optimalne razvrstitve pogosto presegajo dejanske razvrstitve glede na vrednosti notranjih mer, ki smo jih uporabili. Kadar pa so dejanske skupine dobro ločene in imajo konveksne oblike, kot so npr. elipsoidi, so optimalne in dejanske razvrstitve tesno usklajene.
Keywords:razvrščanje z minimalno vsoto kvadratov, dejansko razvrščanje, zunanje mere, notranje mere


Back