Abstract
In approximate query searching (AQS), the given query point (\({\bar{\textbf{q}}}'\)) can be seen as a noise (\({{\bar{\eta }}}\)) corrupted version of one of the points (\({\bar{\textbf{q}}}\)) in the existing database \({\mathcal {X}}\), i.e., \({\bar{\textbf{q}}}' = {\bar{\textbf{q}}} + {\bar{\mathbf{\eta }}}\). Thus deciding on an appropriate distance d that would return the correct match (\({\bar{\textbf{q}}}\)) entails that the chosen distance should be aware of the type of distribution of the noise. In this work, we study the suitability of Minkowski-type distances in AQS when the \({\bar{\textbf{q}}}\) is afflicted by both white and coloured noises to different extent. To this end, we employ a simple similarity search based scoring algorithm proposed in François et al. (ESANN 2005, 13th European Symposium on Artificial Neural Networks, Bruges, Belgium, April 27–29, 2005, Proceedings, pp 339–344, 2005). Our study reveals an interesting interplay of the following 3D’s in the quest for an appropriate distance: Dimensionality and Domain geometry of the data and the type of noise Distribution and has led us to explore this problem from a basic geometric perspective. Our main contribution herein is the proposal of a novel index called the Relative Contained Volume (RCV) that helps explain the performance of the considered distances.
Similar content being viewed by others
Data availability
The analysis of all the data generated during this study is included in this article. The MATLAB codes to generate and analyse the data are available from the corresponding author on reasonable request.
Notes
Note that \(\left( \ell _p({\bar{\textbf{x}}}, {\bar{\textbf{y}}}) \right) ^p = \Vert {\bar{\textbf{x}}} - {\bar{\textbf{y}}} \Vert _p^p = \sum _{i=1}^m |x_i - y_i \vert ^p \) is, in fact, not a metric since it does not satisfy the triangle inequality but it satisfies the other two properties of a metric and hence the choice of our terminology.
This is an expanded version of the preliminary findings presented in Singh and Jayaram (2020).
Perhaps this explains the choice of \(\sigma _i = 0.3\) in François et al. (2005) where \({\mathcal {D}} = [0,1]^m\).
References
Aeberhard S, Forina M (1991) Wine. UCI Machine Learning Repository. https://doi.org/10.24432/C5PC7J
Aggarwal CC, Hinneburg A, Keim DA (2001) On the surprising behavior of distance metrics in high dimensional spaces. In: Database theory—ICDT 2001, 8th International Conference, London, UK, January 4–6, 2001, Proceedings, pp 420–434
Bator M (2015) Dataset for sensorless drive diagnosis. UCI Machine Learning Repository. https://doi.org/10.24432/C5VP5F
Beyer KS, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Database theory—ICDT ’99, 7th International Conference, Jerusalem, Israel, January 10–12, 1999, Proceedings, pp 217–235
Blum A, Hopcroft J, Kannan R (2020) Foundations of data science. Cambridge University Press, Cambridge
Bock R (2007) Magic gamma telescope. UCI Machine Learning Repository. https://doi.org/10.24432/C52C8B
Durrant RJ, Kabán A (2009) When is ‘nearest neighbour’ meaningful: a converse theorem and implications. J Complex 25(4):385–397
François D, Wertz V, Verleysen M (2005) Non-Euclidean metrics for similarity search in noisy datasets. In: ESANN 2005, 13th European symposium on artificial neural networks, Bruges, Belgium, April 27–29, 2005, Proceedings, pp 339–344
François D, Wertz V, Verleysen M (2007) The concentration of fractional distances. IEEE Trans Knowl Data Eng 19(7):873–886
Hinneburg A, Aggarwal CC, Keim DA (2000) What is the nearest neighbor in high dimensional spaces? In: VLDB 2000, Proceedings of 26th international conference on very large data bases, September 10–14, 2000, Cairo, Egypt, pp 506–515
Hu Y, Yu M, Wang H, Ting Z (2015) A similarity-based learning algorithm using distance transformation. IEEE Trans Knowl Data Eng 27(6):1452–1464
Jayaram B, Klawonn F (2012) Can unbounded distance measures mitigate the curse of dimensionality? Int J Data Min Model Manag 4(4):361–383
Klawonn F, Höppner F, Jayaram B (2012) What are clusters in high dimensions and are they difficult to find? In: Revised selected papers of the first international workshop on clustering high-dimensional data, vol 7627. Springer, Berlin, pp 14–33
Kumari S, Jayaram B (2017) Measuring concentration of distances—an effective and efficient empirical index. IEEE Trans Knowl Data Eng 29(2):373–386
Pestov V (2000) On the geometry of similarity search: dimensionality curse and concentration of measure. Inf Process Lett 73(1–2):47–51
Qiao M, Li J (2016) Distance-based mixture modeling for classification via hypothetical local mapping. Stat Anal Data Min 9(1):43–57
Singh A, Jayaram B (2020) Performance of Minkowski-type distances in similarity search—a geometrical approach. In: IEEE 5th International Conference on Computing Communication and Automation (ICCCA), pp 467–47
Smith DJ, Vamanamurthy MK (1989) How small is a unit ball? Math Mag 62(2):101–107
Wang Z, Bovik AC (2009) Mean squared error: Love it or leave it? A new look at signal fidelity measures. IEEE Signal Process. Mag. 26(1):98–117
Weinberger KQ, Sha F, Saul LK (2010) Convex optimizations for distance metric learning and pattern classification [Applications Corner]. IEEE Signal Process Mag 27(3):146–158
Wolberg SW, William, Mangasarian O (1995) Breast cancer Wisconsin (prognostic). UCI Machine Learning Repository. https://doi.org/10.24432/C5GK50
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no Conflict of interest. The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Singh, A., Jayaram, B. Minkowski-type distances in approximate query searches. Comp. Appl. Math. 43, 187 (2024). https://doi.org/10.1007/s40314-024-02704-8
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s40314-024-02704-8
Keywords
- Approximate query searching
- High-dimensional data analysis
- Minkowski distances
- Relative contained volume