Statistical embedding: Beyond principal components
release_4vwj5epnkfapxhgeybhkzxv47a
by
Dag Tjøstheim and Martin Jullum and Anders Løland
2021
Abstract
There has been an intense recent activity in embedding of very high
dimensional and nonlinear data structures, much of it in the data science and
machine learning literature. We survey this activity in four parts. In the
first part we cover nonlinear methods such as principal curves,
multidimensional scaling, local linear methods, ISOMAP, graph based methods and
kernel based methods. The second part is concerned with topological embedding
methods, in particular mapping topological properties into persistence
diagrams. Another type of data sets with a tremendous growth is very
high-dimensional network data. The task considered in part three is how to
embed such data in a vector space of moderate dimension to make the data
amenable to traditional techniques such as cluster and classification
techniques. The final part of the survey deals with embedding in
ℝ^2, which is visualization. Three methods are presented: t-SNE,
UMAP and LargeVis based on methods in parts one, two and three, respectively.
The methods are illustrated and compared on two simulated data sets; one
consisting of a triple of noisy Ranunculoid curves, and one consisting of
networks of increasing complexity and with two types of nodes.
In text/plain
format
Archived Files and Locations
application/pdf 1.6 MB
file_todfb5x3frc4rp3kyptwabc3qe
|
arxiv.org (repository) web.archive.org (webarchive) |
2106.01858v1
access all versions, variants, and formats of this works (eg, pre-prints)