Distributed stochastic neighbor embedding

T-distributed Stochastic Neighbor Embedding (t-SNE), which is widely used in machine learning, is a non-linear, non-deterministic algorithm that creates a two-dimensional map of data with thousands of dimensions.

In other words, it transforms data in a high-dimensional space to fit into a 2D plane. t-SNE tries to hold, or preserve, the local neighbors in the data. It is a very popular approach for dimensionality reduction, as it is very flexible and able to find the structure or relationships in the data where other algorithms fail. It does this by calculating the probability of object i picking potential neighbor j. It will pick up the similar object from high dimension as it will have a higher probability than a less similar object. It uses the Euclidean distance between the objects as a basis for similarity metrics. t-SNE uses the perplexity feature to fine-tune and decide how to balance local and global data.

t-SNE implementation is available in many languages; we are going to use the one available at https://github.com/lejon/T-SNE-Java. Using git and mvn, you can build and use the examples provided here. Execute the following command:

> git clone https://github.com/lejon/T-SNE-Java.git
> cd T-SNE-Java
> mvn install
> cd tsne-demo
> java -jar target/tsne-demos-2.4.0.jar -nohdr -nolbls src/main/resources/datasets/iris_X.txt

The output will be as follows:

TSneCsv: Running 2000 iterations of t-SNE on src/main/resources/datasets/iris_X.txt
NA string is: null
Loaded CSV with: 150 rows and 4 columns.
Dataset types:[class java.lang.Double, class java.lang.Double, class java.lang.Double, class java.lang.Double]
               V0             V1             V2             V3
 0     5.10000000     3.50000000     1.40000000     0.20000000
 1     4.90000000     3.00000000     1.40000000     0.20000000
 2     4.70000000     3.20000000     1.30000000     0.20000000
 3     4.60000000     3.10000000     1.50000000     0.20000000
 4     5.00000000     3.60000000     1.40000000     0.20000000
 5     5.40000000     3.90000000     1.70000000     0.40000000
 6     4.60000000     3.40000000     1.40000000     0.30000000
 7     5.00000000     3.40000000     1.50000000     0.20000000
 8     4.40000000     2.90000000     1.40000000     0.20000000
 9     4.90000000     3.10000000     1.50000000     0.10000000

Dim:150 x 4
000: [5.1000, 3.5000, 1.4000, 0.2000...]
001: [4.9000, 3.0000, 1.4000, 0.2000...]
002: [4.7000, 3.2000, 1.3000, 0.2000...]
003: [4.6000, 3.1000, 1.5000, 0.2000...]
004: [5.0000, 3.6000, 1.4000, 0.2000...]
    .
    .
    .
145: [6.7000, 3.0000, 5.2000, 2.3000]
146: [6.3000, 2.5000, 5.0000, 1.9000]
147: [6.5000, 3.0000, 5.2000, 2.0000]
148: [6.2000, 3.4000, 5.4000, 2.3000]
149: [5.9000, 3.0000, 5.1000, 1.8000]
X:Shape is = 150 x 4
Using no_dims = 2, perplexity = 20.000000, and theta = 0.500000
Computing input similarities...
Done in 0.06 seconds (sparsity = 0.472756)!
Learning embedding...
Iteration 50: error is 64.67259135061494 (50 iterations in 0.19 seconds)
Iteration 100: error is 61.50118570075227 (50 iterations in 0.20 seconds)
Iteration 150: error is 61.373758889762875 (50 iterations in 0.20 seconds)
Iteration 200: error is 55.78219488135168 (50 iterations in 0.09 seconds)
Iteration 250: error is 2.3581173593529687 (50 iterations in 0.09 seconds)
Iteration 300: error is 2.2349608757095827 (50 iterations in 0.07 seconds)
Iteration 350: error is 1.9906437450336596 (50 iterations in 0.07 seconds)
Iteration 400: error is 1.8958764344779482 (50 iterations in 0.08 seconds)
Iteration 450: error is 1.7360726540960958 (50 iterations in 0.08 seconds)
Iteration 500: error is 1.553250634564741 (50 iterations in 0.09 seconds)
Iteration 550: error is 1.294981722012944 (50 iterations in 0.06 seconds)
Iteration 600: error is 1.0985607573299603 (50 iterations in 0.03 seconds)
Iteration 650: error is 1.0810715645272573 (50 iterations in 0.04 seconds)
Iteration 700: error is 0.8168399675722107 (50 iterations in 0.05 seconds)
Iteration 750: error is 0.7158739920771124 (50 iterations in 0.03 seconds)
Iteration 800: error is 0.6911748222330966 (50 iterations in 0.04 seconds)
Iteration 850: error is 0.6123536061655738 (50 iterations in 0.04 seconds)
Iteration 900: error is 0.5631133416913786 (50 iterations in 0.04 seconds)
Iteration 950: error is 0.5905547118496892 (50 iterations in 0.03 seconds)
Iteration 1000: error is 0.5053631170520657 (50 iterations in 0.04 seconds)
Iteration 1050: error is 0.44752244538411406 (50 iterations in 0.04 seconds)
Iteration 1100: error is 0.40661841893114614 (50 iterations in 0.03 seconds)
Iteration 1150: error is 0.3267394426152807 (50 iterations in 0.05 seconds)
Iteration 1200: error is 0.3393774577158965 (50 iterations in 0.03 seconds)
Iteration 1250: error is 0.37023103950965025 (50 iterations in 0.04 seconds)
Iteration 1300: error is 0.3192975790641602 (50 iterations in 0.04 seconds)
Iteration 1350: error is 0.28140161036965816 (50 iterations in 0.03 seconds)
Iteration 1400: error is 0.30413739839879855 (50 iterations in 0.04 seconds)
Iteration 1450: error is 0.31755361125826165 (50 iterations in 0.04 seconds)
Iteration 1500: error is 0.36301524742916624 (50 iterations in 0.04 seconds)
Iteration 1550: error is 0.3063801941900375 (50 iterations in 0.03 seconds)
Iteration 1600: error is 0.2928584822753138 (50 iterations in 0.03 seconds)
Iteration 1650: error is 0.2867502934852756 (50 iterations in 0.03 seconds)
Iteration 1700: error is 0.470469997545481 (50 iterations in 0.04 seconds)
Iteration 1750: error is 0.4792376115843584 (50 iterations in 0.04 seconds)
Iteration 1800: error is 0.5100126924750723 (50 iterations in 0.06 seconds)
Iteration 1850: error is 0.37855035406353427 (50 iterations in 0.04 seconds)
Iteration 1900: error is 0.32776847081948496 (50 iterations in 0.04 seconds)
Iteration 1950: error is 0.3875134029990107 (50 iterations in 0.04 seconds)
Iteration 1999: error is 0.32560416632168365 (50 iterations in 0.04 seconds)
Fitting performed in 2.29 seconds.
TSne took: 2.43 seconds

This example uses iris_X.txt, which has 150 rows and 4 columns, so the dimensions are 150 x 4. It tries to reduce these dimensions to 2 by setting the perplexity as 20 and the theta as 0.5. It iterates on the data provided in iris_X.txt and, using gradient descent, it comes up with a graph on a 2D plane after 2,000 iterations. The graph shows the clusters in the data in a 2D plane, hence effectively reducing the dimensionality. For the mathematical approach to how this was achieved, there are many papers on the topic, and the Wikipedia article (https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) explains it too.

Table of Contents for Distributed stochastic neighbor embedding

Create new playlist

Sign In

Sign Up

Table of Contents for
Distributed stochastic neighbor embedding