t-SNE (t-Distributed Stochastic Neighbor Embedding) is a powerful technique for dimensionality reduction that can effectively visualize high-dimensional data in a lower-dimensional space.
Dimensionality reduction can improve machine learning results by reducing computational complexity of the algorithms, preventing overfitting, and focusing on the most relevant features in the dataset. Note that this technique should only be used when the number of features is low.
Import libraries
In [ ]:
import datarobot as dr
import pandas as pd
import seaborn as sns
from sklearn.manifold import TSNE
In [3]:
# either directly pass in your endpoint/token, use a config file, or connect using DataRobot notebooks
dr.Client()
Out [3]:
<datarobot.rest.RESTClientObject at 0x7f5f10312280>
Get dataset
This example uses data on the movement of a double pendulum which has already been loaded into DataRobot for this example, but can be found here.
In [40]:
# replace the dataset ID with your own data
ds_id = "62fbcdf583b30f0ef972dc31"
# get dataset from DataRobot
ds = dr.Dataset.get(ds_id)
df = ds.get_as_dataframe()
display(df)
Out[40]:
t
x1
x2
v1
v2
a1
a2
0
0.000000
2.36
3.14
-0.0100
-0.01000
-9.24
6.53
1
0.000862
2.36
3.14
-0.0180
-0.00437
-9.24
6.53
2
0.001720
2.36
3.14
-0.0259
0.00126
-9.24
6.53
3
0.002590
2.36
3.14
-0.0339
0.00689
-9.24
6.53
4
0.003450
2.36
3.14
-0.0418
0.01250
-9.24
6.53
…
…
…
…
…
…
…
…
2424
9.970000
-14.70
-22.40
1.1400
1.82000
6.94
-3.84
2425
9.980000
-14.70
-22.30
1.2000
1.79000
7.04
-3.64
2426
9.980000
-14.70
-22.30
1.2500
1.76000
7.12
-3.42
2427
9.990000
-14.70
-22.30
1.3100
1.73000
7.20
-3.19
2428
10.000000
-14.70
-22.30
1.3700
1.70000
7.28
-2.95
2429 rows × 7 columns
Reduce the number of features in the dataset
In [ ]:
# features to exclude from reduction
# can be target columns or ID columns or other
exclude_cols = ["t", "a2"]
model = TSNE(learning_rate=100, random_state=42)
transformed = model.fit_transform(df.drop(exclude_cols, axis=1))
Create new dataframe with reduced columns and previously excluded columns
In [39]:
# get the tsne dataset
reduced_df = pd.DataFrame(transformed, columns=["tsne_x", "tsne_y"])
# join in target and time columns from original dataset
reduced_df = pd.concat([reduced_df, df[exclude_cols]], axis=1)
display(reduced_df)