In this Jupyter Notebook, we construct a Kangas DataGrid containing the data and projections of the embeddings into 2 dimensions.
In this Jupyter Notebook, we construct a Kangas DataGrid containing the data and projections of the embeddings into 2 dimensions.
Kangas as an open source, mixed-media, dataframe-like tool for data scientists. It was developed by Comet, a company designed to help reduce the friction of moving models into production.
To get started, we pip install kangas, and import it.
%pip install kangas --quiet
import kangas as kg
We create a Kangas Datagrid with the original data and the embeddings. The data is composed of a rows of reviews, and the embeddings are composed of 1536 floating-point values. In this example, we get the data directly from github, in case you aren't running this notebook inside OpenAI's repo.
We use Kangas to read the CSV file into a DataGrid for further processing.
data = kg.read_csv("https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/data/fine_food_reviews_with_embeddings_1k.csv")
Loading CSV file 'fine_food_reviews_with_embeddings_1k.csv'...
1001it [00:00, 2412.90it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 2899.16it/s]
We can review the fields of the CSV file:
data.info()
DataGrid (in memory) Name : fine_food_reviews_with_embeddings_1k Rows : 1,000 Columns: 9 # Column Non-Null Count DataGrid Type --- -------------------- --------------- -------------------- 1 Column 1 1,000 INTEGER 2 ProductId 1,000 TEXT 3 UserId 1,000 TEXT 4 Score 1,000 INTEGER 5 Summary 1,000 TEXT 6 Text 1,000 TEXT 7 combined 1,000 TEXT 8 n_tokens 1,000 INTEGER 9 embedding 1,000 TEXT
And get a glimpse of the first and last rows:
data
row-id | Column 1 | ProductId | UserId | Score | Summary | Text | combined | n_tokens | embedding |
---|---|---|---|---|---|---|---|---|---|
1 | 0 | B003XPF9BO | A3R7JR3FMEBXQB | 5 | where does one | Wanted to save | Title: where do | 52 | [0.007018072064 |
2 | 297 | B003VXHGPK | A21VWSCGW7UUAR | 4 | Good, but not W | Honestly, I hav | Title: Good, bu | 178 | [-0.00314055196 |
3 | 296 | B008JKTTUA | A34XBAIFT02B60 | 1 | Should advertis | First, these sh | Title: Should a | 78 | [-0.01757248118 |
4 | 295 | B000LKTTTW | A14MQ40CCU8B13 | 5 | Best tomato sou | I have a hard t | Title: Best tom | 111 | [-0.00139322795 |
5 | 294 | B001D09KAM | A34XBAIFT02B60 | 1 | Should advertis | First, these sh | Title: Should a | 78 | [-0.01757248118 |
... | 996 | 623 | B0000CFXYA | A3GS4GWPIBV0NT | 1 | Strange inflamm | Truthfully wasn | Title: Strange | 110 | [0.000110913533 |
997 | 624 | B0001BH5YM | A1BZ3HMAKK0NC | 5 | My favorite and | You've just got | Title: My favor | 80 | [-0.02086931467 |
998 | 625 | B0009ET7TC | A2FSDQY5AI6TNX | 5 | My furbabies LO | Shake the conta | Title: My furba | 47 | [-0.00974910240 |
999 | 619 | B007PA32L2 | A15FF2P7RPKH6G | 5 | got this for th | all i have hear | Title: got this | 50 | [-0.00521062919 |
1000 | 999 | B001EQ5GEO | A3VYU0VO6DYV6I | 5 | I love Maui Cof | My first experi | Title: I love M | 118 | [-0.00605782261 |
[1000 rows x 9 columns] | |||||||||
* Use DataGrid.save() to save to disk | |||||||||
** Use DataGrid.show() to start user interface |
Now, we create a new DataGrid, converting the numbers into an Embedding:
import ast # to convert string of a list of numbers into a list of numbers
dg = kg.DataGrid(
name="openai_embeddings",
columns=data.get_columns(),
converters={"Score": str},
)
for row in data:
embedding = ast.literal_eval(row[8])
row[8] = kg.Embedding(
embedding,
name=str(row[3]),
text="%s - %.10s" % (row[3], row[4]),
projection="umap",
)
dg.append(row)
The new DataGrid now has an Embedding column with proper datatype.
dg.info()
DataGrid (in memory) Name : openai_embeddings Rows : 1,000 Columns: 9 # Column Non-Null Count DataGrid Type --- -------------------- --------------- -------------------- 1 Column 1 1,000 INTEGER 2 ProductId 1,000 TEXT 3 UserId 1,000 TEXT 4 Score 1,000 TEXT 5 Summary 1,000 TEXT 6 Text 1,000 TEXT 7 combined 1,000 TEXT 8 n_tokens 1,000 INTEGER 9 embedding 1,000 EMBEDDING-ASSET
We simply save the datagrid, and we're done.
dg.save()
To render the data directly in the notebook, simply show it. Note that each row contains an embedding projection.
Scroll to far right to see embeddings projection per row.
The color of the point in projection space represents the Score.
dg.show()
Group by "Score" to see rows of each group.
dg.show(group="Score", sort="Score", rows=5, select="Score,embedding")
An example of this datagrid is hosted here: https://kangas.comet.com/?datagrid=/data/openai_embeddings.datagrid