Vectors are More than Just Text Embeddings

Environment & Language

The Jupyter notebooks within this repostiory have been developed with Python 3.11.

To get started, create a virtual environment and install the packages listed in requirements.txt:

pip install -r requirements.txt

Introduction

The wider computer world has recently been introduced to the concept of word embeddings, which are vector representations of words that are learned from large text corpora. These embeddings have been shown to be useful in a variety of tasks, including sentiment analysis, machine translation, and named entity recognition.

Most recently, semantic embeddings generated by Large Language Models (LLMs) have been shown to be useful in a variety of tasks, including question answering, text summarization, and can enhance the capabilites of chatbots and other search engines.

Text embeddings are typically stored as vectors, which are arrays of numbers. By converting text to numbers, computers are more easily able to work with, manipulate, and "understand" language.

But we can create embeddings of much more than text - and we can generate text embeddings with more than just LLMs. In this repository, we'll explore some of the wider applications of embeddings, and how we can use them to solve problems in a variety of domains.

Vector Embeddings for Beginners Video

Ania Kubów has a 30 minute video Vector Embeddings for Beginners that provides a great introduction to embeddings. Watch the first 10:37 of the linked video. The video goes on to show how to generate text embeddings with OpenAI, and store these in the Astra cloud vector database, but we will not be convering that in this repository.

The Iris Dataset - Flowers as Vectors

The Iris flower dataset is "a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper *The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula 'all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus'."

This is commonly used within introductory Machine Learning texts, and we'll do much the same here. But rather than delve into building a machine learning classifier, we'll use the Iris dataset to explore how we can represent data as vectors, and how we can visualize these vectors.

This notebook is iris.ipynb.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
iris.ipynb		iris.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vectors are More than Just Text Embeddings

Environment & Language

Introduction

Vector Embeddings for Beginners Video

The Iris Dataset - Flowers as Vectors

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vectors are More than Just Text Embeddings

Environment & Language

Introduction

Vector Embeddings for Beginners Video

The Iris Dataset - Flowers as Vectors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages