Skip to content

AI-Phil/VectorMoreThanText

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Vectors are More than Just Text Embeddings

Environment & Language

The Jupyter notebooks within this repostiory have been developed with Python 3.11.

To get started, create a virtual environment and install the packages listed in requirements.txt:

pip install -r requirements.txt 

Introduction

The wider computer world has recently been introduced to the concept of word embeddings, which are vector representations of words that are learned from large text corpora. These embeddings have been shown to be useful in a variety of tasks, including sentiment analysis, machine translation, and named entity recognition.

Most recently, semantic embeddings generated by Large Language Models (LLMs) have been shown to be useful in a variety of tasks, including question answering, text summarization, and can enhance the capabilites of chatbots and other search engines.

Text embeddings are typically stored as vectors, which are arrays of numbers. By converting text to numbers, computers are more easily able to work with, manipulate, and "understand" language.

But we can create embeddings of much more than text - and we can generate text embeddings with more than just LLMs. In this repository, we'll explore some of the wider applications of embeddings, and how we can use them to solve problems in a variety of domains.

Vector Embeddings for Beginners Video

Ania Kubów has a 30 minute video Vector Embeddings for Beginners that provides a great introduction to embeddings. Watch the first 10:37 of the linked video. The video goes on to show how to generate text embeddings with OpenAI, and store these in the Astra cloud vector database, but we will not be convering that in this repository.

The Iris Dataset - Flowers as Vectors

The Iris flower dataset is "a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper *The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula 'all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus'."

This is commonly used within introductory Machine Learning texts, and we'll do much the same here. But rather than delve into building a machine learning classifier, we'll use the Iris dataset to explore how we can represent data as vectors, and how we can visualize these vectors.

This notebook is iris.ipynb.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors