Mapping Embeddings
πŸ—ΊοΈπŸ”

From meaning to vectors and back

JoΓ£o Galego

$$\left|\text{🧠}\right>$$

Contents πŸ““

  1. Introduction to Embeddings
  2. Working with Vector Databases
  3. Dimensionality Reduction Techniques
  4. Advanced Retrieval Strategies
  5. Retrieval Augmented Generation (RAG)

Warning ⚠️

This deck is a work in progress…

and always will be

Feel free to search around πŸ”Ž

Cite this presentation πŸ“‘

@misc{a-tour-of-genai-jgalego,
    title = {Mapping Embeddings: from meaning to vectors and back},
    author = {Galego, JoΓ£o},
    howpublished = \url{jgalego.github.io/MappingEmbeddings},
    year = {2024}
}

Note on implementation πŸ‘¨β€πŸ’»

The slides were created using reveal.js

and the presentation is hosted on GitHub Pages

Want to contribute? ✨

Just open an issue/PR for this project

github.com/JGalego/MappingEmbeddings

Introduction to Embeddings

Let's start by sending some love
to Amazon Titan for Embeddings...

Titan Love πŸ”±πŸ’—


                        """
                        Sends love to Amazon Titan for Embeddings πŸ’–
                        and gets a bunch of numbers in return πŸ”’
                        """

                        import json
                        import boto3

                        # Initialize Bedrock Runtime client
                        # https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-runtime.html
                        bedrock = boto3.client("bedrock-runtime")

                        # Call Amazon Titan for Embeddings model on "love"
                        # https://docs.aws.amazon.com/bedrock/latest/userguide/titan-embedding-models.html
                        response = bedrock.invoke_model(
                            modelId="amazon.titan-embed-text-v1",
                            body="{\"inputText\": \"love\"}"
                        )

                        # Process the model response and print the final result
                        body = json.loads(response.get('body').read())
                        print(body['embedding'])
                    

WTF?

Where is the love?

Let's put this question on hold for now...

and work out some definitions.

What are embeddings?

A numerical representation of a piece of information

Source: Adapted from Arize and Medium

Example: Embedding Wikipedia

What if you had the embeddings of ALL Wikipedia?

Source: Cohere
Source: Cohere

Example: Amazon Music

Neighboring vectors, similar tracks

Source: AWS Big Data Blog

Example: Embedding Projector

Source: Google Research

Example: AI Virtual Cell

Source: Bunne et al. (2024)

Data $\rightarrow$ "meaningful" numbers

πŸ’¬ πŸ–ΌοΈ πŸ”Š 🎞️ 🦠

Now, let's get back to our original example...

Why love?

Rule of thumb: 1 token ~ 4 characters

You may have heard of the
$\texttt{1 token} \sim \texttt{4 chars}$ rule of thumb

☝️ Caveat: only valid for English, more on this later...

Well, things are a bit more complicated than that...

so let's spend a few tokens on tokenization

Tokenization is the root of (almost) all evils

Source: Andrej Karpathy

Some are just plain πš π”’π’Ύπ«π”‘...

Source: Adapted from LessWrong

Tokenization is one of the reasons why LLMs
are usually bad at math...

Source: Beren's Blog

Tiktokenizing integers

Replicating Integer Tokenization is Insane 🀯

All languages are not created tokenized equal!

Source: Art Fish Intelligence
Source: Art Fish Intelligence

Ok, time to head back to our main feature...

How do we actually train an embedding model?

Image Embeddings πŸ–ΌοΈ

Contrastive Learning

Source: Adapted from Hadsell, Chopra & LeCun (2005)

Distance Measures

Source: Maarten Grootendorst

Train an Image Embedding model from scratch


                        # pylint: disable=import-error,invalid-name
                        """
                        Train image embeddings model from scratch using contrastive learning.

                        Adapted from Hadsell, Chopra & LeCun (2005)
                        https://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf
                        and Underfitted's 'Training a model to generate image embeddings'
                        https://underfitted.svpino.com/p/training-a-model-to-generate-image
                        """

                        import numpy as np

                        from keras import datasets, Input, Model
                        from keras.layers import Dense, Lambda
                        from keras.metrics import binary_accuracy, BinaryAccuracy
                        from keras.models import Sequential
                        from keras.ops import cast, maximum, norm, square

                        ########
                        # Data #
                        ########

                        # Load dataset
                        (X_train, y_train), (X_test, y_test) = datasets.mnist.load_data()

                        # Reshape and normalize it
                        X_train = X_train.reshape(-1, 784)
                        X_test = X_test.reshape(-1, 784)
                        X_train = X_train.astype('float32') / 255.0
                        X_test = X_test.astype('float32') / 255.0

                        def generate_pairs(X, y):
                            """
                            Creates a collection of positive and negative image pairs.
                            """

                            X_pairs = []
                            y_pairs = []

                            for i in enumerate(X):
                                digit = y[i]

                                # Create positive match
                                positive_digit_index = np.random.choice(np.where(y == digit)[0])
                                X_pairs.append([X[i], X[positive_digit_index]])
                                y_pairs.append([0])

                                # Create negative match
                                negative_digit_index = np.random.choice(np.where(y != digit)[0])
                                X_pairs.append([X[i], X[negative_digit_index]])
                                y_pairs.append([1])

                            # Shuffle everything
                            indices = np.arange(len(X_pairs))
                            np.random.shuffle(indices)

                            return np.array(X_pairs)[indices], np.array(y_pairs)[indices]

                        # Prepare input pairs
                        X_train_pairs, y_train_pairs = generate_pairs(X_train, y_train)
                        X_test_pairs, y_test_pairs = generate_pairs(X_test, y_test)

                        #########
                        # Model #
                        #########

                        # Define inputs
                        input1 = Input(shape=(784,))
                        input2 = Input(shape=(784,))

                        # Build siamese network
                        network = Sequential(
                            [
                                Input(shape=(784,)),
                                Dense(512, activation="relu"),
                                Dense(256, activation="relu"),
                                Dense(128, activation=None),
                            ]
                        )

                        # Define twin branches
                        twin1 = network(input1)
                        twin2 = network(input2)

                        # Define distance
                        def euclidean_distance(a, b):
                            """Computes the Euclidean distance."""
                            return norm(a - b, axis=1, keepdims=True)

                        distance = Lambda(euclidean_distance)(twin1, twin2)

                        # Set up the model
                        model = Model(inputs=[input1, input2], outputs=distance)

                        ########
                        # Loss #
                        ########

                        def contrastive_loss(y, d):
                            """
                            Computes the contrastive loss from Hasdell, Chopra & LeCun (2005)
                            """
                            margin = 1.0
                            y = cast(y, d.dtype)
                            loss = (1 - y) / 2 * square(d) + y / 2 * square(maximum(0.0, margin - d))
                            return loss

                        # Compile model using contrastive loss
                        model.compile(
                            loss=contrastive_loss,
                            optimizer="adam",
                            metrics=[binary_accuracy]
                        )

                        #########
                        # Train #
                        #########

                        # Fit the model
                        history = model.fit(
                            x=[X_train_pairs[:, 0], X_train_pairs[:, 1]],
                            y=y_train_pairs[:],
                            validation_data=([X_test_pairs[:, 0], X_test_pairs[:, 1]], y_test_pairs[:]),
                            batch_size=32,
                            epochs=5,
                        )

                        ########
                        # Test #
                        ########

                        # Generate predictions
                        predictions = model.predict([X_test_pairs[:, 0], X_test_pairs[:, 1]]) >= 0.5

                        # Compute model accuracy
                        accuracy = BinaryAccuracy()
                        accuracy.update_state(y_test_pairs, predictions.astype(int))
                        print(f"Accuracy: {accuracy.result().numpy():.2f}")

                        ############
                        # Generate #
                        ############

                        # Initialize model
                        embedding_model = model.layers[2]

                        # Generate embeddings
                        digits = np.where(y_test == 7)[0]
                        embeddings = embedding_model.predict(X_test[np.random.choice(digits)].reshape(1, -1))
                        print(embeddings, len(embeddings))
                    

Sentence Embeddings πŸ’¬

Step 1: Model

pip install -q sentence-transformers

There are plenty to choose from on πŸ€—

Step 2: Data + Loss Function

Source: πŸ€—

Step 3: Test

MTEB: Massive Text Embedding Benchmark

Source: πŸ€—

How good/bad is Amazon Titan for Embeddings?

* Inspired by Phil Schmid's post

October 2023: amazon.titan-embed-text-v1

September 2024: amazon.titan-embed-text-v2:0

Are bigger embeddings always better?

Well, not necessarily...

Matryoshka Representation Learning πŸͺ†

Source: πŸ€—

Working with Vector Databases

Quote #1

"The most important piece of the preprocessing pipeline, from a systems standpoint, is the vector database."
Andreessen Horowitz

Quote #2

"In the future, we believe that every database will be a vector database."
Google

Vector databases are everywhere

But... what are they?

Any database that treats vectors as first class citizens is a vector database.

Source: Pinecone

CR7 Embeddings ⚽

Source: Medium

Vector Database Types

Source: Medium

Vector Databases on AWS

Demo: SQLite + Amazon Bedrock

Multimodal vector search with sqlite-rembed

Mind the (multimodal) gap!

Source: Adapted from JinaAI

Modality Gap Explorer 🧭

Dimensionality Reduction Techniques

Dimensionality reduction is used to
make sense of high-dimensional data

It can bring huge benefits...

  • Compute / Storage ⬇️
  • Data Visualization ✨

It comes in many flavors...

  • global πŸ†š local
  • linear πŸ†š non-linear
  • parametric πŸ†š non-parametric
  • deterministic πŸ†š stochastic

We'll focus on 3 different techniques...

  • PCA captures global patterns
  • t-SNE emphasizes local patterns and clusters
  • UMAP handles complex relationships

Embedding a 2D circle with t-SNE

Source: Kobak & Linderman (2021)

Initialization is critical

Source: Adapted from Kobak & Linderman (2021)

Dimensionality reduction as probabilistic inference

Source: Ravuri & Lawrence (2024)

The map is not the territory!

Von Neumann's Elephant Woolly Mammoth

Source: PAIR
Source: PAIR

Now, you may be wondering...

How do models represent more features
than they have dimensions?

Let's talk about Superposition

(not the quantum type)

Superposition Hypothesis

Source: Anthropic

The Hunt for Monosemanticity

Source: Anthropic

Example: Golden Gate Bridge πŸŒ‰

Source: Anthropic

Example: Golden Gate Bridge πŸŒ‰

Source: Anthropic

Example: Golden Gate Bridge πŸŒ‰

Source: Anthropic

Example: Golden Gate Bridge πŸŒ‰

Source: Anthropic

Advanced Retrieval Strategies

When (naive) vector search fails!

RAG Triad

Source: TruLens

Advanced Retrieval Techniques

  1. Query transformations
    • Generated answers
      $\texttt{query} \rightarrow \texttt{LLM} \rightarrow \texttt{hypothetical answer}$
    • Multiple queries
      $\texttt{query} \rightarrow \texttt{LLM} \rightarrow \texttt{sub-queries}$
  2. Cross-encoder re-ranking
  3. Embedding adaptors
  4. Other techniques
    • Fine-tune embedding model
    • Fine-tune LLM for retrieval (RA-DIT, InstructRetro)
    • Deep embedding adaptors
    • Deep relevance modelling
    • Deep chunking

RAG Fusion

Source: LangChain

Demo: RAGmap πŸ—ΊοΈπŸ”

RAGxplorer πŸ¦™πŸ¦Ί

A simple tool for RAG visualizations

First major bug!

RAGmap πŸ—ΊοΈπŸ”

Visualization tool for exploring embeddings

RAG in a nutshell πŸ₯œ

  • LLMs are trained on HUGE amounts of data, but...
  • LLMs haven't seen your data
  • RAG is key to connecting LLMs to external data

RAG Stages

  1. Load
  2. Index
  3. Store
  4. Query
  5. Evaluate
  6. Update
Source: LlamaIndex

Let's look at an example...

Step 1: Load

Source: Medium

Step 2: Index + Store

Step 3: Query

Step 4: Evaluate

Key Takeaways

In summary...

  • Embedding models turn data into "meaningful" numerical representations (vectors)

  • Vector databases can be used to search through these representations efficiently

  • Dimensionality reduction allows us to to make sense of high-dimensional data

  • Advanced retrieval strategies can be a great asset when naive search is not enough

  • RAG ties everything together by bringing external data to the model but...

  • It's still just a clever hack!

  • There's much we don't know...

HIC SVNT DRACONES πŸ‰

What we don't know...

πŸ’§

"What we know is a drop,
what we don't know is an ocean."
Isaac Newton

References πŸ“š

General

Short Courses πŸ‘©β€πŸ«

Advanced Courses πŸ‘¨πŸ»β€πŸŽ“

  • (MIT) 6.S191: Introduction to Deep Learning
  • (Stanford) CS224N: Natural Language Processing with Deep Learning
  • (Stanford) CS224U: Natural Language Understanding
  • (Stanford) CS324: Large Language Models
  • (UMass) CS685: Advanced Natural Language Processing
  • (ETH) 263-5354-00L: Large Language Models