Unlocking Company Data Potential; The Role of Vector Embeddings
When meeting with clients who have been briefed internally, we frequently observe a basic grasp of terms like vector embeddings. However, this understanding typically focuses on why it's necessary, rather than on its operational mechanics or the requirements for its implementation.
And since this is such an instrumental part of AI, we feel that even if you are not working with this directly, it is still very important to understand if you and your company are serious about data. So in this blogpost, we try to explain what vector embeddings are, without going too much into the technical details.
What are vector embeddings?
On a very high level, vector embedding is a numerical representation of text, images, sounds, etc. For example, you can take each word of a sentence and assign a number to it, which is an elementary vector embedding. For example, we can translate the sentence ‘A man is walking in a field’ to a vector of numbers.
Vector embeddings that are used in advanced AI applications are more complex. Sticking to text, it will try to map the essence of a sentence to a vector. So in an example shared by Word2Vec (an early implementation invented by Google), you can find several animals mapped to a set of concepts.
This example maps to 6 different dimensions, but true vector mapping can have hundreds or thousands of dimensions. This is impossible for our minds to visualize, but it is the same thing as going from 2-dimensional space to 3 dimensions, you add more possibilities. Let’s say you only have the option of integers on a scale that goes from 1-100. 1 dimension would give you 100 positions, the second dimension would give 100x100 = 10000 positions, and the third dimension would already give you 1 million positions (100x100x100).
What can you do with those vector embeddings?
So after the vector embedding is created we have a numerical representation of the text. Being numbers means we can start calculating and that opens up a lot of possibilities, but they come all down to one simple idea, finding things that are close to each other. Whether it is creating clusters, finding the best alternative, or identifying anomalies, it all comes down to how close to each other different vectors are.
The graphs on the right depict a 2D visualization (dimension reduction) of the Word2Vec example. There, the bunny and rabbit are very close to each other because they share the concept of mammal and non-rodent. In some cases, the hamster will also be considered close, whereas in other cases it would be considered to be part of a different cluster.
Going back to the earlier example, of "A man is walking in a field", this is now closer to the text "a little boy is walking"’ than to "a dog is walking past a field". Even though the sentences "A man is walking in a field" and "a dog is walking past a field" seem more similar, the connection between a man and a little boy is stronger. So the similarity score of those two sentences is higher.
If we have all these texts in a database, we can calculate the proximity to the string we are looking for. This is what we call a similarity search.
What are the applications for vector embeddings?
It is interesting to note that, although many people now suddenly are aware of vector embeddings, the concepts have been applied for many years. A Google image search or a recommendation after you finish a series on Netflix, it’s all based on this concept.
To get your head spinning, here are some cool examples of vector embeddings;
- Magic photo editors, e.g. removing people from the background
- Product recommendations
- Chatbots answering questions based on your data
- Identifying churning customers
- De-duplication of data
- Anomaly detection
An example, of a use-case of vector embeddings?
Recently we did a project where we developed a chatbot to answer questions about a dataset of millions of articles. We used a system to only get the relevant articles, adopting the ideas discussed in this post. We vectorized the question and compared that vector to the vectors in the database (similarity search), which led to 10 articles that were used to answer the question.
We will discuss that project in a follow-up post, but you can already find more information about that project in the use-cases.
“This Vincent guy really, really knows his shit!”
As stated by one happy customer