We Loved Vectors Before They Were Cool: Here’s Why
AI and Machine Learning have made vectors very cool. But at the same time, the focus on the way vectors support AI and ML has created a sort of tunnel vision about the power of vectors.
We know this because at KX we’ve been embedding time series data, transactional data, and all the data needed for apps in finance, manufacturing, life sciences in vectors for more than 20 years. We use our KDB+ database to store the vectors and link them to metadata and all sorts of other unstructured data.
The result is that the apps we build can process all this data in vector space in ways that are incredibly scalable but also support streaming and real-time processing without having to use caches and other layers of supporting components. Embedding data in vectors, storing them in our KDB+ database, and then using our Q language or Python to process the ETL, database queries, app logic, and analytics in vector space squishes the data science stack mightily.
We’ve loved vectors for a long time before they were cool. It’s a mistake just to think of vectors as a way to support AI/ML model training and development. The power of vector engineering can and should be applied to much more than that. This article will explain why.
Why Vectors Conquered the AI/ML Space
In the AI/ML space all sorts of data is represented as vectors, a process known as embedding. The vectors are lists of numbers that create a vector space, similar to the X,Y,Z access in linear algebra.
Each column in a vector in AI/ML represents some aspect of the entity being encoded. In ChatGPT, for example, words are encoded into vectors that have 12,288 dimensions. A super simplified way of thinking of vectors was offered by @svpino in a recent tweet that imagined coding four words. Here’s an example using Dog, Cat, Puppy, Kitten, in a four dimensional vector as follows:
Dog → [3, 1] • Cat → [3, 2] • Puppy → [1, 1] • Kitten → [1, 2]
The first component represents the concept of "age": Cat and Dog have a value of 3, indicating they are older than Puppy and Kitten with a value of 1. The second component represents the concept of "species": Dog and Puppy have a value of 1, indicating they are of the canine species, while Cat and Kitten have a value of 2, indicating feline.
But any type of data can be encoded into a vector. You can code the different types of alerts and the temperature from a heat sensing IoT device into vectors, or all of the data from the electronic sensors of an automobile engine, or all of the trades for a day in the financial markets. The search for features to encode into vectors can also be automated.
After encoding these vectors can be used for training neural networks that can be used to make predictions, analyze and categorize data, and zillions of other purposes like asset tracking, environmental monitoring, industrial automation, and autonomous vehicles, to name a few.
AI/ML has taken off because once you convert raw data into vectors you can unleash the power of all sorts of algorithms for supervised learning or reinforcement learning that work well on numbers but wouldn’t know what to do with raw text or images or transactional data, or time series.
So the broad recipe AI/ML applications have used is this:
Convert all sorts of data into vectors.
Use vector space to process huge amounts of data in an incredibly scalable manner..
Automate the process of using vectors.
Use powerful algorithms and techniques such as to find patterns in the vectors and build models for prediction, categorization, and many other purposes.
Build applications based on all this power to do useful things.
If you notice, this recipe doesn’t mention AI/ML at all.
But when people think of vectors now, they almost always think of the way they are used in AI/ML applications, and that’s a mistake.
The Power of Vectors Beyond AI/ML
The idea of using vectors as a way to model applications has been around for a long time. APL, an array processing language that was also used for vectors was invented in the late 1950s. There are many others and now that AI/ML has drawn interest to vectors, a whole new generation of vector databases and vector processing technology has cropped up.
Our efforts started 23 years ago when we created the KDB+ vector database, and we continue to innovate every week.
At KX, we found that the recipe mentioned above could build applications in arenas where there was a huge amount of data that had to be processed in real time to support high value decisions.
Here’s how we applied the recipe to a variety of domains.
Financial Services
KX converted time series data into vectors, added a variety of other metadata also encoded as vectors, and were able to create applications that could handle huge amounts of data, process streams of data, create real-time applications, all with super low latency.
The result is that the KDB database is now used by all of the top 50 investment banks and is a linchpin of both the sell- and buy-side infrastructure for algorithmic trading, risk analysis and management, and portfolio optimization.
Life Sciences
Speeding up clinical trials and assembling and using huge amounts of data in drug discovery can result in ROI measured in billions of dollars. Through a partnership with KX, Syneos, a company focused on creating integrated biopharmaceutical solutions, converted vast arrays of clinical trial data to vectors with a time-index. The vectors allowed huge amounts of data to be processed quickly, speeding the work of researchers, but also made it easier to assemble collections of relevant data and identify sequences that lead to interesting outcomes such as:
Creating a date timehouse to help customers address complex healthcare decisions.
Improving the quality of analysis for drug trial site selection.
Improving clinical trial efficiency, reducing the costs and speeding time to market for life-changing therapies for patients.
At KX we love vectors because they enabled us to build a great business for companies that had huge amounts of data to be processed in real time. Time-series and streaming applications proved very friendly to our model.
Along the way we learned how to:
Map all sorts of data into vectors including time series.
Connect vectors with all sorts of data structured and unstructured data.
Use the Python or KX’s Q language to squish the stack and do ETL, data engineering, analytics, streaming, and application logic in one platform.
Handle huge data volumes at scale
Now that the whole world is learning vectors are cool for AI/ML, we want to make sure that mastery of vectors has the biggest possible payoff.