When considering databases for AI applications, one typically envisions a specialized tool tailored exclusively for vector operations. However, Alibaba has chosen a different approach, integrating vector search support directly into AliSQL, their relational database. This decision might appear to be a mere technical detail, but it reflects a fundamental strategy. Let's explore what this means in practice and how it's engineered.
What Are Vectors, and Why Does a Database Need Them?
If you've ever used semantic search, «similar item» recommendations, or a voice assistant, you'll know that these features are all powered by vectors. Simply put, a vector is a set of numbers that describes an object, such as text, an image, or a sound. Two objects with similar meanings will have closely corresponding vectors.
In this context, the database's role is not only to store these numbers but also to quickly find those closest to a given query. Why adapt a traditional relational database for this purpose? The answer is straightforward: most applications already leverage relational databases. By adding vector search capabilities, developers no longer need to create and maintain a separate vector system, synchronize data between two storage solutions, or worry about consistency. Everything remains within a single environment.
Как векторы хранятся в AliSQL на диске
How Do Numbers Become a Structure on Disk?
The primary question that arises when working with vectors in a database is how to store them. A vector is essentially an array of floating-point numbers. In AliSQL, a special data type has been introduced for this purpose, which is serialized (or packed) into a binary format before being written to disk.
The format is structured as follows: it begins with a header containing metadata, including the vector's dimension (the number of numeric values it contains) and the data type (e.g., 32-bit floating-point numbers). This is followed by the numerical values themselves, written sequentially. This approach allows for precise determination of vector boundaries without incurring wasted space on delimiters.
An important consideration is dimensionality. Modern AI models often generate vectors with dimensions ranging from several hundred to several thousand. AliSQL supports up to 16,000 dimensions, which covers the needs of most current tasks.
Why Brute-Force Vector Search is Inefficient
Why Is Brute-Force Iteration Inefficient?
Imagine you have a million vectors in your database, and you need to find the 10 most similar to a query. The most direct method is to compare the query with each of the million vectors. Mathematically, this is known as an exact nearest neighbor search.
The problem is that this approach is catastrophically slow with large volumes of data. A million comparisons might still be tolerable, but with tens or hundreds of millions of data points, the response time becomes unacceptable for real-world applications.
This is precisely why vector databases employ approximate nearest neighbor (ANN) search – algorithms that find a statistically very close result, rather than a guaranteed exact one, and do so much faster. AliSQL utilizes the HNSW algorithm for this.
HNSW The Graph Algorithm for Fast Vector Searches
HNSW: The Graph That Enables Fast Searches
HNSW stands for Hierarchical Navigable Small World. The name sounds complex, yet the underlying idea is quite elegant.
Imagine a city map. There are high-speed highways that connect distant districts but do not extend into every courtyard. There are local roads that cover specific neighborhoods, and pedestrian paths for precise movement within a small area.
HNSW constructs a similar structure, but with vectors. The upper layers of the graph act as «highways»: providing coarse but fast connections between distant points. The lower layers offer detailed connections between neighboring vectors. The search begins from the top, allowing for quick navigation to the relevant area, then proceeds to a lower level to refine the result.
This approach offers an excellent balance between speed and accuracy, which is why HNSW has become the de facto standard in most vector databases.
How HNSW is Implemented in AliSQL
How Is HNSW Implemented in AliSQL?
The implementation in AliSQL adheres to the general logic of HNSW but is adapted to the specific requirements of a relational database. Let's examine the key points.
Graph Construction
When a new vector is added to a table, AliSQL inserts it into the graph. First, it determines at which hierarchy level the vector will be present. This is done randomly, based on a probabilistic rule: most vectors end up only on the bottom layer, with only a few «rising» higher. This ensures the graph's appropriate structure.
Then, for each level, the nearest neighbors of the new vector are identified, and connections are established. The number of connections is limited by a parameter – by default, each node is connected to no more than a certain number of neighbors. This limitation not only conserves memory but also maintains search efficiency, as an overly «congested» graph slows down navigation.
Search
The search process works in reverse: it starts from the top layer, where vectors are sparsely connected but span long distances. It moves to the nearest node, then descends to the level below, repeating the process. On the bottom, most detailed layer, a final local search is conducted, and the result is returned.
Additionally, a list of «candidates» is employed. During navigation, the algorithm tracks several promising nodes simultaneously to avoid getting stuck in a local optimum. The size of this list is a configurable parameter: a larger list leads to more accurate results but slower searches.
Distance Metrics
AliSQL supports several methods to measure vector «similarity»:
- Euclidean distance – the standard geometric distance between two points in a multidimensional space;
- Cosine similarity – measures the angle between vectors, rather than the distance, and works well for text;
- Dot product – another option, frequently used in recommendation tasks.
The choice of metric depends on the specific task and how the model that generates the vectors was trained.
AliSQL Vector Integration with Storage
Integration with Storage: More Than Just an Index
One of the non-trivial aspects of the implementation is how the HNSW graph is persisted to disk. In standalone vector databases, the index is usually kept entirely in RAM. In a relational database, this is not feasible, as the data volume can be too large, and it must persist after a server restart.
AliSQL saves the graph as part of its storage engine. The graph nodes and their connections are serialized and written to disk alongside regular data. When loaded, the graph is restored in memory for operation. This necessitates careful memory management – keeping the entire graph in memory is expensive, so a partial loading mechanism is implemented.
Furthermore, since data in a relational database is constantly changing – rows are added, updated, and deleted – the graph must accurately reflect these changes. Deleted rows should not appear in search results, even if they are still physically present in the graph until the next re-indexing.
Who Can Benefit from AliSQL Vector Search?
Who Is This For?
If you are developing an application that already uses MySQL or a compatible database – and AliSQL is MySQL-compatible – and you wish to add semantic search, similar image search, or a recommendation system, this solution allows you to do so without introducing a separate component into your infrastructure.
For small to medium-sized projects, this is particularly valuable: the fewer moving parts in a system, the fewer points of failure and the lower the operational load.
For large, high-load systems, the issue is more complex – specialized vector databases may still offer superior performance specifically for vector tasks. However, the ability to work with vectors within a familiar SQL environment is no longer an exotic feature but a practical tool.
This is the first part of a technical deep dive into AliSQL, focusing on its vector capabilities. It lays the foundation: how data is stored and how the search algorithm works. It is anticipated that subsequent parts will be devoted to other aspects, such as performance and tuning for real-world scenarios.