the vector db needs of imagehub
2025-07-11
I’m running into a problem with trying to build my vector database for search atop open source options. The problem I’m trying to avoid is starvation of results by filtering data before doing distance calculations (instead of filtering after the fact).
It seems that convex
only supports OR operators for filtering - which means that I can’t do much beyond filtering content for a particular user (with my current schema which indexes only on ownerId
, and decouples image data from album data).
In other words, I can’t do a search like “image belongs to user X and is a member of album Y”.
Moreover, exclusions are also a challenge: “find me all photos that I could add to this album” requires comparisons to ones already in it. If you limit search results and filter after the fact, you run into a problem of having no results when your album is already full of relevant content (but there’s still more to add).
There are a few core fundamental usage patterns that I hope to address.
The problem is that I don’t know which vector databases have good index support for metadata.
I looked at chroma
a bit and there seem to be performance concerns over its implementation of this feature.
I’d like to lay out in this blog post the three modes of search functionality that I want my app to support, and where I currently stand with respect to each of them.
1. Library Search
retrieve an uploaded image (without navigating to an album)
This is actually perfectly well supported in convex
right now!
I can filter on ownerId
for images to search over only the user’s images.
No other filters are needed.
2. Album Search
- find a specific image in an album (to show someone)
- find images to remove from an album
An album is actually a pointer to a “latest revision,” which represents a set of images, stored as a list of ids.
Users should be able to search for photos in any revision of an album (be it the latest published, the draft, or a historical one).
In order to do this with convex
, I’d (probably) have to create a record for each image per revision, and index on [ownerId, revisionId]
.
If I did that, I’d only need a single filter on the multi-index to leverage the existing search functionality.
The downside is that this creates an explosion of data since a single album can have dozens of revisions
Alternatively (without duplicating data): if I filtered on membership to the revision after vector search completed, I would encounter “starvation” if highly relevant images outside the album outnumbered those inside it. That’s a non-sequitur, as search quality would depend on the sizes of albums + qualities of libraries.
Currently, I am handling this by performing a query to pre-filter the data and use a custom vector search function that I wrote.
I already have a set of imageId
s that represent a revision, so I can retrieve these records directly and send them over to a convex
action.
The problem that my current solution encounters is that there’s a limit on the memory consumption of actions, so after ~2000 images the process will fail.
In theory, the multi-index solution with convex
could scale album search further than this.
The memory limits can be addressed with pagination (doing multiple searches and re-combining), but for now I think an enforced album limit of 2000 is a reasonable stopgap.
3. Album Management
find new images to add to an album
I already have a good process for copying (or moving) images from one album to another, but there’s technically an upload button on the Library page that can bypass images belonging to an album. In the event that someone uploaded many images to the library directly instead of distinct albums, they may want to build albums after-the-fact.
This is the really tricky one with respect to avoiding starvation. If an album already contains many relevant photos, vector search would be starved if filtering occurred after distance calculations.
convex
has a limit of 256 results, so searching for content not in an album becomes infeasible after the first batch of relevant images are added, as all subsequent searches will come up with the same results and be filtered out.
I am currently handling this in the same way as (2), with my custom vector search and pre-filtering.
To get all images not in an album requires checking each imageId
in the user’s library against membership in a list.
I don’t know if it’s possible to solve this in convex
natively, as it’d require doing a “not equals” comparison to check against revision membership, so even my multi-index approach would not be handled here.
Notes
- An ideal solution would require complex pre-filtering before distance calculations.
- I’m okay with placing a limit of 100k photos in the library and 2k in an album for now (ideally 5k-10k eventually), which means I’ll run into memory limits that require pagination for (3) with my current solution
- I’d prefer not to duplicate vectors, but if multi-indexes could be leveraged to avoid starvation of results with pre-filtering, I’d be open to the additional complexity with cleanup processes.
- I’m not super-concerned about increasing memory usage if it means I can handle much larger volumes of data.
- The 16MB server action limit in
convex
can bypassed if I move vectors to a separate database.
Closing thoughts
I think my application is honestly in a pretty good state already, capable of being useful for many situations - especially when dealing with hundreds of photos. So in some sense, many of the problems I’m about to lay out could be solved by imposing strict limits on accounts, forcing users to create different accounts for different projects and segmenting their experience. While I could deal with that personally, I know that the current limits of my app will stifle the use-cases I have for it. At the very least, it can already
After writing this up, I think my current implementation is actually pretty close to sufficient, but it’s far from ideal (it wont scale to millions of images).
The only real limitation with my current solution is the convex
memory limit on actions, which I can deal with using paginated queries and re-combining the vector search results.
It may feel slow but it would return the right results.
If I can handle about 2000 images right now per query, 5 of them would allow me to search a library of 10k images. At 100k images, this becomes an untenable wait. That should suffice for now, but a part of me keeps thinking about when this will break, and whether I should be considering a different vector database.