What implementation can quickly search the intersection of the sets (tag system)?

I would like to ask, who decides how the task search by tags. What.

Two basic conditions:
1) Target system is quickly (~1-50 MS) to answer the question in the spirit of "to find all documents in which there is (Tag-1 or Tag 30 or a Tag-100500 or ... ) and (Tag 50 or the Tag-1000 or...) and ...". In OR may be up to two dozen of tags And terms can be a dozen.
2) Since data can be updated frequently, it is necessary to achieve the minimum time of updating the changes.

I tried to do on Redis using styles and SET BITMAP (around here a "Quick filter directory for the Internet stores on ..."). SET is not approached (in the set-and kept-ID documents), as in the case of intersecting two sets in 100k thinks any longer than necessary. BITMAP did not come due to strong low ID documents, as a result of excess memory consumption by "holes". In General, if multiple large, then Redis boxes are looking bad.

Right now we have the option to Sphinx. ID tags are written in sql_attr_multi. This ensures that the specified requirement for the speed of document search. Requirement updates is solved by the construction of the main and delta index. The main index (which are search) declared as distributed. This principle works well, but sometimes it's a lot of new changes and the Delta index starts to slow down. The rebuild of the main index takes a few minutes (now about 3.5 M id documents in it). Did not seem long, but it is planned to increase the number of documents in dozens of times. Time of data updating also will increase.

Would like to know if there is any other solution (S? Tarantool? Elasticsearch?) and anyone that uses.
July 4th 19 at 22:45
2 answers
July 4th 19 at 22:47
Leave as comment for the history. At the moment the scheme in sphinx turns the fastest. The issue with the Delta index were more frequent recalculation. The application now keeps track of its size and as soon as it has more 20K documents starts dotirovanie. It turns out the required sampling speed, even on complex queries.
And be sure to get ALL? All is still at once the user will not be read. You can get each tag separately, and then AJAX to pull up and , if necessary, to sort on the client... - tyrique.Rodriguez13 commented on July 4th 19 at 22:50
: Of course it's not needed. Always need X page. And on the client, neither of which do not work. In the category of maybe 10K of goods not on the client work. For far too long. I certainly understand that the developers are sitting on good computers with big monitors. And they have 10K to digest with difficulty, but possible. In General, what to say about the client, this problem is purely a server and only there it can be solved. But Ajax can only pull up the results. This dataset for it and need to. - davonte commented on July 4th 19 at 22:53
July 4th 19 at 22:49
In version 4 of the radish had the opportunity to connect external modules.
For example, you can add support for JSON and roaring bitmaps:
https://github.com/RedisLabsModules/rejson
https://github.com/aviggiano/redis-roaring

The latter reduces the memory bitmap discharged.
Faster operations and, or and xor bitmasks can't be.
I think the same mechanism of compression is used in Sphinx.
Thank you! Have not seen it. And to feel you had?

He recently briefly tested the schema with Postgresql and JSONB (table with products, one of the fields is JSONB in which the array is a list of tags) to 100k products who have 10 characteristics (tags). Worked very good, but you need to supati on a complex sample and catalog for the 1M. Plus compared to the Sphinx is the consistency of the data (the index of which one, and the base is a little different). - tyrique.Rodriguez13 commented on July 4th 19 at 22:52
Just now the probe, the prod is nothing.
I had the exact same problem with memory and bit masks.

127.0.0.1:6379> SETBIT aaa 4000000000 1
(integer) 0
127.0.0.1:6379> info memory
# Memory
used_memory:537721360
used_memory_human:512.81 M

And I also refused bitmasks, even though he knew it to be the fastest method.
With compression, things got a lot better + write that bit faster operation times than conventional lines.
JSONB will definitely lag

127.0.0.1:6379> SETBIT R. bbb 4000000000 1
(integer) 0
127.0.0.1:6379> info memory
# Memory
used_memory:850144
used_memory_human:830.22 K - davonte commented on July 4th 19 at 22:55
as I understand it had from the sources to collect himself redis server from the unstable branch? - davonte commented on July 4th 19 at 22:58
, redis server 4.0 is already stable, and modules from source. The module connects to redis.conf using the loadmodule Directive with the path obtained so file. - ottilie.Paucek commented on July 4th 19 at 23:01
got around potestit. Impressive. Waved from 15K-tagging, 1.5 M products, 20 tags/item (30M ties). Filter fulfills for ~2 MS, eats 20Mb of RAM. Directly some space. I will recheck again. Well, parallel to the load have not test. But yeah, JSONB there is certainly not even close to pass.

In prod was drinking? - davonte commented on July 4th 19 at 23:04
great result! Radish single-threaded, all parallel queries will be executed sequentially but very fast.
I have volumes less than you, so this approach both from bfg on sparrows. But if you think about highload, then the stock is huge. And that single instance without any clusters and load balancers. - ottilie.Paucek commented on July 4th 19 at 23:07

Find more questions by tags MySQLSQL