Design a "timeline" for the social project?

Statement of the problem the following:


1) each user has a friends ("maximum" option — all friends at all)

2) Each user can enter an unlimited number of groups


The user may generate the content themselves (like a personal "Wall" as Facebook and FB), so the content may be generated within components of the project.


I want to make a consolidated feed of updates like status/thoughts of the user, his friends, internal events and event groups and bring them on a common list with the possibility of paging.


In the backend, the system turns the MySQL database, but for me it is quite obvious the fact that "normal" relational approach to solving the problem in the General case is not applicable because of the imposed restrictions — if friends, for example, 2500+, then one WHERE it grows to monstrous proportions.


Because looking towards noSQL solutions, but initially for myself, just want to understand the algorithm of work of such ribbon, and then choose tools.


Should a denormalizing is to play with the introduction of a separate tape for each user (but in the case of withdrawal from a group or remove person from friends the update of this list can take a very long time)?


Maybe there are some articles about, or recognised variants of implementation?
October 8th 19 at 01:25
7 answers
October 8th 19 at 01:27
All Facebook and Vkontakte tape data generates a demon in a compiled language, which is also a lot of information caches in memory. Mastered? I don't know C, you can try with Node or Java to write well, or PHP at least, if we lay the scale, too, will be OK from time to time.

The algorithm works like this: 1) select the id of the friends of the user 2) select the last event each of them 3) take out protected by privacy 4) merge lists, sort by date and give to the browser.

And in MySQL you have the load increases, the number of friends, it quickly put the server. Let's do a simple calculation: the average user is doing 10 events per day, if users at least a few thousand, the event table will quickly jump up in size to millions of records. And all sorts of construction type of JOIN first, loaded the server, and secondly, not chardata.
Writing a daemon is not a problem, because the question was purely theoretical — the tool I choose already after realizing the correct path.

If I understood You correctly, You are opposed to denormalization of the database in favor of samples in favor of something similar to map/reduce (clustering of activities in some primary characteristics, parallel processing of certain modules and the final bonding of the results before the results in the browser).

That's pretty close to what is deduced for myself, I namely, to store long-term data in a DBMS, cesira current (e.g., last week) to a noSQL solution such as MongoDB, clustersize them to start with, by type of activity and date. When a new request "on the tape from X events" originally requested by X events from each data node to sort by time /privacy and keep in the "hot" cache of the application.

When you update event to change myself as "hot" cache, and DBMS / noSQL storage. In particularly difficult cases a "hot" cache pregenerated. - Johathan72 commented on October 8th 19 at 01:30
You can not even imagine ka the erroneous phrase "pregenerated cache" on millions evento =) - aniya.Kertzmann65 commented on October 8th 19 at 01:33
I see 2 options, using the database:

1) SELECT with the expensive, but cheap INSERT: make the sign of the form (user_id, event_id, time, event_details), when the event occurs do 1 INSERT and to read the timeline, select the id of all friends of the user and do SELECT * FROM events WHERE user_id IN :friendIds ORDER BY time DESC LIMIT :limit. The insert is cheap, but the selection is not optimized: we either use the key on time or on user_id, but not both. That is, when a large event history, a large number of events or the number of users you SELECT will slow down, thank God. Intuition tells me that with any serious load sampling Filesort put base.

2) cheap selective, but expensive inserts. It is when we are in each user table in your timeline friends, and when the event occurs, insert it into the tape to all my friends. Here SELECT would be SELECT * FROM events WHERE feed_user_id = :userId ORDER BY time DESCLIMIT :limit. In this embodiment, if the user has 100 friends and it is for example a downloaded photo. to do 100 INSERT to tape all of his friends. It is generally a losing option, but if the storage and "adding 100 feeds a new event" to shift onto the shoulders of the demon, the chances of any appear.

You, as I understand it, in favor of option 1 with caching noSQL data part? How do you cache refresh will be? Someone uploaded a picture, and you drop the cache keys of all his friends? So in this case the cache is constantly flushed and virtually useless. And if you do update the cache have 100 friends, so it's 100 queries to NoSQL storage.

I don't see what good can NoSQL and 1) and 2) inefficient when a large number of users and friends (although I know there are brave Indians, making the activity feed in PHP and MySQL, but I don't know how they behave under load).

Therefore the variant with the demon is actually a variant of 1), only instead of using the database, the data cached in the memory of the demon, if memory is gigabytes, and old events to delete from the cache, we can cache all the tape all users of the project. - antonette.Gislas commented on October 8th 19 at 01:36
And generally, NoSQL is not a magic bullet. You either use the index (and get the same pros and cons that the SQL), or stupid use of as key-value storage.

But it is important, of course, how many users do you have, what their activity, it is possible to estimate the volume and frequency of queries and draw conclusions. - antonette.Gislas commented on October 8th 19 at 01:39
And I now reread the question, " each user has a friends ("maximum" option — all friends from all)" — safely, what to say. - antonette.Gislas commented on October 8th 19 at 01:42
Thought about upgrading hot Kesha — Yes, here in addition to updating thereof on a timer I don't see, otherwise the cache will be regenerated constantly. - Johathan72 commented on October 8th 19 at 01:45
> Inserting cheap, but the selection is not optimized:
> we either use the key on time or on user_id, but not both.
Two-column index? - destin_Miller commented on October 8th 19 at 01:48
October 8th 19 at 01:29
There is such a thing as a cache. The cache can be of different types, it is not only the sample stored in a file at hand. A trivial example of a karma on habré. I'm pretty sure that they are stored in the database all the data, who to whom how many times have poked plus or minusik, but we see only 3 (+4 - 1). A kind of cache need data in a separate location. Move communication separately specifically for this tape, duplicate data, don't be afraid, doubles sometimes help performance. Let's make to start the sign activity:
id, user_id, event_desc, event_date
And when each activity enter data here, this impossibly simple sign can one request to get the user activity in a primitive form. Next is develop the idea, how convenient =)
Karma does not cache, and aggregation :) - Johathan72 commented on October 8th 19 at 01:32
October 8th 19 at 01:31
Did not understand, to be honest tasks. Like anything complex.
In pseudocode looks something like
select news.*
from news, users_users as friends
where friends.left_user = :currentUser
and friends.right_user = news.user_id
union
select news.*
from news groups
where groups_users.user_id = :currentUser
and groups_users.group_id = news.group_id

Or something I didn't consider in conditions?
Where groups_users, users_users table-chords many-to-many - Johathan72 commented on October 8th 19 at 01:34
It seems to me that the performance of this solution in a large (hundreds of thousands, millions) number of users and hence generated content and events is unaffordable for any known DBMS. - aniya.Kertzmann65 commented on October 8th 19 at 01:37
October 8th 19 at 01:33
Here is another example where RDBMS is not rested and we should use that to something like MongoDB/CouchDB or Neo4j (if the data is strongly connected).
To make first an array of all ObjectiveID with which the user is associated (one request), and then time to take all events where the author of the event is located in this array. Or using DBRef find that the event has the link to the current user. The first option would be faster, but less code
Nei4j very interesting, thank you very much — look at them too. - Johathan72 commented on October 8th 19 at 01:36
October 8th 19 at 01:35
In Facebook and Twitter used to de-normalize, and "cleaning" tapes after whoa from the group/friends to have them available. Corny — that received tape, the received. Additionally, the tape on the FB like is time limited, it is reasonable. Twitter also had a problem with a huge band of users who have 20k+ subscriptions, such one time just to get banned. In General, put as the to get, options to get should be a bit. Normalization is not for large amounts. And denormalizing mysql not larger, unless you split the instances (say, a maximum of 100 users with all their data feeds into one database).
October 8th 19 at 01:37
Recently, he was interested in, ran into this article: myforce.ru/tyeoriya-i-praktika/delaem-lentu-obnovlenij-na-mongodb-php/
Well, Yes, people offer denormalizing and mongoDB importantly, what would noSQL "kept" such a large number of inserts. - Johathan72 commented on October 8th 19 at 01:40
October 8th 19 at 01:39
Uspenie aside from Redis Sorted set. Each set tape a person, as a score — event ID (available time), the radish can hold about 200 000 queries per second, and in the summer of sorts, able to make requests according to the usual limits/offsets to either the queries of the form ID < x
The very same object — a serialized array (or, as you udobnee), and the number object is taken from memcache for example. Actually, as wrote the companion denver, use the denormalize.

The described method operates successfully on enough visited resources, and easily copes with the case where the person with 30,000 followers creates the post.
But MongoDB did not have faced?
Why Radishes? - Johathan72 commented on October 8th 19 at 01:42
also, in the "psevdosotsialnye" network did Redise.
Formed ribbon of users for each individual. Stored the indexes of data source type of the event. Sort by score or id, actually, it was possible to do the "kickout" of events (in scrore kept time() ) and output in a chronological order of events.
but for some reason, more than 40-50K per second for our admins failed to rise. php Rediska.

When small amounts runs smartly. Remains an urgent problem — how to store "history". - aniya.Kertzmann65 commented on October 8th 19 at 01:45
in the sense of "history"? - antonette.Gislas commented on October 8th 19 at 01:48

Find more questions by tags AlgorithmsSocial networksFeeds