A database to store big data?

Considering one your Internet site, here's the thing,

Just gather a table of about 40-60 million lines gradually (approximately 10 rows per second) . And then it will also need to update or add new data into it.

The database will be approximately the following load:
- search - is there a record in the database with the specified name, if there is updated information there. Ie, before adding the entry to check it in the database and add/update the data. The function is very desired and mass, so would not like to run longer than 0.1 seconds.

- search the database with the specified parameters (for example, to such a parameter was greater than a specified value and similar conditions) and this function is often used (total number 100 times a day, and subsequently will only be increased many times depending on the number of users of the site). There is such a speed is not so critical, but also needs to be relatively fast (up can say 5-20 seconds )

Data will be stored in plain sight: line id, title, link to website, date, date,numeric and text fields (up to 10 pieces).
By itself, the database is very simple in structure, and can be stored in the 1st database in the 1st table.No links to other tables is not expected. Or can only say keep the categories and their number in a separate table. But in the main writing room only categories.
Updated data is not immediately all at once, but gradually and continuously (several additions/changes per second or less).
The data in the table will be relevant only up to about 6 months, those who are older would no longer be needed and you can delete them.

The main language of writing, presumably php.
The first part to the user which shows information on the parameters, the second part runs continuously and gradually adds/updates information in the database.

Which database to choose ? If MySQL for these tasks?
June 14th 19 at 20:39
5 answers
June 14th 19 at 20:41
Which database to choose ? If MySQL for these tasks?
If we have proper experience with it, the right skills in DB design and a full understanding of why you need to do "exactly" and "why not otherwise?", I think is fine. In General, the base is usually evaluated not by the number of entries in 1 (one) table, and the total volume of data (Giga/Peta - bytes) and some other parameters.

- search - is there a record in the database with the specified name, if there is updated information there. Ie, before adding the entry (and let me remind you first of 5-40 million and will continue to grow) will check it in the database and add/update the data.
For this there are indexes in all databases known to me. Presumably, the standard B-tree index, it works in all bases about the same.

The database will be approximately the following load:
Load You will have to iron and not to the base, if it will stand up to something from the point of view of the database - logical problems of storage 40 million. records - I do not see.

Want to know how to organize the structure of the store information ?
"Big information" or large amounts of data? 40 million. records is absolutely not necessarily a large amount. For example, the index on numeric (INT) field to 40 million. records will occupy only a few megabytes. Storage is "great information" - can take, such as PostgreSQL, there is a ready mechanism, TOAST, designed specifically for this purpose, or to design a MySQL database so that the data would need to lay separate from any "information collection" ("tails"), this will reduce the size of the individual tables on disk and as a consequence - to increase the working speed.
added in italics - Valentin_Swift commented on June 14th 19 at 20:44
Yes I think MySQL for this is quite good, more speed - you can try to use a MyISAM table (or their equivalents), as well as single-byte encoding such as CP-1251 for data storage. About the other details have already written, I think the meaning to rewrite other people's answers there :) - Salvado commented on June 14th 19 at 20:47
June 14th 19 at 20:43
Every day or a little less will need to update the data for 5-40 million records,
it is necessary to do in one run?
- search - is there a record in the database with the specified name, if there is updated information there.
what kind of name is that? text, hash int? Long? In the General case, selection by index is VERY fast, there is more dependent on iron than from the base.
- search the database with the specified parameters (for example, to such a parameter was greater than a specified value and similar conditions)
Indexes decide if the task is simple sample of a flat table - it will be fast in addition to options for finding a La `field` like ' %some text%.
Want to know how to organize the structure of the store information ?
read the "normal form"database.
Which database to choose ? If MySQL for these tasks?
Muscul or PostGIS, it is necessary to look at a bunch of hardware/software, because you have the task either very unusual, or something you are designing is wrong, you have the same "super secret goal", respectively, very palatially response.

UPD:
Data will be stored in plain sight: the line id,
I hope it's a typo, I mean - id type integer?
Or can only say keep the categories and their number in a separate table. But in the main writing room only categories.
read about normalization, no, but really, this is IMPORTANT.
Numeric data ALWAYS work faster mixed(alphanumeric) equal to the length(in "characters"), I hope it's obvious. Accordingly, the sample where categoryid = 55 will work faster than a where category = 'somecategoryname'. Otherwise - I don't see any problems.
added in italics - Valentin_Swift commented on June 14th 19 at 20:46
June 14th 19 at 20:45
Every day or a little less will need to update the data for 5-40 million records and this number will continue to grow every month approximately 5-10% of the records in the table will be added
This scrapping service (perhaps through a search engine or some sort of Analytics social.network).

I would think about how You will produce the receive data prior to the upgrade to 5-40 million records per day on a single server: it's ~500 requests per second to retrieve data!

You don't want it all to stick to one server?!

mysql will be enough. The main thing to take care of the partitioning for the table when approaching the number of records in one table to the limits.

Limits on the count of rows.
The MyISAM storage engine supports 2^32 rows per table, but you can build MySQL with the --with-big-tables option to make it support up to 2^64 rows per table.
no, it will be updated gradually but constantly (several times per second) - Valentin_Swift commented on June 14th 19 at 20:48
so I thought about constantly: 40 million during the day) - Salvado commented on June 14th 19 at 20:51
the day approximately 800 thousand lines will need to be updated . It is 10 operations per second - Valentin_Swift commented on June 14th 19 at 20:54
i.e., conditions have changed already, right?)))
Then 9,25 per second... Yes...
mysql will be more than enough...
And data processing - also don't forget to find CPU. - Valentin_Swift commented on June 14th 19 at 20:57
June 14th 19 at 20:47
use Mongu to store statistics on queries in search engines
name -> hashed index.
in a database of more than 2 billion records - search and the upgrade takes less than 1 millisecond.
more than 10 times faster than your requirements, when storing 50 times more data.
June 14th 19 at 20:49
Above you correctly, all painted except for one. There are generally no reasons to use MySQL. IMHO take PostgreSQL and hire a good DBA
Why MySQL does not recommend use? - Valentin_Swift commented on June 14th 19 at 20:52
There is not any reason at all, except perhaps the difficulty to find her an alternative on RF hosting.
In short that is "not a database but a dick with nails". Very stupid optimizer that you can make the corner without problems. No FTS, recursive queries, and carriage of other cool stuff. - Salvado commented on June 14th 19 at 20:55

Find more questions by tags Databases