How to protect website from parsing without hitting the search engines?

I see the only option is a hard filter on ip, because all other methods that offer the Internet to look naive.
Added to white list only the search engines Google and Yandex.
Download all ip ranges subnets search engines and every request to the site check for compliance, if the search engine - not limited if 3 requests in 10 seconds with 1 ip give a captcha. Like this? And here come the questions where to get the ip ranges of Google and Yandex?
Number of pages ~5 million

UPD
you can not download all ip, and to perform a reverse dns query and verify the hosts allowed + to add these ip in the database to further minimize the dns queries
https://yandex.ru/support/webmaster/robot-workings...
April 4th 20 at 00:50
5 answers
April 4th 20 at 00:52
I remember Roskomnadzor tried to block by ip telegrams, in the end, blocked all but a telegram.
Nowhere do you find the range of ip address to use by Yandex and Google, especially because they change.
And Yes, you can not protect your website from parsing, it is in principle possible.
how do you mean?

UPD
you can not download all ip, and to perform a reverse dns query and verify the hosts allowed + to add these ip in the database to further minimize the dns queries - Stefan_Lynch commented on April 4th 20 at 00:55
@Stefan_Lynch, and what does it mean? you do realize that Google and Yandex have their own servers that they sell, and these servers may host parsers, i.e. in the same domain zone with the crawler of the Google and Yandex. Enough to invent nonsense will not protect you in any way your miracle blozhik from parsing. - Mohammad66 commented on April 4th 20 at 00:58
@Mohammad66, God, you're flooding without checking the info, answer the question just for yourself to convince others of your contrived misconception. Miracle blozhik)
Toxic a resident of the square, is not toxic to me here a question, don't scare off people) CECI here pozavarovalnica though)
https://yandex.ru/support/webmaster/robot-workings... - Stefan_Lynch commented on April 4th 20 at 01:01
@Stefan_Lynch, so calm down already, you flood. what do you give a definition of robots? how are you even going to know what the website parse? limit the number of queries is harmful only to the users who open multiple pages at once, parsers often use a proxy, ie will not have 1 ip with many requests per minute. ban shipping page without js? then there is selenium, etc. You can not protect a site from the parser, not at all, absolutely. Most often you complicate life only to myself and users, but not those involved in parsing. - Mohammad66 commented on April 4th 20 at 01:04
@Stefan_Lynch, and Yes, you'd have learned to read:
"The robots use multiple IP addresses that change frequently. So their list is not disclosed."
and I repeat: I can zahostit your parser on the server Yandex and it will be the same host as the Yandex bot. - Mohammad66 commented on April 4th 20 at 01:07
@Stefan_Lynch, I don't quite understand. Separate robots Yandex - approx. how from users to distinguish? - Maria10 commented on April 4th 20 at 01:10
I remembered the joke about Imperceptible Joe whom nobody can catch because he nafik not needed. - Adell_Rau commented on April 4th 20 at 01:13
@Maria10, no way. all ban and will sit alone with your blog safe. - Mohammad66 commented on April 4th 20 at 01:16
@Mohammad66, what you're talking about the rent of the server of Yandex? About it cloud.yandex.ru? Or fantasize again?) - Stefan_Lynch commented on April 4th 20 at 01:19
@Maria10, a user can limit wrote above 3 requests in 10 seconds. Opened 10 tabs, well, sit and wait to update the game. - Stefan_Lynch commented on April 4th 20 at 01:22
@Mohammad66not flood, I asked the question about the value of some of the resource and its information, the question read again - Stefan_Lynch commented on April 4th 20 at 01:25
@Stefan_Lynch,
what you're talking about the rent of the server of Yandex? About it cloud.yandex.ru? Or fantasize again?)

Yes, about this. and the person does not suggest go - it violates the rules of the resource. so what about "fantasy" leave with him.
user can be restricted wrote above 3 requests in 10 seconds. Opened 10 tabs, well, sit and wait to update the game.

in the end, the user will close the website and forget for him, and the parser worked through a proxy and it will work.
do not flood, I asked the question about the value of some of the resource and its information, the question read again

I never about the value of anything not written. I'll have to just trying to convey the idea that it is physically impossible to protect the site from parsing, it does, absolutely. if you think otherwise, give an example and I shall refute it. - Mohammad66 commented on April 4th 20 at 01:28
@Stefan_Lynchthat might work. I do not know of a parser for the IP pool, but I was watching the logs on SSH brute force - there is order. I.e. they should not just keep, but also to ban. - Maria10 commented on April 4th 20 at 01:31
@Mohammad66, reverse DNS server service cloud.yandex.ru will not return yandex.ru/net/com (as in documentation), and cloud.yandex.ru.
Too bad we can't restrict access to their issues otdelnym users. Objectively - you only confuse and hinder their responses to my question. - Stefan_Lynch commented on April 4th 20 at 01:34
@Stefan_Lynch, sorry you read what I write do not want. let's first: how do you distinguish the user from the parser? - Mohammad66 commented on April 4th 20 at 01:37
@Mohammad66, I don't what it is. I just set the rules the speed of the visit and the copy information. Tecusa allowed 3 requests per 10 seconds. Everything works faster - the offender, whether the user is alive or the parser. - Stefan_Lynch commented on April 4th 20 at 01:40
@Stefan_Lynch, let's go a little further: the parser works through the proxy, i.e. each request is from different ip, the pool of proxies is huge. I have now running in microsoft azure crawler which rake in about 15m. pages of information in one pass from the website airbnb, which is so well established limitation on the number of requests per minute, it does not save them. But they had 3 requests per 10 seconds, and more, for 3 requests in 10 seconds is a problem primarily for users, not for parsers.
Then I repeat my question: how are you going to distinguish the user from the parser? - Mohammad66 commented on April 4th 20 at 01:43
@Mohammad66how much proxies? thousands, tens, hundreds, thousands? It costs, and large, and those who want to get all the info - will be taken simply by buying hundreds of thousands of proxies, but this scheme useit all the other villains - Stefan_Lynch commented on April 4th 20 at 01:46
@Stefan_Lynch, 15 pages, should I put around$ 40. Is it expensive? No. Moreover, in free access, you can always find around 15K proxies are totally free. This is more than enough. - Mohammad66 commented on April 4th 20 at 01:49
@Stefan_Lynchwhy hundreds of thousands? you do the math: let's say I have 1000 proxies, you have on the website a limit of 3 requests per 10 seconds = 18 queries per minute per ip. Let's say I scab site in 100 threads and I'm going through the circular rotation of proxies, i.e. the first 100 proxies, then the second 100 proxies, etc., as a result, each proxy will do only 1 query per 10 seconds, at best, provided a good proxy with a small ping and a very easy page, under more realistic conditions (frišnyh proxies, ping a huge, heavy pages) with 1 proxy request will be made every 40 seconds, the speed of Panga in the region of 1500 ppm = 90000 per hour = 2160000 per day. This is when there are only 1000 proxies. And if you use a service with a rotation of proxies from a pool of 20,000 proxies? - Mohammad66 commented on April 4th 20 at 01:52
@Mohammad66, I contacted the public free proxies machine was being parsed them from several sources and run the checker, 1 if the request was on https, a known resource with a timeout of 10 seconds, it is considered active. I remember I fiddled with this to the extent of when our database had a couple million of them active proxies a couple of hundred and those with a poor rate, and those outside. Then I noticed interesnoe, a pub proxies perform 1 successful and quick request and then refuse artificial setting to 1 inquiry on the face. Public proxies are dirt and trash, did not bind anymore. And you where to buy proxies? - Stefan_Lynch commented on April 4th 20 at 01:55
@Stefan_Lynch, I do not buy them, use services like luminati. sometimes use public proxies. there are several sources from where you can easily collect 1-2K proxies with a normal ping. - Mohammad66 commented on April 4th 20 at 01:58
April 4th 20 at 00:54
Like this?
Do not dig another hole - let himself dig!

Public content on the public!
Want to hide and show it only to authorized users.

Or pending publication before full indexing of new content by search engines:
1. Post a new article. (link - do not give anyone! From a search of your blog and topics/tags, too, is its hiding).
2. Add a link in the Sitemaps. (The file name of Sitemap - trivial!)
3. Put a trigger on the material availability checks in the search for the SS.
4. Once everywhere the article will appear in search results (proindeksirujut) - open it to the public on the website.
Public content, all right, but do not want to copy-paste myself and from its name it was placed. And pages a large number of ~ 5 million pages - Stefan_Lynch commented on April 4th 20 at 00:57
@Stefan_Lynch, Generally a different approach!
1. Post a new article. (link - do not give anyone! From a search of your blog and topics/tags, too, is its hiding).
2. Add a link in the Sitemaps. (The file name of Sitemap - trivial!)
3. Put a trigger on the material availability checks in the search for the SS.
4. Once everywhere the article will appear in search results (proindeksirujut) - open it to the public on the website. - Vidal.Orn commented on April 4th 20 at 01:00
@Stefan_Lynch, how do you imagine to prohibit the parsing in the presence of a public viewing through a browser? - Adell_Rau commented on April 4th 20 at 01:03
@Adell_Rau, pages 5 million, the user is limited to 3 queries in 10 seconds, but search engines is not limited. This statement of the problem - Stefan_Lynch commented on April 4th 20 at 01:06
@Stefan_Lynch,
5 million pages, the user is limited to 3 queries in 10 seconds, but search engines is not limited. This statement of the problem
this is nonsense... not a problem.
Nothing prevents to put such a site! From the word: "quite". - Vidal.Orn commented on April 4th 20 at 01:09
@Vidal.Orn, in fact poetmu I'm asking the community) what nonsense? Yes, with such speed sparsit 5кк pages for half a year and just this nonsense - Stefan_Lynch commented on April 4th 20 at 01:12
@Stefan_Lynch,
What nonsense?
there is a cloud of parallel sessions with different generated "users" and I have these restrictions - even in the drum. - Vidal.Orn commented on April 4th 20 at 01:15
@Vidal.Orn, we're talking about filtering by ip, login to the resource is not - Stefan_Lynch commented on April 4th 20 at 01:18
@Stefan_Lynch, I'm talking about the public session now and say, with a cloud of browsers and ip. - Vidal.Orn commented on April 4th 20 at 01:21
@Vidal.Orn, yeah, this is probably monitoring the effectiveness of this protection. I.e. in the end all the villains otseit, and will remain a most interested who will get the money and the will. In the end it will come down to tens or hundreds of thousands of proxy + simulators, etc., and then, of course, understand that nothing would help. But it can be considered as completed task. - Stefan_Lynch commented on April 4th 20 at 01:24
@Stefan_Lynch, sump between indexing and publication - is the real way to secure yourself from stealing your unique content. - Vidal.Orn commented on April 4th 20 at 01:27
April 4th 20 at 00:56
> 3 requests in 10 seconds with 1 ip give a captcha

you are on page 10 of the pictures, a bunch of GS and styles and in the end, one user can't even load the page like flies in the ban? but after a page refresh the captcha? well, the site is closed...
April 4th 20 at 00:58
How to protect website from parsing without hitting the search engines?

To block access for all IP, except bots. Otherwise - no.
thank you! - Stefan_Lynch commented on April 4th 20 at 01:01
April 4th 20 at 01:00
Forget all steal from each other, you can encrypt the code so the parser was a bit more difficult to steal from you, max) The same VC all spars not only because of its size, one computer will not withstand this load)

PS: and if it stand, it will be a long time, the word very)

Find more questions by tags ParsingWeb Development