Precautions while parsing in Python?

Recently learning Python and want to do the first project for parsing data from the closed area (for authorization).

Looked one lesson (the gist, but there is also a link to the video on YouTube), in which everything is quite clear. But the author does not use any modules for authentication, does not send the headers, use proxies, etc., so the following questions arise:
- If you have to put a few thousand pages, what security measures need to be taken in order not to be banned?
- Probably. if you put a pause between requests you can not get ban? (and how do "razvalivaetsya" the situation, to understand: here you can easily parse, and here you show a complex captcha after the first 3 requests).
- Whether to parse from the desktop (as did the author)?
- Any simple http client can you recommend?
- Is it enough to send headers similar to the ones that sends my browser?

The data for parsing in General, simple, titles, travel contacts, no JS, pagination.
July 2nd 19 at 17:02
5 answers
July 2nd 19 at 17:04
Solution
If parse seriously, I recommend to pay attention to scrapy - posh framework in Python for parsing of sites.
The task in the header could be resolved without the bad code.
Total: 1 page beautiful code, 57 seconds in 16 streams 345 downloads pages of weblancer and gives 3420 projects.
ie scrape default parsing in multiple threads? - Candida93 commented on July 2nd 19 at 17:07
he seems to not support third Python - Presley64 commented on July 2nd 19 at 17:10
Supports already. Scrapy runs on Python 2.7 and Python 3.3 or above (except on Windows where Python 3 is not supported yet) -- from the documentation. - Lucious_Jaskolski commented on July 2nd 19 at 17:13
July 2nd 19 at 17:06
On the contrary sometimes it is easier to customize the parser in 10 threads and put all 30 minutes until the admins came around to stretch than it is on many XS ))
July 2nd 19 at 17:08
a good way is to run wget if it will deflate the whole site - it is single-threaded, then there is the protection of non-singular
another trick is to pretend Hogbottom, believe me - very few checks of the bot, especially with IPOs in the United States to parse
for VC and other popular spammers - protection will always be, the border - look
headers - see https://pypi.python.org/pypi/fake-useragent/0.1.2
About to pretend to be Google bot can be more? - Candida93 commented on July 2nd 19 at 17:11
https://support.google.com/webmasters/answer/10619... - Presley64 commented on July 2nd 19 at 17:14
Well I would not say that wget a good way - he did like it for the parsing of little use.
They did that home page of a student to extort - it is stupid for the existing link goes.
Most of the sites you need to parse at least interactive - send the request - parse the response and based on the analysis of the response to form a new query.
And now JS EN masse - not complied with the script, I did not see the links on the first page will not get anywhere. - Lucious_Jaskolski commented on July 2nd 19 at 17:17
found a couple of popups that load multiple rows via AJAX (this data is not in the code). What library can you recommend for parser in Python? - Misael.Krajcik commented on July 2nd 19 at 17:20
: If the required data is possible to load without scripting then grab.
Otherwise, only control the browser, for example using selenium. But it is a long and resource-intensive. - Juanita17 commented on July 2nd 19 at 17:23
: Yes, not so long - earl.Weissnat commented on July 2nd 19 at 17:26
And if you don't need pictures, and not very resource intensive) - Candida93 commented on July 2nd 19 at 17:29
July 2nd 19 at 17:10
1) Bdit time intervals to use different IP and account (if possible).
2) Probably Yes. However, no You do not answer, all very individually. Exploration is always done by trial and error.
3) Yes, please. For parsing we do not put. In the worst case banned. You Decide.
4) In terms of?
5) How do we know?
July 2nd 19 at 17:12
To test these libraries work, but if you really want to parse large sites, it is necessary to use scrapy.
- If you have to put a few thousand pages, what security measures need to be taken in order not to be banned?
If there is no authorization, it is possible to use: user-agent rotation, proxy rotation, random delay.
- Probably. if you put a pause between requests you can not get ban? (and how do "razvalivaetsya" the situation, to understand: here you can easily parse, and here you show a complex captcha after the first 3 requests).
Just write a parser without a pause, if all parsed, then there is no defense. In my experience I can say that very few sites are protected from a variety of queries, mostly large projects.
- Whether to parse from the desktop (as did the author)?
Of course.
- Is it enough to send headers similar to the ones that sends my browser?
It is necessary to look for protection, mainly missing user-agent and.

Find more questions by tags PythonParsingBeautiful Soup