The better and faster to parse Amazon Python?

Write the parser for Amazon products. The parsing of solely static html pages, ie is not supposed to parse either via AJAX or the more dynamic (for example Selenium). On the page interested in some text fields (prices, shipping etc.). Since many products and Amazon is a lot of protection from the parsing, I have a question about the proper selection of libraries to create a robust parser, which is able to work through a proxy and do it quickly.

I have partially written the code for BeautifulSoup (lxml) + requests (with proxy list) + Random UA, but my feeling is once this work is not very fast. Whether it is necessary to look at other libraries? Tell me, who had a similar experience. Whether to use Scrapy or something else?

Or if you do at the specified stack, what language feature do you recommend to pay attention to speed of operation of the parser?
June 14th 19 at 20:10
2 answers
June 14th 19 at 20:12
Solution
Do the parsing of the Amazon on an industrial scale (hundreds of thousands of pages per day). The biggest problem is not in libraries, and that Amazon is very skillfully reveals the attempts of the parsing and constantly improve its own technique of detection of such attempts. Therefore, the most effective way is to have a decent set of quality proxies (those which differ only in the last section and the port number of a long time to work will not work - you will be blacklisted for a period of hours to days, depending on how intense will be using them to send requests).
About the library - choose them according to your needs, based on the volume of requests that need to send. The most simple is all sorts of requests, urllib, pycurl, multycurl. Their appropriate use in single-threaded and synchronous type of parsers. But almost all the work will be writing by hand. If you want a little more power and amenities - view to the side Grab. It can be many things, including conveniently works with proxies, etc. If you need large volume and speed - use Scrapy. Cool stuff, but with rules. However, if you need to sharpen under itself - on the network a lot of information on it.
API Amazon can and should work. But there are several problems:
1. There is a limit on the number of references(there are more, but it is possible in a single request to send up to 10 ASIN).
2. Most worryingly, for some products (when you use lookup methods) infa does not come any different from the original(site). Ie no need to rely on the API will return information are identical with their website.
3. The limit on the number of goods for which it returns the info (using the search methods). 100 goods. Then - only parsing. Such a restriction not only on Amazon, Ebay as well. Without this number any dropshipers and other intermediaries simply can not be.
A few nuances:
-Do not attempt to impersonate Google Bot, nothing good will come, just spend time.
-The use of any browser technologies, like PhantomJS or even Selenium, nothing will. There is the problem of IP will add more cookies, etc. speed will be slow for large volumes is not suitable.
-Most importantly, as is clear, to circumvent a system that detects bots and crawlers. So improvise, experiment, use your head and search for their solutions. At the other end sit people too ) In the network weight tips for this reason (you can start with the last section here).
Thank you very much sharing your experience. Very useful. Tell me whether to log in the profile or you can easily parse under Guest visitor?
- Whether to store in the query-string any options (Amazon adds them, if you moved for example from their own search or recommendations, etc.)

As I understand the API isn't working for me. First there are restrictions, so it's a half-measure, and secondly I need the most current information.

About Google Bot funny. Even such thoughts do not come. Immediately thought that we should do the normal UA (computer, phone, Mac etc.)

Do you try to work through Tor? - Candida93 commented on June 14th 19 at 20:15
Unless Amazon somehow blocks requests in addition to the issuance of captcha, which is fairly easy to solve? - Bianka_Kassulke commented on June 14th 19 at 20:18
, tor is very slow, buy a normal proxy, they cost a dollar apiece per month. To parse account, I would not, it is always better to do it anonymously. - Bianka_Kassulke commented on June 14th 19 at 20:21
June 14th 19 at 20:14
Here is an example of parsing Scrapy on Amazon can come in handy.
blog.datahut.co/tutorial-how-to-scrape-amazon-usin...

Find more questions by tags PythonParsing