Do the parsing of the Amazon on an industrial scale (hundreds of thousands of pages per day). The biggest problem is not in libraries, and that Amazon is very skillfully reveals the attempts of the parsing and constantly improve its own technique of detection of such attempts. Therefore, the most effective way is to have a decent set of quality proxies (those which differ only in the last section and the port number of a long time to work will not work - you will be blacklisted for a period of hours to days, depending on how intense will be using them to send requests).
About the library - choose them according to your needs, based on the volume of requests that need to send. The most simple is all sorts of requests
. Their appropriate use in single-threaded and synchronous type of parsers. But almost all the work will be writing by hand. If you want a little more power and amenities - view to the side Grab
. It can be many things, including conveniently works with proxies, etc. If you need large volume and speed - use Scrapy
. Cool stuff, but with rules. However, if you need to sharpen under itself - on the network a lot of information on it.
API Amazon can and should work. But there are several problems:
1. There is a limit on the number of references(there are more
, but it is possible in a single request to send up to 10 ASIN).
2. Most worryingly, for some products (when you use lookup methods) infa does not come any different from the original(site). Ie no need to rely on the API will return information are identical with their website.
3. The limit on the number of goods for which it returns the info (using the search methods). 100 goods. Then - only parsing. Such a restriction not only on Amazon, Ebay as well. Without this number any dropshipers and other intermediaries simply can not be.
A few nuances:
-Do not attempt to impersonate Google Bot, nothing good will come, just spend time.
-The use of any browser technologies, like PhantomJS
or even Selenium,
nothing will. There is the problem of IP will add more cookies, etc. speed will be slow for large volumes is not suitable.
-Most importantly, as is clear, to circumvent a system that detects bots and crawlers. So improvise, experiment, use your head and search for their solutions. At the other end sit people too ) In the network weight tips for this reason (you can start with the last section here