Statement of the problem
: it is necessary on a daily basis to parse a large number of web sites (>100 sites, >1000 pages) and to extract from them information about the products. For example, online shopping. Need multi-threaded operation using a proxy (one page (not the website) is one proxy).
Actually a question - please advise a complete "harness" the ultimate solutions with a focus on:
- Productivity and efficiency (to run multiple threads on a single machine/channel of communication).
- Stability and compatibility (pagination, frames, unicode, and all the other features of typography).
- Security (for example, with the page is transmitted destructive code/virus).
What components to use to work with HTML? CsQuery/HtmlAgilityPack... How to access sites using pre-purchased proxies? Support components to work with HTML or proxy to use a proxy, need more "padding"?
I will be wildly grateful for a detailed description and a sequence of actions (I'm not a professional programmer).
YAP/development environment - VC++ 2015. I understand that it may not be the best PL for such tasks. But I ask the question of the choice/change of YAP is not raised. Only interested in VC++.