Which components to use for multi-threaded HTML parsing in VC++ using proxies?

Statement of the problem: it is necessary on a daily basis to parse a large number of web sites (>100 sites, >1000 pages) and to extract from them information about the products. For example, online shopping. Need multi-threaded operation using a proxy (one page (not the website) is one proxy).

Actually a question - please advise a complete "harness" the ultimate solutions with a focus on:
  1. Productivity and efficiency (to run multiple threads on a single machine/channel of communication).
  2. Stability and compatibility (pagination, frames, unicode, and all the other features of typography).
  3. Security (for example, with the page is transmitted destructive code/virus).


What components to use to work with HTML? CsQuery/HtmlAgilityPack... How to access sites using pre-purchased proxies? Support components to work with HTML or proxy to use a proxy, need more "padding"?

I will be wildly grateful for a detailed description and a sequence of actions (I'm not a professional programmer).

YAP/development environment - VC++ 2015. I understand that it may not be the best PL for such tasks. But I ask the question of the choice/change of YAP is not raised. Only interested in VC++.
July 8th 19 at 11:24
1 answer
July 8th 19 at 11:26
Treated Python 2.7 1kk pages used from multiprocessing import Pool. Check my farrowing, where it was referenced.

Find more questions by tags HTMLmultithreadingParsingProxy