Is there any library for C++, which is possible via parsing the web pages and the download of this information in a file or in the database?
If so, will they work faster conventional file_get_contents in php?
PS Library is required to load a large number of pages (about 5 million) and saving the information with them.
99% of the time it will take to download a content (5000000 pages is not huhry muhry). Here though in basic you can write. Well, of course, C++ will be faster, but in absolute terms, the results of this task, it is unlikely to be noticeable.
I think your problem is better to solve in Perl or Python, but not C++.
I use MultiCurl. Start 100 downloads, using bash runs 20 copies of the script (IPC is implemented via a queue in Redis). 18кссылок work out for 10 minutes (around 30 per second). But the script is not just shakes, it provides a page analysis, transcode into the required charset, and builds the DOM of a page and through XPath pulls the desired data. Each copy eats somewhere 50MB. The scheme is simple and easily scalable horizontally.
By the way, you need to understand/share the process jumps of the page and its parsing. The first can easily do wget.
Sesnie libs will give in principle one plus save RAM (i.e., nelis mistaken). Otherwise, more time will be spent on interaction with the network and write data to the database.
file_get_contents without specifying contexts you should not use. Because 1) it's lockable, 2) does not include timeouts.
I can not say about the existing libraries, because C++ didn't write anything. But, the work it will be faster than PHP (especially given the volume that You specify) - that's right, though, because PHP is an interpreted language.
Find more questions by tags PHP