How to find on the page area with useful content?

The essence of the problem - search for certain words on the pages of the website. A search would be handled by the browser extension.
Problem: how to determine what I'm looking for the word not the entire document, namely a useful part of it?

For example, you would immediately discard from the search tags, Ala aside, nav... maybe even header and footer - although they can also be useful info, for example in the header will be the title of the article. Look around the body? - then I will find the text in ad units, which is not Gud

has anyone solved this problem?
July 8th 19 at 16:06
4 answers
July 8th 19 at 16:08
This issue has been dealt with repeatedly on the toaster, he wrote the answers. Search for the resource.
Yes! I agree! +1 )) - Kyleigh_Hills commented on July 8th 19 at 16:11
cool that you agree - that's just a googol of anything intelligible did not give, and you have such an obvious link in the question, or at least a query that can be written. Maybe still not so the question understood? - Sterling_Parker commented on July 8th 19 at 16:14
: https://toster.ru/search?q=%D0%BF%D0%B0%D1%80%D1%8... - lily.Kozey commented on July 8th 19 at 16:17
July 8th 19 at 16:10
The task itself is slightly challenging in two ways:
1. Standards are typeset not all
2. You need to look specifically at the structure of the parsed resource
there is also a question resource can be any :( - Kyleigh_Hills commented on July 8th 19 at 16:13
: it is possible to rely on different kinds of TSMs-Ki, and it is their integrity in terms of content, but rather, any text not office appointment shoves in paragraph (a), it is possible to take and only read from it. - Sterling_Parker commented on July 8th 19 at 16:16
July 8th 19 at 16:12
If we are talking about in General any resource and any data on some sample problems here a lot. Typeset-all, as someone comfortable.
A year ago I wrote a parser for pulling emails and phones, so the best score is 56%. That is, from 100 pages I was getting 56 contacts. And this is pre-known formats for which you can prescribe the regular season...
July 8th 19 at 16:14
well, in short: it is the task of finding the MAIN content of the page.
1. Remove all containers (except for paragraph tags, text layout), with the number of child elements is greater than 1.
2. Clean the container body from all tags except the tag container (div,td)
3. Find the container (div,td) with the longest text.
4. I feel it Rob.

Example.
Was:
<div1>
<div2>
 <a href="/1/">link1</a>
 <a href="/2/">link2</a>

<div3>
 <span contetnt>
 some text
<p>
 <i>more text</i>
</p>
</span>
</div3>
</div2></div1>

Was:
<div3>
 some text
 more text
</div3>
> 1. Remove all containers, with the number of child elements is greater than 1.
if this logic any vraper with a bunch of paragraphs, we immediately and delete - Kyleigh_Hills commented on July 8th 19 at 16:17
: no, delete only duplicate div, ul, td. When testing - do not touch the tags text styling ("p", etc.). - Sterling_Parker commented on July 8th 19 at 16:20

Find more questions by tags AlgorithmsJavaScript