I am looking for a smart solution to extract only the real informative content of a range of different webpages. I had the idea, that certain html tags tend to have more content than others. Is that a good way of filtering content in the preprocessing or do you have any other ideas?

