java - How to find and extract "main" image in website -
i need tackling problem. need program which, given site, finds , extracts "main" picture, i.e. 1 represents site. (to biggest or first picture not true).
how should approach this? there libraries me this? thanks!
option 1
you checkout goose. similar pocket , readability does, i.e. try extract main article given webpage using set of heuristics. can apparently extract main image article, bit of hit , miss, 60% of time works everytime.
it used java project rewritten scala.
from readme
goose try extract following information:
- main text of article
- main image of article
- any youtube/vimeo movies embedded in article
- meta description
- meta tags
- publish date
try here: http://jimplush.com/blog/goose
option 2
you use java wrapper (e.g. ghostdriver) running headless browser, phantomjs. then, fetch website , find img element largest dimensions. this ghostdriver test case shows how query dom elements , it's renderd size.
option 3
use library jsoup helps parse html. value src attribute img tags. request each url find image , measure sizes. 1 biggest dimensions website's main image.
Comments
Post a Comment