java - How to find and extract "main" image in website -


i need tackling problem. need program which, given site, finds , extracts "main" picture, i.e. 1 represents site. (to biggest or first picture not true).

how should approach this? there libraries me this? thanks!

option 1

you checkout goose. similar pocket , readability does, i.e. try extract main article given webpage using set of heuristics. can apparently extract main image article, bit of hit , miss, 60% of time works everytime.

it used java project rewritten scala.

from readme

goose try extract following information:

  • main text of article
  • main image of article
  • any youtube/vimeo movies embedded in article
  • meta description
  • meta tags
  • publish date

try here: http://jimplush.com/blog/goose


option 2

you use java wrapper (e.g. ghostdriver) running headless browser, phantomjs. then, fetch website , find img element largest dimensions. this ghostdriver test case shows how query dom elements , it's renderd size.


option 3

use library jsoup helps parse html. value src attribute img tags. request each url find image , measure sizes. 1 biggest dimensions website's main image.


Comments

Popular posts from this blog

assembly - 8086 TASM: Illegal Indexing Mode -

Java, LWJGL, OpenGL 1.1, decoding BufferedImage to Bytebuffer and binding to OpenGL across classes -

javascript - addthis share facebook and google+ url -