java - How to find and extract "main" image in website -

January 15, 2013

i need tackling problem. need program which, given site, finds , extracts "main" picture, i.e. 1 represents site. (to biggest or first picture not true).

how should approach this? there libraries me this? thanks!

option 1

you checkout goose. similar pocket , readability does, i.e. try extract main article given webpage using set of heuristics. can apparently extract main image article, bit of hit , miss, 60% of time works everytime.

it used java project rewritten scala.

from readme

goose try extract following information:

main text of article

main image of article

any youtube/vimeo movies embedded in article

meta description

meta tags

publish date

try here: http://jimplush.com/blog/goose

option 2

you use java wrapper (e.g. ghostdriver) running headless browser, phantomjs. then, fetch website , find img element largest dimensions. this ghostdriver test case shows how query dom elements , it's renderd size.

option 3

use library jsoup helps parse html. value src attribute img tags. request each url find image , measure sizes. 1 biggest dimensions website's main image.

Search This Blog

Detect

java - How to find and extract "main" image in website -

Comments

Post a Comment

Popular posts from this blog

assembly - 8086 TASM: Illegal Indexing Mode -

javascript - addthis share facebook and google+ url -

Java, LWJGL, OpenGL 1.1, decoding BufferedImage to Bytebuffer and binding to OpenGL across classes -