html parsing - Android, Proper HTMLCleaner Usage -
i know should try our own stuff here, , not place make requests hate having read stuff html, don't understand it's ways.
so, awarding bounty of 150 points (not i'm cheap, can't more :( ) if can help, or @ least being pointed in right direction sample code.
what trying accomplish?
- i trying latest news following nasa page.
- i plan displays news on listview, of course, listview has little content displayed begin with, data available through page above, here's quick mock-up.
that's it, when user clicks link, taken different fragment shows full article, , i'll figure out how later, once can done.
so, tried using htmlcleaner following bit:
private class cleanurltask extends asynctask<void, void, void> { @override protected void doinbackground(void... params) { try { //try cleaning nasa page. mnode = mcleaner.clean(murl); } catch (exception e) { constants.logmessage("error cleaning file" + e.tostring()); } return null; } @override protected void onpostexecute(void result) { try { //for writing xml file sort of read through //god html code ugly. new prettyxmlserializer(mprops).writetofile( mnode, file_name, "utf-8" ); } catch (exception e) { constants.logmessage("error writing file: " + e.tostring()); } } }
but there, pretty lost. here's xml output btw. did notice there sort of repetition on tag hierarchy each article content, seems go this: left goes image , article link , right goes article title , preview content
so, if helping me figure out how obtain content somehow, i'd appreciate it.
just side note, project educational purposes part of 2013 nasa international space apps challenge, more info here.
as bonus question, same link contains information current, future, , past expeditions, including current members, , each member of expedition, there link bio page.
the tags seem not repetitive, names seem preset , constant, have "tab1", "tab2", , "tab3", , forth , on.
what i'd obtain is:
- expedition number , dates.
- expedition cew members
- a link each of member's bio.
again, support if any, appreciate it.
so apparently needed figure out how use xpath in order data xml output.
so basically, idea xpath can node withing xml, , in case, can see in image above, wanted specific information.
here's xpath article link:
public static final string xpath_article_links = "//div[@class='landing-slide']/div[@class='landing-slide-inner']/div[@class='fpss-img_holder_div_landing']/div[@id='fpss-img-div_466']/a/@href";
where //div[@class='landing-slide']
means looking div class name landing-slide regardless (the '//' declares that) of may located in document. , there on, go further hierarchy of item obtain value href
attribute (attributes pointed via '@' character).
now have xpath, need pass value html cleaner. doing via asynctask
, please keep in mind isn't final code, gets info want.
first, xpaths used:
private class news { static final string xpath_article_links = "//div[@class='landing-slide']/div[@class='landing-slide-inner']/div[@class='fpss-img_holder_div_landing']/div[@id='fpss-img-div_466']/a/@href"; static final string xpath_article_images = "//div[@class='landing-slide']/div[@class='landing-slide-inner']/div[@class='fpss-img_holder_div_landing']/div[@id='fpss-img-div_466']/a/img/@src"; static final string xpath_article_headers = "//div[@class='landing-slide']/div[@class='landing-slide-inner']/div[@class='landing-fpss-introtext']/div[@class='landing-slidetext']/h1/a"; static final string xpath_article_descriptions = "//div[@class='landing-slide']/div[@class='landing-slide-inner']/div[@class='landing-fpss-introtext']/div[@class='landing-slidetext']/p"; }
now asynctask:
private class cleanurltask extends asynctask<void, void, void> { @override protected void doinbackground(void... params) { try { //try cleaning nasa page. (root node) mnode = mcleaner.clean(murl); //get of article links object[] marticles = mnode.evaluatexpath(news.xpath_article_links); //get of image links object[] mimages = mnode.evaluatexpath(news.xpath_article_images); //get of article titles object[] mtitles = mnode.evaluatexpath(news.xpath_article_headers); //get of article descriptions object[] mdescriptions = mnode.evaluatexpath(news.xpath_article_descriptions); constants.logmessage("found : " + marticles.length + " articles"); //value containers string link, image, title, description; (int = 0; < marticles.length; i++) { //the nasa page returns link not qualified url, need append prefix if needed. link = marticles[i].tostring().startswith(full_html_prefix)? marticles[i].tostring() : nasa_prefix + marticles[i].tostring(); image = mimages[i].tostring().startswith(full_html_prefix)? mimages[i].tostring() : nasa_prefix + mimages[i].tostring(); //on previous 2 items getting attribute value //here, need text inside actual element, , want cast object tagnode //the tagnode allows extract text supplied element. title = ((tagnode)mtitles[i]).gettext().tostring(); description = ((tagnode)mdescriptions[i]).gettext().tostring(); //only log values now. constants.logmessage("link article " + link); constants.logmessage("image article " + image); constants.logmessage("title of article " + title); constants.logmessage("description of article " + description); } } catch (exception e) { constants.logmessage("error cleaning file" + e.tostring()); } return null; }
in case lost was, hope can shed light upon way.
Comments
Post a Comment