My Java regex doesn't work properly -
i wrote regex expression below, used extract dates string:
(monday|tuesday|wednesday|thursday|friday|saturday|sunday)(\*){0,2}\s+\d{1,2}\s+(january|february|march|april|may|june|july|august|september|october|november|december)\s+\d{4}
before convert java regex expression tested here http://regexr.com?35vlm
the results looks no problem, matches want.
"el" object string type arraylist:
holiday: new year's day wednesday 1 january 2014 holiday: chinese new year friday 31 january 2014 saturday 1 february 2014 holiday: friday friday 18 april 2014 holiday: labour day thursday 1 may 2014 holiday: vesak day tuesday 13 may 2014 holiday: hari raya puasa monday 28 july 2014 holiday: national day saturday 9 august 2014 holiday: hari raya haji sunday* 5 october 2014 holiday: deepavali thursday** 23 october 2014 holiday: christmas day thursday 25 december 2014
question in java dates missed, matched, tested here http://java-regex-tester.appspot.com/, same error.
update:
full version of code:
import java.io.ioexception; import java.text.decimalformat; import java.util.arraylist; import java.util.list; import java.util.regex.matcher; import java.util.regex.pattern; import org.jsoup.jsoup; import org.jsoup.nodes.document; import org.jsoup.nodes.element; import org.jsoup.select.elements; public class tester { /** * @param args * @throws ioexception */ public static void main(string[] args) throws ioexception { updatesingaporeholidaycalendar(); } public static void updatesingaporeholidaycalendar() throws ioexception{ string url = "http://www.mom.gov.sg/employment-practices/leave-and-holidays/pages/public-holidays-2014.aspx"; document document = jsoup.connect(url).get(); elements holidays = document.select("#contentarea table tr"); // system.out.println("12312312"); //system.out.println("web page context: " + question); list<string> el = new arraylist<string>(); for(int = 2; < holidays.size() + 1; i++){ if((i&1) == 1) continue; elements threegroup = holidays.get(i-2).getelementsbytag("td"); int j = 2; for(element e : threegroup){ if(j-- != 0) continue; j = 2; el.add(e.text()); } } pattern pattern = pattern.compile("(monday|tuesday|wednesday|thursday|friday|saturday|sunday)(\\*){0,2}\\s+\\d{1,2}\\s+(january|february|march|april|may|june|july|august|september|october|november|december)\\s+\\d{4}"); //out put for(int k = 0; k < el.size(); k++){ matcher matcher = pattern.matcher(el.get(k)); // check occurrences while (matcher.find()) { //system.out.print("start index: " + matcher.start()); //system.out.print(" end index: " + matcher.end()); system.out.println(" found: " + matcher.group()); } system.out.println("holiday: " + el.get(k)); } } }
external jar : jsoup.jar
output:
found: wednesday 1 january 2014 holiday: new year's day wednesday 1 january 2014 found: saturday 1 february 2014 holiday: chinese new year friday 31 january 2014 saturday 1 february 2014 holiday: friday friday 18 april 2014 found: thursday 1 may 2014 holiday: labour day thursday 1 may 2014 holiday: vesak day tuesday 13 may 2014 holiday: hari raya puasa monday 28 july 2014 holiday: national day saturday 9 august 2014 found: sunday* 5 october 2014 holiday: hari raya haji sunday* 5 october 2014 holiday: deepavali thursday** 23 october 2014 found: thursday 25 december 2014 holiday: christmas day thursday 25 december 2014 holiday: holiday:
solved:
as @pshemo said, "data got site contain no-break space can written in html , apparently doesn't belong \s class. solve problem replace each \s [\s\u00a0] include character (written unicode identifier)."
so change expression :
pattern pattern = pattern .compile("(monday|tuesday|wednesday|thursday|friday|saturday|sunday)(\\*){0,2}[\\s\u00a0]+\\d{1,2}[\\s\u00a0]+(january|february|march|april|may|june|july|august|september|october|november|december)[\\s\u00a0]+\\d{4}");
solved issue.
data got site contain no-break space
can written in html  
, apparently doesn't belong \\s
class. solve problem replace each \\s
[\\s\u00a0]
include character (written unicode identifier).
so regex can
pattern pattern = pattern .compile("(monday|tuesday|wednesday|thursday|friday|saturday|sunday)(\\*){0,2}[\\s\u00a0]+\\d{1,2}[\\s\u00a0]+(january|february|march|april|may|june|july|august|september|october|november|december)[\\s\u00a0]+\\d{4}");
Comments
Post a Comment