My Java regex doesn't work properly -


i wrote regex expression below, used extract dates string:

(monday|tuesday|wednesday|thursday|friday|saturday|sunday)(\*){0,2}\s+\d{1,2}\s+(january|february|march|april|may|june|july|august|september|october|november|december)\s+\d{4} 

before convert java regex expression tested here http://regexr.com?35vlm

the results looks no problem, matches want.

"el" object string type arraylist:

holiday: new year's day wednesday 1 january 2014 holiday: chinese new year friday 31 january 2014 saturday 1 february 2014 holiday: friday friday 18 april 2014 holiday: labour day thursday 1 may 2014 holiday: vesak day tuesday 13 may 2014 holiday: hari raya puasa monday 28 july 2014 holiday: national day  saturday 9 august 2014 holiday: hari raya haji  sunday* 5 october 2014 holiday: deepavali  thursday** 23 october 2014 holiday: christmas day thursday 25 december 2014 

question in java dates missed, matched, tested here http://java-regex-tester.appspot.com/, same error.

update:

full version of code:

import java.io.ioexception; import java.text.decimalformat; import java.util.arraylist; import java.util.list; import java.util.regex.matcher; import java.util.regex.pattern;  import org.jsoup.jsoup; import org.jsoup.nodes.document; import org.jsoup.nodes.element; import org.jsoup.select.elements;   public class tester {      /**      * @param args      * @throws ioexception       */     public static void main(string[] args) throws ioexception {          updatesingaporeholidaycalendar();     }  public static void updatesingaporeholidaycalendar() throws ioexception{          string url = "http://www.mom.gov.sg/employment-practices/leave-and-holidays/pages/public-holidays-2014.aspx";         document document = jsoup.connect(url).get();          elements holidays = document.select("#contentarea table tr");         // system.out.println("12312312");         //system.out.println("web page context: " + question);         list<string> el = new arraylist<string>();         for(int = 2; < holidays.size() + 1; i++){             if((i&1) == 1) continue;             elements threegroup = holidays.get(i-2).getelementsbytag("td");              int j = 2;             for(element e : threegroup){                 if(j-- != 0) continue;                 j = 2;                 el.add(e.text());             }         }           pattern pattern = pattern.compile("(monday|tuesday|wednesday|thursday|friday|saturday|sunday)(\\*){0,2}\\s+\\d{1,2}\\s+(january|february|march|april|may|june|july|august|september|october|november|december)\\s+\\d{4}");          //out put         for(int k = 0; k < el.size(); k++){              matcher matcher = pattern.matcher(el.get(k));             // check occurrences             while (matcher.find()) {                 //system.out.print("start index: " + matcher.start());                 //system.out.print(" end index: " + matcher.end());                 system.out.println(" found: " + matcher.group());             }             system.out.println("holiday: " + el.get(k));         }      }  } 

external jar : jsoup.jar

output:

  found: wednesday 1 january 2014 holiday: new year's day wednesday 1 january 2014  found: saturday 1 february 2014 holiday: chinese new year friday 31 january 2014 saturday 1 february 2014 holiday: friday friday 18 april 2014  found: thursday 1 may 2014 holiday: labour day thursday 1 may 2014 holiday: vesak day tuesday 13 may 2014 holiday: hari raya puasa monday 28 july 2014 holiday: national day  saturday 9 august 2014  found: sunday* 5 october 2014 holiday: hari raya haji  sunday* 5 october 2014 holiday: deepavali  thursday** 23 october 2014  found: thursday 25 december 2014 holiday: christmas day thursday 25 december 2014 holiday:   holiday:   

solved:

as @pshemo said, "data got site contain no-break space can written in html   , apparently doesn't belong \s class. solve problem replace each \s [\s\u00a0] include character (written unicode identifier)."

so change expression :

 pattern pattern = pattern         .compile("(monday|tuesday|wednesday|thursday|friday|saturday|sunday)(\\*){0,2}[\\s\u00a0]+\\d{1,2}[\\s\u00a0]+(january|february|march|april|may|june|july|august|september|october|november|december)[\\s\u00a0]+\\d{4}"); 

solved issue.

data got site contain no-break space can written in html &#160; , apparently doesn't belong \\s class. solve problem replace each \\s [\\s\u00a0] include character (written unicode identifier).

so regex can

pattern pattern = pattern         .compile("(monday|tuesday|wednesday|thursday|friday|saturday|sunday)(\\*){0,2}[\\s\u00a0]+\\d{1,2}[\\s\u00a0]+(january|february|march|april|may|june|july|august|september|october|november|december)[\\s\u00a0]+\\d{4}"); 

Comments

Popular posts from this blog

c# - Send Image in Json : 400 Bad request -

javascript - addthis share facebook and google+ url -

ios - Show keyboard with UITextField in the input accessory view -