我正在构建一个本地事件日历,它采用RSS提要和网站抓取并从中提取事件日期.
我以前问如何提取从PHP短信安排初次约会在这里,并从时间收到了很好的答案MarcDefiant:
function parse_date_tokens($tokens) { # only try to extract a date if we have 2 or more tokens if(!is_array($tokens) || count($tokens) < 2) return false; return strtotime(implode(" ", $tokens)); } function extract_dates($text) { static $patterns = Array( '/^[0-9]+(st|nd|rd|th|)?$/i', # day '/^(Jan(uary)?|Feb(ruary)?|Mar(ch)?|etc)$/i', # month '/^20[0-9]{2}$/', # year '/^of$/' #words ); # defines which of the above patterns aren't actually part of a date static $drop_patterns = Array( false, false, false, true ); $tokens = Array(); $result = Array(); $text = str_word_count($text, 1, '0123456789'); # get all words in text # iterate words and search for matching patterns foreach($text as $word) { $found = false; foreach($patterns as $key => $pattern) { if(preg_match($pattern, $word)) { if(!$drop_patterns[$key]) { $tokens[] = $word; } $found = true; break; } } if(!$found) { $result[] = parse_date_tokens($tokens); $tokens = Array(); } } $result[] = parse_date_tokens($tokens); return array_filter($result); } # test $texts = Array( "The focus of the seminar, on Saturday 2nd February 2013 will be [...]", "Valentines Special @ The Radisson, Feb 14th", "On Friday the 15th of February, a special Hollywood themed [...]", "Symposium on Childhood Play on Friday, February 8th", "Hosting a craft workshop March 9th - 11th in the old [...]" ); $dates = extract_dates(implode(" ", $texts)); echo "Dates: \n"; foreach($dates as $date) { echo " " . date('d.m.Y H:i:s', $date) . "\n"; }
但是,该解决方案有一些缺点 - 首先,它无法匹配日期范围.
我现在正在寻找一种更复杂的解决方案,可以从示例文本中提取日期,时间和日期范围.
这是最好的方法吗?看起来我正在靠回一系列正则表达式语句,一个接一个地运行以捕获这些情况.我无法看到更好的方法来捕捉日期范围,但我知道必须有更好的方法来做到这一点.是否有任何库只用于PHP中的日期解析?
根据要求,日期/日期范围样本
$dates = [ " Saturday 28th December", "2013/2014", "Friday 10th of January", "Thursday 19th December", " on Sunday the 15th December at 1 p.m", "On Saturday December 14th ", "On Saturday December 21st at 7.30pm", "Saturday, March 21st, 9.30 a.m.", "Jan-April 2014", "January 21st - Jan 24th 2014", "Dec 30th - Jan 3rd, 2014", "February 14th-16th, 2014", "Mon 14 - Wed 16 April, 12 - 2pm", "Sun 13 April, 8pm", "Mon 21 - Wed 23 April", "Friday 25 April, 10 – 3pm", "The focus of the seminar, on Saturday 2nd February 2013 will be [...]", "Valentines Special @ The Radisson, Feb 14th", "On Friday the 15th of February, a special Hollywood themed [...]", "Symposium on Childhood Play on Friday, February 8th", "Hosting a craft workshop March 9th - 11th in the old [...]" ];
我目前正在使用的功能(不是上述功能)大约90%准确.它可以捕获日期范围,但如果还指定了时间则有困难.它使用正则表达式列表,非常复杂.
更新:2014年1月6日
我正在研究执行此操作的代码,使用一系列正则表达式的原始方法依次运行.我认为我接近一个可以从一段文本中提取几乎任何日期/时间范围/格式的工作解决方案.当我完成后,我会在这里发布它作为答案.