I'm finally parsing through wikipedias wiki text. I have the following type of text here:
我终于通过wikipedias wiki文本解析了。我在这里有以下类型的文字:
{{Airport-list|the Solomon Islands}}
* '''AGAF''' (AFT) – [[Afutara Airport]] – [[Afutara]]
* '''AGAR''' (RNA) – [[Ulawa Airport]] – [[Arona]], [[Ulawa Island]]
* '''AGAT''' (ATD) – [[Uru Harbour]] – [[Atoifi]], [[Malaita]]
* '''AGBA''' – [[Barakoma Airport]] – [[Barakoma]]
I need to retrieve all lines in a single array which start with the pattern
我需要检索以模式开头的单个数组中的所有行
* '''
I think a regular expression would be called to order here but I'm really messed up on my regular expressions part though.
我认为这里会调用一个正则表达式,但我真正搞砸了我的正则表达式部分。
Plus in another example I have the following text:
另外在另一个例子中,我有以下文字:
{{otheruses}}
{{Infobox Settlement
|official_name = Doha
|native_name = {{rtl-lang|ar|الدوحة}} ''ad-Dawḥa''
|image_skyline = Doha Sheraton.jpg
|imagesize =
|image_caption = West Bay at night
|image_map = QA-01.svg
|mapsize = 100px
|map_caption = Location of the municipality of Doha within [[Qatar]].
|pushpin_map =
|pushpin_label_position =
|pushpin_mapsize =
|subdivision_type = [[Countries of the world|Country]]
|subdivision_name = [[Qatar]]
|subdivision_type1 = [[Municipalities of Qatar|Municipality]]
|subdivision_name1 = [[Ad Dawhah]]
|established_title = Established
|established_date = 1850
|area_total_km2 = 132
|area_total_sq_mi = 51
|area_land_km2 =
|area_land_sq_mi =
|area_water_km2 =
|area_water_sq_mi =
|area_water_percent =
|area_urban_km2 =
|area_urban_sq_mi =
|area_metro_km2 =
|area_metro_sq_mi =
|population_as_of = 2004
|population_note =
|population_footnotes = [http://www.planning.gov.qa/Qatar-Census-2004/Flash/introduction.html Qatar 2004 Census]
|population_total = 339847
|population_metro = 998651
|population_density_km2 = 2574
|population_density_sq_mi = 6690
|latd=25 |latm=17 | lats=12 |latNS=N
|lOngd=51|lOngm=32 | lOngs=0| lOngEW=E
|coordinates_display = inline,title
|coordinates_type = type:city_region:QA
|timezOne= [[Arab Standard Time|AST]]
|utc_offset = +3
|website =
|footnotes =
}}
'''Doha''' ({{lang-ar|الدوحة}}, ''{{transl|ar|ad-Dawḥa}}'' or ''{{unicode|ad-Dōḥa}}'') is the [[capital city]] of [[Qatar]]. It has a population of 400,051 according to the 2005 census,[http://www.hotelrentalgroup.com/Qatar/Sheraton%20Doha%20Hotel%20&%20Resort.htm Sheraton Doha Hotel & Resort | Hotel discount bookings in Qatar] and is located in the [[Ad Dawhah]] municipality on the [[Persian Gulf]]. Doha is Qatar's largest city, with over 80% of the nation's population residing in Doha or its surrounding [[suburbs]], and is also the economic center of the country.
It is also the seat of government of Qatar, which is ruled by [[Sheikh Hamad bin Khalifa Al Thani]]–the current ruling Emir of Qatar.
I need to extract the infobox here. The infobox is and includes all text between the first occurrence of
我需要在这里提取信息框。信息框是并且包括第一次出现之间的所有文本
{{Infobox Settlement
and ends with the first occurrence of
并以第一次出现结束
}}
I'm totally lost when it comes to regular expressions and I could use help here. I'm using Php.
当谈到正则表达式时,我完全迷失了,我可以在这里使用帮助。我正在使用Php。
I've been battling for 40 hours and I can't get the stupid regular expression to work right :( so far I just have this:
我一直在争斗40个小时,我不能让愚蠢的正则表达式正常工作:(到目前为止,我只是这样:
{{Infobox[^\b(\r|\n)}}(\r|\n)\b]*[\b(\r|\n)}}(\r|\n)(\r|\n)\b]
But its not working I want it to read all the string data between {{infobox and ends with a \n}}\n
但是它不起作用我希望它读取{{infobox和以\ n}}结尾的所有字符串数据\ n
I'm using Php and can't get this to work :( It just returns the first occurrence of }} ignoring the fact that I want it to retrieve }} with preceding linefeed. Help please before I waste more of my sanity on this :'(
我正在使用Php并且不能让它工作:(它只是返回第一次出现}}忽略了我希望它用前面的换行检索}}的事实。请帮助之前我更浪费我的理智:'(
I need to extract the infobox ...
我需要提取信息框......
Try this, this time making sure dotall mode is enabled:
试试这个,这一次确保启用了dotall模式:
\{\{Infobox.*?(?=\}\} )
And again, explanation for that:
再次,解释:
(?xs) # x=comment mode, s=dotall mode
\{\{ # two opening braces (special char, so needs escaping here.)
Infobox # literal text
.*? # any char (including newlines), non-greedily match zero or more times.
(?= # begin positive lookahead
\}\} # two closing braces
# literal text
) # end positive lookahead
This will match upto (but excluding) the the ending expression - you could remove the lookahead itself and include just the contents to have it include the ending, if necessary.
这将匹配(但不包括)结束表达式 - 您可以删除前瞻本身并仅包含内容以使其包含结尾(如有必要)。
Update, based on comment to answer:
更新,根据评论回答:
\{\{Infobox.*?(?=\n\}\}\n)
Same as above, but lookahead looks for two braces on their own line.
与上面相同,但是lookahead在他们自己的行上寻找两个大括号。
To optionally allow the comment also, use:
要同时允许评论,请使用:
\{\{Infobox.*?(?=\n\}\}(?: )?\n)
MediaWiki is open-source. Have a look at their source code ... ;-)
MediaWiki是开源的。看看他们的源代码...... ;-)
I think the best way is to merge all lines into one string, especially for the infobox.
我认为最好的方法是将所有行合并为一个字符串,尤其是对于信息框。
Then something along the lines of
然后是一些东西
$reg = "\n(\* '''[^\n]*)";
$ reg =“\ n(\ *'''[^ \ n] *)”;
for the first part (everything after a new line that start with * ''' and is not a new line).
对于第一部分(在以''''开头并且不是新行的新行之后的所有内容)。
And for the second part I'm not quire sure right now, but this is a nice place to play around a bit: http://www.solmetra.com/scripts/regex/index.php
而对于第二部分我现在不确定,但这是一个很好的地方玩一下:http://www.solmetra.com/scripts/regex/index.php
And here is a short reference for regular expression syntax: http://www.regular-expressions.info/reference.html
以下是正则表达式语法的简短参考:http://www.regular-expressions.info/reference.html
I need to retrieve all lines in a single array which start with the pattern
* '''
我需要检索单个数组中的所有行,这些行以模式*'''开头
Enable multiline mode and ensure dotall mode is disabled, and use this:
启用多线模式并确保禁用dotall模式,并使用:
^\* '''.*$
That expression dissected is:
解剖的表达是:
(?xm-s) # Flags:
# x enables comment mode (spaces ignore, hashes start comments)
# m enables multiline mode (^$ match lines)
# -s disables dotall (. matches newline)
^ # start of line
\* # literal asterisk
[ ] # literal space (needs braces in comment mode, but not otherwise)
''' # three literal apostrophes
.* # any character (excluding newline), greedily matched zero or many times.
$ # end of line