R/R正则表达式函数

维基教科书,自由的教学读本
< R

正则表达式[编辑]

  • 正则表达式可以被认为是普通字符常量和元字符的组合
  • 与自然语言相比,普通字符常量相当于语言中的字词,元字符就是定义词句的语法
  • 正则表达式有丰富的元字符

普通字符常量[编辑]

最简单的匹配只包含普通字符常量。 “nuclear” 会匹配以下行:

Ooh. I just learned that to keep myself alive after a nuclear blast! All I have to do is milk some rats then drink the milk. Aweosme. :}

Laozi says nuclear weapons are mas macho

Chaos in a country that has nuclear weapons -- not good.

my nephew is trying to teach me nuclear physics, or possibly just trying to show me how smart he is. so I’ll be proud of him [which I am].

lol if you ever say "nuclear" people immediately think DEATH by radiation LOL

“Obama” 会匹配以下行:

Politics r dum. Not 2 long ago Clinton was sayin Obama was crap n now she sez vote 4 him n unite? WTF? Screw em both + Mcain. Go Ron Paul!

Clinton conceeds to Obama but will her followers listen??

Are we sure Chelsea didn’t vote for Obama?

thinking ... Michelle Obama is terrific!

jetlag..no sleep...early mornig to starbux..Ms. Obama was moving

正则表达式[编辑]

  • 最简单的模式只包含普通字符常量,匹配文本中的任意位置包含该字符的文本
  • 如果只想匹配单词 “Obama” 的时候,或者以 “Clinton”、“clinton”、“clinto”为结尾的句子

需要用一种方式来表达

  • 词之间的空白边界
  • 普通字符集
  • 一行的开始和结束
  • 可选字符 (如“war”或“peace”)

元字符可以解决这些问题


元字符[编辑]

一些元字符代表一行的开始

^i think

会匹配以下内容

i think we all rule for participating
i think i have been outed
i think this will be quite fun actually
i think i need to go to work
i think i first saw zombo in 1999.

元字符[编辑]

$ 代表一行的末尾

morning$

会匹配以下内容

well they had something this morning
then had to catch a tram home in the morning
dog obedience school in the morning
and yes happy birthday i forgot to say it earlier this morning
I walked in the rain this morning
good morning

字符串类 [][编辑]

通过设计一组可选的字符,来匹配单词

[Bb][Uu][Ss][Hh]

会匹配以下行

The democrats are playing, "Name the worst thing about Bush!"
I smelled the desert creosote bush, brownies, BBQ chicken
BBQ and bushwalking at Molonglo Gorge
Bush TOLD you that North Korea is part of the Axis of Evil
I’m listening to Bush - Hurricane (Album Version)
^[Ii] am

会匹配

i am so angry at my boyfriend i can’t even bear to look at him

i am boycotting the apple store

I am twittering from iPhone

I am a very vengeful person when you ruin my sweetheart.

I am so over this. I need food. Mmmm bacon...

类似的,可以设置一系列的字母 [a-z] 或 [a-zA-Z];注意顺序无关紧要

^[0-9][a-zA-Z]

会匹配以下行

7th inning stretch
2nd half soon to begin. OSU did just win something
3am - cant sleep - too hot still.. :(
5ft 7 sent from heaven
1st sign of starvagtion

当在[]中使用 “^” 字符,表示进行逆向匹配

[^?.]$

会匹配以下行

i like basketballs
6 and 9
dont worry... we all die anyway!
Not in Baghdad
helicopter under water? hmmm

更多的元字符[编辑]

“.” 用来匹配任意字符

9.11

会匹配以下行

its stupid the post 9-11 rules
if any 1 of us did 9/11 we would have been caught in days.
NetBios: scanning ip 203.169.114.66
Front Door 9:11:46 AM
Sings: 0118999881999119725...3 !

| 在正则表达式中不代表管道符,代表逻辑或,可以用来链接两个表达式,前后表达式都是可选项

flood|fire

会匹配以下行

is firewire like usb on none macs?
the global flood makes sense within the context of the bible
yeah ive had the fire on tonight
... and the floods, hurricanes, killer heatwaves, rednecks, gun nuts, etc.

可以设置任意数量的可选项

flood|earthquake|hurricane|coldfire

会匹配以下行

Not a whole lot of hurricanes in the Arctic.
We do have earthquakes nearly every day somewhere in our State
hurricanes swirl in the other direction
coldfire is STRAIGHT!
’cause we keep getting earthquakes

也可以用于连接两个表达式

^[Gg]ood|[Bb]ad

匹配以下行

good to hear some good knews from someone here
Good afternoon fellow american infidels!
good on you-what do you drive?
Katie... guess they had bad experiences...
my middle name is trouble, Miss Bad News

^| 字符联合使用时,需要加括号,否则只会匹配第一个表达式的值

^([Gg]ood|[Bb]ad)

会匹配以下行

bad habbit
bad coordination today
good, becuase there is nothing worse than a man in kinky underwear
Badcop, its because people want to use drugs
Good Monday Holiday
Good riddance to Limey

? 表示可选匹配,1个或0个

[Gg]eorge( [Ww]\.)? [Bb]ush

会匹配以下行

i bet i can spell better than you and george bush combined
BBC reported that President George W. Bush claimed God told him to invade I
a bird in the hand is worth two george bushes

需要注意的是[编辑]

. 是一个元字符,如果相匹配它,需要通过\转义

[Gg]eorge( [Ww]\.)? [Bb]ush

*+[编辑]

*+是用来显示重复的元字符,* 表示任意数量,包括0,+表示匹配至少一个

(.*)

会匹配以下行

anyone wanna chat? (24, m, germany)
hello, 20.m here... ( east area + drives + webcam )
(he means older men)
()

匹配两个数字之间任意长度的行

[0-9]+ (.*)[0-9]+

会匹配以下行

working as MP here 720 MP battallion, 42nd birgade
so say 2 or 3 years at colleage and 4 at uni makes us 23 when and if we fin
it went down on several occasions for like, 3 or 4 *days*
Mmmm its time 4 me 2 go 2 bed

元字符-{ }[编辑]

{ } 代表一个区间,设置表达式匹配的最小和最大数量

[Bb]ush( +[^ ]+ +){1,5} debate

会匹配以下行

Bush has historically won all major debates he’s done.
in my view, Bush doesn’t need these debates..
bush doesn’t need the debates? maybe you are right
That’s what Bush supporters are doing about the debate.
Felix, I don’t disagree that Bush was poorly prepared for the debate.
indeed, but still, Bush should have taken the debate more seriously.
Keep repeating that Bush smirked and scowled during the debate

{m,n} 表示至少精确匹配m个,但不超过n个

元字符--()[编辑]

  • 正则表达式中,括号不仅仅限制可选表达式的范围,还可以用来记录匹配的文本
  • 用类似\1, \2等表示
+([a-zA-Z]+) +\1 +

会匹配以下行

time for bed, night night twitter!
blah blah blah blah
my tattoo is so so itchy today
i was standing all all alone against the world outside...
hi anybody anybody at home
estudiando css css css css.... que desastritooooo

* 是贪婪匹配,会匹配满足正则表达式的尽可能长的字符串

^s(.*)s

匹配以下结果

sitting at starbucks
setting up mysql and rails
studying stuff for the exams
spaghetti with marshmallows
stop fighting with crackers
sore shoulders, stupid ergonomics

* 的贪婪匹配可以通过 ? 关闭

^s(.*?)s$

小结[编辑]

  • 正则表达式在许多不同的编程语言中使用,不只是R。
  • 正则表达式由普通字符和元字符组成,代表一组或一类字符串或词句
  • 通过正则表达式提取数据是一种很便捷的方式

正则表达式函数[编辑]

R原生的处理正则表达式的函数

  • grep, grepl: 通过字符串向量中的正则表达式的匹配;返回字符串向量匹配的索引序号,或者包含TRUE/FALSE的向量,表示每个元素是否匹配
  • regexpr, gregexpr: 通过字符串向量中的正则表达式的匹配,返回匹配的起始索引和匹配长度
  • sub, gsub: 匹配并替换字符
  • regexec: 解释正则表达式

grep[编辑]

Baltimore City homicides数据集

> homicides <- readLines("homicides.txt")
> homicides[1]
[1] "39.311024, -76.674227, iconHomicideShooting, ’p2’, ’<dl><dt>Leon
Nelson</dt><dd class=\"address\">3400 Clifton Ave.<br />Baltimore, MD
21216</dd><dd>black male, 17 years old</dd>
<dd>Found on January 1, 2007</dd><dd>Victim died at Shock
Trauma</dd><dd>Cause: shooting</dd></dl>’"

> homicides[1000]
[1] "39.33626300000, -76.55553990000, icon_homicide_shooting, ’p1200’,...
> length(grep("iconHomicideShooting", homicides))
[1] 228
> length(grep("iconHomicideShooting|icon_homicide_shooting", homicides))
[1] 1003
> length(grep("Cause: shooting", homicides))
[1] 228
> length(grep("Cause: [Ss]hooting", homicides))
[1] 1003
> length(grep("[Ss]hooting", homicides))
[1] 1005
> i <- grep("[cC]ause: [Ss]hooting", homicides)
> j <- grep("[Ss]hooting", homicides)
> str(i)
 int [1:1003] 1 2 6 7 8 9 10 11 12 13 ...
> str(j)
 int [1:1005] 1 2 6 7 8 9 10 11 12 13 ...
> setdiff(i, j)
integer(0)
> setdiff(j, i)
[1] 318 859
> homicides[859]
[1] "39.33743900000, -76.66316500000, icon_homicide_bluntforce,
’p914’, ’<dl><dt><a href=\"http://essentials.baltimoresun.com/
micro_sun/homicides/victim/914/steven-harris\">Steven Harris</a>
</dt><dd class=\"address\">4200 Pimlico Road<br />Baltimore, MD 21215
</dd><dd>Race: Black<br />Gender: male<br />Age: 38 years old</dd>
<dd>Found on July 29, 2010</dd><dd>Victim died at Scene</dd>
<dd>Cause: Blunt Force</dd><dd class=\"popup-note\"><p>Harris was
found dead July 22 and ruled a shooting victim; an autopsy
subsequently showed that he had not been shot,...</dd></dl>’"

grep 返回匹配到字符串的序号。

> grep("^New", state.name)
[1] 29 30 31 32
Setting value = TRUE returns the actual elements of the character vector that match. > grep("^New", state.name, value = TRUE)
[1] "New Hampshire" "New Jersey"    "New Mexico"    "New York"
grepl returns a logical vector indicating which element matches.
> grepl("^New", state.name)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALS
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALS
[25] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALS
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALS
[49] FALSE FALSE


regexpr[编辑]

grep的局限性

  • grep 函数返回匹配字符串的位置,但并没有精确说明如何匹配及位置
  • regexpr 函数返回匹配的起始位置序号和长度
  • regexpr 只返回第一个匹配的字符串的序号, gregexpr 会返回所有匹配的位置
> homicides[1]
[1] "39.311024, -76.674227, iconHomicideShooting, ’p2’, ’<dl><dt>Leon
Nelson</dt><dd class=\"address\">3400 Clifton Ave.<br />Baltimore,
MD 21216</dd><dd>black male, 17 years old</dd>
<dd>Found on January 1, 2007</dd><dd>Victim died at Shock
Trauma</dd><dd>Cause: shooting</dd></dl>’"
> homicides[954]
[1] "39.30677400000, -76.59891100000, icon_homicide_shooting, ’p816’,
’<dl><dd class=\"address\">1400 N Caroline St<br />Baltimore, MD 21213</dd>
<dd>Race: Black<br />Gender: male<br />Age: 29 years old</dd>
<dd>Found on March  3, 2010</dd><dd>Victim died at Scene</dd>
<dd>Cause: Shooting</dd><dd class=\"popup-note\"><p>Wheeler\\’s body
was&nbsp;found on the grounds of Dr. Bernard Harris Sr.&nbsp;Elementary
School</p></dd></dl>’"
> regexpr("<dd>[F|f]ound(.*)</dd>", homicides[1:10])
 [1] 177 178 188 189 178 182 178 187 182 183
attr(,"match.length")
 [1] 93 86 89 90 89 84 85 84 88 84
attr(,"useBytes")
[1] TRUE
> substr(homicides[1], 177, 177 + 93 - 1)
[1] "<dd>Found on January 1, 2007</dd><dd>Victim died at Shock
 Trauma</dd><dd>Cause: shooting</dd>"

使用多字符串去除贪婪匹配

> regexpr("<dd>[F|f]ound(.*?)</dd>", homicides[1:10])
 [1] 177 178 188 189 178 182 178 187 182 183
attr(,"match.length")
 [1] 33 33 33 33 33 33 33 33 33 33
attr(,"useBytes")
[1] TRUE
> substr(homicides[1], 177, 177 + 33 - 1)
[1] "<dd>Found on January 1, 2007</dd>"

regmatches[编辑]

regmatches提取匹配的字符( substr).

> r <- regexpr("<dd>[F|f]ound(.*?)</dd>", homicides[1:5])
> regmatches(homicides[1:5], r)
[1] "<dd>Found on January 1, 2007</dd>" "<dd>Found on January 2, 2007</dd>"
[3] "<dd>Found on January 2, 2007</dd>" "<dd>Found on January 3, 2007</dd>"
[5] "<dd>Found on January 5, 2007</dd>"

sub/gsub[编辑]

> x <- substr(homicides[1], 177, 177 + 33 - 1) 
> x
[1] "<dd>Found on January 1, 2007</dd>"
> sub("<dd>[F|f]ound on |</dd>", "", x)
[1] "January 1, 2007</dd>"
> gsub("<dd>[F|f]ound on |</dd>", "", x)
[1] "January 1, 2007"
> r <- regexpr("<dd>[F|f]ound(.*?)</dd>", homicides[1:5])
> m <- regmatches(homicides[1:5], r)
>m
[1] "<dd>Found on January 1, 2007</dd>" "<dd>Found on January 2, 2007</dd>" 
[3] "<dd>Found on January 2, 2007</dd>" "<dd>Found on January 3, 2007</dd>" 
[5] "<dd>Found on January 5, 2007</dd>"
> gsub("<dd>[F|f]ound on |</dd>", "", m)
[1] "January 1, 2007" "January 2, 2007" "January 2, 2007" "January 3, 2007"
[5] "January 5, 2007"
> as.Date(d, "%B %d, %Y")
[1] "2007-01-01" "2007-01-02" "2007-01-02" "2007-01-03" "2007-01-05"

regexec[编辑]

regexec类似于 regexpr.

> regexec("<dd>[F|f]ound on (.*?)</dd>", homicides[1])
[[1]]
[1] 177 190
attr(,"match.length")
[1] 33 15

> regexec("<dd>[F|f]ound on .*?</dd>", homicides[1])
[[1]]
[1] 177
attr(,"match.length")
[1] 33
> regexec("<dd>[F|f]ound on (.*?)</dd>", homicides[1])
[[1]]
[1] 177 190
attr(,"match.length")
[1] 33 15

> substr(homicides[1], 177, 177 + 33 - 1)
[1] "<dd>Found on January 1, 2007</dd>"

> substr(homicides[1], 190, 190 + 15 - 1)
[1] "January 1, 2007"

regmatches函数

> r <- regexec("<dd>[F|f]ound on (.*?)</dd>", homicides[1:2])
> regmatches(homicides[1:2], r)
[[1]]
[1] "<dd>Found on January 1, 2007</dd>" "January 1, 2007"

[[2]]
[1] "<dd>Found on January 2, 2007</dd>" "January 2, 2007"
> r <- regexec("<dd>[F|f]ound on (.*?)</dd>", homicides)
> m <- regmatches(homicides, r)
> dates <- sapply(m, function(x) x[2])
> dates <- as.Date(dates, "%B %d, %Y")
> hist(dates, "month", freq = TRUE)

小结[编辑]

R处理正则表达式函数

  • grep, grepl: 对字符串向量进行正则表达式
  • regexpr, gregexpr: 返回匹配的字符串的起始位置,经常与 regmatches一起使用
  • sub, gsub: 搜索并替换字符
  • regexec: 返回括号内的子表达式