正則表達式

正則表達式可以被認為是普通字符常量和元字符的組合
與自然語言相比，普通字符常量相當於語言中的字詞，元字符就是定義詞句的語法
正則表達式有豐富的元字符

普通字符常量

最簡單的匹配只包含普通字符常量。「nuclear」會匹配以下行：

Ooh. I just learned that to keep myself alive after a nuclear blast! All I have to do is milk some rats then drink the milk. Aweosme. :}

Laozi says nuclear weapons are mas macho

Chaos in a country that has nuclear weapons -- not good.

my nephew is trying to teach me nuclear physics, or possibly just trying to show me how smart he is. so I’ll be proud of him [which I am].

lol if you ever say "nuclear" people immediately think DEATH by radiation LOL

「Obama」會匹配以下行：

Politics r dum. Not 2 long ago Clinton was sayin Obama was crap n now she sez vote 4 him n unite? WTF? Screw em both + Mcain. Go Ron Paul!

Clinton conceeds to Obama but will her followers listen??

Are we sure Chelsea didn’t vote for Obama?

thinking ... Michelle Obama is terrific!

jetlag..no sleep...early mornig to starbux..Ms. Obama was moving

正則表達式

最簡單的模式只包含普通字符常量，匹配文本中的任意位置包含該字符的文本
如果只想匹配單詞「Obama」的時候，或者以「Clinton」、「clinton」、「clinto」為結尾的句子

需要用一種方式來表達

詞之間的空白邊界
普通字符集
一行的開始和結束
可選字符 (如「war」或「peace」)

元字符可以解決這些問題

元字符

一些元字符代表一行的開始

^i think

會匹配以下內容

i think we all rule for participating
i think i have been outed
i think this will be quite fun actually
i think i need to go to work
i think i first saw zombo in 1999.

元字符

$ 代表一行的末尾

morning$

會匹配以下內容

well they had something this morning
then had to catch a tram home in the morning
dog obedience school in the morning
and yes happy birthday i forgot to say it earlier this morning
I walked in the rain this morning
good morning

字符串類 []

通過設計一組可選的字符，來匹配單詞

[Bb][Uu][Ss][Hh]

會匹配以下行

The democrats are playing, "Name the worst thing about Bush!"
I smelled the desert creosote bush, brownies, BBQ chicken
BBQ and bushwalking at Molonglo Gorge
Bush TOLD you that North Korea is part of the Axis of Evil
I’m listening to Bush - Hurricane (Album Version)

^[Ii] am

會匹配

i am so angry at my boyfriend i can’t even bear to look at him

i am boycotting the apple store

I am twittering from iPhone

I am a very vengeful person when you ruin my sweetheart.

I am so over this. I need food. Mmmm bacon...

類似的，可以設置一系列的字母 [a-z] 或 [a-zA-Z]；注意順序無關緊要

^[0-9][a-zA-Z]

會匹配以下行

7th inning stretch
2nd half soon to begin. OSU did just win something
3am - cant sleep - too hot still.. :(
5ft 7 sent from heaven
1st sign of starvagtion

當在[]中使用「^」字符，表示進行逆向匹配

[^?.]$

會匹配以下行

i like basketballs
6 and 9
dont worry... we all die anyway!
Not in Baghdad
helicopter under water? hmmm

需要注意的是

. 是一個元字符，如果相匹配它，需要通過\轉義

[Gg]eorge( [Ww]\.)? [Bb]ush

`*` 和 `+`

* 和 +是用來顯示重複的元字符，* 表示任意數量，包括0，+表示匹配至少一個

(.*)

會匹配以下行

anyone wanna chat? (24, m, germany)
hello, 20.m here... ( east area + drives + webcam )
(he means older men)
()

匹配兩個數字之間任意長度的行

[0-9]+ (.*)[0-9]+

會匹配以下行

working as MP here 720 MP battallion, 42nd birgade
so say 2 or 3 years at colleage and 4 at uni makes us 23 when and if we fin
it went down on several occasions for like, 3 or 4 *days*
Mmmm its time 4 me 2 go 2 bed

元字符-`{ }`

{ } 代表一個區間，設置表達式匹配的最小和最大數量

[Bb]ush( +[^ ]+ +){1,5} debate

會匹配以下行

Bush has historically won all major debates he’s done.
in my view, Bush doesn’t need these debates..
bush doesn’t need the debates? maybe you are right
That’s what Bush supporters are doing about the debate.
Felix, I don’t disagree that Bush was poorly prepared for the debate.
indeed, but still, Bush should have taken the debate more seriously.
Keep repeating that Bush smirked and scowled during the debate

{m,n} 表示至少精確匹配m個，但不超過n個

元字符--`()`

正則表達式中，括號不僅僅限制可選表達式的範圍，還可以用來記錄匹配的文本
用類似\1, \2等表示

+([a-zA-Z]+) +\1 +

會匹配以下行

time for bed, night night twitter!
blah blah blah blah
my tattoo is so so itchy today
i was standing all all alone against the world outside...
hi anybody anybody at home
estudiando css css css css.... que desastritooooo

* 是貪婪匹配，會匹配滿足正則表達式的儘可能長的字符串

^s(.*)s

匹配以下結果

sitting at starbucks
setting up mysql and rails
studying stuff for the exams
spaghetti with marshmallows
stop fighting with crackers
sore shoulders, stupid ergonomics

* 的貪婪匹配可以通過 ? 關閉

^s(.*?)s$

小結

正則表達式在許多不同的編程語言中使用，不只是R。
正則表達式由普通字符和元字符組成，代表一組或一類字符串或詞句
通過正則表達式提取數據是一種很便捷的方式

正則表達式函數

R原生的處理正則表達式的函數

grep, grepl: 通過字符串向量中的正則表達式的匹配；返回字符串向量匹配的索引序號，或者包含TRUE/FALSE的向量，表示每個元素是否匹配
regexpr, gregexpr: 通過字符串向量中的正則表達式的匹配，返回匹配的起始索引和匹配長度
sub, gsub: 匹配並替換字符
regexec: 解釋正則表達式

grep

Baltimore City homicides數據集

> homicides <- readLines("homicides.txt")
> homicides[1]
[1] "39.311024, -76.674227, iconHomicideShooting, ’p2’, ’<dl><dt>Leon
Nelson</dt><dd class=\"address\">3400 Clifton Ave.<br />Baltimore, MD
21216</dd><dd>black male, 17 years old</dd>
<dd>Found on January 1, 2007</dd><dd>Victim died at Shock
Trauma</dd><dd>Cause: shooting</dd></dl>’"

> homicides[1000]
[1] "39.33626300000, -76.55553990000, icon_homicide_shooting, ’p1200’,...

> length(grep("iconHomicideShooting", homicides))
[1] 228
> length(grep("iconHomicideShooting|icon_homicide_shooting", homicides))
[1] 1003
> length(grep("Cause: shooting", homicides))
[1] 228
> length(grep("Cause: [Ss]hooting", homicides))
[1] 1003
> length(grep("[Ss]hooting", homicides))
[1] 1005

> i <- grep("[cC]ause: [Ss]hooting", homicides)
> j <- grep("[Ss]hooting", homicides)
> str(i)
 int [1:1003] 1 2 6 7 8 9 10 11 12 13 ...
> str(j)
 int [1:1005] 1 2 6 7 8 9 10 11 12 13 ...
> setdiff(i, j)
integer(0)
> setdiff(j, i)
[1] 318 859

> homicides[859]
[1] "39.33743900000, -76.66316500000, icon_homicide_bluntforce,
’p914’, ’<dl><dt><a href=\"http://essentials.baltimoresun.com/
micro_sun/homicides/victim/914/steven-harris\">Steven Harris</a>
</dt><dd class=\"address\">4200 Pimlico Road<br />Baltimore, MD 21215
</dd><dd>Race: Black<br />Gender: male<br />Age: 38 years old</dd>
<dd>Found on July 29, 2010</dd><dd>Victim died at Scene</dd>
<dd>Cause: Blunt Force</dd><dd class=\"popup-note\"><p>Harris was
found dead July 22 and ruled a shooting victim; an autopsy
subsequently showed that he had not been shot,...</dd></dl>’"

grep 返回匹配到字符串的序號。

> grep("^New", state.name)
[1] 29 30 31 32
Setting value = TRUE returns the actual elements of the character vector that match. > grep("^New", state.name, value = TRUE)
[1] "New Hampshire" "New Jersey"    "New Mexico"    "New York"
grepl returns a logical vector indicating which element matches.
> grepl("^New", state.name)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALS
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALS
[25] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALS
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALS
[49] FALSE FALSE

regexpr

grep的局限性

grep 函數返回匹配字符串的位置，但並沒有精確說明如何匹配及位置
regexpr 函數返回匹配的起始位置序號和長度
regexpr 只返回第一個匹配的字符串的序號， gregexpr 會返回所有匹配的位置

> homicides[1]
[1] "39.311024, -76.674227, iconHomicideShooting, ’p2’, ’<dl><dt>Leon
Nelson</dt><dd class=\"address\">3400 Clifton Ave.<br />Baltimore,
MD 21216</dd><dd>black male, 17 years old</dd>
<dd>Found on January 1, 2007</dd><dd>Victim died at Shock
Trauma</dd><dd>Cause: shooting</dd></dl>’"

> homicides[954]
[1] "39.30677400000, -76.59891100000, icon_homicide_shooting, ’p816’,
’<dl><dd class=\"address\">1400 N Caroline St<br />Baltimore, MD 21213</dd>
<dd>Race: Black<br />Gender: male<br />Age: 29 years old</dd>
<dd>Found on March  3, 2010</dd><dd>Victim died at Scene</dd>
<dd>Cause: Shooting</dd><dd class=\"popup-note\"><p>Wheeler\\’s body
was&nbsp;found on the grounds of Dr. Bernard Harris Sr.&nbsp;Elementary
School</p></dd></dl>’"

> regexpr("<dd>[F|f]ound(.*)</dd>", homicides[1:10])
 [1] 177 178 188 189 178 182 178 187 182 183
attr(,"match.length")
 [1] 93 86 89 90 89 84 85 84 88 84
attr(,"useBytes")
[1] TRUE
> substr(homicides[1], 177, 177 + 93 - 1)
[1] "<dd>Found on January 1, 2007</dd><dd>Victim died at Shock
 Trauma</dd><dd>Cause: shooting</dd>"

使用多字符串去除貪婪匹配

> regexpr("<dd>[F|f]ound(.*?)</dd>", homicides[1:10])
 [1] 177 178 188 189 178 182 178 187 182 183
attr(,"match.length")
 [1] 33 33 33 33 33 33 33 33 33 33
attr(,"useBytes")
[1] TRUE
> substr(homicides[1], 177, 177 + 33 - 1)
[1] "<dd>Found on January 1, 2007</dd>"

regmatches

regmatches提取匹配的字符（ substr）.

> r <- regexpr("<dd>[F|f]ound(.*?)</dd>", homicides[1:5])
> regmatches(homicides[1:5], r)
[1] "<dd>Found on January 1, 2007</dd>" "<dd>Found on January 2, 2007</dd>"
[3] "<dd>Found on January 2, 2007</dd>" "<dd>Found on January 3, 2007</dd>"
[5] "<dd>Found on January 5, 2007</dd>"

sub/gsub

> x <- substr(homicides[1], 177, 177 + 33 - 1) 
> x
[1] "<dd>Found on January 1, 2007</dd>"

> sub("<dd>[F|f]ound on |</dd>", "", x)
[1] "January 1, 2007</dd>"
> gsub("<dd>[F|f]ound on |</dd>", "", x)
[1] "January 1, 2007"

> r <- regexpr("<dd>[F|f]ound(.*?)</dd>", homicides[1:5])
> m <- regmatches(homicides[1:5], r)
>m
[1] "<dd>Found on January 1, 2007</dd>" "<dd>Found on January 2, 2007</dd>" 
[3] "<dd>Found on January 2, 2007</dd>" "<dd>Found on January 3, 2007</dd>" 
[5] "<dd>Found on January 5, 2007</dd>"
> gsub("<dd>[F|f]ound on |</dd>", "", m)
[1] "January 1, 2007" "January 2, 2007" "January 2, 2007" "January 3, 2007"
[5] "January 5, 2007"
> as.Date(d, "%B %d, %Y")
[1] "2007-01-01" "2007-01-02" "2007-01-02" "2007-01-03" "2007-01-05"

regexec

regexec類似於 regexpr.

> regexec("<dd>[F|f]ound on (.*?)</dd>", homicides[1])
[[1]]
[1] 177 190
attr(,"match.length")
[1] 33 15

> regexec("<dd>[F|f]ound on .*?</dd>", homicides[1])
[[1]]
[1] 177
attr(,"match.length")
[1] 33

> regexec("<dd>[F|f]ound on (.*?)</dd>", homicides[1])
[[1]]
[1] 177 190
attr(,"match.length")
[1] 33 15

> substr(homicides[1], 177, 177 + 33 - 1)
[1] "<dd>Found on January 1, 2007</dd>"

> substr(homicides[1], 190, 190 + 15 - 1)
[1] "January 1, 2007"

regmatches函數

> r <- regexec("<dd>[F|f]ound on (.*?)</dd>", homicides[1:2])
> regmatches(homicides[1:2], r)
[[1]]
[1] "<dd>Found on January 1, 2007</dd>" "January 1, 2007"

[[2]]
[1] "<dd>Found on January 2, 2007</dd>" "January 2, 2007"

> r <- regexec("<dd>[F|f]ound on (.*?)</dd>", homicides)
> m <- regmatches(homicides, r)
> dates <- sapply(m, function(x) x[2])
> dates <- as.Date(dates, "%B %d, %Y")
> hist(dates, "month", freq = TRUE)

小結

R處理正則表達式函數

grep, grepl: 對字符串向量進行正則表達式
regexpr, gregexpr: 返回匹配的字符串的起始位置，經常與 regmatches一起使用
sub, gsub: 搜索並替換字符
regexec: 返回括號內的子表達式

R/R正則表達式函數

正則表達式

普通字符常量

正則表達式

元字符

元字符

字符串類 []

更多的元字符

需要注意的是

`*` 和 `+`

元字符-`{ }`

元字符--`()`

小結

正則表達式函數

grep

regexpr

regmatches

sub/gsub

regexec

小結

正則表達式

普通字符常量

正則表達式

元字符

元字符

字符串類 []

更多的元字符

需要注意的是

* 和 +

元字符-{ }

元字符--()

小結

正則表達式函數

grep

regexpr

regmatches

sub/gsub

regexec

小結

`*` 和 `+`

元字符-`{ }`

元字符--`()`