C++/Regex

<regex> 是C++標準程式庫中的一個頭文件，定義了C++標準中正則表達式的實現。是從C++11正式引入的。

C++11 <regex>默認使用ECMAScript即javascript的ECMA-262標準，因此不支持逆向檢查（look-behind）語法。

類型定義

syntax_option_type
match_flag_type
error_type

類模板

包括下述類模板：

basic_regex：正則表達式對象。
sub_match：子表達式匹配捕獲的字符序列
match_results：一個正則表達式的匹配，包含了所有子表達式匹配。
regex_iterator：在一個字符序列中遍歷所有正則表達式匹配的迭代器。
regex_token_iterator：在給定字符序列的所有正則表達式匹配中遍歷所有特定子表達式的迭代器。
regex_error：正則表達式庫產生的錯誤報告。
regex_traits：正則表達式庫所需的字符類型的維護信息。

basic_regex

正則表達式的對象在構造時，可以選擇語法類型：

flag	語法效果	注釋
icase	大小寫不敏感	匹配時不考慮大小寫的差別
nosubs	無子表達式	子表達式不被認為是marked。match_results對象不包含子表達式匹配。
optimize	優化匹配	匹配效率比構建regex對象的效率更優先
collate	Locale的sensitiveness	字符範圍，如"[a-b]"，受locale影響.
ECMAScript	ECMAScript語法	正則表達式遵循其中一種語法。不能多選。如果不設置，則默認是ECMAScript語法.
basic	Basic POSIX語法
extended	Extended POSIX語法
awk	Awk POSIX語法
grep	Grep POSIX語法
egrep	Egrep POSIX語法

regex_iterator類模板

用正則表達式搜索一個序列時，使用前向只讀迭代器regex_iterator在所有匹配位置上迭代。

template<
   class BidirIt,
   class CharT = typename std::iterator_traits<BidirIt>::value_type,
   class Traits = std::regex_traits<CharT>
> class regex_iterator

//构造函数
regex_iterator ( BidirectionalIterator first, BidirectionalIterator last, //底层序列的开始和结束迭代器（二个 BidirIt 实例）
       const regex_type& rgx, //指向正则表达式的指针
       regex_constants::match_flag_type flags = regex_constants::match_default); //匹配标志类型

構造時給出被搜索序列的起始與末尾的位置\所使用的正則表達式對象、屬性類型。構造函數首先用函數regex_search找到相繼的匹配。如果無匹配，則迭代器相當於缺省構造出的對象，表示序列尾。迭代器每次自增時，它調用 std::regex_search 並記憶結果（即保存值 std::match_results<BidirIt> 的副本）。

在最後匹配後自增 std::regex_iterator ，將等於序列尾迭代器。進一步解引用或自增序列尾迭代器引發未定義行為。

每次用運算符++在移動迭代器；解引用(dereference)獲得內部match_results對象的引用。

regex_token_iterator

類似於regex_iterator的一個迭代器，但所指向的是正則表達式每次匹配中的特定的sub_match對象。可在構造regex_token_iterator對象時通過構造器的第3個參數指出要選擇哪個（或哪些）sub_match對象，其中0代表整個匹配，1、2、...依此代表相應的子匹配，-1代表不屬於匹配的字符序列（可用於tokenize一個序列，其中不匹配的部分就是想要的數據，稱為tokenizer）。

match_results

接近為一個容器類，存儲了regex_match, regex_search 或 regex_iterator函數的一次正則匹配操作產生的一批匹配結果，每個匹配結果對應於sub_match類型。當match_results包含了有效的匹配結果時（即使結果為空），其成員函數match_results::ready返回真；然後對regex_iterator解引用將指向有效地位置。如果匹配結果不為空，則empty成員函數返回為假，match_results包含一系列的sub_match元素，其中第一個是整個匹配，隨後依次是對應於捕獲群（括號包圍的群）的子表達式；也可以直接調用成員函數如str(i), length, position，或運算符[]，或迭代器begin、end、cbegin、cend.

如果match_results對象用於regex_search函數，目標序列中不是匹配部分的可用成員函數prefix與suffix訪問.

ready狀態的match_results對象，調用format成員函數，可用于格式化字符串序列。可用的格式指示符（format specifiers）有：

字符	替換為
$n	第n個向後引用，n必須大於0，至多為2位數字。
$&	整個匹配
$`	前綴（目標序列中匹配之前的部分）
$´	後綴（目標序列中匹配之後的部分）
$$	單個的$字符

預定義了下述特化的模板類：

typedef match_results<const char*> cmatch;
typedef match_results<const wchar_t*> wcmatch;
typedef match_results<string::const_iterator> smatch;
typedef match_results<wstring::const_iterator> wsmatch;

預定義的成員類型，需要關注的有：

value_type	sub_match<BidirectionalIterator>	 
char_type	iterator_traits<BidirectionalIterator>::value_type	 
reference	value_type&	
const_reference	const value_type&	
iterator	a forward iterator to const value_type	
const_iterator	a forward iterator to const value_type	The same as iterator

sub_match

sub_match是std::pair的派生類模板，定義如下：

template <class BidirectionalIterator>
        class sub_match : public pair <BidirectionalIterator, BidirectionalIterator>;

sub_match表示正則表達式一次匹配計算中的一個子表達式的匹配結果。由函數regex_match或regex_search, 或regex迭代器(regex_iterator 或regex_token_iterator)產生一次匹配計算。子表達式的匹配結果是字符序列，但sub_match並不存儲字符序列本身，而是使用std::pair基類存儲字符序列的開始迭代器與結束（past-the-end）迭代器。

sub_match的成員函數matched，給出了對象的狀態表示已匹配或未匹配，缺省構造的sub_match對象的狀態為假；作為一個match_results對象的一部分的sub_match的狀態為真。

sub_match對象可轉化為string對象，或在compare時行為類似於string，並有成員函數length其行為類似於string的同名成員函數。

預定義的成員類型：

類型名	定義	含義
value_type	iterator_traits<BidirectionalIterator>::value_type	字符序列的字符類型
string_type	basic_string<value_type>	字符序列的string類型
iterator	BidirectionalIterator	模板參數，即字符序列的迭代器類型
difference_type	iterator_traits<BidirectionalIterator>::size_type	即ptrdiff_t
first_type	BidirectionalIterator	基類std::pair的第一個模板參數
second_type	BidirectionalIterator	基類std::pair的第二個模板參數

預定義特化版本：

typedef sub_match<const char*> csub_match;
typedef sub_match<const wchar_t*> wcsub_match;
typedef sub_match<string::const_iterator> ssub_match;
typedef sub_match<wstring::const_iterator> wssub_match;

regex_traits

translate：把一個字符翻譯為另一個字符。如果兩個字符翻譯到同一個字符，那麼正則匹配時認為二者相同。 value：把一個字符用int表示。可指定進制情況。 isctype：判斷一個字符是否屬於指定的字符類。字符類用整形值表示。 lookup_classname：返回一個整形表示的bitmask值的字符類。 lookup_collatename：返回字符串。

regex_error

regex_error是regex庫函數可以拋出的異常對象。它的成員函數code()返回regex_constants::error_type枚舉值：

flag	error
error_collate	表達式包含無效的collating元素名字
error_ctype	表達式包含無效的字符類名字
error_escape	表達式包含無效的轉義字符或尾部轉義(trailing escape)
error_backref	表達式包含無效的反向引用
error_brack	表達式包含不匹配的方括號
error_paren	表達式包含不匹配的圓括號
error_brace	表達式包含不匹配的大括號
error_badbrace	表達式的大括號之間的範圍(range)無效
error_range	表達式包含無效的字符範圍
error_space	內存不足，無法把表達式轉化為有限狀態機。
error_badrepeat	表達式中包含重複指示符（即*?+{中的一個）但它前面沒有效的正則表達式。
error_complexity	匹配的計算複雜度超出了預設的級別
error_stack	運行棧的內存不足

算法函數

regex_match：對整個字符序列做正則表達式匹配嘗試。
regex_search：對字符序列的一部分做正則表達式匹配嘗試。
regex_replace：對正則表達式匹配上的部分做替換操作。

全局函數

std::swap(std::basic_regex)：針對正則表達式對象的特化的swap。
比較兩個子匹配對象：
- operator==
- operator!=
- operator<
- operator<=
- operator>
- operator>=
operator<< ：輸出匹配餓得字符子序列
字典序比較兩個匹配結果的值
- operator==
- operator!=
std::swap(std::match_results)：針對正則表達式匹配結果的特化版本的swap

常量定義

match_flag_type

std::regex_constants::match_flag_type具有下述比特標誌值:

flag	effects on match	notes
match_default	缺省值	缺省匹配行為，值為0
match_not_bol	Not Beginning-Of-Line	第一個字符不被認為是行的開始("^"不匹配).
match_not_eol	Not End-Of-Line	最後一個字符不被認為是行的末尾("$"不匹配)
match_not_bow	Not Beginning-Of-Word	轉義序列"\b"不匹配一個單詞（word）的開始。
match_not_eow	Not End-Of-Word	轉義序列"\b"不匹配一個單詞（word）的末尾。
match_any	Any match	如果有不止一種匹配，any match是可接受的
match_not_null	Not null	不匹配空序列
match_continuous	Continuous	表達式必須匹配從第一個字符開始的子序列
match_prev_avail	Previous Available	在第一個匹配之前還有字符存在(match_not_bol與match_not_bow被忽略)
format_default	Default formatting	默認使用ECMAScript的替換規則。值為0
format_sed	sed formatting	使用POSIX的sed工具的替換匹配
format_no_copy	No copy	目標序列中不匹配正則表達式的部分在替換匹配時不被複製。
format_first_only	First only	僅第一次出現的正則表達式被替換。

syntax_option_type


值	效果
icase	匹配時忽略大小寫
nosubs	所有子匹配都作為non-marking sub-expressions (?:expr)。從而，沒有匹配存入std::regex_match結構且mark_count()為0
optimize	指示正則表達式引擎用更多編譯時間產生一個速度更快的表示。例如，把不確定有限狀態自動機（non-deterministic FSA）轉化為確定有限狀態自動機。
collate	形如"[a-b]"的字符將是locale 敏感的
multiline (C++17)	使用ECMAScript引擎前提下，指示 ^ 將匹配行首，$將匹配行尾
ECMAScript	使用修改後的ECMAScript正則表達式語法
basic	使用基本POSIX正則表達式語法
extended	使用擴展POSIX正則表達式語法
awk	使用awk的正則表達式語法
grep	使用grep的正則表達式語法。實際上在基本語法之外增加了把新行字符'\n'作為可選分隔符。
egrep	'）。

從ECMAScript, basic, extended, awk, grep, egrep中至少選擇一個語法選項。如果沒有選擇，ECMAScript是缺省項。其它選項作為修改符，例如：std::regex("meow", std::regex::icase) is equivalent to std::regex("meow", std::regex::ECMAScript|std::regex::icase)

ECMA是深度優先匹配；而POSIX是最左最長匹配。例如：.*(a|xayy) 在zzxayyzz中做正則表達式搜索，

ECMA (depth first search) match: zzxa
POSIX (leftmost longest)  match: zzxayy

例子程序

#include <iostream>
#include <string>
#include <regex>
 
int main()
{
    std::string fnames[] = {"foo.txt", "bar.txt", "baz.dat", "zoidberg"};
    std::regex pieces_regex("([a-z]+)\\.([a-z]+)");
    std::smatch pieces_match; 
    for (const auto &fname : fnames) {
        if (std::regex_match(fname, pieces_match, pieces_regex)) {
            std::cout << fname << '\n';
            for (size_t i = 0; i < pieces_match.size(); ++i) {
                std::ssub_match sub_match = pieces_match[i];
                std::string piece = sub_match.str();
                std::cout << "  submatch " << i << ": " << piece << '\n';
            }   
        }   
    }   
}

輸出：

foo.txt
  submatch 0: foo.txt
  submatch 1: foo
  submatch 2: txt
bar.txt
  submatch 0: bar.txt
  submatch 1: bar
  submatch 2: txt
baz.dat
  submatch 0: baz.dat
  submatch 1: baz
  submatch 2: dat

參考文獻

頁面Template:ReflistH/styles.css沒有內容。

C++ reference for Standard library header <regex>