grep 正则表达式 匹配url

★ grep中匹配url的正则表达式

grep -ohr -E "https?://[a-zA-Z0-9\.\/_&=@$%?~#-]*" ./folder

在cygwin下测试通过。

参数说明:

参数 含义
-E·, --extended-regexp 使用扩展的正则表达式,如果不加-E,则是基本的正则表达式
-o, --only-matching 只匹配所要找的url
-h, --no-filename 不显示所在的文件名
-r, --recursive 递归查找所有子目录

★ Basic vs Extended Regular Expressions

基本的正则表达式(简称BRE) 和 扩展的正则表达式(简称ERE) 的区别:

BRE中,?, +, {, |, (, 和 )都是其字面意思,即问号?就代表问号。
在ERE中,?, +, {, |, (, 和 )都有特殊的含义,?表示0个或1个。+表示1个或多个。在ERE中,如果要匹配?, +, {, |, (, 和 )的字面意思,则需要转义,即采用这种形式\?, \+, \{, |, \(, 和 \)

举个例子,
有一段文本如下:

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way.

如果同时搜索 hope和despair,可以这样:

  • 方式一: 使用BRE,执行grep "hope\|despair"。如果直接用grep "hope|despair"|表示的是竖线本身,而不是表示两种选项之一。

  • 方式二: 使用ERE,执行grep -E "hope|despair", 加了-E之后, |表示两种选项之一, 而不是它本身的字面意思。

在命令行中,可以这样测试:

$ echo "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way." | grep -o "hope\|despair"

hope
despair
$ echo "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way." | grep -oE "hope|despair"

hope
despair

★ 参考

  • man grep中的说明:

    In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, +, {, |, (, and ).

  • https://www.gnu.org/software/grep/manual/html_node/Basic-vs-Extended.html

    3.6 Basic vs Extended Regular Expressions
    In basic regular expressions the meta-characters ‘?’, ‘+’, ‘{’, ‘|’, ‘(’, and ‘)’ lose their special meaning; instead use the backslashed versions ‘\?’, ‘+’, ‘{’, ‘|’, ‘(’, and ‘)’.

    Traditional egrep did not support the ‘{’ meta-character, and some egrep implementations support ‘{’ instead, so portable scripts should avoid ‘{’ in ‘grep -E’ patterns and should use ‘[{]’ to match a literal ‘{’.

    GNU grep -E attempts to support traditional usage by assuming that ‘{’ is not special if it would be the start of an invalid interval specification. For example, the command ‘grep -E ‘{1’’ searches for the two-character string ‘{1’ instead of reporting a syntax error in the regular expression. POSIX allows this behavior as an extension, but portable scripts should avoid it. (如果大括号『{』不匹配,GNU grep -E并不会报语法错误,只按照字面意思来搜索。如果脚本需要移植到不同的系统,则避免这样使用。)

你可能感兴趣的:(grep)