今天要玩耍的是strtok()
先上文档。
http://man7.org/linux/man-pages/man3/strtok.3.html
摘一些
#include
char *strtok(char *str, const char *delim);
The strtok() function breaks a string into a sequence of zero or more
nonempty tokens. On the first call to strtok(), the string to be
parsed should be specified in str. In each subsequent call that
should parse the same string, str must be NULL.The delim argument specifies a set of bytes that delimit the tokens
in the parsed string. The caller may specify different strings in
delim in successive calls that parse the same string.Each call to strtok() returns a pointer to a null-terminated string
containing the next token. This string does not include the
delimiting byte. If no more tokens are found, strtok() returns NULL.A sequence of calls to strtok() that operate on the same string
maintains a pointer that determines the point from which to start
searching for the next token. The first call to strtok() sets this
pointer to point to the first byte of the string. The start of the
next token is determined by scanning forward for the next
nondelimiter byte in str. If such a byte is found, it is taken as
the start of the next token. If no such byte is found, then there
are no more tokens, and strtok() returns NULL. (A string that is
empty or that contains only delimiters will thus cause strtok() to
return NULL on the first call.)The end of each token is found by scanning forward until either the
next delimiter byte is found or until the terminating null byte
('\0') is encountered. If a delimiter byte is found, it is
overwritten with a null byte to terminate the current token, and
strtok() saves a pointer to the following byte; that pointer will be
used as the starting point when searching for the next token. In
this case, strtok() returns a pointer to the start of the found
token.From the above description, it follows that a sequence of two or more
contiguous delimiter bytes in the parsed string is considered to be a
single delimiter, and that delimiter bytes at the start or end of the
string are ignored. Put another way: the tokens returned by strtok()
are always nonempty strings. Thus, for example, given the string
"aaa;;bbb,", successive calls to strtok() that specify the delimiter
string ";," would return the strings "aaa" and "bbb", and then a null
pointer.
我写了一个测试代码
#include
#include
int main() {
char str[] = "abc,deffff,ghi,jkl";
char *tok;
tok = strtok(str, ",");
printf("1st: %s\n", tok);
tok = strtok(NULL, ",");
printf("2nd: %s\n", tok);
tok = strtok(NULL, ",");
printf("3rd: %s\n", tok);
tok = strtok(NULL, ",");
printf("4th: %s\n", tok);
printf("str: %s, length: %lu, size: %lu\n", str, strlen(str), sizeof(str));
return 0;
}
文件名为strtokalt.c
$ make && ./strtokalt
1st: abc
2nd: deffff
3rd: ghi
4th: jkl
str: abc, length: 3, size: 19
首先,strtok() 返回的是C字符串指针。null terminator是在里面的。
然后,第一次使用strtok()的时候,把待tokenize的字符串指针pass进去。
以后再继续tokenize这个字符串的时候,第一个参数一定必须是NULL
不然会很惨的。
你看,前几个token看上去都不错,但我们回过头来看原本的str的时候,就发现它只含有第一个token了。长度为3,大小还是19.
说明什么,说明原本的str在第一个“,”的地方变成了"\0" null terminator.
用gdb检测一下:
$ gdb ./strtoalt
Reading symbols from ./strtoalt...done.
(gdb) l 11
6 char *tok;
7 tok = strtok(str, ",");
8 printf("1st: %s\n", tok);
9 tok = strtok(NULL, ",");
10 printf("2nd: %s\n", tok);
11 tok = strtok(NULL, ",");
12 printf("3rd: %s\n", tok);
13 tok = strtok(NULL, ",");
14 printf("4th: %s\n", tok);
15 printf("str: %s, length: %lu, size: %lu\n", str, strlen(str), sizeof(str));
(gdb) b 7
Breakpoint 1 at 0x8bd: file strtoalt.c, line 7.
(gdb) b 9
Breakpoint 2 at 0x8ec: file strtoalt.c, line 9.
(gdb) b 11
Breakpoint 3 at 0x919: file strtoalt.c, line 11.
(gdb) b 13
Breakpoint 4 at 0x946: file strtoalt.c, line 13.
(gdb) b 16
Breakpoint 5 at 0x99f: file strtoalt.c, line 16.
(gdb) r
Starting program: /home/yuyue/Coding/Play/strtoalt
Breakpoint 1, main () at strtoalt.c:7
7 tok = strtok(str, ",");
(gdb) p str
$1 = "abc,deffff,ghi,jkl"
(gdb) p sizeof(str)
$2 = 19
(gdb) x/19b str
0x7fffffffe200: 97 98 99 44 100 101 102 102
0x7fffffffe208: 102 102 44 103 104 105 44 106
0x7fffffffe210: 107 108 0
(gdb) p tok
$3 = 0x0
(gdb) c
Continuing.
1st: abc
Breakpoint 2, main () at strtoalt.c:9
9 tok = strtok(NULL, ",");
(gdb) p str
$4 = "abc\000deffff,ghi,jkl"
(gdb) x/19b str
0x7fffffffe200: 97 98 99 0 100 101 102 102
0x7fffffffe208: 102 102 44 103 104 105 44 106
0x7fffffffe210: 107 108 0
(gdb) p tok
$5 = 0x7fffffffe200 "abc"
(gdb) p sizeof(tok)
$6 = 8
(gdb) x/8b tok
0x7fffffffe200: 97 98 99 0 100 101 102 102
(gdb) c
Continuing.
2nd: deffff
Breakpoint 3, main () at strtoalt.c:11
11 tok = strtok(NULL, ",");
(gdb) p str
$7 = "abc\000deffff\000ghi,jkl"
(gdb) x/19b str
0x7fffffffe200: 97 98 99 0 100 101 102 102
0x7fffffffe208: 102 102 0 103 104 105 44 106
0x7fffffffe210: 107 108 0
(gdb) p tok
$8 = 0x7fffffffe204 "deffff"
(gdb) p sizeof(tok)
$9 = 8
(gdb) x/8b tok
0x7fffffffe204: 100 101 102 102 102 102 0 103
(gdb) c
Continuing.
3rd: ghi
Breakpoint 4, main () at strtoalt.c:13
13 tok = strtok(NULL, ",");
(gdb) p str
$10 = "abc\000deffff\000ghi\000jkl"
(gdb) x/19b str
0x7fffffffe200: 97 98 99 0 100 101 102 102
0x7fffffffe208: 102 102 0 103 104 105 0 106
0x7fffffffe210: 107 108 0
(gdb) p tok
$11 = 0x7fffffffe20b "ghi"
(gdb) x/8b tok
0x7fffffffe20b: 103 104 105 0 106 107 108 0
(gdb) c
Continuing.
4th: jkl
str: abc, length: 3, size: 19
Breakpoint 5, main () at strtoalt.c:16
16 return 0;
(gdb) p str
$12 = "abc\000deffff\000ghi\000jkl"
(gdb) x/19b str
0x7fffffffe200: 97 98 99 0 100 101 102 102
0x7fffffffe208: 102 102 0 103 104 105 0 106
0x7fffffffe210: 107 108 0
(gdb) p tok
$13 = 0x7fffffffe20f "jkl"
(gdb) x/8b tok
0x7fffffffe20f: 106 107 108 0 -1 -1 127 0
(gdb) c
Continuing.
[Inferior 1 (process 3138) exited normally]
(gdb) q
可以看到,第一次运行strtok()
之后,确实在","的地方变成了"\0"。然后以后的每次调用,都会把delim的地方变成null terminator.
从这次的gdb调试中,看到的是除了delim变化了以外,其他的character都好好的。但实际上这个是不能保证的,比如很多情况下,会使用for loop 或者 while loop 来做tokenize操作,调用 这句话我好像没有求证。先不管他了。strtok()
,这时候,原来字符串里的字符都变得乱七八糟的了。
还有一个有趣的点是,这里的char *tok;
我并没有把它初始化,初始化是在strtok()
函数内部完成的。这里它给我分配了8个字节,也确实够用了,但如果第1个Token是3个字节,然后第2个Token是11个字节呢?
下面我就改一下程序:
#include
#include
int main() {
char str[] = "abc,123456789ab,ghi,jkl";
char *tok;
tok = strtok(str, ",");
printf("tok 1: %s\n", tok);
for (int i = 0; i < 3; i++) {
tok = strtok(NULL, ",");
printf("tok %d: %s\n", i + 2, tok);
}
printf("str: %s, length: %lu, size: %lu\n", str, strlen(str), sizeof(str));
return 0;
}
编译运行的结果:
$ make && ./strtoalt
cc -o strtoalt strtoalt.c -Wall -lm -pg -g
tok 1: abc
tok 2: 123456789ab
tok 3: ghi
tok 4: jkl
str: abc, length: 3, size: 24
再用gdb简单测试一下:
gdb ./strtoalt
Reading symbols from ./strtoalt...done.
(gdb) l 11
6 char *tok;
7 tok = strtok(str, ",");
8 printf("tok 1: %s\n", tok);
9 for (int i = 0; i < 3; i++) {
10 tok = strtok(NULL, ",");
11 printf("tok %d: %s\n", i + 2, tok);
12 }
13 printf("str: %s, length: %lu, size: %lu\n", str, strlen(str), sizeof(str));
14 return 0;
15 }
(gdb) b 8
Breakpoint 1 at 0x8d8: file strtoalt.c, line 8.
(gdb) b 11
Breakpoint 2 at 0x90e: file strtoalt.c, line 11.
(gdb) r
Starting program: /home/yuyue/Coding/CMPSC311/CMPSC311Play/strtoalt
Breakpoint 1, main () at strtoalt.c:8
8 printf("tok 1: %s\n", tok);
(gdb) p str
$1 = "abc\000\061\062\063\064\065\066\067\070\071ab,ghi,jkl"
(gdb) p sizeof(str)
$2 = 24
(gdb) x/24b str
0x7fffffffe200: 97 98 99 0 49 50 51 52
0x7fffffffe208: 53 54 55 56 57 97 98 44
0x7fffffffe210: 103 104 105 44 106 107 108 0
(gdb) p tok
$3 = 0x7fffffffe200 "abc"
(gdb) p sizeof(tok)
$4 = 8
(gdb) x/8b tok
0x7fffffffe200: 97 98 99 0 49 50 51 52
(gdb) c
Continuing.
tok 1: abc
Breakpoint 2, main () at strtoalt.c:11
11 printf("tok %d: %s\n", i + 2, tok);
(gdb) p str
$5 = "abc\000\061\062\063\064\065\066\067\070\071ab\000ghi,jkl"
(gdb) x/24b str
0x7fffffffe200: 97 98 99 0 49 50 51 52
0x7fffffffe208: 53 54 55 56 57 97 98 0
0x7fffffffe210: 103 104 105 44 106 107 108 0
(gdb) p sizeof(tok)
$6 = 8
(gdb) p tok
$7 = 0x7fffffffe204 "123456789ab"
(gdb) x/8b tok
0x7fffffffe204: 49 50 51 52 53 54 55 56
(gdb) x/16b tok
0x7fffffffe204: 49 50 51 52 53 54 55 56
0x7fffffffe20c: 57 97 98 0 103 104 105 44
(gdb) quit
A debugging session is active.
Inferior 1 [process 3190] will be killed.
Quit anyway? (y or n) y
好的好的,算你狠哦!
首先p str
输出的结果$1 = "abc\000\061\062\063\064\065\066\067\070\071ab,ghi,jkl"
里面的数字全部用对应的ascii码的8进制表示了。。。
其次,这个char *tok;
不管怎么样size都是8个字节。明明都超过8个字节了,还是显示size是8个字节。。。这是gdb的问题还是strtok()
的问题?
那就在程序里面printf一下看看吧。
#include
#include
int main() {
char str[] = "abc,123456789ab,ghi,jkl";
char *tok;
tok = strtok(str, ",");
printf("tok 1: %s, length %lu, size: %lu\n", tok, strlen(tok), sizeof(tok));
for (int i = 0; i < 3; i++) {
tok = strtok(NULL, ",");
printf("tok %d: %s, length %lu, size: %lu\n", i + 2, tok, strlen(tok), sizeof(tok));
}
printf("str: %s, length: %lu, size: %lu\n", str, strlen(str), sizeof(str));
return 0;
}
运行结果:
$ make && ./strtoalt
cc -o strtoalt strtoalt.c -Wall -lm -pg -g
tok 1: abc, length 3, size: 8
tok 2: 123456789ab, length 11, size: 8
tok 3: ghi, length 3, size: 8
tok 4: jkl, length 3, size: 8
str: abc, length: 3, size: 24
好吧,那就是strtok()
的问题,不是gdb的问题。
想想貌似又有可能不是strtok()
的问题,而有可能是我的问题。
为什么这说呢,因为strtok()
返回的是一个token的pointer,我的char *tok
在第一次传给strtok()
的时候就已经被初始化了。然后C语言就会记住这个tok的type,然后它的大小是固定的。
下面再试试看,第二个token给它一个新的变量,然后看看还是不是8个字节。
#include
#include
int main() {
char str[] = "abc,123456789ab,ghi,jkl";
char *tok, *tok2;
tok = strtok(str, ",");
printf("tok 1: %s, length %lu, size: %lu\n", tok, strlen(tok), sizeof(tok));
for (int i = 0; i < 3; i++) {
tok2 = strtok(NULL, ",");
printf("tok %d: %s, length %lu, size: %lu\n", i + 2, tok2, strlen(tok2), sizeof(tok2));
}
printf("str: %s, length: %lu, size: %lu\n", str, strlen(str), sizeof(str));
return 0;
}
$ make && ./strtokalt
$ make && ./strtoalt
cc -o strtoalt strtoalt.c -Wall -lm -pg -g
tok 1: abc, length 3, size: 8
tok 2: 123456789ab, length 11, size: 8
tok 3: ghi, length 3, size: 8
tok 4: jkl, length 3, size: 8
str: abc, length: 3, size: 24
可以,服气。我没有冤枉它。它还是睁着眼睛说瞎话,说我的123456789ab大小是8。下次用strtok()
的时候,千万不要相信sizeof()
,要相信strlen()
。。。
以上都是在Ubuntu 18.04上运行的,我试一下我的macOS 10.14.6怎么样。
» ./strtoalt
tok 1: abc, length 3, size: 8
tok 2: 123456789ab, length 11, size: 8
tok 3: ghi, length 3, size: 8
tok 4: jkl, length 3, size: 8
str: abc, length: 3, size: 24
嗯一样的。
不急,我还有一个openBSD 6.5试一试。
$ gmake && ./strtokalt
cc -o strtokalt strtokalt.c -Wall -lm -pg -g
tok 1: abc, length 3, size: 8
tok 2: 123456789ab, length 11, size: 8
tok 3: ghi, length 3, size: 8
tok 4: jkl, length 3, size: 8
str: abc, length: 3, size: 24
嗯,一样的。
分割线
等一下,是不是char *tok;
这样定义的都是8个字节呢?
写个代码测试一下:
#include
#include
#include
int main()
{
char *str;
str = malloc(10 * sizeof(str));
for (int i = 0; i < 9; i++) {
str[i] = 'a' + i;
}
str[9] = '\0';
printf("str: %s length: %lu size: %lu\n", str, strlen(str), sizeof(str));
free(str);
str = NULL;
return 0;
}
$ make && ./charsizetest
cc -o charsizetest charsizetest.c -Wall -lm -pg -g
str: abcdefghi length: 9 size: 8
啊,果然如此!
看来是我的C语言语法理解出了问题。虽然char *str
和char str[]
用法上差不多,但是类型还是不太一样的。char *str
不管你里面装了多少东西, size永远是8个字节,然后char str[]
的size,是初始化的大小。