Objectives
1) Upon completion of this unit, you should be able to:
- Use tools for extracting, analyzing and manipulating text data
Tools for Extracting Text
1) File contents: less and cat
2) File Experts: head and tail
3) Extract by Column: cut
4) Extract by Keyworld: grep
Viewing File Contents - less and cat
1) Cat: dump one or more files to STDOUT
- Multiple files are concatenated together
2) less: view file or STDIN on page at a time
- Useful commands while viewing:
- /text searches for text
- n/N jumps to the next/previous match
- v opens the file in a text editor
- less is the pager used by man
Viewing File Excerpts - head and tail
1) head: Display the first 10 lines of a file
- Use �Cn to change number of lines displayed
2) tail: Display the last 10 lines of a file
- Use �Cn to change the number of lines displayed
- Use �Cf to follow the subsequent additions to the file
- Very useful for monitoring the log files!
Extracting Text by keyword - grep
1) Print lines of files to STDIN where a pattern is matched
- grep ‘john’ /etc/passwd
- date �Chelp | grep year
2) Use �Ci to search case-insensitively
3) Use �Cn to print line number of matches
4) Use �Cv to print lines not containing pattern
5) Use �CAx to include the x lines after each match
6) Use �CBx to include the x lines before each match
7) Use �Cl to return the name of the files that containing the pattern
Extracting Text by Column - cut
1) Display a specified columns of file or STDIN data
- cut �Cd: �Cf1 /etc/passwd
- grep root /etc/passwd | cut �Cd: �Cf7
2) Use �Cd to specify the column delimiter (default is TAB)
3) Use �Cf to specify the column to print
4) Use �Cc to cut by characters
- cut �Cc2 �C5 /etc/share/dict/words
Tools for Analyzing Text
1) Text Stats: wc
2) Sorting text: sort
3) Comparing files: diff and match
4) Spell check: aspell
Gathering Text Statistics - wc (word count)
1) Counts words, lines, bytes and characters
2) Can act upon a file or STDIN
3) Use �Cl for only line count
4) Use �Cw for only word count
5) Use �Cc for only byte count
6) Use �Cm for character count (not displayed)
Sorting Text - sort
1) Sort text to STDOUT �C original file unchanged
2) Common options
- -r performs a rerverse (descending) sort
- -n performs a numeric sort
- -f ignores (folds) case of characters in strings
- -u (unique) removes duplicate lines in output
- -t c uses c as a field separator
- -k x sorts by c-delimited field x, can be used mutiple times
sort �Ct : �Ck 3 �Cn /etc/passwd
Eliminating Duplicate Lines �C sort and uniq
1) sort �Cu: removes duplicated lines from input
2) uniq: removes duplicate adjacent lines from input
- Use �Cc to count number of occurences
- Use with sort for best effect:
sort userlist.txt | uniq �Cc
Comparing Files �C diff
1) Compares two files for differences
2) Use gvimdiff for graphical diff
- Provided by vim-X11 package
Duplicating File Changes �C patch
1) diff output stored in a file is called a “patchfile”
- Use �Cu for “unified” diff, best in patchfiles
2) patch duplicates changes in other files (use with care!)
- Use �Cb to automatically backup changed file
diff �Cu foo.conf-broken foo.conf-works > foo.patch
patch �Cb foo.conf-broken foo.patch
Spell Checking with aspell
1) Interactively spell-check files:
2) Non-interactively list mis-spelled words in STDIN
- aspell list < letter.txt
- aspell list < letter.txt | wc �Cl
Tools for Manipulating Text �C tr and sed
1) Alter (translate) Character: tr
- Converts characters in one set to corresponding characters in another set
- Only reads data from STDIN
$ tr ‘a-z’ ‘A-Z’ < lowercase.txt
2) Alter Strings: sed
- stream editor
- Performs search/replace operations on a stream of text
- Normally does not alter source file
- Use �Ci.bak to backup and alter source file
- -i : case-insensitive
- -g: global
sed ‘s/cat/dog/’ pets
sed ‘s/cat/dog/gi’ pets
Sed Examples
1) Quote search and replace instructions!
2) sed addresses
- sed ‘s/dog/cat/g’ pets
- sed ‘1,50s/dog/cat/g’ pets ###the replacement will only be performed on lines 1 to 50
- sed ‘digby/,/duncan/s/dog/cat/g’ pets ###the replacement will only start on the line that contains the string ‘digby’ and continuing through the line that contains ‘duncan’
3) Multiple sed instructions
- sed �Ce ‘s/dog/cat/’ �Ce ‘s/hi/lo/’ pets
- sed �Cf myedits pets
Special Characters for Complex Searches Regular Expressions
1) ^ represents beginning of line
2) $ represents end of line
3) Character classes as in bash:
- [abc], [^abc]
- [[:upper:]], [^[:upper:]]
4) Used by:
End of Unit8
1) Questions and Answers
2) Summary
- Extracting Text: cat, less, head, tail, grep, cut
- Analyzing Text: wc, sort, uniq, diff, patch
- Manipulating Text: tr, sed
- Special Search Characters: ^, $, [abc], [[:alpha:]], [^[:alpha:]], etc
[:digit:]
Only the digits 0 to 9
[:alnum:]
Any alphanumeric character 0 to 9 OR A to Z or a to z.
[:alpha:]
Any alpha character A to Z or a to z.
[:blank:]
Space and TAB characters only.
[:xdigit:]
Hexadecimal notation 0-9, A-F, a-f.
[:punct:]
Punctuation symbols . , " ' ? ! ; : # $ % & ( ) * + - / < > = @ [ ] \ ^ _ { } | ~
[:print:]
Any printable character.
[:space:]
Any whitespace characters (space, tab, NL, FF, VT, CR). Many system abbreviate as \s.
[:graph:]
Exclude whitespace (SPACE, TAB). Many system abbreviate as \W.
[:upper:]
Any alpha character A to Z.
[:lower:]
Any alpha character a to z.
[:cntrl:]
Control Characters NL CR LF TAB VT FF NUL SOH STX EXT EOT ENQ ACK SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC IS1 IS2 IS3 IS4 DEL.