RH033 Unit8 Text Processing Tools

1) Upon completion of this unit, you should be able to:
- Use tools for extracting, analyzing and manipulating text data
Tools for Extracting Text
1) File contents: less and cat
2) File Experts: head and tail
3) Extract by Column: cut
4) Extract by Keyworld: grep
Viewing File Contents - less and cat
1) Cat: dump one or more files to STDOUT
- Multiple files are concatenated together
2) less: view file or STDIN on page at a time
- Useful commands while viewing:
  • /text searches for text
  • n/N jumps to the next/previous match
  • v opens the file in a text editor
- less is the pager used by man
Viewing File Excerpts - head and tail
1) head: Display the first 10 lines of a file
- Use �Cn to change number of lines displayed
2) tail: Display the last 10 lines of a file
- Use �Cn to change the number of lines displayed
- Use �Cf to follow the subsequent additions to the file
  • Very useful for monitoring the log files!
Extracting Text by keyword - grep
1) Print lines of files to STDIN where a pattern is matched
  • grep ‘john’ /etc/passwd
  • date �Chelp | grep year
2) Use �Ci to search case-insensitively
3) Use �Cn to print line number of matches
4) Use �Cv to print lines not containing pattern
5) Use �CAx to include the x lines after each match
6) Use �CBx to include the x lines before each match
7) Use �Cl to return the name of the files that containing the pattern
Extracting Text by Column - cut
1) Display a specified columns of file or STDIN data
  • cut �Cd: �Cf1 /etc/passwd
  • grep root /etc/passwd | cut �Cd: �Cf7
2) Use �Cd to specify the column delimiter (default is TAB)
3) Use �Cf to specify the column to print
4) Use �Cc to cut by characters
  • cut �Cc2 �C5 /etc/share/dict/words
Tools for Analyzing Text
1) Text Stats: wc
2) Sorting text: sort
3) Comparing files: diff and match
4) Spell check: aspell
Gathering Text Statistics - wc (word count)
1) Counts words, lines, bytes and characters
2) Can act upon a file or STDIN
3) Use �Cl for only line count
4) Use �Cw for only word count
5) Use �Cc for only byte count
6) Use �Cm for character count (not displayed)
Sorting Text - sort
1) Sort text to STDOUT �C original file unchanged
  • sort [options] file(s)
2) Common options
  • -r performs a rerverse (descending) sort
  • -n performs a numeric sort
  • -f ignores (folds) case of characters in strings
  • -u (unique) removes duplicate lines in output
  • -t c uses c as a field separator
  • -k x sorts by c-delimited field x, can be used mutiple times
  sort �Ct : �Ck 3 �Cn /etc/passwd
Eliminating Duplicate Lines �C sort and uniq
1) sort �Cu: removes duplicated lines from input
2) uniq: removes duplicate adjacent lines from input
  • Use �Cc to count number of occurences
  • Use with sort for best effect:
sort userlist.txt | uniq �Cc
Comparing Files �C diff
1) Compares two files for differences
2) Use gvimdiff for graphical diff
  • Provided by vim-X11 package
Duplicating File Changes �C patch
1) diff output stored in a file is called a “patchfile”
  • Use �Cu for “unified” diff, best in patchfiles
2) patch duplicates changes in other files (use with care!)
  • Use �Cb to automatically backup changed file
diff �Cu foo.conf-broken foo.conf-works > foo.patch
patch �Cb foo.conf-broken foo.patch
Spell Checking with aspell
1) Interactively spell-check files:
  • aspell check letter.txt
2) Non-interactively list mis-spelled words in STDIN
  • aspell list < letter.txt
  • aspell list &lt; letter.txt | wc �Cl
Tools for Manipulating Text �C tr and sed
1) Alter (translate) Character: tr
  • Converts characters in one set to corresponding characters in another set
  • Only reads data from STDIN
        $ tr ‘a-z’ ‘A-Z’ &lt; lowercase.txt
2) Alter Strings: sed
  • stream editor
  • Performs search/replace operations on a stream of text
  • Normally does not alter source file
  • Use �Ci.bak to backup and alter source file
  • -i : case-insensitive
  • -g: global
   sed ‘s/cat/dog/’ pets
   sed ‘s/cat/dog/gi’ pets
Sed Examples
1) Quote search and replace instructions!
2) sed addresses
  • sed ‘s/dog/cat/g’ pets
  • sed ‘1,50s/dog/cat/g’ pets ###the replacement will only be performed on lines 1 to 50
  • sed ‘digby/,/duncan/s/dog/cat/g’ pets ###the replacement will only start on the line that contains the string ‘digby’ and continuing through the line that contains ‘duncan’
3) Multiple sed instructions
  • sed �Ce ‘s/dog/cat/’ �Ce ‘s/hi/lo/’ pets
  • sed �Cf myedits pets
Special Characters for Complex Searches Regular Expressions
1) ^ represents beginning of line
2) $ represents end of line
3) Character classes as in bash:
  • [abc], [^abc]
  • [[:upper:]], [^[:upper:]]
4) Used by:
  • grep, sed, less, others
End of Unit8
1) Questions and Answers
2) Summary
  • Extracting Text: cat, less, head, tail, grep, cut
  • Analyzing Text: wc, sort, uniq, diff, patch
  • Manipulating Text: tr, sed
  • Special Search Characters: ^, $, [abc], [[:alpha:]], [^[:alpha:]], etc
Only the digits 0 to 9
Any alphanumeric character 0 to 9 OR A to Z or a to z.
Any alpha character A to Z or a to z.
Space and TAB characters only.
Hexadecimal notation 0-9, A-F, a-f.
Punctuation symbols . , " ' ? ! ; : # $ % & ( ) * + - / &lt; > = @ [ ] \ ^ _ { } | ~
Any printable character.
Any whitespace characters (space, tab, NL, FF, VT, CR). Many system abbreviate as \s.
Exclude whitespace (SPACE, TAB). Many system abbreviate as \W.
Any alpha character A to Z.
Any alpha character a to z.
