Hey guys,
I've pushed a snapshot update to Cascalog that includes two new taps -- hfs-delimited and lfs-delimited. These support the same keyword options as the other hfs-* and lfs-* taps, with a few extras I'll detail below.
If any of you find these useful, I'd really appreciate it if you would give them a try and let me know how the API works out for you. This feature is available in either of the following builds:
[cascalog "1.8.7-SNAPSHOT"]
[cascalog "1.9.0-wip8"]
As an example, say you had a textfile with data like this:
exchange,stock_symbol,date,open,high,low,close,volume,adj
NYSE,AA,2008-03-05,37.01,37.9,36.13,36.6,17752400,36.6
NYSE,AA,2008-03-04,38.85,39.28,38.26,38.37,11279900,38.37
The default separator is a tab character, so the standard hfs-delimited tap with no options would produce 1-tuples with a single line of text:
(hfs-delimited "/path/to/file")
;; makes textlines
The ":delimiter" option allows you to change this:
(hfs-delimited "/pathto/data"
:delimiter ",")
;; produces 9-tuples, all strings
Now we have the problem of the header line getting in the way. :skip-header? to the rescue:
(hfs-delimited "/pathto/data"
:delimiter ","
:skip-header? true)
;; produces 9-tuples of strings
Next, if you include a vector of classes with the :classes keyword, the tap will do class conversions on the fields for you:
(hfs-delimited "/pathto/data"
:delimiter ","
:classes [String String String Float Float Float Float Integer Float]
:skip-header? true)
;; produces 9-tuples with the above classes -- numbers are parsed properly, strings stay strings.
Finally, by providing :outfields you gain the ability to select out specific fields by name:
(def stock-tap
(hfs-delimited "/pathto/data"
:delimiter ","
:outfields ["?exchange" "?stock-sym" "?date" "?open" "?high" "?low" "?close" "?volume" "?adj"]
:classes [String String String Float Float Float Float Integer Float]
:skip-header? true))
(select-fields stock-tap ["?stock-sym" "?open"])
;; returns 2-tuples of [String, Float] pairs representing the stock symbol and opening price for each line.
Looking forward to hearing your feedback! The API here will probably change a bit before release, so get your notes in now.
Cheers,
http://grokbase.com/t/gg/cascalog-user/123ky5apsx/new-taps-hfs-delimited-and-lfs-delimited