The Penalty Box
Five minutes for attempted world domination
Down with OPP?
February 21, 2012 at 08:02 PM | categories: Coding, Tech | View CommentsA lot of people rag on Perl as a language. Indeed, I can agree with some of their points. It's not great as a page serving language, despite a decent framework like Catalyst. The syntax lends to awful looking code. Figuring out what it does can take just as long as rewriting it yourself. However, I still think it's a good language to do system level stuff: parsing, deployment and other assorted back end stuff. It's pretty fast (in the world of scripting languages at least, although there are some crazy things being done in Python to get it to the speed of compiled C with JIT compiling), there is good library support and it's been around forever so it's pretty stable.
The downside usually lies in reading code that is not your own (Other People's Perl). I was trying to use a Perl library recently that interacted with Amazon's S3 storage, but it wasn't doing quite what I wanted to do. I then came across this little tidbit that made me want to punch the developer in the face.
my $cmd = qq[curl $curl_options $aws $header --request $verb $content --location @{[cq($url)]}];
Anyone want to hazard a guess? No? Yeah didn't think so, not unless you're one of those crazy Perl guys. It starts off pretty normal. "qq[" is another way of doing the backtick operator, capturing output and assigning it to $cmd. The rest is pretty straightforward too, until you get to the parameter of location. Eventually I figured out that the output of cq was an array of strings. The square brackets annonymizes the array and the @{} syntax derefences the array and joins it with the empty string. This is an example of someone who is trying to be clever with his code when he could have just used a freaking join function!
You down with OPP? Hell no, not me.
Apache Pig
January 29, 2012 at 07:28 PM | categories: Coding, Tech | View CommentsI've been messing around with various ways to parallelize jobs in order to aggregate logs faster. Fortunately, I didn't have to delve too far into the insanity of CPAN, or worse yet, the depths of the Internet (although arguably, that is CPAN). This particular problem has been solved in a rather elegant fashion from the Apache Foundation: Apache Pig. It's a fairly high level language that lets you do MapReduce programs to use with Hadoop.
Previously, I had been crunching away in aggregation with Perl and a series of in-memory data structures. For a relatively low-traffic application/website, this is feasible. Where it becomes infeasible is when you start getting into the millions of impressions per hour. Enter Apache Pig. Tell it to load your logs, give it a couple filters to get rid of stuff you don't care about, group by the stuff you want to aggregate on and store the results into a file. Just running it on one processor was already miles better than crunching it through the most optimized Perl script.
Say you have a couple of pages you want to track on your website. From your Apache logs, you can extract out your page names or page IDs from the URL requests. Then group by the ID and do a count after you extract out the data you want.
define DateExtractor org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('[dd/MMM/yyyy:HH:mm:ss', 'yyyyMMddHH');-- Split the Apache logs into its columns with the CSVExcelStorage parser, split by the whitespace (your Apache log may vary)LOGS = LOAD '/tmp/my-logs' USING org.apache.pig.piggybank.storage.CSVExcelStorage(' ') as (ip, datetime, offset, request, status, bytes, useragent);-- Now we look at only the lines we want, this case the requests with "mypage" in it and requests that were 200 OKLOGS = FILTER LOGS BY request MATCHES '.*mypage.*';LOGS = FILTER LOGS BY status MATCHES '200';-- Now we extract from the timestamp the hour and regenerate the lines with the YMDH type of timestampPARSED = FOREACH LOGS GENERATE (ip, DateExtractor(datetime) as HourDateTime, request, status);-- group the aggregation by the hourLOGS_GROUPED = GROUP PARSED BY (HourDateTime);-- Now do the counts by the group and count the lines we generated in PARSED variableREQ_COUNT = FOREACH LOGS_GROUPED GENERATE group, COUNT(PARSED) AS mycount;STORE REQ_COUNT INTO 'myoutput';
The results are stored like this:
(2012012900) 1 (2012012901) 4 etc...
From here, you can do what you want with it: insert into a database, store as a flat file, parse directly in a report page, whatever. What's cool is that while this runs fairly quickly on one machine compared to a parsing script, you can throw this onto a Hadoop cluster like Amazon's Elastic MapReduce and parallelize it trivially. You just give it input logs you want to parse, the Pig script you wrote to parse it and an output directory on S3. Then automagically, Amazon takes it, sends it to a number of EC2 instances you specify to work on and then spits out the results in your specified directory.
Even doing complex analysis on your logs takes way less time than it used to with traditional scripting methods. If you leverage these "as-a-service" platforms, you don't even need to keep a supercomputing cluster around. It's already up in the cloud for you to use. Microsoft's "Yay Cloud!" commercials may be a giant misnomer, but this kind of this is very much a "Yay Cloud!" moment for me (in the proper "cloud" sense if you will). This may be one of those moments where I nerd out over something and people just look at me and say, "Um, okay. What else does it do?" But I think it's pretty rad.
