The Penalty Box

Five minutes for attempted world domination

Down with OPP?

February 21, 2012 at 08:02 PM | categories: Coding, Tech | View Comments

A lot of people rag on Perl as a language. Indeed, I can agree with some of their points. It's not great as a page serving language, despite a decent framework like Catalyst. The syntax lends to awful looking code. Figuring out what it does can take just as long as rewriting it yourself. However, I still think it's a good language to do system level stuff: parsing, deployment and other assorted back end stuff. It's pretty fast (in the world of scripting languages at least, although there are some crazy things being done in Python to get it to the speed of compiled C with JIT compiling), there is good library support and it's been around forever so it's pretty stable.

The downside usually lies in reading code that is not your own (Other People's Perl). I was trying to use a Perl library recently that interacted with Amazon's S3 storage, but it wasn't doing quite what I wanted to do. I then came across this little tidbit that made me want to punch the developer in the face.

  1. my $cmd = qq[curl $curl_options $aws $header --request $verb $content --location @{[cq($url)]}];

Anyone want to hazard a guess? No? Yeah didn't think so, not unless you're one of those crazy Perl guys. It starts off pretty normal. "qq[" is another way of doing the backtick operator, capturing output and assigning it to $cmd. The rest is pretty straightforward too, until you get to the parameter of location. Eventually I figured out that the output of cq was an array of strings. The square brackets annonymizes the array and the @{} syntax derefences the array and joins it with the empty string. This is an example of someone who is trying to be clever with his code when he could have just used a freaking join function!

You down with OPP? Hell no, not me.

Read and Post Comments

Look Up

February 18, 2012 at 11:25 AM | categories: Five Hole Photo | View Comments

Ah yes, another cliche Internet cat photo. Don't worry, I won't be putting any LOLText on it. This is my sister's cat, Beanie. Beanie is a bit of a putz; everything either amazes her or scares the crap out of her. Or the world is ending. She is very much a #FirstWorldProblems cat.

This was taken when she was still a kitten, so the prospect of a camera looming over her was a very curious sight indeed. Hence the eyes bugging out.

IMG_1261_v1.JPG

f/1.4, 1/40 sec, ISO 400, exposure correction in post

Read and Post Comments

Sunset Theatre

February 10, 2012 at 11:09 PM | categories: Five Hole Photo | View Comments

I took this photo on a walk around the Seawall by Stanley Park. As the sun went down, this couple sat down on a log to watch the sunset. As I waited for the skies to redden a little deeper to get a landscape shot with the ships in the distance, I noticed the light from the setting sun leading straight to the lovebirds on the log. Once the sun got low enough to silhouette their two heads in an embrace, I snapped a shot. Enter your own mushy romantic cliche here.

IMG_0431.JPG

f/14, 1/30 sec, ISO 100

Read and Post Comments

Bugs are cool

February 04, 2012 at 04:35 PM | categories: Five Hole Photo | View Comments

I snapped this photo before I had my DSLR. If there's one thing I've learned since getting into photography, it's that the gear doesn't matter nearly as much as the operator of the gear. A little luck doesn't hurt either. Immediately after I hit the shutter for this picture, this little guy took off and buzzed a canoe like a Japanese Zero Fighter. Except, you know, without the kamikaze bit. Macro shots for nature's creatures usually take some level of sneakiness, but I happened upon this guy while walking along the trail at Burnaby Lake.

IMG_6676.JPG

f/5, 1/500 sec, ISO 100

Read and Post Comments

Apache Pig

January 29, 2012 at 07:28 PM | categories: Coding, Tech | View Comments

I've been messing around with various ways to parallelize jobs in order to aggregate logs faster. Fortunately, I didn't have to delve too far into the insanity of CPAN, or worse yet, the depths of the Internet (although arguably, that is CPAN). This particular problem has been solved in a rather elegant fashion from the Apache Foundation: Apache Pig. It's a fairly high level language that lets you do MapReduce programs to use with Hadoop.

Previously, I had been crunching away in aggregation with Perl and a series of in-memory data structures. For a relatively low-traffic application/website, this is feasible. Where it becomes infeasible is when you start getting into the millions of impressions per hour. Enter Apache Pig. Tell it to load your logs, give it a couple filters to get rid of stuff you don't care about, group by the stuff you want to aggregate on and store the results into a file. Just running it on one processor was already miles better than crunching it through the most optimized Perl script.

Say you have a couple of pages you want to track on your website. From your Apache logs, you can extract out your page names or page IDs from the URL requests. Then group by the ID and do a count after you extract out the data you want.

  1. define DateExtractor org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('[dd/MMM/yyyy:HH:mm:ss', 'yyyyMMddHH');
  2. -- Split the Apache logs into its columns with the CSVExcelStorage parser, split by the whitespace (your Apache log may vary)
  3. LOGS = LOAD '/tmp/my-logs' USING org.apache.pig.piggybank.storage.CSVExcelStorage(' ') as (ip, datetime, offset, request, status, bytes, useragent);
  4. -- Now we look at only the lines we want, this case the requests with "mypage" in it and requests that were 200 OK
  5. LOGS = FILTER LOGS BY request MATCHES '.*mypage.*';
  6. LOGS = FILTER LOGS BY status MATCHES '200';
  7. -- Now we extract from the timestamp the hour and regenerate the lines with the YMDH type of timestamp
  8. PARSED = FOREACH LOGS GENERATE (ip, DateExtractor(datetime) as HourDateTime, request, status);
  9. -- group the aggregation by the hour
  10. LOGS_GROUPED = GROUP PARSED BY (HourDateTime);
  11. -- Now do the counts by the group and count the lines we generated in PARSED variable
  12. REQ_COUNT = FOREACH LOGS_GROUPED GENERATE group, COUNT(PARSED) AS mycount;
  13. STORE REQ_COUNT INTO 'myoutput';

The results are stored like this:

(2012012900)    1
(2012012901)    4
etc...

From here, you can do what you want with it: insert into a database, store as a flat file, parse directly in a report page, whatever. What's cool is that while this runs fairly quickly on one machine compared to a parsing script, you can throw this onto a Hadoop cluster like Amazon's Elastic MapReduce and parallelize it trivially. You just give it input logs you want to parse, the Pig script you wrote to parse it and an output directory on S3. Then automagically, Amazon takes it, sends it to a number of EC2 instances you specify to work on and then spits out the results in your specified directory.

Even doing complex analysis on your logs takes way less time than it used to with traditional scripting methods. If you leverage these "as-a-service" platforms, you don't even need to keep a supercomputing cluster around. It's already up in the cloud for you to use. Microsoft's "Yay Cloud!" commercials may be a giant misnomer, but this kind of this is very much a "Yay Cloud!" moment for me (in the proper "cloud" sense if you will). This may be one of those moments where I nerd out over something and people just look at me and say, "Um, okay. What else does it do?" But I think it's pretty rad.

Read and Post Comments

Next Page ยป