Allen J. Hall

Materials Science & Engineering, Productivity, and Life

Software For Scientists: Awk (& OsX)

For years while Apple had proprietary system software, I was itching to get a Unix system underneath and have the ease of the windowing system. Well, after OsX was released, I was ecstatic. Why? Because of the ease of some tasks in Unix in comparison to other OS’s. This post is only one example of what you can do if you do a bit of research into how to use your Mac. For those who have un*x boxes, this will merely be a place-holder for a few AWK scripts for you.

One of the first research programs I worked on as an undergraduate was taking large amounts of data from an HPGC running a fixed-bed reactor. There were many problems with the work: (1) since our filter materials were very good at filtering the chemicals used, the run-times to breakthrough were very long (which meant that I had to come into the lab every 4 hours and restart the machine), (2) even with the lowest data setting, we had gobs and gobs of data (gigabytes in a day we were used to kilobytes of data), and (3) the data was saved in the form of: A 1 B 2 C3 (return) D4 E5 F6 (return). To the rescue was a friend of mine, Nicolas, who was an excellent coder. He turned me onto AWK and it worked beautifully. The code he came up with is more complex than the following, and I’ve misplaced the old code, so for now, let’s deal with a simple case…

Let’s say you have too much data (you set the machine wrong, or you can’t set the machine properly), and you don’t care about throwing away the data, as the trends you want to see aren’t on the order of the data you wish to pitch. If you have thousands of data-points and you only need say every 5th line, or every other line- do you really put it in Excel and edit it manually? If so, you shouldn’t get a pay-raise next quarter- you can save crap tons of time by using the right tools.

Enter Awk – a great command-line program available on almost every linux/unix computer (maybe yours is called Gawk). [A huge book on how to use all the aspects of awk is available here: The Gnu Awk User's Guide.]

With a single one-liner of code in the terminal, you can accomplish your goal of reducing your data. My pal Brandon wanted to keep only every 5th line from his data:

cat "$@" | awk -F, '{if (count++%5==0) print $0;}' >~/Desktop/AwkOutput5thLines.txt

This is the code I used within Automator to accomplish his data-throw-away needs. (there was an error with line-endings I fix later on down in this post…) cat “$@” uses the output of the choose-file automator task (as argument) to feed the file to the awk command (in -F mode for piping). The count++ command is doing the dirty work. Hat-tip to Duane’s Brain blog for the awk portion of the code which does the dirty work here! Details of how to pass strings as arguments came from this great post on MacOsXHints.com.

Another frequent problem is throwing away every other line. Here’s the code to pitch every 2nd line:
cat "$@" | awk -F, 'NR % 2 == 1' > ~/Desktop/AwkOutput.txt

One of the errors my pal had with his data was the line-endings. So, note that awk requires linefeeds (unix format text) in order to accomplish it’s goals. A simple way to translate is to use tr (translate text) the line-endings into things unix understands.

So, I solved the text line-ending problems by adding the following code:
tr '\r' '\n'
making the final code appear like this (within automator) (all in one line):
cat "$@" | tr '\r' '\n' | awk -F, '{if (count++%5==0) print $0;}' >~/Desktop/AwkOutput5thLines.txt

Some more great awk links you may find useful:

Finally, to give you a few presents for dropping in to read this meager blog, here’s a finished Automator script as well as an application form of the script in case you need exactly every 5th line like my pal, Brandon.

4 Comments

  1. foo bar
    Posted April 12, 2011 at 2:05 pm | Permalink

    It’s nice that you want to use awk, but please read parts of the manpage before making a post like this [which people may accidentally stumble upon -- i found this when trying to find out what version of awk that OSX uses]

    what rankles me is ‘{if (count++%5==0) print $0;}’

    Forget about the fact that you can write it as ‘NR % 5 == 1′, but you should at least recognize that its ‘count++%5==0′ [no need for print $0 as its the implicit rule]

  2. Posted April 12, 2011 at 2:14 pm | Permalink

    Hi Foo Bar- I’m glad you could drop in to comment.

    I’m also glad you took the time to comment on “proper” code. Those interested will read your thoughts here. In my honest opinion, while I understand the interest in minimal code, and better understanding of the language/software, the above code is not incorrect, and gets the job done, which is what most will want.

    Also- being that osx is built upon freebsd and a Mach kernel, the user is allowed to alter the standard install in a myriad of ways- so, there is no one version of awk for osx.

    My best, and I am serious when I say thank you for the correction and taking the time to drop by.
    -Allen

  3. Kefas
    Posted June 5, 2012 at 3:57 pm | Permalink

    Hi both!

    Yes, it’s true. “$ awk ‘NR % 5 == 0′ data_file > AwkOutput5thLines.txt” is all the code needed, but there’s no reason to be an ass towards someone with good intentions.

    NR here is the number of the line
    % is the remainder operator

    So if the number of a line is divided by 5 and leaves a remainder of 0, it will print the line
    And because all you’re doing is printing the whole line, you don’t need {if (…) print $0;}
    And there’s also no need to provide the comma as a field separator with -F,

    Simpler and cleaner codes can noticeably improve the performance when processing large data. But I don’t think it’s an acceptable excuse to be grumpy towards others.

  4. Posted June 5, 2012 at 4:18 pm | Permalink

    Hi Kefas!

    Thanks for dropping in! I appreciate the detail added – this helps me learn from foobar’s post and hopefully others will as well! Thank you for the information!!

    I didn’t take foo bar’s comments too harshly, so no worries! There will always be someone who can code something in 1 line of perl that will take me 10 pages. :)

    Thanks!!

Post a Comment

Your email is never shared. Required fields are marked *

*
*

A Quick Introduction...

I'm a graduate student (PhD Candidate) at the University of Illinois at Urbana-Champaign.

I've studied and researched in two fields of Materials Science and Engineering (Polymers and Semiconductors). My interests are as diverse as my musical tastes and I usually have my hand in some crazy project during my free time.

I'm available for consulting and have access to a world-renown materials research user-facility supported by the D.O.E. If you would like to know more, please contact me.

Popular Tags

Amazon Associate Link Apple Support AppStore Bug CIGS CIS CLI Conferences Cross Platform Data Mining data visualization dual-driver headphones failure Friend Geek Tool Great Scientists HAM Radio Hardware Tips How To Humanitarian IEM IM In-Ear Monitors iPod Touch LaTeX Linux Mac OsX Materials Science and Engineering Matlab Obituary Open Source problem Productivity reciprocal space return Silent Key Software Software Review Support This Blog Thesis Writing Tip UIUC VOIP Windows xrd

Support This Blog

You can support this blog by shopping on Amazon through my Affiliate Store.