2008-01-14

UNIX tip of the day: cut columns from a text file

Most people who know their UNIX shells know of "cut", a utility which can grab only certain parts of each line within a text file. I'm kind of a weather geek, so I'll use some raw CSV METAR data from Weather Underground for this demo.

For instance, the following will grab the forth through tenth byte from each line of Data.csv:
$ cut -b 4-10 Data.csv
eCST,Te
53 AM,3
3 AM,37
3 AM,37
3 AM,37
3 AM,37
3 AM,37
3 AM,37
0 AM,37


If Data.csv happens to be a genuine comma-separated file, the following will grab the second and fifth fields (columns) from each line, using a comma as the delimiter between fields:
$ cut -f2,5 -d, Data.csv
TemperatureF,Sea Level PressureIn
37.0,29.79
37.0,29.77
37.0,29.75
37.0,29.69
37.9,29.66
37.9,29.65
37.0,29.65
37.4,29.61

This, as you can see, grabbed the Fahrenheit Temperature as well as the Sea Level Pressure (in Inches of Mercury), not like you cared. You're not here to check the weather. This is just a demo.

Unfortunately, cut always dumps the columns out in the order they're in in the file, regardless of which order you specify. For example switching the field order in the command line still nets you the identical output as before:
$ cut -f5,2 -d, Data.csv
TemperatureF,Sea Level PressureIn
37.0,29.79
37.0,29.77
37.0,29.75
37.0,29.69
37.9,29.66
37.9,29.65
37.0,29.65
37.4,29.61

Also, cut is really stupid about data that's organized in columns. Unless the formatting is very, very clean, it can be hard to get predictable results using cut.

For parsing data that's in columns (say, the output of the ls command, for example), it's better to take a crack at using awk. Awk is a somewhat intimidating beast for the unfamiliar user. For simply pulling data out of columns, though, it can't be beat.

Each column that awk identifies in a line is given an incremental variable, starting with $1. By default, awk uses white space as its field separator. Let's give this a try.


$ ls -la
total 16
drwx------+ 2 axon staff 136 Jan 14 13:50 .
drwxr-xr-x 14 axon staff 646 Jan 4 17:38 ..
-rw-r--r-- 1 axon staff 2970 Jan 14 13:50 DailyHistory.csv
-rw-r--r-- 1 axon staff 672 Jan 14 13:50 Data.csv


Lets try to grab the owner and the file name only. That would be the third and ninth fields in the above listing.

$ ls -la | awk '{print $3 $9}'

axon.
axon..
axonDailyHistory.csv
axonData.csv

As you can see, there's no separation between the two fields now. You must separate your records within the print command. You can use a space, tabs, a comma, or whatever you wish. We'll just separate them with a space in quotes:

$ ls -la | awk '{print $3 " " $9}'

axon .
axon ..
axon DailyHistory.csv
axon Data.csv

Also, unlike cut, you can specify things in any order you wish. This is great for getting only the data you want from large CSV files or other tabular data right from the command line.

$ ls -la | awk '{print $9 "," $3 "," $1}'
,,total
.,axon,drwx------+
..,axon,drwxr-xr-x
DailyHistory.csv,axon,-rw-r--r--
Data.csv,axon,-rw-r--r--

blog comments powered by Disqus