awk && Regular Expressions For Finding Text

The History of Computing - A podcast by Charles Edge

Categories:

Programming was once all about math. And life was good. Then came strings, or those icky non-numbery things. Then we had to process those strings. And much of that is looking for patterns that wouldn’t be a need with integers, or numbers. For example, a space in a string of text. Let’s say we want to print hello world to the screen in bash. That would be the echo command, followed by “Hello World!” Now let’s say we ran that without the quotes then it would simply echo out the word Hello to the screen, given that the interpreter saw the space and ended the command, or looked for the next operator or verb according to which command is being used. Unix was started in 1969 at Bell Labs. Part of that work was The Thompson shell, the first Unix shell, which shipped in 1971. And C was written in 1972. These make up the ancestral underpinnings of the modern Linux, BSD, Android, Chrome, iPhone, and Mac operating systems. A lot of the work the team at Bell Labs was doing was shifting from pure statistical and mathematical operations to connect phones and do R&D faster to more general computing applications. Those meant going from math to those annoying stringy things. Unix was an early operating system and that shell gave them new abilities to interact with the computer. People called files funny things. There was text in those files. And so text manipulation became a thing. Lee McMahon developed sed in 1974, which was great for finding patterns and doing basic substitutions. Another team  at Bell Labs that included Finnish programmer Alfred Aho, Peter Weinberger, and Brian Kernighan had more advanced needs. Take their last name initials and we get awk. Awk is a programming language they developed in 1977 for data processing, or more specifically for text manipulation. Marc Rochkind had been working on a version management tool for code at Bell and that involved some text manipulation, as well as a good starting point for awk.  It’s meant to be concise and given some input, produce the desired output. Nice, short, and efficient scripting language to help people that didn’t need to go out and learn C to do some basic tasks. AWK is a programming language with its own interpreter, so no need to compile to run AWK scripts as executable programs.  Sed and awk are both written to be used as one0line programs, or more if needed. But building in an implicit loops and implicit variables made it simple to build short but power regular expressions. Think of awk as a pair of objects. The first is a pattern followed by an action to take in curly brackets. It can be dangerous to call if the pattern is too wide open.; especially when piping information For example,  ls -al at the root of a volume and piping that to awk $1 or some other position and then piping that into xargs to rm and a systems administrator could have a really rough day. Those $1, $2, and so-on represent the positions of words. So could be directories.  Think about this, though. In a world before relational databases, when we were looking to query the 3rd column in a file with information separated by some delimiter, piping those positions represented a simple way to effectively join tables of information into a text file or screen output. Or to find files on a computer that match a pattern for whatever reason.  Awk began powerful. Over time, improvements have enabled it to be used in increasingly  complicated scenarios. Especially when it comes to pattern matching with regular expressions. Various coding styles for input and output have been added as well, which can be changed depending on the need at hand.  Awk is also important because it influenced other languages. After becoming part of the IEEE Standard 1003.1, it is now a part of the POSIX standard. And after a few years, Larry Wall came up with some improvements, and along came Perl. But the awk syntax has always been the most succinct and useable regular expression engines. Part of that is the wildcard, piping, and file redirection techniques borrowed from the original shells. The AWK creators wrote a book called The AWK Programming Language for Addison-Wesley in 1988. Aho would go on to develop influential algorithms, write compilers, and write books (some of which were about compilers). Weinberger continued to do work at Bell before becoming the Chief Technology Officer of Hedge Fund Renaissance Technologies with former code breaker and mathematician James Simon and Robert Mercer. His face led to much love from his coworkers at Bell during the advent of digital photography and hopefully some day we’ll see it on the Google Search page, given he now works there.  Brian Kernighan was a contributor to the early Multics then Unix work, as well as C. In fact, an important C implementation, K&R C, stands for Kernighan and Ritchie C. He coauthored The C Programming Language ands written a number of other books, most recently on the Go Programming Language. He also wrote a number of influential algorithms, as well as some other programming languages, including AMPL. His 1978 description of how to manage memory when working with those pesky strings we discussed earlier went on to give us the Hello World example we use for pretty much all introductions to programming languages today. He worked on ARPA projects at Stanford, helped with emacs, and now teaches computer science at Princeton, where he can help to shape the minds of future generations of programming languages and their creators.