banner

For a list of BASHing data 2 blog posts see the index page.    RSS


Extract the year from a date string without using the date command

My wife asked if I could add a year-only field to a big table she was working on. It had a fairly messy date field, "DATE R'CD", and the much-simplified, pipe-separated table below ("years") has one example each of the formatting variations I found in that field.

RECORD|DATE R'CD
1467-1|Nov. 1992
1467-2|1987
1467-3|
1467-4|xxx. 1995
1467-5|March 2003
1467-6|2006?
1467-7|Feb.2001

The date command doesn't like any of these strings:

dates1

But there are other ways on the command line to extract the year from "DATE R'CD". This post looks at 8 methods that work with the "years" file above.


With AWK. Each of the 4 AWK methods below begins with awk -F"|" 'NR==1 {print $0"|YEAR"; next}. This tells AWK that the field separator is a pipe, and that when processing the first line AWK should print the full line, then a pipe and the string YEAR before going to the next line.

awk -F"|" 'NR==1 {print $0"|YEAR"; next} {printf("%s|",$0); gsub(/[^0-9]/,"",$2); print $2}' years

This is the method I actually used. When processing lines after the header, AWK first printfs the full line followed by a pipe, but with no following newline. AWK then deletes anything that isn't a number in the second field, then prints whatever is left in the field, and by default also prints a newline.

dates2

awk -F"|" 'NR==1 {print $0"|YEAR"; next} {print $0,gensub(/[^0-9]/,"","g",$2)}' OFS="|" years

The next command (above) is a variation of the first, this time using gensub instead of gsub to delete any non-numbers in field 2. The pipe between field 2 and the new field 3 is inserted with OFS = "|", but could just as easily be printed with print $0"|"gensub(/[^0-9]/,"","g",$2). (I love AWK's flexibility...)

awk -F"|" 'NR==1 {print $0"|YEAR"; next} {patsplit($2,a,"[0-9]{4}"); print $0,a[1]}' OFS="|" years

In the above command the patsplit function breaks field 2 into pieces and puts them in the array "a"; each piece matches the regex for a year number, [0-9]{4}. After this is done, AWK prints the full line followed by the first (and only) year number in the array.

awk -F"|" 'NR==1 {print $0"|YEAR"; next} {split($2,a,"[0-9]{4}",s); print $0,s[1]}' OFS="|" years

A variation of the last command. Here the split function builds an array "a" in which the pieces have been separated by anything matching a year number. Those separating pieces are stored in another array, "s", and this time AWK prints the full line followed by the first (and only) separator.


Without AWK. Each of the following AWK-less commands has the structure paste -d"|" years <(echo "YEAR" && tail -n +2 years | cut -d"|" -f2 | [SOMETHING]). In other words, paste joins the "years" file with whatever happens in the redirection that follows, using a pipe as a separator. That redirection begins by echoing "YEAR" and a newline, then the non-header part of "years" has its second field cut out before SOMETHING happens to that second field.

paste -d"|" years <(echo "YEAR" && tail -n +2 years | cut -d"|" -f2 | sed 's/[^0-9]//g')

In this first command the SOMETHING is that sed deletes anything that isn't a number, line by line.

dates3

In the next 3 commands, the SOMETHING is a while loop that works line-by-line through the header-less, field 2 remnant of "years".

paste -d"|" years <(echo "YEAR" && tail -n +2 years | cut -d"|" -f2 | while read line; do echo "${line//[!0-9]/}"; done)

The expression ${line//[!0-9]/} tells BASH to delete anything that isn't a number.

paste -d"|" years <(echo "YEAR" && tail -n +2 years | cut -d"|" -f2 | while read line; do var="$(echo $line | tr -dc '[0-9]')"; echo "$var"; done)

This time tr is used to delete the complement of the number range, or in other words any non-numbers. Note that the result is stored in a variable "var" and that "var" is echoed with a newline. If you use while read line; do echo "$line" | tr -dc '[0-9]'; done, tr will delete the newlines in the remnant as well.

dates4

paste -d"|" years <(echo "YEAR" && tail -n +2 years | cut -d"|" -f2 | while read line; do var=$(grep -oE '[0-9]{4}' <<<"$line"); echo "$var"; done)

In this last command grep extracts year numbers directly from the field 2 string, but note again that the result is stored in a variable. If not, the non-matching line without a year number would be missed out.

dates5

Next post:
2025-02-28   Does string A contain string B? Ask AWK's index


Last update: 2025-02-21
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License