
For a list of BASHing data 2 blog posts see the index page.
Extract the year from a date string without using the date command
My wife asked if I could add a year-only field to a big table she was working on. It had a fairly messy date field, "DATE R'CD", and the much-simplified, pipe-separated table below ("years") has one example each of the formatting variations I found in that field.
RECORD|DATE R'CD
1467-1|Nov. 1992
1467-2|1987
1467-3|
1467-4|xxx. 1995
1467-5|March 2003
1467-6|2006?
1467-7|Feb.2001
The date command doesn't like any of these strings:

But there are other ways on the command line to extract the year from "DATE R'CD". This post looks at 8 methods that work with the "years" file above.
With AWK. Each of the 4 AWK methods below begins with awk -F"|" 'NR==1 {print $0"|YEAR"; next}. This tells AWK that the field separator is a pipe, and that when processing the first line AWK should print the full line, then a pipe and the string YEAR before going to the next line.
awk -F"|" 'NR==1 {print $0"|YEAR"; next} {printf("%s|",$0); gsub(/[^0-9]/,"",$2); print $2}' years
This is the method I actually used. When processing lines after the header, AWK first printfs the full line followed by a pipe, but with no following newline. AWK then deletes anything that isn't a number in the second field, then prints whatever is left in the field, and by default also prints a newline.

awk -F"|" 'NR==1 {print $0"|YEAR"; next} {print $0,gensub(/[^0-9]/,"","g",$2)}' OFS="|" years
The next command (above) is a variation of the first, this time using gensub instead of gsub to delete any non-numbers in field 2. The pipe between field 2 and the new field 3 is inserted with OFS = "|", but could just as easily be printed with print $0"|"gensub(/[^0-9]/,"","g",$2). (I love AWK's flexibility...)
awk -F"|" 'NR==1 {print $0"|YEAR"; next} {patsplit($2,a,"[0-9]{4}"); print $0,a[1]}' OFS="|" years
In the above command the patsplit function breaks field 2 into pieces and puts them in the array "a"; each piece matches the regex for a year number, [0-9]{4}. After this is done, AWK prints the full line followed by the first (and only) year number in the array.
awk -F"|" 'NR==1 {print $0"|YEAR"; next} {split($2,a,"[0-9]{4}",s); print $0,s[1]}' OFS="|" years
A variation of the last command. Here the split function builds an array "a" in which the pieces have been separated by anything matching a year number. Those separating pieces are stored in another array, "s", and this time AWK prints the full line followed by the first (and only) separator.
Without AWK. Each of the following AWK-less commands has the structure paste -d"|" years <(echo "YEAR" && tail -n +2 years | cut -d"|" -f2 | [SOMETHING]). In other words, paste joins the "years" file with whatever happens in the redirection that follows, using a pipe as a separator. That redirection begins by echoing "YEAR" and a newline, then the non-header part of "years" has its second field cut out before SOMETHING happens to that second field.
paste -d"|" years <(echo "YEAR" && tail -n +2 years | cut -d"|" -f2 | sed 's/[^0-9]//g')
In this first command the SOMETHING is that sed deletes anything that isn't a number, line by line.

In the next 3 commands, the SOMETHING is a while loop that works line-by-line through the header-less, field 2 remnant of "years".
paste -d"|" years <(echo "YEAR" && tail -n +2 years | cut -d"|" -f2 | while read line; do echo "${line//[!0-9]/}"; done)
The expression ${line//[!0-9]/} tells BASH to delete anything that isn't a number.
paste -d"|" years <(echo "YEAR" && tail -n +2 years | cut -d"|" -f2 | while read line; do var="$(echo $line | tr -dc '[0-9]')"; echo "$var"; done)
This time tr is used to delete the complement of the number range, or in other words any non-numbers. Note that the result is stored in a variable "var" and that "var" is echoed with a newline. If you use while read line; do echo "$line" | tr -dc '[0-9]'; done, tr will delete the newlines in the remnant as well.

paste -d"|" years <(echo "YEAR" && tail -n +2 years | cut -d"|" -f2 | while read line; do var=$(grep -oE '[0-9]{4}' <<<"$line"); echo "$var"; done)
In this last command grep extracts year numbers directly from the field 2 string, but note again that the result is stored in a variable. If not, the non-matching line without a year number would be missed out.

Next post:
2025-02-28 Does string A contain string B? Ask AWK's index
Last update: 2025-02-21
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License