Friday, 11 May 2018

unicode - Using awk to remove the Byte-order mark



How would an awk script (presumably a one-liner) for removing a BOM look like?



Specification:




  • print every line after the first (NR > 1)


  • for the first line: If it starts with #FE #FF or #FF #FE, remove those and print the rest


Answer



Try this:



awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}{print}' INFILE > OUTFILE


On the first record (line), remove the BOM characters. Print every record.




Or slightly shorter, using the knowledge that the default action in awk is to print the record:



awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}1' INFILE > OUTFILE


1 is the shortest condition that always evaluates to true, so each record is printed.



Enjoy!



-- ADDENDUM --




Unicode Byte Order Mark (BOM) FAQ includes the following table listing the exact BOM bytes for each encoding:



Bytes         |  Encoding Form
--------------------------------------
00 00 FE FF | UTF-32, big-endian
FF FE 00 00 | UTF-32, little-endian
FE FF | UTF-16, big-endian
FF FE | UTF-16, little-endian
EF BB BF | UTF-8



Thus, you can see how \xef\xbb\xbf corresponds to EF BB BF UTF-8 BOM bytes from the above table.


No comments:

Post a Comment

casting - Why wasn't Tobey Maguire in The Amazing Spider-Man? - Movies & TV

In the Spider-Man franchise, Tobey Maguire is an outstanding performer as a Spider-Man and also reprised his role in the sequels Spider-Man...