Thursday, 28 September 2017

regex - How do I grep for all non-ASCII characters?



I have several very large XML files and I'm trying to find the lines that contain non-ASCII characters. I've tried the following:



grep -e "[\x{00FF}-\x{FFFF}]" file.xml



But this returns every line in the file, regardless of whether the line contains a character in the range specified.



Do I have the syntax wrong or am I doing something else wrong? I've also tried:



egrep "[\x{00FF}-\x{FFFF}]" file.xml 


(with both single and double quotes surrounding the pattern).



Answer



You can use the command:



grep --color='auto' -P -n "[\x80-\xFF]" file.xml


This will give you the line number, and will highlight non-ascii chars in red.



In some systems, depending on your settings, the above will not work, so you can grep by the inverse




grep --color='auto' -P -n "[^\x00-\x7F]" file.xml


Note also, that the important bit is the -P flag which equates to --perl-regexp: so it will interpret your pattern as a Perl regular expression. It also says that




this is highly experimental and grep -P may warn of unimplemented
features.



No comments:

Post a Comment

casting - Why wasn't Tobey Maguire in The Amazing Spider-Man? - Movies & TV

In the Spider-Man franchise, Tobey Maguire is an outstanding performer as a Spider-Man and also reprised his role in the sequels Spider-Man...