I have some old migrated files that contain non-printable characters. I would like to find all files with such names and delete them completely from the system.
Example:
ls -l
-rwxrwxr-x 1 cws cws 0 Dec 28 2011 ??"??
ls -lb
-rwxrwxr-x 1 cws cws 0 Dec 28 2011 \a\211"\206\351
I would like to find all such files.
Here is an example screenshot of what I'm seeing when I do a ls
in such folders:
I want to find these files with the non-printable characters and just delete them.
Answer
ASCII character codes range from 0x00
to 0x7F
in hex. Therefore, any character with a code greater than 0x7F
is a non-ASCII character. This includes the bulk of the characters in UTF-8 (ASCII codes are essentially a subset of UTF-8). For example, the Japanese character
あ
is encoded in hex in UTF-8 as
E3 81 82
UTF-8 has been the default character encoding on, among others, Red Hat Linux since version 8.0 (2002), SuSE Linux since version 9.1 (2004), and Ubuntu Linux since version 5.04 (2005).
Out of the ASCII codes, 0x00
through 0x1F
and 0x7F
represent control characters such as ESC
(0x1B
). These control characters were not originally intended to be printable even though some of them, like the line feed character 0x0A
, can be interpreted and displayed.
On my system, ls
displays all control characters as ?
by default, unless I pass the --show-control-chars
option. I'm guessing that the files you want to delete contain ASCII control characters, as opposed to non-ASCII characters. This is an important distinction: if you delete filenames containing non-ASCII characters, you may blow away legitimate files that just happen to be named in another language.
POSIX
POSIX provides a very handy collection of character classes for dealing with these types of characters (thanks to bashophil for pointing this out):
[:cntrl:] Control characters
[:graph:] Graphic printable characters (same as [:print:] minus the space character)
[:print:] Printable characters (same as [:graph:] plus the space character)
PCRE
Perl Compatible Regular Expressions allow hexadecimal character codes using the syntax
\x00
For example, a PCRE regex for the Japanese character あ
would be
\xE3\x81\x82
In addition to the POSIX character classes listed above, PCRE also provides the [:ascii:]
character class, which is a convenient shorthand for [\x00-\x7F]
.
GNU's version of grep
supports PCRE using the -P
flag, but BSD grep
(on Mac OS X, for example) does not. Neither GNU nor BSD find
supports PCRE regexes.
GNU find
supports POSIX regexes (thanks to iscfrc for pointing out the pure find
solution to avoid spawning additional processes). The following command will list all filenames (but not directory names) below the current directory that contain non-printable control characters:
find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$'
The regex is a little complicated because the -regex
option has to match the entire file path, not just the filename, and because I'm assuming that we don't want to blow away files with normal names simply because they are inside directories with names containing control characters.
To delete the matching files, simply pass the -delete
option to find
, after all other options (this is critical; passing -delete
as the first option will blow away everything in your current directory):
find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$' -delete
I highly recommend running the command without the -delete
first, so you can see what will be deleted before it's too late.
If you also pass the -print
option, you can see what is being deleted as the command runs:
find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$' -print -delete
To blow away any paths (files or directories) that contain control characters, the regex can be simplified and you can drop the -type
option:
find -regextype posix-basic -regex '.*[[:cntrl:]].*' -print -delete
With this command, if a directory name contains control characters, even if none of the filenames inside the directory do, they will all be deleted.
It looks like your files contain both non-ASCII characters and ASCII control characters. As it turns out, [:ascii:]
is not a POSIX character class, but it is provided by PCRE. I couldn't find a POSIX regex to do this, so it's Perl to the rescue. We'll still use find
to traverse our directory tree, but we'll pass the results to Perl for processing.
To make sure we can handle filenames containing newlines (which seems likely in this case), we need to use the -print0
argument to find
(supported on both GNU and BSD versions); this separates records with a null character (0x00
) instead of a newline, since the null character is the only character that can't be in a valid filename on Linux. We need to pass the corresponding flag -0
to our Perl code so it knows how records are separated. The following command will print every path inside the current directory, recursively:
find . -print0 | perl -n0e 'print $_, "\n"'
Note that this command only spawns a single instance of the Perl interpreter, which is good for performance. The starting path argument (in this case, .
for CWD
) is optional in GNU find
but is required in BSD find
on Mac OS X, so I've included it for the sake of portability.
Now for our regex. Here is a PCRE regex matching names that contain either non-ASCII or non-printable (i.e. control) characters (or both):
[[:^ascii:][:cntrl:]]
The following command will print all paths (directories or files) in the current directory that match this regex:
find . -print0 | perl -n0e 'chomp; print $_, "\n" if /[[:^ascii:][:cntrl:]]/'
The chomp
is necessary because it strips off the trailing null character from each path, which would otherwise match our regex. To delete the matching files and directories, we can use the following:
find . -print0 | perl -MFile::Path=remove_tree -n0e 'chomp; remove_tree($_, {verbose=>1}) if /[[:^ascii:][:cntrl:]]/'
This will also print out what is being deleted as the command runs (although control characters are interpreted so the output will not quite match the output of ls
).
No comments:
Post a Comment