java中剔除文件名中的非法字符

How to convert strings in any language and character set to valid filenames in Java?

up vote 4 down vote favorite
2

I need to generate file names from user inputted names. These names could be in any language. For example:

  • "John Smith"
  • "高岡和子"
  • "محمد سعيد بن عبد العزيز الفلسطيني"

These are use inputted values, so I have no guarantee that the names don't contain characters that are invalid to be in file names.

Users will be downloading these files from their browser, so I need to ensure the file names are valid on all operating systems in all configurations.

I am currently doing this for English speaking countries by simply removing all non-alphanumeric characters with a simple regex:

string = string.replaceAll("[^a-zA-Z0-9]", ""); string = string.replaceAll("\\s+", "_")

Some example conversions:

  • "John Smith" -> "John_Smith.ext"
  • "John O'Henry" -> "John_OHenry.ext"
  • "John van Smith III" -> "John_van_Smith_III.ext"

Obviously this does not work internationally.

I've considered finding/generating a blacklist of all characters that are invalid on all file systems and stripping those from the names. I've been unable to find a comprehensive list.

I'd prefer to use existing code in a common library if possible. I imagine this is an already solved problem, however I can't find a solution that works internationally.

The filename is for the user downloading the file, not for me. I'm not going to be storing these files. These files are dynamically generated by the server upon request from data in a database. The filenames are for the convenience of the person downloading the file.

share | edit
 
What happens if you just return a unicode file name? I would assume the operating system could figure this sort of thing out (but I wouldn't be surprised if some don't).  – Brendan Long   Apr 14 '12 at 3:56
 
But... spaces are valid...  – Ignacio Vazquez-Abrams   Apr 14 '12 at 3:58
 
I know spaces are valid, I just prefer underscores to spaces.  – leros   Apr 14 '12 at 4:00
 
Whitelists are better than blacklists.... It's hard to enumerate all evil...  – Jordão   Apr 14 '12 at 4:20
 
@Jordão: But it is fun trying...  – Ignacio Vazquez-Abrams   Apr 14 '12 at 4:21
show 1 more comment

6 Answers

active oldest votes
up vote 3 down vote accepted

Regex [^a-zA-Z0-9] will filter non-ASCII characters which will omit Unicode characters or characters above 128 codepoints.

Assuming that you want to filter user input for valid file-names by replacing invalid file-name characters such as ? \ / : | < > * with underscore (_):

import java.io.UnsupportedEncodingException; public class ReplaceI18N { public static void main(String[] args) { String[] names = { "John Smith", "高岡和子", "محمد سعيد بن عبد العزيز الفلسطيني", "|J:o<h>n?Sm\\it/h*", "高?岡和\\子*", "محمد /سعيد بن عبد ?العزيز :الفلسطيني\\" }; for(String s: names){ String u = s; try { u = new String(s.getBytes(), "UTF-8"); } catch (UnsupportedEncodingException e) { // TODO Auto-generated catch block e.printStackTrace(); } u = u.replaceAll("[\\?\\\\/:|<>\\*]", " "); //filter ? \ / : | < > * u = u.replaceAll("\\s+", "_"); System.out.println(s + " = " + u); } } }

The output:

John Smith = John_Smith 高岡和子 = 高岡和子 محمد سعيد بن عبد العزيز الفلسطيني = محمد_سعيد_بن_عبد_العزيز_الفلسطيني |J:o<h>n?Sm\it/h* = _J_o_h_n_Sm_it_h_ 高?岡和\子* = _岡和__ محمد /سعيد بن عبد ?العزيز :الفلسطيني\ = محمد_سعيد_بن_عبد_العزيز_الفلسطيني_

The valid filenames even with Unicode characters will be displayable on any webpage that supports UTF-8 encoding with the correct Unicode font.

In addition, each will be the correct name for its file on any OS file-system that supports Unicode (tested OK on Windows XP, Windows 7).

But, if you want to pass each valid filename as a URL string, make sure to encode it properly usingURLEncoder and later decode each encoded URL using URLDecoder.

share | edit
 
They also need to be suitable for download via HTTP.  – Ignacio Vazquez-Abrams   Apr 14 '12 at 7:16
1  
@IgnacioVazquez-Abrams If we want to pass the valid filename as a URL string, then convert the string using URLEncoder like what you have suggested.  – eee   Apr 14 '12 at 7:23
up vote 0 down vote

Encode the filename as UTF-8, and then URL-encode the result.

'高岡和子' -> '%E9%AB%98%E5%B2%A1%E5%92%8C%E5%AD%90'
share | edit
 
The filename needs to contain the name in human readable fashion. In your example the file names needs to be "高岡和子.ext"  – leros   Apr 14 '12 at 4:03
 
You encode on download. You store it normally locally.  – Ignacio Vazquez-Abrams   Apr 14 '12 at 4:04
up vote 0 down vote

Windows appears to support unicode file names, I know Linux does, and apparently OS X does too. Presumably a well-written would fix invalid characters in a file name before saving it.

It seems like you should be able to just use unicode file names. Is there some OS or browser that this doesn't work on?

share | edit
 
OS X supports them in a... non-conventional way though.  – Ignacio Vazquez-Abrams   Apr 14 '12 at 4:06
up vote 0 down vote

My advice would be to make it a requirement that your application runs on a platform that supports Unicode filenames. Most do these days.

I don't think it is feasible to map from Unicode to an (unspecified) restricted character set, while still retaining human readability AND the original meaning AND avoiding collisions. Indeed, it is not even possible to do this mapping from Latin-1 to ASCII.

If your application has to run on platforms that doesn't support Unicode filenames, then you will need to sacrifice human readability and/or meaning in the filenames in some cases. Besides, consider whether (for example) ASCII-ized chinese characters or Cyrilic letters or letters with accents stripped off are going to be acceptable to your end users.

What I'd do is offer the user two options to select from:

  • An option that uses Unicode filenames for uploaded files. This should be the default, since most users' machines will support this.

  • A fallback option that uses generated names which are not related to the original strings / text.

In reality, if the user's machine doesn't support Unicode, they are going to have huge problems dealing with textual names that are not encoded using the machine's native encoding. There's no completely reliable way to find out what that is. Even if you have a semi-reliable way of figuring that out ... on the server side ... the problem of mapping all of Unicode to that encoding is intractable.

It is better to encourage the user to upgrade his / her operating system to a Unicode capable one.

share | edit
 
"These files are not going to be saved server side. I'll be generating them on requests from data in a database."– Ignacio Vazquez-Abrams   Apr 14 '12 at 5:54
 
That doesn't affect the point I'm making. If the user's platform can't represent the filenames, then attempting to map them to the user's machine's native character set is not going to give acceptable names ... if the user is a native speaker.  – Stephen C   Apr 14 '12 at 6:02
 
But... it's a web app. You can't force your users to only run Linux.  – Ignacio Vazquez-Abrams   Apr 14 '12 at 6:03
 
1) Most Windows systems support Unicode too. 2) If they don't then just map to non-meaningful names. One should not spend lots of effort in supporting users who haven't upgraded their OS for N years.  – Stephen C   Apr 14 '12 at 6:07
 
"You can't force your users to only run Linux.". Apart from being misleading, it is also illogical. It is like saying "you can't force your users to run a Windows browser" or "you can't force your users to run MS Word". In fact, many websites / organizations try to do exactly this kind of thing ... and get away with it.  – Stephen C   Apr 14 '12 at 6:12
show 2 more comments
up vote 0 down vote

Summarizing and paraphrasing @eee's answer...

String sanitizeFilename(String unsanitized) { return unsanitized .replaceAll("[\\?\\\\/:|<>\\*]", " ") // filter out ? \ / : | < > * .replaceAll("\\s", "_"); // white space as underscores }

(not joining multiple spaces into one!)

share | edit
   
up vote 0 down vote

Letting the input determine a file name without proper sanitizing seems prone to security attacks. You can use a hash function (SHA-1, MD5) to generate a valid filename. Just be aware that you can't derive the original name from the hash.

Also, if you can have a simple lookup table, you can assign special identifiers to the names (like sequential numbers or GUIDs), and use the identifier as the filename.

Another thing, have you thought about homonyms?

share | edit
1  
Or you could just sanitize the filename.  – Ignacio Vazquez-Abrams   Apr 14 '12 at 4:00
 
Sure............  – Jordão   Apr 14 '12 at 4:00
 
The filename needs to contain the name in human readable form, which is why I can't just generate something.  – leros   Apr 14 '12 at 4:01
 
Very stringent requirements ... just be careful with file-system equivalents to Little Bobby Tables  – Jordão Apr 14 '12 at 4:06
 
Its going to be a very common use case that a user would download 10-100+ files for various people. It's absolutely a requirement that the user be able to quickly find the file that corresponds to a person.  – leros Apr 14 '12 at 4:10
show 4 more comments

Your Answer

    • Links
    •  
    • Images
    •  
    • Styling/Headers
    •  
    • Lists
    •  
    • Blockquotes
    •  
    • Code
    •  
    • HTML
    • advanced help »
 

Not the answer you're looking for? Browse other questions tagged java unicodeinternationalization filenames valid or ask your own question.

 
 
http://stackoverflow.com/questions/10150850/how-to-convert-strings-in-any-language-and-character-set-to-valid-filenames-in-j

你可能感兴趣的:(java中剔除文件名中的非法字符)