I've got to handle some input that can be either English or Chinese characters. Does anyone have any good suggestions for preventing malicious input when the input has to be a bit more relaxed than normal?
So for example, I need to validate a name of a place. In English, I'd say it should only be able to be form of letters, numbers and some punctuation. I could do this in regex using a standard pattern (e.g. [a-zA-Z0-9...], you get the idea). But how do you handle this with Chinese characters, where a single character can represent an entity (e.g. 狗 which represents 'Dog')? Saying "must be made of letters or numbers only" doesn't seem to cut it.
This is mainly for a Java program, so using the syntax on the Pattern class documentation, there's no too much there to help me direct Chinese characters representing letters, numbers & the like.
Executive Product Leader & Mentor for High-End Influencers and Brands @ mevrael.com
You can represent each chinese character as unicode so you can validate that way, if unicode == unicode.
Mev-Rael
As you said yourself your goal is
This is great that you already understand that you ALWAYS have to check user's input back-end. However, you do not write over 9000 validation rules. It nothing does with the language. You just use technique called escape a string.
You escape any input based on what you are going to do next with the input
The only exception is when you intentionally want to remove (strip) specific parts of user input like all HTML tags, emojis, whatever. Well, this depends on your business case and if you really want just to allow a-z, 0-9 and Chinese characters, you may use regex ranges for unicode itself.