What's the best way to validate input that can contain Chinese characters?

I've got to handle some input that can be either English or Chinese characters. Does anyone have any good suggestions for preventing malicious input when the input has to be a bit more relaxed than normal?

So for example, I need to validate a name of a place. In English, I'd say it should only be able to be form of letters, numbers and some punctuation. I could do this in regex using a standard pattern (e.g. [a-zA-Z0-9...], you get the idea). But how do you handle this with Chinese characters, where a single character can represent an entity (e.g. 狗 which represents 'Dog')? Saying "must be made of letters or numbers only" doesn't seem to cut it.

This is mainly for a Java program, so using the syntax on the Pattern class documentation, there's no too much there to help me direct Chinese characters representing letters, numbers & the like.

Responses(2)

As you said yourself your goal is

to prevent malicious input

This is great that you already understand that you ALWAYS have to check user's input back-end. However, you do not write over 9000 validation rules. It nothing does with the language. You just use technique called escape a string.

You escape any input based on what you are going to do next with the input

Insert into MySQL? -> Use prepared statements.
Also display in browser to other users? -> Prevent XSS attacks and encode HTML characters when sending string to your template/view/output. You NEVER encode HTML string before saving it in DB, you save it AS IS and later you just escape output from DB.

The only exception is when you intentionally want to remove (strip) specific parts of user input like all HTML tags, emojis, whatever. Well, this depends on your business case and if you really want just to allow a-z, 0-9 and Chinese characters, you may use regex ranges for unicode itself.

Thread

What's the best way to validate input that can contain Chinese characters?

Responses(2)

You escape any input based on what you are going to do next with the input

Recent in Forum

Search Hashnode

What's the best way to validate input that can contain Chinese characters?

Responses(2)

You escape any input based on what you are going to do next with the input

Recent in Forum