Messages published in 2 2012

Accepting user input: Beware of fullwidth characters

If your application is accepting user input you should be ready to treat all kinds of “unexpected” data. This holds true especially when the application is facing open internet hence you should always canonicalise (normalise) user input before processing it. Canonicalisation process should convert all different representations of data into one standard form simplifying further processing. Usual operations include normalising characters case, trimming spaces from beginning and end or removing duplicated spaced from between words. But there are another aspect we should take into consideration and one of them is different representations of the same character.

Unicode halfwidth and fullwidht forms

As you look into unicode charcters you can notice there are some characters duplicated, although they look different (how different will depend on the application used to display the text). Below lines are written using halfwidth and fullwidth forms:

Hello World!
Hello World!
Hello World!