115: Data Types: Strings Part 2.

Take Up Code - A podcast by Take Up Code: build your own computer games, apps, and robotics with podcasts and live classes

Categories:

You need more than a bunch of numbers and logic to write an application. You need text and working with individual characters isn’t enough either. This episode continues the discussion about the string data type and covers the following points: Can the encoding be changed and how are character boundaries detected? Is the encoding self-synchronizing? How do you expand and collapse composite characters? How do you convert numbers and other data types to strings and back? How do you append, insert, and remove sections of a string? How do you reorder and reverse strings? How do you change and detect case of letters? How do you control the formatting of a string with placeholders? Listen to the full episode or you can also read the full transcript below. Transcript This episode continues the explanation of the string data type, what you can do with it, and many of the unique considerations that apply to strings. Listen to episode 114 for the first part and to tomorrow’s episode for the third part. Here’s the next seven points from #8 to #14. #8 Can the encoding be changed and how are character boundaries detected? Is the encoding self-synchronizing? Yes, encoding can be changed but not usually in-place. You’ll normally have to read one string and use the information to construct another string. And if you’re reading a stream of string data, then you can write another stream with the output. If the data you’re reading uses a single byte for each character, then there’s no issues with detecting character boundaries. You’ll know where one byte ends and the next begins. This is handled for you at a much lower level. But at some point, yes, even the bytes need to be separated from a flow of bits. Where you can run into problems with you code is when working with multiple bytes per character. You need to know which comes first, the high order byte or the low order byte and many files and streams will start with a byte order mark or a BOM to help identify this. Imagine for a moment a train passing by where the cars are related in pairs. If you’re lucky enough to catch the beginning of the train and don’t lose track, then you can keep track of the first and second cars, then the third and fourth cars, then the fifth and sixth cars, etc. But if all the cars look the same and you start observing the passing train after the engine has long since passed by, then you have no way of knowing if a particular car is related to the one before it or after it. This is a problem for encoding systems that are not self-synchronizing and you’ll just have to start at the beginning. If the cars don’t all look the same though, and let’s say the first car in each related pair has a green stripe, then it’s easy to tell when a new pair begins. Some encoding systems such as UTF-8 are self-synchronizing. They don’t have green stripes, but you can still tell exactly when a new character begins. #9 How do you expand and collapse composite characters? The Unicode standard defines four normalization forms. This can get complicated and is often overlooked. Normalization is a process where you put things in a standard representation. The dictionary says it is a process to put something back into a usual or expected state. This is really important for file systems and communications. Let’s say you’re working on a scientific research paper and decide to include a special character in the name of your document. This special character is the angstrom symbol and it looks like an A with a circle on top. In fact it looks exactly like an A with a circle on top. There is another character that’s an A with a circle on top that’s not the angstrom character. Even though both look identical, they’re not. Sometimes, it’s important to swap special characters like this with other characters that look the same but are more common and expected. Another example