RR 340: Strings and Encodings in Ruby with Aaron Lasseigne

Ruby Rogues - A podcast by Charles M Wood - Wednesdays

Panel:Charles Max WoodDave KimuraEric BerryDavid RichardsIn this episode, the Ruby Rogues panel discuss Strings and Encodings in Ruby with Aaron Lasseigne. Aaron has been a Ruby developer for over a decade and is the author of Mastering Ruby: Strings and Encodings. Also, Aaron talks about his recent work on a service object Gem called Active Interaction. This is a great episode on learning about Strings and Encodings.In particular, we dive pretty deep on: Discussion Points (contributed by guests and hosts):•Why is it so important to understand strings?◦“The internet is powered by multimillion-dollar string manipulation machines. We put strings in a box, and get new strings out. While there’s plenty of mathy things that can happen in the middle, there is no denying the importance of strings in today’s world.” - Schneems◦They’re the only data structure that lies to you. You can see the exact contents of an array or hash but strings mask what’s happening. That’s why you can get situations when a single character has a length of 2.•What are character sets?◦A character set defines a group of characters, their order, and it assigns each an identifier (a code point).▪Unicode is a character set.◦What are code points?▪Unique identifiers within the character set.◦UTF-8, UTF-16, and UTF-32 are implementations of the Unicode character set▪Each has its own benefits•Normalization forms◦Different representations of the same character. We can represent “é” as a single character or as an “e” and a combining mark (2 characters). Normalization forms allow us to change between forms.◦There are 4 forms, NFC, NFD, NFKC, and NFKD and they all do slightly different things.▪They can be switched between with `String#unicode_normalize`.•Sorting◦Easy for English only but can be quite difficult with other languages. Sorting “e” and “é” can be tricky.•Security◦Identical characters, similar characters, and invisible characters can all be used to spoof user names.▪https://www.huffingtonpost.com/entry/how-to-avoid-downloading-a-fake-app_us_5a147d40e4b0f401dfa7eafb ▪https://www.reddit.com/r/Android/comments/7ahujw/psa_two_different_developers_under_the_same_name/ •The current state of Unicode support in Ruby. It was improved in 2.4 when methods like `upcase` started working with Unicode characters.•The addition of grapheme support in Ruby 2.5.•Freezing strings with `String#freeze` and with the special comment at the top of a file.◦Immutable strings may still make it into Ruby 3 as the default.•Character set expressions (a term I made up) for use with methods like `String#count` and `String#delete`.◦They’re like the inside of a regular expression character set (e.g. `[a-z]`)•Tofu and mojibake◦Tofu are those white boxes you see when a character doesn’t exist on your computer.◦Mojibake is when the characters show up but they don’t make sense because you’re using the wrong encoding or they were misencoded somewhere along the way.•Fixing bad characters◦Strings can be checked with `valid_encoding?`.◦`String#scrub` lets you replace invalid bytes with a single character which is the replacement character by default (that black diamond with a question mark in it).◦`String#encode` also does replacement work and will let you swap out characters if you go from something like UTF-8 to ASCII.▪You can even change out newline types with it.◦`Encoding::Converter` is an even more powerful way to convert but it’s a tool for when things...