Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Localization of Stack Overflow- QCon China 2014

The Localization of Stack Overflow- QCon China 2014

Slides supporting the "Localization of Stack OVerflow" talk presented at QCon China in April 2014

3fd9e5b2c59170ec3d74dde30d233fa4?s=128

Marco Cecconi

April 25, 2014
Tweet

Transcript

  1. The Localization Of Marco Cecconi @sklivvz sklivvz@stackoverflow.com

  2. Who are we?

  3. None
  4. None
  5. 561,027,840 pageviews in the last 30 days* (~100% growth year

    over year) *source: Quantcast
  6. Why did we want to do this?

  7. Expert programmers Fluent English EN Stack Overflow

  8. Expert programmers Fluent English EN Stack Overflow Fluent Portuguese PT

    Stack Overflow
  9. Different languages make Different communities

  10. Expert programmers Stack Overflow

  11. We want to make the internet a better place for

    all
  12. Architectural Requirements

  13. Easy.

  14. Easy. Fast.

  15. Performance is a feature

  16. None
  17. None
  18. 1. Allocates two objects per translated string 2. Does a

    lot of lookups
  19. None
  20. 1. No allocation (strings are interned) 2. No lookups 3.

    Not “easy”, not “usable”
  21. None
  22. Simplest/easiest possible code

  23. Simplest/easiest possible code Implementation

  24. Source Code Compile Time Run Time aspnet_compiler Look Up

  25. Simplest/easiest possible code Compiled to the equivalent of this

  26. Source Code Compile Time Run Time extended aspnet_compiler using Roslyn

    Look Up
  27. Javascript - No GC pressure so we don’t care about

    interned strings - Can’t really precompile either - We simply create one set of JS files per language, e.g. “stub.en.js” and “stub.pt.js” - For all that follows, the same APIs are available to Javascript
  28. Our solution

  29. None
  30. API _s(string value) Meaning “Substitute String” _m(string value) Meaning “Substitute

    Markdown”
  31. Languages are WEIRD (part 1)

  32. _s("Hello $name$", new { name = "Marco" }) Hello Marco

  33. 一隻雞 兩隻雞

  34. 1 chicken 2 chickens

  35. Language Name Code Category Examples Rules Chinese zh other 0-999;

    1.2... other → everything English en one 1 one → n is 1; other → everything else other 0, 2-999; 1.2, 2.07...
  36. one 1, 21, 31, 41, 51, 61, 71, 81, 101,

    1001, … i % 10 = 1 and i % 100 != 11 few 2~4, 22~24, 32~34, 42~44, 52~54, 62, 102, 1002, … i % 10 = 2..4 and i % 100 != 12..14 many 0, 5~19, 100, 1000, 10000, 100000, 1000000, … i % 10 = 0 or i % 10 = 5..9 or i % 100 = 11..14 other 0.0~1.5, 10.0, 100.0, 1000.0, 10000.0, 100000.0, 1000000.0, … other 0~15, 100, 1000, 10000, 100000, 1000000, … Ukranian!
  37. Welsh has SIX modes…

  38. _s(“#num# chickens", new { num = 3 }) 3 chickens

  39. Behind the scenes All combinations are generated for each language

    and sent to translators:  For Chinese: “$num:other$ chickens” will be sent  For a 2 mode language: “$num:one$ chickens” and “$num:other$ chickens” will be sent Rules have to be evaluated at runtime to choose the correct translation.
  40. Languages are WEIRD (part 2)

  41. None
  42. 10 classes called Class I to Class X and containing

    all sorts of arbitrary groupings but often characterised as • people, • long objects, • animals, • miscellaneous objects, • large objects and liquids, • small objects, • languages, • pejoratives, • infinitives, • mass nouns Uganda
  43. _s("Active $~posts$") Attivi _s("Active $~questions$") Attive

  44. Post Mortem

  45. Some numbers 700 views localized 100,000 lines of code A

    lot of javascript A LOT of refactoring/fixing/tech debt repayed Very little performance impact ~6 months of work (team of ~3)
  46. More numbers Portuguese released Dec. 12 4k Questions 7k Answers

    8k Users One of the best performing new communities ever
  47. None
  48. Lessons learned

  49. Never put non-content text data in the DB It’s A

    Good Thing™ if all the text to be localized is in the views or javascript.
  50. Never compose sentences in code n==1? "1 unicorn": n.ToString() +

    " unicorns"
  51. None
  52. Never assume anything about the language 10 genders and 6

    plurals? REALLY?
  53. Designing a global application is hard

  54. Conclusions

  55. 1. It’s possible to internationalize quickly and cheaply, without performance

    hits. 2. Localization is a surprisingly rich problem. There are many gotchas that can be painful later, like pluralization “bugs”. Fun! 3. Localization is a very healthy choice for Stack Overflow and we hope to provide more and more users with a native interface some day :-)
  56. Questions? Marco Cecconi @sklivvz sklivvz@stackoverflow.com