Distribution of characters in Squeak code
Last updated at 11:34 am UTC on 24 June 2016
Tobias Pape (Squeak Mailing list Wed, Jun 22, 2016)
I was curious about the relative distribution of characters in Squeak Code.
I sampled the source code and drew a histogram (Attached)
" Uses the new HistogramMorph"
| characterFrequency |
CurrentReadOnlySourceFiles cacheDuring: [
characterFrequency := ((CompiledMethod allInstances select:
[:method | (method allLiterals detectSum:
[:lit | lit isCollection ifFalse:  ifTrue: [lit size]]) 1500])
gather: [:method | method getSource
reject: [:c |c isSeparator]]) asBag].
(HistogramMorph on: characterFrequency)
labelBlock: [:c | c codePoint > 32 ifTrue:[c asString] ifFalse: [c printString]];
((characterFrequency sortedCounts collect: [:ea | ea value]) first: 90) join.
- The most frequent (printable) characters are in order
and more detailed, the 90 most frequent characters:
- This is quit close to actual English:
- The most frequent punctuation is : and . follows quite long after.
- Cascading is comparatively rare. We have more blocks and equality/identity comparisons than ;
- Blocks are more common than parenthesis and literal arrays
- You cannot spell ifTrue or ifFalse with the 20 most common characters
- ifTrue: is far more common than ifFalse:
- The most frequent uppercase Character is S. I have no conjecture here, tho.
C, sampling the Linux kernel:
- under_score_case vs. camelCase is rather obvious.
- (not displayed but tab and newline are amog the 6 most frequent characters!)
- Punctuation starts much earlier.
- The beginning differs a lot, the ending not so much.
- 0 is far more important than 1
- : is unimportant
Ruby, sampling Rails:
- underscore shows, but not so much as in C.
- The : is (like in Smalltalk) more important
- Uppercase is more uncommon than in both C and Smalltalk.