A comprehensive analysis of English words present in the Human proteome

Proteolinguistica 1:1. Received Apr 1. Revised Apr 1. Accepted Apr 1. Published online Apr 15.

Prof. Albert

Enter word to search Proteolinguistica:

     
     
    Abstract, introduction, result and discussion: We analyzed for appearance of English words (>=4 letters) in the amino acid sequences of all known human proteins. Out of the 236,736 (84,249 4+letter encodable, see below) English words in the NLTK dictionary, the human proteome contains 10,445 words (Table 1), demonstrating that the human proteome has the potential to become a bestselling novelist and outsmart many of us (me included). We identified  that the longest English words in the proteome are TARGETEER and SKULLFISH while the protein with the most words (97) hidden within its amino acid sequence is, to our surprise, one of the longest genes/proteins, TITIN. We (I) believe that this result will prove useful for the study of proteomic linguistics and will be worthy of  (a least a nomination for) the IgNobel prize.

     

    Table 1. Stats

    Number of words in NLTK dictionary 236,736
    Number of 4+letter words encodable (without B,J,O,X,Z) 84,249
    Number of unique 4+letter words hidden in the human proteome 10,445
    Protein hiding the largest number of words (#words) TTN [a.k.a. TITIN] (97)
    Longest words in the human proteome (gene) TARGETEER (C12orf42)* , SKULLFISH=>SKVLLFISH (CRB1)

    *There’s another 5-letter word hidden in C12orf42. Can you find it?

     

    Figures – no figures. Save figure fee and color fee. Also, I don’t have figure-preparation software other than PowerPoint.

     

    Material & Methods: We used the English word list provided by the NLTK Corpus (nltk.corpus) package (https://www.nltk.org/). We downloaded the knownGene table and protein sequences from UCSC Genome Browser (Table Browser). We then searched for the presence of each word in the English word list in the protein sequences. Because the amino acid alphabet contains only 20 elements, for we excluded words containing B,J,O,X,Z and convert all U into V (apparently U is often written as V a few centuries back).

     

    Joke aside, please check out our serious science using the Menu on the left.