Roman KlimenkoBlogPhotography

KNN clustering for a social network

October 23, 2021

dataclustering

This weekend I am shaking the old days and trying to cluster the data of a relatively small internet community.

We have a community of 157.343 users. Each user can affect other users' karma by voting into the other users' karmas with +2, +1, -1, and -2.

There are 3.406.926 karma votes at the moment, where 34.665 users have voted to karmas of 64.259 users. If we sum up all the votes, negative and positive, we get a total of 793.067. I.e., users rather upvote than downvote.

Let's attempt to cluster the users based on whose karmas they have voted.

I will use a simple K-Means algorithm. You can also read about it in this post.

First, we select the users who got karma votes from at least 1000 users. There are 374 of them.

Then, we cluster all users based on their votes in karmas of these 374 users. We get 10 clusters.

For these clusters, we put the number of users in each cluster in a table, the average karma vote of each cluster, and the average karma of all users in each cluster. We also determine the preferred communities for each cluster - the communities where a cluster's users get more positive votes for their posts and comments:

clusters.png

Obviously, clusters #9 and #4 are about half of the users. But this is more or less expected as most users are not very active voters.

Most of the minor clusters have strong voting preferences. I.e., their average vote for a particular user is more than +1 (or less than -1):

  • Cluster #0 (2.272 users):
    • Downvotes users Oldrover and NigilistNeo.
  • Cluster #6 (1.004 users):
    • Upvotes a user infinum.
  • Cluster #1 (just 956 users):
    • Upvotes downvotes users: mongol, Oldrover, NigilistNeo, infinum, Andreich, paraom, rongo, Bart, VadimoV, xaerostar, ripcord, DJGlooM, TurboDIMA, mercaptan, vorchuchelo, papa_s_perforatorom, Postojannaya_Junga, SergKz, NikVic, north_spb, TotSamyiStalinist, Stas_II, Дениска, kybik, DMTRYO, Nic, atlakatl, Zaloginen, Issues, yt369, doublefree, suslik_agronom, SlafFIG, Kafka_Grefnewaya, hott_griff, cardeur, clayman, Voen47, adv, Mauop_Buxpb, futurmultur, dmitryco, 2015, wilddolphin, torero, mash8055, Zianovietrusiki, DonAnton, AlFonso, lonestar, avot1ya, kudejar, psy_iCe, one_of_us, McArrow, einherjar, aabazh, Евгений_Зубарев, Ragero, nazi, platoon, Loginov, boris_minaev, Papagalo, benbow, CherieDeVille, IvanIvan, SuperOzma, Danila1990, Ezhich, jesusdiedforyou, vitriform, Zitron, medevic, ninorover, mynameAstaroth, Biglibom2, Tolik_5_let, Sokira_Stalina, vl55, zaputina, ratslayer, Motolog, Xic, volgaoka, IvanZarubin
  • Cluster #2 (438 users):
    • Upvotes: infinum, sly2m, onlooker, pomorin, ckkpss, Didja, MrGrey, pavelhunt, wereman, Sap_ru, xenon, Baryonyx, Portal, Cyprian_Norwid, Marcus_Octavius, Kushavera, snob, logixor, JIMM, an_tosha, StivyG, Nave, oxygenh, 50creative, asderru, smartov, vadim_zhartun, Navookhodonosor, September66, Sher-Khan, leha_chifir, zapovednik_slonov, Pavol, edduardi, avprof, Arkomen, Alexandras, qqshka, doctorwho, Alexandro66, kindzarp, Insomn1ac, andrey_321, ADR4_2, mku, foxm, rbbb, samopavel, dbor, rvr, KamikaZze, Ponevoloki, BUFF, ga3ry, Lrk, mentat, SuperALF, asasha9, openocean, Tom_Braider, iocus, b_rodrigez, Toljatti, RattusN, bravecitizen, our_son_of_a_bitch
  • Cluster #3 (325 users):
    • Upvotes: mongol, Oldrover, NigilistNeo, Andreich, paraom, rongo, xaerostar, ripcord, TurboDIMA, mercaptan, vorchuchelo, experov, SergKz, NikVic, north_spb, TotSamyiStalinist, Stas_II, Дениска, kybik, Stalin_ist, Issues, yt369, SlafFIG, Kafka_Grefnewaya, hott_griff, adv, Argentatus, DonAnton, AlFonso, paroxizm, medevic, mcf
    • Downvotes: onlooker, pomorin, ckkpss, Didja, pavelhunt, wereman, Baryonyx, Cyprian_Norwid, snob, a17s76, StivyG, 50creative, AY, Navookhodonosor, EasyLiving, unlaba, Understander, sambuka, Pavol, avprof, Arkomen, qqshka, Alexandro66, Insomn1ac, one_of_us, nevmer, CheburatorUA, mku, ____, Biochemik, rvr, Stratosferov, BLR, a_LEX, Bizett, cheblin, Tek_Glittering_Spear, CottonHead, Funtazer, SuperALF, polonist, IGld, ipun, Adamkus, sleepydaemon – notice how it matches to the upvotes list of the cluster #2.
  • Cluster #7 (220 users):
    • Downvotes: nickpo, mongol
  • Cluster #8 (187 users):
    • Upvotes: Baryonyx

So the smaller clusters have more obvious common characteristics. We can dissect the data further to see if there are any patterns. But this is out of the scope of this short post. Here, I want to show that this is a promising direction that is effective and simple to implement.

To summarize, this is how the voting patterns can be visualized with a chord diagram:

chord.png

Each arc color represents the average cluster karma, and arrows represent the sum of votes from each cluster to the other clusters and to itself.