Twitter Algorithm Review - Part 1 - my criteria and Real Graph

Now with video

Apr 02, 2023

I have taken a look at the twitter algorithm and uploaded videos on this playlist

First video here:

The edited transcript of the first video is below:

We are talking about the Twitter recommendation algorithm which got released in full open source on March 31st. I've had some time to look at it and some parts are better than I expected and some parts which are exactly as I expected. We're going to talk about what makes it good and bad and try not to dive too deep, but maybe catch a little bit of code as well. So, the first question that's important to address before we start is, what are even the criteria for what makes a good social media algorithm? So, whatever algorithms in college we talk about efficiency. You know, you sort a trillion numbers quickly, great, you pass the test. But here, the question is not efficiency, it's “does it do the correct thing.?” And defining what the correct thing is or is itself a challenge. It's the majority of the challenge. And what should the goal of the algorithm be in the first place? So, things like efficiency, understandability, while valuable to look at, it's not really the issue here. Understandability is an issue because if it's hard to understand, likely it is doing something strange. Good algorithms of this sort should be as legible as say, like the Bitcoin algorithm, but it's not. Elon's goal, or one of his many goals, is “does the algorithm improve on unregretted user minutes”? This is an okay goal. There's lots of good in that. Specifically, it targets the world. It specifically looks at a user, it says do they enjoy spending time on the site even after reflection? Looking back after years on this site, would they look at that with positive emotion? This is a good goal. It's hard to A/B test. It's hard to measure because you can't wait for years for users. You can sometimes maybe test by asking people to move away from Twitter and see what they feel or use it more and see how they feel. But I think there are other goals that are worth looking at that are in a similar style but not exactly un-regretted user minutes.

I'll talk about contractual fulfillment and incentives. But before that, there's a big part of what I would consider the algorithm that isn't published, and this is a big transparency question: what are the criteria for the A/B tests that determine algorithm changes? So I've worked for Bing and Amazon, and you know, people can make all kinds of changes to the algorithm, and nobody's gonna agree on what's a good change until they test it in production. You know, get some users on one on the old one, get some using the new one, and then look at metrics which may have gone up and down since then. So the first question is, what are the metrics looked at within the A/B test, and what is the decision criteria of which metrics are the most important for this A/B test? These are not published, and these are, in some ways, quite important to understand how this algorithm arrived in the first place. So when people say tweet optimizes for engagement, it kind of does in terms of its like the internal optimization may be looking at engagement and saying, okay, more engagement on this tweet means it's a good tweet, but also it's important to look at the external optimization loop to see: after comparing two algorithms, what is Twitter's developers' outside view on these? Are we looking at likes or time spent on the site? If they're looking at likes, this is a good thing. If you're looking at time spent on the site, this is a bad thing. Likes are probably more correlated with un-regretted use minutes than time spent, which is maybe negatively correlated. But the goals that I think is worth putting as high, as unregretted user minutes are contractual fullfillment and incentives.

Contractual fullfillment:

Does the site perform roughly what users expected to perform? People have different expectations, but one example of violating them is that Twitter pushes outside of your network recommendations on your feed. I see people, myself included, complain about this all the time.

The second question of expectations is that it's unclear what a like does within the algorithm. Open sourcing is a good step towards understanding this but it's not legible to you what does this like represent in terms of the relationship how does the relationship alter between the person I'm liking and me. I don't think even Twitter devs fully understand this.

Incentives

In addition to meeting user expectations, the incentives produced by the algorithm are of great importance. The algorithm should aim towards promoting good behavior such as truthfulness, usefulness, and uplifting content while avoiding incentives that increase anger or hostility towards others. One example of an area where the algorithm can be improved in this regard is quote tweets. Quote tweets are currently counted the same as retweets, even though they can often be negative. This creates an incentive for hate engagements to propagate, which is not desirable. Incentives like this should be avoided, as they promote bad behavior and create a negative user experience.

TrustRank

As a critic, people might ask me if I can do more than just criticize algorithms, and the answer is yes. In fact, I have developed a superior ranking algorithm for Twitter that addresses many of the issues I've mentioned. It's called Cozy Twitter, and it's based on my proprietary implementation of TrustRank. It's a chrome extension that can solve problems related to contractual fulfillment, incentives, and optimization in society. It's also fairly cost-effective, if you don't count the exuberant API call pricing. So, if you're Elon and you want an easy and cheap way to fix your entire site, I can definitely make a better algorithm. I won't go into detail about how Cozy Twitter works. This is more about how Twitter algorithm is not doing the right things

Ad hoc imporvements

However, I will say that one sign of a good core algorithm is whether it requires ad hoc improvements to run after it.

One example of a massive ad-hoc change is censorship. When it comes to the topic of censorship, there's been a lot of discussions lately. Some people want to censor others on Twitter, but to me, this is a failure of the algorithm. The algorithm should properly separate people into groups where they don't see each other's information. If people are fighting for censorship of a jointly owned discussion space, maybe they shouldn't be in each others' space and instead have their own spaces. This could help reduce the fighting. To me, the big fight over censorship is a failure to properly implement the algorithm. There are too many incentives to block people on Twitter due to Twitter's power level and reach. This is promoting certain content to some people who don't want to see it, and as a result, they want to censor it, sometimes using government power. This ad hoc change points to strangeness in the algorithm.

Even something like not showing too many tweets from the same user is an ad hoc change after the core algorithms. If diversity isn't already a byproduct of a previous function that's already working correctly, then it's not working correctly. So, there are a lot of problems in the core algorithm if you have to patch it frequently.

How does it affect society?

So, for me, the question of what makes a good algorithm is all about how it affects society. There's a lot of talk about social media hurting people's mental health, and I remember Zuckerberg going in front of Congress and saying that passive consumption of Facebook hurts people's mental health. But there were very few follow-up questions, which makes me wonder if it can even be measured. Can you do A/B tests on whether people who quit social media for two weeks see a marked improvement in their mental health? I think some people have done this and found that, yes, something in the algorithm is actually contributing to the destruction of people's expectations of what discourse ought to be. Specifically, this seems related to passive consumption, which discourages people who are not very heavy users from reading and engaging with content. So, this is something that could be fixed at the algorithm level.

Another aspect of algorithm optimization is the question of whether society as a whole becomes more cohesive as a result of this algorithm. Do people like each other more after interacting on Twitter, or do they come away from it feeling angry and resentful? There are definitely times when Twitter mobs cancel people, and this is a thing that now gives people pause before doing certain things in the workplace or online. Cancel culture is not only a Twitter thing, but this hostility may discourage a lot of people from starting new businesses or putting themselves out there. So, the encouragement of Twitter mobs may actually be pretty negative.

Finally, there's also the question of ranking ontology. What are you ranking? Tweets are being ranked, but is that the right thing? Do you want to rank conversations, people, or online communities?

In terms of the algorithm's team affiliation, I won't be discussing whether it leans towards Republicans or Democrats. Some people have been trying to find clues in the codebase, and they did find some debug features. The questions that remain are more meta, such as how we manage the power dynamic between power users and new users, how we handle trust, whether we propagate a trusted network, and how we manage time decay. These are important questions that have a significant impact on society, and making a statement on them requires careful consideration.

Engineering blog post

Okay, so next up we're going to go through this blog post, and that will be the end of part one. Philosophically, I would love for our recommendation system to be a single service, single job, but unfortunately, it's composed of many connected services and jobs. This makes it difficult to separate all the business accounts from the efficiency code and know exactly what's happening. These models aim to answer important questions on the network, such as "what is the probability you track with another user in the future?" However, this very first question in the design blog post is poorly posed since the probability depends on the algorithm itself. We are off to a great start finding problems!!!

It's not an isolated thing, like the probability of rain tomorrow, which is unaffected by what you're doing right now. If this is what you're computing, you're doing some self-referential thing that is extremely hard to evaluate. What I think they may have meant is, "What is the probability that you interact with this tweet given that we show it to you?" which is a far more reasonable question to compute and have in your network. However, the question they have stated depends on the algorithm. In my opinion, it's a bad idea to do this unless you really know what you're doing, and I don't think they do.

In / out of network balance

Today for you timeline consists of 50% In Network 50% out of-network tweets.

What? Why?

Why is this so high, this is not what I want.

If I like somebody I follow them. You could follow many accounts. The out-of-network ratio should be 5%-10%.

Is this a target that was decided through a hard code or is this an emergent property of other algorithms? Because if it's an emergent property, I bet there's a bug in there somewhere. This can't be the proper way to manage things.

In - network.

There is the real graph, which aims to understand the strength of engagement between two users on Twitter. I find the idea of assigning a qualitative strength value for each directional pairwise connection on Twitter to be reasonable, as not all connections have the same strength. However, the implementation details are somewhat confusing, and I wish they would explain more about what exactly the model does.

The paper mentions different types of interactions, such as retweets, favorites, mentions, messages, and clicks on a profile page. I find it strange that visiting a profile page is included as an interaction to optimize for, as it is not something I personally do often. I wonder if this is a common interaction that is worth considering. The paper also discusses different types of follow relationships, such as one-directional follows, two-directional follows, and mutuals.

Some of the features here include: "daily exponential decay." This is a surprisingly deep question about what Twitter wants to be. Does it want to be an “everything” app? Or does it want to be a place for news distribution only. Cause I am going to tell you, you are not going to be an everything app if you have "daily exponential decay." Time and frequency of interaction are worth considering, but this exponential decay biases too much towards recent interactions.

So to have an overview of what's happening:

a) we have a bunch of features for interaction

b) we create a score of person to person, basically your friendship score.

c) Then we compute personalized page rank. I really like this. This is extremely positive thinking. This is an implementation of something we call "status," or a question of who is cool in your community. The issue is that the input to this function is highly biased towards recent and frequent interactions, which is highly biased towards people who interact with the site all the time. If Twitter wants to be an addicting platform for power users, sure this is good, but this is not friendly to a casual user.

If we take a look at the code for real graph, it's even more drastic:

" label interactions that occurred one day after the target period"

It seems to me that if you interact with somebody Monday Wednesday Friday of every week, then this algorithm will not pick it up as an interaction. You have to interact on consecutive days Monday Tuesday Wednesday, for example. This not friendly to a casual user and I'd recommend changing it to one week.

PashaNomics

Discussion about this post