Some time ago, Randal Olson analyzed how to create successful Reddit posts in a data-driven way. I thought it would be interesting to do the same for Hacker News, a popular social-news based not-only-tech portal I first came across when my Xerox article made it to its first page, causing quite a bit of server load on this blog. Hacker News is kind of important in the tech web; getting a popular post there can for example instantaneously provide you with a critical mass of page impressions to for example get something viral.
In order to write a data-driven post, the most important thing you need is … data (no shit, Sherlock!). So lots of thanks go to Shital Shah, who downloaded all Hacker News posts since 2006 and proceeded to make them available for download as a big-*ss JSON file. Thanks, Shital! From the JSON file we can read a whopping 1333789 posts.
The remainder of the article is structured as follows. First, for the readers in a hurry, I will be analyzing when and what to post in order to improve your odds in getting a popular post. This will be done on the more recent part of the data, namely all posts since 2013. After that, I will try and derive some possible explanations for the popularity observations made by having a look at when and what HN users post in a coarsely grained way. Third, I will go further and analyses in a more general way the behavior of HN users, considering the whole dataset, not only the recent posts. Also, I will look at how the users' behavior changed over time.
Disclaimer: Please be aware that, like Randal, I am only making statements about probability in this post. Following these guidelines will by no means 100% guarantee that you will have a successful post.
Even more important Disclaimer: As a HN user points out: “You get popular by posting content that conforms to the group-think and avoid content that doesn't”. Another user states “yes, what the world needs now is more and more people who care about marketing their brand”. Both are right and point to the most important thing. What I show here is a data-driven analysis that may increase your odds by like a few percent – but it won't get you anywhere if you post nonsense or otherwise behave like a dipshit.
Analyzing popularity of posts since 2013
What popularity is
In this article, we say, a HN post is popular, if it belongs to the top 3% of its year in terms of post score (this way of defining popularity is quite resilient against changes in the number of HN viewers over the time, as more readers cause higher scores to be given in general). For instance, for 2009, we consider posts with at least a score of 55 popular, and for 2013 a post needs 97 points for popularity.
This means, that in the following plots any post attribute with a popularity percentage above 3% is overproportionally successful: A post exhibiting the attribute statistically increases its chance of being popular. On the contrary, post attributes with a mean popularity below 3% are unsuccessful.
Popularity by day of week and daytime
Edit: I used wrong images instead of the following two plots at first. I corrected this mistake, thanks for notifying.
Like Randal found with respect to Reddit, the posting time might play an important role when making popular HN posts. Let's have a look at the popularity percentages across the hours of daytime:
This plot doesn't look too interesting at first glance. All bars twiddle around the standard popularity percentage of 3%. There is no time of day that, for instance, doubles your chance of having a popular post. However, there is a slightly overproportional chance for popularity around 11 am Greenwich mean time (which is early morning at US east coast), but nothing special. Let's have a look now at the popularity percentages across week days.
Now, this is more clear. On the X-axis, 0 is Monday, 1 is Tuesday, and so forth. Seems like posts appearing on Saturdays and Sundays have a better chance than posts appearing during the week. Posts created on Sundays even have a 4% popularity percentage, which means they have 33% better chance of getting popular than the average post. Also, on Monday, there is a slightly higher probability of getting popular than in the rest of the week, which suggests that in fact, the weekend itself is the trigger, but in a slightly time-zone shifted way. Now let's have a look at weekdays and posting time in one single heat map plot:
Remember that the time zone of the data is Greenwich mean time. Now, have a look at the time span between Saturday, 13h GMT, and Monday, 8h GMT. In this weekly period, chances to get popular posts significantly higher than in the rest of the week. This is pretty accurately the weekend with respect to the US time zones with a slight bias to East coast time, which suggests that, like Reddit, HN is dominated by US users.
Unlike Reddit, as you saw in the plot above that was only related to daytime, there is no real everyday-bias towards a specific daytime to be seen. There is a slight peak in daytime popularity around 11h GMT (5am EST), but it seems less significant than Reddit's.
Popularity by title length
Another important factor influencing the decision whether having a closer view at a post or not, is probably the number of words the post title contains. Let's see:
Yes – right, short titles are better than long ones in general with a peak at two word titles, but around 16 words in the title, the title at least doesn't do any harm.
Popularity by post type
As you probably know, posts can have kind of types at Hacker News. I declared “Ask HN” and “Show HN” as types, and derived two more types from the data itself: PDFs (the post title contains “[pdf]”) and Videos (the post title contains “[video]”).
We can see quite a few things from this plot: HN Users don't really like to be asked things – “Ask HN” posts make exceptionally bad in terms of popularity. On the other hand, PDFs and Videos make exceptionally good. Posting a “Show HN” doesn't really make a difference, and, not surprisingly, all other posts' popularity average is located close to the 3% mean value.
So all in all, statistically, you can maximize your chances of getting a popular HN post by
- posting on weekends (US time zones)
- trying to shorten your titles
- posting videos or PDFs.
Now, let's try to get a bit more insight in the general HN user behavior.
Current Hacker News user behavior
It's always interesting to see when people post, regardless the popularity. Here is the number of posts with respect to the daytime:
We can see here that most of the posts are made around afternoon GMT, which means they are made around working start time in the US, so like in Reddit, it seems like US people hit HN before actually starting their work day. Another interesting aspect is the distribution of posts across the week.
As you can see, there are way fewer posts being made in the weekend than in the rest of the week. This could explain the higher probability of weekend posts to be popular – there is less competition, just like in Reddit.
Trends in Hacker News
Now, let's have a look at the whole dataset, not just all posts since 2013.
General user behavior
Looking at the data, the first thing I noticed is that there were only 49 posts in 2006. This is way too few posts to derive some senseful statistics, so I cut off the 2006 part of the dataset, starting the analyzed part of the dataset in 2007. Let's have a look at the posts per day since 2007:
Well, well, well, what can we see? First, there seem to be some nice holes in the data (at least I don't believe that there weren't any HN posts made in the second half of 2009 and some other periods of time). When I find the time, I'll drop Shital a note about this. Second, HN grew from almost zero to 1200 posts a day in the time period from 2007 till 2012, and after that, post numbers dropped slightly, but not in a way causing fear about HN's future. Full disclosure: I applied a 7day-rolling average on the time series in order to smoothen away the weekend fluctuations (the trends in weekday- and daytime post distributions over time show that the weekend always had fewer posts than the workweek, no plots shown here).
Now let's see how popularity developed over time. Here is a trellis variant of the above “popularity by daytime” plot.
We can see that in the early history of HN, data was noisier than now. I suspect this is because of the smaller sample, HN hat way less users back then. Until 2011, a nice clear marked-out popularity dent across 6am GMT came to light, which is around midnight at US east coast and late evening at west coast. I suppose that HN was mainly popular in the US back then and even hackers need their beauty sleep. After 2011, the dent starts to disappear – can this be seen as a sign of the rest of the world kicking in?
The popularity of the weekend posts developed over time:
Another interesting thing is watching how the popularity of title lengths developed over time.
In the early years, longer titles were more popular than they are now. The popularity maximum is clearly moving towards shorter from 2012 on. In the plots are some anomalies, though. There may be word counts that relatively seldom but manage to get one single popular post, which then yields a high probability percentage. For a funny example have a look at the 2008 plot: There were indeed 7 (in words: seven) posts with an empty title and lo and behold: This one managed to get popular. This is why we get this large popularity percentage.
Now, let's see how the popularity of types changed over time.
This is another way of plotting which seemed convenient because there are only few types. Each column corresponds to one type and has seven bars for the different years. One can clearly see that PDFs had been unpopular, but now are on the up. Videos had been popular, then their popularity dropped, but nowadays they become popular again. Unsurprisingly, other (the most of the articles) always hang around the popularity average. Ask HN posts are becoming quite unpopular (they have their own community though) and Show HN posts once were popular, and now, they at least don't hurt.
Some more trivia
While we're at it, we can look into the data from a few more aspects. I grouped all posts by target server. Here are the top 5 server targets (regardless the popularity):
- github.com is obviously HN's first choice code dump with 9623 posts pointing to it
- youtube.com is HN's place to watch videos (7242 posts, there is no statistics about cat and non-cat videos)
- techcrunch.com is one of HN's first-to-look-at places when it comes to tech news (6393 posts)
- medium.com is also popular probably for all the writer's blog-posts there (6367 posts)
- nytimes.com seems to be the first-choice for general news (4133 posts)
As an alternative to techcrunch, arstechnica is also received well with 3401 posts, and if you don't like nytimes, bbc.co.uk is also fine (3313 posts).
Now from the dataset, I threw away the servers that occurred in less than 50 posts to get rid of all the stray, and on the remainder looked at server popularity percentage. Here are the top five popularity server targets in descending order:
- teslamotors.com (35%)
- wikileaks.org (32%)
- stripe.com (29%)
- codinghorror.com (23%)
- sivers.org (23%)
Here are the most post-intensive authors (regardless of popularity):
- shawndumas (2174 posts 8-0 – what the heck!)
- Libertatea (1456 posts)
- ColinWright (1384 posts)
- danso (1326 posts)
- iProject (1215 posts)
And last but not least, here are the posts linking to my blog:
All of those posts link to the Xerox Saga.
BTW: I know the axis titles are not all aligned – however I did my best to explain them in the text. For my own convenience I used different plotting frameworks across the plots. I may clean them later when there is time.
Because of caching, a comment can take up to two minutes until it appears.