Indexing reddit’s /r/funny images based on user comments

[Edit: This feature on PixPit is currently down]
What would you get if /r/funny images from reddit were indexed based on user comments? something interesting.

corgi

I took on the task this weekend and spent sometime spinning up Elasticsearch and indexing last three months of /r/funny.  I included the image title and url along with all the comments.  I flatten all the comment and add them as an array of text ignoring depth and context. The initial ideas is to treat all comments equally.  This actually ended up working nice for some searches, like ‘cat‘, ‘bear‘,  ‘game of thrones‘, and ‘Metallica‘, but some results are obviously showing up because of random discussions in the comments section.

I’m already planning on decreasing the value of comments further down the tree. These deep comments are usually not relevant to the image and make for bad indexing data.

I also indexed usernames along with the comments. Although this makes finding images that certain users have commented on much easier, it does skew the results based on values that are hardly relevant to the images. However people have such odd and mostly meaningless usernames that not a lot of queries match them.

If you want to play around with it, I have it up here: give it a try here

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s