Wikipedia visual link position and user navigation.

Understanding user behavior by exploiting the visual position of links in Wikipedia pages

How people

By: H. All├Ęgre, I. Bantzis, M. Chatzakis, B. Tornare

Group: souvlADAki

Press the logo

Introduction

Wikipedia is the biggest crowd-supported encyclopedia openly available to the public. It is constantly updated with new material from its own users, while its enormous size and content includes most of the knowledge we have available. Wikipedia has been very attractive to computer scientists and data analysts, because the links between the pages form a vast knowledge graph, subject of interesting network research.

In this direction, Wikispeedia has been developed, an interactive game which challenges the users to exploit their common sense, by requiring them to reach a Wikipedia article starting from an initial one, only by using the links that appear in each page. Wikispeedia managed to provide more insight about the semantic relationships between the entities of the Wikipedia graph, as well as the tactical behavior that users followed in order to win the game. This work provided the scientific community with a dataset containing the Wikipedia graph and the game data of the pathsthat the users followed, enabling even more research directions.

Using this dataset, a massive amount of research is being performed for such networks, examining the semantic similarity of entities through the paths provided from the game. In the context of this evolving field of research, some questions regarding the user behavior still remain unanswered:

- What is the position of the most clicked links?
- How the position of each link is related to the decision of the user?


Wikispeedia: Just a simple game?

Players browse Wikipedia pages, in which links are highlighted in yellow, while each article has a number of links leading to another article. Players start from an inital article and using their semantic intepretations of the relation between entities and their common sense they must reach the target in the fastest way possible. Although the idea might seem simple, the strategies and methods that the users follow in order to win the game are really interesting.

In this example, the player must find the article "Amazon Basin" starting from the article "Constantine I". The path contains 7 links already visited (Constantine I > Great Britain > Europe > Russia > United States > Gulf of Mexico > Spain > Latin), and now the player should choose the next link to follow, which supposingly will lead them closer to "Amazon Basin".



What are we trying to do?

Through this work, we are trying to address a number of research questions, which circle around our bacis point of interest:

How the users navigate in a Wikispeedia game?

In detail, we are trying to address the following research points:
  • How does the position of links in the page of affects the user behavior?
  • Are there specific sections in which their links are preferred in navigation?
  • Are the results of visual link analysis on Wikispeedia similar with the available literature?

Wikispeedia Data

The Wikispeedia dataset contains multiple data. Among others, it provides information about the articles, their categories and their links used in the game, enabling us to recreate the whole Wikipedia graph. In addition, we have available the path data of each game. The path data contain information about if a path is finished or not, the articles of the path, the total time needed to reach the target or abandon the game and the backtracks that were performed.

How big is the graph?

We present here a summary of the available data for the Wikispeedia graph and paths that we used for our analysis.

4604

Wikipedia Articles

51318

Finished paths

24875

Unfinished paths



How does the graph look like?

The visualization of a graph like this is not an easy task, because almost every node of the graph is connected with each other, making it a challenging graph-drawing task. Using Gephi, before the visualization, we clustered the nodes.


How do the finished paths look like?

Africa > Coffee > Cancer > Human > Sleep (4 links, 105 seconds)

Pythagoras > Science > Pollution > Carbon_dioxide > Greenhouse_effect (4 links, 53 seconds)

Monty_Python > Democracy > Monarchy > Kuwait > Iraq > Saudi_Arabia > Afghanistan > Osama_bin_Laden (7 links, 248 seconds)

Game analysis

A user facing a Wikispeedia game challenge is unlikely to reach the final goal if he randomly clicks links that appear in the page. To actually win the game, users need to follow a specific strategy.


Wait, Wikispeedia needs a strategy?

Well, yes! In order to win, there is a specific tactic which users seem to use. During the early phases of the game (first few clicks), users tend to follow links that might lead them to general Wikipedia pages, called hubs. The hubs are pages with many incoming and outgoing links, which enable the user to follow paths in a more broad direction. Such pages are pages of general historic events, such as wars, pages of countries etc.

Reaching such a page allows the player to start the next phase of the game, which is called "narrowing-down" phase. During this part of the game, and after the encounter of a hub, the users tend to follow links to more specific pages, that might seem related to their target article. This is a well known strategic approach of the game, which is also discussed in the official paper.

Let's see graphs!

The graph above presents the average links of the page that the users visit in every step (click) of the game. Clearly, we see that in the very early phases of the game, users visit pages with many links, the so called hubs. After a hub is reached, we can see the "narrowing-down" strategy, by noticing that the players visit more specific pages, with less links at every new click.
This is an actual representation of the two phases of the game: Reaching a hub and narrowing down in the next steps.

Those graphs present the percentange of the top categories of articles that players follow during the initial, "hub-searching" phase of the game. The right graph corresponds to the very beginning of the game, after the first click of the player, while the second corresponds to the second click of the player.
We see that in both cases more than 15% of the clicks go to a country page!
This is expected! In Wikipedia, pages that represent countries include a great amount of content and have numerous outgoing links to many different kind of pages. Thus, it is normal for a player to start by searching a country page, in order to reach a high quality hub.


The rest of the graphs present the percentance of the top categories of articles visited during the "narrowing-down" phase of the game, in steps 7 and 8 respectively.
Here, we have a more balanced selection of the categories and no category exceeds the percentange of 5%!
This is also expected! During this phase, players tend to find more specific articles (and thus, categories). The articles that they select are related to their final goal, resulting in the above graphs.

Good, but where are those links?

In the lead?

In the main body?

In the infobox?

In which paragraph?

Maybe.. NOWHERE?

Who knows?

Conclusion

The Wikispeedia dataset provided us with rich resources, helping us to analyze the user behavior when they navigate over the Wikipedia knowledge graph, using their common sense. After performing a wide work over the available data, we gained interesting insights.

Our results can be summarized in the following observations:

  • In Wikipedia articles, the most links are usually present in the first few paragraphs of each page
  • For Wikispeedia players, links in the lead section contain more relevant articles to reach the final goal
  • For Wikispeedia players, there is a strong preference towards the first paragraphs of a pages
  • In Wikipedia, articles of the lead section contain on average more links than the ones in the body
  • In Wikispeedia games, the percentage of sections clicked at each stage of the game remain almost equal, but:
  • In Wikispeedia games, the percentage of the clicks in the body section increases slightly in the latter stages of the game, implying that players are looking further down the document for more informative links

This work could be extended in the future in mutliple directions. This analysis could be combined with the ideas presented in the related work section, in order to provide insight both for the absolute position of the links in a page (right, left, top, bottom) with their conceptual position in the article (lead, body, infobox). In addition, this work could be extended with experiments where the paragraphs are evaluated with their actual name, instead of their relative position in the text.


Thank you very much for your attention!