Wikipedia Link Analysis

Introduction

Wikipedia is the biggest crowd-supported encyclopedia openly available to the public. It is constantly updated with new material from its own users, while its enormous size and content includes most of the knowledge we have available. Wikipedia has been very attractive to computer scientists and data analysts, because the links between the pages form a vast knowledge graph, subject of interesting network research.

In this direction, Wikispeedia has been developed, an interactive game which challenges the users to exploit their common sense, by requiring them to reach a Wikipedia article starting from an initial one, only by using the links that appear in each page. Wikispeedia managed to provide more insight about the semantic relationships between the entities of the Wikipedia graph, as well as the tactical behavior that users followed in order to win the game. This work provided the scientific community with a dataset containing the Wikipedia graph and the game data of the pathsthat the users followed, enabling even more research directions.

Using this dataset, a massive amount of research is being performed for such networks, examining the semantic similarity of entities through the paths provided from the game. In the context of this evolving field of research, some questions regarding the user behavior still remain unanswered:

- What is the position of the most clicked links?
- How the position of each link is related to the decision of the user?

Wikispeedia: Just a simple game?

Players browse Wikipedia pages, in which links are highlighted in yellow, while each article has a number of links leading to another article. Players start from an inital article and using their semantic intepretations of the relation between entities and their common sense they must reach the target in the fastest way possible. Although the idea might seem simple, the strategies and methods that the users follow in order to win the game are really interesting.

In this example, the player must find the article "Amazon Basin" starting from the article "Constantine I". The path contains 7 links already visited (Constantine I > Great Britain > Europe > Russia > United States > Gulf of Mexico > Spain > Latin), and now the player should choose the next link to follow, which supposingly will lead them closer to "Amazon Basin".

What are we trying to do?

Through this work, we are trying to address a number of research questions, which circle around our bacis point of interest:

How the users navigate in a Wikispeedia game?

In detail, we are trying to address the following research points:

How does the position of links in the page of affects the user behavior?
Are there specific sections in which their links are preferred in navigation?
Are the results of visual link analysis on Wikispeedia similar with the available literature?

Wikispeedia Data

The Wikispeedia dataset contains multiple data. Among others, it provides information about the articles, their categories and their links used in the game, enabling us to recreate the whole Wikipedia graph. In addition, we have available the path data of each game. The path data contain information about if a path is finished or not, the articles of the path, the total time needed to reach the target or abandon the game and the backtracks that were performed.

How big is the graph?

We present here a summary of the available data for the Wikispeedia graph and paths that we used for our analysis.

4604

Wikipedia Articles

51318

Finished paths

24875

Unfinished paths

How does the graph look like?

The visualization of a graph like this is not an easy task, because almost every node of the graph is connected with each other, making it a challenging graph-drawing task. Using Gephi, before the visualization, we clustered the nodes.

How do the finished paths look like?

Africa > Coffee > Cancer > Human > Sleep (4 links, 105 seconds)

Pythagoras > Science > Pollution > Carbon_dioxide > Greenhouse_effect (4 links, 53 seconds)

Monty_Python > Democracy > Monarchy > Kuwait > Iraq > Saudi_Arabia > Afghanistan > Osama_bin_Laden (7 links, 248 seconds)

Game analysis

A user facing a Wikispeedia game challenge is unlikely to reach the final goal if he randomly clicks links that appear in the page. To actually win the game, users need to follow a specific strategy.

Wait, Wikispeedia needs a strategy?

Well, yes! In order to win, there is a specific tactic which users seem to use. During the early phases of the game (first few clicks), users tend to follow links that might lead them to general Wikipedia pages, called hubs. The hubs are pages with many incoming and outgoing links, which enable the user to follow paths in a more broad direction. Such pages are pages of general historic events, such as wars, pages of countries etc.

Reaching such a page allows the player to start the next phase of the game, which is called "narrowing-down" phase. During this part of the game, and after the encounter of a hub, the users tend to follow links to more specific pages, that might seem related to their target article. This is a well known strategic approach of the game, which is also discussed in the official paper.

Let's see graphs!

The graph above presents the average links of the page that the users visit in every step (click) of the game. Clearly, we see that in the very early phases of the game, users visit pages with many links, the so called hubs. After a hub is reached, we can see the "narrowing-down" strategy, by noticing that the players visit more specific pages, with less links at every new click.
This is an actual representation of the two phases of the game: Reaching a hub and narrowing down in the next steps.

Those graphs present the percentange of the top categories of articles that players follow during the initial, "hub-searching" phase of the game. The right graph corresponds to the very beginning of the game, after the first click of the player, while the second corresponds to the second click of the player.
We see that in both cases more than 15% of the clicks go to a country page!
This is expected! In Wikipedia, pages that represent countries include a great amount of content and have numerous outgoing links to many different kind of pages. Thus, it is normal for a player to start by searching a country page, in order to reach a high quality hub.

The rest of the graphs present the percentance of the top categories of articles visited during the "narrowing-down" phase of the game, in steps 7 and 8 respectively.
Here, we have a more balanced selection of the categories and no category exceeds the percentange of 5%!
This is also expected! During this phase, players tend to find more specific articles (and thus, categories). The articles that they select are related to their final goal, resulting in the above graphs.

Good, but where are those links?

In the lead?

In the main body?

In the infobox?

In which paragraph?

Maybe.. NOWHERE?

Who knows?

Visual Link Position Analysis

Before finding the positions of the links, it is important to keep in mind the structure of a typical Wikipedia page. A wikipedia page has a specific structure, because it is divided in three major sections, the lead, the body and the infobox which contain links that users can follow.

The lead is on the top of the page, containing a descriptive paragraph about the current page with the corresponding links. Then, the body follows, containing multiple paragraphs and links, which are the main content of the article. Finally, the infobox is a small frame at the right of the page, showing conclusive information. Infoboxes sometimes contain links, but there are several cases where they only contain text.

In Wikispeedia data, approximately 1% of the links belong to the infobox, thus we decided not to include them in our analysis for simplicity.

This in an example of a typical structure for an interesting Wikipedia page. We see that the page begins with the lead, on the right there is the infobox and then, the body of the page with the actual content.

Let's get to the point!

Now that we are aware of the basic statistics and analysis of the game, as well as the structure of the pages, we can proceed with our visual link position analysis work!

The above graph presents the power-law-like ditribution of the number of links in each paragraph of every article. This means that the x-axis represents the position of the paragraph in the page. We conclude that the most links are usually present in the first few paragraphs of each page although the rest of the paragraphs can still have some links.

The right graph presents the total number of links present in lead and body section, among all the pages of the dataset, while the left graph presents the number of clicks performed among all the games in the corresponding sections of lead and body. From this graph, we conclude that although clicks in the lead section represent only about 20% of the total links in the corpus, users click at them about 40% of the time when playing the game! This implies that links in the lead section contain more relevant articles that lead users to their final goal.

What if the links were chosen at random? We repeat every path in our dataset and for every click we calculate an expected click location based on all of the links in that particular page. The location is simply the paragraph that contains the link. We then compare these expected locations of clicks with those of the actual clicks that players made in the game. The results are clear: There is a strong preference towards the first paragraphs.

We have plenty of evidence so far that the links located in the first paragraphs are the most chosen ones. Do they also contain more links? Yes! By analyzing all the chosen articles in the paths we can see that articles clicked at the lead introductory section contain on average more links than the ones in the body. This difference continues to exist even as the game progresses!. This behavior is to be expected. To quote Wikipedia Manual of Style: "The lead serves as an introduction to the article and a summary of its most important contents." Even as we approach more specialized, shorter articles, the introductory section contains the most important points of the article and naturally leads to more general "hub" articles.

The percentages of sections clicked at each stage of the game remain almost equal. On the first click, we notice a slight preference for articles in the lead section. This represents the strategy of the users to get away from the unrelated source articles towards the more general articles that will give them access to a higher amount of links and relevant results (hub-searching phase).

As we note from the plot on the left, the first click is located higher in the article on average. This occurs because the first introductory paragraphs lead to articles of higher importance and generality as seen in the plots before. On the second click the paragraph number increases drastically. This happens because the general "high importance" articles contain more paragraphs and players are searching through whole text for more relevant specialized articles which can be found lower in the page. After the second click the paragraph number decreases gradually which can be attributed to the lower total number of paragraphs these articles contain. However, players are still looking deep into the text as the right figure shows. The percentage of the body section clicked increases slightly in the latter stages of the game which implies that players are looking further down the document although these highly specialized documents have a lower total amount of text.

Related Work

Is there related work?

As we mentioned in the introduction, the field of studying the user navigation and locating semantically similar entities based on the user behavior by exploiting their common sense is evolving and there are many works around this general topic.

There have been studies from user navigation in Wikipedia articles, but those evaluations are different, as they are captured from users that perform a "freestyle-browsing" in Wikipedia, clicking the links that they find interesting. Such works, although they provide useful results they are not comparable with Wikispeedia navigation, because Wikispeedia encapsulates the common sense and semantic similarity from the user thought during the navigation.

The most applicable related work with our research is described here. In a nutshell, scientists tried to find in which position of the page the most clicked links belong. By creating a heatmap of the positions of the links in the pages, scientists understood that users have a tendency to prefer to click on the left side of the screen. This means that this work is mostly oriented in finding the absolute positions of the links in a page (left, right, bottom, top).

Interaction with our work

Our research in this project is orthogonal to what we described above. Rather than studying the absolute positions of links in the screen, we study in which relative parts of the page the most popular links aside. Instead of having terms such as left-side, right-side etc. we analyze their position based on if they appear on the lead or the body, which sections are the most clicked, how the links in different sections have different attributes etc.

The combination of these two works as a future direction would be really interesting.

Conclusion

The Wikispeedia dataset provided us with rich resources, helping us to analyze the user behavior when they navigate over the Wikipedia knowledge graph, using their common sense. After performing a wide work over the available data, we gained interesting insights.

Our results can be summarized in the following observations:

In Wikipedia articles, the most links are usually present in the first few paragraphs of each page
For Wikispeedia players, links in the lead section contain more relevant articles to reach the final goal
For Wikispeedia players, there is a strong preference towards the first paragraphs of a pages
In Wikipedia, articles of the lead section contain on average more links than the ones in the body
In Wikispeedia games, the percentage of sections clicked at each stage of the game remain almost equal, but:
In Wikispeedia games, the percentage of the clicks in the body section increases slightly in the latter stages of the game, implying that players are looking further down the document for more informative links

This work could be extended in the future in mutliple directions. This analysis could be combined with the ideas presented in the related work section, in order to provide insight both for the absolute position of the links in a page (right, left, top, bottom) with their conceptual position in the article (lead, body, infobox). In addition, this work could be extended with experiments where the paragraphs are evaluated with their actual name, instead of their relative position in the text.

Wikipedia visual link position and user navigation.