Introduction
Wikipedia is the biggest crowd-supported encyclopedia openly available to the public. It is constantly
updated with new material from its own users, while its enormous size and content includes most of the
knowledge we
have available.
Wikipedia has been very attractive to computer scientists and data analysts, because the links between the
pages form a vast knowledge graph, subject of interesting network research.
In this direction, Wikispeedia
has been developed, an interactive game which challenges the users to exploit their common sense, by
requiring
them to reach a Wikipedia article starting from an initial one, only by using the links that appear in each
page.
Wikispeedia managed to provide more insight about the semantic relationships between the entities of the
Wikipedia graph, as well as the tactical behavior that users followed in order to win the game.
This work provided the scientific community with a dataset containing the Wikipedia graph and the game data
of the pathsthat the users followed, enabling even more research directions.
Using this dataset, a massive amount of research is being performed for
such networks, examining the semantic similarity of entities through the paths provided from the game. In
the context of this evolving field of research, some questions regarding the user behavior still remain
unanswered:
- What is the position of the most clicked links?
- How the position of each link is related to the decision of the user?
Wikispeedia: Just a simple game?
Players browse Wikipedia pages, in which links are highlighted in yellow, while each article has
a number of links leading to another article. Players start from an inital article and using their
semantic intepretations of the relation between entities and their common sense they must reach the
target
in the fastest way possible.
Although the idea might seem simple, the strategies and methods that the users follow in order to
win the game are really interesting.
In this example, the player must find the article "Amazon Basin" starting from the article
"Constantine I". The path contains 7 links already visited (Constantine I > Great Britain >
Europe > Russia > United States > Gulf of Mexico > Spain > Latin), and now the player should choose the
next link to follow, which supposingly will lead them closer to "Amazon Basin".
What are we trying to do?
How the users navigate in a Wikispeedia game?
In detail, we are trying to address the following research points:
- How does the position of links in the page of affects the user behavior?
- Are there specific sections in which their links are preferred in navigation?
- Are the results of visual link analysis on Wikispeedia similar with the available literature?
Wikispeedia Data
The Wikispeedia dataset contains multiple data. Among others,
it provides information about the articles, their categories and their links used
in the game, enabling us to recreate the whole Wikipedia graph.
In addition, we have available the path data of each game. The path data
contain information about if a path is finished or not, the articles of the
path, the total time needed to reach the target or abandon the game and the backtracks
that were performed.
How big is the graph?
We present here a summary of the available data for the Wikispeedia graph and paths that we used for our analysis.
Wikipedia Articles
Finished paths
Unfinished paths
How does the graph look like?
How do the finished paths look like?
Game analysis
A user facing a Wikispeedia game challenge is unlikely to reach the final goal if he randomly clicks links that appear in the page. To actually win the game, users need to follow a specific strategy.
Wait, Wikispeedia needs a strategy?
Well, yes! In order to win, there is a specific tactic which users seem to use. During the early phases
of the
game (first few clicks), users tend to follow links that might lead them to general Wikipedia pages, called
hubs.
The hubs are pages with many incoming and outgoing links, which enable the user to follow paths in a more
broad direction.
Such pages are pages of general historic events, such as wars, pages of countries etc.
Reaching
such a
page allows the player to start
the next phase of the game, which is called "narrowing-down" phase. During this part of the game, and after
the encounter of a hub, the users tend to
follow links to more specific pages, that might seem related to their target article. This is a well known
strategic approach of the game, which is
also discussed in the official paper.
Let's see graphs!
The graph above presents the average links of the page that the users visit in every step (click) of the
game.
Clearly, we see that in the very early phases of the game, users visit pages with many links, the so called
hubs.
After a hub is reached, we can see the "narrowing-down" strategy, by noticing that the players visit more
specific pages, with less
links at every new click.
This is an actual representation of the two phases of the game: Reaching a
hub and narrowing down in the next steps.
Those graphs present the percentange of the top categories of articles that players follow
during the initial, "hub-searching" phase of the game. The right graph corresponds to the very
beginning of the game, after the first click of the player, while the second corresponds to the
second click of the player.
We see that in both cases more than 15% of the clicks go to a country page!
This is expected! In Wikipedia, pages that represent countries include a great amount of content
and have numerous outgoing links to many different kind of pages. Thus, it is normal for a player
to start by searching a country page, in order to reach a high quality hub.
The rest of the graphs present the percentance of the top categories of articles visited during the
"narrowing-down" phase of the game, in steps 7 and 8 respectively.
Here, we have a more balanced
selection of the categories and no category exceeds the percentange of 5%!
This is also expected! During this phase, players tend to find more specific articles (and thus,
categories). The articles
that they select are related to their final goal, resulting in the above graphs.
Good, but where are those links?
Visual Link Position Analysis
Before finding the positions of the links, it is important
to keep in mind the structure of a typical Wikipedia page.
A wikipedia page has a specific structure, because it is divided in three major sections, the lead, the
body and
the infobox which contain links that users can follow.
The lead is on the top of the page, containing a descriptive paragraph about the
current page with the corresponding links.
Then, the body follows, containing multiple paragraphs and links, which are the main content of the article.
Finally, the infobox is a small frame at the right of the
page, showing conclusive information.
Infoboxes sometimes contain links, but there are several cases where they only contain
text.
In Wikispeedia data, approximately 1% of the links belong to the infobox, thus we decided not to
include them in our analysis for simplicity.
This in an example of a typical structure for an interesting Wikipedia page. We see that the page begins with the lead, on the right there is the infobox and then, the body of the page with the actual content.
Let's get to the point!
Now that we are aware of the basic statistics and analysis of the game, as well as the structure of the pages, we can proceed with our visual link position analysis work!
The above graph presents the power-law-like ditribution of the number of links in each paragraph of every article. This means that the x-axis represents the position of the paragraph in the page. We conclude that the most links are usually present in the first few paragraphs of each page although the rest of the paragraphs can still have some links.
The right graph presents the total number of links present in lead and body section, among all the pages of the dataset, while the left graph presents the number of clicks performed among all the games in the corresponding sections of lead and body. From this graph, we conclude that although clicks in the lead section represent only about 20% of the total links in the corpus, users click at them about 40% of the time when playing the game! This implies that links in the lead section contain more relevant articles that lead users to their final goal.
What if the links were chosen at random? We repeat every path in our dataset and for every click we calculate an expected click location based on all of the links in that particular page. The location is simply the paragraph that contains the link. We then compare these expected locations of clicks with those of the actual clicks that players made in the game. The results are clear: There is a strong preference towards the first paragraphs.
We have plenty of evidence so far that the links located in the first paragraphs are the most chosen ones. Do they also contain more links? Yes! By analyzing all the chosen articles in the paths we can see that articles clicked at the lead introductory section contain on average more links than the ones in the body. This difference continues to exist even as the game progresses!. This behavior is to be expected. To quote Wikipedia Manual of Style: "The lead serves as an introduction to the article and a summary of its most important contents." Even as we approach more specialized, shorter articles, the introductory section contains the most important points of the article and naturally leads to more general "hub" articles.
The percentages of sections clicked at each stage of the game remain almost equal. On the first click, we notice a slight preference for articles in the lead section. This represents the strategy of the users to get away from the unrelated source articles towards the more general articles that will give them access to a higher amount of links and relevant results (hub-searching phase).
As we note from the plot on the left, the first click is located higher in the article on average. This occurs because the first introductory paragraphs lead to articles of higher importance and generality as seen in the plots before. On the second click the paragraph number increases drastically. This happens because the general "high importance" articles contain more paragraphs and players are searching through whole text for more relevant specialized articles which can be found lower in the page. After the second click the paragraph number decreases gradually which can be attributed to the lower total number of paragraphs these articles contain. However, players are still looking deep into the text as the right figure shows. The percentage of the body section clicked increases slightly in the latter stages of the game which implies that players are looking further down the document although these highly specialized documents have a lower total amount of text.
Conclusion
The Wikispeedia dataset provided us with rich resources, helping us to analyze the user behavior when
they navigate over the Wikipedia knowledge graph, using their common sense. After performing a wide work
over the available data, we gained interesting insights.
Our results can be summarized in the following observations:
- In Wikipedia articles, the most links are usually present in the first few paragraphs of each page
- For Wikispeedia players, links in the lead section contain more relevant articles to reach the final goal
- For Wikispeedia players, there is a strong preference towards the first paragraphs of a pages
- In Wikipedia, articles of the lead section contain on average more links than the ones in the body
- In Wikispeedia games, the percentage of sections clicked at each stage of the game remain almost equal, but:
- In Wikispeedia games, the percentage of the clicks in the body section increases slightly in the latter stages of the game, implying that players are looking further down the document for more informative links
This work could be extended in the future in mutliple directions. This analysis could be combined with the ideas presented in the related work section, in order to provide insight both for the absolute position of the links in a page (right, left, top, bottom) with their conceptual position in the article (lead, body, infobox). In addition, this work could be extended with experiments where the paragraphs are evaluated with their actual name, instead of their relative position in the text.