Wikipedia Dumps

Wikipedia Dumps

avatar

AAAS24

May 15, 2022

Coding

Wikipedia is one of the most solid references for data nowadays. It is persistently updated and has become a reference in many books. I find myself constantly learning and searching it’s data, so I thought it would be interesting to learn and document different methods to obtain information from it.

This project studies downloading the entire wikipedia data and how to quickly obtain a single article information.

As a summary, here is my assessment of the libraries used:

Scale used: 1=Least -> 5= Most

Concept Wiki Dump Parser Wiki Dump Reader
Ease to install 4 4
Ease to use 4 4
Output Value 3 5
     

Wikipedia does not allow web crawlers for downloading large number of articles. As stated in there how-to download guide, Wikipedia servers would not be able to cope with the constant pressure of scrapping the entire site. However, they have made available copies of the site that you can download in different formats, the easiest would be the latest copy of the state of all the pages. This will be the method explored.

To follow along this project’s code, please view location.

METHODOLOGY

Disclaimer: I tried running all commands from the Jupyter notebook in an effort to test its capabilities, expand my knowledge on libraries, increase traces of code changes and reduce switching between tools. This may have resulted in other inefficiencies that can be solved by directly running scripts. In order to run shell commands I utilized this function:

    import subprocess

    def runcmd(cmd, verbose = False, *args, **kwargs):

        process = subprocess.Popen(
            cmd,
            stdout = subprocess.PIPE, 
            stderr = subprocess.PIPE,
            text = True, 
            shell = True
        )
        std_out, std_err = process.communicate()
        if verbose:
            print(std_out.strip(), std_err)
        pass

Example:

    runcmd("echo 'Hello World'", verbose = True)

DEPENDENCIES

PYTHON LIBRARIES

    import pandas as pd
    import numpy as np
    import os

SHELL PACKAGES:

  • wget
  • bzip2

STEP 1 DOWNLOADING AND DECOMPRESSING DUMP

    # deletes previous file
    try:
        os.remove(filename)
    except OSError:
        pass

    # download latest wiki dump 
    runcmd("wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2", verbose = True)

    # decompress file using #wikiextractor library
    runcmd("bzip2 -d /Users/alialvarez/Desktop/STUDIES/github/code_library/wikipedia/enwiki-latest-pages-articles.xml.bz2", verbose = True) 

STEP 2 PROCESS DUMP

  • OBTAIN METADATA

Evaluated the Wiki Dump Parser Library.

    try:
        os.remove(filename[:-8]+'.csv')
    except OSError:
        pass


    import wiki_dump_parser as parser
    parser.xml_to_csv('enwiki-latest-pages-articles.xml')

The resulting file shown below has several failed parsed rows while processing the text columns [page_title & contributor_name] that make it hard to utilize directly. However, with some data wrangling, one would think it could be possible to obtain a curated list of the page_id’s and titles. However, it would be an interesting exercise to understand if one of the dump files contains this data pre-arranged, namely the “enwiki-**-all-titles.gz”

Metadata Example

  • OBTAIN MARKUP

Evaluated the Wiki Dump Reader Library.

    from wiki_dump_reader import Cleaner, iterate

    wiki={}
    cleaner = Cleaner()

    for title, text in iterate('enwiki-latest-pages-articles.xml'):
        orig_text=text
        text = cleaner.clean_text(text)
        cleaned_text, links = cleaner.build_links(text)

        #add files to dictionary
        wiki.update({title:[cleaned_text]})

    #create DataFrame and export it as CSV
    df_wiki=pd.DataFrame.from_dict(wiki, orient='index')
    df_wiki.columns=['cleaned']
    df_wiki.to_csv(os.path.join(os.getcwd(), 'wiki_dump_example.csv'))

The code above results in a clean file with many redirects rows, like this:

Metadata Example

One potential improvement is to delete this REDIRECTS and, when failing to find a term, utilize the “enwiki-**-redirects.gz” file provided as part of the dumps to find a different title page

CONCLUSIONS

The Wiki Dump Parser Library is very easy to utilize but the output requires further transformations as it does not correctly parses the text fields. If you require the ID or titles alone, it might be available on one of the file dumps released by wikipedia. However, for basic stats like bits or contributor name, it seems like a good start.

The Wiki Dump Reader provides a great markup result for a fast understanding of each article and could be fed into a ML model with ease.

Scale: 1=Least -> 5= Most

Concept Wiki Dump Parser Wiki Dump Reader
Ease to install 4 4
Ease to use 4 4
Output Value 3 5
     

IMPROVEMENTS

  • Compare output “enwiki-**-all-titles.gz” with Wiki Dump Parser Library.

  • Delete this REDIRECTS and, when failing to find a term, utilize the “enwiki-**-redirects.gz” file provided as part of the dumps to find a different title page.

Recent Blog Articles