Wikipedia data dump to XML conversion
Intention
- Pure XML format of the documents in the Wikipedia collection for easier textual processing of its data.
- Simpler scheme serving the needs of information retrieval and friends.
- One file per document for parallel and incremental processing and easier debugging.
- Crystallize the relations important for textual information processing but hard to extract from the original dump format. For example the heading to cell relations in tables.
- Open a discussion and share efforts.
Example XML plain document
This example XML document illustrates the output generated by the conversion from the original dump.
XML tag summary
A summary of all tag paths appearing in the Wikipedia collection (english) with an example and some statistics can be found here. Unfortunately there is a bug in the calculation of the df that is always 1. But the analysis gives you an overview on the tag paths appearing in the converted content. A schema will be provided in the future.
Example calls
You have to get a Wikipedia dump from here. To get all articles and redirects, use the option -n 0 of the strusWikimediaToXml call to restrict the extraction to namespace 0 documents (articles).
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 bunzip2 enwiki-latest-pages-articles.xml.bz2 mkdir xml strusWikimediaToXml -I -B -n 0 -P 10000 -t 12 enwiki-latest-pages-articles.xml xmlIf you want to resolve page links to redirect pages, you can run the program twice. First with option -R <redirectfile> and then with option -L <redirectfile>. In the extracting link mode (option -R specified) there are no converted XML documents written and the program runs single threaded.
strusWikimediaToXml -n 0 -P 10000 -R ./redirects.txt enwiki-latest-pages-articles.xml xml strusWikimediaToXml -I -B -n 0 -P 10000 -t 12 -L ./redirects.txt enwiki-latest-pages-articles.xml xml
The option -I for the conversion generates more than attribute with the same name per tag. For example a table cell my look like <cell id="C1" id="R2"> if called with -I. Unfortunately this is not valid XML. Without -I the same tag will be printed as <cell id="C1,R2">.
Resources needed
You need less than 8 GB RAM. Conversion on a Intel(R) Core(TM) i7-6800K CPU @ 3.40GHz, 12 Threads and strusWikimediaToXml called with Option -t 12:
Command being timed: "strusWikimediaToXml -I -B -n 0 -t 12 -L ./redirects.txt enwiki-latest-pages-articles.xml doc" User time (seconds): 10381.37 System time (seconds): 405.87 Percent of CPU this job got: 764% Elapsed (wall clock) time (h:mm:ss or m:ss): 23:30.65 Maximum resident set size (kbytes): 2682604 Exit status: 0
Program
The conversion program is part of the project strusWikipediaSearch.
Usage: strusWikimediaToXml [options] <inputfile> [<outputdir>] <inputfile> :File to process or '-' for stdin <outputdir> :Directory where output files and directories are written to. options: -h :Print this usage -V :Verbosity level 1 (output document title and errors to stderr) -VV :Verbosity level 2 (output lexems found additional to level 1) -S <doc> :Select processed documents containingas title sub string -B :Beautified readable XML output -P <mod> :Print progress counter modulo <mod> to stderr -D :Write dump files always, not only in case of an error -K <filename>:Write dump file to file <filename> before processing it. -t <threads> :Number of conversion threads to use is <threads> Total number of threads is <threads> +1 (conversion threads + main thread) -n <ns> :Reduce output to namespace <ns> (0=article) -I :Produce one 'id' attribute per table cell reference, instead of one with the ids separated by commas (e.g. id='C1,R2'). One 'id' attribute per table cell reference is non valid XML, but you should use this format if you process the XML with strus. -R <lnkfile> :Collect redirects only and write them to <lnkfile> -L <lnkfile> :Load link file <lnkfile> for verifying page links
Description
- Takes a unpacked Wikipedia XML dump as input and tries to convert it to a set of XML files in a schema suitable for information retrieval.
- The XML documents produced are not valid XML because table cells and headings may contain multiple 'id' attributes. We did not find yet another proper way to describe heading to cell relations, that might be N to N. See section "Processing the data for information retrieval". Despite that, the XML documents produced are valid. Ideas are welcome.
- The produced XML document files have the extension .xml and are written into a subdirectory of <outputdir>. The subdirectories for the output are enumerated with 4 digits in ascending order starting with 0000.
- Each subdirectory contains at maximum 1000 <docid>.xml output files.
- Each output file contains one document and has an identifier derived from the Wikipedia title.
- You are encouraged to use multiple threads (option -t) for faster conversion.
- <docid>.err :File with recoverable errors in the document
- <docid>.mis :File with unresolvable page links in the document
- <docid>.ftl :File with an exception thrown while processing
- <docid>.wtf :File listing some suspicious text elements. This list is useful for tracking classification errors.
- <docid>.org :File with a dump of the document processed (only written if required or on error)
Output XML tags
The tag hierarchy is as best effort intendet to be as flat as possible. The following list explains the tags in the output:
Structural XML tags embeding a structure
- <quot> :a quoted string (any type of quote) in a document
- <heading id='h#'> :a subtitle or heading in a document
- <list id='l#'> :a list item in a document
- <attr> :an attribute in a document
- <entity> :a marked entity of the Wikipedia collection
- <citation> :a citation in a document
- <ref> :a reference structure in a document
- <table> :a table implementation
- <tabtitle> :sub-title text in the table
- <head id='C#'> :head cells of a table adressing a column cell. '#' represents a non negative number, e.g. "C#" ~ "C3".
- <head id='R#'> :head cells of a table adressing a row cell. '#' represents a non negative number here, e.g. "R#" ~ "R4".
- <cell id='R#'> :data sells of a table with a list of identifiers making it addressable. '#' represents a non negative number here, e.g. "R#" ~ "R0".
Structural XML tags describing links
- <pagelink> :a link to a Wikipedia page
- <category> :a link to a Wikipedia category
- <imglink> :a link to an image
- <filelink> :a link to a file
- <weblink> :a web link
- <tablink :internal link to a table in this document
- <citlink> :internal link to a citation in this document
- <reflink> :internal link to a reference in this document
Textual XML tags (tags marking a content)
- <docid> :The content specifies a unique document identifier (the title with '_' instead of ' ' and some other encodings)
- <text> :A text passage in a document
- <char> :Content is on or a sequence of special characters
- <code> :Text descibing some sort of an Id not suitable for retrival
- <math> :Text marked as LaTeX syntax math formula
- <time> :A timestimp of the form "YYMMDDThhmmss<zone>", <zone> = Z (UTC)
- <bibref> :bibliographic reference
- <nowiki> :information not to index
Processing the data for information retrieval
- Indexable text passages can be retrieved with //text, //char, //code and //math.
- Special identifiers are selected by //bibref //time //code
- Tables //table are defined with a title selected with //tabtitle and cell headings and cells. Cell headings and cells have attributes 'id' that relate headings to cells with a common value for the 'id' attribute.
- Tables, reference inclusions and citations are written as //table, //ref and //citation end of the section they appear and referenced in the text with //tablink, //reflink and //citlink. The idea behind this is to avoid interruption of sentences. This makes part of speech tagging and retrieval with positional relations within sentences possible. You can use the 'id' attribute of the link and of the reference with an identical value for items related to relate entities with their context.