Cirrussearch has a method to export it's current search indexes to a file. The export contains one json string per article formatted for use with the elasticsearch bulk api. Because the bulk requests are just json it can easily be processed with anything that reads json. This information is already public, but only on a per-article basis[1]. The full wiki dumps could be made publicly available and might be useful to anyone doing text based analysis of the corpus. This is also something we could point to when throttling abusive clients.
The total cluster size is currently 2.5TB. Limiting to only the content namespaces it is 1.2TB. These generally compress 10 to 1 vs. their reported size with gzip. For reference the top 10 largest indexes:
enwiki_general 438 GB enwiki_content 200 GB commonswiki_file 239 GB dewiki_content 55 GB commonswiki_general 70 GB frwiki_general 65 GB jawiki_content 62 GB dewiki_general 62 GB frwiki_content 55 GB metawiki_general 54 GB
[1]http://en.wikipedia.org/wiki/California?action=cirrusdump