in the ~five years of running Digg Reader, we collected 12 billion pieces of content (~27TB of article content). that's prob most of what was published over the last five years. if anyone wants access to it for research or whatever, let me know.
ah the challenges with such volumes! We are now closing in on 14billion docs. Are you using compression? I'm asking because we're currently at ~19TB compressed content + 9TB Elasticsearch index.
We don't store the contents in ES (_all and _source are disabled). It's only being used for full text search, nothing more. The actual content is in MySQL (believe it or not), sharded into multiple databases compressed with zlib and serialized with msgpack.
that's awesome! yeah, sorry, i meant the entire archive metadata in search. that's very cool. are you running it on AWS? always cool to hear how other people have done it - thanks for sharing! we did dynamo/redis for metadata and memcache/s3 for article content (json).
Ah OK for us the content is metadata too ;) Only ids/feed ids/dates are 1st class citizens here :) We use redis heavily too for the polling system. Everything is on our own cloud (1TB+ RAM, 300+ cores) using opennebula and @storpool. Should make a blog post about it soon.
we built a service that sits in front of S3 and does multigets on files in s3. so memcache + s3 as article storage worked pretty well, was fairly cheap and we didnt have to worry about scaling the storage.
Sounds like a great setup. For us the latency to AWS is a bit of an issue since our main datacenter is in Europe and 5 years ago they weren't very present in EU :)
Hi @myoung - I am not familiar with the data format, but have you considered putting this into Google's #BigQuery as a public dataset? I think that's a great way to go with terabyte scale datasets that you want to share.
+1 for Internet Archive! If we could get a compressed WARC representation of the JSON it'd be (a) reasonably sized (gzip), (b) allow for random access, and (c) likely be of a sane size for torrents ^_^ Happy to help however possible!
Any chances of export it to open format (CSV, dbf) and then compress , outside of dynamodb, in several parts, maybe 1 GB each file, to import and transform to another DB. Those subsets would be very useful✌.
It is indeed very interesting, but I am not sure if the AWS S3 would be open to none-users of AWS. But I’d love to download the whole thing in any possible way and put in Elastic Stack for API and Hadoop for Spark :)
Hi Michael, we have interactive Spark notebooks on top of Hadoop cluster with 180TB HDFS, which is used by @CNRS and @ISCPIF researchers/scientists. We would love to host this data and make it available through Rest APIs and batch processing for the scientific community.
Oh yes, please add me to the list. What's the license on the content? Can we preprocess/opensource some of it under github.com/RaRe-Technolog…? CC @gensim_py
Hello from @parsely! We would be interested to get a look at this data set for research and product development purposes. Are you considering making this a publicly available dataset like common crawl? Amazon or Google may be willing to host for free.
I'm confused - this is 12G URLs and the associated html/text/json? Why not just include the urls and people can scrape from the web or internet archive. IMO real value in this is the user perspective (clicks, saves, tags, etc..) of how this content relates to each other.