See the entire conversation

in the ~five years of running Digg Reader, we collected 12 billion pieces of content (~27TB of article content). that's prob most of what was published over the last five years. if anyone wants access to it for research or whatever, let me know.
93 replies and sub-replies as of Mar 18 2018

Can you upload the data/metadata to BigQuery? That way anyone can access it quickly. cc @felipehoffa
it's all on AWS now. the metadata is in dynamodb and article content (json) is on s3.
Happy to help, thanks for the ping Max! Wondering - what interesting data do you have in that metadata?
yeah, do you also have the shares in that metadata? e.g. how many people shared an article. I just did a similar research for Reddit.
Selfishly if there was a recipe to pull the data to my own AWS environment that’d be perfect.
Details of this s3 bucket etc? Thank you for what you’re doing.
ah the challenges with such volumes! We are now closing in on 14billion docs. Are you using compression? I'm asking because we're currently at ~19TB compressed content + 9TB Elasticsearch index.
27TB is for the article content (as json files), no compression. do you have all of your archive it in ES?
We don't store the contents in ES (_all and _source are disabled). It's only being used for full text search, nothing more. The actual content is in MySQL (believe it or not), sharded into multiple databases compressed with zlib and serialized with msgpack.
that's awesome! yeah, sorry, i meant the entire archive metadata in search. that's very cool. are you running it on AWS? always cool to hear how other people have done it - thanks for sharing! we did dynamo/redis for metadata and memcache/s3 for article content (json).
Ah OK for us the content is metadata too ;) Only ids/feed ids/dates are 1st class citizens here :) We use redis heavily too for the polling system. Everything is on our own cloud (1TB+ RAM, 300+ cores) using opennebula and @storpool. Should make a blog post about it soon.
we built a service that sits in front of S3 and does multigets on files in s3. so memcache + s3 as article storage worked pretty well, was fairly cheap and we didnt have to worry about scaling the storage.
Sounds like a great setup. For us the latency to AWS is a bit of an issue since our main datacenter is in Europe and 5 years ago they weren't very present in EU :)
I *want* this but not sure I know how or what I’d do with it
i know. i'll keep you posted...will see if i can package up some useful parts of it for ppl to use.
Can you add me to your notify list? I'm also interested in this dataset.
I’m trying out a database at the moment - would love to use this. Is it available anywhere?
Hmmm, maybe give it an API? I suggest calling it "Digg Dugg".
academictorrents.com would be a great way to host a dataset like this!
Ooh tempting. What’s included in the article content? Any html or just the text?
We’d love to find a way to colaborate here cc @aabtzu
Hi @myoung - I am not familiar with the data format, but have you considered putting this into Google's #BigQuery as a public dataset? I think that's a great way to go with terabyte scale datasets that you want to share.
- let me know if I can be any help with this #BigQuery
We’d love to have our students use it for a Data analytics class!
Jason scott, internet archive. We want it.
Jscott@archive.org
+1 for Internet Archive! If we could get a compressed WARC representation of the JSON it'd be (a) reasonably sized (gzip), (b) allow for random access, and (c) likely be of a sane size for torrents ^_^ Happy to help however possible!
If we can help make this available for academic research let me know.
I wonder if Google would want to host it as open data in BigQuery.
Super cool! Let's discuss and then reach out.
Does it contain fake news?
It’s full of fake news!
MIchael, would you consider making your data sets available on @datadotworld? It's an amazing data collaboration platform. @databrett @jonloyens @bryonjacob @scuttlemonkey
Wow! Maybe @GCPcloud will host it for you and connect it up w/ BigQuery :)
So sad to see it go. Was always great to get a response from you when I ran into bugs or had feature requests. All the best for your next ventures.
Awesome! Can I have access to the datasets?
What kind of meta data does it have
Any chances of export it to open format (CSV, dbf) and then compress , outside of dynamodb, in several parts, maybe 1 GB each file, to import and transform to another DB. Those subsets would be very useful✌.
oh. you're on it. never mind.
What does "or whatever" comprise?
Would love to have access to this for @InsightDataSci Fellows to build ML projects!
Please train (a) a topic model and (b) Brown clusters and share the outputs.
would like access but to specific parts. i am an independent data researcher
I’d love to take a look at the data. Keep me posted.
Haven’t used digg in a long while but would it be possible to extract URLs country-wise ? Year-wise would be readily available I guess.
I'd love access to the file if it becomes available.
Yes I would love to access it!
I am interested for student projects, please keep me posted
I’d also love access to this data
would like access Mike, thanks.
Can you put it in a public, requester-pays S3 bucket?
could be interested
It is indeed very interesting, but I am not sure if the AWS S3 would be open to none-users of AWS. But I’d love to download the whole thing in any possible way and put in Elastic Stack for API and Hadoop for Spark :)
Hi Michael, we have interactive Spark notebooks on top of Hadoop cluster with 180TB HDFS, which is used by @CNRS and @ISCPIF researchers/scientists. We would love to host this data and make it available through Rest APIs and batch processing for the scientific community.
Oh yes, please add me to the list. What's the license on the content? Can we preprocess/opensource some of it under github.com/RaRe-Technolog…? CC @gensim_py
RaRe-Technologies/gensim-data
gensim-data - Data repository for pretrained NLP models and NLP corpora.
github.com
How can one gain access to this Michael?
Hi @myoung my company @exonar would love to get access to assist with privacy research, what's the easiest way of us facilitating?
, useful for the PersComm project?
Hello from @parsely! We would be interested to get a look at this data set for research and product development purposes. Are you considering making this a publicly available dataset like common crawl? Amazon or Google may be willing to host for free.
I’d be very much interested in using it for research!
How bout putting a small sample of it on Kaggle?
What kinds of metadata/stats on viewership, time collected are available? Supervaluable for studying transmission, virality, and fake news.
Dis you consider putting it in an S3 requester pays bucket?
Hi Michael, I'm interested on it. Especially in articles written in portuguese. Thank you!
It would be awesome if we could have a way to access it. Many cool research projects here. Cc @KristinaLerman
Do you have just the articles, or also interactions (diggs/likes)?
I'd love to have access to it as well! Bet there are some crazy cool stuff to be found in that dataset!
I would like to access it. Please provide me the details / keep me in the list. angsuman [ at ] taragana [ dot ] com
Interested. We would love to analyze this dataset!
I'm interested (for NLP program we're working on).
Interested. Thanks, Michael!
Would love to have that as well to play with. What about sharing on bittorent?
pretty interested in that, how can I reach you?
I'm confused - this is 12G URLs and the associated html/text/json? Why not just include the urls and people can scrape from the web or internet archive. IMO real value in this is the user perspective (clicks, saves, tags, etc..) of how this content relates to each other.