Convopage : @myoung : in the ~five years of running Digg Reader, we collected 12 billion pieces of content (~27TB of article content). that's prob most of what was published over the last five years. if anyone wants access to it for research or whatever, let me know.

Convopage

See the entire conversation

in the ~five years of running Digg Reader, we collected 12 billion pieces of content (~27TB of article content). that's prob most of what was published over the last five years. if anyone wants access to it for research or whatever, let me know.

93 replies and sub-replies as of Mar 18 2018

Max Woolf@minimaxir

Can you upload the data/metadata to BigQuery? That way anyone can access it quickly. cc @felipehoffa

Michael Young@myoung

it's all on AWS now. the metadata is in dynamodb and article content (json) is on s3.

Felipe Hoffa@felipehoffa

Happy to help, thanks for the ping Max! Wondering - what interesting data do you have in that metadata?

Sergei Sokolenko@datancoffee

yeah, do you also have the shares in that metadata? e.g. how many people shared an article. I just did a similar research for Reddit.

Nick Sharkey@sharkey

Selfishly if there was a recipe to pull the data to my own AWS environment that’d be perfect.

Delip Rao@deliprao

Details of this s3 bucket etc? Thank you for what you’re doing.

Stormy Daniel McNichol@dnlmc

➕☝️

Inoreader@Inoreader

ah the challenges with such volumes! We are now closing in on 14billion docs. Are you using compression? I'm asking because we're currently at ~19TB compressed content + 9TB Elasticsearch index.

Michael Young@myoung

27TB is for the article content (as json files), no compression. do you have all of your archive it in ES?

Inoreader@Inoreader

We don't store the contents in ES (_all and _source are disabled). It's only being used for full text search, nothing more. The actual content is in MySQL (believe it or not), sharded into multiple databases compressed with zlib and serialized with msgpack.

Michael Young@myoung

that's awesome! yeah, sorry, i meant the entire archive metadata in search. that's very cool. are you running it on AWS? always cool to hear how other people have done it - thanks for sharing! we did dynamo/redis for metadata and memcache/s3 for article content (json).

Inoreader@Inoreader

Ah OK for us the content is metadata too ;) Only ids/feed ids/dates are 1st class citizens here :) We use redis heavily too for the polling system. Everything is on our own cloud (1TB+ RAM, 300+ cores) using opennebula and @storpool. Should make a blog post about it soon.

Michael Young@myoung

we built a service that sits in front of S3 and does multigets on files in s3. so memcache + s3 as article storage worked pretty well, was fairly cheap and we didnt have to worry about scaling the storage.

Inoreader@Inoreader

Sounds like a great setup. For us the latency to AWS is a bit of an issue since our main datacenter is in Europe and 5 years ago they weren't very present in EU :)

Michael Donohoe@donohoe

I *want* this but not sure I know how or what I’d do with it

Michael Young@myoung

i know. i'll keep you posted...will see if i can package up some useful parts of it for ppl to use.

Peter Wang@pwang

Can you add me to your notify list? I'm also interested in this dataset.

Katja Bego@katjabego

CC @kgyodi @michalpalinski

Michael Kavanagh@mogkavanagh

Awesome data to share 👏 cc @awdrius @paddyLindstrom

Jamie Laird 👨🏻‍💻@jamielaird_

I’m trying out a database at the moment - would love to use this. Is it available anywhere?

Brian@phaedrusalt

Hmmm, maybe give it an API? I suggest calling it "Digg Dugg".

Brian Wallace@botnet_hunter

academictorrents.com would be a great way to host a dataset like this!

Academic Torrents

We've designed a distributed system for sharing enormous datasets - for researchers, by researchers. The result is a scalable, secure, and fault-tolerant repository for data, with blazing fast download speeds.

academictorrents.com

Henry VII@chaoticneural

Want

🤪🔨@kevrone

Ooh tempting. What’s included in the article content? Any html or just the text?

Joe Alicata@wirelessjoe

We’d love to find a way to colaborate here cc @aabtzu

Michael Manoochehri@doubleocherry

Hi @myoung - I am not familiar with the data format, but have you considered putting this into Google's #BigQuery as a public dataset? I think that's a great way to go with terabyte scale datasets that you want to share.

Michael Manoochehri@doubleocherry

- let me know if I can be any help with this #BigQuery

सुनील मार्ती@suneelmarthi

cc @ApacheOpennlp #nlproc @tteofili

Inclusion@InclusionOrg

We’d love to have our students use it for a Data analytics class!

Jason Scott@textfiles

Jason scott, internet archive. We want it.

Jason Scott@textfiles

Jscott@archive.org

Smerity@Smerity

+1 for Internet Archive! If we could get a compressed WARC representation of the JSON it'd be (a) reasonably sized (gzip), (b) allow for random access, and (c) likely be of a sane size for torrents ^_^ Happy to help however possible!

Sean Work@seanvwork

(No comment)

Burt Monroe@burtmonroe

If we can help make this available for academic research let me know.

Max Schoening@mschoening

I wonder if Google would want to host it as open data in BigQuery.

Alexis Lloyd@alexislloyd

Yes pls!

karl rohe@karlrohe

Attn @yinizhang2011

Yini@yinizhang2011

Super cool! Let's discuss and then reach out.

Meredith Broussard@merbroussard

Yes, please!

Manos Tsagkias@samanos

cc @dvdgrs @ams_ds @ILPS_Amsterdam

Greg Stoddard@gregstod

Does it contain fake news?

Michael Young@myoung

It’s full of fake news!

Paragon Science@ParagonSci_Inc

MIchael, would you consider making your data sets available on @datadotworld? It's an amazing data collaboration platform. @databrett @jonloyens @bryonjacob @scuttlemonkey

Jordan Hochenbaum@Jnatanh

(No comment)

Parag K. Mital@pkmital

Wow! Maybe @GCPcloud will host it for you and connect it up w/ BigQuery :)

ɐƃnd uoɹɐɐ@wwkuwl

So sad to see it go. Was always great to get a response from you when I ran into bugs or had feature requests. All the best for your next ventures.

Luiz Picanço@lpicanco

Awesome! Can I have access to the datasets?

Murali sankar@muralisankar

What kind of meta data does it have

Matthew Howells-Barby@matthewbarby

cc @Angie9012 ?

Janani Sriram@jansriram

Ricardo@NomadVolk

Any chances of export it to open format (CSV, dbf) and then compress , outside of dynamodb, in several parts, maybe 1 GB each file, to import and transform to another DB. Those subsets would be very useful✌.

Chris Hanel@ChrisHanel

(No comment)

Chris Hanel@ChrisHanel

oh. you're on it. never mind.

‽@fmailhot

What does "or whatever" comprise?

bikrish amatya@bikrish

Yes

Emmanuel Ameisen@EmmanuelAmeisen

Would love to have access to this for @InsightDataSci Fellows to build ML projects!

vlad@dovgalec

Please train (a) a topic model and (b) Brown clusters and share the outputs.

Greg Battle: gbattle@gbattle

could use this

Meraj Hasan@_merajhasan

would like access but to specific parts. i am an independent data researcher

Marco T. Bastos@toledobastos

I’d love to take a look at the data. Keep me posted.

fibinse & 322 others@fx86

Haven’t used digg in a long while but would it be possible to extract URLs country-wise ? Year-wise would be readily available I guess.

AJ Vicens@AJVicens

I'd love access to the file if it becomes available.

Karthik Babu N K@KarTechBabu

Yes I would love to access it!

Claudia@ClaudiaBorgUOM

I am interested for student projects, please keep me posted

Phillip Adkins@phillypoopskins

I’d also love access to this data

priya joseph@ayirpelle

superorganism_@superorganism_

would like access Mike, thanks.

matt2000@matt2000

Can you put it in a public, requester-pays S3 bucket?

ngaumont@ksadorf

could be interested

Multivac Platform@multivacsupport

It is indeed very interesting, but I am not sure if the AWS S3 would be open to none-users of AWS. But I’d love to download the whole thing in any possible way and put in Elastic Stack for API and Hadoop for Spark :)

Maziyar PANAHI@MaziyarPanahi

Hi Michael, we have interactive Spark notebooks on top of Hadoop cluster with 180TB HDFS, which is used by @CNRS and @ISCPIF researchers/scientists. We would love to host this data and make it available through Rest APIs and batch processing for the scientific community.

Radim Řehůřek@RadimRehurek

Oh yes, please add me to the list. What's the license on the content? Can we preprocess/opensource some of it under github.com/RaRe-Technolog…? CC @gensim_py

RaRe-Technologies/gensim-data

gensim-data - Data repository for pretrained NLP models and NLP corpora.

github.com

Peter.@Ngbede

How can one gain access to this Michael?

Adrian Barrett@barrettthebolt

Hi @myoung my company @exonar would love to get access to assist with privacy research, what's the easiest way of us facilitating?

Bram van Es@bramiozo

, useful for the PersComm project?

Mike Sukmanowsky@msukmanowsky

Hello from @parsely! We would be interested to get a look at this data set for research and product development purposes. Are you considering making this a publicly available dataset like common crawl? Amazon or Google may be willing to host for free.

Aidan Milliff@amilliff

Thomas Kober@tttthomasssss

I’d be very much interested in using it for research!

Chris Crawford@crawforc3

How bout putting a small sample of it on Kaggle?

Ayeh Bandeh-Ahmadi@heyayeh

What kinds of metadata/stats on viewership, time collected are available? Supervaluable for studying transmission, virality, and fake news.

Brian Mingus@alterego

Dis you consider putting it in an S3 requester pays bucket?

Flavio@flaviodrt

Hi Michael, I'm interested on it. Especially in articles written in portuguese. Thank you!

Emilio Ferrara@jabawack

It would be awesome if we could have a way to access it. Many cool research projects here. Cc @KristinaLerman

Kristina Lerman@KristinaLerman

Do you have just the articles, or also interactions (diggs/likes)?

Tiago Fassoni@tiagodizquenao

I'd love to have access to it as well! Bet there are some crazy cool stuff to be found in that dataset!

Kevin Stevens@kevindstevens

(No comment)

Angsuman Chakraborty@angsuman

I would like to access it. Please provide me the details / keep me in the list. angsuman [ at ] taragana [ dot ] com

Paul Stefan Bohm@paulsbohm

Interested. We would love to analyze this dataset!

Lech Mazur@LechMazur

I'm interested (for NLP program we're working on).

Liandy Hardikoesoemo@Liandy213

Interested. Thanks, Michael!

Hartator@Hartator

Would love to have that as well to play with. What about sharing on bittorent?

Oh, let's pretend!@qosys

pretty interested in that, how can I reach you?

yehosef@yehosef

I'm confused - this is 12G URLs and the associated html/text/json? Why not just include the urls and people can scrape from the web or internet archive. IMO real value in this is the user perspective (clicks, saves, tags, etc..) of how this content relates to each other.