Inspired by Spiegel Mining lecture by Daniel Kriesel I decided to try some mining my own using this New York Times dataset from Kaggle. In this example we will be concentrating on the articles dataset: nyt-articles-2020.csv. It contains more than 16K articles with 11 features we can toy around with:

  • newsdesk
  • section
  • subsection
  • material
  • headline
  • abstract
  • keywords
  • word_count
  • pub_date
  • n_comments
  • uniqueID

So here are a couple of questions we can ask ourselves:

  • Which article has the largest/smallest word count?
  • Which article has the most/littlest comments?
  • What is the average word count per article for each newsdesk?
  • How many articles are published on average per day?
  • How many articles are published on average per weekday?
  • On which day the most articles get published on average?
  • ..

So let’s get started by running the MongoDB Docker Container, setting up a database and importin the data. For this part please refer to the documentation inside this repo.

The repo also contains the file with all the queries to answer the above questions.

Additionaly you can watch this YouTube video which explains the whole process step by step including data visualization with MongoDB Atlas and Charts.