Important
● Before you start: Please make sure you have a valid NYTimes account and created an API key!!!! Look at the video on how to create a new API key. You can find it on Canvas in Files/Lecture Videos. The file is titled week14_newyorktimes_api.mp4
● The procedure is straightforward: go to developer.nytimes.com
● and create an account. Create an App and generate an API Key.
● Your final output should have two python files:
○ code to get the data from the NYTimes API, cleaning it, retrieving the relevant fields and storing it in files
○ code to analyze the data and present the results
Checklist for your code
Required
● Your code executes successfully
● Your code should have detailed comments
Important. You will lose points if you do not complete all the tasks. However, if your code does not run correctly, you will not be getting partial credit. So it is better to make sure you implement the project in such a way that you cover one goal after another.
Part 1: Data collection and cleaning
Prerequisites:
● You will be using the Archive end point of the NYTimes API for the project. Familiarize yourself with the endpoint (also discussed in detail in the lecture video)
● You can go to https://developer.nytimes.com/docs/archive-product/1/overview
● Links to an external site.
● and read the documentation
● The API endpoint just needs the year, month and the API key. So if you want to get data for 2019 January, you request the url https://api.nytimes.com/svc/archive/v1/2019/1.json?api-key=<yourkey>
● Parse the JSON and get the field under response -> docs -> headline -> main which is the title of the article.
● Remember: the JSON you receive from the url is all garbled and unreadable. To make it viewable you either have to google for a json formatter extension for the browser you are using (Chrome and Safari) or use Firefox where json formatting is available by default.
Output:
● Create a file named data_collection.py where you get all the posts from two different points in time: October 1918 (Spanish Flu pandemic) and October 2020 (COVID pandemic). Extract the titles and store them in two different files titled “titles_1918.txt” and “titles_2020.txt”. You should clean the titles and make sure that one title is stored per line in the file, i.e. the files should exactly have the same number of lines as the number of results returned (accessible via response -> meta -> hits).
Part 2: Analysis
Next create another file titled analysis.py in which you should read the two files titles_1918.txt and titles_2020.txt files and perform the following analysis:
1. Count the 10 words that appear most frequently in each of these files and print them in reverse order of frequency for each year. Make sure to remove the commonly occurring words in the English language using the code from the remove_stopwords function provided here
2. Links to an external site.
3. . You should call this function for each article title.
4. What fraction of articles did the words “flu”, “virus” and “death” appear in both the years?
5. Count occurrences of dollar amounts in the headlines and produce a total of the dollar amounts mentioned each year.
○ For instance, consider three headlines:
1. “$503,200 was spent on fixing roads”
2. “Twitter subscriptions now cost $8”
3. “Covid-19 deaths could cost the economy a trillion dollars”.
○ The “dollar amounts” to be considered in these headlines are the ones which have a ‘$’ symbol and a number after it. So we need to output 503200 + 8 = 503208 as output
○ Hint: Use a regular expression to identify all the dollar amounts. You could use something like “$[0-9,]+” to identify all occurrences of dollar amounts.
6. Identifies the average sentiment among these article titles (check out the section titled “Sentiment analysis in python” on Canvas. You should use the function provided in that section. Check out the video here.
7. Links to an external site.
8. Even though the video mentions tweets, it should work for article titles. Important: Make sure you install vaderSentiment on replit!)
Sugar. Bonus points for making use of functions for identifying various components like counting frequency and searching for keywords.
Expected outcome: (below is an example of how you can format your output. It need not be done exactly this way, but the output of each analysis should be clearly indicated). Also note that the exact output you get will be completely different based on the data you used. The numbers shown are only for illustration. You could get different results based on your analysis.
************************
Most frequent words 1918
************************
abc, 204
ssi, 134
kasgas, 122
…
************************
Most frequent words 2020
************************
abc, 204
dead, 124
gdvjs, 102
…
************************
Fraction of articles in 1918
************************
flu 0.085
virus 0.123
death 0.03
************************
Fraction of articles in 2020
************************
flu 0.083
virus 0.14
death 0.001
************************
Dollar amounts
************************
1918 $302,402,425
2020 $204,325,381
************************
Sentiment 1918
************************
The average sentiment of the articles is 0.421
************************
Sentiment 2020
************************
The average sentiment of the articles is 0.421