May 6, 2020

How I read the News without ads (WordPress based News Sites)

By Remejy Python 0 Comments

How often do you find a news article link somewhere that looks really interesting to read, but the article is packed full of ads and distractions you have a difficult time discerning where the news article text is located on the page? Some of these ads can take over your screen or block your view.

This happens to me all the time, and I tend to avoid articles that belong to these news sources that have far too many ads, and especially when the ads popup and take over your screen. However, I still want to try and get the news data. This is why I came up with a very simple solution, a quick Python script to grab the data I want to read.

The following script will work for just about any WordPress site that is at least running version 5.0 and above that has the WordPress API enabled. This API sets up JSON based endpoints to gather posts and pages from the site, and even allows for search parameters and pagination of data. Read more at the WordPress API Handbook.

Disclaimer

This script should not be used in any manner as to harm anyone’s website. I recommend testing this script against your own personal site to understand how it works. Better yet, instead of using the script to grab live data over and over, start by copying the results of one API call and pasting into a temporary string for testing. I make no guarantees that this script works or that it will not cause harm to anyone’s website or your own computer. You absorb all risks when using anyone else’s scripts, so be sure to learn how they work, before you use them.

Setup Python

First, I created a folder to store the script and setup a Python Virtual Environment to contain all the code and libraries I may need. See my quick How to setup Python Virtual Environments. Truthfully, with how simple this script is and not using any third party libraries; you can run this script in any folder you want, without the Python Virtual Environment.

First create a new file, possibly called news.py (or whatever name you like) and paste the following code:

import json
import re
from urllib.request import Request, urlopen
from urllib.error import URLError

siteBaseUrl = 'https://www.news.site.com/'

def stripTags(text):
  scripts = re.compile(r'<script[. \S\s]*?/script>')
  css = re.compile(r'<style.*?/style>')
  tags = re.compile(r'<.*?>')

  text = scripts.sub(' ', text)
  text = css.sub(' ', text)
  text = tags.sub(' ', text)

  return text

req = Request(siteBaseUrl + 'wp-json/wp/v2/posts')
try:
    response = urlopen(req)
except URLError as e:
    if hasattr(e, 'reason'):
        print('We failed to reach a server.')
        print('Reason: ', e.reason)
    elif hasattr(e, 'code'):
        print('The server couldn\'t fulfill the request.')
        print('Error code: ', e.code)

data = json.loads(response.read())
for item in data:
    print()
    print('*********************************')
    print(item['id'], item['date'], item['modified'], item['title']['rendered'])
    print(item['link'])
    print(stripTags(item['content']['rendered']))
    print()
    print()

Note: this is just a quick and dirty script that could use tons of improvements and could use better libraries. However, it does perform as I need in the interim to read the last 10 news articles from the news site.

Imports

The imports are standard library imports, so should be able to be run on standard Python install.

import json
import re
from urllib.request import Request, urlopen
from urllib.error import URLError

Use the json library to convert the response data from the website to a python object. Use the re library so can process the rendered content and remove all script, style and any other html tags with regular expression substitutions. Use the urllib library to grab the website data and provide meaningful errors if the data could not be downloaded.

URL structure

I set a base URL string for the website at the top for ease of changing.

siteBaseUrl = 'https://www.news.site.com/'

Could easily modify the script to use an array of sites for processing, and loop through the array of sites and store the data.

The structure of a WordPress API to get posts uses the base URL of the website and appends the following path ‘/wp-json/wp/v2/posts‘, so that you will end up with something like ‘https://news.site.com/wp-json/wp/v2/posts‘. To try it out, swap in your favorite WordPress site base URL and see what the response is in your web browser. This is the JSON data we will process to get the useful parts of a post and convert to plain text.

Strip HTML tags

I setup a function that uses regular expressions to hunt for and replace any HTML tags. I found this simple function using regular expressions on StackOverflow. Seems pretty effective. I am not very good with regular expressions, so I would not be best to explain how this one works at finding the <script> tags.

def stripTags(text):
  scripts = re.compile(r'<script[. \S\s]*?/script>')
  css = re.compile(r'<style.*?/style>')
  tags = re.compile(r'<.*?>')

  text = scripts.sub(' ', text)
  text = css.sub(' ', text)
  text = tags.sub(' ', text)

  return text

I was able to remove most HTML tags with the simple regular expression ‘<.*?>‘ but this did leave the scripts or css data between the tags. So I needed to find a better understanding of regular expressions to remove from the beginning tag to the ending tag with all the data in the middle, but only for script or style tags (don’t want to remove the data that we wanted to read that would have been between <p> tags). So remove the script and style tags first, then go back and remove all remaining tags.

Try/Except to get data

I found a helpful try/except block in the Python documentation.

req = Request(siteBaseUrl + 'wp-json/wp/v2/posts')
try:
    response = urlopen(req)
except URLError as e:
    if hasattr(e, 'reason'):
        print('We failed to reach a server.')
        print('Reason: ', e.reason)
    elif hasattr(e, 'code'):
        print('The server couldn\'t fulfill the request.')
        print('Error code: ', e.code)

If the call to request the document from the server fails, at least you will get some helpful response in the error messages to guide you as to what may be wrong. When I first tried writing the full script I had a few errors with my url patterns. I easily figured out I had typed the websites base URL incorrectly, by reading the printed errors. So It is good to have this try/except in the script to help you out.

Loop through JSON data response

Once we get the response from the website, we use the json.dumps function (dumps = dump string, which converts the json string data into a Python object) to make looping through the data to process easier.

data = json.loads(response.read())
for item in data:
    print()
    print('*********************************')
    print(item['id'], item['date'], item['modified'], item['title']['rendered'])
    print(item['link'])
    print(stripTags(item['content']['rendered']))
    print()
    print()

We loop through each set of posts (which are now python objects), and grab the post id number, the title, created date, updated date, the post’s website link, and the post content (striped of HTML tags thanks to our stripTags function above). I am just printing the results in the terminal, however you could easily save to a database, where you could request the specific post again in the future to see if the updated date has changed; which would likely mean the post content had changed. The call would look like ‘https://news.site.com/wp-json/wp/v2/posts/<id>‘, and you would just replace the post id number in for ‘<id>‘, and then compare the dates on the response data (after of course using json.dumps to convert to an object).

Save the file and run

All that remains is to save the file and run. To run the script, open a terminal or command prompt window to the directory where you saved the news.py file. To run the script, type: python news.py if in Windows, or python3 news.py if in MacOS or Linux.

You should see 10 news articles separated with a line of asterisks ( ***************).

Troubleshooting

For the most part, I believe this script should work without issue. If you get errors from the try/except block you may want to check that first, the site is a WordPress site and you changed the base url correctly to match the site you want to view. You can test that json data actually is returned by putting the full API url in your browser. For example https://news.site.com/wp-json/wp/v2/posts

Otherwise, leave me a comment and I will see what ideas I can share with you to help.

Remejy . Com

How I read the News without ads (WordPress based News Sites)

Disclaimer

Setup Python

Imports

URL structure

Strip HTML tags

Try/Except to get data

Loop through JSON data response

Save the file and run

Troubleshooting

Add a Comment

Disclaimer

Setup Python

Imports

URL structure

Strip HTML tags

Try/Except to get data

Loop through JSON data response

Save the file and run

Troubleshooting

Related Posts

Add a Comment