## Making an Atom Feed using Selenium

Making an Atom Feed using Selenium.

by

Christoph Lohmann <[email protected]>

## Intro

* The web is getting more complex.

* You have pure javascript framework auto-generated websites with nothing
 to parse from.

* The only way to get any conten from it is to execute the javascript.

* Sadly we need that part of a browser from it.
       * There is not just some javascript engine.
       * All is interwingeled.
       * Google wanted it that way.

## Basic Atom Feed Generation

{
       printf '<?xml version="1.0" encoding="utf-8"?>\n'
       printf '<feed xmlns="http://www.w3.org/2005/Atom">\n'
       printf '<updated>%s</updated>\n' "$(date "+%FT%T%z")"

       hurl $uri \
       | grep content | sed 's,rawcontent,content,g' \
       | while read -r line;
       do
               printf "<entry>"
               printf "<content><![CDATA[%s]]></content>" "${line}"
               printf "</entry>\n"
       done

       printf "</feed>\n"
} > somefeed.atom

## How it evolved.

* frameworks like python requests
* webkit
       * Small browser evolved and scraping.
       * PhantomJS

--> They became outdated due to the speed web engines evolved.
       --> Feature bloat.
       --> Corporate need for new things besides no need for it.
       --> Sell more products.

* Intermediate steps followed of complex control protocols.
       * I will skip them for so you stay sane.

## Current State: WebDriver

https://w3c.github.io/webdriver/

> WebDriver is a remote control interface that enables introspection and
> control of user agents. It provides a platform- and language-neutral
> wire protocol as a way for out-of-process programs to remotely instruct
> the behavior of web browsers.

Webbrowsers expose HTTP endpoints:

       POST /session/...
       DELETE /session/...
       GET /session/...

* Could be wrapped into C too.
* For fast prototyping we use selenium and python.

## Selenium Environment

1. Get Selenium

       $ pip install selenium
       # Huge bloat is installed.

2. Get a Chromium WebDriver

Normally included in your chromium installation at:
       /usr/bin/chromedriver

Or:
Gentoo:         emerge www-apps/chromedriver-bin
Binary package: https://chromedriver.chromium.org/downloads

## Selenium Environment

Other Web Browsers:

* Edge
* Firefox
* Internet Explorer
* Safari

All have their quirks:

https://www.selenium.dev/documentation/webdriver/browsers/

## Basic Selenium Script

       #!/bin/env python
       from selenium import webdriver
       from selenium.webdriver.common.by import By
       driver = webdriver.Chrome()
       driver.get("https://www.bitreich.org")
       driver.implicitly_wait(1.0)
       driver.find_element(By.XPATH, "//*[@class=\"proletariat\"]").text

Output: gophers://bitreich.org

## Selenium IDE

https://www.selenium.dev/selenium-ide/

Browser Extension for Firefox and Chromium to record interactions with
websites.
       * Easily generate scripts from that.

## Select Content in Websites

driver.find_element or driver.find_elements

       <p><elem id="text" class="info" meta="subcontext">
               text to grep
       </elem></p>

e = driver.find_element(By.ID, "text").text
e = driver.find_element(By.TAG_NAME, "elem").text
e = driver.find_element(By.CLASS_NAME, "info").text
e = driver.find_element(By.XPath, "//p/elem").text

Others: By.NAME (forms), By.CSS_SELECTOR, By.LINK_TEXT,
       By.PARTIAL_LINK_TEXT

e.get_attribute("meta")

## Stuff we won't handle here.

Selenium can do:
       * input
               * fill out forms
                       * key presses emulation
                       * upload files
               * send forms
               * scroll web pages
               * do pen actions (tablet)
               * mouse emulation
                       * drag and drop elements
       * history / navigate around in the browser

## Stuff we won't handle here.

Selenium can do:
       * window manipulation
               * handle multiple tabs / windows
               * handle iframes
               * move windows around
               * take screenshots
       * print websites
       * handle popup alerts
       * set / get cookies
       * let you run inline javascript
       * do color animations
       * debug javascript for you using the bidirectional protocol
       * build huge action chains for time-perfect handling

## Complex Example

https://www.kvsachsen.de/
* 'new modern' website of my doctor association.
* All in a javascript framework.
* News is hidden by loading even more javascript.
* No rss feed.

News:
1. Open https://www.kvsachsen.de/fuer-praxen/aktuelle-informationen/praxis-news
2. Parse Javascript in subframe.

## Complex Example

Get stuff ready:

       from selenium import webdriver
       from selenium.webdriver.chrome.options import Options as chromeoptions
       from selenium.webdriver.support.ui import WebDriverWait
       from selenium.webdriver.support import expected_conditions as EC
       from selenium.webdriver.common.by import By
       from datetime import datetime
       import pytz

       link = "https://www.kvsachsen.de/fuer-praxen/\
               aktuelle-informationen/praxis-news"

## Complex Example

Get ChromeDriver ready:

       options = chromeoptions()
       chromearguments = [
               "headless",
               "no-sandbox",
               "disable-extensions",
               "disable-dev-shm-usage",
               "start-maximized",
               "window-size=1900,1080",
               "disable-gpu"
       ]
       for carg in chromearguments:
               options.add_argument(carg)

       driver = webdriver.Chrome(options=options)

## Complex Example

Get the content:

       driver.get(link)

## Complex Example

Wait for the content to be ready and loaded with a timeout
of 60 seconds:

       isnews = WebDriverWait(driver=driver, timeout=60).until(
                       EC.presence_of_element_located((By.XPATH,
                               "//div[@data-last-letter]")
                       )
       )

EC ... Expected Condition
EC can be very many things:

       https://www.selenium.dev/selenium/docs/api/py/\
               webdriver_support/\
               selenium.webdriver.support.expected_conditions.html

Pro Tip: Do not wait for a static time, use some EC for this. You will
be safer and have less errors.

## Complex Example

Get the root news element we work from:

       newslist = driver.find_elements(By.XPATH,
               "//div[@data-filter-target=\"list\"]")[0]

Get some metadata for the atom feed:

       title = driver.find_elements(By.XPATH,
               "//meta[@property=\"og:title\"]")[0].\
               get_attribute("content")
       description = title

## Complex example

Print the header of the atom feed to stdout:

       print("""<?xml version="1.0" encoding="utf-8"?>""")
       print("""<feed xmlns="http://www.w3.org/2005/Atom">""")
       print("\t<title><![CDATA[%s]]></title>" % (title))
       print("\t<subtitle><![CDATA[%s]]></subtitle>" % (description))
       print("\t<id>%s</id>" % (link))
       print("\t<link href=\"%s\" rel=\"self\" />" % (link))
       print("\t<link href=\"%s\" />" % (link))

Use the current data for updated:

       utcnow = datetime.now(pytz.utc)
       print("\t<updated>%s</updated>" % (utcnow.isoformat()))

## Complex example

Get the entries:

       articles = newslist.find_elements(By.XPATH, "./div")

Prepare a base URI for appending relative links:

       baselink = "/".join(link.split("/", 3)[:-1])

Loop over all entries in backwards style:

       for article in articles[::-1]:

## Complex example

Find the deep link to the article:

               link = article.find_elements(By.XPATH, "./a")[0]
               plink = link.get_attribute("href")

Normalize the link in case it is relative:

               if not plink.startswith("http"):
                       plink = "%s/%s" % (baselink, plink)

Get the entry title, content and set an absolute author:

               ptitle = link.get_attribute("data-title")
               pcontent = article.text
               pauthor = "[email protected]"

## Complex example

Parse the datetime for the article release:

               updateds = article.find_elements(By.XPATH, ".//time")[0].text
               try:
                       dtupdated = datetime.strptime(updateds, "%d.%m.%Y")
               except ValueError:
                       continue

Bring the datetime into python native format for further processing:

               dtupdated = dtupdated.replace(hour=12, minute=0,\
                               second=0, tzinfo=pytz.utc)
               if dtupdated.year > utcnow.year:
                       dtupdated = dtupdated.replace(year=utcnow.year)
               pupdated = dtupdated

## Complex example

Print the entry:

               print("\t<entry>")
               print("\t\t<id>%s</id>" % (plink))
               print("\t\t<title><![CDATA[%s]]></title>" % (ptitle))
               print("\t\t<link href=\"%s\" />" % (plink))
               print("\t\t<author><name>%s</name></author>" % (pauthor))
               print("\t\t<updated>%s</updated>" % (pupdated.isoformat()))
               print("\t\t<content><![CDATA[%s]]></content>" % (pcontent))
               print("\t</entry>")

Print the footer (out of feeds loop):

       print("</feed>")

## Example Script

The full example script and how I use it can be found in:

       git://bitreich.org/brcon2023-hackathons
       ./sfeed-atom/kvssachsen2atom

## Summary

* With Selenium you can script all of modern web in all ways.

* We fight bloat with bloat.
       * You run a full web process to parse the web.
       * We wanted to avoid that with scraping.

* You can easily prototype web access in for example ipython(1).

* There are still privacy concerns.
       * You run a huge blob of hundreds of thousands of sloc.
       * Plato's cave allegory

## Plato's cave allegory

+--------------;,,.;        ..\.|./,.
|                       .------(_)------
|#         #  (too bright!) - /,|.\
|#     o  =|       o/     ,  /. |. .(
|#    o|o =|o      |       / , .|, (_|
|     | | = |     ,..,            ___|_,
+---------+----."'    ''''''~~~~~~\____|~~~

* People are in a cave, watching the shadow figure of a hash,
 presented to them by a narrator behind the wall to the exit
 of the cave.
* When people want to leave the cave, they will be blinded by
 the sun. The sunlight hurts their eyes. The will go back into
 the have.
* The outside did not look so fine presented and prepared as
 the shadow of the narrator's play. It does not hurt the eye.
* Only some people are able to adapt their eyes and see the
 beauty of not being dependent on a narrator. They will be able
 to leave the cave.

## Questions?

Do you have questions?

## Thanks

Thank you for listening.

For further suggestions, contact me at

       Christoph Lohmann <[email protected]>