## Making an Atom Feed using Selenium

## Making an Atom Feed using Selenium

Making an Atom Feed using Selenium.

by

Christoph Lohmann <[email protected]>

## Intro

* The web is getting more complex.

* You have pure javascript framework auto-generated websites with nothing
to parse from.

* The only way to get any conten from it is to execute the javascript.

* Sadly we need that part of a browser from it.
* There is not just some javascript engine.
* All is interwingeled.
* Google wanted it that way.

## Basic Atom Feed Generation

{
printf '<?xml version="1.0" encoding="utf-8"?>\n'
printf '<feed xmlns="http://www.w3.org/2005/Atom">\n'
printf '<updated>%s</updated>\n' "$(date "+%FT%T%z")"

hurl $uri \
| grep content | sed 's,rawcontent,content,g' \
| while read -r line;
do
printf "<entry>"
printf "<content><![CDATA[%s]]></content>" "${line}"
printf "</entry>\n"
done

printf "</feed>\n"
} > somefeed.atom

## How it evolved.

* frameworks like python requests
* webkit
* Small browser evolved and scraping.
* PhantomJS

--> They became outdated due to the speed web engines evolved.
--> Feature bloat.
--> Corporate need for new things besides no need for it.
--> Sell more products.

* Intermediate steps followed of complex control protocols.
* I will skip them for so you stay sane.

## Current State: WebDriver

https://w3c.github.io/webdriver/

> WebDriver is a remote control interface that enables introspection and
> control of user agents. It provides a platform- and language-neutral
> wire protocol as a way for out-of-process programs to remotely instruct
> the behavior of web browsers.

Webbrowsers expose HTTP endpoints:

POST /session/...
DELETE /session/...
GET /session/...

* Could be wrapped into C too.
* For fast prototyping we use selenium and python.

## Selenium Environment

1. Get Selenium

$ pip install selenium
# Huge bloat is installed.

2. Get a Chromium WebDriver

Normally included in your chromium installation at:
/usr/bin/chromedriver

Or:
Gentoo: emerge www-apps/chromedriver-bin
Binary package: https://chromedriver.chromium.org/downloads

## Selenium Environment

Other Web Browsers:

* Edge
* Firefox
* Internet Explorer
* Safari

All have their quirks:

https://www.selenium.dev/documentation/webdriver/browsers/

## Basic Selenium Script

#!/bin/env python
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://www.bitreich.org")
driver.implicitly_wait(1.0)
driver.find_element(By.XPATH, "//*[@class=\"proletariat\"]").text

Output: gophers://bitreich.org

## Selenium IDE

https://www.selenium.dev/selenium-ide/

Browser Extension for Firefox and Chromium to record interactions with
websites.
* Easily generate scripts from that.

## Select Content in Websites

driver.find_element or driver.find_elements

<p><elem id="text" class="info" meta="subcontext">
text to grep
</elem></p>

e = driver.find_element(By.ID, "text").text
e = driver.find_element(By.TAG_NAME, "elem").text
e = driver.find_element(By.CLASS_NAME, "info").text
e = driver.find_element(By.XPath, "//p/elem").text

Others: By.NAME (forms), By.CSS_SELECTOR, By.LINK_TEXT,
By.PARTIAL_LINK_TEXT

e.get_attribute("meta")

## Stuff we won't handle here.

Selenium can do:
* input
* fill out forms
* key presses emulation
* upload files
* send forms
* scroll web pages
* do pen actions (tablet)
* mouse emulation
* drag and drop elements
* history / navigate around in the browser

## Stuff we won't handle here.

Selenium can do:
* window manipulation
* handle multiple tabs / windows
* handle iframes
* move windows around
* take screenshots
* print websites
* handle popup alerts
* set / get cookies
* let you run inline javascript
* do color animations
* debug javascript for you using the bidirectional protocol
* build huge action chains for time-perfect handling

## Complex Example

https://www.kvsachsen.de/
* 'new modern' website of my doctor association.
* All in a javascript framework.
* News is hidden by loading even more javascript.
* No rss feed.

News:
1. Open https://www.kvsachsen.de/fuer-praxen/aktuelle-informationen/praxis-news
2. Parse Javascript in subframe.

## Complex Example

Get stuff ready:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options as chromeoptions
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from datetime import datetime
import pytz

link = "https://www.kvsachsen.de/fuer-praxen/\
aktuelle-informationen/praxis-news"

## Complex Example

Get ChromeDriver ready:

options = chromeoptions()
chromearguments = [
"headless",
"no-sandbox",
"disable-extensions",
"disable-dev-shm-usage",
"start-maximized",
"window-size=1900,1080",
"disable-gpu"
]
for carg in chromearguments:
options.add_argument(carg)

driver = webdriver.Chrome(options=options)

## Complex Example

Get the content:

driver.get(link)

## Complex Example

Wait for the content to be ready and loaded with a timeout
of 60 seconds:

isnews = WebDriverWait(driver=driver, timeout=60).until(
EC.presence_of_element_located((By.XPATH,
"//div[@data-last-letter]")
)
)

EC ... Expected Condition
EC can be very many things:

https://www.selenium.dev/selenium/docs/api/py/\
webdriver_support/\
selenium.webdriver.support.expected_conditions.html

Pro Tip: Do not wait for a static time, use some EC for this. You will
be safer and have less errors.

## Complex Example

Get the root news element we work from:

newslist = driver.find_elements(By.XPATH,
"//div[@data-filter-target=\"list\"]")[0]

Get some metadata for the atom feed:

title = driver.find_elements(By.XPATH,
"//meta[@property=\"og:title\"]")[0].\
get_attribute("content")
description = title

## Complex example

Print the header of the atom feed to stdout:

print("""<?xml version="1.0" encoding="utf-8"?>""")
print("""<feed xmlns="http://www.w3.org/2005/Atom">""")
print("\t<title><![CDATA[%s]]></title>" % (title))
print("\t<subtitle><![CDATA[%s]]></subtitle>" % (description))
print("\t<id>%s</id>" % (link))
print("\t<link href=\"%s\" rel=\"self\" />" % (link))
print("\t<link href=\"%s\" />" % (link))

Use the current data for updated:

utcnow = datetime.now(pytz.utc)
print("\t<updated>%s</updated>" % (utcnow.isoformat()))

## Complex example

Get the entries:

articles = newslist.find_elements(By.XPATH, "./div")

Prepare a base URI for appending relative links:

baselink = "/".join(link.split("/", 3)[:-1])

Loop over all entries in backwards style:

for article in articles[::-1]:

## Complex example

Find the deep link to the article:

link = article.find_elements(By.XPATH, "./a")[0]
plink = link.get_attribute("href")

Normalize the link in case it is relative:

if not plink.startswith("http"):
plink = "%s/%s" % (baselink, plink)

Get the entry title, content and set an absolute author:

ptitle = link.get_attribute("data-title")
pcontent = article.text
pauthor = "[email protected]"

## Complex example

Parse the datetime for the article release:

updateds = article.find_elements(By.XPATH, ".//time")[0].text
try:
dtupdated = datetime.strptime(updateds, "%d.%m.%Y")
except ValueError:
continue

Bring the datetime into python native format for further processing:

dtupdated = dtupdated.replace(hour=12, minute=0,\
second=0, tzinfo=pytz.utc)
if dtupdated.year > utcnow.year:
dtupdated = dtupdated.replace(year=utcnow.year)
pupdated = dtupdated

## Complex example

Print the entry:

print("\t<entry>")
print("\t\t<id>%s</id>" % (plink))
print("\t\t<title><![CDATA[%s]]></title>" % (ptitle))
print("\t\t<link href=\"%s\" />" % (plink))
print("\t\t<author><name>%s</name></author>" % (pauthor))
print("\t\t<updated>%s</updated>" % (pupdated.isoformat()))
print("\t\t<content><![CDATA[%s]]></content>" % (pcontent))
print("\t</entry>")

Print the footer (out of feeds loop):

print("</feed>")

## Example Script

The full example script and how I use it can be found in:

git://bitreich.org/brcon2023-hackathons
./sfeed-atom/kvssachsen2atom

## Summary

* With Selenium you can script all of modern web in all ways.

* We fight bloat with bloat.
* You run a full web process to parse the web.
* We wanted to avoid that with scraping.

* You can easily prototype web access in for example ipython(1).

* There are still privacy concerns.
* You run a huge blob of hundreds of thousands of sloc.
* Plato's cave allegory

## Plato's cave allegory

+--------------;,,.; ..\.|./,.
| .------(_)------
|# # (too bright!) - /,|.\
|# o =| o/ , /. |. .(
|# o|o =|o | / , .|, (_|
| | | = | ,.., ___|_,
+---------+----."' ''''''~~~~~~\____|~~~

* People are in a cave, watching the shadow figure of a hash,
presented to them by a narrator behind the wall to the exit
of the cave.
* When people want to leave the cave, they will be blinded by
the sun. The sunlight hurts their eyes. The will go back into
the have.
* The outside did not look so fine presented and prepared as
the shadow of the narrator's play. It does not hurt the eye.
* Only some people are able to adapt their eyes and see the
beauty of not being dependent on a narrator. They will be able
to leave the cave.

## Questions?

Do you have questions?

## Thanks

Thank you for listening.

For further suggestions, contact me at

Christoph Lohmann <[email protected]>