As the saying goes, I spent hours automating a task that takes five minutes. In my defense, I had to repeat it often while looking for two spots, and the automation itself only took around one to two hours, so this one is short.
Here is how I automated a lookup for two vacancies at Innova Schools using Airflow (yeah, really).
Goal
My goal was simple: get notified whenever there were open vacancies for my two siblings (4th and 5th grade) in a specific location.
I believe there are two ways to scrape data from the internet: the right way, and using Selenium.
So I went with the right way. I am sure there is a way to make a browser driver run in an Airflow instance, but it felt unnecessary for this case.
Request inspection
I used good old requests + BeautifulSoup. First, I inspected the requests made by the main page. It turned out to be much simpler than I thought: calling the location URL with query params already returned the information I needed:
https://admision.innovaschools.edu.pe/Vacant/SearchVacant?VacancyNumber=2&City=153&District=1480
The URL can be split into three parts:
- Base:
https://admision.innovaschools.edu.pe/ - Resource:
Vacant/SearchVacant - Params:
VacancyNumber=2,City=153,District=1480
Calling that URL returned everything I needed. As the headers showed, they were using ASP.NET, so the page was rendered server-side and the availability data was already present in the HTML (no extra API call needed).
From there, it was just a matter of parsing the HTML. One of the script tags included the data used to populate the dropdown where available slots appeared. After that, an if statement plus a notification was enough. At the time I used ntfy, a self-hostable service for pushing notifications from anywhere.
In the end, I did not get both slots because openings happened at different times. But still, the setup worked, and it would have landed perfectly if the stars had aligned :)
Code
Here is the DAG I ended up with:
from airflow import DAG
from time import sleep
from airflow.operators.python import PythonOperator
from pendulum import datetime
def checking_innova():
import requests
from bs4 import BeautifulSoup
import json
from libs.send_mail import send_py_email
def notify(topic,message,title,priority,tags):
print(f"""
Topic: {topic}
Message: {message}
Title: {title}
Priority: {priority}
Tags: {tags}
""")
r = requests.post(
url = f'https://ntfy.franzrg.uk/{topic}',
headers={
"Title": title,
"Priority": priority,
"Tags": tags
},
data= message
)
url = 'https://admision.innovaschools.edu.pe/Vacant/SearchVacant?VacancyNumber=2&City=153&District=1480'
r = requests.get(url)
try:
assert r.status_code == 200, "The status code should be 200"
except Exception as e:
notify(
topic='vacantes_innova',
title="Error en get request",
priority="min",
tags="warning,skull",
message="Status code: " + str(r.status_code)
)
raise e
soup = BeautifulSoup(r.text, 'html.parser')
try:
script_tags = soup.find_all('script')
script_tag = [i for i in script_tags if 'El Retablo' in i.text]
assert len(script_tag) == 1, "There should be only one script tag with El Retablo"
del script_tags
script_tag = script_tag[0]
statements = script_tag.text.split(';')
del script_tag
statements = [i for i in statements if 'El Retablo' in i]
assert len(statements) == 1, "There should be only one var statement with El Retablo"
statement = statements[0]
del statements
data = statement[statement.find('{'):statement.rfind('}')+1]
data = json.loads(data)
del statement
sede = [i for i in data['Sedes'] if 'El Retablo' in i['HeadquarterName']]
assert len(sede) == 1, "There should be only one sede with El Retablo"
sede = sede[0]
del data
grado = [i for i in sede['VacantList'] if i['NombreGrado'] in ('3° Primaria','4° Primaria')]
assert len(grado) == 2, "There should be 2 grados with 3° Primaria or 4° Primaria"
print(sede['HeadquarterName'])
print(grado)
if grado[0]['NumeroDeVacantes']>0 and grado[1]['NumeroDeVacantes']>0:
print("Hay en los dos")
sleep(1)
notify(
topic='vacantes_innova',
title="Hay vacantes para los dos en Innova!!",
priority="urgent",
tags="partying_face,tada",
message=f' {grado[0]["NombreGrado"][0]} grado tiene {grado[0]["NumeroDeVacantes"]} vacantes y {grado[1]["NombreGrado"][0]} tiene {grado[1]["NumeroDeVacantes"]} vacantes'
)
sleep(1)
send_py_email(
subject='Innova tiene vacantes!!',
to='franz-1241@hotmail.com',
html_content=f'{grado[0]["NombreGrado"][0]} grado tiene {grado[0]["NumeroDeVacantes"]} vacantes y {grado[1]["NombreGrado"][0]} grado tiene {grado[1]["NumeroDeVacantes"]} vacantes. Visita este link <a href="https://admision.innovaschools.edu.pe/Vacant/SearchVacant?VacancyNumber=2&City=153&District=1480" target=_blank> '
)
sleep(1)
except Exception as e:
notify(
topic='vacantes_innova',
title="Error en parseo de vacantes",
priority="urgent",
tags="warning,skull",
message="Este fue el error: " + str(e)
)
raise e
default_args = {
'owner': 'airflow',
'tags': ['personal'],
'email': ['xxxxxxx@xxxxxxx.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
}
with DAG(
dag_id='innova_vacantes',
description='Revisando las vacantes de Innova Schools',
schedule_interval='*/30 9-20 * * *',
start_date=datetime(2023, 12, 13),
catchup=False,
default_args=default_args
) as dag:
check_innova = PythonOperator(
task_id='check_innova',
python_callable=checking_innova,
dag=dag,
)
check_innova