index

Innova Scraping

· 3min

As the saying goes, I spent hours automating a task that takes five minutes. In my defense, I had to repeat it often while looking for two spots, and the automation itself only took around one to two hours, so this one is short.

Here is how I automated a lookup for two vacancies at Innova Schools using Airflow (yeah, really).

Goal

My goal was simple: get notified whenever there were open vacancies for my two siblings (4th and 5th grade) in a specific location.

I believe there are two ways to scrape data from the internet: the right way, and using Selenium.

So I went with the right way. I am sure there is a way to make a browser driver run in an Airflow instance, but it felt unnecessary for this case.

Request inspection

I used good old requests + BeautifulSoup. First, I inspected the requests made by the main page. It turned out to be much simpler than I thought: calling the location URL with query params already returned the information I needed:

https://admision.innovaschools.edu.pe/Vacant/SearchVacant?VacancyNumber=2&City=153&District=1480

The URL can be split into three parts:

  • Base: https://admision.innovaschools.edu.pe/
  • Resource: Vacant/SearchVacant
  • Params: VacancyNumber=2, City=153, District=1480

Calling that URL returned everything I needed. As the headers showed, they were using ASP.NET, so the page was rendered server-side and the availability data was already present in the HTML (no extra API call needed).

From there, it was just a matter of parsing the HTML. One of the script tags included the data used to populate the dropdown where available slots appeared. After that, an if statement plus a notification was enough. At the time I used ntfy, a self-hostable service for pushing notifications from anywhere.

In the end, I did not get both slots because openings happened at different times. But still, the setup worked, and it would have landed perfectly if the stars had aligned :)

Code

Here is the DAG I ended up with:

from airflow import DAG
from time import sleep
from airflow.operators.python import PythonOperator
from pendulum import datetime

def checking_innova():
    import requests
    from bs4 import BeautifulSoup
    import json
    from libs.send_mail import send_py_email


    def notify(topic,message,title,priority,tags):
        print(f"""
Topic: {topic}
Message: {message}
Title: {title}
Priority: {priority}
Tags: {tags}
""")
        r = requests.post(
                url = f'https://ntfy.franzrg.uk/{topic}',
                headers={
                    "Title": title,
                    "Priority": priority,
                    "Tags": tags
                },
                data= message
                )


    url = 'https://admision.innovaschools.edu.pe/Vacant/SearchVacant?VacancyNumber=2&City=153&District=1480'

    r = requests.get(url)
    try:
        assert r.status_code == 200, "The status code should be 200"
    except Exception as e:
        notify(
            topic='vacantes_innova',
            title="Error en get request",
            priority="min",
            tags="warning,skull",
            message="Status code: " + str(r.status_code)
        )
        raise e
    soup = BeautifulSoup(r.text, 'html.parser')



    try:
        script_tags = soup.find_all('script')
        script_tag = [i for i in script_tags if 'El Retablo' in i.text]
        assert len(script_tag) == 1, "There should be only one script tag with El Retablo"
        del script_tags
        script_tag = script_tag[0]
        statements = script_tag.text.split(';')
        del script_tag
        statements = [i for i in statements if 'El Retablo' in i]
        assert len(statements) == 1, "There should be only one var statement with El Retablo"
        statement = statements[0]
        del statements
        data = statement[statement.find('{'):statement.rfind('}')+1]
        data = json.loads(data)
        del statement
        sede = [i for i in data['Sedes'] if 'El Retablo' in i['HeadquarterName']]
        assert len(sede) == 1, "There should be only one sede with El Retablo"
        sede = sede[0]
        del data
        grado = [i for i in sede['VacantList'] if i['NombreGrado'] in ('3° Primaria','4° Primaria')]
        assert len(grado) == 2, "There should be 2 grados with 3° Primaria or 4° Primaria"
        print(sede['HeadquarterName'])
        print(grado)

        if grado[0]['NumeroDeVacantes']>0 and grado[1]['NumeroDeVacantes']>0:
            print("Hay en los dos")
            sleep(1)
            notify(
                topic='vacantes_innova',
                title="Hay vacantes para los dos en Innova!!",
                priority="urgent",
                tags="partying_face,tada",
                message=f' {grado[0]["NombreGrado"][0]} grado tiene {grado[0]["NumeroDeVacantes"]} vacantes y {grado[1]["NombreGrado"][0]} tiene {grado[1]["NumeroDeVacantes"]} vacantes'
            )
            sleep(1)
            send_py_email(
                subject='Innova tiene vacantes!!',
                to='franz-1241@hotmail.com',
                html_content=f'{grado[0]["NombreGrado"][0]} grado tiene {grado[0]["NumeroDeVacantes"]} vacantes y {grado[1]["NombreGrado"][0]} grado tiene {grado[1]["NumeroDeVacantes"]} vacantes. Visita este link <a href="https://admision.innovaschools.edu.pe/Vacant/SearchVacant?VacancyNumber=2&City=153&District=1480" target=_blank> '
            )
            sleep(1)
    except Exception as e:
        notify(
            topic='vacantes_innova',
            title="Error en parseo de vacantes",
            priority="urgent",
            tags="warning,skull",
            message="Este fue el error: " + str(e)
        )
        raise e



default_args = {
    'owner': 'airflow',
    'tags': ['personal'],
    'email': ['xxxxxxx@xxxxxxx.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 0,
}

with DAG(
    dag_id='innova_vacantes',
    description='Revisando las vacantes de Innova Schools',
    schedule_interval='*/30 9-20 * * *',
    start_date=datetime(2023, 12, 13),
    catchup=False,
    default_args=default_args
) as dag:
    check_innova = PythonOperator(
        task_id='check_innova',
        python_callable=checking_innova,
        dag=dag,
)

check_innova