Traffic management

We have had a situation, where the access times of our application where increased so much, that it became almost unusable for real users. To find the reason behind this behavior, we checked the access logs we saw a huge number of accesses to our website by bots. Since, the functionality of this application makes possible to compare multiple resources by joining them into one url with one or more ‘+’, there is huge number of possible links (for \(n\) resources they are \(n^k\), where k is number ‘+’+1).

Varnish 

Varnish Cache is a…

Let all ‘+’ lead to 404

Our initial approach was to define a Varnish-rule, that basically blocked all url-accesses, with a ‘+’ and return a 404. This vastly stabilized the application and made it useful again. But with Google’s: Note that returning “no availability” codes for more than a few days will cause Google to permanently slow or stop crawling URLs on your site, so follow the additional next steps (source: Handle overcrawling of your site (emergencies)) in mind, we should change this behavior in the future:

snippet from default.vcl

if (req.url ~ "/texts/") {
   if (req.url ~ "\+.*\+.*") {
      return (synth(404, "Denied by request filtering configuration"));
   }
   return (pass);
}

The next step is to comment-out this rule and switch to a request limit approach as seen in https://support.platform.sh/hc/en-us/community/posts/16439617864722-Rate-limit-connections-to-your-application-using-Varnish.

Before making any changes, please check the syntax of the file via varnishd -C -f /etc/varnish/default.vcl (see this blog post ).

After changes: service varnish reload. (https://stackoverflow.com/a/46088507/7924573)

robots.txt

We use a robots.txt file to disallow all bots to crawl sites containing a “+”. To achieve this we used the robots.txt wildcard expression “*”. It is not known whether all bots accept this form of wildcard annotation.

If you intend to make changes to the robots.txt, consider checking it’s functionality before hand via online tools such as the: robots.txt Testing & Validation Tool

robots.txt

# robots.txt
## Although the "*" should include all bots. I did observe, that certain ones do need specific instructions.
## The goal is to disallow the crawling and automatic processing of all sites with a "+" or "%2B" (equivalent in percentage encoding)
## Since the site is refreshed quite infrequent, we used a Crawl-delay of 30 seconds
User-agent: *
User-agent: AdsBot-Google
User-agent: Yandex
User-agent: YandexBot
User-agent: AcademicBotRTU
User-agent: Googlebot
Disallow: *+*
Disallow: *%2B*
Crawl-delay: 30

This reduce the accesses by over 50% within one day (10k-20k 404s per hour on 09.10.24 vs 300-6k 404s per hour on 10.10.24). The most prominent of the remaining crawlers on 10.10.24 are AcademicBotRTU and YandexBot, these remained until the 15.10.24, on which I added explicit rules to block them. Since, Google needs at most 24-36 hours to check the new robots.txt, I expect other crawlers to have the same frequency.

Why add `crawl-delay`?

Well, Googlebot and Yandex wont use it. So, it wont cause any harm and is mostly just ignored. https://support.google.com/webmasters/thread/251817470/putting-crawl-delay-in-robots-txt-file-is-good?hl=en. But some do recognize it, so I have added the arbitrary limit of 30 seconds. Which is for instance recognize by the AcademicBotRTU.

Why add `%2B`?

Because it is the Percent-encoding equivalent for ‘+’.

Which “good” bots frequently crawled our website?

crawler-list
name	url	remarks
AcademicBotRTU	https://academicbot.rtu.lv/	academic
YandexBot		search engine
GoogleBot		search engine
BingBot		search engine
Baiduspider	https://www.baidu.com/search/robots_english.html	search engine
Nexus 5X Build/MMB29P	https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers?hl=de#googlebot-smartphone	search engine
CCBot	???	???

What if a crawler ignores robots.txt

https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/

nginx

flask

Controlled by the MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER environment variable, which is then used by the r_multipassager_multipassage-method.

Per default it is set to ‘dev’, so authentication is not required. If you want to activate the authentication required-process, please set it to ‘production’. Create or modify the .env-file in the root directory. With the following variables:

SERVER_TYPE = "production"
MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER = 1

Config.py

class Config(object):
    SECRET_KEY = os.environ.get('NEMO_KEY') or 'you-will-never-guess'
    SQLALCHEMY_DATABASE_URI = os.environ.get('DATABASE_URL') or 'sqlite:///' + os.path.join(basedir, 'app.db')
    SQLALCHEMY_TRACK_MODIFICATIONS = False
    POSTS_PER_PAGE = 10
    # if the environment variables ELASTICSEARCH_URL is not set; the application is starting without it
    ELASTICSEARCH_URL = os.environ.get('ELASTICSEARCH_URL').split(';') if os.environ.get('ELASTICSEARCH_URL') else False
    ES_CLIENT_CERT = os.environ.get('ES_CLIENT_CERT', '')
    ES_CLIENT_KEY = os.environ.get('ES_CLIENT_KEY', '')
    LANGUAGES = ['en', 'de', 'fr']
    BABEL_DEFAULT_LOCALE = 'de'
    CORPUS_FOLDERS = os.environ.get('CORPUS_FOLDERS').split(';') if os.environ.get('CORPUS_FOLDERS') else ["../formulae-corpora/"]
    INFLECTED_LEM_JSONS = os.environ.get('INFLECTED_LEM_JSONS').split(';') if os.environ.get('INFLECTED_LEM_JSONS') else []
    LEM_TO_LEM_JSONS = os.environ.get('LEM_TO_LEM_JSONS').split(';') if os.environ.get('LEM_TO_LEM_JSONS') else []
    DEAD_URLS = os.environ.get('DEAD_URLS').split(';') if os.environ.get('DEAD_URLS') else []
    COMP_PLACES = os.environ.get('COMP_PLACES').split(';') if os.environ.get('COMP_PLACES') else []
    LEMMA_LISTS = os.environ.get('LEMMA_LISTS').split(';') if os.environ.get('LEMMA_LISTS') else []
    COLLECTED_COLLS = os.environ.get('COLLECTED_COLLS').split(';') if os.environ.get('COLLECTED_COLLS') else []
    # TERM_VECTORS = os.environ.get('TERM_VECTORS')
    CACHE_DIRECTORY = os.environ.get('NEMO_CACHE_DIR') or './cache/'
    MAIL_SERVER = os.environ.get('MAIL_SERVER')
    MAIL_PORT = int(os.environ.get('MAIL_PORT') or 25)
    MAIL_USE_TLS = os.environ.get('MAIL_USE_TLS') is not None
    MAIL_USERNAME = os.environ.get('MAIL_USERNAME')
    MAIL_PASSWORD = os.environ.get('MAIL_PASSWORD')
    ADMINS = os.environ.get('ADMINS').split(';') if os.environ.get('ADMINS') else ['no-reply@example.com']
    SESSION_TYPE = 'filesystem'
    IIIF_SERVER = os.environ.get('IIIF_SERVER')
    IIIF_MAPPING = os.environ.get('IIIF_MAPPING') or ';'.join(['{}/iiif'.format(f) for f in CORPUS_FOLDERS])
    # This should only be changed to True when collecting search queries and responses for mocking ES
    SAVE_REQUESTS = False
    CACHE_MAX_AGE = os.environ.get('VARNISH_MAX_AGE') or 0  # Only need cache on the server, where this should be set in env
    PDF_ENCRYPTION_PW = os.environ.get('PDF_ENCRYPTION_PW', 'hard_pw')
    SESSION_COOKIE_SECURE = os.environ.get('SESSION_COOKIE_SECURE', True)
    REMEMBER_COOKIE_SECURE = os.environ.get('REMEMBER_COOKIE_SECURE', True)
    REDIS_URL = os.environ.get('REDIS_URL') or 'redis://'
    PREFERRED_URL_SCHEME = os.environ.get('PREFERRED_URL_SCHEME', 'http')
    VIDEO_FOLDER = os.environ.get('VIDEO_FOLDER') or ''
    COLLATE_API_URL = os.environ.get('COLLATE_API_URL', 'http://localhost:7300')
    WORD_GRAPH_API_URL = os.environ.get('WORD_GRAPH_API_URL', '')
    # Used to decide whether authentication is needed for certain resources (dev -> access to all without restriction; production -> restricted access for non-authenticated users)
    SERVER_TYPE = os.environ.get('SERVER_TYPE', 'dev')
    # Number of texts a not-authenticated user should be able to see. int > 0
    try:
        MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER = int(os.environ.get('MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER'))
        if not MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER > 0:
            MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER = 1
    except TypeError:
        MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER = 1

r_multipassage()

    def r_multipassage(self, objectIds: str, subreferences: str, lang: str = None, collate: bool = False) -> Dict[str, Any]:
        """ Retrieve the text of the passage

        :param objectIds: Collection identifiers separated by '+'
        :param lang: Lang in which to express main data
        :param subreferences: Reference identifiers separated by '+'
        :return: Template, collections metadata and Markup object representing the text
        """
        # authentication check:
        texts_threshold = self.app.config['MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER']
        if (texts_threshold <= objectIds.count('+')  or texts_threshold <= subreferences.count('+')) and \
            not current_user.is_authenticated and \
            not 'dev' in self.app.config['SERVER_TYPE'].lower():
                abort(401)

        if 'reading_format' not in session:
            session['reading_format'] = 'columns'
        ids = objectIds.split('+')
        translations = {}
        view = 1
        passage_data = {'template': 'main::multipassage.html', 'objects': [], "translation": {}}
        subrefers = subreferences.split('+')
        all_parent_colls = list()
        collate_html_dict = dict()
        if 'collate' in request.values:
            collate_html_dict = self.call_collate_api(obj_ids=ids, subrefers=subrefers)
        if len(subrefers) != len(ids):
            abort(404)
        for i, id in enumerate(ids):
            manifest_requested = False
            if id in self.dead_urls:
                id = self.dead_urls[id]
            if "manifest:" in id:
                id = re.sub(r'^manifest:', '', id)
                manifest_requested = True
            if self.check_project_team() is True or id in self.open_texts:
                if subrefers[i] in ["all", 'first']:
                    subref = self.get_reffs(id)[0][0]
                else:
                    subref = subrefers[i]
                d = self.r_passage(id, subref, lang=lang)
                d['text_passage'] = collate_html_dict.get(id, d['text_passage'])
                d.update(self.get_prev_next_texts(d['objectId']))
                del d['template']
                parent_colls = defaultdict(list)
                parent_colls[99] = [(id, str(d['collections']['current']['label'].replace('<br>', ' ')))]
                parent_textgroups = [x for x in d['collections']['parents'] if 'cts:textgroup' in x['subtype']]
                for parent_coll in parent_textgroups:
                    grandparent_depths = list()
                    for grandparent in parent_coll['ancestors'].values():
                        start_number = 0
                        if 'cts:textgroup' in grandparent.subtype:
                            start_number = 1
                        grandparent_depths.append(len([x for x in self.make_parents(grandparent) if 'cts:textgroup' in x['subtype']]) + start_number)
                    max_grandparent_depth = max(grandparent_depths)
                    parent_colls[max_grandparent_depth].append((parent_coll['id'], str(parent_coll['short_title'])))
                all_parent_colls.append([v for k, v in sorted(parent_colls.items())])
                translations[id] = []
                for x in d.pop('translations', None):
                    if x[0].id not in ids and x not in translations[id]:
                        translations[id].append(x)
                if manifest_requested:
                    # This is when there are multiple manuscripts and the edition cannot be tied to any single one of them
                    formulae = dict()
                    if 'manifest:' + d['collections']['current']['id'] in self.app.picture_file:
                        formulae = self.app.picture_file['manifest:' + d['collections']['current']['id']]
                    d['alt_image'] = ''
                    if os.path.isfile(self.app.config['IIIF_MAPPING'] + '/' + 'alternatives.json'):
                        with open(self.app.config['IIIF_MAPPING'] + '/' + 'alternatives.json') as f:
                            alt_images = json_load(f)
                        d['alt_image'] = alt_images.get(id)
                    d["objectId"] = "manifest:" + id
                    d["div_v"] = "manifest" + str(view)
                    view = view + 1
                    del d['text_passage']
                    del d['notes']
                    if formulae == {}:
                        d['manifest'] = None
                        d["label"] = [d['collections']['current']['label'], '']
                        d['lib_link'] = self.ms_lib_links.get(id.split(':')[-1].split('.')[0], '')
                        passage_data['objects'].append(d)
                        continue
                    # this viewer work when the library or archive give an IIIF API for the external usage of theirs books
                    d["manifest"] = url_for('viewer.static', filename=formulae["manifest"])
                    with open(self.app.config['IIIF_MAPPING'] + '/' + formulae['manifest']) as f:
                        this_manifest = json_load(f)
                    self.app.logger.warn("this_manifest['@id'] {}".format(this_manifest['@id']))
                    if 'fuldig.hs-fulda.de' in this_manifest['@id']:
                        # This works for resources from https://fuldig.hs-fulda.de/
                        d['lib_link'] = this_manifest['sequences'][0]['canvases'][0]['rendering'][1]['@id']
                    elif 'gallica.bnf.fr' in this_manifest['@id']:
                        # This link needs to be constructed from the thumbnail link for images from https://gallica.bnf.fr/
                        d['lib_link'] = this_manifest['sequences'][0]['canvases'][0]['thumbnail']['@id'].replace('.thumbnail', '')
                        self.app.logger.warn("gallica.bnf.fr: lib_link created:{}".format(d['lib_link']))
                    elif 'api.digitale-sammlungen.de' in this_manifest['@id']:
                        # This works for resources from the Bayerische Staatsbibliothek
                        # (and perhaps other German digital libraries?)
                        d['lib_link'] = this_manifest['sequences'][0]['canvases'][0]['@id'] + '/view'
                    elif 'digitalcollections.universiteitleiden.nl' in this_manifest['@id']:
                        # This works for resources from the Leiden University Library
                        # This links to the manuscript as a whole.
                        # I am not sure how to link to specific pages in their IIIF viewer.
                        d['lib_link'] = 'https://iiifviewer.universiteitleiden.nl/?manifest=' + this_manifest['@id']
                    elif 'digi.vatlib.it' in this_manifest['@id']:
                        # This works for resources from the Vatican Libraries
                        d['lib_link'] = this_manifest['sequences'][0]['canvases'][0]['@id'].replace('iiif', 'view').replace('canvas/p', '')
                    elif 'digital.blb-karlsruhe.de' in this_manifest['@id']:
                        # This works for resources from the BLB Karlsrühe
                        # This links to the manuscript as a whole.
                        # I am not sure how to link to specific pages in their IIIF viewer.
                        d['lib_link'] = 'https://i3f.vls.io/?collection=i3fblbk&id=' + this_manifest['@id']
                    elif 'www.e-codices.unifr.ch' in this_manifest['@id']:
                        # This works for resources from the E-Codices
                        d['lib_link'] = this_manifest['related'].replace('/list/one', '') + '/' + this_manifest['sequences'][0]['canvases'][0]['label']
                    
                    self.app.logger.debug(msg="lib_link: {}".format(d['lib_link']))
                    
                    folios = re.sub(r'(\d+)([rvab]{1,2})', r'\1<span class="verso-recto">\2</span>',
                                    this_manifest['sequences'][0]['canvases'][0]['label'])
                    if len(this_manifest['sequences'][0]['canvases']) > 1:
                        folios += ' - ' + re.sub(r'(\d+)([rvab]{1,2})', r'\1<span class="verso-recto">\2</span>',
                                                 this_manifest['sequences'][0]['canvases'][-1]['label'])
                    d["label"] = [formulae["title"], ' [fol.' + folios + ']']

                else:
                    d["IIIFviewer"] = []
                    for transcription, t_title, t_partOf, t_siglum in d['transcriptions']:
                        t_id = 'no_image'
                        if "manifest:" + transcription.id in self.app.picture_file:
                            t_id = "manifest:" + transcription.id
                        elif 'wa1' in transcription.id:
                            t_id = self.ms_lib_links['wa1']
                        d["IIIFviewer"].append((t_id,
                                                t_title + ' (' + t_siglum + ')',
                                                t_partOf))

                    
                    self.app.logger.warn(msg='d["IIIFviewer"]: {}'.format(d["IIIFviewer"]))
                    if 'previous_search' in session:
                        result_ids = [x for x in session['previous_search'] if x['id'] == id]
                        if result_ids and any([x.get('highlight') for x in result_ids]):
                            d['text_passage'] = self.highlight_found_sents(d['text_passage'], result_ids)
                    if d['collections']['current']['sigla'] != '':
                        d['collections']['current']['label'] = d['collections']['current']['label'].split(' [')
                        d['collections']['current']['label'][-1] = ' [' + d['collections']['current']['label'][-1]
                filtered_transcriptions = []
                for x in d['transcriptions']:
                    if x[0].id not in ids and x not in filtered_transcriptions:
                        filtered_transcriptions.append(x)
                d['transcriptions'] = filtered_transcriptions
                passage_data['objects'].append(d)
        passage_data['breadcrumb_colls'] = all_parent_colls
        if len(ids) > len(passage_data['objects']):
            flash(_('Mindestens ein Text, den Sie anzeigen möchten, ist nicht verfügbar.'))
        passage_data['translation'] = translations
        passage_data['videos'] = [v for k, v in self.VIDEOS.items() if 2 in k][0]
        passage_data['word_graph_url'] = self.app.config['WORD_GRAPH_API_URL']
        return passage_data

Traffic management

Varnish 

Let all ‘+’ lead to 404

robots.txt

Why add crawl-delay?

Why add %2B?

Which “good” bots frequently crawled our website?

What if a crawler ignores robots.txt

nginx

flask

Why add `crawl-delay`?

Why add `%2B`?