Traffic management

We have had a situation, where the access times of our application where increased so much, that it became almost unusable for real users. To find the reason behind this behavior, we checked the access logs we saw a huge number of accesses to our website by bots. Since, the functionality of this application makes possible to compare multiple resources by joining them into one url with one or more ‘+’, there is huge number of possible links (for \(n\) resources they are \(n^k\), where k is number ‘+’+1).

Varnish Alternative text

Varnish Cache is a…

Let all ‘+’ lead to 404

Our initial approach was to define a Varnish-rule, that basically blocked all url-accesses, with a ‘+’ and return a 404. This vastly stabilized the application and made it useful again. But with Google’s: Note that returning “no availability” codes for more than a few days will cause Google to permanently slow or stop crawling URLs on your site, so follow the additional next steps (source: Handle overcrawling of your site (emergencies)) in mind, we should change this behavior in the future:

snippet from default.vcl
1if (req.url ~ "/texts/") {
2   if (req.url ~ "\+.*\+.*") {
3      return (synth(404, "Denied by request filtering configuration"));
4   }
5   return (pass);
6}

The next step is to comment-out this rule and switch to a request limit approach as seen in https://support.platform.sh/hc/en-us/community/posts/16439617864722-Rate-limit-connections-to-your-application-using-Varnish.

Before making any changes, please check the syntax of the file via varnishd -C -f /etc/varnish/default.vcl (see this blog post ).

After changes: service varnish reload. (https://stackoverflow.com/a/46088507/7924573)

robots.txt

We use a robots.txt file to disallow all bots to crawl sites containing a “+”. To achieve this we used the robots.txt wildcard expression “*”. It is not known whether all bots accept this form of wildcard annotation.

If you intend to make changes to the robots.txt, consider checking it’s functionality before hand via online tools such as the: robots.txt Testing & Validation Tool

robots.txt
 1# robots.txt
 2## Although the "*" should include all bots. I did observe, that certain ones do need specific instructions.
 3## The goal is to disallow the crawling and automatic processing of all sites with a "+" or "%2B" (equivalent in percentage encoding)
 4## Since the site is refreshed quite infrequent, we used a Crawl-delay of 30 seconds
 5User-agent: *
 6User-agent: AdsBot-Google
 7User-agent: Yandex
 8User-agent: YandexBot
 9User-agent: AcademicBotRTU
10User-agent: Googlebot
11Disallow: *+*
12Disallow: *%2B*
13Crawl-delay: 30

This reduce the accesses by over 50% within one day (10k-20k 404s per hour on 09.10.24 vs 300-6k 404s per hour on 10.10.24). The most prominent of the remaining crawlers on 10.10.24 are AcademicBotRTU and YandexBot, these remained until the 15.10.24, on which I added explicit rules to block them. Since, Google needs at most 24-36 hours to check the new robots.txt, I expect other crawlers to have the same frequency.

Why add crawl-delay?

Well, Googlebot and Yandex wont use it. So, it wont cause any harm and is mostly just ignored. https://support.google.com/webmasters/thread/251817470/putting-crawl-delay-in-robots-txt-file-is-good?hl=en. But some do recognize it, so I have added the arbitrary limit of 30 seconds. Which is for instance recognize by the AcademicBotRTU.

Why add %2B?

Because it is the Percent-encoding equivalent for ‘+’.

Which “good” bots frequently crawled our website?

crawler-list

name

url

remarks

AcademicBotRTU

https://academicbot.rtu.lv/

academic

YandexBot

search engine

GoogleBot

search engine

BingBot

search engine

Baiduspider

https://www.baidu.com/search/robots_english.html

search engine

Nexus 5X Build/MMB29P

https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers?hl=de#googlebot-smartphone

search engine

CCBot

???

???

What if a crawler ignores robots.txt

https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/

nginx

flask

Controlled by the MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER environment variable, which is then used by the r_multipassager_multipassage-method.

Per default it is set to ‘dev’, so authentication is not required. If you want to activate the authentication required-process, please set it to ‘production’. Create or modify the .env-file in the root directory. With the following variables:

SERVER_TYPE = "production"
MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER = 1
Config.py
 1class Config(object):
 2    SECRET_KEY = os.environ.get('NEMO_KEY') or 'you-will-never-guess'
 3    SQLALCHEMY_DATABASE_URI = os.environ.get('DATABASE_URL') or 'sqlite:///' + os.path.join(basedir, 'app.db')
 4    SQLALCHEMY_TRACK_MODIFICATIONS = False
 5    POSTS_PER_PAGE = 10
 6    # if the environment variables ELASTICSEARCH_URL is not set; the application is starting without it
 7    ELASTICSEARCH_URL = os.environ.get('ELASTICSEARCH_URL').split(';') if os.environ.get('ELASTICSEARCH_URL') else False
 8    ES_CLIENT_CERT = os.environ.get('ES_CLIENT_CERT', '')
 9    ES_CLIENT_KEY = os.environ.get('ES_CLIENT_KEY', '')
10    LANGUAGES = ['en', 'de', 'fr']
11    BABEL_DEFAULT_LOCALE = 'de'
12    CORPUS_FOLDERS = os.environ.get('CORPUS_FOLDERS').split(';') if os.environ.get('CORPUS_FOLDERS') else ["../formulae-corpora/"]
13    INFLECTED_LEM_JSONS = os.environ.get('INFLECTED_LEM_JSONS').split(';') if os.environ.get('INFLECTED_LEM_JSONS') else []
14    LEM_TO_LEM_JSONS = os.environ.get('LEM_TO_LEM_JSONS').split(';') if os.environ.get('LEM_TO_LEM_JSONS') else []
15    DEAD_URLS = os.environ.get('DEAD_URLS').split(';') if os.environ.get('DEAD_URLS') else []
16    COMP_PLACES = os.environ.get('COMP_PLACES').split(';') if os.environ.get('COMP_PLACES') else []
17    LEMMA_LISTS = os.environ.get('LEMMA_LISTS').split(';') if os.environ.get('LEMMA_LISTS') else []
18    COLLECTED_COLLS = os.environ.get('COLLECTED_COLLS').split(';') if os.environ.get('COLLECTED_COLLS') else []
19    # TERM_VECTORS = os.environ.get('TERM_VECTORS')
20    CACHE_DIRECTORY = os.environ.get('NEMO_CACHE_DIR') or './cache/'
21    MAIL_SERVER = os.environ.get('MAIL_SERVER')
22    MAIL_PORT = int(os.environ.get('MAIL_PORT') or 25)
23    MAIL_USE_TLS = os.environ.get('MAIL_USE_TLS') is not None
24    MAIL_USERNAME = os.environ.get('MAIL_USERNAME')
25    MAIL_PASSWORD = os.environ.get('MAIL_PASSWORD')
26    ADMINS = os.environ.get('ADMINS').split(';') if os.environ.get('ADMINS') else ['no-reply@example.com']
27    SESSION_TYPE = 'filesystem'
28    IIIF_SERVER = os.environ.get('IIIF_SERVER')
29    IIIF_MAPPING = os.environ.get('IIIF_MAPPING') or ';'.join(['{}/iiif'.format(f) for f in CORPUS_FOLDERS])
30    # This should only be changed to True when collecting search queries and responses for mocking ES
31    SAVE_REQUESTS = False
32    CACHE_MAX_AGE = os.environ.get('VARNISH_MAX_AGE') or 0  # Only need cache on the server, where this should be set in env
33    PDF_ENCRYPTION_PW = os.environ.get('PDF_ENCRYPTION_PW', 'hard_pw')
34    SESSION_COOKIE_SECURE = os.environ.get('SESSION_COOKIE_SECURE', True)
35    REMEMBER_COOKIE_SECURE = os.environ.get('REMEMBER_COOKIE_SECURE', True)
36    REDIS_URL = os.environ.get('REDIS_URL') or 'redis://'
37    PREFERRED_URL_SCHEME = os.environ.get('PREFERRED_URL_SCHEME', 'http')
38    VIDEO_FOLDER = os.environ.get('VIDEO_FOLDER') or ''
39    COLLATE_API_URL = os.environ.get('COLLATE_API_URL', 'http://localhost:7300')
40    WORD_GRAPH_API_URL = os.environ.get('WORD_GRAPH_API_URL', '')
41    # Used to decide whether authentication is needed for certain resources (dev -> access to all without restriction; production -> restricted access for non-authenticated users)
42    SERVER_TYPE = os.environ.get('SERVER_TYPE', 'dev')
43    # Number of texts a not-authenticated user should be able to see. int > 0
44    try:
45        MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER = int(os.environ.get('MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER'))
46        if not MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER > 0:
47            MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER = 1
48    except TypeError:
49        MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER = 1
r_multipassage()
  1    def r_multipassage(self, objectIds: str, subreferences: str, lang: str = None, collate: bool = False) -> Dict[str, Any]:
  2        """ Retrieve the text of the passage
  3
  4        :param objectIds: Collection identifiers separated by '+'
  5        :param lang: Lang in which to express main data
  6        :param subreferences: Reference identifiers separated by '+'
  7        :return: Template, collections metadata and Markup object representing the text
  8        """
  9        # authentication check:
 10        texts_threshold = self.app.config['MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER']
 11        if (texts_threshold <= objectIds.count('+')  or texts_threshold <= subreferences.count('+')) and \
 12            not current_user.is_authenticated and \
 13            not 'dev' in self.app.config['SERVER_TYPE'].lower():
 14                abort(401)
 15
 16        if 'reading_format' not in session:
 17            session['reading_format'] = 'columns'
 18        ids = objectIds.split('+')
 19        translations = {}
 20        view = 1
 21        passage_data = {'template': 'main::multipassage.html', 'objects': [], "translation": {}}
 22        subrefers = subreferences.split('+')
 23        all_parent_colls = list()
 24        collate_html_dict = dict()
 25        if 'collate' in request.values:
 26            collate_html_dict = self.call_collate_api(obj_ids=ids, subrefers=subrefers)
 27        if len(subrefers) != len(ids):
 28            abort(404)
 29        for i, id in enumerate(ids):
 30            manifest_requested = False
 31            if id in self.dead_urls:
 32                id = self.dead_urls[id]
 33            if "manifest:" in id:
 34                id = re.sub(r'^manifest:', '', id)
 35                manifest_requested = True
 36            if self.check_project_team() is True or id in self.open_texts:
 37                if subrefers[i] in ["all", 'first']:
 38                    subref = self.get_reffs(id)[0][0]
 39                else:
 40                    subref = subrefers[i]
 41                d = self.r_passage(id, subref, lang=lang)
 42                d['text_passage'] = collate_html_dict.get(id, d['text_passage'])
 43                d.update(self.get_prev_next_texts(d['objectId']))
 44                del d['template']
 45                parent_colls = defaultdict(list)
 46                parent_colls[99] = [(id, str(d['collections']['current']['label'].replace('<br>', ' ')))]
 47                parent_textgroups = [x for x in d['collections']['parents'] if 'cts:textgroup' in x['subtype']]
 48                for parent_coll in parent_textgroups:
 49                    grandparent_depths = list()
 50                    for grandparent in parent_coll['ancestors'].values():
 51                        start_number = 0
 52                        if 'cts:textgroup' in grandparent.subtype:
 53                            start_number = 1
 54                        grandparent_depths.append(len([x for x in self.make_parents(grandparent) if 'cts:textgroup' in x['subtype']]) + start_number)
 55                    max_grandparent_depth = max(grandparent_depths)
 56                    parent_colls[max_grandparent_depth].append((parent_coll['id'], str(parent_coll['short_title'])))
 57                all_parent_colls.append([v for k, v in sorted(parent_colls.items())])
 58                translations[id] = []
 59                for x in d.pop('translations', None):
 60                    if x[0].id not in ids and x not in translations[id]:
 61                        translations[id].append(x)
 62                if manifest_requested:
 63                    # This is when there are multiple manuscripts and the edition cannot be tied to any single one of them
 64                    formulae = dict()
 65                    if 'manifest:' + d['collections']['current']['id'] in self.app.picture_file:
 66                        formulae = self.app.picture_file['manifest:' + d['collections']['current']['id']]
 67                    d['alt_image'] = ''
 68                    if os.path.isfile(self.app.config['IIIF_MAPPING'] + '/' + 'alternatives.json'):
 69                        with open(self.app.config['IIIF_MAPPING'] + '/' + 'alternatives.json') as f:
 70                            alt_images = json_load(f)
 71                        d['alt_image'] = alt_images.get(id)
 72                    d["objectId"] = "manifest:" + id
 73                    d["div_v"] = "manifest" + str(view)
 74                    view = view + 1
 75                    del d['text_passage']
 76                    del d['notes']
 77                    if formulae == {}:
 78                        d['manifest'] = None
 79                        d["label"] = [d['collections']['current']['label'], '']
 80                        d['lib_link'] = self.ms_lib_links.get(id.split(':')[-1].split('.')[0], '')
 81                        passage_data['objects'].append(d)
 82                        continue
 83                    # this viewer work when the library or archive give an IIIF API for the external usage of theirs books
 84                    d["manifest"] = url_for('viewer.static', filename=formulae["manifest"])
 85                    with open(self.app.config['IIIF_MAPPING'] + '/' + formulae['manifest']) as f:
 86                        this_manifest = json_load(f)
 87                    self.app.logger.warn("this_manifest['@id'] {}".format(this_manifest['@id']))
 88                    if 'fuldig.hs-fulda.de' in this_manifest['@id']:
 89                        # This works for resources from https://fuldig.hs-fulda.de/
 90                        d['lib_link'] = this_manifest['sequences'][0]['canvases'][0]['rendering'][1]['@id']
 91                    elif 'gallica.bnf.fr' in this_manifest['@id']:
 92                        # This link needs to be constructed from the thumbnail link for images from https://gallica.bnf.fr/
 93                        d['lib_link'] = this_manifest['sequences'][0]['canvases'][0]['thumbnail']['@id'].replace('.thumbnail', '')
 94                        self.app.logger.warn("gallica.bnf.fr: lib_link created:{}".format(d['lib_link']))
 95                    elif 'api.digitale-sammlungen.de' in this_manifest['@id']:
 96                        # This works for resources from the Bayerische Staatsbibliothek
 97                        # (and perhaps other German digital libraries?)
 98                        d['lib_link'] = this_manifest['sequences'][0]['canvases'][0]['@id'] + '/view'
 99                    elif 'digitalcollections.universiteitleiden.nl' in this_manifest['@id']:
100                        # This works for resources from the Leiden University Library
101                        # This links to the manuscript as a whole.
102                        # I am not sure how to link to specific pages in their IIIF viewer.
103                        d['lib_link'] = 'https://iiifviewer.universiteitleiden.nl/?manifest=' + this_manifest['@id']
104                    elif 'digi.vatlib.it' in this_manifest['@id']:
105                        # This works for resources from the Vatican Libraries
106                        d['lib_link'] = this_manifest['sequences'][0]['canvases'][0]['@id'].replace('iiif', 'view').replace('canvas/p', '')
107                    elif 'digital.blb-karlsruhe.de' in this_manifest['@id']:
108                        # This works for resources from the BLB Karlsrühe
109                        # This links to the manuscript as a whole.
110                        # I am not sure how to link to specific pages in their IIIF viewer.
111                        d['lib_link'] = 'https://i3f.vls.io/?collection=i3fblbk&id=' + this_manifest['@id']
112                    elif 'www.e-codices.unifr.ch' in this_manifest['@id']:
113                        # This works for resources from the E-Codices
114                        d['lib_link'] = this_manifest['related'].replace('/list/one', '') + '/' + this_manifest['sequences'][0]['canvases'][0]['label']
115                    
116                    self.app.logger.debug(msg="lib_link: {}".format(d['lib_link']))
117                    
118                    folios = re.sub(r'(\d+)([rvab]{1,2})', r'\1<span class="verso-recto">\2</span>',
119                                    this_manifest['sequences'][0]['canvases'][0]['label'])
120                    if len(this_manifest['sequences'][0]['canvases']) > 1:
121                        folios += ' - ' + re.sub(r'(\d+)([rvab]{1,2})', r'\1<span class="verso-recto">\2</span>',
122                                                 this_manifest['sequences'][0]['canvases'][-1]['label'])
123                    d["label"] = [formulae["title"], ' [fol.' + folios + ']']
124
125                else:
126                    d["IIIFviewer"] = []
127                    for transcription, t_title, t_partOf, t_siglum in d['transcriptions']:
128                        t_id = 'no_image'
129                        if "manifest:" + transcription.id in self.app.picture_file:
130                            t_id = "manifest:" + transcription.id
131                        elif 'wa1' in transcription.id:
132                            t_id = self.ms_lib_links['wa1']
133                        d["IIIFviewer"].append((t_id,
134                                                t_title + ' (' + t_siglum + ')',
135                                                t_partOf))
136
137                    
138                    self.app.logger.warn(msg='d["IIIFviewer"]: {}'.format(d["IIIFviewer"]))
139                    if 'previous_search' in session:
140                        result_ids = [x for x in session['previous_search'] if x['id'] == id]
141                        if result_ids and any([x.get('highlight') for x in result_ids]):
142                            d['text_passage'] = self.highlight_found_sents(d['text_passage'], result_ids)
143                    if d['collections']['current']['sigla'] != '':
144                        d['collections']['current']['label'] = d['collections']['current']['label'].split(' [')
145                        d['collections']['current']['label'][-1] = ' [' + d['collections']['current']['label'][-1]
146                filtered_transcriptions = []
147                for x in d['transcriptions']:
148                    if x[0].id not in ids and x not in filtered_transcriptions:
149                        filtered_transcriptions.append(x)
150                d['transcriptions'] = filtered_transcriptions
151                passage_data['objects'].append(d)
152        passage_data['breadcrumb_colls'] = all_parent_colls
153        if len(ids) > len(passage_data['objects']):
154            flash(_('Mindestens ein Text, den Sie anzeigen möchten, ist nicht verfügbar.'))
155        passage_data['translation'] = translations
156        passage_data['videos'] = [v for k, v in self.VIDEOS.items() if 2 in k][0]
157        passage_data['word_graph_url'] = self.app.config['WORD_GRAPH_API_URL']
158        return passage_data