Traffic management
We have had a situation, where the access times of our application where increased so much, that it became almost unusable for real users. To find the reason behind this behavior, we checked the access logs we saw a huge number of accesses to our website by bots. Since, the functionality of this application makes possible to compare multiple resources by joining them into one url with one or more ‘+’, there is huge number of possible links (for \(n\) resources they are \(n^k\), where k is number ‘+’+1).
Varnish
Varnish Cache is a…
Let all ‘+’ lead to 404
Our initial approach was to define a Varnish-rule, that basically blocked all url-accesses, with a ‘+’ and return a 404. This vastly stabilized the application and made it useful again. But with Google’s: Note that returning “no availability” codes for more than a few days will cause Google to permanently slow or stop crawling URLs on your site, so follow the additional next steps (source: Handle overcrawling of your site (emergencies)) in mind, we should change this behavior in the future:
1if (req.url ~ "/texts/") {
2 if (req.url ~ "\+.*\+.*") {
3 return (synth(404, "Denied by request filtering configuration"));
4 }
5 return (pass);
6}
The next step is to comment-out this rule and switch to a request limit approach as seen in https://support.platform.sh/hc/en-us/community/posts/16439617864722-Rate-limit-connections-to-your-application-using-Varnish.
Before making any changes, please check the syntax of the file via varnishd -C -f /etc/varnish/default.vcl
(see this blog post ).
After changes: service varnish reload. (https://stackoverflow.com/a/46088507/7924573)
robots.txt
We use a robots.txt file to disallow all bots to crawl sites containing a “+”. To achieve this we used the robots.txt wildcard expression “*”. It is not known whether all bots accept this form of wildcard annotation.
If you intend to make changes to the robots.txt, consider checking it’s functionality before hand via online tools such as the: robots.txt Testing & Validation Tool
1# robots.txt
2## Although the "*" should include all bots. I did observe, that certain ones do need specific instructions.
3## The goal is to disallow the crawling and automatic processing of all sites with a "+" or "%2B" (equivalent in percentage encoding)
4## Since the site is refreshed quite infrequent, we used a Crawl-delay of 30 seconds
5User-agent: *
6User-agent: AdsBot-Google
7User-agent: Yandex
8User-agent: YandexBot
9User-agent: AcademicBotRTU
10User-agent: Googlebot
11Disallow: *+*
12Disallow: *%2B*
13Crawl-delay: 30
This reduce the accesses by over 50% within one day (10k-20k 404s per hour on 09.10.24 vs 300-6k 404s per hour on 10.10.24). The most prominent of the remaining crawlers on 10.10.24 are AcademicBotRTU and YandexBot, these remained until the 15.10.24, on which I added explicit rules to block them. Since, Google needs at most 24-36 hours to check the new robots.txt, I expect other crawlers to have the same frequency.
Why add crawl-delay
?
Well, Googlebot and Yandex wont use it. So, it wont cause any harm and is mostly just ignored. https://support.google.com/webmasters/thread/251817470/putting-crawl-delay-in-robots-txt-file-is-good?hl=en. But some do recognize it, so I have added the arbitrary limit of 30 seconds. Which is for instance recognize by the AcademicBotRTU.
Why add %2B
?
Because it is the Percent-encoding equivalent for ‘+’.
Which “good” bots frequently crawled our website?
name |
url |
remarks |
---|---|---|
AcademicBotRTU |
academic |
|
YandexBot |
search engine |
|
GoogleBot |
search engine |
|
BingBot |
search engine |
|
Baiduspider |
search engine |
|
Nexus 5X Build/MMB29P |
search engine |
|
CCBot |
??? |
??? |
What if a crawler ignores robots.txt
nginx
flask
Controlled by the MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER environment variable, which is then used by the r_multipassager_multipassage-method.
Per default it is set to ‘dev’, so authentication is not required. If you want to activate the authentication required-process, please set it to ‘production’. Create or modify the .env-file in the root directory. With the following variables:
SERVER_TYPE = "production"
MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER = 1
1class Config(object):
2 SECRET_KEY = os.environ.get('NEMO_KEY') or 'you-will-never-guess'
3 SQLALCHEMY_DATABASE_URI = os.environ.get('DATABASE_URL') or 'sqlite:///' + os.path.join(basedir, 'app.db')
4 SQLALCHEMY_TRACK_MODIFICATIONS = False
5 POSTS_PER_PAGE = 10
6 # if the environment variables ELASTICSEARCH_URL is not set; the application is starting without it
7 ELASTICSEARCH_URL = os.environ.get('ELASTICSEARCH_URL').split(';') if os.environ.get('ELASTICSEARCH_URL') else False
8 ES_CLIENT_CERT = os.environ.get('ES_CLIENT_CERT', '')
9 ES_CLIENT_KEY = os.environ.get('ES_CLIENT_KEY', '')
10 LANGUAGES = ['en', 'de', 'fr']
11 BABEL_DEFAULT_LOCALE = 'de'
12 CORPUS_FOLDERS = os.environ.get('CORPUS_FOLDERS').split(';') if os.environ.get('CORPUS_FOLDERS') else ["../formulae-corpora/"]
13 INFLECTED_LEM_JSONS = os.environ.get('INFLECTED_LEM_JSONS').split(';') if os.environ.get('INFLECTED_LEM_JSONS') else []
14 LEM_TO_LEM_JSONS = os.environ.get('LEM_TO_LEM_JSONS').split(';') if os.environ.get('LEM_TO_LEM_JSONS') else []
15 DEAD_URLS = os.environ.get('DEAD_URLS').split(';') if os.environ.get('DEAD_URLS') else []
16 COMP_PLACES = os.environ.get('COMP_PLACES').split(';') if os.environ.get('COMP_PLACES') else []
17 LEMMA_LISTS = os.environ.get('LEMMA_LISTS').split(';') if os.environ.get('LEMMA_LISTS') else []
18 COLLECTED_COLLS = os.environ.get('COLLECTED_COLLS').split(';') if os.environ.get('COLLECTED_COLLS') else []
19 # TERM_VECTORS = os.environ.get('TERM_VECTORS')
20 CACHE_DIRECTORY = os.environ.get('NEMO_CACHE_DIR') or './cache/'
21 MAIL_SERVER = os.environ.get('MAIL_SERVER')
22 MAIL_PORT = int(os.environ.get('MAIL_PORT') or 25)
23 MAIL_USE_TLS = os.environ.get('MAIL_USE_TLS') is not None
24 MAIL_USERNAME = os.environ.get('MAIL_USERNAME')
25 MAIL_PASSWORD = os.environ.get('MAIL_PASSWORD')
26 ADMINS = os.environ.get('ADMINS').split(';') if os.environ.get('ADMINS') else ['no-reply@example.com']
27 SESSION_TYPE = 'filesystem'
28 IIIF_SERVER = os.environ.get('IIIF_SERVER')
29 IIIF_MAPPING = os.environ.get('IIIF_MAPPING') or ';'.join(['{}/iiif'.format(f) for f in CORPUS_FOLDERS])
30 # This should only be changed to True when collecting search queries and responses for mocking ES
31 SAVE_REQUESTS = False
32 CACHE_MAX_AGE = os.environ.get('VARNISH_MAX_AGE') or 0 # Only need cache on the server, where this should be set in env
33 PDF_ENCRYPTION_PW = os.environ.get('PDF_ENCRYPTION_PW', 'hard_pw')
34 SESSION_COOKIE_SECURE = os.environ.get('SESSION_COOKIE_SECURE', True)
35 REMEMBER_COOKIE_SECURE = os.environ.get('REMEMBER_COOKIE_SECURE', True)
36 REDIS_URL = os.environ.get('REDIS_URL') or 'redis://'
37 PREFERRED_URL_SCHEME = os.environ.get('PREFERRED_URL_SCHEME', 'http')
38 VIDEO_FOLDER = os.environ.get('VIDEO_FOLDER') or ''
39 COLLATE_API_URL = os.environ.get('COLLATE_API_URL', 'http://localhost:7300')
40 WORD_GRAPH_API_URL = os.environ.get('WORD_GRAPH_API_URL', '')
41 # Used to decide whether authentication is needed for certain resources (dev -> access to all without restriction; production -> restricted access for non-authenticated users)
42 SERVER_TYPE = os.environ.get('SERVER_TYPE', 'dev')
43 # Number of texts a not-authenticated user should be able to see. int > 0
44 try:
45 MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER = int(os.environ.get('MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER'))
46 if not MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER > 0:
47 MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER = 1
48 except TypeError:
49 MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER = 1
1 def r_multipassage(self, objectIds: str, subreferences: str, lang: str = None, collate: bool = False) -> Dict[str, Any]:
2 """ Retrieve the text of the passage
3
4 :param objectIds: Collection identifiers separated by '+'
5 :param lang: Lang in which to express main data
6 :param subreferences: Reference identifiers separated by '+'
7 :return: Template, collections metadata and Markup object representing the text
8 """
9 # authentication check:
10 texts_threshold = self.app.config['MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER']
11 if (texts_threshold <= objectIds.count('+') or texts_threshold <= subreferences.count('+')) and \
12 not current_user.is_authenticated and \
13 not 'dev' in self.app.config['SERVER_TYPE'].lower():
14 abort(401)
15
16 if 'reading_format' not in session:
17 session['reading_format'] = 'columns'
18 ids = objectIds.split('+')
19 translations = {}
20 view = 1
21 passage_data = {'template': 'main::multipassage.html', 'objects': [], "translation": {}}
22 subrefers = subreferences.split('+')
23 all_parent_colls = list()
24 collate_html_dict = dict()
25 if 'collate' in request.values:
26 collate_html_dict = self.call_collate_api(obj_ids=ids, subrefers=subrefers)
27 if len(subrefers) != len(ids):
28 abort(404)
29 for i, id in enumerate(ids):
30 manifest_requested = False
31 if id in self.dead_urls:
32 id = self.dead_urls[id]
33 if "manifest:" in id:
34 id = re.sub(r'^manifest:', '', id)
35 manifest_requested = True
36 if self.check_project_team() is True or id in self.open_texts:
37 if subrefers[i] in ["all", 'first']:
38 subref = self.get_reffs(id)[0][0]
39 else:
40 subref = subrefers[i]
41 d = self.r_passage(id, subref, lang=lang)
42 d['text_passage'] = collate_html_dict.get(id, d['text_passage'])
43 d.update(self.get_prev_next_texts(d['objectId']))
44 del d['template']
45 parent_colls = defaultdict(list)
46 parent_colls[99] = [(id, str(d['collections']['current']['label'].replace('<br>', ' ')))]
47 parent_textgroups = [x for x in d['collections']['parents'] if 'cts:textgroup' in x['subtype']]
48 for parent_coll in parent_textgroups:
49 grandparent_depths = list()
50 for grandparent in parent_coll['ancestors'].values():
51 start_number = 0
52 if 'cts:textgroup' in grandparent.subtype:
53 start_number = 1
54 grandparent_depths.append(len([x for x in self.make_parents(grandparent) if 'cts:textgroup' in x['subtype']]) + start_number)
55 max_grandparent_depth = max(grandparent_depths)
56 parent_colls[max_grandparent_depth].append((parent_coll['id'], str(parent_coll['short_title'])))
57 all_parent_colls.append([v for k, v in sorted(parent_colls.items())])
58 translations[id] = []
59 for x in d.pop('translations', None):
60 if x[0].id not in ids and x not in translations[id]:
61 translations[id].append(x)
62 if manifest_requested:
63 # This is when there are multiple manuscripts and the edition cannot be tied to any single one of them
64 formulae = dict()
65 if 'manifest:' + d['collections']['current']['id'] in self.app.picture_file:
66 formulae = self.app.picture_file['manifest:' + d['collections']['current']['id']]
67 d['alt_image'] = ''
68 if os.path.isfile(self.app.config['IIIF_MAPPING'] + '/' + 'alternatives.json'):
69 with open(self.app.config['IIIF_MAPPING'] + '/' + 'alternatives.json') as f:
70 alt_images = json_load(f)
71 d['alt_image'] = alt_images.get(id)
72 d["objectId"] = "manifest:" + id
73 d["div_v"] = "manifest" + str(view)
74 view = view + 1
75 del d['text_passage']
76 del d['notes']
77 if formulae == {}:
78 d['manifest'] = None
79 d["label"] = [d['collections']['current']['label'], '']
80 d['lib_link'] = self.ms_lib_links.get(id.split(':')[-1].split('.')[0], '')
81 passage_data['objects'].append(d)
82 continue
83 # this viewer work when the library or archive give an IIIF API for the external usage of theirs books
84 d["manifest"] = url_for('viewer.static', filename=formulae["manifest"])
85 with open(self.app.config['IIIF_MAPPING'] + '/' + formulae['manifest']) as f:
86 this_manifest = json_load(f)
87 self.app.logger.warn("this_manifest['@id'] {}".format(this_manifest['@id']))
88 if 'fuldig.hs-fulda.de' in this_manifest['@id']:
89 # This works for resources from https://fuldig.hs-fulda.de/
90 d['lib_link'] = this_manifest['sequences'][0]['canvases'][0]['rendering'][1]['@id']
91 elif 'gallica.bnf.fr' in this_manifest['@id']:
92 # This link needs to be constructed from the thumbnail link for images from https://gallica.bnf.fr/
93 d['lib_link'] = this_manifest['sequences'][0]['canvases'][0]['thumbnail']['@id'].replace('.thumbnail', '')
94 self.app.logger.warn("gallica.bnf.fr: lib_link created:{}".format(d['lib_link']))
95 elif 'api.digitale-sammlungen.de' in this_manifest['@id']:
96 # This works for resources from the Bayerische Staatsbibliothek
97 # (and perhaps other German digital libraries?)
98 d['lib_link'] = this_manifest['sequences'][0]['canvases'][0]['@id'] + '/view'
99 elif 'digitalcollections.universiteitleiden.nl' in this_manifest['@id']:
100 # This works for resources from the Leiden University Library
101 # This links to the manuscript as a whole.
102 # I am not sure how to link to specific pages in their IIIF viewer.
103 d['lib_link'] = 'https://iiifviewer.universiteitleiden.nl/?manifest=' + this_manifest['@id']
104 elif 'digi.vatlib.it' in this_manifest['@id']:
105 # This works for resources from the Vatican Libraries
106 d['lib_link'] = this_manifest['sequences'][0]['canvases'][0]['@id'].replace('iiif', 'view').replace('canvas/p', '')
107 elif 'digital.blb-karlsruhe.de' in this_manifest['@id']:
108 # This works for resources from the BLB Karlsrühe
109 # This links to the manuscript as a whole.
110 # I am not sure how to link to specific pages in their IIIF viewer.
111 d['lib_link'] = 'https://i3f.vls.io/?collection=i3fblbk&id=' + this_manifest['@id']
112 elif 'www.e-codices.unifr.ch' in this_manifest['@id']:
113 # This works for resources from the E-Codices
114 d['lib_link'] = this_manifest['related'].replace('/list/one', '') + '/' + this_manifest['sequences'][0]['canvases'][0]['label']
115
116 self.app.logger.debug(msg="lib_link: {}".format(d['lib_link']))
117
118 folios = re.sub(r'(\d+)([rvab]{1,2})', r'\1<span class="verso-recto">\2</span>',
119 this_manifest['sequences'][0]['canvases'][0]['label'])
120 if len(this_manifest['sequences'][0]['canvases']) > 1:
121 folios += ' - ' + re.sub(r'(\d+)([rvab]{1,2})', r'\1<span class="verso-recto">\2</span>',
122 this_manifest['sequences'][0]['canvases'][-1]['label'])
123 d["label"] = [formulae["title"], ' [fol.' + folios + ']']
124
125 else:
126 d["IIIFviewer"] = []
127 for transcription, t_title, t_partOf, t_siglum in d['transcriptions']:
128 t_id = 'no_image'
129 if "manifest:" + transcription.id in self.app.picture_file:
130 t_id = "manifest:" + transcription.id
131 elif 'wa1' in transcription.id:
132 t_id = self.ms_lib_links['wa1']
133 d["IIIFviewer"].append((t_id,
134 t_title + ' (' + t_siglum + ')',
135 t_partOf))
136
137
138 self.app.logger.warn(msg='d["IIIFviewer"]: {}'.format(d["IIIFviewer"]))
139 if 'previous_search' in session:
140 result_ids = [x for x in session['previous_search'] if x['id'] == id]
141 if result_ids and any([x.get('highlight') for x in result_ids]):
142 d['text_passage'] = self.highlight_found_sents(d['text_passage'], result_ids)
143 if d['collections']['current']['sigla'] != '':
144 d['collections']['current']['label'] = d['collections']['current']['label'].split(' [')
145 d['collections']['current']['label'][-1] = ' [' + d['collections']['current']['label'][-1]
146 filtered_transcriptions = []
147 for x in d['transcriptions']:
148 if x[0].id not in ids and x not in filtered_transcriptions:
149 filtered_transcriptions.append(x)
150 d['transcriptions'] = filtered_transcriptions
151 passage_data['objects'].append(d)
152 passage_data['breadcrumb_colls'] = all_parent_colls
153 if len(ids) > len(passage_data['objects']):
154 flash(_('Mindestens ein Text, den Sie anzeigen möchten, ist nicht verfügbar.'))
155 passage_data['translation'] = translations
156 passage_data['videos'] = [v for k, v in self.VIDEOS.items() if 2 in k][0]
157 passage_data['word_graph_url'] = self.app.config['WORD_GRAPH_API_URL']
158 return passage_data