Traffic management

We have had a situation, where the access times of our application where increased so much, that it became almost unusable for real users. To find the reason behind this behavior, we checked the access logs we saw a huge number of accesses to our website by bots. Since, the functionality of this application makes possible to compare multiple resources by joining them into one url with one or more ‘+’, there is huge number of possible links (for \(n\) resources they are \(n^k\), where k is number ‘+’+1).

Varnish

Alternative text Varnish Cache is a…

Before making any changes, please check the syntax of the file via varnishd -C -f /etc/varnish/default.vcl (see this blog post ).

After changes: service varnish reload. (https://stackoverflow.com/a/46088507/7924573). This is the current varnish solutions, which blocks are access from entities, who identify themselves as bots:

Current version of default.vcl
  1 #
  2 # This is an example VCL file for Varnish.
  3 #
  4 # It does not do anything by default, delegating control to the
  5 # builtin VCL. The builtin VCL is called when there is no explicit
  6 # return statement.
  7 #
  8 # See the VCL chapters in the Users Guide at https://www.varnish-cache.org/docs/
  9 # and https://www.varnish-cache.org/trac/wiki/VCLExamples for more examples.
 10
 11 # Marker to tell the VCL compiler that this VCL has been adapted to the
 12 # new 4.0 format.
 13 vcl 4.0;
 14
 15 # Default backend definition. This points to the gunicorn server.
 16 backend default {
 17     .host = "127.0.0.1";
 18     .port = "8000";
 19 }
 20
 21 sub vcl_recv {
 22     # Happens before we check if we have this in cache already.
 23     #
 24     # Typically you clean up the request here, removing cookies you don't need,
 25     # rewriting the request, etc.
 26
 27     # remove the Matomo tracking cookie
 28     if (req.url ~ "urn:cts:formulae:pancarte_noir_internal") {
 29         return (synth(404, "Denied by request filtering configuration"));
 30     }
 31     if (req.http.user-agent ~ "Bot") {
 32         if (req.url ~ "\+" || req.url ~ "%2B") {
 33             return (synth(404, "Denied by request filtering configuration"));
 34         }
 35         return (pass);
 36     }
 37     if (req.http.user-agent ~ "bot") {
 38         if (req.url ~ "\+" || req.url ~ "%2B") {
 39             return (synth(404, "Denied by request filtering configuration"));
 40         }
 41         return (pass);
 42     }
 43     if (req.http.user-agent ~ "Bytespider") {
 44         if (req.url ~ "\+" || req.url ~ "%2B") {
 45             return (synth(404, "Denied by request filtering configuration"));
 46         }
 47         return (pass);
 48     }
 49     if (req.http.referer ~ "google") {
 50         if (req.url ~ "\+" || req.url ~ "%2B") {
 51             return (synth(404, "Denied by request filtering configuration"));
 52         }
 53         return (pass);
 54     }
 55     if (req.url ~ "/texts/") {
 56         #if (req.url ~ "\+.*\+.*") {
 57         #    return (synth(404, "Denied by request filtering configuration"));
 58         #}
 59         return (pass);
 60     }
 61     set req.http.Cookie = regsuball(req.http.Cookie, "(^|;\s*)(_[_a-z0-9\.]+)=[^;]*", "");
 62     set req.http.Cookie = regsub(req.http.Cookie, "^;\s*", "");
 63     # save the cookies before the built-in vcl_recv (allows caching of pages with cookies)
 64     set req.http.Cookie-Backup = req.http.Cookie;
 65     unset req.http.Cookie;
 66     # To deal with the Authorization header I should probably do the same thing I did with the Cookie header above
 67     # I.e., put is in an Authorization-Backup header, then restore this to the Authorization header in vcl_hash
 68     # I will then need to get the backend to send a Vary header that says the response should be varied according to Authorization
 69     # See https://stackoverflow.com/questions/35119283/best-pratice-for-varnish-cache-content-with-authorization-header for some guidance
 70 }
 71
 72 sub vcl_backend_response {
 73     # Happens after we have read the response headers from the backend.
 74     #
 75     # Here you clean the response headers, removing silly Set-Cookie headers
 76     # and other mistakes your backend does.
 77
 78     # I think I should move the Set-Cookie header into a temporary header and then reset it again in vcl_deliver
 79     #set beresp.http.Set-Cookie-Backup = beresp.http.Set-Cookie;
 80     #unset beresp.http.Set-Cookie;
 81 }
 82
 83 sub vcl_deliver {
 84     # Happens when we have all the pieces we need, and are about to send the
 85     # response to the client.
 86     #
 87     # You can do accounting or modifying the final object here.
 88
 89     # Reset the Set-Cookie header
 90     #set resp.http.Set-Cookie = resp.http.Set-Cookie-Backup;
 91     #unset resp.http.Set-Cookie-Backup;
 92 }
 93
 94 sub vcl_hash {
 95     if (req.http.Cookie-Backup) {
 96         # restore the cookies before the lookup if any. This should allow auth cookies to affect results
 97         # may need to add an HTTP Vary header on the backend to send either project or non-project pages
 98         set req.http.Cookie = req.http.Cookie-Backup;
 99         unset req.http.Cookie-Backup;
100     }
101 }

Restricting parallel resources

Our initial approach was to define a Varnish-rule, that basically blocked all url-accesses, with a ‘+’ and return a 404. This vastly stabilized the application and made it useful again. But with Google’s: Note that returning “no availability” codes for more than a few days will cause Google to permanently slow or stop crawling URLs on your site, so follow the additional next steps (source: Handle overcrawling of your site (emergencies)) in mind, we should change this behavior in the future:

Snippet from the OLD default.vcl including the first and most restrictive filtering rule.
1if (req.url ~ "/texts/") {
2   if (req.url ~ "\+.*\+.*") {
3      return (synth(404, "Denied by request filtering configuration"));
4   }
5   return (pass);
6}

Cookies

“Varnish will, in the default configuration, not cache an object coming from the backend with a ‘Set-Cookie’ header present. Also, if the client sends a Cookie header, Varnish will bypass the cache and go directly to the backend.” We therefore decided to remove some cookies:

Cache Control with max-age

“The ‘Cache-Control’ header instructs caches how to handle the content. Varnish cares about the max-age parameter and uses it to calculate the TTL for an object.” - cache-control. We therefore added a default max-age value in the config:

config.py
40    # This folder houses all transcription / manuscript images

This value is later used in the application to construct the header:

    def after_request(self, response: Response) -> Response:
        """
        Post-processes the response object after each request.

        - Disables caching for authentication and language-switching routes.
        - Applies extended caching for static assets.
        - Sets general cache-control headers using `CACHE_MAX_AGE` for all other responses.
        - Persists certain variables from `g` to the session.
        """
        path = request.path

        # Disable caching for login and language-switch
        if path.startswith('/auth/') or path.startswith('/lang/'):
            response.cache_control.no_cache = True
            response.cache_control.no_store = True
            response.cache_control.must_revalidate = True
            response.headers['Pragma'] = 'no-cache'
            response.headers['Expires'] = '0'
        # Long caching for assets
        elif path.startswith('/assets/'):
            response.cache_control.max_age = 60 * 60 * 24  # 1 day
            response.cache_control.public = True
        else:
            # Default caching for other routes
            max_age = self.app.config['CACHE_MAX_AGE']
            self.app.logger.debug(f"Applying default cache for path: {path}, max_age={max_age}")
            response.cache_control.max_age = max_age
            response.cache_control.public = True
            response.vary = 'Cookie'  # vary by session
        if getattr(g, 'previous_search', None) is not None:
            session['previous_search'] = g.previous_search
        if getattr(g, 'previous_search_args', None):
            session['previous_search_args'] = g.previous_search_args
        if getattr(g, 'previous_aggregations', None):
            session['previous_aggregations'] = g.previous_aggregations
        if getattr(g, 'highlighted_words', None):
            session['highlighted_words'] = g.highlighted_words
        return response

Previously we had if re.search('/(lang|auth|texts)/', request.url): response.cache_control.no_cache = True, which never caches our resources. Since we never refresh a resource more than daily in production this limit, this does not seem to be practical. I removed this clause and have now everything but the assets fall under the previously mentioned max-age. I don’t see a problem in setting it to 24*60*60 seconds. The assets could even be older, I guess.

Rate limit

Planned: - https://github.com/nand2/libvmod-throttle - https://support.platform.sh/hc/en-us/community/posts/16439617864722-Rate-limit-connections-to-your-application-using-Varnish - https://vinyl-cache.org/vmods/ - rate-limit is only in development-stage:

robots.txt

We use a robots.txt file to disallow all bots to crawl sites containing a “+”. To achieve this we used the robots.txt wildcard expression “*”. It is not known whether all bots accept this form of wildcard annotation.

If you intend to make changes to the robots.txt, consider checking it’s functionality before hand via online tools such as the: robots.txt Testing & Validation Tool

robots.txt
 1# robots.txt
 2## Although the "*" should include all bots. I did observe, that certain ones do need specific instructions.
 3## The goal is to disallow the crawling and automatic processing of all sites with a "+" or "%2B" (equivalent in percentage encoding)
 4## Since the site is refreshed quite infrequent, we used a Crawl-delay of 30 seconds although Google ignores this parameter
 5## I disallowed add_text because it is merely a redirect. If this is not the case, change this file.
 6## I used: https://technicalseo.com/tools/robots-txt/ to validate this file. 
 7User-agent: *
 8User-agent: AdsBot-Google
 9User-agent: Yandex
10User-agent: YandexBot
11User-agent: AcademicBotRTU
12User-agent: Googlebot
13Disallow: *+*
14Disallow: *%2B*
15Disallow: */add_text/*
16Disallow: */pdf/*
17Crawl-delay: 30

This reduce the accesses by over 50% within one day (10k-20k 404s per hour on 09.10.24 vs 300-6k 404s per hour on 10.10.24). The most prominent of the remaining crawlers on 10.10.24 are AcademicBotRTU and YandexBot, these remained until the 15.10.24, on which I added explicit rules to block them. Since, Google needs at most 24-36 hours to check the new robots.txt, I expect other crawlers to have the same frequency.

Why add crawl-delay?

Well, Googlebot and Yandex wont use it. So, it wont cause any harm and is mostly just ignored. https://support.google.com/webmasters/thread/251817470/putting-crawl-delay-in-robots-txt-file-is-good?hl=en. But some do recognize it, so I have added the arbitrary limit of 30 seconds. Which is for instance recognize by the AcademicBotRTU.

Why add %2B?

Because it is the Percent-encoding equivalent for ‘+’.

Which “good” bots frequently crawled our website?

crawler-list

name

url

remarks

AcademicBotRTU

https://academicbot.rtu.lv/

academic

YandexBot

search engine

GoogleBot

search engine

BingBot

search engine

Baiduspider

https://www.baidu.com/search/robots_english.html

search engine

Nexus 5X Build/MMB29P

https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers?hl=de#googlebot-smartphone

search engine

CCBot

???

???

Nexus 5X Build/MMB29P

https://developers.facebook.com/docs/sharing/webmasters/web-crawlers

search engine

What if a crawler ignores robots.txt

https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/

nginx

flask

In the previous section, the varnish-way of limiting access was introduced. In addition to that, there is a mechanism for controlling all access on the application side. Controlled by the MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER environment variable, which is then used by the r_multipassager_multipassage-method. So, if you have a risen number of 504 errors it could help to reduce this parameter.

Per default it is set to ‘dev’, so authentication is not required. If you want to activate the authentication required-process, please set it to ‘production’:

  1. Create or modify the .env-file in the root directory. 1. cd formulae-capitains-nemo 2. nano .env

2. With the following variables: .. code-block:: python

SERVER_TYPE = “production” MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER = 1

  1. Reload the application, in order for the variables to take effect

Config.py
 1class Config(object):
 2    SECRET_KEY = os.environ.get('NEMO_KEY') or 'you-will-never-guess'
 3    SQLALCHEMY_DATABASE_URI = os.environ.get('DATABASE_URL') or 'sqlite:///' + os.path.join(basedir, 'app.db')
 4    SQLALCHEMY_TRACK_MODIFICATIONS = False
 5    POSTS_PER_PAGE = 10
 6    # if the environment variables ELASTICSEARCH_URL is not set; the application is starting without it
 7    ELASTICSEARCH_URL = os.environ.get('ELASTICSEARCH_URL').split(';') if os.environ.get('ELASTICSEARCH_URL') else False
 8    ES_CLIENT_CERT = os.environ.get('ES_CLIENT_CERT', '')
 9    ES_CLIENT_KEY = os.environ.get('ES_CLIENT_KEY', '')
10    LANGUAGES = ['en', 'de', 'fr']
11    BABEL_DEFAULT_LOCALE = 'de'
12    CORPUS_FOLDERS = os.environ.get('CORPUS_FOLDERS').split(';') if os.environ.get('CORPUS_FOLDERS') else ["../formulae-corpora/"]
13    INFLECTED_LEM_JSONS = os.environ.get('INFLECTED_LEM_JSONS').split(';') if os.environ.get('INFLECTED_LEM_JSONS') else []
14    LEM_TO_LEM_JSONS = os.environ.get('LEM_TO_LEM_JSONS').split(';') if os.environ.get('LEM_TO_LEM_JSONS') else []
15    DEAD_URLS = os.environ.get('DEAD_URLS').split(';') if os.environ.get('DEAD_URLS') else []
16    COMP_PLACES = os.environ.get('COMP_PLACES').split(';') if os.environ.get('COMP_PLACES') else []
17    LEMMA_LISTS = os.environ.get('LEMMA_LISTS').split(';') if os.environ.get('LEMMA_LISTS') else []
18    COLLECTED_COLLS = os.environ.get('COLLECTED_COLLS').split(';') if os.environ.get('COLLECTED_COLLS') else []
19    # TERM_VECTORS = os.environ.get('TERM_VECTORS')
20    CACHE_DIRECTORY = os.environ.get('NEMO_CACHE_DIR') or './cache/'
21    ##### MAILING
22    MAIL_SERVER = os.environ.get('MAIL_SERVER')
23    MAIL_PORT = int(os.environ.get('MAIL_PORT') or 25)
24    MAIL_USE_TLS = os.environ.get('MAIL_USE_TLS') is not None
25    MAIL_USERNAME = os.environ.get('MAIL_USERNAME')
26    MAIL_PASSWORD = os.environ.get('MAIL_PASSWORD')
27    ADMINS = os.environ.get('ADMINS').split(';') if os.environ.get('ADMINS') else ['no-reply@example.com']
28    SEND_MAILS_TO_ADMINS = os.environ.get('SEND_MAILS_TO_ADMINS', False)
29    MAIL_USE_TLS = True
30    MAIL_USE_SSL = False
31    ######
32    SESSION_TYPE = 'filesystem'
33    IIIF_SERVER = os.environ.get('IIIF_SERVER')
34    # This folder houses all transcription / manuscript images
35    # Particularly it is required to have a file named 'Mapping.json'
36    IIIF_MAPPING = os.environ.get('IIIF_MAPPING') or ';'.join(['{}/iiif'.format(f) for f in CORPUS_FOLDERS])
37    # This should only be changed to True when collecting search queries and responses for mocking ES
38    SAVE_REQUESTS = False
39    CACHE_MAX_AGE = os.environ.get('VARNISH_MAX_AGE') or 0  # Only need cache on the server, where this should be set in env
40    PDF_ENCRYPTION_PW = os.environ.get('PDF_ENCRYPTION_PW', 'hard_pw')
41    SESSION_COOKIE_SECURE = os.environ.get('SESSION_COOKIE_SECURE', True)
42    REMEMBER_COOKIE_SECURE = os.environ.get('REMEMBER_COOKIE_SECURE', True)
43    REDIS_URL = os.environ.get('REDIS_URL') or 'redis://'
44    PREFERRED_URL_SCHEME = os.environ.get('PREFERRED_URL_SCHEME', 'http')
45    VIDEO_FOLDER = os.environ.get('VIDEO_FOLDER') or ''
46    COLLATE_API_URL = os.environ.get('COLLATE_API_URL', 'http://localhost:7300')
47    WORD_GRAPH_API_URL = os.environ.get('WORD_GRAPH_API_URL', '')
48    # Used to decide whether authentication is needed for certain resources (dev -> access to all without restriction; production -> restricted access for non-authenticated users)
49    SERVER_TYPE = os.environ.get('SERVER_TYPE', 'dev')
50    # Number of texts a not-authenticated user should be able to see. int > 0
51    try:
52        MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER = int(os.environ.get('MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER'))
53        if not MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER > 0:
54            MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER = 1
55    except TypeError:
56        MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER = 1
r_multipassage()
  1    def r_multipassage(self, objectIds: str, subreferences: str, lang: str = None, collate: bool = False) -> Dict[str, Any]:
  2        """ Retrieve the text of the passage
  3
  4        :param objectIds: Collection identifiers separated by '+'
  5        :param lang: Lang in which to express main data
  6        :param subreferences: Reference identifiers separated by '+'
  7        :return: Template, collections metadata and Markup object representing the text
  8        """
  9        # authentication check:
 10        texts_threshold = self.app.config['MAX_NUMBER_OF_TEXTS_FOR_NOT_AUTHENTICATED_USER']
 11        if (texts_threshold <= objectIds.count('+')  or texts_threshold <= subreferences.count('+')) and \
 12            not current_user.is_authenticated and \
 13            not 'dev' in self.app.config['SERVER_TYPE'].lower():
 14                abort(401)
 15
 16        if 'reading_format' not in session:
 17            session['reading_format'] = 'columns'
 18        ids = objectIds.split('+')
 19        translations = {}
 20        view = 1
 21        passage_data = {'template': 'main::multipassage.html', 'objects': [], "translation": {}}
 22        subrefers = subreferences.split('+')
 23        all_parent_colls = list()
 24        collate_html_dict = dict()
 25        if 'collate' in request.values:
 26            collate_html_dict = self.call_collate_api(obj_ids=ids, subrefers=subrefers)
 27        if len(subrefers) != len(ids):
 28            abort(404)
 29        for i, id in enumerate(ids):
 30            manifest_requested = False
 31            if id in self.dead_urls:
 32                id = self.dead_urls[id]
 33            if "manifest:" in id:
 34                id = re.sub(r'^manifest:', '', id)
 35                manifest_requested = True
 36            if self.check_project_team() is True or id in self.open_texts:
 37                if subrefers[i] in ["all", 'first']:
 38                    subref = self.get_reffs(id)[0][0]
 39                else:
 40                    subref = subrefers[i]
 41                d = self.r_passage(id, subref, lang=lang)
 42                d['text_passage'] = collate_html_dict.get(id, d['text_passage'])
 43                d.update(self.get_prev_next_texts(d['objectId']))
 44                del d['template']
 45                parent_colls = defaultdict(list)
 46                parent_colls[99] = [(id, str(d['collections']['current']['label'].replace('<br>', ' ')))]
 47                parent_textgroups = [x for x in d['collections']['parents'] if 'cts:textgroup' in x['subtype']]
 48                for parent_coll in parent_textgroups:
 49                    grandparent_depths = list()
 50                    for grandparent in parent_coll['ancestors'].values():
 51                        start_number = 0
 52                        if 'cts:textgroup' in grandparent.subtype:
 53                            start_number = 1
 54                        grandparent_depths.append(len([x for x in self.make_parents(grandparent) if 'cts:textgroup' in x['subtype']]) + start_number)
 55                    max_grandparent_depth = max(grandparent_depths)
 56                    parent_colls[max_grandparent_depth].append((parent_coll['id'], str(parent_coll['short_title'])))
 57                all_parent_colls.append([v for k, v in sorted(parent_colls.items())])
 58                translations[id] = []
 59                for x in d.pop('translations', None):
 60                    if x[0].id not in ids and x not in translations[id]:
 61                        translations[id].append(x)
 62                if manifest_requested:
 63                    # This is when there are multiple manuscripts and the edition cannot be tied to any single one of them
 64                    formulae = dict()
 65                    if 'manifest:' + d['collections']['current']['id'] in self.app.picture_file:
 66                        formulae = self.app.picture_file['manifest:' + d['collections']['current']['id']]
 67                    d['alt_image'] = ''
 68                    if os.path.isfile(self.app.config['IIIF_MAPPING'] + '/' + 'alternatives.json'):
 69                        with open(self.app.config['IIIF_MAPPING'] + '/' + 'alternatives.json') as f:
 70                            alt_images = json_load(f)
 71                        d['alt_image'] = alt_images.get(id)
 72                    d["objectId"] = "manifest:" + id
 73                    d["div_v"] = "manifest" + str(view)
 74                    view = view + 1
 75                    del d['text_passage']
 76                    del d['notes']
 77                    if formulae == {}:
 78                        d['manifest'] = None
 79                        d["label"] = [d['collections']['current']['label'], '']
 80                        d['lib_link'] = self.ms_lib_links.get(id.split(':')[-1].split('.')[0], '')
 81                        passage_data['objects'].append(d)
 82                        continue
 83                    # this viewer work when the library or archive give an IIIF API for the external usage of theirs books
 84                    d["manifest"] = url_for('viewer.static', filename=formulae["manifest"])
 85                    with open(self.app.config['IIIF_MAPPING'] + '/' + formulae['manifest']) as f:
 86                        this_manifest = json_load(f)
 87                    self.app.logger.debug("this_manifest['@id'] {}".format(this_manifest['@id']))
 88                    if 'fuldig.hs-fulda.de' in this_manifest['@id']:
 89                        # This works for resources from https://fuldig.hs-fulda.de/
 90                        d['lib_link'] = this_manifest['sequences'][0]['canvases'][0]['rendering'][1]['@id']
 91                    elif 'gallica.bnf.fr' in this_manifest['@id']:
 92                        # This link needs to be constructed from the thumbnail link for images from https://gallica.bnf.fr/
 93                        d['lib_link'] = this_manifest['sequences'][0]['canvases'][0]['thumbnail']['@id'].replace('.thumbnail', '')
 94                        self.app.logger.debug("gallica.bnf.fr: lib_link created:{}".format(d['lib_link']))
 95                    elif 'api.digitale-sammlungen.de' in this_manifest['@id']:
 96                        # This works for resources from the Bayerische Staatsbibliothek
 97                        # (and perhaps other German digital libraries?)
 98                        d['lib_link'] = this_manifest['sequences'][0]['canvases'][0]['@id'] + '/view'
 99                    elif 'digitalcollections.universiteitleiden.nl' in this_manifest['@id']:
100                        # This works for resources from the Leiden University Library
101                        # This links to the manuscript as a whole.
102                        # I am not sure how to link to specific pages in their IIIF viewer.
103                        d['lib_link'] = 'https://iiifviewer.universiteitleiden.nl/?manifest=' + this_manifest['@id']
104                    elif 'digi.vatlib.it' in this_manifest['@id']:
105                        # This works for resources from the Vatican Libraries
106                        d['lib_link'] = this_manifest['sequences'][0]['canvases'][0]['@id'].replace('iiif', 'view').replace('canvas/p', '')
107                    elif 'digital.blb-karlsruhe.de' in this_manifest['@id']:
108                        # This works for resources from the BLB Karlsrühe
109                        # This links to the manuscript as a whole.
110                        # I am not sure how to link to specific pages in their IIIF viewer.
111                        d['lib_link'] = 'https://i3f.vls.io/?collection=i3fblbk&id=' + this_manifest['@id']
112                    elif 'www.e-codices.unifr.ch' in this_manifest['@id']:
113                        # This works for resources from the E-Codices
114                        d['lib_link'] = this_manifest['related'].replace('/list/one', '') + '/' + this_manifest['sequences'][0]['canvases'][0]['label']
115                    
116                    self.app.logger.debug(msg="lib_link: {}".format(d['lib_link']))
117                    
118                    folios = re.sub(r'(\d+)([rvab]{1,2})', r'\1<span class="verso-recto">\2</span>',
119                                    this_manifest['sequences'][0]['canvases'][0]['label'])
120                    if len(this_manifest['sequences'][0]['canvases']) > 1:
121                        folios += ' - ' + re.sub(r'(\d+)([rvab]{1,2})', r'\1<span class="verso-recto">\2</span>',
122                                                 this_manifest['sequences'][0]['canvases'][-1]['label'])
123                    d["label"] = [formulae["title"], ' [fol.' + folios + ']']
124
125                else:
126                    d["IIIFviewer"] = []
127                    for transcription, t_title, t_partOf, t_siglum in d['transcriptions']:
128                        t_id = 'no_image'
129                        if "manifest:" + transcription.id in self.app.picture_file:
130                            t_id = "manifest:" + transcription.id
131                        elif 'wa1' in transcription.id:
132                            t_id = self.ms_lib_links['wa1']
133                        d["IIIFviewer"].append((t_id,
134                                                t_title + ' (' + t_siglum + ')',
135                                                t_partOf))
136
137                    
138                    
139                    if 'previous_search' in session:
140                        result_ids = [x for x in session['previous_search'] if x['id'] == id]
141                        if result_ids and any([x.get('highlight') for x in result_ids]):
142                            d['text_passage'] = self.highlight_found_sents(d['text_passage'], result_ids)
143                    if d['collections']['current']['sigla'] != '':
144                        d['collections']['current']['label'] = d['collections']['current']['label'].split(' [')
145                        d['collections']['current']['label'][-1] = ' [' + d['collections']['current']['label'][-1]
146                filtered_transcriptions = []
147                for x in d['transcriptions']:
148                    if x[0].id not in ids and x not in filtered_transcriptions:
149                        filtered_transcriptions.append(x)
150                d['transcriptions'] = filtered_transcriptions
151                passage_data['objects'].append(d)
152        passage_data['breadcrumb_colls'] = all_parent_colls
153        if len(ids) > len(passage_data['objects']):
154            flash(_('Mindestens ein Text, den Sie anzeigen möchten, ist nicht verfügbar.'))
155        passage_data['translation'] = translations
156        passage_data['videos'] = [v for k, v in self.VIDEOS.items() if 2 in k][0]
157        passage_data['word_graph_url'] = self.app.config['WORD_GRAPH_API_URL']
158        return passage_data