When I implemented this in 9a4c24c, there were not many rows in
entity_documentproperties in the database here. Now that there are,
computing the document_document -> entity_documentproperties ->
entity_entity join and then filtering is really, really slow. Postgres
seems to materialize the whole join and then scan it.
If we get a set of matching document IDs for the entity query in a
subquery, and then just filter with IN on that, things are much faster:
scan entity_entity; in a nested loop, get the document_ids via
entity_documentproperties; hash this set; and then scan
document_document.
Searching for a single character, this brings the query from ~1.1s to
~400ms. Searching for a full word, ~800ms to 120ms
This condition is getting really ugly -- I am sorry!
References #2935
This should have been included with a8dcbbb, which changed the
related_name to access DocumentProperties from Document. (There's no
actual change to the database.)
Without this fix, a condition like:
{key: 'id', operator: '==', value: 'A/B'}
gets mapped to:
public_id__exact=('A/B'.lower())
which is wrong.
I introduced this bug in b3df5b8. I didn't catch it because I was
mostly interested in the 'layer' key -- but layer names are
conventionally lowercase anyway so lowercasing them had no effect.
Fetching documents for each entity in turn is expensive. (I have tried
using ArrayAgg to fetch them in the same query as the Entity — no
improvement. It's possible that being able to join to entity_entity,
and then use ArrayAgg, would be better.)
Even once you've fetched them all, if the same entity appears many
times in an item, then get(..., keys=['layers']) duplicates the whole
JSON for the entity many times: expensive to serialize, expensive to
send over the wire.
Pandora's own web interface only depends on the 'id' key of 'entity' in
each annotation, and refetches the rest of the entity to show the pop-up
dialog when you press E. So by just not bothering to fetch and send any
other keys, get(..., keys=['layers']) on an item with many entity
annotations is substantially faster.
(I experimented with splitting the full entities off to one side, so,
you'd have:
{
"layers": {
somelayer: [...,
{..., "entity": {"id": ABC}},
], ...
},
"entities": {
ABC: {...},
...
}
}
This is quicker than the status quo, but obviously not as fast as not
fetching & sending the rest at all!)
Clip.public_id uses self.item.public_id.
Clip.json() uses self.item.sort, so we should select_related on that
rather than the clip's own sort field. (They are identical objects. Is
Clip.sort ever used directly?)
With this change, findClips() issues one query to fetch clips plus one
query per flavour of annotation; before, it issued two extra queries per
clip.
Requiring layer to have the right case is consistent with
addAnnotation(), and means the _layer[_like] index can be used. In my
testing, if itemsQuery specifies a single item, then postgres doesn't
bother with the layer index anyway; but if not, it makes a pretty big
(~3×) difference.
Matching public_id and item__public_id case-sensitively also seems
reasonable (it's consistent with get() and getAnnotation()).
(Is lower() redundant for the case-insensitive comparisons? ie. is
UPPER(x.lower()) == UPPER(x)? I'm not sure, it's cheap, let's leave it.)
This is about 30% faster, presumably because it avoids allocation and/or
closing over variables is slow(?). It's not hugely significant (I
misread a line_profile report) but why not.
Otherwise this:
self.name_find = '||' + '||'.join((self.name,) + self.alternativeNames) + '||'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
fails because () + [] is an error. I guess this must have been
introduced by the DictField/TupleField rewrite.
Without this fix, it is impossible to create a new entity.
Basically the same logic is used for Event and Place too so I've made
the same change to those, and, in passing, fix another copy of the bug
fixed for Entity.name_find in fe7f961.
FileField.delete() will, by default, save() the model instance it is
attached to. This is pointless if we're in the process of deleting the
Document -- and since Document.save() calls Document.update_matches(),
this scans all annotations every time a document is deleted.
With 17 layers and 12 clipLayers, this repeated fetching was around 49%
of the cost of this function, which was in turn 94% of the cost of
creating many new annotations with mostly-unique endpoints. This helps a
bit...
If the order of clipLayers is not meant to be significant to sortvalue
(which I assume it is) then this could be simpler.
The case must be correct anyway for the layer to be found in
settings.CONFIG['layers']. Running this:
Q(annotation__layer__iexact='foo') &
Q(annotation__findvalue__icontains='bar')
compiles to
upper(layer) = upper('foo') and
...
which can't use the case-sensitive annotation_annotation_layer index.
This:
Q(annotation__layer__exact='foo') &
Q(annotation__findvalue__icontains='bar')
can. (It still can't use the findvalue_like index, though! The other
option is to add indices on upper(layer) and upper(findvalue)
[varchar_pattern_ops].)
This has several benefits:
• Clip.get_layers() (used by smart edits) and Item.get_layers() pick up
the select_related('user') optimization added for static edits in
r5007.
• Static edits and items pick up the optimization from r4941 to select
annotations once, not once per layer.
Fetching an item with ~1000 annotations took ~1s without this patch,
~0.34s with this patch. Another item with ~6000 annotations took ~11.6s
before, ~8.6s after.
Because this block is moved out to the top:
if user and user.is_anonymous():
user = None
then, for anonymous users,
"editable": false,
is no longer included in the annotations. The old behaviour ended up
including this key in all layers listed before the first private layer
in the config, and leaving it out from later ones. So this new behaviour
is more consistent.
The expensive part of fetching an edit is JSONifying the clips'
annotations. Profiling showed that the main cost was Annotation.json(),
and within that:
File: /srv/pandora/pandora/annotation/models.py
Function: json at line 216
Line # Hits Time Per Hit % Time Line Contents
==============================================================
216 def json(self, layer=False, keys=None, user=None):
217 632 827 1.3 0.1 j = {
218 632 1048170 1658.5 89.6 'user': self.user.username,
219 }
Obviously this join just moves some of the cost further out, but it
brings my micro-benchmark down from 1.3s to 0.3s.
This would crash:
api.findChangeLogs(keys=[])
because there's no 'name' key for changelog entries. Instead, default to
chronological order, newest first.