Don't include documents in entities in annotation JSON

wjt commented

2016-04-19 15:03:24 +00:00

If you have many documents attached to entities, each entity has a lot of data, and an item X has lots of annotations pointing to entities, then get(id=X, keys=['layers']) gets a bit painful. Partly because fetching the documents for each entity in turn is expensive, and partly because the JSON for each entity is big. Also, if the same entity appears many times in an item, then it's completely redundant to send the full entity JSON along with each annotation.

So here are two ideas for improving this, both branches on https://gitlab.com/wjt/pandora.git (on top of #2804). They're the same except for the top patch on each:

get-layers-no-entity-documents: leave documents out of entities inside annotations. This is about 30% faster in my testing
get-layers-only-entity-name: Pandora's own interface only uses the id and name keys of annotation.entity (it refetches the whole entity to show the popover), so just leave everything else out. This is ~3× faster

Both could break applications using the API, obviously 2 is more likely to.

A Third Way would be to split the entities out, so get would return:

        {
            "layers": {
                "somelayer": [...,
                  {..., "entity": {"id": "ABC"}},
                ], ...
            },
            "entities": {
                "ABC": {...},
                ...
            }
        }

This is somewhere in between the two branches here, speed-wise, and every API method that returns annotations would need to be changed...

What do you think?

cf #2830 If you have many documents attached to entities, each entity has a lot of data, and an item `X` has lots of annotations pointing to entities, then `get(id=X, keys=['layers'])` gets a bit painful. Partly because fetching the documents for each entity in turn is expensive, and partly because the JSON for each entity is big. Also, if the same entity appears many times in an item, then it's completely redundant to send the full entity JSON along with each annotation. So here are two ideas for improving this, both branches on <https://gitlab.com/wjt/pandora.git> (on top of #2804). They're the same except for the top patch on each: 1. `get-layers-no-entity-documents`: leave documents out of entities inside annotations. This is about 30% faster in my testing 2. `get-layers-only-entity-name`: Pandora's own interface only uses the `id` and `name` keys of `annotation`.`entity` (it refetches the whole entity to show the popover), so just leave everything else out. This is ~3× faster Both could break applications using the API, obviously 2 is more likely to. A Third Way would be to split the entities out, so `get` would return: ```#!json { "layers": { "somelayer": [..., {..., "entity": {"id": "ABC"}}, ], ... }, "entities": { "ABC": {...}, ... } } ``` This is somewhere in between the two branches here, speed-wise, and every API method that returns annotations would need to be changed... What do you think?

j added the

backend

label 2016-04-19 15:03:24 +00:00

j added this to the 14.04 milestone 2016-04-19 15:03:24 +00:00

j self-assigned this 2016-04-19 15:03:24 +00:00

j added the

normal

enhancement

labels 2016-04-19 15:03:24 +00:00

j commented

2016-04-25 10:15:58 +00:00

Owner

i think just including id/name is good enough. always including entities instead of fetching them on demand might also slow down things. will test your patches and merge, not aware of it breaking any use of the api besides your possibly.

wjt commented

2016-04-25 11:02:43 +00:00

Author

Yes, only fetching & returning id & name for entities does break my application, but obviously I am happy to fix it! :-)

wjt commented

2016-04-28 13:12:57 +00:00

Author

Ah, one of these patches actually breaks Entity.json(keys=['documents']), will fix…

Ah, one of these patches actually breaks `Entity.json(keys=['documents'])`, will fix…

wjt commented

2016-04-28 13:23:39 +00:00

Author

Fixed both branches. Previously, “Entity.json: get document ids from join table” was fetching id from entity_documentproperties rather than document_id.

Apart from this, no other problems to report from running the get-layers-only-entity-name branch here for the last couple of days.

Fixed both branches. Previously, “Entity.json: get document ids from join table” was fetching `id` from `entity_documentproperties` rather than `document_id`. Apart from this, no other problems to report from running the `get-layers-only-entity-name` branch here for the last couple of days.

j commented

2016-04-28 22:17:30 +00:00

Owner

In 34747c0/pandora:

#!CommitTicketReference repository="pandora" revision="34747c0fd7d8ecfcff4cf7106ecef5f773bf16ff"
Merge remote-tracking branch 'wjt/get-layers-only-entity-name'

(fixes #2913)

In [34747c0/pandora](https://code.0x2620.org/0x2620/pandora/commit/34747c0fd7d8ecfcff4cf7106ecef5f773bf16ff): ``` #!CommitTicketReference repository="pandora" revision="34747c0fd7d8ecfcff4cf7106ecef5f773bf16ff" Merge remote-tracking branch 'wjt/get-layers-only-entity-name' (fixes #2913) ```

0x2620 added the

fixed

label 2016-04-28 22:17:30 +00:00

j closed this issue

2016-04-28 22:17:30 +00:00

Don't include documents in entities in annotation JSON #2913