Don't include documents in entities in annotation JSON #2913

Closed
opened 2016-04-19 15:03:24 +00:00 by wjt · 5 comments

cf #2830

If you have many documents attached to entities, each entity has a lot of data, and an item X has lots of annotations pointing to entities, then get(id=X, keys=['layers']) gets a bit painful. Partly because fetching the documents for each entity in turn is expensive, and partly because the JSON for each entity is big. Also, if the same entity appears many times in an item, then it's completely redundant to send the full entity JSON along with each annotation.

So here are two ideas for improving this, both branches on https://gitlab.com/wjt/pandora.git (on top of #2804). They're the same except for the top patch on each:

  1. get-layers-no-entity-documents: leave documents out of entities inside annotations. This is about 30% faster in my testing
  2. get-layers-only-entity-name: Pandora's own interface only uses the id and name keys of annotation.entity (it refetches the whole entity to show the popover), so just leave everything else out. This is ~3× faster

Both could break applications using the API, obviously 2 is more likely to.

A Third Way would be to split the entities out, so get would return:

        {
            "layers": {
                "somelayer": [...,
                  {..., "entity": {"id": "ABC"}},
                ], ...
            },
            "entities": {
                "ABC": {...},
                ...
            }
        }

This is somewhere in between the two branches here, speed-wise, and every API method that returns annotations would need to be changed...

What do you think?

cf #2830 If you have many documents attached to entities, each entity has a lot of data, and an item `X` has lots of annotations pointing to entities, then `get(id=X, keys=['layers'])` gets a bit painful. Partly because fetching the documents for each entity in turn is expensive, and partly because the JSON for each entity is big. Also, if the same entity appears many times in an item, then it's completely redundant to send the full entity JSON along with each annotation. So here are two ideas for improving this, both branches on <https://gitlab.com/wjt/pandora.git> (on top of #2804). They're the same except for the top patch on each: 1. `get-layers-no-entity-documents`: leave documents out of entities inside annotations. This is about 30% faster in my testing 2. `get-layers-only-entity-name`: Pandora's own interface only uses the `id` and `name` keys of `annotation`.`entity` (it refetches the whole entity to show the popover), so just leave everything else out. This is ~3× faster Both could break applications using the API, obviously 2 is more likely to. A Third Way would be to split the entities out, so `get` would return: ```#!json { "layers": { "somelayer": [..., {..., "entity": {"id": "ABC"}}, ], ... }, "entities": { "ABC": {...}, ... } } ``` This is somewhere in between the two branches here, speed-wise, and every API method that returns annotations would need to be changed... What do you think?
j added the
backend
label 2016-04-19 15:03:24 +00:00
j added this to the 14.04 milestone 2016-04-19 15:03:24 +00:00
j self-assigned this 2016-04-19 15:03:24 +00:00
j added the
normal
enhancement
labels 2016-04-19 15:03:24 +00:00
Owner

i think just including id/name is good enough. always including entities instead of fetching them on demand might also slow down things. will test your patches and merge, not aware of it breaking any use of the api besides your possibly.

i think just including id/name is good enough. always including entities instead of fetching them on demand might also slow down things. will test your patches and merge, not aware of it breaking any use of the api besides your possibly.
Author

Yes, only fetching & returning id & name for entities does break my application, but obviously I am happy to fix it! :-)

Yes, only fetching & returning id & name for entities does break my application, but obviously I am happy to fix it! :-)
Author

Ah, one of these patches actually breaks Entity.json(keys=['documents']), will fix…

Ah, one of these patches actually breaks `Entity.json(keys=['documents'])`, will fix…
Author

Fixed both branches. Previously, “Entity.json: get document ids from join table” was fetching id from entity_documentproperties rather than document_id.

Apart from this, no other problems to report from running the get-layers-only-entity-name branch here for the last couple of days.

Fixed both branches. Previously, “Entity.json: get document ids from join table” was fetching `id` from `entity_documentproperties` rather than `document_id`. Apart from this, no other problems to report from running the `get-layers-only-entity-name` branch here for the last couple of days.
Owner

In 34747c0/pandora:

#!CommitTicketReference repository="pandora" revision="34747c0fd7d8ecfcff4cf7106ecef5f773bf16ff"
Merge remote-tracking branch 'wjt/get-layers-only-entity-name'

(fixes #2913)
In [34747c0/pandora](https://code.0x2620.org/0x2620/pandora/commit/34747c0fd7d8ecfcff4cf7106ecef5f773bf16ff): ``` #!CommitTicketReference repository="pandora" revision="34747c0fd7d8ecfcff4cf7106ecef5f773bf16ff" Merge remote-tracking branch 'wjt/get-layers-only-entity-name' (fixes #2913) ```
0x2620 added the
fixed
label 2016-04-28 22:17:30 +00:00
j closed this issue 2016-04-28 22:17:30 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: 0x2620/pandora#2913
No description provided.