Quick start

After this page, please ensure you have read this guide thoroughly before using this data for anything serious! Additionally, you should consult the documentation for each Source you intend to use.

Most of the time you will want to grab the dataset in bulk, or for a particular source, or a particular DOI prefix. You can then filter it, load it into your own data store, etc. Check the Crossref blog for ideas. We collect a few tens of thousands of Events per day, so that can weigh in at over 10MB of data per day. Bear this in mind if you point a browser at the URLs.

This quick start is going to show you how to fetch data and then do some rudimentary querying with it using the popular JQ tool.

Data is available on a per-day basis. To fetch 10,000 Events from Event Data, collected at any time:

curl "https://0-api-eventdata-crossref-org.libus.csd.mu.edu/v1/events?mailto=YOUR_EMAIL_HERE&rows=1000" > all-events.json

That returns 10,000 Events (out of a possible 1,363,971 at the time of writing).

If you're only interested in Reddit, you can filter that:

curl "https://0-api-eventdata-crossref-org.libus.csd.mu.edu/v1/events?mailto=YOUR_EMAIL_HERE&rows=1000&source=reddit" > reddit-events.json

If you're only interested in PLOS articles (4013 Events), you can filter by their prefix:

curl "https://0-api-eventdata-crossref-org.libus.csd.mu.edu/v1/events?mailto=YOUR_EMAIL_HERE&rows=1000&source=reddit&subj-id.prefix=10.1371" > reddit-plos.json

Now you've got a few thousand Events to crunch.

We can pipe it through jq to format it nicely. I've cut its head off at 25 lines:

$ jq . reddit-plos.json | head -n 25
  "events": [
    {
      "license": "https://creativecommons.org/publicdomain/zero/1.0/",
      "obj_id": "https://0-doi-org.libus.csd.mu.edu/10.1370/afm.1885",
      "source_token": "a6c9d511-9239-4de8-a266-b013f5bd8764",
      "occurred_at": "2016-01-16T01:20:49Z",
      "subj_id": "https://reddit.com/r/psychology/comments/4166g7/long_term_prescription_opioid_use_associated_with/",
      "id": "e37eaea3-003b-414e-822e-3fcbd61090a4",
      "evidence_record": "https://0-evidence-eventdata-crossref-org.libus.csd.mu.edu/evidence/201702226e03dbb4-bc2e-46e3-8c1e-d27f2d7fc1e4",
      "terms": "https://0-doi-org.libus.csd.mu.edu/10.13003/CED-terms-of-use",
      "action": "add",
      "subj": {
        "pid": "https://reddit.com/r/psychology/comments/4166g7/long_term_prescription_opioid_use_associated_with/",
        "type": "post",
        "title": "Long term prescription opioid use associated with increased risk of new-onset depression, independent of the known contribution of pain to depression.",
        "issued": "2016-01-16T01:20:49.000Z"
      },
      "source_id": "reddit",
      "obj": {
        "pid": "https://0-doi-org.libus.csd.mu.edu/10.1370/afm.1885",
        "url": "http://www.annfammed.org/content/14/1/54"
      },
      "timestamp": "2017-02-22T16:15:50Z",
      "relation_type_id": "discusses"
    },

Note the cursor. You can use these to navigate your query back and forward through time on the API.

I'm going to use JQ to select the events, then I'm going to return all of the distinct source names.

jq '.events | map(.source_id) | unique ' 2017-02-21.json
[
  "reddit",
  "stackexchange"
]

We were only collecting for those two sources on that day. Now let's group by the DOI and count how many Events we got for each DOI. Again, I've snipped a long output.

$ jq '.events | group_by(.obj_id) | map ([.[0].obj_id, length]) ' 2017-02-21.json  | head -n 17
[
  [
    "https://0-doi-org.libus.csd.mu.edu/10.1001/journalofethics.2017.19.2.pfor1-1702",
    1
  ],
  [
    "https://0-doi-org.libus.csd.mu.edu/10.1001/journalofethics.2017.19.2.stas1-1702",
    3
  ],
  [
    "https://0-doi-org.libus.csd.mu.edu/10.1001/virtualmentor.2010.12.9.imhl1-1009",
    11
  ],
  [
    "https://0-doi-org.libus.csd.mu.edu/10.1001/virtualmentor.2013.15.5.imhl1-1305",
    4
  ],
  [

That's all for now. The Query API page describes the Query API and connected services in depth.