WikiPedia API - Inconsistent Page Summary

WikiPedia API - Inconsistent Page Summary
0

#1

Hello all,

I’m wrapping up the Wikipedia API project. I’ve noticed a weird behavior where the page title and link are present, but not the summary sentence (retrieving this through the ‘extract’ parameter). I could use a second pair of eyes on my API query. I haven’t had much luck using the sandbox my query.

Current URL

https://en.wikipedia.org/w/api.php?action=query&format=json&origin=*&prop=extracts&generator=allpages&exsentences=1&exintro=1&explaintext=1&gapprefix=test

#2

Hey. If you check the output of the URL you posted, there are the first two lines who can tell you why the summary, sometimes, doesn’t come back for all your articles:

"batchcomplete":""

That should return “true”. If it doesn’t, most likely, it means there is additional information to be displayed that you’ll be able to get by using the “continue” property in the JSON:

"continue":{"gapcontinue":"Test-Oath,_Missouri","continue":"gapcontinue||"}

If you don’t want to do that, I would suggest using the “prefixsearch” generator and limiting the number of articles and summaries (even better, via characters instead of sentences). Basically, your URL would be like this:

"https://en.wikipedia.org/w/api.php?action=query&origin=*&format=json&prop=extracts&generator=prefixsearch&exchars=150&exlimit=10&exintro=1&explaintext=1&gpsnamespace=0&gpslimit=10&gpssearch="

#3

@noyb thanks again! I was able to create a similar URL using prefixsearch as you suggested. I’m a little confused on how I could use the continue property to access the missing extract data in my original query?


#4

why do you prefer a complicated prefix search generator over a simple search query that relies on defaults?

action=query&origin=*&format=json&formatversion=2&utf8&list=search&srsearch=music

this returns 10 results

{batchcomplete:true,continue:{sroffset:10,continue:"-||"}, ...

you simply take the continue block and use it as-is in the next query

action=query&origin=*&format=json&formatversion=2&utf8&list=search&srsearch=music&sroffset=10&continue=-||

which returns the next 10 results

{batchcomplete:true,continue:{sroffset:20,continue:"-||"}, ...

loop till there’s no continue or you have what you need


#5

I suggested the prefixsearch ‘cause I used it for my wikipedia project where I liked to have control on the snippets’ length and type (I was also getting images, and I didn’t want to use continue). Thanks for the input though; the search query, for most cases, it’s simpler for sure.


#6

I see - generators are a bit advanced and have complications with batchcomplete - it’s best to avoid them unless necessary


#7

Thanks @noyb & @ppc, I’ve experimented with both queries and definitely see the pros and cons of each approach. I did have one question: Was the reason my original URL didn’t work because the extracts where too long (more than 1 line)? I see why the two URLs you have both submitted work, but I’m still not sure why my original one didn’t.


#8

I think it’s just how the wikipedia api works, but I’m not totally sure. By default it returns a batch of 10 articles and if something doesn’t fit, it provides the continue.

I forgot to mention, if you’ll be using the prefixsearch, you should also add “&redirects=1” to your URL. Wikipedia is full of redirects and without it the api wouldn’t follow them automatically, resulting in some articles without text (you can test it searching “dgs” with or without the “redirects” parameter).


#9

the original query includes redirects which do not have extracts

btw boolean parameters like exintro and explaintext do not need a value - they are true if present, false if absent

looking at forum posts on the wikipedia search project it’s clear there’s a whole bunch of confusing examples - I see opensearch queries that refer to an obsolete search protocol - it was pretty much DOA - I see search used as a generator when it can be used directly as I showed above

the wikipedia api is quite powerful - it’s better to start as simple as possible then explore more complex queries

this is the place to start

https://www.mediawiki.org/wiki/API:Search

the search api has the srsearch parameter for the search terms - this query searches for “music” in titles - it is the simplest and has good defaults - it includes a snippet of the match - it resolves redirects

action=query&list=search&srsearch=music&origin=*&format=json&formatversion=2&utf8

UTF-8 is the best encoding for non-English characters

JSON format version 2 is explained here https://www.mediawiki.org/wiki/API:JSON_version_2

origin=* is needed for CORS explained here https://www.mediawiki.org/wiki/API:Cross-site_requests

page content can be searched by adding srwhat=text

action=query&list=search&srwhat=text&srsearch=music&origin=*&format=json&formatversion=2&utf8

The response looks like this

{
   batchcomplete : true,
   continue : {
      sroffset : 10,
      continue : "-||"
   },
   query : {
      search : [
         {
            timestamp : "2017-08-28T22:23:08Z",
            size : 139746,
            snippet : "<span class=\"searchmatch\">Music</span> is an art form and cultural activity whose medium is sound organized in time. The common elements of <span class=\"searchmatch\">music</span> are pitch (which governs melody and harmony)",
            pageid : 18839,
            wordcount : 17243,
            ns : 0,
            title : "Music"
         },

more array entries in between

         {
            snippet : "institution can also be known as a school of <span class=\"searchmatch\">music</span>, <span class=\"searchmatch\">music</span> academy, <span class=\"searchmatch\">music</span> faculty, college of <span class=\"searchmatch\">music</span>, <span class=\"searchmatch\">music</span> department (of a larger institution), conservatory",
            ns : 0,
            wordcount : 2320,
            pageid : 24782280,
            title : "Music school",
            size : 18997,
            timestamp : "2017-08-30T08:17:18Z"
         }
      ],
      searchinfo : {
         totalhits : 734632
      }
   }
}

a page is retrieved by replacing spaces with underscores in the title - e.g. the entry above for “Music school” has the url

https://en.wikipedia.org/wiki/Music_school

or by its page id in a url

https://en.wikipedia.org?curid=24782280

or the page html can be retrieved through the parse api

action=parse&pageid=24782280&prop=text&origin=*&format=json&formatversion=2&utf8

#10

Thanks @ppc & @noyb, the redirects explanation makes sense. I’ve definitely learned a lot by reading these post and by trying out the different queries posted here. I’m seeing the benefit of making the query as simple as possible :smile: for this API.