Earlier this month during Wikidata Data Modeling Days Hannah Bast from University of Freiburg presented QLever, a SPARQL query engine with some really cool features (slides, recording). After a month of using it, in this post I’ll discuss how it’s relevant for the OSM community and my experience so far.
What’s SPARQL and why should we care?
SPARQL is a query language for RDF data, which can come from RDF-native knowledge graphs (there are thousands of them, public and private, the best known in the OSM community is Wikidata) or other sources (for example OSM) converted in RDF with some tool or middleware
SPARQL includes an optional extension for geospatial data, geoSPARQL. Query services implementing it allow to run all kinds of spatial queries with a naming similar to other query languages based on OGC standards (like SQL on PostGIS).
One notable feature of RDF and SPARQL is that they are made from the ground-up for interoperability between different data sources (“linked data”). SPARQL natively supports querying multiple RDF data sources in one single query through “federated queries”. This works by specifying inside the query to the first service the URL of the second service and its query, then specifying how the result should be merged or joined to the data from first service. If the second service is not blacklisted, the first service will handle autonomously the communication, merge the data and return directly the final output.
QLever
QLever and osm2rdf are two projects by the University of Freiburg; they were introduced respectively in 2017 and in 2021 but only recently started getting attention in the OSM world.
osm2rdf is a tool for converting OSM data into RDF. It transforms geometries from OSM’s node-way-relation format to Well Known Text (WKT) and can indirectly materialize containment and intersection relations between elements to improve spatial querying speed. It’s FOSS and extracts of the data it generates are available online.
QLever is a SPARQL query engine which can be used on any RDF dataset. It’s FOSS as well and instructions are available for setting up custom instances on your data. SPARQL services for various data sources are made available on the official website, the full list is in the dropdown on the top left, the most relevant for OSM are:
- OpenStreetMap (generated through osm2rdf)
- Wikidata (RDF native)
- Wikimedia Commons structured data (RDF native)
A the time of writing (dec 2023) QLever has implemented most features of the SPARQL language and is ready for most queries but some features are still missing. The team has expressed interest in implementing some features of geoSPARQL but currently the only implemented geospatial features are sfContains and sfIntersects between OSM elements, distance() between points, latitude() and longitude() for points. Currently there is no way to apply filters elements of all geometry types inside/intersecting/… a bounding box or an arbitrary geometry.
QLever and Overpass
When talking about OSM queries the first name that comes to mind is Overpass. QLever has some clear advantages:
- Learning SPARQL once and being able to use it both on OSM and thousands of other data sources is a great prospect when compared to Overpass, which once learnt can be used only on OSM
- SPARQL federated queries simplify significantly the integration of data from other sources, allowing new use cases not previously possible
So, will QLever make Overpass obsolete? No. Certainly it won’t on the short term, for multiple reasons that could change on the future but are currently active obstacles:
- As mentioned above QLever will implement at least part of geoSPARQL but it’s yet to be seen how much this will take. The current lack of spatial querying methods is a deal breaker for some use cases
- More in general, QLever is still in an early phase of life and various SPARQL features are still missing and some have bugs (definitely more than a mature software like Overpass which has been running for more than 10 years).
- Currently OSM data is updated only weekly on QLever while it’s updated minutely in Overpass. This might change in the future but no intention to change it has been expressed by the QLever team and based on other people’s experience data update speed seems to be a pain point of QLever. Depending on the use case this might be a deal breaker.
- Currently there is only one public QLever instance for the full OSM planet and there is no index of public instances. If this QLever instance was to shut down no alternative provider would be available. Both Overpass and QLever allow to install your own instance but that has a cost in time and money prohibitive for a lot of use cases. Overpass API, instead, has a main instance supported by the OSMF and many other instances listed on the wiki. QLever is among the engines being considered to improve Wikidata Query Service and some talks about funding it have come up but at the time of writing nothing concrete has yet been concluded.
- Overpass includes OSM metadata (last changeset, user, timestamp, …), QLever currently does not.
Regardless, in my opinion QLever won’t make Overpass obsolete even on the long run, for other reasons:
- The fact that QLever uses WKT for geometries is positive, as WKT is an industry standard format recognized by most geographic software, unlike OSM’s format. This however means that some information on the original geometry is lost, making Overpass the only choice for OSM-specific low-level use cases.
- OverpassQL is a domain-specific query language, with a narrow range of use cases but an excellent efficiency and a good ease of use. SPARQL, on the contrary, is a general purpose language which can come useful in basically any field but in return is more complex to learn and harder to optimize, both for the user writing the query and for the query engine implementer. It must be said that the QLever team did a great job of optimization as it is getting viral in the Wikidata world due to how fast it is in comparison with other SPARQL engines and the implementation of spatial queries between OSM elements is smart and fast. However it is still to be seen how fast the engine will be in hard domain-specific tasks like geospatial querying with arbitrary geometries, where existing implementations are not sufficient (more on that in the osm2rdf paper, this video and this thread).
QLever and Sophox
QLever is not the first SPARQL query service that allows to query OSM. Sophox already did it, but based on the Blazegraph query engine and with some significant differences.
First off, let’s address the elephant in the room: Sophox is currently stuck to OSM data from July 2021. Sadly, this makes it unusable for a lot of use cases. Anyway, I will compare them in the hypothesis that both query services were updated regularly.
The biggest difference is that while QLever contains full geometries and containment/intersection relations between elements, Sophox does only include centroids. Also, Sophox discards tags with non-ASCII characters, while osm2rdf keeps all of them.
Both engines allow federated queries to other SPARQL services. Sophox also supports kind-of-federated-queries to Overpass API, OSM wiki data items, OSM metadata and key statistics from Taginfo, but this data is stale as well.
Sophox allows bounding box filtering through the wikibase:box service; QLever currently allows it on points through geof:latitude(), geof:longitude() and FILTER() and sooner or later it will be implemented for all geometries through geoSPARQL functions. Both engines allow distance filtering through geof:distance() and FILTER() (Sophox obviously only on points, QLever currently only for points)
On this OSM wiki page it is possible to compare equivalent queries for QLever and Sophox.
So, how can you use QLever?
RDF data represents the world in triples which define relationships between a subject and an object through a predicate. Entities are not identified by simple IDs or codes, they are identified by Uniform Resource Identifiers (URIs), which universally identify entities and their source, improving interoperability between data sources.
For example in RDF data generated from OSM through osm2rdf, the tag name=”Paris Gare du Nord” on the node 5013364 is represented with this triple:
1
</node/3090733718> </wiki/Key:name> "Paris Gare du Nord"
… where …
- /node/3090733718 , the subject, identifies the entity we are talking about (OSM node 3090733718) with an URI
- /wiki/Key:name , the predicate, specifies which type of relationship we are declaring (OSM key name=*) with an URI
- “Paris Gare du Nord”, the object, can be a literal (like the string in this example) or the URI of another entity
Including the element’s type and geometry and showing it in the more readable Turtle syntax, it looks like this:
1 2 3 4 5 6 7 8 9
@prefix osmkey: </wiki/Key:> . @prefix osmnode: </node/> . @prefix geo: <http://www.opengis.net/ont/geosparql#> . @prefix osm: </> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . osmnode:3090733718 osmkey:name "Paris Gare du Nord" ; # This row is equivalent to the one explained above rdf:type osm:node ; # The previous row ended with a ";" so the subject is implictly the same geo:hasGeometry "POINT(2.3549733 48.8804003)" .
SPARQL uses a similar syntax but allows to replace any subject, predicate or object with a variable. The query engine will match the given pattern with the available data and find all possible values for the variables (“bindings”). For example, we can ask the name and geometry of the element above with this query:
1 2 3 4 5 6 7 8 9 10 11
PREFIX osmkey: </wiki/Key:> PREFIX osmnode: </node/> PREFIX geo: <http://www.opengis.net/ont/geosparql#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> SELECT ?name ?type ?geometry WHERE { osmnode:3090733718 osmkey:name ?name; rdf:type ?type; geo:hasGeometry ?geometry . }
(Click here to run it on QLever)
In response to a SELECT SPARQL query you receive a list of possible combinations of values for the requested variables. In this example you will receive one row with ?name=Paris Gare du Nord, ?type=/node and ?geometry=POINT(2.3549733 48.8804003).
Let’s say you want to find all railway stations (railway=station) in France, you can run this query:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
PREFIX osmkey: </wiki/Key:> PREFIX geo: <http://www.opengis.net/ont/geosparql#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX osmrel: </relation/> PREFIX ogc: <http://www.opengis.net/rdf#> SELECT ?element ?name ?type ?geometry WHERE { osmrel:2202162 ogc:sfContains ?element . # Only elements in France (/relation/2202162) ?element osmkey:railway "station" ; osmkey:name ?name ; rdf:type ?type ; geo:hasGeometry ?geometry . }
(Click here to run it on QLever, you might notice that currently area ways create duplicate rows in the response, this is not intended, it’s a bug)
Let’s say you want to find all french stations that are served by the TGV (french high speed trains). This can be checked with a federated query to Wikidata:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
PREFIX osmkey: </wiki/Key:> PREFIX geo: <http://www.opengis.net/ont/geosparql#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX osmrel: </relation/> PREFIX ogc: <http://www.opengis.net/rdf#> PREFIX wdt: <http://www.wikidata.org/prop/direct/> PREFIX wd: <http://www.wikidata.org/entity/> PREFIX schema: <http://schema.org/> PREFIX osm: </> SELECT DISTINCT ?wdItem ?wpArticle ?element ?name ?type ?geometry WHERE { # French train stations served by TGV SERVICE <https://qlever.cs.uni-freiburg.de/api/wikidata> { ?wdItem wdt:P17 wd:Q142; # Only items in France: P17=country, Q142=France wdt:P31/wdt:P279* wd:Q55488; # Only train stations: P31=instance of, P249=subclass of, Q55488=railway station wdt:P1192 wd:Q129337. # Only items served by TGV: P1192=connecting service, Q129337=TGV ?wpArticle schema:about ?wdItem; schema:isPartOf <https://en.wikipedia.org/>. # Only items with an english Wikipedia article } ?element osmkey:railway "station" ; osm:wikidata ?wdItem; # This row joins the OSM element to the Wikidata item osmkey:name ?name ; rdf:type ?type ; geo:hasGeometry ?geometry . }
(Click here to run it on QLever)
For other query examples I suggest you to check out
- the official examples on the QLever website
- the comparisons between SPARQL and Overpass queries on this OSM wiki page
- the Examples section of this OSM wiki page
- sites based on OSM-Wikidata Map Framework, where I added the possibility to use QLever as back-end and see the used queries (these queries are still in beta and have room for improvement): Open Etymology Map, OSM-Wikidata Map, Open Artist Map, Open Architect Map, Open Burial Map
License considerations
When using federated queries, like every time you combine data from multiple sources, you should analyze the licenses of all sources to understand the limitations they impose. When you combine OSM data with anything else to create a derived dataset, it must be licensed under ODbL. Wikidata is licensed under CC0 but there are some caveats that you should be aware, for example you can’t import the results on OSM. All other sources that can be used directly on QLever (Wikimedia Commons, UniProt, DBLP, IMDb, DNB, …) or through federated queries will have their license with their limitations. Make sure to understand them.
Conclusions
QLever is still young and needs some work on to implement the missing pieces and fix some bugs, but on the long term it’s very promising. I am really excited about the future that it presents us: a single query language to rule them all, one standard way to query OSM and thousands of other datasets, with baked-in interoperability for seamless combination of multiple sources.