Hi,
I have to launch 2 million queries against a Wikidata instance.
I have loaded Wikidata in Virtuoso 7 (512 RAM, 32 cores, SSD disks with RAID 0).
The queries are simple, just 2 types.
select ?s ?p ?o {
?s ?p ?o.
filter (?s = ?param)
}
select ?s ?p ?o {
?s ?p ?o.
filter (?o = ?param)
}
If I use a Java ThreadPoolExecutor takes 6 hours.
How can I speed up the queries processing even more?
I was thinking :
a) to implement a Virtuoso cluster to distribute the queries or
b) to load Wikidata …
[View More]in a Spark dataframe (since Sansa framework is
very slow, I would use my own implementation) or
c) to load Wikidata in a Postgresql table and use Presto to distribute
the queries or
d) to load Wikidata in a PG-Strom table to use GPU parallelism.
What do you think? I am looking for ideas.
Any suggestion will be appreciated.
Best,
[View Less]
Hello all!
We are happy to announce the availability of Wikimedia Commons Query
Service (WCQS): https://wcqs-beta.wmflabs.org/.
This is a beta SPARQL endpoint exposing the Structured Data on Commons
(SDoC) dataset. This endpoint can federate with WDQS. More work is needed
as we iterate on the service, but feel free to begin using the endpoint.
Known limitations are listed below:
* The service is a beta endpoint that is updated via weekly dumps. Some
caveats …
[View More]include limited performance, expected downtimes, and no interface,
naming, or backward compatibility stability guarantees.
* The service is hosted on Wikimedia Cloud Services, with limited
resources and limited monitoring. This means there may be random unplanned
downtime.
The data will be reloaded weekly from dumps. The service will be down
during data reload. With the current amount of SDoC data, downtime will
last approximately 4 hours, but this may increase as SDoC data grows.
* Due to an issue with the dump format, the data currently only dates
back to July 5th. We’re working on getting more up-to-date data and hope to
have a solution soon. (https://phabricator.wikimedia.org/T258507 and
https://phabricator.wikimedia.org/T258474)
* The MediaInfo concept URIs (e.g.
http://commons.wikimedia.org/entity/M37200540) are currently HTTP; we may
change these to HTTPS in the near future. Please comment on T258590 if you
have concerns about this change.
* The service is restricted behind OAuth authentication, backed by Commons.
You will need an account on Commons to access the service. This is so that
we can contact abusive bots and/or users and block them selectively as a
last resort if needed.
* Please note that to correctly logout of the service, you need to use
the logout link in WCQS - logging out of just Wikimedia Commons will not
work for WCQS. This limitation will be lifted once we move to production.
* No documentation on the service is available yet. In particular, no
examples are provided yet. You can add your own examples at
https://commons.wikimedia.org/wiki/Commons:SPARQL_query_service/queries/exa…
following the format at
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples
.
* Please use the SPARQL template. Note that while there is currently a
bug that doesn’t allow us to change the “Try it!” link endpoint, the
examples will be displayed correctly on the WCQS GUI.
* WCQS is a work in progress and some bugs are to be expected, especially
related to generalizing WDQS to fit SDoC data. For example, current bugs
include:
* URI prefixes specific for SDoC data don’t yet work - you need to use
full URIs if you want to query using them. Relations and Q items are
defined by Wikidata’s URI prefixes, so they work correctly.
* Autocomplete for SDoC items doesn’t work - without prefixes they’d be
unusable anyway, but additional work will be required after we inject SDoC
URI prefixes into WCQS GUI.
* If you find any additional bugs or issues, please report them via
Phabricator with the tag wikidata-query-service.
* We do plan to move the service to production, but we don’t have a
timeline on that yet. We want to emphasize that while we do expect a SPARQL
endpoint to be part of a medium to long-term solution, it will only be part
of that solution. Even once the service is production-ready, it will still
have limitations in terms of timeouts, expensive queries, and federation.
Some use cases will need to be migrated, over time, to better solutions -
once those solutions exist.
Have fun!
Guillaume
--
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+1 / CET
[View Less]
Hi all,
We experienced WDQS service disruptions on 2020/07/23. As a result there
was a full outage (inability to respond to all queries) for a period of
several minutes, and a more extended period of intermittently degraded
service (inability to respond to a subset of queries) for 1-2 hours.
The full incident report is available here:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20200723-wdqs-ou…
Ultimately, we traced the proximate cause to a …
[View More]series of non-performant
queries, which caused a deadlock in blazegraph, the backend for WDQS. We
have placed a temporary block on the IP address in question and are taking
steps to better define service availability expectations as well as
processes to make detection of these events more streamlined going forward.
[View Less]
Greetings everyone,
The PCC<https://www.loc.gov/aba/pcc/>, an international cooperative cataloging effort for library collections, is launching a Wikidata Pilot to further advance the movement toward identity management. Stated broadly in its Strategic Directions document, the PCC hopes to “Accelerate the movement toward ubiquitous identifier creation and identity management at the network level … attain an environment where identity management work …
[View More]activity is characterized by much greater proportions and numbers of entities receiving identifiers … strategic partnerships and collaboration existing among cultural heritage organizations, rights management agencies, Wikidata, and others … collaborate with other identity management communities to facilitate and promote the use of unique identifiers.”
More specifically, this Pilot is anticipated to involve
• Comparing ease of use and benefits of Wikidata to other registries (LCNAF, ISNI)
• Assessing the productivity and quality assurance tools that exist (or should exist)
• Learning about the culture of the Wikidata community
The upcoming Pilot was featured in the LD4 Wikidata Affinity Group meeting of June 16 and more background information and discussion can be found in the presentation recording<https://stanford.zoom.us/rec/share/_eAtNuzb_HNLcK_97GzcBJ95MN2-T6a8hHRI-PYO…>, slides<https://docs.google.com/presentation/d/1NpkAQdGGft1Wi2vX0zgMtIxwXWjPq96NtXx…>, and notes<https://docs.google.com/document/d/1z1SSAp4c4tftOGW3BbJ6Fxfd8oRIhfzveh0zjeb…>.
Participants can choose to experiment in a range of focus areas based on what is of interest to their own institution, sharing their findings without each being required to delve into all the areas that are covered by the pilot. Projects of any size, however small or large, and at any stage of progress are welcome. The PCC invites interested institutions (both PCC and non-PCC) to participate by completing a short survey<https://forms.gle/5VEHS8sbQbG1JyQa9> describing their project and the issues of interest to them. Initial expressions of interest by the end of July will allow the Pilot to get underway with a kick off meeting in early August. We will solicit firm commitments for ongoing participation at a later date.
The pilot is anticipated to last about 12 months. If you have questions, please write to John Riemer<mailto:jriemer@library.ucla.edu> or Michelle Durocher<mailto:durocher@fas.harvard.edu>.
Hilary Thorsen, on behalf of the PCC Task Group on Identity Management in NACO
Hilary Thorsen
Resource Sharing Librarian
Stanford Libraries
thorsenh(a)stanford.edu
650-285-9429
[View Less]