[00:10:17] !repooled wdqs1006 - caught up [00:10:21] !log repooled wdqs1006 - caught up [00:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:38] !log depooled wdqs1005 - let it catch up [00:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:23] 10Operations, 10Release-Engineering-Team, 10Scap: Enable scap to roll back broken changes to MediaWiki - https://phabricator.wikimedia.org/T225207 (10Krinkle) I'm pretty sure this has been brought up in past tasks, and been thought about from time to time as something to add to Scap. Can't find a task for it... [02:18:58] (03CR) 10Mathew.onipe: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [02:20:50] (03CR) 10jerkins-bot: [V: 04-1] Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [02:53:14] 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash, 10wmerrors, and 7 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10Legoktm) [03:17:55] (03CR) 10Tim Starling: "Sure, I can move it. I forgot to do a git pull before I started working on this, so the base revision still had hhvm-fatal-error.php until" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516975 (https://phabricator.wikimedia.org/T187147) (owner: 10Tim Starling) [03:18:10] (03Abandoned) 10Tim Starling: Add a fatal error page to go with the proposed wmerrors feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516975 (https://phabricator.wikimedia.org/T187147) (owner: 10Tim Starling) [03:57:55] !log repooled wdqs1005 [03:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:21:31] (03PS1) 10Tim Starling: Add a fatal error page to go with the proposed wmerrors feature [puppet] - 10https://gerrit.wikimedia.org/r/516988 (https://phabricator.wikimedia.org/T187147) [04:22:32] (03CR) 10jerkins-bot: [V: 04-1] Add a fatal error page to go with the proposed wmerrors feature [puppet] - 10https://gerrit.wikimedia.org/r/516988 (https://phabricator.wikimedia.org/T187147) (owner: 10Tim Starling) [04:24:00] (03PS2) 10Tim Starling: Add a fatal error page to go with the proposed wmerrors feature [puppet] - 10https://gerrit.wikimedia.org/r/516988 (https://phabricator.wikimedia.org/T187147) [04:24:54] (03CR) 10jerkins-bot: [V: 04-1] Add a fatal error page to go with the proposed wmerrors feature [puppet] - 10https://gerrit.wikimedia.org/r/516988 (https://phabricator.wikimedia.org/T187147) (owner: 10Tim Starling) [04:30:00] (03CR) 10Tim Starling: "> error during compilation: Evaluation Error: Error while evaluating a Function Call, Could not find data item statsd in any Hiera data fi" [puppet] - 10https://gerrit.wikimedia.org/r/516988 (https://phabricator.wikimedia.org/T187147) (owner: 10Tim Starling) [04:41:25] (03PS3) 10ArielGlenn: Add a fatal error page to go with the proposed wmerrors feature [puppet] - 10https://gerrit.wikimedia.org/r/516988 (https://phabricator.wikimedia.org/T187147) (owner: 10Tim Starling) [04:42:21] (03CR) 10jerkins-bot: [V: 04-1] Add a fatal error page to go with the proposed wmerrors feature [puppet] - 10https://gerrit.wikimedia.org/r/516988 (https://phabricator.wikimedia.org/T187147) (owner: 10Tim Starling) [04:56:02] (03CR) 10ArielGlenn: "> > error during compilation: Evaluation Error: Error while" [puppet] - 10https://gerrit.wikimedia.org/r/516988 (https://phabricator.wikimedia.org/T187147) (owner: 10Tim Starling) [05:10:45] 10Operations, 10observability: Upgrade grafana to 6.1 - https://phabricator.wikimedia.org/T220838 (10Peter) If we could upgrade to 6.2.x that would be great. I've been using it for my projects for a while and the [lazy loading of panels out of view](https://grafana.com/docs/guides/whats-new-in-v6-2/#lazy-loadi... [05:13:08] 10Operations, 10SRE-Access-Requests: Typo in workboard column name: "Confirmation" - https://phabricator.wikimedia.org/T225696 (10ArielGlenn) 05Open→03Resolved a:03ArielGlenn Fixed, thanks! [05:22:50] 10Operations, 10Elasticsearch, 10Icinga, 10Discovery-Search (Current work): Create Icinga plugin to check number of eligible masters - https://phabricator.wikimedia.org/T224073 (10Mathew.onipe) [05:24:05] (03PS1) 10Mathew.onipe: icinga: cirrus masters eligible check [puppet] - 10https://gerrit.wikimedia.org/r/516992 (https://phabricator.wikimedia.org/T224073) [05:24:45] (03CR) 10jerkins-bot: [V: 04-1] icinga: cirrus masters eligible check [puppet] - 10https://gerrit.wikimedia.org/r/516992 (https://phabricator.wikimedia.org/T224073) (owner: 10Mathew.onipe) [05:28:49] (03PS2) 10Mathew.onipe: icinga: cirrus masters eligible check [puppet] - 10https://gerrit.wikimedia.org/r/516992 (https://phabricator.wikimedia.org/T224073) [06:58:33] 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10ArielGlenn) Great! I'll make sure this is brought up at the next SRE meeting then (Monday). [06:59:30] (03PS2) 10ArielGlenn: add awight as deployer [puppet] - 10https://gerrit.wikimedia.org/r/516109 (https://phabricator.wikimedia.org/T225062) [07:02:26] (03CR) 10ArielGlenn: [C: 03+2] add awight as deployer [puppet] - 10https://gerrit.wikimedia.org/r/516109 (https://phabricator.wikimedia.org/T225062) (owner: 10ArielGlenn) [07:04:16] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment cluster for awight - https://phabricator.wikimedia.org/T225062 (10ArielGlenn) It will be live in around half an hour everywhere; sometime after that, please check that you can get to the hosts you expect. [07:09:30] 10Operations, 10media-storage: CPU scaling governor on HP Gen9 hosts - https://phabricator.wikimedia.org/T225713 (10ArielGlenn) p:05Triage→03High [07:12:04] 10Operations, 10Operations-Software-Development: Error while checking binary files for python shebang - https://phabricator.wikimedia.org/T225710 (10ArielGlenn) p:05Triage→03Normal [07:12:32] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10SCB, 10Services (watching): Upgrade python-service-checker across the fleet - https://phabricator.wikimedia.org/T225707 (10ArielGlenn) p:05Triage→03Normal [07:13:43] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Wikimedia-Logstash: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline - https://phabricator.wikimedia.org/T225125 (10ArielGlenn) p:05Triage→03Normal [07:14:12] 10Operations, 10Performance-Team: Test usage of igbinary with apcu with MediaWiki - https://phabricator.wikimedia.org/T225074 (10ArielGlenn) p:05Triage→03Normal [07:53:41] Who maintains Graphite? Is that SRE territory? [07:54:26] nvm ^ I spent 5 seconds reading Phabricator, have my answer. [07:54:55] 10Operations, 10Graphite: PHP statsd client doesn't support tagging metrics - https://phabricator.wikimedia.org/T225721 (10awight) [08:02:01] 10Operations, 10Graphite: PHP statsd client doesn't support tagging metrics - https://phabricator.wikimedia.org/T225721 (10ArielGlenn) p:05Triage→03Normal [08:05:01] 10Operations, 10Operations-Software-Development: Error while checking binary files for python shebang - https://phabricator.wikimedia.org/T225710 (10hashar) The code should test whether the first line read from the file is `valid_encoding?`, else its binary I guess? The container lacks the `file` utility: `sh... [08:09:21] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:09:59] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:12:45] 10Operations, 10Graphite: PHP statsd client doesn't support tagging metrics - https://phabricator.wikimedia.org/T225721 (10awight) To say it out loud, it looks like the `liuggio/statsd-php-client` is no longer maintained. A question about releases, https://github.com/liuggio/statsd-php-client/issues/55 has be... [08:13:51] telia maintenance window, expires in 3.5 hours [08:20:57] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:21:33] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:23:11] 10Operations, 10SRE-Access-Requests: Requesting access to deployment cluster for awight - https://phabricator.wikimedia.org/T225062 (10awight) >>! In T225062#5258273, @ArielGlenn wrote: > It will be live in around half an hour everywhere; sometime after that, please check that you can get to the hosts you expe... [08:32:37] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:33:15] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:33:37] (03PS2) 10Gehel: wdqs: limit number of messages from the same logger also for file logging. [puppet] - 10https://gerrit.wikimedia.org/r/516837 [08:35:10] (03CR) 10Gehel: [C: 03+2] wdqs: limit number of messages from the same logger also for file logging. [puppet] - 10https://gerrit.wikimedia.org/r/516837 (owner: 10Gehel) [08:36:59] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:37:39] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:39:30] 10Operations, 10SRE-Access-Requests: Requesting access to deployment cluster for awight - https://phabricator.wikimedia.org/T225062 (10ArielGlenn) 05Open→03Resolved a:03ArielGlenn [08:40:10] \o/ (for the filter) [08:53:11] 10Operations, 10Analytics, 10Traffic: Investigate varnish behavior change since new ATS-change in upload - https://phabricator.wikimedia.org/T225786 (10JAllemandou) [09:12:17] 10Operations, 10Wikimedia-Mailing-lists: LGBT mailing list moderator password reset - https://phabricator.wikimedia.org/T225787 (10Ladsgroup) [09:19:50] awight: yeah SRE maintains graphite/statsd, for tagged/multidimensional metrics (and in general) we've been pushing for prometheus though (re T225721) [09:19:51] T225721: PHP statsd client doesn't support tagging metrics - https://phabricator.wikimedia.org/T225721 [09:20:27] awight: I'm assuming that's for mediawiki statsd metrics ? [09:26:50] !log test setting 'performance' governor on ms-be2016 - T210723 [09:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:55] T210723: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 [09:28:22] !log test setting 'performance' governor on ms-be2038 - T210723 [09:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:45] godog: Thanks for the breadcrumb! Will we still use the statsd protocol to send data from MediaWiki to prometheus? [09:33:44] awight: good question, I don't have a definitive answer though. I know it is in the air that "we should do something about mw metrics", the solution we have to bridge the statsd/prometheus gap is to use https://github.com/prometheus/statsd_exporter to leave statsd in place but have prometheus multidimensional metrics toos [09:35:29] awight: some of the porting work has been done as part of T205870 although the mw review isn't merged yet [09:35:29] T205870: Provision >= 50% of statsd/Graphite-only metrics in Prometheus - https://phabricator.wikimedia.org/T205870 [09:36:17] Great for my purposes, since we'll still get datadog-style tags transformed into something equivalent. [09:36:39] Unfortunately, the stale PHP client is a blocker either way. [09:40:54] yeah definitely "tagged metrics for mediawiki" would be nice to have, whether statsd and/or prometheus and/or , I don't know what/if the plans are there [09:43:57] !log test setting 'performance' governor on ms-be2033 - T210723 [09:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:02] T210723: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 [09:44:33] !log test setting 'performance' governor on ms-be2037 - T210723 [09:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:55] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 (10fgiunchedi) >>! In T210723#5255625, @faidon wrote: > So, the timeout patch above bumped the timeouts to 100s I t... [09:48:06] (03Abandoned) 10Mvolz: Enable reftabs on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514461 (https://phabricator.wikimedia.org/T199197) (owner: 10Mvolz) [10:00:39] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [10:02:38] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Consider raising Memcached MWObject cache memory size limit - https://phabricator.wikimedia.org/T217731 (10elukey) 05Open→03Declined This is probably not worth pursuing for the moment, the config for mc101... [10:02:41] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Mcrouter periodically reports soft TKOs for mc1029 (was mc1035, mc1022) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) [10:04:08] 10Operations, 10MediaWiki-Cache, 10Performance-Team (Radar), 10User-Elukey: mcrouter does not remove a memcached shard from consistent hashing when timeouts happen - https://phabricator.wikimedia.org/T208934 (10elukey) [10:13:49] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517027 [10:14:00] 10Operations, 10MediaWiki-Cache, 10Performance-Team (Radar), 10User-Elukey: mcrouter does not remove a memcached shard from consistent hashing when timeouts happen - https://phabricator.wikimedia.org/T208934 (10elukey) The parent task has been completed, we now have way less and sporadic TKO events and req... [10:14:26] !log test setting 'performance' governor on ms-be2031 - T210723 [10:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:32] T210723: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 [10:15:55] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517027 (owner: 10Marostegui) [10:16:47] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517027 (owner: 10Marostegui) [10:17:02] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517027 (owner: 10Marostegui) [10:17:31] (03PS1) 10Gehel: Revert "wdqs: ban disallowed User-Agent at nginx" [puppet] - 10https://gerrit.wikimedia.org/r/517029 [10:17:57] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1077 after recovering from a crash (duration: 00m 49s) [10:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:10] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) db1077 has been fully repooled. [10:18:46] (03CR) 10Vgutierrez: [C: 03+1] Revert "wdqs: ban disallowed User-Agent at nginx" [puppet] - 10https://gerrit.wikimedia.org/r/517029 (owner: 10Gehel) [10:19:11] (03PS2) 10Gehel: Revert "wdqs: ban disallowed User-Agent at nginx" [puppet] - 10https://gerrit.wikimedia.org/r/517029 [10:19:59] (03CR) 10Gehel: [C: 03+2] Revert "wdqs: ban disallowed User-Agent at nginx" [puppet] - 10https://gerrit.wikimedia.org/r/517029 (owner: 10Gehel) [10:22:05] !log Optimize tables on pc2008 - T210725 [10:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:10] T210725: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 [10:27:53] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:37:24] (03PS1) 10Vgutierrez: wdqs: Set a custom U-A for the categories lag check [puppet] - 10https://gerrit.wikimedia.org/r/517032 [10:41:34] (03CR) 10Gehel: [C: 03+1] "tested on wdqs1007, works just fine" [puppet] - 10https://gerrit.wikimedia.org/r/517032 (owner: 10Vgutierrez) [10:44:19] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10tramm) Maybe you know how to move on with this, @Aklapper? I have now sent e-mails to everybody involved in this issue without answer for several weeks, the so... [10:45:54] (03PS2) 10Gehel: wdqs: Set a custom U-A for the categories lag check [puppet] - 10https://gerrit.wikimedia.org/r/517032 (owner: 10Vgutierrez) [10:48:37] (03CR) 10CDanis: [C: 03+1] wdqs: Set a custom U-A for the categories lag check [puppet] - 10https://gerrit.wikimedia.org/r/517032 (owner: 10Vgutierrez) [10:50:41] (03CR) 10Vgutierrez: [C: 03+2] wdqs: Set a custom U-A for the categories lag check [puppet] - 10https://gerrit.wikimedia.org/r/517032 (owner: 10Vgutierrez) [10:50:55] (03PS3) 10Vgutierrez: wdqs: Set a custom U-A for the categories lag check [puppet] - 10https://gerrit.wikimedia.org/r/517032 [10:58:18] 10Operations, 10Analytics, 10Traffic: Investigate varnish behavior change since new ATS-change in webrequest upload - https://phabricator.wikimedia.org/T225786 (10JAllemandou) [10:59:54] 10Operations, 10Analytics, 10Traffic: Investigate varnish behavior change since new ATS-change in webrequest upload - https://phabricator.wikimedia.org/T225786 (10JAllemandou) [11:03:07] (03PS1) 10Fdans: Analytics Refinery: bump up jar version to apply latest source changes [puppet] - 10https://gerrit.wikimedia.org/r/517050 [11:36:36] !log test setting 'performance' governor on ms-be2034 - T210723 [11:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:41] T210723: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 [12:38:13] !log test setting 'performance' governor on ms-be2032 - T210723 [12:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:19] T210723: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 [12:40:58] there's a ton of mw errors going on since 12 UTC btw [12:41:05] PHP Notice: Undefined property: stdClass::$value [12:41:16] /srv/mediawiki/php-1.34.0-wmf.8/includes/api/ApiQueryQueryPage.php:125 [12:44:13] looks like it is a wikisource only thing [12:44:32] I'm guessing someone started requesting requesting a specific query page [12:44:37] got a url? [12:44:56] /w/api.php?action=query&list=querypage&qppage=IndexPages&format=json&qplimit=500&qpoffset=500 [12:45:04] on uk.wikisource.org [12:45:11] https://github.com/wikimedia/mediawiki/blob/61544d6eb2353/includes/api/ApiQueryQueryPage.php#L125 [12:45:42] Probably a ProofreadPage bug [12:46:01] or just that querycache is crap [12:46:02] how would the table not have that column though [12:46:52] probably there but not selected? [12:47:19] I'm wondering if it might have anything to do with db1077 repooling, it is on s3 [12:47:45] marostegui: ^ JFYI [12:47:50] Hi, has anyone changed the CI config regarding ratelimits? Our (=Wikibase) tests started to be super flaky with ratelimit errors for IPs being set to 8 per 60s, which doesn't seem right [12:48:09] See https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/53930/consoleFull for the log and https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/53930/artifact/log/mw-ratelimit.log for the rate-limits [12:48:29] the first ratelimit error corresponds with the first test failure [12:49:27] Reedy: Yeah, it's not selecting a value - https://github.com/wikimedia/mediawiki-extensions-ProofreadPage/blob/fb1b06269afef0a7d424774992e729c46852fe42/includes/Special/SpecialProofreadPages.php#L211 [12:49:42] heh [12:49:44] during cache look up, core enforces presence of it but during raw fetch, it's up to the original query [12:49:47] and if it's not there, it's not there [12:50:15] I guess it doesn't need it, but yeah, needs to be set to null in that case I guess? [12:50:24] I don't know what we do there for other query pages that don't need a third value [12:50:32] anyway, likely very old bug just being hit by users now [12:50:55] godog: that host have been repooled for days,I just gave it more traffic [12:51:09] godog: not infra/DB related [12:51:20] deterministic, affects all DBs, old bug. user triggered. [12:51:35] marostegui Krinkle thanks! yeah misled by SAL time alignment [12:51:56] Michael_WMDE: Unless someone has overridden then in some hook/extension, there doesn't seem to be any changes in DefaultSettings recently [12:52:07] godog: wanna file #prod-error task with a sample trace and exception id? [12:52:41] Krinkle: sure, doing now [12:54:37] {{done}} T225813 [12:54:38] T225813: ErrorException from line 125 of /srv/mediawiki/php-1.34.0-wmf.8/includes/api/ApiQueryQueryPage.php: PHP Notice: Undefined property: stdClass::$value - https://phabricator.wikimedia.org/T225813 [12:56:32] zeljkof: to move to this channel. We haven't really change anything in Wikibase since yesterday, when all was fine. Hence we wonder whether some CI setup has change recently? (git history does not suggest it, but what do i know) [12:57:57] PROBLEM - HHVM rendering on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:59:13] RECOVERY - HHVM rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 81814 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:06:24] (03CR) 10Reedy: [C: 03+1] Consistent beta wikidata urls, without www [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516753 (owner: 10Matthias Mullie) [13:12:51] PROBLEM - High lag on wdqs1003 is CRITICAL: 3640 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [13:14:43] (03Restored) 10Mvolz: Enable reftabs on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514461 (https://phabricator.wikimedia.org/T199197) (owner: 10Mvolz) [13:15:45] PROBLEM - High lag on wdqs1003 is CRITICAL: 3635 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [13:16:06] (03PS5) 10Mvolz: Enable reftabs on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514461 (https://phabricator.wikimedia.org/T199197) [13:21:31] PROBLEM - High lag on wdqs1003 is CRITICAL: 3668 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [13:22:18] !log joal@deploy1001 Started restart [analytics/aqs/deploy@fc1d232]: (no justification provided) [13:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:45] Sorry forgot to message -- This is to try fixing AQS incoherent config state (deploy before puppet had run everywhere) [13:26:17] ACKNOWLEDGEMENT - High lag on wdqs1003 is CRITICAL: 3654 ge 3600 Gehel server depooled to catch up on lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [13:26:27] !log depooling wdqs1003 to allow it to catch up on lag [13:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:51] (03PS1) 10Filippo Giunchedi: prometheus: let group 'prometheus' own metrics directory [puppet] - 10https://gerrit.wikimedia.org/r/517073 [13:47:21] leszek_wmde: sorry, I was out to lunch, this is the task? T225796 [13:47:22] T225796: Wikibase and Lexeme browser tests are failing with `failed-save: The save has failed.` - https://phabricator.wikimedia.org/T225796 [13:47:44] yes [13:47:56] leszek_wmde: thanks, looking into it [13:55:37] PROBLEM - puppet last run on dns2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:57:03] PROBLEM - Host mw1294 is DOWN: PING CRITICAL - Packet loss = 100% [13:58:44] I don't see anything obvious [13:58:54] all videos I've checked are broken :/ [13:59:55] but screenshots are there, but they look fine [14:02:04] strange that it fails for both node and ruby, so I guess the problem is not there, but somewhere else [14:09:06] leszek_wmde: Lucas_WMDE, sorry, I can't help much, I'm not sure why a mediawiki in a jenkins VM is rate limited at all [14:10:50] zeljkof: yeah, that's what we've been wondering as well [14:11:14] zeljkof: I guess we'd need to open a ticket and hope someone with the know chimes in? [14:11:41] there have been some changes to the CI (node 6 to 10 upgrade) but I don't see how that would cause this problem [14:11:59] I've already pinged hashar on the task, he would know [14:15:04] fantastic, thank you sir [14:22:47] RECOVERY - puppet last run on dns2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:32:19] 10Operations, 10Graphite: PHP statsd client doesn't support tagging metrics - https://phabricator.wikimedia.org/T225721 (10Ottomata) FWIW, WMF is slowly moving away from statsd in favor of Prometheus. I'm not sure what the Mediawiki plan is. @fgiunchedi [14:32:50] (03CR) 10Ottomata: "I'll wait til monday to merge." [puppet] - 10https://gerrit.wikimedia.org/r/517050 (owner: 10Fdans) [14:35:17] !log powercycle mw1294, down and no console [14:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:11] RECOVERY - Host mw1294 is UP: PING WARNING - Packet loss = 86%, RTA = 0.23 ms [14:37:52] 10Operations, 10Graphite: PHP statsd client doesn't support tagging metrics - https://phabricator.wikimedia.org/T225721 (10awight) >>! In T225721#5259308, @Ottomata wrote: > FWIW, WMF is slowly moving away from statsd in favor of Prometheus. I'm not sure what the Mediawiki plan is. @fgiunchedi It looks like... [15:22:32] 10Operations, 10Graphite: PHP statsd client doesn't support tagging metrics - https://phabricator.wikimedia.org/T225721 (10fgiunchedi) Indeed, there's been a push to move onto Prometheus as the supported platform. We (SRE) haven't done any work towards supporting tags in statsd/graphite though as that platform... [15:24:00] !log test setting 'performance' governor on ms-be2035 - T210723 [15:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:05] T210723: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 [15:56:39] !log repooling wdqs1003, not catching up anyway (high edit load) [15:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:10] (03PS1) 10Fdans: ReportUpdater: change repo of all queries to reportupdater-queries [puppet] - 10https://gerrit.wikimedia.org/r/517085 (https://phabricator.wikimedia.org/T221064) [16:00:19] (03CR) 10Fdans: [C: 04-1] "do not merge until https://gerrit.wikimedia.org/r/#/c/517084/ is merged" [puppet] - 10https://gerrit.wikimedia.org/r/517085 (https://phabricator.wikimedia.org/T221064) (owner: 10Fdans) [16:17:23] (03PS1) 10Hashar: contint: remove colordiff [puppet] - 10https://gerrit.wikimedia.org/r/517091 (https://phabricator.wikimedia.org/T225735) [16:23:05] (03PS1) 10Hashar: contint: remove unused contint::packages::python [puppet] - 10https://gerrit.wikimedia.org/r/517092 (https://phabricator.wikimedia.org/T225735) [16:27:15] (03PS1) 10Hashar: contint: remove several unused packages [puppet] - 10https://gerrit.wikimedia.org/r/517093 (https://phabricator.wikimedia.org/T225735) [16:27:17] (03PS1) 10Hashar: contint: remove unneeded profile::ci::hhvm [puppet] - 10https://gerrit.wikimedia.org/r/517094 (https://phabricator.wikimedia.org/T225735) [16:31:44] (03PS2) 10Hashar: contint: remove several unused packages [puppet] - 10https://gerrit.wikimedia.org/r/517093 (https://phabricator.wikimedia.org/T225735) [16:31:46] (03PS2) 10Hashar: contint: remove unneeded profile::ci::hhvm [puppet] - 10https://gerrit.wikimedia.org/r/517094 (https://phabricator.wikimedia.org/T225735) [16:31:48] (03PS1) 10Hashar: contint: remove MySQL related configuration [puppet] - 10https://gerrit.wikimedia.org/r/517095 (https://phabricator.wikimedia.org/T225735) [16:36:42] (03PS1) 10Hashar: contint: drop contint::tmpfs [puppet] - 10https://gerrit.wikimedia.org/r/517098 (https://phabricator.wikimedia.org/T225735) [17:51:29] (03PS1) 10Bstorm: toolforge: make backup registry optional (for toolsbeta) [puppet] - 10https://gerrit.wikimedia.org/r/517110 (https://phabricator.wikimedia.org/T221721) [17:53:03] (03PS1) 10Brennen Bearnes: CI: Create lightweight agent role for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/517111 (https://phabricator.wikimedia.org/T224069) [17:53:28] (03CR) 10jerkins-bot: [V: 04-1] CI: Create lightweight agent role for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/517111 (https://phabricator.wikimedia.org/T224069) (owner: 10Brennen Bearnes) [17:53:54] (03CR) 10Brennen Bearnes: "Paired with Thcipriani." [puppet] - 10https://gerrit.wikimedia.org/r/517111 (https://phabricator.wikimedia.org/T224069) (owner: 10Brennen Bearnes) [17:55:52] (03PS2) 10Brennen Bearnes: CI: Create lightweight agent role for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/517111 (https://phabricator.wikimedia.org/T224069) [18:24:29] (03PS1) 10Bstorm: dumps distribution: enable the service for nfs to start on reboot [puppet] - 10https://gerrit.wikimedia.org/r/517117 (https://phabricator.wikimedia.org/T217474) [18:31:57] (03CR) 10ArielGlenn: "Can you say a bit more about 'hoping that it will page'? Can we ensure it does?" [puppet] - 10https://gerrit.wikimedia.org/r/517117 (https://phabricator.wikimedia.org/T217474) (owner: 10Bstorm) [19:08:11] (03CR) 10Krinkle: "Indeed. profile::webperf::processors uses it without issue. Not sure what's going on..." [puppet] - 10https://gerrit.wikimedia.org/r/516988 (https://phabricator.wikimedia.org/T187147) (owner: 10Tim Starling) [19:10:01] (03CR) 10Krinkle: Add a fatal error page to go with the proposed wmerrors feature (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/516988 (https://phabricator.wikimedia.org/T187147) (owner: 10Tim Starling) [19:12:55] (03PS1) 1020after4: phabricator: change owner / mode on redirect_config.json [puppet] - 10https://gerrit.wikimedia.org/r/517124 [19:14:01] (03PS2) 1020after4: phabricator: change owner / mode on redirect_config.json [puppet] - 10https://gerrit.wikimedia.org/r/517124 [19:17:22] !log depooled wdqs1003 to catch up [19:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:49] (03PS6) 10Paladox: Gerrit: Quadruple web session cache memory to 8192 [puppet] - 10https://gerrit.wikimedia.org/r/513682 (https://phabricator.wikimedia.org/T222472) [19:30:04] (03PS7) 10Paladox: Gerrit: Quadruple web session cache memory to 8192 [puppet] - 10https://gerrit.wikimedia.org/r/513682 (https://phabricator.wikimedia.org/T222472) [19:42:12] (03CR) 10Thcipriani: [C: 03+1] Gerrit: Quadruple web session cache memory to 8192 [puppet] - 10https://gerrit.wikimedia.org/r/513682 (https://phabricator.wikimedia.org/T222472) (owner: 10Paladox) [19:42:36] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "A change in the tests hiera value is needed. I can take care of that, the patch overall LGTM (but see Krinkle's comment as well). My -1 mo" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/516988 (https://phabricator.wikimedia.org/T187147) (owner: 10Tim Starling) [19:56:08] !log depooled wdqs2003 to catch up [19:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:03] 10Operations, 10Wikimedia-Mailing-lists: LGBT mailing list moderator password reset - https://phabricator.wikimedia.org/T225787 (10Quiddity) Done, and sent to Laura. [21:50:04] (03PS1) 1020after4: phabricator: Allow admins to silence maniphest bulk jobs via sudo [puppet] - 10https://gerrit.wikimedia.org/r/517140 [21:51:43] (03CR) 10Greg Grossmeier: [C: 03+1] "Yes please." [puppet] - 10https://gerrit.wikimedia.org/r/517140 (owner: 1020after4) [21:51:46] (03CR) 10Smalyshev: varnish: Rate limit wdqs requests violating UA policy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/516803 (owner: 10Vgutierrez) [21:59:05] (03CR) 1020after4: "https://puppet-compiler.wmflabs.org/compiler1002/16957/" [puppet] - 10https://gerrit.wikimedia.org/r/517124 (owner: 1020after4) [21:59:12] (03CR) 1020after4: [C: 03+1] phabricator: change owner / mode on redirect_config.json [puppet] - 10https://gerrit.wikimedia.org/r/517124 (owner: 1020after4) [22:02:12] (03CR) 1020after4: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/16958/" [puppet] - 10https://gerrit.wikimedia.org/r/517140 (owner: 1020after4) [22:19:16] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/517117 (https://phabricator.wikimedia.org/T217474) (owner: 10Bstorm) [22:29:02] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10CRoslof) >>! In T204056#5181204, @tramm wrote: > Previously I didn't understand your desired solution because I didn't get what @CRoslof means by "retaining ow... [22:53:13] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1033 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [23:10:36] <_joe_> !log set cpufreq governor for mw1348 to performance [23:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:02] !log repooled wdqs2003 [23:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:59] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:19:30] !log repooled wdqs1003 [23:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:59] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1