[00:51:33] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Make the user agent configurable for Wikidata Query Service Updater - https://phabricator.wikimedia.org/T217896 (10Smalyshev) Blazegraph accepts `http.userAgent` but looks like Updater does not. Probably makes sense to make it do the same. [00:52:09] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Make the user agent configurable for Wikidata Query Service Updater - https://phabricator.wikimedia.org/T217896 (10Smalyshev) [00:52:19] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Make the user agent configurable for Wikidata Query Service Updater - https://phabricator.wikimedia.org/T217896 (10Smalyshev) p:05Triage→03Normal [01:09:51] (03PS5) 10Paladox: Add "multi-site" plugin so gerrit can have multi masters [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/494865 [01:25:03] (03PS1) 10Jdlrobson: Restore descriptions to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495419 (https://phabricator.wikimedia.org/T217931) [01:41:49] (03CR) 10Krinkle: [C: 03+2] "Beta-only. Will revert if it blows up." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495419 (https://phabricator.wikimedia.org/T217931) (owner: 10Jdlrobson) [01:42:35] (03CR) 10Krinkle: [C: 03+1] Remove $wgMediaInTargetLanguage, matches the MW default now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495408 (owner: 10MaxSem) [01:42:50] (03Merged) 10jenkins-bot: Restore descriptions to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495419 (https://phabricator.wikimedia.org/T217931) (owner: 10Jdlrobson) [01:49:45] (03CR) 10jenkins-bot: Restore descriptions to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495419 (https://phabricator.wikimedia.org/T217931) (owner: 10Jdlrobson) [02:27:55] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 26187896 and 1 seconds [02:30:17] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 52288 and 56 seconds [02:52:33] (03PS1) 10Paladox: Merge branch 'stable-2.16' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495424 [02:54:11] (03Abandoned) 10Paladox: Merge branch 'stable-2.16' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495265 (owner: 10Paladox) [02:54:16] (03CR) 10Paladox: [V: 03+2 C: 03+2] Merge branch 'stable-2.16' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495424 (owner: 10Paladox) [02:55:11] (03PS2) 10Paladox: Merge branch 'stable-2.16' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495424 [06:28:09] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.163 second response time https://wikitech.wikimedia.org/wiki/Netbox [06:28:37] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:37:03] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [06:37:47] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 348 bytes in 0.669 second response time https://wikitech.wikimedia.org/wiki/Netbox [10:07:21] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:08:23] mcrouter is complaining https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=All [10:08:26] for the apis [10:09:14] and it is for mc1022 (I checked on /var/log/mcrouter.log on one api appserver) [10:10:22] and from https://grafana.wikimedia.org/d/000000317/memcache-slabs?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached&var-instance=mc1022&var-slab=All the culprit seems to be slab 168, that is the recurrent issue with the translate-groups key [10:10:32] in theory it should recover pretty soon [10:10:45] on the hosts I can see tkos alread gone [10:14:02] * elukey goes afk now but available if needed [10:24:05] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:40:43] (03PS1) 10Jayprakash12345: Enable $wgAllowCopyUploads for pawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495446 (https://phabricator.wikimedia.org/T217486) [11:33:27] (03PS1) 10Volans: check_icinga: fix retry logic [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/495447 (https://phabricator.wikimedia.org/T217599) [12:33:42] (03CR) 10Hashar: [C: 03+1] "+1 per Bryan :-] Then we will see whether it actually changes anything on the LDAP Grafana dashboard." [puppet] - 10https://gerrit.wikimedia.org/r/494922 (https://phabricator.wikimedia.org/T217280) (owner: 10GTirloni) [13:29:08] (03PS1) 10Elukey: superset: fix database name for analytics-tool1004 [puppet] - 10https://gerrit.wikimedia.org/r/495452 [13:31:54] (03CR) 10Elukey: [C: 03+2] superset: fix database name for analytics-tool1004 [puppet] - 10https://gerrit.wikimedia.org/r/495452 (owner: 10Elukey) [13:51:01] RECOVERY - superset on analytics-tool1004 is OK: TCP OK - 0.036 second response time on 10.64.36.116 port 9080 [13:52:37] RECOVERY - puppet last run on analytics-tool1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:30:27] PROBLEM - superset on analytics-tool1004 is CRITICAL: connect to address 10.64.36.116 and port 9080: Connection refused [16:30:57] PROBLEM - Check systemd state on analytics-tool1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:32:49] RECOVERY - superset on analytics-tool1004 is OK: TCP OK - 0.036 second response time on 10.64.36.116 port 9080 [16:33:00] this is me testing on a testing host :) [16:33:19] RECOVERY - Check systemd state on analytics-tool1004 is OK: OK - running: The system is fully operational [21:21:00] 10Operations, 10MobileFrontend, 10TechCom, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Krinkle) >>! In T214998#5007578, @Tbayer wrote: >>>! In T214998#5005391, @Krinkle wrote: >> I... [22:10:14] (03CR) 10MarkAHershberger: "glad you suggested it. coming up!" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/495158 (owner: 10MarkAHershberger) [22:20:56] (03PS2) 10MarkAHershberger: Add the current time to the tag for the nightly build [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/495158 [22:36:16] PROBLEM - puppet last run on db1062 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:02:08] RECOVERY - puppet last run on db1062 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures