[00:00:04] twentyafterfour: That opportune time is upon us again. Time for a Phabricator update deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190822T0000). [00:09:59] 10Operations, 10Traffic: Configure Layer3 hashing for router ECMP (for anycast DNS) - https://phabricator.wikimedia.org/T230955 (10ayounsi) Tested in ulsfo with: ` # show forwarding-options enhanced-hash-key family inet { no-destination-port; no-source-port; } family inet6 { no-destination-port;... [00:15:18] !log Starting phabricator upgrade from tag release/2019-08-14/1 to release/2019-08-22/1 [00:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:11] !log push L3 ECMP to codfw routers - T230955 [00:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:16] T230955: Configure Layer3 hashing for router ECMP (for anycast DNS) - https://phabricator.wikimedia.org/T230955 [00:21:37] !log phabricator update completed without incident [00:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:02] !log push L3 ECMP to eqsin routers - T230955 [00:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:34] !log push L3 ECMP to esams routers - T230955 [00:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:26] !log push L3 ECMP to eqiad routers - T230955 [00:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:32] T230955: Configure Layer3 hashing for router ECMP (for anycast DNS) - https://phabricator.wikimedia.org/T230955 [00:35:24] ACKNOWLEDGEMENT - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ayounsi https://phabricator.wikimedia.org/T230964 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:03] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:37:23] !log run /usr/local/sbin/restart-php7.2-fpm on mwdebug1001/2 [00:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:48:37] (03PS2) 10Vgutierrez: prometheus: Consider the new layer label for ATS aggregation rules [puppet] - 10https://gerrit.wikimedia.org/r/531334 (https://phabricator.wikimedia.org/T221594) [05:02:28] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Remove db2059 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531480 (https://phabricator.wikimedia.org/T230884) [05:04:30] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2059 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531480 (https://phabricator.wikimedia.org/T230884) (owner: 10Marostegui) [05:05:20] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2059 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531480 (https://phabricator.wikimedia.org/T230884) (owner: 10Marostegui) [05:06:33] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2059 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531480 (https://phabricator.wikimedia.org/T230884) (owner: 10Marostegui) [05:06:52] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2059 from config T230884 (duration: 00m 59s) [05:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:59] T230884: Decommission db2059.codfw.wmnet - https://phabricator.wikimedia.org/T230884 [05:08:02] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2059 from config T230884 (duration: 00m 55s) [05:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:52] !log Remove db2059 from tendril and zarcillo - T230884 [05:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:58] T230884: Decommission db2059.codfw.wmnet - https://phabricator.wikimedia.org/T230884 [05:23:45] (03PS1) 10Marostegui: mariadb: Decommission db2059 [puppet] - 10https://gerrit.wikimedia.org/r/531593 (https://phabricator.wikimedia.org/T230884) [05:24:32] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2059 [puppet] - 10https://gerrit.wikimedia.org/r/531593 (https://phabricator.wikimedia.org/T230884) (owner: 10Marostegui) [05:27:29] (03CR) 10Marostegui: [C: 03+1] mariadb::backups - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531207 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [05:27:59] (03CR) 10Marostegui: [C: 03+1] role::mariadb::misc - codfw: add ipv6 mapped [puppet] - 10https://gerrit.wikimedia.org/r/531166 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [05:37:16] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2066 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531594 (https://phabricator.wikimedia.org/T230885) [05:38:10] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2066 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531594 (https://phabricator.wikimedia.org/T230885) (owner: 10Marostegui) [05:39:04] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2066 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531594 (https://phabricator.wikimedia.org/T230885) (owner: 10Marostegui) [05:39:22] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2066 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531594 (https://phabricator.wikimedia.org/T230885) (owner: 10Marostegui) [05:40:23] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2066 from config T230885 (duration: 00m 54s) [05:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:28] T230885: Decommission db2066.codfw.wmnet - https://phabricator.wikimedia.org/T230885 [05:41:22] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2066 from config T230885 (duration: 00m 54s) [05:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:13] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:37:04] !log installing python-pip updates from Stretch 9.9 point release [06:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:18] 10Operations, 10ops-eqiad, 10DBA: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 (10Marostegui) 05Open→03Resolved The log is still clear. So closing this, if it happens again I will re-open ` /admin1-> racadm getsel Record: 1 Date/Time: 08/2... [06:40:51] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:52:22] !log installing mariadb-10.1 updates from Stretch 9.9 point release (unrelated to wmf-mariadb, mostly client-side clients/libraries as shipped in Debian) [06:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:23] I'm going to do the backport to fix this train blocker in a moment: T230937 [06:57:23] T230937: TermboxView.php: Call to a member function getSerialization() on a non-object (null) - https://phabricator.wikimedia.org/T230937 [07:13:37] 10Operations: Integrate Stretch 9.9 point update - https://phabricator.wikimedia.org/T222053 (10MoritzMuehlenhoff) [07:17:10] (03PS1) 10Marostegui: dbproxy1019: Provision dbproxy1019 to replace dbproxy1011 [puppet] - 10https://gerrit.wikimedia.org/r/531598 (https://phabricator.wikimedia.org/T202367) [07:21:12] (03PS2) 10Marostegui: dbproxy1019: Provision dbproxy1019 to replace dbproxy1011 [puppet] - 10https://gerrit.wikimedia.org/r/531598 (https://phabricator.wikimedia.org/T202367) [07:44:40] Doing the backport now Jenkins is finally done [07:46:47] !log Deploy grants on labsdb1009-labsdb1012 to allow connections for haproxy from dbproxy1019 - T202367 [07:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:54] T202367: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367 [07:54:30] !log tarrow@deploy1001 Synchronized php-1.34.0-wmf.19/extensions/Wikibase/repo/: Backport for UBN [[gerrit:531527|Hack to avoid trying to termbox render page before save (T230937)]] (duration: 00m 56s) [07:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:36] T230937: TermboxView.php: Call to a member function getSerialization() on a non-object (null) - https://phabricator.wikimedia.org/T230937 [07:57:38] (03CR) 10Marostegui: "I have created the haproxy for dbproxy1019 on all the labs hosts" [puppet] - 10https://gerrit.wikimedia.org/r/531598 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [07:57:55] (03CR) 10Marostegui: [C: 03+2] dbproxy1019: Provision dbproxy1019 to replace dbproxy1011 [puppet] - 10https://gerrit.wikimedia.org/r/531598 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [07:59:04] (03PS1) 10Muehlenhoff: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/531604 [08:02:48] I'm now done :) [08:03:52] (03PS2) 10Muehlenhoff: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/531604 [08:14:24] 10Operations, 10Developer-Advocacy, 10Discourse: Migration of discourse-mediawiki.wmflabs.org from wmflabs to production - https://phabricator.wikimedia.org/T184461 (10Aklapper) [08:15:10] (03CR) 10Muehlenhoff: [C: 03+2] Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/531604 (owner: 10Muehlenhoff) [08:16:14] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10mmodell) >>! In T226044#5429736, @JAufrecht wrote: > The ideal breadcrumb might be > >... [08:17:31] !log restarting oozie on an-coord1001 [08:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:33] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen [08:26:51] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen [08:36:58] if anyone is going to puppet-merge something, I have a clean up patch for contint "Remove role::ci::slave::webperformance" | https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/531420/ [08:37:03] it is no more used :-] [08:39:23] <_joe_> hashar: in a few :) [08:40:35] (03CR) 10Filippo Giunchedi: [C: 04-1] "I apologize I wasn't clear enough in my instructions, the _layer name should be appended before the first column in the metric name, I've " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531334 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [08:41:15] oops :) [08:43:55] (03PS3) 10Vgutierrez: prometheus: Consider the new layer label for ATS aggregation rules [puppet] - 10https://gerrit.wikimedia.org/r/531334 (https://phabricator.wikimedia.org/T221594) [08:44:05] (03PS1) 10Marostegui: install_server: Allow re-image dbproxy1018,dbproxy1019 [puppet] - 10https://gerrit.wikimedia.org/r/531660 (https://phabricator.wikimedia.org/T202367) [08:44:16] and we have apparently lost phabricator :-\ [08:44:25] oh was transient [08:44:29] * hashar blames internet [08:44:30] hashar: woot? [08:44:34] hashar: works here [08:44:39] marostegui: probably it was just me :-] [08:45:46] (03CR) 10Marostegui: [C: 03+2] install_server: Allow re-image dbproxy1018,dbproxy1019 [puppet] - 10https://gerrit.wikimedia.org/r/531660 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [08:46:00] (03CR) 10Vgutierrez: "> Patch Set 1: Code-Review-1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531334 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [08:46:07] I just have some packet loss over IPv6 but that is my ISP to blame for it [08:47:29] (03PS1) 10Ema: ATS: allow @debug syscall family [puppet] - 10https://gerrit.wikimedia.org/r/531662 [08:49:06] (03CR) 10Muehlenhoff: [C: 03+1] ATS: allow @debug syscall family [puppet] - 10https://gerrit.wikimedia.org/r/531662 (owner: 10Ema) [08:49:30] (03CR) 10Vgutierrez: [C: 03+1] ATS: allow @debug syscall family [puppet] - 10https://gerrit.wikimedia.org/r/531662 (owner: 10Ema) [08:49:46] that's a +2? [08:49:57] looks like! [08:50:23] bots aren't as fast as my colleagues though [08:50:30] so now waiting for jenkins [08:54:57] (03CR) 10Ema: [C: 03+2] ATS: allow @debug syscall family [puppet] - 10https://gerrit.wikimedia.org/r/531662 (owner: 10Ema) [09:00:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [09:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10Andrew) @Cmjohnson and @wiki_willy, I just want to clarify what's happening on this ticket. The primary task (re-racking and movi... [09:01:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1022 with 10G interfaces - https://phabricator.wikimedia.org/T229872 (10Andrew) @Cmjohnson and @wiki_willy, I just want to clarify what's happening on this ticket. The primary task (re-racking and movi... [09:01:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1021 with 10G interfaces - https://phabricator.wikimedia.org/T229873 (10Andrew) @Cmjohnson and @wiki_willy, I just want to clarify what's happening on this ticket. The primary task (re-racking and movi... [09:02:12] (03PS1) 10Filippo Giunchedi: mediawiki: alert on 2xx latency [puppet] - 10https://gerrit.wikimedia.org/r/531664 (https://phabricator.wikimedia.org/T230396) [09:02:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:43] (03CR) 10Filippo Giunchedi: "Should be less noisy now" [puppet] - 10https://gerrit.wikimedia.org/r/531664 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [09:03:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: alert on 2xx latency [puppet] - 10https://gerrit.wikimedia.org/r/531664 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [09:04:01] (03CR) 10Filippo Giunchedi: [C: 03+2] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/531334 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [09:04:47] nice :D [09:04:53] godog: I'm merging this one before: https://gerrit.wikimedia.org/r/c/operations/puppet/+/508289/ [09:04:54] (03CR) 10Filippo Giunchedi: [C: 03+2] mediawiki: alert on 2xx latency [puppet] - 10https://gerrit.wikimedia.org/r/531664 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [09:04:56] (03PS2) 10Filippo Giunchedi: mediawiki: alert on 2xx latency [puppet] - 10https://gerrit.wikimedia.org/r/531664 (https://phabricator.wikimedia.org/T230396) [09:06:05] vgutierrez: yup, sounds good to me [09:09:36] !log rolling ats-backend-restart to enable @debug system call family [09:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:36] (03CR) 10Vgutierrez: [C: 03+2] prometheus: Identify trafficserver instances using the layer label [puppet] - 10https://gerrit.wikimedia.org/r/508289 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [09:10:46] (03PS10) 10Vgutierrez: prometheus: Identify trafficserver instances using the layer label [puppet] - 10https://gerrit.wikimedia.org/r/508289 (https://phabricator.wikimedia.org/T221217) [09:14:39] (03PS4) 10Vgutierrez: prometheus: Consider the new layer label for ATS aggregation rules [puppet] - 10https://gerrit.wikimedia.org/r/531334 (https://phabricator.wikimedia.org/T221594) [09:16:07] (03PS1) 10Filippo Giunchedi: prometheus: rename mtail appserver handlers [puppet] - 10https://gerrit.wikimedia.org/r/531666 (https://phabricator.wikimedia.org/T230396) [09:17:24] (03CR) 10jerkins-bot: [V: 04-1] prometheus: rename mtail appserver handlers [puppet] - 10https://gerrit.wikimedia.org/r/531666 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [09:20:19] (03PS2) 10Filippo Giunchedi: prometheus: rename mtail appserver handlers [puppet] - 10https://gerrit.wikimedia.org/r/531666 (https://phabricator.wikimedia.org/T230396) [09:22:00] (03PS1) 10Muehlenhoff: Add ferm service for mysql replicas [puppet] - 10https://gerrit.wikimedia.org/r/531667 [09:23:01] (03CR) 10jerkins-bot: [V: 04-1] Add ferm service for mysql replicas [puppet] - 10https://gerrit.wikimedia.org/r/531667 (owner: 10Muehlenhoff) [09:24:40] (03PS2) 10Muehlenhoff: Add ferm service for mysql replicas [puppet] - 10https://gerrit.wikimedia.org/r/531667 [09:25:14] 10Operations, 10cloud-services-team, 10netops: Review switches ACL to connect from tools-bastion to dbproxy1019 - https://phabricator.wikimedia.org/T230980 (10Marostegui) [09:25:45] (03CR) 10jerkins-bot: [V: 04-1] Add ferm service for mysql replicas [puppet] - 10https://gerrit.wikimedia.org/r/531667 (owner: 10Muehlenhoff) [09:35:50] 10Operations, 10netbox: Netbox LibreNMS report fails - https://phabricator.wikimedia.org/T230964 (10Joe) p:05Triage→03Normal [09:38:06] (03Abandoned) 10Muehlenhoff: Add ferm service for mysql replicas [puppet] - 10https://gerrit.wikimedia.org/r/531667 (owner: 10Muehlenhoff) [09:38:52] 10Operations, 10Wikimedia-Mailing-lists: Transfer ownership of mediawiki-security mailman list to Security Team - https://phabricator.wikimedia.org/T230951 (10Joe) p:05Triage→03Low a:03Joe [09:40:30] (03PS1) 10Muehlenhoff: Enable ferm for dbproxy1019 [puppet] - 10https://gerrit.wikimedia.org/r/531670 [09:44:05] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] mediawiki::users: Allow adding privileges to mwdeploy user [puppet] - 10https://gerrit.wikimedia.org/r/531474 (owner: 10Effie Mouzeli) [09:44:27] (03PS4) 10Effie Mouzeli: mediawiki::users: Allow adding privileges to mwdeploy user [puppet] - 10https://gerrit.wikimedia.org/r/531474 [09:46:28] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] mediawiki::common: Allow mwdeploy user to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/531475 (https://phabricator.wikimedia.org/T224857) (owner: 10Effie Mouzeli) [09:46:39] (03PS7) 10Effie Mouzeli: mediawiki::common: Allow mwdeploy user to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/531475 (https://phabricator.wikimedia.org/T224857) [09:47:23] 10Operations, 10Wikimedia-Mailing-lists: Transfer ownership of mediawiki-security mailman list to Security Team - https://phabricator.wikimedia.org/T230951 (10Joe) @Reedy @JBennett I've set you up as list administrators. You now need to change the list admin password, I'm happy to help if you can't reset it yo... [09:50:22] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [09:50:26] (03CR) 10Effie Mouzeli: [C: 03+1] redis - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531268 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [09:51:10] (03CR) 10Effie Mouzeli: [C: 03+1] redis - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531267 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [09:55:01] (03CR) 10Marostegui: [C: 03+1] mariadb::misc::multiinstance: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531200 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [09:55:59] (03PS1) 10Ema: ATS: add icinga check for traffic_server restarts [puppet] - 10https://gerrit.wikimedia.org/r/531671 (https://phabricator.wikimedia.org/T227432) [09:57:23] jouncebot: next [09:57:23] In 1 hour(s) and 2 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190822T1100) [09:57:58] (03PS2) 10Ema: ATS: add icinga check for traffic_server restarts [puppet] - 10https://gerrit.wikimedia.org/r/531671 (https://phabricator.wikimedia.org/T227432) [10:01:04] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [10:01:08] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [10:03:45] (03PS2) 10Jbond: mariadb::backups - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531207 (https://phabricator.wikimedia.org/T102099) [10:04:10] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [10:04:20] (03CR) 10Jbond: [C: 03+2] mariadb::backups - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531207 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:06:36] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:07:03] (03CR) 10Ema: "No obvious problem with pcc https://puppet-compiler.wmflabs.org/compiler1002/17979/cp1076.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/531671 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [10:08:43] 10Operations, 10MediaWiki-General: Elevated php7 latency during mw deploy - https://phabricator.wikimedia.org/T230934 (10Joe) p:05Triage→03Normal it's indeed strange. In particular I find it strange that it affects mainly 400s and 404s. Maybe the #performance-team might have an insight to why 4xx and 301s... [10:10:01] (03PS3) 10Jbond: role::mariadb::misc - codfw: add ipv6 mapped [puppet] - 10https://gerrit.wikimedia.org/r/531166 (https://phabricator.wikimedia.org/T102099) [10:10:48] (03CR) 10Jbond: [C: 03+2] role::mariadb::misc - codfw: add ipv6 mapped [puppet] - 10https://gerrit.wikimedia.org/r/531166 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:11:05] (03PS2) 10Effie Mouzeli: scap: set flag for check-and-restart-php [puppet] - 10https://gerrit.wikimedia.org/r/530014 (https://phabricator.wikimedia.org/T224857) (owner: 10Thcipriani) [10:11:35] 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Joe) I think this is a reasonable explanation, but how would you suggest we should fix our monitoring? [10:11:59] 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Joe) p:05Triage→03Normal a:03Joe [10:12:07] 10Operations, 10Traffic: Allow running several ATS instances on the same server - https://phabricator.wikimedia.org/T221217 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [10:12:09] 10Operations, 10serviceops, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10Joe) p:05Triage→03Normal [10:12:13] 10Operations, 10Traffic, 10Patch-For-Review: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10Vgutierrez) [10:12:49] (03PS1) 10Muehlenhoff: Remove obsolete puppetdb settings [puppet] - 10https://gerrit.wikimedia.org/r/531673 [10:12:57] (03PS2) 10Jbond: mariadb::misc::multiinstance: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531200 (https://phabricator.wikimedia.org/T102099) [10:14:02] (03CR) 10Jbond: [C: 03+2] mariadb::misc::multiinstance: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531200 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:14:30] (03CR) 10Effie Mouzeli: [C: 03+2] scap: set flag for check-and-restart-php [puppet] - 10https://gerrit.wikimedia.org/r/530014 (https://phabricator.wikimedia.org/T224857) (owner: 10Thcipriani) [10:14:46] (03PS3) 10Effie Mouzeli: scap: set flag for check-and-restart-php [puppet] - 10https://gerrit.wikimedia.org/r/530014 (https://phabricator.wikimedia.org/T224857) (owner: 10Thcipriani) [10:15:49] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "not blocked anymore" [puppet] - 10https://gerrit.wikimedia.org/r/530014 (https://phabricator.wikimedia.org/T224857) (owner: 10Thcipriani) [10:16:02] (03PS2) 10Jbond: redis - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531267 (https://phabricator.wikimedia.org/T102099) [10:16:36] (03CR) 10Jbond: [C: 03+2] redis - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531267 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:16:58] (03CR) 10Effie Mouzeli: [C: 03+2] "Unblocked with I3f13dd41b549a5db41fb7f065f70825f84c5789e and Id0ebed6f2622364c750549a6eca96ad672baf705" [puppet] - 10https://gerrit.wikimedia.org/r/530014 (https://phabricator.wikimedia.org/T224857) (owner: 10Thcipriani) [10:17:45] (03PS4) 10Effie Mouzeli: scap: set flag for check-and-restart-php [puppet] - 10https://gerrit.wikimedia.org/r/530014 (https://phabricator.wikimedia.org/T224857) (owner: 10Thcipriani) [10:21:01] (03PS2) 10Jbond: redis - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531268 (https://phabricator.wikimedia.org/T102099) [10:21:37] (03CR) 10Jbond: [C: 03+2] redis - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531268 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:24:18] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [10:24:40] ^downtime expired [10:24:54] (03CR) 10Jcrespo: [C: 04-1] Enable ferm for dbproxy1019 [puppet] - 10https://gerrit.wikimedia.org/r/531670 (owner: 10Muehlenhoff) [10:30:10] (03PS2) 10Jbond: parsoid: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531281 (https://phabricator.wikimedia.org/T102099) [10:31:02] (03CR) 10Jbond: [C: 03+2] parsoid: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531281 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:32:17] (03CR) 10Effie Mouzeli: [C: 03+2] Send 33.3% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529924 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [10:34:41] (03PS2) 10Jbond: otrs: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531250 (https://phabricator.wikimedia.org/T102099) [10:36:38] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [10:36:55] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/531673 (owner: 10Muehlenhoff) [10:36:59] (03PS3) 10Ema: ATS: add icinga check for traffic_server restarts [puppet] - 10https://gerrit.wikimedia.org/r/531671 (https://phabricator.wikimedia.org/T227432) [10:37:04] (03CR) 10Jbond: [C: 03+2] otrs: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531250 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:38:47] (03PS2) 10Effie Mouzeli: Send 33.3% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529924 (https://phabricator.wikimedia.org/T219150) [10:39:37] (03PS4) 10Ema: ATS: add icinga check for traffic_server restarts [puppet] - 10https://gerrit.wikimedia.org/r/531671 (https://phabricator.wikimedia.org/T227432) [10:39:44] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [10:43:26] (03CR) 10Vgutierrez: [C: 03+1] ATS: add icinga check for traffic_server restarts [puppet] - 10https://gerrit.wikimedia.org/r/531671 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [10:44:20] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [10:44:37] (03PS2) 10Jbond: xhgui::app: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531279 (https://phabricator.wikimedia.org/T102099) [10:48:11] (03CR) 10Jbond: [C: 03+2] xhgui::app: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531279 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:50:41] (03CR) 10Effie Mouzeli: Send 33.3% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529924 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [10:50:44] (03CR) 10Effie Mouzeli: [C: 03+2] Send 33.3% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529924 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [10:51:41] (03Merged) 10jenkins-bot: Send 33.3% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529924 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [10:51:56] (03CR) 10jenkins-bot: Send 33.3% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529924 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [10:52:30] (03PS2) 10Jbond: docker_registry: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531269 (https://phabricator.wikimedia.org/T102099) [10:53:05] (03CR) 10Jbond: [C: 03+2] docker_registry: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531269 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:55:02] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [10:55:48] !log jiji@deploy1001 Synchronized wmf-config/CommonSettings.php: Push PHP7 traffic to 33.3% - T219150 (duration: 01m 01s) [10:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:59] T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 [10:57:52] (03PS2) 10Jbond: poolcounter: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531265 (https://phabricator.wikimedia.org/T102099) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190822T1100). [11:00:04] alaa_wmde: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:34] (03CR) 10Jbond: [C: 03+2] poolcounter: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531265 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [11:03:54] o/ [11:04:32] \o/ [11:04:44] o/ [11:04:50] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 3 others: Some thumbnail images delivered with wrong application/x-www-form-urlencoded mime-type - https://phabricator.wikimedia.org/T188831 (10fgiunchedi) >>! In T188831#5414614, @brion wrote: > The only `POST`s I see in there are in... [11:05:35] Amir1 and alaa_wmde [11:05:51] jijiki: yes sir [11:05:53] please keep in mind that we just increased PHP7 traffic [11:06:08] roger thanks for the heads up [11:06:09] jijiki: YESSS [11:06:15] tx:) [11:06:37] it is my going away for vacations present :p [11:07:20] sorry jijiki .. should say yes madam? [11:07:45] or neither is better probably :D .. have nice vacations :) [11:08:00] alaa_wmde: we are in an era where it doesn't matter really [11:08:10] it is the intention that counts :p [11:08:17] yeap that's the spirit! [11:08:22] jijiki: have fun! [11:08:30] haha tx :) [11:08:43] (03PS2) 10Jbond: eventschemas: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531274 (https://phabricator.wikimedia.org/T102099) [11:09:21] (03CR) 10Jbond: [C: 03+2] eventschemas: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531274 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [11:09:58] (03CR) 10Filippo Giunchedi: ATS: add icinga check for traffic_server restarts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531671 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [11:11:38] <_joe_> please note it's still just about 15% of all requests anyways, as it only applies to users who accept cookies (so, humans with a browsers, mostly) [11:13:18] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10Marostegui) @CDanis @Volans can you confirm this command will set wikitech (db1073 is its master) on read-only?:... [11:13:33] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/17981/" [puppet] - 10https://gerrit.wikimedia.org/r/531673 (owner: 10Muehlenhoff) [11:13:40] (03PS2) 10Muehlenhoff: Remove obsolete puppetdb settings [puppet] - 10https://gerrit.wikimedia.org/r/531673 [11:14:35] _joe_: but API appservers are on php7 as well (20% last time I checked) + lots of jobs \o/ [11:15:06] (03PS2) 10Jbond: kafka::main: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531238 (https://phabricator.wikimedia.org/T102099) [11:15:12] tarrow: alaa_wmde Is it okay if I deploy the wb_terms thingy? [11:15:38] (03CR) 10Jbond: [C: 03+2] kafka::main: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531238 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [11:15:39] well I'd rather have you rest in bed if you are sick, but I can't force you can I? [11:17:01] Amir1: if you do, please sync it first to debug I can test on a page I prepared for that .. I'll be watching kibana & grafana too [11:18:03] We actually have the meeting in 10 minutes, Should postpone it? [11:19:35] I moved the meeting till 2 [11:19:48] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10jijiki) [11:19:48] we have to get the revert deployed in this swat for sure [11:20:09] so that the issue doesn't go to wikipedias in the train today [11:21:10] (03PS3) 10Muehlenhoff: Remove obsolete puppetdb settings [puppet] - 10https://gerrit.wikimedia.org/r/531673 [11:26:05] (03CR) 10Ema: ATS: add icinga check for traffic_server restarts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531671 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [11:27:31] 10Operations, 10serviceops, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10jijiki) We are on our way to finishing migration to PHP7, my opinion is to try the PHP7.2 bandaid rather than upgrading production to P... [11:30:53] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete puppetdb settings [puppet] - 10https://gerrit.wikimedia.org/r/531673 (owner: 10Muehlenhoff) [11:31:27] omg CI is killing our SWAT [11:31:51] hehe you must be new here [11:32:00] lol how did you know? [11:32:45] that place, where you feel very new and very old at the same time :P [11:33:05] CI on backorts usually take around 40 minutes, if you hit a flaky test, you need to do it again ^_^ [11:35:27] yeah crazy times! [11:36:40] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [11:41:02] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 13.69 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [11:44:20] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [11:45:18] <_joe_> uhm [11:45:25] <_joe_> this is bad [11:45:28] <_joe_> all on php [11:45:31] <_joe_> jijiki: ^^ [11:45:34] I am here [11:45:53] I was noticing that [11:45:56] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-12h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET [11:46:03] there is this pattern [11:46:25] and now that we pushed a little more traffic, it got a little worse [11:48:42] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 114.2 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [11:49:10] so we need to find where this pattern comes from [11:49:45] <_joe_> it seems we have a latency that's a bit higher than normal at times [11:49:51] <_joe_> but I hardly see any regularity [11:50:28] the thing is that it affects only php [11:50:33] not hhvm [11:50:35] <_joe_> sure [11:50:55] <_joe_> best thing we can do is check if that's clear on all appservers at the same time [11:51:14] I think that it is on API servers, but it is ahunch [11:51:30] <_joe_> sorry, the graph you linked is just for appservers [11:52:01] <_joe_> oh right that is unified [11:52:04] yep [11:52:14] it was confusing me as well at first [11:52:15] <_joe_> uhm why was that done, I don't agree :P [11:52:43] we'll fix, I didn't know it was unified until I tried to check :p [11:55:22] (03PS1) 10Muehlenhoff: Hiera settings for puppetdb1002/2002 [puppet] - 10https://gerrit.wikimedia.org/r/531683 [11:55:37] <_joe_> jijiki: I can confirm it's API [11:55:53] <_joe_> and btw [11:56:03] <_joe_> the other graphs were correctly separated [11:56:28] I fixed it [11:56:41] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200 [11:57:53] Just wondering if that is some how caused by or related to the extra traffic termbox might be putting on the API servers? The spikes in that graph seem to correspond to the times we get termbox error spikes. [11:58:11] that is a good start [11:58:28] or at least worth looking [11:59:05] (we get termbox error spikes because our requests timeout talking to the api appservers) [11:59:07] <_joe_> tarrow: the issue started at 17:46 yesterday evening [11:59:18] <_joe_> tarrow: and yes, this is clearly related to your issue [11:59:53] tarrow: let me know how I can help [12:00:02] cool [12:00:13] I don't really know where to go from here [12:00:28] <_joe_> jijiki: you can start looking at logstash around the time of one spike [12:00:31] we could look at those specific requests [12:00:36] and find a pattern [12:00:41] <_joe_> then go look at api.log [12:01:00] sure, tx joe [12:01:42] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-conf1001.eqiad.wmnet'] ` The log can be found... [12:01:50] tarrow: I will be back in 5' and we can start looking [12:02:09] awesome! I'm in a meeting but I'm also here [12:03:22] <_joe_> jijiki: it's even "better", it's just mw1347 and mw1348 [12:06:10] oh come on [12:06:24] those are php7 only servers [12:06:30] and I think they are the good ones :p [12:06:47] let me try something [12:07:15] I am curious if it is just these two, what will happen if we depool them [12:07:42] !log Depooling mw1347 and mw1348 [12:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:23] <_joe_> uhm I see peaks in cpu usage around those times [12:10:39] <_joe_> https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&panelId=3&fullscreen&orgId=1&var-server=mw1348&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver [12:10:41] (03PS1) 10DCausse: Add services_proxy to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/531686 (https://phabricator.wikimedia.org/T230994) [12:12:18] (03PS2) 10Jbond: ores: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531260 (https://phabricator.wikimedia.org/T102099) [12:12:26] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-conf1001.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-conf1001.eqiad.wmnet'] ` [12:12:30] end network errors [12:12:56] (03CR) 10Jbond: [C: 03+2] ores: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531260 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [12:13:30] <_joe_> we also maxed out the php-fpm workers, which explains the network errors I guess [12:13:49] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [12:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:25] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10tramm) >>! In T204056#5429694, @Slaporte wrote: > Do you have nameservers that we could add to the domain? I think that would be the quickest way to let the ch... [12:15:23] is it possible that LVS kept sending termbox requests to those specific servers? [12:15:49] I guess that is possible, does it do it by ip has or something? [12:15:57] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [12:15:57] hash* [12:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:09] tarrow: iirc it does [12:16:16] (03PS1) 10Elukey: Remove ipv6 mapped conf from an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531687 [12:16:51] !log upgrading mariadb (packaged Debian version) on matomo1001 [12:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:07] but let's wait a little bit more [12:17:10] (03CR) 10Elukey: [C: 03+2] Remove ipv6 mapped conf from an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531687 (owner: 10Elukey) [12:18:31] 10Operations: Integrate Stretch 9.9 point update - https://phabricator.wikimedia.org/T222053 (10MoritzMuehlenhoff) [12:18:44] 10Operations: Integrate Stretch 9.9 point update - https://phabricator.wikimedia.org/T222053 (10MoritzMuehlenhoff) 05Open→03Resolved All complete [12:19:49] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/531683 (owner: 10Muehlenhoff) [12:20:06] <_joe_> jijiki: no it's not really possible [12:20:17] <_joe_> unless they have persistent connections on the termbox side [12:20:34] _joe_: that was my thought [12:20:44] <_joe_> anyways, I was looking at /var/log/php7.2-fpm-www-slowlog.log on mw1348 [12:20:54] <_joe_> there are a lot of traces around the time of the issue [12:21:17] <_joe_> I think we might try to disable tracing of slowlog requests on either mw1348 or mw1347 [12:21:32] <_joe_> to see if that creates a pathological runaway situation [12:22:12] the thing is, so far it hasn't happened again [12:22:39] looking at this [12:22:42] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-6h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200 [12:23:04] 10Operations, 10CirrusSearch, 10Discovery-Search (Current work), 10Patch-For-Review: labweb100[12]: Search backend error during get of .[array] after 0: unknown: No enabled connection - https://phabricator.wikimedia.org/T230994 (10dcausse) p:05Normal→03High Bumping as I think this causing all cirrus up... [12:23:17] (03PS2) 10Jbond: etcd::networking: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531224 (https://phabricator.wikimedia.org/T102099) [12:23:54] (03CR) 10Jbond: [C: 03+2] etcd::networking: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531224 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [12:24:02] tarrow: are there any errors so far? [12:24:44] Looks like a couple: https://grafana.wikimedia.org/d/AJf0z_7Wz/termbox?refresh=1m&orgId=1&from=now-24h&to=now [12:25:18] but maybe they are unrelated [12:26:20] _joe_: do we have tracing of slowlog requests on all servers? [12:26:34] Most recent one was 21 minutes past the hour [12:26:55] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-conf1002.eqiad.wmnet', 'an-conf1003.eqiad.wmne... [12:27:32] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:28:14] (03PS2) 10Jbond: etcd::kubernetes: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531222 (https://phabricator.wikimedia.org/T102099) [12:28:17] ok that is appserver, different issue [12:28:49] (03CR) 10Jbond: [C: 03+2] etcd::kubernetes: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531222 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [12:30:38] tarrow: can you check if you are using persistent connections ? [12:30:48] (03PS2) 10Jbond: mediawiki::memcached - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531247 (https://phabricator.wikimedia.org/T102099) [12:31:08] jijiki: I can look; give me a moment [12:31:17] sure tx [12:31:53] <_joe_> tarrow: what's the user agent of termbox? [12:31:58] <_joe_> I'm pretty sure it's not that [12:32:25] <_joe_> so the slow responses were overwhelmingly for requests coming from parsoid [12:32:26] (03CR) 10Elukey: [C: 03+1] mediawiki::memcached - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531247 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [12:32:26] _joe_: this time [12:32:35] it was just mw1270 [12:32:47] <_joe_> that's an appserver [12:32:49] <_joe_> not an api [12:32:59] then it was related to the appserver alert [12:33:04] <_joe_> yes [12:33:11] <_joe_> and that's fully on php7 as well [12:33:59] <_joe_> jijiki: I'll try to disable the slowlog on mw1348 [12:34:15] +1 [12:34:31] and I will pool back mw1347 [12:34:44] _joe_: should be wikibase-termbox 0.1.0 [12:35:00] (03CR) 10Jbond: [C: 03+2] mediawiki::memcached - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531247 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [12:35:24] as well as some library versions / author at the end [12:35:24] <_joe_> tarrow: then it's definitely not it [12:35:32] right [12:35:42] Then we should do that ASAP [12:35:45] I guess? [12:35:53] !log Pooling mw1247 [12:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:02] I thought that we had already done something about that [12:36:50] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:37:05] <_joe_> tarrow: about what? [12:37:13] <_joe_> tarrow: I told you termbox is not causing issues [12:37:24] <_joe_> jijiki: 1247?? [12:37:27] using persistent connects from our http client [12:37:44] !log Pooling mv1347 not mw1247 [12:37:47] (03CR) 10Gehel: [C: 04-1] Add maps reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [12:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:02] <_joe_> !log disabled slowlog on mw1348, repooling after reload [12:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:10] _joe_: I pooled the right server though :p [12:38:27] ah! You were saying "termbox isn't the problem" not "termbox is the problem because you aren't using persistent connections" [12:38:36] <_joe_> yeah :P [12:38:43] <_joe_> sorry I wasn't very clear [12:39:01] :P no worries; my bad trying to multitask too much as well [12:39:07] (03PS3) 10Gehel: Allow glent indices to auto-create in cirrus clusters [puppet] - 10https://gerrit.wikimedia.org/r/531273 (https://phabricator.wikimedia.org/T227364) (owner: 10EBernhardson) [12:39:19] <_joe_> jijiki: in the meantime, what about raising those timeouts for alerts a bit? [12:39:24] <_joe_> they're clearly too noisy [12:39:55] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [12:39:55] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [12:39:58] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [12:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:59] (03CR) 10Gehel: [C: 03+2] Allow glent indices to auto-create in cirrus clusters [puppet] - 10https://gerrit.wikimedia.org/r/531273 (https://phabricator.wikimedia.org/T227364) (owner: 10EBernhardson) [12:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:10] _joe_: sure [12:40:31] (the downtime cookbook is started by wmf-reimage I suppose, but since those are new hosts it fails?) [12:40:33] <_joe_> it looks, btw, like something happened yesterday that made php7's performance so much worse than hhvm [12:40:47] (03PS3) 10Jbond: mediawiki::memcached - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531247 (https://phabricator.wikimedia.org/T102099) [12:40:53] <_joe_> it was better until yesterday at 13:20 [12:40:55] elukey: yes and no [12:41:03] <_joe_> so I would be inclined to think of the train [12:41:32] yes they are started by the reimage cookbook but no, the one launched in background is done only after the first puppet run is started and a puppet run is forced on the icinga host [12:41:34] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [12:41:52] so the host should be there, but that was changed recently so it's possible there is an unknown bug [12:41:53] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:57] I can have a look in a bit [12:42:11] sure [12:42:46] As part of my multitasking is now an ok time to backport a patch to unblock the train? [12:43:11] tarrow: we might block the train, which patch ? [12:43:35] https://phabricator.wikimedia.org/T230926 [12:44:00] so nothing to do with this [12:44:10] conversation [12:44:18] (03PS2) 10Gehel: Fix water-polygons download URL [puppet] - 10https://gerrit.wikimedia.org/r/530863 (owner: 10MSantos) [12:44:26] zeljkof: you might need to hold the train a bit [12:44:27] but I wanted to check if now was a bad time to do it while you are digging [12:44:49] _joe_: I think there is not reason not to revert this, agree? [12:44:55] jijiki: ok, what's the problem? [12:45:10] zeljkof: we are having some perfomance issues since yesterday [12:45:19] and we are not sure yet if it is due to the train [12:45:35] (03CR) 10Gehel: [C: 03+2] Fix water-polygons download URL [puppet] - 10https://gerrit.wikimedia.org/r/530863 (owner: 10MSantos) [12:45:49] jijiki: ok, please keep me updated? is there a task? [12:46:04] "updated", with no question mark :) [12:46:10] zeljkof: not yet, I will make sure to create one of I have a little bit more info [12:46:20] jijiki: thanks! [12:46:23] s/of/when [12:46:26] tx [12:46:46] tarrow: there used to be "Pre MediaWiki train sanity break" an hour before train, I don't see it in the calendar now ;) [12:47:02] zeljkof: nothing to do with me :P [12:47:15] I was hoping we'd have done the backport in the SWAT window [12:47:34] but sadly we were too delayed by jenkins [12:47:34] but it wasn't? [12:47:40] ah :/ [12:48:06] The patch is now landed on the branch but I didn't yet deploy it [12:48:47] because I wasn't keen to do it as the same time as the urgent performance investigation [12:49:00] tarrow: makes sense [12:49:07] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:49:45] tarrow: please wait then until the investigation is over, you can deploy during train window to unblock the train, if the performance investigation doesn't block it [12:49:57] (03PS1) 10Filippo Giunchedi: wdqs: improve alert description [puppet] - 10https://gerrit.wikimedia.org/r/531690 (https://phabricator.wikimedia.org/T228878) [12:50:17] zeljkof: thanks :) [12:50:38] sorry for still having our building equipment across the train tracks :P [12:52:33] (03PS1) 10Effie Mouzeli: profile::mediawiki::alerts: Increasing alerts for mw request time [puppet] - 10https://gerrit.wikimedia.org/r/531691 (https://phabricator.wikimedia.org/T230396) [12:52:41] <_joe_> zeljkof: the performance investigation shouldn't block the ddeployment [12:53:18] _joe_: jijiki said to wait with the train until they investigate... [12:53:38] _joe_: if you think so, I am ok [12:53:55] my fear was that this will be magnified when we deploy to enwiki [12:54:08] <_joe_> zeljkof: oh the train [12:54:18] <_joe_> I thought you were referring to a single change by tarrow [12:54:26] (03PS2) 10Jbond: mediawiki::memcached - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531248 (https://phabricator.wikimedia.org/T102099) [12:54:43] <_joe_> so yeah we think something happened with yesterday's train [12:54:55] <_joe_> but I don't think we can realistically get to the bottom of it right now [12:55:10] <_joe_> and tbh, if the train shows there are issues in wmf.19 with php7 [12:55:15] <_joe_> we should just rollback [12:55:15] _joe_: tarrow is trying to unblock the train, so you think it's ok for him to deploy now? [12:55:22] <_joe_> sure [12:55:24] right, so shall I do the single backport now while you keep investigating. Then I'm out of the way and you can either block the train or not with the other stuff [12:55:27] great [12:55:36] tarrow: yes, looks like you can go ahead [12:55:39] I'll just start now then! [12:55:41] :D [12:56:23] _joe_, jijiki: please create a task and make it the train blocker, I can wait with the train, or revert if needed [12:56:32] (03CR) 10Jbond: [C: 03+2] mediawiki::memcached - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531248 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [12:56:42] <_joe_> !log restarting mw1270 with slowlog disabled [12:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:50] <_joe_> zeljkof: this is not even a train blocker, as we're not sure it's linked to the train [12:58:12] <_joe_> or just to the fact the train is a massive deployment of new code to cache [12:58:34] _joe_: I should wait with promoting wmf.19 to all wikis for a while while you investigate, right? [12:58:55] <_joe_> zeljkof: my position is I don't have time to investigate further right now [12:59:16] <_joe_> and we're too thin on resources to be realistically able to find the culprit in this precise moment [12:59:26] <_joe_> worse that can happen, we have to rollback the train [12:59:47] since we are confident that this is php7 specific [13:00:04] zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - European version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190822T1300). [13:00:07] I can move php7 traffic back to 20% [13:00:15] and we can let zeljkof deploy [13:00:32] jijiki: sounds good to me [13:00:45] let me know when I can deploy, if there's trouble, I'll roll back [13:00:46] and then go to the fallback plan of rolling back the train [13:00:49] <_joe_> jijiki: a restart fixed the high response times on mw1270 :/ [13:00:59] oh gawd not again [13:01:11] <_joe_> I am starting to think that restarting php every time we deploy is the best strategy [13:01:45] :( [13:02:02] zeljkof: gives a little more time please [13:02:12] "sudo: [/usr/local/sbin/check-and-restart-php,: command not found [13:02:12] 13:01:43 php-fpm restart failed!" from scap pull? [13:02:38] on on the debug server [13:03:33] zeljkof: any idea? [13:03:40] (03PS2) 10Gehel: wdqs: improve alert description [puppet] - 10https://gerrit.wikimedia.org/r/531690 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [13:04:02] <_joe_> wut [13:04:07] <_joe_> tarrow: I do [13:04:20] am I doing something stupid? [13:04:22] (03CR) 10Gehel: [C: 03+1] "I changed the description to 'WDQS high update lag', which I think is even more descriptive. LGTM now." [puppet] - 10https://gerrit.wikimedia.org/r/531690 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [13:04:25] <_joe_> can you please paste the whole output? [13:04:26] jijiki: sure [13:04:28] <_joe_> tarrow: nope [13:04:46] https://www.irccloud.com/pastebin/85NNRwZ9/ [13:04:59] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10Volans) >>! In T229657#5431230, @Marostegui wrote: > @CDanis @Volans can you confirm this command will set wikit... [13:05:13] <_joe_> there is a comma in excess [13:05:21] <_joe_> ok lemme try to see the hotfix here [13:05:21] there is [13:05:25] how did that get there? [13:06:04] (03CR) 10Filippo Giunchedi: [C: 03+2] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/531690 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [13:06:13] (03PS3) 10Filippo Giunchedi: wdqs: improve alert description [puppet] - 10https://gerrit.wikimedia.org/r/531690 (https://phabricator.wikimedia.org/T228878) [13:06:13] <_joe_> sigh [13:06:23] <_joe_> it's correct in the configuration AFAICT [13:06:28] <_joe_> no commas :P [13:06:40] wtf [13:06:45] tarrow: we updated scap recently [13:06:51] <_joe_> jijiki: can you rollback the change introducing the new scap configs? [13:06:56] yeah [13:07:05] tarrow: hold on [13:07:32] * tarrow hangs on [13:07:40] (03PS1) 10Effie Mouzeli: Revert "scap: set flag for check-and-restart-php" [puppet] - 10https://gerrit.wikimedia.org/r/531692 [13:08:46] (03PS2) 10Effie Mouzeli: Revert "scap: set flag for check-and-restart-php" [puppet] - 10https://gerrit.wikimedia.org/r/531692 [13:09:07] (03CR) 10Effie Mouzeli: [C: 03+2] Revert "scap: set flag for check-and-restart-php" [puppet] - 10https://gerrit.wikimedia.org/r/531692 (owner: 10Effie Mouzeli) [13:09:18] (03PS3) 10Effie Mouzeli: Revert "scap: set flag for check-and-restart-php" [puppet] - 10https://gerrit.wikimedia.org/r/531692 [13:10:29] sigh damn jenkins [13:11:17] (03CR) 10Effie Mouzeli: Revert "scap: set flag for check-and-restart-php" [puppet] - 10https://gerrit.wikimedia.org/r/531692 (owner: 10Effie Mouzeli) [13:11:30] (03PS2) 10Muehlenhoff: Hiera settings for puppetdb1002/2002 [puppet] - 10https://gerrit.wikimedia.org/r/531683 [13:11:32] (03CR) 10Effie Mouzeli: [C: 03+2] Revert "scap: set flag for check-and-restart-php" [puppet] - 10https://gerrit.wikimedia.org/r/531692 (owner: 10Effie Mouzeli) [13:12:15] <_joe_> sigh ofc it can't work [13:12:36] I think I +2'd too soon [13:12:38] <_joe_> jijiki: please rollback also the php7 percentage, the php-fpm restarting code in scap is wrong [13:13:07] great [13:13:09] ok [13:15:06] (03PS3) 10Muehlenhoff: Hiera settings for puppetdb1002/2002 [puppet] - 10https://gerrit.wikimedia.org/r/531683 [13:15:38] <_joe_> jijiki: tbh, I'm not sure why [13:16:35] let's do for now what is safest for prod [13:17:31] 👍 [13:17:45] tarrow: I merged the patch to unblock you [13:17:58] <_joe_> jijiki: did you also run puppet on deploy1001? [13:18:00] !log joal@deploy1001 Started deploy [analytics/refinery@a9b99e9]: Regular weekly analytics deployment train (1 day late) [13:18:03] jijiki: cool [13:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:05] _joe_: doing now [13:18:12] thanks :) [13:18:24] Sorry I'm not being all that much help [13:18:53] _joe_: hangon [13:19:49] actually, you helped :p [13:20:10] this would hit the train deployment too [13:20:27] :P [13:20:28] (03CR) 10Muehlenhoff: [C: 03+2] Hiera settings for puppetdb1002/2002 [puppet] - 10https://gerrit.wikimedia.org/r/531683 (owner: 10Muehlenhoff) [13:20:43] ok tarrow, you can proceed [13:20:55] cheers! pulling to debug now :) [13:20:55] 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) I'm looking into what it would take to monitor a celery worker pool on a specific machine.... [13:20:57] [13:21:31] jijiki: I'm on debug1002 should I actually be on 1001? [13:21:55] (obviously puppet didn't run on 1002 yet) [13:22:00] <_joe_> tarrow: sorry you got that error earlier when running "scap pull" on mwdebug1002? [13:22:04] yep [13:22:09] <_joe_> oh, ok [13:22:12] <_joe_> interesting [13:22:30] and I still get it now [13:22:54] <_joe_> tarrow: oh, wait [13:23:07] How should I chose which debug server? I just always used 1002 since that's what someone once told me to use [13:23:10] <_joe_> we thought it was on a scap sync-file [13:23:16] (03PS1) 10Effie Mouzeli: Revert "Send 33.3% of anonymous users to PHP7.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531695 (https://phabricator.wikimedia.org/T219150) [13:23:17] <_joe_> tarrow: that's ok [13:23:24] nope, on `scap pull` [13:23:39] <_joe_> tarrow: ok it should be fixed soon [13:23:43] :) [13:23:44] oh [13:23:54] <_joe_> tarrow: try now? [13:24:11] I thought it was on sync-file as well [13:24:13] <_joe_> jijiki: I'll try debugging from another machine [13:24:15] _joe_: all green! [13:24:36] I'll just check that the change is behaving as expected :) [13:26:06] revert works on mwdebug1002 :) [13:26:25] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/531691 (https://phabricator.wikimedia.org/T230396) (owner: 10Effie Mouzeli) [13:27:27] (03CR) 10Filippo Giunchedi: "Note that this will create *new* metrics, so it'll take some time to have full graphs again as data accumulates" [puppet] - 10https://gerrit.wikimedia.org/r/531666 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [13:27:56] !log tarrow@deploy1001 Synchronized php-1.34.0-wmf.19/extensions/Wikibase/client/: [[gerrit:531677|Revert "Use the backwards-compatible HTML ID for the wikidata item link" (T230958, T66315)]] (duration: 00m 58s) [13:28:01] (03PS1) 10Muehlenhoff: Enable puppetdb1002/2002 as puppetdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/531697 [13:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:03] T230958: deWikisource doesn't show Wikibase-Item (API) / Link to Wikidata (UI) - https://phabricator.wikimedia.org/T230958 [13:28:03] T66315: Move "Data item" link outside of sidebar toolbox - https://phabricator.wikimedia.org/T66315 [13:28:30] (03CR) 10Effie Mouzeli: [C: 03+2] Revert "Send 33.3% of anonymous users to PHP7.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531695 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [13:29:25] (03Merged) 10jenkins-bot: Revert "Send 33.3% of anonymous users to PHP7.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531695 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [13:29:28] (03CR) 10Filippo Giunchedi: ATS: add icinga check for traffic_server restarts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531671 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [13:29:42] (03CR) 10jenkins-bot: Revert "Send 33.3% of anonymous users to PHP7.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531695 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [13:29:50] <_joe_> jijiki: ftr, mw1348 had the problem again [13:30:18] <_joe_> and mw1347 *at the same time* [13:30:26] <_joe_> so it must be some request pattern [13:30:30] We are done :). Thank you all for your help and support! [13:31:05] zeljkof: We are done; sorry for the delay :) [13:31:34] _joe_: I am running scap [13:31:48] 10Operations, 10Maps (Kartotherian): Create helm chart for kartotherian k8s deployment - https://phabricator.wikimedia.org/T231006 (10Mathew.onipe) [13:31:58] but it is very weird that is just those 2 [13:32:07] !log jiji@deploy1001 Synchronized wmf-config/CommonSettings.php: Reverting PHP7 traffic back to 20% - T219150 (duration: 00m 57s) [13:32:10] as we have 10 PHP7 servers [13:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:16] <_joe_> jijiki: it's not, those are the two api servers that get most traffic [13:32:16] T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 [13:32:25] <_joe_> jijiki: what are the other ones? [13:32:41] tarrow: great, that closes all train blockers :) [13:32:44] we started from the old ones, so 1222+5 ? [13:32:46] let me check [13:32:50] <_joe_> yeah I think so [13:32:56] <_joe_> it is indeed a bit weird [13:32:58] jijiki, _joe_: do you need more time? [13:33:05] <_joe_> zeljkof: not for me, no [13:33:15] <_joe_> if the problem is the train, we will discover soon [13:33:20] zeljkof: let's go ahead, and worst case [13:33:21] <_joe_> please proceed [13:33:24] we rollback [13:33:43] _joe_: I will create a task for now, to attach it to the train [13:33:49] ok, starting with train, I'll monitor logs and revert in case of trouble [13:34:01] <_joe_> jijiki: I don't think we have elements to attach it to the train [13:34:08] <_joe_> jijiki: if things get bad now, maybe [13:34:51] <_joe_> once zeljkof is done, please restart fpm on one of those two hosts, although I think the problem is different and has to do with incoming requests [13:35:00] ok [13:36:18] (03PS1) 10Mathew.onipe: kartotherian: add kartotherian chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/531699 (https://phabricator.wikimedia.org/T231006) [13:36:58] !log joal@deploy1001 Finished deploy [analytics/refinery@a9b99e9]: Regular weekly analytics deployment train (1 day late) (duration: 18m 57s) [13:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:19] zeljkof: I won't create a task for now then [13:38:02] jijiki: ok [13:38:03] (03PS1) 10Zfilipin: all wikis to 1.34.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531700 [13:38:05] (03CR) 10Zfilipin: [C: 03+2] all wikis to 1.34.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531700 (owner: 10Zfilipin) [13:38:44] <_joe_> jijiki: you can create one without linking it to the train [13:38:50] <_joe_> we are seeing this issue right? [13:39:02] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10Marostegui) Indeed @Volans - thanks! ` root@cumin1001:~# dbctl --scope eqiad section wikitech rw && dbctl confi... [13:39:16] (03Merged) 10jenkins-bot: all wikis to 1.34.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531700 (owner: 10Zfilipin) [13:39:26] (03CR) 10jenkins-bot: all wikis to 1.34.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531700 (owner: 10Zfilipin) [13:41:18] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.19 [13:41:21] _joe_: apart from the pattern we saw [13:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:35] jijiki, _joe_: wmf.19 is deployed to all wikis [13:41:36] I am not sure we have anything more to write [13:41:48] and now we have reverted [13:42:21] I suggest to wait a couple of hours, or create a task tomorrow [13:42:29] tx zeljkof [13:42:48] !log Restart php-fpm on mw1348 and mw1347 [13:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:17] 10Operations, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10jijiki) [13:45:56] (03PS1) 10Elukey: Create the new Analytics Zookeeper cluster [puppet] - 10https://gerrit.wikimedia.org/r/531701 (https://phabricator.wikimedia.org/T227025) [13:47:16] jijiki, _joe_: as far as I can see, nothing exploded, I'll update the roadmap [13:47:26] if anything explodes, I'll revert [13:47:41] !log update puppet compiler's facts [13:47:42] for things that can wait until next week, we can block next week's train, wmf.20 [13:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:56] zeljkof: if something is wrong, ping me [13:47:59] I will be around [13:50:26] I'll revert first, ask for help later :D [13:52:12] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Performance-Team, 10Traffic: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10ema) p:05Triage→03Normal [13:52:50] 10Operations, 10Release Pipeline, 10Maps (Kartotherian): Make jobprocessor's test not depend on external files - https://phabricator.wikimedia.org/T231009 (10Mathew.onipe) [14:05:26] 10Operations, 10Discovery-Search, 10Elasticsearch: Change partitioning scheme for elasticsearch from RAID to JBOD - https://phabricator.wikimedia.org/T231010 (10Mathew.onipe) [14:06:05] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:12] (03PS3) 10Muehlenhoff: Switch Failoid in codfw to failoid2001 [puppet] - 10https://gerrit.wikimedia.org/r/531165 (https://phabricator.wikimedia.org/T224559) [14:10:27] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:40] (03CR) 10Muehlenhoff: [C: 03+2] Switch Failoid in codfw to failoid2001 [puppet] - 10https://gerrit.wikimedia.org/r/531165 (https://phabricator.wikimedia.org/T224559) (owner: 10Muehlenhoff) [14:13:38] (03PS2) 10Elukey: Create the new Analytics Zookeeper cluster [puppet] - 10https://gerrit.wikimedia.org/r/531701 (https://phabricator.wikimedia.org/T227025) [14:14:03] (03CR) 10Ema: ATS: add icinga check for traffic_server restarts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531671 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [14:15:43] (03CR) 10CDanis: [C: 03+1] prometheus: rename mtail appserver handlers [puppet] - 10https://gerrit.wikimedia.org/r/531666 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [14:17:40] 10Operations, 10serviceops, 10Core Platform Team (Needs Cleaning - Services Operations): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Jdlrobson) [14:20:16] 10Operations, 10serviceops, 10PHP 7.2 support: Mysterious, coordinated slowdowns every ~ 25 minutes on mw1347,mw1348 (php7 api servers) - https://phabricator.wikimedia.org/T231011 (10Joe) [14:20:51] 10Operations, 10serviceops, 10PHP 7.2 support: Mysterious, coordinated slowdowns every ~ 25 minutes on mw1347,mw1348 (php7 api servers) - https://phabricator.wikimedia.org/T231011 (10Joe) p:05Triage→03High [14:21:15] (03PS1) 10Muehlenhoff: Switch Failoid in eqiad to failoid1001 [puppet] - 10https://gerrit.wikimedia.org/r/531706 (https://phabricator.wikimedia.org/T224559) [14:23:59] 10Operations, 10serviceops, 10PHP 7.2 support: Mysterious, coordinated slowdowns every ~ 25 minutes on mw1347,mw1348 (php7 api servers) - https://phabricator.wikimedia.org/T231011 (10jijiki) [14:25:04] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/531706 (https://phabricator.wikimedia.org/T224559) (owner: 10Muehlenhoff) [14:28:15] (03CR) 10Mforns: "Should we do the 'ensure => absent' first, and then remove the code in a subsequent patch? This way the script files generated by the job " [puppet] - 10https://gerrit.wikimedia.org/r/531489 (https://phabricator.wikimedia.org/T229042) (owner: 10Nuria) [14:29:24] (03CR) 10Muehlenhoff: [C: 03+2] Switch Failoid in eqiad to failoid1001 [puppet] - 10https://gerrit.wikimedia.org/r/531706 (https://phabricator.wikimedia.org/T224559) (owner: 10Muehlenhoff) [14:36:20] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Migrate Failoid hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224559 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff New VMs (failoid1001 and failoid2001) have been setup and are in active use now. I'll keep the old jessie VMs around f... [14:36:32] (03PS2) 10Jbond: failoid: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531218 (https://phabricator.wikimedia.org/T102099) [14:37:19] <_joe_> !log restarting php-fpm on mw1348 to observe the effect on the slowdown, T231011 [14:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:26] T231011: Mysterious, coordinated slowdowns every ~ 25 minutes on mw1347,mw1348 (php7 api servers) - https://phabricator.wikimedia.org/T231011 [14:38:23] (03CR) 10Jbond: failoid: add ipv6 mapped address (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/531218 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [14:39:03] (03CR) 10Elukey: [C: 03+2] Create the new Analytics Zookeeper cluster [puppet] - 10https://gerrit.wikimedia.org/r/531701 (https://phabricator.wikimedia.org/T227025) (owner: 10Elukey) [14:39:06] Krinkle: I just wanted to say thanks for mediawiki-new-errors, it's making the train conductor life so much easier [14:39:11] (03PS3) 10Elukey: Create the new Analytics Zookeeper cluster [puppet] - 10https://gerrit.wikimedia.org/r/531701 (https://phabricator.wikimedia.org/T227025) [14:40:28] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/531218 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [14:40:38] 10Operations, 10vm-requests: Site: 2 VMs for corp LDAP replicas - https://phabricator.wikimedia.org/T231015 (10MoritzMuehlenhoff) [14:40:48] 10Operations, 10vm-requests: Site: 2 VMs for corp LDAP replicas - https://phabricator.wikimedia.org/T231015 (10MoritzMuehlenhoff) p:05Triage→03Normal a:03MoritzMuehlenhoff [14:41:21] 10Operations, 10vm-requests: eqiad/codfw: 2 VMs for corp LDAP replicas - https://phabricator.wikimedia.org/T231015 (10MoritzMuehlenhoff) [14:43:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531218 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [14:44:10] (03CR) 10Elukey: [C: 03+2] Create the new Analytics Zookeeper cluster [puppet] - 10https://gerrit.wikimedia.org/r/531701 (https://phabricator.wikimedia.org/T227025) (owner: 10Elukey) [14:45:29] (03CR) 10Filippo Giunchedi: [C: 03+1] ATS: add icinga check for traffic_server restarts [puppet] - 10https://gerrit.wikimedia.org/r/531671 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [14:46:36] (03CR) 10Giuseppe Lavagetto: [C: 03+1] profile::mediawiki::alerts: Increasing alerts for mw request time [puppet] - 10https://gerrit.wikimedia.org/r/531691 (https://phabricator.wikimedia.org/T230396) (owner: 10Effie Mouzeli) [14:47:12] (03CR) 10Effie Mouzeli: [C: 03+2] profile::mediawiki::alerts: Increasing alerts for mw request time [puppet] - 10https://gerrit.wikimedia.org/r/531691 (https://phabricator.wikimedia.org/T230396) (owner: 10Effie Mouzeli) [14:47:22] (03PS2) 10Effie Mouzeli: profile::mediawiki::alerts: Increasing alerts for mw request time [puppet] - 10https://gerrit.wikimedia.org/r/531691 (https://phabricator.wikimedia.org/T230396) [14:47:24] (03CR) 10BBlack: [C: 03+1] "What moritz said, this doesn't affect the actual failoid service." [puppet] - 10https://gerrit.wikimedia.org/r/531218 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [14:49:20] (03PS3) 10Jbond: failoid: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531218 (https://phabricator.wikimedia.org/T102099) [14:49:36] zeljkof: I'm glad it's useful :) [14:49:57] it is! :) [14:50:07] (03CR) 10Jbond: [C: 03+2] failoid: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531218 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [14:50:30] 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) So, I've been trying explore the behaviors of `celery -A ores_celery inspect ping` to see if... [14:50:45] (03PS5) 10Ema: ATS: add icinga check for traffic_server restarts [puppet] - 10https://gerrit.wikimedia.org/r/531671 (https://phabricator.wikimedia.org/T227432) [14:54:14] (03CR) 10Ema: [C: 03+2] ATS: add icinga check for traffic_server restarts [puppet] - 10https://gerrit.wikimedia.org/r/531671 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [14:55:30] (03CR) 10Jcrespo: [C: 04-1] "I change is pending indeed, but let's discuss on IRC what is the best way to go over this- I think there is a lot of confusion on proxy fi" [puppet] - 10https://gerrit.wikimedia.org/r/531670 (owner: 10Muehlenhoff) [14:57:45] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10elukey) [14:58:07] (03CR) 10Nuria: "@mforns yes, that makes total sense. Can you submit a patch so I can see how do we specify that?" [puppet] - 10https://gerrit.wikimedia.org/r/531489 (https://phabricator.wikimedia.org/T229042) (owner: 10Nuria) [14:58:27] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10elukey) >>! In T227025#5428221, @elukey wrote: > @Cmjohnson I was able to install the OS on an-conf1001 via manual PXE install, but I... [14:59:38] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10elukey) [15:01:21] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:01:35] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:02:21] jijiki: what does the 'mediawiki_http_requests_' metric in Prometheus measure? Is it similar to varnish_backend_timing (which, despite the name, measures apache backend-timing) [15:03:00] Krinkle: it is apache backend-timing direct from the apache logs on each appserver [15:03:02] via mtail [15:03:21] PROBLEM - Check systemd state on restbase2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:23] PROBLEM - cassandra-b CQL 10.192.48.122:9042 on restbase2017 is CRITICAL: connect to address 10.192.48.122 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:03:37] cdanis: aha, I didn't know we could do mtail there, I guess we don't need it from varnishlog anymore then. [15:03:40] Krinkle: we pars logs with mtail [15:03:43] Krinkle: recently added https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/mtail/files/programs/mediawiki_access_log.mtail [15:03:51] It's mostly a hack (filtered to only count non-cache hits) [15:03:53] PROBLEM - cassandra-b service on restbase2017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:04:46] cool. yeah. Are there plans to phase out the varnish_backend_timing metric? [15:05:17] I don't think so for now, maybe it will go away along with varnish itself [15:05:20] :p [15:05:49] yeah it'll eventually die out I think when text moves to ats, although having generic ats per-backend timing would be great [15:06:09] 10Operations: expand list of those who have permissions to edit the #wikimedia-operations topic - https://phabricator.wikimedia.org/T231016 (10CDanis) [15:06:11] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, 10Core Platform Team (Needs Cleaning - Services Operations): Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10MoritzMuehlenhoff) T208087, T223976 and T222960 are fixed. Could we get restbase2009-rest... [15:06:23] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:06:45] (03PS3) 10Elukey: profile::analytics::refinery::job::druid_load: absent readingdepth job [puppet] - 10https://gerrit.wikimedia.org/r/531489 (https://phabricator.wikimedia.org/T229042) (owner: 10Nuria) [15:09:22] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::druid_load: absent readingdepth job [puppet] - 10https://gerrit.wikimedia.org/r/531489 (https://phabricator.wikimedia.org/T229042) (owner: 10Nuria) [15:09:31] PROBLEM - traffic_server backend process restarted on cp1076 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/000000610/ats-instance-drilldown?orgId=1&var-site=eqiad+prometheus/ops&var-instance=cp1076&var-layer=backend [15:10:01] known ^ [15:10:35] RECOVERY - cassandra-b service on restbase2017 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:11:09] RECOVERY - Check systemd state on restbase2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:21] RECOVERY - cassandra-b CQL 10.192.48.122:9042 on restbase2017 is OK: TCP OK - 0.036 second response time on 10.192.48.122 port 9042 https://phabricator.wikimedia.org/T93886 [15:13:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/530096 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [15:14:05] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/531697 (owner: 10Muehlenhoff) [15:14:23] (03CR) 10Elukey: Add sre.hadoop.reboot-workers.py (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/530096 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [15:14:33] (03PS6) 10BBlack: authdns: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507076 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [15:16:39] (03CR) 10BBlack: [C: 03+2] authdns: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507076 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [15:16:59] (03PS1) 10Ema: Revert "ATS: add icinga check for traffic_server restarts" [puppet] - 10https://gerrit.wikimedia.org/r/531708 [15:17:07] (03PS2) 10Ema: Revert "ATS: add icinga check for traffic_server restarts" [puppet] - 10https://gerrit.wikimedia.org/r/531708 [15:18:10] (03PS3) 10Elukey: Add sre.hadoop.reboot-workers.py [cookbooks] - 10https://gerrit.wikimedia.org/r/530096 (https://phabricator.wikimedia.org/T225297) [15:18:44] (03CR) 10Ema: [C: 03+2] Revert "ATS: add icinga check for traffic_server restarts" [puppet] - 10https://gerrit.wikimedia.org/r/531708 (owner: 10Ema) [15:19:44] (03PS1) 10Elukey: profile::analytics::refinery::job::druid_load: remove absent job [puppet] - 10https://gerrit.wikimedia.org/r/531709 [15:20:50] godog: fwiw, varnish backend timing does not measure the time spent for the backend request from varnish perspective. It just takes the Backend-Timing header that Apache outputs and forwards that value. [15:21:05] (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/531709 (owner: 10Elukey) [15:21:26] 10Operations, 10Dumps-Generation: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn) [15:21:49] PROBLEM - traffic_server tls process restarted on cp1076 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/000000610/ats-instance-drilldown?orgId=1&var-site=eqiad+prometheus/ops&var-instance=cp1076&var-layer=tls [15:21:57] (03PS4) 10BBlack: Remove unused/untrusted IP ranges from trusted lists [puppet] - 10https://gerrit.wikimedia.org/r/509140 (https://phabricator.wikimedia.org/T222392) (owner: 10Ayounsi) [15:22:11] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::druid_load: remove absent job [puppet] - 10https://gerrit.wikimedia.org/r/531709 (owner: 10Elukey) [15:22:28] (03Abandoned) 10BBlack: Revert "Convert most DYNA into 1H CNAME records" [dns] - 10https://gerrit.wikimedia.org/r/508979 (owner: 10Cwek) [15:22:53] Krinkle: indeed, I forgot about that but thanks for the reminder [15:23:15] the backend generic metric is the histogram at varnish_backend_requests_seconds_sum iirc [15:23:53] (03PS5) 10BBlack: Remove unused/untrusted IP ranges from trusted lists [puppet] - 10https://gerrit.wikimedia.org/r/509140 (https://phabricator.wikimedia.org/T222392) (owner: 10Ayounsi) [15:27:51] 10Operations, 10Analytics, 10vm-requests: Decommission analytics-tool1002 (old turnilo vm) - https://phabricator.wikimedia.org/T231021 (10elukey) [15:28:44] (03CR) 10BBlack: [C: 03+2] Remove unused/untrusted IP ranges from trusted lists [puppet] - 10https://gerrit.wikimedia.org/r/509140 (https://phabricator.wikimedia.org/T222392) (owner: 10Ayounsi) [15:30:48] 10Operations, 10netbox: Netbox LibreNMS report fails - https://phabricator.wikimedia.org/T230964 (10crusnov) a:03crusnov [15:33:20] (03PS1) 10Elukey: Decommission analytics-tool1002 (old turnilo Ganeti vm) [puppet] - 10https://gerrit.wikimedia.org/r/531712 (https://phabricator.wikimedia.org/T231021) [15:35:21] Krinkle: I'm not sure if I'm using mediawiki-new-errors correctly :/ I can't filter out everything. Every time I filter out something, something else pops up :/ [15:47:04] 10Operations, 10Security-Team, 10Traffic: scan external ranges with current Nessus rulesets - https://phabricator.wikimedia.org/T222097 (10ayounsi) [15:52:41] (03PS1) 10BBlack: [WIP] Fix up varnish ACLs... [puppet] - 10https://gerrit.wikimedia.org/r/531716 [16:00:04] godog and _joe_: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190822T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:03:18] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, 10Core Platform Team (Needs Cleaning - Services Operations): Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10Eevans) >>! In T224553#5431770, @MoritzMuehlenhoff wrote: > T208087, T223976 and T222960 a... [16:18:25] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Slaporte) @Dzahn Can you confirm how @tramm should configure the MX records? I think he'll need to add 18-19 from [here](https://gerrit.wikimedia.org/r/plugins... [16:31:51] (03CR) 10MSantos: "@AKosiaris is there anything else we need to move this forward?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/526679 (https://phabricator.wikimedia.org/T229287) (owner: 10MSantos) [16:32:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decom californium - https://phabricator.wikimedia.org/T189921 (10Jclark-ctr) a:05Jclark-ctr→03None [16:35:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decom californium - https://phabricator.wikimedia.org/T189921 (10Jclark-ctr) Disk wiped, Removed host from Netbox and racks placed in storage. [16:35:38] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decom californium - https://phabricator.wikimedia.org/T189921 (10Jclark-ctr) a:03Cmjohnson [16:35:54] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Joe) >>! In T204056#5431991, @Slaporte wrote: > @Dzahn Can you confirm how @tramm should configure the MX records? I think he'll need to add 18-19 from [here](... [16:48:17] (03PS6) 10MSantos: First version of the wikifeeds chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/526679 (https://phabricator.wikimedia.org/T229287) [16:51:53] jouncebot: next [16:51:53] In 0 hour(s) and 8 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190822T1700) [17:00:05] cscott, arlolra, subbu, halfak, and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190822T1700). [17:04:32] 10Operations, 10Analytics, 10vm-requests, 10Patch-For-Review: Decommission analytics-tool1002 (old turnilo vm) - https://phabricator.wikimedia.org/T231021 (10fdans) p:05Triage→03High [17:10:18] 10Operations, 10ops-eqiad, 10decommission: Decommission labnet1001 & labnet1002 - https://phabricator.wikimedia.org/T221818 (10Jclark-ctr) [17:10:44] (03CR) 10Elukey: [C: 03+2] Decommission analytics-tool1002 (old turnilo Ganeti vm) [puppet] - 10https://gerrit.wikimedia.org/r/531712 (https://phabricator.wikimedia.org/T231021) (owner: 10Elukey) [17:12:42] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [17:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:49] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [17:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:56] 10Operations, 10Analytics, 10vm-requests, 10Patch-For-Review: Decommission analytics-tool1002 (old turnilo vm) - https://phabricator.wikimedia.org/T231021 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: `analytics-tool1002.eqiad.wmnet` - analytics-tool1002.e... [17:13:12] I love cookbooks :D [17:14:33] !log remove analytics-tool1002 from ganeti - T231021 [17:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:39] T231021: Decommission analytics-tool1002 (old turnilo vm) - https://phabricator.wikimedia.org/T231021 [17:16:02] 10Operations, 10ops-eqiad, 10decommission: Decommission labnet1001 & labnet1002 - https://phabricator.wikimedia.org/T221818 (10Jclark-ctr) wiped disk on labnet1001 removed from netbox and racks. labnet1002 not in rack listed previously listed b3 [17:16:15] (03PS1) 10Elukey: Decom analytics-tool1002 [dns] - 10https://gerrit.wikimedia.org/r/531725 (https://phabricator.wikimedia.org/T231021) [17:17:09] anybody up for a quick dns review? --^ [17:17:18] 10Operations, 10ops-eqiad, 10decommission: Decommission labnet1001 & labnet1002 - https://phabricator.wikimedia.org/T221818 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [17:19:24] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/531725 (https://phabricator.wikimedia.org/T231021) (owner: 10Elukey) [17:20:34] (03CR) 10Elukey: [C: 03+2] Decom analytics-tool1002 [dns] - 10https://gerrit.wikimedia.org/r/531725 (https://phabricator.wikimedia.org/T231021) (owner: 10Elukey) [17:24:16] (03PS1) 10Cmjohnson: Removing dns entries for labnet100[1-2] [dns] - 10https://gerrit.wikimedia.org/r/531728 (https://phabricator.wikimedia.org/T221818) [17:24:43] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring - https://phabricator.wikimedia.org/T148541 (10RobH) >>! In T148541#5413482, @fgiunchedi wrote: > From a chat with @faidon it emerged that we have at least three main use c... [17:25:47] (03PS5) 10Herron: prometheus: add prometheus ipsec exporter service & config in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/530616 (https://phabricator.wikimedia.org/T230236) [17:26:59] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add prometheus ipsec exporter service & config in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/530616 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [17:27:01] (03CR) 10Cmjohnson: [C: 03+2] Removing dns entries for labnet100[1-2] [dns] - 10https://gerrit.wikimedia.org/r/531728 (https://phabricator.wikimedia.org/T221818) (owner: 10Cmjohnson) [17:27:03] (03PS2) 10Cmjohnson: Removing dns entries for labnet100[1-2] [dns] - 10https://gerrit.wikimedia.org/r/531728 (https://phabricator.wikimedia.org/T221818) [17:27:07] (03CR) 10Cmjohnson: [V: 03+2 C: 03+2] Removing dns entries for labnet100[1-2] [dns] - 10https://gerrit.wikimedia.org/r/531728 (https://phabricator.wikimedia.org/T221818) (owner: 10Cmjohnson) [17:27:30] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Performance-Team, 10Traffic: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10fdans) a:05Nuria→03elukey [17:28:12] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Slaporte) :thumbsup: thanks, @Joe. @tramm, we're in the process of changing the nameserver now. [17:28:16] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission labnet1001 & labnet1002 - https://phabricator.wikimedia.org/T221818 (10Cmjohnson) @jclark-ctr did you add this to the tracking sheet? [17:28:30] (03PS6) 10Herron: prometheus: add prometheus ipsec exporter service & config in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/530616 (https://phabricator.wikimedia.org/T230236) [17:29:05] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Performance-Team, 10Traffic: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10fdans) a:05elukey→03Nuria [17:30:17] 10Operations, 10Readers-Web-Backlog, 10Traffic: [Bug] iPadOS 13 shows the desktop version of Safari with a broken layout - https://phabricator.wikimedia.org/T229875 (10dr0ptp4kt) @ovasileva okay to redirect Safari desktop to mdot? CC @DStrine @mepps @ejegg @BBlack [17:32:32] 10Operations, 10Analytics, 10vm-requests, 10Patch-For-Review: Decommission analytics-tool1002 (old turnilo vm) - https://phabricator.wikimedia.org/T231021 (10elukey) 05Open→03Resolved [17:32:41] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission labnet1001 & labnet1002 - https://phabricator.wikimedia.org/T221818 (10Cmjohnson) [17:37:19] (03CR) 10Cwhite: [C: 03+2] logster: add ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/529399 (https://phabricator.wikimedia.org/T229357) (owner: 10Cwhite) [17:37:26] (03PS7) 10Cwhite: logster: add ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/529399 (https://phabricator.wikimedia.org/T229357) [17:46:03] 10Operations, 10conftool, 10serviceops: update master DC switch script for a post-dbctl world - https://phabricator.wikimedia.org/T231035 (10CDanis) [17:47:49] 10Operations, 10conftool, 10serviceops: update master DC switch script for a post-dbctl world - https://phabricator.wikimedia.org/T231035 (10CDanis) 05Open→03Invalid `17:44:35 cdanis: that was the switcdc repo that become the basis to write spicerack, now lives as cookbooks in the cookbooks repo... [17:48:02] 10Operations, 10Readers-Web-Backlog, 10Traffic: [Bug] iPadOS 13 shows the desktop version of Safari with a broken layout - https://phabricator.wikimedia.org/T229875 (10ovasileva) >>! In T229875#5432239, @dr0ptp4kt wrote: > @ovasileva okay to redirect Safari desktop to mdot? > > CC @DStrine @mepps @ejegg @BB... [17:49:22] (03PS1) 10Cwhite: profile, varnishkafka: remove logster cron entries from varnishkafka hosts [puppet] - 10https://gerrit.wikimedia.org/r/531730 (https://phabricator.wikimedia.org/T229357) [17:49:57] 10Operations, 10Readers-Web-Backlog, 10Traffic: [Bug] iPadOS 13 shows the desktop version of Safari with a broken layout - https://phabricator.wikimedia.org/T229875 (10dr0ptp4kt) Okay, let me look into this. [17:51:25] (03CR) 10Cwhite: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/17988/" [puppet] - 10https://gerrit.wikimedia.org/r/531730 (https://phabricator.wikimedia.org/T229357) (owner: 10Cwhite) [17:57:08] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Slaporte) Hi @tramm, can you let me know when your nameservers are configured? The registry requires this, and the update will not complete (on our side) until... [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Morning SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190822T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:57] (03CR) 10Mathew.onipe: [C: 03+1] Add services_proxy to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/531686 (https://phabricator.wikimedia.org/T230994) (owner: 10DCausse) [18:17:35] Uhhm any deployer around for a last-minute patch? [18:21:44] (03PS1) 10Mholloway: MachineVision: Update image labeling handler config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531734 [18:45:29] o/ hey folks, would someone be able to remove 2fa from my account? I seem to have lost my token. [blush] [18:52:12] (03CR) 10Catrope: [C: 03+1] "a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528546 (https://phabricator.wikimedia.org/T221871) (owner: 10Sbisson) [18:54:44] (03CR) 10Catrope: [C: 03+1] Enable and configure ORES damaging and goodfaith on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528537 (https://phabricator.wikimedia.org/T225562) (owner: 10Sbisson) [19:22:22] (03PS1) 10DannyS712: Clean up `wgRateLimits` to remove unneeded entries. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531739 (https://phabricator.wikimedia.org/T231040) [19:22:49] (03PS2) 10DannyS712: Clean up `wgRateLimits` to remove unneeded entries. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531739 (https://phabricator.wikimedia.org/T231040) [19:24:45] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10JAufrecht) > __Wikimedia Foundation__ (https://wikimediafoundation.org/) → Technical Blog... [19:41:44] (03PS1) 10DannyS712: Clean up `groupOverrides` to remove unneeded entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531741 (https://phabricator.wikimedia.org/T231041) [19:47:12] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10tramm) Thanks @Slaporte, we will have to consult our hosting provider Elkdata and this will happen in 12h or so. They don't provide us UI to make the changes,... [19:49:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikimedia-Logstash, and 3 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [19:50:09] (03CR) 10Urbanecm: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531739 (https://phabricator.wikimedia.org/T231040) (owner: 10DannyS712) [20:14:22] (03PS2) 10DannyS712: Clean up `groupOverrides` to remove unneeded entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531741 (https://phabricator.wikimedia.org/T231041) [20:28:53] (03CR) 10Cwhite: prometheus: add prometheus ipsec exporter service & config in ulsfo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/530616 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [20:32:28] (03PS3) 10DannyS712: Clean up `groupOverrides` to remove unneeded entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531741 (https://phabricator.wikimedia.org/T231041) [20:33:26] (03CR) 10jerkins-bot: [V: 04-1] Clean up `groupOverrides` to remove unneeded entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531741 (https://phabricator.wikimedia.org/T231041) (owner: 10DannyS712) [20:37:11] (03PS4) 10DannyS712: General cleanup of `groupOverrides`. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531741 (https://phabricator.wikimedia.org/T231041) [20:38:31] (03CR) 10jerkins-bot: [V: 04-1] General cleanup of `groupOverrides`. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531741 (https://phabricator.wikimedia.org/T231041) (owner: 10DannyS712) [20:43:34] (03PS1) 10Ayounsi: Fastnetmon, add GeoIP DBs [puppet] - 10https://gerrit.wikimedia.org/r/531746 (https://phabricator.wikimedia.org/T226810) [20:46:58] (03CR) 10DannyS712: "Reasoning for each specific removal is in the inline comments." (0335 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531741 (https://phabricator.wikimedia.org/T231041) (owner: 10DannyS712) [20:47:36] (03PS2) 10Ayounsi: Fastnetmon, add GeoIP DBs [puppet] - 10https://gerrit.wikimedia.org/r/531746 (https://phabricator.wikimedia.org/T226810) [20:47:38] (03PS5) 10DannyS712: General cleanup of `groupOverrides`. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531741 (https://phabricator.wikimedia.org/T231041) [20:57:46] (03PS3) 10Ayounsi: Fastnetmon, add GeoIP DBs [puppet] - 10https://gerrit.wikimedia.org/r/531746 (https://phabricator.wikimedia.org/T226810) [20:59:14] PROBLEM - Host cloudvirt1015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:59:15] (03CR) 10Ayounsi: [C: 03+2] Fastnetmon, add GeoIP DBs [puppet] - 10https://gerrit.wikimedia.org/r/531746 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [21:06:41] (03PS1) 10Ayounsi: Fastnetmon add python3-geoip2 [puppet] - 10https://gerrit.wikimedia.org/r/531747 (https://phabricator.wikimedia.org/T226810) [21:12:03] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17991/" [puppet] - 10https://gerrit.wikimedia.org/r/531747 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [21:21:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10bd808) >>! In T220853#5429415, @Cmjohnson wrote: > Board arrived DOA...need another one The haunting extends to replaceme... [21:22:25] (03CR) 10Urbanecm: [C: 04-1] "Thanks for your work, few things I catched on first sight." (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531741 (https://phabricator.wikimedia.org/T231041) (owner: 10DannyS712) [21:42:30] (03CR) 10DannyS712: General cleanup of `groupOverrides`. (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531741 (https://phabricator.wikimedia.org/T231041) (owner: 10DannyS712) [21:46:52] 10Operations: setup/install gerrit1001 - https://phabricator.wikimedia.org/T231046 (10RobH) p:05Triage→03Normal [21:47:00] 10Operations: setup/install gerrit1001 - https://phabricator.wikimedia.org/T231046 (10RobH) [21:47:53] 10Operations: setup/install gerrit1001 - https://phabricator.wikimedia.org/T231046 (10RobH) [21:48:53] 10Operations: apply hostname label for wmf5176/gerrit1001 - https://phabricator.wikimedia.org/T231047 (10RobH) [21:49:09] 10Operations: setup/install gerrit1001 - https://phabricator.wikimedia.org/T231046 (10RobH) [21:57:15] RECOVERY - Host cloudvirt1015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [21:57:27] (03PS1) 10RobH: updating gerrit1001 mgmt dns [dns] - 10https://gerrit.wikimedia.org/r/531751 (https://phabricator.wikimedia.org/T231046) [21:58:04] (03PS2) 10Mholloway: MachineVision: Update image labeling handler config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531734 [21:58:06] (03PS1) 10Ayounsi: Pmacct, add source and dest countries based on GeoIP DB [puppet] - 10https://gerrit.wikimedia.org/r/531752 [21:58:53] (03CR) 10jerkins-bot: [V: 04-1] MachineVision: Update image labeling handler config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531734 (owner: 10Mholloway) [22:01:30] (03PS3) 10Mholloway: MachineVision: Update image labeling handler config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531734 [22:03:16] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: site=eqsin https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [22:03:28] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Jclark-ctr) motherboard replaced set idrac and password [22:03:37] (03PS2) 10Ayounsi: Pmacct, add source and destination countries based on GeoIP DB [puppet] - 10https://gerrit.wikimedia.org/r/531752 [22:03:52] (03CR) 10Mholloway: [C: 03+2] MachineVision: Update image labeling handler config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531734 (owner: 10Mholloway) [22:04:55] (03Merged) 10jenkins-bot: MachineVision: Update image labeling handler config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531734 (owner: 10Mholloway) [22:05:31] (03PS3) 10Ayounsi: Pmacct, add source and destination countries based on GeoIP DB [puppet] - 10https://gerrit.wikimedia.org/r/531752 [22:05:58] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/17993/netflow1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/531752 (owner: 10Ayounsi) [22:07:24] (03CR) 10jenkins-bot: MachineVision: Update image labeling handler config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531734 (owner: 10Mholloway) [22:07:35] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Update MachineVision Beta config (duration: 00m 47s) [22:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:59] (03CR) 10RobH: [C: 03+2] updating gerrit1001 mgmt dns [dns] - 10https://gerrit.wikimedia.org/r/531751 (https://phabricator.wikimedia.org/T231046) (owner: 10RobH) [22:16:11] (03PS1) 10RobH: adding gerrit1001 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/531757 (https://phabricator.wikimedia.org/T231046) [22:16:36] (03CR) 10jerkins-bot: [V: 04-1] adding gerrit1001 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/531757 (https://phabricator.wikimedia.org/T231046) (owner: 10RobH) [22:18:05] (03CR) 10Urbanecm: [C: 03+1] "LGTM then"" (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531741 (https://phabricator.wikimedia.org/T231041) (owner: 10DannyS712) [22:18:16] jouncebot: next [22:18:16] In 0 hour(s) and 41 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190822T2300) [22:19:47] (03PS2) 10RobH: adding gerrit1001 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/531757 (https://phabricator.wikimedia.org/T231046) [22:20:38] (03CR) 10RobH: [C: 03+2] adding gerrit1001 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/531757 (https://phabricator.wikimedia.org/T231046) (owner: 10RobH) [22:23:42] 10Operations: setup/install gerrit1001 - https://phabricator.wikimedia.org/T231046 (10RobH) [22:32:52] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [22:36:16] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:38:55] (03CR) 10Ayounsi: "John, is there any way to not have to disable the linter for those require ?" [puppet] - 10https://gerrit.wikimedia.org/r/531752 (owner: 10Ayounsi) [22:40:09] (03PS1) 10RobH: gerrit1001 base install params [puppet] - 10https://gerrit.wikimedia.org/r/531760 (https://phabricator.wikimedia.org/T231046) [22:40:54] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:41:05] eh [22:41:19] Oh yah [22:41:26] fix on that in a bit [22:44:54] (03CR) 10RobH: [C: 03+2] gerrit1001 base install params [puppet] - 10https://gerrit.wikimedia.org/r/531760 (https://phabricator.wikimedia.org/T231046) (owner: 10RobH) [22:54:26] 10Operations, 10Patch-For-Review: setup/install gerrit1001 - https://phabricator.wikimedia.org/T231046 (10RobH) [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Evening SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190822T2300). [23:00:04] Smalyshev: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:20] am here... [23:00:27] the patch is still stuck in CI though... [23:00:46] I can SWAT today! [23:01:32] well it'd have to go through the gate-and-submit-swat pipeline anyway [23:01:43] I'll do a few config patches before this merges [23:02:09] Urbanecm: ok, thanks. [23:02:14] (03CR) 10Urbanecm: [C: 03+2] Revert "Revert "Clean up `wgNamespacesToBeSearchedDefault` to remove unneeded entries"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531501 (https://phabricator.wikimedia.org/T230797) (owner: 10Urbanecm) [23:02:22] (03CR) 10Urbanecm: [C: 03+2] Clean up `wgRateLimits` to remove unneeded entries. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531739 (https://phabricator.wikimedia.org/T231040) (owner: 10DannyS712) [23:02:26] CI for wikidata modules is insanely long lately :( [23:02:44] :/ [23:02:57] (03PS2) 10Urbanecm: Revert "Revert "Clean up `wgNamespacesToBeSearchedDefault` to remove unneeded entries"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531501 (https://phabricator.wikimedia.org/T230797) [23:03:10] (03CR) 10Urbanecm: [C: 03+2] Revert "Revert "Clean up `wgNamespacesToBeSearchedDefault` to remove unneeded entries"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531501 (https://phabricator.wikimedia.org/T230797) (owner: 10Urbanecm) [23:03:39] (03PS1) 10RobH: removing ipv6 call while role spare [puppet] - 10https://gerrit.wikimedia.org/r/531762 (https://phabricator.wikimedia.org/T231046) [23:04:03] (03Merged) 10jenkins-bot: Clean up `wgRateLimits` to remove unneeded entries. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531739 (https://phabricator.wikimedia.org/T231040) (owner: 10DannyS712) [23:04:05] (03CR) 10RobH: [C: 03+2] removing ipv6 call while role spare [puppet] - 10https://gerrit.wikimedia.org/r/531762 (https://phabricator.wikimedia.org/T231046) (owner: 10RobH) [23:04:20] (03CR) 10jenkins-bot: Clean up `wgRateLimits` to remove unneeded entries. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531739 (https://phabricator.wikimedia.org/T231040) (owner: 10DannyS712) [23:07:03] (03PS1) 10CRusnov: librenms: Fix for api shift [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/531763 [23:08:01] (03PS3) 10Urbanecm: Change language code for punjabiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530769 (https://phabricator.wikimedia.org/T230680) (owner: 10MarcoAurelio) [23:08:03] (03CR) 10CRusnov: [C: 03+2] librenms: Fix for api shift [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/531763 (owner: 10CRusnov) [23:08:05] (03CR) 10Urbanecm: [C: 03+2] Change language code for punjabiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530769 (https://phabricator.wikimedia.org/T230680) (owner: 10MarcoAurelio) [23:08:58] (03Merged) 10jenkins-bot: Change language code for punjabiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530769 (https://phabricator.wikimedia.org/T230680) (owner: 10MarcoAurelio) [23:09:14] (03CR) 10jenkins-bot: Change language code for punjabiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530769 (https://phabricator.wikimedia.org/T230680) (owner: 10MarcoAurelio) [23:09:46] 10Operations, 10netbox: Netbox LibreNMS report fails - https://phabricator.wikimedia.org/T230964 (10crusnov) 05Open→03Resolved Fix deployed with https://gerrit.wikimedia.org/r/531763 [23:10:00] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:10:38] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: a5917e4: Clean up `wgRateLimits` to remove unneeded entries (T231040) (duration: 00m 48s) [23:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:44] T231040: Clean up `wgRateLimits` to remove unneeded entries - https://phabricator.wikimedia.org/T231040 [23:10:59] (03PS3) 10Urbanecm: Revert "Revert "Clean up `wgNamespacesToBeSearchedDefault` to remove unneeded entries"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531501 (https://phabricator.wikimedia.org/T230797) [23:11:03] (03CR) 10Urbanecm: [C: 03+2] Revert "Revert "Clean up `wgNamespacesToBeSearchedDefault` to remove unneeded entries"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531501 (https://phabricator.wikimedia.org/T230797) (owner: 10Urbanecm) [23:12:08] (03Merged) 10jenkins-bot: Revert "Revert "Clean up `wgNamespacesToBeSearchedDefault` to remove unneeded entries"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531501 (https://phabricator.wikimedia.org/T230797) (owner: 10Urbanecm) [23:12:24] (03CR) 10jenkins-bot: Revert "Revert "Clean up `wgNamespacesToBeSearchedDefault` to remove unneeded entries"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531501 (https://phabricator.wikimedia.org/T230797) (owner: 10Urbanecm) [23:13:28] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 872f4b0: Change language code for punjabiwikimedia (T230680) (duration: 00m 48s) [23:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:34] T230680: punjabi.wikimedia.org: Change site language code from fake 'punjabi' to 'pa' - https://phabricator.wikimedia.org/T230680 [23:14:10] resyncing the above, got "IOError: [Errno 32] Broken pipe" from scap [23:15:01] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 872f4b0: Change language code for punjabiwikimedia, resyncing, got broken pipe at the end (T230680) (duration: 00m 47s) [23:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:12] worked fine this time [23:16:15] (03CR) 10Urbanecm: [C: 03+2] General cleanup of `groupOverrides`. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531741 (https://phabricator.wikimedia.org/T231041) (owner: 10DannyS712) [23:16:16] PROBLEM - Check the Netbox report-s- librenms for fail status. on netmon1002 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:16:48] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 295ce06: Revert "Revert "Clean up `wgNamespacesToBeSearchedDefault` to remove unneeded entries"" (T230797) (duration: 00m 48s) [23:16:51] XioNoX: ^ [23:17:19] (03PS6) 10Urbanecm: General cleanup of `groupOverrides`. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531741 (https://phabricator.wikimedia.org/T231041) (owner: 10DannyS712) [23:17:26] (03CR) 10Urbanecm: [C: 03+2] General cleanup of `groupOverrides`. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531741 (https://phabricator.wikimedia.org/T231041) (owner: 10DannyS712) [23:19:48] chaomodus: yay! [23:19:54] (03Merged) 10jenkins-bot: General cleanup of `groupOverrides`. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531741 (https://phabricator.wikimedia.org/T231041) (owner: 10DannyS712) [23:19:59] I'm wondering what changed [23:20:15] I guess the new REs don't report the FPM over SNMP... [23:20:31] (03CR) 10jenkins-bot: General cleanup of `groupOverrides`. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531741 (https://phabricator.wikimedia.org/T231041) (owner: 10DannyS712) [23:21:18] FPM = https://www.juniper.net/documentation/en_US/release-independent/junos/topics/topic-map/mx480-chassis.html#id-mx480-craft-interface-description [23:22:02] yeah, eqiad's router do report it https://librenms.wikimedia.org/device/device=1/tab=entphysical/ [23:22:37] (03CR) 10Urbanecm: [C: 03+2] "Should be ready now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480074 (owner: 10Daimona Eaytoy) [23:23:12] 10Operations, 10ops-codfw: Degraded RAID on db2056 - https://phabricator.wikimedia.org/T231056 (10ops-monitoring-bot) [23:24:13] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 66b719d: General cleanup of `groupOverrides` (T231041) (duration: 00m 47s) [23:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:22] T231041: Clean up `groupOverrides` to remove unneeded entries - https://phabricator.wikimedia.org/T231041 [23:25:04] SMalyshev: we have a failed test [23:25:07] https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php72-docker/16917/ [23:25:15] could you please have a look? [23:25:17] Urbanecm: yeah 73 test timed out [23:25:35] wait that's different one [23:25:56] ah of course... Lexeme qunit tests are super flaky [23:26:01] they time out all the time [23:26:01] doesn't seem like a timeout? https://www.irccloud.com/pastebin/Pg7RxN7h/ [23:26:55] hmm no idea actually since this same code passed 3 times in master and the change has absolutely nothing to do with editing forms... its RDF export [23:26:56] (03Merged) 10jenkins-bot: Rename globals and rights in AbuseFilter config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480074 (owner: 10Daimona Eaytoy) [23:26:59] so I'd try to recheck [23:27:11] (03CR) 10jenkins-bot: Rename globals and rights in AbuseFilter config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480074 (owner: 10Daimona Eaytoy) [23:27:40] SMalyshev: well it fails in php7 pipeline [23:27:46] Urbanecm: it does that a lot: https://phabricator.wikimedia.org/T229634 [23:28:19] those tests are very unstable [23:28:35] see the third one in the phab task - it's exactly this one [23:29:01] okay, going to get the tests restarted somehow [23:29:15] yeah I'd try recheck... it usually goes away [23:29:56] sorry, these tests are long and unstable... :( [23:30:15] and we have UBN issue hanging on them [23:31:04] okay [23:31:15] cancelled the rest of the build, and re+2'ed [23:31:17] let's see [23:32:57] !log urbanecm@deploy1001 Synchronized wmf-config/abusefilter.php: SWAT: eb1c4ea: Rename globals and rights in AbuseFilter config (duration: 00m 47s) [23:33:01] 10Operations: setup/install gerrit1001 - https://phabricator.wikimedia.org/T231046 (10RobH) a:05RobH→03Dzahn [23:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:13] CI seeems to be quite busy SMalyshev :/ [23:33:23] yep [23:33:50] 10Operations: setup/install gerrit1001 - https://phabricator.wikimedia.org/T231046 (10RobH) @dzahn, I think you are the person to push this into service, being the author of the original hardware request. If not you, please advise if you know who should get this, and if not sure, assign back to me for followup... [23:35:24] given that particular job succeeeded in master, I'm tempted to just V+2 it [23:36:22] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:37:00] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:37:36] did that SMalyshev , we don't have another half an hour, possibly more [23:38:32] Urbanecm: I am trying to verify it on beta, but for some reason it didn't deploy the latest master yet for WikibaseLexeme... [23:38:42] SMalyshev: pulled onto mwdebug1002, could you check there? [23:38:44] do you know how that works? Does anything manual need to be done? [23:38:55] Urbanecm: ok let me check now [23:39:08] well the deployment job seems to have failed [23:39:12] I'll rerun it manually [23:39:30] oh not [23:39:33] that's diff one [23:39:46] Urbanecm: looks like it works on mwdebug [23:39:52] okay, let's sync then [23:40:06] yep not broken anymore [23:41:11] Urbanecm: so on mwdebug it's good you can deploy [23:41:16] syncing, thanks SMalyshev [23:41:41] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.19/extensions/WikibaseLexeme: SWAT: e4a5457: Fix Lexemes RDF generation (T230974) (duration: 00m 49s) [23:41:45] done SMalyshev ! [23:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:47] T230974: New lexemes missing in Wikidata Query Service - https://phabricator.wikimedia.org/T230974 [23:41:49] please verify and let me know! [23:42:32] Urbanecm: yep RDF on Lexemes not broken anymore. Thanks! [23:42:38] happy to help SMalyshev ! [23:42:50] I'll look into the beta issue a little [23:43:11] now I have to go back and re-update all lexemes that were broken since it started... but that's easier