[00:14:44] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Rebuild Netbox 2.9.10 dependencies [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/651280 (owner: 10CRusnov) [00:19:55] !log crusnov@deploy1001 Started deploy [netbox/deploy@b17db99]: Redeploy of 2.9.10 to netbox-dev for dep test [00:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:49] !log crusnov@deploy1001 Finished deploy [netbox/deploy@b17db99]: Redeploy of 2.9.10 to netbox-dev for dep test (duration: 00m 54s) [00:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:17] (03CR) 10Ori.livneh: "> Patch Set 1: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651267 (owner: 10Ori.livneh) [00:32:50] (03PS1) 10Legoktm: aptrepo: Pull pyall component over HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/651300 [00:41:17] (03PS2) 10Legoktm: aptrepo: Pull pyall component over HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/651300 [00:48:57] (03PS1) 10Bstorm: wikireplicas: set up VM haproxy layer [puppet] - 10https://gerrit.wikimedia.org/r/651301 (https://phabricator.wikimedia.org/T267376) [00:50:03] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 10923952192 and 592 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:03] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 8314045280 and 433 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:27] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2451684784 and 154 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:27] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 6385288736 and 362 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:52:13] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3135964968 and 295 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:52:31] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2444305600 and 271 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:52:41] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1554093408 and 238 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:52:43] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 4012208296 and 368 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:52:47] (03CR) 10Bstorm: wikireplicas: set up VM haproxy layer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651301 (https://phabricator.wikimedia.org/T267376) (owner: 10Bstorm) [00:53:47] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 133408 and 237 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:33] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 5880 and 343 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:49] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 149224 and 359 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:01] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 67840 and 370 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:57:41] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 12704 and 471 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:58:21] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 32216 and 510 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:58:45] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 4376 and 535 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:59:59] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 22376 and 608 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:13:24] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T270663 (10ops-monitoring-bot) [01:21:07] 10Operations, 10Traffic, 10netops: Was unable to connect (esams) for about 20 minutes - https://phabricator.wikimedia.org/T270664 (10AlexisJazz) [02:07:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.23 [core] (wmf/1.36.0-wmf.23) - 10https://gerrit.wikimedia.org/r/651306 [02:11:22] (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.23 [core] (wmf/1.36.0-wmf.23) - 10https://gerrit.wikimedia.org/r/651306 (https://phabricator.wikimedia.org/T267416) (owner: 10TrainBranchBot) [02:11:32] (03Abandoned) 10DannyS712: Branch commit for wmf/1.36.0-wmf.23 [core] (wmf/1.36.0-wmf.23) - 10https://gerrit.wikimedia.org/r/651306 (https://phabricator.wikimedia.org/T267416) (owner: 10TrainBranchBot) [02:32:57] PROBLEM - Check systemd state on dbprov2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:27:27] RECOVERY - Check systemd state on dbprov2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:53:24] (03CR) 10Marostegui: [C: 03+1] "Thanks!" [software/conftool] - 10https://gerrit.wikimedia.org/r/651269 (https://phabricator.wikimedia.org/T269324) (owner: 10CDanis) [05:54:45] 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Marostegui) Thank you @krinkle One further question, we are still thinking whether to configure ROW or STA... [05:55:18] (03CR) 10Marostegui: "This won't be needed, quoting Timo:" [puppet] - 10https://gerrit.wikimedia.org/r/649820 (https://phabricator.wikimedia.org/T269324) (owner: 10Jcrespo) [06:00:48] (03PS1) 10Marostegui: install_server: Do not reimage x2 eqiad hosts [puppet] - 10https://gerrit.wikimedia.org/r/651313 [06:01:33] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage x2 eqiad hosts [puppet] - 10https://gerrit.wikimedia.org/r/651313 (owner: 10Marostegui) [06:04:33] (03PS3) 10Legoktm: aptrepo: Pull pyall component over HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/651300 [06:59:17] (03PS2) 10Elukey: hive: remove analytics-replicated-hive config [puppet] - 10https://gerrit.wikimedia.org/r/650077 (https://phabricator.wikimedia.org/T268028) [07:01:35] (03CR) 10Elukey: [C: 03+2] hive: remove analytics-replicated-hive config [puppet] - 10https://gerrit.wikimedia.org/r/650077 (https://phabricator.wikimedia.org/T268028) (owner: 10Elukey) [07:08:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [07:11:04] ah this seems to be mr1 [07:11:13] the alarm is missing something [07:11:55] yep indeed https://librenms.wikimedia.org/device/53 [07:13:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [07:20:51] I checked and the cause seems to be flowd_octeon_hm, but IIUC from reading it seems a system daemon that sometimes processes packets at higher rate (consuming more cpu) [07:27:43] !log reboot stat100[4-8] (analytics hadoop clients) for kernel upgrades [07:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:45] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.199 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:28:01] ah lovely [07:28:21] PROBLEM - Juniper alarms on mr1-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.154.199 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [07:30:28] it seems the high cpu usage [07:32:45] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.199 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:34:15] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:34:51] RECOVERY - Juniper alarms on mr1-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [07:40:05] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:40:21] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:42:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [07:47:47] (Processor usage over 85%) resolved: (2) Processor usage over 85% - https://alerts.wikimedia.org [08:29:49] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10observability, and 2 others: Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10hashar) [08:36:01] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:36:11] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:48:03] (03PS1) 10Elukey: admin: deprecate the analytics-users posix group [puppet] - 10https://gerrit.wikimedia.org/r/651448 (https://phabricator.wikimedia.org/T269150) [08:48:29] (03CR) 10jerkins-bot: [V: 04-1] admin: deprecate the analytics-users posix group [puppet] - 10https://gerrit.wikimedia.org/r/651448 (https://phabricator.wikimedia.org/T269150) (owner: 10Elukey) [08:49:43] (03PS2) 10Elukey: admin: deprecate the analytics-users posix group [puppet] - 10https://gerrit.wikimedia.org/r/651448 (https://phabricator.wikimedia.org/T269150) [08:52:02] !log gerrit: running jhat heap analyzer on gerrit2001 # T263008 [08:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:06] T263008: Gerrit out of heap - https://phabricator.wikimedia.org/T263008 [08:55:20] a 34GB dump :-\ [09:18:06] (03PS1) 10Giuseppe Lavagetto: Remove user/group settings from php-fpm's configuration [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/651450 [09:20:13] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T270663 (10Peachey88) [09:21:12] 10Operations, 10Traffic, 10netops: Was unable to connect (esams) for about 20 minutes - https://phabricator.wikimedia.org/T270664 (10Peachey88) [09:42:13] (03PS1) 10Arturo Borrero Gonzalez: nftables: introduce nft-check exec [puppet] - 10https://gerrit.wikimedia.org/r/651453 [09:44:56] (03PS1) 10Jbond: enable-puppet: ensure we make sure puppet is not running before enableing [puppet] - 10https://gerrit.wikimedia.org/r/651454 [09:45:20] volans: ^^^ could be the fix [09:48:37] (03CR) 10Volans: [C: 03+1] "It shouldn't hurt and makes sense if we run the enable on a host with puppet already enabled and hence potentially running" [puppet] - 10https://gerrit.wikimedia.org/r/651454 (owner: 10Jbond) [09:50:08] (03CR) 10Arturo Borrero Gonzalez: "yes @dcaro I did some tests in toolsbeta. I'm convinced the patch works." [puppet] - 10https://gerrit.wikimedia.org/r/650470 (https://phabricator.wikimedia.org/T267966) (owner: 10Arturo Borrero Gonzalez) [09:50:27] (03Abandoned) 10Arturo Borrero Gonzalez: kubeadm: etcd: introduce systemd-based higher priority scheduling policy [puppet] - 10https://gerrit.wikimedia.org/r/650470 (https://phabricator.wikimedia.org/T267966) (owner: 10Arturo Borrero Gonzalez) [10:00:07] PROBLEM - Hive Metastore on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [10:01:03] this is me --^ [10:02:43] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T270663 (10Volans) 05Open→03Resolved a:03Volans It was a transient degradation due to a rebuild, it's all good now, resolving. ` $ sudo /usr/local/lib/nagios/plugins/get-raid-s... [10:06:53] !log upgraded python3-wmflib to 0.0.5 on cumin2001 [10:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:06] (03PS1) 10Elukey: cdh::hive: don't use TLS for local db connections [puppet] - 10https://gerrit.wikimedia.org/r/651455 [10:08:30] (03CR) 10jerkins-bot: [V: 04-1] cdh::hive: don't use TLS for local db connections [puppet] - 10https://gerrit.wikimedia.org/r/651455 (owner: 10Elukey) [10:08:39] iff [10:09:28] (03PS2) 10Elukey: cdh::hive: don't use TLS for local db connections [puppet] - 10https://gerrit.wikimedia.org/r/651455 [10:11:30] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27229/console" [puppet] - 10https://gerrit.wikimedia.org/r/651455 (owner: 10Elukey) [10:14:08] (03PS3) 10Elukey: cdh::hive: don't use TLS for local db connections [puppet] - 10https://gerrit.wikimedia.org/r/651455 [10:14:13] RECOVERY - Hive Metastore on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [10:16:40] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27230/console" [puppet] - 10https://gerrit.wikimedia.org/r/651455 (owner: 10Elukey) [10:17:16] (03CR) 10Elukey: [V: 03+1 C: 03+2] cdh::hive: don't use TLS for local db connections [puppet] - 10https://gerrit.wikimedia.org/r/651455 (owner: 10Elukey) [10:20:24] (03CR) 10Jbond: [C: 03+2] enable-puppet: ensure we make sure puppet is not running before enableing [puppet] - 10https://gerrit.wikimedia.org/r/651454 (owner: 10Jbond) [10:20:26] (03PS1) 10Elukey: Failover analytics-hive.eqiad.wmnet to an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/651456 (https://phabricator.wikimedia.org/T268028) [10:21:43] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10aborrero) [10:23:36] (03PS1) 10Jbond: enable-puppet: initialise attempts [puppet] - 10https://gerrit.wikimedia.org/r/651457 [10:24:16] (03PS2) 10Jbond: enable-puppet: initialise attempts [puppet] - 10https://gerrit.wikimedia.org/r/651457 [10:24:23] (03CR) 10Jbond: [V: 03+2 C: 03+2] enable-puppet: initialise attempts [puppet] - 10https://gerrit.wikimedia.org/r/651457 (owner: 10Jbond) [10:24:44] (03CR) 10Elukey: [C: 03+2] Failover analytics-hive.eqiad.wmnet to an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/651456 (https://phabricator.wikimedia.org/T268028) (owner: 10Elukey) [10:32:25] 10Operations, 10observability, 10CAS-SSO: Sign-in links from Grafana dashboards don't work when not signed into SSO - https://phabricator.wikimedia.org/T269272 (10Volans) This is happening to me even with an SSO session already active fwiw. [10:34:34] (03PS1) 10Elukey: role::analytics_clustr::coordinator: move Presto to analytics-hive [puppet] - 10https://gerrit.wikimedia.org/r/651458 (https://phabricator.wikimedia.org/T257412) [10:34:51] 10Operations, 10observability, 10CAS-SSO: Sign-in links from Grafana dashboards don't work when not signed into SSO - https://phabricator.wikimedia.org/T269272 (10Volans) And to be clear, even if I just pate into the browser the URL: https://grafana-rw.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId... [10:35:00] (03PS2) 10Elukey: role::analytics_cluster::coordinator: move Presto to analytics-hive [puppet] - 10https://gerrit.wikimedia.org/r/651458 (https://phabricator.wikimedia.org/T257412) [10:36:04] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: move Presto to analytics-hive [puppet] - 10https://gerrit.wikimedia.org/r/651458 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [10:39:09] (03PS1) 10Elukey: role::analytics_cluster::coordinator: set new hive metastore for presto [puppet] - 10https://gerrit.wikimedia.org/r/651459 [10:39:40] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: set new hive metastore for presto [puppet] - 10https://gerrit.wikimedia.org/r/651459 (owner: 10Elukey) [10:49:29] volans@cumin2001 test message [10:49:33] elukey: ^^ [10:49:37] \o/ [10:49:37] (03PS1) 10Odder: Add localised logos for the Madurese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651462 (https://phabricator.wikimedia.org/T270693) [10:49:40] \o/ [10:49:42] niceeeee [10:51:51] !log volans@cumin2001 test SAL message from wmflib, please ignore [10:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:57] elukey: ^^^ :D [10:56:13] (03PS1) 10Volans: doc: improve documentation [software/pywmflib] - 10https://gerrit.wikimedia.org/r/651463 [10:57:49] (03CR) 10Elukey: [C: 03+1] doc: improve documentation [software/pywmflib] - 10https://gerrit.wikimedia.org/r/651463 (owner: 10Volans) [10:58:15] (03CR) 10Volans: [C: 03+2] doc: improve documentation [software/pywmflib] - 10https://gerrit.wikimedia.org/r/651463 (owner: 10Volans) [10:59:33] (03Merged) 10jenkins-bot: doc: improve documentation [software/pywmflib] - 10https://gerrit.wikimedia.org/r/651463 (owner: 10Volans) [10:59:58] (03PS1) 10Odder: Add localised logos for the Madurese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651464 (https://phabricator.wikimedia.org/T270693) [11:00:56] (03CR) 10jerkins-bot: [V: 04-1] Add localised logos for the Madurese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651464 (https://phabricator.wikimedia.org/T270693) (owner: 10Odder) [11:02:24] !log upload puppet 5.5.22 to stretch-wikimedia [11:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:51] (03Abandoned) 10Odder: Add localised logos for the Madurese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651464 (https://phabricator.wikimedia.org/T270693) (owner: 10Odder) [11:05:39] (03PS1) 10Odder: Add localised logos for the Madurese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651465 (https://phabricator.wikimedia.org/T270693) [11:10:55] !log upload puppet 5.5.22 to jessie-wikimedia [11:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:56] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "I just captured what is not being NAT'ed at this moment:" [puppet] - 10https://gerrit.wikimedia.org/r/651169 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez) [11:19:57] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "That list is:" [puppet] - 10https://gerrit.wikimedia.org/r/651169 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez) [11:42:23] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] make sure bz2 header is read when reading blocks backwards [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/645340 (https://phabricator.wikimedia.org/T269225) (owner: 10ArielGlenn) [11:43:07] (03CR) 10ArielGlenn: [C: 03+2] version 0.1.2 [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/651162 (owner: 10ArielGlenn) [11:52:30] !log Set db1151 to writable T269324 [11:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:34] T269324: Productionize x2 databases - https://phabricator.wikimedia.org/T269324 [11:56:00] 10Operations, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [11:56:03] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Marostegui) [12:08:27] PROBLEM - DPKG on db2113 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:10:11] 10Operations, 10Puppet, 10User-jbond: Update puppet infrastructure latest 5.5 version - https://phabricator.wikimedia.org/T265139 (10jbond) puppet 5.5.22 has now been deployed fleet wide. [12:10:23] PROBLEM - DPKG on elastic2050 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:10:38] looking ^^ [12:10:43] PROBLEM - DPKG on ms-be1028 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:11:52] jbond42: A puppet run on db2113 seems to have fixed it [12:11:53] PROBLEM - DPKG on elastic2054 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:12:37] marostegui: thanks looks like the puppet update had a few issues on some machines [12:12:39] PROBLEM - DPKG on kubernetes2012 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:12:43] RECOVERY - DPKG on db2113 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:13:01] PROBLEM - DPKG on an-worker1096 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:13:05] PROBLEM - DPKG on db1082 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:15:57] RECOVERY - DPKG on db1082 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:15:57] RECOVERY - DPKG on an-worker1096 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:15:57] RECOVERY - DPKG on ms-be1028 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:15:57] RECOVERY - DPKG on elastic2050 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:15:57] RECOVERY - DPKG on elastic2054 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:15:57] RECOVERY - DPKG on kubernetes2012 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:34:10] (03PS2) 10David Caro: wmcs.backups: move wikidumpparse to cloudvirt1025 [puppet] - 10https://gerrit.wikimedia.org/r/651175 [12:38:26] (03PS1) 10David Caro: wmcs.backup: Add a method to create a vm backup [puppet] - 10https://gerrit.wikimedia.org/r/651507 [12:38:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Great work! some comments inline." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/651301 (https://phabricator.wikimedia.org/T267376) (owner: 10Bstorm) [13:45:15] (03CR) 10David Caro: [C: 03+2] wmcs.backups: move wikidumpparse to cloudvirt1025 [puppet] - 10https://gerrit.wikimedia.org/r/651175 (owner: 10David Caro) [14:11:28] (03CR) 10CDanis: [C: 03+2] "thanks! dbctl wikitech page also updated" [software/conftool] - 10https://gerrit.wikimedia.org/r/651269 (https://phabricator.wikimedia.org/T269324) (owner: 10CDanis) [14:13:45] (03Merged) 10jenkins-bot: dbctl: README: document section 'flavor' [software/conftool] - 10https://gerrit.wikimedia.org/r/651269 (https://phabricator.wikimedia.org/T269324) (owner: 10CDanis) [14:23:03] 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: network architecture - https://phabricator.wikimedia.org/T209460 (10aborrero) [14:41:25] (03PS1) 10David Caro: wmcs.backup: Remove all dangling snapshots [puppet] - 10https://gerrit.wikimedia.org/r/651537 [15:01:08] (03CR) 10Razzi: [C: 03+2] role::analytics_cluster::ui::dashboards: Add superset to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/650179 (https://phabricator.wikimedia.org/T268219) (owner: 10Razzi) [15:05:19] (03CR) 10Razzi: [C: 03+2] superset: Switch traffic from analytics-tool1004 to an-tool1010 [puppet] - 10https://gerrit.wikimedia.org/r/650522 (owner: 10Razzi) [15:05:20] There is a small change we would like to push before the holidays: https://gerrit.wikimedia.org/r/c/mediawiki/services/wikifeeds/+/650185. We have already reached out to thcipriani and it looks like the change is small enough to be pushed over the holiday deployment freeze. [15:06:27] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:06:45] 10Operations, 10Traffic, 10netops: Was unable to connect (esams) for about 20 minutes - https://phabricator.wikimedia.org/T270664 (10CDanis) 05Open→03Resolved a:03CDanis Thanks for the report. I took a look and found no evidence of larger issues at this time: * no increase in automated Network Error L... [15:08:32] !log jgiannelos@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [15:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:02] !log jgiannelos@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [15:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:20] nemo-yiannis: thanks :) [15:23:37] !log jgiannelos@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [15:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:58] (03CR) 10Bstorm: wikireplicas: set up VM haproxy layer (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/651301 (https://phabricator.wikimedia.org/T267376) (owner: 10Bstorm) [15:48:34] (03PS10) 10CDanis: Pre-review: initial commit [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238 [15:49:22] (03PS1) 10Bstorm: cloud nfs: switch to the 10G interface for booting labstore1004 [puppet] - 10https://gerrit.wikimedia.org/r/651545 (https://phabricator.wikimedia.org/T266202) [16:01:06] PROBLEM - Check systemd state on analytics-tool1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:01:08] PROBLEM - superset on analytics-tool1004 is CRITICAL: connect to address 10.64.36.116 and port 9080: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Superset [16:04:08] this is expected --^ [16:04:11] razzi: ---^ [16:04:28] (so you can test if the icinga ack works now :) [16:15:08] (03PS2) 10David Caro: wmcs.backup: Remove all dangling snapshots [puppet] - 10https://gerrit.wikimedia.org/r/651537 [16:15:10] (03PS1) 10David Caro: wmcs.backup: Add a way to remove old backups and snapshots [puppet] - 10https://gerrit.wikimedia.org/r/651550 [16:15:16] !log downtimed and stopped puppet on labstore1004 and labstore1005 for failover T266202 [16:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:21] T266202: Move or recable labstore1004 to 10Gbps rack (if needed) and ethernet - https://phabricator.wikimedia.org/T266202 [16:20:54] elukey: ah when I saw that I quickly downtimed it for 24h, will have to try icinga ack on something else [16:27:26] PROBLEM - Check systemd state on ms-be1026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:38:17] !log mforns@deploy1001 Started deploy [analytics/refinery@21c0c89]: Regular analytics weekly train [analytics/refinery@Ie7bce02179547ee4c6756d52f9956f492c5b4df6] [16:39:38] mforns: stashbot is dead [16:39:52] I'm guessing it's know as nfs maint is ongoing [16:40:15] RhinosF1: aha, thanks for the heads up [16:41:32] mforns: it just joined back if you wanna log again [16:42:24] ok! [16:44:23] RhinosF1: I tried to !log from the wikimedia-analytics channel, and it didn't work [16:44:36] mforns: must be just not talking but online now [16:44:49] ok! thanks [16:44:51] Will have to wait until b_storm finishes [16:45:08] "Unfortunately, the failover has not gone well due to some upgrades to the secondary it seems. I apologize for the service disruptions, and we are trying to bring everything back as fast as possible." [16:45:08] np! [16:45:15] That just went to the cloud list [16:47:32] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1026 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:48:29] !log restarted ferm on ms-be1026 (failed with DNS query for 'ms-be1055.eqiad.wmnet' failed: query timed out ) [16:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:42] RECOVERY - Check systemd state on ms-be1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:49:00] PROBLEM - k8s API server requests latencies on neon is CRITICAL: instance=10.64.0.40 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:50:28] !log mforns@deploy1001 Finished deploy [analytics/refinery@21c0c89]: Regular analytics weekly train [analytics/refinery@Ie7bce02179547ee4c6756d52f9956f492c5b4df6] (duration: 12m 11s) [16:50:40] RECOVERY - k8s API server requests latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:51:05] !log mforns@deploy1001 Started deploy [analytics/refinery@21c0c89] (thin): Regular analytics weekly train THIN [analytics/refinery@Ie7bce02179547ee4c6756d52f9956f492c5b4df6] [16:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:13] !log mforns@deploy1001 Finished deploy [analytics/refinery@21c0c89] (thin): Regular analytics weekly train THIN [analytics/refinery@Ie7bce02179547ee4c6756d52f9956f492c5b4df6] (duration: 00m 08s) [16:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:20] 10Operations, 10ops-eqiad, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Hardware): Move or recable labstore1004 to 10Gbps rack (if needed) and ethernet - https://phabricator.wikimedia.org/T266202 (10Bstorm) Ok, that has gone very badly. Working to fix. [16:56:59] (03PS1) 10Mforns: analytics::refinery::job::refine: Bump up refinery_version to 142 [puppet] - 10https://gerrit.wikimedia.org/r/651559 [17:18:00] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1026 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:24:45] 10Operations, 10Traffic, 10netops: Was unable to connect (esams) for about 20 minutes - https://phabricator.wikimedia.org/T270664 (10AlexisJazz) >>! In T270664#6708236, @CDanis wrote: > Thanks for the report. > > I took a look and found no evidence of larger issues at this time: > * no increase in automated... [17:27:43] !log shutting down labstore1004 in preparation for move and reimage [17:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:52] PROBLEM - Host mw1274.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:30:10] PROBLEM - Host mw1272.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:30:16] PROBLEM - Host mw1273.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:31:13] (03PS1) 10Andrew Bogott: labstore1004: move to 'insetup' role [puppet] - 10https://gerrit.wikimedia.org/r/651564 (https://phabricator.wikimedia.org/T266202) [17:31:28] PROBLEM - Host mw1279.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:31:32] PROBLEM - Host mw1275.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:31:59] (03CR) 10Bstorm: [C: 03+2] labstore1004: move to 'insetup' role [puppet] - 10https://gerrit.wikimedia.org/r/651564 (https://phabricator.wikimedia.org/T266202) (owner: 10Andrew Bogott) [17:32:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] labstore1004: move to 'insetup' role [puppet] - 10https://gerrit.wikimedia.org/r/651564 (https://phabricator.wikimedia.org/T266202) (owner: 10Andrew Bogott) [17:33:04] (03CR) 10Bstorm: [C: 03+2] cloud nfs: switch to the 10G interface for booting labstore1004 [puppet] - 10https://gerrit.wikimedia.org/r/651545 (https://phabricator.wikimedia.org/T266202) (owner: 10Bstorm) [17:33:51] (03PS1) 10Elukey: Force the Hive Servers to use their local Metastores [puppet] - 10https://gerrit.wikimedia.org/r/651565 [17:35:26] RECOVERY - Host mw1274.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [17:35:44] RECOVERY - Host mw1272.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [17:35:48] RECOVERY - Host mw1273.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [17:37:00] RECOVERY - Host mw1279.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [17:37:04] RECOVERY - Host mw1275.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [17:37:11] (03PS2) 10Elukey: Force the Hive Servers to use their local Metastores [puppet] - 10https://gerrit.wikimedia.org/r/651565 [17:39:08] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27232/console" [puppet] - 10https://gerrit.wikimedia.org/r/651565 (owner: 10Elukey) [17:51:56] (03CR) 10Elukey: [V: 03+1 C: 03+2] Force the Hive Servers to use their local Metastores [puppet] - 10https://gerrit.wikimedia.org/r/651565 (owner: 10Elukey) [17:56:54] (03PS1) 10Bstorm: cloud nfs: get primary interface from puppet and create mountpoints [puppet] - 10https://gerrit.wikimedia.org/r/651568 (https://phabricator.wikimedia.org/T266202) [18:06:16] RECOVERY - MegaRAID on db1101 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:07:29] (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::refine: Bump up refinery_version to 142 [puppet] - 10https://gerrit.wikimedia.org/r/651559 (owner: 10Mforns) [18:11:28] (03PS1) 10Elukey: Point analytics-hive.eqiad.wmnet to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/651569 [18:13:13] (03CR) 10Elukey: [C: 03+2] Point analytics-hive.eqiad.wmnet to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/651569 (owner: 10Elukey) [18:25:28] 10Operations, 10Graphoid, 10serviceops, 10Platform Engineering (Icebox): The graphs on a 'info' special page are no longer displayed - https://phabricator.wikimedia.org/T270720 (10Iniquity) [18:27:09] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on labstore1004.eqiad.wmnet with reason: REIMAGE [18:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:08] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on labstore1004.eqiad.wmnet with reason: REIMAGE [18:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:43] (03PS1) 10Andrew Bogott: Revert "labstore1004: move to 'insetup' role" [puppet] - 10https://gerrit.wikimedia.org/r/651496 [18:31:45] 10Operations, 10MW-on-K8s, 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Pipeline): Deployment infrastructure for PHP microservices - https://phabricator.wikimedia.org/T261369 (10Legoktm) [18:34:12] (03CR) 10Bstorm: [C: 03+1] "May want to review https://gerrit.wikimedia.org/r/c/operations/puppet/+/651568 first. Puppet is disabled on 1005, so it would be safe enou" [puppet] - 10https://gerrit.wikimedia.org/r/651496 (owner: 10Andrew Bogott) [18:37:22] (03CR) 10Andrew Bogott: [C: 03+1] cloud nfs: get primary interface from puppet and create mountpoints [puppet] - 10https://gerrit.wikimedia.org/r/651568 (https://phabricator.wikimedia.org/T266202) (owner: 10Bstorm) [18:43:29] (03PS2) 10Bstorm: cloud nfs: get primary interface from puppet and create mountpoints [puppet] - 10https://gerrit.wikimedia.org/r/651568 (https://phabricator.wikimedia.org/T266202) [18:48:50] (03CR) 10Andrew Bogott: [C: 03+2] cloud nfs: get primary interface from puppet and create mountpoints [puppet] - 10https://gerrit.wikimedia.org/r/651568 (https://phabricator.wikimedia.org/T266202) (owner: 10Bstorm) [18:49:03] (03CR) 10Andrew Bogott: [C: 03+2] Revert "labstore1004: move to 'insetup' role" [puppet] - 10https://gerrit.wikimedia.org/r/651496 (owner: 10Andrew Bogott) [18:49:53] (03PS1) 10Ahmon Dancy: 0.3.1: Use securityContext instead of main_app.rootImage [deployment-charts] - 10https://gerrit.wikimedia.org/r/651571 [18:56:47] (03CR) 10RLazarus: "nitpicks ahoy!" (0314 comments) [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238 (owner: 10CDanis) [18:57:22] (03PS1) 10Bstorm: labstore: remove excess variable from hiera [puppet] - 10https://gerrit.wikimedia.org/r/651575 (https://phabricator.wikimedia.org/T266202) [18:59:32] (03CR) 10Bstorm: [C: 03+2] labstore: remove excess variable from hiera [puppet] - 10https://gerrit.wikimedia.org/r/651575 (https://phabricator.wikimedia.org/T266202) (owner: 10Bstorm) [19:01:23] (03PS11) 10CDanis: Pre-review: initial commit [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238 [19:11:25] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10wiki_willy) [19:32:15] !log restarting gerrit to pick up config change in gitiles for T269300 [19:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:22] T269300: Modify readme.md for gerrit markdown - https://phabricator.wikimedia.org/T269300 [19:39:52] (03PS12) 10CDanis: Pre-review: initial commit [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238 [19:40:11] (03CR) 10CDanis: "Thanks for the helpful comments! Most have been resolved." (0314 comments) [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238 (owner: 10CDanis) [19:41:07] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1101 - https://phabricator.wikimedia.org/T270571 (10Marostegui) 05Open→03Resolved Not sure who changed the disk, but thank you either John or Chris! ` 19:06:25 <+icinga-wm> RECOVERY - MegaRAID on db1101 is OK: OK: optimal, 1 logical, 10 physical, Write... [19:45:47] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) >>! In T245757#6695685, @hnowlan wrote: > mw1265 is now reimaged and pooled with weight 5 (as opposed... [19:48:02] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [19:50:30] (03PS1) 10Bstorm: cloud nfs: make sure the /srv/scratch dir is there and another fix [puppet] - 10https://gerrit.wikimedia.org/r/651608 (https://phabricator.wikimedia.org/T266202) [19:51:17] 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) @Marostegui Regarding schema changes, I'd say very unlikely. It is a key-value store that hasn't... [19:52:16] 10Operations, 10ops-eqiad, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Hardware): Move or recable labstore1004 to 10Gbps rack (if needed) and ethernet - https://phabricator.wikimedia.org/T266202 (10Bstorm) [19:58:46] 10Operations, 10Graphoid, 10serviceops, 10Platform Engineering (Icebox): The graphs on a 'info' special page are no longer displayed - https://phabricator.wikimedia.org/T270720 (10Jdforrester-WMF) Yup, this is an inescapable consequence of client-side-only rendering. For now we should switch off the graphs... [19:59:03] 10Operations, 10Graphoid, 10PageViewInfo, 10serviceops, 10Platform Engineering (Icebox): The graphs on a 'info' special page are no longer displayed - https://phabricator.wikimedia.org/T270720 (10Jdforrester-WMF) [20:00:20] (03PS1) 10Nikki Nikkhoui: Enable safehtml in gitiles [puppet] - 10https://gerrit.wikimedia.org/r/651614 (https://phabricator.wikimedia.org/T269300) [20:06:31] 10Operations, 10Graphoid, 10PageViewInfo, 10serviceops, 10Platform Engineering (Icebox): The graphs on a 'info' special page are no longer displayed - https://phabricator.wikimedia.org/T270720 (10Jdforrester-WMF) Hmm, no, it's rendering for me? It's all client-side already. [20:07:54] 10Operations, 10Graphoid, 10PageViewInfo, 10serviceops, 10Platform Engineering (Icebox): The graphs on a 'info' special page are no longer displayed - https://phabricator.wikimedia.org/T270720 (10Jdforrester-WMF) {F33970350} and {F33970348} for example. [20:21:58] 10Operations, 10Graphoid, 10PageViewInfo, 10serviceops, 10Platform Engineering (Icebox): The graphs on a 'info' special page are no longer displayed - https://phabricator.wikimedia.org/T270720 (10Iniquity) >>! In T270720#6708984, @Jdforrester-WMF wrote: > Hmm, no, it's rendering for me? It's all client-s... [20:23:56] (03PS13) 10CDanis: Initial commit of Klaxon, a webapp for trusted users to page SRE [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238 (https://phabricator.wikimedia.org/T270324) [20:23:56] 10Operations, 10Graphoid, 10PageViewInfo, 10serviceops, 10Platform Engineering (Icebox): The graphs on a 'info' special page are no longer displayed - https://phabricator.wikimedia.org/T270720 (10Iniquity) [20:35:03] 10Operations, 10Graphoid, 10PageViewInfo, 10serviceops, 10Platform Engineering (Icebox): The graphs on a 'info' special page are no longer displayed - https://phabricator.wikimedia.org/T270720 (10Jdforrester-WMF) It looks like `Шаблон:График просмотров` is inserting a graph callback, which isn't supporte... [20:39:18] PROBLEM - Logstash rate of ingestion percent change compared to yesterday #o11y on alert1001 is CRITICAL: 238.9 ge 210 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [20:45:27] 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) [20:45:52] 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) [20:46:10] 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) [20:55:30] (03CR) 10RLazarus: [C: 03+1] "LGTM! 🚀" (034 comments) [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238 (https://phabricator.wikimedia.org/T270324) (owner: 10CDanis) [21:00:57] (03Abandoned) 10Nikki Nikkhoui: Enable safehtml in gitiles [puppet] - 10https://gerrit.wikimedia.org/r/651614 (https://phabricator.wikimedia.org/T269300) (owner: 10Nikki Nikkhoui) [21:03:49] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: 2021-01-30) rack/setup/install frqueue2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T269481 (10Papaul) [21:04:17] (03CR) 10CDanis: [V: 03+2 C: 03+2] "thanks again!" (033 comments) [software/klaxon] - 10https://gerrit.wikimedia.org/r/651238 (https://phabricator.wikimedia.org/T270324) (owner: 10CDanis) [21:08:42] (03PS1) 10CDanis: klaxon: address comments from ba13da2 [software/klaxon] - 10https://gerrit.wikimedia.org/r/651625 [21:09:10] (03CR) 10CDanis: [V: 03+2 C: 03+2] klaxon: address comments from ba13da2 [software/klaxon] - 10https://gerrit.wikimedia.org/r/651625 (owner: 10CDanis) [21:10:07] 10Operations, 10ops-codfw: RMA failed codfw C7 switch - WMF6114 - https://phabricator.wikimedia.org/T267950 (10Papaul) Received the new switch {F33970453} [21:23:56] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/27237/" [puppet] - 10https://gerrit.wikimedia.org/r/651158 (https://phabricator.wikimedia.org/T245757) (owner: 10Muehlenhoff) [21:24:48] (03CR) 10Dzahn: [C: 03+2] "Thanks! it's also installed on mwmaint by the way: https://debmonitor.wikimedia.org/packages/php-readline" [puppet] - 10https://gerrit.wikimedia.org/r/651158 (https://phabricator.wikimedia.org/T245757) (owner: 10Muehlenhoff) [21:26:48] !log upgrading wikitech-static: mediawiki to 1.35.1 and general apt upgrade [21:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:15] !log deploy1002/deploy2002 - apt-get remove --purge php-readline and let puppet reinstall it (7.2 vs 7.3 after gerrit 651158) T265963 [21:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:18] T265963: Replace production deployment servers and update them to Buster - https://phabricator.wikimedia.org/T265963 [21:32:32] (03CR) 10Dzahn: "on deploy1001 the repo gets added but 2:7.2+69+0~20190215163918.14+stretch~1.gbpfa617b+wmf1 is installed before and after. on deploy1002/" [puppet] - 10https://gerrit.wikimedia.org/r/651158 (https://phabricator.wikimedia.org/T245757) (owner: 10Muehlenhoff) [21:34:08] (03CR) 10Dzahn: "2:7.2+69+0~20190215163918.14+stretch~1.gbpfa617b+wmf1 is installed on all 4 deployment servers now" [puppet] - 10https://gerrit.wikimedia.org/r/651158 (https://phabricator.wikimedia.org/T245757) (owner: 10Muehlenhoff) [21:45:19] (03PS3) 10Dzahn: aptrepo: replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/641315 (https://phabricator.wikimedia.org/T265138) [21:47:18] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 134 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:50:34] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 30 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:53:01] (03CR) 10Dzahn: [C: 03+2] aptrepo: replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/641315 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [21:56:08] (03CR) 10Dzahn: "apt2001 - nothing because the sync is only on primary servers - on apt1001 cron removed, timer added" [puppet] - 10https://gerrit.wikimedia.org/r/641315 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [21:57:50] !log apt1001 - sudo systemctl status rsync-aptrepo-apt2001.wikimedia.org.service - confirmed timer job is working like the cron before [21:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:48] (03CR) 10Dzahn: "Also manually ran the service and confirmed it ran and finished fine." [puppet] - 10https://gerrit.wikimedia.org/r/641315 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [21:59:53] (03CR) 10Dzahn: [C: 03+2] "only moves an existing line around and adds a comment (for now)" [dns] - 10https://gerrit.wikimedia.org/r/650628 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [22:00:23] (03PS2) 10Dzahn: add doc to misc services with multiple backends [dns] - 10https://gerrit.wikimedia.org/r/650628 (https://phabricator.wikimedia.org/T247653) [22:04:58] (03PS1) 10Razzi: Add cookbook for rebooting druid nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) [22:05:51] (03CR) 10Razzi: "First pass at a cookbook - I'm sure there are many issues - let me know what you think, and if / how this can be tested!" [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [22:06:21] (03CR) 10jerkins-bot: [V: 04-1] Add cookbook for rebooting druid nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [22:15:38] (03CR) 10Bstorm: [C: 03+2] cloud nfs: make sure the /srv/scratch dir is there and another fix [puppet] - 10https://gerrit.wikimedia.org/r/651608 (https://phabricator.wikimedia.org/T266202) (owner: 10Bstorm) [22:18:44] (03PS2) 10Razzi: Add cookbook for rebooting druid nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) [22:38:26] (03PS1) 10Ladsgroup: cache: Migrate hiera() to lookup() and setting datatype [puppet] - 10https://gerrit.wikimedia.org/r/651640 (https://phabricator.wikimedia.org/T209953) [22:41:08] (03PS2) 10Ladsgroup: cache: Migrate hiera() to lookup() and setting datatype [puppet] - 10https://gerrit.wikimedia.org/r/651640 (https://phabricator.wikimedia.org/T209953) [22:51:10] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/27238/" [puppet] - 10https://gerrit.wikimedia.org/r/651640 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [22:54:04] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Create Generalised blocking stratagy - https://phabricator.wikimedia.org/T270618 (10jbond) [22:56:41] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Create Generalised blocking stratagy - https://phabricator.wikimedia.org/T270618 (10jbond) [22:57:19] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Create Generalised blocking stratagy - https://phabricator.wikimedia.org/T270618 (10jbond) [22:57:53] win 68 [23:03:36] (03PS1) 10Dzahn: Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into production [puppet] - 10https://gerrit.wikimedia.org/r/651643 [23:03:38] (03PS1) 10Dzahn: aptrepo: remove absented cron code, replaced by timer [puppet] - 10https://gerrit.wikimedia.org/r/651644 [23:04:07] (03CR) 10jerkins-bot: [V: 04-1] Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into production [puppet] - 10https://gerrit.wikimedia.org/r/651643 (owner: 10Dzahn) [23:04:26] (03Abandoned) 10Dzahn: Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into production [puppet] - 10https://gerrit.wikimedia.org/r/651643 (owner: 10Dzahn) [23:04:34] (03PS2) 10Dzahn: aptrepo: remove absented cron code, replaced by timer [puppet] - 10https://gerrit.wikimedia.org/r/651644 [23:05:11] (03PS3) 10Dzahn: aptrepo: remove absented cron code, replaced by timer [puppet] - 10https://gerrit.wikimedia.org/r/651644 (https://phabricator.wikimedia.org/T265138) [23:05:47] 10Operations, 10ops-eqiad, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Hardware): Move or recable labstore1004 to 10Gbps rack (if needed) and ethernet - https://phabricator.wikimedia.org/T266202 (10Andrew) [23:08:12] (03CR) 10Dzahn: [C: 03+2] aptrepo: remove absented cron code, replaced by timer [puppet] - 10https://gerrit.wikimedia.org/r/651644 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [23:10:32] 10Operations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn) 05Stalled→03Open Will be resolved Jan 1st [23:58:51] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10observability, 10User-DannyS712: Beta cluster logstash down - https://phabricator.wikimedia.org/T268200 (10DannyS712) >>! In T268200#6708351, @colewhite wrote: > Hi @DannyS712 > > As of this writing, logs are flowing in deployme...