[00:10:23] (03PS1) 10CRusnov: acme_chief: Add netbox-dev keys [puppet] - 10https://gerrit.wikimedia.org/r/606040 (https://phabricator.wikimedia.org/T253140) [00:10:39] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/606040 (https://phabricator.wikimedia.org/T253140) (owner: 10CRusnov) [00:11:57] 10Operations, 10ops-codfw, 10DC-Ops: Decomission oresrdb2002.codfw.wmnet - https://phabricator.wikimedia.org/T254240 (10Papaul) [00:12:09] (03PS2) 10CRusnov: acme_chief: Add netbox-dev keys [puppet] - 10https://gerrit.wikimedia.org/r/606040 (https://phabricator.wikimedia.org/T253140) [00:18:06] PROBLEM - dump of m2 in eqiad on db2093 is CRITICAL: dump for m2 at eqiad taken more than 8 days ago: Most recent backup 2020-06-09 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:30:29] (03PS1) 10Andrew Bogott: wmcs galera: add a second icinga check [puppet] - 10https://gerrit.wikimedia.org/r/606042 (https://phabricator.wikimedia.org/T242455) [00:30:41] (03CR) 10jerkins-bot: [V: 04-1] wmcs galera: add a second icinga check [puppet] - 10https://gerrit.wikimedia.org/r/606042 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [00:32:25] (03PS2) 10Andrew Bogott: wmcs galera: add a second icinga check [puppet] - 10https://gerrit.wikimedia.org/r/606042 (https://phabricator.wikimedia.org/T242455) [00:37:54] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 22562104 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:39:48] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3080 and 31 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:42] (03PS3) 10Andrew Bogott: wmcs galera: add a second icinga check [puppet] - 10https://gerrit.wikimedia.org/r/606042 (https://phabricator.wikimedia.org/T242455) [00:51:02] PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:57] (03PS1) 10DannyS712: Do not return internal edit status from EditPage [core] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606047 (https://phabricator.wikimedia.org/T255177) [00:56:33] (03PS1) 10Krinkle: mediawiki,logstash: Update type:parsoid-php -> type:mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/606049 (https://phabricator.wikimedia.org/T255627) [01:02:17] (03CR) 10Subramanya Sastry: "I am a +1 in principle." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606038 (https://phabricator.wikimedia.org/T255627) (owner: 10Krinkle) [01:08:24] (03CR) 10Krinkle: "Those dashboards panels would change from type:parsoid-php to servergroup:php. I can help with that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606038 (https://phabricator.wikimedia.org/T255627) (owner: 10Krinkle) [01:10:09] (03CR) 10Krinkle: "Overall, filtering is for individual team/feature dashboards. There are no "known bug" filters on the others, so that's totally fine to ad" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606038 (https://phabricator.wikimedia.org/T255627) (owner: 10Krinkle) [01:20:06] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 56 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:25:56] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 48 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:26:18] can someone check the trace for `17775ae6-a6a5-4e42-831c-3b8a15382cc5`? T255633 [01:26:19] T255633: Error when examining edit in Abuse Filter management interface - https://phabricator.wikimedia.org/T255633 [01:43:36] !log restart elasticsearch on logstash1011 [01:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:13:19] (03CR) 10Andrew Bogott: [C: 03+1] cloud apt: set periodic autocleaning to once a week [puppet] - 10https://gerrit.wikimedia.org/r/605979 (https://phabricator.wikimedia.org/T127374) (owner: 10Bstorm) [02:20:00] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [02:20:58] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is CRITICAL: 107.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [03:01:08] (03CR) 10Subramanya Sastry: [C: 03+1] logging: Fold Parsoid into type:mediawiki, add 'servergroup:' instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606038 (https://phabricator.wikimedia.org/T255627) (owner: 10Krinkle) [03:01:23] (03CR) 10Subramanya Sastry: [C: 03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606038 (https://phabricator.wikimedia.org/T255627) (owner: 10Krinkle) [03:06:02] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={atlas_exporter,swagger_check_cxserver_cluster_eqiad} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:07:52] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:31:30] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 2 (netbox-dev2001, ...), Fresh: 96 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:17:10] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:19:00] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:19:26] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:38:27] !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool db1129', diff saved to https://phabricator.wikimedia.org/P11547 and previous config saved to /var/cache/conftool/dbconfig/20200617-043826-marostegui.json [04:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:42:09] (03CR) 10Marostegui: [C: 03+2] wikireplica_analytics: Increase query killer time [puppet] - 10https://gerrit.wikimedia.org/r/605902 (owner: 10Marostegui) [04:44:12] !log Reload pt-kill on labsdb analytics host to pick up new config [04:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:45:06] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:49:43] (03PS1) 10Marostegui: mariadb: Reimage db1090 to Buster and 10.4 [puppet] - 10https://gerrit.wikimedia.org/r/606065 (https://phabricator.wikimedia.org/T250666) [04:50:24] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage db1090 to Buster and 10.4 [puppet] - 10https://gerrit.wikimedia.org/r/606065 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [04:50:54] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 47 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:51:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1090:3312, db1090:3317 for reimage', diff saved to https://phabricator.wikimedia.org/P11548 and previous config saved to /var/cache/conftool/dbconfig/20200617-045105-marostegui.json [04:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:11] (03PS1) 10Marostegui: dbproxy1008: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606066 (https://phabricator.wikimedia.org/T255406) [04:59:40] (03CR) 10Marostegui: [C: 03+2] dbproxy1008: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606066 (https://phabricator.wikimedia.org/T255406) (owner: 10Marostegui) [04:59:58] 10Operations, 10ops-eqiad, 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10Marostegui) [05:02:56] (03PS1) 10Marostegui: dbproxy1013: Reimage to Buster [puppet] - 10https://gerrit.wikimedia.org/r/606067 (https://phabricator.wikimedia.org/T255408) [05:03:00] (03PS1) 10DannyS712: Revert "Hard deprecate the `TitleMoveCompleting` hook" [core] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606068 (https://phabricator.wikimedia.org/T255608) [05:03:09] (03PS1) 10DannyS712: Revert "Hooks: Use PageMoveComplete instead of TitleMoveCompleting" [extensions/Flow] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606069 (https://phabricator.wikimedia.org/T255608) [05:03:33] (03CR) 10Marostegui: [C: 03+2] dbproxy1013: Reimage to Buster [puppet] - 10https://gerrit.wikimedia.org/r/606067 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [05:08:05] !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime [05:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:34] !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:17] !log marostegui@cumin2001 dbctl commit (dc=all): 'Slowly repool db1090:3312, db1090:3317', diff saved to https://phabricator.wikimedia.org/P11549 and previous config saved to /var/cache/conftool/dbconfig/20200617-051916-marostegui.json [05:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:28] (03PS1) 10Marostegui: db1090: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606070 [05:21:03] (03CR) 10Marostegui: [C: 03+2] db1090: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606070 (owner: 10Marostegui) [05:22:03] !log marostegui@cumin2001 dbctl commit (dc=all): 'Slowly repool db1090:3312, db1090:3317', diff saved to https://phabricator.wikimedia.org/P11550 and previous config saved to /var/cache/conftool/dbconfig/20200617-052202-marostegui.json [05:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:33] (03PS1) 10Marostegui: mariadb: Reimage db1113 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/606072 (https://phabricator.wikimedia.org/T250666) [05:28:10] !log marostegui@cumin2001 dbctl commit (dc=all): 'Slowly repool db1090:3312, db1090:3317', diff saved to https://phabricator.wikimedia.org/P11551 and previous config saved to /var/cache/conftool/dbconfig/20200617-052809-marostegui.json [05:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:14] !log Deploy schema change on s7 codfw (lag will appear) - T250066 [05:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:18] T250066: text table still has old_* fields and indexes on some hosts - https://phabricator.wikimedia.org/T250066 [05:30:25] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [05:34:22] !log marostegui@cumin2001 dbctl commit (dc=all): 'Fully repool db1090:3312, db1090:3317', diff saved to https://phabricator.wikimedia.org/P11552 and previous config saved to /var/cache/conftool/dbconfig/20200617-053421-marostegui.json [05:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:43] !log volker-e@deploy1001 Started deploy [design/style-guide@37c67dd]: Deploy design/style-guide: [05:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:47] !log volker-e@deploy1001 Finished deploy [design/style-guide@37c67dd]: Deploy design/style-guide: (duration: 00m 05s) [05:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:13] !log clean up old systemd timer config on an-coord1001 (came up after the last reboot) [05:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:34] !log reboot stat1007/8 for kernel upgrades [05:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:51] RECOVERY - Check systemd state on an-launcher1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:46:11] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:47:39] PROBLEM - Host stat1008 is DOWN: PING CRITICAL - Packet loss = 100% [05:49:11] PROBLEM - Host stat1007 is DOWN: PING CRITICAL - Packet loss = 100% [05:49:25] RECOVERY - Host stat1007 is UP: PING WARNING - Packet loss = 75%, RTA = 0.46 ms [05:50:08] elukey: I guess it rebooted itself? ^ [05:51:09] RECOVERY - Host stat1008 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [05:52:29] PROBLEM - Check systemd state on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:53:39] PROBLEM - puppet last run on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [05:54:15] PROBLEM - configured eth on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [05:55:03] PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [05:55:27] marostegui: nono I did it, but something probably is not going in the right direction [05:55:30] (See sal) [06:00:34] elukey: aaah sorry yeah, missed it [06:01:01] np! Should recover in few sec [06:01:02] in theory [06:03:15] !log reboot an-conf100[1-3] for kernel upgrades [06:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:31] ah no sorry I downtimed it, so it will not show recoveries [06:03:37] anyway, stat1007 is up now :) [06:04:17] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:05:31] RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:10:16] (03PS2) 10Tim Starling: Enable PoolCounter fastStale mode on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605437 [06:10:54] (03CR) 10Tim Starling: [C: 03+2] Enable PoolCounter fastStale mode on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605437 (owner: 10Tim Starling) [06:11:42] (03Merged) 10jenkins-bot: Enable PoolCounter fastStale mode on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605437 (owner: 10Tim Starling) [06:15:47] !log tstarling@deploy1001 Synchronized wmf-config/PoolCounterSettings.php: test fast stale mode on testwiki T250248 (duration: 01m 17s) [06:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:51] T250248: Fast stale ParserCache responses on PoolCounter contention - https://phabricator.wikimedia.org/T250248 [06:16:28] 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) Note after checking slab distribution on the gutter pool. The last slab sizes seem to not follow the prediction made by the script: ` STAT 49:chunk_si... [06:23:09] !log set lacp active on cr2-esams:ae2 - T253970 [06:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:14] T253970: No LACP info for cr2-esams:ae2 - https://phabricator.wikimedia.org/T253970 [06:24:21] RECOVERY - configured eth on stat1007 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:24:27] !log reboot an-master100[1,2] for kernel upgrades [06:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:39] 10Operations, 10Analytics-Radar, 10Traffic: Spammy events coming our way for sites such us https://ru.wikipedia.kim - https://phabricator.wikimedia.org/T190843 (10JAllemandou) > I thought ru stand for Russia, this can just be a Russia version of wikipedia `ru.wikipedia.ORG` is official Russian wikipedia - `... [06:28:13] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:51] 10Operations, 10MediaWiki-General, 10Patch-For-Review, 10Sustainability (Incident Prevention): Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) - https://phabricator.wikimedia.org/T105378 (10tstarling) The count over 24 hours was 1689 connection timeout errors. I... [06:30:38] (03CR) 10Elukey: [C: 03+1] "It looks really nice to me, thanks a lot for all this work. I left some comments but nothing blocking. I was worried about the systemd uni" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/605937 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [06:31:51] (03PS1) 10Majavah: Set hiwiktionary timezone to Asia/Kolkata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606075 (https://phabricator.wikimedia.org/T255531) [06:32:52] (03PS2) 10Majavah: Set hiwiktionary timezone to Asia/Kolkata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606075 (https://phabricator.wikimedia.org/T255531) [06:33:50] (03CR) 10Majavah: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606075 (https://phabricator.wikimedia.org/T255531) (owner: 10Majavah) [06:34:00] (03CR) 10Elukey: "It looks good to me! Ignorant question - have we decided to re-use puppet host level TLS certs for service TLS termination as well? I have" [puppet] - 10https://gerrit.wikimedia.org/r/605947 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [06:40:13] !log reboot krb1001 for kernel upgrades [06:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:41] 10Operations, 10netops: No LACP info for cr2-esams:ae2 - https://phabricator.wikimedia.org/T253970 (10ayounsi) This didn't work. Configuring the link as active took it down. But did show statistics. ` cr2-esams> show lacp interfaces ae2 Aggregated interface: ae2 LACP state: Role Exp Def Dis... [06:48:59] PROBLEM - Host an-master1001 is DOWN: PING CRITICAL - Packet loss = 100% [06:49:57] this is me --^ [06:51:21] RECOVERY - Host an-master1001 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [06:55:47] (03PS2) 10Aaron Schulz: arclamp: add svgs for some key entrypoint/singleton methods calls [puppet] - 10https://gerrit.wikimedia.org/r/598292 [07:04:15] 10Operations, 10netops: ulsfo - codfw Zayo link down - https://phabricator.wikimedia.org/T255393 (10ayounsi) 05Open→03Resolved Restored as of Tue Jun 16 22:11:00 GMT 2020 [07:16:50] (03PS1) 10Elukey: kafka: force openjdk-8 for Kafka clusters to avoid issues with Buster [puppet] - 10https://gerrit.wikimedia.org/r/606115 [07:19:59] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/23294/" [puppet] - 10https://gerrit.wikimedia.org/r/606115 (owner: 10Elukey) [07:21:48] godog: --^ [07:21:52] if you are around [07:40:01] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/606115 (owner: 10Elukey) [07:44:39] (03CR) 10Elukey: [C: 03+2] kafka: force openjdk-8 for Kafka clusters to avoid issues with Buster [puppet] - 10https://gerrit.wikimedia.org/r/606115 (owner: 10Elukey) [07:48:39] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:50:25] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:53:03] !log reboot kafka-jumbo1009 for kernel upgrades [07:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:55] elukey: yup will take a look shortly [07:59:03] ah nevermind [07:59:33] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:00:15] 10Operations, 10MediaWiki-General, 10Patch-For-Review, 10Sustainability (Incident Prevention): Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) - https://phabricator.wikimedia.org/T105378 (10akosiaris) >>! In T105378#6230593, @tstarling wrote: > The count over 24... [08:01:23] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:01:25] godog: thanks! completing the puppet run now, all no ops [08:04:22] *nod* [08:05:12] !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool db1113:3315 db1113:3316', diff saved to https://phabricator.wikimedia.org/P11553 and previous config saved to /var/cache/conftool/dbconfig/20200617-080511-marostegui.json [08:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:16] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage db1113 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/606072 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [08:08:17] (03CR) 10Alexandros Kosiaris: [C: 03+1] Initial commit of debian directory (031 comment) [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/605940 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [08:10:14] !log stop logstash temporarily on logstash7 hosts to test increased es shards - T255243 [08:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:18] T255243: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 [08:10:32] (03PS1) 10Kosta Harlan: Fix NewcomerTask schema [extensions/GrowthExperiments] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606121 (https://phabricator.wikimedia.org/T255597) [08:11:17] (03PS1) 10Kosta Harlan: Fix NewcomerTask schema [extensions/GrowthExperiments] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/606122 (https://phabricator.wikimedia.org/T255597) [08:15:51] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=jmx_logstash site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:19:10] (03CR) 10Volans: "seems sane to me but I'll leave it to @vgutierrez to check that it has all the bits needed." [puppet] - 10https://gerrit.wikimedia.org/r/606040 (https://phabricator.wikimedia.org/T253140) (owner: 10CRusnov) [08:20:52] (03CR) 10Vgutierrez: [C: 03+1] acme_chief: Add netbox-dev keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606040 (https://phabricator.wikimedia.org/T253140) (owner: 10CRusnov) [08:20:56] (03CR) 10Muehlenhoff: "Looks good, a few comments inline" (034 comments) [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/605940 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [08:20:57] !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime [08:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:35] (03CR) 10Jforrester: [C: 03+2] Do not return internal edit status from EditPage [core] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606047 (https://phabricator.wikimedia.org/T255177) (owner: 10DannyS712) [08:22:12] (03PS2) 10Jforrester: Fix NewcomerTask schema [extensions/GrowthExperiments] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606121 (https://phabricator.wikimedia.org/T255597) (owner: 10Kosta Harlan) [08:22:23] (03PS2) 10Jforrester: Fix NewcomerTask schema [extensions/GrowthExperiments] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/606122 (https://phabricator.wikimedia.org/T255597) (owner: 10Kosta Harlan) [08:22:43] (03CR) 10Muehlenhoff: Initial commit of debian directory (031 comment) [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/605940 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [08:23:30] !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:00] (03CR) 10Jforrester: [C: 03+2] Enable DiscussionTools on all labs wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605958 (https://phabricator.wikimedia.org/T255223) (owner: 10Esanders) [08:27:16] (03CR) 10Jforrester: [C: 03+1] logging: Fold Parsoid into type:mediawiki, add 'servergroup:' instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606038 (https://phabricator.wikimedia.org/T255627) (owner: 10Krinkle) [08:27:48] (03Merged) 10jenkins-bot: Enable DiscussionTools on all labs wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605958 (https://phabricator.wikimedia.org/T255223) (owner: 10Esanders) [08:28:22] (03CR) 10Muehlenhoff: "@hashar: There's not urgency in getting this merged, if you prefer to be around, just ping me on IRC whenever it works for you this or nex" [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [08:29:16] 10Operations, 10MediaWiki-Vagrant, 10phan: It should be possible to install php-ast using apt-get on MediaWiki-Vagrant - https://phabricator.wikimedia.org/T234240 (10Lokal_Profil) @Mainframe98 Many thanks ! [08:29:49] !log prune nginx from remaining mw* servers in codfw T255565 [08:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:53] T255565: Remaining nginx packages on some mw servers - https://phabricator.wikimedia.org/T255565 [08:30:56] !log start logstash on logstash7 - T255243 [08:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:00] T255243: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 [08:31:21] !log marostegui@cumin2001 dbctl commit (dc=all): 'Slowly repool db1113:3315, db1113:3316', diff saved to https://phabricator.wikimedia.org/P11554 and previous config saved to /var/cache/conftool/dbconfig/20200617-083120-marostegui.json [08:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:20] (03PS1) 10Marostegui: db1113: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606124 [08:33:05] (03CR) 10Marostegui: [C: 03+2] db1113: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606124 (owner: 10Marostegui) [08:35:26] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:37:38] (03PS1) 10Marostegui: install_server: Do not reimage db1113 [puppet] - 10https://gerrit.wikimedia.org/r/606125 [08:37:43] (03PS2) 10DCausse: [wdqs] bump vocabulary and inline URI handler version [puppet] - 10https://gerrit.wikimedia.org/r/605536 (https://phabricator.wikimedia.org/T255399) [08:38:12] (03CR) 10Ammarpad: [C: 03+1] Set hiwiktionary timezone to Asia/Kolkata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606075 (https://phabricator.wikimedia.org/T255531) (owner: 10Majavah) [08:38:16] (03CR) 10Jforrester: [C: 03+2] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599369 (https://phabricator.wikimedia.org/T247943) (owner: 10Jforrester) [08:38:22] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1113 [puppet] - 10https://gerrit.wikimedia.org/r/606125 (owner: 10Marostegui) [08:38:24] (03PS2) 10Jforrester: Install MediaModeration extension - I: Add i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599369 (https://phabricator.wikimedia.org/T247943) [08:38:31] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599369 (https://phabricator.wikimedia.org/T247943) (owner: 10Jforrester) [08:38:44] (03Merged) 10jenkins-bot: Do not return internal edit status from EditPage [core] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606047 (https://phabricator.wikimedia.org/T255177) (owner: 10DannyS712) [08:39:26] (03Merged) 10jenkins-bot: Install MediaModeration extension - I: Add i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599369 (https://phabricator.wikimedia.org/T247943) (owner: 10Jforrester) [08:42:45] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: fix Thanos sidecar Prometheus connection alert [puppet] - 10https://gerrit.wikimedia.org/r/605960 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [08:42:52] (03PS2) 10Filippo Giunchedi: thanos: fix Thanos sidecar Prometheus connection alert [puppet] - 10https://gerrit.wikimedia.org/r/605960 (https://phabricator.wikimedia.org/T252186) [08:43:05] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.37/includes/EditPage.php: T255177 T255614 Do not return internal edit status from EditPage (duration: 01m 08s) [08:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:10] T255177: Tests failing for the master branch of MassMessage - https://phabricator.wikimedia.org/T255177 [08:43:10] T255614: PHP Notice: Object of class MediaWiki\Debug\DeprecatablePropertyArray could not be converted to int - https://phabricator.wikimedia.org/T255614 [08:43:37] (03CR) 10Jforrester: [C: 03+2] Revert "Hard deprecate the `TitleMoveCompleting` hook" [core] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606068 (https://phabricator.wikimedia.org/T255608) (owner: 10DannyS712) [08:43:40] (03CR) 10Jforrester: [C: 03+2] Revert "Hooks: Use PageMoveComplete instead of TitleMoveCompleting" [extensions/Flow] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606069 (https://phabricator.wikimedia.org/T255608) (owner: 10DannyS712) [08:44:03] !log marostegui@cumin2001 dbctl commit (dc=all): 'Slowly repool db1113:3315, db1113:3316', diff saved to https://phabricator.wikimedia.org/P11556 and previous config saved to /var/cache/conftool/dbconfig/20200617-084402-marostegui.json [08:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:12] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=jmx_logstash site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:46:15] (03PS4) 10JMeybohm: Initial commit of debian directory [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/605940 (https://phabricator.wikimedia.org/T253843) [08:46:36] (03CR) 10JMeybohm: Initial commit of debian directory (035 comments) [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/605940 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [08:46:48] PROBLEM - DPKG on mw2350 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:47:29] PROBLEM - DPKG on mw2324 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:47:49] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/605957 (owner: 10Jbond) [08:47:53] !log marostegui@cumin2001 dbctl commit (dc=all): 'Slowly repool db1113:3315, db1113:3316', diff saved to https://phabricator.wikimedia.org/P11557 and previous config saved to /var/cache/conftool/dbconfig/20200617-084751-marostegui.json [08:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:16] PROBLEM - DPKG on mw2321 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:49:08] PROBLEM - DPKG on mw2297 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:49:38] !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime [08:49:38] PROBLEM - DPKG on mw2302 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:42] PROBLEM - DPKG on mw2360 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:50:56] PROBLEM - DPKG on mw2298 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:51:10] PROBLEM - DPKG on mw2292 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:51:50] moritzm: ^^ expected? [08:52:03] yeah, I'm fixing those, the garbage postinst insists on a directory being busy [08:52:13] !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:41] and unless nginx-common is fully purged, the /etc/init.d/nginx is kept around as it's a conffile, which generates a systemd.service, which fails on next boot [08:52:47] * moritzm shakes fist at nginx [08:52:56] PROBLEM - DPKG on mw2293 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:53:08] PROBLEM - DPKG on mw2370 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:53:48] PROBLEM - DPKG on mw2374 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:53:48] PROBLEM - DPKG on mw2296 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:53:48] PROBLEM - DPKG on mw2368 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:55:24] PROBLEM - DPKG on mw2326 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:56:04] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:22] PROBLEM - DPKG on mw2319 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:56:28] PROBLEM - DPKG on mw2362 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:56:58] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:04] PROBLEM - DPKG on mw2295 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:57:10] (03PS1) 10Volans: Exclude local external plugins from the PyPi build [software/homer] - 10https://gerrit.wikimedia.org/r/606129 [08:59:18] (03PS2) 10Jforrester: Install MediaModeration extension - II: Add flag to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599370 (https://phabricator.wikimedia.org/T247943) [08:59:32] (03CR) 10Jforrester: [C: 03+2] Install MediaModeration extension - II: Add flag to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599370 (https://phabricator.wikimedia.org/T247943) (owner: 10Jforrester) [09:00:14] PROBLEM - DPKG on mw2244 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:00:20] (03CR) 10Hashar: [C: 03+1] "> this seems fine as long as it isn't applying to a ton of VMs on the same hypervisor." [puppet] - 10https://gerrit.wikimedia.org/r/605550 (https://phabricator.wikimedia.org/T255371) (owner: 10Hashar) [09:00:24] (03Merged) 10jenkins-bot: Install MediaModeration extension - II: Add flag to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599370 (https://phabricator.wikimedia.org/T247943) (owner: 10Jforrester) [09:01:25] (03PS4) 10Jforrester: Install MediaModeration extension - III: Install where enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599363 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [09:01:34] PROBLEM - DPKG on mw2291 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:01:39] (03CR) 10Jforrester: "Should do the PS patch first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599363 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [09:02:20] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T247943 Install MediaModeration extension - II: Add flag to IS (duration: 01m 05s) [09:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:24] T247943: Deploy MediaModeration Extension to Wikimedia Production - https://phabricator.wikimedia.org/T247943 [09:04:20] PROBLEM - DPKG on mw2299 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:05:57] (03Merged) 10jenkins-bot: Revert "Hard deprecate the `TitleMoveCompleting` hook" [core] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606068 (https://phabricator.wikimedia.org/T255608) (owner: 10DannyS712) [09:06:01] (03Merged) 10jenkins-bot: Revert "Hooks: Use PageMoveComplete instead of TitleMoveCompleting" [extensions/Flow] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606069 (https://phabricator.wikimedia.org/T255608) (owner: 10DannyS712) [09:06:06] RECOVERY - DPKG on mw2291 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:06:06] RECOVERY - DPKG on mw2292 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:06:06] RECOVERY - DPKG on mw2295 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:06:06] RECOVERY - DPKG on mw2293 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:06:06] RECOVERY - DPKG on mw2244 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:06:06] RECOVERY - DPKG on mw2297 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:06:06] RECOVERY - DPKG on mw2296 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:06:07] RECOVERY - DPKG on mw2298 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:06:07] RECOVERY - DPKG on mw2299 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:09:29] (03PS1) 10RhinosF1: Merge commit 'refs/changes/78/605978/2' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into T247330 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606133 [09:10:20] (03Abandoned) 10RhinosF1: Merge commit 'refs/changes/78/605978/2' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into T247330 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606133 (owner: 10RhinosF1) [09:11:05] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.37/includes/HookContainer/DeprecatedHooks.php: T255608 Revert 'Hard deprecate the hook' (duration: 01m 05s) [09:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:10] T255608: DBUnexpectedError when moving a page with a StructuredDiscussion talkpage - https://phabricator.wikimedia.org/T255608 [09:11:59] (03PS1) 10Marostegui: dbproxy1013: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606134 (https://phabricator.wikimedia.org/T255408) [09:12:23] (03CR) 10Marostegui: [C: 03+2] dbproxy1013: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606134 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [09:14:12] (03PS2) 10Filippo Giunchedi: thanos: pass min / max time to store [puppet] - 10https://gerrit.wikimedia.org/r/605949 (https://phabricator.wikimedia.org/T252186) [09:14:14] (03PS2) 10Filippo Giunchedi: thanos: use object storage for data older than 15d [puppet] - 10https://gerrit.wikimedia.org/r/605950 (https://phabricator.wikimedia.org/T252186) [09:14:16] (03PS1) 10Filippo Giunchedi: prometheus: pass min_time to Thanos sidecar [puppet] - 10https://gerrit.wikimedia.org/r/606135 (https://phabricator.wikimedia.org/T252186) [09:15:11] !log marostegui@cumin2001 dbctl commit (dc=all): 'Fully repool db1113:3315, db1113:3316', diff saved to https://phabricator.wikimedia.org/P11558 and previous config saved to /var/cache/conftool/dbconfig/20200617-091509-marostegui.json [09:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:19] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.37/extensions/Flow/: T255608 Revert 'Hooks: Use PageMoveComplete instead of TitleMoveCompleting' (duration: 01m 05s) [09:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:22] T255608: DBUnexpectedError when moving a page with a StructuredDiscussion talkpage - https://phabricator.wikimedia.org/T255608 [09:18:22] (03PS1) 10Marostegui: mariadb: Reimage db2122 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/606137 (https://phabricator.wikimedia.org/T250666) [09:18:27] RECOVERY - DPKG on mw2321 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:18:46] (03CR) 10Muehlenhoff: Initial commit of debian directory (031 comment) [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/605940 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [09:19:21] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage db2122 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/606137 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [09:19:39] RECOVERY - DPKG on mw2302 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:19:41] RECOVERY - DPKG on mw2324 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:19:43] RECOVERY - DPKG on mw2360 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:23:11] RECOVERY - DPKG on mw2370 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:23:59] RECOVERY - DPKG on mw2368 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:23:59] RECOVERY - DPKG on mw2374 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:25:39] RECOVERY - DPKG on mw2326 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:26:07] RECOVERY - Thanos sidecar cannot connect to Prometheus on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar [09:26:39] RECOVERY - DPKG on mw2319 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:26:43] RECOVERY - DPKG on mw2362 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:27:54] 10Operations, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10MoritzMuehlenhoff) Since we don't use helm3 in production yet, just adding it to the task. There's a new vulnerability specific to 3.x, so let's upgrade to 3.2.4 before we roll th... [09:29:20] (03PS5) 10Jforrester: Install MediaModeration extension - III: Install where enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599363 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [09:30:12] (03CR) 10Ayounsi: [C: 03+1] Exclude local external plugins from the PyPi build [software/homer] - 10https://gerrit.wikimedia.org/r/606129 (owner: 10Volans) [09:31:22] (03PS3) 10Filippo Giunchedi: thanos: use object storage for data older than 15d [puppet] - 10https://gerrit.wikimedia.org/r/605950 (https://phabricator.wikimedia.org/T252186) [09:32:47] RECOVERY - DPKG on mw2350 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:34:16] (03PS3) 10RhinosF1: close trwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605978 (https://phabricator.wikimedia.org/T247330) [09:35:25] yes!, now pass jenkins [09:37:35] (03PS4) 10RhinosF1: close trwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605978 (https://phabricator.wikimedia.org/T247330) [09:40:09] !log killing stale changeprop instances running on scb hosts [09:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:23] (03PS1) 10Jbond: mailman: add redirects [puppet] - 10https://gerrit.wikimedia.org/r/606143 (https://phabricator.wikimedia.org/T255009) [09:42:19] !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime [09:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:44] (03CR) 10Jbond: [C: 03+2] mailman: add redirects [puppet] - 10https://gerrit.wikimedia.org/r/606143 (https://phabricator.wikimedia.org/T255009) (owner: 10Jbond) [09:43:37] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:43:51] there might be a kafka lag alert for logstash7, expected [09:44:50] !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:13] (03PS1) 10RhinosF1: Add http://pashaei.studio/ to the wgCopyUploadsDomains whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606144 [09:45:45] PROBLEM - Check systemd state on scb1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:19] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:22] (03PS1) 10Jbond: mailman: fix copy paste error [puppet] - 10https://gerrit.wikimedia.org/r/606145 [09:46:45] (03CR) 10Jbond: [V: 03+2 C: 03+2] mailman: fix copy paste error [puppet] - 10https://gerrit.wikimedia.org/r/606145 (owner: 10Jbond) [09:47:35] (03PS2) 10RhinosF1: Add http://pashaei.studio/ to the wgCopyUploadsDomains whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606144 (https://phabricator.wikimedia.org/T255336) [09:50:43] PROBLEM - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:54:29] (03PS1) 10RhinosF1: Create 'rollbacker' user group for elwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606146 [09:56:07] PROBLEM - Check systemd state on scb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:56:10] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10jbond) [09:58:26] (03PS2) 10RhinosF1: Create 'rollbacker' user group for elwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606146 (https://phabricator.wikimedia.org/T255569) [10:02:32] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10jbond) @MNovotny_WMF are you able to let us know the end date for this internship, this helps us with our au... [10:02:38] (03PS3) 10RhinosF1: Add numerous domains to the wgCopyUploadsDomains whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606144 (https://phabricator.wikimedia.org/T255336) [10:03:29] (03PS1) 10Jbond: admin: add Andrew Kuznetsov to ldap only [puppet] - 10https://gerrit.wikimedia.org/r/606147 [10:04:13] (03CR) 10Jbond: "have added a very short end data while i clarify the correct value" [puppet] - 10https://gerrit.wikimedia.org/r/606147 (owner: 10Jbond) [10:04:19] (03CR) 10Majavah: [C: 04-1] "Sysops should probably be able to revoke the right too" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606146 (https://phabricator.wikimedia.org/T255569) (owner: 10RhinosF1) [10:08:02] (03PS1) 10Marostegui: db2122: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606148 [10:08:23] (03PS4) 10RhinosF1: Add numerous domains to the wgCopyUploadsDomains whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606144 (https://phabricator.wikimedia.org/T255336) [10:08:25] (03CR) 10Marostegui: [C: 03+2] db2122: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606148 (owner: 10Marostegui) [10:11:06] (03PS7) 10Vgutierrez: ATS: Add http-redirect.lua [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) [10:12:03] PROBLEM - Check systemd state on scb2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:10] (03PS3) 10RhinosF1: Create 'rollbacker' user group for elwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606146 (https://phabricator.wikimedia.org/T255569) [10:13:17] PROBLEM - Check systemd state on scb1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:49] PROBLEM - Check systemd state on scb2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:50] (03CR) 10Majavah: [C: 04-1] "almost there" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606146 (https://phabricator.wikimedia.org/T255569) (owner: 10RhinosF1) [10:13:55] PROBLEM - Check systemd state on scb2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:05] PROBLEM - Check systemd state on scb2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:28] (03PS1) 10Marostegui: mariadb: Move db2091 to s8 [puppet] - 10https://gerrit.wikimedia.org/r/606149 (https://phabricator.wikimedia.org/T253217) [10:15:32] (03PS4) 10RhinosF1: Create 'rollbacker' user group for elwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606146 (https://phabricator.wikimedia.org/T255569) [10:15:45] (03PS2) 10Marostegui: mariadb: Move db2091 to s8 [puppet] - 10https://gerrit.wikimedia.org/r/606149 (https://phabricator.wikimedia.org/T253217) [10:15:52] (03CR) 10RhinosF1: ">" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606146 (https://phabricator.wikimedia.org/T255569) (owner: 10RhinosF1) [10:16:14] (03CR) 10Majavah: [C: 03+1] Create 'rollbacker' user group for elwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606146 (https://phabricator.wikimedia.org/T255569) (owner: 10RhinosF1) [10:16:20] (03Abandoned) 10RhinosF1: Add wikimedia cloud mailing list to mailman’s robots.txt [puppet] - 10https://gerrit.wikimedia.org/r/563684 (https://phabricator.wikimedia.org/T242520) (owner: 10RhinosF1) [10:16:51] 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review, 10User-RhinosF1: Allow Cloud mailing list to be indexed - https://phabricator.wikimedia.org/T242520 (10RhinosF1) 05Open→03Declined Reopen if still needed/wanted [10:16:59] (03PS3) 10Arturo Borrero Gonzalez: toolforge: legacy_redirector: refresh set of allowed tools [puppet] - 10https://gerrit.wikimedia.org/r/602054 (https://phabricator.wikimedia.org/T234617) [10:18:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: legacy_redirector: refresh set of allowed tools [puppet] - 10https://gerrit.wikimedia.org/r/602054 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez) [10:18:40] (03CR) 10Kormat: [C: 03+1] mariadb: Move db2091 to s8 [puppet] - 10https://gerrit.wikimedia.org/r/606149 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [10:19:18] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db2091 to s8 [puppet] - 10https://gerrit.wikimedia.org/r/606149 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [10:19:23] (03PS3) 10RhinosF1: Add localised sitename for bewikibooks. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599760 (https://phabricator.wikimedia.org/T253962) [10:20:25] (03PS4) 10Jbond: memcached: add TLS support [puppet] - 10https://gerrit.wikimedia.org/r/605937 (https://phabricator.wikimedia.org/T233933) [10:20:36] (03CR) 10Jbond: "thanks so inline for comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/605937 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [10:22:41] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 52 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:23:40] 10Operations, 10Traffic: HTML Dumps 429 error on RESTBase endpoints - https://phabricator.wikimedia.org/T255524 (10Kelson) We get many HTTP 429 errors from the rest(base) API if we scrape with nodes outside the VPS cluster. Really a hassle to deal with. It seems to me we are impacted... But maybe I get somethi... [10:27:34] PROBLEM - Check systemd state on scb2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:27:47] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 46 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:27:58] (03PS1) 10Lars Wirzenius: group1 wikis to 1.35.0-wmf.37 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606152 [10:28:00] (03CR) 10Lars Wirzenius: [C: 03+2] group1 wikis to 1.35.0-wmf.37 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606152 (owner: 10Lars Wirzenius) [10:28:46] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.37 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606152 (owner: 10Lars Wirzenius) [10:30:23] !log liw@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.37 [10:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:20] liw: Looking good to me. [10:31:28] !log liw@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.37 (duration: 01m 04s) [10:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloud apt: set periodic autocleaning to once a week [puppet] - 10https://gerrit.wikimedia.org/r/605979 (https://phabricator.wikimedia.org/T127374) (owner: 10Bstorm) [10:32:25] James_F, not exploding immediately, at least [10:32:44] * James_F grins. [10:32:49] That's the spirit. [10:34:07] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:34:20] ^ not me! :) [10:34:36] marostegui: probably? [10:35:31] 10Operations, 10Wikimedia-Logstash, 10observability: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 (10fgiunchedi) I've stopped logstash for 20 min first and 60m later to test backlog processing. The sharding definitely worked and now all SSD hosts are receiving equa... [10:35:55] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:38:08] !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime [10:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:43] !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:28] (03PS1) 10Arturo Borrero Gonzalez: toolforge: legacy-redirector: use specific certificate [puppet] - 10https://gerrit.wikimedia.org/r/606155 (https://phabricator.wikimedia.org/T247236) [10:45:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: legacy-redirector: use specific certificate [puppet] - 10https://gerrit.wikimedia.org/r/606155 (https://phabricator.wikimedia.org/T247236) (owner: 10Arturo Borrero Gonzalez) [10:48:17] !log marostegui@cumin2001 dbctl commit (dc=all): 'Remove db2091 from dbctl in s2 and s4', diff saved to https://phabricator.wikimedia.org/P11562 and previous config saved to /var/cache/conftool/dbconfig/20200617-104816-marostegui.json [10:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:04] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:51:28] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/605957 (owner: 10Jbond) [10:51:40] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:59:45] (03CR) 10Muehlenhoff: memcached: add TLS support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605937 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: May I have your attention please! European mid-day backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200617T1100) [11:00:04] Majavah, kostajh, and Ammarpad: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:23] \o [11:00:26] o/ [11:00:57] (03CR) 10Jbond: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/605947 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [11:01:33] I can BACON today [11:01:41] Tsk. [11:01:45] Please don't call it that. [11:02:09] We just intentionally renamed it *away* from things to do with slaughter. [11:02:25] (03CR) 10Jbond: [C: 03+2] memcached: Clear ExecStart when used as an override [puppet] - 10https://gerrit.wikimedia.org/r/605957 (owner: 10Jbond) [11:02:59] (03PS3) 10Ladsgroup: Set hiwiktionary timezone to Asia/Kolkata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606075 (https://phabricator.wikimedia.org/T255531) (owner: 10Majavah) [11:03:04] (03CR) 10Ladsgroup: [C: 03+2] Set hiwiktionary timezone to Asia/Kolkata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606075 (https://phabricator.wikimedia.org/T255531) (owner: 10Majavah) [11:03:54] (03Merged) 10jenkins-bot: Set hiwiktionary timezone to Asia/Kolkata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606075 (https://phabricator.wikimedia.org/T255531) (owner: 10Majavah) [11:04:10] James_F: noted [11:04:15] sorry I'm late [11:04:30] Amir1: I'm just no fun, I know. ;-) [11:05:13] James_F: nah, I can survive not using that word lol [11:05:28] Majavah: live at mwdebug1001 [11:05:46] Amir1: working, thanks [11:06:57] deploying [11:07:33] (03CR) 10Ladsgroup: [C: 03+2] "B&C" [extensions/GrowthExperiments] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/606122 (https://phabricator.wikimedia.org/T255597) (owner: 10Kosta Harlan) [11:07:36] James_F: so...CAB would be better? :) [11:07:47] (03CR) 10Ladsgroup: [C: 03+2] "B&C" [extensions/GrowthExperiments] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606121 (https://phabricator.wikimedia.org/T255597) (owner: 10Kosta Harlan) [11:07:58] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:606075|Set hiwiktionary timezone to Asia/Kolkata (T255531)]] (duration: 01m 05s) [11:07:58] B&C [11:07:59] :P [11:08:01] Urbanecm: … or just the name we settled on after talking about it for months. :-) [11:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:02] T255531: Set hiwiktionary timezone to Asia/Kolkata - https://phabricator.wikimedia.org/T255531 [11:08:03] how about BCDW? :D [11:08:33] James_F: does https://gerrit.wikimedia.org/r/605978 look right if you're free? [11:09:03] RhinosF1: Don't you have to adjust the groupOverride stuff too? [11:09:19] I think they always involve IS.php changing, don't they? [11:09:46] The backport is going to take some time to merge, I deploy the config change in the mean time [11:10:14] @seen Ammarpad [11:10:14] Amir1: Last time I saw Ammarpad they were quitting the network with reason: Quit: Connection closed for inactivity N/A at 6/16/2020 2:21:40 AM (1d8h48m33s ago) [11:10:24] James_F: Do I just remove any entry in group overrides with that wiki name? [11:10:40] Well, since Ammarpad is not around. They are not getting deployed [11:10:49] RhinosF1: I think so. [11:10:59] Sorry, it's been a while. [11:11:01] James_F: will do after lunch then [11:11:13] * RhinosF1 brand new to wiki closures [11:14:23] (03Merged) 10jenkins-bot: Fix NewcomerTask schema [extensions/GrowthExperiments] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/606122 (https://phabricator.wikimedia.org/T255597) (owner: 10Kosta Harlan) [11:14:52] kostajh: can you test this? [11:15:28] Amir1: not really; I can look to see if the new schema revision ID is referenced in outgoing requests 🤷 [11:15:42] noted, let's just deploy it then [11:15:49] k [11:16:38] (03PS5) 10Jbond: profile::idp::memcached: move SSL termination to memcached [puppet] - 10https://gerrit.wikimedia.org/r/605947 (https://phabricator.wikimedia.org/T233933) [11:16:51] (03CR) 10jerkins-bot: [V: 04-1] profile::idp::memcached: move SSL termination to memcached [puppet] - 10https://gerrit.wikimedia.org/r/605947 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [11:17:15] (03Merged) 10jenkins-bot: Fix NewcomerTask schema [extensions/GrowthExperiments] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606121 (https://phabricator.wikimedia.org/T255597) (owner: 10Kosta Harlan) [11:18:32] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.36/extensions/GrowthExperiments/extension.json: [[gerrit:606121|Fix NewcomerTask schema (T255597)]] (duration: 01m 06s) [11:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:43] T255597: NewcomerTask EventLogging schema has invalid array items type specification - https://phabricator.wikimedia.org/T255597 [11:19:04] kostajh: can you check now? [11:19:12] deployed on wmf.36 [11:19:14] Amir1: yep one sec [11:19:22] deploying wmf.37 atm [11:21:49] Amir1: yes wmf.36 has the correct schema rev ID [11:22:05] cool, deploying [11:22:44] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/605947 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [11:22:56] (03PS5) 10Jbond: memcached: add TLS support [puppet] - 10https://gerrit.wikimedia.org/r/605937 (https://phabricator.wikimedia.org/T233933) [11:23:04] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.37/extensions/GrowthExperiments/extension.json: [[gerrit:606122|Fix NewcomerTask schema (T255597)]] (duration: 01m 04s) [11:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:53] (03CR) 10Jbond: "check experimental" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605937 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [11:24:07] (03PS6) 10Jbond: profile::idp::memcached: move SSL termination to memcached [puppet] - 10https://gerrit.wikimedia.org/r/605947 (https://phabricator.wikimedia.org/T233933) [11:25:27] (03CR) 10Ladsgroup: [C: 04-1] Change sidebar upload link destination for tr.wikisource (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605656 (https://phabricator.wikimedia.org/T253490) (owner: 10Ammarpad) [11:26:44] (03PS1) 10Arturo Borrero Gonzalez: toolforge: legacy-redirector: declare explicitly sslcert::dhparam [puppet] - 10https://gerrit.wikimedia.org/r/606163 (https://phabricator.wikimedia.org/T247236) [11:27:06] (03PS4) 10Ladsgroup: Add extended-confirmed group and restriction level for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605652 (https://phabricator.wikimedia.org/T254471) (owner: 10Ammarpad) [11:27:11] (03CR) 10Ladsgroup: [C: 03+2] Add extended-confirmed group and restriction level for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605652 (https://phabricator.wikimedia.org/T254471) (owner: 10Ammarpad) [11:28:02] (03Merged) 10jenkins-bot: Add extended-confirmed group and restriction level for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605652 (https://phabricator.wikimedia.org/T254471) (owner: 10Ammarpad) [11:28:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: legacy-redirector: declare explicitly sslcert::dhparam [puppet] - 10https://gerrit.wikimedia.org/r/606163 (https://phabricator.wikimedia.org/T247236) (owner: 10Arturo Borrero Gonzalez) [11:28:50] Ammarpad: the rowiki patch is live on mwdebug1001 now [11:29:28] OK, I see [11:29:51] (03PS1) 10Marostegui: mariadb: Reimage es1025 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/606164 (https://phabricator.wikimedia.org/T250666) [11:30:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1025 for reimage, give weight to es1023 (es5 master)', diff saved to https://phabricator.wikimedia.org/P11563 and previous config saved to /var/cache/conftool/dbconfig/20200617-113026-marostegui.json [11:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:02] 10Operations, 10Traffic: noc.wikimedia.org consistently 503s in eqsin and sometimes 503s in esams - https://phabricator.wikimedia.org/T255368 (10ema) Alright I finally understood what's going on. The problem here is that (1) the origin server is adding `Transfer-Encoding: chunked` to conditional HEAD requests,... [11:31:27] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage es1025 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/606164 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [11:31:55] (03PS3) 10Ammarpad: Change sidebar upload link destination for tr.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605656 (https://phabricator.wikimedia.org/T253490) [11:32:10] (03CR) 10Esanders: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599307 (https://phabricator.wikimedia.org/T252264) (owner: 10Esanders) [11:34:30] (03CR) 10Ammarpad: "I updated the patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605656 (https://phabricator.wikimedia.org/T253490) (owner: 10Ammarpad) [11:36:23] (03CR) 10RhinosF1: [C: 03+1] Install DiscussionTools on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599307 (https://phabricator.wikimedia.org/T252264) (owner: 10Esanders) [11:36:32] (03CR) 10Muehlenhoff: memcached: add TLS support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605937 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [11:38:03] Ammarpad: let me know when I can move forward with the rowiki patch [11:40:06] (03CR) 10Ladsgroup: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605656 (https://phabricator.wikimedia.org/T253490) (owner: 10Ammarpad) [11:45:46] Amir1 You can proceed. Things look right to me [11:47:31] (03CR) 10Ladsgroup: [C: 03+2] Change sidebar upload link destination for tr.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605656 (https://phabricator.wikimedia.org/T253490) (owner: 10Ammarpad) [11:47:56] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:605652|Add extended-confirmed group and restriction level for rowiki (T254471)]] (duration: 01m 04s) [11:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:00] T254471: Add extended confirmed protection to Romanian Wikipedia - https://phabricator.wikimedia.org/T254471 [11:48:17] (03Merged) 10jenkins-bot: Change sidebar upload link destination for tr.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605656 (https://phabricator.wikimedia.org/T253490) (owner: 10Ammarpad) [11:48:36] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [11:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:30] 10Operations, 10CAS-SSO, 10User-jbond: cas-puppetboard fails to show facts - https://phabricator.wikimedia.org/T255665 (10jbond) p:05Triage→03Medium [11:49:35] Ammarpad: the trwikisource patch is live on mwdebug1001 [11:50:17] Amir1 OK [11:53:20] Amir1 Everything looks OK [11:53:46] James_F: I don't think they are any groupOverrides set for trwikinews [11:54:12] cool, moving forward [11:55:36] James_F: there's enableuploads set but other than that just namespace and logo stuff [11:55:43] !log ladsgroup@deploy1001 Synchronized dblists/commonsuploads.dblist: [[gerrit:605656|Change sidebar upload link destination for tr.wikisource (T253490)]] (duration: 01m 04s) [11:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:47] T253490: Change the link of the Upload file section in the tools section in the sidebar on tr.wikisource.org - https://phabricator.wikimedia.org/T253490 [11:58:05] !log ladsgroup@deploy1001 Synchronized wmf-config/config/trwikisource.yaml: [[gerrit:605656|Change sidebar upload link destination for tr.wikisource (T253490)]] (duration: 01m 03s) [11:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:32] !log B&C is done for today [11:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:51] !log not today, just EU noon [11:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200617T1200) [12:06:23] !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool db2076 to remove triggers from sanitarium T238966', diff saved to https://phabricator.wikimedia.org/P11565 and previous config saved to /var/cache/conftool/dbconfig/20200617-120622-marostegui.json [12:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:27] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [12:07:31] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=jmx_logstash site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:08:02] 10Operations, 10Puppet, 10User-jbond: puppetmaster - ignoring invalid UTF-8 byte sequences in data to be sent to PuppetDB - https://phabricator.wikimedia.org/T255667 (10jbond) p:05Triage→03Low [12:08:13] (03PS1) 10Alexandros Kosiaris: Remove ganeti100[1-4], ganeti200[1-6] [dns] - 10https://gerrit.wikimedia.org/r/606168 (https://phabricator.wikimedia.org/T255553) [12:08:15] (03PS2) 10Alexandros Kosiaris: Remove role spare from old ganeti hosts [puppet] - 10https://gerrit.wikimedia.org/r/605895 (https://phabricator.wikimedia.org/T255553) [12:08:21] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Remove role spare from old ganeti hosts [puppet] - 10https://gerrit.wikimedia.org/r/605895 (https://phabricator.wikimedia.org/T255553) (owner: 10Alexandros Kosiaris) [12:10:06] 10Operations, 10CAS-SSO, 10User-jbond: cas-puppetboard fails to show facts - https://phabricator.wikimedia.org/T255665 (10MoritzMuehlenhoff) I see a failing CORS request in the developer console. [12:10:58] (03PS2) 10Alexandros Kosiaris: Remove ganeti100[1-4], ganeti200[1-6] [dns] - 10https://gerrit.wikimedia.org/r/606168 (https://phabricator.wikimedia.org/T255553) [12:12:24] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove ganeti100[1-4], ganeti200[1-6] [dns] - 10https://gerrit.wikimedia.org/r/606168 (https://phabricator.wikimedia.org/T255553) (owner: 10Alexandros Kosiaris) [12:14:10] 10Operations, 10ops-eqiad, 10decommission-hardware: decommission ganeti100[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T255553 (10akosiaris) a:03Cmjohnson [12:14:45] 10Operations, 10ops-eqiad, 10decommission-hardware: decommission ganeti100[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T255553 (10akosiaris) Service ops owner steps done, machines are ready to be handled by dc ops. [12:16:55] 10Operations, 10CAS-SSO, 10User-jbond: cas-puppetboard fails to show facts - https://phabricator.wikimedia.org/T255665 (10MoritzMuehlenhoff) The facts are retrieved from "https://puppetboard.wikimedia.org/fact/", I think this will simply vanish when this is no longer a separate vhost, but mod_cas "owns" the... [12:17:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add kubernetes[12]007-kubernetes[12]014 to BGP [homer/public] - 10https://gerrit.wikimedia.org/r/605848 (https://phabricator.wikimedia.org/T241850) (owner: 10Alexandros Kosiaris) [12:17:51] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Let's see how well I 'll fare with that" [homer/public] - 10https://gerrit.wikimedia.org/r/605848 (https://phabricator.wikimedia.org/T241850) (owner: 10Alexandros Kosiaris) [12:18:12] 10Operations, 10CAS-SSO, 10User-jbond: cas-puppetboard fails to show facts - https://phabricator.wikimedia.org/T255665 (10jbond) I think we can probably make that change safley now? however i dont share your confidence, seems they are fetching the url from https://cas-puppetboard.wikimedia.org//node/cumi... [12:18:15] (03Merged) 10jenkins-bot: Add kubernetes[12]007-kubernetes[12]014 to BGP [homer/public] - 10https://gerrit.wikimedia.org/r/605848 (https://phabricator.wikimedia.org/T241850) (owner: 10Alexandros Kosiaris) [12:21:08] (03PS3) 10Hashar: Switch CI to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [12:21:10] (03CR) 10Hashar: Switch CI to profile::java (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [12:22:03] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [12:24:18] (03PS4) 10Hashar: Switch CI to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [12:24:27] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [12:29:50] RECOVERY - Check systemd state on scb2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:20] RECOVERY - Check systemd state on scb2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:28] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:30] RECOVERY - Check systemd state on scb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:32] RECOVERY - Check systemd state on scb2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:32] RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:40] RECOVERY - Check systemd state on scb2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:53] !log Removed remaining changeprop systemd components from scb [12:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:30] RECOVERY - Check systemd state on scb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:32] (03PS3) 10Filippo Giunchedi: thanos: pass min / max time to store [puppet] - 10https://gerrit.wikimedia.org/r/605949 (https://phabricator.wikimedia.org/T252186) [12:34:34] (03PS2) 10Filippo Giunchedi: prometheus: pass min_time to Thanos sidecar [puppet] - 10https://gerrit.wikimedia.org/r/606135 (https://phabricator.wikimedia.org/T252186) [12:34:36] (03PS4) 10Filippo Giunchedi: thanos: use object storage for data older than 15d [puppet] - 10https://gerrit.wikimedia.org/r/605950 (https://phabricator.wikimedia.org/T252186) [12:36:36] (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/compiler1003/433/" [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [12:37:10] RECOVERY - Check systemd state on scb2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:10] RECOVERY - Check systemd state on scb2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:15] (03PS5) 10Hashar: Switch CI to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [12:38:31] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [12:40:33] 10Operations, 10CAS-SSO, 10User-jbond: cas-puppetboard fails to show facts - https://phabricator.wikimedia.org/T255665 (10MoritzMuehlenhoff) Yeah, scratch what I wrote before, I looked into the wrong place. After some more digging this could to be a similar case to T251513; I'm now logged in with a long te... [12:40:35] !log marostegui@cumin2001 dbctl commit (dc=all): 'Add db2091 to s8 T253217', diff saved to https://phabricator.wikimedia.org/P11566 and previous config saved to /var/cache/conftool/dbconfig/20200617-124034-marostegui.json [12:40:36] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:40] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [12:42:20] (03CR) 10Hashar: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/434/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [12:42:59] (03PS1) 10Marostegui: install_server: Reimage db2091 as Buster [puppet] - 10https://gerrit.wikimedia.org/r/606173 [12:43:08] 10Operations, 10CAS-SSO, 10User-jbond: cas-puppetboard fails to show facts - https://phabricator.wikimedia.org/T255665 (10jbond) >>! In T255665#6231425, @MoritzMuehlenhoff wrote: > Yeah, scratch what I wrote before, I looked into the wrong place. > > After some more digging this could to be a similar case... [12:43:44] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2091 as Buster [puppet] - 10https://gerrit.wikimedia.org/r/606173 (owner: 10Marostegui) [12:44:55] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1003/23298/" [puppet] - 10https://gerrit.wikimedia.org/r/605950 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [12:45:15] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: pass min / max time to store [puppet] - 10https://gerrit.wikimedia.org/r/605949 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [12:45:51] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by akosiaris on cumin2001.codfw.wmnet for... [12:46:20] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 46 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:48:04] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: pass min_time to Thanos sidecar [puppet] - 10https://gerrit.wikimedia.org/r/606135 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [12:50:49] (03CR) 10Volans: [C: 03+2] Exclude local external plugins from the PyPi build [software/homer] - 10https://gerrit.wikimedia.org/r/606129 (owner: 10Volans) [12:53:21] (03PS1) 10Jbond: profile::idp::client::httpd: remove trailing slash from proxied as [puppet] - 10https://gerrit.wikimedia.org/r/606177 (https://phabricator.wikimedia.org/T255665) [12:54:34] !log upgraded cpjobqueue to newer container image, rolled back [12:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:39] (03Merged) 10jenkins-bot: Exclude local external plugins from the PyPi build [software/homer] - 10https://gerrit.wikimedia.org/r/606129 (owner: 10Volans) [12:56:05] 10Operations, 10SRE-tools, 10Traffic, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10Volans) @ayounsi @crusnov I've created mini-one-time script to automatically reserve the first 5 IP addresses in all relevant prefixes. Did a... [12:57:34] (03PS2) 10Jbond: profile::idp::client::httpd: remove trailing slash from proxied as [puppet] - 10https://gerrit.wikimedia.org/r/606177 (https://phabricator.wikimedia.org/T255665) [12:59:35] !log akosiaris@cumin2001 START - Cookbook sre.hosts.downtime [12:59:35] !log akosiaris@cumin2001 START - Cookbook sre.hosts.downtime [12:59:35] !log akosiaris@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [12:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:49] !log akosiaris@cumin2001 START - Cookbook sre.hosts.downtime [12:59:49] !log akosiaris@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [12:59:51] !log akosiaris@cumin2001 START - Cookbook sre.hosts.downtime [12:59:51] !log akosiaris@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [12:59:51] !log akosiaris@cumin2001 START - Cookbook sre.hosts.downtime [12:59:51] !log akosiaris@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [12:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:53] !log akosiaris@cumin2001 START - Cookbook sre.hosts.downtime [12:59:53] !log akosiaris@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [12:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] liw and brennen: Time to snap out of that daydream and deploy Mediawiki train - European+American Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200617T1300). [13:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:19] !log akosiaris@cumin2001 START - Cookbook sre.hosts.downtime [13:00:19] !log akosiaris@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [13:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:35] !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime [13:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:47] (03PS1) 10Lars Wirzenius: all wikis to 1.35.0-wmf.37 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606179 [13:00:48] (03CR) 10Lars Wirzenius: [C: 03+2] all wikis to 1.35.0-wmf.37 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606179 (owner: 10Lars Wirzenius) [13:01:33] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.37 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606179 (owner: 10Lars Wirzenius) [13:02:02] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/606177 (https://phabricator.wikimedia.org/T255665) (owner: 10Jbond) [13:02:08] !log akosiaris@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [13:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:38] !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool db1129', diff saved to https://phabricator.wikimedia.org/P11567 and previous config saved to /var/cache/conftool/dbconfig/20200617-130236-marostegui.json [13:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:39] !log disable puppet on C:memcache to deploy a new change [13:03:39] !log liw@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.37 [13:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:59] !log The above db1129 depool was meant to be a repool, wrong commit message [13:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:41] !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:50] (03CR) 10Jbond: [C: 03+2] memcached: add TLS support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605937 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [13:08:22] (03CR) 10Jgiannelos: "After this [1] patch, push-notifications service also requires configuration for APNS. That means:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [13:08:26] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes2013.codfw.wmnet', 'kubernetes2... [13:08:55] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:09:02] (03CR) 10Jgiannelos: [C: 04-1] charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [13:09:03] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:10:11] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:11:19] mhhh I think that last alert was a monitoring artifact [13:18:50] (03PS1) 10Elukey: dumps::web::fetches::stats: fix bash script for mediawiki_history_dumps [puppet] - 10https://gerrit.wikimedia.org/r/606183 (https://phabricator.wikimedia.org/T255485) [13:19:11] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Epic, and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10akosiaris) [13:19:14] (03CR) 10jerkins-bot: [V: 04-1] dumps::web::fetches::stats: fix bash script for mediawiki_history_dumps [puppet] - 10https://gerrit.wikimedia.org/r/606183 (https://phabricator.wikimedia.org/T255485) (owner: 10Elukey) [13:19:28] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Epic, and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10akosiaris) [13:19:38] (03PS2) 10Elukey: dumps::web::fetches::stats: fix bash script for mediawiki_history_dumps [puppet] - 10https://gerrit.wikimedia.org/r/606183 (https://phabricator.wikimedia.org/T255485) [13:19:43] (03PS4) 10Privacybatm: [WIP] transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) [13:19:58] (03CR) 10jerkins-bot: [V: 04-1] dumps::web::fetches::stats: fix bash script for mediawiki_history_dumps [puppet] - 10https://gerrit.wikimedia.org/r/606183 (https://phabricator.wikimedia.org/T255485) (owner: 10Elukey) [13:20:15] (03CR) 10jerkins-bot: [V: 04-1] [WIP] transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [13:21:58] !log re-enable puppet on C:memcached nodes [13:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:43] (03PS5) 10Privacybatm: [WIP] transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) [13:25:13] RhinosF1: Ack. [13:25:47] (03CR) 10Privacybatm: "As a starting point, I have added 2 tests, Can you please comment on them?" [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [13:25:54] (03PS3) 10Elukey: dumps::web::fetches::stats: fix bash script for mediawiki_history_dumps [puppet] - 10https://gerrit.wikimedia.org/r/606183 (https://phabricator.wikimedia.org/T255485) [13:26:41] (03PS1) 10Ottomata: Bump SearchSatisfaction schema version to 1.1.0 to pick up client_dt field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606185 (https://phabricator.wikimedia.org/T249261) [13:27:17] James_F: I've scheduled for evening SWAT [13:27:39] i just heard SWAT is now called BACON [13:27:40] :) [13:27:52] hear ye hear ye [13:27:57] (03PS2) 10DCausse: [wdqs] drop updater mode config [puppet] - 10https://gerrit.wikimedia.org/r/602353 [13:27:57] * RhinosF1 needs to get his brain used to it [13:27:59] (03PS16) 10DCausse: [wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 [13:28:03] ottomata: No, it isn't. [13:28:10] hahah [13:28:13] (03CR) 10jerkins-bot: [V: 04-1] [wdqs] drop updater mode config [puppet] - 10https://gerrit.wikimedia.org/r/602353 (owner: 10DCausse) [13:28:19] (03CR) 10jerkins-bot: [V: 04-1] [wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [13:29:30] (03CR) 10Ottomata: [C: 03+2] Bump SearchSatisfaction schema version to 1.1.0 to pick up client_dt field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606185 (https://phabricator.wikimedia.org/T249261) (owner: 10Ottomata) [13:30:04] 10Operations, 10SRE-tools, 10Traffic, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10ayounsi) LGTM! Especially for a one-time job. [13:30:07] 10Operations, 10Wikimedia-Logstash, 10observability: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 (10fgiunchedi) >>! In T255243#6231066, @fgiunchedi wrote: > I've stopped logstash for 20 min first and 60m later to test backlog processing. The sharding definitely wo... [13:30:26] !log upgrade remaining parsoid nodes to PHP 7.2.31 [13:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:20] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: EventLogging to EventGate: - SearchSatisfaction on testwiki version 1.1.0 - T249261 (duration: 00m 58s) [13:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:24] T249261: Vertical: Migrate SearchSatisfaction EventLogging event stream to Event Platform - https://phabricator.wikimedia.org/T249261 [13:33:22] (03CR) 10Jbond: [C: 03+2] profile::idp::memcached: move SSL termination to memcached [puppet] - 10https://gerrit.wikimedia.org/r/605947 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [13:33:27] (03PS1) 10Gergő Tisza: Fix help panel sizing logic [extensions/GrowthExperiments] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606186 (https://phabricator.wikimedia.org/T255607) [13:34:26] (03PS15) 10DCausse: Consolidate query_service profile duplication [puppet] - 10https://gerrit.wikimedia.org/r/599146 (owner: 10EBernhardson) [13:34:28] (03PS3) 10DCausse: [wdqs] drop updater mode config [puppet] - 10https://gerrit.wikimedia.org/r/602353 [13:34:30] (03PS17) 10DCausse: [wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 [13:34:58] (03PS10) 10DCausse: Revert "Revert "Role for SDoC WDQS"" [puppet] - 10https://gerrit.wikimedia.org/r/602171 (owner: 10EBernhardson) [13:35:44] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [13:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:52] (03CR) 10jerkins-bot: [V: 04-1] [wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [13:37:08] (03CR) 10Jforrester: "Want this deployed immediately?" [extensions/GrowthExperiments] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606186 (https://phabricator.wikimedia.org/T255607) (owner: 10Gergő Tisza) [13:37:22] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mathoid' for release 'production' . [13:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:51] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install rdb200[78] - https://phabricator.wikimedia.org/T251626 (10akosiaris) 05Resolved→03Open Great! Thanks Papaul Repurposing the ticket for setting up the service. [13:40:51] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={swagger_check_cxserver_cluster_eqiad,swagger_check_mathoid_cluster_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:41:01] PROBLEM - Memcached on idp-test2001 is CRITICAL: connect to address 208.80.153.25 and port 11000: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [13:41:27] PROBLEM - Check systemd state on idp-test2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:42:34] PROBLEM - LVS mathoid eqiad port 10042/tcp - Mathematical rendering service- mathoid.svc.eqiad.wmnet IPv4 #page on mathoid.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:42:39] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - mathoid_10042: Servers kubernetes1003.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:42:43] o/ [13:42:51] let me know if I can help [13:42:52] here as well [13:43:01] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - mathoid_10042: Servers kubernetes1001.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:43:04] o/ [13:43:09] here too [13:43:13] did we lose some k8s nodes? [13:43:25] dammit, that's probably me [13:43:37] should i start a doc? [13:43:44] no, but I just enabled after tests the others [13:43:51] ack [13:43:53] ack [13:43:55] jbond42: no need, we can rollback really quickly from this [13:44:51] !log redrain kubernetes1007-14 [13:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:05] I was pretty sure they were ready dammit. Even ran tests [13:45:51] RECOVERY - LVS mathoid eqiad port 10042/tcp - Mathematical rendering service- mathoid.svc.eqiad.wmnet IPv4 #page on mathoid.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 925 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:45:59] PROBLEM - Mathoid LVS eqiad on mathoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mathoid [13:46:28] ACKED on VO by texting that number [13:46:44] here [13:46:45] it recovered anyway [13:47:09] * akosiaris finding out what failed ... [13:47:25] (03PS2) 10Ottomata: EventLogging - use EventGate on group0 wikis for SearchSatisfaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601732 (https://phabricator.wikimedia.org/T249261) [13:47:52] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:49:10] (03CR) 10Gergő Tisza: "> Want this deployed immediately?" [extensions/GrowthExperiments] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606186 (https://phabricator.wikimedia.org/T255607) (owner: 10Gergő Tisza) [13:50:06] akosiaris: so what happened? new kubelets added to the pool, something didn't work (maybe bad BGP peerings or fw configs)... it doesn't quite add up to me that just mathoid failed, I'd expect either all things using NodePorts to break, or hm, maybe only the mathoid pods got rescheduled onto the new kubelets and the return path wasn't working? [13:50:35] the mathoid fail was because I had deploy it to send some load to the new nodes [13:50:43] there was nothing else but mathoid on the new nodes [13:51:01] ack [13:51:04] (03PS3) 10Ottomata: EventLogging - use EventGate on group0 wikis for SearchSatisfaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601732 (https://phabricator.wikimedia.org/T249261) [13:51:10] not sure what did not work though yet [13:51:11] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install rdb200[78] - https://phabricator.wikimedia.org/T251626 (10Papaul) @akosiaris the racking/setup ticket needs to be closed and open another ticket for the service. Thanks [13:51:30] (03Abandoned) 10Elukey: dumps::web::fetches::stats: fix bash script for mediawiki_history_dumps [puppet] - 10https://gerrit.wikimedia.org/r/606183 (https://phabricator.wikimedia.org/T255485) (owner: 10Elukey) [13:51:42] RECOVERY - Mathoid LVS eqiad on mathoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mathoid [13:51:48] !log cleanup msw1-codfw interfaces [13:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:10] er, all codfw mgmt will be unreach for < 1min the time the switch rolls back automatically [13:53:58] good now [13:54:19] cdanis: "Commit was not confirmed; automatic rollback complete." that's for sure something I like with Junos :) [13:54:28] eheh [13:55:17] (03PS2) 10Dzahn: site: add mw2335-mw2339 as appservers [puppet] - 10https://gerrit.wikimedia.org/r/604429 (https://phabricator.wikimedia.org/T241852) [13:55:27] (03CR) 10RhinosF1: [C: 04-1] "Assuming https://phabricator.wikimedia.org/T255675#6231642 will happen if this rolls forward, please disable on loginwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599307 (https://phabricator.wikimedia.org/T252264) (owner: 10Esanders) [13:56:10] liw: ^ that's the patch to switch it on for prod [13:56:11] PROBLEM - puppet last run on labstore1007 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:56:16] 10Operations, 10WMF-Design, 10Design: Create sub-directory URL for Design blog (https://design.wikimedia.org/blog) - https://phabricator.wikimedia.org/T254118 (10Dzahn) Great, I'm glad it worked out and it looks good to me, @Prtksxna :) [13:57:04] (03PS1) 10Majavah: Do not use DiscussionTools on beta cluster loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606188 (https://phabricator.wikimedia.org/T255675) [13:57:44] (03CR) 10RhinosF1: [C: 03+1] Do not use DiscussionTools on beta cluster loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606188 (https://phabricator.wikimedia.org/T255675) (owner: 10Majavah) [13:58:24] puppet on labstore1007 is elukey doing some work [13:58:33] (03CR) 10RhinosF1: [C: 04-1] "should DiscussionToolsEnable be false as well" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606188 (https://phabricator.wikimedia.org/T255675) (owner: 10Majavah) [13:58:41] apergos: and ..it's actually running [13:59:01] it was disabled when i looked 2 mins ago [13:59:40] (03CR) 10Dzahn: [C: 03+2] site: add mw2335-mw2339 as appservers [puppet] - 10https://gerrit.wikimedia.org/r/604429 (https://phabricator.wikimedia.org/T241852) (owner: 10Dzahn) [13:59:51] (03CR) 10Majavah: "> Patch Set 1: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606188 (https://phabricator.wikimedia.org/T255675) (owner: 10Majavah) [13:59:56] RhinosF1, thanks, added link to train task [13:59:59] (03PS3) 10Dzahn: site: add mw2335-mw2339 as appservers [puppet] - 10https://gerrit.wikimedia.org/r/604429 (https://phabricator.wikimedia.org/T241852) [14:00:30] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:01:15] liw: as long as they fix the config to enable it, it should be fine [14:01:30] RECOVERY - puppet last run on labstore1007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:02:42] !log rebooting mw2335 through mw2339 (not in service) [14:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:19] (03PS1) 10Elukey: dumps::web::fetches::analytics::job: use ls when checking dirs [puppet] - 10https://gerrit.wikimedia.org/r/606190 (https://phabricator.wikimedia.org/T255485) [14:11:50] (03CR) 10Elukey: [C: 03+2] "Tested on labstore1007, works as expected. This is not a very elegant solution but refactoring the whole puppet code seems a bit overkill " [puppet] - 10https://gerrit.wikimedia.org/r/606190 (https://phabricator.wikimedia.org/T255485) (owner: 10Elukey) [14:13:06] 10Operations, 10Wikimedia-Logstash, 10observability: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 (10fgiunchedi) To recap, my understanding so far is the following: 1. We're seeing a maximum of ~6-7k index/s per SSD host (total of 4 hosts), with 2 replicas and 4 s... [14:13:08] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:13:33] !log generating new mcrouter certs for mw2335 - mw2339 (T247021) [14:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:37] T247021: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 [14:18:38] cdanis: found it [14:18:53] 👀 [14:19:00] PEBKAC on my part. In a pretty error-prone part of the system. The allocation of ippools in calico [14:19:11] by a typo I had enter in eqiad 2 codfw ip blocks [14:19:15] entered* [14:19:34] thankfully the routers filtered it out (and hence the error) [14:20:05] it's the part that is manual and I was us to move into deployments-charts by using the calico CRDs for that. So we can have that in git and reviewed [14:20:14] and not manual actions as it is now :-( [14:20:24] ahhhhh [14:20:28] but that requires the calico upgrade we 've been fighting to do so for so long [14:20:29] (03CR) 10Andrew Bogott: [C: 03+2] wmcs galera: add a second icinga check [puppet] - 10https://gerrit.wikimedia.org/r/606042 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [14:21:00] (03PS1) 10Dzahn: add fake mcrouter certs for mw2335 - mw2339 [labs/private] - 10https://gerrit.wikimedia.org/r/606195 (https://phabricator.wikimedia.org/T247021) [14:21:55] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 (10fgiunchedi) [14:21:56] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake mcrouter certs for mw2335 - mw2339 [labs/private] - 10https://gerrit.wikimedia.org/r/606195 (https://phabricator.wikimedia.org/T247021) (owner: 10Dzahn) [14:22:13] !log uncordon kubernetes10{07..14} again [14:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:58] gonna try this 1 more time but with a non paging service this time around [14:24:37] (03PS1) 10Dzahn: conftool: add mw2335 - mw2339 [puppet] - 10https://gerrit.wikimedia.org/r/606197 (https://phabricator.wikimedia.org/T247021) [14:25:45] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [14:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:12] (03PS1) 10Hnowlan: cpjobqueue: increase wikibase-addUsagesForPage concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/606198 [14:27:29] !log disabling puppet on icinga to avoid alert spam when adding new appservers [14:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:40] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [14:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:43] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:52] 10Operations, 10serviceops, 10Patch-For-Review: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10ops-monitoring-bot) Icinga downtime for 1:00:00 set by dzahn@cumin1001 on 5 host(s) and their services with reason: new_install ` mw[2335... [14:28:51] 10Operations, 10ops-codfw, 10DC-Ops: Put rdb200[78] into service - https://phabricator.wikimedia.org/T255681 (10akosiaris) [14:28:52] (03PS1) 10Marostegui: mariadb: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/606199 [14:28:58] jynus: ^ [14:29:20] 10Operations, 10ops-codfw, 10DC-Ops: Put rdb200[78] into service - https://phabricator.wikimedia.org/T255681 (10akosiaris) p:05Triage→03High [14:29:24] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install rdb200[78] - https://phabricator.wikimedia.org/T251626 (10akosiaris) 05Open→03Resolved >>! In T251626#6231628, @Papaul wrote: > @akosiaris the racking/setup ticket needs to be closed and open another ticket for the service. > > Than... [14:29:44] (03CR) 10Dzahn: [C: 03+2] conftool: add mw2335 - mw2339 [puppet] - 10https://gerrit.wikimedia.org/r/606197 (https://phabricator.wikimedia.org/T247021) (owner: 10Dzahn) [14:29:45] ok, we are good this time around it seems :-) [14:29:54] (03PS2) 10Dzahn: conftool: add mw2335 - mw2339 [puppet] - 10https://gerrit.wikimedia.org/r/606197 (https://phabricator.wikimedia.org/T247021) [14:29:56] marostegui: that is ok, but let me check db2091 now [14:30:11] wikifeeds is now running entirely on the new nodes [14:30:18] (03CR) 10Milimetric: [C: 03+1] dumps::web::fetches::analytics::job: use ls when checking dirs [puppet] - 10https://gerrit.wikimedia.org/r/606190 (https://phabricator.wikimedia.org/T255485) (owner: 10Elukey) [14:30:29] however, I need to do 1 more thing first before putting them fully into service. [14:30:58] (03CR) 10Jcrespo: [C: 03+1] "both look ok now" [puppet] - 10https://gerrit.wikimedia.org/r/606199 (owner: 10Marostegui) [14:30:58] !log redrain kubernetes1007-14 [14:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:07] (03CR) 10Ppchelko: [C: 03+1] cpjobqueue: increase wikibase-addUsagesForPage concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/606198 (owner: 10Hnowlan) [14:31:09] (03CR) 10Marostegui: [C: 03+2] mariadb: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/606199 (owner: 10Marostegui) [14:31:27] (03PS1) 10Andrew Bogott: Split out galera tests into two configs [puppet] - 10https://gerrit.wikimedia.org/r/606200 [14:32:06] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [14:32:15] (03CR) 10Andrew Bogott: [C: 03+2] Split out galera tests into two configs [puppet] - 10https://gerrit.wikimedia.org/r/606200 (owner: 10Andrew Bogott) [14:33:08] icinga issue is a duplicate definition of the check_galera command, which i assume will be fixed by that merge above [14:33:19] just that puppet is disabled for a minute [14:33:27] to avoid unrelated alert spam [14:36:35] mutante: I fixed it by hand [14:36:46] hopefully when puppet is back on it won't break again, in theory it's fixed in puppet now [14:36:47] andrewbogott: ah, cool! [14:37:00] andrewbogott: i will check it after re-enabling puppet, ack [14:37:05] thanks [14:37:26] waiting for an initial puppet run on some new appservers [14:42:11] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [14:43:17] !log mholloway-shell@deploy1001 Started deploy [recommendation-api/deploy@c39d567]: Update recommendation-api to db97742 [14:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:33] !log mholloway-shell@deploy1001 Finished deploy [recommendation-api/deploy@c39d567]: Update recommendation-api to db97742 (duration: 01m 16s) [14:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:16] jouncebot: next [14:45:16] In 3 hour(s) and 14 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200617T1800) [14:45:16] In 3 hour(s) and 14 minute(s): Morning backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200617T1800) [14:45:21] (03CR) 10Jforrester: [C: 03+2] Fix help panel sizing logic [extensions/GrowthExperiments] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606186 (https://phabricator.wikimedia.org/T255607) (owner: 10Gergő Tisza) [14:46:04] (03Abandoned) 10Jforrester: Ensure an array is passed to ApiEchoMute::lookupIds() [extensions/Echo] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604848 (https://phabricator.wikimedia.org/T254699) (owner: 10Krinkle) [14:46:10] (03Abandoned) 10Jforrester: Check for block in GlobalBlocking::getUserBlockDetails [extensions/GlobalBlocking] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/604849 (https://phabricator.wikimedia.org/T254955) (owner: 10Krinkle) [14:48:52] (03CR) 10Hnowlan: [C: 03+2] cpjobqueue: increase wikibase-addUsagesForPage concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/606198 (owner: 10Hnowlan) [14:49:18] !log rolled back recommendation-api deployment due to canary endpoint check failure (T255683) [14:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:23] T255683: 2020-06-17 recommendation-api production deployment failed - https://phabricator.wikimedia.org/T255683 [14:49:30] (03Merged) 10jenkins-bot: cpjobqueue: increase wikibase-addUsagesForPage concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/606198 (owner: 10Hnowlan) [14:50:06] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [14:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:34] (03PS1) 10Elukey: profile::analytics::backup::database: add mariadb-client [puppet] - 10https://gerrit.wikimedia.org/r/606203 (https://phabricator.wikimedia.org/T252740) [14:52:41] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:52:43] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [14:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:49] 10Operations, 10serviceops, 10Patch-For-Review: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10ops-monitoring-bot) Icinga downtime for 1:00:00 set by dzahn@cumin1001 on 5 host(s) and their services with reason: new_install ` mw[2335... [14:53:23] (03Merged) 10jenkins-bot: Fix help panel sizing logic [extensions/GrowthExperiments] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606186 (https://phabricator.wikimedia.org/T255607) (owner: 10Gergő Tisza) [14:54:39] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [14:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:44] (03CR) 10CDanis: [C: 03+1] thanos: use object storage for data older than 15d [puppet] - 10https://gerrit.wikimedia.org/r/605950 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [14:57:27] anyone around here willing to deploy a betacluster-only config change? would like to be able to actually login there [14:57:50] Majavah: Sure. Which? [14:57:53] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/606188 [14:58:32] (03CR) 10Jforrester: [C: 03+2] Do not use DiscussionTools on beta cluster loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606188 (https://phabricator.wikimedia.org/T255675) (owner: 10Majavah) [14:58:33] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.37/extensions/GrowthExperiments/modules/help/ext.growthExperiments.HelpPanelProcessDialog.js: T255607 Fix help panel sizing logic (duration: 00m 56s) [14:58:35] (03CR) 10Elukey: [C: 03+2] profile::analytics::backup::database: add mariadb-client [puppet] - 10https://gerrit.wikimedia.org/r/606203 (https://phabricator.wikimedia.org/T252740) (owner: 10Elukey) [14:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:37] T255607: [regression-wmf.37] Post-edit dialog display issue - https://phabricator.wikimedia.org/T255607 [14:58:49] Majavah: Ha, sorry. [14:59:29] (03Merged) 10jenkins-bot: Do not use DiscussionTools on beta cluster loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606188 (https://phabricator.wikimedia.org/T255675) (owner: 10Majavah) [15:00:27] James_F: no worries, glad it was found on beta cluster and not on production [15:00:44] Majavah: Yeah, though in prod we'd notice it faster. [15:01:02] (03PS1) 10Ema: ATS: unset Transfer-Encoding on 304 responses from origins [puppet] - 10https://gerrit.wikimedia.org/r/606204 (https://phabricator.wikimedia.org/T255368) [15:07:15] (03PS6) 10Jforrester: Install MediaModeration extension - III: Install where enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599363 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [15:07:23] (03CR) 10Jforrester: [C: 03+2] Install MediaModeration extension - III: Install where enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599363 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [15:08:09] (03Merged) 10jenkins-bot: Install MediaModeration extension - III: Install where enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599363 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [15:08:57] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw233[5-9].codfw.wmnet [15:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:27] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2339.codfw.wmnet [15:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:39] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2337.codfw.wmnet [15:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:49] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2336.codfw.wmnet [15:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:26] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2335.codfw.wmnet [15:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:34] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T247943 Install MediaModeration extension - III: Install where enabled (duration: 00m 56s) [15:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:37] T247943: Deploy MediaModeration Extension to Wikimedia Production - https://phabricator.wikimedia.org/T247943 [15:11:45] (03PS1) 10CDanis: allow easy overriding of VRRP priority on all interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/606206 [15:11:50] !log dzahn@cumin1001 conftool action : set/weight=15; selector: name=mw233[5-9].codfw.wmnet [15:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:05] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2338.codfw.wmnet [15:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:25] !log jforrester@deploy1001 Synchronized private/PrivateSettings.php: T247943 Add API key and recipient config for MediaModeration (duration: 00m 55s) [15:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:28] T247943: Deploy MediaModeration Extension to Wikimedia Production - https://phabricator.wikimedia.org/T247943 [15:17:44] ^^ CindyCicaleseWMF [15:18:29] \o/ [15:21:03] (03PS1) 10Jforrester: Install MediaModeration extension - IV: Enable on Beta Clusetr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606209 (https://phabricator.wikimedia.org/T247943) [15:21:46] (03CR) 10Jforrester: "> Patch Set 3: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599307 (https://phabricator.wikimedia.org/T252264) (owner: 10Esanders) [15:22:47] (03PS3) 10CRusnov: acme_chief: Add netbox-dev keys [puppet] - 10https://gerrit.wikimedia.org/r/606040 (https://phabricator.wikimedia.org/T253140) [15:23:21] (03CR) 10CRusnov: "Thanks :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606040 (https://phabricator.wikimedia.org/T253140) (owner: 10CRusnov) [15:23:49] (03CR) 10Jforrester: [C: 03+2] Install MediaModeration extension - IV: Enable on Beta Clusetr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606209 (https://phabricator.wikimedia.org/T247943) (owner: 10Jforrester) [15:24:25] 10Operations, 10ops-codfw, 10serviceops: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Dzahn) >>! In T241852#6211870, @Papaul wrote: > @Dzahn the 5 servers in C3 are ready for services They are now in production. (details in T247021) [15:24:48] (03Merged) 10jenkins-bot: Install MediaModeration extension - IV: Enable on Beta Clusetr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606209 (https://phabricator.wikimedia.org/T247943) (owner: 10Jforrester) [15:25:02] 10Operations, 10serviceops: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) mw2335 through mw2339 in rack C3 have also been taken into production now. This should complete the ticket. All hosts from mw2291 through mw2376 are po... [15:25:27] CindyCicaleseWMF: OK, should magically go live on Beta Cluster in a few minutes' time. [15:25:49] 10Operations, 10SRE-tools, 10Traffic, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10crusnov) Nice, this is what we pretty much had in mind, although in the future of course if we add more prefixes or change them we'll have to... [15:28:37] !log temp bump logstash7 workers to 8 and temp stop logstash - T255243 [15:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:40] T255243: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 [15:28:59] CindyCicaleseWMF: It's live. [15:29:20] Excellent, James_F. Thank you!! [15:30:09] 10Operations, 10ops-codfw, 10serviceops: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Dzahn) @Papaul This ticket talks about mw2377 but it seems to me we never had a host mw2377 (only up to mw2376). Can you confirm that? [15:30:10] James_F: while we're talking about beta cluster, can you give global developer permissions there? they would have been handy couple of times when testing bugs [15:30:22] or do I need to file a phab ticket for that or something [15:31:15] Majavah: Sure. [15:31:42] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=jmx_logstash site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:32:32] Majavah: "global", not just on beta enwiki or whatever? [15:32:51] 10Operations, 10ops-codfw, 10serviceops: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Dzahn) 76 servers are pooled as appservers. 10 have been used for kubernetes. Adds up to 86. [15:33:12] 10Operations, 10serviceops: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) 05Open→03Resolved 76 servers are pooled as appservers. 10 have been used for kubernetes. Adds up to 86. [15:33:14] 10Operations, 10ops-codfw, 10serviceops: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Dzahn) [15:33:45] Majavah: Done. [15:34:01] James_F: thank you [15:34:14] Please don't abuse it, etc. etc. [15:34:33] n [15:34:41] ofc [15:34:52] also you really didn't see "Please do rights changes, global locking and blocking on Deployment Wiki, and not here" on beta meta main page? [15:35:07] Oh, who put that there? [15:35:12] Also, who reads the main page? [15:35:25] I generally do them on meta. [15:35:31] But I don't read its main page. ;-) [15:36:07] Also, who reads the instructions anywhere? [15:36:10] (03PS4) 10CRusnov: acme_chief: Add netbox-dev keys [puppet] - 10https://gerrit.wikimedia.org/r/606040 (https://phabricator.wikimedia.org/T253140) [15:36:16] Indeed. [15:36:37] (03CR) 10Dzahn: [C: 03+2] gerrit: Clarify that `container.javaOptions` is currently unused [puppet] - 10https://gerrit.wikimedia.org/r/606004 (owner: 10QChris) [15:37:10] (03CR) 10Dzahn: [C: 03+2] gerrit: Quote `container.javaOptions` in config [puppet] - 10https://gerrit.wikimedia.org/r/606003 (owner: 10QChris) [15:37:23] (03CR) 10CRusnov: [C: 03+2] acme_chief: Add netbox-dev keys [puppet] - 10https://gerrit.wikimedia.org/r/606040 (https://phabricator.wikimedia.org/T253140) (owner: 10CRusnov) [15:37:31] (03PS3) 10Dzahn: gerrit: Clarify that `container.javaOptions` is currently unused [puppet] - 10https://gerrit.wikimedia.org/r/606004 (owner: 10QChris) [15:37:53] heh mutante i got your change too in puppet-merge [15:38:23] chaomodus: i would like to merge it together with other changes. could you say "no" and just merge yours? [15:38:26] most of the time it works [15:41:12] chaomodus: i guess it doesn't work in this case... i will merge yours in a minute, ok? [15:41:25] mutante: yep no problem [15:41:57] (03CR) 10Dzahn: [C: 03+2] gerrit: Split `container.javaOptions` settings onto separate lines [puppet] - 10https://gerrit.wikimedia.org/r/606005 (owner: 10QChris) [15:42:07] (03PS3) 10Dzahn: gerrit: Split `container.javaOptions` settings onto separate lines [puppet] - 10https://gerrit.wikimedia.org/r/606005 (owner: 10QChris) [15:42:37] 10Operations, 10Phabricator, 10Traffic: Accessing Phabricator from Tor - https://phabricator.wikimedia.org/T254568 (10sbassett) Removing #security-team as this is now managed by #operations. [15:42:49] (03CR) 10Jbond: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/606206 (owner: 10CDanis) [15:43:40] chaomodus: you can try acme_chief now [15:45:05] mutante: cool thanks [15:46:10] (03CR) 10ArielGlenn: "one nit and one thing that showed up after testing." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602645 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [15:47:57] 10Puppet, 10Phragile, 10Composer: Puppet fail due to composer install on Phragile instance - https://phabricator.wikimedia.org/T133967 (10Aklapper) 05Open→03Declined Declining this task as Phragile is not under development or maintained anymore - see T240308#6164990 [15:48:16] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 52 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:49:03] 10Operations, 10Phabricator, 10Traffic: Accessing Phabricator from Tor - https://phabricator.wikimedia.org/T254568 (10Dzahn) @sbassett Where can Operations find information on where and how these blocks are configured? (also: T229620#5386233 , T218589) [15:51:16] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:54:06] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 46 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:56:52] (03CR) 10Dzahn: [C: 03+2] gerrit: Escape quotes for pipeline commentlinks [puppet] - 10https://gerrit.wikimedia.org/r/606001 (owner: 10QChris) [15:57:00] (03PS2) 10Dzahn: gerrit: Escape quotes for pipeline commentlinks [puppet] - 10https://gerrit.wikimedia.org/r/606001 (owner: 10QChris) [16:00:14] !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool db1094', diff saved to https://phabricator.wikimedia.org/P11571 and previous config saved to /var/cache/conftool/dbconfig/20200617-160013-marostegui.json [16:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:54] RECOVERY - dump of m2 in eqiad on db2093 is OK: Last dump for m2 at eqiad (db1117.eqiad.wmnet:3322) taken on 2020-06-17 10:55:50 (426 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:05:55] 10Operations, 10Phabricator, 10Traffic: Accessing Phabricator from Tor - https://phabricator.wikimedia.org/T254568 (10sbassett) @Dzahn - @herron and @chasemp would have the most domain knowledge about this right now, as they initially worked on T218784. @JBennett should also be able provide any broad securi... [16:09:09] (03CR) 10Dzahn: [C: 03+2] gerrit: Fix comment for enableReverseDnsLookup [puppet] - 10https://gerrit.wikimedia.org/r/606018 (owner: 10QChris) [16:09:18] (03PS2) 10Dzahn: gerrit: Fix comment for enableReverseDnsLookup [puppet] - 10https://gerrit.wikimedia.org/r/606018 (owner: 10QChris) [16:11:20] (03PS6) 10Privacybatm: [WIP] transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) [16:19:03] (03CR) 10Privacybatm: "> Patch Set 5:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [16:19:38] (03CR) 10Bstorm: [C: 03+2] "I'm merging this and making a note to myself to check that it is doing the right thing next week." [puppet] - 10https://gerrit.wikimedia.org/r/605979 (https://phabricator.wikimedia.org/T127374) (owner: 10Bstorm) [16:30:49] (03PS1) 10Dzahn: mediawiki::maintenance: add server-header config [puppet] - 10https://gerrit.wikimedia.org/r/606218 (https://phabricator.wikimedia.org/T255629) [16:54:02] (03PS4) 10Ottomata: EventLogging - use EventGate on group0 wikis for SearchSatisfaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601732 (https://phabricator.wikimedia.org/T249261) [16:56:14] (03CR) 10Ottomata: [C: 03+2] EventLogging - use EventGate on group0 wikis for SearchSatisfaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601732 (https://phabricator.wikimedia.org/T249261) (owner: 10Ottomata) [16:57:58] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: EventLogging to EventGate: - SearchSatisfaction on group0 wikis - T249261 (duration: 00m 56s) [16:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:03] T249261: Vertical: Migrate SearchSatisfaction EventLogging event stream to Event Platform - https://phabricator.wikimedia.org/T249261 [17:04:31] (03PS1) 10Cicalese: Add logging for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606222 [17:04:57] (03PS2) 10Cicalese: Add logging for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606222 [17:07:29] 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10Papaul) @ayounsi there is an IRB interface setup on the new SRX300 . Are we going to use it ? if yes what will be the unit to use and the inet default config below ~~~ irb { unit 0 {... [17:07:32] https://commons.wikimedia.beta.wmflabs.org Is this an issue with Wikibase or beta? [17:16:02] (03PS1) 10CRusnov: netbox: parameterize the acmechief profile and set it for netbox-dev2001 [puppet] - 10https://gerrit.wikimedia.org/r/606225 [17:22:03] (03CR) 10Ppchelko: [C: 04-1] Add logging for mediamoderation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606222 (owner: 10Cicalese) [17:23:29] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/606225 (owner: 10CRusnov) [17:29:28] 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10ayounsi) IRBs are used when several L2 devices are connected to the router and they all share the same vlan. For example in POPs where all mgmt switches terminate. In core sites, only the ms... [17:36:29] 10Operations, 10ops-codfw, 10serviceops: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) @Dzahn confirm we have up to mw2376 [17:39:20] (03PS3) 10Cicalese: Add temporary logging for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606222 (https://phabricator.wikimedia.org/T247943) [17:40:30] (03CR) 10Ayounsi: "Thanks! That will be useful." (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/606206 (owner: 10CDanis) [17:41:11] (03CR) 10Ppchelko: [C: 03+1] Add temporary logging for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606222 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [17:41:23] 10Operations, 10SRE-tools, 10Traffic, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10Volans) >>! In T233183#6231998, @crusnov wrote: > Nice, this is what we pretty much had in mind, although in the future of course if we add mo... [17:46:29] (03PS4) 10Cicalese: Add temporary logging for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606222 (https://phabricator.wikimedia.org/T247943) [17:48:58] (03PS4) 10Bartosz Dziewoński: Install DiscussionTools on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599307 (https://phabricator.wikimedia.org/T252264) (owner: 10Esanders) [17:49:57] (03CR) 10Vgutierrez: "[huge-nitpick] technically you're choosing between acme-chief certificates, not profiles :)" [puppet] - 10https://gerrit.wikimedia.org/r/606225 (owner: 10CRusnov) [17:50:48] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T255520 (10Jclark-ctr) [17:51:37] (03PS5) 10Bartosz Dziewoński: Install DiscussionTools on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599307 (https://phabricator.wikimedia.org/T252264) (owner: 10Esanders) [17:52:01] (03PS2) 10CRusnov: netbox: parameterize the acmechief certificate and set it for netbox-dev2001 [puppet] - 10https://gerrit.wikimedia.org/r/606225 [17:52:44] (03CR) 10CRusnov: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/606225 (owner: 10CRusnov) [17:53:06] (03CR) 10jerkins-bot: [V: 04-1] netbox: parameterize the acmechief certificate and set it for netbox-dev2001 [puppet] - 10https://gerrit.wikimedia.org/r/606225 (owner: 10CRusnov) [17:53:33] (03CR) 10Bartosz Dziewoński: "@Rhinos Thank you for pointing this out, I added 'votewiki' => false, 'loginwiki' => false, to match the configuration of wmgUseLinter, th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599307 (https://phabricator.wikimedia.org/T252264) (owner: 10Esanders) [17:53:35] (03CR) 10Vgutierrez: netbox: parameterize the acmechief certificate and set it for netbox-dev2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606225 (owner: 10CRusnov) [17:55:00] (03PS6) 10Bartosz Dziewoński: Install DiscussionTools on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599307 (https://phabricator.wikimedia.org/T252264) (owner: 10Esanders) [17:55:45] (03PS3) 10CRusnov: netbox: parameterize the acmechief certificate and set it for netbox-dev2001 [puppet] - 10https://gerrit.wikimedia.org/r/606225 [17:55:47] (03CR) 10RhinosF1: [C: 04-1] "> @Rhinos Thank you for pointing this out, I added 'votewiki' =>" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599307 (https://phabricator.wikimedia.org/T252264) (owner: 10Esanders) [17:55:54] (03CR) 10Cicalese: Add temporary logging for mediamoderation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606222 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [17:56:26] (03PS2) 10Bartosz Dziewoński: Set DiscussionToolsEnableVisual to true by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605997 (https://phabricator.wikimedia.org/T251654) (owner: 10Esanders) [17:58:34] (03CR) 10CRusnov: "I have been blessed by the CI" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606225 (owner: 10CRusnov) [17:58:45] (03CR) 10RhinosF1: [C: 04-1] "nonbetafeatures is just vote and login so only wikitech needs adding I believe" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599307 (https://phabricator.wikimedia.org/T252264) (owner: 10Esanders) [17:59:14] MatmaRex: I believe you are bartosz so ^ [17:59:45] yes, looking [18:00:04] liw and brennen: I, the Bot under the Fountain, allow thee, The Deployer, to do Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200617T1800). [18:00:04] cicalese: A patch you scheduled for Train log triage with CPT is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Morning backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200617T1800). [18:00:04] tgr and MatmaRex: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:32] here [18:00:35] RhinosF1: yeah good point, we depend on VE too [18:01:50] MatmaRex: As far as I can see, as long as you disable on wikitech then it should be fine. [18:01:55] (03PS1) 10Elukey: Decommission matomo1001 [puppet] - 10https://gerrit.wikimedia.org/r/606229 (https://phabricator.wikimedia.org/T252740) [18:02:02] mine got emergency-deployed ahead of time (thanks James!) [18:02:03] Make sure you check wikitech,login+vote all work [18:02:34] (03CR) 10Elukey: [C: 03+2] Decommission matomo1001 [puppet] - 10https://gerrit.wikimedia.org/r/606229 (https://phabricator.wikimedia.org/T252740) (owner: 10Elukey) [18:02:50] (03PS7) 10Bartosz Dziewoński: Install DiscussionTools on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599307 (https://phabricator.wikimedia.org/T252264) (owner: 10Esanders) [18:02:53] RhinosF1: does that look good to you? ^ [18:02:56] * RhinosF1 can't see a votewiki on labs so I assume it's *just* login we broke [18:03:15] (why is 'nonbetafeatures' even a thing?) [18:03:36] oops, just noticed I put my patch in the table cell above where it was supposed to go [18:03:56] (03CR) 10RhinosF1: [C: 03+1] "Looks fine now, of them 3, I only see a loginwiki on labs so I think we only broke login earlier. I'll follow that up though now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599307 (https://phabricator.wikimedia.org/T252264) (owner: 10Esanders) [18:04:06] MatmaRex: dblist [18:04:16] But +1 from me [18:04:17] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [18:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:32] yeah but why do we have a dblist for two wikis, and with a name that doesn't really make sense [18:04:36] MatmaRex: can't you just check if wmgUseLinter and wmgUseVisualEditor are also true when calling wfLoadExtension() [18:04:37] anyway, whatever [18:05:16] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [18:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:09] Majavah: i can if you think that would be better. i didn't do it because then i'd need to make it a separate patch, because CommonSettings and InitialiseSettings will need to be synced in the right order [18:06:37] also. is anyone deploying today? :) [18:06:59] (03PS1) 10Elukey: Remove records related to matomo1001 [dns] - 10https://gerrit.wikimedia.org/r/606230 (https://phabricator.wikimedia.org/T252740) [18:07:00] MatmaRex: that would at least disallow breaking wikis, so I see an obvious benefict :-) [18:07:44] (03CR) 10Elukey: [C: 03+2] Remove records related to matomo1001 [dns] - 10https://gerrit.wikimedia.org/r/606230 (https://phabricator.wikimedia.org/T252740) (owner: 10Elukey) [18:07:49] MatmaRex: I can deploy today! [18:07:54] MatmaRex: I assume we have it because it's controlling numerous settings. They must be a benefit as it says they're expensive [18:08:32] Urbanecm: I sent you a pm earlier, please see [18:09:06] MatmaRex: I assume 599307 needs to go first, is that right? [18:09:43] Urbanecm: either order should be fine [18:09:53] okay, thanks [18:09:59] I'll proceed in the calendar order then [18:10:05] (03CR) 10Urbanecm: [C: 03+2] "B&C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605997 (https://phabricator.wikimedia.org/T251654) (owner: 10Esanders) [18:11:11] (03Merged) 10jenkins-bot: Set DiscussionToolsEnableVisual to true by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605997 (https://phabricator.wikimedia.org/T251654) (owner: 10Esanders) [18:11:52] MatmaRex: available to test at mwdebug1001 [18:12:52] Urbanecm: looks good [18:12:57] syncing [18:13:03] thanks [18:14:22] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: c9f6452: Set DiscussionToolsEnableVisual to true by default (T251654) (duration: 00m 56s) [18:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:27] T251654: Deploy Replying v2.0 as Beta Feature to partner wikis - https://phabricator.wikimedia.org/T251654 [18:14:39] Majavah: you're right, but i also notice that we don't do that with other extensions that have dependencies (e.g. wmgUseContentTranslation does not check for wmgUseVisualEditor first, even though it depends on it). i don't want to introduce a new convention, and i also don't want to do it while making actual confgiuation changes [18:15:37] maybe it's a mistake that we don't do that, but i don't really want to lead the change here, sorry [18:16:58] !log milimetric@deploy1001 Started deploy [analytics/refinery@6640d6f]: Quick fix for data quality bundles [18:16:59] (03CR) 10Urbanecm: [C: 03+2] "B&C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599307 (https://phabricator.wikimedia.org/T252264) (owner: 10Esanders) [18:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:07] (03PS8) 10Urbanecm: Install DiscussionTools on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599307 (https://phabricator.wikimedia.org/T252264) (owner: 10Esanders) [18:17:14] (03CR) 10Urbanecm: Install DiscussionTools on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599307 (https://phabricator.wikimedia.org/T252264) (owner: 10Esanders) [18:17:19] (03CR) 10Urbanecm: [C: 03+2] "B&C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599307 (https://phabricator.wikimedia.org/T252264) (owner: 10Esanders) [18:18:06] (03Merged) 10jenkins-bot: Install DiscussionTools on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599307 (https://phabricator.wikimedia.org/T252264) (owner: 10Esanders) [18:19:28] MatmaRex: available at mwdebug1001, could you have a look, please? [18:19:48] (03PS5) 10Urbanecm: Add temporary logging for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606222 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [18:20:16] Urbanecm: looks good! [18:20:21] MatmaRex: syncing! [18:21:05] (03PS2) 10CDanis: allow easy overriding of VRRP priority on all interfaces & update docs [homer/public] - 10https://gerrit.wikimedia.org/r/606206 [18:21:31] (03PS1) 10Bstorm: unattendedupgrades: allow configurable kernel cleanup [puppet] - 10https://gerrit.wikimedia.org/r/606234 (https://phabricator.wikimedia.org/T127374) [18:21:38] !log urbanecm@deploy1001 scap failed: average error rate on 9/9 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/e474f13ffac6b8c3bf919c4aeafc8c9b for details) [18:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:46] uhuh [18:21:48] (03PS3) 10CDanis: allow easy overriding of VRRP priority on all interfaces & update docs [homer/public] - 10https://gerrit.wikimedia.org/r/606206 [18:22:29] MatmaRex: reverting, see logmsgbot's entry above [18:22:40] (03CR) 10jerkins-bot: [V: 04-1] unattendedupgrades: allow configurable kernel cleanup [puppet] - 10https://gerrit.wikimedia.org/r/606234 (https://phabricator.wikimedia.org/T127374) (owner: 10Bstorm) [18:22:48] !log urbanecm@deploy1001 scap failed: average error rate on 3/9 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/e474f13ffac6b8c3bf919c4aeafc8c9b for details) [18:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:50] commons broken? [18:22:53] huh [18:23:12] (03PS2) 10Bstorm: unattendedupgrades: allow configurable kernel cleanup [puppet] - 10https://gerrit.wikimedia.org/r/606234 (https://phabricator.wikimedia.org/T127374) [18:23:14] I can't sync the reverted version? [18:23:17] (03PS4) 10CDanis: allow easy overriding of VRRP priority on all interfaces & update docs [homer/public] - 10https://gerrit.wikimedia.org/r/606206 [18:23:24] Urbanecm: you sometimes need to --force reverts [18:23:28] the canary check is not very smart [18:23:29] yeah commons breaks with mwdebug1001 with that patch [18:23:34] okay, using --revert [18:23:36] *--force [18:23:58] thanks cdanis [18:24:06] I assume thats why commons on beta cluster is also broken [18:24:12] (03PS3) 10Bstorm: unattendedupgrades: allow configurable kernel cleanup [puppet] - 10https://gerrit.wikimedia.org/r/606234 (https://phabricator.wikimedia.org/T127374) [18:24:15] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: REVERT: ae76450: Install DiscussionTools on all wikis (T252264; T253943) (duration: 00m 34s) [18:24:17] (03CR) 10jerkins-bot: [V: 04-1] unattendedupgrades: allow configurable kernel cleanup [puppet] - 10https://gerrit.wikimedia.org/r/606234 (https://phabricator.wikimedia.org/T127374) (owner: 10Bstorm) [18:24:18] probably, I'll merge the revert once synced [18:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:20] T252264: Deploy Reply tool via query string parameter to zh.wiki - https://phabricator.wikimedia.org/T252264 [18:24:20] T253943: Enable DiscussionTools at all wikis via query string - https://phabricator.wikimedia.org/T253943 [18:24:44] (03CR) 10jerkins-bot: [V: 04-1] unattendedupgrades: allow configurable kernel cleanup [puppet] - 10https://gerrit.wikimedia.org/r/606234 (https://phabricator.wikimedia.org/T127374) (owner: 10Bstorm) [18:25:00] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 52 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:25:36] (03PS1) 10Urbanecm: Revert "Install DiscussionTools on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606235 [18:26:02] (03PS2) 10Urbanecm: Revert "Install DiscussionTools on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606235 (https://phabricator.wikimedia.org/T252264) [18:26:09] the error message is "Role mediainfo is already defined", and it's not coming from anywhere inside DiscussionTools [18:26:16] (03CR) 10CDanis: "PTAL!" (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/606206 (owner: 10CDanis) [18:26:25] (03PS3) 10Urbanecm: Revert "Install DiscussionTools on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606235 (https://phabricator.wikimedia.org/T252264) [18:26:34] MatmaRex: that's why I'm confused [18:26:35] (03CR) 10Urbanecm: [C: 03+2] "caused fatals" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606235 (https://phabricator.wikimedia.org/T252264) (owner: 10Urbanecm) [18:26:54] but I clearly saw that break when DiscussionTools was enabled on commons [18:27:36] (03Merged) 10jenkins-bot: Revert "Install DiscussionTools on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606235 (https://phabricator.wikimedia.org/T252264) (owner: 10Urbanecm) [18:28:10] * RhinosF1 didn't see anything [18:28:16] Code wise [18:28:37] MatmaRex: I'm now going to deploy CindyCicaleseWMF's patch [18:28:45] thanks! [18:28:55] sure [18:28:56] (03PS6) 10Urbanecm: Add temporary logging for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606222 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [18:29:06] (03CR) 10Urbanecm: [C: 03+2] "B&C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606222 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [18:29:12] The mediainfo slot role is from WikibaseMediaInfo. [18:29:29] Note that Commons is the only wiki with MCR slots. [18:29:38] Possibly DT needs to change its code to work with slots? [18:29:57] (03Merged) 10jenkins-bot: Add temporary logging for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606222 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [18:30:14] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 47 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:30:17] I remember this error happening before but couldn't find a ticket that referenced it when I searched before [18:30:17] CindyCicaleseWMF: syncing :) [18:30:43] James_F: do you remember this issue cropping up in the past? [18:30:51] marktraceur: No. [18:31:19] (03PS4) 10Bstorm: unattendedupgrades: allow configurable kernel cleanup [puppet] - 10https://gerrit.wikimedia.org/r/606234 (https://phabricator.wikimedia.org/T127374) [18:31:46] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 96153f9: Add temporary logging for mediamoderation (T247943) (duration: 00m 56s) [18:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:53] T247943: Deploy MediaModeration Extension to Wikimedia Production - https://phabricator.wikimedia.org/T247943 [18:32:38] CindyCicaleseWMF: should be live. The code will arrive to beta shortly, I can't affect that. [18:32:45] excellent - thank you! [18:32:50] happy to help! [18:32:54] (03PS4) 10CRusnov: netbox: parameterize the acmechief certificate and set it for netbox-dev2001 [puppet] - 10https://gerrit.wikimedia.org/r/606225 [18:34:11] (03CR) 10CRusnov: [C: 03+2] netbox: parameterize the acmechief certificate and set it for netbox-dev2001 [puppet] - 10https://gerrit.wikimedia.org/r/606225 (owner: 10CRusnov) [18:35:09] MatmaRex: I assume there's nothing else to do now. [18:36:36] Urbanecm: i was wondering if we should try enabling everywhere except Commons to see if this only happens with the WikibaseMediaInfoHooks extension? [18:38:54] MatmaRex: if you want to do that, I think we can [18:39:20] marktraceur: to me this looks like something calls the onMediaWikiServices hook while we're already inside a onMediaWikiServices hook handler. that's the only way i can see for the WikibaseMediaInfoHooks code that fails here to be called twice [18:40:15] Huh, that's not great [18:41:07] Urbanecm: eh, actually, let's not do it. thank you [18:41:22] i'll write up a task with some details [18:41:25] okay, that's fine. [18:41:32] !log Morning B&C window done [18:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:45] thanks for deploying (and reverting!) [18:42:36] happy to help! [18:44:53] !log milimetric@deploy1001 Finished deploy [analytics/refinery@6640d6f]: Quick fix for data quality bundles (duration: 27m 55s) [18:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:01] (03PS1) 10Cicalese: DO NOT MERGE Remove temporary logging for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606239 (https://phabricator.wikimedia.org/T247943) [18:49:35] (03CR) 10Cicalese: [C: 04-1] DO NOT MERGE Remove temporary logging for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606239 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [18:49:56] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [18:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10wiki_willy) @Andrew - just wanted to keep you posted with the latest update on this from my bi-weekly meeting with Dell to... [18:52:31] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [18:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:59] 10Operations, 10Developer Productivity: Apache error log noise "Connection refused: AH00957: FCGI: attempt to connect to 127.0.0.1" on mwdebug1001 - https://phabricator.wikimedia.org/T236401 (10Krinkle) [18:56:25] 10Operations, 10Developer Productivity: Apache error log noise "Connection refused: AH00957: FCGI: attempt to connect to 127.0.0.1" on mwdebug1001 - https://phabricator.wikimedia.org/T236401 (10Krinkle) Debug-only, not a prod failure. Tagging as devprod given the noise on mwdebug is scaring people routinely wh... [18:56:49] (03CR) 10Hashar: "I have build the image locally with:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605653 (owner: 10Hashar) [18:57:45] !log milimetric@deploy1001 Started deploy [analytics/refinery@6640d6f] (thin): Quick fix for data quality bundles (THIN) [18:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:53] 10Operations, 10Release-Engineering-Team, 10serviceops, 10Developer Productivity, and 2 others: All debug hosts give (likely spurious) message: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp) - https://phabricator.wikimedia.org/T214734 (10Krinkle) Debug-only, not... [18:57:56] !log milimetric@deploy1001 Finished deploy [analytics/refinery@6640d6f] (thin): Quick fix for data quality bundles (THIN) (duration: 00m 10s) [18:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] liw and brennen: It is that lovely time of the day again! You are hereby commanded to deploy Mediawiki train - European+American Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200617T1900). [19:01:00] train status: currently on all wikis; i'm keeping an eye on T255704 in case a rollback seems necessary. [19:01:01] T255704: Fatal LogicException: Role mediainfo is already defined - https://phabricator.wikimedia.org/T255704 [19:02:42] MatmaRex: marktraceur: Commons beta is still down, even the change arrived to the beta servers already [19:02:53] Urbanecm: re https://phabricator.wikimedia.org/T255704 did that get enabled on all sites in beta first? [19:03:10] Urbanecm: IS-labs.php? [19:03:18] no... [19:03:42] huh [19:04:08] * addshore thinks we should define a set flow of sites for feature enables like that to flow through before hitting prod. I'm guessing setting the default to true in beta first would have spotted that issue, as would it being enabled on group0! [19:04:23] Urbanecm: i think there is actually a separate config for labs [19:04:24] 'wmgUseDiscussionTools' => [ [19:04:24] 'default' => true, [19:04:24] 'loginwiki' => false, [19:04:24] ], [19:04:36] so it actually was enabled at beta before? [19:04:37] so it's still enabled there, according to this [19:04:53] yesterday [19:04:55] Urbanecm: yes [19:04:59] Interesting, thats a shame and also super odd it didnt spot the issue :D [19:05:10] * Urbanecm is going to disable it at beta commons then [19:05:32] Urbanecm: that's how we found the loginwiki issue and didn't completely break prod! [19:05:41] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [19:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:11] * addshore just spotted something to do with slot and figured he should at least look at the stacktraces :D [19:06:15] * addshore goes back to doing other things [19:06:18] addshore / MatmaRex: ah, so just so i'm clear, that error spike was an attempted config deploy and shouldn't recur? [19:06:35] brennen: yes, I've put the links at https://phabricator.wikimedia.org/T255704 [19:06:36] Urbanecm: btw i filed https://phabricator.wikimedia.org/T255708 about this [19:06:40] beautiful, thanks. [19:06:47] (03CR) 10Krinkle: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/23305/" [puppet] - 10https://gerrit.wikimedia.org/r/606218 (https://phabricator.wikimedia.org/T255629) (owner: 10Dzahn) [19:06:57] oh [19:07:00] you already filed a task [19:07:53] addshore: isnt beta/g0/1/2 already standard practice? [19:07:54] jouncebot: now [19:07:55] For the next 1 hour(s) and 52 minute(s): Mediawiki train - European+American Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200617T1900) [19:08:00] (03PS1) 10Urbanecm: Disable DiscussionTools at beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606244 (https://phabricator.wikimedia.org/T255705) [19:08:18] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [19:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:24] Krinkle: for enabling extensions that are already enabled at one or more WMF sites? I don't think so [19:08:42] Urbanecm: this eception happens in prod as well [19:08:53] LogicException mediainfo [19:09:50] Krinkle: I see the exception ended with https://sal.toolforge.org/log/ybGGw3IBLkHzneNNQNbu, but myaybe I'm missing something? [19:09:58] (03PS2) 10Krinkle: Disable DiscussionTools at beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606244 (https://phabricator.wikimedia.org/T255704) (owner: 10Urbanecm) [19:10:41] ah, you're referring to the bug ID I used in the commit message... [19:11:32] Urbanecm: should commit msg say beta commons not beta? [19:11:40] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [19:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:49] (03PS3) 10Urbanecm: Disable DiscussionTools at beta commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606244 (https://phabricator.wikimedia.org/T255704) [19:11:50] right, fixed [19:12:02] RhinosF1: was just about to suggest that :D [19:12:16] Majavah: snap :) [19:12:18] Urbanecm: ty [19:12:31] Krinkle: this is perhaps indeed an edge case where it was already on some sites, but wikidata and commons are pretty different to "test rest" right now, using their test systems as a target for rolling to them directly would be good to codify somewhere [19:12:43] Am on mobile so my typing is slower :/ [19:13:12] Majavah: so am I [19:13:17] Also what Gerrit perms are needed to modify patches made by others? +2 on that repo or what? [19:13:28] Majavah: trusted contributors, I think [19:13:48] ^ [19:13:57] MatmaRex: I've added you to the group :) [19:14:22] (03CR) 10Krinkle: "This was reverted in https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/606235/ per T255704." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599307 (https://phabricator.wikimedia.org/T252264) (owner: 10Esanders) [19:14:48] Krinkle: MatmaRex: I assume there aren't any issues with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/606244 going out, is that right? [19:15:11] yes, it's fine [19:15:27] Majavah: Urbanecm just added you to trusted contributors - https://gerrit.wikimedia.org/r/#/admin/groups/1505,audit-log [19:15:43] I noticed that [19:15:49] :) [19:16:18] (03CR) 10Urbanecm: [C: 03+2] "[beta-only] to fix commons beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606244 (https://phabricator.wikimedia.org/T255704) (owner: 10Urbanecm) [19:17:02] (03Merged) 10jenkins-bot: Disable DiscussionTools at beta commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606244 (https://phabricator.wikimedia.org/T255704) (owner: 10Urbanecm) [19:17:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1094', diff saved to https://phabricator.wikimedia.org/P11572 and previous config saved to /var/cache/conftool/dbconfig/20200617-191723-marostegui.json [19:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:45] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T255520 (10RobH) a:03elukey [19:21:33] !log Deploy schema change on s6 codfw master T238966 [19:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:39] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [19:22:20] thanks MatmaRex, commons beta is now up [19:22:52] Yey [19:27:24] (03PS1) 10Hashar: zuul: add gerrit-test:29418 as a ssh known host [puppet] - 10https://gerrit.wikimedia.org/r/606249 (https://phabricator.wikimedia.org/T253263) [19:27:39] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/606249 (https://phabricator.wikimedia.org/T253263) (owner: 10Hashar) [19:28:55] (03CR) 10Hashar: "The big devil is whether the known_hosts file will end up with BOTH entries bah" [puppet] - 10https://gerrit.wikimedia.org/r/606249 (https://phabricator.wikimedia.org/T253263) (owner: 10Hashar) [19:31:05] (03PS2) 10Hashar: zuul: add gerrit-test:29418 as a ssh known host [puppet] - 10https://gerrit.wikimedia.org/r/606249 (https://phabricator.wikimedia.org/T253263) [19:31:19] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/606249 (https://phabricator.wikimedia.org/T253263) (owner: 10Hashar) [19:38:32] (03CR) 10Hashar: [V: 03+1 C: 03+1] "Puppet compiler is not helpful for this use case. What I have done to test it is:" [puppet] - 10https://gerrit.wikimedia.org/r/606249 (https://phabricator.wikimedia.org/T253263) (owner: 10Hashar) [19:42:03] (03CR) 10QChris: [C: 03+1] zuul: add gerrit-test:29418 as a ssh known host [puppet] - 10https://gerrit.wikimedia.org/r/606249 (https://phabricator.wikimedia.org/T253263) (owner: 10Hashar) [19:42:45] (03PS3) 10Hashar: zuul: add gerrit-test:29418 as a ssh known host [puppet] - 10https://gerrit.wikimedia.org/r/606249 (https://phabricator.wikimedia.org/T253263) [19:43:36] (03CR) 10CDanis: [C: 03+2] zuul: add gerrit-test:29418 as a ssh known host [puppet] - 10https://gerrit.wikimedia.org/r/606249 (https://phabricator.wikimedia.org/T253263) (owner: 10Hashar) [20:00:05] halfak and accraze: Time to snap out of that daydream and deploy Services – Graphoid / Citoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200617T2000). [20:01:03] (03CR) 10CDanis: [C: 03+1] ATS: unset Transfer-Encoding on 304 responses from origins [puppet] - 10https://gerrit.wikimedia.org/r/606204 (https://phabricator.wikimedia.org/T255368) (owner: 10Ema) [20:02:23] (03CR) 10Cwhite: [C: 03+1] thanos: use object storage for data older than 15d [puppet] - 10https://gerrit.wikimedia.org/r/605950 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [20:08:27] !log Stopped zuul-merger on contint1001 [20:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:56] (03PS1) 10Paladox: phabricator: Change DEPLOYMENT_HOST -> CACH_HOSTS in ferm [puppet] - 10https://gerrit.wikimedia.org/r/606255 [20:14:00] RECOVERY - zuul_merger_service_running on contint1001 is OK: PROCS OK: 1 process with regex args bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [20:14:05] (03CR) 10jerkins-bot: [V: 04-1] phabricator: Change DEPLOYMENT_HOST -> CACH_HOSTS in ferm [puppet] - 10https://gerrit.wikimedia.org/r/606255 (owner: 10Paladox) [20:14:17] (03PS2) 10Paladox: phabricator: Change DEPLOYMENT_HOST -> CACH_HOSTS in ferm [puppet] - 10https://gerrit.wikimedia.org/r/606255 [20:14:26] (03CR) 10jerkins-bot: [V: 04-1] phabricator: Change DEPLOYMENT_HOST -> CACH_HOSTS in ferm [puppet] - 10https://gerrit.wikimedia.org/r/606255 (owner: 10Paladox) [20:14:57] (03CR) 10Alex Monk: "cache with an E, possibly?" [puppet] - 10https://gerrit.wikimedia.org/r/606255 (owner: 10Paladox) [20:15:42] (03PS3) 10Paladox: phabricator: Change DEPLOYMENT_HOST -> CACHE_HOSTS in ferm [puppet] - 10https://gerrit.wikimedia.org/r/606255 [20:15:51] (03CR) 10Paladox: "> cache with an E, possibly?" [puppet] - 10https://gerrit.wikimedia.org/r/606255 (owner: 10Paladox) [20:15:53] (03CR) 10jerkins-bot: [V: 04-1] phabricator: Change DEPLOYMENT_HOST -> CACHE_HOSTS in ferm [puppet] - 10https://gerrit.wikimedia.org/r/606255 (owner: 10Paladox) [20:18:36] (03PS4) 10Paladox: phabricator: Change DEPLOYMENT_HOST -> CACHE_HOSTS in ferm [puppet] - 10https://gerrit.wikimedia.org/r/606255 [20:18:44] (03CR) 10jerkins-bot: [V: 04-1] phabricator: Change DEPLOYMENT_HOST -> CACHE_HOSTS in ferm [puppet] - 10https://gerrit.wikimedia.org/r/606255 (owner: 10Paladox) [20:24:06] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/606255 (owner: 10Paladox) [20:24:36] (03CR) 10Dzahn: [C: 04-1] phabricator: Change DEPLOYMENT_HOST -> CACHE_HOSTS in ferm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606255 (owner: 10Paladox) [20:25:02] (03PS5) 10Paladox: phabricator: Change DEPLOYMENT_HOSTS -> CACHE_HOSTS in ferm [puppet] - 10https://gerrit.wikimedia.org/r/606255 [20:25:10] (03CR) 10jerkins-bot: [V: 04-1] phabricator: Change DEPLOYMENT_HOSTS -> CACHE_HOSTS in ferm [puppet] - 10https://gerrit.wikimedia.org/r/606255 (owner: 10Paladox) [20:25:53] (03CR) 10Paladox: phabricator: Change DEPLOYMENT_HOSTS -> CACHE_HOSTS in ferm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606255 (owner: 10Paladox) [20:29:54] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/606255 (owner: 10Paladox) [20:29:54] PROBLEM - zuul_merger_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [20:30:34] i think the zuul issue is known (-releng) [20:31:42] RECOVERY - zuul_merger_service_running on contint1001 is OK: PROCS OK: 1 process with regex args bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [20:32:02] (03PS6) 10Dzahn: phabricator: Change DEPLOYMENT_HOSTS -> CACHE_HOSTS in ferm [puppet] - 10https://gerrit.wikimedia.org/r/606255 (owner: 10Paladox) [20:32:39] !log Fixed up zuul-merger on contint1001 due to some faulty hotfix [20:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:21] (03PS1) 10Papaul: DNS: Remove mgmt entries for oresrdb2002 [dns] - 10https://gerrit.wikimedia.org/r/606262 [20:42:37] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt entries for oresrdb2002 [dns] - 10https://gerrit.wikimedia.org/r/606262 (owner: 10Papaul) [21:01:02] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [21:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:18] (03PS7) 10Dzahn: phabricator: Hiera'ize ferm srange for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/606255 (owner: 10Paladox) [21:04:56] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/23306/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/606255 (owner: 10Paladox) [21:08:19] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/23308/phabricator-prod-1001.devtools.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/606255 (owner: 10Paladox) [21:13:36] (03PS1) 10Andrew Bogott: openstack: add templatized database grants for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/606266 (https://phabricator.wikimedia.org/T242455) [21:14:44] (03CR) 10jerkins-bot: [V: 04-1] openstack: add templatized database grants for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/606266 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [21:15:38] (03PS1) 10Paladox: Fix setting profile::phabricator::main::http_srange [puppet] - 10https://gerrit.wikimedia.org/r/606267 [21:16:23] (03PS2) 10Paladox: Fix setting profile::phabricator::main::http_srange [puppet] - 10https://gerrit.wikimedia.org/r/606267 [21:17:03] (03CR) 10Dzahn: [C: 03+2] Fix setting profile::phabricator::main::http_srange [puppet] - 10https://gerrit.wikimedia.org/r/606267 (owner: 10Paladox) [21:20:06] (03PS2) 10Andrew Bogott: openstack: add templatized database grants for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/606266 (https://phabricator.wikimedia.org/T242455) [21:21:13] (03CR) 10jerkins-bot: [V: 04-1] openstack: add templatized database grants for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/606266 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [21:23:15] (03PS3) 10Andrew Bogott: openstack: add templatized database grants for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/606266 (https://phabricator.wikimedia.org/T242455) [21:23:50] 10Operations, 10MediaWiki-General, 10Patch-For-Review, 10Sustainability (Incident Prevention): Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) - https://phabricator.wikimedia.org/T105378 (10tstarling) The log messages represent failure of all servers in the pool,... [21:24:22] (03CR) 10jerkins-bot: [V: 04-1] openstack: add templatized database grants for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/606266 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [21:24:28] (03PS1) 10CRusnov: netbox: parameterize the acmechief certificate in scripts [puppet] - 10https://gerrit.wikimedia.org/r/606270 [21:25:14] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/606270 (owner: 10CRusnov) [21:27:56] (03PS4) 10Andrew Bogott: openstack: add templatized database grants for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/606266 (https://phabricator.wikimedia.org/T242455) [21:28:43] (03CR) 10CRusnov: [C: 03+2] netbox: parameterize the acmechief certificate in scripts [puppet] - 10https://gerrit.wikimedia.org/r/606270 (owner: 10CRusnov) [21:34:37] (03PS5) 10Andrew Bogott: openstack: add templatized database grants for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/606266 (https://phabricator.wikimedia.org/T242455) [21:37:38] (03PS6) 10Andrew Bogott: openstack: add templatized database grants for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/606266 (https://phabricator.wikimedia.org/T242455) [21:42:36] (03PS7) 10Andrew Bogott: openstack: add templatized database grants for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/606266 (https://phabricator.wikimedia.org/T242455) [21:54:58] (03CR) 10Andrew Bogott: [C: 03+2] openstack: add templatized database grants for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/606266 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [21:58:25] (03PS1) 10Bartosz Dziewoński: Use $wgLocaltimezone global instead of request context [extensions/DiscussionTools] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606279 (https://phabricator.wikimedia.org/T252264) [21:58:39] (03PS1) 10Bartosz Dziewoński: Install DiscussionTools on all wikis (attempt 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606280 (https://phabricator.wikimedia.org/T252264) [22:00:22] (03CR) 10Bartosz Dziewoński: "Patch https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/606279 should resolve the issue that required the revert." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606280 (https://phabricator.wikimedia.org/T252264) (owner: 10Bartosz Dziewoński) [22:06:03] (03PS1) 10Andrew Bogott: openstack: fix some .erb mistakes in the db grant template [puppet] - 10https://gerrit.wikimedia.org/r/606281 [22:06:33] (03CR) 10Andrew Bogott: [C: 03+2] openstack: fix some .erb mistakes in the db grant template [puppet] - 10https://gerrit.wikimedia.org/r/606281 (owner: 10Andrew Bogott) [22:09:21] (03PS1) 10Andrew Bogott: Openstack db grant template: For the moment we're using raw passwords, not a hash [puppet] - 10https://gerrit.wikimedia.org/r/606283 [22:09:27] (03CR) 10jerkins-bot: [V: 04-1] Openstack db grant template: For the moment we're using raw passwords, not a hash [puppet] - 10https://gerrit.wikimedia.org/r/606283 (owner: 10Andrew Bogott) [22:10:08] (03PS2) 10Bartosz Dziewoński: Install DiscussionTools on all wikis (attempt 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606280 (https://phabricator.wikimedia.org/T252264) [22:10:19] (03PS2) 10Andrew Bogott: Openstack db grant template: we're using raw passwords, not a hash [puppet] - 10https://gerrit.wikimedia.org/r/606283 [22:10:58] (03CR) 10Andrew Bogott: [C: 03+2] Openstack db grant template: we're using raw passwords, not a hash [puppet] - 10https://gerrit.wikimedia.org/r/606283 (owner: 10Andrew Bogott) [22:18:38] (03PS1) 10Andrew Bogott: openstack::db::project_grants: show_diff => false [puppet] - 10https://gerrit.wikimedia.org/r/606284 [22:19:24] (03CR) 10Andrew Bogott: [C: 03+2] openstack::db::project_grants: show_diff => false [puppet] - 10https://gerrit.wikimedia.org/r/606284 (owner: 10Andrew Bogott) [22:47:14] !log ryankemper@deploy1001 Started deploy [wdqs/wdqs@79fb82f]: 0.3.39 [22:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:19] (03PS1) 10Andrew Bogott: codfw1dev glance: use galera on openstack.codfw1dev.wikimediacloud.org [puppet] - 10https://gerrit.wikimedia.org/r/606285 [22:50:34] (03PS1) 10Dzahn: jenkins: replace system user/group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606286 [22:50:47] (03CR) 10jerkins-bot: [V: 04-1] jenkins: replace system user/group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606286 (owner: 10Dzahn) [22:55:05] (03PS2) 10Dzahn: jenkins: replace system user/group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606286 [22:55:08] (03PS1) 10Dzahn: planet: replace system/user group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606287 [22:55:31] (03CR) 10jerkins-bot: [V: 04-1] planet: replace system/user group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606287 (owner: 10Dzahn) [22:55:37] (03CR) 10jerkins-bot: [V: 04-1] jenkins: replace system user/group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606286 (owner: 10Dzahn) [23:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Evening backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200617T2300). Please do the needful. [23:00:04] MatmaRex: A patch you scheduled for Evening backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:23] hi [23:00:39] hi MatmaRex :) [23:00:42] I can deploy today! [23:00:43] 10Operations, 10WMF-Design, 10Design: Create sub-directory URL for Design blog (https://design.wikimedia.org/blog) - https://phabricator.wikimedia.org/T254118 (10Iniquity) >>! In T254118#6182683, @Iniquity wrote: > No matter, I missed. Sorry :) Oh, I was not mistaken. Can you add a link to the blog on the ma... [23:00:56] (again :)) [23:01:06] Urbanecm: so, it's the same patch, but with a backport that should prevent it from exploding this time [23:01:16] the backport needs to go first [23:01:19] noted [23:01:34] I guess there is no way to test the backport without also deploying the config patch, is that right? [23:01:41] (do we need to backport to wmf.36 too? it doesn't seem to be deployed any more) [23:01:48] (03CR) 10Urbanecm: [C: 03+2] "B&C" [extensions/DiscussionTools] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606279 (https://phabricator.wikimedia.org/T252264) (owner: 10Bartosz Dziewoński) [23:01:52] !log ryankemper@deploy1001 Finished deploy [wdqs/wdqs@79fb82f]: 0.3.39 (duration: 14m 38s) [23:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:19] (03PS2) 10Dzahn: planet: replace system/user group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606287 [23:02:36] Urbanecm: we can test it on the wikis where DiscussionTools is already available [23:02:47] (it should do nothing) [23:03:22] I see, so I'll prepare both patches to the debug server, and hopefully commons will be fine :) [23:04:55] it's probably wise to do wmf.36 too, in case train is going back, to prevent commons going down [23:05:05] Urbanecm: oh also, we could try re-enabling it on beta commons first (reverting https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/606244) [23:05:18] assuming that the DiscussionTools we merged on master is already deployed there [23:05:22] (03Merged) 10jenkins-bot: Use $wgLocaltimezone global instead of request context [extensions/DiscussionTools] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606279 (https://phabricator.wikimedia.org/T252264) (owner: 10Bartosz Dziewoński) [23:05:58] (03PS1) 10Urbanecm: Use $wgLocaltimezone global instead of request context [extensions/DiscussionTools] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/606292 (https://phabricator.wikimedia.org/T255704) [23:06:15] good idea, on it :) [23:07:00] (03PS1) 10Urbanecm: Revert "Disable DiscussionTools at beta commonswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606293 (https://phabricator.wikimedia.org/T255704) [23:07:12] (03CR) 10Urbanecm: [C: 03+2] "beta only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606293 (https://phabricator.wikimedia.org/T255704) (owner: 10Urbanecm) [23:07:22] (03PS3) 10Dzahn: jenkins: replace system user/group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606286 [23:07:57] (03CR) 10jerkins-bot: [V: 04-1] jenkins: replace system user/group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606286 (owner: 10Dzahn) [23:08:04] (03Merged) 10jenkins-bot: Revert "Disable DiscussionTools at beta commonswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606293 (https://phabricator.wikimedia.org/T255704) (owner: 10Urbanecm) [23:08:09] (03CR) 10Urbanecm: [C: 03+2] "B&C" [extensions/DiscussionTools] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/606292 (https://phabricator.wikimedia.org/T255704) (owner: 10Urbanecm) [23:09:36] (03CR) 10jerkins-bot: [V: 04-1] Use $wgLocaltimezone global instead of request context [extensions/DiscussionTools] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/606292 (https://phabricator.wikimedia.org/T255704) (owner: 10Urbanecm) [23:09:47] MatmaRex: ^^ [23:10:40] ugh [23:11:02] 10Operations, 10WMF-Design, 10Design: Create sub-directory URL for Design blog (https://design.wikimedia.org/blog) - https://phabricator.wikimedia.org/T254118 (10Dzahn) a:05Dzahn→03Prtksxna [23:11:25] Urbanecm: the Phan failures were fixed in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/604502 , but it didn't make it into wmf.36 . it can probably be backported but it's kind of big [23:12:14] Urbanecm: i'd prefer just force-merging the $wgLocaltimezone patch, it's tiny and it would fail in an obvious way if anything was wrong with it [23:12:46] gotcha, I'm going to force-merge that, as it's an unrelated failure [23:12:59] MatmaRex: but I need you to remove the -1, I can't do that... [23:13:40] huh [23:13:49] Urbanecm: done. i apparently can [23:13:55] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Use $wgLocaltimezone global instead of request context [extensions/DiscussionTools] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/606292 (https://phabricator.wikimedia.org/T255704) (owner: 10Urbanecm) [23:13:58] even though i can't +2/-2 myself [23:14:05] (in wmf branches) [23:14:22] yeah, it's kinda weird - I think there is a task for that :). Merged. [23:14:39] (03PS4) 10Dzahn: jenkins: replace system user/group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606286 [23:14:43] the beta-only patch should be at beta commons now, and I confirm it's up [23:14:52] (03CR) 10Bartosz Dziewoński: "(the Phan failures were fixed in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/604502 , but it didn't make it in" [extensions/DiscussionTools] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/606292 (https://phabricator.wikimedia.org/T255704) (owner: 10Urbanecm) [23:15:12] nice [23:15:32] (03CR) 10Urbanecm: [C: 03+2] "B&C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606280 (https://phabricator.wikimedia.org/T252264) (owner: 10Bartosz Dziewoński) [23:16:21] (03Merged) 10jenkins-bot: Install DiscussionTools on all wikis (attempt 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606280 (https://phabricator.wikimedia.org/T252264) (owner: 10Bartosz Dziewoński) [23:17:51] MatmaRex: both backport and config is at mwdebug1001, could you test please? [23:18:43] yeah [23:19:29] commons still looks up, and DiscussionTools works as expected [23:19:33] Urbanecm: seems good to me [23:19:44] thanks, syncing [23:19:52] (syncing backport now) [23:21:49] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.37/extensions/DiscussionTools/includes/Hooks.php: 4551d29: Use $wgLocaltimezone global instead of request context (T252264; T253943; T255704) (duration: 00m 58s) [23:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:56] T252264: Deploy Reply tool via query string parameter to zh.wiki - https://phabricator.wikimedia.org/T252264 [23:21:56] T255704: Fatal LogicException: Role mediainfo is already defined (DiscussionTools conflicts with WikibaseMediaInfo) - https://phabricator.wikimedia.org/T255704 [23:21:56] T253943: Enable DiscussionTools at all wikis via query string - https://phabricator.wikimedia.org/T253943 [23:23:20] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.36/extensions/DiscussionTools/includes/Hooks.php: ff01083: Use $wgLocaltimezone global instead of request context (T255704) (duration: 00m 57s) [23:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:33] and now the config patch itself... [23:25:03] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 0e7079d: Install DiscussionTools on all wikis (attempt 2) (T252264; T253943) (duration: 00m 56s) [23:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:09] MatmaRex: done! [23:25:49] Urbanecm: thank you! [23:26:07] no problem :) [23:26:10] I'm glad it works now [23:27:01] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev glance: use galera on openstack.codfw1dev.wikimediacloud.org [puppet] - 10https://gerrit.wikimedia.org/r/606285 (owner: 10Andrew Bogott) [23:28:20] (03CR) 10Dzahn: "> Patch Set 4:" [dns] - 10https://gerrit.wikimedia.org/r/599883 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [23:55:29] (03PS1) 10Bartosz Dziewoński: Use 'nonbetafeatures' dblist more instead of listing both wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606295 [23:57:39] (03CR) 10Bartosz Dziewoński: "Not sure if it's a good idea, but I found the inconsistency confusing when trying to figure out which wikis should have DiscussionTools." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606295 (owner: 10Bartosz Dziewoński)