[00:00:05] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210113T0000). [00:00:05] No GERRIT patches in the queue for this window AFAICS. [00:03:40] PROBLEM - PHP opcache health on mw2225 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:20:46] RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:25:29] (03CR) 10Andrew Bogott: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/655762 (https://phabricator.wikimedia.org/T252199) (owner: 10Andrew Bogott) [00:35:04] * Krinkle done testing on mwdebug1002 [00:42:48] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [00:48:48] PROBLEM - SSH on logstash1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:49:10] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1008 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f2a92cbe4e0: Failed to establish a new connection: [Errno 111] Connection [00:49:10] ://wikitech.wikimedia.org/wiki/Search%23Administration [00:50:18] RECOVERY - SSH on logstash1008 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:50:44] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1008 is OK: OK - elasticsearch status production-logstash-eqiad: unassigned_shards: 0, active_shards: 916, number_of_nodes: 6, relocating_shards: 4, active_shards_percent_as_number: 100.0, number_of_in_flight_fetch: 0, initializing_shards: 0, delayed_unassigned_shards: 0, active_primary_shards: 483, cluster_name: production-logstash-eqiad, status: green, task_max [00:50:44] _millis: 0, number_of_data_nodes: 3, timed_out: False, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:55:38] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 51970528 and 281 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:57:18] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 339880 and 380 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:14:43] (03PS1) 10Herron: kibana7: change vhost from logstash-next to logstash.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/655802 (https://phabricator.wikimedia.org/T234854) [01:31:20] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [02:11:29] PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:17:53] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 79, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:21:15] RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:13:11] PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:20:53] RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:08:14] (03PS1) 10Marostegui: Revert "db1079: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/655777 [06:09:00] (03CR) 10Marostegui: [C: 03+2] Revert "db1079: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/655777 (owner: 10Marostegui) [06:10:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 25%: After cloning db1155:3317', diff saved to https://phabricator.wikimedia.org/P13742 and previous config saved to /var/cache/conftool/dbconfig/20210113-061024-root.json [06:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 50%: After cloning db1155:3317', diff saved to https://phabricator.wikimedia.org/P13743 and previous config saved to /var/cache/conftool/dbconfig/20210113-062528-root.json [06:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/655743 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [06:40:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 75%: After cloning db1155:3317', diff saved to https://phabricator.wikimedia.org/P13744 and previous config saved to /var/cache/conftool/dbconfig/20210113-064031-root.json [06:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:38] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/655731 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [06:44:25] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/655733 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [06:48:56] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/655734 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [06:50:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/652575 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [06:51:09] (03PS1) 10Ryan Kemper: elasticsearch: spicerack now supports cloudelastic [cookbooks] - 10https://gerrit.wikimedia.org/r/655810 (https://phabricator.wikimedia.org/T268779) [06:55:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 100%: After cloning db1155:3317', diff saved to https://phabricator.wikimedia.org/P13745 and previous config saved to /var/cache/conftool/dbconfig/20210113-065535-root.json [06:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:02] (03CR) 10Muehlenhoff: [C: 03+2] Add a wrapper to optionally mail the output of a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/655077 (owner: 10Muehlenhoff) [06:58:36] (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch: spicerack now supports cloudelastic [cookbooks] - 10https://gerrit.wikimedia.org/r/655810 (https://phabricator.wikimedia.org/T268779) (owner: 10Ryan Kemper) [06:59:15] (03CR) 10Ryan Kemper: [C: 03+2] "This only changes the elasticsearch arg parsing logic (allows a new type of clustergroup as supported in spicerack) so is low-risk" [cookbooks] - 10https://gerrit.wikimedia.org/r/655810 (https://phabricator.wikimedia.org/T268779) (owner: 10Ryan Kemper) [07:01:11] (03Merged) 10jenkins-bot: elasticsearch: spicerack now supports cloudelastic [cookbooks] - 10https://gerrit.wikimedia.org/r/655810 (https://phabricator.wikimedia.org/T268779) (owner: 10Ryan Kemper) [07:03:50] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-restart [07:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:35] !log T266492 T268779 T265699 Restarting cloudelastic to apply new readahead changes, this will also verify cloudelastic support works in our elasticsearch spicerack code. Only going one node at a time because cloudelastic elasticsearch indices only have 1 replica shard per index. [07:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:40] T268779: Support cloudelastic in spicerack elasticsearch - https://phabricator.wikimedia.org/T268779 [07:04:42] T266492: Restart elasticsearch clusters to apply readahead changes - https://phabricator.wikimedia.org/T266492 [07:04:42] T265699: 40-elasticsearch-readahead udev rule failing for cloudelastic100[5,6] - https://phabricator.wikimedia.org/T265699 [07:06:15] (03PS3) 10Elukey: Add cookbook to upgrade hadoop client nodes to Bigtop [cookbooks] - 10https://gerrit.wikimedia.org/r/655630 (https://phabricator.wikimedia.org/T269919) [07:07:27] (03CR) 10Muehlenhoff: [C: 03+2] Add a new option to enable mail output for a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/655628 (owner: 10Muehlenhoff) [07:13:11] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [07:13:39] !log [WDQS Deploy] All tests passing on canary instance `wdqs1003` prior to start of deploy. Proceeding with canary deploy of version `0.3.59`... [07:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:44] !log ryankemper@deploy1001 Started deploy [wdqs/wdqs@fdd2c2f]: 0.3.59 [07:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:59] !log [WDQS Deploy] All tests passing on canary instance `wdqs1003` following canary deploy. Proceeding to rest of fleet... [07:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:58] (03PS1) 10Muehlenhoff: Enable sending mail for Cumin alias check [puppet] - 10https://gerrit.wikimedia.org/r/655812 [07:28:07] !log ryankemper@deploy1001 Finished deploy [wdqs/wdqs@fdd2c2f]: 0.3.59 (duration: 14m 23s) [07:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:40] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts simultaneously: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [07:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:01] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [07:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:24] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [07:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:32] (03CR) 10Elukey: [C: 03+2] Add cookbook to upgrade hadoop client nodes to Bigtop [cookbooks] - 10https://gerrit.wikimedia.org/r/655630 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey) [07:29:40] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and wmf LDAP for fkaelin - https://phabricator.wikimedia.org/T267817 (10elukey) 05Open→03Resolved >>! In T267817#6741652, @fkaelin wrote: > I am trying to access Hue, and after looking at these tasks requesting access for Hue [[... [07:38:46] (03PS1) 10Elukey: sre.hadoop.change-distro-from-cdh-clients: fix cumin query [cookbooks] - 10https://gerrit.wikimedia.org/r/655858 [07:41:28] * elukey merges before Riccardos sees my pebcak [07:41:35] *Riccardo :D [07:50:00] (03CR) 10Elukey: [C: 03+2] sre.hadoop.change-distro-from-cdh-clients: fix cumin query [cookbooks] - 10https://gerrit.wikimedia.org/r/655858 (owner: 10Elukey) [07:50:58] clever :-) [07:51:59] (03CR) 10Muehlenhoff: [C: 03+2] Make check-cumin-aliases always return 0 [puppet] - 10https://gerrit.wikimedia.org/r/645311 (owner: 10Muehlenhoff) [07:59:13] !log draining ganeti4001 for eventual reboot [07:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:14] !log [WDQS Deploy] Deploy is complete, and the WDQS service is healthy [08:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:17] (03CR) 10Elukey: [C: 03+1] "I think it is good for the moment, the depool/repool use case is definitely good and we'll need to follow up also in the roll-restart cook" [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [08:06:41] (03CR) 10Muehlenhoff: [C: 03+2] Enable sending mail for Cumin alias check [puppet] - 10https://gerrit.wikimedia.org/r/655812 (owner: 10Muehlenhoff) [08:07:16] (03PS1) 10Gergő Tisza: [no-op] GrowthExperiments: Disable link recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655863 (https://phabricator.wikimedia.org/T261408) [08:09:33] (03CR) 10Giuseppe Lavagetto: "> Patch Set 2:" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655411 (https://phabricator.wikimedia.org/T219398) (owner: 10Giuseppe Lavagetto) [08:13:43] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [08:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:36] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:28] !log draining ganeti4002 for eventual reboot [08:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:46] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [08:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:20] (03PS1) 10Elukey: cumin: update analytics coordinator's alias with the new replica host [puppet] - 10https://gerrit.wikimedia.org/r/655864 [08:26:51] (03CR) 10Elukey: [C: 03+2] cumin: update analytics coordinator's alias with the new replica host [puppet] - 10https://gerrit.wikimedia.org/r/655864 (owner: 10Elukey) [08:28:03] (03PS1) 10Gergő Tisza: Add GrowthExperiments maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/655865 (https://phabricator.wikimedia.org/T261408) [08:29:09] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:17] (03PS1) 10Elukey: sre.hadoop: add more client cumin aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/655866 [08:35:04] !log failover ganeti master in ulsfo to ganeti4002 [08:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:15] (03CR) 10Elukey: [C: 03+1] "Left a nit about importing, feel free to merge afterwards!" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [08:37:51] (03CR) 10Elukey: [C: 03+2] sre.hadoop: add more client cumin aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/655866 (owner: 10Elukey) [08:39:05] PROBLEM - ganeti-wconfd running on ganeti4003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [08:46:18] !log cp5008: re-enable puppet to undo JIT tslua experiment T265625 [08:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:21] T265625: ats-be occasional system CPU usage increase - https://phabricator.wikimedia.org/T265625 [08:47:58] !log draining ganeti4003 for eventual reboot [08:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:45] 10SRE, 10Performance-Team, 10Traffic: ats-be occasional system CPU usage increase - https://phabricator.wikimedia.org/T265625 (10ema) Disabling JIT in all Lua scripts resulted in significantly decreased CPU usage [[https://grafana.wikimedia.org/d/7-ZqK8-Wz/varnish-frontend-ttfb-comparison?viewPanel=4&orgId=1... [08:54:49] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [08:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:23] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [09:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:07] !log installing efivar bugfix update [09:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:14] !log ayounsi@deploy1001 Started deploy [homer/deploy@723ebfe]: Netbox 2.9 changes [09:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:37] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-restart (exit_code=99) [09:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:25] !log ayounsi@deploy1001 Finished deploy [homer/deploy@723ebfe]: Netbox 2.9 changes (duration: 03m 11s) [09:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:03] (03CR) 10Filippo Giunchedi: [C: 03+2] role: add interface::rps to swift::storage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655636 (https://phabricator.wikimedia.org/T271415) (owner: 10Filippo Giunchedi) [09:16:39] 10SRE: Integrate Buster 10.7 point update - https://phabricator.wikimedia.org/T269558 (10MoritzMuehlenhoff) [09:17:54] (03CR) 10Jbond: [C: 03+1] phabricator/phab_epipe.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/655743 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [09:21:32] (03PS1) 10Ema: ATS: disable JIT compiler on ats-be Lua scripts [puppet] - 10https://gerrit.wikimedia.org/r/655870 (https://phabricator.wikimedia.org/T265625) [09:22:35] (03CR) 10David Caro: "Thanks a lot for the review!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/654419 (https://phabricator.wikimedia.org/T268877) (owner: 10David Caro) [09:23:03] (03CR) 10jerkins-bot: [V: 04-1] ATS: disable JIT compiler on ats-be Lua scripts [puppet] - 10https://gerrit.wikimedia.org/r/655870 (https://phabricator.wikimedia.org/T265625) (owner: 10Ema) [09:23:46] (03CR) 10Vgutierrez: [C: 03+1] ATS: disable JIT compiler on ats-be Lua scripts [puppet] - 10https://gerrit.wikimedia.org/r/655870 (https://phabricator.wikimedia.org/T265625) (owner: 10Ema) [09:25:06] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/655731 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [09:25:58] (03CR) 10Jbond: [C: 03+1] check_graphite_freshness.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/655733 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [09:26:21] (03PS1) 10Volans: netbox: fix check report for Netbox 2.9 [puppet] - 10https://gerrit.wikimedia.org/r/655871 (https://phabricator.wikimedia.org/T266487) [09:26:43] (03PS2) 10Ema: ATS: disable JIT compiler on ats-be Lua scripts [puppet] - 10https://gerrit.wikimedia.org/r/655870 (https://phabricator.wikimedia.org/T265625) [09:27:01] (03CR) 10Jbond: [C: 03+1] check_graphite.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/655734 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [09:27:36] (03CR) 10Jbond: [C: 03+1] modules/interface/files/interface-rps.py: Adapt for Python3 [puppet] - 10https://gerrit.wikimedia.org/r/652575 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [09:28:25] (03CR) 10Ayounsi: [C: 03+1] netbox: fix check report for Netbox 2.9 [puppet] - 10https://gerrit.wikimedia.org/r/655871 (https://phabricator.wikimedia.org/T266487) (owner: 10Volans) [09:28:56] (03CR) 10Ema: [C: 03+2] ATS: disable JIT compiler on ats-be Lua scripts [puppet] - 10https://gerrit.wikimedia.org/r/655870 (https://phabricator.wikimedia.org/T265625) (owner: 10Ema) [09:29:08] (03PS2) 10Volans: netbox: fix check report for Netbox 2.9 [puppet] - 10https://gerrit.wikimedia.org/r/655871 (https://phabricator.wikimedia.org/T266487) [09:31:03] (03CR) 10Ayounsi: [C: 03+1] "That was my idea." [puppet] - 10https://gerrit.wikimedia.org/r/655871 (https://phabricator.wikimedia.org/T266487) (owner: 10Volans) [09:31:39] (03CR) 10Volans: [C: 03+2] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/655871 (https://phabricator.wikimedia.org/T266487) (owner: 10Volans) [09:42:06] !log swift eqiad-prod: add weight to ms-be106[0-3] - T268435 [09:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:10] T268435: Add ms-be106[0-3] to swift - https://phabricator.wikimedia.org/T268435 [09:45:07] (03PS1) 10Muehlenhoff: Make bast3005 a bastion [puppet] - 10https://gerrit.wikimedia.org/r/655872 [09:47:48] 10SRE, 10ops-eqsin, 10serviceops: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10MoritzMuehlenhoff) @RobH, what's the status here? Was the IPMI error reproducible on a second attempt? [09:49:43] !log Enable report_host on all codfw sby masters - T271106 [09:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:46] T271106: Enable report_host on candidate masters - https://phabricator.wikimedia.org/T271106 [09:53:08] (03CR) 10Jbond: [C: 04-1] cloud.encapi: enable ssl nginx vhost (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/654419 (https://phabricator.wikimedia.org/T268877) (owner: 10David Caro) [09:56:08] PROBLEM - Check systemd state on db2100 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:57:58] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.47 [software/spicerack] - 10https://gerrit.wikimedia.org/r/655873 [09:58:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1020', diff saved to https://phabricator.wikimedia.org/P13747 and previous config saved to /var/cache/conftool/dbconfig/20210113-095834-marostegui.json [09:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:05] !log Enable report_host on es1020 T271106 [09:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:10] T271106: Enable report_host on candidate masters - https://phabricator.wikimedia.org/T271106 [09:59:36] (03CR) 10Ema: [C: 03+1] varnish: migrate abuse_nets acl to abuse_networks hiera block (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651174 (https://phabricator.wikimedia.org/T193762) (owner: 10Jbond) [10:00:41] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/655872 (owner: 10Muehlenhoff) [10:02:09] (03PS3) 10Giuseppe Lavagetto: Always refresh the base images [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655411 (https://phabricator.wikimedia.org/T219398) [10:02:11] (03PS3) 10Giuseppe Lavagetto: Add ability to separate the apt and the general http proxy [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655412 (https://phabricator.wikimedia.org/T183545) [10:02:22] RECOVERY - Check systemd state on db2100 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 25%: After restarting mysql', diff saved to https://phabricator.wikimedia.org/P13748 and previous config saved to /var/cache/conftool/dbconfig/20210113-100253-root.json [10:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:47] 10SRE, 10Analytics: Request for Kerberos password - https://phabricator.wikimedia.org/T271845 (10elukey) 05Open→03Resolved a:03elukey [10:05:16] (03CR) 10Ema: [C: 03+1] varnish: add phabricator specific ban in varnish (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/651171 (https://phabricator.wikimedia.org/T270618) (owner: 10Jbond) [10:05:46] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.47 [software/spicerack] - 10https://gerrit.wikimedia.org/r/655873 (owner: 10Volans) [10:11:01] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.47 [software/spicerack] - 10https://gerrit.wikimedia.org/r/655873 (owner: 10Volans) [10:16:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1020', diff saved to https://phabricator.wikimedia.org/P13749 and previous config saved to /var/cache/conftool/dbconfig/20210113-101606-marostegui.json [10:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:10] (03PS1) 10Ema: ATS: disable JIT compiler on ats-tls too [puppet] - 10https://gerrit.wikimedia.org/r/655876 (https://phabricator.wikimedia.org/T244538) [10:18:15] !log disable puppet on the cp::text to deploy block list changes 651174 + 651171 [10:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Reduce weight on es1021', diff saved to https://phabricator.wikimedia.org/P13750 and previous config saved to /var/cache/conftool/dbconfig/20210113-102245-marostegui.json [10:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:35] (03PS2) 10Jbond: varnish: add phabricator specific ban in varnish [puppet] - 10https://gerrit.wikimedia.org/r/651171 (https://phabricator.wikimedia.org/T270618) [10:25:47] (03CR) 10Vgutierrez: [C: 03+1] ATS: disable JIT compiler on ats-tls too [puppet] - 10https://gerrit.wikimedia.org/r/655876 (https://phabricator.wikimedia.org/T244538) (owner: 10Ema) [10:26:11] (03CR) 10Jbond: [C: 03+2] varnish: add phabricator specific ban in varnish [puppet] - 10https://gerrit.wikimedia.org/r/651171 (https://phabricator.wikimedia.org/T270618) (owner: 10Jbond) [10:26:13] (03CR) 10Jbond: [C: 03+2] varnish: migrate abuse_nets acl to abuse_networks hiera block [puppet] - 10https://gerrit.wikimedia.org/r/651174 (https://phabricator.wikimedia.org/T193762) (owner: 10Jbond) [10:27:52] (03PS1) 10Volans: Upstream release v0.0.47 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/655877 [10:28:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 25%: After restarting mysql', diff saved to https://phabricator.wikimedia.org/P13751 and previous config saved to /var/cache/conftool/dbconfig/20210113-102802-root.json [10:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:49] 10SRE, 10Traffic, 10Beta-Cluster-reproducible, 10HTTPS: The certificate for upload.beta.wmflabs.org expired on January 12, 2021. - https://phabricator.wikimedia.org/T271808 (10Aklapper) >>! In T271808#6739592, @RhinosF1 wrote: > But yeah I guess this should be fixed/monitored better so it doesn't need manu... [10:34:07] (03PS2) 10Volans: Upstream release v0.0.47 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/655877 [10:35:00] (03CR) 10Elukey: [C: 03+1] Upstream release v0.0.47 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/655877 (owner: 10Volans) [10:35:12] (03CR) 10Ema: [C: 03+2] ATS: disable JIT compiler on ats-tls too [puppet] - 10https://gerrit.wikimedia.org/r/655876 (https://phabricator.wikimedia.org/T244538) (owner: 10Ema) [10:35:32] !log puppet re-enabled on aall cp-text hosts [10:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:04] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.47 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/655877 (owner: 10Volans) [10:36:44] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana-ssl_443: Servers logstash1007.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:39:50] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:40:26] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [10:41:45] (03PS8) 10David Caro: cloud.encapi: enable ssl nginx vhost [puppet] - 10https://gerrit.wikimedia.org/r/654419 (https://phabricator.wikimedia.org/T268877) [10:43:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 50%: After restarting mysql', diff saved to https://phabricator.wikimedia.org/P13753 and previous config saved to /var/cache/conftool/dbconfig/20210113-104305-root.json [10:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:18] 10SRE, 10Traffic, 10Beta-Cluster-reproducible, 10HTTPS: The certificate for upload.beta.wmflabs.org expired on January 12, 2021. - https://phabricator.wikimedia.org/T271808 (10RhinosF1) >>! In T271808#6742982, @Aklapper wrote: >>>! In T271808#6739592, @RhinosF1 wrote: >> But yeah I guess this should be fix... [10:44:08] (03PS9) 10Jbond: P:phabricator: remove apache level blocking [puppet] - 10https://gerrit.wikimedia.org/r/649882 (https://phabricator.wikimedia.org/T270285) [10:47:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27447/console" [puppet] - 10https://gerrit.wikimedia.org/r/649882 (https://phabricator.wikimedia.org/T270285) (owner: 10Jbond) [10:48:28] (03CR) 10Jbond: [V: 03+1] "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/649882 (https://phabricator.wikimedia.org/T270285) (owner: 10Jbond) [10:48:46] 10SRE, 10docker-pkg, 10serviceops: Duplicate image name in docker-images/production-images - https://phabricator.wikimedia.org/T271901 (10Joe) [10:49:02] 10SRE, 10docker-pkg, 10serviceops: Duplicate image name in docker-images/production-images - https://phabricator.wikimedia.org/T271901 (10Joe) p:05Triage→03Medium a:03Joe [10:51:58] (03CR) 10Giuseppe Lavagetto: Switch the base image to buster from stretch. (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615683 (owner: 10Giuseppe Lavagetto) [10:55:10] (03CR) 10David Caro: cloud.encapi: enable ssl nginx vhost (0313 comments) [puppet] - 10https://gerrit.wikimedia.org/r/654419 (https://phabricator.wikimedia.org/T268877) (owner: 10David Caro) [10:57:09] !log uploaded spicerack_0.0.47 to apt.wikimedia.org buster-wikimedia [10:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 75%: After restarting mysql', diff saved to https://phabricator.wikimedia.org/P13754 and previous config saved to /var/cache/conftool/dbconfig/20210113-105809-root.json [10:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: neutron: conntrackd: fix gw address in filter [puppet] - 10https://gerrit.wikimedia.org/r/655650 (owner: 10Arturo Borrero Gonzalez) [11:00:35] (03CR) 10David Caro: cloud.encapi: enable ssl nginx vhost (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/654419 (https://phabricator.wikimedia.org/T268877) (owner: 10David Caro) [11:04:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove weight on es4 the master', diff saved to https://phabricator.wikimedia.org/P13755 and previous config saved to /var/cache/conftool/dbconfig/20210113-110419-marostegui.json [11:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:32] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana-ssl_443: Servers logstash1008.eqiad.wmnet are marked down but pooled: kibana_80: Servers logstash1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:04:54] (03CR) 10Jbond: "think you missed one, other wise looks good thanks <3" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/654419 (https://phabricator.wikimedia.org/T268877) (owner: 10David Caro) [11:07:36] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:09:42] (03CR) 10Jbond: cloud.encapi: enable ssl nginx vhost (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/654419 (https://phabricator.wikimedia.org/T268877) (owner: 10David Caro) [11:12:12] 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) Reserved port 4992 in https://wikitech.wikimedia.org/wiki/Service_ports [11:13:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 100%: After restarting mysql', diff saved to https://phabricator.wikimedia.org/P13756 and previous config saved to /var/cache/conftool/dbconfig/20210113-111312-root.json [11:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:36] (03PS2) 10KartikMistry: Update cxserver to 2021-01-12-095820-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/655642 (https://phabricator.wikimedia.org/T234220) [11:20:31] * kart_ updating cxserver [11:20:47] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2021-01-12-095820-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/655642 (https://phabricator.wikimedia.org/T234220) (owner: 10KartikMistry) [11:20:49] (03PS1) 10Elukey: Add eventstreams-internal k8s service dummy token config [labs/private] - 10https://gerrit.wikimedia.org/r/655879 (https://phabricator.wikimedia.org/T269160) [11:22:02] (03Merged) 10jenkins-bot: Update cxserver to 2021-01-12-095820-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/655642 (https://phabricator.wikimedia.org/T234220) (owner: 10KartikMistry) [11:23:56] !log kartik@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [11:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:24] 10SRE, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) These is a recap of the new servers we can use to replace the old ones and their rack - I will do a 1:1 replacing planning once I have checked where the servers we have to replace are ra... [11:32:12] (03PS3) 10Effie Mouzeli: hiera: upgrade mc1029, mc2029 to buster [puppet] - 10https://gerrit.wikimedia.org/r/655373 (https://phabricator.wikimedia.org/T213089) [11:32:50] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: upgrade mc1029, mc2029 to buster [puppet] - 10https://gerrit.wikimedia.org/r/655373 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [11:33:44] !log kartik@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [11:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:52] (03CR) 10David Caro: "Also tested again on cloud-puppetmaster-03" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/654419 (https://phabricator.wikimedia.org/T268877) (owner: 10David Caro) [11:34:00] 10SRE, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mc1029.eqiad.wmnet ` The log can be found i... [11:34:15] 10SRE, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mc2029.codfw.wmnet ` The log can be found i... [11:34:22] (03PS9) 10David Caro: cloud.encapi: enable ssl nginx vhost [puppet] - 10https://gerrit.wikimedia.org/r/654419 (https://phabricator.wikimedia.org/T268877) [11:35:34] 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) I am following https://wikitech.wikimedia.org/wiki/Kubernetes#Add_a_new_se... [11:37:18] (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/654419 (https://phabricator.wikimedia.org/T268877) (owner: 10David Caro) [11:37:24] !log kartik@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [11:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:33] (03CR) 10David Caro: [C: 03+2] cloud.encapi: enable ssl nginx vhost [puppet] - 10https://gerrit.wikimedia.org/r/654419 (https://phabricator.wikimedia.org/T268877) (owner: 10David Caro) [11:40:01] (03PS2) 10Jbond: wmflib: create dir::split and dir::mkdir_p functions [puppet] - 10https://gerrit.wikimedia.org/r/655741 [11:40:03] (03PS1) 10Jbond: P:microsite::peopleweb: convert dirtree to mkdir_p [puppet] - 10https://gerrit.wikimedia.org/r/655882 [11:40:05] (03PS1) 10Jbond: wmflib: drop dirtree in favour of dir::split and dir::mkdir_p [puppet] - 10https://gerrit.wikimedia.org/r/655883 [11:40:26] !log Updated cxserver to 2021-01-12-095820-production (T234220, T270408) [11:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:30] T234220: cxserver error prevents translation of a section: Cannot read property 'replace' of undefined - https://phabricator.wikimedia.org/T234220 [11:40:31] T270408: Create Wikipedia Nias - https://phabricator.wikimedia.org/T270408 [11:41:08] (03CR) 10jerkins-bot: [V: 04-1] wmflib: create dir::split and dir::mkdir_p functions [puppet] - 10https://gerrit.wikimedia.org/r/655741 (owner: 10Jbond) [11:47:49] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1029.eqiad.wmnet with reason: REIMAGE [11:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:41] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2029.codfw.wmnet with reason: REIMAGE [11:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:55] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1029.eqiad.wmnet with reason: REIMAGE [11:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:42] (03PS2) 10David Caro: last-puppet-run: don't crash if puppet has not run yet [puppet] - 10https://gerrit.wikimedia.org/r/641207 [11:52:03] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2029.codfw.wmnet with reason: REIMAGE [11:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:48] (03PS3) 10Jbond: wmflib: create dir::split and dir::mkdir_p functions [puppet] - 10https://gerrit.wikimedia.org/r/655741 [11:56:50] (03CR) 10jerkins-bot: [V: 04-1] wmflib: create dir::split and dir::mkdir_p functions [puppet] - 10https://gerrit.wikimedia.org/r/655741 (owner: 10Jbond) [11:59:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27450/console" [puppet] - 10https://gerrit.wikimedia.org/r/655882 (owner: 10Jbond) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a European mid-day backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210113T1200). [12:00:04] dcausse: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:05] (03PS4) 10Jbond: wmflib: create dir::split and dir::mkdir_p functions [puppet] - 10https://gerrit.wikimedia.org/r/655741 [12:00:14] o/ [12:00:25] I can deploy [12:01:35] (03PS3) 10DCausse: Revert "Disable sanity check cirrus jobs for Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655389 (https://phabricator.wikimedia.org/T239931) [12:02:45] (03CR) 10DCausse: [C: 03+2] Revert "Disable sanity check cirrus jobs for Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655389 (https://phabricator.wikimedia.org/T239931) (owner: 10DCausse) [12:02:53] (03CR) 10Urbanecm: [C: 04-2] "Ib167f0d919216c51e5b2b61c22c0ee973d804541 needs to be done first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655786 (https://phabricator.wikimedia.org/T271828) (owner: 10Luke081515) [12:03:07] (03PS2) 10Jbond: P:microsite::peopleweb: convert dirtree to mkdir_p [puppet] - 10https://gerrit.wikimedia.org/r/655882 [12:03:13] (03PS2) 10Jbond: wmflib: drop dirtree in favour of dir::split and dir::mkdir_p [puppet] - 10https://gerrit.wikimedia.org/r/655883 [12:03:24] (03CR) 10Jbond: [C: 03+2] wmflib: create dir::split and dir::mkdir_p functions [puppet] - 10https://gerrit.wikimedia.org/r/655741 (owner: 10Jbond) [12:03:36] (03Merged) 10jenkins-bot: Revert "Disable sanity check cirrus jobs for Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655389 (https://phabricator.wikimedia.org/T239931) (owner: 10DCausse) [12:06:24] (03CR) 10Jbond: [C: 03+2] P:microsite::peopleweb: convert dirtree to mkdir_p [puppet] - 10https://gerrit.wikimedia.org/r/655882 (owner: 10Jbond) [12:06:28] (03CR) 10Jbond: [C: 03+2] wmflib: drop dirtree in favour of dir::split and dir::mkdir_p [puppet] - 10https://gerrit.wikimedia.org/r/655883 (owner: 10Jbond) [12:09:46] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T239931: Revert "Disable sanity check cirrus jobs for Wikidata" (duration: 01m 16s) [12:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:50] T239931: Reduce the impact of the sanitizer on wikidata - https://phabricator.wikimedia.org/T239931 [12:14:01] 10SRE, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc1029.eqiad.wmnet'] ` Of which those **FAILED**: ` ['mc1029.eqiad.wmnet'] ` [12:15:03] !log European mid-day backport window done [12:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:46] 10SRE, 10Release-Engineering-Team, 10puppet-compiler, 10User-jbond: puppet documentation generation is missing some compnets - https://phabricator.wikimedia.org/T271909 (10jbond) p:05Triage→03Medium [12:18:02] (03PS3) 10Ayounsi: Run Homer during the decom cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/647629 [12:30:57] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana-ssl_443: Servers logstash1008.eqiad.wmnet are marked down but pooled: wdqs-internal_80: Servers wdqs1011.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:34:57] (03PS4) 10Ayounsi: Run Homer during the decom cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/647629 [12:36:12] 10SRE, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2029.codfw.wmnet'] ` Of which those **FAILED**: ` ['mc2029.codfw.wmnet'] ` [12:36:41] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1391.eqiad.wmnet, wdqs1011.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [12:38:13] (03PS5) 10Ayounsi: Run Homer during the decom cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/647629 [12:52:18] (03PS1) 10Phuedx: Use {{link-mainpage}} in legacy sidebar same as new logo [skins/Vector] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655779 (https://phabricator.wikimedia.org/T271873) [12:52:21] (03PS1) 10Arturo Borrero Gonzalez: neutron: conntrackd: double systemd watchdog timeout [puppet] - 10https://gerrit.wikimedia.org/r/655890 (https://phabricator.wikimedia.org/T268335) [12:53:52] (03CR) 10Ssingh: [C: 03+2] "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/655763 (owner: 10Ssingh) [12:54:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] neutron: conntrackd: double systemd watchdog timeout [puppet] - 10https://gerrit.wikimedia.org/r/655890 (https://phabricator.wikimedia.org/T268335) (owner: 10Arturo Borrero Gonzalez) [12:54:45] Is the backport window still open? [12:55:54] jouncebot now [12:55:54] For the next 0 hour(s) and 4 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210113T1200) [12:57:21] https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/655779 fixes https://phabricator.wikimedia.org/T271873, which is a deployment blocker for 1.36.0-wmf.26 [12:57:31] *is a backport of the fix for [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210113T1300) [13:02:37] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/594492 (https://phabricator.wikimedia.org/T239459) (owner: 10Alexandros Kosiaris) [13:02:55] https://wikitech.wikimedia.org/wiki/RT states that procurement@ is still being handled in RT. Is that true, given that https://phabricator.wikimedia.org/S4 and https://phabricator.wikimedia.org/tag/procurement exist? [13:03:16] (plus https://phabricator.wikimedia.org/T242860#5807771 from a year ago) [13:04:03] (03Merged) 10jenkins-bot: eventgate, eventstreams: Log with namedlevels [deployment-charts] - 10https://gerrit.wikimedia.org/r/594492 (https://phabricator.wikimedia.org/T239459) (owner: 10Alexandros Kosiaris) [13:07:32] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [13:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:07] (03CR) 10Filippo Giunchedi: [C: 03+1] kibana7: change vhost from logstash-next to logstash.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/655802 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [13:17:29] (03CR) 10Filippo Giunchedi: [C: 03+1] ELK: promote logstash-next to logstash.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/655754 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [13:18:55] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, although untested" [puppet] - 10https://gerrit.wikimedia.org/r/655734 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [13:19:23] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, although untested" [puppet] - 10https://gerrit.wikimedia.org/r/655733 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [13:19:55] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, although untested" [puppet] - 10https://gerrit.wikimedia.org/r/655731 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [13:20:49] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:10] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [13:31:10] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [13:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:21] phuedx: deployment blockers may be deployed at (almost) any time :) [13:33:11] phuedx: if you're available and able to test the patch, I'm happy to deploy this one for you [13:33:17] also ping liw , the train conductor [13:33:26] (it's about T271873) [13:33:27] T271873: [Legacy] Site logo in Vector links to current page instead of main page on mediawiki.org - https://phabricator.wikimedia.org/T271873 [13:35:02] Urbanecm: o/ Sorry. I was AFK for 15 minutes. I'm around to test the patch [13:35:43] (03CR) 10Urbanecm: [C: 03+2] "train blocker" [skins/Vector] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655779 (https://phabricator.wikimedia.org/T271873) (owner: 10Phuedx) [13:35:57] phuedx: excellent, let's wait for CI then. I'll ping you once ready. [13:36:51] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [13:36:52] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [13:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:05] (03PS1) 10Jbond: ferm-status: add ability to ignore rules with a specific comment prefix [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702) [13:39:42] jbond42: Ah thanks for ^, /me reviewing [13:41:07] 10SRE, 10CX-cxserver, 10Product-Infrastructure-Team-Backlog, 10serviceops-radar, and 2 others: service-runner apps running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10akosiaris) [13:41:16] 10SRE, 10CX-cxserver, 10Product-Infrastructure-Team-Backlog, 10serviceops-radar, and 2 others: service-runner apps running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10akosiaris) eventstreams done. Double checked in logstash and I can see nice log levels now. [13:42:54] (03CR) 10Volans: [C: 03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/647629 (owner: 10Ayounsi) [13:48:26] !log swift codfw-prod: more weight to ms-be20[58-61] - T269337 [13:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:29] T269337: Add ms-be20[58-61] to swift - https://phabricator.wikimedia.org/T269337 [13:49:10] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [13:49:10] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [13:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:25] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [13:50:25] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [13:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:17] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [13:51:17] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [13:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:03] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [13:52:04] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [13:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:40] (03PS1) 10Jbond: wmflib::dir::mkdir_p: use ensure-resource to avoid duplicate [puppet] - 10https://gerrit.wikimedia.org/r/655897 [13:53:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27451/console" [puppet] - 10https://gerrit.wikimedia.org/r/655897 (owner: 10Jbond) [13:55:43] (03CR) 10Jbond: [V: 03+1 C: 03+2] wmflib::dir::mkdir_p: use ensure-resource to avoid duplicate [puppet] - 10https://gerrit.wikimedia.org/r/655897 (owner: 10Jbond) [13:58:57] (03Merged) 10jenkins-bot: Use {{link-mainpage}} in legacy sidebar same as new logo [skins/Vector] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655779 (https://phabricator.wikimedia.org/T271873) (owner: 10Phuedx) [14:00:04] liw and longma: (Dis)respected human, time to deploy Mediawiki train - European+American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210113T1400). Please do the needful. [14:00:19] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventlogging: Remove profile::eventlogging::analytics::files [puppet] - 10https://gerrit.wikimedia.org/r/655791 (https://phabricator.wikimedia.org/T259030) (owner: 10Ladsgroup) [14:00:36] (03PS1) 10Ladsgroup: Close lrcwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655898 [14:00:47] (03PS1) 10Lars Wirzenius: group1 wikis to 1.36.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655899 [14:00:57] (03CR) 10Lars Wirzenius: [C: 03+2] group1 wikis to 1.36.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655899 (owner: 10Lars Wirzenius) [14:00:57] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:01:54] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655899 (owner: 10Lars Wirzenius) [14:02:24] Urbanecm: A friendly ping now that the change has been merged [14:02:44] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Nice. Couple of nitpicks, but otherwise /me likes :-)" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702) (owner: 10Jbond) [14:03:04] thanks phuedx [14:03:04] !log liw@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.26 [14:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:21] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:03:42] liw: do you mind me pushing a backport for a train blocker (T271873)? [14:03:43] T271873: [Legacy] Site logo in Vector links to current page instead of main page on mediawiki.org - https://phabricator.wikimedia.org/T271873 [14:04:08] !log liw@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.26 (duration: 01m 03s) [14:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:25] Urbanecm, not right now, please, I'm promoting train to group1 and would like 15 minutes to let that finish and settle [14:05:03] liw: sure, ping me when ready (through note it's already merged :/) [14:05:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and wmf LDAP for fkaelin - https://phabricator.wikimedia.org/T267817 (10fkaelin) @elukey, thanks for the background and for adding my user to Hue - I was able to login. [14:05:42] Urbanecm, check; if you don't hear from me within 15 minutes, ping me or go ahead [14:06:30] liw: sure. thanks - sorry for merging it in the silent hour before train, I thought since there's an open blocker, it won't move forward :( [14:06:57] Urbanecm, no problem [14:07:26] (train at group1 now, nothing has exploded in four minutes) [14:12:47] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [14:12:47] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [14:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:11] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [14:14:11] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [14:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:59] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [14:15:59] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [14:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:11] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:17:11] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [14:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:10] Urbanecm, go ahead when you're ready [14:18:33] Thanks low! [14:18:37] (03PS2) 10Jbond: ferm-status: add ability to ignore rules with a specific comment prefix [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702) [14:18:38] *liw [14:18:51] phuedx: still around? [14:19:02] Urbanecm: o/ [14:19:59] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1267.eqiad.wmnet are marked down but pooled: wdqs-internal_80: Servers wdqs1003.eqiad.wmnet are marked down but pooled: kibana_80: Servers logstash1008.eqiad.wmnet are marked down but pooled: kibana-ssl_443: Servers logstash1007.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:20:42] (03CR) 10Elukey: "Already added the config in the puppet private repo" [labs/private] - 10https://gerrit.wikimedia.org/r/655879 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [14:21:46] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2021-01-30) rack/setup/install frqueue2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T269481 (10Jgreen) [14:22:11] phuedx: available for you to test at mwdebug1001. Can you test and let me know how it looks? [14:22:19] On it [14:22:38] (03CR) 10Jbond: "thanks updated" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702) (owner: 10Jbond) [14:25:44] Urbanecm: Thanks. LGTM [14:25:55] phuedx: great, I'll sync it then [14:27:30] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.26/skins/Vector/includes/templates/legacy/Sidebar.mustache: 5a117ded68b5e0fc7f9b4a8a4513780e57eceefe: Use {{link-mainpage}} in legacy sidebar same as new logo (T271873) (duration: 01m 05s) [14:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:33] T271873: [Legacy] Site logo in Vector links to current page instead of main page on mediawiki.org - https://phabricator.wikimedia.org/T271873 [14:27:39] phuedx: should be live! [14:27:54] liw: I'm done, thanks! [14:28:06] Urbanecm, phuedx, thank you! [14:28:15] Urbanecm: Thanks! [14:28:18] no problem! [14:28:27] * liw filed two new train blocker in the meanwhile [14:28:33] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1003.eqiad.wmnet, mw1389.eqiad.wmnet, mw1267.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [14:28:48] :( [14:29:48] Urbanecm, if the fix you deployed is on the servers now, can we close the task? T271873 that is [14:30:17] liw: I think so, but phuedx is probably a better person to ask that question [14:31:03] phuedx, can T271873 be closed? [14:31:07] liw: I think it can be closed [14:31:47] (03PS1) 10Filippo Giunchedi: role: add interface::rps to swift::storage [puppet] - 10https://gerrit.wikimedia.org/r/655902 (https://phabricator.wikimedia.org/T271415) [14:31:50] closed it, thank you [14:33:52] is anybody checking the pybal alerts? [14:34:15] gehel, ryankemper - o/ I see pybal not happy about wdqs1003, can you check? [14:34:50] dcausse, zpapierski: any chance you could have a look? [14:35:51] sure looking [14:35:53] <3 [14:36:11] I am checking the mw hosts [14:36:13] thanks! [14:36:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] ferm-status: add ability to ignore rules with a specific comment prefix (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702) (owner: 10Jbond) [14:37:00] hm wdqs1003 seems functional... [14:37:15] liw: uploaded a patch for T271932, just needs a +2 :) [14:37:15] T271932: GlobalVarConfig::get: undefined option: 'wgDisableTextSearch' - https://phabricator.wikimedia.org/T271932 [14:37:46] (03PS3) 10Jbond: ferm-status: add ability to ignore rules with a specific comment prefix [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702) [14:38:10] Urbanecm, I wish I could do that, but I am too ignorant :( [14:38:26] Urbanecm, but thank you for quick work [14:38:46] dcausse: checking if there is anything ongoing with the lvs [14:39:10] thanks [14:39:23] it might be https://phabricator.wikimedia.org/T271087, the three hosts are in row A [14:39:24] (03CR) 10CDanis: [C: 03+1] "https://grafana.wikimedia.org/d/OERePosZk/cdanis-host-per-cpu-skew?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=swift&var-i" [puppet] - 10https://gerrit.wikimedia.org/r/655902 (https://phabricator.wikimedia.org/T271415) (owner: 10Filippo Giunchedi) [14:40:02] so lvs1016's leg on row a is faulty [14:40:27] XioNoX: around? [14:40:43] (03CR) 10Jbond: "presenting an example of a rule line with a quoted comment to make review a bit easier" [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702) (owner: 10Jbond) [14:40:45] (03PS1) 10Giuseppe Lavagetto: Fix duplicate detection when running in parallel [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655904 [14:41:59] elukey: https://phabricator.wikimedia.org/T271087 ? [14:42:03] PROBLEM - SSH on ms-be2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:42:27] cdanis: yep I linked it above, I see on the switch that the port seems down :( [14:42:39] elukey@asw2-a-eqiad> show interfaces descriptions | match lvs1016 [14:42:42] xe-4/0/7 up down lvs1016:enp4s0f1 {#3917} [14:42:53] (03CR) 10Jbond: ferm-status: add ability to ignore rules with a specific comment prefix (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702) (owner: 10Jbond) [14:43:08] (03PS2) 10Giuseppe Lavagetto: Fix duplicate detection when running in parallel [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655904 (https://phabricator.wikimedia.org/T271901) [14:43:33] RECOVERY - SSH on ms-be2017 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:43:58] hm :( [14:44:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/655896 (https://phabricator.wikimedia.org/T271702) (owner: 10Jbond) [14:44:24] 10SRE, 10ops-eqiad, 10DC-Ops: frdev1001 ILO inaccessible - https://phabricator.wikimedia.org/T267969 (10Jgreen) >>! In T267969#6740997, @Cmjohnson wrote: > @jgreen do you have our mgmt ports in a vlan? I don't think anyone has touched the server prior to the ILO becoming inaccessible. The problem happens of... [14:45:23] (03CR) 10Fdans: [C: 03+1] analytics:refinery:job:data_purge Activate netflow auto-purging [puppet] - 10https://gerrit.wikimedia.org/r/655120 (https://phabricator.wikimedia.org/T231339) (owner: 10Mforns) [14:46:01] cdanis: not sure what's best, in theory it should be fine if cmjohnson1 later on will swap cables etc.. [14:46:02] 10SRE, 10ops-eqiad, 10Traffic: Interface errors on asw2-a-eqiad:xe-4/0/7 (lvs1016) - https://phabricator.wikimedia.org/T271087 (10elukey) The interface seems having trouble at the moment, we have some icinga alerts about pybal not reaching row-a hosts: ` elukey@asw2-a-eqiad> show interfaces descriptions | m... [14:46:08] is lvs1016 still pooled? [14:46:28] ah good point, it might be the standby and the alert was an expired downtime [14:46:49] indeed [14:46:51] https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?orgId=1&from=now-3h&to=now [14:46:56] lvs1016 is still at 0 [14:47:10] perfect then, I'll add info the task [14:47:18] and downtime the host [14:47:25] thanks for checking [14:47:37] 10SRE, 10ops-eqiad, 10Traffic: Interface errors on asw2-a-eqiad:xe-4/0/7 (lvs1016) - https://phabricator.wikimedia.org/T271087 (10ayounsi) p:05Medium→03High [14:47:44] yeah lvs1016 has bgp-med=100 [14:48:03] so is the secondary anyway [14:48:11] left a message on -traffic as well [14:49:46] (03CR) 10Ottomata: [C: 03+1] Add eventstreams-internal k8s service dummy token config [labs/private] - 10https://gerrit.wikimedia.org/r/655879 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [14:54:02] 10SRE, 10ops-eqiad, 10Traffic: Interface errors on asw2-a-eqiad:xe-4/0/7 (lvs1016) - https://phabricator.wikimedia.org/T271087 (10elukey) The lvs is a secondary so not taking traffic, added a day of downtime :) [14:55:08] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add eventstreams-internal k8s service dummy token config [labs/private] - 10https://gerrit.wikimedia.org/r/655879 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [14:55:44] (03PS4) 10Jbond: role and profile specs: add example spec test [puppet] - 10https://gerrit.wikimedia.org/r/642423 [14:57:16] (03CR) 10jerkins-bot: [V: 04-1] role and profile specs: add example spec test [puppet] - 10https://gerrit.wikimedia.org/r/642423 (owner: 10Jbond) [14:57:25] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [14:57:57] !log imported jenkins 2.263.2 (security release) to apt.wikimedia.org/buster-wikimedia [14:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:11] 10SRE, 10Traffic, 10netops: Create Generalised blocking stratagy - https://phabricator.wikimedia.org/T270618 (10CDanis) Thanks for the writeup with all the background! And for the cleanup patches so far :) Just a few things to add. Re: the schema of blocks themselves: * In addition to being able to specif... [14:59:52] (03PS5) 10Jbond: role and profile specs: add example spec test [puppet] - 10https://gerrit.wikimedia.org/r/642423 [15:00:09] (03CR) 10CDanis: [C: 03+1] pcc: filter out noise [puppet] - 10https://gerrit.wikimedia.org/r/651911 (owner: 10Jbond) [15:01:40] !log Upgraded Jenkins on releases1002 / releases2002 hosts # T271507 [15:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:04] 10SRE, 10Analytics, 10Patch-For-Review: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10CDanis) Hey @Ottomata, I meant to get around to this last quarter but didn't. Would very much like to get some mechanism in place soon -- do you have any i... [15:05:57] !log upgraded spicerack to 0.0.47-1+deb10u1 on cumin2001 - T257905 [15:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:11] T257905: Spin off common Spicerack modules into a standalone Python library importable anywhere - https://phabricator.wikimedia.org/T257905 [15:06:16] !log volans@cumin2001 START - Cookbook sre.hosts.downtime for 0:10:00 on cumin2001.codfw.wmnet with reason: volans's test [15:06:17] !log volans@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on cumin2001.codfw.wmnet with reason: volans's test [15:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:30] 10SRE, 10Analytics, 10Patch-For-Review: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10Ottomata) The long term solution here is still not clear and is very tied up with some other yet undefined long term projects, like Data Governance. Let's... [15:10:00] (03CR) 10Jbond: [C: 03+2] pcc: filter out noise [puppet] - 10https://gerrit.wikimedia.org/r/651911 (owner: 10Jbond) [15:10:33] (03CR) 10Jbond: [C: 03+2] pcc: add more info to the status message [puppet] - 10https://gerrit.wikimedia.org/r/651788 (https://phabricator.wikimedia.org/T270757) (owner: 10Jbond) [15:10:51] (03PS1) 10Urbanecm: GlobalVarConfig::get should not be provided with the wg prefix [extensions/ProofreadPage] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655780 (https://phabricator.wikimedia.org/T271932) [15:11:08] !log elukey@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters [15:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:53] this is the test cluster --^ [15:14:21] 10SRE, 10Analytics, 10Patch-For-Review: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10CDanis) >>! In T263496#6744057, @Ottomata wrote: > The long term solution here is still not clear and is very tied up with some other yet undefined long ter... [15:20:41] (03CR) 10Muehlenhoff: [C: 03+2] Make bast3005 a bastion [puppet] - 10https://gerrit.wikimedia.org/r/655872 (owner: 10Muehlenhoff) [15:22:57] !log Stopping Jenkins CI on contint2001 to upgrade Jenkins # T271507 [15:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:17] (03CR) 10Ottomata: VCL: Attach a variety of GeoIP info as bereq headers; test GeoIP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630316 (https://phabricator.wikimedia.org/T263496) (owner: 10CDanis) [15:29:13] (03CR) 10Elukey: "I followed the steps in https://wikitech.wikimedia.org/wiki/Kubernetes#Add_a_new_service to create the tokens, I am wondering if we need t" [deployment-charts] - 10https://gerrit.wikimedia.org/r/644612 (https://phabricator.wikimedia.org/T269160) (owner: 10Ottomata) [15:32:59] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): (Need By: TBD) rack/setup/install cloudgw2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271590 (10aborrero) [15:34:16] (03CR) 10Ottomata: "> If nobody opposes I am going to verify what's missing and add it to this patch, assuming that what we already have is good (it seems ve" [deployment-charts] - 10https://gerrit.wikimedia.org/r/644612 (https://phabricator.wikimedia.org/T269160) (owner: 10Ottomata) [15:34:38] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10jijiki) [15:34:42] 10SRE: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10jijiki) [15:34:44] 10SRE, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [15:35:13] /q vg [15:35:16] 10SRE, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [15:35:19] almost! [15:35:49] 10SRE, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) 05Open→03Resolved a:03jijiki Despite of what the above messages say, mc2029 and mc1029 were properly reimaged 🎉 [15:38:01] 10SRE, 10Platform Engineering, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Phasing out "redis_sessions" MediaWiki cluster - https://phabricator.wikimedia.org/T267581 (10jijiki) [15:38:03] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10jijiki) [15:38:08] 10SRE, 10Platform Engineering, 10Wikidata, 10serviceops, and 4 others: Upgrade memcached cluster to Debian Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [15:38:18] (03CR) 10CDanis: VCL: Attach a variety of GeoIP info as bereq headers; test GeoIP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630316 (https://phabricator.wikimedia.org/T263496) (owner: 10CDanis) [15:38:43] 10SRE, 10Platform Engineering, 10serviceops, 10Patch-For-Review, 10User-jijiki: Upgrade MediaWiki's Redis cluster to Debian Buster - https://phabricator.wikimedia.org/T265643 (10jijiki) 05Open→03Resolved a:03jijiki We ported version 2.8 to Buster, and all servers were upgraded as part of T213089. [15:38:51] 10SRE: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10jijiki) [15:41:04] 10SRE, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [15:41:12] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [15:41:12] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [15:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:31] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [15:42:31] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [15:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:52] 10SRE, 10WMF-NDA-Requests: Request from WMDE intern Amrutha - https://phabricator.wikimedia.org/T271725 (10amy_rc) [15:44:08] 10SRE, 10CX-cxserver, 10Product-Infrastructure-Team-Backlog, 10serviceops-radar, and 2 others: service-runner apps running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10akosiaris) [15:44:33] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [15:44:33] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [15:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:27] (03PS1) 10Volans: wmf-auto-reimage: fix Netbox update for 2.9 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/655909 (https://phabricator.wikimedia.org/T257905) [15:45:29] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [15:45:29] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [15:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:05] 10SRE, 10WMF-NDA-Requests: Request from WMDE intern Amrutha - https://phabricator.wikimedia.org/T271725 (10Lea_WMDE) I confirm that @amy_rc is an intern in my team, and I approve the request. [15:47:15] (03CR) 10jerkins-bot: [V: 04-1] wmf-auto-reimage: fix Netbox update for 2.9 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/655909 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [15:49:17] 10SRE, 10Wikifeeds, 10Wikimedia-Logstash, 10observability: Move wikifeeds to the logging pipeline - https://phabricator.wikimedia.org/T245604 (10akosiaris) [15:49:22] 10SRE, 10Product-Infrastructure-Team-Backlog, 10Proton, 10Wikimedia-Logstash, and 3 others: Move proton logging to new logging pipeline - https://phabricator.wikimedia.org/T219925 (10akosiaris) [15:49:25] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10akosiaris) [15:49:28] 10SRE, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, 10Wikimedia-Logstash, and 3 others: Move mobileapps logging to new logging pipeline - https://phabricator.wikimedia.org/T219924 (10akosiaris) [15:49:31] 10SRE, 10Citoid, 10Wikimedia-Logstash, 10observability, and 2 others: Move citoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219919 (10akosiaris) [15:49:35] 10SRE, 10Wikimedia-Logstash, 10observability, 10service-runner, 10Patch-For-Review: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10akosiaris) [15:49:38] (03PS2) 10Volans: wmf-auto-reimage: fix Netbox update for 2.9 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/655909 (https://phabricator.wikimedia.org/T257905) [15:50:48] 10SRE, 10CX-cxserver, 10Product-Infrastructure-Team-Backlog, 10serviceops-radar, and 2 others: service-runner apps running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10akosiaris) 05Open→03Resolved a:03akosiaris eventgate done. And with this, we can close th... [15:53:05] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1022 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:56:14] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) [15:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:52] !log upgraded spicerack to 0.0.47-1+deb10u1 on cumin1001 - T257905 [15:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:55] T257905: Spin off common Spicerack modules into a standalone Python library importable anywhere - https://phabricator.wikimedia.org/T257905 [15:57:28] jouncebot: now [15:57:28] For the next 0 hour(s) and 2 minute(s): Mediawiki train - European+American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210113T1400) [15:57:33] jouncebot: next [15:57:34] In 3 hour(s) and 2 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210113T1900) [15:57:34] In 3 hour(s) and 2 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210113T1900) [15:58:35] (03CR) 10Urbanecm: [C: 03+2] "UBN train blocker" [extensions/ProofreadPage] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655780 (https://phabricator.wikimedia.org/T271932) (owner: 10Urbanecm) [16:00:07] (03CR) 10Ayounsi: [C: 03+1] "Looks reasonable" [puppet] - 10https://gerrit.wikimedia.org/r/655909 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [16:00:59] (03CR) 10Volans: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/655909 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [16:01:14] (03PS3) 10Volans: wmf-auto-reimage: fix Netbox update for 2.9 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/655909 (https://phabricator.wikimedia.org/T266487) [16:03:36] (03CR) 10Volans: [C: 03+2] wmf-auto-reimage: fix Netbox update for 2.9 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/655909 (https://phabricator.wikimedia.org/T266487) (owner: 10Volans) [16:04:06] (03Merged) 10jenkins-bot: GlobalVarConfig::get should not be provided with the wg prefix [extensions/ProofreadPage] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655780 (https://phabricator.wikimedia.org/T271932) (owner: 10Urbanecm) [16:06:52] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.26/extensions/ProofreadPage/includes/Special/SpecialProofreadPages.php: d73ba7c1aa92190903cd4b07fe3e8cf1bed13d70: GlobalVarConfig::get should not be provided with the wg prefix (T271932) (duration: 01m 07s) [16:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:55] T271932: GlobalVarConfig::get: undefined option: 'wgDisableTextSearch' - https://phabricator.wikimedia.org/T271932 [16:08:17] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:15:54] (03CR) 10Ayounsi: [C: 03+2] Run Homer during the decom cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/647629 (owner: 10Ayounsi) [16:15:57] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [16:17:58] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [16:17:58] !log jmm@cumin2001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [16:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:26] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [16:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:37] !log jmm@cumin2001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [16:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:10] (03Merged) 10jenkins-bot: Run Homer during the decom cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/647629 (owner: 10Ayounsi) [16:22:16] 10SRE, 10ops-eqiad, 10Traffic: lvs1016 interface down - https://phabricator.wikimedia.org/T271087 (10ayounsi) [16:23:41] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1022 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:29:00] 10SRE, 10WVUI: Import npm 6.14.8 to buster dist. on apt.wikimedia.org - https://phabricator.wikimedia.org/T270321 (10nnikkhoui) 05Open→03Declined Closing, as discussion on T269957 has moved back to using buster backports instead. [16:39:51] !log upload pdns-recursor_4.4.2-2wm1 to apt.wm.o (buster) - T252132 [16:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:55] T252132: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 [16:48:06] (03CR) 10Awight: "I've smoke-tested the scripts on a stat machine. The last step is that the start date should be adjusted in the job." [puppet] - 10https://gerrit.wikimedia.org/r/649662 (https://phabricator.wikimedia.org/T262209) (owner: 10Awight) [16:49:40] (03PS16) 10Hnowlan: sockpuppet-api: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) [16:51:40] (03PS1) 10Hnowlan: similar-users: add helmfile configuration. [deployment-charts] - 10https://gerrit.wikimedia.org/r/655915 (https://phabricator.wikimedia.org/T268837) [16:51:57] (03CR) 10CRusnov: "This change is ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/655914 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [16:52:34] (03PS1) 10Muehlenhoff: Remove bast3004/bast4002/bast5001 from Prometheus Hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/655916 [16:52:56] (03CR) 10jerkins-bot: [V: 04-1] similar-users: add helmfile configuration. [deployment-charts] - 10https://gerrit.wikimedia.org/r/655915 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [16:53:32] (03CR) 10Volans: ganeti.makevm: Make necessary changes to port for Netbox 2.9 API (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/655914 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [16:53:44] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:43] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [16:55:43] mutante: there was an issue earlier with the reimage script related to the netbox upgrade, please let me know if it works all ok now given that the fix has been already deployed [16:56:40] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [16:56:43] (03PS1) 10Cparle: Guard against this file being included twice [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655926 (https://phabricator.wikimedia.org/T271933) [16:56:44] volans: reasons seems unrelated. mw2228 had remote IPMI issue [16:56:54] works on the next host [16:57:02] it's just for the final netbox_update stuff [16:57:08] ah, ok, will let you know [16:57:22] thx [16:57:23] ^ effie also reimaged two mc hosts today [16:57:28] she reported the issue :D [16:57:32] ah :-) [16:59:29] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [17:00:11] (03Abandoned) 10Cparle: Guard against this file being included twice [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655926 (https://phabricator.wikimedia.org/T271933) (owner: 10Cparle) [17:02:34] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:45] !log m2228 resetting DRAC/BMC - trying to solve remote IPMI issue - bmc-device --cold-reset; echo $? [17:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:50] (03Restored) 10Cparle: Guard against this file being included twice [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655926 (https://phabricator.wikimedia.org/T271933) (owner: 10Cparle) [17:05:03] (03PS1) 10Matthias Mullie: Guard against this file being included twice [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655919 (https://phabricator.wikimedia.org/T271933) [17:05:34] (03PS2) 10Matthias Mullie: Guard against this file being included twice [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655919 (https://phabricator.wikimedia.org/T271933) [17:05:48] (03CR) 10Cparle: "This change is ready for review." [extensions/WikibaseMediaInfo] (wmf/1.35.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655927 (https://phabricator.wikimedia.org/T271933) (owner: 10Cparle) [17:06:06] (03Abandoned) 10Cparle: Guard against this file being included twice [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655926 (https://phabricator.wikimedia.org/T271933) (owner: 10Cparle) [17:07:33] Is there something else I can do after "bmc-device --cold-reset; echo $?" returned 0 but I am still getting "Remote IPMI failed" ? [17:07:53] that's when I go to dcops, right [17:07:55] (03CR) 10Cparle: [C: 03+2] Guard against this file being included twice [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655919 (https://phabricator.wikimedia.org/T271933) (owner: 10Matthias Mullie) [17:08:05] mutante: have you followed https://wikitech.wikimedia.org/wiki/Ipmi ? [17:09:26] volans: yes, local works, remote does not work, reset works but does not fix issue [17:09:33] (03CR) 10jerkins-bot: [V: 04-1] Guard against this file being included twice [extensions/WikibaseMediaInfo] (wmf/1.35.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655927 (https://phabricator.wikimedia.org/T271933) (owner: 10Cparle) [17:09:40] no diff shown [17:09:54] https://wikitech.wikimedia.org/wiki/Management_Interfaces#Is_remote_IPMI_enabled? [17:09:57] ah ok [17:10:03] might be physical then... [17:10:44] volans: ack, also no --diff for permissions [17:10:53] i'm asking papaul [17:11:20] !log beginning cutover of https://logstash.wikimedia.org frontend to ELK7 T234854 [17:11:23] well, I could try re-setting the password [17:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:27] T234854: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 [17:11:40] (03CR) 10Herron: [C: 03+2] kibana7: change vhost from logstash-next to logstash.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/655802 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [17:11:48] (03PS2) 10Herron: kibana7: change vhost from logstash-next to logstash.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/655802 (https://phabricator.wikimedia.org/T234854) [17:12:51] (03PS1) 10David Caro: wmcs.ceph.osd: disable write caches when possible [puppet] - 10https://gerrit.wikimedia.org/r/655923 (https://phabricator.wikimedia.org/T271527) [17:13:50] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [17:14:26] volans: interesting, i logged in on mgmt, with the normal mgmt password, then "racadm config -g cfgUserAdmin -o cfgUserAdminPassword -i 2 .." to set that same password again.. then re-tried reimage cookbook.. and now it works [17:14:40] lol [17:14:54] the gut feeling was weird but turned out true [17:15:07] good it got fixed [17:15:11] :) [17:22:27] (03PS2) 10Herron: ELK: promote logstash-next to logstash.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/655754 (https://phabricator.wikimedia.org/T234854) [17:23:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2227.codfw.wmnet with reason: REIMAGE [17:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:00] (03CR) 10Herron: [C: 03+2] ELK: promote logstash-next to logstash.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/655754 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [17:25:50] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2229.codfw.wmnet with reason: REIMAGE [17:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:00] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2227.codfw.wmnet with reason: REIMAGE [17:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:02] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2229.codfw.wmnet with reason: REIMAGE [17:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:39] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2230.codfw.wmnet with reason: REIMAGE [17:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:40] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2230.codfw.wmnet with reason: REIMAGE [17:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:02] volans: added that case to the wikitech page. let's see if it happens ever again :) [17:35:11] thanks! [17:37:59] (03PS13) 10Razzi: Add cookbook for rebooting druid nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) [17:38:39] (03PS14) 10Razzi: Add cookbook for rebooting druid nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) [17:39:57] (03PS15) 10Razzi: Add cookbook for rebooting druid nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) [17:41:15] (03Merged) 10jenkins-bot: Guard against this file being included twice [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655919 (https://phabricator.wikimedia.org/T271933) (owner: 10Matthias Mullie) [17:41:31] (03CR) 10Awight: "PS 4: new Depends-On for the job tweak" [puppet] - 10https://gerrit.wikimedia.org/r/649662 (https://phabricator.wikimedia.org/T262209) (owner: 10Awight) [17:42:40] (03CR) 10jerkins-bot: [V: 04-1] Add cookbook for rebooting druid nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [17:44:44] (03CR) 10Kosta Harlan: [C: 03+1] [no-op] GrowthExperiments: Disable link recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655863 (https://phabricator.wikimedia.org/T261408) (owner: 10Gergő Tisza) [17:45:01] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2228.codfw.wmnet with reason: REIMAGE [17:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:37] (03PS16) 10Razzi: Add cookbook for rebooting druid nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) [17:45:51] (03CR) 10Kosta Harlan: [C: 03+1] "Is there a way to review stderr/stdout from this job?" [puppet] - 10https://gerrit.wikimedia.org/r/655865 (https://phabricator.wikimedia.org/T261408) (owner: 10Gergő Tisza) [17:47:07] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2228.codfw.wmnet with reason: REIMAGE [17:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:09] 10SRE, 10serviceops, 10User-jijiki: Upgrade memcached to version 1.6.x - https://phabricator.wikimedia.org/T270315 (10jijiki) [18:08:05] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/655946 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [18:09:46] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ml-serve1001.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/... [18:14:11] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2227.codfw.wmnet'] ` Of which those **F... [18:14:36] volans: did you mean this? Unable to run wmf-auto-reimage-host: 'NoneType' object is not subscriptable [18:14:51] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2229.codfw.wmnet'] ` Of which those **F... [18:14:51] then I can confirm it on 2 hosts now [18:17:39] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [18:17:48] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [18:18:03] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2230.codfw.wmnet'] ` Of which those **F... [18:18:34] 10SRE, 10CheckUser, 10Traffic: Log source port for anonymous users and expose it for sysops/checkusers - https://phabricator.wikimedia.org/T181368 (10Ladsgroup) I'm inclined to close this as declined in favor of {T271953} which basically gives people who have access to hadoop to be able to see the source por... [18:18:43] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2227.codfw.wmnet [18:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:09] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2230.codfw.wmnet [18:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:27] 10SRE, 10serviceops, 10User-jijiki: Enable TLS on memcached - https://phabricator.wikimedia.org/T271967 (10jijiki) [18:19:34] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2229.codfw.wmnet [18:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:46] 10SRE, 10serviceops, 10User-jijiki: Enable TLS on memcached - https://phabricator.wikimedia.org/T271967 (10jijiki) [18:19:49] 10SRE, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [18:20:19] 10SRE, 10serviceops, 10User-jijiki: Enable TLS on memcached - https://phabricator.wikimedia.org/T271967 (10jijiki) [18:23:45] (03CR) 10Gergő Tisza: "> Is there a way to review stderr/stdout from this job?" [puppet] - 10https://gerrit.wikimedia.org/r/655865 (https://phabricator.wikimedia.org/T261408) (owner: 10Gergő Tisza) [18:27:56] (03CR) 10Dzahn: "34 logfile_basedir => '/var/log/mediawiki'," [puppet] - 10https://gerrit.wikimedia.org/r/655865 (https://phabricator.wikimedia.org/T261408) (owner: 10Gergő Tisza) [18:28:29] (03PS1) 10Herron: elk7: enable icinga notifications [puppet] - 10https://gerrit.wikimedia.org/r/655951 (https://phabricator.wikimedia.org/T234854) [18:29:57] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2230.codfw.wmnet [18:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:04] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [18:32:41] (03CR) 10CDanis: "PCC is currently no-op; intent is to set this to the core SRE team's team-id in private hiera once merged" [puppet] - 10https://gerrit.wikimedia.org/r/655485 (owner: 10CDanis) [18:34:29] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2228.codfw.wmnet'] ` Of which those **F... [18:34:43] 10SRE, 10Traffic, 10serviceops, 10HTTPS, 10User-jijiki: Enable TLS on memcached - https://phabricator.wikimedia.org/T271967 (10Nintendofan885) [18:34:51] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2228.codfw.wmnet [18:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:58] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2227.codfw.wmnet [18:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:39] (03PS1) 10Bstorm: nfs: change the default throttles for primary cluster read and egress [puppet] - 10https://gerrit.wikimedia.org/r/655952 [18:35:41] (03CR) 10Herron: [C: 03+2] elk7: enable icinga notifications [puppet] - 10https://gerrit.wikimedia.org/r/655951 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [18:36:18] 10SRE, 10Traffic, 10serviceops, 10HTTPS, and 3 others: Move blubberoid to use TLS only. - https://phabricator.wikimedia.org/T236017 (10Nintendofan885) [18:38:13] mutante: do you have the output pasted somewhere? [18:38:34] T266487 is a good place if not already there [18:38:35] T266487: Upgrade Netbox and accompanying to the 2.9 series (Tracking Task) - https://phabricator.wikimedia.org/T266487 [18:38:39] it shoul dhave been fixed :/ [18:39:19] what's that in regard to? [18:39:58] (03CR) 10Razzi: [C: 03+2] Add cookbook for rebooting druid nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [18:42:55] chaomodus: " 'NoneType' object is not subscriptable" [18:43:01] when trying to update netbox [18:43:33] volans: I am not sure if there is anything I need to do in netbox manually. The host is still there and just got reimaged. Not sure what info it would have updated if it had worked [18:43:47] oh it's the reimage script [18:44:16] chaomodus: yes, it is. and the issue happens after "Updated Netbox:" [18:44:22] kay [18:44:51] seems like I can ignore that though. or at least I can't think of what it would actually update there [18:45:01] if it was just a reimage [18:45:22] mutante: no the issue is just printing the result of the update [18:45:34] the update was already successful if it says "Updated Netbox" [18:45:43] so nothing to do for you manually [18:45:44] volans: ok, great. I will leave a comment on that ticket but not worry much about it otherwise then [18:45:54] but do you have a stacktrace? [18:46:35] if anyone's able to backport the fix to https://phabricator.wikimedia.org/T271933 I would appreciate that [18:46:53] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2231.codfw.wmnet with reason: REIMAGE [18:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:01] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2232.codfw.wmnet with reason: REIMAGE [18:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:03] no, I just have the output of the script or what is in logs [18:47:39] yes I was expecintg a stacktrace there in the output :) [18:48:52] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2231.codfw.wmnet with reason: REIMAGE [18:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:48] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2232.codfw.wmnet with reason: REIMAGE [18:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:19] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ml-serve1001.eqiad.wmnet'] ` Of which those **FAILED**: ` ['ml-serve1001.eqiad.wmnet'] ` [18:53:10] volans: chaomodus: https://phabricator.wikimedia.org/P13760 [18:53:46] Ooh it's using a customscript? [18:53:50] mutante: ack thx, I might need to add some polling [18:53:53] 10SRE, 10serviceops, 10User-jijiki: Enable TLS on memcached - https://phabricator.wikimedia.org/T271967 (10Aklapper) @Nintendofan885 This is unrelated to #HTTPS [18:54:07] volans: check out the solution in custom_script_proxy [18:55:20] added that to the ticket, thx [18:57:13] (03CR) 10Volans: [C: 04-1] "I'm not convinced at all that this move towards job results and polling works for this proxy." (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/655946 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [18:57:47] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review, 10Release Pipeline (Blubber): Move blubberoid to use TLS only. - https://phabricator.wikimedia.org/T236017 (10Aklapper) @Nintendofan885 This is unrelated to HTTPS [18:59:25] (03PS1) 10Ladsgroup: Remove gui stuff from wdqs [puppet] - 10https://gerrit.wikimedia.org/r/655955 (https://phabricator.wikimedia.org/T271851) [19:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210113T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:00:05] liw and longma: Your horoscope predicts another unfortunate Train log triage with CPT deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210113T1900). [19:00:17] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2233.codfw.wmnet with reason: REIMAGE [19:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2233.codfw.wmnet with reason: REIMAGE [19:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:01] (03PS2) 10Ladsgroup: query_service: Remove gui files from wdqs [puppet] - 10https://gerrit.wikimedia.org/r/655955 (https://phabricator.wikimedia.org/T271851) [19:08:34] (03PS4) 10CRusnov: custom_script_proxy: adjust for Netbox 2.9 API [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/655946 (https://phabricator.wikimedia.org/T266487) [19:09:42] (03PS1) 10Razzi: sre.druid.reboot-workers: pass single host as list [cookbooks] - 10https://gerrit.wikimedia.org/r/655956 (https://phabricator.wikimedia.org/T269596) [19:10:19] (03CR) 10Volans: [C: 03+1] "right!" [cookbooks] - 10https://gerrit.wikimedia.org/r/655956 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [19:10:58] (03PS1) 10Herron: elk7: change kibana7 monitoring to critical [puppet] - 10https://gerrit.wikimedia.org/r/655957 (https://phabricator.wikimedia.org/T234854) [19:11:00] (03PS1) 10Herron: elk7: remove logstash-next cache setting [puppet] - 10https://gerrit.wikimedia.org/r/655958 (https://phabricator.wikimedia.org/T234854) [19:11:24] (03CR) 10CRusnov: "> Patch Set 3: Code-Review-1" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/655946 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [19:13:20] (03CR) 10Volans: [C: 04-1] "Implementation look good, one detail to fix given the API response." (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/655946 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [19:13:30] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2228.codfw.wmnet [19:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:47] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2229.codfw.wmnet [19:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:01] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2230.codfw.wmnet [19:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:11] (03PS1) 10Herron: dns: remove logstash-next.wikimedia.org record [dns] - 10https://gerrit.wikimedia.org/r/655959 (https://phabricator.wikimedia.org/T234854) [19:14:59] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [19:16:25] (03CR) 10Razzi: [C: 03+2] sre.druid.reboot-workers: pass single host as list [cookbooks] - 10https://gerrit.wikimedia.org/r/655956 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [19:17:31] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:18:21] (03PS5) 10CRusnov: custom_script_proxy: adjust for Netbox 2.9 API [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/655946 (https://phabricator.wikimedia.org/T266487) [19:19:01] (03Merged) 10jenkins-bot: sre.druid.reboot-workers: pass single host as list [cookbooks] - 10https://gerrit.wikimedia.org/r/655956 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [19:19:13] (03CR) 10CRusnov: "Sounds reasonable. I have tested the current solution." (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/655946 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [19:20:18] !log razzi@cumin1001 START - Cookbook sre.druid.reboot-workers for Druid test cluster: Reboot Druid nodes - razzi@cumin1001 [19:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:30] 10SRE, 10serviceops, 10User-jijiki: Enable TLS on memcached - https://phabricator.wikimedia.org/T271967 (10elukey) My 2c: I'd vote for 1.6.x since it is close to what upstream is currently supporting, plus I don't think that it would be less stable than the last 1.5.x version.. In 1.6 a lot of new things wer... [19:21:30] (03CR) 10Volans: "question inline I missed it earlier" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/655946 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [19:24:15] (03CR) 10CRusnov: custom_script_proxy: adjust for Netbox 2.9 API (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/655946 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [19:24:17] (03CR) 10Herron: [C: 03+1] "thanks! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/655916 (owner: 10Muehlenhoff) [19:25:17] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:26:03] PROBLEM - PHP opcache health on mw2227 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:26:08] 10SRE, 10serviceops, 10User-jijiki: Enable TLS on memcached - https://phabricator.wikimedia.org/T271967 (10MoritzMuehlenhoff) Also, memcached 1.6.6 is already used on the IDPs and available in a component. [19:26:28] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/655946 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [19:26:59] (03CR) 10CRusnov: [C: 03+2] custom_script_proxy: adjust for Netbox 2.9 API [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/655946 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [19:27:44] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2021 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) I'm seeing this several times a week and have for several months. I haven't reported it before since it's not essential prod code and we... [19:29:20] (03PS3) 10CRusnov: ganeti.makevm: Make necessary changes to port for Netbox 2.9 API [cookbooks] - 10https://gerrit.wikimedia.org/r/655914 (https://phabricator.wikimedia.org/T266487) [19:29:53] PROBLEM - PHP opcache health on mw2229 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:30:27] (03PS1) 10Ottomata: Undo finalization of EL migration of SpecialMuteSubmit [puppet] - 10https://gerrit.wikimedia.org/r/655962 (https://phabricator.wikimedia.org/T268517) [19:31:23] (03PS1) 10Volans: wmf-auto-reimage: poll Netbox script results [puppet] - 10https://gerrit.wikimedia.org/r/655963 (https://phabricator.wikimedia.org/T266487) [19:31:36] (03PS1) 10Ottomata: Undo finalization of EL migration of SpecialMuteSubmit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655964 (https://phabricator.wikimedia.org/T268517) [19:32:25] (03CR) 10Ottomata: [C: 03+2] Undo finalization of EL migration of SpecialMuteSubmit [puppet] - 10https://gerrit.wikimedia.org/r/655962 (https://phabricator.wikimedia.org/T268517) (owner: 10Ottomata) [19:32:57] PROBLEM - PHP opcache health on mw2230 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:33:00] (03CR) 10jerkins-bot: [V: 04-1] wmf-auto-reimage: poll Netbox script results [puppet] - 10https://gerrit.wikimedia.org/r/655963 (https://phabricator.wikimedia.org/T266487) (owner: 10Volans) [19:34:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:34:25] (03CR) 10Ottomata: [C: 03+2] Undo finalization of EL migration of SpecialMuteSubmit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655964 (https://phabricator.wikimedia.org/T268517) (owner: 10Ottomata) [19:34:26] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2231.codfw.wmnet'] ` Of which those **F... [19:34:46] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2231.codfw.wmnet [19:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:15] (03PS2) 10Volans: wmf-auto-reimage: poll Netbox script results [puppet] - 10https://gerrit.wikimedia.org/r/655963 (https://phabricator.wikimedia.org/T266487) [19:35:34] (03CR) 10Herron: [C: 03+2] mailman: set Czech language to iso-8859-2 [puppet] - 10https://gerrit.wikimedia.org/r/654915 (https://phabricator.wikimedia.org/T271123) (owner: 10Herron) [19:35:43] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [19:36:19] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Undo - Migrate SpecialMuteSubmit to EventGate - T268517 (duration: 01m 06s) [19:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:22] T268517: Migrate Anti-Harassment EventLogging schemas to Event Platform - https://phabricator.wikimedia.org/T268517 [19:36:24] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2232.codfw.wmnet'] ` Of which those **F... [19:36:45] (03CR) 10jerkins-bot: [V: 04-1] wmf-auto-reimage: poll Netbox script results [puppet] - 10https://gerrit.wikimedia.org/r/655963 (https://phabricator.wikimedia.org/T266487) (owner: 10Volans) [19:38:14] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics dev for clarakosi - https://phabricator.wikimedia.org/T271973 (10Clarakosi) [19:39:31] !log razzi@cumin1001 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid test cluster: Reboot Druid nodes - razzi@cumin1001 [19:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:11] !log thcipriani@deploy1001 Synchronized php-1.36.0-wmf.26/extensions/WikibaseMediaInfo/src/Search/MediaSearchProfiles.php: [[gerrit:655919|Guard against this file being included twice]] T271933 (duration: 01m 04s) [19:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:14] T271933: PHP Fatal Error: Cannot redeclare Wikibase\MediaInfo\Search\closureToAnonymousClass() (previously declared in /srv/mediawiki/php-1.36.0-wmf.26/extensions/WikibaseMediaInfo/src/Search/MediaSearchProfiles.php:24) - https://phabricator.wikimedia.org/T271933 [19:44:19] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2234.codfw.wmnet with reason: REIMAGE [19:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:04] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics dev for clarakosi - https://phabricator.wikimedia.org/T271973 (10WDoranWMF) As @Clarakosi manager I approve the request and her need for access as part of her role on the #platform_engineering team [19:46:16] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2234.codfw.wmnet with reason: REIMAGE [19:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:12] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['ml-serve1001.eqiad.wmnet', 'ml-serve1002.eqiad.wmnet', 'ml-serve1003.eqiad.w... [19:47:46] !log thcipriani@deploy1001 Synchronized php-1.36.0-wmf.26/extensions/WikibaseMediaInfo/src/Search/MediaSearchProfiles.php: [[gerrit:655919|Guard against this file being included twice]] T271933 (for real -- forgot to submodule update) (duration: 01m 04s) [19:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:49] T271933: PHP Fatal Error: Cannot redeclare Wikibase\MediaInfo\Search\closureToAnonymousClass() (previously declared in /srv/mediawiki/php-1.36.0-wmf.26/extensions/WikibaseMediaInfo/src/Search/MediaSearchProfiles.php:24) - https://phabricator.wikimedia.org/T271933 [19:48:49] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2233.codfw.wmnet'] ` Of which those **F... [19:48:58] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools' newtopictool as beta feature on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655966 (https://phabricator.wikimedia.org/T267595) [19:50:11] 10SRE, 10Wikimedia-Mailing-lists, 10I18n, 10Patch-For-Review: Mailman password reminder mail (and other texts) has broken encoding in Czech - https://phabricator.wikimedia.org/T271123 (10herron) Was hoping for some feedback on the above patch, but since it's been a few days I've gone ahead and merged it.... [19:51:14] thcipriani: are you done with deploying? If so, can I sync sth? [19:51:52] Urbanecm: yep, I'm done, sorry, just realized there was some code that was merged that wasn't deployed :) [19:53:08] no problem, just asking to be sure you're done :) [19:53:16] (03PS2) 10Urbanecm: Set import sources for mrwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655293 (https://phabricator.wikimedia.org/T270402) [19:53:23] (03CR) 10Urbanecm: [C: 03+2] Set import sources for mrwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655293 (https://phabricator.wikimedia.org/T270402) (owner: 10Urbanecm) [19:56:22] (03CR) 10Cwhite: [C: 03+1] elk7: remove logstash-next cache setting [puppet] - 10https://gerrit.wikimedia.org/r/655958 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [19:56:39] (03Merged) 10jenkins-bot: Set import sources for mrwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655293 (https://phabricator.wikimedia.org/T270402) (owner: 10Urbanecm) [19:56:46] (03CR) 10Cwhite: [C: 03+1] elk7: change kibana7 monitoring to critical [puppet] - 10https://gerrit.wikimedia.org/r/655957 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [19:57:07] (03CR) 10Cwhite: [C: 03+1] dns: remove logstash-next.wikimedia.org record [dns] - 10https://gerrit.wikimedia.org/r/655959 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [19:58:41] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 726e972bc8cff1ff8ed90c8dd853aae4997329f5: Set import sources for mrwikibooks (T270402) (duration: 01m 04s) [19:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:46] T270402: Allow admins and importers to import on mrwikibooks and set sources for the same. - https://phabricator.wikimedia.org/T270402 [20:00:04] liw and longma: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Mediawiki train - European+American Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210113T2000). [20:01:19] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1002.eqiad.wmnet with reason: REIMAGE [20:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:08] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1003.eqiad.wmnet with reason: REIMAGE [20:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:40] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: REIMAGE [20:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:20] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1002.eqiad.wmnet with reason: REIMAGE [20:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:57] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2235.codfw.wmnet with reason: REIMAGE [20:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:10] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: REIMAGE [20:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:20] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ml-serve1003.eqiad.wmnet with reason: REIMAGE [20:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2235.codfw.wmnet with reason: REIMAGE [20:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:47] PROBLEM - PHP opcache health on mw2228 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:19:14] (03PS1) 10Gerrit maintenance bot: Add alt to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/655972 (https://phabricator.wikimedia.org/T271980) [20:19:43] (03CR) 10Urbanecm: [C: 03+1] Add alt to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/655972 (https://phabricator.wikimedia.org/T271980) (owner: 10Gerrit maintenance bot) [20:24:02] (03PS1) 10Andrew Bogott: Horizon: put into maintenance mode for Stein upgrade [puppet] - 10https://gerrit.wikimedia.org/r/655977 (https://phabricator.wikimedia.org/T261134) [20:28:10] (03PS1) 10Andrew Bogott: cloud-vps: move eqiad1 from openstack 'rocky' to 'stein' [puppet] - 10https://gerrit.wikimedia.org/r/655979 (https://phabricator.wikimedia.org/T261134) [20:28:12] (03PS1) 10Andrew Bogott: Revert "Horizon: put into maintenance mode for Stein upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/655980 [20:31:13] (03CR) 10Dzahn: [C: 03+2] "approved by langcom https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Southern_Altai" [dns] - 10https://gerrit.wikimedia.org/r/655972 (https://phabricator.wikimedia.org/T271980) (owner: 10Gerrit maintenance bot) [20:32:14] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2234.codfw.wmnet'] ` Of which those **F... [20:36:11] thanks mutante :) [20:40:44] !log DNS - new project language "alt" added. Altai (also Gorno-Altai) is a Turkic language, spoken officially in the Altai Republic, Russia. [20:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:47] Urbanecm: yw [20:41:43] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10RobH) [20:46:10] mutante: I first read 'alt' as alternative language ;) [20:47:56] (03PS3) 10Volans: wmf-auto-reimage: poll Netbox script results [puppet] - 10https://gerrit.wikimedia.org/r/655963 (https://phabricator.wikimedia.org/T266487) [20:48:00] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ml-serve1002.eqiad.wmnet', 'ml-serve1001.eqiad.wmnet', 'ml-serve1003.eqiad.wmnet', 'ml-serve1004.eqiad.wmnet'] ` Of which t... [20:49:48] volans: I was about to say something about "alternative Wikipedia" because of that :) [20:51:31] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [20:54:01] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2235.codfw.wmnet'] ` Of which those **F... [20:54:08] (03CR) 10Volans: [C: 03+1] "LGTM but should be tested once merged" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/655914 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [20:55:52] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10RobH) So 100[123] failed the final part of the reimage script with puppet run, no clue why, need to investigate. 1004 is failing to get a DHCP response for PXE boot, and isn't bei... [21:00:04] chrisalbon and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210113T2100). [21:06:32] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2232.codfw.wmnet [21:06:33] PROBLEM - PHP opcache health on mw2233 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:06:39] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2233.codfw.wmnet [21:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:46] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2234.codfw.wmnet [21:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:57] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2235.codfw.wmnet [21:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:02] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [21:09:34] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [21:10:51] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [21:11:11] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [21:12:16] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2231.codfw.wmnet [21:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:41] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2232.codfw.wmnet [21:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:09] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2233.codfw.wmnet [21:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:50] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2234.codfw.wmnet [21:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:55] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2235.codfw.wmnet [21:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:33] (03PS2) 10Legoktm: sbuild: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655794 (https://phabricator.wikimedia.org/T266479) [21:18:01] (03CR) 10Legoktm: [C: 03+2] "Appears unused" [puppet] - 10https://gerrit.wikimedia.org/r/655794 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm) [21:18:55] (03CR) 10Dzahn: "compiler can't find nodes using these classes" [puppet] - 10https://gerrit.wikimedia.org/r/655520 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [21:23:15] (03CR) 10CRusnov: [C: 03+1] "LGTM. Can we test it ahead of merging?" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654435 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [21:24:32] (03CR) 10CRusnov: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/655963 (https://phabricator.wikimedia.org/T266487) (owner: 10Volans) [21:24:54] (03CR) 10Volans: [C: 03+2] wmf-auto-reimage: poll Netbox script results [puppet] - 10https://gerrit.wikimedia.org/r/655963 (https://phabricator.wikimedia.org/T266487) (owner: 10Volans) [21:25:55] (03CR) 10Dzahn: "also integration-docker-registry-1003.integration.eqiad.wmflabs is 404 .. the name change in cloud makes it unclear" [puppet] - 10https://gerrit.wikimedia.org/r/655520 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [21:26:18] mutante: the above should have fixed the netbox update stuff in the reimage script. If you're running any more reimages it *should* work now. [21:26:56] volans: ok, cool! I'll keep an eye on it once the currently running ones are finishing [21:27:10] thx [21:31:16] (03PS2) 10Legoktm: shiny_server: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655796 (https://phabricator.wikimedia.org/T266479) [21:31:25] (03PS1) 10RobH: correcting ml-serve1004 mac [puppet] - 10https://gerrit.wikimedia.org/r/655997 (https://phabricator.wikimedia.org/T267050) [21:31:40] (03CR) 10jerkins-bot: [V: 04-1] correcting ml-serve1004 mac [puppet] - 10https://gerrit.wikimedia.org/r/655997 (https://phabricator.wikimedia.org/T267050) (owner: 10RobH) [21:31:46] (03PS2) 10RobH: correcting ml-serve1004 mac [puppet] - 10https://gerrit.wikimedia.org/r/655997 (https://phabricator.wikimedia.org/T267050) [21:32:22] (03CR) 10RobH: [C: 03+2] correcting ml-serve1004 mac [puppet] - 10https://gerrit.wikimedia.org/r/655997 (https://phabricator.wikimedia.org/T267050) (owner: 10RobH) [21:38:16] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2237.codfw.wmnet with reason: REIMAGE [21:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:51] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2238.codfw.wmnet with reason: REIMAGE [21:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:01] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2239.codfw.wmnet with reason: REIMAGE [21:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:21] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2237.codfw.wmnet with reason: REIMAGE [21:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:23] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2239.codfw.wmnet with reason: REIMAGE [21:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:59] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [21:41:29] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2240.codfw.wmnet with reason: REIMAGE [21:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2238.codfw.wmnet with reason: REIMAGE [21:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:15] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2240.codfw.wmnet with reason: REIMAGE [21:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:51] PROBLEM - nutcracker socket on mw2239 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.65: Connection reset by peer https://wikitech.wikimedia.org/wiki/Nutcracker [21:49:03] PROBLEM - Check size of conntrack table on mw2239 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.65: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:49:03] PROBLEM - MD RAID on mw2239 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.65: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:49:03] PROBLEM - php7.2-fpm service on mw2239 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.65: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:50:03] PROBLEM - Check no envoy runtime configuration is left persistent on mw2239 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.65: Connection reset by peer https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [21:50:14] ^ that is being reimaged and it's being unluky somehow because it did not happen for other hosts before [21:50:30] exit code 99 [21:51:28] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw2239.codfw.wmnet with reason: new install on buster [21:51:28] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw2239.codfw.wmnet with reason: new install on buster [21:51:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10RobH) ml-serve1--4 will pxe boot now. next step is use image script again and investigate its errors. [21:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:33] (03PS1) 10CDanis: Add IRC/SAL notifications via tcpircbot. [software/klaxon] - 10https://gerrit.wikimedia.org/r/655998 [22:03:01] PROBLEM - PHP opcache health on mw2234 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:03:26] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/27459/" [puppet] - 10https://gerrit.wikimedia.org/r/655519 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [22:08:35] 10SRE, 10MW-on-K8s, 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Pipeline): Deployment infrastructure for PHP microservices - https://phabricator.wikimedia.org/T261369 (10sbassett) >>! In T261369#6695002, @akosiaris wrote: >> As I understand it, there's a halt on that npm appro... [22:09:25] PROBLEM - PHP opcache health on mw2235 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:13:21] (03CR) 10Dzahn: "noop on alert1001, logstash1028" [puppet] - 10https://gerrit.wikimedia.org/r/655519 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [22:18:27] PROBLEM - PHP opcache health on mw2232 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:19:41] PROBLEM - PHP opcache health on mw2231 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:20:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:22:25] ACKNOWLEDGEMENT - PHP opcache health on mw2224 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:22:25] ACKNOWLEDGEMENT - PHP opcache health on mw2225 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:22:25] ACKNOWLEDGEMENT - PHP opcache health on mw2231 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:23:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:26:24] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2237.codfw.wmnet'] ` Of which those **F... [22:26:27] RECOVERY - Check size of conntrack table on mw2239 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [22:26:39] RECOVERY - nutcracker socket on mw2239 is OK: TCP OK - 0.000 second response time on socket /var/run/nutcracker/redis_codfw.sock https://wikitech.wikimedia.org/wiki/Nutcracker [22:28:43] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2239.codfw.wmnet'] ` Of which those **F... [22:28:47] RECOVERY - php7.2-fpm service on mw2239 is OK: OK - php7.2-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:29:38] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2238.codfw.wmnet'] ` Of which those **F... [22:30:12] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2240.codfw.wmnet'] ` Of which those **F... [22:33:35] RECOVERY - MD RAID on mw2239 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [22:45:07] (03PS1) 10Krinkle: Edit link may not be present, avoid undefined index notice [skins/CologneBlue] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655932 (https://phabricator.wikimedia.org/T271978) [22:51:34] RECOVERY - Check no envoy runtime configuration is left persistent on mw2239 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [22:53:02] !log T266492 T268779 T265699 Restarting cloudelastic to apply new readahead changes, this will also verify cloudelastic support works in our elasticsearch spicerack code. Only going one node at a time because cloudelastic elasticsearch indices only have 1 replica shard per index [22:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:07] T268779: Support cloudelastic in spicerack elasticsearch - https://phabricator.wikimedia.org/T268779 [22:53:08] T266492: Restart elasticsearch clusters to apply readahead changes - https://phabricator.wikimedia.org/T266492 [22:53:08] T265699: 40-elasticsearch-readahead udev rule failing for cloudelastic100[5,6] - https://phabricator.wikimedia.org/T265699 [22:53:09] !log T266492 T268779 T265699 `sudo -i cookbook sre.elasticsearch.rolling-restart cloudelastic "cloudelastic cluster restart" --task-id T266492 --nodes-per-run 1` [22:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:25] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-restart [22:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:52] 10SRE, 10MW-on-K8s, 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Pipeline): Deployment infrastructure for PHP microservices - https://phabricator.wikimedia.org/T261369 (10bd808) >>! In T261369#6746098, @sbassett wrote: > That being said, npm install shouldn't be run on any prod... [23:18:31] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2237.codfw.wmnet [23:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:37] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2238.codfw.wmnet [23:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:43] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2239.codfw.wmnet [23:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:53] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2240.codfw.wmnet [23:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:42] (03CR) 10CRusnov: [C: 03+2] ganeti.makevm: Make necessary changes to port for Netbox 2.9 API [cookbooks] - 10https://gerrit.wikimedia.org/r/655914 (https://phabricator.wikimedia.org/T266487) (owner: 10CRusnov) [23:40:44] PROBLEM - PHP opcache health on mw2237 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:43:10] PROBLEM - PHP opcache health on mw2238 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:44:34] PROBLEM - PHP opcache health on mw2240 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:44:43] !log rebooting Netbox instances to apply updates [23:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:04] !log crusnov@cumin1001 START - Cookbook sre.hosts.reboot-single [23:46:04] !log crusnov@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) [23:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:40] !log crusnov@cumin1001 START - Cookbook sre.hosts.reboot-single [23:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:25] !log crusnov@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [23:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:50] !log crusnov@cumin1001 START - Cookbook sre.hosts.reboot-single [23:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:51] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-restart (exit_code=0) [23:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:41] !log crusnov@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [23:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:56] !log crusnov@cumin1001 START - Cookbook sre.hosts.reboot-single [23:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:14] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:57:26] expected