[01:59:18] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:01:00] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:24:02] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:32:32] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:37:12] <wikibugs>	 (03PS1) 10Elukey: Decom analytics1047 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/633385 (https://phabricator.wikimedia.org/T255140)
[05:38:11] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Decom analytics1047 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/633385 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey)
[05:54:17] <wikibugs>	 (03PS1) 10Elukey: Set the test coordinator role to an-test-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/633386 (https://phabricator.wikimedia.org/T255139)
[05:55:18] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set the test coordinator role to an-test-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/633386 (https://phabricator.wikimedia.org/T255139) (owner: 10Elukey)
[06:32:21] <wikibugs>	 (03PS1) 10Elukey: role::analytics_test_cluster::coordiantor: avoid db backups [puppet] - 10https://gerrit.wikimedia.org/r/633387 (https://phabricator.wikimedia.org/T255139)
[06:33:13] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::coordiantor: avoid db backups [puppet] - 10https://gerrit.wikimedia.org/r/633387 (https://phabricator.wikimedia.org/T255139) (owner: 10Elukey)
[06:53:41] <wikibugs>	 (03PS1) 10Elukey: Reduce the HDFS block replication factor for the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/633471 (https://phabricator.wikimedia.org/T255139)
[06:55:43] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Reduce the HDFS block replication factor for the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/633471 (https://phabricator.wikimedia.org/T255139) (owner: 10Elukey)
[07:08:05] <wikibugs>	 (03PS3) 10Muehlenhoff: Install ldap-replica200[34] as additional LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/632648 (https://phabricator.wikimedia.org/T264388)
[07:12:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/633275 (https://phabricator.wikimedia.org/T264991) (owner: 10Dzahn)
[07:17:15] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10MoritzMuehlenhoff) >>! In T264991#6534293, @bd808 wrote: > Should this task be merged with {T245757} somehow?  Probablyish, but the scope is a little differe...
[07:19:01] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10MoritzMuehlenhoff) >>! In T264991#6534026, @Dzahn wrote: > ii  prometheus-nutcracker-exporter       0.2+nmu1                         all          Prometheus...
[07:33:10] <librenms-wmf>	 08Warning Alert for device scs-c1-codfw.mgmt.codfw.wmnet - Processor usage over 85%
[07:34:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove dotfiles for banyek, demon, rush [puppet] - 10https://gerrit.wikimedia.org/r/633473
[07:39:39] <wikibugs>	 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) Thanks for trying that out. I ran the same command for much longer with higher concurrency. For Main_Page, I saw no difference.  Then...
[07:53:27] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime
[07:53:27] <logmsgbot>	 !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[07:53:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:33] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime
[07:53:35] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[07:53:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:37] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by filippo@cumin1001 on 1 host(s) and their services with reason: reboot ` ms-be2036.codfw.wmnet `
[07:53:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:48] <godog>	 !log reboot ms-be2036 - T265208
[07:54:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:54] <stashbot>	 T265208: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208
[07:57:39] <wikibugs>	 (03PS1) 10Elukey: role::analytics_test_cluster::coordinator: set /srv/mysql as datadir for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/633476 (https://phabricator.wikimedia.org/T255139)
[07:58:04] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10Joe) >>! In T264991#6533992, @Legoktm wrote: >>>! In T264991#6533968, @Dzahn wrote: >> - ploticus >  > {T253377}  Given we're sunsetting graphoid instead, Ea...
[07:58:38] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::coordinator: set /srv/mysql as datadir for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/633476 (https://phabricator.wikimedia.org/T255139) (owner: 10Elukey)
[08:00:30] <wikibugs>	 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) I've checked with curl and it does get cached by both hosts:   ` X-Cache: cp3060 miss, cp3052 hit/1000011 X-Cache-Status: hit-front...
[08:06:59] <wikibugs>	 (03PS1) 10Volans: documentation: refactor configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/633477
[08:07:01] <wikibugs>	 (03PS1) 10Volans: pylint: allow 'logger' as module-scope name [software/spicerack] - 10https://gerrit.wikimedia.org/r/633478
[08:07:31] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:08:58] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10fgiunchedi) The host rebooted into Linux OK, however there were error messages at boot. Looks like related to both ilo and the hw raid firmware.  ` Slot 3 Port 1 : Smart Array P840 Controller - (4096 MB, V4.52)...
[08:09:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] pylint: allow 'logger' as module-scope name [software/spicerack] - 10https://gerrit.wikimedia.org/r/633478 (owner: 10Volans)
[08:09:44] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests: Renable SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T265071 (10Kormat) 05Open→03Resolved a:03Kormat Sounds like this is complete, so resolving.
[08:09:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] documentation: refactor configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/633477 (owner: 10Volans)
[08:09:47] <icinga-wm>	 RECOVERY - puppet last run on ms-be2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:11:03] <icinga-wm>	 RECOVERY - MD RAID on ms-be2036 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[08:19:50] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests: Renable SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T265071 (10elukey) Just executed:  ` elukey@krb1001:~$ sudo manage_principals.py create lexnasser --email_address=lexnasser@icloud.com Principal successfully created. Make...
[08:20:47] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10fgiunchedi) a:03Papaul @papaul I've updated the hw raid firmware to 6.88 and rebooted to apply the upgrade. On reboot the message below was still there, what do you think ? Feel free to upgrade the ilo firmwa...
[08:27:38] <wikibugs>	 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6535510, @Gilles wrote: > Thanks for trying that out. I ran the same command for much longer with higher concurrency. For...
[08:37:09] <icinga-wm>	 RECOVERY - Disk space on ms-be2036 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2036&var-datasource=codfw+prometheus/ops
[08:38:05] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T265250 (10Aklapper)
[08:40:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add field checks to filter throttle [puppet] - 10https://gerrit.wikimedia.org/r/633224 (owner: 10Herron)
[08:43:28] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Prioritize SG-IX [homer/public] - 10https://gerrit.wikimedia.org/r/633200 (https://phabricator.wikimedia.org/T260991) (owner: 10Ayounsi)
[08:43:57] <wikibugs>	 (03Merged) 10jenkins-bot: Prioritize SG-IX [homer/public] - 10https://gerrit.wikimedia.org/r/633200 (https://phabricator.wikimedia.org/T260991) (owner: 10Ayounsi)
[08:44:22] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T265250 (10Kormat) Hi @Aklapper - i think you accidentally created a new task with the form instead of editing the form template ({T265248}).  https://phabricator.wikimedia.org/transactions/editen...
[08:49:57] <wikibugs>	 (03PS1) 10Ayounsi: Add BGP to HE on cr2-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/633480
[08:50:47] <moritzm>	 !log uploaded libxml2 2.9.4+dfsg1-2.2+deb9u3+wmf1 to component/icu63 T264991
[08:50:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:54] <stashbot>	 T264991: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991
[08:51:24] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add BGP to HE on cr2-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/633480 (owner: 10Ayounsi)
[08:51:47] <wikibugs>	 (03Merged) 10jenkins-bot: Add BGP to HE on cr2-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/633480 (owner: 10Ayounsi)
[08:53:49] <wikibugs>	 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) I tried a bunch of different benchmark settings before I ended up using those, some of which definitely had too much concurrency and...
[08:57:13] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:01:57] <wikibugs>	 (03PS2) 10Volans: documentation: refactor configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/633477
[09:01:59] <wikibugs>	 (03PS2) 10Volans: pylint: allow 'logger' as module-scope name [software/spicerack] - 10https://gerrit.wikimedia.org/r/633478
[09:02:01] <wikibugs>	 (03PS1) 10Volans: log: adjust return type as required by mypy [software/spicerack] - 10https://gerrit.wikimedia.org/r/633482
[09:02:08] <wikibugs>	 (03PS1) 10Volans: pylint: allow 'logger' as module-scope name [cookbooks] - 10https://gerrit.wikimedia.org/r/633483
[09:02:10] <wikibugs>	 (03PS1) 10Volans: sre.hosts.downtime: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212)
[09:03:54] <wikibugs>	 (03CR) 10Elukey: "This is a good start, I like the approach about working on a single domain at the time. I think that we should consider some failure scena" [puppet] - 10https://gerrit.wikimedia.org/r/633227 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi)
[09:04:14] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Gehel) 05Open→03Resolved Service implementation will follow in T262211.
[09:04:27] <wikibugs>	 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) If it was indeed the last benchmark run, what you're saying is that cp3052 favored the benchmark requests over others, while cp3054 r...
[09:04:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] documentation: refactor configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/633477 (owner: 10Volans)
[09:04:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.downtime: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans)
[09:08:15] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=gerrit site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:08:37] <wikibugs>	 (03CR) 10Volans: [C: 03+2] log: adjust return type as required by mypy [software/spicerack] - 10https://gerrit.wikimedia.org/r/633482 (owner: 10Volans)
[09:09:55] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:10:18] <wikibugs>	 (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/633477 (owner: 10Volans)
[09:11:44] <wikibugs>	 (03Merged) 10jenkins-bot: log: adjust return type as required by mypy [software/spicerack] - 10https://gerrit.wikimedia.org/r/633482 (owner: 10Volans)
[09:14:54] <wikibugs>	 (03CR) 10Volans: "Finally ready to be reviewed. A lot of things have changed since the last implementation so I suggest to starts over like it was a fresh C" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans)
[09:17:31] <wikibugs>	 (03PS2) 10Volans: pylint: allow 'logger' as module-scope name [cookbooks] - 10https://gerrit.wikimedia.org/r/633483
[09:22:13] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T265250 (10Aklapper) 05Open→03Invalid Argh... Thanks for catching this. (I thought I had had a coffee before this?!) Fixed now (for real).
[09:27:36] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) a:05Jclark-ctr→03Cmjohnson
[09:28:10] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM: always good to fight less with the linters!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/633478 (owner: 10Volans)
[09:28:13] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) @Cmjohnson if you have some time during the next days can we swap the NIC on one node only? (to verify the procedure and make sure that the NICs...
[09:29:59] <wikibugs>	 (03PS1) 10Ayounsi: Nfacctd, add src_net, dst_net [puppet] - 10https://gerrit.wikimedia.org/r/633510 (https://phabricator.wikimedia.org/T254332)
[09:40:05] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631792 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey)
[09:47:23] <wikibugs>	 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6535643, @Gilles wrote: > I've checked with curl and it does get cached by both hosts: [...] > Oddly, cp3054 has Content-...
[09:48:28] <wikibugs>	 (03PS2) 10Klausman: amd_rocm: Ensure linux-headers-amd64 is installed [puppet] - 10https://gerrit.wikimedia.org/r/633194
[09:48:30] <wikibugs>	 (03CR) 10Klausman: amd_rocm: Ensure linux-headers-amd64 is installed (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/633194 (owner: 10Klausman)
[09:53:06] <wikibugs>	 (03PS3) 10Klausman: amd_rocm: Ensure linux-headers-amd64 is installed [puppet] - 10https://gerrit.wikimedia.org/r/633194
[09:54:19] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] amd_rocm: Ensure linux-headers-amd64 is installed [puppet] - 10https://gerrit.wikimedia.org/r/633194 (owner: 10Klausman)
[09:58:25] <wikibugs>	 (03CR) 10Muehlenhoff: amd_rocm: Ensure linux-headers-amd64 is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633194 (owner: 10Klausman)
[10:00:51] <wikibugs>	 (03CR) 10Elukey: amd_rocm: Ensure linux-headers-amd64 is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633194 (owner: 10Klausman)
[10:01:27] <wikibugs>	 10Operations, 10Data-Persistence: Create intengration test env for wmfmariadbpy - https://phabricator.wikimedia.org/T265266 (10Kormat)
[10:10:05] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Access to the Logstash for John Bolorinos - https://phabricator.wikimedia.org/T264918 (10JMeybohm) p:05Triage→03Medium
[10:11:37] <wikibugs>	 (03CR) 10Muehlenhoff: amd_rocm: Ensure linux-headers-amd64 is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633194 (owner: 10Klausman)
[10:13:14] <wikibugs>	 10Operations, 10DBA, 10Data-Persistence: Create intengration test env for wmfmariadbpy - https://phabricator.wikimedia.org/T265266 (10LSobanski)
[10:14:29] <wikibugs>	 (03CR) 10Klausman: amd_rocm: Ensure linux-headers-amd64 is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633194 (owner: 10Klausman)
[10:19:12] <wikibugs>	 10Operations, 10DBA, 10Data-Persistence: Create integration test env for wmfmariadbpy - https://phabricator.wikimedia.org/T265266 (10Aklapper)
[10:22:57] <wikibugs>	 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) It was:  ` gilles@cp3052:/home/ema$ curl -I -H "X-Forwarded-Proto: https" -H "Host: en.wikipedia.org" http://localhost/wiki/Barack_Ob...
[10:24:46] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Import the action module from spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631792 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey)
[10:24:58] <wikibugs>	 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) I guess the difference is that I did a HEAD request.
[10:26:09] <hnowlan>	 !log roll-restarting restbase201[345678] for cert refresh
[10:26:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:43] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart
[10:26:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:23] <wikibugs>	 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10kostajh) >>! In T252391#6533353, @nettrom_WMF wrote: > @kostajh : Thanks for picking this up and pinging me about it. I think...
[10:30:04] <jouncebot>	 jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201012T1030).
[10:32:08] <wikibugs>	 10Operations, 10Machine Learning Platform, 10SRE-Access-Requests: Requesting adding to ores-admin for Ladsgroup - https://phabricator.wikimedia.org/T265172 (10JMeybohm) p:05Triage→03Medium @calbon Please review/approve if you are fine with this.
[10:32:42] <wikibugs>	 (03PS1) 10Kosta Harlan: Disable wgWMEUnderstandingFirstDay (EditorJourney) logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633514 (https://phabricator.wikimedia.org/T252391)
[10:33:09] <librenms-wmf>	 08Warning Alert for device scs-c1-codfw.mgmt.codfw.wmnet - Processor usage over 85%
[10:33:18] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 04-1] "Do not merge until the code change to WikimediaEvents is done per T252391#6536174" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633514 (https://phabricator.wikimedia.org/T252391) (owner: 10Kosta Harlan)
[10:33:45] <wikibugs>	 (03PS2) 10Kosta Harlan: [DNM] Disable wgWMEUnderstandingFirstDay (EditorJourney) logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633514 (https://phabricator.wikimedia.org/T252391)
[10:37:34] <wikibugs>	 (03CR) 10Volans: "One nit and one question inline." (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631909 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey)
[10:39:31] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631910 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey)
[10:39:52] <wikibugs>	 (03CR) 10Elukey: Import the config module from Spicerack (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631909 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey)
[10:42:32] <wikibugs>	 10Operations, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Update 2030.wikimedia.org redirect to new URI - https://phabricator.wikimedia.org/T264797 (10JMeybohm) p:05Triage→03Medium a:03Dzahn Assigned to @Dzahn, since it looks as if a additional review of yours has been requested.
[10:44:22] <wikibugs>	 (03PS5) 10Elukey: Import the config module from Spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631909 (https://phabricator.wikimedia.org/T257905)
[10:44:24] <wikibugs>	 (03PS4) 10Elukey: Import the phabricator module from Spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631910 (https://phabricator.wikimedia.org/T257905)
[10:45:36] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631909 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey)
[10:46:21] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Import the config module from Spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631909 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey)
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201012T1100).
[11:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[11:00:27] <Lucas_WMDE>	 I’d like to add a config change to the deploy window
[11:00:50] <Urbanecm>	 Lucas_WMDE: I don't see any reason why not :)
[11:01:54] <Urbanecm>	 Lucas_WMDE: and please ping me when done, I have also sth to do :)
[11:01:57] <wikibugs>	 (03PS5) 10Lucas Werkmeister (WMDE): Require autoconfirmed status to edit Wikidata Properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631809 (https://phabricator.wikimedia.org/T254280) (owner: 10Abián)
[11:01:58] <Lucas_WMDE>	 ok
[11:02:06] <Lucas_WMDE>	 just added ^ to the calendar
[11:02:34] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Require autoconfirmed status to edit Wikidata Properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631809 (https://phabricator.wikimedia.org/T254280) (owner: 10Abián)
[11:03:20] <wikibugs>	 (03Merged) 10jenkins-bot: Require autoconfirmed status to edit Wikidata Properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631809 (https://phabricator.wikimedia.org/T254280) (owner: 10Abián)
[11:04:18] <Lucas_WMDE>	 pulled onto mwdebug2001, I’ll test it with a colleague
[11:06:53] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:08:35] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:10:06] <wikibugs>	 (03PS1) 10Elukey: Remove lvmbackups of Analytics meta from the Hadoop Standby master [puppet] - 10https://gerrit.wikimedia.org/r/633516
[11:16:58] <Lucas_WMDE>	 test successful, syncing config change
[11:17:30] <wikibugs>	 (03PS2) 10Elukey: Remove lvmbackups of Analytics meta from the Hadoop Standby master [puppet] - 10https://gerrit.wikimedia.org/r/633516 (https://phabricator.wikimedia.org/T257412)
[11:17:53] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1002 job=burrow partition={3,4,5} prometheus=ops site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-d
[11:17:53] <icinga-wm>	 var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[11:18:00] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/25819/" [puppet] - 10https://gerrit.wikimedia.org/r/633516 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey)
[11:18:26] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:631809|Require autoconfirmed status to edit Wikidata Properties (T254280)]] (duration: 01m 00s)
[11:18:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:32] <stashbot>	 T254280: Semi-protecting all properties - https://phabricator.wikimedia.org/T254280
[11:18:47] <Lucas_WMDE>	 Urbanecm: the stage is yours
[11:19:11] <Urbanecm>	 thanks
[11:19:33] <elukey>	 klausman: ok if I puppet-merge your change?
[11:20:15] <wikibugs>	 (03PS4) 10Urbanecm: Enable wgCheckUserLogLogins at all wikis but few large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629227 (https://phabricator.wikimedia.org/T253802)
[11:20:21] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable wgCheckUserLogLogins at all wikis but few large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629227 (https://phabricator.wikimedia.org/T253802) (owner: 10Urbanecm)
[11:20:30] <elukey>	 (since it is a little change already reviewed I am proceeding)
[11:21:07] <wikibugs>	 (03Merged) 10jenkins-bot: Enable wgCheckUserLogLogins at all wikis but few large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629227 (https://phabricator.wikimedia.org/T253802) (owner: 10Urbanecm)
[11:25:06] <klausman>	 elukey:yes!
[11:26:03] <Urbanecm>	 marostegui: heads-up, T253802 is going to be enabled at all wikis but few very large ones. Please do ping me if it causes anything unexpected, but given the cswiki/fawiki trial, it should be good.
[11:26:04] <stashbot>	 T253802: Configure WMF wikis to log login attempts in CheckUser - https://phabricator.wikimedia.org/T253802
[11:27:28] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0)
[11:27:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:05] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 4966e8a6b8ae4e6d5623dd35e65ed8fcf3338bc1: Enable wgCheckUserLogLogins at all wikis but few large wikis (T253802) (duration: 00m 58s)
[11:28:07] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[11:28:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:51] <wikibugs>	 (03PS3) 10Urbanecm: [testwiki, test2wiki] Allow bureaucrats to grant import rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633273
[11:28:58] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] [testwiki, test2wiki] Allow bureaucrats to grant import rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633273 (owner: 10Urbanecm)
[11:29:30] <wikibugs>	 (03PS1) 10Elukey: role::analytics_cluster::coordinator: avoid lvm backups to an-master1002 [puppet] - 10https://gerrit.wikimedia.org/r/633519 (https://phabricator.wikimedia.org/T257412)
[11:29:43] <wikibugs>	 (03Merged) 10jenkins-bot: [testwiki, test2wiki] Allow bureaucrats to grant import rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633273 (owner: 10Urbanecm)
[11:31:40] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/25820/" [puppet] - 10https://gerrit.wikimedia.org/r/633519 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey)
[11:32:29] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: fff2532424f84970962f7de1e35d4250b83cb3da: [testwiki, test2wiki] Allow bureaucrats to grant import rights (duration: 00m 58s)
[11:32:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:14] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] [labs] Remove wmgMonologChannels override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631224 (owner: 10Urbanecm)
[11:33:56] <wikibugs>	 (03Merged) 10jenkins-bot: [labs] Remove wmgMonologChannels override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631224 (owner: 10Urbanecm)
[11:38:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Import the phabricator module from Spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631910 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey)
[11:38:43] <Urbanecm>	 !log EU B&C done
[11:38:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:11] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:45:53] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:47:53] <wikibugs>	 10Operations, 10DBA, 10Data-Persistence, 10User-Kormat: Create integration test env for wmfmariadbpy - https://phabricator.wikimedia.org/T265266 (10Kormat)
[11:49:13] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:50:55] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:54:14] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:55:16] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:16:01] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] "Updated pcc https://puppet-compiler.wmflabs.org/compiler1003/25821/maps1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/608726 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper)
[12:26:21] <moritzm>	 !log installing spice security updates on Buster
[12:26:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:04] <moritzm>	 !log installing rails security updates on Stretch
[12:39:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:04] <wikibugs>	 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6536169, @Gilles wrote: > I guess the difference is that I did a HEAD request. It's still reproducible right now.  OK, I'...
[12:41:06] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM: always good to fight less with the linters!" [cookbooks] - 10https://gerrit.wikimedia.org/r/633483 (owner: 10Volans)
[12:42:46] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:44:26] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:02:46] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:04:28] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:33:10] <librenms-wmf>	 08Warning Alert for device scs-c1-codfw.mgmt.codfw.wmnet - Processor usage over 85%
[13:41:27] <wikibugs>	 (03PS1) 10Kormat: (Mostly) convert to pytest. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633534
[13:42:26] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1002 job=burrow partition={3,4,5} prometheus=ops site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-clu
[13:42:26] <icinga-wm>	 d&var-topic=All&var-consumer_group=All
[13:42:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] (Mostly) convert to pytest. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633534 (owner: 10Kormat)
[13:42:32] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10LSobanski) @Cmjohnson @Jclark-ctr Is there anything we (#dba) can do to help move this forward?
[13:45:54] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[13:47:43] <wikibugs>	 (03PS1) 10Ayounsi: cr2-eqsin: set real HE IPs [homer/public] - 10https://gerrit.wikimedia.org/r/633535
[13:48:16] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] cr2-eqsin: set real HE IPs [homer/public] - 10https://gerrit.wikimedia.org/r/633535 (owner: 10Ayounsi)
[13:48:43] <wikibugs>	 (03Merged) 10jenkins-bot: cr2-eqsin: set real HE IPs [homer/public] - 10https://gerrit.wikimedia.org/r/633535 (owner: 10Ayounsi)
[13:50:32] <icinga-wm>	 RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 28, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:50:39] <wikibugs>	 10Operations, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Update 2030.wikimedia.org redirect to new URI - https://phabricator.wikimedia.org/T264797 (10Dzahn) a:05Dzahn→03None
[13:57:29] <wikibugs>	 (03CR) 10Nikerabbit: "Thanks. Some earlier discussion about this happened in I689b3eeee765822743d99a5820a95ca72c8a7680" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631224 (owner: 10Urbanecm)
[13:59:13] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10Thank-You-Page, 10MW-1.36-notes (1.36.0-wmf.4; 2020-08-11), and 3 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Nintendofan885)
[14:05:09] <wikibugs>	 (03PS1) 10Elukey: Clean up Analytics lvm-based backup not used anymore [puppet] - 10https://gerrit.wikimedia.org/r/633545 (https://phabricator.wikimedia.org/T257412)
[14:05:37] <moritzm>	 !log uploaded php7.2 7.2.31-1+0~20200514.41+debian9~1.gbpe2a56b+wmf1+icu63 to component/icu63 T264991
[14:05:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:43] <stashbot>	 T264991: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991
[14:06:01] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Clean up Analytics lvm-based backup not used anymore [puppet] - 10https://gerrit.wikimedia.org/r/633545 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey)
[14:12:28] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:14:08] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:22:34] <wikibugs>	 (03PS1) 10Elukey: sre.hadoop.change-distro-from-cdh: allow to select workers/journal [cookbooks] - 10https://gerrit.wikimedia.org/r/633550
[14:39:44] <icinga-wm>	 PROBLEM - Disk space on ms-be2036 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=90%): /tmp 0 MB (0% inode=90%): /var/tmp 0 MB (0% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2036&var-datasource=codfw+prometheus/ops
[14:41:48] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] pylint: allow 'logger' as module-scope name [software/spicerack] - 10https://gerrit.wikimedia.org/r/633478 (owner: 10Volans)
[14:42:49] <_joe_>	 uh what's up with that swift server?
[14:44:03] <_joe_>	 !log freed 1.5 GB of space on ms-be2036 by running "apt-get clean"
[14:44:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:15] <wikibugs>	 (03PS1) 10Kormat: tox: Expand to test against py3.[5678] [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633552
[14:49:47] <wikibugs>	 (03PS2) 10Kormat: (Mostly) convert to pytest. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633534
[14:53:38] <_joe_>	 something's wrong with ms-be2036. du reports only 13 GB used, but 53 appear to be used.
[14:53:46] <_joe_>	 when looking at df
[14:53:46] <apergos>	 https://phabricator.wikimedia.org/T265208  it's being worked on, _joe_
[14:54:13] <_joe_>	 and there are no deleted files
[14:54:19] <_joe_>	 apergos: I suspect this is another problem
[14:54:27] <wikibugs>	 (03PS2) 10Kormat: tox: Expand to test against py3.[5678] [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633552
[14:54:40] <_joe_>	 specifically, that some data was written to /srv/swift-storage while the partitions were unmounted
[14:55:05] <apergos>	 sounds likely
[14:55:28] <apergos>	 I would just note that on the task and let folks handle it once the main issues are sorte dout
[14:55:38] <apergos>	 ther emight be more reboots/unmounts in its future
[14:56:57] <wikibugs>	 (03PS3) 10Kormat: tox: Expand to test against py3.[5678] [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633552
[14:57:25] <godog>	 thanks folks, I'll take a look too at ms-be2036
[14:57:45] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10Joe) I just want to comment that this server had its root directory filled up today, and it's in a strange state where only 13 GB are found by `du -xsh /`, but 53 are occupied on `/dev/md0` according to `df`. G...
[14:57:49] <_joe_>	 godog: oh I thought you were off today
[14:58:19] <godog>	 _joe_: yeah technically you are right, it is an holiday in Spain but I'll be floating it
[14:58:42] <kormat>	 _joe_: he Observes all
[14:59:10] <godog>	 I'm not observing this holiday tho! File:Sting.ogg
[14:59:37] <kormat>	 you're.. an englishman in new york?
[15:00:25] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Looks sane to me" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633552 (owner: 10Kormat)
[15:00:38] <godog>	 I'm afraid I'm not getting it
[15:00:54] <kormat>	 https://www.youtube.com/watch?v=d27gTrPPAyk
[15:01:17] <godog>	 hah! thanks that explains
[15:01:24] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] tox: Expand to test against py3.[5678] [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633552 (owner: 10Kormat)
[15:01:33] <godog>	 I was more like https://commons.wikimedia.org/wiki/File:Sting.ogg
[15:02:47] <wikibugs>	 (03Merged) 10jenkins-bot: tox: Expand to test against py3.[5678] [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633552 (owner: 10Kormat)
[15:03:03] <wikibugs>	 (03PS3) 10Kormat: (Mostly) convert to pytest. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633534
[15:05:10] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:07:53] <kormat>	 godog: ahh :)
[15:08:36] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:09:38] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:13:50] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10fgiunchedi) Thanks! Indeed that's what happened, I've unmounted the filesystems and delete the files
[15:15:33] <wikibugs>	 (03CR) 10Volans: [C: 03+2] pylint: allow 'logger' as module-scope name [cookbooks] - 10https://gerrit.wikimedia.org/r/633483 (owner: 10Volans)
[15:16:46] <wikibugs>	 (03Merged) 10jenkins-bot: pylint: allow 'logger' as module-scope name [cookbooks] - 10https://gerrit.wikimedia.org/r/633483 (owner: 10Volans)
[15:18:08] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:18:35] <wikibugs>	 (03PS8) 10Volans: cookbook API: add class API [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212)
[15:18:44] <wikibugs>	 (03PS2) 10Volans: sre.hosts.downtime: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212)
[15:19:20] <wikibugs>	 (03CR) 10Volans: "An example of a converted cookbook can be checked in I97b48851c3e33cea6f34439d3036ff0cba11eac7" [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans)
[15:19:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.downtime: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans)
[15:20:22] <wikibugs>	 (03CR) 10Volans: "Expected CI failure as it depends on the Depends-On change to be merged and released." [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans)
[15:21:02] <icinga-wm>	 RECOVERY - Disk space on ms-be2036 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2036&var-datasource=codfw+prometheus/ops
[15:27:05] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery-Search: Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10Gehel) p:05Triage→03High
[15:27:48] <wikibugs>	 10Operations, 10Analytics-Clusters, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10ema) >>! In T264074#6507717, @elukey wrote:  > Most of the usage seems to be VUT related, especially for `fxstatat64` (no idea where it is used).   You are indeed correct. The n...
[15:30:39] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: more instances in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/633559
[15:33:10] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1002 job=burrow partition={2,3,4} prometheus=ops site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-d
[15:33:10] <icinga-wm>	 var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[15:35:54] <godog>	 I'm taking a look, looks like codfw only though
[15:41:42] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[15:41:47] <godog>	 !log roll-restart logstash5 in codfw
[15:41:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:16] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic: ~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface - https://phabricator.wikimedia.org/T264021 (10Ottomata) a:03Ottomata Interesting.  So we don't know exactly where the timeout is occurring?  Assigning to...
[16:02:38] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:04:22] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:04:43] <wikibugs>	 10Operations, 10Analytics-Clusters, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10elukey) Thanks Ema, really great analysis!  I am wondering if we could quickly test how varnishncsa behaves when we pass `-q`, that seems to be the big difference between the tw...
[16:33:10] <librenms-wmf>	 08Warning Alert for device scs-c1-codfw.mgmt.codfw.wmnet - Processor usage over 85%
[16:47:57] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10JMeybohm) Also the system is sending root mails since ~15:10.  ` Cron <swift@ms-be2036>   test -x /usr/bin/swift-recon-cron && test -r /etc/swift/object-server.conf && /usr/bin/swift-recon-cron /etc/swift/objec...
[16:48:14] <jayme>	 godog: you're arround by chance?
[17:00:04] <jouncebot>	 ryankemper: #bothumor My software never has bugs. It just develops random features. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201012T1700).
[17:03:03] <jayme>	 !log fixed /var/lock/ permission (1777) on ms-be2036 - T265208
[17:03:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:03:10] <stashbot>	 T265208: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208
[17:05:16] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10JMeybohm) I could not find any evidence that this was a intentional change, so I fixed the permissions.
[17:17:20] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_citoid_cluster_codfw,swagger_check_restbase_esams} site={codfw,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:19:02] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:36:46] <wikibugs>	 10Operations, 10Analytics-Clusters, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10ema) >>! In T264074#6536967, @elukey wrote: > I am wondering if we could quickly test how varnishncsa behaves when we pass `-q`, that seems to be the big difference between the...
[17:58:54] <wikibugs>	 (03PS1) 10ArielGlenn: get revision info from stubs file and use to generate page range info [dumps] - 10https://gerrit.wikimedia.org/r/633567
[18:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201012T1800).
[18:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[18:02:27] <wikibugs>	 (03PS2) 10ArielGlenn: get revision info from stubs file and use to generate page range info [dumps] - 10https://gerrit.wikimedia.org/r/633567 (https://phabricator.wikimedia.org/T263319)
[18:18:28] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:21:42] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:37:43] <wikibugs>	 (03CR) 10Nuria: [C: 03+1] "Let's merge this, ya'all?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625771 (https://phabricator.wikimedia.org/T260582) (owner: 10Gergő Tisza)
[18:44:38] <wikibugs>	 10Operations, 10Analytics-Clusters, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10elukey) There is another test that we could do, namely grouping. As far as I can see in the varnishkafka change, the [[ https://github.com/wikimedia/varnishkafka/commit/b0675e80...
[18:45:15] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools as a beta feature on 30 more wikis ("phase 2") [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633568 (https://phabricator.wikimedia.org/T264693)
[19:02:58] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 123 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:04:34] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:30:58] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={pdu_sentry4,swagger_check_restbase_esams} site={eqsin,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:32:36] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:33:10] <librenms-wmf>	 08Warning Alert for device scs-c1-codfw.mgmt.codfw.wmnet - Processor usage over 85%
[19:55:34] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:57:16] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:00:04] <jouncebot>	 chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201012T2000). Please do the needful.
[20:50:17] <wikibugs>	 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 3 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Ladsgroup) ^ This will make redis connection handling slightly healthier but I can't say it will handle this case as observability of ores is...
[21:00:04] <jouncebot>	 Reedy and sbassett: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201012T2100).
[22:02:36] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 130 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:04:09] <wikibugs>	 10Operations, 10Advanced-Search, 10Discovery-Search, 10Traffic, and 3 others: Strange URL pattern after search https://en.wikipedia.org/w/index.php?sort=relevance&sort=relevance&sort=relevance&sort=relevance&sort=relevance&sort=relevance ... - https://phabricator.wikimedia.org/T243884 (10Ladsgroup) I think...
[22:04:14] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 14 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:24:17] <wikibugs>	 (03PS1) 10MichaelSchoenitzer: Add neovim, fd and ripgrep to toolforge [puppet] - 10https://gerrit.wikimedia.org/r/633583
[22:24:19] <wikibugs>	 (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/633583 (owner: 10MichaelSchoenitzer)
[22:28:01] <wikibugs>	 (03PS2) 10MichaelSchoenitzer: Add neovim, fd and ripgrep to toolforge [puppet] - 10https://gerrit.wikimedia.org/r/633583 (https://phabricator.wikimedia.org/T219501)
[22:33:10] <librenms-wmf>	 08Warning Alert for device scs-c1-codfw.mgmt.codfw.wmnet - Processor usage over 85%
[23:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Evening backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201012T2300).
[23:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[23:56:18] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:58:00] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets