[01:59:18] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:01:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:24:02] PROBLEM - Check systemd state on ms-be2024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:32:32] RECOVERY - Check systemd state on ms-be2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:37:12] (03PS1) 10Elukey: Decom analytics1047 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/633385 (https://phabricator.wikimedia.org/T255140) [05:38:11] (03CR) 10Elukey: [C: 03+2] Decom analytics1047 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/633385 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [05:54:17] (03PS1) 10Elukey: Set the test coordinator role to an-test-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/633386 (https://phabricator.wikimedia.org/T255139) [05:55:18] (03CR) 10Elukey: [C: 03+2] Set the test coordinator role to an-test-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/633386 (https://phabricator.wikimedia.org/T255139) (owner: 10Elukey) [06:32:21] (03PS1) 10Elukey: role::analytics_test_cluster::coordiantor: avoid db backups [puppet] - 10https://gerrit.wikimedia.org/r/633387 (https://phabricator.wikimedia.org/T255139) [06:33:13] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::coordiantor: avoid db backups [puppet] - 10https://gerrit.wikimedia.org/r/633387 (https://phabricator.wikimedia.org/T255139) (owner: 10Elukey) [06:53:41] (03PS1) 10Elukey: Reduce the HDFS block replication factor for the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/633471 (https://phabricator.wikimedia.org/T255139) [06:55:43] (03CR) 10Elukey: [C: 03+2] Reduce the HDFS block replication factor for the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/633471 (https://phabricator.wikimedia.org/T255139) (owner: 10Elukey) [07:08:05] (03PS3) 10Muehlenhoff: Install ldap-replica200[34] as additional LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/632648 (https://phabricator.wikimedia.org/T264388) [07:12:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/633275 (https://phabricator.wikimedia.org/T264991) (owner: 10Dzahn) [07:17:15] 10Operations, 10serviceops, 10Patch-For-Review: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10MoritzMuehlenhoff) >>! In T264991#6534293, @bd808 wrote: > Should this task be merged with {T245757} somehow? Probablyish, but the scope is a little differe... [07:19:01] 10Operations, 10serviceops, 10Patch-For-Review: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10MoritzMuehlenhoff) >>! In T264991#6534026, @Dzahn wrote: > ii prometheus-nutcracker-exporter 0.2+nmu1 all Prometheus... [07:33:10] 08Warning Alert for device scs-c1-codfw.mgmt.codfw.wmnet - Processor usage over 85% [07:34:14] (03PS1) 10Muehlenhoff: Remove dotfiles for banyek, demon, rush [puppet] - 10https://gerrit.wikimedia.org/r/633473 [07:39:39] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) Thanks for trying that out. I ran the same command for much longer with higher concurrency. For Main_Page, I saw no difference. Then... [07:53:27] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [07:53:27] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [07:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:33] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [07:53:35] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:37] 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by filippo@cumin1001 on 1 host(s) and their services with reason: reboot ` ms-be2036.codfw.wmnet ` [07:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:48] !log reboot ms-be2036 - T265208 [07:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:54] T265208: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 [07:57:39] (03PS1) 10Elukey: role::analytics_test_cluster::coordinator: set /srv/mysql as datadir for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/633476 (https://phabricator.wikimedia.org/T255139) [07:58:04] 10Operations, 10serviceops, 10Patch-For-Review: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10Joe) >>! In T264991#6533992, @Legoktm wrote: >>>! In T264991#6533968, @Dzahn wrote: >> - ploticus > > {T253377} Given we're sunsetting graphoid instead, Ea... [07:58:38] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::coordinator: set /srv/mysql as datadir for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/633476 (https://phabricator.wikimedia.org/T255139) (owner: 10Elukey) [08:00:30] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) I've checked with curl and it does get cached by both hosts: ` X-Cache: cp3060 miss, cp3052 hit/1000011 X-Cache-Status: hit-front... [08:06:59] (03PS1) 10Volans: documentation: refactor configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/633477 [08:07:01] (03PS1) 10Volans: pylint: allow 'logger' as module-scope name [software/spicerack] - 10https://gerrit.wikimedia.org/r/633478 [08:07:31] RECOVERY - Check systemd state on ms-be2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:58] 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10fgiunchedi) The host rebooted into Linux OK, however there were error messages at boot. Looks like related to both ilo and the hw raid firmware. ` Slot 3 Port 1 : Smart Array P840 Controller - (4096 MB, V4.52)... [08:09:39] (03CR) 10jerkins-bot: [V: 04-1] pylint: allow 'logger' as module-scope name [software/spicerack] - 10https://gerrit.wikimedia.org/r/633478 (owner: 10Volans) [08:09:44] 10Operations, 10Analytics, 10SRE-Access-Requests: Renable SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T265071 (10Kormat) 05Open→03Resolved a:03Kormat Sounds like this is complete, so resolving. [08:09:47] (03CR) 10jerkins-bot: [V: 04-1] documentation: refactor configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/633477 (owner: 10Volans) [08:09:47] RECOVERY - puppet last run on ms-be2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:11:03] RECOVERY - MD RAID on ms-be2036 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:19:50] 10Operations, 10Analytics, 10SRE-Access-Requests: Renable SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T265071 (10elukey) Just executed: ` elukey@krb1001:~$ sudo manage_principals.py create lexnasser --email_address=lexnasser@icloud.com Principal successfully created. Make... [08:20:47] 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10fgiunchedi) a:03Papaul @papaul I've updated the hw raid firmware to 6.88 and rebooted to apply the upgrade. On reboot the message below was still there, what do you think ? Feel free to upgrade the ilo firmwa... [08:27:38] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6535510, @Gilles wrote: > Thanks for trying that out. I ran the same command for much longer with higher concurrency. For... [08:37:09] RECOVERY - Disk space on ms-be2036 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2036&var-datasource=codfw+prometheus/ops [08:38:05] 10Operations, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T265250 (10Aklapper) [08:40:39] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add field checks to filter throttle [puppet] - 10https://gerrit.wikimedia.org/r/633224 (owner: 10Herron) [08:43:28] (03CR) 10Ayounsi: [C: 03+2] Prioritize SG-IX [homer/public] - 10https://gerrit.wikimedia.org/r/633200 (https://phabricator.wikimedia.org/T260991) (owner: 10Ayounsi) [08:43:57] (03Merged) 10jenkins-bot: Prioritize SG-IX [homer/public] - 10https://gerrit.wikimedia.org/r/633200 (https://phabricator.wikimedia.org/T260991) (owner: 10Ayounsi) [08:44:22] 10Operations, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T265250 (10Kormat) Hi @Aklapper - i think you accidentally created a new task with the form instead of editing the form template ({T265248}). https://phabricator.wikimedia.org/transactions/editen... [08:49:57] (03PS1) 10Ayounsi: Add BGP to HE on cr2-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/633480 [08:50:47] !log uploaded libxml2 2.9.4+dfsg1-2.2+deb9u3+wmf1 to component/icu63 T264991 [08:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:54] T264991: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 [08:51:24] (03CR) 10Ayounsi: [C: 03+2] Add BGP to HE on cr2-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/633480 (owner: 10Ayounsi) [08:51:47] (03Merged) 10jenkins-bot: Add BGP to HE on cr2-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/633480 (owner: 10Ayounsi) [08:53:49] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) I tried a bunch of different benchmark settings before I ended up using those, some of which definitely had too much concurrency and... [08:57:13] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:01:57] (03PS2) 10Volans: documentation: refactor configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/633477 [09:01:59] (03PS2) 10Volans: pylint: allow 'logger' as module-scope name [software/spicerack] - 10https://gerrit.wikimedia.org/r/633478 [09:02:01] (03PS1) 10Volans: log: adjust return type as required by mypy [software/spicerack] - 10https://gerrit.wikimedia.org/r/633482 [09:02:08] (03PS1) 10Volans: pylint: allow 'logger' as module-scope name [cookbooks] - 10https://gerrit.wikimedia.org/r/633483 [09:02:10] (03PS1) 10Volans: sre.hosts.downtime: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212) [09:03:54] (03CR) 10Elukey: "This is a good start, I like the approach about working on a single domain at the time. I think that we should consider some failure scena" [puppet] - 10https://gerrit.wikimedia.org/r/633227 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [09:04:14] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Gehel) 05Open→03Resolved Service implementation will follow in T262211. [09:04:27] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) If it was indeed the last benchmark run, what you're saying is that cp3052 favored the benchmark requests over others, while cp3054 r... [09:04:40] (03CR) 10jerkins-bot: [V: 04-1] documentation: refactor configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/633477 (owner: 10Volans) [09:04:43] (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.downtime: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [09:08:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=gerrit site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:08:37] (03CR) 10Volans: [C: 03+2] log: adjust return type as required by mypy [software/spicerack] - 10https://gerrit.wikimedia.org/r/633482 (owner: 10Volans) [09:09:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:10:18] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/633477 (owner: 10Volans) [09:11:44] (03Merged) 10jenkins-bot: log: adjust return type as required by mypy [software/spicerack] - 10https://gerrit.wikimedia.org/r/633482 (owner: 10Volans) [09:14:54] (03CR) 10Volans: "Finally ready to be reviewed. A lot of things have changed since the last implementation so I suggest to starts over like it was a fresh C" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [09:17:31] (03PS2) 10Volans: pylint: allow 'logger' as module-scope name [cookbooks] - 10https://gerrit.wikimedia.org/r/633483 [09:22:13] 10Operations, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T265250 (10Aklapper) 05Open→03Invalid Argh... Thanks for catching this. (I thought I had had a coffee before this?!) Fixed now (for real). [09:27:36] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) a:05Jclark-ctr→03Cmjohnson [09:28:10] (03CR) 10Gehel: [C: 03+1] "LGTM: always good to fight less with the linters!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/633478 (owner: 10Volans) [09:28:13] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) @Cmjohnson if you have some time during the next days can we swap the NIC on one node only? (to verify the procedure and make sure that the NICs... [09:29:59] (03PS1) 10Ayounsi: Nfacctd, add src_net, dst_net [puppet] - 10https://gerrit.wikimedia.org/r/633510 (https://phabricator.wikimedia.org/T254332) [09:40:05] (03CR) 10Volans: [C: 03+1] "LGTM" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631792 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [09:47:23] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6535643, @Gilles wrote: > I've checked with curl and it does get cached by both hosts: [...] > Oddly, cp3054 has Content-... [09:48:28] (03PS2) 10Klausman: amd_rocm: Ensure linux-headers-amd64 is installed [puppet] - 10https://gerrit.wikimedia.org/r/633194 [09:48:30] (03CR) 10Klausman: amd_rocm: Ensure linux-headers-amd64 is installed (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/633194 (owner: 10Klausman) [09:53:06] (03PS3) 10Klausman: amd_rocm: Ensure linux-headers-amd64 is installed [puppet] - 10https://gerrit.wikimedia.org/r/633194 [09:54:19] (03CR) 10Klausman: [C: 03+2] amd_rocm: Ensure linux-headers-amd64 is installed [puppet] - 10https://gerrit.wikimedia.org/r/633194 (owner: 10Klausman) [09:58:25] (03CR) 10Muehlenhoff: amd_rocm: Ensure linux-headers-amd64 is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633194 (owner: 10Klausman) [10:00:51] (03CR) 10Elukey: amd_rocm: Ensure linux-headers-amd64 is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633194 (owner: 10Klausman) [10:01:27] 10Operations, 10Data-Persistence: Create intengration test env for wmfmariadbpy - https://phabricator.wikimedia.org/T265266 (10Kormat) [10:10:05] 10Operations, 10LDAP-Access-Requests: Access to the Logstash for John Bolorinos - https://phabricator.wikimedia.org/T264918 (10JMeybohm) p:05Triage→03Medium [10:11:37] (03CR) 10Muehlenhoff: amd_rocm: Ensure linux-headers-amd64 is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633194 (owner: 10Klausman) [10:13:14] 10Operations, 10DBA, 10Data-Persistence: Create intengration test env for wmfmariadbpy - https://phabricator.wikimedia.org/T265266 (10LSobanski) [10:14:29] (03CR) 10Klausman: amd_rocm: Ensure linux-headers-amd64 is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633194 (owner: 10Klausman) [10:19:12] 10Operations, 10DBA, 10Data-Persistence: Create integration test env for wmfmariadbpy - https://phabricator.wikimedia.org/T265266 (10Aklapper) [10:22:57] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) It was: ` gilles@cp3052:/home/ema$ curl -I -H "X-Forwarded-Proto: https" -H "Host: en.wikipedia.org" http://localhost/wiki/Barack_Ob... [10:24:46] (03CR) 10Elukey: [C: 03+2] Import the action module from spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631792 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [10:24:58] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) I guess the difference is that I did a HEAD request. [10:26:09] !log roll-restarting restbase201[345678] for cert refresh [10:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:43] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart [10:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:23] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10kostajh) >>! In T252391#6533353, @nettrom_WMF wrote: > @kostajh : Thanks for picking this up and pinging me about it. I think... [10:30:04] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201012T1030). [10:32:08] 10Operations, 10Machine Learning Platform, 10SRE-Access-Requests: Requesting adding to ores-admin for Ladsgroup - https://phabricator.wikimedia.org/T265172 (10JMeybohm) p:05Triage→03Medium @calbon Please review/approve if you are fine with this. [10:32:42] (03PS1) 10Kosta Harlan: Disable wgWMEUnderstandingFirstDay (EditorJourney) logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633514 (https://phabricator.wikimedia.org/T252391) [10:33:09] 08Warning Alert for device scs-c1-codfw.mgmt.codfw.wmnet - Processor usage over 85% [10:33:18] (03CR) 10Kosta Harlan: [C: 04-1] "Do not merge until the code change to WikimediaEvents is done per T252391#6536174" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633514 (https://phabricator.wikimedia.org/T252391) (owner: 10Kosta Harlan) [10:33:45] (03PS2) 10Kosta Harlan: [DNM] Disable wgWMEUnderstandingFirstDay (EditorJourney) logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633514 (https://phabricator.wikimedia.org/T252391) [10:37:34] (03CR) 10Volans: "One nit and one question inline." (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631909 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [10:39:31] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631910 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [10:39:52] (03CR) 10Elukey: Import the config module from Spicerack (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631909 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [10:42:32] 10Operations, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Update 2030.wikimedia.org redirect to new URI - https://phabricator.wikimedia.org/T264797 (10JMeybohm) p:05Triage→03Medium a:03Dzahn Assigned to @Dzahn, since it looks as if a additional review of yours has been requested. [10:44:22] (03PS5) 10Elukey: Import the config module from Spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631909 (https://phabricator.wikimedia.org/T257905) [10:44:24] (03PS4) 10Elukey: Import the phabricator module from Spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631910 (https://phabricator.wikimedia.org/T257905) [10:45:36] (03CR) 10Volans: [C: 03+1] "LGTM" (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631909 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [10:46:21] (03CR) 10Elukey: [C: 03+2] Import the config module from Spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631909 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201012T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:27] I’d like to add a config change to the deploy window [11:00:50] Lucas_WMDE: I don't see any reason why not :) [11:01:54] Lucas_WMDE: and please ping me when done, I have also sth to do :) [11:01:57] (03PS5) 10Lucas Werkmeister (WMDE): Require autoconfirmed status to edit Wikidata Properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631809 (https://phabricator.wikimedia.org/T254280) (owner: 10Abián) [11:01:58] ok [11:02:06] just added ^ to the calendar [11:02:34] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Require autoconfirmed status to edit Wikidata Properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631809 (https://phabricator.wikimedia.org/T254280) (owner: 10Abián) [11:03:20] (03Merged) 10jenkins-bot: Require autoconfirmed status to edit Wikidata Properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631809 (https://phabricator.wikimedia.org/T254280) (owner: 10Abián) [11:04:18] pulled onto mwdebug2001, I’ll test it with a colleague [11:06:53] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:08:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:10:06] (03PS1) 10Elukey: Remove lvmbackups of Analytics meta from the Hadoop Standby master [puppet] - 10https://gerrit.wikimedia.org/r/633516 [11:16:58] test successful, syncing config change [11:17:30] (03PS2) 10Elukey: Remove lvmbackups of Analytics meta from the Hadoop Standby master [puppet] - 10https://gerrit.wikimedia.org/r/633516 (https://phabricator.wikimedia.org/T257412) [11:17:53] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1002 job=burrow partition={3,4,5} prometheus=ops site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-d [11:17:53] var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [11:18:00] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/25819/" [puppet] - 10https://gerrit.wikimedia.org/r/633516 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [11:18:26] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:631809|Require autoconfirmed status to edit Wikidata Properties (T254280)]] (duration: 01m 00s) [11:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:32] T254280: Semi-protecting all properties - https://phabricator.wikimedia.org/T254280 [11:18:47] Urbanecm: the stage is yours [11:19:11] thanks [11:19:33] klausman: ok if I puppet-merge your change? [11:20:15] (03PS4) 10Urbanecm: Enable wgCheckUserLogLogins at all wikis but few large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629227 (https://phabricator.wikimedia.org/T253802) [11:20:21] (03CR) 10Urbanecm: [C: 03+2] Enable wgCheckUserLogLogins at all wikis but few large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629227 (https://phabricator.wikimedia.org/T253802) (owner: 10Urbanecm) [11:20:30] (since it is a little change already reviewed I am proceeding) [11:21:07] (03Merged) 10jenkins-bot: Enable wgCheckUserLogLogins at all wikis but few large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629227 (https://phabricator.wikimedia.org/T253802) (owner: 10Urbanecm) [11:25:06] elukey:yes! [11:26:03] marostegui: heads-up, T253802 is going to be enabled at all wikis but few very large ones. Please do ping me if it causes anything unexpected, but given the cswiki/fawiki trial, it should be good. [11:26:04] T253802: Configure WMF wikis to log login attempts in CheckUser - https://phabricator.wikimedia.org/T253802 [11:27:28] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) [11:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:05] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 4966e8a6b8ae4e6d5623dd35e65ed8fcf3338bc1: Enable wgCheckUserLogLogins at all wikis but few large wikis (T253802) (duration: 00m 58s) [11:28:07] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [11:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:51] (03PS3) 10Urbanecm: [testwiki, test2wiki] Allow bureaucrats to grant import rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633273 [11:28:58] (03CR) 10Urbanecm: [C: 03+2] [testwiki, test2wiki] Allow bureaucrats to grant import rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633273 (owner: 10Urbanecm) [11:29:30] (03PS1) 10Elukey: role::analytics_cluster::coordinator: avoid lvm backups to an-master1002 [puppet] - 10https://gerrit.wikimedia.org/r/633519 (https://phabricator.wikimedia.org/T257412) [11:29:43] (03Merged) 10jenkins-bot: [testwiki, test2wiki] Allow bureaucrats to grant import rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633273 (owner: 10Urbanecm) [11:31:40] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/25820/" [puppet] - 10https://gerrit.wikimedia.org/r/633519 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [11:32:29] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: fff2532424f84970962f7de1e35d4250b83cb3da: [testwiki, test2wiki] Allow bureaucrats to grant import rights (duration: 00m 58s) [11:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:14] (03CR) 10Urbanecm: [C: 03+2] [labs] Remove wmgMonologChannels override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631224 (owner: 10Urbanecm) [11:33:56] (03Merged) 10jenkins-bot: [labs] Remove wmgMonologChannels override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631224 (owner: 10Urbanecm) [11:38:06] (03CR) 10Elukey: [C: 03+2] Import the phabricator module from Spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631910 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [11:38:43] !log EU B&C done [11:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:45:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:47:53] 10Operations, 10DBA, 10Data-Persistence, 10User-Kormat: Create integration test env for wmfmariadbpy - https://phabricator.wikimedia.org/T265266 (10Kormat) [11:49:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:50:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:54:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:55:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:16:01] (03CR) 10Hnowlan: [C: 03+1] "Updated pcc https://puppet-compiler.wmflabs.org/compiler1003/25821/maps1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/608726 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper) [12:26:21] !log installing spice security updates on Buster [12:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:04] !log installing rails security updates on Stretch [12:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:04] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6536169, @Gilles wrote: > I guess the difference is that I did a HEAD request. It's still reproducible right now. OK, I'... [12:41:06] (03CR) 10Gehel: [C: 03+1] "LGTM: always good to fight less with the linters!" [cookbooks] - 10https://gerrit.wikimedia.org/r/633483 (owner: 10Volans) [12:42:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:44:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:02:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:04:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:33:10] 08Warning Alert for device scs-c1-codfw.mgmt.codfw.wmnet - Processor usage over 85% [13:41:27] (03PS1) 10Kormat: (Mostly) convert to pytest. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633534 [13:42:26] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1002 job=burrow partition={3,4,5} prometheus=ops site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-clu [13:42:26] d&var-topic=All&var-consumer_group=All [13:42:28] (03CR) 10jerkins-bot: [V: 04-1] (Mostly) convert to pytest. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633534 (owner: 10Kormat) [13:42:32] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10LSobanski) @Cmjohnson @Jclark-ctr Is there anything we (#dba) can do to help move this forward? [13:45:54] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [13:47:43] (03PS1) 10Ayounsi: cr2-eqsin: set real HE IPs [homer/public] - 10https://gerrit.wikimedia.org/r/633535 [13:48:16] (03CR) 10Ayounsi: [C: 03+2] cr2-eqsin: set real HE IPs [homer/public] - 10https://gerrit.wikimedia.org/r/633535 (owner: 10Ayounsi) [13:48:43] (03Merged) 10jenkins-bot: cr2-eqsin: set real HE IPs [homer/public] - 10https://gerrit.wikimedia.org/r/633535 (owner: 10Ayounsi) [13:50:32] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 28, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:50:39] 10Operations, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Update 2030.wikimedia.org redirect to new URI - https://phabricator.wikimedia.org/T264797 (10Dzahn) a:05Dzahn→03None [13:57:29] (03CR) 10Nikerabbit: "Thanks. Some earlier discussion about this happened in I689b3eeee765822743d99a5820a95ca72c8a7680" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631224 (owner: 10Urbanecm) [13:59:13] 10Operations, 10Fundraising-Backlog, 10Thank-You-Page, 10MW-1.36-notes (1.36.0-wmf.4; 2020-08-11), and 3 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Nintendofan885) [14:05:09] (03PS1) 10Elukey: Clean up Analytics lvm-based backup not used anymore [puppet] - 10https://gerrit.wikimedia.org/r/633545 (https://phabricator.wikimedia.org/T257412) [14:05:37] !log uploaded php7.2 7.2.31-1+0~20200514.41+debian9~1.gbpe2a56b+wmf1+icu63 to component/icu63 T264991 [14:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:43] T264991: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 [14:06:01] (03CR) 10Elukey: [C: 03+2] Clean up Analytics lvm-based backup not used anymore [puppet] - 10https://gerrit.wikimedia.org/r/633545 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [14:12:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:14:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:22:34] (03PS1) 10Elukey: sre.hadoop.change-distro-from-cdh: allow to select workers/journal [cookbooks] - 10https://gerrit.wikimedia.org/r/633550 [14:39:44] PROBLEM - Disk space on ms-be2036 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=90%): /tmp 0 MB (0% inode=90%): /var/tmp 0 MB (0% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2036&var-datasource=codfw+prometheus/ops [14:41:48] (03CR) 10Elukey: [C: 03+1] pylint: allow 'logger' as module-scope name [software/spicerack] - 10https://gerrit.wikimedia.org/r/633478 (owner: 10Volans) [14:42:49] <_joe_> uh what's up with that swift server? [14:44:03] <_joe_> !log freed 1.5 GB of space on ms-be2036 by running "apt-get clean" [14:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:15] (03PS1) 10Kormat: tox: Expand to test against py3.[5678] [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633552 [14:49:47] (03PS2) 10Kormat: (Mostly) convert to pytest. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633534 [14:53:38] <_joe_> something's wrong with ms-be2036. du reports only 13 GB used, but 53 appear to be used. [14:53:46] <_joe_> when looking at df [14:53:46] https://phabricator.wikimedia.org/T265208 it's being worked on, _joe_ [14:54:13] <_joe_> and there are no deleted files [14:54:19] <_joe_> apergos: I suspect this is another problem [14:54:27] (03PS2) 10Kormat: tox: Expand to test against py3.[5678] [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633552 [14:54:40] <_joe_> specifically, that some data was written to /srv/swift-storage while the partitions were unmounted [14:55:05] sounds likely [14:55:28] I would just note that on the task and let folks handle it once the main issues are sorte dout [14:55:38] ther emight be more reboots/unmounts in its future [14:56:57] (03PS3) 10Kormat: tox: Expand to test against py3.[5678] [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633552 [14:57:25] thanks folks, I'll take a look too at ms-be2036 [14:57:45] 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10Joe) I just want to comment that this server had its root directory filled up today, and it's in a strange state where only 13 GB are found by `du -xsh /`, but 53 are occupied on `/dev/md0` according to `df`. G... [14:57:49] <_joe_> godog: oh I thought you were off today [14:58:19] _joe_: yeah technically you are right, it is an holiday in Spain but I'll be floating it [14:58:42] _joe_: he Observes all [14:59:10] I'm not observing this holiday tho! File:Sting.ogg [14:59:37] you're.. an englishman in new york? [15:00:25] (03CR) 10Volans: [C: 03+1] "Looks sane to me" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633552 (owner: 10Kormat) [15:00:38] I'm afraid I'm not getting it [15:00:54] https://www.youtube.com/watch?v=d27gTrPPAyk [15:01:17] hah! thanks that explains [15:01:24] (03CR) 10Kormat: [C: 03+2] tox: Expand to test against py3.[5678] [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633552 (owner: 10Kormat) [15:01:33] I was more like https://commons.wikimedia.org/wiki/File:Sting.ogg [15:02:47] (03Merged) 10jenkins-bot: tox: Expand to test against py3.[5678] [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633552 (owner: 10Kormat) [15:03:03] (03PS3) 10Kormat: (Mostly) convert to pytest. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633534 [15:05:10] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:07:53] godog: ahh :) [15:08:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:09:38] PROBLEM - Check systemd state on ms-be2036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:50] 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10fgiunchedi) Thanks! Indeed that's what happened, I've unmounted the filesystems and delete the files [15:15:33] (03CR) 10Volans: [C: 03+2] pylint: allow 'logger' as module-scope name [cookbooks] - 10https://gerrit.wikimedia.org/r/633483 (owner: 10Volans) [15:16:46] (03Merged) 10jenkins-bot: pylint: allow 'logger' as module-scope name [cookbooks] - 10https://gerrit.wikimedia.org/r/633483 (owner: 10Volans) [15:18:08] RECOVERY - Check systemd state on ms-be2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:35] (03PS8) 10Volans: cookbook API: add class API [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) [15:18:44] (03PS2) 10Volans: sre.hosts.downtime: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212) [15:19:20] (03CR) 10Volans: "An example of a converted cookbook can be checked in I97b48851c3e33cea6f34439d3036ff0cba11eac7" [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [15:19:55] (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.downtime: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [15:20:22] (03CR) 10Volans: "Expected CI failure as it depends on the Depends-On change to be merged and released." [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [15:21:02] RECOVERY - Disk space on ms-be2036 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2036&var-datasource=codfw+prometheus/ops [15:27:05] 10Operations, 10ops-eqiad, 10Discovery-Search: Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10Gehel) p:05Triage→03High [15:27:48] 10Operations, 10Analytics-Clusters, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10ema) >>! In T264074#6507717, @elukey wrote: > Most of the usage seems to be VUT related, especially for `fxstatat64` (no idea where it is used). You are indeed correct. The n... [15:30:39] (03PS1) 10Hnowlan: api-gateway: more instances in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/633559 [15:33:10] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1002 job=burrow partition={2,3,4} prometheus=ops site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-d [15:33:10] var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [15:35:54] I'm taking a look, looks like codfw only though [15:41:42] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [15:41:47] !log roll-restart logstash5 in codfw [15:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:16] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic: ~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface - https://phabricator.wikimedia.org/T264021 (10Ottomata) a:03Ottomata Interesting. So we don't know exactly where the timeout is occurring? Assigning to... [16:02:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:04:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:04:43] 10Operations, 10Analytics-Clusters, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10elukey) Thanks Ema, really great analysis! I am wondering if we could quickly test how varnishncsa behaves when we pass `-q`, that seems to be the big difference between the tw... [16:33:10] 08Warning Alert for device scs-c1-codfw.mgmt.codfw.wmnet - Processor usage over 85% [16:47:57] 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10JMeybohm) Also the system is sending root mails since ~15:10. ` Cron test -x /usr/bin/swift-recon-cron && test -r /etc/swift/object-server.conf && /usr/bin/swift-recon-cron /etc/swift/objec... [16:48:14] godog: you're arround by chance? [17:00:04] ryankemper: #bothumor My software never has bugs. It just develops random features. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201012T1700). [17:03:03] !log fixed /var/lock/ permission (1777) on ms-be2036 - T265208 [17:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:10] T265208: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 [17:05:16] 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10JMeybohm) I could not find any evidence that this was a intentional change, so I fixed the permissions. [17:17:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_citoid_cluster_codfw,swagger_check_restbase_esams} site={codfw,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:19:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:36:46] 10Operations, 10Analytics-Clusters, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10ema) >>! In T264074#6536967, @elukey wrote: > I am wondering if we could quickly test how varnishncsa behaves when we pass `-q`, that seems to be the big difference between the... [17:58:54] (03PS1) 10ArielGlenn: get revision info from stubs file and use to generate page range info [dumps] - 10https://gerrit.wikimedia.org/r/633567 [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201012T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:02:27] (03PS2) 10ArielGlenn: get revision info from stubs file and use to generate page range info [dumps] - 10https://gerrit.wikimedia.org/r/633567 (https://phabricator.wikimedia.org/T263319) [18:18:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:21:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:37:43] (03CR) 10Nuria: [C: 03+1] "Let's merge this, ya'all?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625771 (https://phabricator.wikimedia.org/T260582) (owner: 10Gergő Tisza) [18:44:38] 10Operations, 10Analytics-Clusters, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10elukey) There is another test that we could do, namely grouping. As far as I can see in the varnishkafka change, the [[ https://github.com/wikimedia/varnishkafka/commit/b0675e80... [18:45:15] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools as a beta feature on 30 more wikis ("phase 2") [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633568 (https://phabricator.wikimedia.org/T264693) [19:02:58] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 123 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:04:34] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:30:58] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={pdu_sentry4,swagger_check_restbase_esams} site={eqsin,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:32:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:33:10] 08Warning Alert for device scs-c1-codfw.mgmt.codfw.wmnet - Processor usage over 85% [19:55:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:57:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:00:04] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201012T2000). Please do the needful. [20:50:17] 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 3 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Ladsgroup) ^ This will make redis connection handling slightly healthier but I can't say it will handle this case as observability of ores is... [21:00:04] Reedy and sbassett: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201012T2100). [22:02:36] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 130 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:04:09] 10Operations, 10Advanced-Search, 10Discovery-Search, 10Traffic, and 3 others: Strange URL pattern after search https://en.wikipedia.org/w/index.php?sort=relevance&sort=relevance&sort=relevance&sort=relevance&sort=relevance&sort=relevance ... - https://phabricator.wikimedia.org/T243884 (10Ladsgroup) I think... [22:04:14] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 14 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:24:17] (03PS1) 10MichaelSchoenitzer: Add neovim, fd and ripgrep to toolforge [puppet] - 10https://gerrit.wikimedia.org/r/633583 [22:24:19] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/633583 (owner: 10MichaelSchoenitzer) [22:28:01] (03PS2) 10MichaelSchoenitzer: Add neovim, fd and ripgrep to toolforge [puppet] - 10https://gerrit.wikimedia.org/r/633583 (https://phabricator.wikimedia.org/T219501) [22:33:10] 08Warning Alert for device scs-c1-codfw.mgmt.codfw.wmnet - Processor usage over 85% [23:00:04] RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Evening backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201012T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:56:18] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:58:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets