[00:30:32] (03PS1) 10Ammarpad: Add 'deletedtext' permission to researcher group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598553 (https://phabricator.wikimedia.org/T253420) [00:50:15] PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:56:16] (03CR) 10Ammarpad: [C: 03+1] "Looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598245 (https://phabricator.wikimedia.org/T252986) (owner: 10RhinosF1) [01:03:52] (03CR) 10Ammarpad: [C: 03+1] "Looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598509 (https://phabricator.wikimedia.org/T253559) (owner: 10MarcoAurelio) [01:31:51] (03PS1) 10Privacybatm: Firewall.py: Add function to kill process by its port number [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598560 (https://phabricator.wikimedia.org/T252950) [01:33:53] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 55 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:36:31] (03CR) 10Privacybatm: "Please see my comment." (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598560 (https://phabricator.wikimedia.org/T252950) (owner: 10Privacybatm) [01:39:43] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:44:20] 10Operations, 10ops-codfw, 10DBA: db2097 memory errors leading to crash - https://phabricator.wikimedia.org/T252492 (10Papaul) I will be onsite tomorrow [01:51:57] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install rdb200[78] - https://phabricator.wikimedia.org/T251626 (10Papaul) [02:00:41] 10Operations, 10ops-codfw: BBU faulty on ms-be2016 - https://phabricator.wikimedia.org/T252851 (10Papaul) We have some decom DL380 onsite but there are GEN8 i will check and see if they have the same BBU as the GEN9 if not I will open a procurement task [02:23:23] (03PS1) 10Privacybatm: transfer.py: Add proper error message at source/target split of option_parse [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598562 (https://phabricator.wikimedia.org/T253560) [02:23:47] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Add proper error message at source/target split of option_parse [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598562 (https://phabricator.wikimedia.org/T253560) (owner: 10Privacybatm) [02:25:29] (03PS2) 10Privacybatm: transfer.py: Add proper error message at source/target split of option_parse [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598562 (https://phabricator.wikimedia.org/T253560) [02:25:58] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Add proper error message at source/target split of option_parse [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598562 (https://phabricator.wikimedia.org/T253560) (owner: 10Privacybatm) [02:47:47] (03PS3) 10Privacybatm: transfer.py: Add proper error message at source/target split of option_parse [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598562 (https://phabricator.wikimedia.org/T253560) [03:01:55] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:09:19] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:18:35] !log tstarling@deploy1001 sync-file aborted: (no justification provided) (duration: 00m 32s) [03:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:20:00] !log tstarling@deploy1001 Synchronized php-1.35.0-wmf.32/includes/specials/SpecialChangeContentModel.php: for UBN T252963 (duration: 01m 07s) [03:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:20:06] T252963: Fatal TypeError: Argument to SpamChecker::checkSummary() must be of the type string, null given - https://phabricator.wikimedia.org/T252963 [03:23:57] PROBLEM - MD RAID on restbase-dev1004 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [03:23:58] ACKNOWLEDGEMENT - MD RAID on restbase-dev1004 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T253607 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [03:24:01] 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1004 - https://phabricator.wikimedia.org/T253607 (10ops-monitoring-bot) [03:36:45] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:42:15] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:51:01] PROBLEM - Device not healthy -SMART- on restbase-dev1004 is CRITICAL: cluster=restbase_dev device=sdc instance=restbase-dev1004:9100 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase-dev1004&var-datasource=eqiad+prometheus/ops [03:53:02] !log tstarling@deploy1001 Synchronized php-1.35.0-wmf.32/includes/export/XmlDumpWriter.php: T253468 (duration: 01m 07s) [03:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:53:06] T253468: stubs dumps (plus all later stages) broken on select wikis - https://phabricator.wikimedia.org/T253468 [03:55:21] !log tstarling@deploy1001 Synchronized php-1.35.0-wmf.31/includes/export/XmlDumpWriter.php: T253468 (duration: 01m 06s) [03:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:02:17] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:05:49] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:13:59] !log Stop slaves and stop mysql on labsdb1011 T249188 [04:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:03] T249188: Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 [04:29:52] (03PS1) 10Marostegui: Revert "dbproxy1018: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/598568 [04:31:05] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1018: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/598568 (owner: 10Marostegui) [04:35:38] !log Repool labsdb1011 - T249188 [04:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:35:42] T249188: Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 [04:52:06] (03PS1) 10Marostegui: Revert "Revert "dbproxy1018: Depool labsdb1011"" [puppet] - 10https://gerrit.wikimedia.org/r/598570 [04:52:41] (03CR) 10Marostegui: [C: 03+2] Revert "Revert "dbproxy1018: Depool labsdb1011"" [puppet] - 10https://gerrit.wikimedia.org/r/598570 (owner: 10Marostegui) [04:59:15] (03PS1) 10Marostegui: dbproxy1018: Repool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/598571 (https://phabricator.wikimedia.org/T249188) [04:59:47] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Repool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/598571 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [05:23:25] RECOVERY - Device not healthy -SMART- on restbase-dev1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase-dev1004&var-datasource=eqiad+prometheus/ops [06:01:03] !log Deploy schema change on s2 directly on the master T253342 [06:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:07] T253342: Apply Babel schema change expanding babel_lang in Wikimedia production - https://phabricator.wikimedia.org/T253342 [06:06:57] (03PS1) 10Legoktm: ExtensionDistributor: Remove EOL REL1_32 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598573 [06:10:00] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Nemo_bis) >>! In T205619#6153801, @Jidanni wrote: > Yes. I have to take my cellphone to the top of the mountain for a clear connection to upl... [06:24:19] !log Deploy schema change on s8 directly on the master T253342 [06:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:22] T253342: Apply Babel schema change expanding babel_lang in Wikimedia production - https://phabricator.wikimedia.org/T253342 [06:29:25] !log Deploy schema change on s7 directly on the master T253342 [06:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:29] T253342: Apply Babel schema change expanding babel_lang in Wikimedia production - https://phabricator.wikimedia.org/T253342 [06:31:09] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 51 probes of 570 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:31:41] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 66 probes of 570 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:34:38] 10Operations, 10DC-Ops: scs-ulsfo 100% CPU - https://phabricator.wikimedia.org/T253609 (10ayounsi) p:05Triage→03Medium [06:35:41] !log reboot scs-ulsfo - T253609 [06:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:45] T253609: scs-ulsfo 100% CPU - https://phabricator.wikimedia.org/T253609 [06:44:30] !log Deploy schema change on s4 directly on the master T253342 [06:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:34] T253342: Apply Babel schema change expanding babel_lang in Wikimedia production - https://phabricator.wikimedia.org/T253342 [06:45:13] 10Operations, 10DC-Ops: scs-ulsfo 100% CPU - https://phabricator.wikimedia.org/T253609 (10ayounsi) 05Open→03Resolved CPU is back to normal. [06:45:37] (03PS1) 10Elukey: role::archiva: move to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/598672 (https://phabricator.wikimedia.org/T253553) [06:47:07] !log Deploy schema change on s1 directly on the master T253342 [06:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:55] !log Deploy schema change on s3 directly on the master with 1 minute sleep in between wikis T253342 [06:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:59] T253342: Apply Babel schema change expanding babel_lang in Wikimedia production - https://phabricator.wikimedia.org/T253342 [06:50:25] (03PS2) 10Elukey: role::archiva: move to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/598672 (https://phabricator.wikimedia.org/T253553) [06:55:36] (03CR) 10Ayounsi: [C: 04-1] "The logic can be simplified here, and indentation corrected." [homer/public] - 10https://gerrit.wikimedia.org/r/592938 (https://phabricator.wikimedia.org/T191667) (owner: 10Ayounsi) [06:59:18] (03CR) 10Elukey: [C: 03+2] "Pcc looks good: https://puppet-compiler.wmflabs.org/compiler1001/22758/archiva1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/598672 (https://phabricator.wikimedia.org/T253553) (owner: 10Elukey) [07:03:56] !log upgrade to ats 8.0.7-1wm11 on cp3064 and cp3065 [07:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:57] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 50 probes of 570 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:08:31] 10Operations, 10DBA: In-place conversion from LVM to normal partition - https://phabricator.wikimedia.org/T252195 (10Kormat) Scan of db fleet complete: ` kormat@cumin1001:~(0:98)$ sudo cumin 'A:db-all-codfw or A:db-all-eqiad' "lvs -o lv_layout --noheadings" 185 hosts will be targeted: db[2071-2092,2094-2101,21... [07:08:55] (03PS1) 10Muehlenhoff: Extend access for iflorez [puppet] - 10https://gerrit.wikimedia.org/r/598673 [07:09:35] (03CR) 10Dzahn: "nitpick: commit message says 24 but code does 28" [puppet] - 10https://gerrit.wikimedia.org/r/598131 (owner: 10Subramanya Sastry) [07:10:03] (03PS2) 10Dzahn: Bump rt-test clients to 28 [puppet] - 10https://gerrit.wikimedia.org/r/598131 (owner: 10Subramanya Sastry) [07:10:23] (03CR) 10Dzahn: [C: 03+2] "update commit message and merging ..." [puppet] - 10https://gerrit.wikimedia.org/r/598131 (owner: 10Subramanya Sastry) [07:12:17] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 46 probes of 570 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:12:44] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for iflorez [puppet] - 10https://gerrit.wikimedia.org/r/598673 (owner: 10Muehlenhoff) [07:17:14] (03CR) 10Dzahn: [C: 03+2] role::postgres::common: Remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/598484 (owner: 10Muehlenhoff) [07:17:49] (03PS1) 10Kormat: cumin: db-misc alias needs updating [puppet] - 10https://gerrit.wikimedia.org/r/598674 [07:17:53] marostegui: ^ [07:18:49] (03CR) 10Dzahn: [C: 03+2] zuul: add convenience link to 'zuul' bin [puppet] - 10https://gerrit.wikimedia.org/r/598068 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [07:19:41] (03CR) 10Marostegui: [C: 03+1] cumin: db-misc alias needs updating [puppet] - 10https://gerrit.wikimedia.org/r/598674 (owner: 10Kormat) [07:20:23] !log installing libssh security updates [07:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:36] (03CR) 10Kormat: [C: 03+2] cumin: db-misc alias needs updating [puppet] - 10https://gerrit.wikimedia.org/r/598674 (owner: 10Kormat) [07:21:05] (03PS1) 10Elukey: Swap profile::java::analytics with profile::java [puppet] - 10https://gerrit.wikimedia.org/r/598675 (https://phabricator.wikimedia.org/T253553) [07:24:19] (03CR) 10Ema: [V: 03+2 C: 03+2] 5.1.3-1wm15: don't set temperature to cold [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/596626 (https://phabricator.wikimedia.org/T236754) (owner: 10Ema) [07:24:39] (03CR) 10Ema: [V: 03+2 C: 03+2] 5.1.3-1wm15: add 0038-vcl_active-lock.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/597091 (https://phabricator.wikimedia.org/T236754) (owner: 10Ema) [07:24:57] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/22759/" [puppet] - 10https://gerrit.wikimedia.org/r/598675 (https://phabricator.wikimedia.org/T253553) (owner: 10Elukey) [07:25:31] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [07:28:22] (03PS1) 10Ema: Release 5.1.3-1wm15 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/598676 [07:36:40] (03PS1) 10Muehlenhoff: Revert "Exclude /mnt/hfds from all debdeploy restart checks" [puppet] - 10https://gerrit.wikimedia.org/r/598677 [07:37:01] (03CR) 10jerkins-bot: [V: 04-1] Revert "Exclude /mnt/hfds from all debdeploy restart checks" [puppet] - 10https://gerrit.wikimedia.org/r/598677 (owner: 10Muehlenhoff) [07:39:03] 10Operations, 10netops: Zayo link eqiad-codfw down - TTN-0004110251 - https://phabricator.wikimedia.org/T253610 (10ayounsi) p:05Triage→03Medium [07:39:19] (03PS2) 10Muehlenhoff: Revert "Exclude /mnt/hfds from all debdeploy restart checks" [puppet] - 10https://gerrit.wikimedia.org/r/598677 [07:40:41] ACKNOWLEDGEMENT - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP Ayounsi T253610 - ACK expires in 24h - The acknowledgement expires at: 2020-05-27 07:39:45. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:40:41] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi T253610 - ACK expires in 24h - The acknowledgement expires at: 2020-05-27 07:39:45. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:42:18] (03PS1) 10KartikMistry: Enable ContentTranslation in Galician Wikipedia as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598678 (https://phabricator.wikimedia.org/T250355) [07:44:33] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Exclude /mnt/hfds from all debdeploy restart checks" [puppet] - 10https://gerrit.wikimedia.org/r/598677 (owner: 10Muehlenhoff) [07:44:43] 10Operations, 10serviceops, 10Continuous-Integration-Config: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10ayounsi) ACKing the alert https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=deneb&service=Check+systemd+state with this task. ` ayounsi@d... [07:45:02] (03CR) 10jerkins-bot: [V: 04-1] Release 5.1.3-1wm15 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/598676 (owner: 10Ema) [07:45:06] ACKNOWLEDGEMENT - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ayounsi T251918 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:45:36] (03PS2) 10Dzahn: site: Move the D3 hosts under the D3 comment [puppet] - 10https://gerrit.wikimedia.org/r/598066 (owner: 10RLazarus) [07:50:47] (03CR) 10Ayounsi: "recheck" [homer/public] - 10https://gerrit.wikimedia.org/r/547584 (https://phabricator.wikimedia.org/T250429) (owner: 10Ayounsi) [07:54:56] (03PS1) 10Muehlenhoff: Exclude /mnt/hfds on labstore1006/1007 for debdeploy restart checks [puppet] - 10https://gerrit.wikimedia.org/r/598680 [07:58:22] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/22760/" [puppet] - 10https://gerrit.wikimedia.org/r/598066 (owner: 10RLazarus) [08:04:33] (03PS9) 10DCausse: [WIP][wdqs] add a new streaming updater test role [puppet] - 10https://gerrit.wikimedia.org/r/597790 [08:04:35] (03PS1) 10DCausse: Import lexeme ttl dumps to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/598681 [08:05:32] (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/598073 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [08:07:01] (03PS3) 10Lucas Werkmeister (WMDE): DNM: Enable Data Bridge on Test Wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595542 (https://phabricator.wikimedia.org/T232584) [08:07:03] (03PS3) 10Lucas Werkmeister (WMDE): DNM: Enable Data Bridge on Catalan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595543 (https://phabricator.wikimedia.org/T232584) [08:08:11] (03CR) 10Dzahn: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/598068 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [08:18:21] (03PS1) 10Muehlenhoff: Integrate hardened java.security into profile::java [puppet] - 10https://gerrit.wikimedia.org/r/598682 [08:19:25] (03CR) 10jerkins-bot: [V: 04-1] Integrate hardened java.security into profile::java [puppet] - 10https://gerrit.wikimedia.org/r/598682 (owner: 10Muehlenhoff) [08:30:01] RECOVERY - Check systemd state on an-launcher1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:41:28] !log oblivian@cumin1001 conftool action : set/weight=1:pooled=yes; selector: name=mw1337.eqiad.wmnet [08:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:46] (03PS2) 10Muehlenhoff: Integrate hardened java.security into profile::java [puppet] - 10https://gerrit.wikimedia.org/r/598682 [08:42:50] (03CR) 10jerkins-bot: [V: 04-1] Integrate hardened java.security into profile::java [puppet] - 10https://gerrit.wikimedia.org/r/598682 (owner: 10Muehlenhoff) [08:45:47] (03PS1) 10Giuseppe Lavagetto: jobrunner: switch all to envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/598689 (https://phabricator.wikimedia.org/T247389) [08:46:27] (03PS2) 10Giuseppe Lavagetto: jobrunner: switch all to envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/598689 (https://phabricator.wikimedia.org/T247389) [08:47:47] (03PS1) 10Elukey: Revert "Set BigTop repository config for the Hadoop Test cluster" [puppet] - 10https://gerrit.wikimedia.org/r/598690 [08:48:07] (03Abandoned) 10Elukey: Revert "Set BigTop repository config for the Hadoop Test cluster" [puppet] - 10https://gerrit.wikimedia.org/r/598690 (owner: 10Elukey) [08:48:23] (03CR) 10Jbond: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/598475 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [08:51:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] jobrunner: switch all to envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/598689 (https://phabricator.wikimedia.org/T247389) (owner: 10Giuseppe Lavagetto) [08:52:11] (03PS1) 10Marostegui: dbproxy1018: Add labsdb1010 with reduced weight [puppet] - 10https://gerrit.wikimedia.org/r/598691 (https://phabricator.wikimedia.org/T249188) [08:52:13] (03PS10) 10ZPapierski: Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) (owner: 10EBernhardson) [08:52:22] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/598680 (owner: 10Muehlenhoff) [08:55:05] (03PS11) 10ZPapierski: Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) (owner: 10EBernhardson) [08:55:50] <_joe_> !log progressively converting jobrunners to envoy [08:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:15] (03PS2) 10Marostegui: dbproxy1018: Add labsdb1010 with reduced weight [puppet] - 10https://gerrit.wikimedia.org/r/598691 (https://phabricator.wikimedia.org/T249188) [08:56:48] (03CR) 10Jbond: "lgtm one more minor nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598073 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [08:57:06] !log oblivian@cumin1001 conftool action : set/weight=1:pooled=yes; selector: name=mw133[4-7].eqiad.wmnet [08:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:07] (03PS1) 10Elukey: Set CDH repository back for the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/598693 (https://phabricator.wikimedia.org/T244499) [08:59:14] (03CR) 10Marostegui: "This shows the expected change: https://puppet-compiler.wmflabs.org/compiler1001/22763/dbproxy1018.eqiad.wmnet/change.dbproxy1018.eqiad.wm" [puppet] - 10https://gerrit.wikimedia.org/r/598691 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [08:59:34] (03CR) 10Marostegui: [C: 04-2] "Let's wait a few days until we are sure labsdb1011 is not crashing" [puppet] - 10https://gerrit.wikimedia.org/r/598691 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [09:00:12] (03CR) 10Elukey: [C: 03+2] Set CDH repository back for the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/598693 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [09:01:46] !log oblivian@cumin1001 conftool action : set/weight=1:pooled=yes; selector: name=mw13(1|3)8.eqiad.wmnet [09:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:49] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [09:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:13] !log decom'ing people1001 - replaced by people1002 [09:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:43] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [09:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:50] 10Operations, 10serviceops, 10Patch-For-Review: decom people1001 - https://phabricator.wikimedia.org/T253296 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `people1001.eqiad.wmnet` - people1001.eqiad.wmnet (**PASS**) - Downtimed host on Icinga - Found Gane... [09:03:46] (03CR) 10Dzahn: [C: 03+2] site/DHCP: remove people1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/597756 (https://phabricator.wikimedia.org/T253296) (owner: 10Dzahn) [09:03:57] (03PS2) 10Dzahn: site/DHCP: remove people1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/597756 (https://phabricator.wikimedia.org/T253296) [09:04:51] (03CR) 10MarcoAurelio: "I think we need T&S signoff for this one. Adding Joe for that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598553 (https://phabricator.wikimedia.org/T253420) (owner: 10Ammarpad) [09:05:31] (03PS3) 10Muehlenhoff: Integrate hardened java.security into profile::java [puppet] - 10https://gerrit.wikimedia.org/r/598682 [09:06:19] (03CR) 10Gehel: [C: 04-1] "See comments inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) (owner: 10EBernhardson) [09:09:39] !log oblivian@cumin1001 conftool action : set/weight=1:pooled=yes; selector: name=mw13(0[89]|1[01]).eqiad.wmnet [09:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:57] (03CR) 10Gehel: [C: 04-1] "PCC is unhappy: https://puppet-compiler.wmflabs.org/compiler1001/22764/" [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) (owner: 10EBernhardson) [09:12:59] (03PS1) 10Dzahn: decom people1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/598695 (https://phabricator.wikimedia.org/T253296) [09:14:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/598469 (owner: 10Muehlenhoff) [09:17:01] (03CR) 10Muehlenhoff: [C: 03+2] Remove absented reprepro configs [puppet] - 10https://gerrit.wikimedia.org/r/598469 (owner: 10Muehlenhoff) [09:22:18] !log repool wdqs1007, catched up on lag [09:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:28] 10Operations, 10Security-Team, 10Patch-For-Review, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10jbond) >>! In T244792#6136255, @jbond wrote: > Hi @HMarcus > > Thanks for configuring this however the application ass... [09:23:35] (03CR) 10MarcoAurelio: [C: 03+1] "Approved on Task by Jan: " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598553 (https://phabricator.wikimedia.org/T253420) (owner: 10Ammarpad) [09:24:42] (03CR) 10Hashar: "contint1001 has /usr/bin/zuul which is installed by the Debian package:" [puppet] - 10https://gerrit.wikimedia.org/r/598068 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [09:25:02] (03CR) 10Dzahn: [C: 03+2] decom people1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/598695 (https://phabricator.wikimedia.org/T253296) (owner: 10Dzahn) [09:26:21] (03PS2) 10Dzahn: decom people1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/598695 (https://phabricator.wikimedia.org/T253296) [09:27:12] !log oblivian@cumin1001 conftool action : set/weight=1:pooled=yes; selector: name=mw130[4-7].eqiad.wmnet [09:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:32] 10Operations, 10Security-Team, 10CAS-SSO, 10User-jbond: Further steps for CAS/web SSO - https://phabricator.wikimedia.org/T233921 (10jbond) [09:27:34] 10Operations, 10CAS-SSO, 10Performance Issue: Investigate CAS performance - https://phabricator.wikimedia.org/T246010 (10jbond) 05Open→03Resolved a:03jbond I asked @Zbyszko to look at this and they where unable to get to the root cause however as we now use an external tomcat instance this issue no lon... [09:28:10] 10Operations, 10Security-Team, 10CAS-SSO, 10User-jbond: Icinga check for CAS-protected web services - https://phabricator.wikimedia.org/T245743 (10jbond) 05Open→03Resolved a:03jbond This is now in place, resolving [09:28:12] 10Operations, 10Security-Team, 10CAS-SSO, 10User-jbond: Further steps for CAS/web SSO - https://phabricator.wikimedia.org/T233921 (10jbond) [09:28:40] 10Operations, 10CAS-SSO, 10User-jbond: Revisit Tomcat deployment of CAS - https://phabricator.wikimedia.org/T233950 (10jbond) 05Open→03Resolved a:03jbond This is now in place, resolving [09:28:43] 10Operations, 10Security-Team, 10CAS-SSO, 10User-jbond: Further steps for CAS/web SSO - https://phabricator.wikimedia.org/T233921 (10jbond) [09:30:59] !log oblivian@cumin1001 conftool action : set/weight=1:pooled=yes; selector: name=mw130[0-3].eqiad.wmnet [09:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:01] 10Operations, 10serviceops, 10Patch-For-Review: decom people1001 - https://phabricator.wikimedia.org/T253296 (10Dzahn) 05Open→03Resolved [09:31:04] 10Operations, 10serviceops: upgrade people.wikimedia.org backend to buster - https://phabricator.wikimedia.org/T247649 (10Dzahn) [09:31:57] 10Operations, 10CAS-SSO, 10User-jbond: Review ticket policies - https://phabricator.wikimedia.org/T233948 (10jbond) This has been discussed and [[ https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration#Session_timeout_handling | documented ]] . Please reopen if further investigation is required [09:33:10] (03CR) 10Jbond: [C: 03+1] Also deploy production IDPs via deb [puppet] - 10https://gerrit.wikimedia.org/r/598475 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [09:33:59] 10Operations, 10observability, 10CAS-SSO, 10User-jbond: Icinga Monitoring for CAS - https://phabricator.wikimedia.org/T233935 (10Peachey88) [09:34:01] 10Operations, 10CAS-SSO, 10User-jbond: Review ticket policies - https://phabricator.wikimedia.org/T233948 (10jbond) 05Open→03Resolved a:03jbond [09:34:03] 10Operations, 10Security-Team, 10CAS-SSO, 10User-jbond: Further steps for CAS/web SSO - https://phabricator.wikimedia.org/T233921 (10jbond) [09:34:44] 10Operations, 10CAS-SSO, 10User-jbond: Wikimedia theme for SSO login page - https://phabricator.wikimedia.org/T233939 (10jbond) 05Open→03Resolved a:03jbond A new skin is now in place [09:34:46] 10Operations, 10Security-Team, 10CAS-SSO, 10User-jbond: Further steps for CAS/web SSO - https://phabricator.wikimedia.org/T233921 (10jbond) [09:35:56] (03PS8) 10Filippo Giunchedi: thanos: add Store Gateway [puppet] - 10https://gerrit.wikimedia.org/r/597019 (https://phabricator.wikimedia.org/T252186) [09:35:58] (03PS7) 10Filippo Giunchedi: thanos: add objstore support to sidecar [puppet] - 10https://gerrit.wikimedia.org/r/597071 (https://phabricator.wikimedia.org/T252186) [09:36:00] (03PS8) 10Filippo Giunchedi: thanos: add thanos::compact [puppet] - 10https://gerrit.wikimedia.org/r/597072 (https://phabricator.wikimedia.org/T252186) [09:36:02] (03PS1) 10Filippo Giunchedi: prometheus: add store and compact jobs [puppet] - 10https://gerrit.wikimedia.org/r/598698 (https://phabricator.wikimedia.org/T252186) [09:36:39] !log oblivian@cumin1001 conftool action : set/weight=10:pooled=yes; selector: name=mw129[3-9].eqiad.wmnet [09:36:41] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Replicated ticket registry - https://phabricator.wikimedia.org/T233933 (10jbond) Currently blocked waiting on [[ https://gerrit.wikimedia.org/r/c/operations/debs/mcrouter/+/596779 | mcrouter for buster ]] [09:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:00] !log oblivian@cumin1001 conftool action : set/weight=10; selector: name=mw130[0-3].eqiad.wmnet [09:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:32] !log oblivian@cumin1001 conftool action : set/weight=10; selector: name=mw130[0-9].eqiad.wmnet [09:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:21] !log oblivian@cumin1001 conftool action : set/weight=10; selector: name=mw133[4-8].eqiad.wmnet [09:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:43] !log oblivian@cumin1001 conftool action : set/weight=10; selector: name=mw131[0-1].eqiad.wmnet [09:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:17] (03PS1) 10Dzahn: planet: fix feeds with 404 that moved to https [puppet] - 10https://gerrit.wikimedia.org/r/598701 (https://phabricator.wikimedia.org/T168459) [09:41:17] <_joe_> !log all jobrunners converted to use envoy for TLS termination [09:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:48] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/22767/" [puppet] - 10https://gerrit.wikimedia.org/r/598682 (owner: 10Muehlenhoff) [09:44:03] (03PS1) 10Muehlenhoff: pcc: Also recommend jenkinsapi Debian package [puppet] - 10https://gerrit.wikimedia.org/r/598704 [09:44:25] (03CR) 10jerkins-bot: [V: 04-1] pcc: Also recommend jenkinsapi Debian package [puppet] - 10https://gerrit.wikimedia.org/r/598704 (owner: 10Muehlenhoff) [09:45:44] (03PS2) 10Dzahn: planet: fix feeds with 404 that moved to https [puppet] - 10https://gerrit.wikimedia.org/r/598701 (https://phabricator.wikimedia.org/T168459) [09:47:45] (03CR) 10Dzahn: [C: 03+2] planet: fix feeds with 404 that moved to https [puppet] - 10https://gerrit.wikimedia.org/r/598701 (https://phabricator.wikimedia.org/T168459) (owner: 10Dzahn) [09:48:24] !log rolling upgrade to ats 8.0.7-1wm11 [09:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:03] 10Operations, 10serviceops: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10hashar) [09:49:51] jouncebot: next [09:49:51] In 1 hour(s) and 10 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200526T1100) [09:53:32] (03PS2) 10Muehlenhoff: pcc: Also recommend jenkinsapi Debian package [puppet] - 10https://gerrit.wikimedia.org/r/598704 [10:00:10] (03CR) 10Jbond: "lgtm some minor nits" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/598682 (owner: 10Muehlenhoff) [10:04:49] (03CR) 10Joal: [C: 03+1] "LGTM - We need to manually create the destination directory on HDFS with correct permissions." [puppet] - 10https://gerrit.wikimedia.org/r/598681 (owner: 10DCausse) [10:05:54] headsup: I'm merging a change to the IDPs which will void current SSO sessions shortly [10:06:03] (03PS1) 10Dzahn: planet: remove some feeds that are gone, fix some links [puppet] - 10https://gerrit.wikimedia.org/r/598707 (https://phabricator.wikimedia.org/T168459) [10:08:58] (03CR) 10Muehlenhoff: [C: 03+2] Also deploy production IDPs via deb [puppet] - 10https://gerrit.wikimedia.org/r/598475 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [10:09:25] 10Operations, 10ops-codfw, 10DBA: db2097 memory errors leading to crash - https://phabricator.wikimedia.org/T252492 (10jcrespo) >>! In T252492#6164093, @Papaul wrote: > > I will be onsite tomorrow Will stop backup processes and stop the server. [10:10:43] (03CR) 10Dzahn: [C: 03+2] planet: remove some feeds that are gone, fix some links [puppet] - 10https://gerrit.wikimedia.org/r/598707 (https://phabricator.wikimedia.org/T168459) (owner: 10Dzahn) [10:13:48] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:18:06] (03PS1) 10Dzahn: site: move appservers in codfw rack C4 into their own regex [puppet] - 10https://gerrit.wikimedia.org/r/598709 [10:18:34] !log stop db2097 for hw maintenance T252492 [10:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:38] T252492: db2097 memory errors leading to crash - https://phabricator.wikimedia.org/T252492 [10:21:12] 10Operations, 10ops-codfw, 10DBA: db2097 memory errors leading to crash - https://phabricator.wikimedia.org/T252492 (10jcrespo) ` $ ssh db2097.mgmt User:root logged-in to ILOMXQ91304KD.(10.193.2.204 / FE80::8230:E0FF:FE3E:F9A2) iLO Standard 1.40 at Feb 05 2019 Server Name: Server Power: Off ` host is down... [10:22:07] (03PS1) 10Dzahn: site: define mw2187,mw2188 as new canary appservers [puppet] - 10https://gerrit.wikimedia.org/r/598710 (https://phabricator.wikimedia.org/T242606) [10:24:42] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:26:21] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/22768/" [puppet] - 10https://gerrit.wikimedia.org/r/598709 (owner: 10Dzahn) [10:27:00] (03CR) 10Giuseppe Lavagetto: "you should also add the "canary" service for them in conftool-data." [puppet] - 10https://gerrit.wikimedia.org/r/598710 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn) [10:28:31] (03PS1) 10Filippo Giunchedi: prometheus: allow setting min/max block duration [puppet] - 10https://gerrit.wikimedia.org/r/598711 (https://phabricator.wikimedia.org/T252186) [10:28:33] (03PS1) 10Filippo Giunchedi: WIP: add upload-data-to-thanos feature flag [puppet] - 10https://gerrit.wikimedia.org/r/598712 [10:30:20] (03PS2) 10Dzahn: site: define mw2187,mw2188 as new canary appservers [puppet] - 10https://gerrit.wikimedia.org/r/598710 (https://phabricator.wikimedia.org/T242606) [10:30:24] (03CR) 10Dzahn: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/598710 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn) [10:30:59] (03CR) 10jerkins-bot: [V: 04-1] WIP: add upload-data-to-thanos feature flag [puppet] - 10https://gerrit.wikimedia.org/r/598712 (owner: 10Filippo Giunchedi) [10:32:37] (03PS1) 10Giuseppe Lavagetto: profile::tlsproxy::envoy: add max_requests parameter [puppet] - 10https://gerrit.wikimedia.org/r/598713 [10:35:01] (03CR) 10Jbond: WIP: add upload-data-to-thanos feature flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598712 (owner: 10Filippo Giunchedi) [10:35:03] (03CR) 10Giuseppe Lavagetto: [C: 03+1] site: define mw2187,mw2188 as new canary appservers [puppet] - 10https://gerrit.wikimedia.org/r/598710 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn) [10:35:26] _joe_: done! also going to add 2 canaries for jobrunners since i see in eqiad we have that but not in codfw. (currently 13 canaries in eqiad (5x app, 4x api, 2x jobs, 2x parsoid) and 10 canaries in codfw (4x app, 4x api, 0x jobs, 2x parsoid) [10:36:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/22770/ merging without further review than a good pcc because we need to solve the situat" [puppet] - 10https://gerrit.wikimedia.org/r/598713 (owner: 10Giuseppe Lavagetto) [10:36:49] <_joe_> mutante: great [10:39:44] (03PS1) 10JMeybohm: tls_helper: readd default upstream timeout (60s) [deployment-charts] - 10https://gerrit.wikimedia.org/r/598716 [10:41:39] (03PS2) 10Filippo Giunchedi: WIP: add upload-data-to-thanos feature flag [puppet] - 10https://gerrit.wikimedia.org/r/598712 [10:41:45] <_joe_> and it worked like a charm [10:41:47] (03PS5) 10Ssingh: dnsdist: add a class to install and configure dnsdist [puppet] - 10https://gerrit.wikimedia.org/r/598073 (https://phabricator.wikimedia.org/T252132) [10:42:07] (03CR) 10Giuseppe Lavagetto: [C: 03+1] tls_helper: readd default upstream timeout (60s) [deployment-charts] - 10https://gerrit.wikimedia.org/r/598716 (owner: 10JMeybohm) [10:43:37] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/22773/" [puppet] - 10https://gerrit.wikimedia.org/r/598073 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [10:46:13] !log Stop tendril's event scheduler [10:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:34] (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/598519 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [10:47:19] (03CR) 10Ssingh: [C: 03+2] acme_chief: update configuration to generate a certificate for malmok [puppet] - 10https://gerrit.wikimedia.org/r/598519 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [10:47:42] (03PS1) 10Vgutierrez: trafficserver_exporter: Gather Lua VM metrics [puppet] - 10https://gerrit.wikimedia.org/r/598717 [10:48:14] (03PS8) 10Filippo Giunchedi: thanos: add objstore support to sidecar [puppet] - 10https://gerrit.wikimedia.org/r/597071 (https://phabricator.wikimedia.org/T252186) [10:48:16] (03PS9) 10Filippo Giunchedi: thanos: add thanos::compact [puppet] - 10https://gerrit.wikimedia.org/r/597072 (https://phabricator.wikimedia.org/T252186) [10:48:18] (03PS2) 10Filippo Giunchedi: prometheus: add store and compact jobs [puppet] - 10https://gerrit.wikimedia.org/r/598698 (https://phabricator.wikimedia.org/T252186) [10:48:20] (03PS2) 10Filippo Giunchedi: prometheus: allow setting min/max block duration [puppet] - 10https://gerrit.wikimedia.org/r/598711 (https://phabricator.wikimedia.org/T252186) [10:48:22] (03PS3) 10Filippo Giunchedi: WIP: add upload-data-to-thanos feature flag [puppet] - 10https://gerrit.wikimedia.org/r/598712 [10:55:03] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/22771/mw2187.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/598710 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn) [10:55:14] (03PS3) 10Dzahn: site: define mw2187,mw2188 as new canary appservers [puppet] - 10https://gerrit.wikimedia.org/r/598710 (https://phabricator.wikimedia.org/T242606) [10:56:58] (03PS1) 10Muehlenhoff: Remove support for non-Tomcat and non-DEB deployments [puppet] - 10https://gerrit.wikimedia.org/r/598718 [10:57:45] (03PS2) 10JMeybohm: tls_helper: readd default upstream timeout (60s) [deployment-charts] - 10https://gerrit.wikimedia.org/r/598716 [10:58:05] (03CR) 10jerkins-bot: [V: 04-1] Remove support for non-Tomcat and non-DEB deployments [puppet] - 10https://gerrit.wikimedia.org/r/598718 (owner: 10Muehlenhoff) [10:59:01] (03PS1) 10Jbond: example: ignore [puppet] - 10https://gerrit.wikimedia.org/r/598719 [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European Mid-day SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200526T1100). [11:00:04] hauskatze: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:13] Present. [11:00:23] I can SWAT today [11:00:30] (03CR) 10Jbond: example: ignore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598719 (owner: 10Jbond) [11:01:50] (03CR) 10Awight: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598509 (https://phabricator.wikimedia.org/T253559) (owner: 10MarcoAurelio) [11:02:29] (03Abandoned) 10Jbond: example: ignore [puppet] - 10https://gerrit.wikimedia.org/r/598719 (owner: 10Jbond) [11:02:42] (03Merged) 10jenkins-bot: [nnwiki] Change category collation to `uca-nn-u-kn` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598509 (https://phabricator.wikimedia.org/T253559) (owner: 10MarcoAurelio) [11:03:12] hauskatze: Is it possible to test this config on mwdebug, or maybe not because it requires the update script? [11:03:27] awight: change is not testable [11:03:37] ack [11:03:57] so we'll need to scap/deploy it first then run the script and see if it worked [11:04:48] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:06:16] !log awight@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:598509|[nnwiki] Change category collation to (T253559)]] (duration: 01m 10s) [11:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:21] T253559: Category sort of nnwiki is wrong - https://phabricator.wikimedia.org/T253559 [11:06:56] API now shows the correct config: ""categorycollation": "uca-nn-u-kn"," [11:07:19] hauskatze: Running the script under screen--currently at 20k of 1M :-) [11:07:30] :D [11:07:38] well, hopefully it doesn't take much [11:08:58] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:09:21] (03PS9) 10Filippo Giunchedi: thanos: add objstore support to sidecar [puppet] - 10https://gerrit.wikimedia.org/r/597071 (https://phabricator.wikimedia.org/T252186) [11:09:24] (03PS10) 10Filippo Giunchedi: thanos: add thanos::compact [puppet] - 10https://gerrit.wikimedia.org/r/597072 (https://phabricator.wikimedia.org/T252186) [11:09:26] (03PS3) 10Filippo Giunchedi: prometheus: add store and compact jobs [puppet] - 10https://gerrit.wikimedia.org/r/598698 (https://phabricator.wikimedia.org/T252186) [11:09:27] (03PS3) 10Filippo Giunchedi: prometheus: allow setting min/max block duration [puppet] - 10https://gerrit.wikimedia.org/r/598711 (https://phabricator.wikimedia.org/T252186) [11:09:29] (03PS4) 10Filippo Giunchedi: WIP: add upload-data-to-thanos feature flag [puppet] - 10https://gerrit.wikimedia.org/r/598712 [11:09:43] (03PS2) 10Awight: Add 'deletedtext' permission to researcher group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598553 (https://phabricator.wikimedia.org/T253420) (owner: 10Ammarpad) [11:10:10] (03CR) 10Awight: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598553 (https://phabricator.wikimedia.org/T253420) (owner: 10Ammarpad) [11:10:50] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:11:27] (03Merged) 10jenkins-bot: Add 'deletedtext' permission to researcher group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598553 (https://phabricator.wikimedia.org/T253420) (owner: 10Ammarpad) [11:12:29] (03CR) 10Filippo Giunchedi: "PCC (for this whole review chain) https://puppet-compiler.wmflabs.org/compiler1001/22775/" [puppet] - 10https://gerrit.wikimedia.org/r/598712 (owner: 10Filippo Giunchedi) [11:13:33] (03CR) 10Ema: [C: 04-1] trafficserver_exporter: Gather Lua VM metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598717 (owner: 10Vgutierrez) [11:14:27] (03CR) 10Ema: [V: 03+2 C: 03+2] Release 5.1.3-1wm15 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/598676 (owner: 10Ema) [11:14:37] !log awight@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:598553|Add 'deletedtext' permission to researcher group (T253420)]] (duration: 01m 06s) [11:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:43] T253420: Add 'deletedtext' right to user group 'researchers' - https://phabricator.wikimedia.org/T253420 [11:16:18] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:16:36] !log EU SWAT done (pending a maintenance script to updateCollation) [11:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:42] thanks awight :) [11:18:02] likewise! I'll comment on the task once the maintenance job finishes. [11:18:52] thanks, so hashar can ride the train [11:19:16] * awight leaps off of the tracks [11:20:20] 10Operations, 10netops: Zayo link eqiad-codfw (OGYX/120003//ZYO) down - TTN-0004110251 - https://phabricator.wikimedia.org/T253610 (10CDanis) [11:23:12] (03CR) 10Jbond: "looks good see inline for comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/598712 (owner: 10Filippo Giunchedi) [11:25:26] (03CR) 10CDanis: profile::conftool::client: allow overriding the user root can access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598415 (https://phabricator.wikimedia.org/T97972) (owner: 10Giuseppe Lavagetto) [11:25:34] PROBLEM - PHP opcache health on mw2187 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:25:45] (03PS1) 10Ayounsi: Depool ulsfo for routers upgrade [dns] - 10https://gerrit.wikimedia.org/r/598723 (https://phabricator.wikimedia.org/T243080) [11:25:48] eh..yea.. that is one of the new canaries [11:26:27] (03CR) 10Ayounsi: [C: 03+2] Depool ulsfo for routers upgrade [dns] - 10https://gerrit.wikimedia.org/r/598723 (https://phabricator.wikimedia.org/T243080) (owner: 10Ayounsi) [11:26:32] PROBLEM - PHP opcache health on mw2188 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:26:55] !log depool ulsfo for routers upgrade - T243080 [11:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:56] !log nnwiki updateCollation.php script has finished. [11:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:11] !log cr3-ulsfo> request vmhost software add ... - T243080 [11:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:12] (03CR) 10CDanis: thanos: add Store Gateway (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597019 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [11:30:38] (03CR) 10CDanis: [C: 03+1] thanos: add thanos::compact [puppet] - 10https://gerrit.wikimedia.org/r/597072 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [11:33:20] (03CR) 10CDanis: thanos: add objstore support to sidecar (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597071 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [11:33:26] (03CR) 10Muehlenhoff: [C: 03+2] Remove jessie check in Docker registry [puppet] - 10https://gerrit.wikimedia.org/r/598479 (owner: 10Muehlenhoff) [11:36:17] (03CR) 10Muehlenhoff: [C: 03+2] Exclude /mnt/hfds on labstore1006/1007 for debdeploy restart checks [puppet] - 10https://gerrit.wikimedia.org/r/598680 (owner: 10Muehlenhoff) [11:40:13] RECOVERY - PHP opcache health on mw2187 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:42:52] !log cr4-ulsfo> request vmhost software add ... - T243080 [11:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:10] (03CR) 10JMeybohm: [C: 03+2] tls_helper: readd default upstream timeout (60s) [deployment-charts] - 10https://gerrit.wikimedia.org/r/598716 (owner: 10JMeybohm) [11:45:33] (03Merged) 10jenkins-bot: tls_helper: readd default upstream timeout (60s) [deployment-charts] - 10https://gerrit.wikimedia.org/r/598716 (owner: 10JMeybohm) [11:46:28] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:48:24] (03PS2) 10Muehlenhoff: Switch the IDPs to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/598488 (https://phabricator.wikimedia.org/T253553) [11:49:33] alright, ulsfo is depooled, time for some router upgrades, no impact expected, but there will most likely be some icinga noise [11:49:53] !log cr3-ulsfo> request vmhost reboot - T243080 [11:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:36] it should take ~10min to come back up [11:54:13] console is back up [11:54:20] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:54:37] waiting for interfaces to initialize [11:54:38] PROBLEM - OSPF status on mr1-ulsfo is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:55:42] RECOVERY - PHP opcache health on mw2188 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:56:13] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for Apache on miscweb [puppet] - 10https://gerrit.wikimedia.org/r/598724 (https://phabricator.wikimedia.org/T135991) [11:57:06] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 76 probes of 570 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:57:33] all interfaces are up [11:58:00] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:58:20] RECOVERY - OSPF status on mr1-ulsfo is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:59:11] no alarms, everything back to normal [11:59:20] so less than 10min total, great [12:01:05] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for Apache on Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/598724 (https://phabricator.wikimedia.org/T135991) [12:01:38] !log cr4-ulsfo deactivate transit/ix/4/6 - T243080 [12:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:56] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 47 probes of 570 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:03:10] !log cr4-ulsfo> request vmhost reboot - T243080 [12:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:50] cli is back [12:07:23] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:07:30] PROBLEM - OSPF status on mr1-ulsfo is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:07:50] (03CR) 10Alexandros Kosiaris: pcc: Also recommend jenkinsapi Debian package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598704 (owner: 10Muehlenhoff) [12:08:02] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:08:32] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:09:51] interfaces are up [12:10:17] (03CR) 10Jbond: [C: 03+1] "lgtm, minor nit but feel free to merge or fix and merge" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598073 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [12:10:20] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:11:02] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:11:08] RECOVERY - OSPF status on mr1-ulsfo is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:11:35] !log cr4-ulsfo re-activate transit/ix/4/6 - T243080 [12:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:40] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:12:17] (03CR) 10Jbond: "lftm, minor nit not related to your change" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598704 (owner: 10Muehlenhoff) [12:14:33] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:14:37] (03PS2) 10Vgutierrez: trafficserver_exporter: Gather Lua VM metrics [puppet] - 10https://gerrit.wikimedia.org/r/598717 [12:14:51] (03PS6) 10Ssingh: dnsdist: add a class to install and configure dnsdist [puppet] - 10https://gerrit.wikimedia.org/r/598073 (https://phabricator.wikimedia.org/T252132) [12:14:59] (03CR) 10Ssingh: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598073 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [12:16:21] (03CR) 10Vgutierrez: trafficserver_exporter: Gather Lua VM metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598717 (owner: 10Vgutierrez) [12:17:23] downtimes removed [12:17:27] jouncebot: next [12:17:28] In 0 hour(s) and 42 minute(s): Mediawiki train - European Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200526T1300) [12:18:56] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/22778/" [puppet] - 10https://gerrit.wikimedia.org/r/598073 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [12:19:19] (03PS1) 10Ayounsi: Revert "Depool ulsfo for routers upgrade" [dns] - 10https://gerrit.wikimedia.org/r/598728 [12:19:28] XioNoX: \o/ [12:19:32] (03CR) 10Ema: [C: 03+1] trafficserver_exporter: Gather Lua VM metrics [puppet] - 10https://gerrit.wikimedia.org/r/598717 (owner: 10Vgutierrez) [12:19:55] went very smooth! [12:19:56] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:20:20] cdanis: is that Zayo still playing with our circuit ^ ? :) [12:20:52] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool ulsfo for routers upgrade" [dns] - 10https://gerrit.wikimedia.org/r/598728 (owner: 10Ayounsi) [12:20:58] XioNoX: yeah I believe so [12:21:02] !log repool ulsfo - T243080 [12:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:48] (03PS10) 10Muehlenhoff: profile::url_downloader: Add types and switch to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/562472 [12:23:51] (03CR) 10Jbond: [C: 03+1] "lgtm" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598073 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [12:23:56] (03CR) 10Ssingh: [C: 03+2] dnsdist: add a class to install and configure dnsdist [puppet] - 10https://gerrit.wikimedia.org/r/598073 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [12:25:30] (03PS1) 10Dzahn: site: define mw2249,mw2250 as jobrunner canaries in codfw [puppet] - 10https://gerrit.wikimedia.org/r/598729 (https://phabricator.wikimedia.org/T242606) [12:29:08] (03CR) 10Dzahn: [C: 03+2] site: define mw2249,mw2250 as jobrunner canaries in codfw [puppet] - 10https://gerrit.wikimedia.org/r/598729 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn) [12:33:13] 10Operations, 10LDAP: Problems accesing superset and horizon.wikimedia.org - https://phabricator.wikimedia.org/T253414 (10MoritzMuehlenhoff) Sure, you can even do that yourself via https://wikitech.wikimedia.org/wiki/Special:PasswordReset :-) [12:34:07] (03Abandoned) 10Giuseppe Lavagetto: mediawiki: convert all appserver to use envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/586205 (https://phabricator.wikimedia.org/T247389) (owner: 10Giuseppe Lavagetto) [12:36:12] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:36:45] (03CR) 10Muehlenhoff: profile::ganeti: refactor hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598498 (owner: 10Jbond) [12:38:03] (03PS1) 10Kormat: site.pp: Add missing mariadb multiinstance comments. [puppet] - 10https://gerrit.wikimedia.org/r/598731 [12:39:20] (03PS10) 10Jbond: profile::ganeti: refactor hiera [puppet] - 10https://gerrit.wikimedia.org/r/598498 [12:39:25] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Nice!. A nit would be perhaps to stagger this a bit by upgrading a single service first, trying it out a bit inside SRE, see if anything b" [deployment-charts] - 10https://gerrit.wikimedia.org/r/598487 (https://phabricator.wikimedia.org/T252428) (owner: 10JMeybohm) [12:39:26] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:39:36] (03CR) 10Jbond: profile::ganeti: refactor hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598498 (owner: 10Jbond) [12:40:50] (03CR) 10Hashar: "recheck there is now a dedicated job to verify the logstash filters whenever a change touches modules/profile/files/logstash" [puppet] - 10https://gerrit.wikimedia.org/r/594460 (https://phabricator.wikimedia.org/T251869) (owner: 10Filippo Giunchedi) [12:43:21] !log swift eqiad-prod: decom ms-be101[678] - T252008 [12:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:25] T252008: Decom ms-be101[678] - https://phabricator.wikimedia.org/T252008 [12:46:16] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [12:46:38] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:46:46] 10Operations, 10LDAP: Problems accesing superset and horizon.wikimedia.org - https://phabricator.wikimedia.org/T253414 (10diego) Ups! I thought was a different procedure, thanks for clarifying, problem solved. [12:46:58] 10Operations, 10LDAP: Problems accesing superset and horizon.wikimedia.org - https://phabricator.wikimedia.org/T253414 (10diego) 05Open→03Resolved [12:47:20] PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [12:48:50] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:50:19] (03PS10) 10Filippo Giunchedi: thanos: add objstore support to sidecar [puppet] - 10https://gerrit.wikimedia.org/r/597071 (https://phabricator.wikimedia.org/T252186) [12:50:21] (03PS11) 10Filippo Giunchedi: thanos: add thanos::compact [puppet] - 10https://gerrit.wikimedia.org/r/597072 (https://phabricator.wikimedia.org/T252186) [12:50:23] (03PS4) 10Filippo Giunchedi: prometheus: add store and compact jobs [puppet] - 10https://gerrit.wikimedia.org/r/598698 (https://phabricator.wikimedia.org/T252186) [12:50:25] (03PS4) 10Filippo Giunchedi: prometheus: allow setting min/max block duration [puppet] - 10https://gerrit.wikimedia.org/r/598711 (https://phabricator.wikimedia.org/T252186) [12:50:39] (03PS5) 10Filippo Giunchedi: prometheus: enable Thanos upload for Prometheus k8s-staging [puppet] - 10https://gerrit.wikimedia.org/r/598712 (https://phabricator.wikimedia.org/T252186) [12:51:33] (03CR) 10jerkins-bot: [V: 04-1] thanos: add objstore support to sidecar [puppet] - 10https://gerrit.wikimedia.org/r/597071 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [12:51:40] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [12:52:12] (03CR) 10Hashar: "Locally (and on CI), the command takes roughly 40 seconds before running the tests. I am assuming that is the time to boot and setup logst" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/594460 (https://phabricator.wikimedia.org/T251869) (owner: 10Filippo Giunchedi) [12:52:44] RECOVERY - Unmerged changes on repository puppet on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [12:54:01] (03CR) 10Muehlenhoff: pcc: Also recommend jenkinsapi Debian package (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/598704 (owner: 10Muehlenhoff) [12:54:47] (03CR) 10Marostegui: [C: 03+1] "Checked that the added hosts belong to those sections - thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/598731 (owner: 10Kormat) [12:55:30] (03CR) 10Kormat: [C: 03+2] site.pp: Add missing mariadb multiinstance comments. [puppet] - 10https://gerrit.wikimedia.org/r/598731 (owner: 10Kormat) [12:56:17] (03PS2) 10Kormat: site.pp: Add missing mariadb multiinstance comments. [puppet] - 10https://gerrit.wikimedia.org/r/598731 [12:56:46] (03CR) 10Vgutierrez: [C: 03+2] trafficserver_exporter: Gather Lua VM metrics [puppet] - 10https://gerrit.wikimedia.org/r/598717 (owner: 10Vgutierrez) [12:57:02] 10Operations, 10serviceops, 10Patch-For-Review: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10Dzahn) mw2187, mw2188 are new canary appservers, replacing mw2271, mw2272 mw2249, mw2250 are new jobrunner canaries that we did not have in codfw. Now we have 13 canaries in eqiad an... [12:57:06] (03CR) 10Hashar: [C: 03+1] "My previous comments were just for info, not meant to be blockers to this change they are really just nitpicks ;)" [puppet] - 10https://gerrit.wikimedia.org/r/594460 (https://phabricator.wikimedia.org/T251869) (owner: 10Filippo Giunchedi) [12:57:16] 10Operations, 10serviceops, 10Patch-For-Review: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10Dzahn) 05Open→03Resolved [12:57:27] vgutierrez: hi! is it safe to merge your puppet CR? [12:57:35] it is [12:57:39] I got the lock right now... [12:57:43] so is it safe to merge yours? ;P [12:57:50] perfect, yes please :) [12:57:51] (03PS4) 10Filippo Giunchedi: profile: initial tests for logstash filters [puppet] - 10https://gerrit.wikimedia.org/r/594460 (https://phabricator.wikimedia.org/T251869) [12:57:54] * vgutierrez merging [12:58:28] kormat: done [12:58:29] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: initial tests for logstash filters (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/594460 (https://phabricator.wikimedia.org/T251869) (owner: 10Filippo Giunchedi) [12:58:36] vgutierrez: ty! [13:00:02] I am going to push 1.35.0wmf.32 to all wikis ( https://phabricator.wikimedia.org/T249964 ) [13:00:04] hashar: That opportune time is upon us again. Time for a Mediawiki train - European Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200526T1300). [13:00:05] (03CR) 10JMeybohm: "> Patch Set 1: Code-Review+1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/598487 (https://phabricator.wikimedia.org/T252428) (owner: 10JMeybohm) [13:02:14] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3055 is CRITICAL: connect to address 10.20.0.55 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:02:16] PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp3055 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:02:18] (03PS1) 10Hashar: all wikis to 1.35.0-wmf.32 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598734 [13:02:21] (03CR) 10Hashar: [C: 03+2] all wikis to 1.35.0-wmf.32 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598734 (owner: 10Hashar) [13:02:28] (03PS1) 10Vgutierrez: trafficserver_exporter: Fix Lua VM metrics [puppet] - 10https://gerrit.wikimedia.org/r/598735 [13:02:32] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp4025 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:02:45] yeah... that's me [13:02:48] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3055 is CRITICAL: connect to address 10.20.0.55 and port 9122: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:02:49] fix on its way... [13:02:56] (03PS1) 10Kormat: mariadb: Add db2138 to s2+s4 [puppet] - 10https://gerrit.wikimedia.org/r/598736 (https://phabricator.wikimedia.org/T252987) [13:03:02] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.32 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598734 (owner: 10Hashar) [13:03:09] rolling [13:03:12] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp4025 is CRITICAL: connect to address 10.128.0.125 and port 9122: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:03:14] hashar: ack [13:03:18] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp3055 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:03:26] PROBLEM - Check systemd state on cp4025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:34] (03CR) 10Vgutierrez: [C: 03+2] trafficserver_exporter: Fix Lua VM metrics [puppet] - 10https://gerrit.wikimedia.org/r/598735 (owner: 10Vgutierrez) [13:03:38] PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp4025 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:03:43] (03CR) 10Marostegui: "The bug looks linked to an incorrect task?" [puppet] - 10https://gerrit.wikimedia.org/r/598736 (https://phabricator.wikimedia.org/T252987) (owner: 10Kormat) [13:03:46] PROBLEM - Check systemd state on cp3055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:04:00] PROBLEM - Check systemd state on cp3064 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:04:02] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp4025 is CRITICAL: connect to address 10.128.0.125 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:04:30] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp2027 is CRITICAL: connect to address 10.192.0.23 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:04:30] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2032 is CRITICAL: connect to address 10.192.16.33 and port 9122: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:04:36] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3064 is CRITICAL: connect to address 10.20.0.64 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:04:40] PROBLEM - Check systemd state on cp2035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:04:42] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp2032 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:04:44] PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp4028 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:04:49] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.32 [13:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:53] hashar: checking! [13:04:56] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2027 is CRITICAL: connect to address 10.192.0.23 and port 9122: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:04:56] (03PS2) 10Kormat: mariadb: Add db2138 to s2+s4 [puppet] - 10https://gerrit.wikimedia.org/r/598736 (https://phabricator.wikimedia.org/T252985) [13:04:58] PROBLEM - Check systemd state on cp2032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:04] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:05:06] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2034 is CRITICAL: connect to address 10.192.16.184 and port 9122: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:05:06] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp2027 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:05:13] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp4028 is CRITICAL: connect to address 10.128.0.128 and port 9122: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:05:16] PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp2035 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:05:16] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3064 is CRITICAL: connect to address 10.20.0.64 and port 9122: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:05:28] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp4028 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:05:30] PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp3064 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:05:30] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp3064 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:05:34] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp2034 is CRITICAL: connect to address 10.192.16.184 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:05:36] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp2030 is CRITICAL: connect to address 10.192.0.32 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:05:38] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp2032 is CRITICAL: connect to address 10.192.16.33 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:05:40] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5003 is CRITICAL: connect to address 10.132.0.103 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:05:43] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp2035 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:05:44] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2035 is CRITICAL: connect to address 10.192.32.18 and port 9122: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:05:50] PROBLEM - Check systemd state on cp4028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:50] RECOVERY - Check systemd state on cp3064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:50] PROBLEM - Check systemd state on cp2027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:52] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp2035 is CRITICAL: connect to address 10.192.32.18 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:05:53] PROBLEM - Check systemd state on cp2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:53] PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp2027 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:06:00] (03PS1) 10Ssingh: wikidough: deploy certificates on the malmok host [puppet] - 10https://gerrit.wikimedia.org/r/598737 (https://phabricator.wikimedia.org/T252132) [13:06:02] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:06:06] :/ [13:06:06] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp4028 is CRITICAL: connect to address 10.128.0.128 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:06:08] PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp2032 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:06:12] PROBLEM - Check systemd state on cp5003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:16] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp5003 is CRITICAL: connect to address 10.132.0.103 and port 9122: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:06:26] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3064 is OK: HTTP OK: HTTP/1.0 200 OK - 23537 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:06:30] RECOVERY - Check systemd state on cp2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:34] RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp4028 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:06:41] (03CR) 10Marostegui: [C: 03+1] mariadb: Add db2138 to s2+s4 [puppet] - 10https://gerrit.wikimedia.org/r/598736 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [13:06:46] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2027 is OK: HTTP OK: HTTP/1.0 200 OK - 25898 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:06:46] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp1089 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:06:50] RECOVERY - Check systemd state on cp2032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:52] PROBLEM - Check systemd state on cp1089 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:52] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp4025 is OK: HTTP OK: HTTP/1.0 200 OK - 25746 bytes in 0.227 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:06:56] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2034 is OK: HTTP OK: HTTP/1.0 200 OK - 25745 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:06:58] RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp3055 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:06:58] RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp2027 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:07:03] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp4028 is OK: HTTP OK: HTTP/1.0 200 OK - 25957 bytes in 0.226 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:07:06] RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp2035 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:07:06] RECOVERY - Check systemd state on cp4025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:07] there are some INSERT flowing in on db1083 / 9104 [13:07:08] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3064 is OK: HTTP OK: HTTP/1.0 200 OK - 26039 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:07:18] RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp4028 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:07:18] RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp4025 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:07:19] (03CR) 10Kormat: [C: 03+2] mariadb: Add db2138 to s2+s4 [puppet] - 10https://gerrit.wikimedia.org/r/598736 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [13:07:20] RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp3064 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:07:20] RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp3064 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:07:24] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp2034 is OK: HTTP OK: HTTP/1.0 200 OK - 23240 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:07:26] RECOVERY - Check systemd state on cp3055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:26] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp2030 is OK: HTTP OK: HTTP/1.0 200 OK - 23230 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:07:28] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp2032 is OK: HTTP OK: HTTP/1.0 200 OK - 23258 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:07:28] hashar: yeah, I am seeing those [13:07:32] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5003 is OK: HTTP OK: HTTP/1.0 200 OK - 23354 bytes in 0.729 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:07:32] RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp2035 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:07:34] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2035 is OK: HTTP OK: HTTP/1.0 200 OK - 25898 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:07:36] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/22780/" [puppet] - 10https://gerrit.wikimedia.org/r/598737 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [13:07:40] RECOVERY - Check systemd state on cp4028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:40] RECOVERY - Check systemd state on cp2027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:42] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp4025 is OK: HTTP OK: HTTP/1.0 200 OK - 23306 bytes in 0.227 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:07:42] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp2035 is OK: HTTP OK: HTTP/1.0 200 OK - 23350 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:07:43] RECOVERY - Check systemd state on cp2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:43] RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp2027 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:07:43] hashar: but so far I haven't noticed a huge increase [13:07:44] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3055 is OK: HTTP OK: HTTP/1.0 200 OK - 23381 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:07:46] RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp3055 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:07:56] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp4028 is OK: HTTP OK: HTTP/1.0 200 OK - 23421 bytes in 0.226 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:07:58] RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp2032 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:08:02] RECOVERY - Check systemd state on cp5003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:04] RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp4025 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:08:08] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp5003 is OK: HTTP OK: HTTP/1.0 200 OK - 25853 bytes in 0.729 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:08:10] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp2027 is OK: HTTP OK: HTTP/1.0 200 OK - 23349 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:08:10] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp2032 is OK: HTTP OK: HTTP/1.0 200 OK - 25746 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:08:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Oh, perfect then! +1, let's proceed!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/598487 (https://phabricator.wikimedia.org/T252428) (owner: 10JMeybohm) [13:08:16] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3055 is OK: HTTP OK: HTTP/1.0 200 OK - 25868 bytes in 0.274 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:08:19] * vgutierrez very sorry about the noise.... [13:08:22] RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp2032 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:08:32] hashar: they are picking up [13:08:36] RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp1089 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:08:39] vgutierrez: no worries. i'm just glad it wasn't me who ran puppet-merge in the end :) [13:08:42] RECOVERY - Check systemd state on cp1089 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:46] (03CR) 10Elukey: [C: 03+1] Enable base::service_auto_restart for Apache on Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/598724 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:08:49] kormat: LOL [13:09:04] marostegui: I guess the spike of insert is to be expected for a new version but eventually they will phan out once the module_deps cache is populated [13:09:10] at least that is my understanding [13:09:34] kormat: yet another futile attempt of winning the t-shirt(TM) [13:09:34] vgutierrez: how dare you spamming us! :D [13:09:48] this channel is so clean and tidy [13:09:55] hashar: yeah, I am expecting it to go down, we'll see [13:10:14] elukey: the noise2signal ratio here is normally nice and high [13:10:32] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:10:46] hashar: I am checking if there's lag [13:10:47] kormat: yes I agree, it used to be a lot worse (kudos to the SRE team for the improvements) [13:10:47] vgutierrez: is it safe to merge another innocent puppet change? :) [13:11:02] kormat: sure, I've fixed the issue [13:11:04] elukey: it used to be _worse_? oh god [13:11:10] vgutierrez: cool, thanks :) [13:11:48] kormat: yes a lot, we have aggregated icinga alarms now, before we didn't (try to picture the puppet master not reachable or failing in here..) [13:12:13] oh boy [13:12:25] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) I'm sure alternate browsers were the first thing I tried. Anyway, you would have to fix it for the world's biggest browser anyway.... [13:13:26] there is no new mediawiki errors in logstash \o/ [13:14:22] PHP Fatal error: Uncaught Wikimedia\Rdbms\DBAccessError: Database access has been disabled. in /srv/mediawiki/php-1.35.0-wmf.32/includes/libs/rdbms/loadbalancer/LoadBalancer.php:1270 in /srv/mediawiki/php-1.35.0-wmf.32/includes/libs/rdbms/loadbalancer/LoadBalancer.php on line 1270 [13:14:30] a single occurence from commonswiki [13:15:00] hashar: I am not seeing lag on the hosts we saw lag list time [13:15:07] *last [13:15:51] there is still ~ 130 INSERT per seconds though [13:16:02] yes, that is still going up [13:16:21] So that's not ideal, but if it doesn't cause lag, we are not having user impact per se [13:17:20] hashar: the spike looks less severe than the one at https://phabricator.wikimedia.org/T247028#6128426 which is the one we had last time [13:17:24] well we could get a stronger GPU, optical fiber and a 144Hz monitor. That helps solving lag issue in the gaming world [13:17:49] XDDDD [13:17:51] (03PS2) 10Ssingh: wikidough: deploy certificates on the malmok host [puppet] - 10https://gerrit.wikimedia.org/r/598737 (https://phabricator.wikimedia.org/T252132) [13:18:04] was it mainly s2 this time? [13:18:10] it was s6 last time [13:18:16] hashar: So the issue is obviously the spike,but we can handle that, the problem is that if the spike is too big, we get lag on the slaves and that does have impact [13:18:54] jynus: no, all the sections have the spike, but I am seeing no lag [13:18:57] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/22781/" [puppet] - 10https://gerrit.wikimedia.org/r/598737 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [13:19:03] and last time we saw lag on s6 and s2 [13:19:39] I am also looking at requests to resource loader entry point load.php ( https://grafana.wikimedia.org/d/000000066/resourceloader?orgId=1&from=now-1h&to=now ) [13:19:54] do you have a dashboard of db lags? [13:19:55] yeah, I think I noticed on s2 first this time because s2 had porcentually less load due to timezone [13:20:23] so the relative increase was larger at this timezone (s2 is more loaded on our night) [13:20:25] hashar: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1 but that can have some false positives [13:20:37] as we include hosts that maybe aren't in production [13:21:05] nice [13:21:19] in s2 the writes increased 500% [13:21:26] while it was less than 100% on the others [13:21:33] even the absolute number was comparable [13:22:02] s1 does 105k operations per seconds!? ouch [13:22:15] I never realized the db were doing so much nowadays [13:22:37] This huge spike isn't nice :( https://grafana.wikimedia.org/d/000000273/mysql?panelId=2&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1122&var-port=9104 [13:22:52] still handleable by the db, but that's doubling the amount of inserts [13:22:58] (03PS1) 10Jbond: cas-icinga: Add an entry point for the eternal monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/598742 (https://phabricator.wikimedia.org/T239323) [13:23:17] s8 had also a large spike [13:23:29] seen with 1 second of lag on several servers [13:23:54] https://grafana.wikimedia.org/d/000000303/mysql-replication-lag?panelId=12&fullscreen&orgId=1&from=1590498383495&to=1590499345192&var-dc=eqiad%20prometheus%2Fops [13:24:03] (03CR) 10jerkins-bot: [V: 04-1] cas-icinga: Add an entry point for the eternal monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/598742 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond) [13:24:31] (control-filter the labsdb hosts and reload until you get the prometheus server that got it) [13:25:34] 1 second lag is unfortunately quite "common" on s8 [13:25:47] (03PS2) 10Jbond: cas-icinga: Add an entry point for the eternal monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/598742 (https://phabricator.wikimedia.org/T239323) [13:26:10] marostegui: yeah, but it correlates with 4-5 hosts at the same time [13:26:14] when the deploy happens [13:26:25] also: https://grafana.wikimedia.org/d/000000273/mysql?panelId=2&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1119&var-port=9104&from=1590488746157&to=1590499546157 [13:26:49] (03CR) 10jerkins-bot: [V: 04-1] cas-icinga: Add an entry point for the eternal monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/598742 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond) [13:26:52] yep, as I said the spike is going on on all the sections (as expected) [13:27:04] is https://dbtree.wikimedia.org/ still any relevant? [13:27:07] but at least is less severe than past week (although s2 looks quite a big increase) [13:27:10] hashar: yes [13:27:10] sorry, I meant to share https://grafana.wikimedia.org/d/000000273/mysql?panelId=2&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1109&var-port=9104&from=1590488821748&to=1590499621748 [13:27:24] but it could the the regular s8 spikes there [13:27:40] hashar: so what's next? [13:27:45] cause nothing shows any lag on that page [13:28:02] hashar: the lag is hard to catch, it is not huge, but enough to make MW complain [13:28:04] and struggle [13:28:39] hashar: so in regard to next steps, I think T247028 needs more love, I don't think it is solved [13:28:40] T247028: Database 'INSERT' query rate doubled (module_deps regression?) - https://phabricator.wikimedia.org/T247028 [13:28:46] hashar: we are precisely looking at historical data because what manuel says [13:29:31] I am tempted to let it flows assuming that the issue has been occurring on each train for the last few weeks / months [13:29:57] and is thus not really an issue with this train version but comes from whatever old code. So that sounds unfair to hold the train on that [13:29:59] THEN [13:30:13] it is not ongoing but as you can see, it is not something that we would like to suffer every deploy [13:30:18] if whenever we push a new version we have DB lag all over the place, that is indeed blocking [13:30:31] it is not the new code indeed [13:30:58] it is whatever happens that gets invalidated and creats an invalidation storm, on the db [13:31:42] (03PS3) 10Jbond: cas-icinga: Add an entry point for the eternal monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/598742 (https://phabricator.wikimedia.org/T239323) [13:31:58] hashar: I think it is not an issue as of now, but as jaime says, this is not something we can afford on every deploy, it might or might not cause lag, but if it does, then we do have user impacting deploys [13:32:11] roger [13:32:32] so I propose we keep monitoring the lag / INSERT rate for now [13:32:49] what was this 0 -> 1? [13:32:50] and if that is really too concerning the qeustion would be wether we rollback or hope for it to fix itself eventually [13:33:19] I am guessing the groups don't matter it is the new version [13:33:27] I believe so yeah, jynus [13:33:29] but iirc from looking at the historical rate of insert, it takes a few hours to drain :( [13:33:43] hashar: yep, from the last time, it takes a few hours [13:33:51] I will comment on the task though, as we need to find a way around this [13:35:12] hashar: I need to step out for an hour or so, I will check later and comment on the task [13:35:24] marostegui: perfect thank you for attending! [13:35:29] hashar: more like days: https://grafana.wikimedia.org/d/000000273/mysql?panelId=2&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1119&var-port=9104&from=1587908120177&to=1590500120178 [13:36:36] (03PS1) 10Ema: vcl: apply mobileaction/useformat ttl cap to cacheable responses [puppet] - 10https://gerrit.wikimedia.org/r/598744 (https://phabricator.wikimedia.org/T247783) [13:37:51] (03PS4) 10Jbond: cas-icinga: Add an entry point for the eternal monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/598742 (https://phabricator.wikimedia.org/T239323) [13:43:48] (03CR) 10Ottomata: [C: 03+1] "Lemme know if/when I should merge." [puppet] - 10https://gerrit.wikimedia.org/r/598681 (owner: 10DCausse) [13:44:50] hashar: just FYI, the long term issue is known and has been known for at least 9 years. It has been this way since 2011. Only the traffic and cache behaviors on top (eg cdn 30 -> 4, single varnish backend -> multiple ats backend) have changed [13:45:15] hashar: the parent task is about refactoring the system wholesale which is already in progress [13:46:43] What's also changed is that we have more traffic and we deploy more often, and make more changes in front end code that need invalidating every week [13:49:53] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:50:09] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Investigate how automated tasks can authenticate against CAS - https://phabricator.wikimedia.org/T239323 (10jbond) I had a look at this on the CAS side and i think it would be doable to add some level of 2FA with an account in ldap. however i thin... [13:50:49] (03PS2) 10JMeybohm: Enable atomic helm upgrades for all service deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/598487 (https://phabricator.wikimedia.org/T252428) [13:51:02] (03CR) 10Krinkle: vcl: apply mobileaction/useformat ttl cap to cacheable responses (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598744 (https://phabricator.wikimedia.org/T247783) (owner: 10Ema) [13:51:46] Krinkle: hi! yeah what confused me is it has been made a train blcoker and I originally thought it to be an issue with the code being shipped [13:51:50] (03CR) 10JMeybohm: [C: 03+2] Enable atomic helm upgrades for all service deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/598487 (https://phabricator.wikimedia.org/T252428) (owner: 10JMeybohm) [13:52:12] (03Merged) 10jenkins-bot: Enable atomic helm upgrades for all service deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/598487 (https://phabricator.wikimedia.org/T252428) (owner: 10JMeybohm) [13:52:53] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:53:39] 10Operations: update profile::waf::apache2::administrative to use the new abuse_networks hiera key - https://phabricator.wikimedia.org/T253632 (10jbond) [13:53:50] 10Operations, 10User-jbond: update profile::waf::apache2::administrative to use the new abuse_networks hiera key - https://phabricator.wikimedia.org/T253632 (10jbond) p:05Triage→03Medium [13:54:09] Krinkle: what I don't get is why it takes so long to get the module_deps populated [13:54:09] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'zotero' for release 'staging' . [13:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:53] 10Operations, 10Security-Team, 10CAS-SSO, 10User-jbond: Further steps for CAS/web SSO - https://phabricator.wikimedia.org/T233921 (10jbond) [13:55:51] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) percentage of jessie systems left as of today (because we were asked): 4.2% [13:56:16] 10Operations, 10observability, 10CAS-SSO, 10User-jbond: Icinga Monitoring for CAS - https://phabricator.wikimedia.org/T233935 (10jbond) a:03jbond [13:56:41] (03PS17) 10Vgutierrez: ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) [14:04:45] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for Apache on Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/598724 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:11:25] (03PS11) 10Filippo Giunchedi: thanos: add objstore support to sidecar [puppet] - 10https://gerrit.wikimedia.org/r/597071 (https://phabricator.wikimedia.org/T252186) [14:11:26] (03PS12) 10Filippo Giunchedi: thanos: add thanos::compact [puppet] - 10https://gerrit.wikimedia.org/r/597072 (https://phabricator.wikimedia.org/T252186) [14:11:31] (03PS5) 10Filippo Giunchedi: prometheus: add store and compact jobs [puppet] - 10https://gerrit.wikimedia.org/r/598698 (https://phabricator.wikimedia.org/T252186) [14:11:34] (03PS5) 10Filippo Giunchedi: prometheus: allow setting min/max block duration [puppet] - 10https://gerrit.wikimedia.org/r/598711 (https://phabricator.wikimedia.org/T252186) [14:11:35] (03PS6) 10Filippo Giunchedi: prometheus: enable Thanos upload for Prometheus k8s-staging [puppet] - 10https://gerrit.wikimedia.org/r/598712 (https://phabricator.wikimedia.org/T252186) [14:11:58] 10Operations, 10netops: scrape ripe atlas data for a few anchors at other large networks - https://phabricator.wikimedia.org/T252890 (10CDanis) >>! In T252890#6154888, @ayounsi wrote: > Good idea! What's the limit? While I'm not very concerned about cardinality limits in Prometheus here, I think we'd start to... [14:14:09] 10Operations, 10Release-Engineering-Team, 10SRE-tools: Support running puppet Beaker on CI - https://phabricator.wikimedia.org/T253635 (10hashar) [14:17:41] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console for soworu - https://phabricator.wikimedia.org/T252705 (10RLazarus) a:05RLazarus→03soworu @soworu Ping on my question above. :) [14:18:16] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile: remove unused profile::etcd::auth [puppet] - 10https://gerrit.wikimedia.org/r/598010 (owner: 10Giuseppe Lavagetto) [14:18:29] (03PS4) 10Muehlenhoff: Integrate hardened java.security into profile::java [puppet] - 10https://gerrit.wikimedia.org/r/598682 [14:18:32] (03PS1) 10JMeybohm: tls_helper: Add a checksum for tls config (envoy) [deployment-charts] - 10https://gerrit.wikimedia.org/r/598746 [14:18:48] (03CR) 10Muehlenhoff: Integrate hardened java.security into profile::java (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/598682 (owner: 10Muehlenhoff) [14:19:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] tls_helper: Add a checksum for tls config (envoy) [deployment-charts] - 10https://gerrit.wikimedia.org/r/598746 (owner: 10JMeybohm) [14:20:51] (03CR) 10JMeybohm: [C: 03+2] tls_helper: Add a checksum for tls config (envoy) [deployment-charts] - 10https://gerrit.wikimedia.org/r/598746 (owner: 10JMeybohm) [14:21:13] (03Merged) 10jenkins-bot: tls_helper: Add a checksum for tls config (envoy) [deployment-charts] - 10https://gerrit.wikimedia.org/r/598746 (owner: 10JMeybohm) [14:21:15] (03CR) 10Filippo Giunchedi: thanos: add Store Gateway (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597019 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [14:22:09] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Unconditionally enable mod-crs [puppet] - 10https://gerrit.wikimedia.org/r/598482 (owner: 10Muehlenhoff) [14:26:22] (03CR) 10Filippo Giunchedi: thanos: add objstore support to sidecar (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597071 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [14:26:29] jouncebot: now [14:26:29] For the next 0 hour(s) and 33 minute(s): Mediawiki train - European Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200526T1300) [14:26:45] (03PS9) 10Filippo Giunchedi: thanos: add Store Gateway [puppet] - 10https://gerrit.wikimedia.org/r/597019 (https://phabricator.wikimedia.org/T252186) [14:26:47] Well, as the train back-up, we're not using that, so I'll sling out some config. [14:26:47] (03PS12) 10Filippo Giunchedi: thanos: add objstore support to sidecar [puppet] - 10https://gerrit.wikimedia.org/r/597071 (https://phabricator.wikimedia.org/T252186) [14:26:49] (03PS13) 10Filippo Giunchedi: thanos: add thanos::compact [puppet] - 10https://gerrit.wikimedia.org/r/597072 (https://phabricator.wikimedia.org/T252186) [14:26:51] (03PS6) 10Filippo Giunchedi: prometheus: add store and compact jobs [puppet] - 10https://gerrit.wikimedia.org/r/598698 (https://phabricator.wikimedia.org/T252186) [14:26:53] (03PS6) 10Filippo Giunchedi: prometheus: allow setting min/max block duration [puppet] - 10https://gerrit.wikimedia.org/r/598711 (https://phabricator.wikimedia.org/T252186) [14:26:55] (03PS7) 10Filippo Giunchedi: prometheus: enable Thanos upload for Prometheus k8s-staging [puppet] - 10https://gerrit.wikimedia.org/r/598712 (https://phabricator.wikimedia.org/T252186) [14:26:57] (03CR) 10Muehlenhoff: "Updated PCC: https://puppet-compiler.wmflabs.org/compiler1003/22786/" [puppet] - 10https://gerrit.wikimedia.org/r/598682 (owner: 10Muehlenhoff) [14:27:19] (03PS2) 10Jforrester: ExtensionDistributor: Remove EOL REL1_32 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598573 (owner: 10Legoktm) [14:27:23] (03CR) 10Jforrester: [C: 03+2] ExtensionDistributor: Remove EOL REL1_32 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598573 (owner: 10Legoktm) [14:28:56] (03Merged) 10jenkins-bot: ExtensionDistributor: Remove EOL REL1_32 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598573 (owner: 10Legoktm) [14:29:20] (03CR) 10Bstorm: toolforge-kubeadm: kubeadm 1.16 requires docker 18.09 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598093 (https://phabricator.wikimedia.org/T250866) (owner: 10Bstorm) [14:29:37] (03PS3) 10Jforrester: SpecialVersionVersionUrl: Don't use confusing local variable name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596726 [14:29:43] (03CR) 10Jforrester: [C: 03+2] SpecialVersionVersionUrl: Don't use confusing local variable name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596726 (owner: 10Jforrester) [14:30:15] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: ExtensionDistributor: Remove EOL REL1_32 (duration: 00m 58s) [14:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:36] (03Merged) 10jenkins-bot: SpecialVersionVersionUrl: Don't use confusing local variable name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596726 (owner: 10Jforrester) [14:31:31] (03PS6) 10Jforrester: Clean up MWMultiVersion check in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579816 (owner: 10Krinkle) [14:31:39] (03CR) 10Jforrester: [C: 03+2] Clean up MWMultiVersion check in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579816 (owner: 10Krinkle) [14:31:58] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: SpecialVersionVersionUrl: Don't use confusing local variable name (duration: 00m 58s) [14:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:20] (03Merged) 10jenkins-bot: Clean up MWMultiVersion check in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579816 (owner: 10Krinkle) [14:33:29] !log test bgp med on dns4002 [14:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:09] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Clean up MWMultiVersion check in CommonSettings.php (duration: 00m 59s) [14:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:13] (03CR) 10Vgutierrez: [C: 03+1] wikidough: deploy certificates on the malmok host [puppet] - 10https://gerrit.wikimedia.org/r/598737 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [14:35:22] (03PS3) 10Jforrester: CommonSettings.php: Move uncondition/no-sideeffect includes up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579814 (owner: 10Krinkle) [14:35:52] (03CR) 10Jforrester: [C: 03+2] CommonSettings.php: Move uncondition/no-sideeffect includes up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579814 (owner: 10Krinkle) [14:35:54] (03PS1) 10Mholloway: Update wikifeeds to 2020-05-26-142242-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/598751 (https://phabricator.wikimedia.org/T253411) [14:36:39] (03Merged) 10jenkins-bot: CommonSettings.php: Move uncondition/no-sideeffect includes up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579814 (owner: 10Krinkle) [14:38:30] (03PS5) 10Jforrester: Move mobile-labs into CommonSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586405 [14:38:30] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: CommonSettings.php: Move uncondition/no-sideeffect includes up (duration: 00m 57s) [14:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:49] (03CR) 10Jforrester: [C: 03+2] Move mobile-labs into CommonSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586405 (owner: 10Jforrester) [14:39:38] (03Merged) 10jenkins-bot: Move mobile-labs into CommonSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586405 (owner: 10Jforrester) [14:41:37] !log jforrester@deploy1001 Synchronized wmf-config/mobile.php: Don't try to load mobile-labs.php from mobile.php (duration: 00m 57s) [14:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:41] !log installing rails security updates [14:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:21] !log jforrester@deploy1001 Synchronized docroot/noc/: Clear out symlink to mobile-labs.php, now removed (duration: 00m 58s) [14:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:45] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/598498 (owner: 10Jbond) [14:44:45] !log upgrade packages in buster-wikimedia/thirdpardy/kubeadm-k8s-1-16 (T246122) [14:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:49] T246122: Upgrade the Toolforge Kubernetes cluster to v1.16 - https://phabricator.wikimedia.org/T246122 [14:46:15] (03CR) 10Mholloway: [C: 03+2] Update wikifeeds to 2020-05-26-142242-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/598751 (https://phabricator.wikimedia.org/T253411) (owner: 10Mholloway) [14:46:46] (03Merged) 10jenkins-bot: Update wikifeeds to 2020-05-26-142242-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/598751 (https://phabricator.wikimedia.org/T253411) (owner: 10Mholloway) [14:47:07] (03PS5) 10Jforrester: Move mobile into CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586406 [14:48:33] hashar: DB load OK? Can we close T249964? [14:48:33] T249964: 1.35.0-wmf.32 deployment blockers - https://phabricator.wikimedia.org/T249964 [14:48:49] James_F: have you read it ? ;D [14:48:59] pending monitoring of the database INSERT rate [14:49:11] Yes, I looked at the rate, it seems fine? [14:49:13] just hoping we dont have to rollback [14:49:21] We're not rolling back. [14:49:27] The next train rolls in a couple of hours. [14:49:38] Perf had a fortnight to fix it enough for the DBAs to be happy. [14:50:10] so I am keeping it open utnil perf / dba confirm we can close it [14:50:38] Fine, but I take over train duties in a couple of hours, at which point I'll be closing it. ;-) [14:51:00] hehe ;) [14:51:09] (03CR) 10Jforrester: [C: 03+2] Move mobile into CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586406 (owner: 10Jforrester) [14:51:16] (03PS3) 10Muehlenhoff: pcc: Also recommend jenkinsapi Debian package [puppet] - 10https://gerrit.wikimedia.org/r/598704 [14:51:49] (03PS1) 10Kormat: base: Add some small quality-of-life packages. [puppet] - 10https://gerrit.wikimedia.org/r/598752 [14:51:56] (03Merged) 10jenkins-bot: Move mobile into CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586406 (owner: 10Jforrester) [14:52:00] (03CR) 10Muehlenhoff: pcc: Also recommend jenkinsapi Debian package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598704 (owner: 10Muehlenhoff) [14:53:08] (03CR) 10Bstorm: "Daniel, can I get a +1 before I merge this? I'm scheduled to deploy this week, but I want to make sure the logic is sound." [puppet] - 10https://gerrit.wikimedia.org/r/595201 (https://phabricator.wikimedia.org/T252219) (owner: 10Bstorm) [14:53:20] (03PS5) 10Alexandros Kosiaris: admin: deduplicate main helmfile.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/581656 [14:53:22] (03PS5) 10Alexandros Kosiaris: admin/namespace: Deduplicate all helmfile templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/581657 [14:53:24] (03PS5) 10Alexandros Kosiaris: admin: Default to sensible values for deploUser, namespaceName [deployment-charts] - 10https://gerrit.wikimedia.org/r/581658 [14:53:26] (03PS5) 10Alexandros Kosiaris: admin: Remove all override files [deployment-charts] - 10https://gerrit.wikimedia.org/r/581748 [14:53:51] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Move mobile.php into CommonSettings.php (duration: 00m 57s) [14:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:18] !log mholloway-shell@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [14:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:25] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.01388 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:55:30] (03CR) 10Ema: vcl: apply mobileaction/useformat ttl cap to cacheable responses (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598744 (https://phabricator.wikimedia.org/T247783) (owner: 10Ema) [14:55:41] (03PS1) 10Hnowlan: deployment-docker-changeprop01: override docker configuration [puppet] - 10https://gerrit.wikimedia.org/r/598753 (https://phabricator.wikimedia.org/T251176) [14:56:11] !log mholloway-shell@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [14:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:46] !log jforrester@deploy1001 Synchronized docroot/noc/: Clear out symlink to mobile.php, now removed (duration: 00m 55s) [14:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:50] (03PS4) 10Jforrester: Drop enwiki mainpage special casing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570180 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [14:57:16] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [14:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:20] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [14:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:23] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:34] (03CR) 10RLazarus: site: move appservers in codfw rack C4 into their own regex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598709 (owner: 10Dzahn) [14:57:38] !log sync staging namespaces configuration [14:57:39] (03CR) 10Ema: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) (owner: 10Vgutierrez) [14:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:02] (03PS1) 10CDanis: dbctl: diffs: when available, prefer icdiff interactively [software/conftool] - 10https://gerrit.wikimedia.org/r/598754 (https://phabricator.wikimedia.org/T253025) [14:58:11] !log mholloway-shell@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [14:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:41] (03CR) 10Jforrester: [C: 03+2] Drop enwiki mainpage special casing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570180 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [14:59:00] (03CR) 10RLazarus: [C: 03+1] site: define mw2249,mw2250 as jobrunner canaries in codfw [puppet] - 10https://gerrit.wikimedia.org/r/598729 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn) [14:59:30] (03Merged) 10jenkins-bot: Drop enwiki mainpage special casing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570180 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [15:00:58] (03CR) 10Kormat: [C: 04-1] dbproxy1018: Add labsdb1010 with reduced weight (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598691 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [15:01:16] (03CR) 10Vgutierrez: [C: 03+2] ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) (owner: 10Vgutierrez) [15:01:19] (03PS2) 10Muehlenhoff: Remove support for non-Tomcat and non-DEB deployments [puppet] - 10https://gerrit.wikimedia.org/r/598718 [15:01:30] (03PS3) 10Bstorm: toolforge-kubeadm: kubeadm 1.16 requires docker 18.09 [puppet] - 10https://gerrit.wikimedia.org/r/598093 (https://phabricator.wikimedia.org/T250866) [15:01:40] !log jforrester@deploy1001 Synchronized dblists/mobilemainpagelegacy.dblist: Drop enwiki mobile mainpage special casing T32405 (duration: 00m 59s) [15:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:45] T32405: [EPIC] MobileFrontend extension should stop special-casing main page - https://phabricator.wikimedia.org/T32405 [15:02:07] 10Operations, 10SRE-Access-Requests: Requesting access to deployment rights for BPirkle - https://phabricator.wikimedia.org/T253640 (10WDoranWMF) [15:02:16] (03CR) 10Bstorm: [C: 03+2] toolforge-kubeadm: kubeadm 1.16 requires docker 18.09 [puppet] - 10https://gerrit.wikimedia.org/r/598093 (https://phabricator.wikimedia.org/T250866) (owner: 10Bstorm) [15:02:28] (03CR) 10CDanis: [C: 03+1] thanos: add Store Gateway (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597019 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [15:02:36] (03CR) 10jerkins-bot: [V: 04-1] Remove support for non-Tomcat and non-DEB deployments [puppet] - 10https://gerrit.wikimedia.org/r/598718 (owner: 10Muehlenhoff) [15:03:19] (03CR) 10CDanis: [C: 03+1] thanos: add objstore support to sidecar (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597071 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [15:03:34] (03CR) 10Giuseppe Lavagetto: "LGTM, I mostly am unsure about one detail. Also - update the copyright notice!" (032 comments) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596779 (https://phabricator.wikimedia.org/T251574) (owner: 10Jbond) [15:03:39] (03PS1) 10RLazarus: site: Restore mw2180, inadvertently dropped from the regex. [puppet] - 10https://gerrit.wikimedia.org/r/598755 [15:03:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] docker build: update the build process to us docker [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596779 (https://phabricator.wikimedia.org/T251574) (owner: 10Jbond) [15:04:05] PROBLEM - Check systemd state on ms-be1034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:10] (03CR) 10Dzahn: site: move appservers in codfw rack C4 into their own regex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598709 (owner: 10Dzahn) [15:05:20] (03PS2) 10Jforrester: Add lazy-loading to Wikimedia Foundation powered-by icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593038 (https://phabricator.wikimedia.org/T239377) [15:05:52] (03CR) 10Jforrester: [C: 03+2] Add lazy-loading to Wikimedia Foundation powered-by icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593038 (https://phabricator.wikimedia.org/T239377) (owner: 10Jforrester) [15:06:22] 10Operations, 10SRE-Access-Requests: Requesting access to deployment rights for BPirkle - https://phabricator.wikimedia.org/T253640 (10BPirkle) Shell access was initially granted under https://phabricator.wikimedia.org/T202546 [15:06:42] (03Merged) 10jenkins-bot: Add lazy-loading to Wikimedia Foundation powered-by icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593038 (https://phabricator.wikimedia.org/T239377) (owner: 10Jforrester) [15:06:48] (03CR) 10Jforrester: "Oops, wanted to roll this after wmf.31." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593038 (https://phabricator.wikimedia.org/T239377) (owner: 10Jforrester) [15:07:10] (03PS1) 10Dzahn: site: re-add mw2180 to the right regex [puppet] - 10https://gerrit.wikimedia.org/r/598756 [15:07:16] (03CR) 10RLazarus: "PCC ("compiles correctly only with the change"): https://puppet-compiler.wmflabs.org/compiler1003/22790/" [puppet] - 10https://gerrit.wikimedia.org/r/598755 (owner: 10RLazarus) [15:08:19] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Add lazy-loading to Wikimedia Foundation powered-by icon T239377 (duration: 00m 57s) [15:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:22] T239377: Use native for non-printable images - https://phabricator.wikimedia.org/T239377 [15:08:41] !log delete/re-import docker/containerd.io packages in the right version in buster-wikimedia/thirdparty/kubeadm-k8s-1-{15,16} (T250866) [15:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:44] T250866: Stage packages for upstream kubeadm v1.16.9 to use in Toolforge - https://phabricator.wikimedia.org/T250866 [15:09:23] 10Operations, 10SRE-Access-Requests: Requesting access to deployment rights for BPirkle - https://phabricator.wikimedia.org/T253640 (10Dzahn) To SRE doing clinic duty: This is an easy one. Really just need to move the existing user from group "restricted" to group "deployment" and the other steps are all done... [15:10:15] 10Operations, 10Security-Team, 10serviceops, 10vm-requests, 10PM: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10chasemp) @MoritzMuehlenhoff could you revisit this when you have a minute? I'd like to get this off my plate and @wiki_willy a... [15:10:58] (03PS1) 10JMeybohm: blubberoid: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598759 (https://phabricator.wikimedia.org/T253396) [15:11:19] (03CR) 10Dzahn: [C: 03+2] site: re-add mw2180 to the right regex [puppet] - 10https://gerrit.wikimedia.org/r/598756 (owner: 10Dzahn) [15:11:29] (03PS2) 10Dzahn: site: re-add mw2180 to the right regex [puppet] - 10https://gerrit.wikimedia.org/r/598756 [15:12:28] (03CR) 10Alexandros Kosiaris: [C: 03+1] "This has one side effect and that's that in the old behavior >1 cluster per DC/PoP was possible, as the cluster was identifiable in puppet" [puppet] - 10https://gerrit.wikimedia.org/r/598498 (owner: 10Jbond) [15:13:46] (03CR) 10Dzahn: [C: 03+2] site: Restore mw2180, inadvertently dropped from the regex. [puppet] - 10https://gerrit.wikimedia.org/r/598755 (owner: 10RLazarus) [15:14:19] (03Abandoned) 10Dzahn: site: re-add mw2180 to the right regex [puppet] - 10https://gerrit.wikimedia.org/r/598756 (owner: 10Dzahn) [15:14:36] (03PS1) 10JMeybohm: charts: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598760 (https://phabricator.wikimedia.org/T253396) [15:15:10] (03CR) 10Jbond: [C: 03+2] "> Patch Set 10: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/598498 (owner: 10Jbond) [15:15:24] !log sync kubernetes codfw namespaces configuration with helmfile [15:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:18] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: add Store Gateway [puppet] - 10https://gerrit.wikimedia.org/r/597019 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [15:18:02] (03CR) 10Alexandros Kosiaris: [C: 04-1] deployment-docker-changeprop01: override docker configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598753 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan) [15:18:15] (03PS2) 10JMeybohm: citoid: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598760 (https://phabricator.wikimedia.org/T253396) [15:18:28] (03PS2) 10Dzahn: Revert "icinga: replace check_ssl_http with check_ssl_http_letsencrypt" [puppet] - 10https://gerrit.wikimedia.org/r/594672 [15:19:16] (03CR) 10Hnowlan: deployment-docker-changeprop01: override docker configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598753 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan) [15:19:21] (03CR) 10Dzahn: [C: 03+2] "reverting this per https://phabricator.wikimedia.org/T251726#6109023 until there is a better fix" [puppet] - 10https://gerrit.wikimedia.org/r/594672 (owner: 10Dzahn) [15:19:41] (03PS2) 10Hnowlan: deployment-docker-changeprop01: override docker configuration [puppet] - 10https://gerrit.wikimedia.org/r/598753 (https://phabricator.wikimedia.org/T251176) [15:19:50] (03PS1) 10JMeybohm: cxserver: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598766 (https://phabricator.wikimedia.org/T253396) [15:20:12] (03CR) 10Dzahn: "at a glance this is "only" planet and wmfusercontent monitoring, but it is actually also monitoring for the unified cert as such" [puppet] - 10https://gerrit.wikimedia.org/r/594672 (owner: 10Dzahn) [15:20:39] 10Operations, 10Security-Team, 10serviceops, 10vm-requests, 10PM: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10MoritzMuehlenhoff) @chasemp I think it's a little overblown, but if it helps unblocking existing tests, feel free go ahead. Our... [15:21:39] (03PS1) 10JMeybohm: wikifeeds: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598774 (https://phabricator.wikimedia.org/T253396) [15:22:28] !log sync kubernetes eqiad namespaces configuration with helmfile [15:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:51] (03PS1) 10JMeybohm: chromium-render: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598777 (https://phabricator.wikimedia.org/T253396) [15:25:13] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.005044 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:25:23] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Certificate *.wikipedia.org valid until 2020-06-20 - https://phabricator.wikimedia.org/T251726 (10Dzahn) >>! In T251726#6109023, @Vgutierrez wrote: > IMHO 7 / 3 is not enough for the unified cert even when LE is the issuer considering our anti cloc... [15:26:30] mutante: merging your change too [15:26:43] godog: thanks, it was 3-way blocked, now just 2 :) [15:26:52] haha! [15:27:06] 10Operations, 10Security-Team, 10serviceops, 10vm-requests, 10PM: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10chasemp) >>! In T252210#6165524, @MoritzMuehlenhoff wrote: > @chasemp I think it's a little overblown, but if it helps unblocki... [15:27:20] (03PS1) 10JMeybohm: mobileapps: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598779 (https://phabricator.wikimedia.org/T253396) [15:28:27] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:29:42] (03PS1) 10JMeybohm: recommendation-api: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598780 (https://phabricator.wikimedia.org/T253396) [15:30:12] (03PS7) 10Filippo Giunchedi: prometheus: add store and compact jobs [puppet] - 10https://gerrit.wikimedia.org/r/598698 (https://phabricator.wikimedia.org/T252186) [15:30:15] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:30:24] (03CR) 10Alexandros Kosiaris: [C: 04-1] Automate deployments (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/597653 (https://phabricator.wikimedia.org/T253264) (owner: 10Jeena Huneidi) [15:30:37] (03CR) 10Muehlenhoff: [C: 03+1] docker build: update the build process to us docker (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596779 (https://phabricator.wikimedia.org/T251574) (owner: 10Jbond) [15:31:22] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: deduplicate main helmfile.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/581656 (owner: 10Alexandros Kosiaris) [15:31:34] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add store and compact jobs [puppet] - 10https://gerrit.wikimedia.org/r/598698 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [15:31:56] 10Operations, 10ops-codfw, 10DBA: db2097 memory errors leading to crash - https://phabricator.wikimedia.org/T252492 (10Papaul) 05Open→03Resolved memory replacement and firmware upgrade complete Return label information below {F31842805} [15:33:21] (03CR) 10Ssingh: [C: 03+2] wikidough: deploy certificates on the malmok host [puppet] - 10https://gerrit.wikimedia.org/r/598737 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:33:38] James_F: 1.35.0-wmf.32 blocker task closed :] [15:33:47] PROBLEM - Check systemd state on thanos-fe2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:47] PROBLEM - Check systemd state on thanos-fe2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:56] that's me ^ [15:34:07] PROBLEM - Check systemd state on thanos-fe2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:34:48] (03CR) 10Muehlenhoff: base: Add some small quality-of-life packages. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/598752 (owner: 10Kormat) [15:34:55] (03PS1) 10JMeybohm: _scaffold: add image_version to tls_enabled fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/598782 [15:34:55] RECOVERY - Check systemd state on ms-be1034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:16] (03PS6) 10Alexandros Kosiaris: admin/namespace: Deduplicate all helmfile templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/581657 [15:37:23] (03PS2) 10JMeybohm: recommendation-api: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598780 (https://phabricator.wikimedia.org/T253396) [15:37:45] RECOVERY - Check systemd state on thanos-fe2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:16] hashar: Excellent. [15:38:56] (03PS3) 10JMeybohm: recommendation-api: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598780 (https://phabricator.wikimedia.org/T253396) [15:39:15] RECOVERY - Check systemd state on thanos-fe2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:15] RECOVERY - Check systemd state on thanos-fe2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:39] (03PS13) 10Filippo Giunchedi: thanos: add objstore support to sidecar [puppet] - 10https://gerrit.wikimedia.org/r/597071 (https://phabricator.wikimedia.org/T252186) [15:45:41] (03PS14) 10Filippo Giunchedi: thanos: add thanos::compact [puppet] - 10https://gerrit.wikimedia.org/r/597072 (https://phabricator.wikimedia.org/T252186) [15:45:43] (03PS7) 10Filippo Giunchedi: prometheus: allow setting min/max block duration [puppet] - 10https://gerrit.wikimedia.org/r/598711 (https://phabricator.wikimedia.org/T252186) [15:45:45] (03PS8) 10Filippo Giunchedi: prometheus: enable Thanos upload for Prometheus k8s-staging [puppet] - 10https://gerrit.wikimedia.org/r/598712 (https://phabricator.wikimedia.org/T252186) [15:46:51] (03CR) 10jerkins-bot: [V: 04-1] thanos: add objstore support to sidecar [puppet] - 10https://gerrit.wikimedia.org/r/597071 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [15:49:04] (03PS15) 10Filippo Giunchedi: thanos: add thanos::compact [puppet] - 10https://gerrit.wikimedia.org/r/597072 (https://phabricator.wikimedia.org/T252186) [15:49:06] (03PS14) 10Filippo Giunchedi: thanos: add objstore support to sidecar [puppet] - 10https://gerrit.wikimedia.org/r/597071 (https://phabricator.wikimedia.org/T252186) [15:49:08] (03PS8) 10Filippo Giunchedi: prometheus: allow setting min/max block duration [puppet] - 10https://gerrit.wikimedia.org/r/598711 (https://phabricator.wikimedia.org/T252186) [15:49:10] (03PS9) 10Filippo Giunchedi: prometheus: enable Thanos upload for Prometheus k8s-staging [puppet] - 10https://gerrit.wikimedia.org/r/598712 (https://phabricator.wikimedia.org/T252186) [15:49:53] PROBLEM - HTTPS-wmfusercontent on phab.wmfusercontent.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2020-06-20 07:01:41 +0000 (expires in 24 days) https://phabricator.wikimedia.org/tag/phabricator/ [15:50:52] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: add thanos::compact [puppet] - 10https://gerrit.wikimedia.org/r/597072 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [15:51:04] (03PS51) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [15:53:27] James_F: I don't consider this to be fixed to be honest. we have a very high rate of inserts and we might get lag on the next rollout or not. But that uncertainty is not good, as if we do, that affect users [15:53:34] (03PS5) 10Jbond: cas-icinga: Add an entry point for the external monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/598742 (https://phabricator.wikimedia.org/T239323) [15:53:59] marostegui: Oh, I agree, it needs improvement. [15:54:05] I am going to comment on that task so someone can work further on reducing the big spikes of inserts we have on each deployment [15:54:05] PROBLEM - HTTPS-planet on en.planet.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2020-06-20 07:01:41 +0000 (expires in 24 days) https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [15:54:30] We could blacklist Europe to reduce traffic, but other than that it needs the decade-long re-writing to finish. [15:57:11] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/598682 (owner: 10Muehlenhoff) [15:59:48] (03CR) 10Jbond: [C: 03+1] "looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597071 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [16:00:05] godog and _joe_: It is that lovely time of the day again! You are hereby commanded to deploy Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200526T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:00:18] (03CR) 10Volans: "The patch looks ok but it would be nice to add the following:" (032 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/598754 (https://phabricator.wikimedia.org/T253025) (owner: 10CDanis) [16:00:37] James_F: Yeah, I cannot really say what the fix is, but right now it looks like it is just pure luck that we don't get lag on each deploy it seems very fragile on that sense. If we get lag, like we did a few days ago, then we are producing errors [16:00:45] Let's follow up the discussion on the task [16:00:46] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/598711 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [16:00:48] I will answer now [16:00:57] Thanks. [16:01:25] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/598712 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [16:01:54] (03PS1) 10ArielGlenn: enable dumps of structured data from commons [puppet] - 10https://gerrit.wikimedia.org/r/598787 (https://phabricator.wikimedia.org/T221917) [16:02:41] (03PS52) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [16:02:49] (03CR) 10Cwhite: mtail: update varnishrls compatibility with rc35 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594316 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [16:06:13] PROBLEM - Check systemd state on ms-be1032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:06:27] (03PS15) 10Filippo Giunchedi: thanos: add objstore support to sidecar [puppet] - 10https://gerrit.wikimedia.org/r/597071 (https://phabricator.wikimedia.org/T252186) [16:06:29] (03PS9) 10Filippo Giunchedi: prometheus: allow setting min/max block duration [puppet] - 10https://gerrit.wikimedia.org/r/598711 (https://phabricator.wikimedia.org/T252186) [16:06:31] (03PS10) 10Filippo Giunchedi: prometheus: enable Thanos upload for Prometheus k8s-staging [puppet] - 10https://gerrit.wikimedia.org/r/598712 (https://phabricator.wikimedia.org/T252186) [16:11:28] (03PS1) 10Jbond: base::debdeploy: switch to lookup and default in hiera [puppet] - 10https://gerrit.wikimedia.org/r/598791 [16:13:31] (03CR) 10Jbond: [C: 03+2] base::debdeploy: switch to lookup and default in hiera [puppet] - 10https://gerrit.wikimedia.org/r/598791 (owner: 10Jbond) [16:14:02] (03PS53) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [16:14:10] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=thanos_compact site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:14:27] (03PS4) 10Jdlrobson: Use AddFooterLink hook for code of conduct and contact links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596277 (https://phabricator.wikimedia.org/T251817) [16:16:23] (03PS1) 10Bstorm: kubeadm: fix broken definition of extra volume [puppet] - 10https://gerrit.wikimedia.org/r/598792 (https://phabricator.wikimedia.org/T246122) [16:19:54] 10Operations, 10observability: Duplicate definitions found in Icinga configuration - https://phabricator.wikimedia.org/T211692 (10Dzahn) 05Open→03Resolved a:03Dzahn Thank you @herron ! Yes, confirmed. No more duplicates right now. [16:20:16] 10Operations, 10observability: Duplicate definitions found in Icinga configuration - https://phabricator.wikimedia.org/T211692 (10Dzahn) a:05Dzahn→03herron [16:25:29] 10Puppet, 10SRE-tools, 10Python3-Porting, 10User-crusnov, 10User-jbond: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10crusnov) [16:29:11] (03CR) 10Bstorm: [C: 03+2] kubeadm: fix broken definition of extra volume [puppet] - 10https://gerrit.wikimedia.org/r/598792 (https://phabricator.wikimedia.org/T246122) (owner: 10Bstorm) [16:29:53] RECOVERY - Check systemd state on ms-be1032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:06] (03CR) 10Marostegui: [C: 04-2] dbproxy1018: Add labsdb1010 with reduced weight (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598691 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [16:36:47] (03PS3) 10Marostegui: dbproxy1018: Add labsdb1010 with reduced weight [puppet] - 10https://gerrit.wikimedia.org/r/598691 (https://phabricator.wikimedia.org/T249188) [16:41:22] (03PS1) 10Jbond: base::debdeploy: add defaults [puppet] - 10https://gerrit.wikimedia.org/r/598795 [16:41:56] (03CR) 10Jbond: [C: 03+2] base::debdeploy: add defaults [puppet] - 10https://gerrit.wikimedia.org/r/598795 (owner: 10Jbond) [16:45:14] 10Puppet, 10SRE-tools, 10Python3-Porting, 10User-MoritzMuehlenhoff, and 2 others: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10MoritzMuehlenhoff) [16:45:54] !log installing jsp-api bugfix update from Buster point release [16:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:31] (03PS1) 10CRusnov: import-mgmt-dns: Fix for choices API [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/598797 [16:47:53] (03CR) 10jerkins-bot: [V: 04-1] import-mgmt-dns: Fix for choices API [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/598797 (owner: 10CRusnov) [16:48:33] 10Operations, 10Traffic, 10observability, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Document and/or improve navigation of the various HTTP frontend Grafana dashboards. - https://phabricator.wikimedia.org/T253655 (10Krinkle) [16:50:50] (03PS2) 10CRusnov: import-mgmt-dns: Fix for choices API [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/598797 [16:53:08] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/598797 (owner: 10CRusnov) [16:54:50] 10Operations, 10Traffic, 10observability, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Document and/or improve navigation of the various HTTP frontend Grafana dashboards. - https://phabricator.wikimedia.org/T253655 (10CDanis) Thanks for filing this @Krinkle. Ideally Grafana would al... [16:55:07] 10Operations, 10Traffic, 10observability, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Document and/or improve navigation of the various HTTP frontend Grafana dashboards - https://phabricator.wikimedia.org/T253655 (10Krinkle) [16:55:11] (03PS1) 10CRusnov: ganeti-netbox-sync: Fix tox-failing code [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/598799 [16:55:59] (03CR) 10CRusnov: [C: 03+2] "Self merging to fix a committed error." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/598799 (owner: 10CRusnov) [16:58:18] (03CR) 10Gehel: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/598787 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [16:58:45] (03PS3) 10CRusnov: import-mgmt-dns: Fix for choices API [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/598797 [17:00:05] halfak and accraze: Your horoscope predicts another unfortunate Services – Graphoid / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200526T1700). [17:01:18] 10Operations: Add Prometheus machine metric to track core dumps - https://phabricator.wikimedia.org/T165323 (10MoritzMuehlenhoff) p:05Triage→03Low a:05MoritzMuehlenhoff→03None [17:02:28] 10Operations, 10User-MoritzMuehlenhoff: Reprepro should bail if it can't read and sign using the root keys - https://phabricator.wikimedia.org/T116951 (10MoritzMuehlenhoff) [17:02:55] 10Operations: Integrate Buster 10.4 point update - https://phabricator.wikimedia.org/T252394 (10MoritzMuehlenhoff) [17:03:19] (03PS1) 10Hnowlan: changeprop-jobqueue: change testjob to thumbnailRender [deployment-charts] - 10https://gerrit.wikimedia.org/r/598802 (https://phabricator.wikimedia.org/T220399) [17:05:54] 10Operations: Add Prometheus machine metric to track core dumps - https://phabricator.wikimedia.org/T165323 (10CDanis) It is far from perfect, or even incredibly usable, but we have some semblance of this via `mtail` syslog scraping: https://w.wiki/RtZ [17:07:53] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/598797 (owner: 10CRusnov) [17:08:44] (03CR) 10Volans: [C: 03+1] "Change look sane, would be nice to see the compiler results on this one (including at least also one cumin host)" [puppet] - 10https://gerrit.wikimedia.org/r/598463 (owner: 10Giuseppe Lavagetto) [17:11:43] 10Operations, 10netops: intermittent brief data dropouts for esams netflow data - https://phabricator.wikimedia.org/T253128 (10CDanis) There's lots of this in kern.log on netflow3001: `May 26 15:15:03 netflow3001 kernel: [1214948.953863] nfacctd[422]: segfault at 0 ip 00007f54d61ebb17 sp 00007ffdcefb36e8 erro... [17:16:51] !log 1.35.0-wmf.34 was branched at b5012a1e7d0bbd2bf7444b8708d421992bcbe2fb for T253022 [17:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:55] T253022: 1.35.0-wmf.34 deployment blockers - https://phabricator.wikimedia.org/T253022 [17:17:22] (03PS5) 10Volans: scripts: add offline device script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576461 [17:18:03] (03CR) 10CRusnov: [C: 03+2] import-mgmt-dns: Fix for choices API [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/598797 (owner: 10CRusnov) [17:23:54] 10Operations, 10Traffic, 10netops: Anycast: consistent routers->servers routing - https://phabricator.wikimedia.org/T253666 (10ayounsi) p:05Triage→03Medium [17:24:27] 10Operations, 10Analytics, 10Privacy Engineering, 10Traffic: Publishing project anomaly data for censorship researchers. Evaluate privacy threats - https://phabricator.wikimedia.org/T183990 (10JFishback_WMF) [17:26:24] 10Operations, 10Cloud-VPS, 10DNS, 10Maps, and 2 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256 (10Andrew) [17:30:11] (03Abandoned) 10Volans: scripts: improve decommission script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/577529 (owner: 10Volans) [17:34:26] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin/namespace: Deduplicate all helmfile templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/581657 (owner: 10Alexandros Kosiaris) [17:34:55] (03Merged) 10jenkins-bot: admin/namespace: Deduplicate all helmfile templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/581657 (owner: 10Alexandros Kosiaris) [17:36:55] (03PS6) 10Alexandros Kosiaris: admin: Default to sensible values for deploUser, namespaceName [deployment-charts] - 10https://gerrit.wikimedia.org/r/581658 [17:37:26] (03CR) 10Volans: [C: 03+2] scripts: add offline device script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576461 (owner: 10Volans) [17:38:45] (03PS7) 10Alexandros Kosiaris: admin: Default to sensible values for deploUser, namespaceName [deployment-charts] - 10https://gerrit.wikimedia.org/r/581658 [17:43:47] (03PS8) 10Alexandros Kosiaris: admin: Default to sensible values for deployUser, namespaceName [deployment-charts] - 10https://gerrit.wikimedia.org/r/581658 [17:44:57] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Default to sensible values for deployUser, namespaceName [deployment-charts] - 10https://gerrit.wikimedia.org/r/581658 (owner: 10Alexandros Kosiaris) [17:45:21] (03Merged) 10jenkins-bot: admin: Default to sensible values for deployUser, namespaceName [deployment-charts] - 10https://gerrit.wikimedia.org/r/581658 (owner: 10Alexandros Kosiaris) [17:47:00] !log netflow3001: disabling puppet and testing some pmacct/librdkafka config tweaks T253128 [17:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:05] T253128: intermittent brief data dropouts for esams netflow data - https://phabricator.wikimedia.org/T253128 [17:57:25] !log installing bind security updates for stretch (only client-side tools/libraries in use) [17:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:18] (03PS2) 10Ppchelko: changeprop-jobqueue: change testjob to thumbnailRender [deployment-charts] - 10https://gerrit.wikimedia.org/r/598802 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [17:59:52] (03CR) 10Ppchelko: [C: 03+1] "Please self-merge when you're ready to deploy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/598802 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200526T1800) [18:02:20] !log cr[12]-eqiad: re-route ns0.wikimedia.org to authdns1001 - T241770 [18:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:27] !log jforrester@deploy1001 Pruned MediaWiki: 1.35.0-wmf.30 (duration: 20m 45s) [18:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:25] (03CR) 10Jbond: "Ready for review, PCC shows noop for all host except ones which also fail on production" [puppet] - 10https://gerrit.wikimedia.org/r/566559 (owner: 10Jbond) [18:14:28] (03PS54) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [18:14:53] (03PS6) 10Alexandros Kosiaris: admin: Remove all override files [deployment-charts] - 10https://gerrit.wikimedia.org/r/581748 [18:15:03] (03CR) 10jerkins-bot: [V: 04-1] admin: Remove all override files [deployment-charts] - 10https://gerrit.wikimedia.org/r/581748 (owner: 10Alexandros Kosiaris) [18:15:51] (03PS7) 10Alexandros Kosiaris: admin: Remove all override files [deployment-charts] - 10https://gerrit.wikimedia.org/r/581748 [18:16:10] (03CR) 10jerkins-bot: [V: 04-1] admin: Remove all override files [deployment-charts] - 10https://gerrit.wikimedia.org/r/581748 (owner: 10Alexandros Kosiaris) [18:16:15] (03PS8) 10Alexandros Kosiaris: admin: Remove all override files [deployment-charts] - 10https://gerrit.wikimedia.org/r/581748 [18:17:01] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Remove all override files [deployment-charts] - 10https://gerrit.wikimedia.org/r/581748 (owner: 10Alexandros Kosiaris) [18:17:26] (03Merged) 10jenkins-bot: admin: Remove all override files [deployment-charts] - 10https://gerrit.wikimedia.org/r/581748 (owner: 10Alexandros Kosiaris) [18:31:11] (03PS1) 10Ssingh: dnsdist: add parameters for TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/598822 (https://phabricator.wikimedia.org/T252132) [18:33:00] 10Operations, 10Traffic, 10netops: Anycast: consistent routers->servers routing - https://phabricator.wikimedia.org/T253666 (10BBlack) Even if we could experimentally verify option A, we probably can't trust it across future firmware differences (between sites, or between two routers in a site). Option B vi... [18:33:15] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/22795/" [puppet] - 10https://gerrit.wikimedia.org/r/598822 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [18:33:59] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install rdb200[78] - https://phabricator.wikimedia.org/T251626 (10Papaul) [18:43:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TDB) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10Andrew) Regardless of whether or not we move existing cloudvirts from 2 ports to 1, we can definitely rack these new servers... [18:44:02] (03CR) 10Dzahn: "I think it would be even better if in the module itself there are no default values to the cert/key parameters at all and you also pass th" [puppet] - 10https://gerrit.wikimedia.org/r/598822 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [18:49:41] PROBLEM - PHP opcache health on mwdebug2001 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:51:29] RECOVERY - PHP opcache health on mwdebug2001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:51:51] (03CR) 10Ssingh: "> I think it would be even better if in the module itself there are" [puppet] - 10https://gerrit.wikimedia.org/r/598822 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [18:57:15] (03PS1) 1020after4: testwikis wikis to 1.35.0-wmf.34 refs T253022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598824 [18:57:17] (03CR) 1020after4: [C: 03+2] testwikis wikis to 1.35.0-wmf.34 refs T253022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598824 (owner: 1020after4) [18:57:58] (03Merged) 10jenkins-bot: testwikis wikis to 1.35.0-wmf.34 refs T253022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598824 (owner: 1020after4) [18:58:08] !log twentyafterfour@deploy1001 Started scap: testwikis wikis to 1.35.0-wmf.34 refs T253022 [18:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:12] T253022: 1.35.0-wmf.34 deployment blockers - https://phabricator.wikimedia.org/T253022 [18:58:49] (03PS1) 10Krinkle: profiler: Use X-Request-Id instead of UNIQUE_ID for XHGui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598825 (https://phabricator.wikimedia.org/T253674) [19:00:04] twentyafterfour and James_F: #bothumor I � Unicode. All rise for Mediawiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200526T1900). [19:00:39] PROBLEM - MariaDB Slave Lag: m1 on db1117 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1131.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:04:17] RECOVERY - MariaDB Slave Lag: m1 on db1117 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:10:26] 10Operations, 10SRE-Access-Requests: Requesting access for deployment and restricted group for cicalese - https://phabricator.wikimedia.org/T253676 (10AMooney) [19:11:50] 10Operations, 10SRE-Access-Requests: Requesting access for deployment and restricted group for cicalese - https://phabricator.wikimedia.org/T253676 (10CCicalese_WMF) I have have read and signed the [[https://phabricator.wikimedia.org/L3|L3 Wikimedia Server Access Responsibilities]] document. [19:18:40] (03CR) 10Krinkle: vcl: apply mobileaction/useformat ttl cap to cacheable responses (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598744 (https://phabricator.wikimedia.org/T247783) (owner: 10Ema) [19:26:58] (03PS1) 10Esanders: Enable DiscussionTools visual mode on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598834 (https://phabricator.wikimedia.org/T253668) [19:27:35] PROBLEM - PHP opcache health on mwdebug2001 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:27:45] (03CR) 10Esanders: "Per this task, this can be deployed asap" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598834 (https://phabricator.wikimedia.org/T253668) (owner: 10Esanders) [19:29:23] RECOVERY - PHP opcache health on mwdebug2001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:37:22] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TDB) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10bd808) >>! In T251627#6166529, @Andrew wrote: > Regardless of whether or not we move existing cloudvirts from 2 ports to 1, w... [19:37:40] twentyafterfour: How's the deploy going? Still running? :-( [19:41:05] James_F: 96% [19:41:41] Of the rsync before the CDB build? [19:42:40] no the final sync [19:42:49] sync-apaches, 14 left [19:42:57] Ah, cool. Still, 50 mins sucks for this step. [19:43:53] 393 servers to sync [19:44:05] Oh, yeah, SRE de-racked a few. [19:44:16] it took quite a while just to sync the proxies [19:44:20] We were at ~470 for a while. [19:44:40] not sure where the stragglers are, these last few are taking a long time [19:44:58] Due to random variance in load, isn't it? [19:45:19] well yeah there may be a couple of servers overloaded by maintenance scripts or something like that [19:45:43] Oh, hmm, Beta Cluster config sync is dead. [19:45:51] * James_F restarts the executor. [19:48:25] all of the stragglers are in codfw, I guess the link is a bit slow for some reason? [19:48:32] Could be. [19:48:38] Or there's a big rebuild happening there. [19:48:50] (03PS1) 10Ayounsi: Anycast: introduce new "deterministic" variable [puppet] - 10https://gerrit.wikimedia.org/r/598836 (https://phabricator.wikimedia.org/T253666) [19:48:55] Because it's "off" so it's "fine" to abuse it. Except it isn't and it isn't. [19:49:06] I'm not sure if we optimize for the speed of the inter-dc link...like scap may not ensure that servers in codfw fetch from proxies in codfw [19:49:40] (probably wouldn't be hard to do that but I don't think we actually do it) [19:52:51] meh still stuck at 14 :-/ [19:53:04] Joy. [19:53:08] ah ha [19:53:16] Complaining about it fixed it? [19:53:16] I've reproduced the must-hit-enter bug [19:58:53] 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 3 others: Some (recent?) uploads to Commons are not available on other wikis - https://phabricator.wikimedia.org/T253405 (10Krinkle) a:05aaron→03Krinkle To write incident report. [20:08:10] !log twentyafterfour@deploy1001 Finished scap: testwikis wikis to 1.35.0-wmf.34 refs T253022 (duration: 70m 02s) [20:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:14] T253022: 1.35.0-wmf.34 deployment blockers - https://phabricator.wikimedia.org/T253022 [20:08:20] Hurrah. [20:10:55] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:10:59] (03PS2) 10Krinkle: profiler: Use X-Request-Id instead of UNIQUE_ID for XHGui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598825 (https://phabricator.wikimedia.org/T253674) [20:12:43] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:15:31] (03PS1) 10CDanis: pmacctd: various increases to buffer sizes [puppet] - 10https://gerrit.wikimedia.org/r/598841 (https://phabricator.wikimedia.org/T253128) [20:16:24] (03CR) 10Jbond: "> Patch Set 1:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/598822 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [20:17:17] (03CR) 10CDanis: "Been testing these (with Puppet disabled) on netflow3001 and haven't seen a hard drop there yet (aside from when I restart nfacctd). Will" [puppet] - 10https://gerrit.wikimedia.org/r/598841 (https://phabricator.wikimedia.org/T253128) (owner: 10CDanis) [20:17:37] twentyafterfour: Things look good on testwiki; time for group0? [20:21:07] (03CR) 10CDanis: "pcc https://puppet-compiler.wmflabs.org/compiler1003/22796/netflow3001.esams.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/598841 (https://phabricator.wikimedia.org/T253128) (owner: 10CDanis) [20:21:39] (03CR) 10Jforrester: [C: 03+2] Enable DiscussionTools visual mode on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598834 (https://phabricator.wikimedia.org/T253668) (owner: 10Esanders) [20:22:28] (03Merged) 10jenkins-bot: Enable DiscussionTools visual mode on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598834 (https://phabricator.wikimedia.org/T253668) (owner: 10Esanders) [20:23:15] (03PS2) 10CDanis: nfacctd: various increases to buffer sizes [puppet] - 10https://gerrit.wikimedia.org/r/598841 (https://phabricator.wikimedia.org/T253128) [20:24:43] 10Operations, 10SRE-Access-Requests: Request for srv/phab/phabricator/bin/bulk make-silent --id * command via SSH for moving tasks quarterly - https://phabricator.wikimedia.org/T251349 (10MBinder_WMF) Thanks for your patience on this. I can SSH into bastion.wmflabs.org with user mbinder I am unable... [20:25:00] James_F: yep [20:25:11] (03PS1) 1020after4: group0 wikis to 1.35.0-wmf.34 refs T253022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598842 [20:25:13] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.35.0-wmf.34 refs T253022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598842 (owner: 1020after4) [20:25:53] (03Merged) 10jenkins-bot: group0 wikis to 1.35.0-wmf.34 refs T253022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598842 (owner: 1020after4) [20:27:27] 10Operations, 10SRE-Access-Requests: Request for srv/phab/phabricator/bin/bulk make-silent --id * command via SSH for moving tasks quarterly - https://phabricator.wikimedia.org/T251349 (10RLazarus) >>! In T251349#6166850, @MBinder_WMF wrote: > Should I instead point to the private key I generated (per https://... [20:29:22] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.34 refs T253022 [20:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:26] T253022: 1.35.0-wmf.34 deployment blockers - https://phabricator.wikimedia.org/T253022 [20:30:51] 10Operations, 10Traffic, 10Core Platform Team Workboards (Clinic Duty Team), 10Developer Productivity: Whitelist x-wikimedia-debug header field (currently not allowed by Access-Control-Allow-Headers in preflight response) - https://phabricator.wikimedia.org/T252826 (10eprodromou) p:05Triage→03Medium [20:31:31] (03PS1) 10Aaron Schulz: Enable $wgResourceLoaderUseObjectCacheForDeps for testwiki and mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598844 [20:31:50] twentyafterfour: Success. [20:55:04] (03CR) 10Gehel: [C: 04-1] increment extra plugin to 6.5.4-wmf-9 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/593833 (https://phabricator.wikimedia.org/T222669) (owner: 10Mstyles) [20:58:08] (03CR) 10Ryan Kemper: [C: 03+2] increment extra plugin to 6.5.4-wmf-9 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/593833 (https://phabricator.wikimedia.org/T222669) (owner: 10Mstyles) [21:01:11] (03CR) 10BBlack: [C: 03+1] Anycast: introduce new "deterministic" variable [puppet] - 10https://gerrit.wikimedia.org/r/598836 (https://phabricator.wikimedia.org/T253666) (owner: 10Ayounsi) [21:02:34] (03CR) 10Krinkle: [C: 04-1] "-1 to look into the svg issue." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594943 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [21:14:46] (03CR) 10Krinkle: [C: 03+2] profiler: Use X-Request-Id instead of UNIQUE_ID for XHGui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598825 (https://phabricator.wikimedia.org/T253674) (owner: 10Krinkle) [21:15:35] (03Merged) 10jenkins-bot: profiler: Use X-Request-Id instead of UNIQUE_ID for XHGui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598825 (https://phabricator.wikimedia.org/T253674) (owner: 10Krinkle) [21:17:18] (03PS1) 10Aaron Schulz: Enable "coalesceKeys"="non-global" for WANCache on commonswiki (II) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598851 [21:17:20] (03PS1) 10Aaron Schulz: Enable "coalesceKeys"="non-global" for WANCache on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598852 [21:17:22] (03PS1) 10Aaron Schulz: Enable "coalesceKeys"="non-global" for WANCache on all but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598853 [21:17:24] (03PS1) 10Aaron Schulz: Enable "coalesceKeys"="non-global" for WANCache on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598854 [21:17:26] (03PS1) 10Aaron Schulz: Enable "coalesceKeys" for global keys for WANCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598855 [21:18:44] !log krinkle@deploy1001 Synchronized wmf-config/profiler.php: Ib0bf8d97b10b, T253674 (duration: 01m 06s) [21:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:48] T253674: WikimediaDebug "Find in XHGui" can't find profiles - https://phabricator.wikimedia.org/T253674 [21:19:34] (03PS2) 10Krinkle: Enable "coalesceKeys"="non-global" for WANCache on commonswiki (II) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598851 (https://phabricator.wikimedia.org/T252564) (owner: 10Aaron Schulz) [21:19:44] (03CR) 10Krinkle: [C: 03+2] Enable "coalesceKeys"="non-global" for WANCache on commonswiki (II) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598851 (https://phabricator.wikimedia.org/T252564) (owner: 10Aaron Schulz) [21:20:38] (03Merged) 10jenkins-bot: Enable "coalesceKeys"="non-global" for WANCache on commonswiki (II) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598851 (https://phabricator.wikimedia.org/T252564) (owner: 10Aaron Schulz) [21:22:29] AaronSchulz: staged on mwdebug1002 [21:24:30] jouncebot: next [21:24:30] In 1 hour(s) and 35 minute(s): Evening SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200526T2300) [21:25:54] (03CR) 10Krinkle: [C: 03+2] Enable $wgResourceLoaderUseObjectCacheForDeps for testwiki and mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598844 (owner: 10Aaron Schulz) [21:26:41] (03PS2) 10Ssingh: dnsdist: add configuration setting for TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/598822 (https://phabricator.wikimedia.org/T252132) [21:26:51] (03Merged) 10jenkins-bot: Enable $wgResourceLoaderUseObjectCacheForDeps for testwiki and mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598844 (owner: 10Aaron Schulz) [21:27:49] (03CR) 10jerkins-bot: [V: 04-1] dnsdist: add configuration setting for TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/598822 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [21:30:30] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: I2714e2ae26404 (duration: 01m 06s) [21:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:06] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/598822 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [21:31:46] Krinkle: lgtm [21:32:08] Krinkle: I guess I'll make a task for the deps hashing idea before doing large wikis [21:33:50] AaronSchulz: sgtm. how close are we to mainstash/db though? Last I heard there was some concern about whether x1 could fit the data, Jaime wanted to some numbers [21:34:02] I guess with all the Echo work done, we now have those numbers :) [21:34:18] !log krinkle@deploy1001 Synchronized wmf-config/mc.php: I0fb124b3593 (duration: 01m 05s) [21:34:20] and the good news is that it's pretty stable. [21:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:32] AaronSchulz: https://grafana.wikimedia.org/d/000000174/redis?orgId=1&refresh=1m&var-datasource=eqiad%20prometheus%2Fops&var-job=redis_sessions&var-instance=All [21:34:46] the size is nicely oscilating at a ceiling far below the actual ceailing [21:34:55] so stuff expires before needing to be evicted [21:35:10] (even with sessions) [21:48:48] (03PS3) 10Ssingh: dnsdist: add configuration setting for TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/598822 (https://phabricator.wikimedia.org/T252132) [21:49:52] (03CR) 10jerkins-bot: [V: 04-1] dnsdist: add configuration setting for TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/598822 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [21:50:58] (03PS4) 10Ssingh: dnsdist: add configuration setting for TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/598822 (https://phabricator.wikimedia.org/T252132) [21:51:14] argh [21:53:53] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:TBD) rack/setup/install rows C and D new PDUs - https://phabricator.wikimedia.org/T253694 (10RobH) [22:03:10] (03PS5) 10Ssingh: dnsdist: add configuration setting for TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/598822 (https://phabricator.wikimedia.org/T252132) [22:05:25] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/22800/" [puppet] - 10https://gerrit.wikimedia.org/r/598822 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [22:30:40] 10Operations, 10ops-eqiad, 10Analytics: an-presto1004 down - https://phabricator.wikimedia.org/T253438 (10wiki_willy) a:03Cmjohnson [22:33:10] jouncebot: refresh [22:33:11] I refreshed my knowledge about deployments. [22:33:20] * RhinosF1 moved his patches [22:34:51] (03CR) 10Cwhite: [C: 03+1] "LGTM! It might be worth trying out on deployment-prep and/or Pontoon if additional testing is desired." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/566559 (owner: 10Jbond) [22:42:25] PROBLEM - PHP opcache health on scandium is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:00:05] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Evening SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200526T2300) [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:04:17] RECOVERY - PHP opcache health on scandium is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:26:49] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install rdb200[78] - https://phabricator.wikimedia.org/T251626 (10Papaul) [23:48:04] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install rdb200[78] - https://phabricator.wikimedia.org/T251626 (10Papaul) [23:58:13] (03PS12) 10EBernhardson: Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) [23:58:15] (03PS1) 10EBernhardson: Rename role::wdqs to role::wdqs::public [puppet] - 10https://gerrit.wikimedia.org/r/598884 [23:58:17] (03PS1) 10EBernhardson: Define a shared profile to remove duplication in roles [puppet] - 10https://gerrit.wikimedia.org/r/598885 [23:58:19] (03PS1) 10EBernhardson: [DNM] Demonstrate problem with wdqs hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/598886 [23:59:47] (03CR) 10jerkins-bot: [V: 04-1] Rename role::wdqs to role::wdqs::public [puppet] - 10https://gerrit.wikimedia.org/r/598884 (owner: 10EBernhardson)