[00:00:39] (03PS7) 10RLazarus: mcrouter_wancache: Add mcrouter support for a machine-local memcached instance [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) [00:00:48] (03PS1) 10Reedy: Remove OAuthReplaceMessage hook subscriber [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601910 (https://phabricator.wikimedia.org/T254301) [00:01:06] (03CR) 10Reedy: [C: 04-2] "https://gerrit.wikimedia.org/r/601909 needs to be everywhere first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601910 (https://phabricator.wikimedia.org/T254301) (owner: 10Reedy) [00:22:54] (03PS2) 10Aaron Schulz: Enable "coalesceKeys"="non-global" for WANCache on all but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598853 [00:24:53] (03CR) 10Bstorm: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/601374 (owner: 10Bstorm) [00:25:58] (03CR) 10Bstorm: [C: 03+2] labstore: turn off systemd paging for labstore1004/5 [puppet] - 10https://gerrit.wikimedia.org/r/601753 (owner: 10Bstorm) [00:26:02] (03CR) 10RLazarus: "PTAL as I've refactored a little. PCC still looks good: there are some diffs even with the hiera flag off, but they're no-ops, of the form" [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) (owner: 10RLazarus) [00:33:49] (03CR) 10Krinkle: [C: 03+1] Enable "coalesceKeys"="non-global" for WANCache on all but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598853 (owner: 10Aaron Schulz) [00:42:20] (03CR) 10Krinkle: [C: 03+2] Enable "coalesceKeys"="non-global" for WANCache on all but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598853 (owner: 10Aaron Schulz) [00:43:07] (03Merged) 10jenkins-bot: Enable "coalesceKeys"="non-global" for WANCache on all but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598853 (owner: 10Aaron Schulz) [00:43:28] (03PS1) 10Reedy: Add apiportalwiki to wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601920 (https://phabricator.wikimedia.org/T254185) [00:49:02] (03PS2) 10Reedy: Add apiportalwiki to wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601920 (https://phabricator.wikimedia.org/T254185) [00:49:07] (03CR) 10Reedy: [C: 03+2] Add apiportalwiki to wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601920 (https://phabricator.wikimedia.org/T254185) (owner: 10Reedy) [00:49:51] (03Merged) 10jenkins-bot: Add apiportalwiki to wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601920 (https://phabricator.wikimedia.org/T254185) (owner: 10Reedy) [00:51:18] AaronSchulz: staged on mwdebug1002 [00:55:46] Krinkle: frwiki looks fine to me [00:56:58] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [00:58:12] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [00:58:17] AaronSchulz: ack, logstash/mwdebug looks good as well [00:58:20] rolling out then [00:59:13] !log krinkle@deploy1001 Synchronized wmf-config/mc.php: Ic27b605aec0118c5 (duration: 01m 11s) [00:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:58] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings-labs.php: wgLocalVirtualHosts (duration: 01m 06s) [01:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:02] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 52 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:08:36] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 47 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:13:36] (03PS1) 10Reedy: Make wikimania.wikimedia.org redirect to mobile site [puppet] - 10https://gerrit.wikimedia.org/r/601923 [01:13:38] (03PS1) 10Reedy: Make api.wikimedia redirect to mobile site [puppet] - 10https://gerrit.wikimedia.org/r/601924 (https://phabricator.wikimedia.org/T254185) [03:25:24] 10Operations, 10DC-Ops: determine/process/document bios firmware tracking/updating policies - https://phabricator.wikimedia.org/T141128 (10RobH) >>! In T141128#6187019, @wiki_willy wrote: > Demo for Dell's System Management Tool set up for next Monday on June 8, to evaluate if it's something we want to use goi... [05:01:50] (03PS1) 10Marostegui: db1138: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/601948 (https://phabricator.wikimedia.org/T253808) [05:07:16] (03CR) 10Marostegui: [C: 03+2] db1138: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/601948 (https://phabricator.wikimedia.org/T253808) (owner: 10Marostegui) [05:09:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1138 T253808', diff saved to https://phabricator.wikimedia.org/P11369 and previous config saved to /var/cache/conftool/dbconfig/20200603-050911-marostegui.json [05:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:16] T253808: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 [05:11:59] @TimStarling do you have a minute? [05:12:37] yes [05:13:06] if you are going to ask about hook deprecation, I have a patch for that almost ready to commit [05:13:38] Yeah, that was it - how to soft deprecate a hook without triggering deprecation errors [05:14:02] !log turn cr1-codfw:fpc0 online - T254110 [05:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:06] T254110: cr1-codfw:fpc0 failure - https://phabricator.wikimedia.org/T254110 [05:14:10] you can achieve this by merging my code after I write it ;) [05:14:10] But also I found that the doc block for `RevisionFromEditCompleteHook` is wrong - there is nothing returned. Can you take a look at https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/601947/ ? [05:15:16] (03PS1) 10Marostegui: mariadb: Reimage db2071 [puppet] - 10https://gerrit.wikimedia.org/r/601949 (https://phabricator.wikimedia.org/T238966) [05:15:34] ok [05:16:54] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage db2071 [puppet] - 10https://gerrit.wikimedia.org/r/601949 (https://phabricator.wikimedia.org/T238966) (owner: 10Marostegui) [05:17:36] (03CR) 10Cwhite: [C: 04-1] "as-is, this patch will attempt adding the flag before the rc35 upgrade. it needs to be broken into two steps" [puppet] - 10https://gerrit.wikimedia.org/r/599474 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [05:17:50] (03CR) 10Cwhite: [C: 04-1] "as-is, this patch will attempt adding the flag before the rc35 upgrade. it needs to be broken into two steps" [puppet] - 10https://gerrit.wikimedia.org/r/601874 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [05:18:08] RECOVERY - OSPF status on mr1-codfw is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:18:32] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:19:20] 10Operations, 10netops: cr1-codfw:fpc0 failure - https://phabricator.wikimedia.org/T254110 (10ayounsi) 05Open→03Resolved The FPC is up and healthy. Interfaces are up as well. Netbox updated with the new serial#. [05:25:39] (03PS1) 10Ayounsi: Depool esams for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/601951 (https://phabricator.wikimedia.org/T254021) [05:32:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [05:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1138 T253808', diff saved to https://phabricator.wikimedia.org/P11370 and previous config saved to /var/cache/conftool/dbconfig/20200603-053748-marostegui.json [05:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:52] T253808: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 [05:38:11] !log deactivate graceful-switchover on cr3-esams - T254021 [05:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:14] T254021: Amsterdam maintenance (June 2020) - https://phabricator.wikimedia.org/T254021 [05:40:58] !log Stop MySQL on db2130 to clone db2071 [05:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:18] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:41:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2130 to clone db2071', diff saved to https://phabricator.wikimedia.org/P11371 and previous config saved to /var/cache/conftool/dbconfig/20200603-054117-marostegui.json [05:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:57] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:48:15] (03CR) 10Ayounsi: [C: 03+2] Depool esams for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/601951 (https://phabricator.wikimedia.org/T254021) (owner: 10Ayounsi) [05:48:55] !log depool esams - T254021 [05:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:59] T254021: Amsterdam maintenance (June 2020) - https://phabricator.wikimedia.org/T254021 [05:51:13] !log deactivate transit BGP ton cr3-knams - T254021 [05:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:09] !log reboot cr3-knams - T254021 [05:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:14] T254021: Amsterdam maintenance (June 2020) - https://phabricator.wikimedia.org/T254021 [05:59:19] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 56.07 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [06:00:56] that's expected with the depool ^ [06:01:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1138 T253808', diff saved to https://phabricator.wikimedia.org/P11373 and previous config saved to /var/cache/conftool/dbconfig/20200603-060124-marostegui.json [06:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:30] T253808: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 [06:04:22] (03CR) 10Elukey: "@Bearloga: if this package is not needed on Hadoop workers etc.. I'd deploy it via profile::analytics::cluster::packages::statistics, that" [puppet] - 10https://gerrit.wikimedia.org/r/601848 (https://phabricator.wikimedia.org/T254278) (owner: 10Bearloga) [06:06:12] XioNoX: if you need any help etc.. please ping :) [06:06:20] elukey: thank you! [06:08:18] !log re-activate transit BGP to cr3-knams - T254021 [06:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:23] T254021: Amsterdam maintenance (June 2020) - https://phabricator.wikimedia.org/T254021 [06:09:20] alright cr3-knams is done, unfortunately it didn't solve T244497 [06:09:21] T244497: cr3-knams:xe-0/1/3 down - https://phabricator.wikimedia.org/T244497 [06:11:02] !log cr3-esams> request vmhost reboot re1 (backup re) - T244497 [06:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:24] !log cr3-esams> request chassis routing-engine master switch - T244497 [06:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:30] T244497: cr3-knams:xe-0/1/3 down - https://phabricator.wikimedia.org/T244497 [06:23:33] FPCs are coming online [06:23:41] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:25:29] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:26:27] re1 is back on the new version and healthy, pushing the OS to re0 [06:27:59] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [06:29:43] PROBLEM - ores uWSGI web app on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:29:55] PROBLEM - Check systemd state on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:05] PROBLEM - Check size of conntrack table on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:31:58] (03PS10) 10WMDE-leszek: Wikidata: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569258 (https://phabricator.wikimedia.org/T242087) [06:31:59] PROBLEM - puppet last run on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:33:31] RECOVERY - Check systemd state on ores2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:41] RECOVERY - Check size of conntrack table on ores2008 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:35:33] re0 (backup) is rebooting [06:37:46] RECOVERY - puppet last run on ores2008 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:37:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1138 T253808', diff saved to https://phabricator.wikimedia.org/P11374 and previous config saved to /var/cache/conftool/dbconfig/20200603-063752-marostegui.json [06:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:58] T253808: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 [06:39:22] re1.cr3-esams> request chassis routing-engine master switch - T244497 [06:39:22] T244497: cr3-knams:xe-0/1/3 down - https://phabricator.wikimedia.org/T244497 [06:40:26] (03PS11) 10WMDE-leszek: Wikidata: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569258 (https://phabricator.wikimedia.org/T254315) [06:40:46] (03CR) 10WMDE-leszek: Wikidata client wikis: Define entity sources configuration (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [06:40:54] (03PS12) 10WMDE-leszek: Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T242087) [06:41:01] (03PS13) 10WMDE-leszek: Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T254315) [06:41:16] (03PS9) 10WMDE-leszek: Commons: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T242087) [06:41:33] (03PS8) 10WMDE-leszek: Wikibase: Removed config option wmgUseEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569263 (https://phabricator.wikimedia.org/T254315) [06:41:41] FPCs are rebooting [06:41:47] (03PS10) 10WMDE-leszek: Commons: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T254315) [06:42:20] (03PS9) 10WMDE-leszek: Wikibase: Removed config option wmgUseEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569263 (https://phabricator.wikimedia.org/T241975) [06:43:29] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:45:16] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:45:45] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:48:15] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:50:03] PROBLEM - Check the last execution of mediawiki_job_cirrus_build_completion_indices_eqiad on mwmaint1002 is CRITICAL: CRITICAL: Status of the systemd unit mediawiki_job_cirrus_build_completion_indices_eqiad https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:51:06] cr3-esams is all good [06:51:11] nice [06:51:24] going to failover vrrp to cr3-esams and reboot cr2-esams [06:51:56] ack [06:54:53] !log failover vrrp to cr3-esams - T244497 [06:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:59] T244497: cr3-knams:xe-0/1/3 down - https://phabricator.wikimedia.org/T244497 [06:56:20] !log deactivate peering/transit BGP cr2-esams - T244497 [06:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:24] is this the right task? :) [06:59:30] should it be T254021 instead? [06:59:32] T254021: Amsterdam maintenance (June 2020) - https://phabricator.wikimedia.org/T254021 [07:00:05] (or one of the cr2-esams subtasks) [07:00:11] paravoid: oops, I picked the wrong one from scrollback :) [07:00:19] !log re0.cr2-esams> request system reboot both-routing-engines - T254021 [07:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:40] monitoring the reboot on console [07:00:47] (03CR) 10Kormat: install_server: Allow reuse of partitions during reimage. [WIP] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [07:01:29] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:04:56] host is back up, waiting for linecards to boot [07:07:41] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:09:06] !log re-activate peering/transit BGP on cr2-esams - T254021 [07:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:12] T254021: Amsterdam maintenance (June 2020) - https://phabricator.wikimedia.org/T254021 [07:10:37] checking everything and removing icinga downtimes, but we should be good to repool [07:10:48] (03PS6) 10Elukey: Add support to pull datapoints from Kafka [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/600295 [07:14:09] (03PS1) 10Ayounsi: Revert "Depool esams for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/602007 (https://phabricator.wikimedia.org/T254021) [07:14:12] (03Abandoned) 10Elukey: Swap profile::java::analytics with profile::java [puppet] - 10https://gerrit.wikimedia.org/r/599389 (owner: 10Elukey) [07:15:13] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:15:58] !log repool esams - T254021 [07:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:03] T254021: Amsterdam maintenance (June 2020) - https://phabricator.wikimedia.org/T254021 [07:16:08] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool esams for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/602007 (https://phabricator.wikimedia.org/T254021) (owner: 10Ayounsi) [07:17:58] (03PS1) 10Muehlenhoff: Extend MOU date for aarora [puppet] - 10https://gerrit.wikimedia.org/r/602008 [07:18:44] 10Operations, 10Analytics, 10Traffic: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10ema) [07:18:52] 10Operations, 10Analytics, 10Traffic: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10ema) p:05Triage→03Medium [07:19:09] 10Operations, 10Analytics, 10Traffic: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10ema) [07:19:19] 10Operations, 10ops-esams, 10netops, 10Patch-For-Review: Amsterdam maintenance (June 2020) - https://phabricator.wikimedia.org/T254021 (10ayounsi) [07:22:53] 10Operations, 10Analytics, 10Traffic: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10ema) [07:23:30] 10Operations, 10DBA: In-place conversion from LVM to normal partition - https://phabricator.wikimedia.org/T252195 (10Kormat) 05Open→03Stalled This is on-hold for now. It looks like we don't need to get rid of lvm (https://gerrit.wikimedia.org/r/c/operations/puppet/+/601761). [07:24:18] 10Operations, 10ops-esams, 10netops, 10Patch-For-Review: Amsterdam maintenance (June 2020) - https://phabricator.wikimedia.org/T254021 (10ayounsi) 05Open→03Resolved Everything got done smoothly, no user impact. T253970 and T244497 are still not solved. T245520 is solved. I also used the opportunity t... [07:25:36] 10Operations, 10netops, 10Patch-For-Review: Configure management-instance on routers with Junos > 17.3 - https://phabricator.wikimedia.org/T247073 (10ayounsi) Applied to cr2-esams and cr3-esams. Confirmed mgmt is still reachable. [07:25:51] (03CR) 10Muehlenhoff: [C: 03+2] Extend MOU date for aarora [puppet] - 10https://gerrit.wikimedia.org/r/602008 (owner: 10Muehlenhoff) [07:26:29] 10Operations, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Marostegui) 05Open→03Resolved Host repooled. All done. Thanks John for replacing the memory! [07:26:59] 10Operations, 10ops-esams, 10netops: 2*10G optics down on cr2-esams - https://phabricator.wikimedia.org/T245520 (10ayounsi) 05Open→03Resolved Links are back up after a reboot. [07:28:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2071 after cloning it from db2130 to restore all the schema changes applied', diff saved to https://phabricator.wikimedia.org/P11375 and previous config saved to /var/cache/conftool/dbconfig/20200603-072841-marostegui.json [07:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:29] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 47.17 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:31:36] this is expected and due to esams repool ^ [07:32:22] 10Operations, 10netops: No LACP info for cr2-esams:ae2 - https://phabricator.wikimedia.org/T253970 (10ayounsi) The reboot didn't help. [07:32:54] 10Operations, 10netops: cr3-knams:xe-0/1/3 down - https://phabricator.wikimedia.org/T244497 (10ayounsi) The upgrade and reboot of cr3-knams didn't help. [07:33:20] (03PS1) 10Elukey: Deprecate profile::java::analytics in favor of profile::java [puppet] - 10https://gerrit.wikimedia.org/r/602009 [07:33:50] !log cp: upgrade purged to 0.15 [07:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:42] Can someone please delete https://wikitech.wikimedia.org/wiki/WordPress_%E2%80%93_Ease_and_Convenience_to_use! ? Spam [07:34:53] 10Operations, 10DC-Ops: determine/process/document bios firmware tracking/updating policies - https://phabricator.wikimedia.org/T141128 (10wiki_willy) Sure, no problem @RobH . I just asked Paul to add you to the invite. > > Please include me in this, as last time we evaluated this it didn't meet open source O... [07:35:47] DannyS712: done [07:35:55] !log imported PHP 7.2.31 to apt.wikimedia.org/component/php72 [07:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:57] PROBLEM - Check the last execution of mediawiki_job_cirrus_build_completion_indices_codfw on mwmaint1002 is CRITICAL: CRITICAL: Status of the systemd unit mediawiki_job_cirrus_build_completion_indices_codfw https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:36:22] dcausse: --^ [07:36:30] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw218[0-6].codfw.wmnet [07:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:36] (03PS2) 10Elukey: Deprecate profile::java::analytics in favor of profile::java [puppet] - 10https://gerrit.wikimedia.org/r/602009 [07:36:39] elukey: looking [07:36:45] !log depooling mw2180 - mw2186 [07:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:37] 10Operations, 10netops: cr3-knams:xe-0/1/3 down - https://phabricator.wikimedia.org/T244497 (10ayounsi) a:05faidon→03ayounsi [07:39:19] (03PS3) 10Elukey: Deprecate profile::java::analytics in favor of profile::java [puppet] - 10https://gerrit.wikimedia.org/r/602009 [07:41:12] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Kormat) [07:42:02] elukey: do you know how to retrieve the logs for a profile::mediawiki::periodic_job ? [07:43:05] dcausse: they should be in /var/log/mediawiki/ on mwmaint1002 [07:43:11] mutante: thanks [07:44:13] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw218[0-6].codfw.wmnet [07:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:37] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [07:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:52] (03PS4) 10Elukey: Deprecate profile::java::analytics in favor of profile::java [puppet] - 10https://gerrit.wikimedia.org/r/602009 [07:48:29] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [07:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:37] 10Operations, 10ops-codfw, 10decommission, 10serviceops, 10Patch-For-Review: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[21... [07:51:12] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [07:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:24] (03PS2) 10Dzahn: site: decom mw2180 through mw2186 [puppet] - 10https://gerrit.wikimedia.org/r/599606 (https://phabricator.wikimedia.org/T247018) [07:52:32] 10Operations, 10observability: Setup HTCP monitoring alerts - https://phabricator.wikimedia.org/T82176 (10Joe) 05Open→03Declined We're moving to purged using kafka, so we will set up alerting based on that, rather than on htcpd [07:52:58] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [07:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:05] 10Operations, 10ops-codfw, 10decommission, 10serviceops, 10Patch-For-Review: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[21... [07:54:51] (03CR) 10Dzahn: [C: 03+2] site: decom mw2180 through mw2186 [puppet] - 10https://gerrit.wikimedia.org/r/599606 (https://phabricator.wikimedia.org/T247018) (owner: 10Dzahn) [07:56:44] (03PS1) 10Marostegui: db2071: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/602011 [07:58:10] (03CR) 10Marostegui: [C: 03+2] db2071: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/602011 (owner: 10Marostegui) [08:01:49] (03PS1) 10Dzahn: remove production IPs for mw2173 through mw2186 [dns] - 10https://gerrit.wikimedia.org/r/602012 (https://phabricator.wikimedia.org/T247018) [08:04:57] (03CR) 10Dzahn: [C: 03+2] remove production IPs for mw2173 through mw2186 [dns] - 10https://gerrit.wikimedia.org/r/602012 (https://phabricator.wikimedia.org/T247018) (owner: 10Dzahn) [08:05:31] (03PS5) 10Elukey: Deprecate profile::java::analytics in favor of profile::java [puppet] - 10https://gerrit.wikimedia.org/r/602009 [08:08:06] !log remove ae2 physical interfaces from external group - T253970 [08:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:10] T253970: No LACP info for cr2-esams:ae2 - https://phabricator.wikimedia.org/T253970 [08:08:44] !log reimaging pc2009 to buster T252182 [08:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:47] T252182: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 [08:09:11] (03PS1) 10Dzahn: remove mgmt IPs for mw2150 through mw2186 [dns] - 10https://gerrit.wikimedia.org/r/602013 (https://phabricator.wikimedia.org/T247018) [08:09:48] !log upgrading remaining mwdebug* servers to PHP 7.2.31 [08:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:00] (03PS1) 10Kormat: install_server: Allow reimage of pc2009 [puppet] - 10https://gerrit.wikimedia.org/r/602014 (https://phabricator.wikimedia.org/T252182) [08:12:57] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/22946/" [puppet] - 10https://gerrit.wikimedia.org/r/602009 (owner: 10Elukey) [08:14:14] 10Operations, 10ops-codfw, 10decommission, 10serviceops, 10Patch-For-Review: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Dzahn) @Papaul All remaining old mw servers in rack C3 (mw2154 through mw2186) are also decom'ed... [08:14:28] 10Operations, 10ops-codfw, 10decommission, 10serviceops, 10Patch-For-Review: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Dzahn) p:05High→03Medium a:05Dzahn→03Papaul [08:15:36] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:16:59] !log re-add ae2 physical interfaces to external group - T253970 [08:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:03] T253970: No LACP info for cr2-esams:ae2 - https://phabricator.wikimedia.org/T253970 [08:17:54] (03PS1) 10DCausse: [cirrus] prepend the wiki id in comp suggest logs [puppet] - 10https://gerrit.wikimedia.org/r/602015 [08:19:59] !log upgrading mw1261 to PHP 7.2.31 [08:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:58] (03PS1) 10Elukey: Set Debian Buster for druid100[2,3] [puppet] - 10https://gerrit.wikimedia.org/r/602016 (https://phabricator.wikimedia.org/T253980) [08:23:03] (03PS2) 10Elukey: Set Debian Buster for druid100[2,3] [puppet] - 10https://gerrit.wikimedia.org/r/602016 (https://phabricator.wikimedia.org/T253980) [08:23:33] 10Operations, 10ops-codfw, 10procurement: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 (10Dzahn) All the mw servers in rack C3 have been decom'ed. Also see T247018#6187845 [08:24:37] 10Operations, 10ops-codfw, 10procurement: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 (10Dzahn) [08:25:06] (03CR) 10Elukey: [C: 03+2] Set Debian Buster for druid100[2,3] [puppet] - 10https://gerrit.wikimedia.org/r/602016 (https://phabricator.wikimedia.org/T253980) (owner: 10Elukey) [08:26:28] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, this seems totally preferable in terms of flexibility over the shared class. There's no Java to bind them all in analyti" [puppet] - 10https://gerrit.wikimedia.org/r/602009 (owner: 10Elukey) [08:27:13] (03CR) 10Dzahn: [C: 03+2] microsites::design: add blog subdirectory [puppet] - 10https://gerrit.wikimedia.org/r/601344 (https://phabricator.wikimedia.org/T254118) (owner: 10Dzahn) [08:28:59] (03CR) 10Elukey: Deprecate profile::java::analytics in favor of profile::java (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602009 (owner: 10Elukey) [08:32:02] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:33:13] (03CR) 10Filippo Giunchedi: "LGTM modulo what Cole said" [puppet] - 10https://gerrit.wikimedia.org/r/601836 (owner: 10Herron) [08:33:46] (03PS6) 10Elukey: Deprecate profile::java::analytics in favor of profile::java [puppet] - 10https://gerrit.wikimedia.org/r/602009 [08:35:03] (03PS1) 10Dzahn: microsites::design: add git clone for blog site [puppet] - 10https://gerrit.wikimedia.org/r/602018 (https://phabricator.wikimedia.org/T254118) [08:35:08] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: delete unreferenced contact groups [puppet] - 10https://gerrit.wikimedia.org/r/601672 (https://phabricator.wikimedia.org/T254006) (owner: 10Filippo Giunchedi) [08:35:20] (03PS1) 10Kormat: db-codfw.php: Replace pc2009 with pc2010 while reimaging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602019 (https://phabricator.wikimedia.org/T252182) [08:36:10] (03PS2) 10Dzahn: microsites::design: add git clone for blog site [puppet] - 10https://gerrit.wikimedia.org/r/602018 (https://phabricator.wikimedia.org/T254118) [08:36:31] (03CR) 10Marostegui: [C: 03+1] db-codfw.php: Replace pc2009 with pc2010 while reimaging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602019 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [08:36:40] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:36:51] (03CR) 10Marostegui: [C: 03+1] install_server: Allow reimage of pc2009 [puppet] - 10https://gerrit.wikimedia.org/r/602014 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [08:37:04] (03CR) 10Kormat: [C: 03+2] install_server: Allow reimage of pc2009 [puppet] - 10https://gerrit.wikimedia.org/r/602014 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [08:37:20] (03CR) 10Kormat: [C: 03+2] db-codfw.php: Replace pc2009 with pc2010 while reimaging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602019 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [08:37:46] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: enable Thanos upload for analytics [puppet] - 10https://gerrit.wikimedia.org/r/601326 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [08:38:06] (03Merged) 10jenkins-bot: db-codfw.php: Replace pc2009 with pc2010 while reimaging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602019 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [08:38:18] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/22948/miscweb1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/602018 (https://phabricator.wikimedia.org/T254118) (owner: 10Dzahn) [08:42:13] !log kormat@deploy1001 Synchronized wmf-config/db-codfw.php: Replace pc2009 with pc2010 while reimaging (duration: 01m 16s) [08:42:15] 10Operations, 10Gerrit: Initial backup run for Gerrit LFS data - https://phabricator.wikimedia.org/T254162 (10jcrespo) a:05jcrespo→03QChris Recovery took 20 minutes, I chose the full one we took for testing yesterday, but remember we have hourly backups: https://grafana.wikimedia.org/d/413r2vbWk/bacula?org... [08:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:36] (03PS1) 10DannyS712: Remove `wgCommentTableSchemaMigrationStage` settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602022 [08:44:05] (03CR) 10Elukey: "New pcc: https://puppet-compiler.wmflabs.org/compiler1002/22947/" [puppet] - 10https://gerrit.wikimedia.org/r/602009 (owner: 10Elukey) [08:44:21] (03CR) 10Muehlenhoff: [C: 03+1] Deprecate profile::java::analytics in favor of profile::java [puppet] - 10https://gerrit.wikimedia.org/r/602009 (owner: 10Elukey) [08:49:10] PROBLEM - Prometheus prometheus2004/analytics restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1:9905 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/analytics [08:49:51] (03PS2) 10DannyS712: Remove obsolete schema migration settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602022 [08:50:22] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): WDQS Categories update lag alert - https://phabricator.wikimedia.org/T246497 (10Gehel) 05Open→03Resolved [08:50:30] PROBLEM - Prometheus prometheus1004/analytics restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9905 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/analytics [08:52:34] (03PS1) 10Marostegui: mariadb: Reimage db1081 to Buster and stretch [puppet] - 10https://gerrit.wikimedia.org/r/602026 (https://phabricator.wikimedia.org/T250666) [08:53:56] PROBLEM - MariaDB Slave SQL: pc3 on pc2009 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:54:05] kormat: ^ [08:54:16] PROBLEM - MariaDB Slave IO: pc3 on pc2009 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:54:59] (03CR) 10Kormat: [C: 04-1] mariadb: Reimage db1081 to Buster and stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602026 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [08:55:38] (03PS2) 10Marostegui: mariadb: Reimage db1080 to Buster and stretch [puppet] - 10https://gerrit.wikimedia.org/r/602026 (https://phabricator.wikimedia.org/T250666) [08:56:02] kormat: pc2009 isn't downtimed and it has notifications enabled [08:56:08] Did puppet run on icinga? [08:56:09] let's seeee [08:56:11] marostegui: it did [08:56:22] i double-checked that before stopping mariadb [08:56:37] and the change went thru? [08:57:00] it didn't show up in the web u/i [08:57:01] another case of https://phabricator.wikimedia.org/T251407 ? [08:57:36] yep :/ [08:57:42] (03PS6) 10Muehlenhoff: Ship the sysusers default config via systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/601743 (https://phabricator.wikimedia.org/T235162) [08:58:44] 10Operations, 10Icinga, 10observability: Icinga notifications didn't get applied after a puppet run - https://phabricator.wikimedia.org/T251407 (10Kormat) This just happened again with https://gerrit.wikimedia.org/r/c/operations/puppet/+/602014 I manually ran puppet on the target host, and confirmed that pu... [09:00:43] (03CR) 10Kormat: [C: 03+1] mariadb: Reimage db1080 to Buster and stretch [puppet] - 10https://gerrit.wikimedia.org/r/602026 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [09:01:01] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage db1080 to Buster and stretch [puppet] - 10https://gerrit.wikimedia.org/r/602026 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [09:01:35] marostegui: oh wow. the web u/i now shows notifications are disabled for _some_ of the services, but not all. whyyy. [09:01:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1080 for reimage', diff saved to https://phabricator.wikimedia.org/P11376 and previous config saved to /var/cache/conftool/dbconfig/20200603-090143-marostegui.json [09:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:52] kormat: that is also what that ticket reflected :( [09:02:14] https://phabricator.wikimedia.org/P11083#63751 [09:02:16] :( [09:03:12] 10Operations, 10Icinga, 10observability: Icinga notifications didn't get applied after a puppet run - https://phabricator.wikimedia.org/T251407 (10Kormat) The web u/i has finally updated to show that _some_ services have notifications disabled, but not all. {F31851959} [09:03:49] marostegui: i think it's time to rename the puppet value to `profile::base::notifications: maybe-disabled` [09:04:01] hahaha [09:04:19] !log Reimage db1080 [09:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:31] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [09:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:01] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:29] !log upgrading mw1262-1265 to PHP 7.2.31 [09:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:37] you will have to run puppet more than once on icinga [09:10:45] the second run should remove the rest [09:10:57] because of exported resources afaict [09:11:18] RECOVERY - Prometheus prometheus2004/analytics restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/analytics [09:12:09] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/22950/" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/598415 (https://phabricator.wikimedia.org/T97972) (owner: 10Giuseppe Lavagetto) [09:12:21] (03PS7) 10Giuseppe Lavagetto: profile::conftool::client: allow overriding the user root can access [puppet] - 10https://gerrit.wikimedia.org/r/598415 (https://phabricator.wikimedia.org/T97972) [09:13:03] mutante: it's ran a second time now, but the problem still persists [09:13:30] kormat: some services are executed via NRPE on the monitored hosts and some on the Icinga server itself. is that the difference which are disabled and which are not by any chance? [09:13:47] s/services/service checks [09:14:06] i have no clue which are which [09:14:32] f.e. stuff like DISK space is via NRPE because you can't query it from external [09:14:54] RECOVERY - Prometheus prometheus1004/analytics restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/analytics [09:14:55] or anything that is a process check [09:15:39] because of exported resources you might need a puppet run first on the monitored host and then on icinga itself [09:16:05] i did that this time [09:17:32] i can't find pc2009 in icinga, wasnt it that one? [09:17:49] https://cas-icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=pc2009 it's there [09:18:08] ah, but, it's currently reimaging [09:18:25] so.. maybe the page refreshes for me because i already had it open v0v [09:18:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [09:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:03] kormat: i don't see it on that link either. eh.. is it possible this is icinga1001 vs icinga2001 [09:19:07] 10Operations, 10ops-eqiad: (Need by: TDB) rack/setup/install cloudelastic100[56] - https://phabricator.wikimedia.org/T249062 (10elukey) ` 09:09:53 | druid1002.eqiad.wmnet | WARNING: unable to verify that BIOS boot parameters are back to normal, got: Boot parameter version: 1 Boot parameter 5 is valid/unlocked... [09:21:06] mutante: i'm not sure what you mean. icinga1001 is what i've been logged into [09:21:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:06] ah, to clarify - i didn't manually run puppet on icinga1001. i ran it manually on pc2009, and logged into icinga1001, and saw it was already running. so maybe it was a race [09:22:27] you're saying that the procedure is run it on the target host, then run it twice on icinga1001, and then maybe it will work [09:22:31] i notice puppet is disabled on icing2001 fwiw [09:22:57] kormat: yes, if it was already running it almost certainly needs another one which starts after the puppet run happened on the monitored machine [09:23:16] puppet runs on icinga can take quite a while [09:23:19] sigh, ok. [09:23:24] well, i can try this next time. [09:23:37] ack [09:24:30] (03CR) 10Jbond: Ship the sysusers default config via systemd::sysuser (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/601743 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [09:24:57] jbond42: still testing "cas enabled vhost" stuff on icinga2001 ? [09:24:57] icinga's the gift that keeps on giving! [09:25:20] well.. it's exported resources and puppet though [09:25:31] replacing the tool doesnt necessarily fix it [09:25:56] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:00] godog: it's by far the worst piece of technology i've had to intereact with here [09:26:06] mutante: we had a job that used to create one log file per wiki in /var/log/mediawiki/cirrus-suggest, these were treated by logrotate, with profile::mediawiki::periodic_job all these logs are now in a single log file is this OK or should the script still be logging to its own log files? [09:26:58] mutante: no sorry that can be reenabled [09:27:01] ill do it now [09:27:51] it's not icinga's fault that we need multiple puppet runs to update its config.. shrug [09:27:56] jbond42: thanks [09:28:14] mutante: this one thing might not be icinga's fault. but pretty much everything else about it is :P [09:28:28] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:29] (03PS1) 10Elukey: Add zookeeper overrides for druid1002 [puppet] - 10https://gerrit.wikimedia.org/r/602035 (https://phabricator.wikimedia.org/T253980) [09:29:30] dcausse: i guess it's ok unless the log file becomes too large. as far as i know logrotate config should not be affected by the switch because the logs are written to the same location before and after [09:30:00] (03CR) 10Jbond: Ship the sysusers default config via systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/601743 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [09:30:24] (03CR) 10Elukey: [C: 03+2] Add zookeeper overrides for druid1002 [puppet] - 10https://gerrit.wikimedia.org/r/602035 (https://phabricator.wikimedia.org/T253980) (owner: 10Elukey) [09:31:28] (03PS1) 10Marostegui: db1081: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/602036 [09:31:28] kormat: pc2009 is now back with PENDING checks and all notifications disabled [09:31:53] (and downtime) [09:33:20] hurrah [09:33:50] kormat: yeah fair enough [09:34:01] mutante: I think when switching we simply started to used stdout/stderr, looking at other job log files it does not seem too big, I'll keep it this ways thanks! [09:35:02] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::conftool::client: allow overriding the user root can access [puppet] - 10https://gerrit.wikimedia.org/r/598415 (https://phabricator.wikimedia.org/T97972) (owner: 10Giuseppe Lavagetto) [09:35:17] (03CR) 10Marostegui: [C: 03+2] db1081: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/602036 (owner: 10Marostegui) [09:35:22] dcausse: if it's ok for you that that sounds good, yep [09:35:23] (03CR) 10Marostegui: db1081: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/602036 (owner: 10Marostegui) [09:35:58] (03PS2) 10Marostegui: db1080: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/602036 [09:37:41] (03CR) 10Marostegui: [C: 03+2] db1080: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/602036 (owner: 10Marostegui) [09:38:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1080', diff saved to https://phabricator.wikimedia.org/P11378 and previous config saved to /var/cache/conftool/dbconfig/20200603-093810-marostegui.json [09:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:32] 10Operations, 10Icinga, 10observability: Icinga notifications didn't get applied after a puppet run - https://phabricator.wikimedia.org/T251407 (10Dzahn) per chat on IRC: puppet was already running on icinga1001 when the puppet run on the monitored host was started. a little while later pc2009 was removed... [09:40:19] 10Operations, 10Analytics, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10faidon) [09:40:28] 10Operations, 10Gerrit: Initial backup run for Gerrit LFS data - https://phabricator.wikimedia.org/T254162 (10Dzahn) >>! In T254162#6184536, @jcrespo wrote: > Bacula has its own consistency model that may not be compatible with git consistency model, I want to put that to the test/enquire if lvm snapshoting wo... [09:47:18] (03PS1) 10Kormat: Revert "install_server: Allow reimage of pc2009" [puppet] - 10https://gerrit.wikimedia.org/r/602037 [09:47:29] (03CR) 10jerkins-bot: [V: 04-1] Revert "install_server: Allow reimage of pc2009" [puppet] - 10https://gerrit.wikimedia.org/r/602037 (owner: 10Kormat) [09:48:07] (03PS7) 10Muehlenhoff: Ship the sysusers default config via systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/601743 (https://phabricator.wikimedia.org/T235162) [09:49:30] (03Abandoned) 10Kormat: Revert "install_server: Allow reimage of pc2009" [puppet] - 10https://gerrit.wikimedia.org/r/602037 (owner: 10Kormat) [09:49:56] (03PS2) 10DCausse: [cirrus] prepend the wiki id in comp suggest logs [puppet] - 10https://gerrit.wikimedia.org/r/602015 [09:51:56] (03PS1) 10Kormat: mariadb: Enable notifications for db2009 [puppet] - 10https://gerrit.wikimedia.org/r/602038 (https://phabricator.wikimedia.org/T252182) [09:58:39] (03PS1) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: add testserver ACLs [puppet] - 10https://gerrit.wikimedia.org/r/602039 [09:59:39] (03CR) 10Ammarpad: [C: 03+1] Use AddFooterLink hook for code of conduct and contact links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596277 (https://phabricator.wikimedia.org/T251817) (owner: 10Jdlrobson) [10:01:16] (03CR) 10Jbond: [C: 03+2] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/568857 (owner: 10Legoktm) [10:01:39] (03PS2) 10Jbond: admin: Fix data_test.py on Python 3.9+ [puppet] - 10https://gerrit.wikimedia.org/r/568857 (owner: 10Legoktm) [10:04:24] (03PS1) 10Dzahn: microsites::design: move repo dir out of design dir [puppet] - 10https://gerrit.wikimedia.org/r/602040 (https://phabricator.wikimedia.org/T254118) [10:04:26] (03CR) 10Ammarpad: [C: 03+1] Enable talk pages on Swedish Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601843 (https://phabricator.wikimedia.org/T253985) (owner: 10Jdlrobson) [10:05:29] (03PS2) 10Dzahn: microsites::design: move blog repo dir out of design dir [puppet] - 10https://gerrit.wikimedia.org/r/602040 (https://phabricator.wikimedia.org/T254118) [10:05:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/22951/conf1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/602039 (owner: 10Giuseppe Lavagetto) [10:07:05] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/601743 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [10:07:54] <_joe_> please everyone stop merging patches [10:07:57] <_joe_> for puppet [10:08:19] ok, was about to. stopped [10:08:34] <_joe_> jbond42: I think I need your assistance [10:09:09] hi _joe_ [10:09:36] <_joe_> https://phabricator.wikimedia.org/P11379 [10:09:45] <_joe_> this happened when I merged on puppetmaster1001 [10:09:57] looking [10:10:01] <_joe_> and sadly I cannot re-run merge from 2001 anymore to fix the issue [10:10:55] <_joe_> I can confirm the other puppetmasters are not updated [10:11:56] one sec i can fix [10:12:27] (03PS4) 10Lucas Werkmeister (WMDE): DNM: Enable Data Bridge on Test Wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595542 (https://phabricator.wikimedia.org/T232584) [10:12:29] (03PS4) 10Lucas Werkmeister (WMDE): DNM: Enable Data Bridge on Catalan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595543 (https://phabricator.wikimedia.org/T232584) [10:12:32] <_joe_> how do you plan to fix this? just reset --hard HEAD~1 on 1001? [10:13:15] run /usr/local/bin/puppet-merge.py --ops 241409a74069699dec856b7e33145d469a4686c2 [10:14:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1080', diff saved to https://phabricator.wikimedia.org/P11380 and previous config saved to /var/cache/conftool/dbconfig/20200603-101426-marostegui.json [10:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:02] _joe_: should be fixed now. [10:16:07] <_joe_> jbond42: thanks! [10:16:17] im curious if it was this issue https://phabricator.wikimedia.org/T251104 [10:17:06] 10Operations, 10Analytics, 10Traffic: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10elukey) Some extra context: Ema added prometheus monitoring for ATSKafka in https://grafana.wikimedia.org/d/1EUhPpzMz/atskafka?orgId=1, and the cp3050's... [10:19:50] <_joe_> jbond42: nope [10:23:04] _joe_: hmm strange seems like simlar end result and i noticed "problems merging LABS" so berhaps a simlar root cause. i have a PS out for the above issue ill have a poke and see if i can tie this to that [10:23:31] <_joe_> jbond42: anyways, thanks for the quick resolution [10:23:33] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10jcrespo) 05Open→03Resolved a:05jcrespo→03Jclark-ctr db1140 has been repopulated from dbprov snapshots of s1 and s6 and upgraded to 10.4. It has been added to tendril and zarcil... [10:23:37] np [10:25:45] _joe_: there wasn't by chance more output looks like puppet-merge failed when trying to merge the private repo and would have expected some type of error? [10:27:44] (03PS1) 10Marostegui: mariadb: Move db1091 from s4 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/602042 (https://phabricator.wikimedia.org/T252512) [10:28:47] (03PS3) 10Dzahn: microsites::design: move blog repo dir out of design dir [puppet] - 10https://gerrit.wikimedia.org/r/602040 (https://phabricator.wikimedia.org/T254118) [10:29:18] 10Operations, 10netops: Zayo link eqiad-codfw (OGYX/120003//ZYO) down (May 2020) - https://phabricator.wikimedia.org/T253610 (10ayounsi) And TTN-0004131048 for the record, currently down. [10:30:31] (03PS5) 10Kormat: install_server: Allow reuse of partitions during reimage. [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) [10:31:21] (03PS2) 10Kormat: mariadb: Enable notifications for db2009 [puppet] - 10https://gerrit.wikimedia.org/r/602038 (https://phabricator.wikimedia.org/T252182) [10:32:13] <_joe_> jbond42: i'll check, but no iirc [10:32:49] (03CR) 10Muehlenhoff: [C: 03+2] Ship the sysusers default config via systemd::sysuser (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/601743 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [10:34:30] ack thanks _joe_ [10:35:52] (03CR) 10Jbond: monitoring: add data types to monitoring::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [10:37:06] (03PS1) 10Jcrespo: Revert "db1140: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/602044 (https://phabricator.wikimedia.org/T250602) [10:37:31] jouncebot: next [10:37:31] In 0 hour(s) and 22 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200603T1100) [10:37:38] (03CR) 10jerkins-bot: [V: 04-1] Revert "db1140: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/602044 (https://phabricator.wikimedia.org/T250602) (owner: 10Jcrespo) [10:38:52] (03PS2) 10Jcrespo: Revert "db1140: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/602044 (https://phabricator.wikimedia.org/T250602) [10:40:30] (03CR) 10Kormat: [C: 04-1] "The commit message should mention that you're also reimaging it to buster." [puppet] - 10https://gerrit.wikimedia.org/r/602042 (https://phabricator.wikimedia.org/T252512) (owner: 10Marostegui) [10:40:44] (03CR) 10Jbond: [C: 03+2] "LGTM PCC is noop" [puppet] - 10https://gerrit.wikimedia.org/r/601871 (owner: 10Alex Monk) [10:41:33] (03PS3) 10MarcoAurelio: [eswiki] Normalize talk namespaces for Anexo, Portal and Wikiproyecto [mediawiki-config] - 10https://gerrit.wikimedia.org/r/600979 (https://phabricator.wikimedia.org/T254077) [10:41:42] (03PS2) 10Marostegui: mariadb: Move db1091 from s4 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/602042 (https://phabricator.wikimedia.org/T252512) [10:42:03] (03CR) 10Jcrespo: [C: 03+2] Revert "db1140: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/602044 (https://phabricator.wikimedia.org/T250602) (owner: 10Jcrespo) [10:42:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1091 - will be reimaged and moved to s1 T252512', diff saved to https://phabricator.wikimedia.org/P11381 and previous config saved to /var/cache/conftool/dbconfig/20200603-104251-marostegui.json [10:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:55] T252512: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 [10:44:52] <_joe_> jbond42: ok so, it failed to fetch changes for labs/private for some reason [10:46:17] (03CR) 10Dzahn: monitoring: add data types to monitoring::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [10:47:13] (03PS4) 10Dzahn: microsites::design: move blog repo dir out of design dir [puppet] - 10https://gerrit.wikimedia.org/r/602040 (https://phabricator.wikimedia.org/T254118) [10:47:47] _joe_: ack thanks ill look at my patch, the prod merge should still continue even if there are faliures in the labs merge [10:48:29] (03CR) 10Gilles: [V: 03+2 C: 03+2] engine.imagemagick: Catch pyexiv2.ExifValueError also [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/600400 (https://phabricator.wikimedia.org/T193326) (owner: 10AntiCompositeNumber) [10:50:35] (03CR) 10Jbond: monitoring: add data types to monitoring::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [10:51:34] (03PS1) 10Muehlenhoff: systemd-sysuser: Use /bin/systemd-sysusers and skip config on jessie [puppet] - 10https://gerrit.wikimedia.org/r/602046 (https://phabricator.wikimedia.org/T235162) [10:55:19] systemd-sysuser :O [10:56:01] gerrit down? [10:56:33] (03PS1) 10Giuseppe Lavagetto: profile::conftool::client: only use root on cumin*, puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/602047 (https://phabricator.wikimedia.org/T97972) [10:56:36] back for me [10:56:45] I got a 502 Proxy Error [10:56:49] but yes, back now [10:57:17] legoktm: there's always https://lwn.net/Articles/822066/ :-) [10:58:00] 10Operations, 10Traffic, 10conftool, 10Patch-For-Review, and 2 others: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972 (10Joe) My tests went fine: - `mwdebug*` servers got the datacenter-appropriate `pool-$dc-testserver` user - the deployment servers got the `conftool` user... [10:58:20] legoktm: go to sleep man :) [10:58:26] (03CR) 10Jcrespo: "The test is ok, but the blind killing if the transfer fails I think is a bad idea. One of the reasons why transfer can fail is because the" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598560 (https://phabricator.wikimedia.org/T252950) (owner: 10Privacybatm) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200603T1100). [11:00:04] DannyS712 and hauskatze: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:10] legoktm: it's quite nice, it provides a common place to define system users which is better than "oh, that got added in some random postinst", for our Puppet you can use systemd::sysuser and to ship it as part of a deb it can be installed into /usr/lib/sysusers.d [11:00:12] Meow [11:00:53] hauskatze: shhhh. I'm actually being productive! [11:01:56] (03CR) 10Jcrespo: "Keep the error message but please instead if a try catch, let's test for a single ':' appearing on both strings with an if without an else" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598562 (https://phabricator.wikimedia.org/T253560) (owner: 10Privacybatm) [11:02:15] moritzm: hah, I remember reading that someone reimplemented a non-systemd version of it too. It's on my list to look into, one of my packages is using adduser in postinst [11:02:24] , which I'd like to get rid of [11:03:13] there's https://packages.qa.debian.org/o/opensysusers.html, but I don't think it makes any sense [11:03:40] systemd upstream has commited to making sure that systemd-sysusers and systemd-tmpfiles can be used without system being the init 1 [11:04:19] maybe we should instead be discussing why you haven't switched to the new tracker.d.o :) [11:04:29] I'm here [11:04:54] any SWATter around? I don’t have time for the full hour (lunch + meeting) :/ [11:04:55] it doesn't display changelogs which is a huge regression for me [11:05:00] though I could try if there’s nobody else [11:05:25] I poked Urbanec--m elsewhere but he seems to be away [11:05:35] @Lucas_WMDE my change is just removing obsolete settings that aren't used and should have no visible impact [11:05:38] !log installing pango updates for buster [11:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:42] but thats not urgent at all [11:05:50] My patch ain't urgent either [11:06:46] (my IRC client is being super laggy, lemme restart it) [11:07:42] (03CR) 10Jcrespo: [C: 03+1] "Let's do the below typo fix and then I will merge all this." (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [11:08:51] !log Add rev_id to revision table on db2124 - T238966 [11:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:55] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [11:10:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2124', diff saved to https://phabricator.wikimedia.org/P11382 and previous config saved to /var/cache/conftool/dbconfig/20200603-111055-marostegui.json [11:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:28] !log remove unused logical-systems from all MX204 routers - T247073 [11:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:31] T247073: Configure management-instance on routers with Junos > 17.3 - https://phabricator.wikimedia.org/T247073 [11:13:06] (03CR) 10Dzahn: [C: 03+2] microsites::design: move blog repo dir out of design dir [puppet] - 10https://gerrit.wikimedia.org/r/602040 (https://phabricator.wikimedia.org/T254118) (owner: 10Dzahn) [11:13:28] (03CR) 10Jcrespo: "for example, we can check that target.count(':') == 1, but also that the server string should be non-empty. I think the path could be empt" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598562 (https://phabricator.wikimedia.org/T253560) (owner: 10Privacybatm) [11:15:13] !log installing python-oslo.utils security updates [11:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:35] hauskatze: let’s go ahead with your change? [11:19:48] does it need a maintenance script to be run afterwards? [11:19:52] !log configure management-instance on cr1-eqsin - T247073 [11:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:56] T247073: Configure management-instance on routers with Junos > 17.3 - https://phabricator.wikimedia.org/T247073 [11:19:59] ah yeah, task mentions namespacesDupes.php [11:20:08] Lucas_WMDE: hey [11:20:11] probably won’t hurt even if it’s not necessary (IIRC that script goes fast) [11:20:27] Lucas_WMDE: danny's change comes first right? [11:20:40] hauskatze: well yours seems more important, so I’d pull that forwards :) [11:20:51] also for Danny’s change I’d want to review if the settings are actually unused [11:20:55] which I’m not sure I have time for [11:21:00] Lucas_WMDE: alright then [11:21:07] sorry DannyS712 [11:21:15] Lucas_WMDE: yes please I'd like namespaceDupes to be run after the change is deployed [11:21:23] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/600979 (https://phabricator.wikimedia.org/T254077) (owner: 10MarcoAurelio) [11:21:34] first dry-run, then with --fix [11:22:17] ok [11:22:40] (03Merged) 10jenkins-bot: [eswiki] Normalize talk namespaces for Anexo, Portal and Wikiproyecto [mediawiki-config] - 10https://gerrit.wikimedia.org/r/600979 (https://phabricator.wikimedia.org/T254077) (owner: 10MarcoAurelio) [11:23:19] change is on mwdebug1001 [11:23:33] !log configure management-instance on cr1-codfw - T247073 [11:23:34] I'll see if it can be sort of tested [11:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:50] seems to work [11:23:59] I just went to a random Portal page and looked at the talk link [11:24:07] and it’s also not blowing up [11:24:11] that’s good enough for me ^^ [11:24:20] but I’ll wait if you have anything else to say [11:24:34] I was doing the same and no errors for me either [11:24:49] no access to logstash though [11:25:04] (03PS1) 10Dzahn: base/monitoring: allow setting different contactgroup for systemd [puppet] - 10https://gerrit.wikimedia.org/r/602052 [11:25:07] * Lucas_WMDE peeks at logstash [11:25:12] !log install brltty updates on Buster [11:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:33] looks fine, I’ll sync [11:25:58] (03PS8) 10Jbond: puppetmaster: update puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/601712 (https://phabricator.wikimedia.org/T251104) [11:26:52] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:600979|[eswiki] Normalize talk namespaces for Anexo, Portal and Wikiproyecto (T254077)]] (duration: 01m 03s) [11:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:56] T254077: Rename some es.wikipedia namespaces - https://phabricator.wikimedia.org/T254077 [11:27:05] !log installing rubygems-integration updates for Buster [11:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:49] “dry run” is just namespaceDupes.php without options, right? [11:27:53] and then --fix for the second run? [11:28:12] (03PS1) 10Arturo Borrero Gonzalez: toolforge: legacy_redirector: refresh set of allowed tools [puppet] - 10https://gerrit.wikimedia.org/r/602054 (https://phabricator.wikimedia.org/T234617) [11:28:15] (03PS1) 10Dzahn: microsites::design: use variable for repo dir when cloning blog files [puppet] - 10https://gerrit.wikimedia.org/r/602055 (https://phabricator.wikimedia.org/T254118) [11:28:37] yeah, mw.org documentation seems to agree [11:29:03] Lucas_WMDE: yup [11:29:07] (03CR) 10Jbond: [C: 03+1] systemd-sysuser: Use /bin/systemd-sysusers and skip config on jessie [puppet] - 10https://gerrit.wikimedia.org/r/602046 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [11:29:29] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes.php eswiki | tee T254077.dry-run [11:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:23] (03PS2) 10Arturo Borrero Gonzalez: toolforge: legacy_redirector: refresh set of allowed tools [puppet] - 10https://gerrit.wikimedia.org/r/602054 (https://phabricator.wikimedia.org/T234617) [11:30:46] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes.php eswiki --fix | tee T254077.fix [11:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:20] (03CR) 10Dzahn: [C: 03+2] microsites::design: use variable for repo dir when cloning blog files [puppet] - 10https://gerrit.wikimedia.org/r/602055 (https://phabricator.wikimedia.org/T254118) (owner: 10Dzahn) [11:31:40] I’ll check what that page is later [11:31:47] but for now I have to close the SWAT, meeting [11:31:52] (I’ll keep an eye on the error log) [11:31:57] !log EU SWAT done [11:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:20] Thanks Lucas_WMDE [11:32:26] Just in time for my lunch [11:40:01] (03PS1) 10Dzahn: microsites::design: add httpd alias and config for blog site [puppet] - 10https://gerrit.wikimedia.org/r/602057 (https://phabricator.wikimedia.org/T254118) [11:40:47] (03CR) 10Ema: [C: 03+2] Add wikimedia server_alias for api.wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/601808 (https://phabricator.wikimedia.org/T254185) (owner: 10Reedy) [11:43:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1080', diff saved to https://phabricator.wikimedia.org/P11383 and previous config saved to /var/cache/conftool/dbconfig/20200603-114351-marostegui.json [11:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2124 after MCR schema change', diff saved to https://phabricator.wikimedia.org/P11384 and previous config saved to /var/cache/conftool/dbconfig/20200603-114409-marostegui.json [11:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:29] 10Operations, 10Analytics, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10JAllemandou) Some more on the Druid aspect of things: - We have used multi-value dimensions in Druid without problem - Data needs to be an array and that's it. - We h... [11:51:45] !log configure management-instance on cr2-codfw - T247073 [11:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:48] T247073: Configure management-instance on routers with Junos > 17.3 - https://phabricator.wikimedia.org/T247073 [11:56:44] !log configure management-instance on cr1/2-eqiad - T247073 [11:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:44] !log updating linux-libc-dev on stretch and buster hosts [11:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:01] (03CR) 10Muehlenhoff: [C: 03+2] systemd-sysuser: Use /bin/systemd-sysusers and skip config on jessie [puppet] - 10https://gerrit.wikimedia.org/r/602046 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [12:01:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1080', diff saved to https://phabricator.wikimedia.org/P11385 and previous config saved to /var/cache/conftool/dbconfig/20200603-120136-marostegui.json [12:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:46] (03PS1) 10JMeybohm: eventstreams: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/602060 (https://phabricator.wikimedia.org/T253396) [12:02:23] ACKNOWLEDGEMENT - Host re0.cr1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi investigating - The acknowledgement expires at: 2020-06-04 12:02:07. [12:02:26] (03PS2) 10JMeybohm: eventstreams: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/602060 (https://phabricator.wikimedia.org/T253396) [12:03:07] (03PS1) 10JMeybohm: eventgate: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/602061 (https://phabricator.wikimedia.org/T253396) [12:04:39] 10Operations, 10Patch-For-Review, 10SRE-OnFire-Incident-Docs: Document 2020-02-04 kartotherian incident - https://phabricator.wikimedia.org/T244278 (10jbond) Is this a duplicate of https://phabricator.wikimedia.org/T251340 ? [12:08:10] (03CR) 10Ayounsi: [C: 03+2] Configure management-instance on routers [homer/public] - 10https://gerrit.wikimedia.org/r/592920 (https://phabricator.wikimedia.org/T247073) (owner: 10Ayounsi) [12:08:41] (03PS1) 10Muehlenhoff: Update example to reflect the new typed arguments [puppet] - 10https://gerrit.wikimedia.org/r/602063 [12:10:50] 10Operations, 10netops, 10Patch-For-Review, 10Sustainability (Incident Prevention): Juniper HA audit - https://phabricator.wikimedia.org/T191667 (10ayounsi) [12:10:53] 10Operations, 10netops, 10Patch-For-Review: Configure management-instance on routers with Junos > 17.3 - https://phabricator.wikimedia.org/T247073 (10ayounsi) 05Open→03Resolved a:03ayounsi All done. [12:13:27] (03CR) 10Privacybatm: "> Patch Set 1:" (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598560 (https://phabricator.wikimedia.org/T252950) (owner: 10Privacybatm) [12:14:52] (03CR) 10Dzahn: [C: 03+2] microsites::design: add httpd alias and config for blog site [puppet] - 10https://gerrit.wikimedia.org/r/602057 (https://phabricator.wikimedia.org/T254118) (owner: 10Dzahn) [12:16:09] (03PS8) 10Privacybatm: Write documentation using Sphinx [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) [12:17:49] (03CR) 10Privacybatm: Write documentation using Sphinx (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [12:19:29] (03PS1) 10Muehlenhoff: Enable managed adduser config for codfw, take three [puppet] - 10https://gerrit.wikimedia.org/r/602067 (https://phabricator.wikimedia.org/T235162) [12:19:52] (03CR) 10Jbond: [C: 03+1] Update example to reflect the new typed arguments [puppet] - 10https://gerrit.wikimedia.org/r/602063 (owner: 10Muehlenhoff) [12:23:05] 10Operations, 10WMF-Design, 10Design, 10Patch-For-Review: Create sub-directory URL for Design blog (https://design.wikimedia.org/blog) - https://phabricator.wikimedia.org/T254118 (10Dzahn) @Prtksxna The site has been setup and the content has been cloned. But you'll have to adjust the links to the CSS fil... [12:25:20] (03CR) 10Hashar: "i think there is a way to autogenerate the various API documentations instead of committing them ( example: transferpy/doc/transferpy/tran" (034 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [12:28:35] (03CR) 10Dzahn: "> It's the way modules/base/manifests/monitoring/host.pp is set up. You can monitor systemd or not, and you can set a contact group for th" [puppet] - 10https://gerrit.wikimedia.org/r/601374 (owner: 10Bstorm) [12:32:04] (03CR) 10CDanis: profile::conftool::client: only use root on cumin*, puppetmasters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602047 (https://phabricator.wikimedia.org/T97972) (owner: 10Giuseppe Lavagetto) [12:35:27] (03CR) 10Muehlenhoff: [C: 03+2] Enable managed adduser config for codfw, take three [puppet] - 10https://gerrit.wikimedia.org/r/602067 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [12:35:51] (03PS1) 10Jbond: apereo_cas: remove cas.ticket.tgt.timeout.maxTimeToLiveInSeconds [puppet] - 10https://gerrit.wikimedia.org/r/602069 [12:38:22] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/22955/labstore1006.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/602052 (owner: 10Dzahn) [12:39:54] (03PS2) 10Dzahn: base/monitoring: allow setting different contactgroup for systemd [puppet] - 10https://gerrit.wikimedia.org/r/602052 [12:40:16] (03CR) 10Dzahn: "see the compiler links above, it changes the contactgroup for labstore1006 but stays "admins" on a random other host and icinga1001 itself" [puppet] - 10https://gerrit.wikimedia.org/r/602052 (owner: 10Dzahn) [12:40:58] (03CR) 10Dzahn: "The change above would allow changing the contact group just for systemd." [puppet] - 10https://gerrit.wikimedia.org/r/601374 (owner: 10Bstorm) [12:43:08] (03CR) 10CDanis: "This patch reflects what systemd allows for unit names." [puppet] - 10https://gerrit.wikimedia.org/r/601460 (owner: 10CDanis) [12:45:08] (03CR) 10Hashar: [C: 04-1] "The files under transferpy/doc/transferpy do not need to be commited in. They can be generated on the fly using sphinx-apidoc which would " [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [12:46:02] (03PS4) 10Filippo Giunchedi: conftool: bail on confctl not found [puppet] - 10https://gerrit.wikimedia.org/r/599299 (https://phabricator.wikimedia.org/T253840) [12:46:04] (03PS4) 10Filippo Giunchedi: prometheus: use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/599298 (https://phabricator.wikimedia.org/T253840) [12:48:18] (03PS1) 10Dzahn: mediawiki::monitoring: allow disabling opcache monitoring on parsoid::test [puppet] - 10https://gerrit.wikimedia.org/r/602071 (https://phabricator.wikimedia.org/T254025) [12:49:06] (03PS1) 10Hashar: transferpy requires paramiko [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/602072 [12:49:44] (03PS2) 10Dzahn: mediawiki::monitoring: disable opcache monitoring on parsoid::test [puppet] - 10https://gerrit.wikimedia.org/r/602071 (https://phabricator.wikimedia.org/T254025) [12:51:30] (03Abandoned) 10Hnowlan: cpjobqueue: reduce log level [deployment-charts] - 10https://gerrit.wikimedia.org/r/601776 (owner: 10Hnowlan) [12:51:37] (03CR) 10Privacybatm: "> Patch Set 3:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598562 (https://phabricator.wikimedia.org/T253560) (owner: 10Privacybatm) [12:51:42] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/22957/scandium.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/602071 (https://phabricator.wikimedia.org/T254025) (owner: 10Dzahn) [12:51:46] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: enable all remaining jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/601798 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [12:52:14] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/602069 (owner: 10Jbond) [12:52:17] (03Merged) 10jenkins-bot: changeprop-jobqueue: enable all remaining jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/601798 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [12:56:45] (03CR) 10Dzahn: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/598990 (owner: 10Muehlenhoff) [12:56:53] (03PS2) 10Dzahn: ci: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/598990 (owner: 10Muehlenhoff) [12:57:08] (03CR) 10Privacybatm: "> Patch Set 8: Code-Review-1" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [12:57:16] 10Operations, 10observability, 10Goal: Export Prometheus-compatible JVM metrics from JVMs in production - https://phabricator.wikimedia.org/T177197 (10fgiunchedi) [12:58:17] 10Operations, 10observability, 10Goal, 10Technical-Debt, and 2 others: Reduce technical debt in metrics monitoring - https://phabricator.wikimedia.org/T177195 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Boldly resolving, the bulk of the work here has been done and the remaining tasks are low(er)... [12:58:37] (03CR) 10jerkins-bot: [V: 04-1] ci: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/598990 (owner: 10Muehlenhoff) [12:59:14] (03CR) 10Jbond: [C: 03+2] apereo_cas: remove cas.ticket.tgt.timeout.maxTimeToLiveInSeconds [puppet] - 10https://gerrit.wikimedia.org/r/602069 (owner: 10Jbond) [12:59:22] (03CR) 10Privacybatm: "> Patch Set 8:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [12:59:40] (03CR) 10CDanis: puppetmaster: update puppet-merge (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/601712 (https://phabricator.wikimedia.org/T251104) (owner: 10Jbond) [13:00:22] (03CR) 10CDanis: [C: 03+1] mediawiki::monitoring: disable opcache monitoring on parsoid::test [puppet] - 10https://gerrit.wikimedia.org/r/602071 (https://phabricator.wikimedia.org/T254025) (owner: 10Dzahn) [13:01:47] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [13:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:30] (03CR) 10CDanis: [C: 03+1] Add check_prometheus rules for navtiming [puppet] - 10https://gerrit.wikimedia.org/r/597176 (https://phabricator.wikimedia.org/T225739) (owner: 10Dave Pifke) [13:03:15] (03CR) 10Privacybatm: transferpy requires paramiko (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/602072 (owner: 10Hashar) [13:03:29] (03PS9) 10Jbond: puppetmaster: update puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/601712 (https://phabricator.wikimedia.org/T251104) [13:04:08] (03CR) 10Kormat: [C: 03+1] mariadb: Move db1091 from s4 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/602042 (https://phabricator.wikimedia.org/T252512) (owner: 10Marostegui) [13:04:48] (03PS1) 10Kormat: Revert "db-codfw.php: Replace pc2009 with pc2010 while reimaging" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602075 [13:04:51] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:06:58] (03CR) 10Marostegui: [C: 03+1] Revert "db-codfw.php: Replace pc2009 with pc2010 while reimaging" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602075 (owner: 10Kormat) [13:07:29] 10Operations, 10Product-Analytics, 10Wikimedia-General-or-Unknown, 10Readers-Web-Backlog (Needs Product Owner Decisions), 10SEO: Yoruba Language Wikipedia not being indexed by search engines - https://phabricator.wikimedia.org/T236241 (10Formatierer) See also this discussion in german wiktionary: https:/... [13:08:25] (03Abandoned) 10CDanis: Revert "Revert "prepend {es,kn}ams"" [homer/public] - 10https://gerrit.wikimedia.org/r/596319 (owner: 10CDanis) [13:08:35] (03CR) 10Marostegui: [C: 03+1] mariadb: Enable notifications for db2009 [puppet] - 10https://gerrit.wikimedia.org/r/602038 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [13:08:52] (03CR) 10Kormat: [C: 03+2] Revert "db-codfw.php: Replace pc2009 with pc2010 while reimaging" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602075 (owner: 10Kormat) [13:08:56] (03CR) 10Kormat: [C: 03+2] mariadb: Enable notifications for db2009 [puppet] - 10https://gerrit.wikimedia.org/r/602038 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [13:09:38] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Replace pc2009 with pc2010 while reimaging" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602075 (owner: 10Kormat) [13:09:49] (03CR) 10Jbond: "updated thanks" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/601712 (https://phabricator.wikimedia.org/T251104) (owner: 10Jbond) [13:10:16] 10Operations, 10Cassandra, 10Goal, 10User-Eevans: Handle HBA controllers in get-raid-status-hpssacli - https://phabricator.wikimedia.org/T185216 (10fgiunchedi) [13:10:17] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:13:36] !log kormat@deploy1001 Synchronized wmf-config/db-codfw.php: Put pc2009 back into pc3 after reimaging T252182 (duration: 01m 05s) [13:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:40] T252182: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 [13:15:21] (03CR) 10CDanis: "Stephen, there's explanatory context in the bug :)" [puppet] - 10https://gerrit.wikimedia.org/r/601454 (https://phabricator.wikimedia.org/T252605) (owner: 10CDanis) [13:15:44] cdanis: 😱 (i haven't even looked at it yet) [13:17:39] cdanis: can you give me an example host where the service is running? [13:18:07] kormat: any mediawiki appserver :) mwdebug1001.eqiad.wmnet or mw1280.eqiad.wmnet is fine [13:20:29] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [13:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:07] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:23:32] (03PS1) 10Jcrespo: mariadb: Add snapshoting capabilities to content backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/602080 (https://phabricator.wikimedia.org/T79922) [13:24:39] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:24:45] (03PS2) 10Jcrespo: mariadb: Add snapshotting capabilities to content backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/602080 (https://phabricator.wikimedia.org/T79922) [13:24:51] (03PS9) 10Privacybatm: Write documentation using Sphinx [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) [13:25:12] (03CR) 10jerkins-bot: [V: 04-1] Write documentation using Sphinx [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [13:29:28] (03CR) 10Kormat: [C: 03+1] textfile exporter for php-fpm worker pool status [puppet] - 10https://gerrit.wikimedia.org/r/601454 (https://phabricator.wikimedia.org/T252605) (owner: 10CDanis) [13:31:23] (03PS10) 10Privacybatm: Write documentation using Sphinx [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) [13:32:20] (03CR) 10Hashar: [V: 03+1 C: 03+1] "That has been running for a couple months now and seems all fine ;)" [puppet] - 10https://gerrit.wikimedia.org/r/579602 (owner: 10Thcipriani) [13:32:24] (03CR) 10Jcrespo: "I didn't focus too much on tox here because we are not going to setup this on CI, but the new transferpy repo once moved, and it will be f" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [13:33:11] (03CR) 10Privacybatm: "Thank you so much for your comments, it was helpful :-) I have incorporated the changes you mentioned." (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [13:34:14] (03PS1) 10Kormat: mariadb: Allow reimage of pc1009 [puppet] - 10https://gerrit.wikimedia.org/r/602081 (https://phabricator.wikimedia.org/T252182) [13:35:47] (03CR) 10Marostegui: "No stretch-installer?" [puppet] - 10https://gerrit.wikimedia.org/r/602081 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [13:36:59] (03CR) 10Kormat: "> No stretch-installer?" [puppet] - 10https://gerrit.wikimedia.org/r/602081 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [13:37:29] (03CR) 10Marostegui: [C: 03+1] "+1 but up to you if you want to do it today or tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/602081 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [13:39:30] (03CR) 10Jcrespo: "> But is throwing the error necessary?" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598560 (https://phabricator.wikimedia.org/T252950) (owner: 10Privacybatm) [13:39:55] (03CR) 10Jcrespo: "> Patch Set 1:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598560 (https://phabricator.wikimedia.org/T252950) (owner: 10Privacybatm) [13:44:07] !log reimaging pc1007 to buster, wish me luck T252182 [13:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:11] T252182: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 [13:45:12] (03CR) 10Privacybatm: "> > The only thing I feel strongly against is the kill. I agree with removing the logic, put a print('ERROR...'), and continue, so we can " [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598560 (https://phabricator.wikimedia.org/T252950) (owner: 10Privacybatm) [13:46:43] (03CR) 10Kormat: [C: 03+2] mariadb: Allow reimage of pc1009 [puppet] - 10https://gerrit.wikimedia.org/r/602081 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [13:46:46] (03CR) 10Jcrespo: "> Patch Set 3:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598562 (https://phabricator.wikimedia.org/T253560) (owner: 10Privacybatm) [13:46:48] kormat: pc1007? [13:47:03] er, crap. [13:47:05] (03CR) 10CDanis: "First pass" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [13:47:34] !log reimaging *pc1009 (promise) to buster T252182 [13:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:25] kormat: buuuut it is not yet depooled, no? [13:49:10] marostegui: correct. the order is disable notifications, run puppet a million times on pc1009 and icinga, hope that icinga notices, then depool [13:49:19] XDD [13:49:27] kormat: make sure to also downtime pc2009, or it will alert [13:49:35] siigh. yes, thanks :) [13:50:22] (03CR) 10Jcrespo: [C: 03+2] mariadb: Add snapshotting capabilities to content backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/602080 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [13:51:01] (03PS1) 10Filippo Giunchedi: WIP thanos: add alerts [puppet] - 10https://gerrit.wikimedia.org/r/602082 (https://phabricator.wikimedia.org/T252186) [13:52:41] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 58 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:52:48] (03CR) 10jerkins-bot: [V: 04-1] WIP thanos: add alerts [puppet] - 10https://gerrit.wikimedia.org/r/602082 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [13:54:00] (03CR) 10Privacybatm: "> Patch Set 3:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598562 (https://phabricator.wikimedia.org/T253560) (owner: 10Privacybatm) [13:55:06] (03CR) 10Jcrespo: "We can skip all non-cumin based executions from documentation. They are not supported not tested that they work well, but I would not like" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/602072 (owner: 10Hashar) [13:55:08] (03CR) 10CDanis: "thanks! LGTM, just a couple comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/600928 (https://phabricator.wikimedia.org/T210818) (owner: 10Cwhite) [13:55:23] (03PS2) 10Filippo Giunchedi: WIP thanos: add alerts [puppet] - 10https://gerrit.wikimedia.org/r/602082 (https://phabricator.wikimedia.org/T252186) [13:55:54] (03PS1) 10Kormat: db-eqiad.php: Replace pc1009 with pc1010 while reimaging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602083 (https://phabricator.wikimedia.org/T252182) [13:56:42] (03CR) 10Marostegui: [C: 03+1] db-eqiad.php: Replace pc1009 with pc1010 while reimaging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602083 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [13:56:44] (03CR) 10Muehlenhoff: [C: 03+2] Update example to reflect the new typed arguments [puppet] - 10https://gerrit.wikimedia.org/r/602063 (owner: 10Muehlenhoff) [13:57:09] (03CR) 10Kormat: [C: 03+2] db-eqiad.php: Replace pc1009 with pc1010 while reimaging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602083 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [13:57:11] (03CR) 10Jcrespo: "Again, I would like not to work too much on dependencies on this repo just before we are going to move away the tree to another dedicated " [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/602072 (owner: 10Hashar) [13:57:39] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 46 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:57:55] (03Merged) 10jenkins-bot: db-eqiad.php: Replace pc1009 with pc1010 while reimaging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602083 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [13:58:04] (03PS2) 10Muehlenhoff: Enable managed adduser config fleet-wide [puppet] - 10https://gerrit.wikimedia.org/r/599359 (https://phabricator.wikimedia.org/T235162) [13:58:36] (03CR) 10Bearloga: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/601848 (https://phabricator.wikimedia.org/T254278) (owner: 10Bearloga) [13:59:56] !log kormat@deploy1001 Synchronized wmf-config/db-eqiad.php: Replace pc1009 with pc1010 reimaging T252182 (duration: 01m 06s) [13:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:00] T252182: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 [14:00:29] !log Restarted CI Jenkins for plugin update [14:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:19] (03PS2) 10Bearloga: profile::analytics::cluster::packages::statistics: Add pandoc-citeproc, libfontconfig1-dev [puppet] - 10https://gerrit.wikimedia.org/r/601848 (https://phabricator.wikimedia.org/T254278) [14:03:22] (03CR) 10jerkins-bot: [V: 04-1] profile::analytics::cluster::packages::statistics: Add pandoc-citeproc, libfontconfig1-dev [puppet] - 10https://gerrit.wikimedia.org/r/601848 (https://phabricator.wikimedia.org/T254278) (owner: 10Bearloga) [14:03:39] (03CR) 10Jcrespo: "> Could you suggest to me how? Shall we have an if-else like this:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598562 (https://phabricator.wikimedia.org/T253560) (owner: 10Privacybatm) [14:07:08] (03PS3) 10Bearloga: profile::analytics::cluster::packages::statistics: Add packages [puppet] - 10https://gerrit.wikimedia.org/r/601848 (https://phabricator.wikimedia.org/T254278) [14:08:03] (03CR) 10Privacybatm: "I have a doubt:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598560 (https://phabricator.wikimedia.org/T252950) (owner: 10Privacybatm) [14:09:27] (03PS3) 10Dzahn: ci: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/598990 (owner: 10Muehlenhoff) [14:10:27] (03CR) 10DCausse: [C: 03+1] "looked at EntityData on test.wikidata.org and it looked good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569258 (https://phabricator.wikimedia.org/T254315) (owner: 10WMDE-leszek) [14:11:13] (03CR) 10jerkins-bot: [V: 04-1] ci: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/598990 (owner: 10Muehlenhoff) [14:11:16] (03CR) 10Jcrespo: "> Patch Set 1:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598560 (https://phabricator.wikimedia.org/T252950) (owner: 10Privacybatm) [14:12:07] (03PS1) 10Ottomata: Add kafka-jumbo100[7-9] to network policy for eventgate-analytics and eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/602087 (https://phabricator.wikimedia.org/T252675) [14:15:49] PROBLEM - MariaDB Slave Lag: pc1 on pc2010 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 423.41 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [14:16:35] !log cleaning commonsrdf-dumps cron entry manually on snapshot1008 [14:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:39] apergos: ^ [14:17:01] great [14:17:53] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [14:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:56] (03CR) 10RLazarus: [C: 03+1] remove mgmt IPs for mw2150 through mw2186 [dns] - 10https://gerrit.wikimedia.org/r/602013 (https://phabricator.wikimedia.org/T247018) (owner: 10Dzahn) [14:20:27] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:06] kormat: you might want to downtime pc2010 [14:22:09] ^ [14:22:52] wait, why? [14:23:03] probably because of coldness [14:23:13] and as pc1010 got inserts...pc2010 isn't really warm [14:23:36] uff, ack. [14:25:25] well, it is more because of the topology [14:25:42] pc2010 is now receivein pc1 and pc3 updates- double the writes [14:26:05] pc1010 is presumably in the same boat [14:26:10] kormat: you should stop replication on pc1010 while it is receiving writes [14:26:24] yeah, but pc2010 cannot lag as a master [14:26:28] pc1010 [14:26:43] and pc2010 has a 40ms extra tax on every transaction [14:26:49] !log stopping replication on pc1010 [14:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:59] stop replication, you will save 50% writes [14:27:02] done [14:27:14] (03CR) 10Gehel: [C: 03+2] Remove duplication and improve clarity in role::wdqs [puppet] - 10https://gerrit.wikimedia.org/r/598884 (owner: 10EBernhardson) [14:27:23] (03PS8) 10Gehel: Remove duplication and improve clarity in role::wdqs [puppet] - 10https://gerrit.wikimedia.org/r/598884 (owner: 10EBernhardson) [14:27:52] (03PS4) 10Dzahn: ci: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/598990 (owner: 10Muehlenhoff) [14:35:01] (03PS2) 10Reedy: Make wikimania.wikimedia.org redirect to mobile site [puppet] - 10https://gerrit.wikimedia.org/r/601923 [14:37:58] (03PS1) 10Giuseppe Lavagetto: grafana: disallow access to /avatars/ [puppet] - 10https://gerrit.wikimedia.org/r/602091 [14:40:11] (03PS2) 10Privacybatm: Firewall.py: Add function to kill process by its port number [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598560 (https://phabricator.wikimedia.org/T252950) [14:40:13] (03CR) 10Elukey: [C: 03+2] profile::analytics::cluster::packages::statistics: Add packages [puppet] - 10https://gerrit.wikimedia.org/r/601848 (https://phabricator.wikimedia.org/T254278) (owner: 10Bearloga) [14:41:09] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1001/22963/grafana1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/602091 (owner: 10Giuseppe Lavagetto) [14:46:14] (03PS4) 10Gehel: Define a shared profile to remove duplication in roles [puppet] - 10https://gerrit.wikimedia.org/r/598885 (owner: 10EBernhardson) [14:46:44] (03PS3) 10Herron: centrallog: split syslogs into host directories [puppet] - 10https://gerrit.wikimedia.org/r/601836 [14:46:49] (03CR) 10RLazarus: [C: 03+1] grafana: disallow access to /avatars/ [puppet] - 10https://gerrit.wikimedia.org/r/602091 (owner: 10Giuseppe Lavagetto) [14:46:54] (03CR) 10Privacybatm: "Infact, the error message is already taken care of by" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598560 (https://phabricator.wikimedia.org/T252950) (owner: 10Privacybatm) [14:46:58] (03PS5) 10Dzahn: ci: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/598990 (owner: 10Muehlenhoff) [14:47:00] (03CR) 10Hashar: "The releases1001.eqiad.wmnet and releases2001.eqiad.wmnet hosts are still on Stretch." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598990 (owner: 10Muehlenhoff) [14:47:06] (03CR) 10Gehel: "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1002/22964/" [puppet] - 10https://gerrit.wikimedia.org/r/598885 (owner: 10EBernhardson) [14:47:08] <_joe_> I see some lag [14:47:09] (03PS5) 10Gehel: Define a shared profile to remove duplication in roles [puppet] - 10https://gerrit.wikimedia.org/r/598885 (owner: 10EBernhardson) [14:47:17] yes same here [14:47:17] (03CR) 10RLazarus: [C: 04-1] "updating as discussed, Location vs Directory" [puppet] - 10https://gerrit.wikimedia.org/r/602091 (owner: 10Giuseppe Lavagetto) [14:47:19] (03CR) 10Herron: centrallog: split syslogs into host directories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/601836 (owner: 10Herron) [14:47:21] (03PS1) 10Marostegui: Revert "mariadb: Allow reimage of pc1009" [puppet] - 10https://gerrit.wikimedia.org/r/602092 [14:47:27] (03CR) 10Gehel: [C: 03+2] Define a shared profile to remove duplication in roles [puppet] - 10https://gerrit.wikimedia.org/r/598885 (owner: 10EBernhardson) [14:47:29] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Allow reimage of pc1009" [puppet] - 10https://gerrit.wikimedia.org/r/602092 (owner: 10Marostegui) [14:47:31] (03CR) 10Dzahn: "fixed spec test and then amended again to still support stretch because jenkins on releases* is on stretch. so only removing jessie" [puppet] - 10https://gerrit.wikimedia.org/r/598990 (owner: 10Muehlenhoff) [14:47:35] !log updated grafana on cloudmetrics* to 6.7.4 [14:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:39] (03PS1) 10Kormat: Revert "db-eqiad.php: Replace pc1009 with pc1010 while reimaging" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602093 [14:47:45] (03CR) 10Marostegui: [C: 03+1] Revert "db-eqiad.php: Replace pc1009 with pc1010 while reimaging" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602093 (owner: 10Kormat) [14:47:59] (03CR) 10Kormat: [C: 03+2] Revert "db-eqiad.php: Replace pc1009 with pc1010 while reimaging" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602093 (owner: 10Kormat) [14:48:12] (03CR) 10Jcrespo: "Does this mean you are ok with this as it is now? So I can give it a last check?" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598560 (https://phabricator.wikimedia.org/T252950) (owner: 10Privacybatm) [14:49:07] (03CR) 10Dzahn: [C: 03+1] "noop on releases1001: https://puppet-compiler.wmflabs.org/compiler1002/22966/" [puppet] - 10https://gerrit.wikimedia.org/r/598990 (owner: 10Muehlenhoff) [14:49:15] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Replace pc1009 with pc1010 while reimaging" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602093 (owner: 10Kormat) [14:50:40] !log kormat@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool pc1009 in pc3 after reimaging T252182 (duration: 01m 06s) [14:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:43] T252182: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 [14:52:02] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10Dzahn) a:03YiJuLu [14:52:20] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10Dzahn) 05Open→03Stalled p:05Triage→03Medium [14:52:53] 10Operations, 10SRE-Access-Requests: Adding Italian Wikinews to Google Search Console to add it to Google News - https://phabricator.wikimedia.org/T253988 (10Dzahn) a:03Ferdi2005 [14:53:00] 10Operations, 10SRE-Access-Requests: Adding Italian Wikinews to Google Search Console to add it to Google News - https://phabricator.wikimedia.org/T253988 (10Dzahn) p:05Triage→03Medium [14:54:23] 10Operations, 10LDAP-Access-Requests, 10observability, 10serviceops, 10Patch-For-Review: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10Dzahn) Hi @AMooney any updates on this? [14:57:01] (03PS1) 10Hnowlan: changeprop-jobqueue: fix indentation of partitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/602094 (https://phabricator.wikimedia.org/T220399) [14:57:58] (03PS2) 10Hnowlan: changeprop-jobqueue: fix indentation of partitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/602094 (https://phabricator.wikimedia.org/T220399) [14:58:14] 10Operations, 10LDAP-Access-Requests, 10observability, 10serviceops, 10Patch-For-Review: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10AMooney) Thanks for checking @Dzahn. It is currently in approval process. I will reach out when we a... [14:58:46] (03CR) 10Ppchelko: [C: 03+2] changeprop-jobqueue: fix indentation of partitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/602094 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [14:59:13] (03Merged) 10jenkins-bot: changeprop-jobqueue: fix indentation of partitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/602094 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [15:00:27] (03CR) 10Ayounsi: "Talked to Faidon, he is fine with proceeding with it." [puppet] - 10https://gerrit.wikimedia.org/r/585434 (https://phabricator.wikimedia.org/T249176) (owner: 10Ayounsi) [15:01:13] (03Abandoned) 10Giuseppe Lavagetto: grafana: disallow access to /avatars/ [puppet] - 10https://gerrit.wikimedia.org/r/602091 (owner: 10Giuseppe Lavagetto) [15:01:35] (03PS19) 10Gehel: Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) (owner: 10EBernhardson) [15:01:50] (03CR) 10Gehel: "PCC agrees this is a noop: https://puppet-compiler.wmflabs.org/compiler1001/22967/" [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) (owner: 10EBernhardson) [15:03:39] (03PS1) 10Elukey: role::druid::analytics::worker: upgrade druid1003's settings for Buster [puppet] - 10https://gerrit.wikimedia.org/r/602095 (https://phabricator.wikimedia.org/T253980) [15:07:06] (03CR) 10Gehel: [C: 03+2] Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) (owner: 10EBernhardson) [15:07:22] 10Operations, 10netops: Homer: manage transit BGP sessions - https://phabricator.wikimedia.org/T250136 (10CDanis) LGTM [15:08:21] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/22968/" [puppet] - 10https://gerrit.wikimedia.org/r/602095 (https://phabricator.wikimedia.org/T253980) (owner: 10Elukey) [15:08:32] (03PS1) 10Gehel: Revert "Role for SDoC WDQS" [puppet] - 10https://gerrit.wikimedia.org/r/602096 [15:09:05] (03PS2) 10Gehel: Revert "Role for SDoC WDQS" [puppet] - 10https://gerrit.wikimedia.org/r/602096 [15:09:37] (03PS1) 10Bearloga: profile::analytics::cluster::packages::statistics: Add cairo [puppet] - 10https://gerrit.wikimedia.org/r/602097 (https://phabricator.wikimedia.org/T254278) [15:09:39] (03CR) 10CDanis: [C: 03+1] Remove RPKIcounter [puppet] - 10https://gerrit.wikimedia.org/r/585434 (https://phabricator.wikimedia.org/T249176) (owner: 10Ayounsi) [15:10:49] (03CR) 10Bearloga: "Sorry to bother you again for another minor CR. There was a config inconsistency between stat1008 and other stat hosts." [puppet] - 10https://gerrit.wikimedia.org/r/602097 (https://phabricator.wikimedia.org/T254278) (owner: 10Bearloga) [15:11:02] (03CR) 10Gehel: [C: 03+2] Revert "Role for SDoC WDQS" [puppet] - 10https://gerrit.wikimedia.org/r/602096 (owner: 10Gehel) [15:11:17] (03CR) 10Ayounsi: [C: 03+2] Remove RPKIcounter [puppet] - 10https://gerrit.wikimedia.org/r/585434 (https://phabricator.wikimedia.org/T249176) (owner: 10Ayounsi) [15:11:28] (03CR) 10Elukey: [C: 03+2] profile::analytics::cluster::packages::statistics: Add cairo [puppet] - 10https://gerrit.wikimedia.org/r/602097 (https://phabricator.wikimedia.org/T254278) (owner: 10Bearloga) [15:11:30] (03PS2) 10Ayounsi: Remove RPKIcounter [puppet] - 10https://gerrit.wikimedia.org/r/585434 (https://phabricator.wikimedia.org/T249176) [15:11:35] (03CR) 10CDanis: [C: 03+1] "thanks, looks good" [puppet] - 10https://gerrit.wikimedia.org/r/601712 (https://phabricator.wikimedia.org/T251104) (owner: 10Jbond) [15:11:43] (03CR) 10Gehel: "Error was not caught by PCC:" [puppet] - 10https://gerrit.wikimedia.org/r/602096 (owner: 10Gehel) [15:12:16] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [15:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:45] (03CR) 10Privacybatm: "> Patch Set 2:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598560 (https://phabricator.wikimedia.org/T252950) (owner: 10Privacybatm) [15:13:57] (03Abandoned) 10Rush: shinken: WMCS: add load alerts for tools-bastion-0[23] [puppet] - 10https://gerrit.wikimedia.org/r/413781 (https://phabricator.wikimedia.org/T186552) (owner: 10Chico Venancio) [15:14:16] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, sth to keep an eye on will be rsyslog max open fds limits" [puppet] - 10https://gerrit.wikimedia.org/r/601836 (owner: 10Herron) [15:14:37] (03Abandoned) 10Rush: rabbitmq: setup monitor manifest user via rabbitmq::user [puppet] - 10https://gerrit.wikimedia.org/r/419839 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [15:15:58] (03PS3) 10Filippo Giunchedi: WIP thanos: add alerts [puppet] - 10https://gerrit.wikimedia.org/r/602082 (https://phabricator.wikimedia.org/T252186) [15:16:52] (03CR) 10Privacybatm: "> Patch Set 3:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598562 (https://phabricator.wikimedia.org/T253560) (owner: 10Privacybatm) [15:17:37] (03CR) 10Giuseppe Lavagetto: profile::conftool::client: only use root on cumin*, puppetmasters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602047 (https://phabricator.wikimedia.org/T97972) (owner: 10Giuseppe Lavagetto) [15:19:15] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [15:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:14] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Replicated ticket registry - https://phabricator.wikimedia.org/T233933 (10jbond) I have configured memcache and mcrouter for CAS however there is currently an error. If CAS talks directly to memcache then all works fine. however when CAS talks to... [15:20:50] (03PS4) 10Filippo Giunchedi: WIP thanos: add alerts [puppet] - 10https://gerrit.wikimedia.org/r/602082 (https://phabricator.wikimedia.org/T252186) [15:21:12] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [15:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:04] (03PS6) 10Hashar: ci: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/598990 (owner: 10Muehlenhoff) [15:23:06] (03PS1) 10Hashar: jenkins: reindent spec to two spaces [puppet] - 10https://gerrit.wikimedia.org/r/602100 [15:23:08] (03PS1) 10Hashar: enkins: fix spec to use proper facts [puppet] - 10https://gerrit.wikimedia.org/r/602101 [15:25:01] (03PS1) 10Gehel: Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/602102 (https://phabricator.wikimedia.org/T237089) [15:25:22] 10Operations, 10netops: Zayo link eqiad-codfw (OGYX/120003//ZYO) down (May 2020) - https://phabricator.wikimedia.org/T253610 (10CDanis) >>! In T253610#6173473, @ayounsi wrote: > Link is back up. > >> We found a defective fiber between nodes. >> Currently seeing service restored. >> Please advise the ONCC... [15:25:49] (03CR) 10Hashar: [C: 03+1] "From a discussion with Daniel, the spec only tests against Debian 8 and never tried to use apt::components_from_package. I rebased the cha" [puppet] - 10https://gerrit.wikimedia.org/r/598990 (owner: 10Muehlenhoff) [15:26:28] (03PS2) 10Hashar: jenkins: fix spec to use proper facts [puppet] - 10https://gerrit.wikimedia.org/r/602101 [15:27:15] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM (for the new bit added compared to PS1)" [puppet] - 10https://gerrit.wikimedia.org/r/598990 (owner: 10Muehlenhoff) [15:30:32] (03PS1) 10Jforrester: Stop setting wgCommentTableSchemaMigrationStage, no longer read in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602105 [15:30:33] (03PS1) 10Jforrester: Stop setting wgChangeTagsSchemaMigrationStage, no longer read in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602106 [15:31:16] 10Operations: Switch blog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T254367 (10CKoerner_WMF) [15:31:49] (03CR) 10Gehel: Role for SDoC WDQS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602102 (https://phabricator.wikimedia.org/T237089) (owner: 10Gehel) [15:32:03] (03PS2) 10Rush: peek: Reenable cron with correct params [puppet] - 10https://gerrit.wikimedia.org/r/601173 (https://phabricator.wikimedia.org/T254127) (owner: 10Jcrespo) [15:33:36] (03PS1) 10Muehlenhoff: Rename reprepro definition for grafana [puppet] - 10https://gerrit.wikimedia.org/r/602108 [15:34:50] 10Operations, 10Peek, 10Security-Team, 10PM: Change peek scheduled jobs to systemd timer or k8s cron - https://phabricator.wikimedia.org/T254368 (10chasemp) p:05Triage→03Medium [15:36:14] (03PS5) 10Filippo Giunchedi: thanos: add alerts for Thanos components [puppet] - 10https://gerrit.wikimedia.org/r/602082 (https://phabricator.wikimedia.org/T252186) [15:36:23] (03PS3) 10Rush: peek: Reenable cron with correct params [puppet] - 10https://gerrit.wikimedia.org/r/601173 (https://phabricator.wikimedia.org/T254127) (owner: 10Jcrespo) [15:41:15] (03PS2) 10Giuseppe Lavagetto: profile::conftool::client: only use root on cumin*, puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/602047 (https://phabricator.wikimedia.org/T97972) [15:44:04] (03CR) 10Ottomata: [C: 03+1] Deprecate profile::java::analytics in favor of profile::java [puppet] - 10https://gerrit.wikimedia.org/r/602009 (owner: 10Elukey) [15:44:05] 10Operations, 10Gerrit: Initial backup run for Gerrit LFS data - https://phabricator.wikimedia.org/T254162 (10QChris) >>! In T254162#6187911, @jcrespo wrote: > Recovery took 20 minutes, [...] Nice! > **Content of everything backed up on `gerrit1001` was restored into `gerrit2001:/srv/T254162_restore`.** I w... [15:45:32] (03CR) 10Giuseppe Lavagetto: prometheus: use profile::lvs::realserver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/599298 (https://phabricator.wikimedia.org/T253840) (owner: 10Filippo Giunchedi) [15:46:47] (03CR) 10Giuseppe Lavagetto: [C: 03+1] conftool: bail on confctl not found [puppet] - 10https://gerrit.wikimedia.org/r/599299 (https://phabricator.wikimedia.org/T253840) (owner: 10Filippo Giunchedi) [15:47:20] (03CR) 10Filippo Giunchedi: "LGTM, see comment on .forward tho" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/601173 (https://phabricator.wikimedia.org/T254127) (owner: 10Jcrespo) [15:48:12] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Replicated ticket registry - https://phabricator.wikimedia.org/T233933 (10jbond) > We could resolve this by having mcrouter talk directly to memcache in the other DC however this requires 1.5.13 which is not currently in buster. 1.6.6 is available... [15:49:16] (03CR) 10Nray: [C: 03+1] Disable growth survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601842 (https://phabricator.wikimedia.org/T251741) (owner: 10Jdlrobson) [15:50:43] 10Operations, 10Analytics, 10Product-Analytics, 10SRE-Access-Requests: Hive access for Sam Patton - https://phabricator.wikimedia.org/T248097 (10spatton) @Dzahn, @jcrespo, @mpopov, @nettrom_WMF, @Nuria, @Volans : thank you for the pursuit, and sorry it was necessary. Busy times :) I really appreciate your... [15:55:06] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 67 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:58:16] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 51 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:00:33] (03CR) 10Rush: peek: Reenable cron with correct params (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/601173 (https://phabricator.wikimedia.org/T254127) (owner: 10Jcrespo) [16:02:21] 10Operations, 10Analytics, 10Product-Analytics, 10SRE-Access-Requests: Hive access for Sam Patton - https://phabricator.wikimedia.org/T248097 (10Nuria) Seems that your data access needs wouls be served by https://turnilo.wikimedia.org It does not seem you need access to raw data. Do you have an account on... [16:11:57] 10Operations, 10Gerrit: Initial backup run for Gerrit LFS data - https://phabricator.wikimedia.org/T254162 (10jcrespo) > There were a few inconsistencies between the currently live data on gerrit1001 and the restored data, but those are expected. Of course, there is a 24 hours gap. My question is if there wou... [16:15:40] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 50 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:22:55] (03CR) 10Privacybatm: "I have added those checks for both origin and destination because of the point you mentioned about the disk space issue (What if the user " [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598562 (https://phabricator.wikimedia.org/T253560) (owner: 10Privacybatm) [16:31:30] RECOVERY - MariaDB Slave Lag: pc1 on pc2010 is OK: OK slave_sql_lag Replication lag: 41.92 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [16:36:26] 10Operations, 10Analytics, 10Traffic: missing wmf_netflow data, 18:30-19:00 May 31 - https://phabricator.wikimedia.org/T254161 (10elukey) The missing data seems from May 31st 18:30 to 19:00. I did a quick check via Spark and on HDFS the data seems present: ` scala> spark.sql("select stamp_inserted from wmf.... [16:44:08] 10Operations: Integrate Buster 10.4 point update - https://phabricator.wikimedia.org/T252394 (10MoritzMuehlenhoff) [16:48:10] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 51 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:53:56] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 50 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:58:52] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 47 probes of 576 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:04:42] PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:57] (03Abandoned) 10Hashar: transferpy requires paramiko [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/602072 (owner: 10Hashar) [17:08:33] !log ganeti: gnd-instance reboot an-launcher1001 to get new memory settings - T254125 [17:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:37] T254125: Increase memory available for an-launcher1001 - https://phabricator.wikimedia.org/T254125 [17:08:43] (03PS2) 10Ppchelko: Session Store: Switch everything to kask-session [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570396 (https://phabricator.wikimedia.org/T243106) [17:11:13] 10Operations, 10Analytics, 10Analytics-Kanban: Increase memory available for an-launcher1001 - https://phabricator.wikimedia.org/T254125 (10elukey) [17:12:08] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:13:56] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:16:01] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-novastats-puppetleaks: clean up prefixes for delete projects [puppet] - 10https://gerrit.wikimedia.org/r/601896 (https://phabricator.wikimedia.org/T252224) (owner: 10Andrew Bogott) [17:20:56] (03PS4) 10Privacybatm: transfer.py: Add proper error message at source/target split of option_parse [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598562 (https://phabricator.wikimedia.org/T253560) [17:37:34] 10Operations, 10ORES, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10elukey) I need to review the task and think about next steps if any, let me know if there is anything outstand... [17:41:09] (03PS1) 10Ayounsi: BGP: add transit links [homer/public] - 10https://gerrit.wikimedia.org/r/602119 [17:41:17] (03CR) 10jerkins-bot: [V: 04-1] BGP: add transit links [homer/public] - 10https://gerrit.wikimedia.org/r/602119 (owner: 10Ayounsi) [17:42:20] 10Operations, 10ORES, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10Halfak) Currently, the basic problem of the task is still outstanding. The stack overflow post is the best re... [17:47:19] (03CR) 10BPirkle: [C: 03+1] Session Store: Switch everything to kask-session [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570396 (https://phabricator.wikimedia.org/T243106) (owner: 10Ppchelko) [17:53:48] (03CR) 10BPirkle: [C: 03+1] "Approved to deploy, keeping in mind Jforrester's advice on sync order." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570396 (https://phabricator.wikimedia.org/T243106) (owner: 10Ppchelko) [17:54:58] Jdlrobson: for the SWAT, do you wanna go ahead with your stuff? mine would take quite some time [17:57:47] (03PS2) 10Ayounsi: BGP: add transit links [homer/public] - 10https://gerrit.wikimedia.org/r/602119 [17:58:08] (03CR) 10jerkins-bot: [V: 04-1] BGP: add transit links [homer/public] - 10https://gerrit.wikimedia.org/r/602119 (owner: 10Ayounsi) [18:00:04] James_F and longma: That opportune time is upon us again. Time for a Train log triage with CPT deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200603T1800). [18:00:05] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Morning SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200603T1800). [18:00:05] bpirkle, Pchelolo, Pchelolo, and Jdlrobson: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:12] RECOVERY - Check systemd state on an-launcher1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:02] (03PS3) 10Ayounsi: BGP: add transit links [homer/public] - 10https://gerrit.wikimedia.org/r/602119 [18:05:27] I can do the SWAT, but are any of the requesters here? [18:05:51] RoanKattouw: I will do my own [18:06:13] I guess we'll start bpirkle [18:06:15] OK go for it [18:06:21] ok [18:07:41] (03CR) 10Ppchelko: [C: 03+2] Session Store: Switch everything to kask-session [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570396 (https://phabricator.wikimedia.org/T243106) (owner: 10Ppchelko) [18:08:29] (03Merged) 10jenkins-bot: Session Store: Switch everything to kask-session [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570396 (https://phabricator.wikimedia.org/T243106) (owner: 10Ppchelko) [18:14:23] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: gerrit 570396 - enable kask-session everywhere. IS.php (duration: 01m 06s) [18:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:43] !log ppchelko@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: gerrit 570396 - enable kask-session everywhere. CS.php (duration: 01m 05s) [18:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:19] 1 done, 1 to go [18:22:34] (03PS6) 10Ppchelko: Enable kafka purges production on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599150 (https://phabricator.wikimedia.org/T250781) [18:22:47] (03CR) 10Ppchelko: [C: 03+2] Enable kafka purges production on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599150 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [18:23:34] (03Merged) 10jenkins-bot: Enable kafka purges production on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599150 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [18:29:03] RoanKattouw: super sorry im late [18:31:54] Jdlrobson: I'm still not done [18:34:40] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: gerrit 599150 - enable kafka purges for group0 (duration: 01m 06s) [18:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:20] Jdlrobson: I'm done, all yours [18:38:58] Pchelolo I need help swatting - dont have rights [18:39:13] ok, I can do it for you [18:39:18] Thanks Pchelolo i appreciate it [18:40:00] (03PS1) 10Ppchelko: EventGate-main: allow resource-purge to be produced [deployment-charts] - 10https://gerrit.wikimedia.org/r/602125 (https://phabricator.wikimedia.org/T250781) [18:40:01] ok. Is there any specific order? [18:41:11] nope [18:41:15] any order will do [18:41:20] ok. [18:41:26] (03CR) 10Ppchelko: [C: 03+2] Use AddFooterLink hook for code of conduct and contact links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596277 (https://phabricator.wikimedia.org/T251817) (owner: 10Jdlrobson) [18:41:49] (03CR) 10Ppchelko: "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596277 (https://phabricator.wikimedia.org/T251817) (owner: 10Jdlrobson) [18:41:53] (03CR) 10Ppchelko: [C: 03+2] Use AddFooterLink hook for code of conduct and contact links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596277 (https://phabricator.wikimedia.org/T251817) (owner: 10Jdlrobson) [18:42:39] (03PS9) 10Jdlrobson: Use AddFooterLink hook for code of conduct and contact links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596277 (https://phabricator.wikimedia.org/T251817) [18:43:51] (03CR) 10Ppchelko: [C: 03+2] Use AddFooterLink hook for code of conduct and contact links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596277 (https://phabricator.wikimedia.org/T251817) (owner: 10Jdlrobson) [18:44:45] (03Merged) 10jenkins-bot: Use AddFooterLink hook for code of conduct and contact links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596277 (https://phabricator.wikimedia.org/T251817) (owner: 10Jdlrobson) [18:46:08] Jdlrobson: ^ is on mwdebug1002 [18:46:17] can you test? [18:46:40] on it [18:47:43] Pchelolo: LGTM! [18:47:58] ok, proceeding [18:49:37] !log ppchelko@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: gerrit 596277 Use AddFooterLink hook for code of conduct and contact links (duration: 01m 05s) [18:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:12] Jdlrobson: done. Moving to next one [18:51:24] (03PS2) 10Ppchelko: Disable growth survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601842 (https://phabricator.wikimedia.org/T251741) (owner: 10Jdlrobson) [18:51:34] (03CR) 10Ppchelko: [C: 03+2] Disable growth survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601842 (https://phabricator.wikimedia.org/T251741) (owner: 10Jdlrobson) [18:52:20] (03CR) 10Ppchelko: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601842 (https://phabricator.wikimedia.org/T251741) (owner: 10Jdlrobson) [18:52:23] (03Merged) 10jenkins-bot: Disable growth survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601842 (https://phabricator.wikimedia.org/T251741) (owner: 10Jdlrobson) [18:52:56] (03CR) 10Ottomata: [C: 03+2] EventGate-main: allow resource-purge to be produced [deployment-charts] - 10https://gerrit.wikimedia.org/r/602125 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [18:53:43] Jdlrobson: Disable growth survey is on mwdebug1002 [18:53:45] can you test? [18:54:05] that one is also good to go! [18:54:09] oki [18:54:34] (03PS1) 10Hnowlan: changeprop-jobqueue: disable partition jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/602133 (https://phabricator.wikimedia.org/T220399) [18:55:18] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: disable partition jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/602133 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [18:55:44] (03PS3) 10Ppchelko: Enable talk pages on Swedish Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601843 (https://phabricator.wikimedia.org/T253985) (owner: 10Jdlrobson) [18:55:46] (03Merged) 10jenkins-bot: changeprop-jobqueue: disable partition jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/602133 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [18:55:49] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: gerrit 601842 - Disable growth survey (duration: 01m 06s) [18:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:58] (03CR) 10Ppchelko: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601843 (https://phabricator.wikimedia.org/T253985) (owner: 10Jdlrobson) [18:56:14] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [18:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:48] (03Merged) 10jenkins-bot: Enable talk pages on Swedish Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601843 (https://phabricator.wikimedia.org/T253985) (owner: 10Jdlrobson) [18:57:47] Jdlrobson: Enable talk pages on Swedish Minerva is on mwdebug1002 [18:58:25] That one is also good to go - talk for anons! yaty! https://usercontent.irccloud-cdn.com/file/BxyKNyYC/Screen%20Shot%202020-06-03%20at%2011.58.13%20AM.png [18:59:33] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [18:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:51] 10Operations, 10MediaWiki-General, 10serviceops-radar, 10Performance-Team (Radar), and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) 10 second samples from two Redis servers. ` $ redis-cli -a "$AUTH" monitor > dump …... [19:00:05] James_F and longma: It is that lovely time of the day again! You are hereby commanded to deploy Mediawiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200603T1900). [19:00:34] James_F: we're running late with 1 more in a SWAT. Would you mind waiting a few minutes? [19:00:39] Okie-dokie. [19:00:55] James_F: just more main page configuration leftshouldnt take long [19:00:56] sorry about that [19:01:19] Fine. [19:01:28] (03PS5) 10Ppchelko: Stop special casing the main page on several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601847 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [19:01:39] actually, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/601847 is left [19:01:54] and I don't really know how to do these big multi-file config changes [19:01:59] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: gerrit 601843 Enable talk pages on Swedish Minerva (duration: 01m 08s) [19:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:03] can you please help with that James_F? [19:02:43] Pchelolo: Oh, sure, that's just sync of the dblist. [19:02:45] I'll do it. [19:02:51] (03CR) 10Jforrester: [C: 03+2] Stop special casing the main page on several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601847 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [19:02:57] ok, thank you [19:02:58] thanks James_F (and for all your help with this deprecation in general) - almost there! [19:03:04] I'm stepping out of deployment [19:03:05] and thanks Pchelolo for your help today [19:03:07] Jdlrobson: It's wonderful to see it happening. :-) [19:03:09] that's a load off my plate [19:03:25] Down to "just" 121 wikis, with this change. [19:03:38] (03Merged) 10jenkins-bot: Stop special casing the main page on several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601847 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [19:04:41] Syncing now. [19:05:44] !log jforrester@deploy1001 Synchronized dblists/mobilemainpagelegacy.dblist: T32405 Stop special casing the main page on another 47 projects (duration: 01m 08s) [19:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:48] T32405: [EPIC] MobileFrontend extension should stop special-casing main page - https://phabricator.wikimedia.org/T32405 [19:06:54] Jdlrobson: All done? [19:07:58] James_F: yup! [19:08:00] thanks! [19:08:14] James_F: i've sent out a last call site notice and am aiming for next branch cut to turn off the rest of them [19:08:21] Brill. [19:08:41] ill update all 121 myself if it comes to that haha [19:08:49] Let's not. ;-) [19:09:21] (03PS1) 10Jforrester: group1 wikis to 1.35.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602136 [19:09:23] (03CR) 10Jforrester: [C: 03+2] group1 wikis to 1.35.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602136 (owner: 10Jforrester) [19:10:10] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602136 (owner: 10Jforrester) [19:11:37] longma: Ready? [19:11:46] yup [19:12:02] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:13:05] Syncing now. [19:13:15] !log jforrester@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.35 [19:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:48] (03PS1) 10Hnowlan: changeprop-jobqueue: disable low traffic jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/602137 (https://phabricator.wikimedia.org/T220399) [19:14:00] Wikidata and Commons are still up. [19:14:18] (03CR) 10Ppchelko: [C: 03+1] changeprop-jobqueue: disable low traffic jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/602137 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [19:14:21] !log jforrester@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.35 (duration: 01m 05s) [19:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:27] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: disable low traffic jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/602137 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [19:14:52] (03Merged) 10jenkins-bot: changeprop-jobqueue: disable low traffic jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/602137 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [19:15:18] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [19:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:34] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:15:45] Ooh, fun, spike in errors from EventStreamConfig [19:16:30] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [19:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:34] PROBLEM - Router interfaces on cr1-eqsin is CRITICAL: CRITICAL: host 103.102.166.129, interfaces up: 85, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:18:02] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:18:24] RECOVERY - Router interfaces on cr1-eqsin is OK: OK: host 103.102.166.129, interfaces up: 89, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:18:28] Eurgh, yeah, rolling back. [19:18:30] James_F: I don't see it going down [19:20:31] !log jforrester@deploy1001 rebuilt and synchronized wikiversions files: Revert group1 wikis to wmf.34 T253023 [19:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:36] T253023: 1.35.0-wmf.35 deployment blockers - https://phabricator.wikimedia.org/T253023 [19:23:24] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:29:39] !log ppchelko@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [19:29:39] !log ppchelko@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [19:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:05] !log ryankemper@cumin2001 START - Cookbook sre.elasticsearch.rolling-upgrade [19:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:59] !log ppchelko@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [19:32:59] !log ppchelko@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [19:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:38] !log ppchelko@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [19:35:38] !log ppchelko@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [19:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:46] PROBLEM - HP RAID on ms-be2018 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:37:49] ACKNOWLEDGEMENT - HP RAID on ms-be2018 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T254392 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Run [19:37:49] aid_Information_Gathering [19:37:52] 10Operations, 10ops-codfw: Degraded RAID on ms-be2018 - https://phabricator.wikimedia.org/T254392 (10ops-monitoring-bot) [19:42:52] (03CR) 10Herron: [C: 03+1] "don't know enough yet to comment about specific thresholds, but looks like a good start!" [puppet] - 10https://gerrit.wikimedia.org/r/602082 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [19:57:42] (03CR) 10Herron: [C: 03+2] centrallog: split syslogs into host directories [puppet] - 10https://gerrit.wikimedia.org/r/601836 (owner: 10Herron) [20:00:04] halfak and accraze: Your horoscope predicts another unfortunate Services – Graphoid / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200603T2000). [20:07:40] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [20:08:08] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [20:08:33] 10Operations, 10Gerrit: Initial backup run for Gerrit LFS data - https://phabricator.wikimedia.org/T254162 (10QChris) 05Open→03Resolved >>! In T254162#6189550, @jcrespo wrote: > My question is if there would be issues with seting up the service from that dump, if they would be internally consistent, even i... [20:11:14] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [20:11:42] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [20:12:14] keeping an eye on these, my patch bounced the rsyslogs but was not expecting these alerts [20:13:16] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.944e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [20:14:48] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [20:14:51] hm, this one is due to refreshLinks job ^ [20:15:04] Pchelolo: would your resource-purge change affact thta? [20:15:18] oh sorry ^^ [20:15:20] the mirror maker lag [20:15:44] ryankemper: ^^ [20:16:41] the "Mediawiki Cirrussearch update rate" is due to the current cluster restart, that check needs to be tuned [20:18:18] mirror maker might be related too, the number of retries on CirrusWrites is increasing during cluster restarts [20:18:47] !log ryankemper@cumin2001 END (ERROR) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=97) [20:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:42] !log elasticsearch cluster restart stopped [20:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:00] let's just check if the MirrorMaker issue is related to that cluster restart [20:24:47] Looks like the rise in MirrorMaker lag began ~30 mins after the codfw rolling upgrade began [20:25:52] (03PS1) 10Herron: icinga: increase "rsyslog failing to deliver messages" check threshold [puppet] - 10https://gerrit.wikimedia.org/r/602153 [20:26:11] MirrorMaker lag is going down [20:26:24] we should have a circuit breaker on that one, not sure what happened [20:28:55] Lag is almost recovered [20:29:09] Given how fast lag dropped when we paused the cluster restart, that was almost certainly the source [20:29:48] and behind the scene explanation: the circuit breaker that I thought was in place isn't actually deployed yet [20:37:35] (03PS1) 10Mholloway: Mobileapps: Add initial helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602155 (https://phabricator.wikimedia.org/T218733) [20:37:55] (03CR) 10Cwhite: [C: 03+1] icinga: increase "rsyslog failing to deliver messages" check threshold [puppet] - 10https://gerrit.wikimedia.org/r/602153 (owner: 10Herron) [20:41:48] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [20:47:57] !log ryankemper@cumin2001 START - Cookbook sre.elasticsearch.rolling-upgrade [20:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:25] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=0) [20:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:54] PROBLEM - PHP opcache health on mwdebug2002 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:58:38] !log ryankemper@cumin2001 START - Cookbook sre.elasticsearch.rolling-upgrade [20:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:32] RECOVERY - PHP opcache health on mwdebug2002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:00:03] (03CR) 10BearND: "Thank you for doing this. Where are these files copied from? They look a bit different from the patch Akosiaris pointed us to in the task." [deployment-charts] - 10https://gerrit.wikimedia.org/r/602155 (https://phabricator.wikimedia.org/T218733) (owner: 10Mholloway) [21:03:35] !log ryankemper@cumin2001 END (ERROR) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=97) [21:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:04] !log Elasticsearch Eqiad was in yellow cluster status before starting the above cookbook run (therefore the run was a no-op until I ctlr+C'd), going to try unsticking the two unassigned shards via `/_cluster/reroute?retry_failed=true` [21:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:38] (03PS1) 10Mholloway: Chromium-render: Add initial helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602164 (https://phabricator.wikimedia.org/T225680) [21:09:26] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 3884 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [21:13:03] !log Ran `curl -X POST "https://localhost:9243/_cluster/reroute?pretty&retry_failed=true&explain=true" -H 'Content-Type: application/json' -d '{}' --insecure` via the ssh tunnel `ssh bast4002.wikimedia.org -L 9243:search.svc.eqiad.wmnet:9243 -L 9443:search.svc.eqiad.wmnet:9443 -L 9643:search.svc.eqiad.wmnet:9643`, two unassigned shards are now initializing [21:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:37] !log The previously ran `_cluster/reroute?retry_failed=true` command worked as intended, the two shards in question have recovered and we're back to green cluster status. We're now in a known state and ready to proceed with the eqiad rolling upgrade [21:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:20] !log ryankemper@cumin2001 START - Cookbook sre.elasticsearch.rolling-upgrade [21:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:00] 10Operations, 10SRE-Access-Requests: Adding Italian Wikinews to Google Search Console to add it to Google News - https://phabricator.wikimedia.org/T253988 (10Ferdi2005) Hi @Dzahn. I'd like to thank you really much for the help that you're providing us! So, as I can see we don't have to submit a publication for... [21:19:31] (03CR) 10Mholloway: "Yes, I was looking at the current versions of the same files in the wikifeeds service. (The 'private' directories in each subfolder are ap" [deployment-charts] - 10https://gerrit.wikimedia.org/r/602155 (https://phabricator.wikimedia.org/T218733) (owner: 10Mholloway) [21:22:44] (03PS2) 10Mholloway: Mobileapps: Add initial helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602155 (https://phabricator.wikimedia.org/T218733) [21:34:48] (03PS5) 10EBernhardson: query_service: Move shared config into common file [puppet] - 10https://gerrit.wikimedia.org/r/599145 [21:34:50] (03PS8) 10EBernhardson: Consolidate query_service profile duplication [puppet] - 10https://gerrit.wikimedia.org/r/599146 [21:34:52] (03PS1) 10EBernhardson: Revert "Revert "Role for SDoC WDQS"" [puppet] - 10https://gerrit.wikimedia.org/r/602171 [21:35:32] (03CR) 10BearND: Mobileapps: Add initial helmfile stanzas (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602155 (https://phabricator.wikimedia.org/T218733) (owner: 10Mholloway) [21:36:03] 10Operations, 10Wikimedia-Apache-configuration, 10Developer Productivity, 10Patch-For-Review, 10Performance-Team (Radar): VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost (on jobrunners) - https://phabricator.wikimedia.org/T190111 (10Krinkle) [21:38:33] Pchelolo: hey looks like mobile diffs on mobile are broken [21:38:54] I'm guessing this relates to the changes your team has been doing as we haven't touched it [21:38:57] will open an UBN shortly [21:42:50] ok phew.. After investigating some more looks like a gadget is the cause (thank goodness!) nothing to see here. :) [21:44:50] (03CR) 10EBernhardson: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/22977/" [puppet] - 10https://gerrit.wikimedia.org/r/602171 (owner: 10EBernhardson) [21:47:07] I'm deploying a UBN fix and then the train. [21:47:32] (03PS1) 10Andrew Bogott: Keystone: add project-proxy-dns-manager to password whitelist [puppet] - 10https://gerrit.wikimedia.org/r/602176 [21:48:31] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: add project-proxy-dns-manager to password whitelist [puppet] - 10https://gerrit.wikimedia.org/r/602176 (owner: 10Andrew Bogott) [21:49:18] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.35/extensions/EventStreamConfig/includes/ApiStreamConfigs.php: T254390 ApiStreamConfigs: If the 'constraints' parameter is unset, don't explode (duration: 01m 06s) [21:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:23] T254390: Argument 1 passed to MediaWiki\Extension\EventStreamConfig\ApiStreamConfigs::multiParamToAssocArray() must be of the type array, null given, called in /srv/mediawiki/php-1.35.0-wmf.35/extensions/EventStreamConfig/includes/ApiStreamConfigs.php on line 66 - https://phabricator.wikimedia.org/T254390 [21:54:17] !log jforrester@deploy1001 rebuilt and synchronized wikiversions files: Re-rolling group1 to 1.35.0-wmf.35 for T253023 [21:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:20] T253023: 1.35.0-wmf.35 deployment blockers - https://phabricator.wikimedia.org/T253023 [21:58:22] (03PS8) 10RLazarus: mcrouter_wancache: Add mcrouter support for a machine-local memcached instance [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) [21:59:48] (03PS1) 10Andrew Bogott: profile:wmcs:proxy:static: add acme-chief certs for each mapped domain [puppet] - 10https://gerrit.wikimedia.org/r/602178 (https://phabricator.wikimedia.org/T251558) [22:00:34] (03PS2) 10Andrew Bogott: profile:wmcs:proxy:static: add acme-chief certs for each mapped domain [puppet] - 10https://gerrit.wikimedia.org/r/602178 (https://phabricator.wikimedia.org/T251558) [22:02:18] (03CR) 10jerkins-bot: [V: 04-1] profile:wmcs:proxy:static: add acme-chief certs for each mapped domain [puppet] - 10https://gerrit.wikimedia.org/r/602178 (https://phabricator.wikimedia.org/T251558) (owner: 10Andrew Bogott) [22:02:31] (03CR) 10Krinkle: [C: 03+1] "Moral support :) - I don't know what including this config file will do, whether that's all good, or whether this is how to do it. So some" [puppet] - 10https://gerrit.wikimedia.org/r/599683 (https://phabricator.wikimedia.org/T190111) (owner: 10Dzahn) [22:12:02] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [22:13:49] ^ expected [22:16:33] (03PS3) 10Andrew Bogott: profile:wmcs:proxy:static: add acme-chief certs as specified by hiera [puppet] - 10https://gerrit.wikimedia.org/r/602178 (https://phabricator.wikimedia.org/T251558) [22:18:16] (03CR) 10jerkins-bot: [V: 04-1] profile:wmcs:proxy:static: add acme-chief certs as specified by hiera [puppet] - 10https://gerrit.wikimedia.org/r/602178 (https://phabricator.wikimedia.org/T251558) (owner: 10Andrew Bogott) [22:20:14] (03PS4) 10Andrew Bogott: profile:wmcs:proxy:static: add acme-chief certs as specified by hiera [puppet] - 10https://gerrit.wikimedia.org/r/602178 (https://phabricator.wikimedia.org/T251558) [22:21:57] (03CR) 10jerkins-bot: [V: 04-1] profile:wmcs:proxy:static: add acme-chief certs as specified by hiera [puppet] - 10https://gerrit.wikimedia.org/r/602178 (https://phabricator.wikimedia.org/T251558) (owner: 10Andrew Bogott) [22:22:04] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=0) [22:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:30] (03CR) 10Krinkle: php: $enable_request_profiling should affect CLI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/599476 (https://phabricator.wikimedia.org/T253547) (owner: 10Dave Pifke) [22:27:37] (03CR) 10Krinkle: [C: 03+1] php: $enable_request_profiling should affect CLI [puppet] - 10https://gerrit.wikimedia.org/r/599476 (https://phabricator.wikimedia.org/T253547) (owner: 10Dave Pifke) [22:28:35] (03PS5) 10Andrew Bogott: profile:wmcs:proxy:static: add acme-chief certs as specified by hiera [puppet] - 10https://gerrit.wikimedia.org/r/602178 (https://phabricator.wikimedia.org/T251558) [22:28:39] (03CR) 10Krinkle: [C: 03+1] "CC-ing Tim. @Tim: We intentionally limited exposure of Tideways due to its invasive runtime behaviour. Just checking with you that it seem" [puppet] - 10https://gerrit.wikimedia.org/r/599476 (https://phabricator.wikimedia.org/T253547) (owner: 10Dave Pifke) [22:38:51] (03CR) 10Andrew Bogott: [C: 03+2] profile:wmcs:proxy:static: add acme-chief certs as specified by hiera [puppet] - 10https://gerrit.wikimedia.org/r/602178 (https://phabricator.wikimedia.org/T251558) (owner: 10Andrew Bogott) [22:42:44] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [22:55:32] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [22:55:44] PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [23:00:04] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200603T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:01:20] I was planning to do some simple maintenance with a bot, but it seems like I can't even log in with the bot at meta. It seems like it still work properly at other wikis. I can log into the bot account through a browser. There are no bot password, no 2FA, only a very long usual password. [23:03:48] jeblad: Are you wanting help? :P [23:03:50] Given the page names I hit the correct wiki (--family:meta) [23:04:13] * jeblad grabs Reedy [23:04:28] * Reedy pokes wikibugs [23:04:32] Yeah, I have no idea what goes on. [23:05:44] I can double check whether the login works at nowiki, but it seemed like it worked… [23:06:01] And btw I did a fresh git pull [23:06:29] Is this a pywikibot issue? [23:06:56] O'h, I didn't say. Yes it is pywikibot [23:06:57] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: T249834 (duration: 01m 06s) [23:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:01] T249834: Configure permissions - https://phabricator.wikimedia.org/T249834 [23:07:18] jeblad: Can you login at any other non-Wikipedias? [23:07:34] Could try commons [23:07:36] jeblad: I just rolled the train so all wikis except Wikipedias are running the new branch. [23:07:53] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10diego) @Dzahn & @Aklapper here you have @YiJuLu 's key: https://office.wikimedia.org/wiki/User:Diego_(WMF)/internKeys Regarding the L3, I see that @YiJuLu has added her conf... [23:08:12] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings-labs.php: T249834 (duration: 01m 06s) [23:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:29] Reedy: … you synchronised a Labs-only file? [23:08:41] Don't we normally? [23:08:41] Are you quite well? [23:08:44] No. [23:08:53] * Reedy always has [23:09:03] Just pull onto the deployment host so it doesn't get in the way of others' workflows. [23:09:15] It'll get rolled out eventually when the next full scap happens. [23:09:43] Then noc is out of date [23:09:51] It often is. [23:10:05] We have gerrit for the one source of truth. [23:10:08] Now I get some exception on language… [23:10:14] still think we should bin noc [23:10:20] Yeah. [23:10:31] Or just redirect it to gerrit.wikimedia.org/g/operations/mediawiki-config [23:10:54] sure [23:10:55] We could slap a nice README.md into mw-config so people could find their way. [23:11:37] (As opposed to the silly md-like unprefixed README which isn't good for anything as nothing speculatively renders it.) [23:11:45] postfixed, even. [23:12:27] andrewbogott: that puppet change ready to merge? [23:12:48] jeblad's message sounded important [23:12:50] Now I get some exception on language… [23:12:52] Yes please [23:13:09] shdubsh: ^ [23:13:13] I mean, it's pywikibot, it's a day with a 'y' in it, so it's broken… [23:13:20] hi, there have been no account creations on Meta in ~15 minutes https://meta.wikimedia.org/wiki/Special:Log/newusers [23:13:29] this doesn't seem right [23:13:34] andrewbogott: ack, done :) [23:13:38] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [23:13:56] interesting [23:14:02] but also the timing doesn't quite align with a deployment? [23:14:08] I do see accounts being created on enwiki though [23:14:17] Thanks! [23:14:43] musikanimal: Did someone make a heavy-handed AbuseFilter? [23:15:09] Some prodding and it seems like user-config.py overrides command line arguments [23:15:12] I was fiddling with one that targets account creation but it is log-only [23:15:22] https://meta.wikimedia.org/wiki/Special:AbuseFilter/238 [23:15:30] also 248 [23:15:36] Train roll was at 21:54; last account creation was at 22:59. Seems unlikely it's the train. [23:15:40] but looking at the AF log, also too soon to explain the sudden stop [23:15:42] RECOVERY - Unmerged changes on repository puppet on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [23:15:44] And then I get CRITICAL: Exiting due to uncaught exception [23:16:34] woah... okay I disabled 238 and I'm seeing accounts again [23:17:02] that filter was very clearly log-only though... [23:17:10] well, I see a single account, which doesn't prove much :) [23:17:24] jeblad: Can you login now? [23:17:29] yeah it should be more by now. Several per minute, minimum [23:17:48] I also disabled 248, for good measure [23:18:18] if it were abusefilters wouldn't https://meta.wikimedia.org/wiki/Special:AbuseLog be flooded? [23:18:26] yes [23:18:36] okay [23:18:46] Not if it errors before it logs somehow? [23:18:47] how about we try making a new account somewhere else and merging on meta [23:18:54] seeing what happens [23:18:58] But the fatal monitor level isn't high. [23:19:32] Seems like it workd, not sure [23:19:38] my test account was created successfully, let me try while logged out [23:20:30] Yes, I'm in! [23:20:40] 23:20, 3 June 2020 User account Krenair (test account) talk contribs was created automatically [23:20:43] yeah the logs look normal again [23:21:03] account creation logs are flowing again [23:21:05] Could be an error between the keys and display… [23:21:28] is it me or was there a sudden flood of a ton of creations all at once? [23:21:40] yeah [23:21:45] maybe stuck in a queue somewhere waiting to be appended to logging, or something? [23:21:50] seems like it backfilled [23:21:54] Possibly. [23:22:09] do we even have a job queue in front of that? [23:22:35] it seems weird... won't be a DB replication thing as the timestamps were clearly affected [23:23:37] I'm surprised, yes. [23:25:11] I re-enabled both of those filters and all seems fine. I think we can rule those out [23:25:24] sounds fair enough yeah [23:58:41] what is the task for the current train blocker?