[00:21:58] !log depool and restart cp3065 cp3061 [00:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:18] !log depool and restart cp3065 cp3061 - T238305 [00:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:21] T238305: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 [00:23:10] !log jiji@cumin1001 conftool action : set/pooled=no; selector: name=cp3065.esams.wmnet [00:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:20] !log jiji@cumin1001 conftool action : set/pooled=no; selector: name=cp3061.esams.wmnet [00:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:33] RECOVERY - Host cp3061 is UP: PING OK - Packet loss = 0%, RTA = 83.31 ms [00:31:49] RECOVERY - Host cp3065 is UP: PING OK - Packet loss = 0%, RTA = 83.37 ms [00:53:41] !log jiji@cumin1001 conftool action : set/pooled=yes; selector: name=cp3065.esams.wmnet [00:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:20] !log jiji@cumin1001 conftool action : set/pooled=yes; selector: name=cp3061.esams.wmnet [00:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:16] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10jijiki) prometheus-trafficserver-tls-exporter.service initially failed to start on both cp3065 and cp3061 after reboot [01:35:17] 10Operations, 10ops-esams: Terminate OE10,11,12,13 Racks - https://phabricator.wikimedia.org/T237055 (10wiki_willy) Draft of termination letter completed by Jim from Legal. Pending review via email, before mailing out to Iron Mountain. Thanks, Willy [02:02:08] !log restarted mariadb on cloudservices1003, cloudservices1004, cloudservices2001-dev, clouddb2001-dev for T239791 [02:02:10] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Andrew) >>! In T239791#5770128, @Marostegui wrote: > @Andrew @Bstorm who in WMCS would be responsible for restarting mysql on these hosts? > ` >... [02:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:02:21] T239791: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 [04:58:36] (03CR) 10Masumrezarock100: [C: 03+1] "Please also update the commit message. The new subtask is T242569. I can't do it myself for some reason." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/563653 (https://phabricator.wikimedia.org/T218626) (owner: 10DannyS712) [05:01:13] (03PS3) 10Ammarpad: Deploy partial blocks on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/563653 (https://phabricator.wikimedia.org/T242569) (owner: 10DannyS712) [05:05:09] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 57.51 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:06:43] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 59.67 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:12:09] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 75.25 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:12:21] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 81.98 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:49:49] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) >>! In T239791#5796099, @Andrew wrote: >>>! In T239791#5770128, @Marostegui wrote: >> @Andrew @Bstorm who in WMCS would be responsible... [05:50:02] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [05:51:43] !log Deploy schema change on x1 master on flowdb with replication - T241387 [05:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:46] T241387: Extend flow_wiki_ref.ref_src_wiki - https://phabricator.wikimedia.org/T241387 [05:53:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1084 after compression', diff saved to https://phabricator.wikimedia.org/P10120 and previous config saved to /var/cache/conftool/dbconfig/20200113-055315-marostegui.json [05:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2091:3312', diff saved to https://phabricator.wikimedia.org/P10121 and previous config saved to /var/cache/conftool/dbconfig/20200113-055554-marostegui.json [05:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1105:3312 - T239453', diff saved to https://phabricator.wikimedia.org/P10122 and previous config saved to /var/cache/conftool/dbconfig/20200113-055811-marostegui.json [05:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:16] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [05:58:40] !log Remove partitions from db1105:3312 - T239453 [05:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1075 T234052', diff saved to https://phabricator.wikimedia.org/P10123 and previous config saved to /var/cache/conftool/dbconfig/20200113-060012-marostegui.json [06:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:16] T234052: Add abuse_filter_log.afl_filter_id and afl_global columns - https://phabricator.wikimedia.org/T234052 [06:01:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1084 after compression', diff saved to https://phabricator.wikimedia.org/P10124 and previous config saved to /var/cache/conftool/dbconfig/20200113-060112-marostegui.json [06:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1013', diff saved to https://phabricator.wikimedia.org/P10125 and previous config saved to /var/cache/conftool/dbconfig/20200113-060841-marostegui.json [06:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1075 T234052', diff saved to https://phabricator.wikimedia.org/P10126 and previous config saved to /var/cache/conftool/dbconfig/20200113-061025-marostegui.json [06:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:29] T234052: Add abuse_filter_log.afl_filter_id and afl_global columns - https://phabricator.wikimedia.org/T234052 [06:11:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool es1013', diff saved to https://phabricator.wikimedia.org/P10127 and previous config saved to /var/cache/conftool/dbconfig/20200113-061106-marostegui.json [06:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:39] !log Deploy schema change on s1 master (db1083) - T234052 [06:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1084 after compression', diff saved to https://phabricator.wikimedia.org/P10128 and previous config saved to /var/cache/conftool/dbconfig/20200113-061434-marostegui.json [06:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1084', diff saved to https://phabricator.wikimedia.org/P10129 and previous config saved to /var/cache/conftool/dbconfig/20200113-061835-marostegui.json [06:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1081 for compression T232446', diff saved to https://phabricator.wikimedia.org/P10130 and previous config saved to /var/cache/conftool/dbconfig/20200113-062007-marostegui.json [06:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:11] T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 [06:35:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1112', diff saved to https://phabricator.wikimedia.org/P10131 and previous config saved to /var/cache/conftool/dbconfig/20200113-063513-marostegui.json [06:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:03] !log Deploy schema change on db1112 with replication (lag will appear on s3 on labs) - T234052 [06:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:05] T234052: Add abuse_filter_log.afl_filter_id and afl_global columns - https://phabricator.wikimedia.org/T234052 [06:45:11] !log Upgrade db1112 [06:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:00] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [07:26:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1112', diff saved to https://phabricator.wikimedia.org/P10132 and previous config saved to /var/cache/conftool/dbconfig/20200113-072611-marostegui.json [07:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:51] (03CR) 10Giuseppe Lavagetto: [C: 04-1] nodejs10: Add buster image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/563185 (https://phabricator.wikimedia.org/T237911) (owner: 10Alexandros Kosiaris) [07:30:50] !log cr3-knams> clear bfd session fe80::5e5e:ab00:d3d:85c - T240659 [07:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:53] T240659: BFD session alerts due to inconsistent status on cr3-knams - https://phabricator.wikimedia.org/T240659 [07:31:12] (03CR) 10Giuseppe Lavagetto: [C: 04-1] nodejs10: Add buster image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/563185 (https://phabricator.wikimedia.org/T237911) (owner: 10Alexandros Kosiaris) [07:31:19] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:34:11] 10Operations, 10netops: BFD session alerts due to inconsistent status on cr3-knams - https://phabricator.wikimedia.org/T240659 (10ayounsi) Removed BFD traceoptions on cr1-eqiad, keeping knams-eqdfw down for JTAC investigation. [07:36:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1112', diff saved to https://phabricator.wikimedia.org/P10133 and previous config saved to /var/cache/conftool/dbconfig/20200113-073656-marostegui.json [07:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:37] (03PS1) 10Ayounsi: Reject RPKI invalids on both transit and peering links [homer/public] - 10https://gerrit.wikimedia.org/r/563824 (https://phabricator.wikimedia.org/T220669) [07:49:43] (03CR) 10Ayounsi: "[edit policy-options policy-statement BGP_IXP_in]" [homer/public] - 10https://gerrit.wikimedia.org/r/563824 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [07:53:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1112', diff saved to https://phabricator.wikimedia.org/P10134 and previous config saved to /var/cache/conftool/dbconfig/20200113-075334-marostegui.json [07:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:12] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10Traffic, 10Release-Engineering-Team (Development services): Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10faidon) This task is about preparing "Phame to support heavy traffic... [08:10:09] 10Operations, 10ops-codfw, 10DBA: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10Marostegui) >>! In T242481#5794886, @Papaul wrote: > @Marostegui I will focus more on troubleshooting this on the NIC level on Monday.... [08:24:17] PROBLEM - nutcracker socket on snapshot1008 is CRITICAL: connect to address 10.64.16.16 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [08:24:23] PROBLEM - DPKG on snapshot1008 is CRITICAL: connect to address 10.64.16.16 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:24:35] PROBLEM - Check size of conntrack table on snapshot1008 is CRITICAL: connect to address 10.64.16.16 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [08:24:41] PROBLEM - Check whether ferm is active by checking the default input chain on snapshot1008 is CRITICAL: connect to address 10.64.16.16 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:24:53] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: connect to address 10.64.16.16 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:55] PROBLEM - mcrouter process on snapshot1008 is CRITICAL: connect to address 10.64.16.16 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Mcrouter [08:25:05] PROBLEM - configured eth on snapshot1008 is CRITICAL: connect to address 10.64.16.16 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [08:25:39] PROBLEM - nutcracker process on snapshot1008 is CRITICAL: connect to address 10.64.16.16 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [08:25:39] PROBLEM - dhclient process on snapshot1008 is CRITICAL: connect to address 10.64.16.16 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [08:25:53] PROBLEM - Disk space on snapshot1008 is CRITICAL: connect to address 10.64.16.16 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=snapshot1008&var-datasource=eqiad+prometheus/ops [08:25:59] PROBLEM - MD RAID on snapshot1008 is CRITICAL: connect to address 10.64.16.16 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:27:43] PROBLEM - puppet last run on snapshot1008 is CRITICAL: connect to address 10.64.16.16 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:30:08] <_joe_> looks like snapshot1008 is not in a good state [08:33:50] huh [08:34:05] i'll have a look [08:35:44] <_joe_> thanks [08:36:41] RECOVERY - Disk space on snapshot1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=snapshot1008&var-datasource=eqiad+prometheus/ops [08:36:45] RECOVERY - MD RAID on snapshot1008 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:36:53] RECOVERY - nutcracker socket on snapshot1008 is OK: TCP OK - 0.000 second response time on socket /var/run/nutcracker/redis_eqiad.sock https://wikitech.wikimedia.org/wiki/Nutcracker [08:36:59] RECOVERY - DPKG on snapshot1008 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:37:11] RECOVERY - Check size of conntrack table on snapshot1008 is OK: OK: nf_conntrack is 9 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [08:37:17] RECOVERY - Check whether ferm is active by checking the default input chain on snapshot1008 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:37:29] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:31] RECOVERY - mcrouter process on snapshot1008 is OK: PROCS OK: 1 process with UID = 115 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter [08:37:41] RECOVERY - configured eth on snapshot1008 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [08:38:10] <_joe_> you clearly intimidated that poor server [08:38:15] RECOVERY - nutcracker process on snapshot1008 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [08:38:15] RECOVERY - dhclient process on snapshot1008 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [08:38:20] Notice: /Stage[main]/Nrpe/Base::Service_unit[nagios-nrpe-server]/Service[nagios-nrpe-server]/ensure: ensure changed 'stopped' to 'running' [08:38:25] oom killer [08:38:30] everything's normal now [08:38:40] it would have been fine at the next puppet run in 10 minutes [08:39:19] RECOVERY - puppet last run on snapshot1008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:41:09] looks like a php memory leak in a test sdc dump process set this off, I'll report that but we have a workaround anyways. and it was a one-off test so we won't have a repeat [08:57:57] (03PS1) 10Filippo Giunchedi: nagios: add PD to sms contact group [puppet] - 10https://gerrit.wikimedia.org/r/563963 (https://phabricator.wikimedia.org/T236075) [09:03:33] (03CR) 10Filippo Giunchedi: [C: 03+2] nagios: add PD to sms contact group [puppet] - 10https://gerrit.wikimedia.org/r/563963 (https://phabricator.wikimedia.org/T236075) (owner: 10Filippo Giunchedi) [09:09:17] 10Operations, 10Traffic, 10Patch-For-Review: track NIC firmware version numbers across the fleet - https://phabricator.wikimedia.org/T236744 (10ayounsi) This might help issues like T242481 [09:09:23] 10Operations, 10ops-eqiad, 10SRE-swift-storage: Degraded RAID on ms-be1039 - https://phabricator.wikimedia.org/T242511 (10fgiunchedi) a:03Jclark-ctr @Jclark-ctr @Cmjohnson host is under warranty for another month according to netbox, please order a replacement for the failed 4TB disk (led is blinking), tha... [09:09:59] ACKNOWLEDGEMENT - Device not healthy -SMART- on ms-be1039 is CRITICAL: cluster=swift device=cciss,13 instance=ms-be1039:9100 job=node site=eqiad Filippo Giunchedi https://phabricator.wikimedia.org/T242511 https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1039&var-datasource=eqiad+prometheus/ops [09:09:59] ACKNOWLEDGEMENT - Disk space on ms-be1039 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdd1 is not accessible: Input/output error Filippo Giunchedi https://phabricator.wikimedia.org/T242511 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1039&var-datasource=eqiad+prometheus/ops [09:10:16] (03CR) 10Giuseppe Lavagetto: "> Patch Set 2:" (0312 comments) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/563482 (owner: 10Giuseppe Lavagetto) [09:12:45] (03CR) 10Giuseppe Lavagetto: Add a registryctl command-line utility (035 comments) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/563482 (owner: 10Giuseppe Lavagetto) [09:13:18] (03PS3) 10Giuseppe Lavagetto: Add a registryctl command-line utility [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/563482 [09:14:32] (03CR) 10jerkins-bot: [V: 04-1] Add a registryctl command-line utility [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/563482 (owner: 10Giuseppe Lavagetto) [09:16:11] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [09:16:12] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:16] RECOVERY - Disk space on ms-be1035 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1035&var-datasource=eqiad+prometheus/ops [09:20:38] RECOVERY - MD RAID on ms-be1035 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:20:50] RECOVERY - Check systemd state on ms-be1035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:25:22] RECOVERY - puppet last run on ms-be1035 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:27:55] 10Operations, 10ops-eqiad, 10SRE-swift-storage: Degraded RAID on ms-be1035 - https://phabricator.wikimedia.org/T242471 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Looks like the controller freaked out (T141756), firmware upgraded and rebooted. [09:29:37] (03PS2) 10Elukey: Add role to mc-gp200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/563381 (https://phabricator.wikimedia.org/T239249) [09:31:57] (03CR) 10jerkins-bot: [V: 04-1] Add role to mc-gp200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/563381 (https://phabricator.wikimedia.org/T239249) (owner: 10Elukey) [09:32:51] (03CR) 10Elukey: Add role to mc-gp200[1-3] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/563381 (https://phabricator.wikimedia.org/T239249) (owner: 10Elukey) [09:37:46] 10Operations, 10ops-codfw, 10DBA: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10Marostegui) I have tried to look thru the BIOS to find a way to disable the 10G capability but I have found nothing. On the installer... [09:38:01] !log rename Ganeti group in ulsfo from "default" to "row_1" [09:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:19] ^ vgutierrez [09:38:40] nice [09:38:44] thanks :) [09:40:32] (03PS1) 10Elukey: role::memcached: move interface::rps settings to dedicated profile [puppet] - 10https://gerrit.wikimedia.org/r/563969 (https://phabricator.wikimedia.org/T239249) [09:42:44] (03CR) 10Elukey: "elukey@cumin1001:~$ sudo cumin 'c:role::memcached' 'ls -l' --dry-run" [puppet] - 10https://gerrit.wikimedia.org/r/563969 (https://phabricator.wikimedia.org/T239249) (owner: 10Elukey) [09:43:02] (03PS4) 10Giuseppe Lavagetto: Add a registryctl command-line utility [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/563482 [09:54:44] 10Operations, 10ops-codfw, 10DBA: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10MoritzMuehlenhoff) For a test maybe disable Puppet on the install* servers and add ` d-i netcfg/choose_interface select eno3 ` to th... [09:57:14] (03CR) 10Muehlenhoff: "Ack, I wasn't aware of that. I'll rework the patch to use apt::package_from_component, then." [puppet] - 10https://gerrit.wikimedia.org/r/563472 (owner: 10Muehlenhoff) [10:02:55] (03CR) 10Elukey: [C: 03+2] role::memcached: move interface::rps settings to dedicated profile [puppet] - 10https://gerrit.wikimedia.org/r/563969 (https://phabricator.wikimedia.org/T239249) (owner: 10Elukey) [10:05:48] (03PS3) 10Elukey: Add role to mc-gp200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/563381 (https://phabricator.wikimedia.org/T239249) [10:05:52] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 35399112 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:07:59] (03CR) 10jerkins-bot: [V: 04-1] Add role to mc-gp200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/563381 (https://phabricator.wikimedia.org/T239249) (owner: 10Elukey) [10:09:27] (03CR) 10Elukey: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/563381 (https://phabricator.wikimedia.org/T239249) (owner: 10Elukey) [10:09:30] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 11400 and 66 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:11:36] (03PS1) 10Marostegui: install_server: Install es2024 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/563975 [10:13:12] (03PS1) 10Filippo Giunchedi: cacheproxy: default tx ring [puppet] - 10https://gerrit.wikimedia.org/r/563976 [10:13:44] (03CR) 10Filippo Giunchedi: "Note I'm not sure 1024 is a sensible default!" [puppet] - 10https://gerrit.wikimedia.org/r/563976 (owner: 10Filippo Giunchedi) [10:13:59] (03CR) 10Marostegui: [C: 03+2] install_server: Install es2024 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/563975 (owner: 10Marostegui) [10:14:21] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/563381 (https://phabricator.wikimedia.org/T239249) (owner: 10Elukey) [10:16:38] (03CR) 10jerkins-bot: [V: 04-1] Add role to mc-gp200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/563381 (https://phabricator.wikimedia.org/T239249) (owner: 10Elukey) [10:18:12] (03PS1) 10Filippo Giunchedi: varnish: use syslog for varnishlog consumers [puppet] - 10https://gerrit.wikimedia.org/r/563977 (https://phabricator.wikimedia.org/T227108) [10:18:24] 10Operations, 10ops-codfw, 10DBA: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10Marostegui) Thanks @MoritzMuehlenhoff - I have merged https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/563975/ to make sure we... [10:22:11] (03PS1) 10Elukey: role::memcached: move base::mysterious_sysctl into the perf profile [puppet] - 10https://gerrit.wikimedia.org/r/563979 (https://phabricator.wikimedia.org/T239249) [10:24:00] (03PS2) 10Elukey: role::memcached: move base::mysterious_sysctl into the perf profile [puppet] - 10https://gerrit.wikimedia.org/r/563979 (https://phabricator.wikimedia.org/T239249) [10:26:00] (03CR) 10Elukey: [C: 03+2] "Noop: https://puppet-compiler.wmflabs.org/compiler1001/20324/" [puppet] - 10https://gerrit.wikimedia.org/r/563979 (https://phabricator.wikimedia.org/T239249) (owner: 10Elukey) [10:30:36] (03CR) 10Muehlenhoff: [C: 03+1] "Looks great!" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/563606 (owner: 10Volans) [10:30:57] (03PS4) 10Elukey: Add role to mc-gp200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/563381 (https://phabricator.wikimedia.org/T239249) [10:31:44] (03PS8) 10Vgutierrez: ganeti: Add esams, ulsfo and eqsin clusters and rows [software/spicerack] - 10https://gerrit.wikimedia.org/r/563132 [10:32:32] (03PS2) 10Gehel: airflow: Properly pass quoted cli arguments in wrapper [puppet] - 10https://gerrit.wikimedia.org/r/563280 (owner: 10EBernhardson) [10:32:49] (03CR) 10Vgutierrez: [C: 03+1] "@volans now that the rows have been renamed to follow the "row_X" nomenclature, we can merge this :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/563132 (owner: 10Vgutierrez) [10:33:13] 10Operations, 10Traffic: Setup netconsole on upload@esams hosts - https://phabricator.wikimedia.org/T242579 (10ema) [10:33:22] (03CR) 10Volans: [C: 04-1] "A couple of questions inline on possible errors and a deprecation warning ;)" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/562852 (owner: 10Jbond) [10:33:35] 10Operations, 10Traffic: Setup netconsole on upload@esams hosts - https://phabricator.wikimedia.org/T242579 (10ema) p:05Triage→03Normal [10:34:32] (03CR) 10Elukey: [C: 03+2] Add role to mc-gp200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/563381 (https://phabricator.wikimedia.org/T239249) (owner: 10Elukey) [10:34:45] (03CR) 10Ema: [C: 03+1] acme_chief: Add smokeping certificate [puppet] - 10https://gerrit.wikimedia.org/r/552398 (https://phabricator.wikimedia.org/T238900) (owner: 10Vgutierrez) [10:36:34] (03CR) 10Volans: "@vgutierrez: ack, was just waiting on a final ack on https://phabricator.wikimedia.org/T242412#5796386" [software/spicerack] - 10https://gerrit.wikimedia.org/r/563132 (owner: 10Vgutierrez) [10:36:57] (03CR) 10Volans: [C: 03+2] binary packages: optimize queries [software/debmonitor] - 10https://gerrit.wikimedia.org/r/563606 (owner: 10Volans) [10:39:15] (03Merged) 10jenkins-bot: binary packages: optimize queries [software/debmonitor] - 10https://gerrit.wikimedia.org/r/563606 (owner: 10Volans) [10:40:57] 10Operations: Unknown address in security alias - https://phabricator.wikimedia.org/T242580 (10LarsWirzenius) [10:42:33] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: add TLS support for smokeping.wikimedia.org - https://phabricator.wikimedia.org/T238900 (10Volans) I was made aware that the two above comments are contradictory. I don't recall the why of my above comment or any limitation on the 2 certs approach. I a... [10:43:57] (03PS2) 10Vgutierrez: acme_chief: Add smokeping certificate [puppet] - 10https://gerrit.wikimedia.org/r/552398 (https://phabricator.wikimedia.org/T238900) [10:44:53] (03PS2) 10Muehlenhoff: gerrit: Switch to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/563472 [10:45:48] (03PS12) 10Jbond: ldap - idp: add ldap helper script for enabling u2f on cas [puppet] - 10https://gerrit.wikimedia.org/r/562852 [10:45:56] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Add smokeping certificate [puppet] - 10https://gerrit.wikimedia.org/r/552398 (https://phabricator.wikimedia.org/T238900) (owner: 10Vgutierrez) [10:46:22] (03CR) 10Jbond: "thanks updated and responses inline" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/562852 (owner: 10Jbond) [10:47:43] (03CR) 10Ema: [C: 03+1] Add ncredir400[12] DNS records [dns] - 10https://gerrit.wikimedia.org/r/563401 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [10:48:16] 10Operations: Unknown address in security alias - https://phabricator.wikimedia.org/T242580 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Thanks! I updated the alias to use jcross@wikimedia.org [10:48:42] (03CR) 10Ema: [C: 03+1] Pool esams for ncredir service [dns] - 10https://gerrit.wikimedia.org/r/563382 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [10:50:09] (03CR) 10Vgutierrez: [C: 03+2] Pool esams for ncredir service [dns] - 10https://gerrit.wikimedia.org/r/563382 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [10:51:20] !log pooling esams for ncredir - T242321 [10:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:25] T242321: Provide non-canonical-redirect service from every datacenter - https://phabricator.wikimedia.org/T242321 [10:51:55] 10Operations, 10ops-codfw, 10DBA: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10Marostegui) The hack didn't work. It keeps choosing a different NIC than eno3, which unfortunately looks like the 10G one. I tried a di... [10:53:02] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: Move cassandra logging to logging pipeline - https://phabricator.wikimedia.org/T242585 (10fgiunchedi) [10:58:25] (03PS1) 10Volans: Release v0.2.3 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/563980 [11:00:49] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/563980 (owner: 10Volans) [11:01:06] (03CR) 10Volans: [V: 03+2 C: 03+2] Release v0.2.3 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/563980 (owner: 10Volans) [11:03:22] !log volans@deploy1001 Started deploy [debmonitor/deploy@265059b]: Release v0.2.3 [11:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:32] !log volans@deploy1001 Finished deploy [debmonitor/deploy@265059b]: Release v0.2.3 (duration: 01m 10s) [11:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:54] (03PS3) 10Gehel: airflow: Properly pass quoted cli arguments in wrapper [puppet] - 10https://gerrit.wikimedia.org/r/563280 (owner: 10EBernhardson) [11:12:18] (03CR) 10Gehel: [C: 03+2] airflow: Properly pass quoted cli arguments in wrapper [puppet] - 10https://gerrit.wikimedia.org/r/563280 (owner: 10EBernhardson) [11:17:08] 10Operations, 10ops-codfw: (Need By: Jan 15) codfw: rack/setup/install mc-gp200[123].codfw.wmnet - https://phabricator.wikimedia.org/T239249 (10elukey) [11:17:22] 10Operations, 10serviceops: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T240684 (10elukey) [11:27:58] (03CR) 10Muehlenhoff: codesearch: Install docker-ce from thirdparty/kubeadm-k8s component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/563633 (owner: 10Legoktm) [11:30:04] jan_drewniak: Time to snap out of that daydream and deploy Wikimedia Portals Update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200113T1130). [11:35:06] (03PS3) 10Vgutierrez: Add ncredir400[12] DNS records [dns] - 10https://gerrit.wikimedia.org/r/563401 (https://phabricator.wikimedia.org/T242321) [11:36:34] (03CR) 10Vgutierrez: [C: 03+2] Add ncredir400[12] DNS records [dns] - 10https://gerrit.wikimedia.org/r/563401 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [11:42:44] !log upgrading remaining mwdebug* servers and mw1261 to PHP 7.2.26 T241222 [11:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:48] T241222: Update Wikimedia production to PHP 7.2.26 - https://phabricator.wikimedia.org/T241222 [11:51:49] 10Operations, 10Citoid, 10Release Pipeline, 10Services, 10serviceops: Migrate citoid and zotero services to helm ( scap-helm is deprecated ) - https://phabricator.wikimedia.org/T233702 (10Mvolz) @akosiaris is this done then? [11:52:30] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/563988 (https://phabricator.wikimedia.org/T128546) [11:54:20] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/563988 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:55:30] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/563988 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:57:04] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:558017| Bumping portals to master (563985)]] (duration: 00m 55s) [11:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:00] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:558017| Bumping portals to master (563985)]] (duration: 00m 55s) [11:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: May I have your attention please! European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200113T1200) [12:00:04] No GERRIT patches in the queue for this window AFAICS. [12:00:39] o/ [12:00:51] I don’t see any patches in the queue either :) [12:01:18] * Urbanecm claims the window [12:01:46] (03PS9) 10ArielGlenn: Set up dumps for mediainfo RDF generation [puppet] - 10https://gerrit.wikimedia.org/r/516444 (https://phabricator.wikimedia.org/T221917) (owner: 10Smalyshev) [12:02:44] jan_drewniak: /srv/mediawiki-staging is dirty, could you have a look? [12:02:54] (03CR) 10ArielGlenn: [C: 03+2] Set up dumps for mediainfo RDF generation [puppet] - 10https://gerrit.wikimedia.org/r/516444 (https://phabricator.wikimedia.org/T221917) (owner: 10Smalyshev) [12:03:53] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [12:03:53] 10Operations, 10Traffic: Setup netconsole on upload@esams hosts - https://phabricator.wikimedia.org/T242579 (10fgiunchedi) In case it is helpful: we can reuse the centrallog hosts in codfw/eqiad. For site-local netconsole instead we'll need to setup local syslog collectors anyways (on ganeti VMs) for network d... [12:03:55] (03CR) 10Urbanecm: [C: 03+2] Deploy partial blocks on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/563653 (https://phabricator.wikimedia.org/T242569) (owner: 10DannyS712) [12:04:55] (03Merged) 10jenkins-bot: Deploy partial blocks on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/563653 (https://phabricator.wikimedia.org/T242569) (owner: 10DannyS712) [12:06:26] * Urbanecm is deploying the patch, since he doesn't need to touch portalsi n any way [12:07:13] (03PS1) 10Vgutierrez: install_server,ncredir: Install ncredir400[12] [puppet] - 10https://gerrit.wikimedia.org/r/563990 (https://phabricator.wikimedia.org/T242321) [12:08:03] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: c7cf53c: Deploy partial blocks on enwiki (T242569) (duration: 00m 55s) [12:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:06] T242569: Deploy partial blocks on English wikipedia - https://phabricator.wikimedia.org/T242569 [12:08:13] !log EU SWAT done [12:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:19] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 22370424 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:12:23] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 85087656 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:12:36] (03CR) 10Filippo Giunchedi: "Friendly ping, can we move forward with this Halfak?" [puppet] - 10https://gerrit.wikimedia.org/r/502527 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [12:13:09] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 4456 and 99 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:14:09] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 154952 and 75 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:19:57] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [12:46:42] (03PS1) 10Elukey: Revert "Revert "hue: add row limit threshold for hive queries"" [puppet] - 10https://gerrit.wikimedia.org/r/563996 [12:52:27] (03CR) 10Elukey: [C: 03+2] Revert "Revert "hue: add row limit threshold for hive queries"" [puppet] - 10https://gerrit.wikimedia.org/r/563996 (owner: 10Elukey) [13:00:45] 10Operations: Sort out plan for install* servers in edge sites - https://phabricator.wikimedia.org/T242602 (10MoritzMuehlenhoff) [13:04:48] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1ed. I can help with deploying this once @halfak +1s it." [puppet] - 10https://gerrit.wikimedia.org/r/502527 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [13:07:34] 10Operations, 10Performance-Team, 10Traffic: Production load.php spends ~ 10% time doing output compression within PHP - https://phabricator.wikimedia.org/T242478 (10ema) [13:11:34] !log upgrade mw canaries to PHP 7.2.26 T241222 [13:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:38] T241222: Update Wikimedia production to PHP 7.2.26 - https://phabricator.wikimedia.org/T241222 [13:11:51] 10Operations: Unknown address in security alias - https://phabricator.wikimedia.org/T242580 (10Reedy) I really don't know why OIT don't keep the -ctr suffix as an alias when people transition from being a contractor... [13:15:06] (03PS2) 10Arturo Borrero Gonzalez: toolforge: new k8s: prometheus: scrape metrics from each individual ingress pod [puppet] - 10https://gerrit.wikimedia.org/r/562838 (https://phabricator.wikimedia.org/T237643) [13:17:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: new k8s: prometheus: scrape metrics from each individual ingress pod [puppet] - 10https://gerrit.wikimedia.org/r/562838 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez) [13:18:49] (03CR) 10RLazarus: [C: 03+1] "> Patch Set 2:" (031 comment) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/563482 (owner: 10Giuseppe Lavagetto) [13:25:55] 10Operations, 10Citoid, 10Release Pipeline, 10Services, 10serviceops: Migrate citoid and zotero services to helmfile ( scap-helm is deprecated ) - https://phabricator.wikimedia.org/T233702 (10akosiaris) [13:27:44] 10Operations, 10SRE-tools, 10docker-pkg, 10serviceops: Report image metadata to debmonitor - https://phabricator.wikimedia.org/T241206 (10Joe) 05Open→03Resolved [13:27:51] 10Operations, 10Citoid, 10Release Pipeline, 10Services, 10serviceops: Migrate citoid and zotero services to helmfile ( scap-helm is deprecated ) - https://phabricator.wikimedia.org/T233702 (10akosiaris) > Helm files were added but then had to be reverted as the build no longer worked. > Addition of helm... [13:28:59] (03PS1) 10Elukey: statistics::site::stats: create symlink only when dir is ready [puppet] - 10https://gerrit.wikimedia.org/r/564004 [13:29:00] 10Operations, 10Performance-Team, 10Traffic: Production load.php spends ~ 10% time doing output compression within PHP - https://phabricator.wikimedia.org/T242478 (10ema) >>! In T242478#5794025, @Krinkle wrote: > I recall that in the pre-ATS setup, we explicitly configured the interaction between applayer an... [13:29:30] (03CR) 10Elukey: [C: 03+2] statistics::site::stats: create symlink only when dir is ready [puppet] - 10https://gerrit.wikimedia.org/r/564004 (owner: 10Elukey) [13:32:30] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops: Remove obsoleted docker images - https://phabricator.wikimedia.org/T242604 (10Joe) [13:33:04] (03PS1) 10Ema: Revert "Revert "ATS: unset Accept-Encoding"" [puppet] - 10https://gerrit.wikimedia.org/r/564005 (https://phabricator.wikimedia.org/T242478) [13:42:25] (03PS3) 10Elukey: wikistats: serve the v2 version of the website by default [puppet] - 10https://gerrit.wikimedia.org/r/563508 (https://phabricator.wikimedia.org/T237752) [13:45:06] (03PS4) 10Elukey: wikistats: serve the v2 version of the website by default [puppet] - 10https://gerrit.wikimedia.org/r/563508 (https://phabricator.wikimedia.org/T237752) [13:45:51] (03CR) 10jerkins-bot: [V: 04-1] wikistats: serve the v2 version of the website by default [puppet] - 10https://gerrit.wikimedia.org/r/563508 (https://phabricator.wikimedia.org/T237752) (owner: 10Elukey) [13:46:45] yes Jenkins I know today I am a bad person [13:46:52] no need to keep remembering me that [13:46:55] (03PS5) 10Elukey: wikistats: serve the v2 version of the website by default [puppet] - 10https://gerrit.wikimedia.org/r/563508 (https://phabricator.wikimedia.org/T237752) [13:52:47] (03PS5) 10Giuseppe Lavagetto: Add a registryctl command-line utility [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/563482 (https://phabricator.wikimedia.org/T242604) [13:55:37] 10Operations, 10serviceops: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10MoritzMuehlenhoff) [13:57:25] 10Operations, 10Citoid, 10Release Pipeline, 10Services, 10serviceops: Migrate citoid and zotero services to helmfile ( scap-helm is deprecated ) - https://phabricator.wikimedia.org/T233702 (10Mvolz) 05Open→03Resolved [13:59:16] 10Operations: Sort out plan for install* servers in edge sites - https://phabricator.wikimedia.org/T242602 (10BBlack) Seems like a good plan! [14:02:13] 10Operations, 10ops-codfw, 10DBA: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10Marostegui) Last test done with @ayounsi: I have removed the drivers `tg3` and `bnxt_en` from the OS: ` /bin # ip addr 1: lo: (03PS1) 10Muehlenhoff: Convert tftp/dhcp ferm rules to ferm services [puppet] - 10https://gerrit.wikimedia.org/r/564010 [14:09:02] (03PS2) 10Muehlenhoff: Convert tftp/dhcp ferm rules to ferm services [puppet] - 10https://gerrit.wikimedia.org/r/564010 [14:14:34] 10Operations, 10Thumbor, 10Wikimedia-Logstash, 10observability, and 2 others: Stream Thumbor logs to logstash - https://phabricator.wikimedia.org/T212946 (10fgiunchedi) Reviving this as part of this Q's OKRs to move services off logstash non-kafka inputs, I'll followup with patches to move to the localhost... [14:16:11] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10Traffic, 10Release-Engineering-Team (Development services): Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10Bmueller) Hey @faidon yes, you're right and that's the plan :-) @sro... [14:16:21] Urbanecm: Sorry I just saw the message [14:18:16] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: Move thumbor to the logging pipeline - https://phabricator.wikimedia.org/T242609 (10fgiunchedi) [14:20:30] Oh I see what happened, I forgot to do a git submodule update this morning... [14:21:30] I'm going to go ahead and re-deploy that now since the deploy window is open [14:21:41] hh [14:21:43] heh even [14:22:28] (03CR) 10Ema: [C: 03+1] install_server,ncredir: Install ncredir400[12] [puppet] - 10https://gerrit.wikimedia.org/r/563990 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [14:22:30] 10Operations, 10ops-codfw, 10DBA: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10Marostegui) Some more notes. The only way to ping the gateway seems to be using the 1G ports, and making sure the 10G aren't there. The... [14:22:35] (03PS4) 10Muehlenhoff: tor: switch to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/563208 [14:23:02] (03CR) 10Muehlenhoff: tor: switch to apt::package_from_component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/563208 (owner: 10Muehlenhoff) [14:23:09] (03CR) 10Jakob: [C: 03+2] Pin termbox chart versions at 0.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/563438 (owner: 10Tarrow) [14:24:01] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:558017| Bumping portals to master (563985)]] (duration: 00m 56s) [14:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:18] (03CR) 10Ottomata: "https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/562623 (https://phabricator.wikimedia.org/T240985) (owner: 10Ottomata) [14:24:56] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:558017| Bumping portals to master (563985)]] (duration: 00m 55s) [14:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:54] thank you jan_drewniak [14:30:07] 10Operations, 10Wikimedia-Logstash, 10observability: Migrate services using deprecated Gelf logstash input to Kafka enabled logging pipeline - https://phabricator.wikimedia.org/T225122 (10fgiunchedi) [14:30:09] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10fgiunchedi) [14:31:15] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10hnowlan) Thanks Daniel! I've created a new key and uploaded it here http://keys.gnupg.net/pks/lookup?o... [14:33:52] (03PS1) 10Vgutierrez: redirects.dat: Funnel fixcopyright.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/564020 (https://phabricator.wikimedia.org/T239141) [14:43:47] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:23] 10Operations: Sort out plan for install* servers in edge sites - https://phabricator.wikimedia.org/T242602 (10MoritzMuehlenhoff) Looking at the data usage of install* sans the repository and related files (like firmware images) we'd only need 20G for the install* servers and still have plenty of head room (the T... [14:50:14] (03PS1) 10Jhedden: Revert "openstack: change cloudvirt1022 to ceph based virt role" [puppet] - 10https://gerrit.wikimedia.org/r/564031 [14:50:51] (03PS2) 10Jhedden: Revert "openstack: change cloudvirt1022 to ceph based virt role" [puppet] - 10https://gerrit.wikimedia.org/r/564031 [14:51:55] (03CR) 10Jhedden: [C: 03+2] Revert "openstack: change cloudvirt1022 to ceph based virt role" [puppet] - 10https://gerrit.wikimedia.org/r/564031 (owner: 10Jhedden) [14:55:30] !log remove hassaleh in Ganeti T224567 [14:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:34] T224567: decom debug proxies (was: Migrate debug proxies to Stretch/Buster) - https://phabricator.wikimedia.org/T224567 [14:57:31] !log jmm@cumin2001 START - Cookbook sre.hosts.decommission [14:57:31] PROBLEM - Host hassaleh is DOWN: PING CRITICAL - Packet loss = 100% [14:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:25] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [14:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:28] 10Operations, 10serviceops: decom debug proxies (was: Migrate debug proxies to Stretch/Buster) - https://phabricator.wikimedia.org/T224567 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2001 for hosts: `hassaleh.codfw.wmnet` - hassaleh.codfw.wmnet (**FAIL**) - Downtimed host on... [14:59:00] moritzm: interesting that icinga-wm has complained... [15:00:05] !log joal@deploy1001 Started deploy [analytics/hdfs-tools/deploy@a1b4d34]: Deploy hdfs-rsync bug correction [15:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:13] !log joal@deploy1001 Finished deploy [analytics/hdfs-tools/deploy@a1b4d34]: Deploy hdfs-rsync bug correction (duration: 00m 08s) [15:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:24] yeah, it printed that the downtime was set, I'll dig in Icinga/spicerack logs what happened in a bit [15:00:33] already on it [15:01:42] was the host already down? [15:02:18] first ping down was at 14:56:57, and disk and puppet unknowns before that at 14:56:41 [15:04:26] completed downtime at 14:57:55,553, but Icinga might take few seconds to actually act on it [15:04:32] yeah, the host was already removed in Ganeti [15:04:55] maybe a minute before the decom cookbook ran [15:05:04] ah [15:05:10] than it explains it [15:08:53] but the downtime is set on icinga1001, why does it make a difference? [15:08:53] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:15] that if you shutdown the VM before running the decom script I guess that icinga noticed the down before the cookbook could downtime it [15:12:25] at least that's what I got from the timeline, but correct me if I got it wrong [15:13:00] the other option is a race condition from where the cookbook issue the downtime and how slow is icinga to register it [15:13:16] it's a non-sync operation [15:13:26] async even :D [15:16:08] volans: ack, makes total sense ofc [15:18:05] !log jmm@cumin2001 START - Cookbook sre.hosts.decommission [15:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:32] fixed the wikitech docs [15:18:34] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [15:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:42] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [15:18:43] thanks! [15:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:46] 10Operations, 10serviceops: decom debug proxies (was: Migrate debug proxies to Stretch/Buster) - https://phabricator.wikimedia.org/T224567 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2001 for hosts: `hassium.eqiad.wmnet` - hassium.eqiad.wmnet (**FAIL**) - Downtimed host on I... [15:19:08] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [15:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:37] !log remove hassium in Ganeti T224567 [15:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:40] T224567: decom debug proxies (was: Migrate debug proxies to Stretch/Buster) - https://phabricator.wikimedia.org/T224567 [15:21:25] 10Operations, 10Traffic: ats-tls is having issues when varnish-fe goes away - https://phabricator.wikimedia.org/T242620 (10Vgutierrez) [15:21:52] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10Traffic, 10Release-Engineering-Team (Development services): Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10srodlund) 05Stalled→03Declined Declining this task. Here is a l... [15:22:09] (03PS2) 10Muehlenhoff: install_server: remove hassaleh and hassium [puppet] - 10https://gerrit.wikimedia.org/r/563566 (https://phabricator.wikimedia.org/T242456) (owner: 10Dzahn) [15:23:15] (03PS2) 10Vgutierrez: install_server,ncredir: Install ncredir400[12] [puppet] - 10https://gerrit.wikimedia.org/r/563990 (https://phabricator.wikimedia.org/T242321) [15:24:29] (03CR) 10Muehlenhoff: [C: 03+2] install_server: remove hassaleh and hassium [puppet] - 10https://gerrit.wikimedia.org/r/563566 (https://phabricator.wikimedia.org/T242456) (owner: 10Dzahn) [15:24:35] (03CR) 10Vgutierrez: [C: 03+2] install_server,ncredir: Install ncredir400[12] [puppet] - 10https://gerrit.wikimedia.org/r/563990 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [15:24:56] it looks like I'm competing (and losing) against moritzm to merge that [15:25:15] moritzm: let me know when you've finished merging your CR :) [15:25:43] (03PS2) 10Muehlenhoff: site: remove hassaleh and hassium [puppet] - 10https://gerrit.wikimedia.org/r/563567 (https://phabricator.wikimedia.org/T242456) (owner: 10Dzahn) [15:25:58] sorry :-) [15:26:20] my dhcp change is merged, please go ahead [15:26:26] thx [15:26:39] (03PS3) 10Vgutierrez: install_server,ncredir: Install ncredir400[12] [puppet] - 10https://gerrit.wikimedia.org/r/563990 (https://phabricator.wikimedia.org/T242321) [15:30:00] merged :D [15:30:56] (03PS3) 10Muehlenhoff: site: remove hassaleh and hassium [puppet] - 10https://gerrit.wikimedia.org/r/563567 (https://phabricator.wikimedia.org/T242456) (owner: 10Dzahn) [15:31:48] 10Operations, 10Analytics: Grant access to archiva-deployers for zpapierski - https://phabricator.wikimedia.org/T242622 (10dcausse) [15:35:14] 10Operations, 10Analytics: Grant access to archiva-deployers for mstyles - https://phabricator.wikimedia.org/T242624 (10dcausse) [15:37:38] (03CR) 10Muehlenhoff: [C: 03+2] site: remove hassaleh and hassium [puppet] - 10https://gerrit.wikimedia.org/r/563567 (https://phabricator.wikimedia.org/T242456) (owner: 10Dzahn) [15:40:51] (03PS1) 10Vgutierrez: Serve smokeping.wm.o directly from netmon1002 [dns] - 10https://gerrit.wikimedia.org/r/564045 (https://phabricator.wikimedia.org/T238900) [15:40:55] (03PS1) 10Muehlenhoff: Remove debug proxy roles/classes [puppet] - 10https://gerrit.wikimedia.org/r/564044 (https://phabricator.wikimedia.org/T224567) [15:41:58] 10Operations, 10serviceops, 10Patch-For-Review: decom debug proxies (was: Migrate debug proxies to Stretch/Buster) - https://phabricator.wikimedia.org/T224567 (10MoritzMuehlenhoff) [15:42:42] (03PS3) 10Muehlenhoff: remove hassium.eqiad.wmnet and hassaleh.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/563570 (https://phabricator.wikimedia.org/T242456) (owner: 10Dzahn) [15:42:46] (03PS1) 10Vgutierrez: smokeping: Serve traffic directly and using TLS [puppet] - 10https://gerrit.wikimedia.org/r/564046 (https://phabricator.wikimedia.org/T238900) [15:43:26] (03CR) 10Muehlenhoff: [C: 03+2] remove hassium.eqiad.wmnet and hassaleh.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/563570 (https://phabricator.wikimedia.org/T242456) (owner: 10Dzahn) [15:43:28] (03CR) 10jerkins-bot: [V: 04-1] smokeping: Serve traffic directly and using TLS [puppet] - 10https://gerrit.wikimedia.org/r/564046 (https://phabricator.wikimedia.org/T238900) (owner: 10Vgutierrez) [15:44:37] (03PS2) 10Vgutierrez: smokeping: Serve traffic directly and using TLS [puppet] - 10https://gerrit.wikimedia.org/r/564046 (https://phabricator.wikimedia.org/T238900) [15:44:46] 10Operations, 10Analytics: Grant access to archiva-deployers for zpapierski - https://phabricator.wikimedia.org/T242622 (10Gehel) 05Open→03Resolved a:03Gehel access granted [15:45:10] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [15:45:16] 10Operations, 10serviceops, 10Patch-For-Review: decom debug proxies (was: Migrate debug proxies to Stretch/Buster) - https://phabricator.wikimedia.org/T224567 (10MoritzMuehlenhoff) 05Open→03Resolved hassium/hassaleh have been retired. [15:45:26] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [15:45:56] 10Operations, 10Analytics: Grant access to archiva-deployers for mstyles - https://phabricator.wikimedia.org/T242624 (10Gehel) 05Open→03Declined @Mstyles is already a member of that group [15:46:23] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission hassaleh.codfw.wmnet - https://phabricator.wikimedia.org/T242457 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Removed via T224567 (we don't need full decom tasks for Ganeti VMs) [15:46:28] 10Operations, 10serviceops, 10Patch-For-Review: decom debug proxies (was: Migrate debug proxies to Stretch/Buster) - https://phabricator.wikimedia.org/T224567 (10MoritzMuehlenhoff) [15:46:36] 10Operations, 10serviceops, 10Patch-For-Review: decom debug proxies (was: Migrate debug proxies to Stretch/Buster) - https://phabricator.wikimedia.org/T224567 (10MoritzMuehlenhoff) [15:46:38] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission hassium.eqiad.wmnet - https://phabricator.wikimedia.org/T242456 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Removed via T224567 (we don't need full decom tasks for Ganeti VMs) [15:51:57] (03CR) 10Jhedden: [C: 03+2] openstack: add cloudvirt1022 back into scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/564048 (https://phabricator.wikimedia.org/T225320) (owner: 10Jhedden) [15:56:24] !log installing cyrus-sasl security updates [15:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:31] (03Abandoned) 10Herron: apply profile::base::firewall to default nodes [puppet] - 10https://gerrit.wikimedia.org/r/562856 (owner: 10Herron) [15:58:06] 10Operations, 10Traffic: ats-tls is having issues when varnish-fe goes away - https://phabricator.wikimedia.org/T242620 (10Vgutierrez) [16:00:21] (03PS1) 10Vgutierrez: Add ncredir-lb.ulsfo.wikimedia.org DNS records [dns] - 10https://gerrit.wikimedia.org/r/564051 (https://phabricator.wikimedia.org/T242321) [16:03:05] (03PS1) 10Ottomata: [POC] Add service.name as label to use when matching a k8s service [deployment-charts] - 10https://gerrit.wikimedia.org/r/564052 [16:06:49] (03CR) 10Volans: "Will we remove manually the files and changes generated by these resources or we need to first absent them?" [puppet] - 10https://gerrit.wikimedia.org/r/564044 (https://phabricator.wikimedia.org/T224567) (owner: 10Muehlenhoff) [16:08:11] (03PS1) 10MarcoAurelio: Restore contentadmin ability to manage abuse filters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564053 (https://phabricator.wikimedia.org/T242593) [16:09:28] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 44443128 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:09:57] (03CR) 10Ottomata: [POC] Add service.name as label to use when matching a k8s service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/564052 (owner: 10Ottomata) [16:11:21] (03CR) 10Ottomata: "I'm trying to keep the release label to always be .Release.Name, and matching the service using a new label 'service', which corresponds t" [deployment-charts] - 10https://gerrit.wikimedia.org/r/564052 (owner: 10Ottomata) [16:11:46] (03PS2) 10Alexandros Kosiaris: Deduplicate cluster-helmfile.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/559106 [16:11:48] (03PS2) 10Alexandros Kosiaris: cluster-helmfile: Add a simple sleep 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/559107 [16:11:50] (03PS9) 10Alexandros Kosiaris: Switch eqiad calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558473 (https://phabricator.wikimedia.org/T239835) [16:11:52] (03PS1) 10Alexandros Kosiaris: mathoid: Rework the canary approach [deployment-charts] - 10https://gerrit.wikimedia.org/r/564054 [16:11:57] (03CR) 10Ottomata: "> Maybe we can get this from helmfile somehow?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/564052 (owner: 10Ottomata) [16:12:40] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 16896 and 63 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:13:59] (03CR) 10Alexandros Kosiaris: nodejs10: Add buster image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/563185 (https://phabricator.wikimedia.org/T237911) (owner: 10Alexandros Kosiaris) [16:14:56] (03CR) 10Muehlenhoff: "We took the big hammer and removed the entire Ganeti instances that carried these resources (hassium/hassaleh) :-)" [puppet] - 10https://gerrit.wikimedia.org/r/564044 (https://phabricator.wikimedia.org/T224567) (owner: 10Muehlenhoff) [16:14:59] (03CR) 10Alexandros Kosiaris: [C: 03+2] cluster-helmfile: Add a simple sleep 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/559107 (owner: 10Alexandros Kosiaris) [16:16:20] (03PS2) 10Alexandros Kosiaris: nodejs10: Add buster image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/563185 (https://phabricator.wikimedia.org/T237911) [16:19:32] (03CR) 10Ottomata: "> OOO, can we use the namespace? it is set to e.g 'eventgate-analytics'" [deployment-charts] - 10https://gerrit.wikimedia.org/r/564052 (owner: 10Ottomata) [16:20:16] (03PS2) 10Jforrester: [wikitech] Restore contentadmin ability to manage abuse filters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564053 (https://phabricator.wikimedia.org/T242593) (owner: 10MarcoAurelio) [16:27:42] (03CR) 10Jforrester: [C: 03+1] "Looks good, thank you." [puppet] - 10https://gerrit.wikimedia.org/r/564020 (https://phabricator.wikimedia.org/T239141) (owner: 10Vgutierrez) [16:36:38] 10Operations, 10Packaging, 10serviceops: package requirements for upgrading deployment_servers to buster - https://phabricator.wikimedia.org/T242480 (10Dzahn) p:05Triage→03Low [16:36:46] 10Operations, 10Wikimedia-Mailing-lists: Please create a private mailing list: sectrainings - https://phabricator.wikimedia.org/T242343 (10Dzahn) a:03Dzahn [16:41:55] (03CR) 10Giuseppe Lavagetto: [C: 03+1] nodejs10: Add buster image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/563185 (https://phabricator.wikimedia.org/T237911) (owner: 10Alexandros Kosiaris) [16:43:22] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "sorry, just noticed..." (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/563185 (https://phabricator.wikimedia.org/T237911) (owner: 10Alexandros Kosiaris) [16:46:18] (03PS2) 10Alexandros Kosiaris: Pin termbox chart versions at 0.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/563438 (owner: 10Tarrow) [16:47:01] 10Operations, 10ops-codfw, 10DBA: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10MoritzMuehlenhoff) The 10G card is identical to what we have running fine on Stretch in e.g. ms-be2050 and I also validated there are n... [16:58:21] 10Puppet, 10Toolforge, 10cloud-services-team (Kanban): Cleanup unsigned puppet cleint certs on tools-puppetnmaster-01 - https://phabricator.wikimedia.org/T242642 (10bd808) [16:58:48] 10Puppet, 10Toolforge, 10cloud-services-team (Kanban): Cleanup unsigned puppet cleint certs on tools-puppetnmaster-01 - https://phabricator.wikimedia.org/T242642 (10bd808) p:05Triage→03High [17:01:49] 10Operations, 10ops-codfw, 10DBA: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10Marostegui) Thanks @MoritzMuehlenhoff for your tests. So recap: - this host has 1G and 10G: we bought the 10G because once we start do... [17:12:16] (03PS3) 10Alexandros Kosiaris: nodejs10: Add buster image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/563185 (https://phabricator.wikimedia.org/T237911) [17:14:35] (03PS1) 10Joal: Add mediawiki-history-dumps rsync to labstore [puppet] - 10https://gerrit.wikimedia.org/r/564066 [17:15:04] elukey: --^ for when you have a minute :) I have doubts on syntax :) [17:15:15] (03CR) 10jerkins-bot: [V: 04-1] Add mediawiki-history-dumps rsync to labstore [puppet] - 10https://gerrit.wikimedia.org/r/564066 (owner: 10Joal) [17:20:28] (03PS2) 10Joal: Add mediawiki-history-dumps rsync to labstore [puppet] - 10https://gerrit.wikimedia.org/r/564066 [17:20:38] 10Operations, 10Traffic: ats-tls is having issues when varnish-fe goes away - https://phabricator.wikimedia.org/T242620 (10Vgutierrez) Apparently ATS solves our current issue about TTFB VS connect timeout on https://github.com/apache/trafficserver/pull/4028, this is currently backported to 7.x and 9.x, I'm che... [17:21:18] (03CR) 10jerkins-bot: [V: 04-1] Add mediawiki-history-dumps rsync to labstore [puppet] - 10https://gerrit.wikimedia.org/r/564066 (owner: 10Joal) [17:22:59] joal: there seems to be an extra bash line at the bottom of one of the files [17:23:06] I think that jenkins doesn't like it [17:23:12] elukey: just cleaned - sorry :( [17:23:17] sorry for the spams [17:23:36] (03PS3) 10Joal: Add mediawiki-history-dumps rsync to labstore [puppet] - 10https://gerrit.wikimedia.org/r/564066 [17:24:18] joal: nah that's fine, you should see my regular spam :) [17:27:37] (03CR) 10Elukey: Add mediawiki-history-dumps rsync to labstore (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/564066 (owner: 10Joal) [17:29:45] (03CR) 10Elukey: Add mediawiki-history-dumps rsync to labstore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/564066 (owner: 10Joal) [17:30:27] (03CR) 10Joal: "1 comment for you elukey" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/564066 (owner: 10Joal) [17:31:39] (03PS4) 10Joal: Add mediawiki-history-dumps rsync to labstore [puppet] - 10https://gerrit.wikimedia.org/r/564066 [17:34:50] 10Operations, 10ops-codfw, 10DBA: Missing Network drivers from Stretch and Buster installer for BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10Marostegui) For what is worth, exactly the same behaviour is happening on es2020. And what was done at T242481#5797643 for es2024 also... [17:41:31] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1100 - https://phabricator.wikimedia.org/T241506 (10Marostegui) @Jclark-ctr feel free to replace the disk once you get to the DC, disk #0 is the one. [17:43:11] (03PS2) 10Ottomata: [POC] eventgate - Use service.name as primary resource grouping, not wmf.releasename [deployment-charts] - 10https://gerrit.wikimedia.org/r/564052 [17:44:06] (03PS3) 10Ottomata: [POC] eventgate - Use service.name as primary resource grouping, not wmf.releasename [deployment-charts] - 10https://gerrit.wikimedia.org/r/564052 [17:44:15] (03PS5) 10Joal: Add mediawiki-history-dumps rsync to labstore [puppet] - 10https://gerrit.wikimedia.org/r/564066 [17:44:50] (03PS1) 10Elukey: admin: add krb flag for user musikanimal [puppet] - 10https://gerrit.wikimedia.org/r/564073 (https://phabricator.wikimedia.org/T242525) [17:45:06] 10Puppet, 10Toolforge, 10cloud-services-team (Kanban): Cleanup unsigned puppet client certs on tools-puppetmaster-01 - https://phabricator.wikimedia.org/T242642 (10bd808) [17:46:54] (03CR) 10Elukey: [C: 03+2] admin: add krb flag for user musikanimal [puppet] - 10https://gerrit.wikimedia.org/r/564073 (https://phabricator.wikimedia.org/T242525) (owner: 10Elukey) [17:47:17] (03CR) 10Ottomata: [POC] eventgate - Use service.name as primary resource grouping, not wmf.releasename (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/564052 (owner: 10Ottomata) [17:50:31] (03PS2) 10Alexandros Kosiaris: mathoid: Rework the canary approach [deployment-charts] - 10https://gerrit.wikimedia.org/r/564054 [17:51:50] 10Puppet, 10Toolforge, 10cloud-services-team (Kanban): Cleanup unsigned puppet client certs on tools-puppetmaster-01 - https://phabricator.wikimedia.org/T242642 (10bd808) 05Open→03Resolved `puppet cert revoke` and `puppet cert clean` only work on signed certificates. The deprecated `puppet ca destroy` co... [17:53:48] (03CR) 10Elukey: "Rendered command is:" [puppet] - 10https://gerrit.wikimedia.org/r/564066 (owner: 10Joal) [17:56:59] (03PS3) 10Alexandros Kosiaris: mathoid: Rework the canary approach [deployment-charts] - 10https://gerrit.wikimedia.org/r/564054 [17:57:38] 10Operations, 10Jade, 10TechCom, 10Core Platform Team Legacy (Watching / External), and 4 others: Deploy Jade extension MVP to production - https://phabricator.wikimedia.org/T183381 (10Halfak) [18:00:04] gehel and onimisionipe: Your horoscope predicts another unfortunate Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200113T1800). [18:13:46] (03PS4) 10Alexandros Kosiaris: mathoid: Rework the canary approach [deployment-charts] - 10https://gerrit.wikimedia.org/r/564054 [18:14:37] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 23346560 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:16:27] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 181672 and 81 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:21:30] (03PS5) 10Alexandros Kosiaris: mathoid: Rework the canary approach [deployment-charts] - 10https://gerrit.wikimedia.org/r/564054 [18:21:32] (03PS1) 10Alexandros Kosiaris: mathoid: 2nd test of canary functionality in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/564084 [18:22:42] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Reworked the entire approach in https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/564054/ and even fixed the object "" not" [deployment-charts] - 10https://gerrit.wikimedia.org/r/469662 (owner: 10Alexandros Kosiaris) [18:34:25] (03PS4) 10Ayounsi: Juniper to Netbox import script [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/507217 [18:35:43] (03Abandoned) 10EBernhardson: Deploy analytics-search keytab to an-airflow [puppet] - 10https://gerrit.wikimedia.org/r/556072 (owner: 10EBernhardson) [18:36:53] (03CR) 10Ottomata: "I think we need a top level concept of 'service' that should be the same between all releases of the service. For eventgate instance, thi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/564054 (owner: 10Alexandros Kosiaris) [18:50:43] (03PS4) 10Ottomata: [POC] eventgate - Use service.name as primary resource grouping, not wmf.releasename [deployment-charts] - 10https://gerrit.wikimedia.org/r/564052 [18:51:46] (03PS5) 10Ottomata: [POC] eventgate - Use service.name as primary resource grouping, not wmf.releasename [deployment-charts] - 10https://gerrit.wikimedia.org/r/564052 [19:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200113T1900). Please do the needful. [19:00:04] tgr: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:01:27] (03PS1) 10Bstorm: gridengine: Make webservices "not rerunable" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/564095 (https://phabricator.wikimedia.org/T242397) [19:02:06] tgr: assuming you do your own SWAT? [19:02:20] I can do it, sure [19:02:47] (03PS2) 10Bstorm: gridengine: Make webservices "not rerunable" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/564095 (https://phabricator.wikimedia.org/T242397) [19:04:28] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/20328/torrelay1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/563208 (owner: 10Muehlenhoff) [19:06:27] (03PS4) 10Bstorm: kubernetes: Set php7.3 as the default type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496564 (owner: 10BryanDavis) [19:06:50] (03PS4) 10Bstorm: Report error messages on stderr [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496565 (owner: 10BryanDavis) [19:07:03] (03PS4) 10Bstorm: Remove lighttpd-precise handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496566 (owner: 10BryanDavis) [19:07:42] (03CR) 10Jdlrobson: "Has a deployment been scheduled?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) (owner: 10Ammarpad) [19:08:58] (03CR) 10Paladox: [C: 03+1] gerrit: Switch to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/563472 (owner: 10Muehlenhoff) [19:10:15] (03CR) 10Dzahn: "File[/etc/apt/sources.list.d/thirdparty-tor.list]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/563208 (owner: 10Muehlenhoff) [19:12:27] (03PS3) 10Gergő Tisza: [DNM until June 15] Revert "Invalidate CommonsMetadata cache for entries affected by T222935" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509914 [19:14:16] (03CR) 10Gergő Tisza: [C: 03+2] [DNM until June 15] Revert "Invalidate CommonsMetadata cache for entries affected by T222935" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509914 (owner: 10Gergő Tisza) [19:14:40] (03CR) 10Volans: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/564044 (https://phabricator.wikimedia.org/T224567) (owner: 10Muehlenhoff) [19:15:14] (03Merged) 10jenkins-bot: [DNM until June 15] Revert "Invalidate CommonsMetadata cache for entries affected by T222935" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509914 (owner: 10Gergő Tisza) [19:15:38] (03PS4) 10Bstorm: Improve support for extra_args [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496567 (owner: 10BryanDavis) [19:15:54] (03CR) 10jerkins-bot: [V: 04-1] Improve support for extra_args [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496567 (owner: 10BryanDavis) [19:16:49] (03CR) 10Bstorm: "recheck" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496567 (owner: 10BryanDavis) [19:17:05] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/20329/gerrit1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/563472 (owner: 10Muehlenhoff) [19:18:26] (03PS3) 10Dzahn: gerrit: Switch to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/563472 (owner: 10Muehlenhoff) [19:19:43] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Puppet agent unable to run in Beta Cluster (Evaluation Error: Error while evaluating a Resource Statement) - https://phabricator.wikimedia.org/T242658 (10Krinkle) [19:21:14] (03CR) 10Dzahn: [C: 03+1] "does not show up in https://tools.wmflabs.org/openstack-browser/puppetclass/ which doesn't cover 100% of all cases but is a strong indicat" [puppet] - 10https://gerrit.wikimedia.org/r/564044 (https://phabricator.wikimedia.org/T224567) (owner: 10Muehlenhoff) [19:21:35] (03CR) 10Krinkle: [C: 04-1] "Beta puppet still broken. Filed T242658" [puppet] - 10https://gerrit.wikimedia.org/r/559262 (https://phabricator.wikimedia.org/T241097) (owner: 10Krinkle) [19:22:24] !log tgr@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:509914|Revert a temporary CommonsMetadata cache validation hook that has been unneeded for a long time]] (duration: 00m 56s) [19:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:38] (03PS1) 10DannyS712: Deploy partial blocks on commons wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564097 (https://phabricator.wikimedia.org/T242570) [19:24:03] (03PS2) 10DannyS712: Deploy partial blocks on commons wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564097 (https://phabricator.wikimedia.org/T242570) [19:24:48] (03CR) 10Dzahn: "puppet removed /etc/apt/sources.list.d/wikimedia-openjdk8.list and added /etc/apt/sources.list.d/repository_wikimedia-openjdk8.list but th" [puppet] - 10https://gerrit.wikimedia.org/r/563472 (owner: 10Muehlenhoff) [19:30:49] (03PS2) 10Dzahn: codesearch: Install docker-ce from thirdparty/kubeadm-k8s component [puppet] - 10https://gerrit.wikimedia.org/r/563633 (owner: 10Legoktm) [19:32:33] (03CR) 10Dzahn: [C: 03+1] Convert tftp/dhcp ferm rules to ferm services [puppet] - 10https://gerrit.wikimedia.org/r/564010 (owner: 10Muehlenhoff) [19:38:57] (03CR) 10Krinkle: "I've filed. Please fix or revert :)" [puppet] - 10https://gerrit.wikimedia.org/r/561816 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [19:38:59] (03CR) 10Krinkle: "filed T242658 *" [puppet] - 10https://gerrit.wikimedia.org/r/561816 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [19:39:42] (03PS1) 10Urbanecm: Configure GlobalRename blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564102 (https://phabricator.wikimedia.org/T101615) [19:39:54] !log ran disableOATHAuthForUser.php for T242543 [19:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:58] T242543: Disable 2FA for "Ocean behind ears" on Wikitech - https://phabricator.wikimedia.org/T242543 [19:40:30] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Puppet agent unable to run in Beta Cluster (Evaluation Error: Error while evaluating a Resource Statement) - https://phabricator.wikimedia.org/T242658 (10Krinkle) >>! @dzahn wrote at (owner: @jbond) > the `etcd_cli... [19:55:33] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Puppet agent unable to run in Beta Cluster (Evaluation Error: Error while evaluating a Resource Statement) - https://phabricator.wikimedia.org/T242658 (10thcipriani) hrm, is this possibly from 06abb84939331458b1187fd9c3a562b82731b329 by @jbond ? Not exac... [19:55:49] (03PS3) 10Dzahn: Convert tftp/dhcp ferm rules to ferm services [puppet] - 10https://gerrit.wikimedia.org/r/564010 (owner: 10Muehlenhoff) [20:00:04] Urbanecm and Amir1: That opportune time is upon us again. Time for a Create ngwikimedia deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200113T2000). [20:00:14] * Urbanecm around [20:00:54] o/ [20:01:15] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Puppet agent unable to run in Beta Cluster (Evaluation Error: Error while evaluating a Resource Statement) - https://phabricator.wikimedia.org/T242658 (10Dzahn) "parameter 'srv_domain' expects a match for Variant[Stdlib::Fqdn" that is the new thing. unti... [20:02:03] "Don't be afraid" seems very apt for a new wiki creation.. [20:02:41] hehe [20:02:49] mutante: could you do https://gerrit.wikimedia.org/r/c/operations/puppet/+/559073 please? [20:03:23] that's dark comedy right there [20:04:10] You really should've got the apache change scheduled ahead of time... [20:04:16] sorry, bad timing. i am on public transport on the way to an appointment [20:05:14] yeah, maybe elukey can help with that [20:05:54] He's marked /away, so I'm guessing not [20:06:54] okay, then people from traffic would be great [20:07:04] bblack: Can you take a look please? [20:07:33] Reedy: The thing is most wikis don't need puppet change, except chapter wikis, which don't happen too often for me to remember [20:07:36] sorry [20:14:59] (03CR) 10RLazarus: [C: 03+1] Add ng.wikimedia.org as chapter site [puppet] - 10https://gerrit.wikimedia.org/r/559073 (https://phabricator.wikimedia.org/T240771) (owner: 10IAmNetx) [20:19:15] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 55967248 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:19:46] (03PS1) 10Dzahn: hieradata/labs: add etcd srv_domain parameter for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/564116 (https://phabricator.wikimedia.org/T242658) [20:21:03] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:21:38] !log clarakosi@deploy1001 Started deploy [restbase/deploy@bfdd342]: Use parsoid_uri, add ngwiki. T241756, T240771 [20:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:42] T241756: Clean-up Parsoid-PHP transition code from RESTBase - https://phabricator.wikimedia.org/T241756 [20:21:42] T240771: Create a wiki for Wikimedia User Group Nigeria - https://phabricator.wikimedia.org/T240771 [20:22:16] (03CR) 10Dzahn: "follow-up to https://phabricator.wikimedia.org/rOPUP06abb84939331458b1187fd9c3a562b82731b329" [puppet] - 10https://gerrit.wikimedia.org/r/564116 (https://phabricator.wikimedia.org/T242658) (owner: 10Dzahn) [20:23:42] (03PS2) 10Dzahn: hieradata/labs: add etcd srv_domain parameter for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/564116 (https://phabricator.wikimedia.org/T242658) [20:25:20] (03CR) 10Dzahn: etcd: add parameter type checking and clean up (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/561816 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [20:26:44] (03CR) 10Dzahn: hieradata/labs: add etcd srv_domain parameter for deployment-prep (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/564116 (https://phabricator.wikimedia.org/T242658) (owner: 10Dzahn) [20:29:21] (03CR) 10Dmaza: "Commons might benefit from having the "Partial Block info" banner show up. This can be achieve by setting wgWikimediaMessagesPartialBlockB" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564097 (https://phabricator.wikimedia.org/T242570) (owner: 10DannyS712) [20:30:24] (03CR) 10Legoktm: [C: 03+1] codesearch: Install docker-ce from thirdparty/kubeadm-k8s component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/563633 (owner: 10Legoktm) [20:30:45] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:32:25] (03CR) 1020after4: [C: 03+1] hieradata/labs: add etcd srv_domain parameter for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/564116 (https://phabricator.wikimedia.org/T242658) (owner: 10Dzahn) [20:32:33] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:35:24] (03CR) 10Dzahn: [C: 03+2] hieradata/labs: add etcd srv_domain parameter for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/564116 (https://phabricator.wikimedia.org/T242658) (owner: 10Dzahn) [20:37:19] !log clarakosi@deploy1001 Finished deploy [restbase/deploy@bfdd342]: Use parsoid_uri, add ngwiki. T241756, T240771 (duration: 15m 41s) [20:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:23] T241756: Clean-up Parsoid-PHP transition code from RESTBase - https://phabricator.wikimedia.org/T241756 [20:37:23] T240771: Create a wiki for Wikimedia User Group Nigeria - https://phabricator.wikimedia.org/T240771 [20:47:52] (03PS1) 10Tchanders: Enable banner for wikis that recently opted in to partial blocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564121 (https://phabricator.wikimedia.org/T240300) [20:49:12] (03CR) 10Tchanders: [C: 03+1] "Good point Dmaza - here's a config patch for the banner: I256d8241a1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564097 (https://phabricator.wikimedia.org/T242570) (owner: 10DannyS712) [20:57:23] (03CR) 10Tchanders: "NB If there's any reason to wait on I7e78720a, this can go first, since the banner will only display on wikis with partial blocks enabled," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564121 (https://phabricator.wikimedia.org/T240300) (owner: 10Tchanders) [20:59:45] (03CR) 10Jforrester: "Presumably this is a lot safer now? :-) Should we bump it to 20180101?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443866 (owner: 10Reedy) [21:00:04] cscott, arlolra, subbu, halfak, and accraze: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200113T2100). [21:09:33] (03PS2) 10Jforrester: Bump default cache epochs from 20130601 to 20160101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443866 (owner: 10Reedy) [21:10:04] (03CR) 10Jforrester: "PS2: Manual trivial rebase." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443866 (owner: 10Reedy) [21:11:39] (03PS1) 10Cwhite: mtail: track new subscription requests in prometheus [puppet] - 10https://gerrit.wikimedia.org/r/564129 (https://phabricator.wikimedia.org/T236505) [21:24:09] (03CR) 10Dmaza: [C: 03+1] Enable banner for wikis that recently opted in to partial blocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564121 (https://phabricator.wikimedia.org/T240300) (owner: 10Tchanders) [21:24:15] (03CR) 10Dmaza: [C: 03+1] Deploy partial blocks on commons wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564097 (https://phabricator.wikimedia.org/T242570) (owner: 10DannyS712) [21:24:15] !log arlolra@deploy1001 Started deploy [parsoid/deploy@dd92eeb]: Updating Parsoid to 5d37da1 [21:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:06] !log milimetric@deploy1001 Started deploy [analytics/refinery@690517c]: Referer Classify change [21:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:36] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@dd92eeb]: Updating Parsoid to 5d37da1 (duration: 08m 21s) [21:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:15] !log milimetric@deploy1001 Finished deploy [analytics/refinery@690517c]: Referer Classify change (duration: 09m 08s) [21:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:04] Amir1: is it ready to go, or needs some coordination at merge time? [21:44:29] Amir1: (I mean https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/559073/ ) [21:48:46] (03PS1) 10Volans: binary packages: optimize queries (part 2) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/564138 [21:50:22] (03CR) 10Volans: "The change has already been applied to the test instance in Cloud" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/564138 (owner: 10Volans) [22:00:04] Reedy and sbassett: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200113T2200). [22:26:12] (03PS1) 10Holger Knust: Migrate changeprop & cpjobqueue to kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/564143 [22:26:32] (03CR) 10jerkins-bot: [V: 04-1] Migrate changeprop & cpjobqueue to kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/564143 (owner: 10Holger Knust) [22:28:13] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 37 probes of 505 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [22:30:50] 10Operations, 10netops: deploy pfw policy 107271f986 - https://phabricator.wikimedia.org/T242681 (10Jgreen) [22:33:21] (03PS2) 10Holger Knust: Migrate changeprop & cpjobqueue to kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 [22:33:35] (03CR) 10jerkins-bot: [V: 04-1] Migrate changeprop & cpjobqueue to kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [22:34:01] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 505 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [22:37:54] (03Abandoned) 10Holger Knust: Migrate changeprop & cpjobqueue to kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/564143 (owner: 10Holger Knust) [22:44:11] (03PS3) 10Holger Knust: Migrate changeprop & cpjobqueue to kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 [22:44:24] (03CR) 10jerkins-bot: [V: 04-1] Migrate changeprop & cpjobqueue to kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [22:48:50] (03CR) 10Holger Knust: "Addressed all the issues highlighted in PS1. Let me know if anything else." (0321 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [23:04:57] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:06:47] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:08:11] (03CR) 10BryanDavis: "The man page says this is the default behavior. Spot checking `qstat -j $ID -xml` a random job on the webgrid-lighttpd queue shows " PROBLEM - snapshot of s4 in eqiad on db1115 is CRITICAL: snapshot for s4 at eqiad taken more than 4 days ago: Most recent backup 2020-01-09 23:22:38 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [23:40:26] (03CR) 10Halfak: [C: 03+1] "Looks good to me. Sorry for the delay. Missed the ping." [puppet] - 10https://gerrit.wikimedia.org/r/502527 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [23:46:48] 10Operations, 10Traffic, 10Patch-For-Review: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 (10Nux) >>! In T238038#5727398, @TheDJ wrote: > Question. https://wikitech.wikimedia.org/wiki/HTTPS/Browser_Recommendations > > Windows 7: I know it CAN support T...